In this recipe we'll explore how to perform six-frame translation on a nucleotide sequence. Briefly, when an unknown sequence is obtained from the environment, there are six possible reading frames for translation. The six reading frames for a sequence start at positions 1, 2, and 3 in the forward orientation (denoted 1, 2, and 3 in scikit-bio) and positions 1, 2, and 3 in the reverse orientation (denoted -1, -2, and -3 in scikit-bio).

Six-frame translation can be used to find the possible proteins a nucleotide sequence might encode.

First let's define a single RNA sequence.

In [1]:
from __future__ import print_function

from skbio import RNA
rna = RNA("GUGGGCUUUUUUAUAUCUUAAUUUGCCAUACUAAUUGCGGCAAUCGCAUGAGGCGUUUUAUUAUUCCAUACACAUAACUUUUCGACUUUAGCUUCAGUAAGAUAUGCAAUCCUCAGGGUAUCCUUCAUCCUUUCAAUCGCUUUUUUUUGUGAAUCUAUAUGUUGACUACCUGGUACUUCUACUUGAAAAGUUGCACCAUUCUUAAAAGUAAUGAUAGCCAUCUCUCUUUUUCCAGCUAGAGAUUCUGUAUACGACAAUAUCUUAUCAUUUAGCGUAUGUAUUUGUGUGUUGUGGUAUUCUGCACACAAAUCAGUAAUAUUUUGAGGUGUUCCAUGUGCAUAUGCUGAAGAUAGUAAAACUGUAAAAAAAACACCAAAUUUUAAUUUAAUCAU")
rna
Out[1]:
RNA
---------------------------------------------------------------------
Stats:
    length: 392
    has gaps: False
    has degenerates: False
    has definites: True
    GC-content: 33.42%
---------------------------------------------------------------------
0   GUGGGCUUUU UUAUAUCUUA AUUUGCCAUA CUAAUUGCGG CAAUCGCAUG AGGCGUUUUA
60  UUAUUCCAUA CACAUAACUU UUCGACUUUA GCUUCAGUAA GAUAUGCAAU CCUCAGGGUA
...
300 GCACACAAAU CAGUAAUAUU UUGAGGUGUU CCAUGUGCAU AUGCUGAAGA UAGUAAAACU
360 GUAAAAAAAA CACCAAAUUU UAAUUUAAUC AU

Next let's create a genetic code object. The default genetic code in scikit-bio is the vertebrate nuclear genetic code, but others exist which contain minor differences (e.g., codons code for different amino acids, or the set of stop codons is slightly different) and can be obtained via the genetic_code factory. Since we're going to translate the cholera toxin RNA sequence (produced by the Vibrio cholerae bacterium), we'll use NCBI's Bacterial, Archaeal and Plant Plastid Code (transl_table=11):

In [2]:
from skbio import GeneticCode
gc = GeneticCode.from_ncbi(11)
gc
Out[2]:
GeneticCode (Bacterial, Archaeal and Plant Plastid)
-------------------------------------------------------------------------
  AAs  = FFLLSSSSYY**CC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG
Starts = ---M---------------M------------MMMM---------------M------------
Base1  = UUUUUUUUUUUUUUUUCCCCCCCCCCCCCCCCAAAAAAAAAAAAAAAAGGGGGGGGGGGGGGGG
Base2  = UUUUCCCCAAAAGGGGUUUUCCCCAAAAGGGGUUUUCCCCAAAAGGGGUUUUCCCCAAAAGGGG
Base3  = UCAGUCAGUCAGUCAGUCAGUCAGUCAGUCAGUCAGUCAGUCAGUCAGUCAGUCAGUCAGUCAG

To perform six-frame translation of our RNA sequence to get a generator of the six translated sequence:

In [3]:
for e in gc.translate_six_frames(rna):
    print(repr(e), end="\n\n")
Protein
---------------------------------------------------------------------
Stats:
    length: 130
    has gaps: False
    has degenerates: False
    has definites: True
    has stops: True
---------------------------------------------------------------------
0   VGFFIS*FAI LIAAIA*GVL LFHTHNFSTL ASVRYAILRV SFILSIAFFC ESIC*LPGTS
60  T*KVAPFLKV MIAISLFPAR DSVYDNILSF SVCICVLWYS AHKSVIF*GV PCAYAEDSKT
120 VKKTPNFNLI

Protein
---------------------------------------------------------------------
Stats:
    length: 130
    has gaps: False
    has degenerates: False
    has definites: True
    has stops: True
---------------------------------------------------------------------
0   WAFLYLNLPY *LRQSHEAFY YSIHITFRL* LQ*DMQSSGY PSSFQSLFFV NLYVDYLVLL
60  LEKLHHS*K* **PSLFFQLE ILYTTISYHL AYVFVCCGIL HTNQ*YFEVF HVHMLKIVKL
120 *KKHQILI*S

Protein
---------------------------------------------------------------------
Stats:
    length: 130
    has gaps: False
    has degenerates: False
    has definites: True
    has stops: True
---------------------------------------------------------------------
0   GLFYILICHT NCGNRMRRFI IPYT*LFDFS FSKICNPQGI LHPFNRFFL* IYMLTTWYFY
60  LKSCTILKSN DSHLSFSS*R FCIRQYLII* RMYLCVVVFC TQISNILRCS MCIC*R**NC
120 KKNTKF*FNH

Protein
---------------------------------------------------------------------
Stats:
    length: 130
    has gaps: False
    has degenerates: False
    has definites: True
    has stops: True
---------------------------------------------------------------------
0   MIKLKFGVFF TVLLSSAYAH GTPQNITDLC AEYHNTQIHT LNDKILSYTE SLAGKREMAI
60  ITFKNGATFQ VEVPGSQHID SQKKAIERMK DTLRIAYLTE AKVEKLCVWN NKTPHAIAAI
120 SMAN*DIKKP

Protein
---------------------------------------------------------------------
Stats:
    length: 130
    has gaps: False
    has degenerates: False
    has definites: True
    has stops: True
---------------------------------------------------------------------
0   *LN*NLVFFL QFYYLQHMHM EHLKILLICV QNTTTHKYIR *MIRYCRIQN L*LEKERWLS
60  LLLRMVQLFK *KYQVVNI*I HKKKRLKG*R IP*GLHILLK LKSKSYVYGI IKRLMRLPQL
120 VWQIKI*KSP

Protein
---------------------------------------------------------------------
Stats:
    length: 130
    has gaps: False
    has degenerates: False
    has definites: True
    has stops: True
---------------------------------------------------------------------
0   D*IKIWCFFY SFTIFSICTW NTSKYY*FVC RIPQHTNTYA K**DIVVYRI SSWKKRDGYH
60  YF*EWCNFSS RSTR*STYRF TKKSD*KDEG YPEDCISY*S *SRKVMCME* *NASCDCRN*
120 YGKLRYKKAH

The six protein sequences represent each possible reading frame in the RNA sequence, but start and stop codons are not taken into account by default since we don't know if we have a full length sequence.

If instead we want to look only at putative proteins coded by each sequence, we could require that translation start at a start codon, and stop at a stop codon, and then review the resulting sequences.

In [4]:
for e in gc.translate_six_frames(rna, start='require', stop='require'):
    print(repr(e), end="\n\n")
Protein
--------------------------
Stats:
    length: 6
    has gaps: False
    has degenerates: False
    has definites: True
    has stops: False
--------------------------
0 MGFFIS

Protein
--------------------------
Stats:
    length: 3
    has gaps: False
    has degenerates: False
    has definites: True
    has stops: False
--------------------------
0 MPY

Protein
--------------------------
Stats:
    length: 20
    has gaps: False
    has degenerates: False
    has definites: True
    has stops: False
--------------------------
0 MLICHTNCGN RMRRFIIPYT

Protein
---------------------------------------------------------------------
Stats:
    length: 124
    has gaps: False
    has degenerates: False
    has definites: True
    has stops: False
---------------------------------------------------------------------
0   MIKLKFGVFF TVLLSSAYAH GTPQNITDLC AEYHNTQIHT LNDKILSYTE SLAGKREMAI
60  ITFKNGATFQ VEVPGSQHID SQKKAIERMK DTLRIAYLTE AKVEKLCVWN NKTPHAIAAI
120 SMAN

Protein
----------------------------------------
Stats:
    length: 35
    has gaps: False
    has degenerates: False
    has definites: True
    has stops: False
----------------------------------------
0 MVFFLQFYYL QHMHMEHLKI LLICVQNTTT HKYIR

Protein
----------------------------
Stats:
    length: 24
    has gaps: False
    has degenerates: False
    has definites: True
    has stops: False
----------------------------
0 MKIWCFFYSF TIFSICTWNT SKYY

Note that the sequences starts with M (which is the amino acid encoded by ATG, the start codon in this genetic code) and that the stop translation character has been trimmed off. One of these, the -1 orientation (i.e., the reverse complement of the input sequence) looks more like a real protein coding sequence due to its length than the others.

As a next step, try searching the putative proteins against a reference database (e.g., by BLASTing them using NCBI's blastp tool) to figure out which of these might be an actual protein coding sequence and what it codes for.

In the future, we'll have support for remote database searching in scikit-bio. You can track progress on that here.