In this recipe we'll explore how to perform six-frame translation on a nucleotide sequence. Briefly, when an unknown sequence is obtained from the environment, there are six possible reading frames for translation. The six reading frames for a sequence start at positions 1, 2, and 3 in the forward orientation (denoted 1
, 2
, and 3
in scikit-bio) and positions 1, 2, and 3 in the reverse orientation (denoted -1
, -2
, and -3
in scikit-bio).
Six-frame translation can be used to find the possible proteins a nucleotide sequence might encode.
First let's define a single RNA sequence.
from __future__ import print_function
from skbio import RNA
rna = RNA("GUGGGCUUUUUUAUAUCUUAAUUUGCCAUACUAAUUGCGGCAAUCGCAUGAGGCGUUUUAUUAUUCCAUACACAUAACUUUUCGACUUUAGCUUCAGUAAGAUAUGCAAUCCUCAGGGUAUCCUUCAUCCUUUCAAUCGCUUUUUUUUGUGAAUCUAUAUGUUGACUACCUGGUACUUCUACUUGAAAAGUUGCACCAUUCUUAAAAGUAAUGAUAGCCAUCUCUCUUUUUCCAGCUAGAGAUUCUGUAUACGACAAUAUCUUAUCAUUUAGCGUAUGUAUUUGUGUGUUGUGGUAUUCUGCACACAAAUCAGUAAUAUUUUGAGGUGUUCCAUGUGCAUAUGCUGAAGAUAGUAAAACUGUAAAAAAAACACCAAAUUUUAAUUUAAUCAU")
rna
RNA --------------------------------------------------------------------- Stats: length: 392 has gaps: False has degenerates: False has definites: True GC-content: 33.42% --------------------------------------------------------------------- 0 GUGGGCUUUU UUAUAUCUUA AUUUGCCAUA CUAAUUGCGG CAAUCGCAUG AGGCGUUUUA 60 UUAUUCCAUA CACAUAACUU UUCGACUUUA GCUUCAGUAA GAUAUGCAAU CCUCAGGGUA ... 300 GCACACAAAU CAGUAAUAUU UUGAGGUGUU CCAUGUGCAU AUGCUGAAGA UAGUAAAACU 360 GUAAAAAAAA CACCAAAUUU UAAUUUAAUC AU
Next let's create a genetic code object. The default genetic code in scikit-bio is the vertebrate nuclear genetic code, but others exist which contain minor differences (e.g., codons code for different amino acids, or the set of stop codons is slightly different) and can be obtained via the genetic_code
factory. Since we're going to translate the cholera toxin RNA sequence (produced by the Vibrio cholerae bacterium), we'll use NCBI's Bacterial, Archaeal and Plant Plastid Code (transl_table=11):
from skbio import GeneticCode
gc = GeneticCode.from_ncbi(11)
gc
GeneticCode (Bacterial, Archaeal and Plant Plastid) ------------------------------------------------------------------------- AAs = FFLLSSSSYY**CC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG Starts = ---M---------------M------------MMMM---------------M------------ Base1 = UUUUUUUUUUUUUUUUCCCCCCCCCCCCCCCCAAAAAAAAAAAAAAAAGGGGGGGGGGGGGGGG Base2 = UUUUCCCCAAAAGGGGUUUUCCCCAAAAGGGGUUUUCCCCAAAAGGGGUUUUCCCCAAAAGGGG Base3 = UCAGUCAGUCAGUCAGUCAGUCAGUCAGUCAGUCAGUCAGUCAGUCAGUCAGUCAGUCAGUCAG
To perform six-frame translation of our RNA sequence to get a generator of the six translated sequence:
for e in gc.translate_six_frames(rna):
print(repr(e), end="\n\n")
Protein --------------------------------------------------------------------- Stats: length: 130 has gaps: False has degenerates: False has definites: True has stops: True --------------------------------------------------------------------- 0 VGFFIS*FAI LIAAIA*GVL LFHTHNFSTL ASVRYAILRV SFILSIAFFC ESIC*LPGTS 60 T*KVAPFLKV MIAISLFPAR DSVYDNILSF SVCICVLWYS AHKSVIF*GV PCAYAEDSKT 120 VKKTPNFNLI Protein --------------------------------------------------------------------- Stats: length: 130 has gaps: False has degenerates: False has definites: True has stops: True --------------------------------------------------------------------- 0 WAFLYLNLPY *LRQSHEAFY YSIHITFRL* LQ*DMQSSGY PSSFQSLFFV NLYVDYLVLL 60 LEKLHHS*K* **PSLFFQLE ILYTTISYHL AYVFVCCGIL HTNQ*YFEVF HVHMLKIVKL 120 *KKHQILI*S Protein --------------------------------------------------------------------- Stats: length: 130 has gaps: False has degenerates: False has definites: True has stops: True --------------------------------------------------------------------- 0 GLFYILICHT NCGNRMRRFI IPYT*LFDFS FSKICNPQGI LHPFNRFFL* IYMLTTWYFY 60 LKSCTILKSN DSHLSFSS*R FCIRQYLII* RMYLCVVVFC TQISNILRCS MCIC*R**NC 120 KKNTKF*FNH Protein --------------------------------------------------------------------- Stats: length: 130 has gaps: False has degenerates: False has definites: True has stops: True --------------------------------------------------------------------- 0 MIKLKFGVFF TVLLSSAYAH GTPQNITDLC AEYHNTQIHT LNDKILSYTE SLAGKREMAI 60 ITFKNGATFQ VEVPGSQHID SQKKAIERMK DTLRIAYLTE AKVEKLCVWN NKTPHAIAAI 120 SMAN*DIKKP Protein --------------------------------------------------------------------- Stats: length: 130 has gaps: False has degenerates: False has definites: True has stops: True --------------------------------------------------------------------- 0 *LN*NLVFFL QFYYLQHMHM EHLKILLICV QNTTTHKYIR *MIRYCRIQN L*LEKERWLS 60 LLLRMVQLFK *KYQVVNI*I HKKKRLKG*R IP*GLHILLK LKSKSYVYGI IKRLMRLPQL 120 VWQIKI*KSP Protein --------------------------------------------------------------------- Stats: length: 130 has gaps: False has degenerates: False has definites: True has stops: True --------------------------------------------------------------------- 0 D*IKIWCFFY SFTIFSICTW NTSKYY*FVC RIPQHTNTYA K**DIVVYRI SSWKKRDGYH 60 YF*EWCNFSS RSTR*STYRF TKKSD*KDEG YPEDCISY*S *SRKVMCME* *NASCDCRN* 120 YGKLRYKKAH
The six protein sequences represent each possible reading frame in the RNA sequence, but start and stop codons are not taken into account by default since we don't know if we have a full length sequence.
If instead we want to look only at putative proteins coded by each sequence, we could require that translation start at a start codon, and stop at a stop codon, and then review the resulting sequences.
for e in gc.translate_six_frames(rna, start='require', stop='require'):
print(repr(e), end="\n\n")
Protein -------------------------- Stats: length: 6 has gaps: False has degenerates: False has definites: True has stops: False -------------------------- 0 MGFFIS Protein -------------------------- Stats: length: 3 has gaps: False has degenerates: False has definites: True has stops: False -------------------------- 0 MPY Protein -------------------------- Stats: length: 20 has gaps: False has degenerates: False has definites: True has stops: False -------------------------- 0 MLICHTNCGN RMRRFIIPYT Protein --------------------------------------------------------------------- Stats: length: 124 has gaps: False has degenerates: False has definites: True has stops: False --------------------------------------------------------------------- 0 MIKLKFGVFF TVLLSSAYAH GTPQNITDLC AEYHNTQIHT LNDKILSYTE SLAGKREMAI 60 ITFKNGATFQ VEVPGSQHID SQKKAIERMK DTLRIAYLTE AKVEKLCVWN NKTPHAIAAI 120 SMAN Protein ---------------------------------------- Stats: length: 35 has gaps: False has degenerates: False has definites: True has stops: False ---------------------------------------- 0 MVFFLQFYYL QHMHMEHLKI LLICVQNTTT HKYIR Protein ---------------------------- Stats: length: 24 has gaps: False has degenerates: False has definites: True has stops: False ---------------------------- 0 MKIWCFFYSF TIFSICTWNT SKYY
Note that the sequences starts with M (which is the amino acid encoded by ATG, the start codon in this genetic code) and that the stop translation character has been trimmed off. One of these, the -1
orientation (i.e., the reverse complement of the input sequence) looks more like a real protein coding sequence due to its length than the others.
As a next step, try searching the putative proteins against a reference database (e.g., by BLASTing them using NCBI's blastp
tool) to figure out which of these might be an actual protein coding sequence and what it codes for.
In the future, we'll have support for remote database searching in scikit-bio. You can track progress on that here.