#!/usr/bin/env python
# coding: utf-8

# In this recipe we'll explore how to perform six-frame translation on a nucleotide sequence. Briefly, when an unknown  sequence is obtained from the environment, there are six possible reading frames for translation. The six reading frames for a sequence start at positions 1, 2, and 3 in the forward orientation (denoted ``1``, ``2``, and ``3`` in scikit-bio) and positions 1, 2, and 3 in the reverse orientation (denoted ``-1``, ``-2``, and ``-3`` in scikit-bio).
# 
# Six-frame translation can be used to find the possible proteins a nucleotide sequence might encode.

# First let's define a single RNA sequence.

# In[1]:


from __future__ import print_function

from skbio import RNA
rna = RNA("GUGGGCUUUUUUAUAUCUUAAUUUGCCAUACUAAUUGCGGCAAUCGCAUGAGGCGUUUUAUUAUUCCAUACACAUAACUUUUCGACUUUAGCUUCAGUAAGAUAUGCAAUCCUCAGGGUAUCCUUCAUCCUUUCAAUCGCUUUUUUUUGUGAAUCUAUAUGUUGACUACCUGGUACUUCUACUUGAAAAGUUGCACCAUUCUUAAAAGUAAUGAUAGCCAUCUCUCUUUUUCCAGCUAGAGAUUCUGUAUACGACAAUAUCUUAUCAUUUAGCGUAUGUAUUUGUGUGUUGUGGUAUUCUGCACACAAAUCAGUAAUAUUUUGAGGUGUUCCAUGUGCAUAUGCUGAAGAUAGUAAAACUGUAAAAAAAACACCAAAUUUUAAUUUAAUCAU")
rna


# Next let's create a genetic code object. The default genetic code in scikit-bio is the vertebrate nuclear genetic code, but others exist which contain minor differences (e.g., codons code for different amino acids, or the set of stop codons is slightly different) and can be obtained via the `genetic_code` factory. Since we're going to translate the cholera toxin RNA sequence (produced by the *Vibrio cholerae* bacterium), we'll use NCBI's [Bacterial, Archaeal and Plant Plastid Code](http://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi#SG11) (transl_table=11):

# In[2]:


from skbio import GeneticCode
gc = GeneticCode.from_ncbi(11)
gc


# To perform six-frame translation of our RNA sequence to get a generator of the six translated sequence:

# In[3]:


for e in gc.translate_six_frames(rna):
    print(repr(e), end="\n\n")


# The six protein sequences represent each possible reading frame in the RNA sequence, but start and stop codons are not taken into account by default since we don't know if we have a full length sequence. 
# 
# If instead we want to look only at putative proteins coded by each sequence, we could require that translation start at a start codon, and stop at a stop codon, and then review the resulting sequences.

# In[4]:


for e in gc.translate_six_frames(rna, start='require', stop='require'):
    print(repr(e), end="\n\n")


# Note that the sequences starts with M (which is the amino acid encoded by ATG, the start codon in this genetic code) and that the stop translation character has been trimmed off. One of these, the ``-1`` orientation (i.e., the reverse complement of the input sequence) looks more like a real protein coding sequence due to its length than the others.
# 
# As a next step, try searching the putative proteins against a reference database (e.g., by BLASTing them using [NCBI's `blastp` tool](http://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastp&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome)) to figure out which of these might be an actual protein coding sequence and what it codes for. 
# 
# In the future, we'll have support for remote database searching in scikit-bio. You can track progress on that [here](https://github.com/biocore/scikit-bio/issues/225).