#!/usr/bin/env python
# coding: utf-8

# Several recent studies of amplicon taxonomic assignment methods have suggested that training Naive Bayes taxonomic classifiers against only the region of a sequence that was amplified, rather than a full length sequence, will give better taxonomic assignment results ([Mizrahi-Man et al. 2013](http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0053608), [Werner et al. 2012](http://www.nature.com/ismej/journal/v6/n1/full/ismej201182a.html)). 
# 
# This recipe shows how an alignment can be trimmed to cover only a region bound by a pair of primers, using scikit-bio's functionality for sequencing matching. The list of [``skbio.DNA``](http://scikit-bio.org/docs/0.5.0/generated/skbio.sequence.DNA.html) objects that results from this process could be used to train a Naive Bayes classifier.
# 
# First we'll load an alignment into a [``skbio.TablularMSA``](http://scikit-bio.org/docs/0.5.0/generated/skbio.alignment.TabularMSA.html) object (*MSA* stands for *multiple sequence alignment*). 
# We're going to work with the [qiime-default-reference](https://github.com/biocore/qiime-default-reference) so we have easy access to some sequences. If you want to adapt this recipe to create a region specific reference collection for another set of sequences, just assign the ``reference_alignment_filepath`` variable to a new filepath.

# In[1]:


import qiime_default_reference
import skbio

reference_alignment_filepath = qiime_default_reference.get_template_alignment()
reference_alignment = skbio.TabularMSA.read(reference_alignment_filepath, constructor=skbio.DNA)
reference_alignment


# Next, we'll define the forward and reverse primers as ``DNA`` objects. The primers that we're using here are pulled from [Supplementary File 1](http://www.nature.com/ismej/journal/v6/n8/extref/ismej20128x2.txt) of [Caporaso et al. 2012](http://www.nature.com/ismej/journal/v6/n8/full/ismej20128a.html). Note that we're reverse complementing the reverse primer when we load it here. scikit-bio does not automatically try to match the reverse or reverse complement of a sequence, as some programs like BLAST and uclust do (for the sake of computational efficency).

# In[2]:


fwd_primer = skbio.DNA("GTGCCAGCMGCCGCGGTAA", {'id':'fwd-primer'})
rev_primer = skbio.DNA("GGACTACHVGGGTWTCTAAT", {'id':'rev-primer'}).reverse_complement()


# These primers contain some degeneracies, or characters representing multiple bases. In practice, this means that each of these primer sequences actually represents a collection of sequences, once the degeneracies are expanded. We can see what non-degenerate sequences are represented by our degenerate forward primer sequence as follows:

# In[3]:


for nondegenerate_sequence in fwd_primer.expand_degenerates():
    print(nondegenerate_sequence)


# We're going to match any of the non-degenerate variants of each primer in this example.
# 
# The typical way to approach the problem of finding the boundaries of a short sequence in a longer sequence would be to use pairwise alignment. But, we're going to try a different approach here since pairwise alignment is inherently slow (it scales quadratically). Because these are sequencing primers, they're designed to be unique (so there shouldn't be multiple matches of a primer to a sequence), and they're designed to match as many sequences as possible. So let's try using regular expressions to match our sequencing primers in the reference database. Regular expression matching scales linearly, so is much faster to apply to many sequences. The scikit-bio [``GrammaredSequence``](http://scikit-bio.org/docs/0.5.0/generated/skbio.sequence.GrammaredSequence.html) objects contain a method, [``to_regex``](http://scikit-bio.org/docs/0.5.0/generated/skbio.sequence.GrammaredSequence.to_regex.html), that allows for conversion of a degenerate (or nondegenerate) sequence into a regular expression. 

# In[4]:


fwd_primer.to_regex()


# In[5]:


rev_primer.to_regex()


# We can use these to create a new regular expression that will match the sequence, excluding the forward and reverse primers (since those are nearly the same in all sequences, they won't be useful in training a taxonomic classifier, so we choose not to include them). 

# In[6]:


regex = '{0}(.*){1}'.format(fwd_primer.to_regex().pattern,
                            rev_primer.to_regex().pattern)
print(regex)


# Next, let's apply this regular expression to all of our reference sequences. This will let us find out how many reference sequences our pattern matches, and the start and stop positions of each match.

# In[7]:


import pandas as pd

starts = []
stops = []

seq_count = 0
match_count = 0

for seq in reference_alignment:
    seq_count += 1
    for match in seq.find_with_regex(regex, ignore=seq.gaps()):
        match_count += 1
        starts.append(match.start)
        stops.append(match.stop)

starts = pd.Series(starts)
stops = pd.Series(stops)
match_percentage = (match_count / seq_count) * 100
print('{0} of {1} ({2:.2f}%) sequences have exact matches to the regular expression.'.format(match_count, seq_count, match_percentage))


# Since we're matching our aligned reference sequences against an alignment, finding matches in around 60% of our sequences gives us an idea of how to slice all of our sequences, since the purpose of a multiple sequence alignment is to normalize the position numbers across all of the sequences in a sequence collection. One problem with matching against an alignment though is that the gaps in the alignment could make it harder to match our regular expression as the gaps would disrupt our matches. We get around this using the ``ignore`` parameter to [``DNA.find_with_regex``](http://scikit-bio.org/docs/0.5.0/generated/skbio.sequence.DNA.find_with_regex.html), which takes a boolean vector (a fancy name for an array or list of boolean values) indicating positions that should be ignored in the regular expression match.
# 
# We can next look at the distribution of start positions and stop positions for each match. As you can see, these nearly always match the same position, and therefore tell us where we want to slice our aligned sequences to extract the region between our primers.

# In[8]:


starts.describe()


# In[9]:


stops.describe()


# The positions that we want to slice our alignment at can be found as follows:

# In[10]:


fwd_primer_end = int(starts.mode())
rev_primer_start = int(stops.mode())


# Next, we're ready to filter out all positions outside of this range. We do this by specifying that range using [``TabularMSA.iloc``](http://scikit-bio.org/docs/0.5.0/generated/skbio.alignment.TabularMSA.iloc.html). This gives us a ``TabularMSA`` that has the same number of sequences that we started with, but fewer positions in each ``Sequence``. 

# In[11]:


filtered_reference_alignment = reference_alignment.iloc[..., fwd_primer_end:rev_primer_start]


# In[12]:


filtered_reference_alignment


# Because Naive Bayes classifiers are generally trained on unaligned sequences, we'd also want to remove all gaps from the alignment to get the unaligned sequences. These are the sequences that we'd want to use to train our classifier.

# In[13]:


filtered_reference_sequences = [e.degap() for e in filtered_reference_alignment]


# In[14]:


filtered_reference_sequences[0]


# In[15]:


filtered_reference_sequences[-1]