Several recent studies of amplicon taxonomic assignment methods have suggested that training Naive Bayes taxonomic classifiers against only the region of a sequence that was amplified, rather than a full length sequence, will give better taxonomic assignment results (Mizrahi-Man et al. 2013, Werner et al. 2012).

This recipe shows how an alignment can be trimmed to cover only a region bound by a pair of primers, using scikit-bio's functionality for sequencing matching. The list of skbio.DNA objects that results from this process could be used to train a Naive Bayes classifier.

First we'll load an alignment into a skbio.TablularMSA object (MSA stands for multiple sequence alignment). We're going to work with the qiime-default-reference so we have easy access to some sequences. If you want to adapt this recipe to create a region specific reference collection for another set of sequences, just assign the reference_alignment_filepath variable to a new filepath.

In [1]:
import qiime_default_reference
import skbio

reference_alignment_filepath = qiime_default_reference.get_template_alignment()
reference_alignment

Out[1]:
TabularMSA[DNA]
-----------------------------------------------------------------------
Stats:
sequence count: 4797
position count: 7682
-----------------------------------------------------------------------
--------------------------------- ... ---------------------------------
--------------------------------- ... ---------------------------------
...
--------------------------------- ... ---------------------------------
--------------------------------- ... ---------------------------------

Next, we'll define the forward and reverse primers as DNA objects. The primers that we're using here are pulled from Supplementary File 1 of Caporaso et al. 2012. Note that we're reverse complementing the reverse primer when we load it here. scikit-bio does not automatically try to match the reverse or reverse complement of a sequence, as some programs like BLAST and uclust do (for the sake of computational efficency).

In [2]:
fwd_primer = skbio.DNA("GTGCCAGCMGCCGCGGTAA", {'id':'fwd-primer'})
rev_primer = skbio.DNA("GGACTACHVGGGTWTCTAAT", {'id':'rev-primer'}).reverse_complement()


These primers contain some degeneracies, or characters representing multiple bases. In practice, this means that each of these primer sequences actually represents a collection of sequences, once the degeneracies are expanded. We can see what non-degenerate sequences are represented by our degenerate forward primer sequence as follows:

In [3]:
for nondegenerate_sequence in fwd_primer.expand_degenerates():
print(nondegenerate_sequence)

GTGCCAGCCGCCGCGGTAA
GTGCCAGCAGCCGCGGTAA


We're going to match any of the non-degenerate variants of each primer in this example.

The typical way to approach the problem of finding the boundaries of a short sequence in a longer sequence would be to use pairwise alignment. But, we're going to try a different approach here since pairwise alignment is inherently slow (it scales quadratically). Because these are sequencing primers, they're designed to be unique (so there shouldn't be multiple matches of a primer to a sequence), and they're designed to match as many sequences as possible. So let's try using regular expressions to match our sequencing primers in the reference database. Regular expression matching scales linearly, so is much faster to apply to many sequences. The scikit-bio GrammaredSequence objects contain a method, to_regex, that allows for conversion of a degenerate (or nondegenerate) sequence into a regular expression.

In [4]:
fwd_primer.to_regex()

Out[4]:
re.compile(r'GTGCCAGC[CA]GCCGCGGTAA', re.UNICODE)
In [5]:
rev_primer.to_regex()

Out[5]:
re.compile(r'ATTAGA[AT]ACCC[GCT][GAT]GTAGTCC', re.UNICODE)

We can use these to create a new regular expression that will match the sequence, excluding the forward and reverse primers (since those are nearly the same in all sequences, they won't be useful in training a taxonomic classifier, so we choose not to include them).

In [6]:
regex = '{0}(.*){1}'.format(fwd_primer.to_regex().pattern,
rev_primer.to_regex().pattern)
print(regex)

GTGCCAGC[CA]GCCGCGGTAA(.*)ATTAGA[AT]ACCC[GCT][GAT]GTAGTCC


Next, let's apply this regular expression to all of our reference sequences. This will let us find out how many reference sequences our pattern matches, and the start and stop positions of each match.

In [7]:
import pandas as pd

starts = []
stops = []

seq_count = 0
match_count = 0

for seq in reference_alignment:
seq_count += 1
for match in seq.find_with_regex(regex, ignore=seq.gaps()):
match_count += 1
starts.append(match.start)
stops.append(match.stop)

starts = pd.Series(starts)
stops = pd.Series(stops)
match_percentage = (match_count / seq_count) * 100
print('{0} of {1} ({2:.2f}%) sequences have exact matches to the regular expression.'.format(match_count, seq_count, match_percentage))

3033 of 4797 (63.23%) sequences have exact matches to the regular expression.


Since we're matching our aligned reference sequences against an alignment, finding matches in around 60% of our sequences gives us an idea of how to slice all of our sequences, since the purpose of a multiple sequence alignment is to normalize the position numbers across all of the sequences in a sequence collection. One problem with matching against an alignment though is that the gaps in the alignment could make it harder to match our regular expression as the gaps would disrupt our matches. We get around this using the ignore parameter to DNA.find_with_regex, which takes a boolean vector (a fancy name for an array or list of boolean values) indicating positions that should be ignored in the regular expression match.

We can next look at the distribution of start positions and stop positions for each match. As you can see, these nearly always match the same position, and therefore tell us where we want to slice our aligned sequences to extract the region between our primers.

In [8]:
starts.describe()

Out[8]:
count    3033.000000
mean     2263.022090
std         1.916592
min      2220.000000
25%      2263.000000
50%      2263.000000
75%      2263.000000
max      2358.000000
dtype: float64
In [9]:
stops.describe()

Out[9]:
count    3033.000000
mean     4051.132542
std         7.281395
min      4050.000000
25%      4051.000000
50%      4051.000000
75%      4051.000000
max      4452.000000
dtype: float64

The positions that we want to slice our alignment at can be found as follows:

In [10]:
fwd_primer_end = int(starts.mode())
rev_primer_start = int(stops.mode())


Next, we're ready to filter out all positions outside of this range. We do this by specifying that range using TabularMSA.iloc. This gives us a TabularMSA that has the same number of sequences that we started with, but fewer positions in each Sequence.

In [11]:
filtered_reference_alignment = reference_alignment.iloc[..., fwd_primer_end:rev_primer_start]

In [12]:
filtered_reference_alignment

Out[12]:
TabularMSA[DNA]
-----------------------------------------------------------------------
Stats:
sequence count: 4797
position count: 1788
-----------------------------------------------------------------------
T--AC---AG-AG-GGT-GCA-A-G-CG-TTAA ... ----------G-TGGG-GAG-C-A-AACA--GG
T--AC---GA-AG-GGG-GCT-A-G-CG-TTGT ... ---------GTGGGGG--AG-C-G-AACA--GG
...
C--AC---CG-GC-AGC-TTA-A-G-TG-GTGA ... ----------C-AGGG-GAG-C-G-AACG--GG
T--AC---CA-GC-TCC-GCG-A-G-TG-GTTG ... ----------T-GGGG-GAG-C-A-AAGG--GG

Because Naive Bayes classifiers are generally trained on unaligned sequences, we'd also want to remove all gaps from the alignment to get the unaligned sequences. These are the sequences that we'd want to use to train our classifier.

In [13]:
filtered_reference_sequences = [e.degap() for e in filtered_reference_alignment]

In [14]:
filtered_reference_sequences[0]

Out[14]:
DNA
---------------------------------------------------------------------
'description': ''
'id': '1111561'
Stats:
length: 253
has gaps: False
has degenerates: False
has definites: True
GC-content: 52.57%
---------------------------------------------------------------------
0   TACAGAGGGT GCAAGCGTTA ATCGGATTGA CTGGGCGTAA AGGGCGCGTA GGCGGTAAGA
60  TAAGTCAGAT GTTAAAAACC CGAGCTCAAC TTGGGGACTG CATTTGAAAC TATCTCACTA
120 GAGTACAGTA GAGGAGAGCG GAATTTCCGG TGTAGCGGTG AAATGCGTAG ATATCGGAAG
180 GAACACCAGT GGCGAAGGCG GCTCTCTGGA CTGACACTGA CGCTGAGGCG CGAAAGCGTG
240 GGGAGCAAAC AGG
In [15]:
filtered_reference_sequences[-1]

Out[15]:
DNA
---------------------------------------------------------------------
'description': ''
'id': '4483258'
Stats:
length: 253
has gaps: False
has degenerates: False
has definites: True
GC-content: 66.01%
---------------------------------------------------------------------
0   TACCAGCTCC GCGAGTGGTT GGGGCGTTTA CTGGGCCTAA AGCGCCCGTA GCCGGCCCGG
60  TAAGTCACCC CTGAAATCCA TGGGCTCAAC CCATGGGCCG GGGGTGATAC TGCCGGGCTT
120 GGGGGTGGGA GAGGCCGAGG GTACTCCCGA GGTAGGGGCG AAATCCGATA ATCTCGGGAG
180 GACCACCAGT GGCGGAAGCG CTCGGCTGGA ACACGCCCGA CGGTGAGGGG CGAAAGCTGG
240 GGGAGCAAAG GGG