Table of Contents 0. Purpose 0. Background 0. Goals 0. Hints 0. Getting started 0. Question 1 0. Question 2 0. Question 3 0. Question 4 0. More hints
The purpose of this exercise is to have you combine a few of the topics we have covered in class up to this point, and get you working with code. (Don't panic: you don't have to write code - everything in this assignment can be achieved with cutting and pasting code that we've already written for you.)
You will need to be familiar with local and global alignments (see the pairwise alignment section) and the process of PCR (see the molecular biology section). The aim is to develop your bioinformatics problem solving skills, while also introducing you to interacting with the IPython Notebook.
You should start by reading the section 4.3 here of Genomes, by TA Brown. You may find the entire chapter useful.
A common process in bioinformatics is looking at the composition of microorganisms in a given environment. For instance, we could take a sample from a desk in an office, the gut of a human subject, or the Southern Ocean and ask what microorganisms are present in each of the different samples. The most common way to answer this question is to sequence the 16S rRNA gene. What makes this gene so useful is that it contains several conserved areas, which means that we can isolate it from the full genomes of many organisms using PCR, however it also contains highly variable regions so that we can tell the organisms apart. The sequence of the 16S rRNA therefore serves as a fingerprint for a microorganism. If we find it in a sample, that suggests that the organism is present in that sample.
In this exercise we will provide you with the full-length 16S rRNA sequences from five different bacterial organisms, and ten candidate primer sequences. Using both global and local sequence alignment, we'll ask you to select what you think is the best single primer pair to use for amplifying 16S rRNA from these five organisms for use in profiling diverse communities of microorganisms.
Throughout this notebook there will be questions that you are required to answer. They will be written in bold so you know they are required and they should also help you with the overall process.
The first thing you will want to do is import a couple of functions that will be necessary for this problem.
%pylab inline
from iab.algorithms import sw_align_nt, nw_align_nt
Next, in order to make sure the function was imported properly, and to see how it works run the help
command on it. Read the help text carefully, it will be important to understand exactly what each function does.
help(sw_align_nt)
help(nw_align_nt)
This next function, slice_sequence
, will let you let you easily extract segments of a sequence that are of interest to you (for example, the region between where two primers align.
def slice_sequence(sequence, start_pos, end_pos):
""" Given a sequence, return the substring between start_pos and end_pos
Parameters
----------
sequence: string
The sequence to be sliced
start_pos: int
The starting position for the new sequence
end_pos: int
The ending position for the new sequence
Returns
-------
string
A substring of the input string between start_pos and end_pos
"""
if len(sequence) == 0:
raise ValueError("The sequence is empty")
if start_pos < 1:
raise ValueError("Starting position must be greater than zero.")
if end_pos > len(sequence):
raise ValueError("Ending position cannot be larger than the length of the sequence.")
if start_pos > end_pos:
raise ValueError("The starting position must be less than the ending positions.")
return sequence[start_pos-1:end_pos]
The following cell contains the full-length 16S rRNA sequences of five diverse bacterial organisms. Make sure to run this cell in order to load the sequences into memory.
# k__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Enterobacteriales; f__Enterobacteriaceae; g__Escherichia; s__coli
s1 = ['656881','GGGTGAGTAATGTCTGGGAAACTGCCTGATGGAGGGGGATAACTACTGGAAACGGTAGCTAATACCGCATAACGTCGCAAGACCAAAGAGGGGGACCTTCGGGCCTCTTGCCATCGGATGTGCCCAGATGGGATTAGCTTGTTGGTGGGGTAACGGCTCACCAAGGCGACGATCCCTAGCTGGTCTGAGAGGATGACCAGCCACACTGGAACTGAGACACGGTCCAGACTCCTACGGGAGGCAGCAGTGGGGAATATTGCACAATGGGCGCAAGCCTGATGCAGCCATGCCGCGTGTATGAAGAAGGCCTTCGGGTTGTAAAGTACTTTCAGCGGGGAGGAAGGGAGTAAAGTTAATACCTTTGCTCATTGACGTTACCCGCAGAAGAAGCACCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCACGCAGGCGGTTTGTTAAGTCAGATGTGAAATCCCCGGGCTCAACCTGGGAACTGCATCTGATACTGGCAAGCTTGAGTCTCGTAGAGGGGGGTAGAATTCCAGGTGTAGCGGTGAAATGCGTAGAGATCTGGAGGAATACCGGTGGCGAAGGCGGCCCCCTGGACGAAGACTGACGCTCAGGTGCGAAAGCGTGGGGAGCAAACAGGATTAGATACCCTGGTAGTCCACGCCGTAAACGATGTCGACTTGGAGGTTGTGCCCTTGAGGCGTGGCTTCCGGAGCTAACGCGTTAAGTCGACCGCCTGGGGAGCAAACAGGATTAGATACCCTGGTAGTCCACGCCGTAAACGATGTCGACTTGGAGGTTGTGCCCTTGAGGCGTGGCTTCCGGAGCTAACGCGTTAAGTCGACCGCCTGGGGAGTACGGCCGCAAGGTTAAAACTCAAATGAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGATGCAACGCGAAGAACCTTACCTGGTCTTGACATCCACGGAAGTTTTCAGAGATGAGAATGTGCCTTCGGGAACCGTGAGACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTTGTGAAATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTTATCCTTTGTTGCCAGCGGTCCGGCCGGGAACTCAAAGGAGACTGCCAGTGATAAACTGGAGGAAGGTGGGGATGACGTCAAGTCATCATGGCCCTTACGACCAGGGCTACACACGTGCTACAATGGCGCATACAAAGAGAAGCGACCTCGCGAGAGCAAGCGGACCTCATAAAGTGCGTCGTAGTCCGGATTGGAGTCTGCAACTCGACTCCATGAAGTCGGAATCGCTAGTAATCGTGGATCAGAATGCCACGGTGAATACGTTCCCGGGCCTTGTACACACCGCCCGTCACACCATGGGAGTGGGTTGCAAAAGAAGTAGGTA']
# k__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales; f__S24-7; g__; s__
s2 = ['305251','GATGAACGCTAGCGACAGGCTTAACACATGCAAGTCGAGGGGCAGCGAGATTGTGGCAACACGATTGTCGGCGACCGGCGCACTGGTGAGTAACACGTATGCAACCTGCCGCGCACTGGGGGATAATCTTGGGAAACCGAGTCTAATACCCCGTAGGCCTTGTTGCCGCATGGTAATAAGGTAAGAGGAGTGATCCGATGCGCGATGGGCATGCGGCGCATTAGCTAGTTGGCGGGGTAACAGCCCACCAAGGCGACGATGCGTAGGGGTTCTGAGAGGAAGGTCCCCCACACTGGTACTGAGACACGGACCAGACTCCTACGGGAGGCAGCAGTGAGGAATATTGGTCAATGGGCGGAAGCCTGAACCAGCCAAGTCGCGTGCGGGAGGGAGGCCCTACGGGTCGTAAACCGCTTTTGATGGGGGGTAACCATGCGGACGAGTCCGCATCTGAGAGCACCCATCGAATAAGCATCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGATGCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGCAGGCGGCTATTCAAGTCAGCGGTCAAATTGCGGTGCTCAACGCCGTATCGCCGTTGAAACTGAGTTGGCTAGAGTGAGAGTGAGGAAGGCGGAATGCGCGGTGTAGCGGTGAAATGCATAGATATTGCGCAGAACTCCGATTGCGAAGGCAGCTTTCCAATTCTCTACTGACGCTCATGCACGAAAGCGTGGGTATCGAACAGGATTAGATACCCTGGTAGTCCACGCGGTAAACGATGGTCACTAGCTGTGCGCCCTGATTAAAGGGAGCGTGGCCGAGCGAAAGCGTTAAGTGACCCACCTTGGGAGTACGCCGGCAACGGTGAAACTCAAAGGAATTGACGGGGGCCCGCACAAGCGGAGGAACATGTGGTTTAATTCGATGATACGCGAGGAACCTTACCCGGGCTCAAACGCTTGCGGGAGTCTATTTGAAAGGATAGATGCCCTTCGGGGCTGCAAGCGAGGTGCTGCATGGTTGTCGTCAGCTCGTGCCGTGAGGTGTCGGCTTAAGTGCCATAACGAGCGCAACCCCCATCTTCAGTTGCCGTCGGGTAGAGCCGGGCACTCTGGAGAGACTGCCGGCGCAAGCTGTGAGGAAGGCGGGGATGACGTCAAATCAGCACGGCCCTTACGTCCGGGGCGACACACGTGTTACAATGGCGGGGCACAGAGGGAAGCCAGGCGGTGACGTCGAGCGGATCCCGAAAACCCGTCTCAGTTCGGATCGGAGTCTGCAGCTCGACTCCGTGAAGCTGGATTCGCTAGTAATCGCGCATCAGCCATGGCGCGGTGAATACGTTCCCGGGCCT']
# k__Bacteria; p__Cyanobacteria; c__Chloroplast; o__Chlorophyta; f__Ulvophyceae; g__; s__
s3 = ['577032', 'ATGAACGCTGGCGGCATGCTTAACACATGCAAGTTGAACGGGTTTAAGTTTATTAAACTTAAACAAGTAGCGGACGGGTGAGTAACGCGTAAGAACCTACCTTTAGGTAAGGAATAACTATTGGAAGCGATAGATAATACCTTATAAGCTTATAGTAAAAGATAAAATCGCCTAAGGATGGGCTTGCGTCTGATTAGCTTGTTGGTGATTTAAAAGATTCACCAAGGCAACGATCAGTAGTTGGTCTAAGAGGATGATCAACCACACTGGGACTGAGACACGGCCCAGACCTCTACGGAGGGCAGCAGTGAGGAATTTTCCGCAATGGGCGAAAGCCTGACGGAGCAATGCCGCGTGGAGGATGAAAGCTTGTGAGTCGTAAACTCCTTTTCTTAGTGAAGAAATAAGACGGTATCTAAGGAATAAGCATCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGATGCAAGCGTTATTCGGAATTATTGGGCGTAAAGCGTCTGTAGGTGGTTTTTTAAGTCTACTGTTAAATATTAAGGCTTAACCTTAAAAAAGCGGTATGAAACTAAAAAACTTGAGTTTAGTAGAGGTAGAGGGAATTCTCGGTGTAGTGGTGAAATGCGTAGAGGTCGGGAAGAACACCGGTAGCGAAAGCGCTCTACTGGGCTAAAACTGACACTCAGAGACGAAAGCTAAAGTAGCGAATGGGATTAGATACCCCAGTAGTCTTAGCTGTAAACGATGGGTACTAGATGTTGCGCGTATCGATCCGTGCAGTATCGTAGCTAACGCGTTAAGTACCCCGCCTGGGAAGTATGCTCGCAAGAGTGAAACTCAAAGGAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGATGCAACGCGAAGAACCTTACCAGAACTTGACATGTCACAAATTTTCTTGAAAAAGAAAAGTGCCTTAGGGAGTGTGAACACAGGTGGTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTTGTTTTTAGTTGCCATCATTTAGTTGGGAACTCTAAAAAGACTGCCGGTGACAAACCGGAGGAAGGTGAGGATGACGTCAAGTCAGCATGCCCCTTATGTTCTGGGCTACACACGTGCTACAATGATTATGACAAAGGGTAGCGAATTCGCGAGAATCAGCCAATCTCATAAACATAGTCTAAGTTCGGATTGCAGGCTGAAACTCGCCTGCATGAAGCTGGAATCGCTAGTAATCGCCGGTCAGCTATACGGCGGTGAATCCGTTCCCGGGCCTTGTACACACCGCCCGTCACACCATGGAAGTTGGCTACGCCCGAAGTCGTTATCTTAACCTTTTTGGAGGGAGGCGCCTAAGGTGGAGCCAGTGACTGGGGTGA']
# k__Bacteria; p__GN02; c__BD1-5; o__; f__; g__; s__
s4 = ['200762','AGAGTTTGATCATGGCTCAGATTGAACGCTGGCGGCGTGCCTAACACATGCAAGTCGAGCGGTAACGGGTGTAGCAATACATGCTGACGAGCGGCGGACGGGTGAGCAATATTTGGGAATCTGCCTATTAGTGGGGGACAACCCGGGGAAACTCGGGCTAATACCGCATACGCTCTACGGAGGAAAGCCGGGGACCGCAAGGCCTGGCGCTAATAGATGAGCCCAAATCGGATTAGCTAGTTGGTGAGGTAAAGGCTCACCAAGGCGACGATCCGTAGCTGGTCTGAGAGGACGACCAGCCACACCGGAACTGAGACACGGTCCGGACTCCTACGGGAGGCAGCAGTGGGGAATATTGGACAATGGGCGAAAGCCTGATGGAGCGACACCGCGTGAAGGATGAAGCCTTTGTTGGTGTAAACTTCTTTTCTCTGGGAAGATAATGACGGTACCAGAGGAATAAGGGGCGGCAAACTTCGTGCCAGCAGCCGCGGTAATACGAAGGCCCCAAGCGTTATCCGGAATTACTGGGCGTAAAGCGTCTGTAGGTGGTCTGGAAAGTTTCAAGTGAAAGGTCAGGGCTTAACCCTGTACTTGCTTGGAAAACTATCAGACTTGAGTGCGGGAGAGGCAAGCAGAACTGTATGAGTAGGGGTGCAATCCGTTGATACATACAAGAATACCAAAAGCGAAGGCAGCTTGCTGGAACGCTACTGACACTGAGAGACGAAAGCGTGGGGAGCAAAAGGGATTAGATACCCCTGTAGTCCACGCCCTAAACGATGGATGCTAAATGTCGGCGCAAGCCGGTGTTTCAAGCTAACGCATTAAGCATCCCGCCTGAGGAGTACGGCCGCAAGGTTAAAACTCAAATGAATAGACGGGGACCCGCACAAGCAGTGGATCATGTGGTTTAATTCGACACTAAACGAGGAACCTCACCTAGGCTTGACATTGATAGAATTTGCTGGAAACAGCGAAGTGCCTGCAAGGGAACTTGAAAACAGGCGCTGCATGGTTGTCGTCAGCTCGTGCCTTGAGGTGTTCGGTTAAGTCCGTTAACGAGCGCAACCCATGTCGTTAGTTATTATGTCTAACGAGACTGCTCGAGTTAATCGAGAGGAAGGTGTGGATGACGTCAAATCAGCATGGCCCTTATGCCTAGGGCTACACACATGATACAATGGTCGGTACAAAGGGTTGCCAAGTGGTAACACGGAGCCAATCCCAGAAAGCCGATCTCAGTCCAGATTGAGGGCTGCAACTCGCCCTCATGAAGTTGGAATTGCTAGTAATCGTGAATCAGCTATGTCACGGTGAATCTGTTCCCGGGTCTTGTACTCACCGCCCGTCAAACCATGGGAGGTGTGCGTACCTGAAGTCCTTCGAGTAATACGGAGGCCCACGGTAAACACACTGACTGGGGTTAAGTCGTAACAAGGTA']
# k__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Enterobacteriales; f__Enterobacteriaceae; g__; s__
s5 = ['3728119','CTCAGATTGAACGCTGGCGGCAGGCCTAACACATGCAAGTCGAGCGGATGACGGGAGCTTGCTCCTTGATTCAGCGGCGGACGGGTGAGTAATGCCTAGGAATCTGCCTGGTAGTGGGGGACAACGTTTCGAAAGGAACGCTAATACCGCATACGTCCTACGGGAGAAAGCAGGGGACCTTCGGGCCTTGCGCTATCAGATGAGCCTAGGTCGGATTAGCTAGTAGGTGAGGTAATGGCTCACCTAGGCGACGATCCGTAACTGGTCTGAGAGGATGATCAGTCACACTGGAACTGAGACACGGTCCAGACTCCTACGGGAGGCAGCAGTGGGGAATATTGGACAATGGGCGAAAGCCTGATCCAGCCATGCCGCGTGTGTGAAGAAGGTCTTCGGATTGTAAAGCACTTTAAGTTGGGAGGAAGGGCAGTAAGCTAATACCTTGCTGTTTTGACGTTACCGACAGAATAAGCACCGGCTAACTCTGTGCCAGCAGCCGCGGTAATACGGCAGGGATGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCGCGCTAGGCTGGTTCGTTAAGTCTGGATGTGAAATCCCCGGGCTCAACCTGGGAACTGCATTCGAAACTGGCAGAGCTAGAGTATTGTAGAGGGTGGTGGAATTTCAGGTGTAGCGGTGAAATGCGTAGATATCTGGAAGGAATACCGGTGGCGAAGGCGACCCCCTGGACATGATACTGACGCTCATGTCGTCTTAGCGATAATGCGTGGGGAGCAAACAGGATTAGATACCCTGGTAGTCCACGCCGTAAACGATGTCGACTTGGAGGTTGTGCCCTTGAGGTTTGGCTTCCGGAGCTAACGCGTTAAGTTGACCGCCTGGGGAGTACGGCCGCAAGGTTAAAACTCAAATGAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGATGCAACGCGAAGAACCTTACCTACTCTTGACATCCAGAGAACTTTCCAGAGATGGATTGGTGCCTTCGGGAACTCTGACACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTTGTGAAATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTTATCCTTTGTTGCCAGCGGTTTGGCCGGGAACTCATAGGAAACTGCCAGTGATCAACTGGAAGAAAGTGGGGATGACCTCCAGTCATCATGGCCCTTACCAGTATGGCTACACACGTGCTACAATGGCGCATACCAAGAGAATCGACCTCGCGAGAGCGAGCGGACTTTATCAAGTGCGTCGTAATCCGGATTGGAGTCTGCCACTCTCACTCGATGAAGTCCGAATCGCTAGTAATCGTGGATTCAGAATTGCTTCGGTGTGAATATCGTTCCCGGGCCTTTGTACACACCCGCCCGGTCACACCATGGG']
sequences = [s1, s2, s3, s4, s5]
You can now use slice_sequence
to extract a region of interest from a sequence - for example, the region between the starting position of two primer hits. The example below would extract the region between 50 and 250 from sequence 1.
slice_sequence(sequences[0][1], 50, 250)
Hint: The numbers here may look a little bit funny. That's because the first item in a list in python is 0, not 1. So, if you want the first entry from the list sequences
you would type sequences[0]
, and if you want the second sequence you would type sequences[1]
. Try this in the cell below until you are comfortable getting the sequence you want. Then try to slice a sequence to the region between positions 25 and 50.
This cell contains the potential primers that we will use to amplify specific regions.
primers = [('p1', 'TTCCGGTTGATCCNGCCGGA'), # F21
('p2', 'ACNGCTCAGTAACACGT'), # F109
('p3', 'GCTGCCTCCCGTAGGAGT'), # F338
('p4', 'TACGGNAGGCAGCAG'), # F343
('p5', 'GTGCCAGCNGCCGCGGTAA'), # F515
('p6', 'ATTAGATACCCNGGTAGTCC'), # F770
('p7', 'ATTAGATACCCNNGTAGTCC'), # R806 (reverse complement)
('p8', 'AGGAATTGGCGGGGCAGCAC'), # R915 (reverse complement)
('p9', 'AAACTNAAAGGAATTGACGG'), # F926
('p10', 'AGGTNNGNATGCCCCNAA')] # R1240 (reverse complement)
If you want to locally align a primer against a primer, a primer against a sequence, or a sequence against a sequence, you could run the following:
aln1, aln2, score, start1, start2 = sw_align_nt(primers[0][1], primers[1][1])
print(aln1)
print(aln2)
print(score)
print(start1)
print(start2)
aln1, aln2, score, start1, start2 = sw_align_nt(primers[0][1], sequences[0][1])
print(aln1)
print(aln2)
print(score)
print(start1)
print(start2)
aln1, aln2, score, start1, start2 = sw_align_nt(primers[0][1], sequences[0][1])
print(aln1)
print(aln2)
print(score)
If you want to globally align a primer against a primer, you could run the following:
aln1, aln2, score = nw_align_nt(primers[0][1], primers[1][1])
print(aln1)
print(aln2)
print(score)
Notice that there are two additional return values from sw_align_nt
than there are from nw_align_nt
. Look at the help for each function to figure out what these values are. Why does it make sense to get them from sw_align_nt
, but not from nw_align_nt
?
In the cell below, try to globally align a primer against a sequence. Then try to globally align a different primer against a different sequence.
Hint: if you want to normalize an alignment score by it's length, you can do the following. This may come in handy when comparing alignments.
aln1, aln2, score = nw_align_nt(primers[0][1], sequences[0][1])
print(score / len(aln1))
At this point you have the necessary functions to complete the assignment. Your goal is to pick the best pair of 16S primers, a forward and a reverse primer, given the above five input sequences.
The best primers will:
What is the difference between a local and global alignment in terms of what is aligned? What are the differences in the algorithms that support this? (One paragraph)
What is the sequence in s2 from position 500 to 505?
Hint: the first base in a sequence is position 1; the first item in a Python list is index 0.
What is the Smith-Waterman alignment score of primer p6
against sequence s4
? Where in s4
does the alignment start?
Hint: copy and paste from other cells, that way you only have to make a couple small changes.
What is the best pair of primers from the list of available primers to use for amplifying the 16S region for sequencing for the purposes of identifying the organisms present?
The best pair of primers will align well to (and therefore be likely to anneal with) the 16S sequences from all of the organisms present in the list.
The best pair of primers will amplify a region of DNA that is between 100 and 400 base pairs long.
The best pair of primers will amplify a region that is highly variable (in other words, the amplified regions across the organisms should not align well).
Think about whether you want to use global or local alignments for the different steps. Are there times when you would want to use a gap penalty other than the default?
The N
character present in some of the primer sequences signifies a "degenerate" base, meaning it could be an 'A', 'T', 'G' or 'C'. You shouldn't worry about these for this exercise.
Good Luck!