Notebook

4.1 Local sequence alignment exercises [edit]¶

Table of Contents 0. Purpose 0. Background 0. Goals 0. Hints 0. Getting started 0. Question 1 0. Question 2 0. Question 3 0. Question 4 0. More hints

4.1.1 Purpose [edit]¶

The purpose of this exercise is to have you combine a few of the topics we have covered in class up to this point, and get you working with code. (Don't panic: you don't have to write code - everything in this assignment can be achieved with cutting and pasting code that we've already written for you.)

You will need to be familiar with local and global alignments (see the pairwise alignment section) and the process of PCR (see the molecular biology section). The aim is to develop your bioinformatics problem solving skills, while also introducing you to interacting with the IPython Notebook.

4.1.2 Background [edit]¶

You should start by reading the section 4.3 here of Genomes, by TA Brown. You may find the entire chapter useful.

A common process in bioinformatics is looking at the composition of microorganisms in a given environment. For instance, we could take a sample from a desk in an office, the gut of a human subject, or the Southern Ocean and ask what microorganisms are present in each of the different samples. The most common way to answer this question is to sequence the 16S rRNA gene. What makes this gene so useful is that it contains several conserved areas, which means that we can isolate it from the full genomes of many organisms using PCR, however it also contains highly variable regions so that we can tell the organisms apart. The sequence of the 16S rRNA therefore serves as a fingerprint for a microorganism. If we find it in a sample, that suggests that the organism is present in that sample.

4.1.3 Goals [edit]¶

In this exercise we will provide you with the full-length 16S rRNA sequences from five different bacterial organisms, and ten candidate primer sequences. Using both global and local sequence alignment, we'll ask you to select what you think is the best single primer pair to use for amplifying 16S rRNA from these five organisms for use in profiling diverse communities of microorganisms.

Throughout this notebook there will be questions that you are required to answer. They will be written in bold so you know they are required and they should also help you with the overall process.

4.1.4 Hints [edit]¶

Read all of the cells containing text very carefully!

You may write code or use a text editor if you wish, however all of the tools necessary to answer the questions are present in this notebook.

Spend some time thinking about what the question is and how you can go about answering it. This assignment is based largely on problem solving skills and it may take time to develop a good strategy.

Get help, that's what office hours are for!

You are allowed to discuss the assignment with other students, however your work needs to be your own. Using or looking at code or commands generated by another student is strictly prohibited. If you're in doubt over whether some type of interaction is acceptable for this assignment, ask.

4.1.5 Getting started [edit]¶

The first thing you will want to do is import a couple of functions that will be necessary for this problem.

In [1]:

%pylab inline
from iab.algorithms import sw_align_nt, nw_align_nt

Next, in order to make sure the function was imported properly, and to see how it works run the help command on it. Read the help text carefully, it will be important to understand exactly what each function does.

In [2]:

help(sw_align_nt)

In [3]:

help(nw_align_nt)

This next function, slice_sequence, will let you let you easily extract segments of a sequence that are of interest to you (for example, the region between where two primers align.

In [4]:

def slice_sequence(sequence, start_pos, end_pos):
    """ Given a sequence, return the substring between start_pos and end_pos

        Parameters
        ----------
        sequence: string
            The sequence to be sliced
        start_pos: int
            The starting position for the new sequence
        end_pos: int
            The ending position for the new sequence

        Returns
        -------
        string
            A substring of the input string between start_pos and end_pos

    """
    if len(sequence) == 0:
        raise ValueError("The sequence is empty")
    if start_pos < 1:
        raise ValueError("Starting position must be greater than zero.")
    if end_pos > len(sequence):
        raise ValueError("Ending position cannot be larger than the length of the sequence.")
    if start_pos > end_pos:
        raise ValueError("The starting position must be less than the ending positions.")
    return sequence[start_pos-1:end_pos]

The following cell contains the full-length 16S rRNA sequences of five diverse bacterial organisms. Make sure to run this cell in order to load the sequences into memory.

In [5]:

# k__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Enterobacteriales; f__Enterobacteriaceae; g__Escherichia; s__coli
s1 = ['656881','GGGTGAGTAATGTCTGGGAAACTGCCTGATGGAGGGGGATAACTACTGGAAACGGTAGCTAATACCGCATAACGTCGCAAGACCAAAGAGGGGGACCTTCGGGCCTCTTGCCATCGGATGTGCCCAGATGGGATTAGCTTGTTGGTGGGGTAACGGCTCACCAAGGCGACGATCCCTAGCTGGTCTGAGAGGATGACCAGCCACACTGGAACTGAGACACGGTCCAGACTCCTACGGGAGGCAGCAGTGGGGAATATTGCACAATGGGCGCAAGCCTGATGCAGCCATGCCGCGTGTATGAAGAAGGCCTTCGGGTTGTAAAGTACTTTCAGCGGGGAGGAAGGGAGTAAAGTTAATACCTTTGCTCATTGACGTTACCCGCAGAAGAAGCACCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCACGCAGGCGGTTTGTTAAGTCAGATGTGAAATCCCCGGGCTCAACCTGGGAACTGCATCTGATACTGGCAAGCTTGAGTCTCGTAGAGGGGGGTAGAATTCCAGGTGTAGCGGTGAAATGCGTAGAGATCTGGAGGAATACCGGTGGCGAAGGCGGCCCCCTGGACGAAGACTGACGCTCAGGTGCGAAAGCGTGGGGAGCAAACAGGATTAGATACCCTGGTAGTCCACGCCGTAAACGATGTCGACTTGGAGGTTGTGCCCTTGAGGCGTGGCTTCCGGAGCTAACGCGTTAAGTCGACCGCCTGGGGAGCAAACAGGATTAGATACCCTGGTAGTCCACGCCGTAAACGATGTCGACTTGGAGGTTGTGCCCTTGAGGCGTGGCTTCCGGAGCTAACGCGTTAAGTCGACCGCCTGGGGAGTACGGCCGCAAGGTTAAAACTCAAATGAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGATGCAACGCGAAGAACCTTACCTGGTCTTGACATCCACGGAAGTTTTCAGAGATGAGAATGTGCCTTCGGGAACCGTGAGACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTTGTGAAATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTTATCCTTTGTTGCCAGCGGTCCGGCCGGGAACTCAAAGGAGACTGCCAGTGATAAACTGGAGGAAGGTGGGGATGACGTCAAGTCATCATGGCCCTTACGACCAGGGCTACACACGTGCTACAATGGCGCATACAAAGAGAAGCGACCTCGCGAGAGCAAGCGGACCTCATAAAGTGCGTCGTAGTCCGGATTGGAGTCTGCAACTCGACTCCATGAAGTCGGAATCGCTAGTAATCGTGGATCAGAATGCCACGGTGAATACGTTCCCGGGCCTTGTACACACCGCCCGTCACACCATGGGAGTGGGTTGCAAAAGAAGTAGGTA']

# k__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales; f__S24-7; g__; s__
s2 = ['305251','GATGAACGCTAGCGACAGGCTTAACACATGCAAGTCGAGGGGCAGCGAGATTGTGGCAACACGATTGTCGGCGACCGGCGCACTGGTGAGTAACACGTATGCAACCTGCCGCGCACTGGGGGATAATCTTGGGAAACCGAGTCTAATACCCCGTAGGCCTTGTTGCCGCATGGTAATAAGGTAAGAGGAGTGATCCGATGCGCGATGGGCATGCGGCGCATTAGCTAGTTGGCGGGGTAACAGCCCACCAAGGCGACGATGCGTAGGGGTTCTGAGAGGAAGGTCCCCCACACTGGTACTGAGACACGGACCAGACTCCTACGGGAGGCAGCAGTGAGGAATATTGGTCAATGGGCGGAAGCCTGAACCAGCCAAGTCGCGTGCGGGAGGGAGGCCCTACGGGTCGTAAACCGCTTTTGATGGGGGGTAACCATGCGGACGAGTCCGCATCTGAGAGCACCCATCGAATAAGCATCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGATGCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGCAGGCGGCTATTCAAGTCAGCGGTCAAATTGCGGTGCTCAACGCCGTATCGCCGTTGAAACTGAGTTGGCTAGAGTGAGAGTGAGGAAGGCGGAATGCGCGGTGTAGCGGTGAAATGCATAGATATTGCGCAGAACTCCGATTGCGAAGGCAGCTTTCCAATTCTCTACTGACGCTCATGCACGAAAGCGTGGGTATCGAACAGGATTAGATACCCTGGTAGTCCACGCGGTAAACGATGGTCACTAGCTGTGCGCCCTGATTAAAGGGAGCGTGGCCGAGCGAAAGCGTTAAGTGACCCACCTTGGGAGTACGCCGGCAACGGTGAAACTCAAAGGAATTGACGGGGGCCCGCACAAGCGGAGGAACATGTGGTTTAATTCGATGATACGCGAGGAACCTTACCCGGGCTCAAACGCTTGCGGGAGTCTATTTGAAAGGATAGATGCCCTTCGGGGCTGCAAGCGAGGTGCTGCATGGTTGTCGTCAGCTCGTGCCGTGAGGTGTCGGCTTAAGTGCCATAACGAGCGCAACCCCCATCTTCAGTTGCCGTCGGGTAGAGCCGGGCACTCTGGAGAGACTGCCGGCGCAAGCTGTGAGGAAGGCGGGGATGACGTCAAATCAGCACGGCCCTTACGTCCGGGGCGACACACGTGTTACAATGGCGGGGCACAGAGGGAAGCCAGGCGGTGACGTCGAGCGGATCCCGAAAACCCGTCTCAGTTCGGATCGGAGTCTGCAGCTCGACTCCGTGAAGCTGGATTCGCTAGTAATCGCGCATCAGCCATGGCGCGGTGAATACGTTCCCGGGCCT']

# k__Bacteria; p__Cyanobacteria; c__Chloroplast; o__Chlorophyta; f__Ulvophyceae; g__; s__
s3 = ['577032', 'ATGAACGCTGGCGGCATGCTTAACACATGCAAGTTGAACGGGTTTAAGTTTATTAAACTTAAACAAGTAGCGGACGGGTGAGTAACGCGTAAGAACCTACCTTTAGGTAAGGAATAACTATTGGAAGCGATAGATAATACCTTATAAGCTTATAGTAAAAGATAAAATCGCCTAAGGATGGGCTTGCGTCTGATTAGCTTGTTGGTGATTTAAAAGATTCACCAAGGCAACGATCAGTAGTTGGTCTAAGAGGATGATCAACCACACTGGGACTGAGACACGGCCCAGACCTCTACGGAGGGCAGCAGTGAGGAATTTTCCGCAATGGGCGAAAGCCTGACGGAGCAATGCCGCGTGGAGGATGAAAGCTTGTGAGTCGTAAACTCCTTTTCTTAGTGAAGAAATAAGACGGTATCTAAGGAATAAGCATCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGATGCAAGCGTTATTCGGAATTATTGGGCGTAAAGCGTCTGTAGGTGGTTTTTTAAGTCTACTGTTAAATATTAAGGCTTAACCTTAAAAAAGCGGTATGAAACTAAAAAACTTGAGTTTAGTAGAGGTAGAGGGAATTCTCGGTGTAGTGGTGAAATGCGTAGAGGTCGGGAAGAACACCGGTAGCGAAAGCGCTCTACTGGGCTAAAACTGACACTCAGAGACGAAAGCTAAAGTAGCGAATGGGATTAGATACCCCAGTAGTCTTAGCTGTAAACGATGGGTACTAGATGTTGCGCGTATCGATCCGTGCAGTATCGTAGCTAACGCGTTAAGTACCCCGCCTGGGAAGTATGCTCGCAAGAGTGAAACTCAAAGGAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGATGCAACGCGAAGAACCTTACCAGAACTTGACATGTCACAAATTTTCTTGAAAAAGAAAAGTGCCTTAGGGAGTGTGAACACAGGTGGTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTTGTTTTTAGTTGCCATCATTTAGTTGGGAACTCTAAAAAGACTGCCGGTGACAAACCGGAGGAAGGTGAGGATGACGTCAAGTCAGCATGCCCCTTATGTTCTGGGCTACACACGTGCTACAATGATTATGACAAAGGGTAGCGAATTCGCGAGAATCAGCCAATCTCATAAACATAGTCTAAGTTCGGATTGCAGGCTGAAACTCGCCTGCATGAAGCTGGAATCGCTAGTAATCGCCGGTCAGCTATACGGCGGTGAATCCGTTCCCGGGCCTTGTACACACCGCCCGTCACACCATGGAAGTTGGCTACGCCCGAAGTCGTTATCTTAACCTTTTTGGAGGGAGGCGCCTAAGGTGGAGCCAGTGACTGGGGTGA']

# k__Bacteria; p__GN02; c__BD1-5; o__; f__; g__; s__
s4 = ['200762','AGAGTTTGATCATGGCTCAGATTGAACGCTGGCGGCGTGCCTAACACATGCAAGTCGAGCGGTAACGGGTGTAGCAATACATGCTGACGAGCGGCGGACGGGTGAGCAATATTTGGGAATCTGCCTATTAGTGGGGGACAACCCGGGGAAACTCGGGCTAATACCGCATACGCTCTACGGAGGAAAGCCGGGGACCGCAAGGCCTGGCGCTAATAGATGAGCCCAAATCGGATTAGCTAGTTGGTGAGGTAAAGGCTCACCAAGGCGACGATCCGTAGCTGGTCTGAGAGGACGACCAGCCACACCGGAACTGAGACACGGTCCGGACTCCTACGGGAGGCAGCAGTGGGGAATATTGGACAATGGGCGAAAGCCTGATGGAGCGACACCGCGTGAAGGATGAAGCCTTTGTTGGTGTAAACTTCTTTTCTCTGGGAAGATAATGACGGTACCAGAGGAATAAGGGGCGGCAAACTTCGTGCCAGCAGCCGCGGTAATACGAAGGCCCCAAGCGTTATCCGGAATTACTGGGCGTAAAGCGTCTGTAGGTGGTCTGGAAAGTTTCAAGTGAAAGGTCAGGGCTTAACCCTGTACTTGCTTGGAAAACTATCAGACTTGAGTGCGGGAGAGGCAAGCAGAACTGTATGAGTAGGGGTGCAATCCGTTGATACATACAAGAATACCAAAAGCGAAGGCAGCTTGCTGGAACGCTACTGACACTGAGAGACGAAAGCGTGGGGAGCAAAAGGGATTAGATACCCCTGTAGTCCACGCCCTAAACGATGGATGCTAAATGTCGGCGCAAGCCGGTGTTTCAAGCTAACGCATTAAGCATCCCGCCTGAGGAGTACGGCCGCAAGGTTAAAACTCAAATGAATAGACGGGGACCCGCACAAGCAGTGGATCATGTGGTTTAATTCGACACTAAACGAGGAACCTCACCTAGGCTTGACATTGATAGAATTTGCTGGAAACAGCGAAGTGCCTGCAAGGGAACTTGAAAACAGGCGCTGCATGGTTGTCGTCAGCTCGTGCCTTGAGGTGTTCGGTTAAGTCCGTTAACGAGCGCAACCCATGTCGTTAGTTATTATGTCTAACGAGACTGCTCGAGTTAATCGAGAGGAAGGTGTGGATGACGTCAAATCAGCATGGCCCTTATGCCTAGGGCTACACACATGATACAATGGTCGGTACAAAGGGTTGCCAAGTGGTAACACGGAGCCAATCCCAGAAAGCCGATCTCAGTCCAGATTGAGGGCTGCAACTCGCCCTCATGAAGTTGGAATTGCTAGTAATCGTGAATCAGCTATGTCACGGTGAATCTGTTCCCGGGTCTTGTACTCACCGCCCGTCAAACCATGGGAGGTGTGCGTACCTGAAGTCCTTCGAGTAATACGGAGGCCCACGGTAAACACACTGACTGGGGTTAAGTCGTAACAAGGTA']

# k__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Enterobacteriales; f__Enterobacteriaceae; g__; s__
s5 = ['3728119','CTCAGATTGAACGCTGGCGGCAGGCCTAACACATGCAAGTCGAGCGGATGACGGGAGCTTGCTCCTTGATTCAGCGGCGGACGGGTGAGTAATGCCTAGGAATCTGCCTGGTAGTGGGGGACAACGTTTCGAAAGGAACGCTAATACCGCATACGTCCTACGGGAGAAAGCAGGGGACCTTCGGGCCTTGCGCTATCAGATGAGCCTAGGTCGGATTAGCTAGTAGGTGAGGTAATGGCTCACCTAGGCGACGATCCGTAACTGGTCTGAGAGGATGATCAGTCACACTGGAACTGAGACACGGTCCAGACTCCTACGGGAGGCAGCAGTGGGGAATATTGGACAATGGGCGAAAGCCTGATCCAGCCATGCCGCGTGTGTGAAGAAGGTCTTCGGATTGTAAAGCACTTTAAGTTGGGAGGAAGGGCAGTAAGCTAATACCTTGCTGTTTTGACGTTACCGACAGAATAAGCACCGGCTAACTCTGTGCCAGCAGCCGCGGTAATACGGCAGGGATGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCGCGCTAGGCTGGTTCGTTAAGTCTGGATGTGAAATCCCCGGGCTCAACCTGGGAACTGCATTCGAAACTGGCAGAGCTAGAGTATTGTAGAGGGTGGTGGAATTTCAGGTGTAGCGGTGAAATGCGTAGATATCTGGAAGGAATACCGGTGGCGAAGGCGACCCCCTGGACATGATACTGACGCTCATGTCGTCTTAGCGATAATGCGTGGGGAGCAAACAGGATTAGATACCCTGGTAGTCCACGCCGTAAACGATGTCGACTTGGAGGTTGTGCCCTTGAGGTTTGGCTTCCGGAGCTAACGCGTTAAGTTGACCGCCTGGGGAGTACGGCCGCAAGGTTAAAACTCAAATGAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGATGCAACGCGAAGAACCTTACCTACTCTTGACATCCAGAGAACTTTCCAGAGATGGATTGGTGCCTTCGGGAACTCTGACACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTTGTGAAATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTTATCCTTTGTTGCCAGCGGTTTGGCCGGGAACTCATAGGAAACTGCCAGTGATCAACTGGAAGAAAGTGGGGATGACCTCCAGTCATCATGGCCCTTACCAGTATGGCTACACACGTGCTACAATGGCGCATACCAAGAGAATCGACCTCGCGAGAGCGAGCGGACTTTATCAAGTGCGTCGTAATCCGGATTGGAGTCTGCCACTCTCACTCGATGAAGTCCGAATCGCTAGTAATCGTGGATTCAGAATTGCTTCGGTGTGAATATCGTTCCCGGGCCTTTGTACACACCCGCCCGGTCACACCATGGG']

sequences = [s1, s2, s3, s4, s5]

You can now use slice_sequence to extract a region of interest from a sequence - for example, the region between the starting position of two primer hits. The example below would extract the region between 50 and 250 from sequence 1.

In [6]:

slice_sequence(sequences[0][1], 50, 250)

Hint: The numbers here may look a little bit funny. That's because the first item in a list in python is 0, not 1. So, if you want the first entry from the list sequences you would type sequences[0], and if you want the second sequence you would type sequences[1]. Try this in the cell below until you are comfortable getting the sequence you want. Then try to slice a sequence to the region between positions 25 and 50.

In [7]:

This cell contains the potential primers that we will use to amplify specific regions.

In [8]:

primers = [('p1', 'TTCCGGTTGATCCNGCCGGA'), # F21
           ('p2', 'ACNGCTCAGTAACACGT'), # F109
           ('p3', 'GCTGCCTCCCGTAGGAGT'), # F338
           ('p4', 'TACGGNAGGCAGCAG'), # F343
           ('p5', 'GTGCCAGCNGCCGCGGTAA'), # F515
           ('p6', 'ATTAGATACCCNGGTAGTCC'), # F770
           ('p7', 'ATTAGATACCCNNGTAGTCC'), # R806 (reverse complement)
           ('p8', 'AGGAATTGGCGGGGCAGCAC'), # R915 (reverse complement)
           ('p9', 'AAACTNAAAGGAATTGACGG'), # F926
           ('p10', 'AGGTNNGNATGCCCCNAA')] # R1240 (reverse complement)

If you want to locally align a primer against a primer, a primer against a sequence, or a sequence against a sequence, you could run the following:

In [9]:

aln1, aln2, score, start1, start2 = sw_align_nt(primers[0][1], primers[1][1])
print(aln1)
print(aln2)
print(score)
print(start1)
print(start2)

In [10]:

aln1, aln2, score, start1, start2 = sw_align_nt(primers[0][1], sequences[0][1])
print(aln1)
print(aln2)
print(score)
print(start1)
print(start2)

In [11]:

aln1, aln2, score, start1, start2 = sw_align_nt(primers[0][1], sequences[0][1])
print(aln1)
print(aln2)
print(score)

If you want to globally align a primer against a primer, you could run the following:

In [12]:

aln1, aln2, score = nw_align_nt(primers[0][1], primers[1][1])
print(aln1)
print(aln2)
print(score)

Notice that there are two additional return values from sw_align_nt than there are from nw_align_nt. Look at the help for each function to figure out what these values are. Why does it make sense to get them from sw_align_nt, but not from nw_align_nt?

In the cell below, try to globally align a primer against a sequence. Then try to globally align a different primer against a different sequence.

In [13]:

Hint: if you want to normalize an alignment score by it's length, you can do the following. This may come in handy when comparing alignments.

In [14]:

aln1, aln2, score = nw_align_nt(primers[0][1], sequences[0][1])
print(score / len(aln1))

At this point you have the necessary functions to complete the assignment. Your goal is to pick the best pair of 16S primers, a forward and a reverse primer, given the above five input sequences.

The best primers will:

Anneal to all of the 16S rRNA sequences well. This will be determined by achieving a high alignment score between the primer and all of the sequences (though it is OK for there to be some mismatches).
They should amplify a region that is 100-400 base pairs long, due to limitations of current sequencing technology.
Finally, the region that is amplified (i.e., between the primers) should be very different across all species, to allow for accurate fingerprinting of the different species.

4.1.6 Question 1 [edit]¶

What is the difference between a local and global alignment in terms of what is aligned? What are the differences in the algorithms that support this? (One paragraph)

4.1.7 Question 2 [edit]¶

What is the sequence in s2 from position 500 to 505?

Hint: the first base in a sequence is position 1; the first item in a Python list is index 0.

In [15]:

4.1.8 Question 3 [edit]¶

What is the Smith-Waterman alignment score of primer p6 against sequence s4? Where in s4 does the alignment start?

Hint: copy and paste from other cells, that way you only have to make a couple small changes.

In [16]:

4.1.9 Question 4 [edit]¶

What is the best pair of primers from the list of available primers to use for amplifying the 16S region for sequencing for the purposes of identifying the organisms present?

4.1.10 More hints [edit]¶

The best pair of primers will align well to (and therefore be likely to anneal with) the 16S sequences from all of the organisms present in the list.

The best pair of primers will amplify a region of DNA that is between 100 and 400 base pairs long.

The best pair of primers will amplify a region that is highly variable (in other words, the amplified regions across the organisms should not align well).

Think about whether you want to use global or local alignments for the different steps. Are there times when you would want to use a gap penalty other than the default?

The N character present in some of the primer sequences signifies a "degenerate" base, meaning it could be an 'A', 'T', 'G' or 'C'. You shouldn't worry about these for this exercise.

Good Luck!

In [17]: