Multiple sequence alignment exercises

Purpose

In this assignment you'll use multiple sequence alignment to reconstruct the phylogeny of a group of organisms based on their 16S rRNA sequences. This assignment builds on ideas from the previous assingment, in that in the last assignment you were identifying good primers to use for amplifying 16S from diverse organisms, and in this assignment we're using those sequences to group organisms by their relatedness. Because of the very large numbers of sequences that are commonly obtained in a modern DNA-sequencing-based experiment, grouping similiar sequences and then working with representative sequences for each of those groups is common for computational efficiency. We'll be exploring these ideas in more detail through-out the next segments of the class.

From a bioinformatics standpoint, we usually start working with sequence in fasta format, very similar to the sequences in the cell below. See here for an explanation of the fasta format.

At this point, you should be feeling fairly comfortable interacting with the IPython Notebook. This assignment will give you additional practice while you explore the ideas mentioned above.

Goals

Continue to work with IPython Notebooks and interact with python code. Understand what multiple sequence alignment is used for, and the concept of grouping sequences into clusters of OTUs. Consider the possible drawbacks to these methods.

Hints

  • Read all of the cells containing text very carefully!

  • You may write code or use a text editor if you wish, however all of the tools necessary to answer the questions are present in this notebook.

  • Get help, that's what office hours are for!

  • You are allowed to discuss the assignment with other students, however your work needs to be your own. Using or looking at code or commands generated by another student is strictly prohibited. If you're in doubt over whether some type of interaction is acceptable for this assignment, ask.

Below are a few functions that you will need to complete the assignment.

Remember to learn about what a function does you can run:

help(name_of_function)

Try this with the funcitons below to see what they do.

In [ ]:
from __future__ import division

from skbio.parse.sequences import parse_fasta
from skbio import BiologicalSequence, SequenceCollection

from iab.algorithms import progressive_msa_and_tree, iterative_msa_and_tree, kmer_distance, guide_tree_from_sequences 

The cell below contains the sequences that you will be working with throughout the assignment

In [ ]:
seqs_16s = """>881726
GACGAACGCTGGCGGCGTGCCTAATACATGCAAGTCGAGCGGATTCATCCTTCGGGATGGGTTAGCGGCGGACGGGTGAGTAACACGTAGGCAACCTGCCTGCAAGTCCGGGATAACTAACGGAAACGTTAGCTAATACCGGATACGCGGTTGGATCGCATGATCCGATCGGGAAAGACGGCGCAAGCTGCCACTTGTAGATGGGCCTGCGGCGCATTAGCTAGTTGGTGGGGTAACGGCTCACCAAGGCGACGATGCGTAGCCGACCTGAGAGAGTGATCGGCCACACTGGGACTGAGACACGGCCCAGACTCCTACGGGAGGCAGCAGTAGGGAATCTTCCGCAATGGACGCAAGTCTGACGGAGCAACGCCGCGTGAGTGATGAAGGTTCTCGGATCGTAAAGCTCTGTTGCCAGGGAAGAACGCTCGGGAGAGTAACTGCTCTCGAGGTGACGGTACCTGAGAAGAAAGCCCCGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGGGGCAAGCGTTGTCCGGAATTATTGGGCGTAAAGCGCGCGCAGGCGGTCGATTAAGTTTGGTGTTTAAGCCCGGGGCTCAACCCCGGTTCGCACTGAAAACTGATCGACTTGAGTGTAGGAGAGGAAAGTGGAATTCCACGTGTAGCGGTGAAATGCGTAGAGATGTGGAGGAACACCAGTGGCGAAGGCGACTTTCTGGCCTATAACTGACGCTGAGGCGCGAAAGCGTGGGGAGCAAACAGGATTAGATACCCTGGTAGTCCACGCCGTAAACGATGCATGCTAGGTGTTAGGGGTTTCGATACCCTTGGTGCCGAAGTCAACACAGTAAGCATGCCGCCTGGGGAGTACGGTCGCAAGACTGAAACTCAAAGGAATTGACGGGGACCCGCACAAGCAGTGGAGTATGTGGTTTAATTCGAAGCAACGCGAAGAACCTTACCAGGTCTTGACATCCCCCTGAATCCTCTAGAGATAGAGGCGGCCCTTCGGGGACAGGGGAGACAGGTGGTGCATGGTTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTTGATCGTAGTTGCCAGCACTTCGGGTGGGCACTCTAGGATGACTGCCGGTGACAAACCGGAGGAAGGCGGGGATGACGTCAAATCATCATGCCCCTTATGACCTGGGCTACACACGTACTACAATGGCCGGTACAACGGGCTGCGAAGCCGCGAGGTGGAGCCAATCCCAGAAAGCCGGTCTCAGTTCAGATTGCAGGCTGCAACTCGCCTGCATGAAGTCGGAATTGCTAGTAATCGCGGATCAGCATGCCGCGGTGAATACGTTCCCGGGTCT
>793074
GAATGAACGCTGGCGGCGTGCTTAATAATGCAAGTCGAGCGCGTAGCAATACGAGCGGCGCACGGGTGCGTAACACGTAGGTCATCTGCCTCTAGGTCGGGGATAACTGCGGGAAACTGCAGCTAATACCCGATGATATCGAGAGATCAAAGCTTCGGTGCCTAGAGAGGAGCCTGCGGCTCATTAGCTAGTTGGTGGGGTAACGGCCTACCAAGGCCACGATGAGTAGCCGGCCTGAGAGGGCGATCGGCCACACTGGAACTGAGACACGGTCCAGACTCCTACGGGAGGCAGCAGTAGGGAATATTGGGCAATGGGCGAAAGCCTGACCCAGCAACGCCGCGTGAGTGATGAAGCCTTTCGGGGTGTAAAGCTCTTTTGGCAGGGACGAATCAATGACGGTACCTGCGTAATAAGCCCCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGGGGGGGCAAGCGTTATTCGGAATTACTGGGCGTAAAGCGCGCGTAGGCGGCTTCTTAAGTCGGGTGTTTAATGTCGGGGCTCAACTCCGGCGCTGCACTCGATACTGGGAGGCTAGAGTACTCGAGAGGAAAGCGGAATTCCTAGTGTAGCGGTGAAATGCGTAGATATTTAGGAGGAACACCAGTGGCGAAGGCGGCTTTCTGGAGAGTAACTGACGCTCAGAGCGCGAAAGCCAGGGGATCGAACGGGATTAGATACCCCGGTAGTCCTGGCTGTAAACGATGGGTACTAGATGTCGCCGGTATCAATCCCGGCGGTATCGTCGCTAACGCATTAAGTACCCCGCCTGGGGAGTACGCTCGCAAGAGTGAAACTCAAAGGAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGATGCAACGCGAAGAACCTTACCTGGACTTGACATACCTCGGACCGGACCTAGAGATAGGACCTTCTCCCGTAAGGGAGCCGGGGATACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCCCATCCCTAGTTGCCAGCGAGTCATGTCGGGAACTCTAGGGAGACTGCCGTTGATAAAACGAGAGGAAGGTGGGGATGACGTCAAGTCATCATGGCCCTTACGTCCAGGGCTACACACGTGCTACAATGGCCACCACAAAGGGTCGCAATACCGTGAGGTGGAGCTAATCCCAAAAAGGTGGCCTCAGTTCGGATTGTAGTCTGCAACTCGACTACATGAAGTCGGAATCGCTAGTAATCGCGGATCAGAACGCCGCGGTGAATACAGTTCCCGGGCCTTGTACACACCGCCCGTCACACCACGAGAGCTGGTTGCGTTAGAAGTCGCCAGGCCAACCGCAAGGGGGCAGGCGCCGAATGCGTGATGAGTGATTGGGGT
>669210
AGAGTTTGATCCTGGCTCAGAACGAACGCTGGCGGCGCGCCTAACACATGCAAGTCGAACGGACTAGCCCCTTCGGGGGCGAAGTTAGTGGCGAACGGGTGAGTAACGCGTAAGTAACCTGCCCCCGGGACTGGGATAACAGCTCGAAAGAGCCGCTAATACCGGATAATTGTTGCAACACTTAGGAGTTGTAACTAAAGAAGGCCTCTGTTTCAAGCTTTCACCTGGGGATGGGCTTGCGTCCCATTAGCTTGTTGGTGAGGTAACGGCTCACCAAGGAAACGATGGGTAGCCGGCCTGAGAGGGTGGTCGGTCACACTGGAACTGAGACACGGTCCAGACTCCTACGGGAGGCAGCAGTGAGGAATCTTGCGCAATGGGCTAACGCCTGACGCAGCGACGCCGCGTGGACGATGAAGCTTTTCGGAGTGTAAAGTCCTTTCAGGAGGGAAGAAATGCCGGTAGTGTGAATAACACACCGGTTTGACGGTACCTCAAGAAGAAGCCCCGGCTAACTCCGTGCCAGCAGCCGCGGTAACACGGAGGGGGCAAGCGTTGTTCGGAATCACTGGGCGTAAAGAGCGCGTAGGTGGTTGTGTAAGTCGGATGTGAAATCCCTCGGCTCAACCGAGGAACTGCGTTCGAAACTACATAGCTAGAGGGCAGGAGAGGAGAGCGGAATTCCCAGTGTAGCGGTGAAATGCGCAGATATTGGGAAGAACACCGGTGGCGAAGGCGGCTCTCTGGACTGTTCCTGACACTGAGGCGCGAAAGCCAGGGGAGCGAACGGGATTAGATACCCCGGTAGTCCTGGCCGTAAACGATGGGCACTAGGTGTGGGGGGTGTCGATCCCCCCCGTGCCGCAGCTAACGCATTAAGTGCCCCGCCTGGGAAGTACGATCGCAAGGTTGAAACTCAAAGGAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGACGCAACGCGAAGAACCTTACCGGAATTTGACATGTTTCTGACGGCCTGCAGAAATGCAGGCTTCCCCTCGGGGCAGATACACAGGAGCTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTTGCCCTTAGTTGCCATCGGTTCGGCCGGGAACTCTAAGGGGACTGCCGGTGACAAACCGGAGGAAGGTGGGGATGACGTCAAGTCAGTATGGCCTTTATGTTCCGGGCTACACACGTGCTACAATGGCTGGTACAAAGGGTCGCGATGCCGTGAGGTGGAGCCAATCCCAAAAAGCCAGTCTTAGTTCGGATTGGAGTCTGCAACTCGACTCTATGAAGCCGGAATCGCTAGTAATCGTGGATCAGCACGCCATGGTGAATACGTTCCCGGGCCTTGTACACACCGCCCGTCACACCACGAAAGTTGGTTGTACCAGAAGTCATTGGGCTAACCCTTTTGGGGGGCAGATGCCGAAGGTATGGTCAGCGATTGGGGTGAAGTCGTAACAAGGTAACC
>583705
ACGGGTGAGTAACGCGTATGCAACCTACCTCGGAAAAGGGGATGACTGGTGGAAACGGGGATTAATGCCCCCTAGGGTTGTTTCTCTGCCTGGGTGAGCCGTTACTATTGGAACCGATTGAGATGGCCATGTTGGTCATTTCCTGGTTGGTGAGGTTACCTCACACCAAGGCGACGATGACTACGGGGTCTAAAAGGATGGTCCCGCACACTGGTACTGAGACACGGACCAGACTCCTACGGGAGGCAGCAGTGAGGAATATTGGTCAATGGGGGCAACCCTGAACCAGCCATGCCGCGTGAAGGAAGACGGCCCTATGGGTTGTAAACTTCTTTTATATGGGAATAAAGAGAGGTACGTGTACCTCAGTGAATGTACCATATGAATAAGCATCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGATGCAGGCGTTATCCGGATTTATTAGGTTTAAAGGGTGCGTAGGCGGGATACTAAGTCAGTGGTGAAAGTTTGCGGCTCAACCGTAAAATCGCCATTGATACTGGTATTCTTGAGTATACAGGAAGTAGGCGGAATGTGTAGTGTAGCGGTGAAATGCATAGATATTACACAGAACACCGATTGCGAAGGCAGCTTACTATAGTATAACTGACGCTGATGCACGAAAGCGTGGGGAGCGAACAGGATTAGATACCCTGGTAGTCCACGCTGTAAACGATGATTACTGGTTGTGCGCGATACACAGTGCGCGACTGAGCGAAAGCATTAAGTAATCCACCTGGGGAGTACGGCGGCAACGCTGAAACTCAAAGGAATTGACGGGGGCCCGCACAAGCGGAGGAACATGTGGTTTAATTCGATGATACGCGAGGAACCTTACCTGGGCTTAAATGTAGAGTGCATGGAGTGGAAACATTCCTTTCCTTCGGGACTCTTTACAAGGTGCTGCATGGTTGTCGTCAGCTCGTGCCGTGAGGTGTCGGGTTAAGTCCCATAACGAGCGCAACCCCTATCATTAGTTGCTAACAGGTCAAGCTGAGGACTCTAGCGAAACTGCCGGTGTAAACCGTGAGGAAGGTGGGGATGACGTCAAATCAGCACGGCCCTTATGTCCAGGGCTACACACGTGTTACGATGGCCAGTACAAAGGGTAGCTACCTGGTGACAGGATGCTAATCTCAAAAGCTGGTCTCAGTTCGGATTGGAGTCTGCAACTCGACTCCATGAAGTTGGATTCGCTAGTAATCGTATATCAGCCATGATACGGTGAATACGTTCCCGGGCCTTGTACACACCGCCCGTCAAACCATGGAAGCTGGGGGTACCTGAAGTACGTCACCGCAAGGAGCGTCCTAGGGTAAATCTAGTGACTGGGGTTAAGTGGTAACAAGGTAACC
>524860
AGAGTTTGATCCTGGCTCAGAACGAACGTTGGCGGCATGGATGAGGCATGCAAGTCGCGGGAATCCCCAGCAATGGGGGGAACCGGCGTAAGGGGCAGTAAGGCGTAGGTACCTACCCCCAGGTCCGGGATAGCCCGCCGAGAGGCGGGGTAATACCGGATGACCTCGGGAGAGCAAAGCTCCGGCGCCTGAGGCGGGGCCTACGTGATATTACCTAGTTGGCGGGGTAACGGCCCACCAAGGGGGAGATGTCTAGCGGGTGTGAGAGCACGACCCGCGCCACTCGCACTGAGACACTGGCGAGACACCTACGGGTGGCTGCAGTCGAGGATCTTCGGCAATGGGGGCAACCCTGACCGAGCGATGCCGCGTGGGCGACGAAGGCCTTCGGGTTGTAAAGCCCTGTCGAGGGGGAGAAAGCCGCAAGGCGGATCCATCCCTGGAGGAAGCTCGGGCTAAGTTCGTGCCAGCAGCCGCGGTAAGACGAACCGAGCGAACGTTGTTCGGAATCACTGGGCTTAAAGGGCGCGTAGGCGGGCTGCCGCGTCCGGGGTGAAATCCCACGGCTCAACCGTGGAACGGCCCCGGGTACGGGCGGCCTCGAGGGGGATAGGGGCGTGCGGAACTGTGGGTGGAGCGGTGAAATGCGTTGATATCCACAGGAACTCCGGTGGCGAAGGCGGCACGCTGGATCCTCTCTGACGCTGAGGCGCGGAAGCCAGGGGAGCGAACGGGATTAGATACCCCGGTAGTCCTGGCCCTAAACGATGAGAACTGGGTAGTAGCCCTGGCATGGGGTTACTGCCGCAGCCAAAGTGCTAAGTTCTCCGCCTGGGGAGTATGGTCGCAAGGCTGAAACTCAAAGGAATTGACGGGGGCTCACACAAGCGGTGGAGCATGTGGCTTAATTCGAGGCTACGCGAAGAACCTTATCCTGGACTTGACATGTGCGAAAGCGCCAGCAGGTAGGACCCGGAAACGGGAACGAACGGTATCCAACCCGGAAGCTGGTACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTCGGGTTAAGTCCCATAACGAGCGAAACCCTTACCCCTTGTTGCAACCCGAAAGGGGCACTCGAGGGGGACTGCCGGTGTCAAACCGGAGGAAGGTGGGGATGACGTCAAGTCCTCATGGCCTTTATGTCCAGGGCTGCACACGTGCTACAATGGCGTGGACAGAGGGACGCGACTGCGCGAGCAGAAGCCGACCCCCGAAAGCACGCCCCAGTTCAGATCGCAGGCTGCAACCCGCCTGCGTGAAGCCGGAATCGCTAGTAATCGCGGGTCAGCAACACCGCGGTGAATGTGTTCCTGAGCCTTGTACACACCGCCCGTCAAGCCACGAAAGGGAGGGACGTCCGAAGTCGCCTCGCGGCGCCGAAGACGGACTTCCTGATTGGGACTAAGTCGTAACAAGGTAACC
>501793
GACGAACGCTGGCGGCGTGCCTAATACATGCAAGTCGAACGGAGTGGTTGAAGGAGCTTGCTCTTTTGATCGCTTAGTGGCAGACGGGTGAGTAACACGTAGGCAACCTGGCTGTAAGACGGGGATAACTGGCGGAAACGTGAGCTAAAACCGGATGGTCGGCTTGAGGGCATCCTCGAGTCGGGAAAGGACGGAGCAATCTGTCGCTTACAGATGGGCCTGCGGCGCATTAGCTAGTTGGTAGGGTAACGGCCTACCAAGGCGACGATGCGTAGCCGACCTGAGAGGGTGAACGGCCACACTGGGACTGAGACACGGCCCAGACTCCTACGGGAGGCAGCAGTAGGGAATCTTCGGCAATGGACGAAAGTCTGACCGAGCAACGCCGCGTGAGTGATGAAGGTTTTCGGATCGTAAAGCTCTGTTGCCAGGGAAGAACGCCAGGGAGAGTAACTGCTCTCTGGGTGACGGTACCTGAGAAGAAAGCCCCGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGGGGCAAGCGTTGTCCGGAATTATTGGGCGTAAAGCGCGCGCAGGCGGTCGTTTAAGTCTCATGTCTAAACCCCGGGGCTCAACCTCGGGGTGCATGGGAAACTGGGCGACTGGAGTGCATGTGAGGAAAGTGGAATTCCACGTGTAGCGGTGGAATGCGTAGAGATGTGGAGGAACACCAGTGGCGAAGGCGACTTTCTGGCCTGTAACTGACGCTGAGGCGCGAAAGCGTGGGGAGCAAACAGGATTAGATACCCTGGTAGTCCACGCCGTAAACGATGAATGCTAGGTGTTAGGGGTTTCGATACCCTTGGTGCCGAAGTTAACACATTAAGCATTCCGCCTGGGGAGTACGGTCGCAAGACTGAAACTCAAAGGAATTGACGGGGACCCGCACAAGCAGTGGGGTATGTGGTTTAATTCGAAGCAACGCGAAGAACCTTACCAGGTCTTGACATCCCGATGAAACATGCAGAGATGTGTGCCCTCTTCGGAGCATTGGAGACAGGTGGTGCATGGTTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTTAAGGTTAGTTGCCAGCAGGTGAAGCTGGGCACTCTAACATGACTGCCGGTGACAAACCGGAGGAAGGCGGGGATGACGTCAAATCATCATGCCCCTTATGACCTGGGCTACACACGTACTACAATGGCCAGTACAACGGGAAGCGAAGTGGCGACACGGAGCCAATCTTAGAAAGCTGGTCTCAGTTCGGATTGCAGGCTGCAACTCGCCTGCATGAAGTCGGAATTGCTAGTAATCGCGGATCAGCATGCCGCGGTGAATACGTTCCCGGGTCT
>296752
AGAGTTTGATCTCTGGCTCAGAACAAACGCTGGCGGTGCGTCTTAAGCATGCAAGTCGAGCGATGGTAGGGGGCTTGCTCCCTATTCATAGCGGCGGACTGGTGAGTAACGCGTAGATGACATACCTTTTGCTGGGGGATAGCTTGTGGAAACACAGGGTAATACCGCATACGATTGAGGCGGTTAGAGCGCTTCAATCAAAGCCTTGTATGGGGCGGCAGTTGAGTGGTCTGCGTACTATTAGCTTGTTGGTGGGGTAACGGCTCACCAAGGCGATGATAGTTATCCGGCCTGAGAGGGTGAACGGACACATTGGGACTGAGATACGGCCCAGACTCCTACGGGAGGCAGCAGCTAAGAATATTCCGCAATGGGCGAAAGCCTGACGGAGCGACACCGCATGAGTGATGAAGGTCGAAAGATTGTAAAATTCTTTTTGAGAGTGATGAATAAAGTCGAGCAGTAATGCTCGGTGATGACGGTAACTTTTGAATAAGGGGTGGCTAATTACGTGCCAGCAGCCGCGGTAACACGTAAGCCCCAAGCGTTGTTCGGAATTATTGGGTGTAAAGGGCATGTAGGTGGTCTTGCAAGCTTGATGTGAAATCTTACAGCTTAACTGTAAAACTGCATTGAGAACTGCATAACTTAAGAATAACTGAGGCGCAACTGGAATTCCAGGTGTAGGGGTGAAATCTGTAGATATCTGGAAGAACACCAATGGCGAAGGCAAGTTGCGAGCAGATTATTGACACTGAGGTGCGAAGGTGCGGGGAGCAAACAGGATTAGATACCCTGGTAGTCCGCACAGACAACGATGTACACTGGGCGTCTGGCTTTATGCTGGGTGCCGTAGTAAACGCGATAAGTGTACCGCCTGGGGAGTATGCTCGCAAGGGTGAAACTCAAAGGAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGATGATACGCGAGAAATCTTACCTGGGTTTGACATTTAGTGGAATTGTATAGAGATATGCAAGGTACTTGTACCCGCTAAACAGGTGCTGCATGGCTGTCGTCAGCTCGTGCCGTGAGGTGTTGGGTTAAGTCCCGCAACGAGCGCAACCCCTACTGCTAGTTACTAACAGGTTATGCTGAGGACTCTAGCGGAACTGCCGGTGACAAACCGGAGGAAGATGGGGATGACGTCAAGTCATCATGGCCCTTATGTCCAGGGCAACACACGTGCTACAATGGTTGTAACAAAGTGATGCGAAATCGCAAGATGAAGCAAAACGCAGAAATGCAATCGTAGTTCGGATTGGAGTCTGAAACTCGACTCCATGAAGTTGGAATCGCTAGTAATCGCATATCAGCACGATGCGGTGAATACGTTCCCGGGCCTTGCACACACCGCCCGTCA
>293514
AGAGTTTGATCCTGGCTCAGAACGAACGCTGGCGGTGCGTCTTAAGCATGCAAGTCGAGCGATGGTAGGGGGCTTGCTCCCTATTCATAGCGGCGGACTGGTGAGTAACGCGTAGATGACATACCTTTTGCTGGGGGATAGCTTGTGGAAACACAGGGTAATACCGCATACGATTGAGGCGGTTAGAGCGCTTCAATCAAAGCCTTGTATGGGGCGGCAGTTGAGTGGTCTGCGTACTATTAGCTTGTTGGTGGGGTAACGGCTCACCAAGGCGATGATAGTTATCCGGCCTGAGAGGGTGAACGGACACATTGGGACTGAGATACGGCCCAGACTCCTACGGGAGGCAGCAGCTAAGAATATTCCGCAATGGGCGAAAGCCTGACGGAGCGACACCGCGTGAGTGATGAAGGTCGAAAGATTGTAAAACTCTTTTGAGAGTGATGAATAAGTCGAGCAGTAATGCTCGGTGATGACGGTAACTTTTGAATAAGGGGTGGCTAATTACGTGCCAGCAGCCGCGGTAACACGTAAGCCCCAAGCGTTGTTCGGAATTATTGGGCGTAAAGGGCATGTAGGTGGTCTTGCAAGCTTGATGTGAAATCTTACAGCTTAACTGTAAAACTGCATTGAGAACTGCATAACTTAAGAATAACTGAGGCGCAACTGGAATTCCAGGTGTAGGGGTGAAATCTGTAGATATCTGGAAGAACACCAATGGCGAAGGCAAGTTGCGAGCAGATTATTGACACTGAGGTGCGAAGGTGCGGGGAGCAAACAGGATTAGATACCCTGGTAGTCCGCACAGTCAACGATGTACACTGGGCGTCTGGCTTTATGCTGGGTGCCGTAGTAAACGCGATAAGTGTACCGCCTGGGGAGTATGCTCGCAAGGGTGAAACTCAAAGGAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGATGATACGCGAGGAATCTTACCTGGGTTTGACATACACATTATCTTTGCAGAGATGTAAAGCGGGGGTAACCCCAATGTGAACAGGTGCTGCATGGCTGTCGTCAGCTCGTGCCGTGAGGTGTTGGGTTAAGTCCCGCAACGAGCGCAACCCCTATTGCCAGTTACTAACAAGTTAAGTTGAGGACTCTGGCGAAACTGCCGGTGACAAATCGGAGGAAGATGGGGATGACGTCAAGTCATCATGGCCCTTATGTCCAGGGCAACACACGTGCTACAATGGTTGAGACAGAGTGATGCTAAGTCGCAAGATGGAGCAAAACGCAGAAATTCAATCGTAGTTCGGATTGGAGTCTGAAACTCGACTCCATGAAGTTGGAATCGCTAGTAATCGCATATCAGCACGATGCGGTGAATACGTTCCCGGGCCTTGCACACACCGCCCGTCA
>292553
AGAGTTTGATCCTGGCTCAGAACGAACGCTGGCGGTGCGTCTTAAGCATGCAAGTCGAGCGATGAATGAGGGGCTTGCTCCTTATTCATAGCGGCGGACTGGTGAGTAACGCGTAGATGACATGTCGATGGCAGGGGGATAGCCAGTAGAAATATTGGGTAATACCGCGTATCCTTCTTGTTGTTAGAGGACAAGAAGAAAAGCCTTGTATGGGGCGGCTATTGAGTGGTCTGCGTACTATTAGTTTGTTGGTGGGGTAACGGCCTACCAAGACTATGATAGTTATCCGGCCTGAGAGGGTGAACGGACACATTGGGACTGAGATACGGCCCAGACTCCTACGGGAGGCAGCAGCTAAGAATATTCCGCAATGGGCGAAAGCCTGACGGAGCGACACCGCGTGAGTGATGAAGGTCGAAAGATTGTAAAACTCTTTTGAATATGATGAATAAGTCAAGCAGTAATGCTTGGCGATGACGGTAGTGTTTGAATAAGGGGTGGCTAATTACGTGCCAGCAGCCGCGGTAACACGTAAGCCCCAAGCGTTGTTCGGAATTATTGGGCGTAAAGGGCATGTAGGTGGTTTTGTAAGCTTGATGTGAAATCTTACAGCTTAACTGTAAAACTGCATTGAGAACTGCAGAACTAGAGTAACTGAGGTGCAACTGGAATTCCAGGTGTAGGGGTGAAATCTGTAGATATCTGGAAGAACACCAATGGCGAAGGCAAGTTGCAAGCAGATTACTGACACTGAGGTGCGAAGGTGCGGGGAGCAAACAGGATTAGATACCCTGGTAGTCCGCACAGTCAACGATGTACACTGGGCGTCTGGCTTTATGCTGGGTGCCGTAGTAAACGCGATAAGTGTACCGCCTGGGGAGTATGCTCGCAAGGGTGAAACTCAAAGGAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGATGATACGCGAGAAATCTTACCTGGGTTTGACATTTAGTGGAATTGTATAGAGATATGCAAGGTACTTGTACCCGCTAAACAGGTGCTGCATGGCTGTCGTCAGCTCGTGCCGTGAGGTGTTGGGTTAAGTCCCGCAACGAGCGCAACCCCTACTGCTAGTTACTAACAGGTTATGCTGAGGACTCTAGCGGAACTGCCGGTGACAAACCGGAGGAAGATGGGGATGACGTCAAGTCATCATGGCCCTTATGTCCAGGGCAACACACGTGCTACAATGGTTGTAACAAAGTGATGCGAAATCGCAAGATGAAGCAAAACGCAGAAATGCAATCGTAGTTCGGATTGGAGTCTGAAACTCGACTCCATGAAGTTGGAATCGCTAGTAATCGCATATCAGCACGATGCGGTGAATACGTTCCCGGGCCTTGCACTAACGGCCCGTCA
>266495
AGTTTGATCCTGGCTCAAGATGAACGCTAGCGGCAGGCTTAACACATGCAAGTCAAAGGGCAACGGGGAGAGTGCTTGCACTCTCTGCCGGCGACTGGCGCACGGGTGAGTAACACTTATGCAGACACTGCCTTCCACAGGGCGGACAACCTCTCCCAAAGGGAGGCTAATCCCGCGTATATCCCTTGGGGGCATCCCCGGGGGAGGAAAGGATTACCGGTGTGCAGGATGGGCATGCGGCGCATTACGCAGTAGGCGGGGTAACGGCCCACCTAACCGACCATGCGTATGGGTTCTGAGAGGAAGGCCCCCCACACTGGTACTGAGACACTGACCAGACTCCTACTGGAGGCAGCAGTGAGGAACATTGGTCAATGGGCGGGAGCCTGAACCAGCAAACCCGCGTGAAGGAAGAAGGCGCCGAACGTCGTAAACTTCTTTTGTCCGGGATCAAAGGGCGCCACGTGTGGCGTTGTGAGTGTACCTGTAGAGAAAGCTTCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGTAGGGAGCGAGCGTTGTCCGGATTTATTGGGTGTAAAGGGCGCGTAGGTGGTCGGTTAAGTCAGGTGTGAAAGCTCGGGGCTCAACCCGGAGGATCCGCTGGAACTTTGGTGTCATGAGGCGCAGGAGAAGTAAAGTGGAATTCGTGGTGTAGCGGTGAAATGCATAGATATTGGGCGGAACTCCGGTGGCGAAGGCAGCGTTCTGGCGCGTGCCTGACGCTGAGGCGCGAAAGCGTGGGTATCGAACGGGATTAGATACCCCGGTAGTCCACGCAGTAAACGATGAATACTGGGTGTCGGACCCATAGAACGTTTGGGTGCGCGCAGCGAAAGCGATAAGCATTCCAAGTGGGGAGTACACCGGCAGTGATGAGACTCAAAGGAATCGACGGGGGTTCGCACAAGTGGAGGGATATGTGGTTTAATTAGACGATAAGTGAGGAACGTGACCCGGGTTCAACAGGGAGTCGACAGGGGCAGAGATTCCCTCTTCCACGGACGTCTTCCGAGGTGGGGCATGGTTGTCAGTCAGCTACGTGCCGTGAGGTGTCGGCTTAAGTGCCATAAGGTGTGCAACACGGGCAGACAGTTGCTAACGGGTAGAGCAGTGGAATGTGTAGTGATTGCAGGGGCAAGCCGCGAGGAAGGGGGGGATGATGTCAAATCAGCGCGGCCCTTAGGTCAGGGGTGACACACGTGCTGCAATGGCGGGGACAGAGGGATGTGAAGAGGCGACGTGGAGCGAACCCCAAAAACCCCGCCCCAGTTAGGATTGTAGTATGCAACCCGAATACATGAAGCCGGAATAGGTAGTAATCGCGGATCAGAATGCAGCGGTGAATAAGTTCCCGGCTCTAGCACACACCGCCCGTCA
>229854
GAGTTTGATCCTGGCTCAGATTGAACGCTGGCGGCATGCTTAACACATGCAAGTCGAACGGCAGCATGACTTAGCTTGCTAAGTTGATGGCGAGTGGCGAACGGGTGAGTAACGCGTAGGAATATGCCTTAAAGAGGGGGACAACTTGGGGAAACTCAAGCTAATACCGCATAAACTCTTCGGAGAAAAGCTGGGGACTTTCGAGCCTGGCGCTTTAAGATTAGCCTGCGTCCGATTAGCTAGTTGGTAGGGTAAAGGCCTACCAAGGCGACGATCAGTAGCTGGTCTGAGAGGATGACCAGCCACACTGGAACTGAGACACGGTCCAGACTCCTACGGGAGGCAGCAGTGGGGAATATTGGACAATGGGGGCAACCCTGATCCAGCAATGCCGCGTGTGTGAAGAAGGCCTGAGGGTTGTAAAGCACTTTCAGTGGGGAGGAGGGTTTCCCGGTTAAGAGCTAGGGGCATTGGACGTTACCCACAGAAGAAGCACCGGCTAACTCCGTGCCAGCAGCCCGCGGTAATACGGGAGGGTGCAAGCGTTAATCGGAATTACTGGGCCGTTAAAANGGTGCCTAAGGTGGTTTGGATNAGTTATGTGTTAAATTCCCTGGCGCCTCCACCCTGGNGCCAGGTCCATANTAAAAACTGTTAAACTCCGAAGTATGGGCACAAGGTAANTTGGAAANTTCCGGTGGTNANCCGNTGAAAATGCGCTTAGAGATNCGGGAAGGGACCACCCCAGTGGGGAAGGCGGCTACCTGGCCTAATAACTGACATTGAGGCACGAAAAGCGTGGGGAGCAACCAGGATTAGATACCCTGGTAGTCCACGCTGTAAACGATGTCAACTAGCTGTNGGTTATATGAATATAATTAGTGGCGAAGCTAACGCGATAAGTTGACCGCCTGGGGAGTACGGTCGCAAGATTAAAACTCAAAGGAATNGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGATGCAACGCGAAGAACCTTACCTACCCTTGACATACAGTAAATCTTTCAGAGATGAGAGAGTGCCTTCGGGAATACTGATACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGTAACGAGCGCAACCCTTATCTCTAGTTGCCAGCGAGTAATGTCGGGAACTCTAAAGAGACTGCCGGTGACAAACCGGAGGAAGGCGGGGACGACGTCAAGTCATCATGGCCCTTACGGGTAGGGCTACACACGTGCTACAATGGCCGATACAGAGGGGCGCGAAGGAGCGATCTGGAGCAAATCTTATAAAGTCGGTCGTAGTCCGGATTGGAGTCTGCAACTCGACTCCATGAAGTCGGAATCGCTAGTAATCGCGAATCAGCATGTCGCGGTGAATACGTTCCCGGGCCTTGTACACACCGCCCGTCACACCATGGGAGTGGGCTGCACCAGAAGTAGATAGTCTAACCGCAAGGGGGACGTTTACCACGGTGTGGTTCATGACTGGGGTGAAGTCGTAACAAGGTAGCCG
>182569
AGAGTTTGATCCTGGCTCAGGATGAACGCTAGCTACAGGCTTAACACATGCAAGTCGAGGGGCAGCATGGTGTATCAATATATCTATGGCGACCAGCGCACCGGTGATGCACACCTCTCCTACCTGCCCCTTACTCCGGGATGATCTTTCTAAAAAAATATTACTACTCCATGGTATTACCGAAAAACGTCTTTTTGTTGTTTAAAAACTTCGATGGTGGAAGGTGATGCTTTCTATTATATACTTGGTGGGGTAACAGCCCACCACCTCAGCGATGAATAGGGGTTCTAATAAGAAGGTCCCCCCCATGGTAACTGGGCCCCGGTCCAAATTCTTCGGGAAGCCACCAGTGAGGATTATTGTTCAATGGCGGAGATTTTGACCCAGCCCAAGTAGCGTGAAGGATGACTGCTCCCATAGGTGGTAAACTTCTTTTATATGGGAATAAAGTGAGTCACGTGTGTCTTTTTGTATGTATCATATGAATAAGGATCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGATTCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGGCGGTTTGTTAAGTCAGTGGTGAAAGTTTGGGGCTCAACCGTGAAATTGCATTTGATACTGGCGGTCTTGAGTGCAGTAGAGGTGGGCGGAATTTGTGGTGTAGCGGTGAAATGCTTAGATATCATGCAGAACTCCGATTGCGAAGGCAGCTCACCGGAGTGTATCTGACGTTGAGGCTCGAAAGTGTGGGTATCAAACAGGATTAGATACCCTGGTAGTCCACACAGTAAAGAAGGAATATTGTCGTTGTGGGATCTCCATTAAGGGGTCAAGGGAAAGCATTAATTATTCCCCTGGGGGAGTAGTCCGCCAGAGGTGAAATTAAAAGAAATGGAGGGGGGCCGGCCCAAGGGAAGGACCATGTGGTTTAATTGGAGGATAGGGGAGGACCTTTCCCGGGGTTGAAAGTGCAAATGAATTATGGGGAGAGCCATTCCCTTCAAGGCATGAGAGAAGGTGCTGCATGGTTGTCGTCAGCTCGTGCCGTGAGGTGTCGGGTTAAGTCCCATAACGAGCGCAACCCTTATCTTCAGTTACTATCAGGTCAAGCTGAGCACTCTGGAGAGACTGCCGTTGTAAGATGAGAGGAAGGTGGGGATGACGTCAAATCAGCACGGCCCTTACGTCCGGGGCTACACACGTGTTACAATGGGGGGTACAGAAGGCAGCTACCCAGCGACAGGATGCCAATCCCAAAAACCTATCTCAGTTCGGATTGAAGTCTGCAACCCGCCTTCGTGAAGTTGGATTCGCTAGTAATCGCGCATCAGCCATGGCGCGGTGAATACGTTCCCGGGCCTTGCACACACCGCCCGTCA
>1719550
TCCTGGCTCAGAACGAACGTTGGCGGCGTGGATTAGGCATGCAAGTCGCGCGAATCCCCGCAAGGGGGGAAGCGGCGTAAGGGGCAGTAAGGCGTGGGTACCTACCCGGGGGTCGGGGATAGCCCGTCGAGAGACGGGGTAATACCCGATGACGTGGAGACACCAAAGGTCCGCCGCCCTCGGCGGGGCCCACGTGATATTAGCTAGTTGGCGGGGTAACGGCCCACCAAGGCGGGGATGTCTAGCGGGTGTGAGAGCACGACCCGCGCCACTGGCACTGAGACACTGGCCAGACACCTACGGGTGGCTGCAGTCGAGGATCTTCGGCAATGGGGGAAACCCTGACCGAGCGACGCCGCGTGGGCGACGAAGGCCTTCGGGTTGTAAAGCCCTGTCGAGGGGGAGAAAGCCTTAACCGGGTGATCTATCCCTGGAGGAAGCACGGGCTAAGTTCGTGCCAGCAGCCGCGGTAAGACGAACCGTGCGAACGTTGTTCGGAATCACTGGGCTTAAAGGGCGCGTAGGCGGGTTGCCGCGTCCGGGGTGAAATCCCACGGCTCAACCGTGGGGCGGCCCCGGGTACGGGCAGCCTCGAGGAGAGTAGGGGCATGCGGAACTCTGGGTGGAGCGGTGAAATGCGTTGATATCCAGAGGAACTCCGGTGGCGAAGGCGGCATGCTGGACCCTTCCTGACGCTGAGGCGCGAAAGCCAGGGGAGCGAACGGGATTAGATACCCCGGTAGTCCTGGCCCTAAACGATGAGAACTAGGTAGCCGGCCGGACATGGGCTGGCTGCCGGAGCCAAAGTGCTAAGTTCTCCGCCTGGGGAGTATGGCCGCAAGGCTGAAACTCAAAGGAATTGACGGGGGCTCACACAAGCGGTGGAGCATGTGGCTTAATTCGAGGCTACGCGAAGAACCTTATCCCGGGCTTGACATGTTCGAAAGAGGCTCGAAGTAGCCCGCGGAAACGTGGGGCCAACGGTATCCAGTCCGGAGCGAGCTACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTCGGGTTAAGTCCCATAACGAGCGAAACCCTTACCCTCAGTTGCTTACTAGGACTCTGGGGGGACTGCCGGTGTCAAACCGGAGGAAGGTGGGGATGACGTCAAGTCCTCATGGCCTTTATGCCCGGGGCTGCACACGTGCTACAATGGCGTGGACAAAGAGACGCGAGCCCGCGAGGGGGAGCCAATCTCAGAAAGCACGCCCCAGTTCAGATCGCAGGCTGCAACTCGCCTGCGTGAAGCCGGAATCGCTAGTAATCGCGGGTCAGCAACACCGCGGTGAATGTGTTCCTGAGCCTTGTACACACCGCCCGTCAAGCCACGAAAGAGAGGGACGTCCGGAGTCGCCTTCACCGGTGCCGAAGACGGACTTCTTGATTGGGACTAAGTCGTAACAAGGTAACC
>1794723
TTAGAGTTTGATCCTGGCTCAGAACGAACGTTGGCGGCGTGGATTAGGCATGCAAGTCTCGCGAATCCCCGCAAGGGGGGAAGCGGCGTAAGGGGCAGTAAGGCGTGGGTAACCCACCCCGGGGCCCGGGATAGCCCGTCGAGAGACGGGGTAATACCGGGCGACGCAGCGTGCCGGCATCGGTGTGCTGCCAAAGGTCCGCCGCCCCGGGCGGGGCCCACGTGGTATTAGCTAGTTGGTGGGGTGACGGCCCACCAAGGCGGAGATGCCTAGCGGGTGTGAGAGCACGACCCGCGCCACTGGCACTGAGACACTGGCCAGACACCTACGGGTGGCTGCAGTCGAGGATCTTCGGCAATGGGGGCAACCCTGACCGAGCGACGCCGCGTGGGCGACGAAGGCCTTCGGGTTGTAAAGCCCTGTCGAGGGGGAGAAACGTCCCGCAAGGGGCCTGATCTATCCCTGGAGGAAGCACGAGCTAAGTTCGTGCCAGCAGCCGCGGTAAGACGAACCGTGCGAACGTTGTTCGGAATCACTGGGCTTAAAGGGCGCGTAGGCGGGCTGCCGAGTCCGGGGTGAAATCCTCCCGCTCAACGGGAGAACGGCCCCGGGTACTGGCGGCCTCGAGGCGGGTAGGGGCGTGCGGAACACTGGGTGGAGCGGTGAAATGCGTTGATATCCAGTGGAACTCCGGTGGCGAAGGCGGCACGCTGGACCCGTCTGACGCTGAGGCGCGAAAGCCAGGGGAGCGAACGGGATTAGATACCCCGGTAGTCCTGGCCCTAAACGTTGAGAACTAGGTAGTCGGCCGGACATGGGCTGACTGCCGGAGCGAAAGTGCTAAGTTCTCCGCCTGGGGAGTATGGTCGCAAGGCTGAAACTCAAAGGAATTGACGGGGGCTCACACAAGCGGTGGAGCATGTGGCTTAATTCGAGGCAACGCGAAGAACCTTATCCCGGGCTTGACATGTGCGAAAGCGTCTGGGGGTACCCGCCGGAAACGGCCGGGGAAGGTATCCAGTCCTGAACCAGACACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTCGGGTTAAGTCCCATAACGAGCGAAACCCTTACCCTCAGTTGCCAGCGGGTCACGCCGGGGACTCTGGGGGGACTGCCGGTGTCAAACCGGAGGAAGGTGGGGATGACGTCAAGTCCTCATGGCCTTTATGCCCGGGGCTGCACACGTGCTACAATGGCGTGGACAAAGGGGCGCGAACGCGCGAGCGGGAGCCGACCCCGGAAAGCACGCCCCAGTTCAGATCGCAGGCTGCAACTCGCCTGCGTGAAGTCGGAATCGCTAGTAATCGCGGGTCAGCAACACCGCGGTGAATGTGTTCCTGAGCCTTGTACACACCGCCCGTCAAGCCACGAAAGGGAGGGACGGCCGAAGTCGCGCCCCGCGCGCCGACGCCGGACTTCCCGATTGGGACTAAGTCGTAACAAGGTAACC
>1142181
CACGTGGGTCATTTGCCCCGAAGCCCGGGATAGCCCATGGAAACATGGATTAATACCGGATGTGGTTGGAGTACACAGGTGCTCCGTATTAAACGGTAGGTAGCAATACCTTCCGCTTCGGGATAAGCCCGCGGCCCATTAGCTAGTTGGTGGGGTAAGACCCAACCAAGGAGACAACCGGGAGCCGGACAGAAAGGGTGACGGCCACATTGGGACTGAGAAACGGCCCGATCCTACGGAGGCAGCAGTAAGAATCTTCCGCATGAACGAAGTCCGACCGAGCGACGCGCTGAGTGATGAAGGTGTTATGCATCGTAAAGCTCCTTCGGGGAGGAGAATAAGCATAGTCCAAAAGGCTATGTGATGACGACCCTCCCTAAAGAAGCCCCGGCTAATTACGTGCAGCAGCGCGGCAATACGTAAGGGGTAAGCGTTGTTCGGAATTACTGGGCGTAAAGGGTGTGCAGGCGGGAGAGTAAGTTGGGGGTGAAATCTACGGGCCCAACCCGTAAACTGCCCTCAAAACTGCTTTTCTTGAGTGCAGGAGAGGAGACTGGAATTCCTAGTGTAGGAGTGAAATCTGTAGATATTAGGAAGAACACCGGTGGCGAAGGCGAGTCTCTGGCCTGACACTGACGCTGATACACGAAAGCGTGGGGAGCGAACAGGATTAGATACCCTGGTAGTCCACGCCGTAAACGTTGTGCACTAGATGTTGGGGGTGTCAATCCCCTCAGTGTCGCAGTTAACGCATTAAGTGCACCGCCTGGGGAGTATGCTCGCAAGGGTGAAACTCAAAGGAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGATGATACGCGAGGAACCTTACCAGGGCTTGACATACAGGTGCCGGGCTGTGAAAGCAGTCCTCTCTTCCGAGCGCCTGTACAGGTGTTGCACGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGTTAAGTCCCCCAACGAGCGCAACCCCTATTGTCTGTGCCATCATTAAGTTGGCACTCGAACGAAACTGCCGGTGATAAACCGGAGGAAGGTGGGGATGACGTCAAGTCATATGGCCCTTAGGCCTGGCTACACGTGCTACAATGGACAGTACAAGAGTCGCAAGACCGAAAGGTGGACCATCCAAAAGCTGTCCTCAGTTCCGATTGAAGTCTGAAACTCGACTCCATGAAGTTGGAATCGCTAGTAATCGCGCATCAGAATGGCGCGGTGAATACGTTCCCGGGCCTTGTACACACCGCCCGTCACACCACCCGAGTTGGAAGTACTTGAAGTCGCTGATCTAACCTTCGGGAGGAAGGCGCCGATTGTACGTCTGATAAGGGGGGTGAAGTCGTAACAAGGTAACC
>2683209
CTGGCGGCGTGGTTTAGGCATGCAAGTCGAACGCGAAAGATTTACTTCGGTAAATTGAGTAGAGTGGCGAACGGGTGAGTAATACGTACGAATCTACCTTAAAGACAGGGATAGTCCCGGGAAACTGGGTTTAATACCTGATGGTATCCGGCTTTGCCGGATTAAAGACGGCCTCTATTTATAAGCTGTTACTTTTAGATGAGCGTGCGCTCCATTAGTTAGTTGGTAAGGTAAGAGCTTACCAAGGCGATGATGGATAGGCGTCCTTAACGGGTGGTCGCCCACACTGGGATTGAGATACGGCCCAGACTCCTACGGGAGGCAGCAGTCGAGAATCGTCTACAATGAACGCAAGTTTGATAGTGCGACGCCGCGTGAATGAAGAAGCATTTCGGTGTGTAAAATTCTTTTATATAAGAACAGTGCATGTATGGTAAATAATTATACGTGAGAGATAGTACTATATGAATAAGCTCCGGCTAACTTCGTGCCAGCAGCCGCGGTAATACGAAGGGAGCAAGCGTTGTCCGGAATTACTAGGTGTAAAGGGTAAGTAGGCGGAAATTTAAGTCTCCGGTTAAATCTTCGGGCTCAACCCGAAATCTGCCTGAGATACTGGATTTCTAGAGTAAAGCAGATGAAGGCGGAATTCCTGGAGTAGCGGTGGAATGCGTAGATATCAGGAAGAACACCCATAGCGAAGGCAGCTTTCAATGCTATTACTGACGCTCAATTACGAAGGTGCGGGTATCGAACAGGATTAGATACCCTGGTAGTCCGCACAGTAAACGATATGTACTTGATATTGGATGTTGAAAATTCAGTGTCGTAGCTAACGCGTTAAGTACATCACCTGGGGACTAACGGCCGCAAGGTTAAAACTCAAAGGAATTGACGGGGGCCCACACAAGCGGTGGAGCATGTGGTTTAATTCGATGCAACGCGAAGAACCTTACCTGAACTTGACATGCCGAGAATCCTGTAGAAATATGGGAGTGCCTTTTTTGGAGCTCGGACACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTTGTGAAATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCCTATCTTTAGTTGCTACCATTAAGTTGAGGACTCTAAAGAGACTGCCAGAGTACAAATCTGGAGGAAGGTGGGGACGACGTCAAGTCATCATGGTCCTTATGTTCAGGGCTACACACGTGCTACAATGGTTGGAACAAAAGGCAGCGAAGGGGCGACCCGGAGCTAATCTCCAAACCCAATCTTAGTCCGGATTGCAGTCTGCAACTCGACTGCATGAAGTTGGAATCGCTAGTAATCGTGAGTCAGCATATCACGGTGAACATGTTCCTGGGCCTTGTACACACCGCCCGTCAAGTCAGCCGAATCGAGTGCACCCGAAGAAGGTGAGTTAATTAGACAGCTTTCGAAGGTGTGCTTGTAAGGGGGACTAAGTC
>2784824
AGTGGCGCACGGGTGAGTAACGCGTGGGTAACTTGCCTTTAAGTGAGGGATAACCCACTGAAAGGTGGACTAATACCTCATAAGACCACAGTGCTACGGCAGCGTGGTCAAAGGTGGCTTTATTAAAAGCTGCCGCTTGGAGAGAGACCCGCGTCCCATCAGCTTGTTGGTAAGGTAATGGCTTACCAAGGCCGAGACGGGTAGCTGGTCTGAGAGGATGGCCAGCCACACTGGAACTGAAACACGGTCCAGACTCCTACGGGAGGCAGCAGTGAGGAATCTTGCGCAATGGGGGGAACCCTGACGCAGCAACGCCGCGTGAGTGAAGAAGGTCTTCGGGTCGTAAAGCCCTGTCGGGAGGGAAGAAACAGTTATGCATGAATAATGCATAACCTTGACGGTACCTCCNGAGGAAGCACCGGCCAACTCCGTGCCAGCAGCCGCGGTAAAACGGAGGGTGCTAGCGTTGTTCGGAATTACTGGGCGTAAAGAGCGTGTAGGCGGATAGATAAGTCGAGTGTGAAAGCCCTCAGCTTAACTGAGGAAGTGCATTCGAAACTATCTTTCTTGGGTACGGAAGAGGGAAGTGGAATTCCCGGTGTAGGGGTGAAATCCGTAGATATCGGGAGGAATACCAGTGGCGAAGGCGACTTCCTGGACCGTCACTGACGCTGAGACGCGAAAGCGTGGGGAGCAAACAGGATTAGATACCCTGGTAGTCCACGCCGTAAACGATGGGCACTAGGTGTATCTCGCTTAGCGGGATGTGCCGTAGCTAACGCATTAAGTGCCCCGCCTGGGGAGTACGGTCGCAAGACTAAAACTCAAAGGAATTGACGGGGGCCCGCACAAGTGGTGGAGCATGTGGTTTAATTCGACGCAACGCGAAGAACCTTACCTGGGTTTGACATGCCGAGAATCTGCCAGAAATGGTGGAGTGCCCCGTTAGGGGAACTCGGACACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTCACCTTTAGTTGCCAGCATTAAGTTGGGCACTCTAAAGGGACTGCCGGTGTTAAACCGGAGGAAGGTGGGGACGACGTCNAGTCCTCATGGCCTTTATACCCAGGGCTACACACGTGCTACAATGGCCAGTACAAAGGGCTGCAATCCCGCGAGGGGGAGCCAACCCCAAAAATCTGGTCTTAGTTCGGATTGGAGTCTGCAACTCGACTCCATAAAGGTGGAATCGCTAGTAATCGTGAATCAGCACGTCACGGTGAATACGTTCCCGGGCCTTGTACACACCGCCCGTCACACCACGAAAGTTGGCTGTACCAGAAGTTGCTGAGCTAACTCGCCTCGGCGGGAGGCAGGCACCTAAGGTGTGGTTGATGATTGGGGTGAAGT
>2941516
TTAGAGTTTGATCCTGGCTCAGGATGAACGCTAGCGATAGGCCTAACACATGCAAGTCGAGGGGTAACAGGGTAGCAATACCGCTGACGACCGGCAAATGGGTGAGTAACGCGTATGCAACCTACCGATAACAGTTGGATAGCTCCCTGAAAGGGGAATTAAACCGGCATGACACTATGAGATCGCCTGTTTTCATAGTTAAATATTTATAGGTTATTGATGGGCATGCGTGACATTAGCAAGTTGGTGAGGTAACGGCTCACCAATGCTACGATGTCTAGGGGTTCTGAGAGGAAGGTCCCCCACACTGGTACTGAGACACGGACCAGACTCCTACGGGAGGCAGCAGTGAGGAATATTGGTCAATGGACGGAAGTCTGAACCACCCACTTCGCGTGCAGGATGACTGCCCTATGGGTTGTAAACTGCTTTTATATAAGAGGAACAGTATTTATGTATAGATATTTGCCAGTATTATATGAATAAGGATCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGATCCGAGCGTTATCCGGATTTATTGAGTTTAAAGGGTGAGTACGCGGTAGTATAAGTCAGCGGTGATAACTCGCAGCTCATCTGTAAGCTTGCCGTTGACACTGTATTACTTGACTTAACGTTGAGGTATGCTGAATGGGGGGGGGTTACCCGTTGAAATGCATTAATCAAAACAACAGACCACCCGATTTGCGGACGGCAGCAAAACTACACTGTCCACTGACGCTGATGCACAAAAGGCGTGGGTATCAAACAGGATTAGATACCCTGGTAGTCCACGCTGTAAACGATGATTACTGGTTGTTTGTGATACACTGCAAGTGACTGAGCGAAAGCACTAAGTAATCCACTTGGCGAGTACGTCGGCAACGATGAAACTCAAAGGAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTCTAATTCGAGGCAACGCGAAGAACCTTACCCAGACTTGACATCTAGGAAAGGTCCTTGAAAGAGGATCGTGCCCGCAAGGGAATCCTAAGACAGGTGTTGCATGGCTGTCGTCAGCTCCTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCCTGCTTACAGTTACCATCGGTTCGGCCGGGGACTCTGTAAGGACTGCCGCTGATAAAGCGAAGGAAGGCGGGGACGACGTCAAGCAATCACGGCCCTTACGTCTGGGGCTACACACGTGCTACAATGGCCGGTACAATGAGTCGCAAAACCGCGAGGTCAAGCTAATCTCAAAAAACCGGTCTCAGTTCGGATTGGAGTCTGCAACCCGACTTCGTGAAGCTGGATTCGCTAGTAATCGCGCATCAGCCATGGCGCGGCGAATACGTTCCCGGGCCTTGTACACACCGCCCGTCA
>998428
GACGAACGCTGGCGGCGTGCCTAATACATGCAAGTCGAGCGGAGTTGTTCCTTCGGGGACAGCTTAGCGGCGGACGGGTGAGTAACACGTAGGCAACCTGCCTGCAGGACCGGGATAACCCACGGAAACGTGAGCTAATACCGGATAGATGGTTCCCTCGCATGAGGGGATCAGGAAAGACGGGGCAACCTGTCACTTGTAGATGGGCCTGCGGCGCATTAGCTAGTTGGCGAGGTAACGGCTCACCAAGGCGACGATGCGTAGCCGACCTGAGAGGGTGAACGGCCACACTGGGACTGAGACACGGCCCAGACTCCTACGGGAGGCAGCAGTAGGGAATCTTCCGCAATGGACGAAAGTCTGACGGAGCAACGCCGCGTGAGTGAGGAAGGTCTTCGGATCGTAAAGCTCTGTTGCCAAGGAAGAACGCTTGGTGGAGTAACTGCCATCAAGGTGACGGTACTTGAGAAGAAAGCCCCGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGGGGCAAGCGTTGTCCGGAATTATGGGCGTAAGCGCGCGCAGCGGTTCTTTAAGTCTGAGGTTAAATGCAGGGCTCAACCTTGTAACGCCTTGGAAACTGGGGGACTGGAGTGTAGGAGAGGAAAGTGGAATTCCACGTGTAGCGGTGAAATGCGTAGAGATGTGGAGGAACACCAGTGGCGAAGGCGACTTTCTGGCCTATAACTGACGCTGAGGCGCGAAAGCGTGGGGAGCAAACAGGATTAGATACCCTGGTAGTCCACGCCGTAAACGATGAGTGCTAGGTGTTAGGGGTTTCGATACCCTTGGTGCCGAAGTTAACACAGTAAGCACTCCGCCTGGGGAGTACGCTCGCAAGAGTGAAACTCAAAGGAATTGACGGGGACCCGCACAGGCAGTGGAGTATGTGGTTTAATTCGAAGCAACGCGAAGAACCTTACCAGGTCTTGACATCCCGATGAAACGTCTAGAGATAGGCGCCCTCTTCGGAGCATTGGAGACAGGTGGTGCATGGTTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTTGAATTCAGTTGCCAGCACTTCGGGTGGGCACTCTGAATTGACTGCCGGTGACAAACCGGAGGAAGGTGGGGATGACGTCAAATCATCGTGCCCCTTATGACCTGGGCTACACACGTACTACAATGGTCGGTACAACGGGCAGCGAAGCCGCGAGGCGGAGCCAATCCTAGAAAAGCCGATCTCAGTTCGGATTGCAGGCTGCAACTCGCCTGCATGAAGTCGGAATTGCTAGTAATCGCGGATCAGCATGCCGCGGTGAATACGTTCCCGGGTCT
>4343117
AACGAACGCTGGCGGCGTGCTTAACACATGCAAGTCGCGTGCGCGGTTCACGAACTTGTACGTGGATGGGCGCACGGCGCAGGGGGGCGTAACACGTGGGCACTCTGCCCTCCGATGGGGAATACTCCCGCGAACCGGGGGCTAATACCGCATAACATTCCGAGGACTTGGGTTCTTGGATTCAAAGCAGTGATGCCTGTGAGGAGGAGCCCGCGCCCGATTAGCTAGTTGGTAGGGTAACGGCCTACCTCGGCAATGATCGGTAGCTGGTCTGAGAGGATAATCAGACACACTGCAACTGAAACGAGGCCCAGACTCCTACCGTAGGGAACGCTGGGGAATCTTGCCTTCTGGGCGAAAGCATGACCCAACGACGCCGCGTGGGGGATGAAGCTTTTGCTAGTGTAAACCCCTTTTCACTGGTAAGAATGCACGCAAGGGAGCGACAGTACCCTGGCAAGAAGCCCCGGCTAACTACGTGCCACCCGCCTCGGTAAGACCTAGGGGGCCAGCGTTGTTCGGAATTACTGGGTGTATAGGGTACTTATGCGGTGCGACAAGTTGGGAGTGAAATCTCTGGGCTTAACCCAGAGGCTGCTTCTCAAACTGCTATGCTTGATTGTGACAGAGGCTCTTGAAATTGCAGGAGTAGCGTTGAAATGCATGTATATCTGCAAGATCACCCGAGATATGGACGAACAGCTGGATCACAAGTGACGCTGAGGAACGAAAGCTACGCTGAGCGAACAGGATTATATACACTGGTAGTCCTAGCACTAAACGATCATGACTTGCGGTGACGACCGTTCGGACGTCTCCCGGAGCTAACGCGTTAAGTCCTGCACCTGGGGAGTACGGTCGCAGACTGGAAGTCAAAGGAATTGACGGGGGCCCGCACAAGCGGTGGAACATGTGGTTCAATTCGACGCTACGCGAGGAACCTTACCTGGTTCGAAATTCTTATGACCAGCTGTAGAATTACGGCTTTCCTTCAAGAGACATGAGTCTAGGCGCTCCATGGCTGTCGTCAGTTCGTTCCGTGAGGTGTTGGGTTAAGTCCCGCAACGAGCGCAACCCCTGCACGTAGTTACTACTCGCAAGAGAGGACTCTACGTGGACTGCTCCGGATAACGGAGAGGAAGGTGGGAATGACGTCAAGTCCGCATGGCCTTTATGTCCAGGGCTACACACGTGTTACAATGCAGGGTACAAACCGTTGCCAACCCGCGAGGGGGAGCTAATCGGATAAAACTGTGCTCAGTTCGGATTGCAGTCTGCAACTCGACTGCATGAAGCTGGAATCGCTAGTAATGGGGATCAGCTTGACGCCGTGAATACGTTCCCGGGCCTTGTACACACCGCCCGTCACATCACGAAAGTGAGCTCACCTAGAAGTCGCCACGCTAACCGCAAGGGGGCAGGCGCCCAAGGTATGACTCATGATTGGGGTG
>4353661
GGATGAACGCTAGCGGGAGGCTTAATACATGCAAGTCGAGGGTGAAGCTTTCTTCGGAAAGTGGAAACCGGCGAACGGGTGCGTAACGCGTACGCAACTTACCCCTTGCTGGAGAATAGCCCCGGGAAACTGGGATTAATGCTCCATGGTATGGTGAAATCGCATGATTTTATCATTAAAGGTTACGGCAAGGGATAGGCGTGCGTCCCATTAGCTTGTTGGTGAGGTAACGGCTCACCAATGCAAACGATGGGTAGCTGGTCTGAGAGGATGATCAGCCACACGGGCACTGAGACACGGGCCCGACTCCTACGGGAGGCAGCAGTAGGGAATATTGGACAATGGACGAAAGTCTGATCCAGCCATCCCGCGTGCAGGACGAATGCCCTATGGGTTGTAAACTGCTTTTCTAAGGAAAGAAATATCTCATTCATGAGGTGCTGACGGTACCTTAGGAATAAGCACCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGGTGCAAGCGTTATCCGGATTCACTGGGTTTAAAGGGTGCGTAGGCGGTATGATAAGTCAGTGGTGAAAGCCCGGGGCTCAACTCCGGAACTGCCGTTGATACTGTCATACTTGAGTCCAGTTGAGGTGGGCGGAATGATACATGTAGCGGTGAAATGCTTAGATATGTATCAGAACACCGACTGCGAAGGCAGCTCACTAAACTGGTACTGACGCTGAGGCACGAAAGCGTGGGTAGCGAACAGGATTAGATACCCTGGTAGTCCACGCCCTAAACGATGCTAACTCGGTATGTGCGATATACTGTACGTGCCTGAGGGAAACCGTTAAGTTAGCCACCTGGGGAGTACGTTCGCAAGAATGAAACTCAAAGGAATTGACGGGGGTCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGATGATACGCGAGGAACCTTACCTGGGCTCTAATGTACCACGCCCGACCCTGAAAGGGGTCTTCTTCTTCGGAAGCGGGGTACAAGGTGCTGCATGGTTGTCGTCAGCTCGTGCCGTGAGGTGTTGGGTTAAGTCCCGCAACGAGCGCAACCCCTGTCCTTAGTTGCCAGCGGTCCGGCCGGGGACTCTAAGGAGACTGCCTTCGCAAGGAGTGAGGAAGGAGGGGACGACGTCAAATCATCATGGCCTTTATGCCCAGGGCTACACACGTGCTACAATGGTGAGGACAAAGGGCAGCCACTTAGCGATAAGGAGCAAATCCCAAAAACCTCACCTCAGTTCGGATTGGAGTCTGCAACTCGACTCCATGAAGTCGGAACCGCTAGTAATCGCAGATCAGACATGCTGCGGTGAATACGTTCCCGGACCTTGTACACACCGCCCGTCAAGCCATGGAGCCGGGTGTACCTTAAGGCGATAACCGAAAGGAGTTGCCCAAGGTA"""

_seqs_16s = []
for seq_id, seq in list(parse_fasta(seqs_16s.split("\n"))):
    _seqs_16s.append(BiologicalSequence(seq, seq_id))
seqs_16s = SequenceCollection(_seqs_16s)


tax = """669210	k__Bacteria; p__; c__; o__; f__; g__; s__
881726	k__Bacteria; p__Firmicutes; c__Bacilli; o__Bacillales; f__Paenibacillaceae; g__Paenibacillus; s__
296752	k__Bacteria; p__Spirochaetes; c__Spirochaetes; o__Spirochaetales; f__Spirochaetaceae; g__Treponema; s__
1794723	k__Bacteria; p__Planctomycetes; c__Planctomycetia; o__Gemmatales; f__Gemmataceae; g__Gemmata; s__
2941516	k__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales; f__Marinilabiaceae; g__; s__
793074	k__Bacteria; p__; c__; o__; f__; g__; s__
4353661	k__Bacteria; p__Bacteroidetes; c__[Saprospirae]; o__[Saprospirales]; f__; g__; s__
292553	k__Bacteria; p__Spirochaetes; c__Spirochaetes; o__Spirochaetales; f__Spirochaetaceae; g__Treponema; s__
2784824	k__Bacteria; p__Proteobacteria; c__Deltaproteobacteria; o__Syntrophobacterales; f__Syntrophaceae; g__; s__
1719550	k__Bacteria; p__Planctomycetes; c__Planctomycetia; o__Gemmatales; f__Gemmataceae; g__Gemmata; s__
182569	k__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales; f__Bacteroidaceae; g__Bacteroides; s__
266495	k__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales; f__S24-7; g__; s__
524860	k__Bacteria; p__Planctomycetes; c__Planctomycetia; o__Gemmatales; f__Gemmataceae; g__Gemmata; s__
293514	k__Bacteria; p__Spirochaetes; c__Spirochaetes; o__Spirochaetales; f__Spirochaetaceae; g__Treponema; s__
2683209	k__Bacteria; p__WWE1; c__[Cloacamonae]; o__[Cloacamonales]; f__; g__; s__
501793	k__Bacteria; p__Firmicutes; c__Bacilli; o__Bacillales; f__Paenibacillaceae; g__Paenibacillus; s__
229854	k__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Legionellales; f__Legionellaceae; g__Legionella; s__
583705	k__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales; f__; g__; s__
1142181	k__Bacteria; p__Spirochaetes; c__GN05; o__SBYZ_6080; f__; g__; s__
998428	k__Bacteria; p__Firmicutes; c__Bacilli; o__Bacillales; f__Paenibacillaceae; g__Paenibacillus; s__
4343117	k__Bacteria; p__Acidobacteria; c__DA052; o__Ellin6513; f__; g__; s__"""

tax_lookup = dict([e.strip().split('\t') for e in tax.split('\n')])

Question 1

What are the fraction of 3 base k-words unique to the following two sequences?

In [ ]:
seq1 = BiologicalSequence('AGCTAGCATCGATCGATCGATGCATGCAT')
seq2 = BiologicalSequence('AGCTCGGCATCGAGGGCAGTCAATCGATCT')
In [ ]:
help(kmer_distance)
In [ ]:
# Compute the kmer distance in this cell

Question 2

Display the guide tree for the sequences in the cell below.

In [ ]:
query_seqs = SequenceCollection(
              [BiologicalSequence("ACGATGACCAGTGCTACCAGT", "s1"),
               BiologicalSequence("AACGATCGATCGATCGTGCTA", "s2"),
               BiologicalSequence("AACGATCTGCTA", "s3"),
               BiologicalSequence("CGATCGATGACATGCATG", "s4"),
               BiologicalSequence("CGATCTGCAT", "s5")])
In [ ]:
help(guide_tree_from_sequences)
In [ ]:
# Display the guide tree in this cell.

Question 3

What are the differences in the guide tree from Question 2, the tree that is generated after 1 iterations of iterative mutliple sequence alignment, and the tree that is generated after 5 iterations of iterative multiple sequence alignment? Display the trees for both 1 and 5 iterations of iterative multiple sequence alignment.

In [ ]:
help(iterative_msa_and_tree)
In [ ]:
from skbio.alignment import global_pairwise_align_nucleotide
# add your command for 1 iterations of iterative multiple sequence alignment here
# hint: pass pairwise_aligner=global_pairwise_align_nucleotide
In [ ]:
# add your command for 5 iterations of iterative multiple sequence alignment here
# hint: pass pairwise_aligner=global_pairwise_align_nucleotide

Question 4

Generate and display a tree based on progressive alignment of the sequences from the second cell (the ones in the seqs_16s varaible). This step can take about 10 minutes to complete.

In [ ]:
help(progressive_msa_and_tree)
In [ ]:
# Add your command for progressive alignment and tree building here
# hint: pass pairwise_aligner=global_pairwise_align_nucleotide

Question 5

Using the tree representing the sequences from question four as a guide, define clusters (i.e., groups) of sequences at 90% and 70% identity. There is not a single right answer for this, or a single method for grouping sequences. Go about this systematically, and describe the process that you're going through in a couple of paragraphs. These groups are usually referred to as operational taxonomic units, or OTUs, because they represent a hypothesis about the taxonomic relatedness of a group of sequences (which is a proxy for a hypothesis about the relatedness of the group of organisms containing those sequences in their genomes).

If you want to obtain a given sequence, you can now do so by looking up its identifier with seqs_16s.get_seq:

In [ ]:
print seqs_16s.get_seq('4343117')
In [ ]:
print seqs_16s.get_seq('4353661')

To compute the pairwise identity for two sequences, use pairwise_percent_id as follows:

In [ ]:
from skbio.alignment import global_pairwise_align_nucleotide
from skbio import BiologicalSequence

def pairwise_percent_id(seq1_id, seq2_id, seq_lookup):
    seq1 = seq_lookup.get_seq(seq1_id)
    seq2 = seq_lookup.get_seq(seq2_id)
    aln = global_pairwise_align_nucleotide(seq1, seq2)
    return 1 - aln.distances()[0][1]
In [ ]:
print pairwise_percent_id('793074', '4353661', seqs_16s)
In [ ]:
# Compute additional pairwise identities, as necessary, to answer this question here. Show all of your commands!

Discuss your results here.

Question 6

Choose one representative sequence from each of the clusters you defined in question 5. Look these up in tax_lookup by their ids to get the taxonomy of each sequence, and include those in the results below. When you see a key that ends with __, that means that there is no known taxonomic assignment for that sequence at that level.

In [ ]:
print tax_lookup['4343117']
In [ ]:
print tax_lookup['4353661']
In [ ]:
# Perform addition taxonomy look-ups here

Discuss your results here.

Question 7

Is the taxonomy of the represenative sequences consistent with phylogenetic tree you generated in question 4? For your 90% and 70% OTUs, list three taxa (e.g., at the phylum, class, or species level) that are monophyletic, if any, and three taxa that are not monophyletic, if any. Discuss two specific reasons why some taxa might appear to not be monophyletic based on your tree.

Discuss your results here.