#!/usr/bin/env python # coding: utf-8 # **Source of the materials**: Biopython cookbook (adapted) # Status: Draft # Multiple Sequence Alignment objects {#chapter:Bio.AlignIO} # =================================== # # This chapter is about Multiple Sequence Alignments, by which we mean a # collection of multiple sequences which have been aligned together – # usually with the insertion of gap characters, and addition of leading or # trailing gaps – such that all the sequence strings are the same length. # Such an alignment can be regarded as a matrix of letters, where each row # is held as a `SeqRecord` object internally. # # We will introduce the `MultipleSeqAlignment` object which holds this # kind of data, and the `Bio.AlignIO` module for reading and writing them # as various file formats (following the design of the `Bio.SeqIO` module # from the previous chapter). Note that both `Bio.SeqIO` and `Bio.AlignIO` # can read and write sequence alignment files. The appropriate choice will # depend largely on what you want to do with the data. # # The final part of this chapter is about our command line wrappers for # common multiple sequence alignment tools like ClustalW and MUSCLE. # # Parsing or Reading Sequence Alignments # -------------------------------------- # # We have two functions for reading in sequence alignments, # `Bio.AlignIO.read()` and `Bio.AlignIO.parse()` which following the # convention introduced in `Bio.SeqIO` are for files containing one or # multiple alignments respectively. # # Using `Bio.AlignIO.parse()` will return an *iterator* which # gives `MultipleSeqAlignment` objects. Iterators are typically used in a # for loop. Examples of situations where you will have multiple different # alignments include resampled alignments from the PHYLIP tool `seqboot`, # or multiple pairwise alignments from the EMBOSS tools `water` or # `needle`, or Bill Pearson’s FASTA tools. # # However, in many situations you will be dealing with files which contain # only a single alignment. In this case, you should use the # `Bio.AlignIO.read()` function which returns a single # `MultipleSeqAlignment` object. # # Both functions expect two mandatory arguments: # # 1. The first argument is a *handle* to read the data from, # typically an open file (see Section \[sec:appendix-handles\]), or # a filename. # # 2. The second argument is a lower case string specifying the # alignment format. As in `Bio.SeqIO` we don’t try and guess the file # format for you! See for a full # listing of supported formats. # # There is also an optional `seq_count` argument which is discussed in # Section \[sec:AlignIO-count-argument\] below for dealing with ambiguous # file formats which may contain more than one alignment. # # A further optional `alphabet` argument allowing you to specify the # expected alphabet. This can be useful as many alignment file formats do # not explicitly label the sequences as RNA, DNA or protein – which means # `Bio.AlignIO` will default to using a generic alphabet. # # ### Single Alignments # # As an example, consider the following annotation rich protein alignment # in the PFAM or Stockholm file format: # # # ``` # # STOCKHOLM 1.0 # #=GS COATB_BPIKE/30-81 AC P03620.1 # #=GS COATB_BPIKE/30-81 DR PDB; 1ifl ; 1-52; # #=GS Q9T0Q8_BPIKE/1-52 AC Q9T0Q8.1 # #=GS COATB_BPI22/32-83 AC P15416.1 # #=GS COATB_BPM13/24-72 AC P69541.1 # #=GS COATB_BPM13/24-72 DR PDB; 2cpb ; 1-49; # #=GS COATB_BPM13/24-72 DR PDB; 2cps ; 1-49; # #=GS COATB_BPZJ2/1-49 AC P03618.1 # #=GS Q9T0Q9_BPFD/1-49 AC Q9T0Q9.1 # #=GS Q9T0Q9_BPFD/1-49 DR PDB; 1nh4 A; 1-49; # #=GS COATB_BPIF1/22-73 AC P03619.2 # #=GS COATB_BPIF1/22-73 DR PDB; 1ifk ; 1-50; # COATB_BPIKE/30-81 AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIRLFKKFSSKA # #=GR COATB_BPIKE/30-81 SS -HHHHHHHHHHHHHH--HHHHHHHH--HHHHHHHHHHHHHHHHHHHHH---- # Q9T0Q8_BPIKE/1-52 AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIKLFKKFVSRA # COATB_BPI22/32-83 DGTSTATSYATEAMNSLKTQATDLIDQTWPVVTSVAVAGLAIRLFKKFSSKA # COATB_BPM13/24-72 AEGDDP...AKAAFNSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFTSKA # #=GR COATB_BPM13/24-72 SS ---S-T...CHCHHHHCCCCTCCCTTCHHHHHHHHHHHHHHHHHHHHCTT-- # COATB_BPZJ2/1-49 AEGDDP...AKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFASKA # Q9T0Q9_BPFD/1-49 AEGDDP...AKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFTSKA # #=GR Q9T0Q9_BPFD/1-49 SS ------...-HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH-- # COATB_BPIF1/22-73 FAADDATSQAKAAFDSLTAQATEMSGYAWALVVLVVGATVGIKLFKKFVSRA # #=GR COATB_BPIF1/22-73 SS XX-HHHH--HHHHHH--HHHHHHH--HHHHHHHHHHHHHHHHHHHHHHH--- # #=GC SS_cons XHHHHHHHHHHHHHHHCHHHHHHHHCHHHHHHHHHHHHHHHHHHHHHHHC-- # #=GC seq_cons AEssss...AptAhDSLpspAT-hIu.sWshVsslVsAsluIKLFKKFsSKA # // # # ``` # # This is the seed alignment for the Phage\_Coat\_Gp8 (PF05371) PFAM # entry, downloaded from a now out of date release of PFAM from # . We can load this file as follows (assuming # it has been saved to disk as “PF05371\_seed.sth” in the current working # directory): # # # In[1]: from Bio import AlignIO alignment = AlignIO.read("data/PF05371_seed.sth", "stockholm") # # This code will print out a summary of the alignment: # # # In[2]: print(alignment) # # You’ll notice in the above output the sequences have been truncated. We # could instead write our own code to format this as we please by # iterating over the rows as `SeqRecord` objects: # # # In[3]: from Bio import AlignIO alignment = AlignIO.read("data/PF05371_seed.sth", "stockholm") print("Alignment length %i" % alignment.get_alignment_length()) # In[4]: for record in alignment: print("%s - %s" % (record.seq, record.id)) # # You could also use the alignment object’s `format` method to show it in # a particular file format – see Section \[sec:alignment-format-method\] # for details. # # Did you notice in the raw file above that several of the sequences # include database cross-references to the PDB and the associated known # secondary structure? Try this: # # # In[5]: for record in alignment: if record.dbxrefs: print("%s %s" % (record.id, record.dbxrefs)) # # To have a look at all the sequence annotation, try this: # # # In[6]: for record in alignment: print(record) # # Sanger provide a nice web interface at # which will actually let # you download this alignment in several other formats. This is what the # file looks like in the FASTA file format: # # # ``` # >COATB_BPIKE/30-81 # AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIRLFKKFSSKA # >Q9T0Q8_BPIKE/1-52 # AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIKLFKKFVSRA # >COATB_BPI22/32-83 # DGTSTATSYATEAMNSLKTQATDLIDQTWPVVTSVAVAGLAIRLFKKFSSKA # >COATB_BPM13/24-72 # AEGDDP---AKAAFNSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFTSKA # >COATB_BPZJ2/1-49 # AEGDDP---AKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFASKA # >Q9T0Q9_BPFD/1-49 # AEGDDP---AKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFTSKA # >COATB_BPIF1/22-73 # FAADDATSQAKAAFDSLTAQATEMSGYAWALVVLVVGATVGIKLFKKFVSRA # # ``` # # Note the website should have an option about showing gaps as periods # (dots) or dashes, we’ve shown dashes above. Assuming you download and # save this as file “PF05371\_seed.faa” then you can load it with almost # exactly the same code: # # # ``` # from Bio import AlignIO # alignment = AlignIO.read("PF05371_seed.faa", "fasta") # print(alignment) # # ``` # # All that has changed in this code is the filename and the format string. # You’ll get the same output as before, the sequences and record # identifiers are the same. However, as you should expect, if you check # each `SeqRecord` there is no annotation nor database cross-references # because these are not included in the FASTA file format. # # Note that rather than using the Sanger website, you could have used # `Bio.AlignIO` to convert the original Stockholm format file into a FASTA # file yourself (see below). # # With any supported file format, you can load an alignment in exactly the # same way just by changing the format string. For example, use “phylip” # for PHYLIP files, “nexus” for NEXUS files or “emboss” for the alignments # output by the EMBOSS tools. There is a full listing on the wiki page # () and in the built in documentation # (also # [online](http://biopython.org/DIST/docs/api/Bio.AlignIO-module.html)): # # # In[7]: from Bio import AlignIO help(AlignIO) # # ### Multiple Alignments # # The previous section focused on reading files containing a single # alignment. In general however, files can contain more than one # alignment, and to read these files we must use the `Bio.AlignIO.parse()` # function. # # Suppose you have a small alignment in PHYLIP format: # # # ``` # 5 6 # Alpha AACAAC # Beta AACCCC # Gamma ACCAAC # Delta CCACCA # Epsilon CCAAAC # # ``` # # If you wanted to bootstrap a phylogenetic tree using the PHYLIP tools, # one of the steps would be to create a set of many resampled alignments # using the tool `bootseq`. This would give output something like this, # which has been abbreviated for conciseness: # # # ``` # 5 6 # Alpha AAACCA # Beta AAACCC # Gamma ACCCCA # Delta CCCAAC # Epsilon CCCAAA # 5 6 # Alpha AAACAA # Beta AAACCC # Gamma ACCCAA # Delta CCCACC # Epsilon CCCAAA # 5 6 # Alpha AAAAAC # Beta AAACCC # Gamma AACAAC # Delta CCCCCA # Epsilon CCCAAC # ... # 5 6 # Alpha AAAACC # Beta ACCCCC # Gamma AAAACC # Delta CCCCAA # Epsilon CAAACC # # ``` # # If you wanted to read this in using `Bio.AlignIO` you could use: # # # In[8]: from Bio import AlignIO alignments = AlignIO.parse("data/resampled.phy", "phylip") for alignment in alignments: print(alignment) print("") # # As with the function `Bio.SeqIO.parse()`, using `Bio.AlignIO.parse()` # returns an iterator. If you want to keep all the alignments in memory at # once, which will allow you to access them in any order, then turn the # iterator into a list: # # # In[10]: from Bio import AlignIO alignments = list(AlignIO.parse("data/resampled.phy", "phylip")) last_align = alignments[-1] first_align = alignments[0] # # ### Ambiguous Alignments {#sec:AlignIO-count-argument} # # Many alignment file formats can explicitly store more than one # alignment, and the division between each alignment is clear. However, # when a general sequence file format has been used there is no such block # structure. The most common such situation is when alignments have been # saved in the FASTA file format. For example consider the following: # # # ``` # >Alpha # ACTACGACTAGCTCAG--G # >Beta # ACTACCGCTAGCTCAGAAG # >Gamma # ACTACGGCTAGCACAGAAG # >Alpha # ACTACGACTAGCTCAGG-- # >Beta # ACTACCGCTAGCTCAGAAG # >Gamma # ACTACGGCTAGCACAGAAG # # ``` # # This could be a single alignment containing six sequences (with repeated # identifiers). Or, judging from the identifiers, this is probably two # different alignments each with three sequences, which happen to all have # the same length. # # What about this next example? # # # ``` # >Alpha # ACTACGACTAGCTCAG--G # >Beta # ACTACCGCTAGCTCAGAAG # >Alpha # ACTACGACTAGCTCAGG-- # >Gamma # ACTACGGCTAGCACAGAAG # >Alpha # ACTACGACTAGCTCAGG-- # >Delta # ACTACGGCTAGCACAGAAG # # ``` # # Again, this could be a single alignment with six sequences. However this # time based on the identifiers we might guess this is three pairwise # alignments which by chance have all got the same lengths. # # This final example is similar: # # # ``` # >Alpha # ACTACGACTAGCTCAG--G # >XXX # ACTACCGCTAGCTCAGAAG # >Alpha # ACTACGACTAGCTCAGG # >YYY # ACTACGGCAAGCACAGG # >Alpha # --ACTACGAC--TAGCTCAGG # >ZZZ # GGACTACGACAATAGCTCAGG # # ``` # # In this third example, because of the differing lengths, this cannot be # treated as a single alignment containing all six records. However, it # could be three pairwise alignments. # # Clearly trying to store more than one alignment in a FASTA file is not # ideal. However, if you are forced to deal with these as input files # `Bio.AlignIO` can cope with the most common situation where all the # alignments have the same number of records. One example of this is a # collection of pairwise alignments, which can be produced by the EMBOSS # tools `needle` and `water` – although in this situation, `Bio.AlignIO` # should be able to understand their native output using “emboss” as the # format string. # # To interpret these FASTA examples as several separate alignments, we can # use `Bio.AlignIO.parse()` with the optional `seq_count` argument which # specifies how many sequences are expected in each alignment (in these # examples, 3, 2 and 2 respectively). For example, using the third example # as the input data: # # # In[11]: for alignment in AlignIO.parse(handle, "fasta", seq_count=2): print("Alignment length %i" % alignment.get_alignment_length()) for record in alignment: print("%s - %s" % (record.seq, record.id)) print("") # # Using `Bio.AlignIO.read()` or `Bio.AlignIO.parse()` without the # `seq_count` argument would give a single alignment containing all six # records for the first two examples. For the third example, an exception # would be raised because the lengths differ preventing them being turned # into a single alignment. # # If the file format itself has a block structure allowing `Bio.AlignIO` # to determine the number of sequences in each alignment directly, then # the `seq_count` argument is not needed. If it is supplied, and doesn’t # agree with the file contents, an error is raised. # # Note that this optional `seq_count` argument assumes each alignment in # the file has the same number of sequences. Hypothetically you may come # across stranger situations, for example a FASTA file containing several # alignments each with a different number of sequences – although I would # love to hear of a real world example of this. Assuming you cannot get # the data in a nicer file format, there is no straight forward way to # deal with this using `Bio.AlignIO`. In this case, you could consider # reading in the sequences themselves using `Bio.SeqIO` and batching them # together to create the alignments as appropriate. # # Writing Alignments # ------------------ # # We’ve talked about using `Bio.AlignIO.read()` and `Bio.AlignIO.parse()` # for alignment input (reading files), and now we’ll look at # `Bio.AlignIO.write()` which is for alignment output (writing files). # This is a function taking three arguments: some `MultipleSeqAlignment` # objects (or for backwards compatibility the obsolete `Alignment` # objects), a handle or filename to write to, and a sequence format. # # Here is an example, where we start by creating a few # `MultipleSeqAlignment` objects the hard way (by hand, rather than by # loading them from a file). Note we create some `SeqRecord` objects to # construct the alignment from. # # # In[9]: from Bio.Alphabet import generic_dna from Bio.Seq import Seq from Bio.SeqRecord import SeqRecord from Bio.Align import MultipleSeqAlignment align1 = MultipleSeqAlignment([ SeqRecord(Seq("ACTGCTAGCTAG", generic_dna), id="Alpha"), SeqRecord(Seq("ACT-CTAGCTAG", generic_dna), id="Beta"), SeqRecord(Seq("ACTGCTAGDTAG", generic_dna), id="Gamma"), ]) align2 = MultipleSeqAlignment([ SeqRecord(Seq("GTCAGC-AG", generic_dna), id="Delta"), SeqRecord(Seq("GACAGCTAG", generic_dna), id="Epsilon"), SeqRecord(Seq("GTCAGCTAG", generic_dna), id="Zeta"), ]) align3 = MultipleSeqAlignment([ SeqRecord(Seq("ACTAGTACAGCTG", generic_dna), id="Eta"), SeqRecord(Seq("ACTAGTACAGCT-", generic_dna), id="Theta"), SeqRecord(Seq("-CTACTACAGGTG", generic_dna), id="Iota"), ]) my_alignments = [align1, align2, align3] # # Now we have a list of `Alignment` objects, we’ll write them to a PHYLIP # format file: # # # In[10]: from Bio import AlignIO AlignIO.write(my_alignments, "my_example.phy", "phylip") # # And if you open this file in your favourite text editor it should look # like this: # # # ``` # 3 12 # Alpha ACTGCTAGCT AG # Beta ACT-CTAGCT AG # Gamma ACTGCTAGDT AG # 3 9 # Delta GTCAGC-AG # Epislon GACAGCTAG # Zeta GTCAGCTAG # 3 13 # Eta ACTAGTACAG CTG # Theta ACTAGTACAG CT- # Iota -CTACTACAG GTG # # ``` # # Its more common to want to load an existing alignment, and save that, # perhaps after some simple manipulation like removing certain rows or # columns. # # Suppose you wanted to know how many alignments the `Bio.AlignIO.write()` # function wrote to the handle? If your alignments were in a list like the # example above, you could just use `len(my_alignments)`, however you # can’t do that when your records come from a generator/iterator. # Therefore the `Bio.AlignIO.write()` function returns the number of # alignments written to the file. # # *Note* - If you tell the `Bio.AlignIO.write()` function to write to a # file that already exists, the old file will be overwritten without any # warning. # # ### Converting between sequence alignment file formats {#sec:converting-alignments} # # Converting between sequence alignment file formats with `Bio.AlignIO` # works in the same way as converting between sequence file formats with # `Bio.SeqIO` (Section \[sec:SeqIO-conversion\]). We load generally the # alignment(s) using `Bio.AlignIO.parse()` and then save them using the # `Bio.AlignIO.write()` – or just use the `Bio.AlignIO.convert()` helper # function. # # For this example, we’ll load the PFAM/Stockholm format file used earlier # and save it as a Clustal W format file: # # # In[11]: from Bio import AlignIO count = AlignIO.convert("data/PF05371_seed.sth", "stockholm", "PF05371_seed.aln", "clustal") print("Converted %i alignments" % count) # # Or, using `Bio.AlignIO.parse()` and `Bio.AlignIO.write()`: # # # In[12]: from Bio import AlignIO alignments = AlignIO.parse("data/PF05371_seed.sth", "stockholm") count = AlignIO.write(alignments, "PF05371_seed.aln", "clustal") print("Converted %i alignments" % count) # # The `Bio.AlignIO.write()` function expects to be given multiple # alignment objects. In the example above we gave it the alignment # iterator returned by `Bio.AlignIO.parse()`. # # In this case, we know there is only one alignment in the file so we # could have used `Bio.AlignIO.read()` instead, but notice we have to pass # this alignment to `Bio.AlignIO.write()` as a single element list: # # # In[13]: from Bio import AlignIO alignment = AlignIO.read("data/PF05371_seed.sth", "stockholm") AlignIO.write([alignment], "PF05371_seed.aln", "clustal") # # Either way, you should end up with the same new Clustal W format file # “PF05371\_seed.aln” with the following content: # # # ``` # CLUSTAL X (1.81) multiple sequence alignment # # # COATB_BPIKE/30-81 AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIRLFKKFSS # Q9T0Q8_BPIKE/1-52 AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIKLFKKFVS # COATB_BPI22/32-83 DGTSTATSYATEAMNSLKTQATDLIDQTWPVVTSVAVAGLAIRLFKKFSS # COATB_BPM13/24-72 AEGDDP---AKAAFNSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFTS # COATB_BPZJ2/1-49 AEGDDP---AKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFAS # Q9T0Q9_BPFD/1-49 AEGDDP---AKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFTS # COATB_BPIF1/22-73 FAADDATSQAKAAFDSLTAQATEMSGYAWALVVLVVGATVGIKLFKKFVS # # COATB_BPIKE/30-81 KA # Q9T0Q8_BPIKE/1-52 RA # COATB_BPI22/32-83 KA # COATB_BPM13/24-72 KA # COATB_BPZJ2/1-49 KA # Q9T0Q9_BPFD/1-49 KA # COATB_BPIF1/22-73 RA # # ``` # # Alternatively, you could make a PHYLIP format file which we’ll name # “PF05371\_seed.phy”: # # # In[14]: from Bio import AlignIO AlignIO.convert("data/PF05371_seed.sth", "stockholm", "PF05371_seed.phy", "phylip") # # This time the output looks like this: # # # ``` # 7 52 # COATB_BPIK AEPNAATNYA TEAMDSLKTQ AIDLISQTWP VVTTVVVAGL VIRLFKKFSS # Q9T0Q8_BPI AEPNAATNYA TEAMDSLKTQ AIDLISQTWP VVTTVVVAGL VIKLFKKFVS # COATB_BPI2 DGTSTATSYA TEAMNSLKTQ ATDLIDQTWP VVTSVAVAGL AIRLFKKFSS # COATB_BPM1 AEGDDP---A KAAFNSLQAS ATEYIGYAWA MVVVIVGATI GIKLFKKFTS # COATB_BPZJ AEGDDP---A KAAFDSLQAS ATEYIGYAWA MVVVIVGATI GIKLFKKFAS # Q9T0Q9_BPF AEGDDP---A KAAFDSLQAS ATEYIGYAWA MVVVIVGATI GIKLFKKFTS # COATB_BPIF FAADDATSQA KAAFDSLTAQ ATEMSGYAWA LVVLVVGATV GIKLFKKFVS # # KA # RA # KA # KA # KA # KA # RA # # ``` # # One of the big handicaps of the original PHYLIP alignment file format is # that the sequence identifiers are strictly truncated at ten characters. # In this example, as you can see the resulting names are still unique - # but they are not very readable. As a result, a more relaxed variant of # the original PHYLIP format is now quite widely used: # # # In[15]: from Bio import AlignIO AlignIO.convert("data/PF05371_seed.sth", "stockholm", "PF05371_seed.phy", "phylip-relaxed") # # This time the output looks like this, using a longer indentation to # allow all the identifers to be given in full:: # # # ``` # 7 52 # COATB_BPIKE/30-81 AEPNAATNYA TEAMDSLKTQ AIDLISQTWP VVTTVVVAGL VIRLFKKFSS # Q9T0Q8_BPIKE/1-52 AEPNAATNYA TEAMDSLKTQ AIDLISQTWP VVTTVVVAGL VIKLFKKFVS # COATB_BPI22/32-83 DGTSTATSYA TEAMNSLKTQ ATDLIDQTWP VVTSVAVAGL AIRLFKKFSS # COATB_BPM13/24-72 AEGDDP---A KAAFNSLQAS ATEYIGYAWA MVVVIVGATI GIKLFKKFTS # COATB_BPZJ2/1-49 AEGDDP---A KAAFDSLQAS ATEYIGYAWA MVVVIVGATI GIKLFKKFAS # Q9T0Q9_BPFD/1-49 AEGDDP---A KAAFDSLQAS ATEYIGYAWA MVVVIVGATI GIKLFKKFTS # COATB_BPIF1/22-73 FAADDATSQA KAAFDSLTAQ ATEMSGYAWA LVVLVVGATV GIKLFKKFVS # # KA # RA # KA # KA # KA # KA # RA # # ``` # # If you have to work with the original strict PHYLIP format, then you may # need to compress the identifers somehow – or assign your own names or # numbering system. This following bit of code manipulates the record # identifiers before saving the output: # # # In[16]: from Bio import AlignIO alignment = AlignIO.read("data/PF05371_seed.sth", "stockholm") name_mapping = {} for i, record in enumerate(alignment): name_mapping[i] = record.id record.id = "seq%i" % i print(name_mapping) AlignIO.write([alignment], "PF05371_seed.phy", "phylip") # # Here is the new (strict) PHYLIP format output: # # # ``` # 7 52 # seq0 AEPNAATNYA TEAMDSLKTQ AIDLISQTWP VVTTVVVAGL VIRLFKKFSS # seq1 AEPNAATNYA TEAMDSLKTQ AIDLISQTWP VVTTVVVAGL VIKLFKKFVS # seq2 DGTSTATSYA TEAMNSLKTQ ATDLIDQTWP VVTSVAVAGL AIRLFKKFSS # seq3 AEGDDP---A KAAFNSLQAS ATEYIGYAWA MVVVIVGATI GIKLFKKFTS # seq4 AEGDDP---A KAAFDSLQAS ATEYIGYAWA MVVVIVGATI GIKLFKKFAS # seq5 AEGDDP---A KAAFDSLQAS ATEYIGYAWA MVVVIVGATI GIKLFKKFTS # seq6 FAADDATSQA KAAFDSLTAQ ATEMSGYAWA LVVLVVGATV GIKLFKKFVS # # KA # RA # KA # KA # KA # KA # RA # # ``` # # In general, because of the identifier limitation, working with *strict* # PHYLIP file formats shouldn’t be your first choice. Using the # PFAM/Stockholm format on the other hand allows you to record a lot of # additional annotation too. # # ### Getting your alignment objects as formatted strings {#sec:alignment-format-method} # # The `Bio.AlignIO` interface is based on handles, which means if you want # to get your alignment(s) into a string in a particular file format you # need to do a little bit more work (see below). However, you will # probably prefer to take advantage of the alignment object’s `format()` # method. This takes a single mandatory argument, a lower case string # which is supported by `Bio.AlignIO` as an output format. For example: # # # ``` # from Bio import AlignIO # alignment = AlignIO.read("PF05371_seed.sth", "stockholm") # print(alignment.format("clustal")) # # ``` # # As described in Section \[sec:SeqRecord-format\], the `SeqRecord` object # has a similar method using output formats supported by `Bio.SeqIO`. # # Internally the `format()` method is using the `StringIO` string based # handle and calling `Bio.AlignIO.write()`. You can do this in your own # code if for example you are using an older version of Biopython: # # # ``` # from Bio import AlignIO # from StringIO import StringIO # # alignments = AlignIO.parse("PF05371_seed.sth", "stockholm") # # out_handle = StringIO() # AlignIO.write(alignments, out_handle, "clustal") # clustal_data = out_handle.getvalue() # # print(clustal_data) # # ``` # # Manipulating Alignments {#sec:manipulating-alignments} # ----------------------- # # Now that we’ve covered loading and saving alignments, we’ll look at what # else you can do with them. # # ### Slicing alignments # # First of all, in some senses the alignment objects act like a Python # `list` of `SeqRecord` objects (the rows). With this model in mind # hopefully the actions of `len()` (the number of rows) and iteration # (each row as a `SeqRecord`) make sense: # # # In[17]: from Bio import AlignIO alignment = AlignIO.read("data/PF05371_seed.sth", "stockholm") print("Number of rows: %i" % len(alignment)) # In[18]: for record in alignment: print("%s - %s" % (record.seq, record.id)) # # You can also use the list-like `append` and `extend` methods to add more # rows to the alignment (as `SeqRecord` objects). Keeping the list # metaphor in mind, simple slicing of the alignment should also make sense # - it selects some of the rows giving back another alignment object: # # # In[19]: print(alignment) # In[20]: print(alignment[3:7]) # # What if you wanted to select by column? Those of you who have used the # NumPy matrix or array objects won’t be surprised at this - you use a # double index. # # # In[21]: print(alignment[2, 6]) # # Using two integer indices pulls out a single letter, short hand for # this: # # # In[22]: print(alignment[2].seq[6]) # # You can pull out a single column as a string like this: # # # In[23]: print(alignment[:, 6]) # # You can also select a range of columns. For example, to pick out those # same three rows we extracted earlier, but take just their first six # columns: # # # In[24]: print(alignment[3:6, :6]) # # Leaving the first index as `:` means take all the rows: # # # In[25]: print(alignment[:, :6]) # # This brings us to a neat way to remove a section. Notice columns 7, 8 # and 9 which are gaps in three of the seven sequences: # # # In[26]: print(alignment[:, 6:9]) # # Again, you can slice to get everything after the ninth column: # # # In[27]: print(alignment[:, 9:]) # # Now, the interesting thing is that addition of alignment objects works # by column. This lets you do this as a way to remove a block of columns: # # # In[28]: edited = alignment[:, :6] + alignment[:, 9:] print(edited) # # Another common use of alignment addition would be to combine alignments # for several different genes into a meta-alignment. Watch out though - # the identifiers need to match up (see Section \[sec:SeqRecord-addition\] # for how adding `SeqRecord` objects works). You may find it helpful to # first sort the alignment rows alphabetically by id: # # # In[29]: edited.sort() print(edited) # # Note that you can only add two alignments together if they have the same # number of rows. # # ### Alignments as arrays # # Depending on what you are doing, it can be more useful to turn the # alignment object into an array of letters – and you can do this with # NumPy: # # # In[30]: import numpy as np from Bio import AlignIO alignment = AlignIO.read("data/PF05371_seed.sth", "stockholm") align_array = np.array([list(rec) for rec in alignment], np.character) print("Array shape %i by %i" % align_array.shape) # # If you will be working heavily with the columns, you can tell NumPy to # store the array by column (as in Fortran) rather then its default of by # row (as in C): # # # In[31]: align_array = np.array([list(rec) for rec in alignment], np.character, order="F") # # Note that this leaves the original Biopython alignment object and the # NumPy array in memory as separate objects - editing one will not update # the other! # # Alignment Tools {#sec:alignment-tools} # --------------- # # There are *lots* of algorithms out there for aligning sequences, both # pairwise alignments and multiple sequence alignments. These calculations # are relatively slow, and you generally wouldn’t want to write such an # algorithm in Python. Instead, you can use Biopython to invoke a command # line tool on your behalf. Normally you would: # # 1. Prepare an input file of your unaligned sequences, typically this # will be a FASTA file which you might create using `Bio.SeqIO` # (see Chapter \[chapter:Bio.SeqIO\]). # # 2. Call the command line tool to process this input file, typically via # one of Biopython’s command line wrappers (which we’ll discuss here). # # 3. Read the output from the tool, i.e. your aligned sequences, # typically using `Bio.AlignIO` (see earlier in this chapter). # # All the command line wrappers we’re going to talk about in this chapter # follow the same style. You create a command line object specifying the # options (e.g. the input filename and the output filename), then invoke # this command line via a Python operating system call (e.g. using the # `subprocess` module). # # Most of these wrappers are defined in the `Bio.Align.Applications` # module: # # # In[32]: import Bio.Align.Applications dir(Bio.Align.Applications) # # (Ignore the entries starting with an underscore – these have special # meaning in Python.) The module `Bio.Emboss.Applications` has wrappers # for some of the [EMBOSS suite](http://emboss.sourceforge.net/), # including `needle` and `water`, which are described below in # Section \[seq:emboss-needle-water\], and wrappers for the EMBOSS # packaged versions of the PHYLIP tools (which EMBOSS refer to as one of # their EMBASSY packages - third party tools with an EMBOSS style # interface). We won’t explore all these alignment tools here in the # section, just a sample, but the same principles apply. # # ### ClustalW {#sec:align_clustal} # # ClustalW is a popular command line tool for multiple sequence alignment # (there is also a graphical interface called ClustalX). Biopython’s # `Bio.Align.Applications` module has a wrapper for this alignment tool # (and several others). # # Before trying to use ClustalW from within Python, you should first try # running the ClustalW tool yourself by hand at the command line, to # familiarise yourself the other options. You’ll find the Biopython # wrapper is very faithful to the actual command line API: # # # In[33]: from Bio.Align.Applications import ClustalwCommandline help(ClustalwCommandline) # # For the most basic usage, all you need is to have a FASTA input file, # such as # [opuntia.fasta](http://biopython.org/DIST/docs/tutorial/examples/opuntia.fasta) # (available online or in the Doc/examples subdirectory of the Biopython # source code). This is a small FASTA file containing seven prickly-pear # DNA sequences (from the cactus family *Opuntia*). # # By default ClustalW will generate an alignment and guide tree file with # names based on the input FASTA file, in this case `opuntia.aln` and # `opuntia.dnd`, but you can override this or make it explicit: # # # In[34]: from Bio.Align.Applications import ClustalwCommandline cline = ClustalwCommandline("clustalw2", infile="data/opuntia.fasta") print(cline) # # Notice here we have given the executable name as `clustalw2`, indicating # we have version two installed, which has a different filename to version # one (`clustalw`, the default). Fortunately both versions support the # same set of arguments at the command line (and indeed, should be # functionally identical). # # You may find that even though you have ClustalW installed, the above # command doesn’t work – you may get a message about “command not found” # (especially on Windows). This indicated that the ClustalW executable is # not on your PATH (an environment variable, a list of directories to be # searched). You can either update your PATH setting to include the # location of your copy of ClustalW tools (how you do this will depend on # your OS), or simply type in the full path of the tool. For example: # # # In[35]: import os from Bio.Align.Applications import ClustalwCommandline clustalw_exe = r"C:\Program Files\new clustal\clustalw2.exe" clustalw_cline = ClustalwCommandline(clustalw_exe, infile="data/opuntia.fasta") # # # In[37]: assert os.path.isfile(clustalw_exe), "Clustal W executable missing" stdout, stderr = clustalw_cline() # # Remember, in Python strings `\n` and `\t` are by default interpreted as # a new line and a tab – which is why we’re put a letter “r” at the start # for a raw string that isn’t translated in this way. This is generally # good practice when specifying a Windows style file name. # # Internally this uses the `subprocess` module which is now the # recommended way to run another program in Python. This replaces older # options like the `os.system()` and the `os.popen*` functions. # # Now, at this point it helps to know about how command line tools “work”. # When you run a tool at the command line, it will often print text output # directly to screen. This text can be captured or redirected, via two # “pipes”, called standard output (the normal results) and standard error # (for error messages and debug messages). There is also standard input, # which is any text fed into the tool. These names get shortened to stdin, # stdout and stderr. When the tool finishes, it has a return code (an # integer), which by convention is zero for success. # # When you run the command line tool like this via the Biopython wrapper, # it will wait for it to finish, and check the return code. If this is non # zero (indicating an error), an exception is raised. The wrapper then # returns two strings, stdout and stderr. # # In the case of ClustalW, when run at the command line all the important # output is written directly to the output files. Everything normally # printed to screen while you wait (via stdout or stderr) is boring and # can be ignored (assuming it worked). # # What we care about are the two output files, the alignment and the guide # tree. We didn’t tell ClustalW what filenames to use, but it defaults to # picking names based on the input file. In this case the output should be # in the file `opuntia.aln`. You should be able to work out how to read in # the alignment using `Bio.AlignIO` by now: # # # In[38]: from Bio import AlignIO align = AlignIO.read("data/opuntia.aln", "clustal") print(align) # # In case you are interested (and this is an aside from the main thrust of # this chapter), the `opuntia.dnd` file ClustalW creates is just a # standard Newick tree file, and `Bio.Phylo` can parse these: # # # In[39]: from Bio import Phylo tree = Phylo.read("data/opuntia.dnd", "newick") Phylo.draw_ascii(tree) # # Chapter \[sec:Phylo\] covers Biopython’s support for phylogenetic trees # in more depth. # # ### MUSCLE # # MUSCLE is a more recent multiple sequence alignment tool than ClustalW, # and Biopython also has a wrapper for it under the # `Bio.Align.Applications` module. As before, we recommend you try using # MUSCLE from the command line before trying it from within Python, as the # Biopython wrapper is very faithful to the actual command line API: # # # In[40]: from Bio.Align.Applications import MuscleCommandline help(MuscleCommandline) # # For the most basic usage, all you need is to have a FASTA input file, # such as # [opuntia.fasta](http://biopython.org/DIST/docs/tutorial/examples/opuntia.fasta) # (available online or in the Doc/examples subdirectory of the Biopython # source code). You can then tell MUSCLE to read in this FASTA file, and # write the alignment to an output file: # # # In[41]: from Bio.Align.Applications import MuscleCommandline cline = MuscleCommandline(input="data/opuntia.fasta", out="opuntia.txt") print(cline) # # Note that MUSCLE uses “-in” and “-out” but in Biopython we have to use # “input” and “out” as the keyword arguments or property names. This is # because “in” is a reserved word in Python. # # By default MUSCLE will output the alignment as a FASTA file (using # gapped sequences). The `Bio.AlignIO` module should be able to read this # alignment using `format=fasta`. You can also ask for ClustalW-like # output: # # # In[42]: from Bio.Align.Applications import MuscleCommandline cline = MuscleCommandline(input="data/opuntia.fasta", out="opuntia.aln", clw=True) print(cline) # # Or, strict ClustalW output where the original ClustalW header line is # used for maximum compatibility: # # # In[43]: from Bio.Align.Applications import MuscleCommandline cline = MuscleCommandline(input="data/opuntia.fasta", out="opuntia.aln", clwstrict=True) print(cline) # # The `Bio.AlignIO` module should be able to read these alignments using # `format=clustal`. # # MUSCLE can also output in GCG MSF format (using the `msf` argument), but # Biopython can’t currently parse that, or using HTML which would give a # human readable web page (not suitable for parsing). # # You can also set the other optional parameters, for example the maximum # number of iterations. See the built in help for details. # # You would then run MUSCLE command line string as described above for # ClustalW, and parse the output using `Bio.AlignIO` to get an alignment # object. # # ### MUSCLE using stdout # # Using a MUSCLE command line as in the examples above will write the # alignment to a file. This means there will be no important information # written to the standard out (stdout) or standard error (stderr) handles. # However, by default MUSCLE will write the alignment to standard output # (stdout). We can take advantage of this to avoid having a temporary # output file! For example: # # # In[44]: from Bio.Align.Applications import MuscleCommandline muscle_cline = MuscleCommandline(input="data/opuntia.fasta") print(muscle_cline) # # If we run this via the wrapper, we get back the output as a string. In # order to parse this we can use `StringIO` to turn it into a handle. # Remember that MUSCLE defaults to using FASTA as the output format: # # # In[45]: from Bio.Align.Applications import MuscleCommandline muscle_cline = MuscleCommandline(input="data/opuntia.fasta") stdout, stderr = muscle_cline() from io import StringIO from Bio import AlignIO align = AlignIO.read(StringIO(stdout), "fasta") print(align) # # The above approach is fairly simple, but if you are dealing with very # large output text the fact that all of stdout and stderr is loaded into # memory as a string can be a potential drawback. Using the `subprocess` # module we can work directly with handles instead: # # # In[49]: import subprocess import sys from Bio.Align.Applications import MuscleCommandline muscle_cline = MuscleCommandline(input="data/opuntia.fasta") child = subprocess.Popen(str(muscle_cline), stdout=subprocess.PIPE, stderr=subprocess.PIPE, shell=(sys.platform != "win32"), universal_newlines=True) from Bio import AlignIO align = AlignIO.read(child.stdout, "fasta") print(align) # # ### MUSCLE using stdin and stdout # # We don’t actually *need* to have our FASTA input sequences prepared in a # file, because by default MUSCLE will read in the input sequence from # standard input! Note this is a bit more advanced and fiddly, so don’t # bother with this technique unless you need to. # # First, we’ll need some unaligned sequences in memory as `SeqRecord` # objects. For this demonstration I’m going to use a filtered version of # the original FASTA file (using a generator expression), taking just six # of the seven sequences: # # # In[73]: from Bio import SeqIO records = (r for r in SeqIO.parse("data/opuntia.fasta", "fasta") if len(r) < 900) # # Then we create the MUSCLE command line, leaving the input and output to # their defaults (stdin and stdout). I’m also going to ask for strict # ClustalW format as for the output. # # # In[90]: from Bio.Align.Applications import MuscleCommandline muscle_cline = MuscleCommandline(clwstrict=True) print(muscle_cline) # # Now for the fiddly bits using the `subprocess` module, stdin and stdout: # # # In[91]: import subprocess import sys child = subprocess.Popen(str(cline), stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.PIPE, universal_newlines=True, shell=(sys.platform != "win32")) # # That should start MUSCLE, but it will be sitting waiting for its FASTA # input sequences, which we must supply via its stdin handle: # # # In[92]: SeqIO.write(records, child.stdin, "fasta") # In[93]: child.stdin.close() # # After writing the six sequences to the handle, MUSCLE will still be # waiting to see if that is all the FASTA sequences or not – so we must # signal that this is all the input data by closing the handle. At that # point MUSCLE should start to run, and we can ask for the output: # # # In[94]: from Bio import AlignIO align = AlignIO.read(child.stdout, "clustal") print(align) # # Wow! There we are with a new alignment of just the six records, without # having created a temporary FASTA input file, or a temporary alignment # output file. However, a word of caution: Dealing with errors with this # style of calling external programs is much more complicated. It also # becomes far harder to diagnose problems, because you can’t try running # MUSCLE manually outside of Biopython (because you don’t have the input # file to supply). There can also be subtle cross platform issues (e.g. # Windows versus Linux, Python 2 versus Python 3), and how you run your # script can have an impact (e.g. at the command line, from IDLE or an # IDE, or as a GUI script). These are all generic Python issues though, # and not specific to Biopython. # # If you find working directly with `subprocess` like this scary, there is # an alternative. If you execute the tool with `muscle_cline()` you can # supply any standard input as a big string, `muscle_cline(stdin=...)`. # So, provided your data isn’t very big, you can prepare the FASTA input # in memory as a string using `StringIO` (see # Section \[sec:appendix-handles\]): # # # In[95]: from Bio import SeqIO records = (r for r in SeqIO.parse("data/opuntia.fasta", "fasta") if len(r) < 900) handle = StringIO() SeqIO.write(records, handle, "fasta") # In[96]: data = handle.getvalue() # # You can then run the tool and parse the alignment as follows: # # # In[97]: stdout, stderr = muscle_cline(stdin=data) from Bio import AlignIO align = AlignIO.read(StringIO(stdout), "clustal") print(align) # # You might find this easier, but it does require more memory (RAM) for # the strings used for the input FASTA and output Clustal formatted data. # # ### EMBOSS needle and water {#seq:emboss-needle-water} # # The [EMBOSS](http://emboss.sourceforge.net/) suite includes the `water` # and `needle` tools for Smith-Waterman algorithm local alignment, and # Needleman-Wunsch global alignment. The tools share the same style # interface, so switching between the two is trivial – we’ll just use # `needle` here. # # Suppose you want to do a global pairwise alignment between two # sequences, prepared in FASTA format as follows: # # # ``` # >HBA_HUMAN # MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHG # KKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTP # AVHASLDKFLASVSTVLTSKYR # # ``` # # in a file `alpha.fasta`, and secondly in a file `beta.fasta`: # # # ``` # >HBB_HUMAN # MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPK # VKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFG # KEFTPPVQAAYQKVVAGVANALAHKYH # # ``` # # Let’s start by creating a complete `needle` command line object in one # go: # # # In[98]: from Bio.Emboss.Applications import NeedleCommandline needle_cline = NeedleCommandline(asequence="data/alpha.faa", bsequence="data/beta.faa", gapopen=10, gapextend=0.5, outfile="needle.txt") print(needle_cline) # # Why not try running this by hand at the command prompt? You should see # it does a pairwise comparison and records the output in the file # `needle.txt` (in the default EMBOSS alignment file format). # # Even if you have EMBOSS installed, running this command may not work – # you might get a message about “command not found” (especially on # Windows). This probably means that the EMBOSS tools are not on your PATH # environment variable. You can either update your PATH setting, or simply # tell Biopython the full path to the tool, for example: # # # In[99]: from Bio.Emboss.Applications import NeedleCommandline needle_cline = NeedleCommandline(r"C:\EMBOSS\needle.exe", asequence="data/alpha.faa", bsequence="data/beta.faa", gapopen=10, gapextend=0.5, outfile="needle.txt") # # Remember in Python that for a default string `\n` or `\t` means a new # line or a tab – which is why we’re put a letter “r” at the start for a # raw string. # # At this point it might help to try running the EMBOSS tools yourself by # hand at the command line, to familiarise yourself the other options and # compare them to the Biopython help text: # # # In[100]: from Bio.Emboss.Applications import NeedleCommandline help(NeedleCommandline) # # Note that you can also specify (or change or look at) the settings like # this: # # # In[101]: from Bio.Emboss.Applications import NeedleCommandline needle_cline = NeedleCommandline() needle_cline.asequence="data/alpha.faa" needle_cline.bsequence="data/beta.faa" needle_cline.gapopen=10 needle_cline.gapextend=0.5 needle_cline.outfile="needle.txt" print(needle_cline) # In[102]: print(needle_cline.outfile) # # Next we want to use Python to run this command for us. As explained # above, for full control, we recommend you use the built in Python # `subprocess` module, but for simple usage the wrapper object usually # suffices: # # # In[103]: stdout, stderr = needle_cline() print(stdout + stderr) # # Next we can load the output file with `Bio.AlignIO` as discussed earlier # in this chapter, as the `emboss` format: # # # In[104]: from Bio import AlignIO align = AlignIO.read("needle.txt", "emboss") print(align) # In[ ]: