#!/usr/bin/env python
# coding: utf-8
# **Source of the materials**: Biopython cookbook (adapted)
# Status: Draft
# Multiple Sequence Alignment objects {#chapter:Bio.AlignIO}
# ===================================
#
# This chapter is about Multiple Sequence Alignments, by which we mean a
# collection of multiple sequences which have been aligned together –
# usually with the insertion of gap characters, and addition of leading or
# trailing gaps – such that all the sequence strings are the same length.
# Such an alignment can be regarded as a matrix of letters, where each row
# is held as a `SeqRecord` object internally.
#
# We will introduce the `MultipleSeqAlignment` object which holds this
# kind of data, and the `Bio.AlignIO` module for reading and writing them
# as various file formats (following the design of the `Bio.SeqIO` module
# from the previous chapter). Note that both `Bio.SeqIO` and `Bio.AlignIO`
# can read and write sequence alignment files. The appropriate choice will
# depend largely on what you want to do with the data.
#
# The final part of this chapter is about our command line wrappers for
# common multiple sequence alignment tools like ClustalW and MUSCLE.
#
# Parsing or Reading Sequence Alignments
# --------------------------------------
#
# We have two functions for reading in sequence alignments,
# `Bio.AlignIO.read()` and `Bio.AlignIO.parse()` which following the
# convention introduced in `Bio.SeqIO` are for files containing one or
# multiple alignments respectively.
#
# Using `Bio.AlignIO.parse()` will return an *iterator* which
# gives `MultipleSeqAlignment` objects. Iterators are typically used in a
# for loop. Examples of situations where you will have multiple different
# alignments include resampled alignments from the PHYLIP tool `seqboot`,
# or multiple pairwise alignments from the EMBOSS tools `water` or
# `needle`, or Bill Pearson’s FASTA tools.
#
# However, in many situations you will be dealing with files which contain
# only a single alignment. In this case, you should use the
# `Bio.AlignIO.read()` function which returns a single
# `MultipleSeqAlignment` object.
#
# Both functions expect two mandatory arguments:
#
# 1. The first argument is a *handle* to read the data from,
# typically an open file (see Section \[sec:appendix-handles\]), or
# a filename.
#
# 2. The second argument is a lower case string specifying the
# alignment format. As in `Bio.SeqIO` we don’t try and guess the file
# format for you! See for a full
# listing of supported formats.
#
# There is also an optional `seq_count` argument which is discussed in
# Section \[sec:AlignIO-count-argument\] below for dealing with ambiguous
# file formats which may contain more than one alignment.
#
# A further optional `alphabet` argument allowing you to specify the
# expected alphabet. This can be useful as many alignment file formats do
# not explicitly label the sequences as RNA, DNA or protein – which means
# `Bio.AlignIO` will default to using a generic alphabet.
#
# ### Single Alignments
#
# As an example, consider the following annotation rich protein alignment
# in the PFAM or Stockholm file format:
#
#
# ```
# # STOCKHOLM 1.0
# #=GS COATB_BPIKE/30-81 AC P03620.1
# #=GS COATB_BPIKE/30-81 DR PDB; 1ifl ; 1-52;
# #=GS Q9T0Q8_BPIKE/1-52 AC Q9T0Q8.1
# #=GS COATB_BPI22/32-83 AC P15416.1
# #=GS COATB_BPM13/24-72 AC P69541.1
# #=GS COATB_BPM13/24-72 DR PDB; 2cpb ; 1-49;
# #=GS COATB_BPM13/24-72 DR PDB; 2cps ; 1-49;
# #=GS COATB_BPZJ2/1-49 AC P03618.1
# #=GS Q9T0Q9_BPFD/1-49 AC Q9T0Q9.1
# #=GS Q9T0Q9_BPFD/1-49 DR PDB; 1nh4 A; 1-49;
# #=GS COATB_BPIF1/22-73 AC P03619.2
# #=GS COATB_BPIF1/22-73 DR PDB; 1ifk ; 1-50;
# COATB_BPIKE/30-81 AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIRLFKKFSSKA
# #=GR COATB_BPIKE/30-81 SS -HHHHHHHHHHHHHH--HHHHHHHH--HHHHHHHHHHHHHHHHHHHHH----
# Q9T0Q8_BPIKE/1-52 AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIKLFKKFVSRA
# COATB_BPI22/32-83 DGTSTATSYATEAMNSLKTQATDLIDQTWPVVTSVAVAGLAIRLFKKFSSKA
# COATB_BPM13/24-72 AEGDDP...AKAAFNSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFTSKA
# #=GR COATB_BPM13/24-72 SS ---S-T...CHCHHHHCCCCTCCCTTCHHHHHHHHHHHHHHHHHHHHCTT--
# COATB_BPZJ2/1-49 AEGDDP...AKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFASKA
# Q9T0Q9_BPFD/1-49 AEGDDP...AKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFTSKA
# #=GR Q9T0Q9_BPFD/1-49 SS ------...-HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH--
# COATB_BPIF1/22-73 FAADDATSQAKAAFDSLTAQATEMSGYAWALVVLVVGATVGIKLFKKFVSRA
# #=GR COATB_BPIF1/22-73 SS XX-HHHH--HHHHHH--HHHHHHH--HHHHHHHHHHHHHHHHHHHHHHH---
# #=GC SS_cons XHHHHHHHHHHHHHHHCHHHHHHHHCHHHHHHHHHHHHHHHHHHHHHHHC--
# #=GC seq_cons AEssss...AptAhDSLpspAT-hIu.sWshVsslVsAsluIKLFKKFsSKA
# //
#
# ```
#
# This is the seed alignment for the Phage\_Coat\_Gp8 (PF05371) PFAM
# entry, downloaded from a now out of date release of PFAM from
# . We can load this file as follows (assuming
# it has been saved to disk as “PF05371\_seed.sth” in the current working
# directory):
#
#
# In[1]:
from Bio import AlignIO
alignment = AlignIO.read("data/PF05371_seed.sth", "stockholm")
#
# This code will print out a summary of the alignment:
#
#
# In[2]:
print(alignment)
#
# You’ll notice in the above output the sequences have been truncated. We
# could instead write our own code to format this as we please by
# iterating over the rows as `SeqRecord` objects:
#
#
# In[3]:
from Bio import AlignIO
alignment = AlignIO.read("data/PF05371_seed.sth", "stockholm")
print("Alignment length %i" % alignment.get_alignment_length())
# In[4]:
for record in alignment:
print("%s - %s" % (record.seq, record.id))
#
# You could also use the alignment object’s `format` method to show it in
# a particular file format – see Section \[sec:alignment-format-method\]
# for details.
#
# Did you notice in the raw file above that several of the sequences
# include database cross-references to the PDB and the associated known
# secondary structure? Try this:
#
#
# In[5]:
for record in alignment:
if record.dbxrefs:
print("%s %s" % (record.id, record.dbxrefs))
#
# To have a look at all the sequence annotation, try this:
#
#
# In[6]:
for record in alignment:
print(record)
#
# Sanger provide a nice web interface at
# which will actually let
# you download this alignment in several other formats. This is what the
# file looks like in the FASTA file format:
#
#
# ```
# >COATB_BPIKE/30-81
# AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIRLFKKFSSKA
# >Q9T0Q8_BPIKE/1-52
# AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIKLFKKFVSRA
# >COATB_BPI22/32-83
# DGTSTATSYATEAMNSLKTQATDLIDQTWPVVTSVAVAGLAIRLFKKFSSKA
# >COATB_BPM13/24-72
# AEGDDP---AKAAFNSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFTSKA
# >COATB_BPZJ2/1-49
# AEGDDP---AKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFASKA
# >Q9T0Q9_BPFD/1-49
# AEGDDP---AKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFTSKA
# >COATB_BPIF1/22-73
# FAADDATSQAKAAFDSLTAQATEMSGYAWALVVLVVGATVGIKLFKKFVSRA
#
# ```
#
# Note the website should have an option about showing gaps as periods
# (dots) or dashes, we’ve shown dashes above. Assuming you download and
# save this as file “PF05371\_seed.faa” then you can load it with almost
# exactly the same code:
#
#
# ```
# from Bio import AlignIO
# alignment = AlignIO.read("PF05371_seed.faa", "fasta")
# print(alignment)
#
# ```
#
# All that has changed in this code is the filename and the format string.
# You’ll get the same output as before, the sequences and record
# identifiers are the same. However, as you should expect, if you check
# each `SeqRecord` there is no annotation nor database cross-references
# because these are not included in the FASTA file format.
#
# Note that rather than using the Sanger website, you could have used
# `Bio.AlignIO` to convert the original Stockholm format file into a FASTA
# file yourself (see below).
#
# With any supported file format, you can load an alignment in exactly the
# same way just by changing the format string. For example, use “phylip”
# for PHYLIP files, “nexus” for NEXUS files or “emboss” for the alignments
# output by the EMBOSS tools. There is a full listing on the wiki page
# () and in the built in documentation
# (also
# [online](http://biopython.org/DIST/docs/api/Bio.AlignIO-module.html)):
#
#
# In[7]:
from Bio import AlignIO
help(AlignIO)
#
# ### Multiple Alignments
#
# The previous section focused on reading files containing a single
# alignment. In general however, files can contain more than one
# alignment, and to read these files we must use the `Bio.AlignIO.parse()`
# function.
#
# Suppose you have a small alignment in PHYLIP format:
#
#
# ```
# 5 6
# Alpha AACAAC
# Beta AACCCC
# Gamma ACCAAC
# Delta CCACCA
# Epsilon CCAAAC
#
# ```
#
# If you wanted to bootstrap a phylogenetic tree using the PHYLIP tools,
# one of the steps would be to create a set of many resampled alignments
# using the tool `bootseq`. This would give output something like this,
# which has been abbreviated for conciseness:
#
#
# ```
# 5 6
# Alpha AAACCA
# Beta AAACCC
# Gamma ACCCCA
# Delta CCCAAC
# Epsilon CCCAAA
# 5 6
# Alpha AAACAA
# Beta AAACCC
# Gamma ACCCAA
# Delta CCCACC
# Epsilon CCCAAA
# 5 6
# Alpha AAAAAC
# Beta AAACCC
# Gamma AACAAC
# Delta CCCCCA
# Epsilon CCCAAC
# ...
# 5 6
# Alpha AAAACC
# Beta ACCCCC
# Gamma AAAACC
# Delta CCCCAA
# Epsilon CAAACC
#
# ```
#
# If you wanted to read this in using `Bio.AlignIO` you could use:
#
#
# In[8]:
from Bio import AlignIO
alignments = AlignIO.parse("data/resampled.phy", "phylip")
for alignment in alignments:
print(alignment)
print("")
#
# As with the function `Bio.SeqIO.parse()`, using `Bio.AlignIO.parse()`
# returns an iterator. If you want to keep all the alignments in memory at
# once, which will allow you to access them in any order, then turn the
# iterator into a list:
#
#
# In[10]:
from Bio import AlignIO
alignments = list(AlignIO.parse("data/resampled.phy", "phylip"))
last_align = alignments[-1]
first_align = alignments[0]
#
# ### Ambiguous Alignments {#sec:AlignIO-count-argument}
#
# Many alignment file formats can explicitly store more than one
# alignment, and the division between each alignment is clear. However,
# when a general sequence file format has been used there is no such block
# structure. The most common such situation is when alignments have been
# saved in the FASTA file format. For example consider the following:
#
#
# ```
# >Alpha
# ACTACGACTAGCTCAG--G
# >Beta
# ACTACCGCTAGCTCAGAAG
# >Gamma
# ACTACGGCTAGCACAGAAG
# >Alpha
# ACTACGACTAGCTCAGG--
# >Beta
# ACTACCGCTAGCTCAGAAG
# >Gamma
# ACTACGGCTAGCACAGAAG
#
# ```
#
# This could be a single alignment containing six sequences (with repeated
# identifiers). Or, judging from the identifiers, this is probably two
# different alignments each with three sequences, which happen to all have
# the same length.
#
# What about this next example?
#
#
# ```
# >Alpha
# ACTACGACTAGCTCAG--G
# >Beta
# ACTACCGCTAGCTCAGAAG
# >Alpha
# ACTACGACTAGCTCAGG--
# >Gamma
# ACTACGGCTAGCACAGAAG
# >Alpha
# ACTACGACTAGCTCAGG--
# >Delta
# ACTACGGCTAGCACAGAAG
#
# ```
#
# Again, this could be a single alignment with six sequences. However this
# time based on the identifiers we might guess this is three pairwise
# alignments which by chance have all got the same lengths.
#
# This final example is similar:
#
#
# ```
# >Alpha
# ACTACGACTAGCTCAG--G
# >XXX
# ACTACCGCTAGCTCAGAAG
# >Alpha
# ACTACGACTAGCTCAGG
# >YYY
# ACTACGGCAAGCACAGG
# >Alpha
# --ACTACGAC--TAGCTCAGG
# >ZZZ
# GGACTACGACAATAGCTCAGG
#
# ```
#
# In this third example, because of the differing lengths, this cannot be
# treated as a single alignment containing all six records. However, it
# could be three pairwise alignments.
#
# Clearly trying to store more than one alignment in a FASTA file is not
# ideal. However, if you are forced to deal with these as input files
# `Bio.AlignIO` can cope with the most common situation where all the
# alignments have the same number of records. One example of this is a
# collection of pairwise alignments, which can be produced by the EMBOSS
# tools `needle` and `water` – although in this situation, `Bio.AlignIO`
# should be able to understand their native output using “emboss” as the
# format string.
#
# To interpret these FASTA examples as several separate alignments, we can
# use `Bio.AlignIO.parse()` with the optional `seq_count` argument which
# specifies how many sequences are expected in each alignment (in these
# examples, 3, 2 and 2 respectively). For example, using the third example
# as the input data:
#
#
# In[11]:
for alignment in AlignIO.parse(handle, "fasta", seq_count=2):
print("Alignment length %i" % alignment.get_alignment_length())
for record in alignment:
print("%s - %s" % (record.seq, record.id))
print("")
#
# Using `Bio.AlignIO.read()` or `Bio.AlignIO.parse()` without the
# `seq_count` argument would give a single alignment containing all six
# records for the first two examples. For the third example, an exception
# would be raised because the lengths differ preventing them being turned
# into a single alignment.
#
# If the file format itself has a block structure allowing `Bio.AlignIO`
# to determine the number of sequences in each alignment directly, then
# the `seq_count` argument is not needed. If it is supplied, and doesn’t
# agree with the file contents, an error is raised.
#
# Note that this optional `seq_count` argument assumes each alignment in
# the file has the same number of sequences. Hypothetically you may come
# across stranger situations, for example a FASTA file containing several
# alignments each with a different number of sequences – although I would
# love to hear of a real world example of this. Assuming you cannot get
# the data in a nicer file format, there is no straight forward way to
# deal with this using `Bio.AlignIO`. In this case, you could consider
# reading in the sequences themselves using `Bio.SeqIO` and batching them
# together to create the alignments as appropriate.
#
# Writing Alignments
# ------------------
#
# We’ve talked about using `Bio.AlignIO.read()` and `Bio.AlignIO.parse()`
# for alignment input (reading files), and now we’ll look at
# `Bio.AlignIO.write()` which is for alignment output (writing files).
# This is a function taking three arguments: some `MultipleSeqAlignment`
# objects (or for backwards compatibility the obsolete `Alignment`
# objects), a handle or filename to write to, and a sequence format.
#
# Here is an example, where we start by creating a few
# `MultipleSeqAlignment` objects the hard way (by hand, rather than by
# loading them from a file). Note we create some `SeqRecord` objects to
# construct the alignment from.
#
#
# In[9]:
from Bio.Alphabet import generic_dna
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio.Align import MultipleSeqAlignment
align1 = MultipleSeqAlignment([
SeqRecord(Seq("ACTGCTAGCTAG", generic_dna), id="Alpha"),
SeqRecord(Seq("ACT-CTAGCTAG", generic_dna), id="Beta"),
SeqRecord(Seq("ACTGCTAGDTAG", generic_dna), id="Gamma"),
])
align2 = MultipleSeqAlignment([
SeqRecord(Seq("GTCAGC-AG", generic_dna), id="Delta"),
SeqRecord(Seq("GACAGCTAG", generic_dna), id="Epsilon"),
SeqRecord(Seq("GTCAGCTAG", generic_dna), id="Zeta"),
])
align3 = MultipleSeqAlignment([
SeqRecord(Seq("ACTAGTACAGCTG", generic_dna), id="Eta"),
SeqRecord(Seq("ACTAGTACAGCT-", generic_dna), id="Theta"),
SeqRecord(Seq("-CTACTACAGGTG", generic_dna), id="Iota"),
])
my_alignments = [align1, align2, align3]
#
# Now we have a list of `Alignment` objects, we’ll write them to a PHYLIP
# format file:
#
#
# In[10]:
from Bio import AlignIO
AlignIO.write(my_alignments, "my_example.phy", "phylip")
#
# And if you open this file in your favourite text editor it should look
# like this:
#
#
# ```
# 3 12
# Alpha ACTGCTAGCT AG
# Beta ACT-CTAGCT AG
# Gamma ACTGCTAGDT AG
# 3 9
# Delta GTCAGC-AG
# Epislon GACAGCTAG
# Zeta GTCAGCTAG
# 3 13
# Eta ACTAGTACAG CTG
# Theta ACTAGTACAG CT-
# Iota -CTACTACAG GTG
#
# ```
#
# Its more common to want to load an existing alignment, and save that,
# perhaps after some simple manipulation like removing certain rows or
# columns.
#
# Suppose you wanted to know how many alignments the `Bio.AlignIO.write()`
# function wrote to the handle? If your alignments were in a list like the
# example above, you could just use `len(my_alignments)`, however you
# can’t do that when your records come from a generator/iterator.
# Therefore the `Bio.AlignIO.write()` function returns the number of
# alignments written to the file.
#
# *Note* - If you tell the `Bio.AlignIO.write()` function to write to a
# file that already exists, the old file will be overwritten without any
# warning.
#
# ### Converting between sequence alignment file formats {#sec:converting-alignments}
#
# Converting between sequence alignment file formats with `Bio.AlignIO`
# works in the same way as converting between sequence file formats with
# `Bio.SeqIO` (Section \[sec:SeqIO-conversion\]). We load generally the
# alignment(s) using `Bio.AlignIO.parse()` and then save them using the
# `Bio.AlignIO.write()` – or just use the `Bio.AlignIO.convert()` helper
# function.
#
# For this example, we’ll load the PFAM/Stockholm format file used earlier
# and save it as a Clustal W format file:
#
#
# In[11]:
from Bio import AlignIO
count = AlignIO.convert("data/PF05371_seed.sth", "stockholm", "PF05371_seed.aln", "clustal")
print("Converted %i alignments" % count)
#
# Or, using `Bio.AlignIO.parse()` and `Bio.AlignIO.write()`:
#
#
# In[12]:
from Bio import AlignIO
alignments = AlignIO.parse("data/PF05371_seed.sth", "stockholm")
count = AlignIO.write(alignments, "PF05371_seed.aln", "clustal")
print("Converted %i alignments" % count)
#
# The `Bio.AlignIO.write()` function expects to be given multiple
# alignment objects. In the example above we gave it the alignment
# iterator returned by `Bio.AlignIO.parse()`.
#
# In this case, we know there is only one alignment in the file so we
# could have used `Bio.AlignIO.read()` instead, but notice we have to pass
# this alignment to `Bio.AlignIO.write()` as a single element list:
#
#
# In[13]:
from Bio import AlignIO
alignment = AlignIO.read("data/PF05371_seed.sth", "stockholm")
AlignIO.write([alignment], "PF05371_seed.aln", "clustal")
#
# Either way, you should end up with the same new Clustal W format file
# “PF05371\_seed.aln” with the following content:
#
#
# ```
# CLUSTAL X (1.81) multiple sequence alignment
#
#
# COATB_BPIKE/30-81 AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIRLFKKFSS
# Q9T0Q8_BPIKE/1-52 AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIKLFKKFVS
# COATB_BPI22/32-83 DGTSTATSYATEAMNSLKTQATDLIDQTWPVVTSVAVAGLAIRLFKKFSS
# COATB_BPM13/24-72 AEGDDP---AKAAFNSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFTS
# COATB_BPZJ2/1-49 AEGDDP---AKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFAS
# Q9T0Q9_BPFD/1-49 AEGDDP---AKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFTS
# COATB_BPIF1/22-73 FAADDATSQAKAAFDSLTAQATEMSGYAWALVVLVVGATVGIKLFKKFVS
#
# COATB_BPIKE/30-81 KA
# Q9T0Q8_BPIKE/1-52 RA
# COATB_BPI22/32-83 KA
# COATB_BPM13/24-72 KA
# COATB_BPZJ2/1-49 KA
# Q9T0Q9_BPFD/1-49 KA
# COATB_BPIF1/22-73 RA
#
# ```
#
# Alternatively, you could make a PHYLIP format file which we’ll name
# “PF05371\_seed.phy”:
#
#
# In[14]:
from Bio import AlignIO
AlignIO.convert("data/PF05371_seed.sth", "stockholm", "PF05371_seed.phy", "phylip")
#
# This time the output looks like this:
#
#
# ```
# 7 52
# COATB_BPIK AEPNAATNYA TEAMDSLKTQ AIDLISQTWP VVTTVVVAGL VIRLFKKFSS
# Q9T0Q8_BPI AEPNAATNYA TEAMDSLKTQ AIDLISQTWP VVTTVVVAGL VIKLFKKFVS
# COATB_BPI2 DGTSTATSYA TEAMNSLKTQ ATDLIDQTWP VVTSVAVAGL AIRLFKKFSS
# COATB_BPM1 AEGDDP---A KAAFNSLQAS ATEYIGYAWA MVVVIVGATI GIKLFKKFTS
# COATB_BPZJ AEGDDP---A KAAFDSLQAS ATEYIGYAWA MVVVIVGATI GIKLFKKFAS
# Q9T0Q9_BPF AEGDDP---A KAAFDSLQAS ATEYIGYAWA MVVVIVGATI GIKLFKKFTS
# COATB_BPIF FAADDATSQA KAAFDSLTAQ ATEMSGYAWA LVVLVVGATV GIKLFKKFVS
#
# KA
# RA
# KA
# KA
# KA
# KA
# RA
#
# ```
#
# One of the big handicaps of the original PHYLIP alignment file format is
# that the sequence identifiers are strictly truncated at ten characters.
# In this example, as you can see the resulting names are still unique -
# but they are not very readable. As a result, a more relaxed variant of
# the original PHYLIP format is now quite widely used:
#
#
# In[15]:
from Bio import AlignIO
AlignIO.convert("data/PF05371_seed.sth", "stockholm", "PF05371_seed.phy", "phylip-relaxed")
#
# This time the output looks like this, using a longer indentation to
# allow all the identifers to be given in full::
#
#
# ```
# 7 52
# COATB_BPIKE/30-81 AEPNAATNYA TEAMDSLKTQ AIDLISQTWP VVTTVVVAGL VIRLFKKFSS
# Q9T0Q8_BPIKE/1-52 AEPNAATNYA TEAMDSLKTQ AIDLISQTWP VVTTVVVAGL VIKLFKKFVS
# COATB_BPI22/32-83 DGTSTATSYA TEAMNSLKTQ ATDLIDQTWP VVTSVAVAGL AIRLFKKFSS
# COATB_BPM13/24-72 AEGDDP---A KAAFNSLQAS ATEYIGYAWA MVVVIVGATI GIKLFKKFTS
# COATB_BPZJ2/1-49 AEGDDP---A KAAFDSLQAS ATEYIGYAWA MVVVIVGATI GIKLFKKFAS
# Q9T0Q9_BPFD/1-49 AEGDDP---A KAAFDSLQAS ATEYIGYAWA MVVVIVGATI GIKLFKKFTS
# COATB_BPIF1/22-73 FAADDATSQA KAAFDSLTAQ ATEMSGYAWA LVVLVVGATV GIKLFKKFVS
#
# KA
# RA
# KA
# KA
# KA
# KA
# RA
#
# ```
#
# If you have to work with the original strict PHYLIP format, then you may
# need to compress the identifers somehow – or assign your own names or
# numbering system. This following bit of code manipulates the record
# identifiers before saving the output:
#
#
# In[16]:
from Bio import AlignIO
alignment = AlignIO.read("data/PF05371_seed.sth", "stockholm")
name_mapping = {}
for i, record in enumerate(alignment):
name_mapping[i] = record.id
record.id = "seq%i" % i
print(name_mapping)
AlignIO.write([alignment], "PF05371_seed.phy", "phylip")
#
# Here is the new (strict) PHYLIP format output:
#
#
# ```
# 7 52
# seq0 AEPNAATNYA TEAMDSLKTQ AIDLISQTWP VVTTVVVAGL VIRLFKKFSS
# seq1 AEPNAATNYA TEAMDSLKTQ AIDLISQTWP VVTTVVVAGL VIKLFKKFVS
# seq2 DGTSTATSYA TEAMNSLKTQ ATDLIDQTWP VVTSVAVAGL AIRLFKKFSS
# seq3 AEGDDP---A KAAFNSLQAS ATEYIGYAWA MVVVIVGATI GIKLFKKFTS
# seq4 AEGDDP---A KAAFDSLQAS ATEYIGYAWA MVVVIVGATI GIKLFKKFAS
# seq5 AEGDDP---A KAAFDSLQAS ATEYIGYAWA MVVVIVGATI GIKLFKKFTS
# seq6 FAADDATSQA KAAFDSLTAQ ATEMSGYAWA LVVLVVGATV GIKLFKKFVS
#
# KA
# RA
# KA
# KA
# KA
# KA
# RA
#
# ```
#
# In general, because of the identifier limitation, working with *strict*
# PHYLIP file formats shouldn’t be your first choice. Using the
# PFAM/Stockholm format on the other hand allows you to record a lot of
# additional annotation too.
#
# ### Getting your alignment objects as formatted strings {#sec:alignment-format-method}
#
# The `Bio.AlignIO` interface is based on handles, which means if you want
# to get your alignment(s) into a string in a particular file format you
# need to do a little bit more work (see below). However, you will
# probably prefer to take advantage of the alignment object’s `format()`
# method. This takes a single mandatory argument, a lower case string
# which is supported by `Bio.AlignIO` as an output format. For example:
#
#
# ```
# from Bio import AlignIO
# alignment = AlignIO.read("PF05371_seed.sth", "stockholm")
# print(alignment.format("clustal"))
#
# ```
#
# As described in Section \[sec:SeqRecord-format\], the `SeqRecord` object
# has a similar method using output formats supported by `Bio.SeqIO`.
#
# Internally the `format()` method is using the `StringIO` string based
# handle and calling `Bio.AlignIO.write()`. You can do this in your own
# code if for example you are using an older version of Biopython:
#
#
# ```
# from Bio import AlignIO
# from StringIO import StringIO
#
# alignments = AlignIO.parse("PF05371_seed.sth", "stockholm")
#
# out_handle = StringIO()
# AlignIO.write(alignments, out_handle, "clustal")
# clustal_data = out_handle.getvalue()
#
# print(clustal_data)
#
# ```
#
# Manipulating Alignments {#sec:manipulating-alignments}
# -----------------------
#
# Now that we’ve covered loading and saving alignments, we’ll look at what
# else you can do with them.
#
# ### Slicing alignments
#
# First of all, in some senses the alignment objects act like a Python
# `list` of `SeqRecord` objects (the rows). With this model in mind
# hopefully the actions of `len()` (the number of rows) and iteration
# (each row as a `SeqRecord`) make sense:
#
#
# In[17]:
from Bio import AlignIO
alignment = AlignIO.read("data/PF05371_seed.sth", "stockholm")
print("Number of rows: %i" % len(alignment))
# In[18]:
for record in alignment:
print("%s - %s" % (record.seq, record.id))
#
# You can also use the list-like `append` and `extend` methods to add more
# rows to the alignment (as `SeqRecord` objects). Keeping the list
# metaphor in mind, simple slicing of the alignment should also make sense
# - it selects some of the rows giving back another alignment object:
#
#
# In[19]:
print(alignment)
# In[20]:
print(alignment[3:7])
#
# What if you wanted to select by column? Those of you who have used the
# NumPy matrix or array objects won’t be surprised at this - you use a
# double index.
#
#
# In[21]:
print(alignment[2, 6])
#
# Using two integer indices pulls out a single letter, short hand for
# this:
#
#
# In[22]:
print(alignment[2].seq[6])
#
# You can pull out a single column as a string like this:
#
#
# In[23]:
print(alignment[:, 6])
#
# You can also select a range of columns. For example, to pick out those
# same three rows we extracted earlier, but take just their first six
# columns:
#
#
# In[24]:
print(alignment[3:6, :6])
#
# Leaving the first index as `:` means take all the rows:
#
#
# In[25]:
print(alignment[:, :6])
#
# This brings us to a neat way to remove a section. Notice columns 7, 8
# and 9 which are gaps in three of the seven sequences:
#
#
# In[26]:
print(alignment[:, 6:9])
#
# Again, you can slice to get everything after the ninth column:
#
#
# In[27]:
print(alignment[:, 9:])
#
# Now, the interesting thing is that addition of alignment objects works
# by column. This lets you do this as a way to remove a block of columns:
#
#
# In[28]:
edited = alignment[:, :6] + alignment[:, 9:]
print(edited)
#
# Another common use of alignment addition would be to combine alignments
# for several different genes into a meta-alignment. Watch out though -
# the identifiers need to match up (see Section \[sec:SeqRecord-addition\]
# for how adding `SeqRecord` objects works). You may find it helpful to
# first sort the alignment rows alphabetically by id:
#
#
# In[29]:
edited.sort()
print(edited)
#
# Note that you can only add two alignments together if they have the same
# number of rows.
#
# ### Alignments as arrays
#
# Depending on what you are doing, it can be more useful to turn the
# alignment object into an array of letters – and you can do this with
# NumPy:
#
#
# In[30]:
import numpy as np
from Bio import AlignIO
alignment = AlignIO.read("data/PF05371_seed.sth", "stockholm")
align_array = np.array([list(rec) for rec in alignment], np.character)
print("Array shape %i by %i" % align_array.shape)
#
# If you will be working heavily with the columns, you can tell NumPy to
# store the array by column (as in Fortran) rather then its default of by
# row (as in C):
#
#
# In[31]:
align_array = np.array([list(rec) for rec in alignment], np.character, order="F")
#
# Note that this leaves the original Biopython alignment object and the
# NumPy array in memory as separate objects - editing one will not update
# the other!
#
# Alignment Tools {#sec:alignment-tools}
# ---------------
#
# There are *lots* of algorithms out there for aligning sequences, both
# pairwise alignments and multiple sequence alignments. These calculations
# are relatively slow, and you generally wouldn’t want to write such an
# algorithm in Python. Instead, you can use Biopython to invoke a command
# line tool on your behalf. Normally you would:
#
# 1. Prepare an input file of your unaligned sequences, typically this
# will be a FASTA file which you might create using `Bio.SeqIO`
# (see Chapter \[chapter:Bio.SeqIO\]).
#
# 2. Call the command line tool to process this input file, typically via
# one of Biopython’s command line wrappers (which we’ll discuss here).
#
# 3. Read the output from the tool, i.e. your aligned sequences,
# typically using `Bio.AlignIO` (see earlier in this chapter).
#
# All the command line wrappers we’re going to talk about in this chapter
# follow the same style. You create a command line object specifying the
# options (e.g. the input filename and the output filename), then invoke
# this command line via a Python operating system call (e.g. using the
# `subprocess` module).
#
# Most of these wrappers are defined in the `Bio.Align.Applications`
# module:
#
#
# In[32]:
import Bio.Align.Applications
dir(Bio.Align.Applications)
#
# (Ignore the entries starting with an underscore – these have special
# meaning in Python.) The module `Bio.Emboss.Applications` has wrappers
# for some of the [EMBOSS suite](http://emboss.sourceforge.net/),
# including `needle` and `water`, which are described below in
# Section \[seq:emboss-needle-water\], and wrappers for the EMBOSS
# packaged versions of the PHYLIP tools (which EMBOSS refer to as one of
# their EMBASSY packages - third party tools with an EMBOSS style
# interface). We won’t explore all these alignment tools here in the
# section, just a sample, but the same principles apply.
#
# ### ClustalW {#sec:align_clustal}
#
# ClustalW is a popular command line tool for multiple sequence alignment
# (there is also a graphical interface called ClustalX). Biopython’s
# `Bio.Align.Applications` module has a wrapper for this alignment tool
# (and several others).
#
# Before trying to use ClustalW from within Python, you should first try
# running the ClustalW tool yourself by hand at the command line, to
# familiarise yourself the other options. You’ll find the Biopython
# wrapper is very faithful to the actual command line API:
#
#
# In[33]:
from Bio.Align.Applications import ClustalwCommandline
help(ClustalwCommandline)
#
# For the most basic usage, all you need is to have a FASTA input file,
# such as
# [opuntia.fasta](http://biopython.org/DIST/docs/tutorial/examples/opuntia.fasta)
# (available online or in the Doc/examples subdirectory of the Biopython
# source code). This is a small FASTA file containing seven prickly-pear
# DNA sequences (from the cactus family *Opuntia*).
#
# By default ClustalW will generate an alignment and guide tree file with
# names based on the input FASTA file, in this case `opuntia.aln` and
# `opuntia.dnd`, but you can override this or make it explicit:
#
#
# In[34]:
from Bio.Align.Applications import ClustalwCommandline
cline = ClustalwCommandline("clustalw2", infile="data/opuntia.fasta")
print(cline)
#
# Notice here we have given the executable name as `clustalw2`, indicating
# we have version two installed, which has a different filename to version
# one (`clustalw`, the default). Fortunately both versions support the
# same set of arguments at the command line (and indeed, should be
# functionally identical).
#
# You may find that even though you have ClustalW installed, the above
# command doesn’t work – you may get a message about “command not found”
# (especially on Windows). This indicated that the ClustalW executable is
# not on your PATH (an environment variable, a list of directories to be
# searched). You can either update your PATH setting to include the
# location of your copy of ClustalW tools (how you do this will depend on
# your OS), or simply type in the full path of the tool. For example:
#
#
# In[35]:
import os
from Bio.Align.Applications import ClustalwCommandline
clustalw_exe = r"C:\Program Files\new clustal\clustalw2.exe"
clustalw_cline = ClustalwCommandline(clustalw_exe, infile="data/opuntia.fasta")
#
#
# In[37]:
assert os.path.isfile(clustalw_exe), "Clustal W executable missing"
stdout, stderr = clustalw_cline()
#
# Remember, in Python strings `\n` and `\t` are by default interpreted as
# a new line and a tab – which is why we’re put a letter “r” at the start
# for a raw string that isn’t translated in this way. This is generally
# good practice when specifying a Windows style file name.
#
# Internally this uses the `subprocess` module which is now the
# recommended way to run another program in Python. This replaces older
# options like the `os.system()` and the `os.popen*` functions.
#
# Now, at this point it helps to know about how command line tools “work”.
# When you run a tool at the command line, it will often print text output
# directly to screen. This text can be captured or redirected, via two
# “pipes”, called standard output (the normal results) and standard error
# (for error messages and debug messages). There is also standard input,
# which is any text fed into the tool. These names get shortened to stdin,
# stdout and stderr. When the tool finishes, it has a return code (an
# integer), which by convention is zero for success.
#
# When you run the command line tool like this via the Biopython wrapper,
# it will wait for it to finish, and check the return code. If this is non
# zero (indicating an error), an exception is raised. The wrapper then
# returns two strings, stdout and stderr.
#
# In the case of ClustalW, when run at the command line all the important
# output is written directly to the output files. Everything normally
# printed to screen while you wait (via stdout or stderr) is boring and
# can be ignored (assuming it worked).
#
# What we care about are the two output files, the alignment and the guide
# tree. We didn’t tell ClustalW what filenames to use, but it defaults to
# picking names based on the input file. In this case the output should be
# in the file `opuntia.aln`. You should be able to work out how to read in
# the alignment using `Bio.AlignIO` by now:
#
#
# In[38]:
from Bio import AlignIO
align = AlignIO.read("data/opuntia.aln", "clustal")
print(align)
#
# In case you are interested (and this is an aside from the main thrust of
# this chapter), the `opuntia.dnd` file ClustalW creates is just a
# standard Newick tree file, and `Bio.Phylo` can parse these:
#
#
# In[39]:
from Bio import Phylo
tree = Phylo.read("data/opuntia.dnd", "newick")
Phylo.draw_ascii(tree)
#
# Chapter \[sec:Phylo\] covers Biopython’s support for phylogenetic trees
# in more depth.
#
# ### MUSCLE
#
# MUSCLE is a more recent multiple sequence alignment tool than ClustalW,
# and Biopython also has a wrapper for it under the
# `Bio.Align.Applications` module. As before, we recommend you try using
# MUSCLE from the command line before trying it from within Python, as the
# Biopython wrapper is very faithful to the actual command line API:
#
#
# In[40]:
from Bio.Align.Applications import MuscleCommandline
help(MuscleCommandline)
#
# For the most basic usage, all you need is to have a FASTA input file,
# such as
# [opuntia.fasta](http://biopython.org/DIST/docs/tutorial/examples/opuntia.fasta)
# (available online or in the Doc/examples subdirectory of the Biopython
# source code). You can then tell MUSCLE to read in this FASTA file, and
# write the alignment to an output file:
#
#
# In[41]:
from Bio.Align.Applications import MuscleCommandline
cline = MuscleCommandline(input="data/opuntia.fasta", out="opuntia.txt")
print(cline)
#
# Note that MUSCLE uses “-in” and “-out” but in Biopython we have to use
# “input” and “out” as the keyword arguments or property names. This is
# because “in” is a reserved word in Python.
#
# By default MUSCLE will output the alignment as a FASTA file (using
# gapped sequences). The `Bio.AlignIO` module should be able to read this
# alignment using `format=fasta`. You can also ask for ClustalW-like
# output:
#
#
# In[42]:
from Bio.Align.Applications import MuscleCommandline
cline = MuscleCommandline(input="data/opuntia.fasta", out="opuntia.aln", clw=True)
print(cline)
#
# Or, strict ClustalW output where the original ClustalW header line is
# used for maximum compatibility:
#
#
# In[43]:
from Bio.Align.Applications import MuscleCommandline
cline = MuscleCommandline(input="data/opuntia.fasta", out="opuntia.aln", clwstrict=True)
print(cline)
#
# The `Bio.AlignIO` module should be able to read these alignments using
# `format=clustal`.
#
# MUSCLE can also output in GCG MSF format (using the `msf` argument), but
# Biopython can’t currently parse that, or using HTML which would give a
# human readable web page (not suitable for parsing).
#
# You can also set the other optional parameters, for example the maximum
# number of iterations. See the built in help for details.
#
# You would then run MUSCLE command line string as described above for
# ClustalW, and parse the output using `Bio.AlignIO` to get an alignment
# object.
#
# ### MUSCLE using stdout
#
# Using a MUSCLE command line as in the examples above will write the
# alignment to a file. This means there will be no important information
# written to the standard out (stdout) or standard error (stderr) handles.
# However, by default MUSCLE will write the alignment to standard output
# (stdout). We can take advantage of this to avoid having a temporary
# output file! For example:
#
#
# In[44]:
from Bio.Align.Applications import MuscleCommandline
muscle_cline = MuscleCommandline(input="data/opuntia.fasta")
print(muscle_cline)
#
# If we run this via the wrapper, we get back the output as a string. In
# order to parse this we can use `StringIO` to turn it into a handle.
# Remember that MUSCLE defaults to using FASTA as the output format:
#
#
# In[45]:
from Bio.Align.Applications import MuscleCommandline
muscle_cline = MuscleCommandline(input="data/opuntia.fasta")
stdout, stderr = muscle_cline()
from io import StringIO
from Bio import AlignIO
align = AlignIO.read(StringIO(stdout), "fasta")
print(align)
#
# The above approach is fairly simple, but if you are dealing with very
# large output text the fact that all of stdout and stderr is loaded into
# memory as a string can be a potential drawback. Using the `subprocess`
# module we can work directly with handles instead:
#
#
# In[49]:
import subprocess
import sys
from Bio.Align.Applications import MuscleCommandline
muscle_cline = MuscleCommandline(input="data/opuntia.fasta")
child = subprocess.Popen(str(muscle_cline),
stdout=subprocess.PIPE, stderr=subprocess.PIPE,
shell=(sys.platform != "win32"),
universal_newlines=True)
from Bio import AlignIO
align = AlignIO.read(child.stdout, "fasta")
print(align)
#
# ### MUSCLE using stdin and stdout
#
# We don’t actually *need* to have our FASTA input sequences prepared in a
# file, because by default MUSCLE will read in the input sequence from
# standard input! Note this is a bit more advanced and fiddly, so don’t
# bother with this technique unless you need to.
#
# First, we’ll need some unaligned sequences in memory as `SeqRecord`
# objects. For this demonstration I’m going to use a filtered version of
# the original FASTA file (using a generator expression), taking just six
# of the seven sequences:
#
#
# In[73]:
from Bio import SeqIO
records = (r for r in SeqIO.parse("data/opuntia.fasta", "fasta") if len(r) < 900)
#
# Then we create the MUSCLE command line, leaving the input and output to
# their defaults (stdin and stdout). I’m also going to ask for strict
# ClustalW format as for the output.
#
#
# In[90]:
from Bio.Align.Applications import MuscleCommandline
muscle_cline = MuscleCommandline(clwstrict=True)
print(muscle_cline)
#
# Now for the fiddly bits using the `subprocess` module, stdin and stdout:
#
#
# In[91]:
import subprocess
import sys
child = subprocess.Popen(str(cline),
stdin=subprocess.PIPE, stdout=subprocess.PIPE,
stderr=subprocess.PIPE, universal_newlines=True,
shell=(sys.platform != "win32"))
#
# That should start MUSCLE, but it will be sitting waiting for its FASTA
# input sequences, which we must supply via its stdin handle:
#
#
# In[92]:
SeqIO.write(records, child.stdin, "fasta")
# In[93]:
child.stdin.close()
#
# After writing the six sequences to the handle, MUSCLE will still be
# waiting to see if that is all the FASTA sequences or not – so we must
# signal that this is all the input data by closing the handle. At that
# point MUSCLE should start to run, and we can ask for the output:
#
#
# In[94]:
from Bio import AlignIO
align = AlignIO.read(child.stdout, "clustal")
print(align)
#
# Wow! There we are with a new alignment of just the six records, without
# having created a temporary FASTA input file, or a temporary alignment
# output file. However, a word of caution: Dealing with errors with this
# style of calling external programs is much more complicated. It also
# becomes far harder to diagnose problems, because you can’t try running
# MUSCLE manually outside of Biopython (because you don’t have the input
# file to supply). There can also be subtle cross platform issues (e.g.
# Windows versus Linux, Python 2 versus Python 3), and how you run your
# script can have an impact (e.g. at the command line, from IDLE or an
# IDE, or as a GUI script). These are all generic Python issues though,
# and not specific to Biopython.
#
# If you find working directly with `subprocess` like this scary, there is
# an alternative. If you execute the tool with `muscle_cline()` you can
# supply any standard input as a big string, `muscle_cline(stdin=...)`.
# So, provided your data isn’t very big, you can prepare the FASTA input
# in memory as a string using `StringIO` (see
# Section \[sec:appendix-handles\]):
#
#
# In[95]:
from Bio import SeqIO
records = (r for r in SeqIO.parse("data/opuntia.fasta", "fasta") if len(r) < 900)
handle = StringIO()
SeqIO.write(records, handle, "fasta")
# In[96]:
data = handle.getvalue()
#
# You can then run the tool and parse the alignment as follows:
#
#
# In[97]:
stdout, stderr = muscle_cline(stdin=data)
from Bio import AlignIO
align = AlignIO.read(StringIO(stdout), "clustal")
print(align)
#
# You might find this easier, but it does require more memory (RAM) for
# the strings used for the input FASTA and output Clustal formatted data.
#
# ### EMBOSS needle and water {#seq:emboss-needle-water}
#
# The [EMBOSS](http://emboss.sourceforge.net/) suite includes the `water`
# and `needle` tools for Smith-Waterman algorithm local alignment, and
# Needleman-Wunsch global alignment. The tools share the same style
# interface, so switching between the two is trivial – we’ll just use
# `needle` here.
#
# Suppose you want to do a global pairwise alignment between two
# sequences, prepared in FASTA format as follows:
#
#
# ```
# >HBA_HUMAN
# MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHG
# KKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTP
# AVHASLDKFLASVSTVLTSKYR
#
# ```
#
# in a file `alpha.fasta`, and secondly in a file `beta.fasta`:
#
#
# ```
# >HBB_HUMAN
# MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPK
# VKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFG
# KEFTPPVQAAYQKVVAGVANALAHKYH
#
# ```
#
# Let’s start by creating a complete `needle` command line object in one
# go:
#
#
# In[98]:
from Bio.Emboss.Applications import NeedleCommandline
needle_cline = NeedleCommandline(asequence="data/alpha.faa", bsequence="data/beta.faa",
gapopen=10, gapextend=0.5, outfile="needle.txt")
print(needle_cline)
#
# Why not try running this by hand at the command prompt? You should see
# it does a pairwise comparison and records the output in the file
# `needle.txt` (in the default EMBOSS alignment file format).
#
# Even if you have EMBOSS installed, running this command may not work –
# you might get a message about “command not found” (especially on
# Windows). This probably means that the EMBOSS tools are not on your PATH
# environment variable. You can either update your PATH setting, or simply
# tell Biopython the full path to the tool, for example:
#
#
# In[99]:
from Bio.Emboss.Applications import NeedleCommandline
needle_cline = NeedleCommandline(r"C:\EMBOSS\needle.exe",
asequence="data/alpha.faa", bsequence="data/beta.faa",
gapopen=10, gapextend=0.5, outfile="needle.txt")
#
# Remember in Python that for a default string `\n` or `\t` means a new
# line or a tab – which is why we’re put a letter “r” at the start for a
# raw string.
#
# At this point it might help to try running the EMBOSS tools yourself by
# hand at the command line, to familiarise yourself the other options and
# compare them to the Biopython help text:
#
#
# In[100]:
from Bio.Emboss.Applications import NeedleCommandline
help(NeedleCommandline)
#
# Note that you can also specify (or change or look at) the settings like
# this:
#
#
# In[101]:
from Bio.Emboss.Applications import NeedleCommandline
needle_cline = NeedleCommandline()
needle_cline.asequence="data/alpha.faa"
needle_cline.bsequence="data/beta.faa"
needle_cline.gapopen=10
needle_cline.gapextend=0.5
needle_cline.outfile="needle.txt"
print(needle_cline)
# In[102]:
print(needle_cline.outfile)
#
# Next we want to use Python to run this command for us. As explained
# above, for full control, we recommend you use the built in Python
# `subprocess` module, but for simple usage the wrapper object usually
# suffices:
#
#
# In[103]:
stdout, stderr = needle_cline()
print(stdout + stderr)
#
# Next we can load the output file with `Bio.AlignIO` as discussed earlier
# in this chapter, as the `emboss` format:
#
#
# In[104]:
from Bio import AlignIO
align = AlignIO.read("needle.txt", "emboss")
print(align)
# In[ ]: