Introduction to Bioinformatics
A masters course by Blaž Zupan and Tomaž Curk
University of Ljubljana (C) 2016-2018

Disclaimer: this work is a first draft of our working notes. The images were obtained from various web sites, but mainly from the wikipedia entries with explanations of different biological entities. Material is intended for our bioinformatics class and is not meant for distribution.

Lecture Notes Part 5¶

Sequence Alignment¶

Molecular Roots of Evolution¶

We, living creatures on earth, evolved. In fact, we are still evolving. The heritable characteristics of our population change over successive generations. When cells divide, the DNA of a cell is replicated: two identical replicates of the DNA are made from one original DNA molecule. Well, not totally identical. When replicating human DNA, for instance, there are at least a few errors left after the replications. Like, on average, there are about seven nucleotides that do not match those from the original DNA. The cell includes machinery to detect and repair such mistakes, but even that machinery sometimes overlooks the mistakes and this could be passed to the organisms of the next generation. These replication mistakes are called mutations. Mutations can be harmless (changes in "junk" part of the DNA) or harmful (the organism with a mutated DNA does not survive), but some could be beneficial and make the organism better fit than the old generation. Organisms with better fitness would proliferate throughout the population. This process, called evolution, happens all the time, for all organisms on Earth.

Our description of molecular background of evolution is very simplistic. But will suffice for now. In the previous lecture we learned about genes, units of heredity and segments of the DNA that code for proteins. Particularly interesting are mutations that change the coding sequence of the gene, as the resulting protein may have a different sequence of amino acids. It is "may", as different sequence of nucleotides may result in the same amino acid chain (think about the number of different codons for each amino acid and redundancy in the DNA-protein coding tables).

Two different species of organisms with a common ancestor may therefore have some differences in the genome sequence, and therefore may have (slightly) different genes that encode similar proteins. If these genes are coming from the same ancestor gene, two gene variants are called orthologous genes. The difference between genes arises due to nucleotide substitution, insertion of the sequence or deletion of the DNA sequence. The later two types of changes in the DNA are called indels. Another kind of mistake in the DNA replication appears where a part of the DNA is duplicated, that is, mistakenly appears several times in the new genome. As there could be now several genes encoding the same protein, all but one of these genes can be changed with no harm. Multiple copies of the same genes that are then, within a single species, changed accross the evolution, give raise to gene families, a set of similar genes that are all stem from the same predecessor. Such genes are called parlogous genes. Orthologs and paralogs are jointly referred as homologous genes.

There is a great value of finding homologous genes within the same species or accross the species. To find them, we need to define the distance between sequences and means to express these distances numerically. Sequences of DNA that we would like to compare need first to be aligned to expose the similarities and differences. This procedure is referred to as sequence alignment. Sequence alignment and scoring of similarity of sequences give rise to many different procedures of bioinformatics, and help us in tasks such as:

gene function prediction: knowing the function of a gene in one species, and finding a similar gene in the other species, we can now make an inference,
gene finding: knowing genome structure in one organism can help us annotate the gene of another organism
phylogeny: based on comparison of gene sequences of organisms we can infer which organisms had common predecessors, and we reconstruct the tree of life,
sequence assembly: knowing the sequence of one genome we can assemble the sequence of another gene from small fragments of the DNA that have been sequence using modern high-throughput sequencing methods.

So What Is Sequence Alignment?¶

Consider two sequences $s$ and $t$:

In [1]:

s, t = ("ATACGTA", "TATGATA")

and consider their possible alignment:

In [2]:

A1 = ("ATACGTA-", "-TATGATA")

Let us pretty-print the alignment:

In [3]:

def pp_alignment(s, t):
    print(s)
    print("".join("|" if si==ti else " " for si, ti in zip(s, t)))
    print(t)

In [4]:

pp_alignment(*A1)

ATACGTA-
 || |   
-TATGATA

The aligned sequences have the same length. We actually talk about the length of alignment. In our case, $|A_1(s,t)|=c=7$. The symbol "-" denotes an indel, a skip in the sequence due to deleted nucleotide in one sequence or added nucleotide in the other one.

Consider a different alignment:

In [5]:

A2 = ("ATACG-TA", "-TATGATA")
pp_alignment(*A2)

ATACG-TA
 || | ||
-TATGATA

and another one

In [6]:

A3 = ("-AT-ACGTA", "TATGA--TA")
pp_alignment(*A3)

-AT-ACGTA
 || |  ||
TATGA--TA

Which alignment is better? We need some means to quantitatively assess the alignments and compute the alignment score.

A Scoring Function¶

Let $x$ and $y$ be aligned sequences, that is, $x$ a version of $s$ after the alignment, and $y$ a version of $t$ after the alignment. Let $x_i$ be an $i$-th symbol of aligned sequence $x$, and $y_i$ be an $i$-th symbol of aligned sequence $y$. We define a scoring function for position $i$ to be $\sigma(x_i,y_i)$. The alignment score is then defined as:

$$M(A(s,t))=\sum_{i=1}^c \sigma(x_i, y_i)$$

All we need now is a concrete scoring function $\sigma$. Let us define one, and score our alignments. We will first use a very simple scoring function, which returns -1 if the symbols are different, and 1 if the symbols at some position of the alignment are the same.

In [7]:

def simple_sigma(a, b):
    return -1 if a != b else 1

In [8]:

def M(A, sigma):
    return sum(sigma(xi, yi) for xi, yi in zip(*A))

In [10]:

alignments = [A1, A2, A3]
for A in alignments:
    pp_alignment(*A)
    print("Score: %d\n" % M(A, simple_sigma))

ATACGTA-
 || |   
-TATGATA
Score: -2

ATACG-TA
 || | ||
-TATGATA
Score: 2

-AT-ACGTA
 || |  ||
TATGA--TA
Score: 1

The best alignment, according to out scoring function simple_sigma is alignment $A_2$. What if change the scoring function? Say, assuming that indels really hurt, because they can change the whole reading frame, we should punish them more than we punish a mismatch of the nucleotides. Let us try is out:

In [11]:

def punished_indels(a, b):
    return -2 if (a == "-" or b == "-") else (-1 if a != b else 2)

First, a check if this works.

In [12]:

punished_indels("A", "C"), punished_indels("-", "T"), punished_indels("G", "G")

Out[12]:

(-1, -2, 2)

Fine, looks like it works. Now we use the scoring function punished_indels to score the alignments:

In [13]:

for A in alignments:
    pp_alignment(*A)
    print("Score: %d\n" % M(A, punished_indels))

ATACGTA-
 || |   
-TATGATA
Score: -1

ATACG-TA
 || | ||
-TATGATA
Score: 5

-AT-ACGTA
 || |  ||
TATGA--TA
Score: 2

Again, alignment $A_2$ looks best, but now the margin is bigger.

Where Are We Now?¶

We can score the alignments. Given the scoring functions. The scoring functions can be more complicated, but for now, the ones we have define will do fine. One problem though. How do we get an alignment? We would actually like to find an alignment $A^*$ that, given a scoring function, maximizes the alignment score $M$, such that there is no other alignment $A$ where $M(A^*)<M(A)$.

One think that we could do is check out all possible alignments. There are quite many. Say, for alignment sequence of length $2n$ there are $n$ possible insertion sites. The number of possible alignments is thus

$$\binom{2n}{n}={(2n)! \over (n!)^2} \approx {2^{2n} \over \sqrt{\Pi n}}$$

We used the Stirling's approximation to factorial. Let's write this in code and see what comes out:

In [14]:

from math import sqrt, pi
def foo(n):
    return round(2**(2*n) / sqrt(pi * n))

In [15]:

foo(10), foo(20)

Out[15]:

(187079, 138710677319)

Even for a short sequences of only 20 nucleotides, there are too many possible alignments. No way to use brute force to find optimal alginment.

Representation Of A Space of All Possible Alignments¶

Let us start with with two very short sequences:

In [16]:

s, t = "ATGA", "TCA"

We now construct a table with symbols from s in columns and symbols from t in rows. We also add an indel symbol "-" at the first column and row of the table.

DTP

We claim that any walk from upper left corner to lower right corner of the table, where we can move just one place right, down or diagonally, represents an alignment. Horizontal or vertical move inserts the indel in one of the sequences and consumes the symbol of the other, while a diagonal move consumes a symbol from both sequences.

This looks complicated, but let us consider an example. We will use the walk that is marked with solid squares. In the first horizontal move we consume "T" from $t$ and introduce an indel in $s$. The next move is vertical: we consume a symbol "A" from $s$ but nothing from $t$, essentially introducing an indel to aligned sequence $t$. At this stage, our alignment is (-A, T-). Next move is diagonal, we consume T from $s$ and C from $t$. Next is a horizontal move, consuming A from $t$ and introducing indel to $s$. At this point, the alignment is (-AT-, T-CA). The final two moves are vertical, consuming the remaining two letters from $s$ and introducing two indels to $t$. Our final alignment is therefore

In [17]:

A1 = "-AT-GA", "T-CA--"
pp_alignment(*A1)
print("Score: %d\n" % M(A1, punished_indels))

-AT-GA
      
T-CA--
Score: -11

Hm, probably not the best alignment, but still a valid one. The path in our table that starts with the same two moves as before, but then takes the dotted path, represents the following alignment:

In [18]:

A2 = "-ATGA--", "T----CA"
pp_alignment(*A2)
print("Score: %d\n" % M(A2, punished_indels))

-ATGA--
       
T----CA
Score: -14

According to our scoring function this is an even worse alignment. But again, it is a valid one. And then again we are just trying to convince you, the reader, the walks in the table we have constructed represent all possible alignments.

There is also something else we would like to point out. Both our alignments, A1 and A2, start with the same two moves in the table, a horizontal one and a vertical one, reaching a shadowed square. The two alignment then branch from the shadowed square. Our aligned sequence at that point is (-A, T-), and the score of this partial alignment is:

In [19]:

A_partial = "-A", "T-"
M(A_partial, punished_indels)

Out[19]:

-4

Notice that this score does not change with the rest of the alignment. That is, whatever follows, the score for this partial alignment with which we started with stays the same. That is simply a result from the fact that our scoring function refers only to a pair of symbols at a certain position, and does not take into consideration the aligned sequence before (or after) this position.

To compute the total alignment score, we can thus break the aligned sequence to subsequences, compute the alignment score for aligned subsequences and sum this up. For alignment A1, we thus can compute the score until the shaded square, and the score for the remaining part of the alignment:

In [20]:

M(("-A", "T-"), punished_indels) + M(("T-GA", "CA--"), punished_indels)

Out[20]:

-11

We can do the same for A2. Again, we have obtained the same result as before, where we have considered the complete aligned sequence:

In [21]:

M(("-A", "T-"), punished_indels) + M(("TGA--", "---CA"), punished_indels)

Out[21]:

-14

Needleman-Wunsch Algorithm for Optimal Global Alignment¶

Which means, if know the alignment score at the point of the shaded square, this can be used to compute the score for all alignments that start with this particular alignemnt sequence, and we just need to add the score that we get for the alignemnt after the shaded square.

Great. But that does not tell us anything about the optimal alignment. Yet, if we would know what the optimal alignment until the shaded square was, any alignment subsequence that would follow the shaded square would precede the optimal alignment until the shaded square.

Right! We can now start with empty alignment. This is an upper left corner of our table. And we can now visit all the squares of the table to the right and down. Each square can be reached from three previous positions through a vertical, horizontal or diagonal move. But now that we know that each previous position denotes a starting subsequence of alignment with the score that will just add to our score of the remaining alignment sequence, we can compute the score at each cell by using the score from any of the previous three cells plus the score for the particular move.Horizontal and vertical moves introduce indels, so we add -2 to the score of the corresponding previous cell (-2 due to our punished_indels function). Diagonal move consumes symbols from $s$ and $t$, so we add either 2 if the symbols match or -1 if they do not match.

Above, we have defined a recursive procedure that can compute the scores for all of the cells of the alignment table. We will refer to the procedure as dynamic programming, and to the table as dynamic programming table. Let $M_{i,j}$ be the $i$-th row and $j$-th column entry of the table. Then,

$$M_{i,j}= \max \begin{cases} M_{i-1,j}+\sigma(s_i,-)\\ M_{i,j-1}+\sigma(-,t_i)\\ M_{i-1,j-1}+\sigma(s_i,t_i) \end{cases}$$

In this equation, the first case refers to the vertical, the second to the horizontal and the third to the diagonal move.

The scores in each of the cells are the scores of optimal alignments at that particular cell, and the path to that particular cell (where the path was chosen to obtain the maximal score) is the path that defines the optimal alignment that would lead us to particular cell. The global alignment, that is, the one that consumes all the symbols from $s$ and $t$ is the alignment where we have reached the lower right cell in the dynamic programming table. If we follow the path that lead us to the score computed in this cell, we obtain the optimal alignment. So simple! For the sequences of length $|s|$ and $|t|$, our computation complexity is only in the order of $|s|\times|t|$.

Above, we forgot about one thing. Intialization. But this is simple. With empy aligment at the start, we are in the upper left cell. The score is 0 (we have not done anything yet). We cal also fill out all the entries in the first row, going from left to right and adding -2, the punishment for introducing the indel. Similar for the first column.

The algorithm we have described above is called Needleman-Wunsch algorithm. It belongs to the dynamic programming type of algorithms. It consist of three stages:

initialization
computation of entries of dynamic programming table
trace-back to reconstruct the optimal alignment

For the alignment of our sequences $s$ and $t$, this is how our dynamic programming table looks like just after the initialization:

	-	T	C	T	A
-	0	-2	-4	-6	-8
A	-2
T	-4
G	-6
A	-8

And this is how it looks like after all its entries have been computed:

	-	T	C	T	A
-	0	-2	-4	-6	-8
A	-2	-1	-3	-5	-4
T	-4	0	-2	-1	-3
G	-6	-2	-1	-3	-2
A	-8	-4	-3	-2	-1

The optimal alignment has a score of -1. We have bolded the cells that are on the path of the contruction of the optimal alignment. Remember, to find this path, we have to start at lower right, and trace back the path, each time taking the previous cell from which we have yielded the maximal score. The optimal alignment corresponding to the path marked in our dynamic programming table is:

In [22]:

A = "ATG-A", "-TCTA"

Let us check out if its alignment score really matches the one from our dynamic programming table.

In [23]:

M(A, punished_indels)

Out[23]:

-1

It does. Huh. We are fine.