Alignment: Definitions and algorithms
FROM DOT PLOTS TO ALIGNMENT
Let
and
be 2 biomolecular sequences of the same type; DNA, RNA or
protein. In a dot plot,
is placed from top to bottom in
the 0th column of an
m+1 x n+1 matrix.
is placed from left to right in the 0th row of the same
matrix. ``Dots'' plotted in row i and column j indicate at least
that ai = bj. If the dot plot uses a ``window'' and ``stringency''
method with w > 1, or a k-word method with k > 1, then not all
matches are plotted in general.
Another way to write down the information in a dot plot is to write the two sequences from left to right, one above the other, and to use line segments to join ai to bj whenever there is a dot in row i and column j. The dot plot in Figure 5 would yield the image in figure 1. This can be regarded as a (bipartite) graph, with the two sequence elements acting as the vertices. We can call this a ``dot plot graph''.
The graph corresponding to the dot plot in Figure 6 would be very complicated. The idea of ``alignment'' is to take a subset of this graph, yielding an ``alignment graph'', or what is normally called an ``alignment of two sequences''. There are 2 requirements that must be satisfied:
The graph in Figure 1 violates both of these conditions. Note that both solid and dotted lines have been used in this Figure. Taking only the solid lines yields a valid alignment. The same is true for the dotted lines. Note that these 2 valid alignments are not the only ones that can be achieved by deleting edges. The whole theory of sequence alignment is to decide what edges to use.
This leads us to the first definition of sequence alignment. This definition will evolve as the course progresses.
DEFINITION: An alignment between sequences
and
is a set of integer pairs,
that satisfy:
Residues aih and bih are said to be ``aligned'' or ``matched'' with each other. Later on, we will drop the requirement that matched pairs must be equal.
We now change the look of the alignment graph by agreeing to write the
2 sequences so that the edges are always vertical lines. This is
equivalent to saying that matched residues are always ``directly over
one another'', or ``in the same column''. In order to accomplish this,
gaps must be left in 1 or both sequences, sometimes at the ends.
These gaps are usually written explicitly, using the symbol ``-''.
For example, an alignment of
AATAACATTCACGCGTCTGC and
AATATTGACGCATTCTGCAT appears below.
AATAACATTCACGC-GTCTGC-- ||| ||| |||| ||||| AAT---ATTGACGCATTCTGCAT
Written like this, we can describe the alignment in evolutionary
terms. We might imagine that
and
are the
descendants of a common ancestor,
.
The sequence
changed into
and
through a series of ``primitive
evolutionary events''. We define these to be:
Sequence elements that do not change are called ``conserved''. In an alignment, matched pairs of residues are inferred to be conserved. This may not be true, since the same change may occur in both.
When ai appears above bj and
,
we know that
a change has occurred. It is possible that more than 1 change has
occurred.
When ai appears above a gap, we know that either ai has been
inserted into the original sequence and appears in
,
or that
ai was in
and was deleted from
.
Similarly when
a gap appears above bj, we know that either bj has been inserted
into the original sequence and appears in
,
or that bj was
in
and was deleted from
.
Thus a gap in an alignment indicates that either an insertion or a deletion has occurred. We cannot deduce when a residue is either inserted or deleted from both sequences. The inference of an ancestral sequence is always error prone, and is approached using statistical methods. A series of consecutive ``-''s is usually called a (single) gap. The number of dashes is referred to as the size, or length of the gap. Dashes at the end of a sequence are called ``end gaps''. Because of the insertion/deletion ambiguity, gaps are often called ``indels''.
SEQUENCE EDITING
Consider the process of turning
into
using a
series of substitutions, insertions and deletions. We call this
``editing''
into
.We start by writing
from left to right without any gaps. We place a copy of
below it. We make a single change in this copy. If a substitution is
made to ai, then ai is changed in the copy. If ai is deleted, then
this character is replaced by a ``-'' in the copy. If an insertion
occurs after ai, then this element is added to the copy and a
``-'' is added at this position in the original sequence. We continue
in this manner, getting k+1 rows for k edit operations. Every time
an element is inserted, a gap character is added in the same position
all the way up to the top. Eliminating all middle rows gives us an
alignment of
with
.
Conversely, an alignment is a record of editing. Matched residues are not edited. ai over bj means that bj is substituted for ai. ai over a gap means that ai is deleted. A gap over bj indicates that bj has been inserted.
MINIMUM EDIT DISTANCE
It is clear that we can edit
to
using at most
m+n edit operations. That is, we can delete all the residues of
,
one by one, and then add all the residues of
,
one
by one. In fact, we could do this in at most
operations,
using
substitutions followed by
insertions or deletions. This is the best
that could be done to turn ACACACACACA into GTGGTGTTGTGTGT.
ACACACACACA--- GTGGTGTTGTGTGT
In general, we can do better.
Let
be the
minimum number of edits required to turn
into
.
Proposition:
is a metric
Proof:
How can we compute
? For
,
let
be the
ith prefix of
.
That is
.
Similarly,
for
.
Let
.
We define
and
to be the empty string. It is clear that
if i or j are 0. That is, Di,0 = i and
D0,j = j. Suppose that we know Dh,k for all
and
,
but not for (i,j). Can we compute Di,j?
In fact, if i > 0 and j > 0, then
Proof:
Thus
The equality follows from the fact that at least 1 of the following
must occur in transforming
to
Working in term of ``edit distance'' is cumbersome. It is easier to
work with alignments. If
is the sequence alphabet, let
,
where
is
.
That is, we add the dash character to the
alphabet. We call d the ``(edit) cost function''. Assume that
.
We assume that d(a,a) = 0. d will usually be
symmetric, but it does not have to be. If it is not symmetric, then
will not be a metric.
At this point, we will change the original definition of alignment by dropping the first condition that aih=bjh for all matched pairs. The d function will make that distinction for us.
Let
be an
alignment between
and
.
Put this alignment into
``alignment format'':
.
This also gives us a
``generalized edit distance'', and a generalized minimum edit
distance. If
d(a,-) = d(-,b) = 1 and d(a,b) = 1 for
This recursion is called the ``fill algorithm''.
Proof: An alignment between
and
can be created by
EXAMPLE TTATGGACTT versus CTTGGCTAGG. d = 2 and 1 for all substitutions and indels, respectively.
| C | T | T | G | G | C | T | A | G | G | ||
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | |
| T | 1 | 2 | 3 | 4 | 5 | 7 | 8 | 9 | |||
| T | 2 | 3 | 2 | 3 | 4 | 6 | 7 | 8 | |||
| A | 3 | 4 | 3 | 2 | 3 | 4 | 5 | 6 | 6 | 7 | |
| T | 4 | 5 | 4 | 5 | 6 | 6 | 7 | 8 | |||
| G | 5 | 6 | 5 | 4 | 5 | 6 | 7 | ||||
| G | 6 | 7 | 6 | 5 | 4 | 5 | 6 | ||||
| A | 7 | 8 | 7 | 6 | 5 | 4 | 5 | 6 | 6 | 7 | |
| C | 8 | 8 | 7 | 6 | 5 | 5 | 6 | 7 | 8 | ||
| T | 9 | 8 | 7 | 6 | 5 | 5 | 6 | 7 | |||
| T | 10 | 9 | 8 | 7 | 6 | 6 | 7 | 8 |
The D matrix after the fill algorithm is executed.
The dots,
,
indicate positions where ai = bj.
| C | T | T | G | G | C | T | A | G | G | ||
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | |
| T | 1 | ||||||||||
| T | 2 | ||||||||||
| A | 3 | ||||||||||
| T | 4 | ||||||||||
| G | 5 | ||||||||||
| G | 6 | ||||||||||
| A | 7 | ||||||||||
| C | 8 | ||||||||||
| T | 9 | ||||||||||
| T | 10 |
The D matrix before filling. The 0th row and column have been defined.
The value of Di,j can be determined if Dh,k is known for
,
,
.
There are
different ``fill orders'' that can be used.
Column fill :
for (j=1; j<=m; j++){
for (i=1; i <=n; i++){
Define D[i,j];
}
}
Row fill :
for (i=1; i<=n; i++){
for (j=1; j <=m; j++){
Define D[i,j];
}
}
The set
Diagonal fill :
for (d=2; d<=m+n; d++){
for (i=min{1,d-n}; i<=min{m,d-1}; i++){
j = d - i;
Define D[i,j];
}
}
Note that with the current algorithm, we only need to keep the present and previous column (row) to execute the fill algorithm in column (row) fill order. For the diagonal fill, we need the 2 previous co-diagonals.
TRACEBACK : After the ``fill algorithm'' is executed, an alignment can be constructed using the ``traceback algorithm''. The alignment is constructed from right to left; that is, backwards. This is because we start with Dm,n, and determine which of the 3 cases resulted in the value of Dm,n. This process continues. A traceback algorithm is explained below and a slight variant is illustrated in Figure 2.
![]() |
| C | T | T | G | G | C | T | A | G | G | ||
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | |
| T | 1 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | |
| T | 2 | 3 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | |
| A | 3 | 4 | 2 | 3 | 4 | 5 | 6 | 5 | 6 | 7 | |
| T | 4 | 5 | 4 | 4 | 5 | 6 | 5 | 6 | 7 | 8 | |
| G | 5 | 6 | 5 | 4 | 4 | 5 | 6 | 7 | 6 | 7 | |
| G | 6 | 7 | 6 | 5 | 4 | 4 | 5 | 6 | 7 | 6 | |
| A | 7 | 8 | 7 | 6 | 5 | 5 | 6 | 5 | 6 | 7 | |
| C | 8 | 7 | 8 | 7 | 6 | 5 | 5 | 6 | 7 | 8 | |
| T | 9 | 8 | 7 | 8 | 7 | 6 | 5 | ||||
| T | 10 | 9 | 8 | 7 | 8 | 7 | 6 | 5 | 6 | 7 |
The execution of the traceback is illustrated above. The symbols:
``
'', ``
'' and ``
'' mean:
The resulting alignment is:
-TTATGGACT---T | ||| || C-T-TGG-CTAGG-
GAP PENALTIES :
We have already introduced the notion of a gap as a series of insertions or deletions in one sequence with respect to another in the context of an alignment. The ``size'' of a gap is the number of indels. Our definition of a gap allows for consecutive indels in one or both sequences. In the alignment above, -/C and T/- are consecutive entries that contain indels. We call this a single gap. Similarly, the four last columns: -/A, -/G, -/G and T/- constitute 1 gap, not 2. (This is not standard usage.)
The cost of a gap often depends only on its size, s. A function
,
which gives the cost
g(s) to a gap of size s, is called a ``gap cost'' or ``gap
penalty'' function. g(0) is always assumed to be 0. In our
general definition,
d(a,-) = d(-,a) can take on different values for
each letter of the (DNA or amino acid) alphabet. If the value of
d(a,-) is the same
,
the a gap of size s has cost
,
where
.
A gap penalty function of the form
is called an ``affine gap penalty function'',
where
and
are non-negative constants.
Alignment cost with gap cost functions.
In this case, d(ai,bj) still gives the cost of substituting ai
with bj, but d(ai,-) and d(-,bj) are no longer used.
The cost of an alignment cannot be determined by adding up one summand
for each column. If there are
gaps of size s in an
alignment, then the cost of an alignment between
and
is
At this time, we add 2 ``fake matches'' to every alignment to take
care of the end gaps. This means that we do not need to list the end
gaps as special cases. We will assume that every alignment (not the
``local'' alignments that will be considered later) begins with
(i0,j0) = (0,0) and ends with
(ik+1,jk+1) = (m+1,n+1).
We assume that these ``fake matches'' have 0 weight.
Note that a gap occurs whenever
ih-ih-1+jh-jh-1>2, and the
gap size is
ih-ih-1+jh-jh-1-2, for
.
Thus:
There are algorithms to compute minimum cost alignments for such gap functions, but we will not consider them. Instead, we will deal with these gap functions when we look at ``maximum similarity'' alignment.
PATH GRAPHS
Alignments can be visualized in so-called ``path graphs'' that are related to dot plots. Alignments are constructed by eye using dot plots. Path graphs allow one to define the set of all possible alignments in a formal, yet intuitive way.
One type of path graph is what I call the ``truck route'' path graph. I use this term because the path is connected. It cannot hop from one location to another. Truck route path graphs formalize the representation that shows matches and gaps explicitly.
An
m+1 x n+1 square grid of points is constructed. The grid can
described as
,
where i refers to row, and j to column. Two distinct points
(i,j) and (i',j') have an edge between them if and only if
.
Any collection of connected edges from (0,0) to (m+1,n+1) that moves from each node (i-1,j-1) to either (i,j-1), (i,j) or to (i-1,j), describes an alignment.
When a diagonal edge,
is used, we say that
a(i) is matched with b(j).
When a vertical edge,
is used, we say that
a(i) is inserted with respect to sequence
,
or that a(i)
is in a gap.
Similarly, when a horizontal edge,
is
used, we say that b(j) is inserted with respect to sequence
,
or that b(j) is in a gap.
Figure 3 illustrates an alignment together with its path graph. These kinds of path graphs are the most useful ones for looking at multiple sequence alignments. They are not good for describing alignments with affine gap penalties.
![]() |
![]() |
In contrast, the ``airline route'' path graph uses an m x n grid, and matched pairs correspond to vertices, not edges. A path consists of a series of vertices, as defined above. Gaps are inferred when either il+1 - il > 1 or jl+1 - jl > 1. The alignments that have been introduced in the course up to now are best visualized using ``airline route'' path graphs.
Consider the m x n array of integer pairs:
.
Consider all line segments joining (i,j) to (i-1,j), (i,j) to
(i,j-1) and (i,j) to
.
(We exclude negative indices.) A ``path'' from (0,0) to
(m,n) is a sequence of these connected line segments, joining
distinct pairs
,
where
and
.
There is a 1:1 correspondence between paths
and alignments, with the segment:
SIMILAR (DNA) SEQUENCES
If 2 (DNA) sequences are known to be fairly similar, say 90% or better and n and m are large (say 10,000 or larger) and ``roughly the same size'', then much time (and space) can be saved in alignment.
Diagonal band fill
The set
MAXIMUM SIMILARITY ALIGNMENT
As before,
is the sequence alphabet, and
.
We call w the
``weight function'' that gives a measure of similarity. For example,
would be a natural and simple weight function.
So would
.
We assume that w is symmetric.
We can think of w as the inverse of d. However, gap costs are
treated the same as before. The definition of alignment remains the
same. This time, we try to maximize the similarity of an alignment, so
that the gap penalties (gap costs) are subtracted, not added.
If there are
gaps of size s in an
alignment, then the cost of an alignment between
and
is
If
,
then we can define
for all letters a and b. In this case, the alignment similarity is
,
where a' and b' are
characters in
.
In this case, we define
to be the
maximum similarity of all alignments between
and
.
Sufficient condition for
to be
equivalent to
.
Suppose that w and d are score functions for sequence alignment by similarity maximization and distance minimization, respectively. Let the gap penalty for distance minimization be given by some function g. We wish to construct a maximum similarity alignment.
Suppose, as well, that
w(a,b) + d(a,b) = C for some constant C,
and that
.
Then distance minimization
alignment is equivalent to similarity maximizing alignment.
Proof: For any alignment, there are k matched pairs and a total of m+n-2k elements in gaps, including possible end gaps.
Let
be the number of gaps of size l. Then
Example The familiar case where
gives what is called the ``Levenstein distance'' between 2
sequences. In this case, g(s) = s. If we let
,
which seems ``natural'', then C=1, so g'(s) = s/2 as a gap penalty
for maximum similarity alignment would give identical results.
For a general gap function, g, how can we compute
and a maximum similarity alignment?
Let Si,j be the maximum similarity of an alignment between
and
.
Note that
S0,j = -g(j) and that
Si,0 =
-g(i). What sort of recursion can be used in this case? The answer is
given below.
Proof : An optimal alignment either ends with a match
between ai and bj, or with an end gap. The
Si-1,j-1+w(ai,bj) term adds the weight of an ai/bj match to
an optimal alignment between
and
.
This
either extends a match or closes an end gap. The second term considers
all possible ``last matched pair'' cases.
While this formula is straightforward, computation time is
,
since one must backtrack for each (i,j) pair. We
can do better when g has a special form.
Case 1.
,
where
is a positive constant.
This is the ``linear gap penalty'' case.
Case 2.
,
where
and
are non-negative constants. (Both are not 0.)
This is the ``affine gap penalty'' case. The linear gap penalty
recursion breaks down because when we add an unpaired ai
to an optimal alignment between
and
or bj to an optimal alignment between
and
we don't know if we are creating a new gap, with a cost of
,
or just extending an existing one, with a cost of
only
.
This can be solved in a number of ways. They all involve
the use of at least one extra array that must be filled.
``Zuker's method'': Let S1i,j be the maximum similarity between
and
that ends with a gap. That is, ai and
bj are not matched with each other. Let S2i,j be the maximum
similarity between
and
that ends with an
ai/bj match. Clearly
Initial conditions.
Recursions:
Proof : We consider cases. For S1i,j:
For S2i,j, ai is paired with bj. ai-1 and bj-1 are either paired with each other, or they are not.
Alignment examples.
AACAGGCAGGCCATT--AGGGAG--CTACGG-GTTGTGCCGACCGTCCAAT---AATGGTCGGCCTCTC |: |:| :|:|||: || :|| || ||: :||:|| || |||:| ||| |:::| | AGAAAG--AGTCATCTGAGCAAGGACT-CGATATTATG--GA---TCCGA-AGAAAT--TTAACGT---
In this example, identity is indicated by |,
transitions (
,
)
by :, and transversions (
,
,
,
)
by blanks, respectively.
Example:
= AGGCTACGG and
= AGGGACTCGAT. w = 10,
2 and -5 for identity, transitions and transvertions, respectively.
Optimal alignment:
AGG--CTACGG- ||| || ||: AGGGACT-CGAT
S1:
| A | G | G | G | A | C | T | C | G | A | T | ||
| -11 | -12 | -13 | -14 | -15 | -16 | -17 | -18 | -19 | -20 | -21 | ||
| A | -11 | -12 | -1 | -2 | -3 | -4 | -5 | -6 | -7 | -8 | -9 | -10 |
| G | -12 | -1 | -2 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 |
| G | -13 | -2 | 9 | 8 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 |
| C | -14 | -3 | 8 | 19 | 18 | 17 | 16 | 17 | 16 | 15 | 14 | 13 |
| T | -15 | -4 | 7 | 18 | 17 | 16 | 17 | 16 | 27 | 26 | 25 | 24 |
| A | -16 | -5 | 6 | 17 | 16 | 15 | 16 | 27 | 26 | 25 | 24 | 25 |
| C | -17 | -6 | 5 | 16 | 15 | 16 | 15 | 26 | 25 | 26 | 25 | 24 |
| G | -18 | -7 | 4 | 15 | 14 | 15 | 26 | 25 | 26 | 25 | 36 | 35 |
| G | -19 | -8 | 3 | 14 | 15 | 14 | 25 | 24 | 25 | 36 | 35 | 38 |
S2:
| A | G | G | G | A | C | T | C | G | A | T | ||
| 0 | ||||||||||||
| A | 10 | -9 | -10 | -11 | -4 | -20 | -21 | -22 | -16 | -9 | -25 | |
| G | -9 | 20 | 9 | 8 | -1 | -9 | -10 | -11 | 3 | -6 | -14 | |
| G | -10 | 9 | 30 | 19 | 10 | 2 | 1 | 0 | 14 | 5 | -3 | |
| C | -18 | -7 | 4 | 25 | 14 | 28 | 19 | 26 | 10 | 9 | 15 | |
| T | -19 | -8 | 3 | 14 | 20 | 19 | 38 | 21 | 21 | 10 | 24 | |
| A | -5 | -2 | 9 | 20 | 27 | 15 | 14 | 33 | 29 | 36 | 20 | |
| C | -21 | -10 | 1 | 12 | 15 | 37 | 18 | 37 | 28 | 24 | 38 | |
| G | -15 | 4 | 15 | 26 | 17 | 11 | 32 | 21 | 47 | 30 | 20 | |
| G | -16 | 3 | 14 | 25 | 28 | 12 | 21 | 27 | 36 | 49 | 31 |
Up to now, we have been dealing with what is called ``global
alignment''. When dealing with edit distance, we must transform one
sequence into another. As we dealt with minimum cost alignment and
maximum similariy alignment, we always kept the notion that we are
dealing with all of
and
.
In particular, we
penalize ``end gaps''. We will now consider another class of alignment
problem.
Local sequence alignment :
If
and
,
let
and let
.
Let
For a general gap function, g, how can we compute
and a maximum similarity alignment? Such an alignment is
called a ``local alignment''.
Let Hi,j be the maximum similarity of an alignment between
and
that includes either the (ai,bj) match or
else counts an end gap, or is empty.
Then
Note that
Hi,0 = H0,j = 0, since these values can only be
or
.
The recursion for i>0 and
j>0 is:
Proof : An optimal alignment either ends with a match
between ai and bj, or with an end gap. The
Hi-1,j-1+w(ai,bj) term adds the weight of an ai/bj match to
an optimal local alignment between
and
.
This
either extends a match or closes an end gap. The second term considers
all possible ``last matched pair'' cases. If the last 2 terms are
negative, then the 0 option (no alignment) can be used.
While this formula is straightforward, computation time is
,
since one must backtrack for each (i,j) pair. As
with S, we can do better when g has a special form. In these forms,
the algorithm is named ``Smith-Waterman'' after Temple Smith and
Michael Waterman who first published it.
Case 1.
,
where
is a positive constant.
This is the ``linear gap penalty'' case.
Case 2. Affine gap penalty.
.
``Zuker's method'': Let H1i,j be the maximum similarity between
and
that ends with a gap. That is, ai and
bj are not matched with each other. Let H2i,j be the maximum
similarity between
and
that ends with an
ai/bj match. It is understood that either H1i,j or
H2i,j will be 0 (no alignment) if there is no alignment
with a non-negative score.
Clearly
The initial conditions are simple. H1i,0 = H10,j = H2i,0 = H20,j = 0.
Recursions:
Proof : We consider cases. For H1i,j:
For H2i,j, if there is an alignment with positive score, then ai is paired with bj. ai-1 and bj-1 are either paired with each other, or they are not.
Alignment examples. Identity, transition and transversion weights are
1, 0 and -1, respectively.
.
= TCTTCTCCAAGGCGTTAACT and
= AACTTCGTTTGAGGCTTCTT
10
CTTC-TCCAAGGC
|||| |:::||||
CTTCGTTTGAGGC
10
H1
| A | A | C | T | T | C | G | T | T | T | G | A | G | G | C | T | T | C | T | T | ||
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | |
| T | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| C | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| T | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| T | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 |
| C | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 2 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 2 | 2 |
| T | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 1 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 1 | 3 |
| C | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 3 | 2 |
| C | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 3 |
| A | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 2 |
| A | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 2 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |
| G | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 1 | 3 | 2 | 1 | 0 | 0 | 0 | 0 |
| G | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 3 | 2 | 4 | 3 | 2 | 1 | 0 | 0 |
| C | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 4 | 3 | 5 | 4 | 3 | 2 | 1 |
| G | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 3 | 5 | 4 | 4 | 3 | 2 | 1 |
| T | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 2 | 4 | 4 | 4 | 5 | 4 | 3 |
| T | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 2 | 1 | 0 | 0 | 1 | 3 | 4 | 5 | 5 | 5 | 4 |
| A | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 1 | 1 | 0 | 0 | 0 | 2 | 3 | 5 | 5 | 4 | 4 |
| A | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 1 | 0 | 1 | 2 | 4 | 4 | 4 | 3 |
| C | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 1 | 3 | 3 | 3 | 3 |
| T | 0 | 0 | 0 | 1 | 0 | 2 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 3 | 2 | 4 |
H2
| A | A | C | T | T | C | G | T | T | T | G | A | G | G | C | T | T | C | T | T | ||
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | |
| T | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 1 |
| C | 0 | 0 | 0 | 1 | 0 | 1 | 2 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 2 | 0 | 1 |
| T | 0 | 0 | 0 | 0 | 2 | 1 | 1 | 1 | 1 | 1 | 2 | 0 | 0 | 0 | 0 | 0 | 2 | 1 | 1 | 3 | 1 |
| T | 0 | 0 | 0 | 0 | 1 | 3 | 1 | 0 | 2 | 2 | 2 | 1 | 0 | 0 | 0 | 0 | 1 | 3 | 1 | 2 | 4 |
| C | 0 | 0 | 0 | 1 | 0 | 1 | 4 | 0 | 0 | 2 | 2 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 4 | 1 | 2 |
| T | 0 | 0 | 0 | 0 | 2 | 1 | 1 | 3 | 3 | 2 | 3 | 1 | 0 | 0 | 0 | 0 | 2 | 1 | 1 | 5 | 3 |
| C | 0 | 0 | 0 | 1 | 0 | 2 | 2 | 1 | 3 | 3 | 2 | 2 | 0 | 0 | 0 | 1 | 0 | 2 | 2 | 2 | 5 |
| C | 0 | 0 | 0 | 1 | 1 | 0 | 3 | 1 | 1 | 3 | 3 | 1 | 1 | 0 | 0 | 1 | 1 | 0 | 3 | 2 | 3 |
| A | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 3 | 0 | 0 | 2 | 3 | 2 | 1 | 0 | 0 | 0 | 0 | 0 | 2 | 1 |
| A | 0 | 1 | 2 | 0 | 0 | 0 | 0 | 1 | 2 | 0 | 0 | 2 | 4 | 2 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |
| G | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 2 | 5 | 3 | 0 | 0 | 0 | 0 | 0 | 0 |
| G | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 1 | 3 | 6 | 2 | 1 | 0 | 0 | 0 | 0 |
| C | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 2 | 7 | 4 | 3 | 3 | 1 | 0 |
| G | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 1 | 0 | 1 | 3 | 3 | 6 | 4 | 3 | 2 | 1 |
| T | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 3 | 1 | 1 | 0 | 0 | 0 | 0 | 3 | 6 | 7 | 4 | 4 | 3 |
| T | 0 | 0 | 0 | 0 | 1 | 2 | 1 | 0 | 1 | 4 | 2 | 0 | 0 | 0 | 0 | 2 | 5 | 7 | 7 | 6 | 5 |
| A | 0 | 1 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 3 | 2 | 2 | 0 | 0 | 0 | 2 | 4 | 6 | 6 | 5 |
| A | 0 | 1 | 2 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 3 | 3 | 2 | 0 | 0 | 1 | 2 | 4 | 5 | 5 |
| C | 0 | 0 | 0 | 3 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 2 | 2 | 1 | 1 | 1 | 2 | 5 | 4 | 5 |
| T | 0 | 0 | 0 | 0 | 4 | 2 | 0 | 0 | 1 | 2 | 1 | 0 | 0 | 1 | 1 | 1 | 2 | 2 | 3 | 6 | 5 |
The Sankoff algorithm
Given a non-negative integer, k, compute an optimal global alignment between A and B that has at most k gaps. There are no gap penalties.
Let
be the maximal score of any alignment on (the
prefixes)
and
that includes the (i,j) pair and
that has at most k gaps. Similarly,
is the maximal
score of any alignment that may or may not include the (i,j) pair
and that has at most k gaps. End gaps are counted in both cases.
There are no gap penalties, so the following boundary conditions are appropriate:
Recursion step:
For i > 0 and j > 0,
For
,
Note that
Examples: global and local alignment
Alignment of 2 random DNA sequences of length 100. The match
scores are 10, 1 and -5 for matches, transitions and transversions,
respectively. The gap penalties are given by
and
.
A global alignment has score 437, and is given by:
10 20 30 40 50 60
G---CG-----CT-ACGC--G-TGA-T-ATTGGG-TT-G-GCGAGTGTTACGAGACCTCTC-A-TCAGAACCTCCCGCGG
| || || |||| | ||| | |: ||| || | :||:|| ||| ||| ||| | | |||| | ||| ||||
GTCCCGCAAAACTAACGCAAGATGATTAACGGGGATTAGAACGGGTCTTA-GAG--CTCACTAGTCAG--CGTCCGGCGG
10 20 30 40 50 60 70
70 80 90 100
CCTAGGGCGGATACGGATAGACGGATCAGGCGATTGGTG
|| |: :|| | || || : | | | | | |:||
CC----GT-AAT-CTGA-AG-T-G-T-A-CCTA-TAGT-
80 90 100
An optimal local alignment achieves a score of 460.
10 20 30 40 50 60
GC---GCT-ACGC--G-TGA-T-ATTGGG-TT-G-GCGAGTGTTACGAGACCTCTC-A-TCAGAACCTCCCGCGGCC-TA
|| :|| |||| | ||| | |: ||| || | :||:|| ||| ||| ||| | | |||| | ||| |||||| ||
GCAAAACTAACGCAAGATGATTAACGGGGATTAGAACGGGTCTTA-GAG--CTCACTAGTCAG--CGTCCGGCGGCCGTA
10 20 30 40 50 60 70 80
70 80 90
GGGCGGATACGGATAG-ACGGATCAG
: | || | | | | || || ||
-ATCTGA-A--G-T-GTACCTAT-AG
90
Note that the local alignment is not just a truncation at both ends of
the global alignment. Changing
from 0 to 20 changes the
optimal global and local alignments to eliminate many gaps.
Global: score = 154
10 20 30 40 50 60 70
GCGCTAC-------GCGTGATATTGGGTTGGCGAGTGTTACGAGACCTCTCATCAGAACCTCCCG--CGGCCTAGGGCGG
|: |::| :||::| || |:||::||:| :||| :| : || |:|||:|:: |:: |:|| | |||||
GTCCCGCAAAACTAACGCAAGAT--GATTAACGGGGATTAGAACGGGTC---TTAGAGCTCACTAGTCAGCGTCCGGCGG
10 20 30 40 50 60 70
80 90 100
ATACGGATAGACGGATCAGGCGATTGGTG
::::: : || | :| :| |:||
CCGTAATCTGAAGTGT---ACCTATAGT-
80 90 100
Local: score = 273
10 20 30 40 50 60 70
ACGCGTGAT-ATT-----GGGTTGG--CGAGTGTTACGAGACCTCTCATCAGAACCTCCCGCGGCCTAGGGCGGATACGG
||||: ||| ||| ||:||:| ||:|| ||| ||| :| || :|||| | ||| |||||| :: | || ::|
ACGCAAGATGATTAACGGGGATTAGAACGGGTCTTA-GAGCTCACTAGTCAG--CGTCCGGCGGCCGTAATCTGAAGTGT
20 30 40 50 60 70 80 90
80
ATAGACGG
|: |::|
ACCTATAG
Generalized alignment
Let
be a locally finite partially ordered set.
is actually a dual,
,
where
is a
non-empty set consisting of elements {pi}, and where
.
For
p1, p2 and
,
we write
if, and
only if
is true. ``
'' is called a partial
order. If
,
then we define the ``closed interval''
.
By
definition, the ``open interval'', (p1,p2), excludes p1 and p2.
The following properties hold:
We define
to be the number of
elements in (p1,p2) and
to be
the size of a maximal chain in (p1,p2).
When
,
we say that p1 is less than or equal to
p2 in the partial order.
If neither
nor
,
then we say that p1 and p2 are not comperable in
the partial order.
Finally, we write
to mean
and
.
In this case, we say that p1 is
less than p2 in the partial order.
A ``chain'' is a totally ordered subset of
.
That is, any 2
elements of a chain are comparable. The elements of a chain can be
assumed to be distinct, and any finite chain can be written uniquely
as
.
We say that ``p2 covers p1'' if
and
[p1,p2] = {p1,p2} (equivalent to
). We write this as
An example of a locally finite partially ordered set is the set of all
finite subsets of the integers, where
is
.
The general setting for alignment is a triplet,
If p1 is not
,
then
g(p1,p2) = 0. The functions w and
g are called the ``weight'' and ``gap penalty'' functions,
respectively. The intervals [p1,p2]1 and [p1,p2]2 refer to
the 2 partial orders,
and
,
respectively.
Let
be a chain of length k with respect to
.
Each pair pi, pi+1 is called a link (the ith
link) in the chain. If pi is not
,
or if
,
but
,
we say that
the ith link is ``weak''.
Then the ``weight'' of
,
,
is defined
by
PROBLEM 1: ``GLOBAL''
Given an interval [pa,pb]1,
compute a chain containing pa and pb with an optimal weight; that
is, with a weight as large as possible.
PROBLEM 2: ``LOCAL''
(SMITH-WATERMAN)
Given an interval
[pa,pb]1, compute a chain with an optimal weight. (An empty
chain has a weight of 0 by definition).
It is possible to construct solutions to problems 1 and 2 in time
.
The solutions are achieved via recursions.
Problem 1: Let
be the maximal score of any chain in
[pa,pi]1 that includes pa and pi.
Clearly,
.
Recursion step:
For
,
Why is this recursion correct? Because [pa,pb]1 is finite, all
the elements pi can be arranged into groups, g1, g2,
etc., where g0 = {pa} and
for k > 0. For some k,
.
Thus, equation 1 can be computed successively for
.
The recursion in equation 1 is called the ``fill
algorithm''. It fills an (abstract) array of numbers with optimal
values. No chains are computed. After the fill algorithm is executed,
all we know are optimal chain lengths for all intervals of the form
.
In particular,
is the optimal score.
The construction of an optimal chain is achieved with a ``traceback algorithm''. It constructs a chain in reverse order and works as follows:
Problem 2: Let
be the maximal score of any chain in
[pa,pi]1 that includes pi if it is non-empty.
Clearly,
.
This is because
we don't have to settle for a negative score if w is always <
0. We can choose the empty chain whose score is 0 by definition.
Recursion step:
For
,
This recursion can be computed in the same order as the one in equation 1. That is, first for elements that cover pa, then for elements that cover those elements, and so on.
What is new here is the 0. It means that for any pi, we can
reject a chain ending at pi if its score is < 0. A chain is free
to begin anywhere.
is not necessarily the score of
an optimal chain. Instead,
The traceback algorithm is slightly different.
The recursions in equations 1 and 1 look back into
the interval [pa,pi]1 for all
.
This yields
an overall time performance of the fill algorithms as
.
Examples
1. The comparison of 2 DNA, RNA or aa (amino acid, or protein) sequences.
and
are 2 biomolecular sequences of the same type.
.
An element of this set is an
ordered pair of integers, (i,j). In practice, we take pa = (0,0)
and
pb = (m+1,n+1) for Problem 1, and pa = (1,1) and pb =
(m,n) for Problems 2 and 3. The convention adopted here is to write a
2D grid with (1,1) on the upper-left and (m,n) on the lower
right. Usage varies. Weak links are called ``gaps'', or ``indels'', which
is an abbreviation for ``insertions or deletions''.
:
:
2. The comparison of a DNA (or RNA) sequence with a protein sequence.
is a DNA (RNA) sequence and
is an amino acid (protein) sequence.
As above
.
Also pa = (0,0)
and
pb = (m+1,n+1) for Problem 1, and pa = (1,1) and pb =
(m,n) for Problem 2.
:
:
A chain is a set of distinct ordered pairs
satisfying
and
.
For example 1, gap penalties are usually of the form g(s), where s
is the ``size'' of the gap, which is
i2 - i1 + j2 - j1 -2. An
affine gap penalty is of the form
,
except that g(0) = 0.
The score function, w is defined as follows. The sequence elements
are from some alphabet,
.
There is a ``weight'', or
``scoring'' function,
.
Then
w(i,j) = w(ai,b(j)) (slight ``abuse of notation''). Note
that w(ai,b(j)) is sometimes written as w(i,j) or wi,j,
which is a slight abuse of notation.
For example 2, gap penalties can be of the form g(s1,s2), where
s1+s2 is the ``size'' of the gap.
s1 = i2 - i1 - 1, which is
the number of inserted DNA characters, and
s2 = j2 - j1 - 1,
which is the number of inserted amino acid characters. An affine gap
penalty is of the form
,
with the understanding that g(0) = 0.
That is, we penalize DNA and amino acid insertions/deletions differently.
For example 1, the score function, w is defined as follows. The
sequence elements are from some alphabet,
.
There is a
``weight'', or ``scoring'' function,
.
Then
w(i,j) = w(ai,bj) (slight ``abuse of
notation''). Note that w(ai,bj) is sometimes written as w(i,j)
or wi,j, which is a slight abuse of notation.
For example 2, the score function, w is defined as follows. The
sequence elements are from 2 alphabets,
and
.
There is a ``weight'', or ``scoring'' function,
.
Then
w(i,j) =
w(ai,ai+1,ai+2,bj) (slight ``abuse of notation''). Note that
w(ai,ai+1,ai+2,bj) is sometimes written as w(i,j) or
wi,j, which is a slight abuse of notation. That is when we
compare ai with bj, we use the triplet
aiai+1ai+2 to
translate the DNA into an amino acid. This means that w(m-1,j) and
w(m,j) are not defined, since they refer to DNA characters past the
end of the sequence. These numbers can either be defined to be
(i.e. do not allow these matches) or they can be
defined knowing that the DNA sequence really extends beyond m.
For global sequence alignment, pa = (0,0) and pb = (m+1,n+1). This is an artificial construct so that the alignment does not have to begin with a (1,1) match or end with a (m,n) match, and yet a penalty is charged for these ``end gaps'', should they occur. We take w(0,0) = w(m+1,n+1) = 0.
ExampleThe DNA sequence for the mat gene of coliphage MS2
is aligned with the mat protein of coliphage GA.
and
.
Weight matrix is PAM128 (which will be
defined later in the course).
Optimal local alignment:
360 380 400
AAC GTT GCG AAC CGG GCG TCG ACC GAA GTC CTG CAA AAG GTC ACC CAG GGT AAT
| : | | | | | | | : : | : |
Asn Ile Asp Gln Arg Ala Ser Val Glu Val Leu Asn Lys Leu Ser Gln Ser Asn
120 130
420 440 460
TTT AAC CTT GGT GTT GCT TTA GCA GAG GCC AGG TCG ACA GCC TCA CAA CTC GCG
: | : | | | : | | | : | | | | |
Leu Asn Ile Gly Val Ala Ile Ala Glu Ala Lys Met Thr Ala Ser Leu Leu Ala
140 150
480 500
ACG CAA ACC ATT GCG CTC GTG AAG GCG TAC ACT GCC GCT CGT CGC GGT AAT TGG
| : | | | : : | | | | | : | | | |
Lys Gln Ser Ile Ala Leu Ile Arg Ala Tyr Thr Ala Ala Lys Arg Gly Asn Trp
160 170
520 540 560
CGC CAG GCG CTC CGC TAC CTT GCC CTA AAC GAA GAT CGA --- AAG TTT CGA TCA
| : | | : : | | : :
Arg Glu Val Leu Ser Gln Leu Leu Ile Ser Glu His Arg Phe Arg Ala Pro Ala
180
580 600 620
AAA CAC GTG GCC GGC AGG TGG TTG GAG TTG CAG TTC GGT TGG TTA CCA CTA ATG
| : : | | | | | | | : | | | | | |
Lys Asp Leu Gly Gly Arg Trp Leu Glu Leu Gln Tyr Gly Trp Leu Pro Leu Met
190 200
640 660
AGT GAT ATC CAG GGT GCA TAT GAG ATG CTT ACG AAG GTT CAC CTT CAA GAG TTT
| | : : | | : : | | | |
Ser Asp Leu Lys Ala Ala Tyr Asp Leu Leu Thr Gln Thr Lys Leu Pro Ala Phe
210 220
680 700 720
CTT CCT ATG AGA GCC GTA CGT CAG GTC GGT ACT AAC ATC AAG TTA GAT GGC CGT
: | : | | | | : |
Met Pro Leu Arg Val Thr Arg Thr Val Gly Gly Thr His Asn Tyr Lys Val Arg
230 240
740 760
CTG TCG TAT CCA GCT G --- CAA ACT TCC AGA CAA CGT G CAA CAT ATC GCG ---
: | : | | : : : :
Asn Val Glu Ser Ala - Gly Asp Thr Trp Ser Tyr Arg - His Arg Leu Ser Val
250
780 800 820
ACG TAT CGT G ATA TGG TTT TAC ATA AAC GAT GCA CGT TTG GCA TGG TTG TCG
| | | | : : | : | : | | | | |
Asn Tyr Arg - Ile Trp Tyr Phe Ile Ser Asp Pro Arg Leu Ala Trp Ala Ser
260 270
840 860 880
TCT CTA GGT ATC TTG AAC CCA CTA GGT ATA GTG TGG GAA AAG GTG CCT TTC TCA
| | | : | | | | | | | | | |
Ser Leu Gly Leu Leu Asn Pro Leu Glu Ile Tyr Trp Glu Lys Thr Pro Trp Ser
280 290
900 920
TTC GTT GTC GAC TGG CTC CTA CCT GTA GGT AAC ATG CTC GAG GGC CTT ACG GCC
| | | | | : | | | | | : : | : : :
Phe Val Val Asp Trp Phe Leu Pro Val Gly Asn Leu Ile Glu Ala Met Ser Asn
300 310
940 960 980
CCC GTG GGA TGC TCC TAC ATG TCA GGA ACA GTT ACT GAC GTA A TAA CGG GTG
| : | : | | | | : :
Pro Leu Gly Leu Asp Ile Ile Ser Gly Thr Lys Thr --- --- - Trp Gln Leu
320
1000 1020 1040
AGT CCA TCA TAA GCG TTG ACG CTC CCT ACG GGT GGA CTG TGG AGA GAC AGG GCA
: | | | : : | | |
Glu Ser Lys Leu Asn Ala Thr Leu Pro Ala Ser Gly --- Trp Ser Gly Thr Ala
330 340
1060 1080
CTG CTA AGG CCC AAA TCT CAG CCA TGC A TCG AGG GGT ACA ATC C --- GTA TGG
| : : : | : |
Lys Leu Thr Ala Tyr Ala Lys Ala Tyr - Asp Arg Ser Thr Phe - Tyr Ser Phe
350 360
1100 1120 1140
CCA ACA ACT GGC GCG TAC GTA AAG TCT CCT TTC TCG ATG GTC CAT ACC TTA GAT
| | : | | | | | : | : | :
Pro Thr Pro Leu Pro Tyr Val Lys Ser Pro Leu Ser Gly Leu His Leu Ala Asn
370
1160
GCG TTA GCA TTA ATC AGG CAA CGG CTC TCT AGA
| | | | | | | | |
Ala Leu Ala Leu Ile Asn Gln Arg Leu Lys Arg
380 390
An algorithm for global alignment of a DNA sequence to an amino acid sequence.
Let S1i,j be the maximum similarity between
and
that ends with a gap. That is, ai and
bj are not matched with each other. Let S2i,j be the maximum
similarity between
and
that ends with an
ai/bj match. Clearly
Initial conditions.
Recursions:
![]() | Michael Zuker Professor of Mathematical Sciences Rensselaer Polytechnic Institute 2004-01-24 |