MATH-4961 / MATH-6961 / CSCI-4964 / CSCI-6964

Alignment: Definitions and algorithms

FROM DOT PLOTS TO ALIGNMENT

Let ${\bf A} = a_1a_2a_3 \dots a_m$ and ${\bf B} = b_1b_2b_3 \dots
b_n$ be 2 biomolecular sequences of the same type; DNA, RNA or protein. In a dot plot, ${\bf A}$ is placed from top to bottom in the 0th column of an m+1 x n+1 matrix. ${\bf B}$ is placed from left to right in the 0th row of the same matrix. ``Dots'' plotted in row i and column j indicate at least that ai = bj. If the dot plot uses a ``window'' and ``stringency'' method with w > 1, or a k-word method with k > 1, then not all matches are plotted in general.

Another way to write down the information in a dot plot is to write the two sequences from left to right, one above the other, and to use line segments to join ai to bj whenever there is a dot in row i and column j. The dot plot in Figure 5 would yield the image in figure 1. This can be regarded as a (bipartite) graph, with the two sequence elements acting as the vertices. We can call this a ``dot plot graph''.


  
Figure 1: The ``Alignment'' corresponding to the dot plot in Figure 5 (dot plot notes).
\begin{figure}
\epsfig{file=images/A_B_2align.ps,width=0.2\textwidth}
\end{figure}

The graph corresponding to the dot plot in Figure 6 would be very complicated. The idea of ``alignment'' is to take a subset of this graph, yielding an ``alignment graph'', or what is normally called an ``alignment of two sequences''. There are 2 requirements that must be satisfied:

1.
Each vertex has at most one edge.
2.
No 2 edges cross each other.

The graph in Figure 1 violates both of these conditions. Note that both solid and dotted lines have been used in this Figure. Taking only the solid lines yields a valid alignment. The same is true for the dotted lines. Note that these 2 valid alignments are not the only ones that can be achieved by deleting edges. The whole theory of sequence alignment is to decide what edges to use.

This leads us to the first definition of sequence alignment. This definition will evolve as the course progresses.

DEFINITION: An alignment between sequences ${\bf A}$ and ${\bf B}$ is a set of integer pairs, $\{(i_h,j_h)\}_{h=1}^{k}$ that satisfy:

1.
aih = bjh for $1 \leq h \leq k$, (this condition is temporary and will be dropped later)
2.
ih < ih+1 for $1 \leq h < k$ and
3.
jh < jh+1 for $1 \leq h < k$.

Residues aih and bih are said to be ``aligned'' or ``matched'' with each other. Later on, we will drop the requirement that matched pairs must be equal.

We now change the look of the alignment graph by agreeing to write the 2 sequences so that the edges are always vertical lines. This is equivalent to saying that matched residues are always ``directly over one another'', or ``in the same column''. In order to accomplish this, gaps must be left in 1 or both sequences, sometimes at the ends. These gaps are usually written explicitly, using the symbol ``-''. For example, an alignment of
${\bf A} =$ AATAACATTCACGCGTCTGC and
${\bf B} =$ AATATTGACGCATTCTGCAT appears below.

AATAACATTCACGC-GTCTGC--
|||   ||| ||||  |||||  
AAT---ATTGACGCATTCTGCAT

Written like this, we can describe the alignment in evolutionary terms. We might imagine that ${\bf A}$ and ${\bf B}$ are the descendants of a common ancestor, ${\bf U}$. The sequence ${\bf U}$ changed into ${\bf A}$ and ${\bf B}$ through a series of ``primitive evolutionary events''. We define these to be:

1.
Substitution: One residue changes to another.
2.
Insertion: A residue is added to a sequence.
3.
Deletion: A residue is deleted from a sequence.

Sequence elements that do not change are called ``conserved''. In an alignment, matched pairs of residues are inferred to be conserved. This may not be true, since the same change may occur in both.

When ai appears above bj and $a_i \neq b_j$, we know that a change has occurred. It is possible that more than 1 change has occurred.

When ai appears above a gap, we know that either ai has been inserted into the original sequence and appears in ${\bf A}$, or that ai was in ${\bf U}$ and was deleted from ${\bf B}$. Similarly when a gap appears above bj, we know that either bj has been inserted into the original sequence and appears in ${\bf B}$, or that bj was in ${\bf U}$ and was deleted from ${\bf A}$.

Thus a gap in an alignment indicates that either an insertion or a deletion has occurred. We cannot deduce when a residue is either inserted or deleted from both sequences. The inference of an ancestral sequence is always error prone, and is approached using statistical methods. A series of consecutive ``-''s is usually called a (single) gap. The number of dashes is referred to as the size, or length of the gap. Dashes at the end of a sequence are called ``end gaps''. Because of the insertion/deletion ambiguity, gaps are often called ``indels''.

SEQUENCE EDITING

Consider the process of turning ${\bf A}$ into ${\bf B}$ using a series of substitutions, insertions and deletions. We call this ``editing'' ${\bf A}$ into ${\bf B}$.We start by writing ${\bf A}$ from left to right without any gaps. We place a copy of ${\bf A}$ below it. We make a single change in this copy. If a substitution is made to ai, then ai is changed in the copy. If ai is deleted, then this character is replaced by a ``-'' in the copy. If an insertion occurs after ai, then this element is added to the copy and a ``-'' is added at this position in the original sequence. We continue in this manner, getting k+1 rows for k edit operations. Every time an element is inserted, a gap character is added in the same position all the way up to the top. Eliminating all middle rows gives us an alignment of ${\bf A}$ with ${\bf B}$.

Conversely, an alignment is a record of editing. Matched residues are not edited. ai over bj means that bj is substituted for ai. ai over a gap means that ai is deleted. A gap over bj indicates that bj has been inserted.

MINIMUM EDIT DISTANCE

It is clear that we can edit ${\bf A}$ to ${\bf B}$ using at most m+n edit operations. That is, we can delete all the residues of ${\bf A}$, one by one, and then add all the residues of ${\bf B}$, one by one. In fact, we could do this in at most $\max\{m,n\}$ operations, using $\min\{m,n\}$ substitutions followed by $\max\{m,n\}-\min\{m,n\}$ insertions or deletions. This is the best that could be done to turn ACACACACACA into GTGGTGTTGTGTGT.

ACACACACACA---

GTGGTGTTGTGTGT

In general, we can do better.

Let ${\cal D}({\bf A},{\bf B})$ be the minimum number of edits required to turn ${\bf A}$ into ${\bf B}$.

Proposition: ${\cal D}$ is a metric

Proof:

How can we compute ${\cal D}({\bf A},{\bf B})$? For $1 \leq i \leq m$, let ${\bf A}_i$ be the ith prefix of ${\bf A}$. That is ${\bf A}_i = a_1a_2\dots a_i$. Similarly, ${\bf B}_j = b_1b_2\dots b_j$ for $1 \leq j \leq n$. Let $D_{i,j} = {\cal D}({\bf A}_i,{\bf B}_j)$. We define ${\bf A}_0$ and ${\bf B}_0$ to be the empty string. It is clear that $D_{i,j} =
\max\{i,j\}$ if i or j are 0. That is, Di,0 = i and D0,j = j. Suppose that we know Dh,k for all $0 \leq h \leq
i$ and $0 \leq k \leq j$, but not for (i,j). Can we compute Di,j?

In fact, if i > 0 and j > 0, then

\begin{displaymath}
D_{i,j} = \min\{D_{i-1,j}+1 , D_{i,j-1}+1, D_{i-1,j-1}+1-\delta(a_i,b_j)\}
\end{displaymath}

Proof:

Thus

\begin{displaymath}
D_{i,j} \leq \min\{D_{i-1,j}+1 , D_{i,j-1}+1, D_{i-1,j-1}+1-\delta(a_i,b_j)\}
\end{displaymath}

The equality follows from the fact that at least 1 of the following must occur in transforming ${\bf A}_i$ to ${\bf B}_{j}$

Working in term of ``edit distance'' is cumbersome. It is easier to work with alignments. If ${\cal A}$ is the sequence alphabet, let $d:
{\cal A}' \times {\cal A}' \rightarrow [0,\infty)$, where ${\bf A}'$ is ${\cal A} \cup \{-\}$. That is, we add the dash character to the alphabet. We call d the ``(edit) cost function''. Assume that $a, b
\in {\cal A}$. We assume that d(a,a) = 0. d will usually be symmetric, but it does not have to be. If it is not symmetric, then ${\cal D}$ will not be a metric.

At this point, we will change the original definition of alignment by dropping the first condition that aih=bjh for all matched pairs. The d function will make that distinction for us.

Let $(i_1,j_1), (i_2,j_2), (i_3,j_3), \dots, (i_k,j_k)$ be an alignment between ${\bf A}$ and ${\bf B}$. Put this alignment into ``alignment format'':

\begin{displaymath}
\begin{array}{lccccc}
{\bf A}' = & a'_1 & a'_2 & a'_3 & \dot...
...\
{\bf B}' = & b'_1 & b'_2 & b'_3 & \dots b'_{k'}
\end{array}\end{displaymath}

The a'is and b'js are either residues or gap characters. Deleting the gaps from ${\bf A}'$ and ${\bf B}'$ gives us back ${\bf A}$ and ${\bf B}$. Then the cost of the alignment is given by $\displaystyle \sum_{h=1}^{k'} d(a'_h,b'_h)$. This also gives us a ``generalized edit distance'', and a generalized minimum edit distance. If d(a,-) = d(-,b) = 1 and d(a,b) = 1 for $a \neq b$, then we have the old definition. ${\cal D}({\bf A},{\bf B})$ is the minimum edit distance between ${\bf A}$ and ${\bf B}$, which is the same as the minimum alignment cost. What is Di,j now? We'll give a formula and proof thinking about alignments, not edits.
Clearly, $D_{i,0} = \sum_{h=1}^{i} d(a_h,-)$ and $D_{0,j} = \sum_{h=1}^{j} d(-,b_h)$. Suppose that we know Dh,k for all $0 \leq h \leq
i$ and $0 \leq k \leq j$, but not for (i,j). Then:

\begin{displaymath}\begin{array}{cl}
D_{i,j} = \min\{ & D_{i-1,j}+d(a_i,-) , D_{i,j-1}+d(-,b_j), \\
& D_{i-1,j-1}+d(a_i,b_j)\} \end{array}\end{displaymath}

This recursion is called the ``fill algorithm''.

Proof: An alignment between ${\bf A}_i$ and ${\bf B}_{j}$ can be created by

This establishes the $\leq$ part. To establish equality, consider an optimal alignment of ${\bf A}_i$ and ${\bf B}_{j}$. The last 2 characters are either ai/-, -/bj or ai/bj. This yields equality for cases 1, 2 or 3, respectively.

EXAMPLE TTATGGACTT versus CTTGGCTAGG. d = 2 and 1 for all substitutions and indels, respectively.

    C T T G G C T A G G
  0 1 2 3 4 5 6 7 8 9 10
T 1 2 $\cdot$1 $\cdot$2 3 4 5 $\cdot$6 7 8 9
T 2 3 $\cdot$2 $\cdot$1 2 3 4 $\cdot$5 6 7 8
A 3 4 3 2 3 4 5 6 $\cdot$5 6 7
T 4 5 $\cdot$4 $\cdot$3 4 5 6 $\cdot$5 6 7 8
G 5 6 5 4 $\cdot$3 $\cdot$4 5 6 7 $\cdot$6 $\cdot$7
G 6 7 6 5 $\cdot$4 $\cdot$3 4 5 6 $\cdot$7 $\cdot$6
A 7 8 7 6 5 4 5 6 $\cdot$5 6 7
C 8 $\cdot$7 8 7 6 5 $\cdot$4 5 6 7 8
T 9 8 $\cdot$7 $\cdot$8 7 6 5 $\cdot$4 5 6 7
T 10 9 $\cdot$8 $\cdot$7 8 7 6 $\cdot$5 6 7 8

The D matrix after the fill algorithm is executed. The dots, $\cdot$, indicate positions where ai = bj.

    C T T G G C T A G G
  0 1 2 3 4 5 6 7 8 9 10
T 1                    
T 2                    
A 3                    
T 4                    
G 5                    
G 6                    
A 7                    
C 8                    
T 9                    
T 10                    

The D matrix before filling. The 0th row and column have been defined.

The value of Di,j can be determined if Dh,k is known for $0 \leq h \leq
i$, $0 \leq k \leq j$, $(h,k) \neq (i,j)$. There are different ``fill orders'' that can be used.

Column fill :

  for (j=1; j<=m; j++){
     for (i=1; i <=n; i++){
        Define D[i,j];
     }
  }

Row fill :

  for (i=1; i<=n; i++){
     for (j=1; j <=m; j++){
        Define D[i,j];
     }
  }

The set

\begin{displaymath}
{\rm codia}_d = \{(i,j) \: \vert \: 1 \leq i \leq m, \: 1 \leq j \leq n, \: j+i=d\}
\end{displaymath}

is called the ``dth co-diagonal''. d varies from 2 to m+n. If $(i,j) \in {\rm dia}_d$, then Di,j can be computed using values of Dh,k for (h,k) in ${\rm codia}_{d'}$, d' < d.

Diagonal fill :

   for (d=2; d<=m+n; d++){
      for (i=min{1,d-n}; i<=min{m,d-1}; i++){
         j = d - i;
         Define D[i,j];
      }
   }

Note that with the current algorithm, we only need to keep the present and previous column (row) to execute the fill algorithm in column (row) fill order. For the diagonal fill, we need the 2 previous co-diagonals.

TRACEBACK : After the ``fill algorithm'' is executed, an alignment can be constructed using the ``traceback algorithm''. The alignment is constructed from right to left; that is, backwards. This is because we start with Dm,n, and determine which of the 3 cases resulted in the value of Dm,n. This process continues. A traceback algorithm is explained below and a slight variant is illustrated in Figure 2.

1.
Set $i=m, \: j=n$. The alignment is empty. Proceed to 2.
2.
Test for:
(a)
Di,j = Di-1,j + d(ai,-). Proceed to 3.
(b)
Di,j = Di,j-1 + d(-,bj). Proceed to 4.
(c)
Di,j = Di-1,j-1 + d(ai,bj). Proceed to 5.
3.
Set i=i-1. Add ai/- to the alignment. Proceed to 6.
4.
Set j=j-1. Add -/bj to the alignment. Proceed to 6.
5.
Set $i=i-1, \: j=j-1$. Add ai/bj to the alignment. Proceed to 6.

6.
Test for i=0 or j=0:
(a)
$i>0, \: j>0$. Proceed to 2
(b)
$i=0, \: j>0$. Add $--{\dots}-/b_1b_2{\dots}b_j$ to the alignment. Stop.
(c)
$i>0, \: j=0$. Add $a_1a_2{\dots}a_i/--{\dots}-$ to the alignment. Stop.
(d)
$i=0, \: j=0$. Stop.


  
Figure 2: In this variant of the traceback algorithm, only the matched pairs, ai/bj are listed in the alignment. The gaps are inferred.
\begin{figure}
\epsfig{file=Tback1.ps,width=0.8\textwidth}
\end{figure}

    C T T G G C T A G G
  0 1 2 3 4 5 6 7 8 9 10
T 1 $\uparrow$2 1 2 3 4 5 6 7 8 9
T 2 3 $\nwarrow$2 1 2 3 4 5 6 7 8
A 3 4 $\uparrow$3 2 3 4 5 6 5 6 7
T 4 5 4 $\nwarrow$3 4 5 6 5 6 7 8
G 5 6 5 4 $\nwarrow$3 4 5 6 7 6 7
G 6 7 6 5 4 $\nwarrow$3 4 5 6 7 6
A 7 8 7 6 5 $\uparrow$4 5 6 5 6 7
C 8 7 8 7 6 5 $\nwarrow$4 5 6 7 8
T 9 8 7 8 7 6 5 $\nwarrow$4 $\leftarrow$5 $\leftarrow$6 $\leftarrow$7
T 10 9 8 7 8 7 6 5 6 7 $\uparrow$8

The execution of the traceback is illustrated above. The symbols: ``$\uparrow$'', ``$\leftarrow$'' and ``$\nwarrow$'' mean:

1.
Move up (Set i=i-1),
2.
Move left (Set j=j-1) and
3.
Move up and left. (Set i=i-1 and j=j-1), respectively.

The resulting alignment is:

-TTATGGACT---T
  | ||| ||
C-T-TGG-CTAGG-

GAP PENALTIES :

We have already introduced the notion of a gap as a series of insertions or deletions in one sequence with respect to another in the context of an alignment. The ``size'' of a gap is the number of indels. Our definition of a gap allows for consecutive indels in one or both sequences. In the alignment above, -/C and T/- are consecutive entries that contain indels. We call this a single gap. Similarly, the four last columns: -/A, -/G, -/G and T/- constitute 1 gap, not 2. (This is not standard usage.)

The cost of a gap often depends only on its size, s. A function $g:\{0,1,2,\dots\} \rightarrow [0,\infty)$, which gives the cost g(s) to a gap of size s, is called a ``gap cost'' or ``gap penalty'' function. g(0) is always assumed to be 0. In our general definition, d(a,-) = d(-,a) can take on different values for each letter of the (DNA or amino acid) alphabet. If the value of d(a,-) is the same $\forall a$, the a gap of size s has cost $\beta s$, where $\beta = d(a,-)$. A gap penalty function of the form $g(s)=\alpha+\beta s$ is called an ``affine gap penalty function'', where $\alpha$ and $\beta$ are non-negative constants.

Alignment cost with gap cost functions.

In this case, d(ai,bj) still gives the cost of substituting ai with bj, but d(ai,-) and d(-,bj) are no longer used. The cost of an alignment cannot be determined by adding up one summand for each column. If there are $\Delta_s$ gaps of size s in an alignment, then the cost of an alignment between ${\bf A}$ and ${\bf B}$ is

\begin{displaymath}
\sum_{h=1}^{k}d(a_{i_h},b_{j_h}) + \sum_{s=1}^{\infty} \Delta_s g(s)
\end{displaymath}

At this time, we add 2 ``fake matches'' to every alignment to take care of the end gaps. This means that we do not need to list the end gaps as special cases. We will assume that every alignment (not the ``local'' alignments that will be considered later) begins with (i0,j0) = (0,0) and ends with (ik+1,jk+1) = (m+1,n+1). We assume that these ``fake matches'' have 0 weight. Note that a gap occurs whenever ih-ih-1+jh-jh-1>2, and the gap size is ih-ih-1+jh-jh-1-2, for $1 \leq h \leq
k+1$. Thus:

\begin{displaymath}
\sum_{h=1}^{k+1}g(i_h-i_{h-1}+j_h-j_{h-1}-2) = \sum_{s=1}^{\infty}
\Delta_s g(s).
\end{displaymath}

There are algorithms to compute minimum cost alignments for such gap functions, but we will not consider them. Instead, we will deal with these gap functions when we look at ``maximum similarity'' alignment.

PATH GRAPHS

Alignments can be visualized in so-called ``path graphs'' that are related to dot plots. Alignments are constructed by eye using dot plots. Path graphs allow one to define the set of all possible alignments in a formal, yet intuitive way.

One type of path graph is what I call the ``truck route'' path graph. I use this term because the path is connected. It cannot hop from one location to another. Truck route path graphs formalize the representation that shows matches and gaps explicitly.

An m+1 x n+1 square grid of points is constructed. The grid can described as $\{(i,j) \; \vert \; 0 \leq i \leq m, 0 \leq j \leq m\}$, where i refers to row, and j to column. Two distinct points (i,j) and (i',j') have an edge between them if and only if $\max \{
\vert i-i'\vert, \vert j-j'\vert \} = 1$.

Any collection of connected edges from (0,0) to (m+1,n+1) that moves from each node (i-1,j-1) to either (i,j-1), (i,j) or to (i-1,j), describes an alignment.

When a diagonal edge, $(i-1,j-1) \mapsto (i,j)$ is used, we say that a(i) is matched with b(j).

When a vertical edge, $(i-1,j-1) \mapsto (i,j-1)$ is used, we say that a(i) is inserted with respect to sequence ${\bf b}$, or that a(i) is in a gap.

Similarly, when a horizontal edge, $(i-1,j-1) \mapsto (i-1,j)$ is used, we say that b(j) is inserted with respect to sequence ${\bf
a}$, or that b(j) is in a gap.

Figure 3 illustrates an alignment together with its path graph. These kinds of path graphs are the most useful ones for looking at multiple sequence alignments. They are not good for describing alignments with affine gap penalties.


  
Figure 3: An alignment of 2 DNA sequences together with its ``truck route'' path graph. Every column in the alignment corresponds to an edge in the graph.
\begin{figure}\epsfig{file=path-graph-1.ps,width=0.8\textwidth}\end{figure}


  
Figure 4: An alignment of 2 DNA sequences together with its ``airline route'' path graph. Matched pairs of residues correspond to vertices in path graph. Gaps are not shown explicitly, but are inferred when the path jumps more than 1 unit in either direction. The yellow edges are part of the ``traceback path'' that will be discussed later.
\begin{figure}\epsfig{file=path-graph-2.ps,width=0.8\textwidth}\end{figure}

In contrast, the ``airline route'' path graph uses an m x n grid, and matched pairs correspond to vertices, not edges. A path consists of a series of vertices, as defined above. Gaps are inferred when either il+1 - il > 1 or jl+1 - jl > 1. The alignments that have been introduced in the course up to now are best visualized using ``airline route'' path graphs.

Consider the m x n array of integer pairs: ${\cal G}_{m,n} =
\{(i,j) \: \vert \: 0 \leq i \leq m, \: 0 \leq j \leq n\}$. Consider all line segments joining (i,j) to (i-1,j), (i,j) to (i,j-1) and (i,j) to $(i-1,j-1), \: \forall \: (i,j) \in {\cal
G}_{m,n}$. (We exclude negative indices.) A ``path'' from (0,0) to (m,n) is a sequence of these connected line segments, joining distinct pairs $(0,0)=(i_0,j_0), (i_1,j_1), (i_2,j_2), (i_3,j_3),
\dots (i_k,j_k)= (m,n)$, where $0 \leq i_h - i_{h-1} \leq 1$ and $0 \leq j_h - j_{h-1} \leq 1,
\: \forall \: 0<h\leq k$. There is a 1:1 correspondence between paths and alignments, with the segment:

SIMILAR (DNA) SEQUENCES

If 2 (DNA) sequences are known to be fairly similar, say 90% or better and n and m are large (say 10,000 or larger) and ``roughly the same size'', then much time (and space) can be saved in alignment.

Diagonal band fill

The set

\begin{displaymath}
{\rm dia}_d = \{(i,j) \: \vert \: 1 \leq i \leq m, \: 1 \leq j \leq n, \: j-i=d\}
\end{displaymath}

is called the ``dth diagonal''. d varies from 1-m to n-1. If ${\bf A}$ and ${\bf B}$ are roughly of the same length and are fairly similar, then a minimum cost alignment will lie ``close to'' the 0th diagonal, ${\rm dia}_0$. This means that for column fill, it suffices to consider only values of i that satisfy $\vert j-i\vert \leq bw$ when filling the jth column, where bw (band width) is a constant. In practice, bw might be 100 when m and n are many thousands. The values of Di,j just outside the boundaries are assumed to be $+\infty$ (ie., a large number). There is a risk of computing an alignment that is not minimum cost. In practice, bw can be increased and new alignments can be computed until the minimum cost no longer changes. When m and n are not roughly equal, the bounds for i in the jth column can be computed by demanding that $\vert i-[\frac{jm}{n}]\vert \leq bw$.

MAXIMUM SIMILARITY ALIGNMENT

As before, ${\cal A}$ is the sequence alphabet, and $w:
{\cal A} \times {\cal A} \rightarrow {\bf R}$. We call w the ``weight function'' that gives a measure of similarity. For example, $w(a,b)=\delta(a,b)$ would be a natural and simple weight function. So would $w(a,b)=2\delta(a,b)-1$. We assume that w is symmetric.

We can think of w as the inverse of d. However, gap costs are treated the same as before. The definition of alignment remains the same. This time, we try to maximize the similarity of an alignment, so that the gap penalties (gap costs) are subtracted, not added. If there are $\Delta_s$ gaps of size s in an alignment, then the cost of an alignment between ${\bf A}$ and ${\bf B}$ is

\begin{displaymath}
\sum_{h=1}^{k}w(a_{i_h},b_{j_h}) - \sum_{s=1}^{\infty} \Delta_s g(s)
\end{displaymath}

If $g(s) = {\beta}s$, then we can define $w(a,-) = w(-,b) = -\beta$ for all letters a and b. In this case, the alignment similarity is $\displaystyle \sum_{h=1}^{k} w(a'_h,b'_h)$, where a' and b' are characters in ${\cal A}' = {\cal A} \cup \{-\}$. In this case, we define $\displaystyle {\cal S}({\bf A},{\bf B})$ to be the maximum similarity of all alignments between ${\bf A}$ and ${\bf B}$.

Sufficient condition for ${\cal D}({\bf A},{\bf B})$ to be equivalent to ${\cal S}({\bf A},{\bf B})$.

Suppose that w and d are score functions for sequence alignment by similarity maximization and distance minimization, respectively. Let the gap penalty for distance minimization be given by some function g. We wish to construct a maximum similarity alignment.

Suppose, as well, that w(a,b) + d(a,b) = C for some constant C, and that $g(s) \geq \frac{Cs}{2}$. Then distance minimization alignment is equivalent to similarity maximizing alignment.

Proof: For any alignment, there are k matched pairs and a total of m+n-2k elements in gaps, including possible end gaps.

Let $\Delta_l$ be the number of gaps of size l. Then

\begin{displaymath}
m+n-2k = \sum_{l=1}^{\infty} l \Delta_{l}.
\end{displaymath}

Thus


\begin{displaymath}
\begin{array}{cl}
& \displaystyle \sum_{h=1}^{k} d(a_{i_h},b...
... \sum_{l=1}^{\infty} (g(l) -Cl/2)\Delta_l \right ].
\end{array}\end{displaymath}

Hence, ${\cal D}({\bf A},{\bf B}) = C(m+n)/2 - {\cal S}({\bf A},{\bf
B})$ if we take g(s) - Cs/2 to be the similarity gap penalty (non-negative).

Example The familiar case where $d(a',b') = 1 - \delta(a',b')$ gives what is called the ``Levenstein distance'' between 2 sequences. In this case, g(s) = s. If we let $w(a,b)=\delta(a,b)$, which seems ``natural'', then C=1, so g'(s) = s/2 as a gap penalty for maximum similarity alignment would give identical results.

For a general gap function, g, how can we compute ${\cal S}({\bf A},{\bf B})$ and a maximum similarity alignment?

Let Si,j be the maximum similarity of an alignment between ${\bf A}_i$ and ${\bf B}_{j}$. Note that S0,j = -g(j) and that Si,0 = -g(i). What sort of recursion can be used in this case? The answer is given below.


\begin{displaymath}
\begin{array}{ll}
S_{i,j} = \max \left\{ \right. &
S_{i-1,j-...
...0 \end{array} }
\left\{S_{i-l,j-k}-g(k+l) \right\}.
\end{array}\end{displaymath}

Proof : An optimal alignment either ends with a match between ai and bj, or with an end gap. The Si-1,j-1+w(ai,bj) term adds the weight of an ai/bj match to an optimal alignment between ${\bf A}_{i-1}$ and ${\bf B}_{j-1}$. This either extends a match or closes an end gap. The second term considers all possible ``last matched pair'' cases.

While this formula is straightforward, computation time is ${\cal
O}(m^{2}n^{2})$, since one must backtrack for each (i,j) pair. We can do better when g has a special form.

Case 1. $g(s) = {\beta}s$, where $\beta$ is a positive constant. This is the ``linear gap penalty'' case.


\begin{displaymath}
S_{i,j} = \max \left \{ S_{i-1,j-1}+w(a_i,b_j),
S_{i-1,j}- \beta, S_{i,j-1} - \beta \right \} .
\end{displaymath}

Case 2. $g(s)=\alpha+\beta s$, where $\alpha$ and $\beta$ are non-negative constants. (Both are not 0.) This is the ``affine gap penalty'' case. The linear gap penalty recursion breaks down because when we add an unpaired ai to an optimal alignment between ${\bf A}_{i-1}$ and ${\bf B}_{j}$ or bj to an optimal alignment between ${\bf A}_i$ and ${\bf B}_{j-1}$ we don't know if we are creating a new gap, with a cost of $\alpha + \beta$, or just extending an existing one, with a cost of only $\beta$. This can be solved in a number of ways. They all involve the use of at least one extra array that must be filled.

``Zuker's method'': Let S1i,j be the maximum similarity between ${\bf A}_i$ and ${\bf B}_{j}$ that ends with a gap. That is, ai and bj are not matched with each other. Let S2i,j be the maximum similarity between ${\bf A}_i$ and ${\bf B}_{j}$ that ends with an ai/bj match. Clearly

\begin{displaymath}
{\cal S}({\bf A},{\bf B}) = \max \{ S^{1}_{m,n} , S^{2}_{m,n} \}
\end{displaymath}

Initial conditions.

Recursions:


\begin{displaymath}
\begin{array}{lcll}
S^{1}_{i,j} & = & \max \{ & S^{1}_{i-1,j...
...& S^{1}_{i-1,j-1}, S^{2}_{i-1,j-1} \} + w(a_i,b_j).
\end{array}\end{displaymath}

Proof : We consider cases. For S1i,j:

For S2i,j, ai is paired with bj. ai-1 and bj-1 are either paired with each other, or they are not.

Alignment examples.

 AACAGGCAGGCCATT--AGGGAG--CTACGG-GTTGTGCCGACCGTCCAAT---AATGGTCGGCCTCTC
 |: |:|  :|:|||:  || :||  || ||: :||:||  ||   |||:|    |||  |:::| |      
 AGAAAG--AGTCATCTGAGCAAGGACT-CGATATTATG--GA---TCCGA-AGAAAT--TTAACGT---

In this example, identity is indicated by |, transitions ( $A{\leftrightarrow}G$, $C{\leftrightarrow}T$) by :, and transversions ( $A{\leftrightarrow}C$, $A{\leftrightarrow}T$, $G{\leftrightarrow}C$, $G{\leftrightarrow}T$) by blanks, respectively.

Example: ${\bf A}$ = AGGCTACGG and ${\bf B}$ = AGGGACTCGAT. w = 10, 2 and -5 for identity, transitions and transvertions, respectively.

Optimal alignment:

 AGG--CTACGG-
 |||  || ||: 
 AGGGACT-CGAT

S1:

    A G G G A C T C G A T
  $-\infty$ -11 -12 -13 -14 -15 -16 -17 -18 -19 -20 -21
A -11 -12 -1 -2 -3 -4 -5 -6 -7 -8 -9 -10
G -12 -1 -2 9 8 7 6 5 4 3 2 1
G -13 -2 9 8 19 18 17 16 15 14 13 12
C -14 -3 8 19 18 17 16 17 16 15 14 13
T -15 -4 7 18 17 16 17 16 27 26 25 24
A -16 -5 6 17 16 15 16 27 26 25 24 25
C -17 -6 5 16 15 16 15 26 25 26 25 24
G -18 -7 4 15 14 15 26 25 26 25 36 35
G -19 -8 3 14 15 14 25 24 25 36 35 38

S2:

    A G G G A C T C G A T
  0 $-\infty$ $-\infty$ $-\infty$ $-\infty$ $-\infty$ $-\infty$ $-\infty$ $-\infty$ $-\infty$ $-\infty$ $-\infty$
A $-\infty$ 10 -9 -10 -11 -4 -20 -21 -22 -16 -9 -25
G $-\infty$ -9 20 9 8 -1 -9 -10 -11 3 -6 -14
G $-\infty$ -10 9 30 19 10 2 1 0 14 5 -3
C $-\infty$ -18 -7 4 25 14 28 19 26 10 9 15
T $-\infty$ -19 -8 3 14 20 19 38 21 21 10 24
A $-\infty$ -5 -2 9 20 27 15 14 33 29 36 20
C $-\infty$ -21 -10 1 12 15 37 18 37 28 24 38
G $-\infty$ -15 4 15 26 17 11 32 21 47 30 20
G $-\infty$ -16 3 14 25 28 12 21 27 36 49 31

Up to now, we have been dealing with what is called ``global alignment''. When dealing with edit distance, we must transform one sequence into another. As we dealt with minimum cost alignment and maximum similariy alignment, we always kept the notion that we are dealing with all of ${\bf A}$ and ${\bf B}$. In particular, we penalize ``end gaps''. We will now consider another class of alignment problem.

Local sequence alignment :

If $1 \leq i_1 \leq i_2 \leq m$ and $1 \leq j_1 \leq j_2 \leq n$, let ${\bf A}_{i_1,i_2} = a_{i_1}a_{i_{1}+1}{\dots}a_{i_2}$ and let ${\bf B}_{j_1,j_2} = b_{j_1}b_{j_{1}+1}{\dots}b_{j_2}$. Let

\begin{displaymath}
{\cal H}({\bf A},{\bf B}) = \max \left \{ 0, \right.
\begin{...
...cal S}({\bf A}_{i_1,i_2},{\bf B}_{j_1,j_2}) \left. \right \} .
\end{displaymath}

For a general gap function, g, how can we compute ${\cal H}({\bf
A},{\bf B})$ and a maximum similarity alignment? Such an alignment is called a ``local alignment''.

Let Hi,j be the maximum similarity of an alignment between ${\bf A}_i$ and ${\bf B}_{j}$ that includes either the (ai,bj) match or else counts an end gap, or is empty.

Then

\begin{displaymath}
{\cal H}({\bf A},{\bf B}) =
\begin{array}{c} \max \\
i,j
\end{array} \{ H_{i,j} \}.
\end{displaymath}

Note that Hi,0 = H0,j = 0, since these values can only be $\max \{0,-g(i)\}$ or $\max \{0,-g(j)\}$. The recursion for i>0 and j>0 is:


\begin{displaymath}
\begin{array}{ll}
H_{i,j} = \max \left\{0, \right. &
H_{i-1,...
...0 \end{array} }
\left\{H_{i-l,j-k}-g(k+l) \right\}.
\end{array}\end{displaymath}

Proof : An optimal alignment either ends with a match between ai and bj, or with an end gap. The Hi-1,j-1+w(ai,bj) term adds the weight of an ai/bj match to an optimal local alignment between ${\bf A}_{i-1}$ and ${\bf B}_{j-1}$. This either extends a match or closes an end gap. The second term considers all possible ``last matched pair'' cases. If the last 2 terms are negative, then the 0 option (no alignment) can be used.

While this formula is straightforward, computation time is ${\cal
O}(m^{2}n^{2})$, since one must backtrack for each (i,j) pair. As with S, we can do better when g has a special form. In these forms, the algorithm is named ``Smith-Waterman'' after Temple Smith and Michael Waterman who first published it.

Case 1. $g(s) = {\beta}s$, where $\beta$ is a positive constant. This is the ``linear gap penalty'' case.


\begin{displaymath}
\begin{array}{ll}
H_{i,j} = \max \left \{0, \right. & H_{i-1...
...ft. H_{i-1,j}- \beta, H_{i,j-1} - \beta \right \} .
\end{array}\end{displaymath}

Case 2. Affine gap penalty. $g(s)=\alpha+\beta s$.

``Zuker's method'': Let H1i,j be the maximum similarity between ${\bf A}_i$ and ${\bf B}_{j}$ that ends with a gap. That is, ai and bj are not matched with each other. Let H2i,j be the maximum similarity between ${\bf A}_i$ and ${\bf B}_{j}$ that ends with an ai/bj match. It is understood that either H1i,j or H2i,j will be 0 (no alignment) if there is no alignment with a non-negative score.

Clearly

\begin{displaymath}
{\cal H}({\bf A},{\bf B}) =
\begin{array}{c}
\max \\ i,j
\end{array}\{ H^{2}_{i,j} \}
\end{displaymath}

The initial conditions are simple. H1i,0 = H10,j = H2i,0 = H20,j = 0.

Recursions:


\begin{displaymath}
\begin{array}{lcll}
H^{1}_{i,j} & = & \max \{0, & H^{1}_{i-1...
...i-1,j-1}+w(a_i,b_j), H^{2}_{i-1,j-1}+w(a_i,b_j) \}.
\end{array}\end{displaymath}

Proof : We consider cases. For H1i,j:

For H2i,j, if there is an alignment with positive score, then ai is paired with bj. ai-1 and bj-1 are either paired with each other, or they are not.

Alignment examples. Identity, transition and transversion weights are 1, 0 and -1, respectively. $\alpha = \beta = 1$.
${\bf A}$ = TCTTCTCCAAGGCGTTAACT and
${\bf B}$ = AACTTCGTTTGAGGCTTCTT

        10   
CTTC-TCCAAGGC
|||| |:::||||
CTTCGTTTGAGGC
      10

H1
    A A C T T C G T T T G A G G C T T C T T
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
C 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
T 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0
C 0 0 0 0 0 1 0 2 1 0 0 0 0 0 0 0 0 1 0 2 2
T 0 0 0 0 0 0 2 1 1 1 0 1 0 0 0 0 0 0 2 1 3
C 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 1 3 2
C 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 2 3
A 0 0 0 0 0 0 1 0 1 1 1 0 1 0 0 0 0 0 1 1 2
A 0 0 0 0 0 0 0 1 0 0 0 1 0 2 1 0 0 0 0 0 1
G 0 0 0 0 0 0 0 0 0 0 0 0 2 1 3 2 1 0 0 0 0
G 0 0 0 0 0 0 0 0 0 0 0 0 1 3 2 4 3 2 1 0 0
C 0 0 0 0 0 0 0 0 0 0 0 0 0 2 4 3 5 4 3 2 1
G 0 0 0 0 0 0 0 0 0 0 0 0 0 1 3 5 4 4 3 2 1
T 0 0 0 0 0 0 0 0 0 1 0 0 0 0 2 4 4 4 5 4 3
T 0 0 0 0 0 0 0 0 1 0 2 1 0 0 1 3 4 5 5 5 4
A 0 0 0 0 0 0 0 0 0 2 1 1 0 0 0 2 3 5 5 4 4
A 0 0 0 0 0 0 0 0 0 1 1 0 1 1 0 1 2 4 4 4 3
C 0 0 0 0 1 0 0 0 0 0 0 1 1 0 0 0 1 3 3 3 3
T 0 0 0 1 0 2 1 0 0 0 0 0 0 0 0 0 0 2 3 2 4

H2
    A A C T T C G T T T G A G G C T T C T T
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
T 0 0 0 0 1 1 0 0 1 1 1 0 0 0 0 0 1 1 0 1 1
C 0 0 0 1 0 1 2 0 0 1 1 0 0 0 0 1 0 1 2 0 1
T 0 0 0 0 2 1 1 1 1 1 2 0 0 0 0 0 2 1 1 3 1
T 0 0 0 0 1 3 1 0 2 2 2 1 0 0 0 0 1 3 1 2 4
C 0 0 0 1 0 1 4 0 0 2 2 1 0 0 0 1 0 1 4 1 2
T 0 0 0 0 2 1 1 3 3 2 3 1 0 0 0 0 2 1 1 5 3
C 0 0 0 1 0 2 2 1 3 3 2 2 0 0 0 1 0 2 2 2 5
C 0 0 0 1 1 0 3 1 1 3 3 1 1 0 0 1 1 0 3 2 3
A 0 1 1 0 0 0 0 3 0 0 2 3 2 1 0 0 0 0 0 2 1
A 0 1 2 0 0 0 0 1 2 0 0 2 4 2 1 0 0 0 0 0 1
G 0 0 1 1 0 0 0 1 0 1 0 1 2 5 3 0 0 0 0 0 0
G 0 0 0 0 0 0 0 1 0 0 0 1 1 3 6 2 1 0 0 0 0
C 0 0 0 1 0 0 1 0 1 0 0 0 0 0 2 7 4 3 3 1 0
G 0 0 0 0 0 0 0 2 0 0 0 1 0 1 3 3 6 4 3 2 1
T 0 0 0 0 1 1 0 0 3 1 1 0 0 0 0 3 6 7 4 4 3
T 0 0 0 0 1 2 1 0 1 4 2 0 0 0 0 2 5 7 7 6 5
A 0 1 1 0 0 0 1 1 0 0 3 2 2 0 0 0 2 4 6 6 5
A 0 1 2 0 0 0 0 1 0 0 1 3 3 2 0 0 1 2 4 5 5
C 0 0 0 3 0 0 1 0 1 0 1 0 2 2 1 1 1 2 5 4 5
T 0 0 0 0 4 2 0 0 1 2 1 0 0 1 1 1 2 2 3 6 5




The Sankoff algorithm

Given a non-negative integer, k, compute an optimal global alignment between A and B that has at most k gaps. There are no gap penalties.

Let ${\bf V}_k(i,j)$ be the maximal score of any alignment on (the prefixes)
${\bf A}_i = a_1, a_2, a_3, \dots, a_i$ and
${\bf B}_j
= b_1, b_2, b_3, \dots, b_j$
that includes the (i,j) pair and that has at most k gaps. Similarly, ${\bf W}_k(i,j)$ is the maximal score of any alignment that may or may not include the (i,j) pair and that has at most k gaps. End gaps are counted in both cases.

There are no gap penalties, so the following boundary conditions are appropriate:

Recursion step:

For i > 0 and j > 0, For $k \geq 0$,

 
$\displaystyle {\bf V}_k(i,j) =$ $\textstyle w(a(i),b(j)) + \max \{ {\bf V}_k(i-1,j-1), {\bf W}_{k-1}(i-1,j-1) \}$    


 
$\displaystyle {\bf W}_k(i,j) = \max \{ {\bf V}_k(i,j), {\bf W}_{k-1}(i-1,j), {\bf W}_{k-1}(i,j-1) \}$      

Note that

\begin{displaymath}\begin{array}{lcl}
{\bf V}_0(i,i) & = & {\bf W}_0(i,i) = \sum...
...f W}_0(i,j) = - \infty \quad {\rm for} \: i \neq j.
\end{array}\end{displaymath}

Examples: global and local alignment

Alignment of 2 random DNA sequences of length 100. The match scores are 10, 1 and -5 for matches, transitions and transversions, respectively. The gap penalties are given by $\alpha = 0$ and $\beta =
5$. A global alignment has score 437, and is given by:

                   10           20           30        40          50        60 
G---CG-----CT-ACGC--G-TGA-T-ATTGGG-TT-G-GCGAGTGTTACGAGACCTCTC-A-TCAGAACCTCCCGCGG
|   ||     || ||||  | ||| | |: ||| || | :||:|| ||| |||  ||| | | ||||  | ||| ||||
GTCCCGCAAAACTAACGCAAGATGATTAACGGGGATTAGAACGGGTCTTA-GAG--CTCACTAGTCAG--CGTCCGGCGG
        10        20        30        40        50           60          70
       70        80        90       100                                         
CCTAGGGCGGATACGGATAGACGGATCAGGCGATTGGTG
||    |: :|| | || || : | | |  | | |:|| 
CC----GT-AAT-CTGA-AG-T-G-T-A-CCTA-TAGT-
        80            90           100                                          

An optimal local alignment achieves a score of 460.

              10           20           30        40          50        60      
GC---GCT-ACGC--G-TGA-T-ATTGGG-TT-G-GCGAGTGTTACGAGACCTCTC-A-TCAGAACCTCCCGCGGCC-TA
||   :|| ||||  | ||| | |: ||| || | :||:|| ||| |||  ||| | | ||||  | ||| |||||| ||
GCAAAACTAACGCAAGATGATTAACGGGGATTAGAACGGGTCTTA-GAG--CTCACTAGTCAG--CGTCCGGCGGCCGTA
   10        20        30        40        50           60          70        80

   70        80         90                                                      
GGGCGGATACGGATAG-ACGGATCAG
 : | || |  | | | ||  || ||
-ATCTGA-A--G-T-GTACCTAT-AG
              90

Note that the local alignment is not just a truncation at both ends of the global alignment. Changing $\alpha$ from 0 to 20 changes the optimal global and local alignments to eliminate many gaps.

Global: score = 154
               10        20        30        40        50          60        70 
GCGCTAC-------GCGTGATATTGGGTTGGCGAGTGTTACGAGACCTCTCATCAGAACCTCCCG--CGGCCTAGGGCGG
|: |::|       :||::| ||  |:||::||:| :||| :| :  ||   |:|||:|:: |::  |:|| |  |||||
GTCCCGCAAAACTAACGCAAGAT--GATTAACGGGGATTAGAACGGGTC---TTAGAGCTCACTAGTCAGCGTCCGGCGG
        10        20          30        40           50        60        70     

       80        90       100                                                   
ATACGGATAGACGGATCAGGCGATTGGTG
 ::::: : || | :|   :|   |:|| 
CCGTAATCTGAAGTGT---ACCTATAGT-
   80        90          100                                                    

Local: score = 273
   10              20          30        40        50        60        70       
ACGCGTGAT-ATT-----GGGTTGG--CGAGTGTTACGAGACCTCTCATCAGAACCTCCCGCGGCCTAGGGCGGATACGG
||||: ||| |||     ||:||:|  ||:|| ||| ||| :| || :||||  | ||| ||||||  :: | || ::| 
ACGCAAGATGATTAACGGGGATTAGAACGGGTCTTA-GAGCTCACTAGTCAG--CGTCCGGCGGCCGTAATCTGAAGTGT
    20        30        40        50         60          70        80        90 

 80                                                                            
ATAGACGG
|:  |::|
ACCTATAG

Generalized alignment

Let ${\cal P}$ be a locally finite partially ordered set.
${\cal P}$ is actually a dual, $({\cal P}, \preceq)$, where ${\cal P}$ is a non-empty set consisting of elements {pi}, and where $\preceq:
{\cal P}\times {\cal P}\rightarrow \{{\rm true,false}\}$. For p1, p2 and $p_3 \in {\cal P}$, we write $p_1 \preceq p_2$ if, and only if $\preceq(p_1,p_2)$ is true. ``$\preceq$'' is called a partial order. If $p_1 \preceq p_2$, then we define the ``closed interval''
$[p_1,p_2] = \{ p_i \in {\cal P}\; \arrowvert \; p_1 \preceq p_i \preceq p_2 \}$.
By definition, the ``open interval'', (p1,p2), excludes p1 and p2. The following properties hold:

We define $\arrowvert [p_1,p_2] \arrowvert$ to be the number of elements in (p1,p2) and $\Arrowvert [p_1,p_2] \Arrowvert$ to be the size of a maximal chain in (p1,p2).

When $p_1 \preceq p_2$, we say that p1 is less than or equal to p2 in the partial order.

If neither $p_1 \preceq p_2$ nor $p_2
\preceq p_1$, then we say that p1 and p2 are not comperable in the partial order.

Finally, we write $p_1 \prec p_2$ to mean $p_1 \preceq p_2$ and $p_1 \neq p_2$. In this case, we say that p1 is less than p2 in the partial order.

A ``chain'' is a totally ordered subset of ${\cal P}$. That is, any 2 elements of a chain are comparable. The elements of a chain can be assumed to be distinct, and any finite chain can be written uniquely as $p_1 \prec p_2 \prec p_3 \dots \prec p_2$. We say that ``p2 covers p1'' if $p_1 \prec p_2$ and [p1,p2] = {p1,p2} (equivalent to $(p_1,p_2) =
\emptyset$). We write this as $p_1 \vartriangleleft p_2$

An example of a locally finite partially ordered set is the set of all finite subsets of the integers, where $\preceq$ is $\subseteq$.

The general setting for alignment is a triplet,

\begin{displaymath}
({\cal P},\preceq_1,\preceq_2),
\end{displaymath}

where $({\cal P},\preceq_1)$ and $({\cal P},\preceq_2)$ are locally finite sets and $p_1 \preceq_1 p_2
\Longrightarrow p_1 \preceq_2 p_2$. We say that ``$\preceq_1$ contains $\preceq_2$'' or that $\preceq_1$ and $\preceq_2$ are the ``weak partial order'' and the ``strong partial order'', respectively. There are 2 real valued functions,
$w:{\cal P}\rightarrow {\bf R}$ and
$g:{\cal P}\times {\cal P}\rightarrow {\bf R^{+}}$.

If p1 is not $\preceq_1 p_2$, then g(p1,p2) = 0. The functions w and g are called the ``weight'' and ``gap penalty'' functions, respectively. The intervals [p1,p2]1 and [p1,p2]2 refer to the 2 partial orders, $\preceq_1$ and $\preceq_2$, respectively.

Let ${\bf P} = \{p_1 \prec_1 p_2 \prec_1 p_3 \dots \prec_1
p_{k-1} \prec_1 p_k \}$ be a chain of length k with respect to $\preceq_1$. Each pair pi, pi+1 is called a link (the ith link) in the chain. If pi is not $\prec_2 p_{i+1}$, or if $p_i
\prec_2 p_{i+1}$, but $(p_i,p_{i+1})_2 \neq \emptyset$, we say that the ith link is ``weak''.

Then the ``weight'' of ${\bf P}$, $S({\bf P})$, is defined by

 \begin{displaymath}
S({\bf P}) = \sum_{i = 1}^{k} w(p_i) - \sum_{i = 1}^{k-1} g(p_i,p_{i+1}). \nonumber
\end{displaymath}  

Although not strictly necessary, we assume that g is 0 for ``strong'' (ie. not weak) links where pi+1 covers pi. Thus the second sum in equation [*] can be restricted to the weak links. The weight of an empty chain is defined to be 0.

PROBLEM 1: ``GLOBAL''
Given an interval [pa,pb]1, compute a chain containing pa and pb with an optimal weight; that is, with a weight as large as possible.

PROBLEM 2: ``LOCAL''
(
SMITH-WATERMAN)
Given an interval [pa,pb]1, compute a chain with an optimal weight. (An empty chain has a weight of 0 by definition).

It is possible to construct solutions to problems 1 and 2 in time ${\cal O}(\arrowvert (p_a,p_b)_1 \arrowvert^{2})$.

The solutions are achieved via recursions.

Problem 1: Let ${\bf S}(p_a,p_i)$ be the maximal score of any chain in [pa,pi]1 that includes pa and pi.

Clearly, ${\bf S}(p_a,p_a) = w(p_a)$.

Recursion step:

For $p_a \prec_1 p_i$,


 
$\displaystyle {\bf S}(p_a,p_i) = w(p_i) \; +$ $\textstyle \max$ $\displaystyle \{ {\bf S}(p_a,p) - g(p,p_i) \}$  
  $\textstyle p_a \preceq_1 p \prec_1 p_i$    

Why is this recursion correct? Because [pa,pb]1 is finite, all the elements pi can be arranged into groups, g1, g2, etc., where g0 = {pa} and $g_k = \{ p \in [p_a,p_b]_1
\arrowvert \; p \; {\rm covers \; some \; element \; of} \; g_{k-1}$ for k > 0. For some k, $g_{k'} = \emptyset \; \forall \; k' \geq
k$. Thus, equation 1 can be computed successively for $g_1,
g_2, \dots$.

The recursion in equation 1 is called the ``fill algorithm''. It fills an (abstract) array of numbers with optimal values. No chains are computed. After the fill algorithm is executed, all we know are optimal chain lengths for all intervals of the form $[p_a,p]_1, \; p_a \preceq_1 p \preceq_1 p_b$. In particular, ${\bf
S}(p_a,p_b)$ is the optimal score.

The construction of an optimal chain is achieved with a ``traceback algorithm''. It constructs a chain in reverse order and works as follows:

1.
BEGIN Let i = 1 and set p1 = pb.
2.
Find $p \in [p_a,p_i]_1$ such that ${\bf S}(p_a,p_i) = {\bf S}(p_a,p) - g(p,p_i) + w(p_i)$.
Such a p must exist because of equation 1.
3.
Set i = i+1 and let pi = p.
4.
If pi = pa, continue with 5. Otherwise, continue with 2.
5.
The chain $p_i, p_{i-1}, \dots, p_1$ is a solution.

Problem 2: Let ${\bf H}(p_a,p_i)$ be the maximal score of any chain in [pa,pi]1 that includes pi if it is non-empty.

Clearly, ${\bf H}(p_a,p_a) = \max \{0, w(p_a) \}$. This is because we don't have to settle for a negative score if w is always < 0. We can choose the empty chain whose score is 0 by definition.

Recursion step:

For $p_a \prec_1 p_i$,


 
$\displaystyle {\bf H}(p_a,p_i) =$ $\textstyle \max$ $\displaystyle \{0, {\bf H}(p_a,p) - g(p,p_i) + w(p_i) \}$  
  $\textstyle p_a \preceq_1 p \prec_1 p_i$    

This recursion can be computed in the same order as the one in equation 1. That is, first for elements that cover pa, then for elements that cover those elements, and so on.

What is new here is the 0. It means that for any pi, we can reject a chain ending at pi if its score is < 0. A chain is free to begin anywhere. ${\bf H}(p_a,p_b)$ is not necessarily the score of an optimal chain. Instead,

$\displaystyle {\bf H}_{max} =$ $\textstyle \max$ $\displaystyle \{ {\bf H}(p_a,p) \}$  
  $\textstyle p_a \preceq_1 p \preceq_1 p_b$    

is the score of an optimal chain. ${\bf H}_{max}$ can be computed recursively during the execution of the fill algorithm.

The traceback algorithm is slightly different.

1.
BEGIN Let i = 1 and choose p1 such that ${\bf
H}_{max} = {\bf H}(p_a,p_1)$.
2.
Find $p \in [p_a,p_i]_1$ such that
${\bf H}(p_a,p_i) = {\bf H}(p_a,p) - g(p,p_i) + w(p_i)$.
Such a p must exist because of equation 1.
3.
Set i = i+1 and let pi = p.
4.
If pi = pa, or if ${\bf H}(p_a,p_i) = 0$ continue with 5. Otherwise, continue with 2.
5.
The chain $p_i, p_{i-1}, \dots, p_1$ is a solution.

The recursions in equations 1 and 1 look back into the interval [pa,pi]1 for all $p_i \in [p_a,p_b]_1$. This yields an overall time performance of the fill algorithms as ${\cal O}(\arrowvert (p_a,p_b)_1 \arrowvert^{2})$.

Examples

1. The comparison of 2 DNA, RNA or aa (amino acid, or protein) sequences.

${\bf A} = a_1a_2a_3 \dots a_m$ and ${\bf B} = b_1b_2b_3 \dots
b_n$ are 2 biomolecular sequences of the same type.

${\cal P} = {\bf Z} \times {\bf Z}$. An element of this set is an ordered pair of integers, (i,j). In practice, we take pa = (0,0) and pb = (m+1,n+1) for Problem 1, and pa = (1,1) and pb = (m,n) for Problems 2 and 3. The convention adopted here is to write a 2D grid with (1,1) on the upper-left and (m,n) on the lower right. Usage varies. Weak links are called ``gaps'', or ``indels'', which is an abbreviation for ``insertions or deletions''.

$\preceq_1$ : $ (i_1,j_1) \preceq_1 (i_2,j_2) \Longleftrightarrow
i_1 \leq i_2 \; {\rm and} \; j_1 \leq j_2$

$\preceq_2$ : $ (i_1,j_1) \preceq_2 (i_2,j_2) \Longleftrightarrow
i_2 - i_1 = j_2 - j_1 \geq 0$

2. The comparison of a DNA (or RNA) sequence with a protein sequence.

${\bf A} = a_1a_2a_3 \dots a_m$ is a DNA (RNA) sequence and ${\bf B} = b_1b_2b_3 \dots
b_n$ is an amino acid (protein) sequence.

As above ${\cal P} = {\bf Z} \times {\bf Z}$. Also pa = (0,0) and pb = (m+1,n+1) for Problem 1, and pa = (1,1) and pb = (m,n) for Problem 2.

$\preceq_1$ : $ (i_1,j_1) \preceq_1 (i_2,j_2) \Longleftrightarrow
i_1 \leq i_2+3 \; {\rm and} \; j_1 \leq j_2$

$\preceq_2$ : $ (i_1,j_1) \preceq_2 (i_2,j_2) \Longleftrightarrow
i_2 - i_1 = 3(j_2 - j_1) \geq 0$

A chain is a set of distinct ordered pairs
$(i_1,j_1), (i_2,j_2), (i_3,j_3), \dots, (i_k,j_k)$
satisfying $i_1 \leq i_2+3 \leq i_3+3 \leq
\dots \leq i_k$ and
$j_1 < j_2 < j_3 < \dots <
j_k$.

For example 1, gap penalties are usually of the form g(s), where s is the ``size'' of the gap, which is i2 - i1 + j2 - j1 -2. An affine gap penalty is of the form $g(s) = \alpha + \beta \times s$, except that g(0) = 0.

The score function, w is defined as follows. The sequence elements are from some alphabet, ${\cal A}$. There is a ``weight'', or ``scoring'' function, $w:
{\cal A} \times {\cal A} \rightarrow {\bf R}$. Then w(i,j) = w(ai,b(j)) (slight ``abuse of notation''). Note that w(ai,b(j)) is sometimes written as w(i,j) or wi,j, which is a slight abuse of notation.

For example 2, gap penalties can be of the form g(s1,s2), where s1+s2 is the ``size'' of the gap. s1 = i2 - i1 - 1, which is the number of inserted DNA characters, and s2 = j2 - j1 - 1, which is the number of inserted amino acid characters. An affine gap penalty is of the form $g(s) = \alpha + \beta_D \times s_1 + \beta_P
\times s_2 $, with the understanding that g(0) = 0. That is, we penalize DNA and amino acid insertions/deletions differently.

For example 1, the score function, w is defined as follows. The sequence elements are from some alphabet, ${\cal A}$. There is a ``weight'', or ``scoring'' function, $w:
{\cal A} \times {\cal A} \rightarrow {\bf R}$. Then w(i,j) = w(ai,bj) (slight ``abuse of notation''). Note that w(ai,bj) is sometimes written as w(i,j) or wi,j, which is a slight abuse of notation.

For example 2, the score function, w is defined as follows. The sequence elements are from 2 alphabets, ${\cal A}$ and ${\cal
B}$. There is a ``weight'', or ``scoring'' function, $w : {\cal A}
\times {\cal B} \rightarrow {\bf R}$. Then w(i,j) = w(ai,ai+1,ai+2,bj) (slight ``abuse of notation''). Note that w(ai,ai+1,ai+2,bj) is sometimes written as w(i,j) or wi,j, which is a slight abuse of notation. That is when we compare ai with bj, we use the triplet aiai+1ai+2 to translate the DNA into an amino acid. This means that w(m-1,j) and w(m,j) are not defined, since they refer to DNA characters past the end of the sequence. These numbers can either be defined to be $+\infty$ (i.e. do not allow these matches) or they can be defined knowing that the DNA sequence really extends beyond m.

For global sequence alignment, pa = (0,0) and pb = (m+1,n+1). This is an artificial construct so that the alignment does not have to begin with a (1,1) match or end with a (m,n) match, and yet a penalty is charged for these ``end gaps'', should they occur. We take w(0,0) = w(m+1,n+1) = 0.

ExampleThe DNA sequence for the mat gene of coliphage MS2 is aligned with the mat protein of coliphage GA. $\alpha = 10$ and $\beta = 4$. Weight matrix is PAM128 (which will be defined later in the course).

Optimal local alignment:

    360                        380                        400           
AAC GTT GCG AAC CGG GCG TCG ACC GAA GTC CTG CAA AAG GTC ACC CAG GGT AAT 
 |   :           |   |   |       |   |   |       |   :   :   |   :   |  
Asn Ile Asp Gln Arg Ala Ser Val Glu Val Leu Asn Lys Leu Ser Gln Ser Asn 
            120                                     130                 

            420                        440                        460   
TTT AAC CTT GGT GTT GCT TTA GCA GAG GCC AGG TCG ACA GCC TCA CAA CTC GCG 
 :   |   :   |   |   |   :   |   |   |   :       |   |   |       |   |  
Leu Asn Ile Gly Val Ala Ile Ala Glu Ala Lys Met Thr Ala Ser Leu Leu Ala 
                    140                                     150         

                    480                        500                      
ACG CAA ACC ATT GCG CTC GTG AAG GCG TAC ACT GCC GCT CGT CGC GGT AAT TGG 
     |   :   |   |   |   :   :   |   |   |   |   |   :   |   |   |   |  
Lys Gln Ser Ile Ala Leu Ile Arg Ala Tyr Thr Ala Ala Lys Arg Gly Asn Trp 
                            160                                     170 

  520                       540                            560          
CGC CAG GCG CTC CGC TAC CTT GCC CTA AAC GAA GAT CGA --- AAG TTT CGA TCA 
 |   :       |           |       :   :   |       |       :           :  
Arg Glu Val Leu Ser Gln Leu Leu Ile Ser Glu His Arg Phe Arg Ala Pro Ala 
                                    180                                 

              580                       600                        620  
AAA CAC GTG GCC GGC AGG TGG TTG GAG TTG CAG TTC GGT TGG TTA CCA CTA ATG 
 |       :   :   |   |   |   |   |   |   |   :   |   |   |   |   |   |  
Lys Asp Leu Gly Gly Arg Trp Leu Glu Leu Gln Tyr Gly Trp Leu Pro Leu Met 
    190                                     200                         

                      640                       660                     
AGT GAT ATC CAG GGT GCA TAT GAG ATG CTT ACG AAG GTT CAC CTT CAA GAG TTT 
 |   |   :       :   |   |   :   :   |   |               |           |  
Ser Asp Leu Lys Ala Ala Tyr Asp Leu Leu Thr Gln Thr Lys Leu Pro Ala Phe 
            210                                     220
   680                        700                       720             
CTT CCT ATG AGA GCC GTA CGT CAG GTC GGT ACT AAC ATC AAG TTA GAT GGC CGT 
 :   |   :   |           |       |   |               :               |  
Met Pro Leu Arg Val Thr Arg Thr Val Gly Gly Thr His Asn Tyr Lys Val Arg 
                    230                                     240         

           740                            760                           
CTG TCG TAT CCA GCT G --- CAA ACT TCC AGA CAA CGT G CAA CAT ATC GCG --- 
             :   |         :   |               |     :   :   :   :      
Asn Val Glu Ser Ala - Gly Asp Thr Trp Ser Tyr Arg - His Arg Leu Ser Val 
                              250                                       

  780                        800                        820           
ACG TAT CGT G ATA TGG TTT TAC ATA AAC GAT GCA CGT TTG GCA TGG TTG TCG 
     |   |     |   |   :   :   |   :   |   :   |   |   |   |       |  
Asn Tyr Arg - Ile Trp Tyr Phe Ile Ser Asp Pro Arg Leu Ala Trp Ala Ser 
260                                       270                         

            840                        860                        880   
TCT CTA GGT ATC TTG AAC CCA CTA GGT ATA GTG TGG GAA AAG GTG CCT TTC TCA 
 |   |   |   :   |   |   |   |       |       |   |   |       |       |  
Ser Leu Gly Leu Leu Asn Pro Leu Glu Ile Tyr Trp Glu Lys Thr Pro Trp Ser 
            280                                     290                 

                    900                        920                      
TTC GTT GTC GAC TGG CTC CTA CCT GTA GGT AAC ATG CTC GAG GGC CTT ACG GCC 
 |   |   |   |   |   :   |   |   |   |   |   :   :   |   :   :   :      
Phe Val Val Asp Trp Phe Leu Pro Val Gly Asn Leu Ile Glu Ala Met Ser Asn 
                    300                                     310         

  940                       960                         980           
CCC GTG GGA TGC TCC TAC ATG TCA GGA ACA GTT ACT GAC GTA A TAA CGG GTG 
 |   :   |               :   |   |   |       |                 :   :  
Pro Leu Gly Leu Asp Ile Ile Ser Gly Thr Lys Thr --- --- - Trp Gln Leu 
                            320                                       

           1000                       1020                       1040   
AGT CCA TCA TAA GCG TTG ACG CTC CCT ACG GGT GGA CTG TGG AGA GAC AGG GCA 
     :                   |   |   |   :   :   |       |               |  
Glu Ser Lys Leu Asn Ala Thr Leu Pro Ala Ser Gly --- Trp Ser Gly Thr Ala 
        330                                         340                 

                   1060                        1080                     
CTG CTA AGG CCC AAA TCT CAG CCA TGC A TCG AGG GGT ACA ATC C --- GTA TGG 
     |       :       :       :             |   :   |                    
Lys Leu Thr Ala Tyr Ala Lys Ala Tyr - Asp Arg Ser Thr Phe - Tyr Ser Phe 
                    350                                         360
      1100                       1120                      1140         
CCA ACA ACT GGC GCG TAC GTA AAG TCT CCT TTC TCG ATG GTC CAT ACC TTA GAT 
 |   |           :   |   |   |   |   |   :   |       :   |           :  
Pro Thr Pro Leu Pro Tyr Val Lys Ser Pro Leu Ser Gly Leu His Leu Ala Asn 
                                370                                     

              1160                          
GCG TTA GCA TTA ATC AGG CAA CGG CTC TCT AGA 
 |   |   |   |   |       |   |   |       |  
Ala Leu Ala Leu Ile Asn Gln Arg Leu Lys Arg 
380                                     390

An algorithm for global alignment of a DNA sequence to an amino acid sequence.

Let S1i,j be the maximum similarity between ${\bf A}_i$ and ${\bf B}_{j}$ that ends with a gap. That is, ai and bj are not matched with each other. Let S2i,j be the maximum similarity between ${\bf A}_i$ and ${\bf B}_{j}$ that ends with an ai/bj match. Clearly

\begin{displaymath}
{\cal S}({\bf A},{\bf B}) = \max \{ S^{1}_{m,n} , S^{2}_{m,n} \}
\end{displaymath}

Initial conditions.

Recursions:


\begin{displaymath}
\begin{array}{lcll}
S^{1}_{i,j} & = & \max \{ & S^{1}_{i-1,j...
... S^{1}_{i-1,j-1},
S^{2}_{i-3,j-1} \} + w(a_i,b_j).
\end{array}\end{displaymath}


Michael Zuker
Professor of Mathematical Sciences
Rensselaer Polytechnic Institute
2004-01-24