MATH-4961 / MATH-6961 / CSCI-4964 / CSCI-6964

Alignment: Definitions and algorithms

FROM DOT PLOTS TO ALIGNMENT

Let ${\bf A} = a_1a_2a_3 \dots a_m$ and ${\bf B} = b_1b_2b_3 \dots
b_n$ be 2 biomolecular sequences of the same type; DNA, RNA or protein. In a dot plot, ${\bf A}$ is placed from top to bottom in the 0th column of an m+1 x n+1 matrix. ${\bf B}$ is placed from left to right in the 0th row of the same matrix. ``Dots'' plotted in row i and column j indicate at least that ai = bj. If the dot plot uses a ``window'' and ``stringency'' method with w > 1, or a k-word method with k > 1, then not all matches are plotted in general.

Another way to write down the information in a dot plot is to write the two sequences from left to right, one above the other, and to use line segments to join ai to bj whenever there is a dot in row i and column j. The dot plot in Figure 5 would yield the image in figure 1. This can be regarded as a (bipartite) graph, with the two sequence elements acting as the vertices. We can call this a ``dot plot graph''.


  
Figure 1: The ``Alignment'' corresponding to the dot plot in Figure 5 (dot plot notes).
\begin{figure}
\epsfig{file=images/A_B_2align.ps,width=0.2\textwidth}
\end{figure}

The graph corresponding to the dot plot in Figure 6 would be very complicated. The idea of ``alignment'' is to take a subset of this graph, yielding an ``alignment graph'', or what is normally called an ``alignment of two sequences''. There are 2 requirements that must be satisfied:

1.
Each vertex has at most one edge.
2.
No 2 edges cross each other.

The graph in Figure 1 violates both of these conditions. Note that both solid and dotted lines have been used in this Figure. Taking only the solid lines yields a valid alignment. The same is true for the dotted lines. Note that these 2 valid alignments are not the only ones that can be achieved by deleting edges. The whole theory of sequence alignment is to decide what edges to use.

This leads us to the first definition of sequence alignment. This definition will evolve as the course progresses.

DEFINITION: An alignment between sequences ${\bf A}$ and ${\bf B}$ is a set of integer pairs, $\{(i_h,j_h)\}_{h=1}^{k}$ that satisfy:

1.
aih = bjh for $1 \leq h \leq k$, (this condition is temporary and will be dropped later)
2.
ih < ih+1 for $1 \leq h < k$ and
3.
jh < jh+1 for $1 \leq h < k$.

Residues aih and bih are said to be ``aligned'' or ``matched'' with each other. Later on, we will drop the requirement that matched pairs must be equal.

We now change the look of the alignment graph by agreeing to write the 2 sequences so that the edges are always vertical lines. This is equivalent to saying that matched residues are always ``directly over one another'', or ``in the same column''. In order to accomplish this, gaps must be left in 1 or both sequences, sometimes at the ends. These gaps are usually written explicitly, using the symbol ``-''. For example, an alignment of
${\bf A} =$ AATAACATTCACGCGTCTGC and
${\bf B} =$ AATATTGACGCATTCTGCAT appears below.

AATAACATTCACGC-GTCTGC--
|||   ||| ||||  |||||  
AAT---ATTGACGCATTCTGCAT

Written like this, we can describe the alignment in evolutionary terms. We might imagine that ${\bf A}$ and ${\bf B}$ are the descendants of a common ancestor, ${\bf U}$. The sequence ${\bf U}$ changed into ${\bf A}$ and ${\bf B}$ through a series of ``primitive evolutionary events''. We define these to be:

1.
Substitution: One residue changes to another.
2.
Insertion: A residue is added to a sequence.
3.
Deletion: A residue is deleted from a sequence.

Sequence elements that do not change are called ``conserved''. In an alignment, matched pairs of residues are inferred to be conserved. This may not be true, since the same change may occur in both.

When ai appears above bj and $a_i \neq b_j$, we know that a change has occurred. It is possible that more than 1 change has occurred.

When ai appears above a gap, we know that either ai has been inserted into the original sequence and appears in ${\bf A}$, or that ai was in ${\bf U}$ and was deleted from ${\bf B}$. Similarly when a gap appears above bj, we know that either bj has been inserted into the original sequence and appears in ${\bf B}$, or that bj was in ${\bf U}$ and was deleted from ${\bf A}$.

Thus a gap in an alignment indicates that either an insertion or a deletion has occurred. We cannot deduce when a residue is either inserted or deleted from both sequences. The inference of an ancestral sequence is always error prone, and is approached using statistical methods. A series of consecutive ``-''s is usually called a (single) gap. The number of dashes is referred to as the size, or length of the gap. Dashes at the end of a sequence are called ``end gaps''. Because of the insertion/deletion ambiguity, gaps are often called ``indels''.

SEQUENCE EDITING

Consider the process of turning ${\bf A}$ into ${\bf B}$ using a series of substitutions, insertions and deletions. We call this ``editing'' ${\bf A}$ into ${\bf B}$.We start by writing ${\bf A}$ from left to right without any gaps. We place a copy of ${\bf A}$ below it. We make a single change in this copy. If a substitution is made to ai, then ai is changed in the copy. If ai is deleted, then this character is replaced by a ``-'' in the copy. If an insertion occurs after ai, then this element is added to the copy and a ``-'' is added at this position in the original sequence. We continue in this manner, getting k+1 rows for k edit operations. Every time an element is inserted, a gap character is added in the same position all the way up to the top. Eliminating all middle rows gives us an alignment of ${\bf A}$ with ${\bf B}$.

Conversely, an alignment is a record of editing. Matched residues are not edited. ai over bj means that bj is substituted for ai. ai over a gap means that ai is deleted. A gap over bj indicates that bj has been inserted.

MINIMUM EDIT DISTANCE

It is clear that we can edit ${\bf A}$ to ${\bf B}$ using at most m+n edit operations. That is, we can delete all the residues of ${\bf A}$, one by one, and then add all the residues of ${\bf B}$, one by one. In fact, we could do this in at most $\max\{m,n\}$ operations, using $\min\{m,n\}$ substitutions followed by $\max\{m,n\}-\min\{m,n\}$ insertions or deletions. This is the best that could be done to turn ACACACACACA into GTGGTGTTGTGTGT.

ACACACACACA---

GTGGTGTTGTGTGT

In general, we can do better.

Let ${\cal D}({\bf A},{\bf B})$ be the minimum number of edits required to turn ${\bf A}$ into ${\bf B}$.

Proposition: ${\cal D}$ is a metric

Proof:

How can we compute ${\cal D}({\bf A},{\bf B})$? For $1 \leq i \leq m$, let ${\bf A}_i$ be the ith prefix of ${\bf A}$. That is ${\bf A}_i = a_1a_2\dots a_i$. Similarly, ${\bf B}_j = b_1b_2\dots b_j$ for $1 \leq j \leq n$. Let $D_{i,j} = {\cal D}({\bf A}_i,{\bf B}_j)$. We define ${\bf A}_0$ and ${\bf B}_0$ to be the empty string. It is clear that $D_{i,j} =
\max\{i,j\}$ if i or j are 0. That is, Di,0 = i and D0,j = j. Suppose that we know Dh,k for all $0 \leq h \leq
i$ and $0 \leq k \leq j$, but not for (i,j). Can we compute Di,j?

In fact, if i > 0 and j > 0, then

\begin{displaymath}
D_{i,j} = \min\{D_{i-1,j}+1 , D_{i,j-1}+1, D_{i-1,j-1}+1-\delta(a_i,b_j)\}
\end{displaymath}

Proof:

Thus

\begin{displaymath}
D_{i,j} \leq \min\{D_{i-1,j}+1 , D_{i,j-1}+1, D_{i-1,j-1}+1-\delta(a_i,b_j)\}
\end{displaymath}

The equality follows from the fact that at least 1 of the following must occur in transforming ${\bf A}_i$ to ${\bf B}_{j}$

Working in term of ``edit distance'' is cumbersome. It is easier to work with alignments. If ${\cal A}$ is the sequence alphabet, let $d:
{\cal A}' \times {\cal A}' \rightarrow [0,\infty)$, where ${\bf A}'$ is ${\cal A} \cup \{-\}$. That is, we add the dash character to the alphabet. We call d the ``(edit) cost function''. Assume that $a, b
\in {\cal A}$. We assume that d(a,a) = 0. d will usually be symmetric, but it does not have to be. If it is not symmetric, then ${\cal D}$ will not be a metric.

At this point, we will change the original definition of alignment by dropping the first condition that aih=bjh for all matched pairs. The d function will make that distinction for us.

Let $(i_1,j_1), (i_2,j_2), (i_3,j_3), \dots, (i_k,j_k)$ be an alignment between ${\bf A}$ and ${\bf B}$. Put this alignment into ``alignment format'':

\begin{displaymath}
\begin{array}{lccccc}
{\bf A}' = & a'_1 & a'_2 & a'_3 & \dot...
...\
{\bf B}' = & b'_1 & b'_2 & b'_3 & \dots b'_{k'}
\end{array}\end{displaymath}

The a'is and b'js are either residues or gap characters. Deleting the gaps from ${\bf A}'$ and ${\bf B}'$ gives us back ${\bf A}$ and ${\bf B}$. Then the cost of the alignment is given by $\displaystyle \sum_{h=1}^{k'} d(a'_h,b'_h)$. This also gives us a ``generalized edit distance'', and a generalized minimum edit distance. If d(a,-) = d(-,b) = 1 and d(a,b) = 1 for $a \neq b$, then we have the old definition. ${\cal D}({\bf A},{\bf B})$ is the minimum edit distance between ${\bf A}$ and ${\bf B}$, which is the same as the minimum alignment cost. What is Di,j now? We'll give a formula and proof thinking about alignments, not edits.
Clearly, $D_{i,0} = \sum_{h=1}^{i} d(a_h,-)$ and $D_{0,j} = \sum_{h=1}^{j} d(-,b_h)$. Suppose that we know Dh,k for all $0 \leq h \leq
i$ and $0 \leq k \leq j$, but not for (i,j). Then:

\begin{displaymath}\begin{array}{cl}
D_{i,j} = \min\{ & D_{i-1,j}+d(a_i,-) , D_{i,j-1}+d(-,b_j), \\
& D_{i-1,j-1}+d(a_i,b_j)\} \end{array}\end{displaymath}

This recursion is called the ``fill algorithm''.

Proof: An alignment between ${\bf A}_i$ and ${\bf B}_{j}$ can be created by

This establishes the $\leq$ part. To establish equality, consider an optimal alignment of ${\bf A}_i$ and ${\bf B}_{j}$. The last 2 characters are either ai/-, -/bj or ai/bj. This yields equality for cases 1, 2 or 3, respectively.

EXAMPLE TTATGGACTT versus CTTGGCTAGG. d = 2 and 1 for all substitutions and indels, respectively.

    C T T G G C T A G G
  0 1 2 3 4 5 6 7 8 9 10
T 1 2 $\cdot$1 $\cdot$2 3 4 5 $\cdot$6 7 8 9
T 2 3 $\cdot$2 $\cdot$1 2 3 4 $\cdot$5 6 7 8
A 3 4 3 2 3 4 5 6 $\cdot$5 6 7
T 4 5 $\cdot$4 $\cdot$3 4 5 6 $\cdot$5 6 7 8
G 5 6 5 4 $\cdot$3 $\cdot$4 5 6 7 $\cdot$6 $\cdot$7
G 6 7 6 5 $\cdot$4 $\cdot$3 4 5 6 $\cdot$7 $\cdot$6
A 7 8 7 6 5 4 5 6 $\cdot$5 6 7
C 8 $\cdot$7 8 7 6 5 $\cdot$4 5 6 7 8
T 9 8 $\cdot$7 $\cdot$8 7 6 5 $\cdot$4 5 6 7
T 10 9 $\cdot$8 $\cdot$7 8 7 6 $\cdot$5 6 7 8

The D matrix after the fill algorithm is executed. The dots, $\cdot$, indicate positions where ai = bj.

    C T T G G C T A G G
  0 1 2 3 4 5 6 7 8 9 10
T 1                    
T 2                    
A 3                    
T 4                    
G 5                    
G 6                    
A 7                    
C 8                    
T 9                    
T 10                    

The D matrix before filling. The 0th row and column have been defined.

The value of Di,j can be determined if Dh,k is known for $0 \leq h \leq
i$, $0 \leq k \leq j$, $(h,k) \neq (i,j)$. There are different ``fill orders'' that can be used.

Column fill :

  for (j=1; j<=m; j++){
     for (i=1; i <=n; i++){
        Define D[i,j];
     }
  }

Row fill :

  for (i=1; i<=n; i++){
     for (j=1; j <=m; j++){
        Define D[i,j];
     }
  }

The set

\begin{displaymath}
{\rm codia}_d = \{(i,j) \: \vert \: 1 \leq i \leq m, \: 1 \leq j \leq n, \: j+i=d\}
\end{displaymath}

is called the ``dth co-diagonal''. d varies from 2 to m+n. If $(i,j) \in {\rm dia}_d$, then Di,j can be computed using values of Dh,k for (h,k) in ${\rm codia}_{d'}$, d' < d.

Diagonal fill :

   for (d=2; d<=m+n; d++){
      for (i=min{1,d-n}; i<=min{m,d-1}; i++){
         j = d - i;
         Define D[i,j];
      }
   }

Note that with the current algorithm, we only need to keep the present and previous column (row) to execute the fill algorithm in column (row) fill order. For the diagonal fill, we need the 2 previous co-diagonals.

TRACEBACK : After the ``fill algorithm'' is executed, an alignment can be constructed using the ``traceback algorithm''. The alignment is constructed from right to left; that is, backwards. This is because we start with Dm,n, and determine which of the 3 cases resulted in the value of Dm,n. This process continues. A traceback algorithm is explained below and a slight variant is illustrated in Figure 2.

1.
Set $i=m, \: j=n$. The alignment is empty. Proceed to 2.
2.
Test for:
(a)
Di,j = Di-1,j + d(ai,-). Proceed to 3.
(b)
Di,j = Di,j-1 + d(-,bj). Proceed to 4.
(c)
Di,j = Di-1,j-1 + d(ai,bj). Proceed to 5.
3.
Set i=i-1. Add ai/- to the alignment. Proceed to 6.
4.
Set j=j-1. Add -/bj to the alignment. Proceed to 6.
5.
Set $i=i-1, \: j=j-1$. Add ai/bj to the alignment. Proceed to 6.

6.
Test for i=0 or j=0:
(a)
$i>0, \: j>0$. Proceed to 2
(b)
$i=0, \: j>0$. Add $--{\dots}-/b_1b_2{\dots}b_j$ to the alignment. Stop.
(c)
$i>0, \: j=0$. Add $a_1a_2{\dots}a_i/--{\dots}-$ to the alignment. Stop.
(d)
$i=0, \: j=0$. Stop.


  
Figure 2: In this variant of the traceback algorithm, only the matched pairs, ai/bj are listed in the alignment. The gaps are inferred.
\begin{figure}
\epsfig{file=Tback1.ps,width=0.8\textwidth}
\end{figure}

    C T T G G C T A G G
  0 1 2 3 4 5 6 7 8 9 10
T 1 $\uparrow$2 1 2 3 4 5 6 7 8 9
T 2 3 $\nwarrow$2 1 2 3 4 5 6 7 8
A 3 4 $\uparrow$3 2 3 4 5 6 5 6 7
T 4 5 4 $\nwarrow$3 4 5 6 5 6 7 8
G 5 6 5 4 $\nwarrow$3 4 5 6 7 6 7
G 6 7 6 5 4 $\nwarrow$3 4 5 6 7 6
A 7 8 7 6 5 $\uparrow$4 5 6 5 6 7
C 8 7 8 7 6 5 $\nwarrow$4 5 6 7 8
T 9 8 7 8 7 6 5 $\nwarrow$4 $\leftarrow$5 $\leftarrow$6 $\leftarrow$7
T 10 9 8 7 8 7 6 5 6 7 $\uparrow$8

The execution of the traceback is illustrated above. The symbols: ``$\uparrow$'', ``$\leftarrow$'' and ``$\nwarrow$'' mean:

1.
Move up (Set i=i-1),
2.
Move left (Set j=j-1) and
3.
Move up and left. (Set i=i-1 and j=j-1), respectively.

The resulting alignment is:

-TTATGGACT---T
  | ||| ||
C-T-TGG-CTAGG-

GAP PENALTIES :

We have already introduced the notion of a gap as a series of insertions or deletions in one sequence with respect to another in the context of an alignment. The ``size'' of a gap is the number of indels. Our definition of a gap allows for consecutive indels in one or both sequences. In the alignment above, -/C and T/- are consecutive entries that contain indels. We call this a single gap. Similarly, the four last columns: -/A, -/G, -/G and T/- constitute 1 gap, not 2. (This is not standard usage.)

The cost of a gap often depends only on its size, s. A function $g:\{0,1,2,\dots\} \rightarrow [0,\infty)$, which gives the cost g(s) to a gap of size s, is called a ``gap cost'' or ``gap penalty'' function. g(0) is always assumed to be 0. In our general definition, d(a,-) = d(-,a) can take on different values for each letter of the (DNA or amino acid) alphabet. If the value of d(a,-) is the same $\forall a$, the a gap of size s has cost $\beta s$, where $\beta = d(a,-)$. A gap penalty function of the form $g(s)=\alpha+\beta s$ is called an ``affine gap penalty function'', where $\alpha$ and $\beta$ are non-negative constants.

Alignment cost with gap cost functions.

In this case, d(ai,bj) still gives the cost of substituting ai with bj, but d(ai,-) and d(-,bj) are no longer used. The cost of an alignment cannot be determined by adding up one summand for each column. If there are $\Delta_s$ gaps of size s in an alignment, then the cost of an alignment between ${\bf A}$ and ${\bf B}$ is

\begin{displaymath}
\sum_{h=1}^{k}d(a_{i_h},b_{j_h}) + \sum_{s=1}^{\infty} \Delta_s g(s)
\end{displaymath}

At this time, we add 2 ``fake matches'' to every alignment to take care of the end gaps. This means that we do not need to list the end gaps as special cases. We will assume that every alignment (not the ``local'' alignments that will be considered later) begins with (i0,j0) = (0,0) and ends with (ik+1,jk+1) = (m+1,n+1). We assume that these ``fake matches'' have 0 weight. Note that a gap occurs whenever ih-ih-1+jh-jh-1>2, and the gap size is ih-ih-1+jh-jh-1-2, for $1 \leq h \leq
k+1$. Thus:

\begin{displaymath}
\sum_{h=1}^{k+1}g(i_h-i_{h-1}+j_h-j_{h-1}-2) = \sum_{s=1}^{\infty}
\Delta_s g(s).
\end{displaymath}

There are algorithms to compute minimum cost alignments for such gap functions, but we will not consider them. Instead, we will deal with these gap functions when we look at ``maximum similarity'' alignment.

PATH GRAPHS

Alignments can be visualized in so-called ``path graphs'' that are related to dot plots. Alignments are constructed by eye using dot plots. Path graphs allow one to define the set of all possible alignments in a formal, yet intuitive way.

One type of path graph is what I call the ``truck route'' path graph. I use this term because the path is connected. It cannot hop from one location to another. Truck route path graphs formalize the representation that shows matches and gaps explicitly.

An m+1 x n+1 square grid of points is constructed. The grid can described as $\{(i,j) \; \vert \; 0 \leq i \leq m, 0 \leq j \leq m\}$, where i refers to row, and j to column. Two distinct points (i,j) and (i',j') have an edge between them if and only if $\max \{
\vert i-i'\vert, \vert j-j'\vert \} = 1$.

Any collection of connected edges from (0,0) to (m+1,n+1) that moves from each node (i-1,j-1) to either (i,j-1), (i,j) or to (i-1,j), describes an alignment.

When a diagonal edge, $(i-1,j-1) \mapsto (i,j)$ is used, we say that a(i) is matched with b(j).

When a vertical edge, $(i-1,j-1) \mapsto (i,j-1)$ is used, we say that a(i) is inserted with respect to sequence ${\bf b}$, or that a(i) is in a gap.

Similarly, when a horizontal edge, $(i-1,j-1) \mapsto (i-1,j)$ is used, we say that b(j) is inserted with respect to sequence ${\bf
a}$, or that b(j) is in a gap.

Figure 3 illustrates an alignment together with its path graph. These kinds of path graphs are the most useful ones for looking at multiple sequence alignments. They are not good for describing alignments with affine gap penalties.


  
Figure 3: An alignment of 2 DNA sequences together with its ``truck route'' path graph. Every column in the alignment corresponds to an edge in the graph.
\begin{figure}\epsfig{file=path-graph-1.ps,width=0.8\textwidth}\end{figure}


  
Figure 4: An alignment of 2 DNA sequences together with its ``airline route'' path graph. Matched pairs of residues correspond to vertices in path graph. Gaps are not shown explicitly, but are inferred when the path jumps more than 1 unit in either direction. The yellow edges are part of the ``traceback path'' that will be discussed later.
\begin{figure}\epsfig{file=path-graph-2.ps,width=0.8\textwidth}\end{figure}

In contrast, the ``airline route'' path graph uses an m x n grid, and matched pairs correspond to vertices, not edges. A path consists of a series of vertices, as defined above. Gaps are inferred when either il+1 - il > 1 or jl+1 - jl > 1. The alignments that have been introduced in the course up to now are best visualized using ``airline route'' path graphs.

Consider the m x n array of integer pairs: ${\cal G}_{m,n} =
\{(i,j) \: \vert \: 0 \leq i \leq m, \: 0 \leq j \leq n\}$. Consider all line segments joining (i,j) to (i-1,j), (i,j) to (i,j-1) and (i,j) to $(i-1,j-1), \: \forall \: (i,j) \in {\cal
G}_{m,n}$. (We exclude negative indices.) A ``path'' from (0,0) to (m,n) is a sequence of these connected line segments, joining distinct pairs $(0,0)=(i_0,j_0), (i_1,j_1), (i_2,j_2), (i_3,j_3),
\dots (i_k,j_k)= (m,n)$, where $0 \leq i_h - i_{h-1} \leq 1$ and $0 \leq j_h - j_{h-1} \leq 1,
\: \forall \: 0<h\leq k$. There is a 1:1 correspondence between paths and alignments, with the segment:

SIMILAR (DNA) SEQUENCES

If 2 (DNA) sequences are known to be fairly similar, say 90% or better and n and m are large (say 10,000 or larger) and ``roughly the same size'', then much time (and space) can be saved in alignment.

Diagonal band fill

The set

\begin{displaymath}
{\rm dia}_d = \{(i,j) \: \vert \: 1 \leq i \leq m, \: 1 \leq j \leq n, \: j-i=d\}
\end{displaymath}

is called the ``dth diagonal''. d varies from 1-m to n-1. If ${\bf A}$ and ${\bf B}$ are roughly of the same length and are fairly similar, then a minimum cost alignment will lie ``close to'' the 0th diagonal, ${\rm dia}_0$. This means that for column fill, it suffices to consider only values of i that satisfy $\vert j-i\vert \leq bw$ when filling the jth column, where bw (band width) is a constant. In practice, bw might be 100 when m and n are many thousands. The values of Di,j just outside the boundaries are assumed to be $+\infty$ (ie., a large number). There is a risk of computing an alignment that is not minimum cost. In practice, bw can be increased and new alignments can be computed until the minimum cost no longer changes. When m and n are not roughly equal, the bounds for i in the jth column can be computed by demanding that $\vert i-[\frac{jm}{n}]\vert \leq bw$.

MAXIMUM SIMILARITY ALIGNMENT

As before, ${\cal A}$ is the sequence alphabet, and $w:
{\cal A} \times {\cal A} \rightarrow {\bf R}$. We call w the ``weight function'' that gives a measure of similarity. For example, $w(a,b)=\delta(a,b)$ would be a natural and simple weight function. So would $w(a,b)=2\delta(a,b)-1$. We assume that w is symmetric.

We can think of w as the inverse of d. However, gap costs are treated the same as before. The definition of alignment remains the same. This time, we try to maximize the similarity of an alignment, so that the gap penalties (gap costs) are subtracted, not added. If there are $\Delta_s$ gaps of size s in an alignment, then the cost of an alignment between ${\bf A}$ and ${\bf B}$ is

\begin{displaymath}
\sum_{h=1}^{k}w(a_{i_h},b_{j_h}) - \sum_{s=1}^{\infty} \Delta_s g(s)
\end{displaymath}

If $g(s) = {\beta}s$, then we can define $w(a,-) = w(-,b) = -\beta$ for all letters a and b. In this case, the alignment similarity is $\displaystyle \sum_{h=1}^{k} w(a'_h,b'_h)$, where a' and b' are characters in ${\cal A}' = {\cal A} \cup \{-\}$. In this case, we define $\displaystyle {\cal S}({\bf A},{\bf B})$ to be the maximum similarity of all alignments between ${\bf A}$ and ${\bf B}$.

Sufficient condition for ${\cal D}({\bf A},{\bf B})$ to be equivalent to ${\cal S}({\bf A},{\bf B})$.

Suppose that w and d are score functions for sequence alignment by similarity maximization and distance minimization, respectively. Let the gap penalty for distance minimization be given by some function g. We wish to construct a maximum similarity alignment.

Suppose, as well, that w(a,b) + d(a,b) = C for some constant C, and that $g(s) \geq \frac{Cs}{2}$. Then distance minimization alignment is equivalent to similarity maximizing alignment.

Proof: For any alignment, there are k matched pairs and a total of m+n-2k elements in gaps, including possible end gaps.

Let $\Delta_l$ be the number of gaps of size l. Then

\begin{displaymath}
m+n-2k = \sum_{l=1}^{\infty} l \Delta_{l}.
\end{displaymath}

Thus


\begin{displaymath}
\begin{array}{cl}
& \displaystyle \sum_{h=1}^{k} d(a_{i_h},b...
... \sum_{l=1}^{\infty} (g(l) -Cl/2)\Delta_l \right ].
\end{array}\end{displaymath}

Hence, ${\cal D}({\bf A},{\bf B}) = C(m+n)/2 - {\cal S}({\bf A},{\bf
B})$ if we take g(s) - Cs/2 to be the similarity gap penalty (non-negative).

Example The familiar case where $d(a',b') = 1 - \delta(a',b')$ gives what is called the ``Levenstein distance'' between 2 sequences. In this case, g(s) = s. If we let $w(a,b)=\delta(a,b)$, which seems ``natural'', then C=1, so g'(s) = s/2 as a gap penalty for maximum similarity alignment would give identical results.

For a general gap function, g, how can we compute ${\cal S}({\bf A},{\bf B})$ and a maximum similarity alignment?

Let Si,j be the maximum similarity of an alignment between ${\bf A}_i$ and ${\bf B}_{j}$. Note that S0,j = -g(j) and that Si,0 = -g(i). What sort of recursion can be used in this case? The answer is given below.


\begin{displaymath}
\begin{array}{ll}
S_{i,j} = \max \left\{ \right. &
S_{i-1,j-...
...0 \end{array} }
\left\{S_{i-l,j-k}-g(k+l) \right\}.
\end{array}\end{displaymath}

Proof : An optimal alignment either ends with a match between ai and bj, or with an end gap. The Si-1,j-1+w(ai,bj) term adds the weight of an ai/bj match to an optimal alignment between ${\bf A}_{i-1}$ and ${\bf B}_{j-1}$. This either extends a match or closes an end gap. The second term considers all possible ``last matched pair'' cases.

While this formula is straightforward, computation time is ${\cal
O}(m^{2}n^{2})$, since one must backtrack for each (i,j) pair. We can do better when g has a special form.

Case 1. $g(s) = {\beta}s$, where $\beta$ is a positive constant. This is the ``linear gap penalty'' case.


\begin{displaymath}
S_{i,j} = \max \left \{ S_{i-1,j-1}+w(a_i,b_j),
S_{i-1,j}- \beta, S_{i,j-1} - \beta \right \} .
\end{displaymath}

Case 2. $g(s)=\alpha+\beta s$, where $\alpha$ and $\beta$ are non-negative constants. (Both are not 0.) This is the ``affine gap penalty'' case. The linear gap penalty recursion breaks down because when we add an unpaired ai to an optimal alignment between ${\bf A}_{i-1}$ and ${\bf B}_{j}$ or bj to an optimal alignment between ${\bf A}_i$ and ${\bf B}_{j-1}$ we don't know if we are creating a new gap, with a cost of $\alpha + \beta$, or just extending an existing one, with a cost of only $\beta$. This can be solved in a number of ways. They all involve the use of at least one extra array that must be filled.

``Zuker's method'': Let S1i,j be the maximum similarity between ${\bf A}_i$ and ${\bf B}_{j}$ that ends with a gap. That is, ai and bj are not matched with each other. Let S2i,j be the maximum similarity between ${\bf A}_i$ and ${\bf B}_{j}$ that ends with an ai/bj match. Clearly

\begin{displaymath}
{\cal S}({\bf A},{\bf B}) = \max \{ S^{1}_{m,n} , S^{2}_{m,n} \}
\end{displaymath}

Initial conditions.

Recursions:


\begin{displaymath}
\begin{array}{lcll}
S^{1}_{i,j} & = & \max \{ & S^{1}_{i-1,j...
...& S^{1}_{i-1,j-1}, S^{2}_{i-1,j-1} \} + w(a_i,b_j).
\end{array}\end{displaymath}

Proof : We consider cases. For S1i,j:

For S2i,j, ai is paired with bj. ai-1 and bj-1 are either paired with each other, or they are not.

Alignment examples.

 AACAGGCAGGCCATT--AGGGAG--CTACGG-GTTGTGCCGACCGTCCAAT---AATGGTCGGCCTCTC
 |: |:|  :|:|||:  || :||  || ||: :||:||  ||   |||:|    |||  |:::| |      
 AGAAAG--AGTCATCTGAGCAAGGACT-CGATATTATG--GA---TCCGA-AGAAAT--TTAACGT---

In this example, identity is indicated by |, transitions ( $A{\leftrightarrow}G$, $C{\leftrightarrow}T$) by :, and transversions ( $A{\leftrightarrow}C$, $A{\leftrightarrow}T$, $G{\leftrightarrow}C$, $G{\leftrightarrow}T$) by blanks, respectively.

Example: ${\bf A}$ = AGGCTACGG and ${\bf B}$ = AGGGACTCGAT. w = 10, 2 and -5 for identity, transitions and transvertions, respectively.

Optimal alignment:

 AGG--CTACGG-
 |||  || ||: 
 AGGGACT-CGAT

S1:

    A G G G A C T C G A T
  $-\infty$ -11 -12 -13 -14 -15 -16 -17 -18 -19 -20 -21
A -11 -12 -1 -2 -3 -4 -5 -6 -7 -8 -9 -10
G -12 -1 -2 9 8 7 6 5 4 3 2 1
G -13 -2 9 8 19 18 17 16 15 14 13 12
C -14 -3 8 19 18 17 16 17 16 15 14 13
T -15 -4 7 18 17 16 17 16 27 26 25 24
A -16 -5 6 17 16 15 16 27 26 25 24 25
C -17 -6 5 16 15 16 15 26 25 26 25 24
G -18 -7 4 15 14 15 26 25 26 25 36 35
G -19 -8 3 14 15 14 25 24 25 36 35 38

S2:

    A G G G A C T C G A T
  0 $-\infty$ $-\infty$ $-\infty$ $-\infty$ $-\infty$ $-\infty$ $-\infty$ $-\infty$ $-\infty$ $-\infty$ $-\infty$
A $-\infty$ 10 -9 -10 -11 -4 -20 -21 -22 -16 -9 -25
G $-\infty$ -9 20 9 8 -1 -9 -10 -11 3 -6 -14
G $-\infty$ -10 9 30 19 10 2 1 0 14 5 -3
C $-\infty$ -18 -7 4 25 14 28 19 26 10 9 15
T $-\infty$ -19 -8 3 14 20 19 38 21 21 10 24
A $-\infty$ -5 -2 9 20 27 15 14 33 29 36 20
C $-\infty$ -21 -10 1 12 15 37 18 37 28 24 38
G $-\infty$ -15 4 15 26 17 11 32 21 47 30 20
G $-\infty$ -16 3 14 25 28 12 21 27 36 49 31

Up to now, we have been dealing with what is called ``global alignment''. When dealing with edit distance, we must transform one sequence into another. As we dealt with minimum cost alignment and maximum similariy alignment, we always kept the notion that we are dealing with all of ${\bf A}$ and ${\bf B}$. In particular, we penalize ``end gaps''. We will now consider another class of alignment problem.

Local sequence alignment :

If $1 \leq i_1 \leq i_2 \leq m$ and $1 \leq j_1 \leq j_2 \leq n$, let ${\bf A}_{i_1,i_2} = a_{i_1}a_{i_{1}+1}{\dots}a_{i_2}$ and let ${\bf B}_{j_1,j_2} = b_{j_1}b_{j_{1}+1}{\dots}b_{j_2}$. Let

\begin{displaymath}
{\cal H}({\bf A},{\bf B}) = \max \left \{ 0, \right.
\begin{...
...cal S}({\bf A}_{i_1,i_2},{\bf B}_{j_1,j_2}) \left. \right \} .
\end{displaymath}

For a general gap function, g, how can we compute ${\cal H}({\bf
A},{\bf B})$ and a maximum similarity alignment? Such an alignment is called a ``local alignment''.

Let Hi,j be the maximum similarity of an alignment between ${\bf A}_i$ and ${\bf B}_{j}$ that includes either the (ai,bj) match or else counts an end gap, or is empty.

Then

\begin{displaymath}
{\cal H}({\bf A},{\bf B}) =
\begin{array}{c} \max \\
i,j
\end{array} \{ H_{i,j} \}.
\end{displaymath}

Note that Hi,0 = H0,j = 0, since these values can only be $\max \{0,-g(i)\}$ or $\max \{0,-g(j)\}$. The recursion for i>0 and j>0 is:


\begin{displaymath}
\begin{array}{ll}
H_{i,j} = \max \left\{0, \right. &
H_{i-1,...
...0 \end{array} }
\left\{H_{i-l,j-k}-g(k+l) \right\}.
\end{array}\end{displaymath}

Proof : An optimal alignment either ends with a match between ai and bj, or with an end gap. The Hi-1,j-1+w(ai,bj) term adds the weight of an ai/bj match to an optimal local alignment between ${\bf A}_{i-1}$ and ${\bf B}_{j-1}$. This either extends a match or closes an end gap. The second term considers all possible ``last matched pair'' cases. If the last 2 terms are negative, then the 0 option (no alignment) can be used.

While this formula is straightforward, computation time is ${\cal
O}(m^{2}n^{2})$, since one must backtrack for each (i,j) pair. As with S, we can do better when g has a special form. In these forms, the algorithm is named ``Smith-Waterman'' after Temple Smith and Michael Waterman who first published it.

Case 1. $g(s) = {\beta}s$, where $\beta$ is a positive constant. This is the ``linear gap penalty'' case.


\begin{displaymath}
\begin{array}{ll}
H_{i,j} = \max \left \{0, \right. & H_{i-1...
...ft. H_{i-1,j}- \beta, H_{i,j-1} - \beta \right \} .
\end{array}\end{displaymath}

Case 2. Affine gap penalty. $g(s)=\alpha+\beta s$.

``Zuker's method'': Let H1i,j be the maximum similarity between ${\bf A}_i$ and ${\bf B}_{j}$ that ends with a gap. That is, ai and bj are not matched with each other. Let H2i,j be the maximum similarity between ${\bf A}_i$ and ${\bf B}_{j}$ that ends with an ai/bj match. It is understood that either H1i,j or H2i,j will be 0 (no alignment) if there is no alignment with a non-negative score.

Clearly

\begin{displaymath}
{\cal H}({\bf A},{\bf B}) =
\begin{array}{c}
\max \\ i,j
\end{array}\{ H^{2}_{i,j} \}
\end{displaymath}

The initial conditions are simple. H1i,0 = H10,j = H2i,0 = H20,j = 0.

Recursions:


\begin{displaymath}
\begin{array}{lcll}
H^{1}_{i,j} & = & \max \{0, & H^{1}_{i-1...
...i-1,j-1}+w(a_i,b_j), H^{2}_{i-1,j-1}+w(a_i,b_j) \}.
\end{array}\end{displaymath}

Proof : We consider cases. For H1i,j:

For H2i,j, if there is an alignment with positive score, then ai is paired with bj. ai-1 and bj-1 are either paired with each other, or they are not.

Alignment examples. Identity, transition and transversion weights are 1, 0 and -1, respectively. $\alpha = \beta = 1$.
${\bf A}$ = TCTTCTCCAAGGCGTTAACT and
${\bf B}$ = AACTTCGTTTGAGGCTTCTT

        10   
CTTC-TCCAAGGC
|||| |:::||||
CTTCGTTTGAGGC
      10

H1
    A A C T T C G T T T G A G G C T T C T T
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
C 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
T 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0
C 0 0 0 0 0 1 0 2 1 0 0 0 0 0 0 0 0 1 0 2 2
T 0 0 0 0 0 0 2 1 1 1 0 1 0 0 0 0 0 0 2 1 3
C 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 1 3 2
C 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 2 3
A 0 0 0 0 0 0 1 0 1 1 1 0 1 0 0 0 0 0 1 1 2
A 0 0 0 0 0 0 0 1 0 0 0 1 0 2 1 0 0 0 0 0 1
G 0 0 0 0 0 0 0 0 0 0 0 0 2 1 3 2 1 0 0 0 0
G 0 0 0 0 0 0 0 0 0 0 0 0 1 3 2 4 3 2 1 0 0
C 0 0 0 0 0 0 0 0 0 0 0 0 0 2 4 3 5 4 3 2 1
G 0 0 0 0 0 0 0 0 0 0 0 0 0 1 3 5 4 4 3 2 1
T 0 0 0 0 0 0 0 0 0 1 0 0 0 0 2 4 4 4 5 4 3
T 0 0 0 0 0 0 0 0 1 0 2 1 0 0 1 3 4 5 5 5 4
A 0 0 0 0 0 0 0 0 0 2 1 1 0 0 0 2 3 5 5 4 4
A 0 0 0 0 0 0 0 0 0 1 1 0 1 1 0 1 2 4 4 4 3
C 0 0 0 0 1 0 0 0 0 0 0 1 1 0 0 0 1 3 3 3 3
T 0 0 0 1 0 2 1 0 0 0 0 0 0 0 0 0 0 2 3 2 4

H2
    A A C T T C G T T T G A G G C T T C T T
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
T 0 0 0 0 1 1 0 0 1 1 1 0 0 0 0 0 1 1 0 1 1
C 0 0 0 1 0 1 2 0 0 1 1 0 0 0 0 1 0 1 2 0 1
T 0 0 0 0 2 1 1 1 1 1 2 0 0 0 0 0 2 1 1 3 1
T 0 0 0 0 1 3 1 0 2 2 2 1 0 0 0 0 1 3 1 2 4
C 0 0 0 1 0 1 4 0 0 2 2 1 0 0 0 1 0 1 4 1 2
T 0 0 0 0 2 1 1 3 3 2 3 1 0 0 0 0 2 1 1 5 3
C 0 0 0 1 0 2 2 1 3 3 2 2 0 0 0 1 0 2 2 2 5
C 0 0 0 1 1 0 3 1 1 3 3 1 1 0 0 1 1 0 3 2 3
A 0 1 1 0 0 0 0 3 0 0 2 3 2 1 0 0 0 0 0 2 1
A 0 1 2 0 0 0 0 1 2 0 0 2 4 2 1 0 0 0 0 0 1
G 0 0 1 1 0 0 0 1 0 1 0 1 2 5 3 0 0 0 0 0 0
G 0 0 0 0 0 0 0 1 0 0 0 1 1 3 6 2 1 0 0 0 0
C 0 0 0 1 0 0 1 0 1 0 0 0 0 0 2 7 4 3 3 1 0
G 0 0 0 0 0 0 0 2 0 0 0 1 0 1 3 3 6 4 3 2 1
T 0 0 0 0 1 1 0 0 3 1 1 0 0 0 0 3 6 7 4 4 3
T 0 0 0 0 1 2 1 0 1 4 2 0 0 0 0 2 5 7 7 6 5
A 0 1 1 0 0 0 1 1 0 0 3 2 2 0 0 0 2 4 6 6 5
A 0 1 2 0 0 0 0 1 0 0 1 3 3 2 0 0 1 2 4 5 5
C 0 0 0 3 0 0 1 0 1 0 1 0 2 2 1 1 1 2 5 4 5
T