Alignment: Definitions and algorithms
FROM DOT PLOTS TO ALIGNMENT
Let
and
be 2 biomolecular sequences of the same type; DNA, RNA or
protein. In a dot plot,
is placed from top to bottom in
the 0th column of an
m+1 x n+1 matrix.
is placed from left to right in the 0th row of the same
matrix. ``Dots'' plotted in row i and column j indicate at least
that ai = bj. If the dot plot uses a ``window'' and ``stringency''
method with w > 1, or a k-word method with k > 1, then not all
matches are plotted in general.
Another way to write down the information in a dot plot is to write the two sequences from left to right, one above the other, and to use line segments to join ai to bj whenever there is a dot in row i and column j. The dot plot in Figure 5 would yield the image in figure 1. This can be regarded as a (bipartite) graph, with the two sequence elements acting as the vertices. We can call this a ``dot plot graph''.
The graph corresponding to the dot plot in Figure 6 would be very complicated. The idea of ``alignment'' is to take a subset of this graph, yielding an ``alignment graph'', or what is normally called an ``alignment of two sequences''. There are 2 requirements that must be satisfied:
The graph in Figure 1 violates both of these conditions. Note that both solid and dotted lines have been used in this Figure. Taking only the solid lines yields a valid alignment. The same is true for the dotted lines. Note that these 2 valid alignments are not the only ones that can be achieved by deleting edges. The whole theory of sequence alignment is to decide what edges to use.
This leads us to the first definition of sequence alignment. This definition will evolve as the course progresses.
DEFINITION: An alignment between sequences
and
is a set of integer pairs,
that satisfy:
Residues aih and bih are said to be ``aligned'' or ``matched'' with each other. Later on, we will drop the requirement that matched pairs must be equal.
We now change the look of the alignment graph by agreeing to write the
2 sequences so that the edges are always vertical lines. This is
equivalent to saying that matched residues are always ``directly over
one another'', or ``in the same column''. In order to accomplish this,
gaps must be left in 1 or both sequences, sometimes at the ends.
These gaps are usually written explicitly, using the symbol ``-''.
For example, an alignment of
AATAACATTCACGCGTCTGC and
AATATTGACGCATTCTGCAT appears below.
AATAACATTCACGC-GTCTGC-- ||| ||| |||| ||||| AAT---ATTGACGCATTCTGCAT
Written like this, we can describe the alignment in evolutionary
terms. We might imagine that
and
are the
descendants of a common ancestor,
.
The sequence
changed into
and
through a series of ``primitive
evolutionary events''. We define these to be:
Sequence elements that do not change are called ``conserved''. In an alignment, matched pairs of residues are inferred to be conserved. This may not be true, since the same change may occur in both.
When ai appears above bj and
,
we know that
a change has occurred. It is possible that more than 1 change has
occurred.
When ai appears above a gap, we know that either ai has been
inserted into the original sequence and appears in
,
or that
ai was in
and was deleted from
.
Similarly when
a gap appears above bj, we know that either bj has been inserted
into the original sequence and appears in
,
or that bj was
in
and was deleted from
.
Thus a gap in an alignment indicates that either an insertion or a deletion has occurred. We cannot deduce when a residue is either inserted or deleted from both sequences. The inference of an ancestral sequence is always error prone, and is approached using statistical methods. A series of consecutive ``-''s is usually called a (single) gap. The number of dashes is referred to as the size, or length of the gap. Dashes at the end of a sequence are called ``end gaps''. Because of the insertion/deletion ambiguity, gaps are often called ``indels''.
SEQUENCE EDITING
Consider the process of turning
into
using a
series of substitutions, insertions and deletions. We call this
``editing''
into
.We start by writing
from left to right without any gaps. We place a copy of
below it. We make a single change in this copy. If a substitution is
made to ai, then ai is changed in the copy. If ai is deleted, then
this character is replaced by a ``-'' in the copy. If an insertion
occurs after ai, then this element is added to the copy and a
``-'' is added at this position in the original sequence. We continue
in this manner, getting k+1 rows for k edit operations. Every time
an element is inserted, a gap character is added in the same position
all the way up to the top. Eliminating all middle rows gives us an
alignment of
with
.
Conversely, an alignment is a record of editing. Matched residues are not edited. ai over bj means that bj is substituted for ai. ai over a gap means that ai is deleted. A gap over bj indicates that bj has been inserted.
MINIMUM EDIT DISTANCE
It is clear that we can edit
to
using at most
m+n edit operations. That is, we can delete all the residues of
,
one by one, and then add all the residues of
,
one
by one. In fact, we could do this in at most
operations,
using
substitutions followed by
insertions or deletions. This is the best
that could be done to turn ACACACACACA into GTGGTGTTGTGTGT.
ACACACACACA--- GTGGTGTTGTGTGT
In general, we can do better.
Let
be the
minimum number of edits required to turn
into
.
Proposition:
is a metric
Proof:
How can we compute
? For
,
let
be the
ith prefix of
.
That is
.
Similarly,
for
.
Let
.
We define
and
to be the empty string. It is clear that
if i or j are 0. That is, Di,0 = i and
D0,j = j. Suppose that we know Dh,k for all
and
,
but not for (i,j). Can we compute Di,j?
In fact, if i > 0 and j > 0, then
Proof:
Thus
The equality follows from the fact that at least 1 of the following
must occur in transforming
to
Working in term of ``edit distance'' is cumbersome. It is easier to
work with alignments. If
is the sequence alphabet, let
,
where
is
.
That is, we add the dash character to the
alphabet. We call d the ``(edit) cost function''. Assume that
.
We assume that d(a,a) = 0. d will usually be
symmetric, but it does not have to be. If it is not symmetric, then
will not be a metric.
At this point, we will change the original definition of alignment by dropping the first condition that aih=bjh for all matched pairs. The d function will make that distinction for us.
Let
be an
alignment between
and
.
Put this alignment into
``alignment format'':
.
This also gives us a
``generalized edit distance'', and a generalized minimum edit
distance. If
d(a,-) = d(-,b) = 1 and d(a,b) = 1 for
This recursion is called the ``fill algorithm''.
Proof: An alignment between
and
can be created by
EXAMPLE TTATGGACTT versus CTTGGCTAGG. d = 2 and 1 for all substitutions and indels, respectively.
| C | T | T | G | G | C | T | A | G | G | ||
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | |
| T | 1 | 2 | 3 | 4 | 5 | 7 | 8 | 9 | |||
| T | 2 | 3 | 2 | 3 | 4 | 6 | 7 | 8 | |||
| A | 3 | 4 | 3 | 2 | 3 | 4 | 5 | 6 | 6 | 7 | |
| T | 4 | 5 | 4 | 5 | 6 | 6 | 7 | 8 | |||
| G | 5 | 6 | 5 | 4 | 5 | 6 | 7 | ||||
| G | 6 | 7 | 6 | 5 | 4 | 5 | 6 | ||||
| A | 7 | 8 | 7 | 6 | 5 | 4 | 5 | 6 | 6 | 7 | |
| C | 8 | 8 | 7 | 6 | 5 | 5 | 6 | 7 | 8 | ||
| T | 9 | 8 | 7 | 6 | 5 | 5 | 6 | 7 | |||
| T | 10 | 9 | 8 | 7 | 6 | 6 | 7 | 8 |
The D matrix after the fill algorithm is executed.
The dots,
,
indicate positions where ai = bj.
| C | T | T | G | G | C | T | A | G | G | ||
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | |
| T | 1 | ||||||||||
| T | 2 | ||||||||||
| A | 3 | ||||||||||
| T | 4 | ||||||||||
| G | 5 | ||||||||||
| G | 6 | ||||||||||
| A | 7 | ||||||||||
| C | 8 | ||||||||||
| T | 9 | ||||||||||
| T | 10 |
The D matrix before filling. The 0th row and column have been defined.
The value of Di,j can be determined if Dh,k is known for
,
,
.
There are
different ``fill orders'' that can be used.
Column fill :
for (j=1; j<=m; j++){
for (i=1; i <=n; i++){
Define D[i,j];
}
}
Row fill :
for (i=1; i<=n; i++){
for (j=1; j <=m; j++){
Define D[i,j];
}
}
The set
Diagonal fill :
for (d=2; d<=m+n; d++){
for (i=min{1,d-n}; i<=min{m,d-1}; i++){
j = d - i;
Define D[i,j];
}
}
Note that with the current algorithm, we only need to keep the present and previous column (row) to execute the fill algorithm in column (row) fill order. For the diagonal fill, we need the 2 previous co-diagonals.
TRACEBACK : After the ``fill algorithm'' is executed, an alignment can be constructed using the ``traceback algorithm''. The alignment is constructed from right to left; that is, backwards. This is because we start with Dm,n, and determine which of the 3 cases resulted in the value of Dm,n. This process continues. A traceback algorithm is explained below and a slight variant is illustrated in Figure 2.
![]() |
| C | T | T | G | G | C | T | A | G | G | ||
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | |
| T | 1 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | |
| T | 2 | 3 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | |
| A | 3 | 4 | 2 | 3 | 4 | 5 | 6 | 5 | 6 | 7 | |
| T | 4 | 5 | 4 | 4 | 5 | 6 | 5 | 6 | 7 | 8 | |
| G | 5 | 6 | 5 | 4 | 4 | 5 | 6 | 7 | 6 | 7 | |
| G | 6 | 7 | 6 | 5 | 4 | 4 | 5 | 6 | 7 | 6 | |
| A | 7 | 8 | 7 | 6 | 5 | 5 | 6 | 5 | 6 | 7 | |
| C | 8 | 7 | 8 | 7 | 6 | 5 | 5 | 6 | 7 | 8 | |
| T | 9 | 8 | 7 | 8 | 7 | 6 | 5 | ||||
| T | 10 | 9 | 8 | 7 | 8 | 7 | 6 | 5 | 6 | 7 |
The execution of the traceback is illustrated above. The symbols:
``
'', ``
'' and ``
'' mean:
The resulting alignment is:
-TTATGGACT---T | ||| || C-T-TGG-CTAGG-
GAP PENALTIES :
We have already introduced the notion of a gap as a series of insertions or deletions in one sequence with respect to another in the context of an alignment. The ``size'' of a gap is the number of indels. Our definition of a gap allows for consecutive indels in one or both sequences. In the alignment above, -/C and T/- are consecutive entries that contain indels. We call this a single gap. Similarly, the four last columns: -/A, -/G, -/G and T/- constitute 1 gap, not 2. (This is not standard usage.)
The cost of a gap often depends only on its size, s. A function
,
which gives the cost
g(s) to a gap of size s, is called a ``gap cost'' or ``gap
penalty'' function. g(0) is always assumed to be 0. In our
general definition,
d(a,-) = d(-,a) can take on different values for
each letter of the (DNA or amino acid) alphabet. If the value of
d(a,-) is the same
,
the a gap of size s has cost
,
where
.
A gap penalty function of the form
is called an ``affine gap penalty function'',
where
and
are non-negative constants.
Alignment cost with gap cost functions.
In this case, d(ai,bj) still gives the cost of substituting ai
with bj, but d(ai,-) and d(-,bj) are no longer used.
The cost of an alignment cannot be determined by adding up one summand
for each column. If there are
gaps of size s in an
alignment, then the cost of an alignment between
and
is
At this time, we add 2 ``fake matches'' to every alignment to take
care of the end gaps. This means that we do not need to list the end
gaps as special cases. We will assume that every alignment (not the
``local'' alignments that will be considered later) begins with
(i0,j0) = (0,0) and ends with
(ik+1,jk+1) = (m+1,n+1).
We assume that these ``fake matches'' have 0 weight.
Note that a gap occurs whenever
ih-ih-1+jh-jh-1>2, and the
gap size is
ih-ih-1+jh-jh-1-2, for
.
Thus:
There are algorithms to compute minimum cost alignments for such gap functions, but we will not consider them. Instead, we will deal with these gap functions when we look at ``maximum similarity'' alignment.
PATH GRAPHS
Alignments can be visualized in so-called ``path graphs'' that are related to dot plots. Alignments are constructed by eye using dot plots. Path graphs allow one to define the set of all possible alignments in a formal, yet intuitive way.
One type of path graph is what I call the ``truck route'' path graph. I use this term because the path is connected. It cannot hop from one location to another. Truck route path graphs formalize the representation that shows matches and gaps explicitly.
An
m+1 x n+1 square grid of points is constructed. The grid can
described as
,
where i refers to row, and j to column. Two distinct points
(i,j) and (i',j') have an edge between them if and only if
.
Any collection of connected edges from (0,0) to (m+1,n+1) that moves from each node (i-1,j-1) to either (i,j-1), (i,j) or to (i-1,j), describes an alignment.
When a diagonal edge,
is used, we say that
a(i) is matched with b(j).
When a vertical edge,
is used, we say that
a(i) is inserted with respect to sequence
,
or that a(i)
is in a gap.
Similarly, when a horizontal edge,
is
used, we say that b(j) is inserted with respect to sequence
,
or that b(j) is in a gap.
Figure 3 illustrates an alignment together with its path graph. These kinds of path graphs are the most useful ones for looking at multiple sequence alignments. They are not good for describing alignments with affine gap penalties.
![]() |
![]() |
In contrast, the ``airline route'' path graph uses an m x n grid, and matched pairs correspond to vertices, not edges. A path consists of a series of vertices, as defined above. Gaps are inferred when either il+1 - il > 1 or jl+1 - jl > 1. The alignments that have been introduced in the course up to now are best visualized using ``airline route'' path graphs.
Consider the m x n array of integer pairs:
.
Consider all line segments joining (i,j) to (i-1,j), (i,j) to
(i,j-1) and (i,j) to
.
(We exclude negative indices.) A ``path'' from (0,0) to
(m,n) is a sequence of these connected line segments, joining
distinct pairs
,
where
and
.
There is a 1:1 correspondence between paths
and alignments, with the segment:
SIMILAR (DNA) SEQUENCES
If 2 (DNA) sequences are known to be fairly similar, say 90% or better and n and m are large (say 10,000 or larger) and ``roughly the same size'', then much time (and space) can be saved in alignment.
Diagonal band fill
The set
MAXIMUM SIMILARITY ALIGNMENT
As before,
is the sequence alphabet, and
.
We call w the
``weight function'' that gives a measure of similarity. For example,
would be a natural and simple weight function.
So would
.
We assume that w is symmetric.
We can think of w as the inverse of d. However, gap costs are
treated the same as before. The definition of alignment remains the
same. This time, we try to maximize the similarity of an alignment, so
that the gap penalties (gap costs) are subtracted, not added.
If there are
gaps of size s in an
alignment, then the cost of an alignment between
and
is
If
,
then we can define
for all letters a and b. In this case, the alignment similarity is
,
where a' and b' are
characters in
.
In this case, we define
to be the
maximum similarity of all alignments between
and
.
Sufficient condition for
to be
equivalent to
.
Suppose that w and d are score functions for sequence alignment by similarity maximization and distance minimization, respectively. Let the gap penalty for distance minimization be given by some function g. We wish to construct a maximum similarity alignment.
Suppose, as well, that
w(a,b) + d(a,b) = C for some constant C,
and that
.
Then distance minimization
alignment is equivalent to similarity maximizing alignment.
Proof: For any alignment, there are k matched pairs and a total of m+n-2k elements in gaps, including possible end gaps.
Let
be the number of gaps of size l. Then
Example The familiar case where
gives what is called the ``Levenstein distance'' between 2
sequences. In this case, g(s) = s. If we let
,
which seems ``natural'', then C=1, so g'(s) = s/2 as a gap penalty
for maximum similarity alignment would give identical results.
For a general gap function, g, how can we compute
and a maximum similarity alignment?
Let Si,j be the maximum similarity of an alignment between
and
.
Note that
S0,j = -g(j) and that
Si,0 =
-g(i). What sort of recursion can be used in this case? The answer is
given below.
Proof : An optimal alignment either ends with a match
between ai and bj, or with an end gap. The
Si-1,j-1+w(ai,bj) term adds the weight of an ai/bj match to
an optimal alignment between
and
.
This
either extends a match or closes an end gap. The second term considers
all possible ``last matched pair'' cases.
While this formula is straightforward, computation time is
,
since one must backtrack for each (i,j) pair. We
can do better when g has a special form.
Case 1.
,
where
is a positive constant.
This is the ``linear gap penalty'' case.
Case 2.
,
where
and
are non-negative constants. (Both are not 0.)
This is the ``affine gap penalty'' case. The linear gap penalty
recursion breaks down because when we add an unpaired ai
to an optimal alignment between
and
or bj to an optimal alignment between
and
we don't know if we are creating a new gap, with a cost of
,
or just extending an existing one, with a cost of
only
.
This can be solved in a number of ways. They all involve
the use of at least one extra array that must be filled.
``Zuker's method'': Let S1i,j be the maximum similarity between
and
that ends with a gap. That is, ai and
bj are not matched with each other. Let S2i,j be the maximum
similarity between
and
that ends with an
ai/bj match. Clearly
Initial conditions.
Recursions:
Proof : We consider cases. For S1i,j:
For S2i,j, ai is paired with bj. ai-1 and bj-1 are either paired with each other, or they are not.
Alignment examples.
AACAGGCAGGCCATT--AGGGAG--CTACGG-GTTGTGCCGACCGTCCAAT---AATGGTCGGCCTCTC |: |:| :|:|||: || :|| || ||: :||:|| || |||:| ||| |:::| | AGAAAG--AGTCATCTGAGCAAGGACT-CGATATTATG--GA---TCCGA-AGAAAT--TTAACGT---
In this example, identity is indicated by |,
transitions (
,
)
by :, and transversions (
,
,
,
)
by blanks, respectively.
Example:
= AGGCTACGG and
= AGGGACTCGAT. w = 10,
2 and -5 for identity, transitions and transvertions, respectively.
Optimal alignment:
AGG--CTACGG- ||| || ||: AGGGACT-CGAT
S1:
| A | G | G | G | A | C | T | C | G | A | T | ||
| -11 | -12 | -13 | -14 | -15 | -16 | -17 | -18 | -19 | -20 | -21 | ||
| A | -11 | -12 | -1 | -2 | -3 | -4 | -5 | -6 | -7 | -8 | -9 | -10 |
| G | -12 | -1 | -2 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 |
| G | -13 | -2 | 9 | 8 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 |
| C | -14 | -3 | 8 | 19 | 18 | 17 | 16 | 17 | 16 | 15 | 14 | 13 |
| T | -15 | -4 | 7 | 18 | 17 | 16 | 17 | 16 | 27 | 26 | 25 | 24 |
| A | -16 | -5 | 6 | 17 | 16 | 15 | 16 | 27 | 26 | 25 | 24 | 25 |
| C | -17 | -6 | 5 | 16 | 15 | 16 | 15 | 26 | 25 | 26 | 25 | 24 |
| G | -18 | -7 | 4 | 15 | 14 | 15 | 26 | 25 | 26 | 25 | 36 | 35 |
| G | -19 | -8 | 3 | 14 | 15 | 14 | 25 | 24 | 25 | 36 | 35 | 38 |
S2:
| A | G | G | G | A | C | T | C | G | A | T | ||
| 0 | ||||||||||||
| A | 10 | -9 | -10 | -11 | -4 | -20 | -21 | -22 | -16 | -9 | -25 | |
| G | -9 | 20 | 9 | 8 | -1 | -9 | -10 | -11 | 3 | -6 | -14 | |
| G | -10 | 9 | 30 | 19 | 10 | 2 | 1 | 0 | 14 | 5 | -3 | |
| C | -18 | -7 | 4 | 25 | 14 | 28 | 19 | 26 | 10 | 9 | 15 | |
| T | -19 | -8 | 3 | 14 | 20 | 19 | 38 | 21 | 21 | 10 | 24 | |
| A | -5 | -2 | 9 | 20 | 27 | 15 | 14 | 33 | 29 | 36 | 20 | |
| C | -21 | -10 | 1 | 12 | 15 | 37 | 18 | 37 | 28 | 24 | 38 | |
| G | -15 | 4 | 15 | 26 | 17 | 11 | 32 | 21 | 47 | 30 | 20 | |
| G | -16 | 3 | 14 | 25 | 28 | 12 | 21 | 27 | 36 | 49 | 31 |
Up to now, we have been dealing with what is called ``global
alignment''. When dealing with edit distance, we must transform one
sequence into another. As we dealt with minimum cost alignment and
maximum similariy alignment, we always kept the notion that we are
dealing with all of
and
.
In particular, we
penalize ``end gaps''. We will now consider another class of alignment
problem.
Local sequence alignment :
If
and
,
let
and let
.
Let
For a general gap function, g, how can we compute
and a maximum similarity alignment? Such an alignment is
called a ``local alignment''.
Let Hi,j be the maximum similarity of an alignment between
and
that includes either the (ai,bj) match or
else counts an end gap, or is empty.
Then
Note that
Hi,0 = H0,j = 0, since these values can only be
or
.
The recursion for i>0 and
j>0 is:
Proof : An optimal alignment either ends with a match
between ai and bj, or with an end gap. The
Hi-1,j-1+w(ai,bj) term adds the weight of an ai/bj match to
an optimal local alignment between
and
.
This
either extends a match or closes an end gap. The second term considers
all possible ``last matched pair'' cases. If the last 2 terms are
negative, then the 0 option (no alignment) can be used.
While this formula is straightforward, computation time is
,
since one must backtrack for each (i,j) pair. As
with S, we can do better when g has a special form. In these forms,
the algorithm is named ``Smith-Waterman'' after Temple Smith and
Michael Waterman who first published it.
Case 1.
,
where
is a positive constant.
This is the ``linear gap penalty'' case.
Case 2. Affine gap penalty.
.
``Zuker's method'': Let H1i,j be the maximum similarity between
and
that ends with a gap. That is, ai and
bj are not matched with each other. Let H2i,j be the maximum
similarity between
and
that ends with an
ai/bj match. It is understood that either H1i,j or
H2i,j will be 0 (no alignment) if there is no alignment
with a non-negative score.
Clearly
The initial conditions are simple. H1i,0 = H10,j = H2i,0 = H20,j = 0.
Recursions:
Proof : We consider cases. For H1i,j:
For H2i,j, if there is an alignment with positive score, then ai is paired with bj. ai-1 and bj-1 are either paired with each other, or they are not.
Alignment examples. Identity, transition and transversion weights are
1, 0 and -1, respectively.
.
= TCTTCTCCAAGGCGTTAACT and
= AACTTCGTTTGAGGCTTCTT
10
CTTC-TCCAAGGC
|||| |:::||||
CTTCGTTTGAGGC
10
H1
| A | A | C | T | T | C | G | T | T | T | G | A | G | G | C | T | T | C | T | T | ||
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | |
| T | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| C | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| T | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| T | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 |
| C | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 2 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 2 | 2 |
| T | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 1 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 1 | 3 |
| C | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 3 | 2 |
| C | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 3 |
| A | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 2 |
| A | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 2 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |
| G | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 1 | 3 | 2 | 1 | 0 | 0 | 0 | 0 |
| G | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 3 | 2 | 4 | 3 | 2 | 1 | 0 | 0 |
| C | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 4 | 3 | 5 | 4 | 3 | 2 | 1 |
| G | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 3 | 5 | 4 | 4 | 3 | 2 | 1 |
| T | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 2 | 4 | 4 | 4 | 5 | 4 | 3 |
| T | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 2 | 1 | 0 | 0 | 1 | 3 | 4 | 5 | 5 | 5 | 4 |
| A | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 1 | 1 | 0 | 0 | 0 | 2 | 3 | 5 | 5 | 4 | 4 |
| A | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 1 | 0 | 1 | 2 | 4 | 4 | 4 | 3 |
| C | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 1 | 3 | 3 | 3 | 3 |
| T | 0 | 0 | 0 | 1 | 0 | 2 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 3 | 2 | 4 |
H2
| A | A | C | T | T | C | G | T | T | T | G | A | G | G | C | T | T | C | T | T | ||
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | |
| T | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 1 |
| C | 0 | 0 | 0 | 1 | 0 | 1 | 2 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 2 | 0 | 1 |
| T | 0 | 0 | 0 | 0 | 2 | 1 | 1 | 1 | 1 | 1 | 2 | 0 | 0 | 0 | 0 | 0 | 2 | 1 | 1 | 3 | 1 |
| T | 0 | 0 | 0 | 0 | 1 | 3 | 1 | 0 | 2 | 2 | 2 | 1 | 0 | 0 | 0 | 0 | 1 | 3 | 1 | 2 | 4 |
| C | 0 | 0 | 0 | 1 | 0 | 1 | 4 | 0 | 0 | 2 | 2 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 4 | 1 | 2 |
| T | 0 | 0 | 0 | 0 | 2 | 1 | 1 | 3 | 3 | 2 | 3 | 1 | 0 | 0 | 0 | 0 | 2 | 1 | 1 | 5 | 3 |
| C | 0 | 0 | 0 | 1 | 0 | 2 | 2 | 1 | 3 | 3 | 2 | 2 | 0 | 0 | 0 | 1 | 0 | 2 | 2 | 2 | 5 |
| C | 0 | 0 | 0 | 1 | 1 | 0 | 3 | 1 | 1 | 3 | 3 | 1 | 1 | 0 | 0 | 1 | 1 | 0 | 3 | 2 | 3 |
| A | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 3 | 0 | 0 | 2 | 3 | 2 | 1 | 0 | 0 | 0 | 0 | 0 | 2 | 1 |
| A | 0 | 1 | 2 | 0 | 0 | 0 | 0 | 1 | 2 | 0 | 0 | 2 | 4 | 2 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |
| G | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 2 | 5 | 3 | 0 | 0 | 0 | 0 | 0 | 0 |
| G | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 1 | 3 | 6 | 2 | 1 | 0 | 0 | 0 | 0 |
| C | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 2 | 7 | 4 | 3 | 3 | 1 | 0 |
| G | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 1 | 0 | 1 | 3 | 3 | 6 | 4 | 3 | 2 | 1 |
| T | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 3 | 1 | 1 | 0 | 0 | 0 | 0 | 3 | 6 | 7 | 4 | 4 | 3 |
| T | 0 | 0 | 0 | 0 | 1 | 2 | 1 | 0 | 1 | 4 | 2 | 0 | 0 | 0 | 0 | 2 | 5 | 7 | 7 | 6 | 5 |
| A | 0 | 1 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 3 | 2 | 2 | 0 | 0 | 0 | 2 | 4 | 6 | 6 | 5 |
| A | 0 | 1 | 2 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 3 | 3 | 2 | 0 | 0 | 1 | 2 | 4 | 5 | 5 |
| C | 0 | 0 | 0 | 3 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 2 | 2 | 1 | 1 | 1 | 2 | 5 | 4 | 5 |
| T |