next up previous

MATH-4961,6961 / CSCI-4971,6971

Markov chains

We have already discussed sequences of independent, identically distributed, or i.i.d., random variables. Let's consider a sequence of random variables, $X_1, X_2, X_3, \dots$ that each take on the same finitely many values. In our applications, the finite set will be the alphabet (DNA or protein),
${\cal A} = \{A_1, A_2,
\dots, A_K\}$ of size K, as defined previously. If the Xis are independent, then a single ``vector'' of numbers is needed to characterize the common distribution.
This vector, ${\bf P}$ is $(p_1,p_2,p_3, \dots, p_K)$, where ph = P(Xi = Ah), since we'll think of ${\bf P}$ as a single row, or as a 1 x K matrix.

In a Markov chain, instead of being independent, each Xi will depend on its immediate predecessor, Xi-1, for i>0. Instead of specifying K probabilities, we now have to specify K x K numbers that describe all possible conditional probabilities. Let

$\displaystyle {\bf P}= \left (
\begin{array}{ccccc}
p_{1,1} & p_{1,2} & p_{1,3}...
... & \vdots \\
p_{K,1} & p_{K,2} & p_{K,3} & \dots & p_{K,K}
\end{array}\right )$     (1)

be a K x K matrix of conditional probabilities, (pk,l) where $p_{k,l} = P( X_{i+1} = A_l \; \arrowvert \; X_i = A_k)$. Note that all the rows of ${\bf P}$ sum to 1.

Note: We will often use ${\bf Q}$ and (qk,l) in place of ${\bf P}$ and (pk,l) to identify these probabilities with those in the numerator of the ``log-odds'' formula.

Let ${\bf P}^n$ be the nth power of $\\ P$, and let (pk,l(n)) be the element in row k and column l. By definition, $\\ P^0 = I$, where I is the (K x K is understood) identity matrix. What is the interpretation of pk,l(n)?

It is easy to see that pk,l(n) is the probability of going from Ak to Al in n steps. That is,

\begin{displaymath}
p_{k,l}^{(n)} = P( X_{i+n} = A_l \; \arrowvert \; X_i = A_k).
\end{displaymath}

For n = 1, this is true by definition. Suppose it is true for some $n \geq 1$. Then

 \begin{displaymath}
p_{k,l}^{(n+1)} = \sum_{m=1}^{K} p_{k,m}p_{m,l}^{(n)}.
\end{displaymath} (2)

In equation 2, each summand describes the probability of starting at Ak, moving to Am in 1 step, and then moving from Am to AL in n steps, by the induction hypothesis. Summing over m takes care of all possibilities.

Let ${\bf F}^{(n)} = (f_1^{(n)},f_2^{(n)},f_3^{(n)}, \dots, f_K^{(n)})$ be a vector that describes the distribution of Xn. The initial distribution, ${\bf F}= {\bf F}^{(1)}$ must be given along with the K x K matrix of probabilities.

It is easy to see that

\begin{displaymath}
{\bf F}^{(n)} = {\bf F}{\bf P}^{n-1}
\end{displaymath}

for n > 0.

A distribution, ${\bf F}^*$, is called ``stationary'' if it has the property:

\begin{displaymath}
{\bf F}^* = {\bf F}^* {\bf P},
\end{displaymath}

so that ${\bf F}^* = {\bf F}^* {\bf P}^n \; \forall \; n \geq 1$ as well. By definition, all the rows of ${\bf P}$ sum to 1. Thus, all the rows of ${\bf P}- I$ sum to 0, so 1 is an eigenvalue of ${\bf P}$. A corresponding eigenvector, ${\bf F}^*$, normalized so that $\displaystyle
\sum_{k=1}^K f_k = 1$, is a stationary distribution. It is not unique in general.

A Markov chain is called ``ergodic'' if $\forall \; k,l \; \exists n
\; \ni \; p_{k,l}^{(n)} > 0$. In words, it is possible to move from every state to every other state with non-zero probability after a finite number of steps.

If ${\bf P}$ is ergodic, then there is a unique stationary distribution, ${\bf F}^*$ and for any starting distribution, ${\bf F}$,

\begin{displaymath}
\lim_{n \rightarrow \infty} {\bf F}^n = {\bf F}^*.
\end{displaymath}

Any Markov chain can be decomposed into separate ergodic components. To be specific, $\forall \; {\bf P}, \; \exists$ a permutation matrix, ${\bf\Pi}$ acting on $\{A_1, \dots, A_K\}$, such that

$\displaystyle {\bf\Pi}{\bf P}{\bf\Pi}^{-1} = \left (
\begin{array}{ccccc}
{\bf ...
...,1} & {\bf Z}_{M,2} & {\bf Z}_{M,3} & \dots & {\bf P}_{M}
\end{array} \right ),$     (3)

where ${\bf P}_i$ is an mi x mi ergodic Markov chain matrix for $1 \leq i < M$, and $\sum_{i=1}^{M} m_i = K$. ${\bf Z}_{i,j}$ is an mi x mj matrix of zeros. This decomposition is unique up to the order of the ${\bf P}_i$s. In this case, there are M linearly independent stationary distributions; one for each ergodic subchain.

In general,

\begin{displaymath}
\lim_{n \rightarrow \infty} {\bf P}^n =
\left ( \begin{arra...
...* \\
{\bf F}^* \\
\vdots \\
{\bf F}^*
\end{array} \right )
\end{displaymath}

for some stationary distribution. That is, ${\bf P}^n$ converges to a matrix where each of the K rows is ${\bf F}^*$. What this ``says'' is that for large n, the Markov chain $X_n, X_{2n}, X_{3n}, X_{4n}, \dots$ behaves more and more like a sequence of i.i.d. random variables.

We can think of the discrete n as a ``time'' or ``space'' variable. If we think of n as ``time'', then each move from n to n+1 can be called a ``generation'' or an ``epoch''. For example, in molecular evolution, an ``epoch'' might correspond to a million years.

Markov process It will be useful to think of finite valued Markov chains as functions of a continuous variable, t. Let ${\bf A}= (a_{k,l})$ be a K x K matrix such that $a_{k,l} \geq 0$ if $k \neq l$, $a_{k,k} \leq 0$ and $\sum_{l=1}^K a_{k,l} = 0 \; \forall \; l$. For any real $t \geq 0$, define P(t) = eAt.

Then $\forall t \geq 0, \; P(t)$ is a Markov chain matrix.

Proof: P(0) = I, which is a trivial Markov chain matrix. Note that $a_{k,l}^{(2)} = \sum_{m=1}^K a_{k,m}a_{m,l}$, so that $\sum_{k=1}^K a_{k,l}^{(2)} = 0 \; \forall \; l$. This is true for all higher powers of ${\bf A}$. Now choose some $\epsilon > 0$ such that

\begin{displaymath}
\sum_{j=1}^{\infty} \frac{(\epsilon {\bf A})^j}{j!}
\end{displaymath}

is a matrix with all terms < 1 in absolute value. Then P(s) is a Markov matrix $\forall 0 < s \leq \epsilon$ because the columns sum to 1 and are non-negative. For any t > 0, choose an integral $n > 0 \; \ni \; t/n
< \epsilon$. Then P(t/n) is a Markov matrix, so it's nth power is also.

${\bf A}$ is called the ``infinitesimal generator'' of the Markov process.

PAM theory

``PAM'' stands for ``Percent Accepted Mutation'', and is a term that was created by Margaret Dayhoff, a giant in protein sequence analysis. The basic evolutionary assumptions in ``PAM theory'' are:

1.
There is an original ancestor sequence whose residues are all evolving independently of one another.
2.
The residues are all evolving according to the same stochastic rules.
3.
Each residue is evolving according to a Markov chain or Markov process.

If a (discrete time) Markov chain is used, the ``n'' is regarded as a time unit and is called a ``generation'' or ``epoch''. For example, an epoch might correspond to 1,000,000 years. Margaret Dayhoff introduced a convention that the ``base'' matrix, $\\ P$, corresponds to what she called a ``1 PAM mutation data matrix''. What this means is that the average probability of change is 1%. This probability is defined as:

\begin{displaymath}
1 - \sum_{k=1}^{K} p_{k,k}f_{k}^*,
\end{displaymath}

where (fk*) is the stationary distribution (she dealt with ergodic systems). This means that there is a 1% chance of change in any epoch. ${\bf P}^n$ is called the ``PAM n'' or ``n PAM'' matrix. Although the choice of 1% for the base matrix is arbitrary, it was important that it be a small number. The reason is that for small n, the probability of some Ak changing to $A_l, \; l \neq k$, and then back to Ak should be ``very small''. This probability would be of the order of 0.0001 for n=2. In other words, the probability of change is close to n% in the PAM n matrix, for small n.

In a Markov process,

$\displaystyle P(X(t+\delta t) = A_l \; \arrowvert \; P(X(t) = A_k)$      
$\displaystyle \approx \left \{
\begin{array}{ll}
\delta t a_{k,l} & k \neq l \\
1 - \delta t a_{k,k} & k = l
\end{array}\right .$     (4)

for small $\delta t$, so that the percent change is proportional to time for small time increments.

According to PAM theory, sequences that are compared by alignment are assumed to evolve by the evolutionary process described above.

In defining scoring weights for comparing DNA or protein sequences, we will adopt the notation ${\bf Q}$ and ${\bf Q}(t)$ to describe a Markov chain matrix or a Markov process. We'll assume that the process is ergodic.

In matching Ak with Al in an alignment, we distinguish between two possibilities. One is the ``null hypothesis'' that the sequences are unrelated and that any ai and bj are chosen independently from the stationary distribution. The probability of this is fk*fl*, and corresponds to pkpl' in the discussion of weights and probabilities. The other possibility is that the matched residues are related by the evolutionary model. In this case, we imagen starting at Al and ending at Ak, with probability fl*ql,kn (or fl*q(t)l,k) for some n (or t). We can't distinguish this from starting at Ak and ending at Al, so we take ( fl*ql,kn + fk*qk,ln)/2 as the probability of this event. Note that this is symmetric in k and l. In scoring this match, we define

 \begin{displaymath}
\begin{array}{rcl}
w_{k,l} & = & \log \left ( \frac{(f_l^*q_...
...,k}^n}{2f_k^*} +
\frac{q_{k,l}^n}{2f_l^*} \right ).
\end{array}\end{displaymath} (5)

Note the use of the general ``log'' function. Margaret Dayhoff used base 10 in her work. These days, base 2 is commonly used, so that scores are measured in ``bits''. Note that matches that are more likely to occur according to the evolutionary model than by chance are given a positive score, and those more likely to occur by chance are given a negative score.

Why do we use ``log-odds'' (also called ``lod'') scores? Lets think about a local alignment without gaps. (Gaps complicate things!) We find some i, j and m such that $\sum_{k=0}^{m-1}
w(A_{i+k},A_{j+k})$ is maximized. The sum of logarithms is the logarithm of a product. If the logarithm is maximized, then so is the product. Because of our independence assumption, the product is the quotient of 2 probabilities: ``the probability that these 2 strings come from a common ancestor'' and ``the probability that these 2 strings are chosen randomly from some distribution''. Thus, finding a best ungapped local alignment maximizes a ratio of 2 probabilities.

DNA evolutionary models

The simplest model is known as the ``Jukes-Cantor'' 1 parameter model. In its discrete form, the mutation data matrix is

 \begin{displaymath}
{\bf Q}= \left (
\begin{array}{cccc}
1- 3a & a & a & a \\
...
...a & 1 - 3a & a \\
a & a & a & 1 - 3a
\end{array}
\right ),
\end{displaymath} (6)

for some small a. In words, any base and convert to any other with equal probability. This is a 1 PAM matrix for a = 0.01/3. This model is best expressed using the instantaneous form, where

 \begin{displaymath}
{\bf A}= \left (
\begin{array}{cccc}
-1 & 1/3 & 1/3 & 1/3 \...
...3 & -1 & 1/3 \\
1/3 & 1/3 & 1/3 & -1
\end{array}
\right ) .
\end{displaymath} (7)

Note that $e^{3a {\bf A}}$ corresponds roughly to the discrete form.


  
Figure 1: A schematic diagram of the Jukes-Cantor model for DNA evolution. All interconversions are equally likely, with probability a.
\begin{figure}
\epsfig{file=jc1.ps}
\end{figure}

We will immediately generalize this model to the ``Kimura 2 parameter'' model. In this case, we distinguish between transitions and transversions, reserving a for the transition probability and introducing b as the transversion probability. Let $r =
\frac{a}{2b}$. Then r is the expected ratio of transitions to transversions, since a transition can happen in a single way, while there are always two choices for transversions. r=1/2 in the Jukes-Cantor model. A more realistic value of r is 1.

In this case,

 \begin{displaymath}
{\bf Q}= \left (
\begin{array}{cccc}
1\!-\!a\!-\!2\!b & b &...
... & b \\
b & a & b & 1\!-\!a\!-\!2\!b
\end{array}
\right )
\end{displaymath} (8)

This model is also best expressed using the instantaneous form,


  
Figure 2: A schematic diagram of the Kimura model for DNA evolution. Transitions have probability a and transversions have probability b in each epoch.
\begin{figure}
\epsfig{file=kimura1.ps}
\end{figure}


 \begin{displaymath}
{\bf A}= \left (
\begin{array}{cccc}
-1 & \frac{0.5}{1+r} ...
...& \frac{r}{1+r} & \frac{0.5}{1+r} & -1
\end{array}
\right ) .
\end{displaymath} (9)

$e^{{\bf A}(a+2b)} \approx {\bf Q}$ in this case. The continuous model is used because the eigenvalues of ${\bf A}$ in equation 9, and hence $e^{{\bf A}t}$ are easily computed in ``closed form''. It can be shown that the eigenvalues of ${\bf A}$ are:

\begin{displaymath}
0, \qquad -\frac{2}{1+r}, \qquad \frac{1+2r}{1+r}, \; {\rm and} \qquad
\frac{1+2r}{1+r}.
\end{displaymath}

Then

 \begin{displaymath}
e^{tA} = \left (
\begin{array}{cccc}
f_1(t) & f_2(t) & f_3(...
...\\
f_2(t) & f_3(t) & f_2(t) & f_1(t)
\end{array}
\right ) ,
\end{displaymath} (10)

where

 \begin{displaymath}
\left \{ \begin{array}{rl}
f_1(t) = & \frac{1}{4} + \frac{1}...
...}} -
\frac{1}{2}e^{-\frac{t(1+2r)}{1+r}}
\end{array} \right .
\end{displaymath} (11)

The function f1, f2 and f3 correspond to ``no change'', ``transversion'' and ``transition'', respectively. The time, t is in 100 PAM units. That is 0.01 is roughly 1 PAM. The equilibrium distribution is
(1/4,1/4,1/4,1/4) in all cases. Equation 5 gives us the appropriate scores. Note that ${\bf A}$ is symmetric, and that fk* = 1/4 for $1 \leq k \leq 4$. This means that $w_{k,l}(t) = \log 4 (e^{{\bf A}t})_{k,l}$, so that


 \begin{displaymath}
\begin{array}{rl}
s_{identity}(t) = & \log \left ( 1 + e^{-\...
... - 2e^{-\frac{t(1+2r)}{1+r}} \right ) \nonumber \\
\end{array}\end{displaymath}  

Note how the transition and transversion scores $\rightarrow -\infty$ as $t \rightarrow 0$, that exact matches always have a positive score, and that transversions always have a negative score. It's easy to derive that transitions have a positive score $\iff$

\begin{displaymath}
\frac{t(2r-1)}{1+r} > \ln 2.
\end{displaymath}

This can only happen if r > 1/2, and then only for large enough t. There is no satisfactory theory on the choice of the a and b gap penalties in alignment, but it is wise to make b greater than the absolute value of the worst match weight, and a equal to the score of ``several good matches''. In hidden Markov models, we'll see how gaps and gap size correspond to certain probabilities.

Evolutionary distance

The problem with aligning sequences using a Jukes-Cantor, Kimura or related model is that we usually have no a priori idea of what the evolutionary time, t is. Can we estimate t? Suppose that we have an alignment. Ignoring gaps, we can count the number of identities, transitions and transversions, and express these in fractional form: fid, fts, ftv, respectively. These, in turn, are estimates of f1, f3, f2 from 10, respectively. Letting $A = e^{-\frac{t}{1+r}}$ and $B =
e^{-\frac{2rt}{1+r}}$, we can write, using 11,

 \begin{displaymath}
\left \{ \begin{array}{rl}
f_{id}(t) = & \frac{1}{4} + \frac...
...c{1}{4} + \frac{1}{4}A^2 - \frac{1}{2}AB
\end{array} \right .
\end{displaymath} (12)

This gives us

\begin{displaymath}
A = \sqrt{1-2f_{tv}},
\end{displaymath}


\begin{displaymath}
B = \frac{1-2f_{ts}-f_{tv}}{\sqrt{1-2f_{tv}}}
\end{displaymath}

and

\begin{displaymath}
r = \frac{\ln B}{2\ln A}.
\end{displaymath}

Finally, we can deduce

 \begin{displaymath}
t = -\frac{1}{2} \ln \left [ (1-2f_{ts}-f_{tv})\sqrt{1-2f_{tv}} \right ]
\end{displaymath} (13)

In practice, we can align 2 DNA sequences using some ``reasonable'' value of the evolutionary distance, t. This will yield a revised estimate of t from equation 14, and hence to a new scoring function. This gives a new alignment, revised values of fid, ftr, ftv and finally of t. The whole process can be repeated a small number of times until it converges.

A protein evolutionary model

For protein, we will not deal with a theoretical model as above. We could imagine a Kimura or Jukes-Cantor model acting on the DNA of genes, and this would lead to an evolutionary model of the resulting proteins, but such a model would be unrealistic since DNA evolution is constrained in coding regions. For this reason, Margaret Dayhoff used observed changes in homologous positions of related proteins.

At this point, I should state that molecular sequence are never homologous. The word ``homologous'' refers to common ancestry. The sequences of 2 DNA or protein molecules may be similar, according to an alignment, but they are never ``homologous''. If the two sequences are sufficiently similar, we might infer homology. Molecular sequences change more rapidly than molecular structure. For proteins, the 3D shapes of two distantly related proteins might be quite similar even though the sequences are completely dissimilar. Based on this well-known observation, Dayhoff looked at hundreds of well-aligned protein sequences that fell into a number of distinct classes, such as globins, cytochrome C's and so on. They were so similar that they could be confidently aligned ``by hand''. This gave her a database of 1572 observed substitutions, and this was used to construct the ``accepted point mutation matrix''. We'll call this matrix ${\bf M}'$. It is the precursor to the 20 x 20 1 PAM Markov chain matrix, ${\bf M}$.

Element mi,j' of ${\bf M}'$ ($i \neq j$) is the number of times that Ai is replaced by Aj, or vice versa. How do we get asymmetry out of this, and how do we normalize to the 1 PAM matrix, ${\bf M}$? Implicit in the computation that follows is the assumption that observed changes have occurred just once. Dayhoff computed an empirical distribution, $f_1, f_2, \dots, f_{20}$ from the alignments. She defined the term ``relative mutability'' of Aj by the total number of changes involving that amino acid divided by its frequency in the database of aligned sequences. That is:

\begin{displaymath}
mut_j \propto \sum_{i \neq j} \frac{m'_{i,j}}{f_j}
\end{displaymath}

The probability that Aj changes is 1 - mj,j, which is proportional to the mutability, so we can write

\begin{displaymath}
m_{j,j} = 1 - \lambda mut_j,
\end{displaymath}

for some constant, $\lambda$. Similarly,

\begin{displaymath}
m_{i,j} \propto mut_j m_{i,j}', \; \forall \; i \neq j.
\end{displaymath}

Since $\sum_{j=1}^{20} m_{i,j} = 1 \; \forall \; i$, we conclude that

\begin{displaymath}
m_{i,j} = \frac{\lambda mut_j m_{i,j}'}{\sum_{k \neq j} m_{k,j}'}, \;
\forall \; i \neq j.
\end{displaymath}

This leaves only $\lambda$ undetermined. But ${\bf M}= (m_{i,j})$ is 1 PAM, so

\begin{displaymath}
0.99 = \sum_{i=1}^{20} f_i m_{i,i}
\end{displaymath}

and

\begin{displaymath}
\lambda = \frac{1}{100\sum_{i=1}^{20} mut_i f_i}.
\end{displaymath}

This defines the 1 PAM matrix, assuming that the probability of change of an amino acid is proportional to its observed mutability. By construction, the observed distribution, (fi) is also the stationary distribution. Table 1 gives the relative mutabilities and distribution of the amino acids the Dayhoff used.


 
Table 1: Relative mutabilities and the distribution of amino acids in M. Dayhoff's database of observed amino acid changes.

    muti fi       muti fi
Ala A 100 0.087   Leu L 40 0.085
Arg R 65 0.041   Lys K 56 0.081
Asn N 134 0.040   Met M 94 0.015
Asp D 106 0.047   Phe F 41 0.040
Cys C 20 0.033   Pro P 56 0.051
Gln Q 93 0.038   Ser S 120 0.070
Glu E 102 0.050   Thr T 97 0.058
Gly G 49 0.089   Trp W 18 0.010
His H 66 0.034   Tyr Y 41 0.030
Ile I 96 0.037   Val V 20 0.065


Table 2 gives Dayhoff's 1 PAM matrix in its entirety.
 \begin{landscape}
% latex2html id marker 1331\begin{table}
\begin{tabular}{\ve...
...ilities of change, so that each column
adds up to 1.}
\end{table}\end{landscape}

By construction, $\frac{m_{i,j}}{f_i} = \frac{m_{j,i}}{f_j}$, so the LOD matrix for PAM n is $(\log \frac{m_{i,j}^{(n)}}{f_i})$. The symmetrized matrix, $(\frac{m_{i,j}^{(n)}}{f_i})$, is called a ``relatedness odds matrix'', ${\bf R}= (r_{i,j})$.

LOD scores without Markov models

BLOSUM - BLOcks SUbstitution Matrix

Dayhoff, because of limited data in her time, had to work with closely related sequences and extrapolate forward in time using Markov models. By 1990, it was no longer necessary to extrapolate, since relatedness matrices and LOD scores could be computed directly from groups of well-aligned sequences.

Blocks are multiply aligned ungapped segments corresponding to the most highly conserved regions of proteins. We'll discuss the creation of blocks later on when we discuss multiple sequence alignment. We can think of blocks as the result of cutting out the best regions of multiple alignments.

Example

IPB000337: 7tm_3 G-protein coupled receptors family 3 (Metabotropic glutamate receptor-like) from the Blocks database.1

There are 16 blocks in this family, labeled
IPB000337A to IPB000337P.
Block IPB000337C is:


\begin{landscape}
\begin{verbatim}ID 7tm_3; BLOCK
AC IPB000337C; distance from p...
...ISISRVIGSFDIPLVSHFATCACLSDKQKYPSFFRTIPSDQFQA 100
//\end{verbatim}\end{landscape}

About 10 years ago, the Henikoff's began to construct LOD matrices starting with over 2000 Blocks. In every column of every block, all rows are compared with each other once. That is, rows k1 and k2 are examined $\forall \; 1 \leq k_1 \leq k_2 \leq k_{max}$. The total number of (Aj,Ai) pairs in all the row comparisons of all the columns of all the blocks is stored as fi,j, for $1 \leq i \leq
j \leq 20$. The probability of each (Ai,Aj) or (Aj,Ai) pair is defined simply as:

\begin{displaymath}
q_{i,j} = \frac{f_{i,j}}{\displaystyle \sum_{i=1}^{20} \sum_{j=1}^{i} f_{i,j}}
\end{displaymath}

The probability of occurrence of the ith amino acid in an (Ai,Aj) pair is

\begin{displaymath}
p_i = q_{i,i} + \sum_{j \neq i} q_{i,j}/2.
\end{displaymath}

Then the probability, ei,j, for the juxtaposition of Ai and Bj purely by chance is given by

\begin{displaymath}
e_{i,j} = \left \{ \begin{array}{ccc}
p_i p_j & {\rm for} & i = j \\
2p_i p_j & {\rm for} & i \neq j \end{array} \right .
\end{displaymath}

The scores are given by

\begin{displaymath}
s_{i,j} = \log_{2} \frac{q_{i,j}}{e_{i,j}}
\end{displaymath}

In practice, the scores are multiplied by 2 and then rounded to the nearest integer to give ``BLOSUM'' matrices in half-bit units. Two quantities are computed. The first is the familiar expected match weight:

\begin{displaymath}
E = \sum_{i=1}^{20} \sum_{j=1}^{i} p_{i} \times p_{j} \times s_{i,j}
\end{displaymath}

The second quantity is the ``relative entropy'':

\begin{displaymath}
\begin{array}{ccl}
H & = & \sum_{i=1}^{20} \sum_{j=1}^{i} q_...
...^{i} q_{i,j} \times \log_2
\frac{q_{i,j}}{e_{i,j}} \end{array}\end{displaymath}

This basic BLOSUM matrix can be compared to a Dayhoff PAM matrix with the same relative entropy. Relative entropy is maximum for 0 PAM and decreases with increasing n (or t). How can we generate BLOSUM matrices with different relative entropies?

A ``BLOSUM n'' matrix, $0 \leq n < 100$ is constructed as follows. In each block, any 2 sequence that have n% or more of their amino acids identical are combined into a single group. This results in a number of groups that becomes smaller (limit is 1) as n increases. The counting of pairs is now done only between groups, although a group to group comparison uses an average value. Suppose group k1 has l1 sequences and group k2 has l2 sequences. Then instead of adding a 1 or a 0 to all the fi,js, we add the number of Ais in group k1 times the number of Ajs in group k2 plus the number of Ajs in group k1 times the number of Ais in group k2 (if $i \neq j$) and divide the entire sum by l1 x l2. Thus the fi,js now take on fractional values. The result of this process is that for increasing n, changes in more distantly related proteins are counted while those in more closely related proteins are weighted less or eliminated altogether when an entire block becomes a single group. This gives a family of BLOSUM matrices with entropies that increase with n.

Dayhoff PAM n matrices and BLOSUM n matrices are considered equivalent when their relative entropies are (roughly) the same. For example, PAM 250 is comparable to BLOSUM 45 (relative entropy of $\approx
0.4$bits). PAM 120 and BLOSUM 80 are equivalent (1 bit). PAM 160 and BLOSUM 62 are equivalent and intermediate, with about 0.7 bits. The expected values for PAM 160 and BLOSUM 62 are -0.57 and -0.52, respectively.


 
Table 3: Blosum 62 matrix scaled to 1/3 bit units. Entropy = 0.6979, Expected = -0.5209
  A R N D C Q E G H I L K M F P S T W Y V
A 6 -2 -2 -3 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1 2 0 -4 -3 0
R -2 8 -1 -2 -5 1 0 -3 0 -4 -3 3 -2 -4 -3 -1 -2 -4 -3 -4
N -2 -1 8 2 -4 0 0 -1 1 -5 -5 0 -3 -4 -3 1 0 -6 -3 -4
D -3 -2 2 9 -5 0 2 -2 -2 -5 -5 -1 -5 -5 -2 0 -2 -6 -5 -5
C -1 -5 -4 -5 13 -4 -5 -4 -4 -2 -2 -5 -2 -4 -4 -1 -1 -3 -4 -1
Q -1 1 0 0 -4 8 3 -3 1 -4 -3 2 -1 -5 -2 0 -1 -3 -2 -3
E -1 0 0 2 -5 3 7 -3 0 -5 -4 1 -3 -5 -2 0 -1 -4 -3 -4
G 0 -3 -1 -2 -4 -3 -3 8 -3 -6 -5 -2 -4 -5 -3 0 -2 -4 -5 -5
H -2 0 1 -2 -4 1 0 -3 11 -5 -4 -1 -2 -2 -3 -1 -3 -4 3 -5
I -2 -4 -5 -5 -2 -4 -5 -6 -5 6 2 -4 2 0 -4 -4 -1 -4 -2 4
L -2 -3 -5 -5 -2 -3 -4 -5 -4 2 6 -4 3 1 -4 -4 -2 -2 -2 1
K -1 3 0 -1 -5 2 1 -2 -1 -4 -4 7 -2 -5 -2 0 -1 -4 -3 -3
M -1 -2 -3 -5 -2 -1 -3 -4 -2 2 3 -2 8 0 -4 -2 -1 -2 -1 1
F -3 -4 -4 -5 -4 -5 -5 -5 -2 0 1 -5 0 9 -5 -4 -3 1 4 -1
P -1 -3 -3 -2 -4 -2 -2 -3 -3 -4 -4 -2 -4 -5 11 -1 -2 -5 -4 -4
S 2 -1 1 0 -1 0 0 0 -1 -4 -4 0 -2 -4 -1 6 2 -4 -3 -2
T 0 -2 0 -2 -1 -1 -1 -2 -3 -1 -2 -1 -1 -3 -2 2 7 -4 -2 0
W -4 -4 -6 -6 -3 -3 -4 -4 -4 -4 -2 -4 -2 1 -5 -4 -4 16 3 -4
Y -3 -3 -3 -5 -4 -2 -3 -5 3 -2 -2 -3 -1 4 -4 -3 -2 3 10 -2
V 0 -4 -4 -5 -1 -3 -4 -5 -5 4 1 -3 1 -1 -4 -2 0 -4 -2 6

There is a repository of BLOSUM matrices at:

ftp://ncbi.nlm.nih.gov/repository/blocks/unix/blosum/BLOSUM/

Table 3 gives the BLOSUM 62 matrix in 1/3 bit units (corresponds roughly with $10 \log_{10}$ used by Dayhoff, since $3/\ln
2 \approx 10/\ln 10$ because $10^3 = 1000 \approx 1024 = 2^{10}$).

The BLOSUM matrices are now considered to be superior to PAM matrices and are widely used.

LOD scores and single sequences

Recall finding runs of heads in a random sequence, $X_1, X_2, \dots$ of i.i.d. random variables, where P(Xi = H) = p and P(Xi = T) = q = 1-p. If we take qH =1 and qT = 0 to be the alternative model, then $s_H = \log_{1/p}(1/p) = 1$ and $s_T = \log_{1/p} (0/q) =
\infty$. Let H0 = 0 and

 \begin{displaymath}
H_i = \max \{ 0, H_{i-1} + s_{X_i} \}
\end{displaymath} (14)

The local maxima of the random Hi series gives us the lengths of all the runs. Peaks that are higher than $\log_{1/p}
n$ in $H_1 \dots H_n$ are ``significant'' if the deviation is high according to the extreme value distribution. This idea generalizes.

Now let $X_1, X_2, X_3, \dots$ be a random molecular sequence with values $A_k, 1 \leq k \leq K$. We'll assume here that the Xis are i.i.d.. When we consider ``Hidden Markov models'' later on, the situation will be more complicated.

Suppose that we are trying to identify some ``inhomogeneity'' in the random sequence. The simplest assumption is that there are really 2 distributions of letters that are begin used, with probabilities pi and qi respectively. Some stretches might be generated using the first distribution and others by the second. If we define $s_i = \log
(q_i/p_i)$, H0 = 0 and H(i) as in equation 15, then the regions between a zero of Hi and a local maximum would be estimates of where the qi distribution would be assumed to act.

As an example, we can consider the genome of various Pyrococcus species that live adjacent to deep sea vents at very high temperature and pressure. They have ``AT'' rich genomes, except for RNA genes that at ``GC'' rich. Suppose that the pis are the probabilities in the ``AT'' rich region, or perhaps of the entire genome. The qis would be the probabilities in the ``GC'' rich areas, obtained by guessing or by examining known RNA genes.


next up previous
Michael Zuker
Professor of Mathematical Sciences
Rensselaer Polytechnic Institute
2003-03-18