Markov chains
We have already discussed sequences of independent, identically
distributed, or i.i.d., random variables. Let's consider a
sequence of random variables,
that each take on
the same finitely many values. In our applications, the finite set
will be the alphabet (DNA or protein),
of size K, as defined previously. If the Xis are
independent, then a single ``vector'' of numbers is needed to
characterize the common distribution.
This vector,
is
,
where
ph = P(Xi = Ah),
since we'll think of
as a single
row, or as a 1 x K matrix.
In a Markov chain, instead of being independent, each Xi will
depend on its immediate predecessor, Xi-1, for i>0. Instead of
specifying K probabilities, we now have to specify K x K
numbers that describe all possible conditional probabilities. Let
![]() |
(1) |
Note: We will often use
and (qk,l) in place of
and
(pk,l) to identify these probabilities with those in the
numerator of the ``log-odds'' formula.
Let
be the nth power of
,
and let
(pk,l(n))
be the element in row k and column l. By definition,
,
where I is the (K x K is understood) identity matrix. What is
the interpretation of pk,l(n)?
It is easy to see that pk,l(n) is the probability of going
from Ak to Al in n steps. That is,
For n = 1, this is true by definition. Suppose it is true for some
.
Then
Let
be a vector that describes the distribution of Xn. The initial
distribution,
must be given along with the K x
K matrix of probabilities.
It is easy to see that
A distribution,
,
is called
``stationary'' if it has the property:
,
is a stationary distribution. It is not unique
in general.
A Markov chain is called ``ergodic'' if
.
In words, it is possible to move from
every state to every other state with non-zero probability after a
finite number of steps.
If
is ergodic, then there is a unique stationary distribution,
and for any starting distribution,
,
Any Markov chain can be decomposed into separate ergodic
components. To be specific,
a
permutation matrix,
acting on
,
such that
![]() |
(3) |
In general,
We can think of the discrete n as a ``time'' or ``space'' variable. If we think of n as ``time'', then each move from n to n+1 can be called a ``generation'' or an ``epoch''. For example, in molecular evolution, an ``epoch'' might correspond to a million years.
Markov process
It will be useful to think of finite valued Markov chains as
functions of a continuous variable, t. Let
be a K
x K matrix such that
if
,
and
.
For
any real
,
define P(t) = eAt.
Then
is a Markov chain matrix.
Proof: P(0) = I, which is a trivial Markov chain
matrix. Note that
,
so
that
.
This is true
for all higher powers of
.
Now choose some
such
that
is called the ``infinitesimal generator'' of the Markov process.
PAM theory
``PAM'' stands for ``Percent Accepted Mutation'', and is a term that was created by Margaret Dayhoff, a giant in protein sequence analysis. The basic evolutionary assumptions in ``PAM theory'' are:
If a (discrete time) Markov chain is used, the ``n'' is regarded as
a time unit and is called a ``generation'' or ``epoch''. For example,
an epoch might correspond to 1,000,000 years. Margaret Dayhoff
introduced a convention that the ``base'' matrix,
,
corresponds
to what she called a ``1 PAM mutation data matrix''. What this means
is that the average probability of change is 1%. This probability
is defined as:
In a Markov process,
![]() |
(4) |
According to PAM theory, sequences that are compared by alignment are assumed to evolve by the evolutionary process described above.
In defining scoring weights for comparing DNA or protein sequences, we
will adopt the notation
and
to describe a Markov chain
matrix or a Markov process. We'll assume that the process is ergodic.
In matching Ak with Al in an alignment, we distinguish between
two possibilities. One is the ``null hypothesis'' that the sequences
are unrelated and that any ai and bj are chosen independently
from the stationary distribution. The probability of this is
fk*fl*, and corresponds to pkpl' in the discussion of
weights and probabilities. The other possibility is that the matched
residues are related by the evolutionary model. In this case, we imagen
starting at Al and ending at Ak, with probability
fl*ql,kn (or
fl*q(t)l,k) for some n (or t). We
can't distinguish this from starting at Ak and ending at Al, so
we take (
fl*ql,kn + fk*qk,ln)/2 as the probability of
this event. Note that this is symmetric in k and l. In scoring
this match, we define
Why do we use ``log-odds'' (also called ``lod'') scores? Lets think
about a local alignment without gaps. (Gaps complicate things!) We
find some i, j and m such that
is maximized. The sum of logarithms is the
logarithm of a product. If the logarithm is maximized, then so is the
product. Because of our independence assumption, the product is the
quotient of 2 probabilities: ``the probability that these 2 strings
come from a common ancestor'' and ``the probability that these 2
strings are chosen randomly from some distribution''. Thus, finding a
best ungapped local alignment maximizes a ratio of 2 probabilities.
DNA evolutionary models
The simplest model is known as the ``Jukes-Cantor'' 1 parameter
model. In its discrete form, the mutation data matrix is
![]() |
We will immediately generalize this model to the ``Kimura 2
parameter'' model. In this case, we distinguish between transitions
and transversions, reserving a for the transition probability
and introducing b as the transversion probability. Let
.
Then r is the expected ratio of transitions
to transversions, since a transition can happen in a single way, while
there are always two choices for transversions. r=1/2 in the
Jukes-Cantor model. A more realistic value of r is 1.
In this case,
This model is also best expressed using the instantaneous form,
![]() |
Then
The function f1, f2 and f3 correspond to ``no change'',
``transversion'' and ``transition'', respectively. The time, t is in
100 PAM units. That is 0.01 is roughly 1 PAM. The equilibrium
distribution is
(1/4,1/4,1/4,1/4) in all cases. Equation
5 gives us the appropriate scores. Note that
is
symmetric, and that fk* = 1/4 for
.
This means
that
,
so that
Note how the transition and transversion scores
as
,
that exact matches always have a positive score,
and that transversions always have a negative score. It's easy to
derive that transitions have a positive score
Evolutionary distance
The problem with aligning sequences using a Jukes-Cantor, Kimura or
related model is that we usually have no a priori idea of what
the evolutionary time, t is. Can we estimate t? Suppose that we
have an alignment. Ignoring gaps, we can count the number of
identities, transitions and transversions, and express these in
fractional form:
fid, fts, ftv, respectively. These, in
turn, are estimates of f1, f3, f2 from 10,
respectively. Letting
and
,
we can write, using 11,
This gives us
In practice, we can align 2 DNA sequences using some ``reasonable'' value of the evolutionary distance, t. This will yield a revised estimate of t from equation 14, and hence to a new scoring function. This gives a new alignment, revised values of fid, ftr, ftv and finally of t. The whole process can be repeated a small number of times until it converges.
A protein evolutionary model
For protein, we will not deal with a theoretical model as above. We could imagine a Kimura or Jukes-Cantor model acting on the DNA of genes, and this would lead to an evolutionary model of the resulting proteins, but such a model would be unrealistic since DNA evolution is constrained in coding regions. For this reason, Margaret Dayhoff used observed changes in homologous positions of related proteins.
At this point, I should state that molecular sequence are never
homologous. The word ``homologous'' refers to common
ancestry. The sequences of 2 DNA or protein molecules may be similar,
according to an alignment, but they are never ``homologous''. If the
two sequences are sufficiently similar, we might infer
homology. Molecular sequences change more rapidly than molecular
structure. For proteins, the 3D shapes of two distantly related
proteins might be quite similar even though the sequences are
completely dissimilar. Based on this well-known observation, Dayhoff
looked at hundreds of well-aligned protein sequences that fell into a
number of distinct classes, such as globins, cytochrome C's and so on.
They were so similar that they could be confidently aligned ``by
hand''. This gave her a database of 1572 observed substitutions, and
this was used to construct the ``accepted point mutation
matrix''. We'll call this matrix
.
It is the precursor to
the 20 x 20 1 PAM Markov chain matrix,
.
Element mi,j' of
(
)
is the number of times that
Ai is replaced by Aj, or vice versa. How do we get
asymmetry out of this, and how do we normalize to the 1 PAM matrix,
? Implicit in the computation that follows is the assumption
that observed changes have occurred just once. Dayhoff computed an
empirical distribution,
from the
alignments. She defined the term ``relative mutability'' of Aj by
the total number of changes involving that amino acid divided by its
frequency in the database of aligned sequences. That is:
The probability that Aj changes is 1 - mj,j,
which is proportional to the mutability, so we can write
This defines the 1 PAM matrix, assuming that the probability of change of an amino acid is proportional to its observed mutability. By construction, the observed distribution, (fi) is also the stationary distribution. Table 1 gives the relative mutabilities and distribution of the amino acids the Dayhoff used.
|
|
Table 2 gives Dayhoff's 1 PAM matrix in its entirety.
By construction,
,
so the
LOD matrix for PAM n is
.
The
symmetrized matrix,
,
is called a
``relatedness odds matrix'',
.
LOD scores without Markov models
BLOSUM - BLOcks SUbstitution Matrix
Dayhoff, because of limited data in her time, had to work with closely related sequences and extrapolate forward in time using Markov models. By 1990, it was no longer necessary to extrapolate, since relatedness matrices and LOD scores could be computed directly from groups of well-aligned sequences.
Blocks are multiply aligned ungapped segments corresponding to the most highly conserved regions of proteins. We'll discuss the creation of blocks later on when we discuss multiple sequence alignment. We can think of blocks as the result of cutting out the best regions of multiple alignments.
Example
IPB000337: 7tm_3 G-protein coupled receptors family 3 (Metabotropic glutamate receptor-like) from the Blocks database.1
There are 16 blocks in this family, labeled
IPB000337A to IPB000337P.
Block IPB000337C is:
About 10 years ago, the Henikoff's began to construct LOD matrices
starting with over 2000 Blocks. In every column of every block, all
rows are compared with each other once. That is, rows k1 and k2
are examined
.
The total
number of (Aj,Ai) pairs in all the row comparisons of all the
columns of all the blocks is stored as fi,j, for
.
The probability of each (Ai,Aj) or (Aj,Ai) pair is defined
simply as:
The probability of occurrence of the ith amino acid in an
(Ai,Aj) pair is
Then the probability, ei,j, for the juxtaposition of Ai and
Bj purely by chance is given by
The scores are given by
In practice, the scores are multiplied by 2 and then rounded to the
nearest integer to give ``BLOSUM'' matrices in half-bit units.
Two quantities are computed. The first is the familiar expected match
weight:
This basic BLOSUM matrix can be compared to a Dayhoff PAM matrix with the same relative entropy. Relative entropy is maximum for 0 PAM and decreases with increasing n (or t). How can we generate BLOSUM matrices with different relative entropies?
A ``BLOSUM n'' matrix,
is constructed as
follows. In each block, any 2 sequence that have n% or more of
their amino acids identical are combined into a single group. This
results in a number of groups that becomes smaller (limit is 1) as n
increases. The counting of pairs is now done only between groups,
although a group to group comparison uses an average value. Suppose
group k1 has l1 sequences and group k2 has l2
sequences. Then instead of adding a 1 or a 0 to all the
fi,js, we add the number of Ais in group k1 times the
number of Ajs in group k2 plus the number of
Ajs in group k1 times the
number of Ais in group k2 (if
)
and divide the entire
sum by
l1 x l2. Thus the fi,js now take on fractional
values. The result of this process is that for increasing n, changes
in more distantly related proteins are counted while those in more
closely related proteins are weighted less or eliminated altogether
when an entire block becomes a single group. This gives a family of
BLOSUM matrices with entropies that increase with n.
Dayhoff PAM n matrices and BLOSUM n matrices are considered equivalent
when their relative entropies are (roughly) the same. For example, PAM
250 is comparable to BLOSUM 45 (relative entropy of
bits). PAM 120 and BLOSUM 80 are equivalent (1 bit). PAM 160 and
BLOSUM 62 are equivalent and intermediate, with about 0.7 bits.
The expected values for PAM 160 and BLOSUM 62 are -0.57 and -0.52,
respectively.
|
There is a repository of BLOSUM matrices at:
ftp://ncbi.nlm.nih.gov/repository/blocks/unix/blosum/BLOSUM/
Table 3 gives the BLOSUM 62 matrix in 1/3 bit units
(corresponds roughly with
used by Dayhoff, since
because
).
The BLOSUM matrices are now considered to be superior to PAM matrices and are widely used.
LOD scores and single sequences
Recall finding runs of heads in a random sequence,
of i.i.d. random variables, where
P(Xi = H) = p and
P(Xi =
T) = q = 1-p. If we take qH =1 and qT = 0 to be the alternative
model, then
and
.
Let H0 = 0 and
The local maxima of the random Hi series gives us the lengths of
all the runs. Peaks that are higher than
in
are ``significant'' if the deviation is high
according to the extreme value distribution. This idea generalizes.
Now let
be a random molecular sequence with
values
.
We'll assume here that the Xis are
i.i.d.. When we consider ``Hidden Markov models'' later on, the
situation will be more complicated.
Suppose that we are trying to identify some ``inhomogeneity'' in the
random sequence. The simplest assumption is that there are really 2
distributions of letters that are begin used, with probabilities pi
and qi respectively. Some stretches might be generated using the
first distribution and others by the second. If we define
,
H0 = 0 and H(i) as in equation 15, then the
regions between a zero of Hi and a local maximum would be estimates
of where the qi distribution would be assumed to act.
As an example, we can consider the genome of various Pyrococcus species that live adjacent to deep sea vents at very high temperature and pressure. They have ``AT'' rich genomes, except for RNA genes that at ``GC'' rich. Suppose that the pis are the probabilities in the ``AT'' rich region, or perhaps of the entire genome. The qis would be the probabilities in the ``GC'' rich areas, obtained by guessing or by examining known RNA genes.
![]() | Michael Zuker Professor of Mathematical Sciences Rensselaer Polytechnic Institute 2003-03-18 |