A random variable is a measurable function
,
where S is a set,
is a sigma-algebra on
S, and
is a measure defined on
such that
.
The set M is a metric space in general. Usually it is
or a subset thereof. The triple
is called
a ``probability space''.
A random variable induces a measure,
on the target space,
M. This is called the ``distribution of X''.
All of this is more formal and general than what we shall need for
this course.
is usually written as P, or Pr; a shorthand for
``probability''. S is often finite, in which case
is the
collection of all subsets of S. In this case, it suffices to define
P for all
.
S could be countably infinite, as in the
non-negative integers. Finally, S could be
,
or an interval
such as [0,1] or
.
In this case, the measure is often
expressed in terms of a ``density function'', f. Thus
.
Independence
An event is a subset of S or M. Events
are
called ``independent'' if
Note that equation 1 is written for events in S. If the
ei were events in M, we would write
Random variables are called independent if equation 2 holds
for any collection of subsets of M. When events or random variable
are independent, people often say that ``they have nothing to do with
one another''. This is imprecise. A better way to think about
independence is to think ``product measure''. Random
variables
are called i.i.d. (independent
and identically distributed) if they all have the same distribution
and if any finite collection are independent.
Examples of independence and non-independence
EXAMPLE 1 X and Y are discrete random variables on the same space. The set of all possibilities for the values of X and Y are given in Table 1.
|
How can we prove that X and Y are independent in Table
1. It's simple. If pi,j is the probability in row i
and column j, then
.
EXAMPLE 2 X and Y are discrete random variables on the same space. A new set of all possibilities for the values of X and Y are given in Table 2.
|
How can we prove that X and Y are not independent in Table
2. Again, it's simple. Find a counterexample. For example,
and
.
The product of these
numbers is 0.05825 which is not equal to 0.05.
Mutually exclusive events
This concept is easier than that of independence. Events
are called ``mutually exclusive''
.
In this case,
Note that equation 3 does not imply mutually exclusive
events, since some events can have probability 0. In general,
Conditional probability and Bayes Theorem
If A and B are events, then we define the ``conditional
probability of A given B'',
by
Bayes theorem (This bit of trivia should really be called ``Bayes lemma'' at best. Bayes was a very minor figure in probability theory and certainly doesn't deserve to have his name used so often.)
Equation 5 can be rewritten as:
From this, it is easy to derive:
Example: Using Bayes Theorem
General Widgets (GW) is a company that manufactures widgets. Each widget needs 3 doodles. The doodles are manufactured by 4 different companies, C1, C2, C3 and C4. 10% of C1's doodles are defective, 5% of C2's doodles are defective, 2% of C3's doodles are defective and 1% of C4's doodles are defective. Because of price and availability, 40% of GW's doodles come from C1, 30% come from C2, 20% from C3 and the final 10% from C4. A widget fails if any one of its doodles is defective. The failures are independent of one another.
We first need to answer the question: ``What is the probability that a randomly selected doodle is defective?''
Let D be the event that a
doodle is defective. Let Ci be the event that company Ci produces
a doodle,
.
Then:
| = | ![]() |
||
| = | ![]() |
||
| = | |||
| = | 0.06 |
If a doodle is defective, what is the probability that it was made by C1?
| = | ![]() |
||
| = | ![]() |
||
| = |
Mean, variance and characteristic functions
For a real valued random variable, X, we define its ``expectation'',
or ``expected value'' as
![]() |
(8) |
![]() |
(9) |
![]() |
(10) |
The
nth moment of X ``about the mean is defined by
![]() |
(11) |
The ``characteristic function'',
,
is defined by
| (12) |
What good are characteristic functions?
1.
have the same distribution
2. If
are independent and
,
then
![]() |
(13) |
Some important probability distributions.
BINOMIAL DISTRIBUTION
Apart from the trivial case where X = 0 with probability 1, the
simplest case of a random variable is one that takes on just 2
values. These are often taken as 0 and 1, although -1 and +1
are sometimes used. Sometimes, we might write ``heads'' and ``tails''
for the 2 possibilities, indicating that X represents the results of
a coin toss. We'll adopt the following convention. X takes on the 2
values, 1 and 0 with probabilities p and q respectively, where
p + q = 1. Let
,
where the Xi are
independent copies of X. What is the distribution of Y. Clearly
Yn can take on any value between 0 and n. In fact, simple
combinatorics gives us:
![]() |
(14) |
POISSON DISTRIBUTION
Suppose that p is ``small'' and the ``n'' is large. To be more
explicit, let's set
and see what happens as
.
Stirling's formula gives the limiting value for
as
.
The characteristic
function is
.
It's easier to do this limiting
process in ``reciprocal space''. First,
.
Thus
.
This approaches
as
.
This limiting distribution is called the ``Poisson distribution''. Think of it as the ``law of rare events''.
UNIFORM DISTRIBUTION
This is given by a constant density function,
for
and f(x) = 0 otherwise.
GAUSSIAN DISTRIBUTION
Density is
Note that
and
are the mean and variance.
This is called the ``normal distribution''. The notation
``N(
,
)'' refers to this distribution.
If
are i.i.d.
with mean
and variance
,
then
![]() |
(15) |
EXPONENTIAL AND GAMMA DISTRIBUTIONS
The density of the exponential distribution is
for
.
The mean and variance are
and
,
respectively.
The density of the gamma distribution is given by
![]() |
(16) |
Note that
yields the exponential distribution. If X and
Y are independent random variables with gamma distributions
and
,
then X+Y has distribution
.
Thus sums of exponential distributions give
gamma distributions with increasing parameter
.
GEOMETRIC DISTRIBUTION
This is the discrete equivalent of the exponential distribution.
for some constant
,
q = 1 - p. A
simple calculation shows that
.
INFORMATION CONTENT:
ENTROPY AND RELATIVE ENTROPY
For a random variable, X with discrete probability values
the ``entropy'' is defined by
If X takes on finitely many values, n, a trite calculation using
Lagrange multipliers shows that H(X) is maximized when
.
In this case, the entropy is
.
Entropy is
maximized when the distribution is ``most spread out''. Also, note
that this spread out distribution is increasing with the number of
states.
Entropy was used by Shannon in his effort to understand information
content of messages. Let's consider the following problem. We're
receiving a message that's been encoded into bits (0's and 1's). If
there were no noise (no error) at all, we would have 1 bit of
information per bit received. Think of the signal as Yi +
Xi. Yi is the pure signal with no error. It is
``deterministic''. Xi is the ``random error''. It is 0 (no error) with
probability p and 1 (bit swap error) with probability 1-p. The
entropy is
.
This approaches 0 as
.
That is, small p implies small error. For p = 1/2,
the entropy is
in bits. That is, all information is
lost. Note that p = 1 is useful, since we can simply swap bits.
In this course, we will be computing entropy for ``sample
distributions''. If random variables
are
i.i.d., then putting a weight of
on each Xi yields a
sample distribution.
For 2 random variables, X and Y, define the ``relative entropy''
of X with respect to Y as
![]() |
(17) |
Finally we define the ``mutual information'' of X with respect to Y as
![]() |
(18) |
Mutual information is the relative entropy of the joint probability distribution of X and Y and the product distribution. As such, it is a good measure of the independence of X and Y.
Coin tossing and longest runs
Infinite tossing of a consistent coin.
Let
be a series of i.i.d. random
variables such that
and
,
where
0 < p < 1 and q = 1 - p. X is 1 for a head and is 0 for a tail.
We want to look at this series in a novel way. Let Ri be the length of the ith ``run of heads'', to be called ``run'' henceforth. By definition, a run begins at the first position, or at the first position after a T. A run ends when as soon as a T is encountered. The length of a run is the number of heads in the run. The tail is not counted. This means that we are including runs of zero length.
Another way of stating this is by defining Bi to be the beginning index of the ith run. Then B1 = 1 by definition and Bi = Bi-1 + Ri-1 + 1 for i > 0. Note that R1 = 0 if X1 = T.
The beginnings of runs for a particular sequence of coin tosses are shown below using the symbol |.
2 4 6 8 10 12 14 16
T T H H H T H H H H T T H T T H H ...
| | | | | | | |
| i | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | |
| Bi | 1 | 2 | 3 | 7 | 12 | 13 | 15 | 16 | |
| Ri | 0 | 0 | 3 | 4 | 0 | 1 | 0 | ? |
The sequence of heads and tails are independent, and the new sequence of runs, each one ending with a tail, are also independent. This means that the random variables, Ri, are independent. particular, the Ris are not independent. In a sequence of finite length, n, we cannot say when the last run ends unless Xn = 0.
What is the distribution of the Ris? They have a common
distribution. In fact,
.
That is, Ri + 1 has the geometric distribution with parameter q.
Ri + 1 is the number of tosses until a T is encountered.
In particular,
The expectation is given by
We can now very easily answer the question ``What is the expected
number of runs of any particular size in n tosses?''.
A run of size k may begin anywhere with the same probability
q2pk, except for the first position. In any case, the expected
number of runs is simply q2pk(n-k+1). If we let
and think of n as ``large'', then:
Using a fixed number of coin tosses, n, results in run lengths that do not end cleanly at n. We will now replace the coin toss by a ``Poissonized version'' that has much better properties and that approximates the true coin toss sufficiently well for our conclusions to transfer.
Let
be a series of i.i.d. random
variables with the common ``run'' distribution discussed above
(equations 19 & 20). Let N be a Poisson random
variable independent of the R's and with mean nq. Consider the
``Poissonized'' coin toss:
consisting of the empty coin toss if N = 0, where R'i(H) means
``insert R'i heads in this position''. The expected length of this
``Poissonized'' coin toss is:
This is as expected and by design. Now let LN be the length of the longest run in the Poissonized coin toss. LN = 0 when N = 0.
Then
We now replace k by
,
obtaining
We now think of
while
is fixed. As
n becomes large, the behavior of LN is a better and better
approximation to Ln. In fact,
The distribution in equation 25 is an example of the ``extreme value distribution''.
It's density is given by differentiation and is
.
Moments are given by:
To evaluate the mean and variance, we digress briefly to consider the
``gamma function''. If the real part of z is > 0, we can write
![]() |
(28) |
![]() |
(29) |
| = | ![]() |
||
| = | (31) |
![]() |
(32) |
Differentiation of equation 30 yields:
The end result of these calculations is that
What we have not proved here is that LN for the ``Poissonized coin toss'' is a good enough approximation to the longest run problem with exactly n coin tosses. However, this is not the application we are interested in.
Computer simulations of coin tossing.
In 216 (65536) tosses of a fair coin, the computer generated 32659 heads and 32877 tails. There were 16400 runs of heads of size > 0. This and various other simulations gave results shown in Figures 1, 2, 3 and 4.
![]() |
Local alignment without gaps
Let's return to sequence alignment.
and
are 2 biomolecular sequences of the same type (from the
same alphabet). Let's refer to the alphabet as
,
where, for DNA, K=4 and
.
For proteins, K=20 and
S,T,V,W,Y}. This time, however, we
will regard the {ai} as i.i.d. random variables, and the same
with {bj}. Let
.
We do not assume that the
ais and bjs have the same distribution. The degree to which they
may differ is restricted, and is beyond the scope of this course. For
example, if
for all members of the
alphabet, then nothing useful results.
![]() |
Let Lm,n be the largest non-negative integer, k, such that
there is some
and some
such that
.
In other words, Lm,n
is the ``longest run of exact matches''. How is this problem similar
to the ``longest run of heads matches'' in coin tossing? We can think
of the familiar m x n ``dot plot'' matrix, where the ``blanks''
are ``tails'' and the ``dots'' are ``head''. Instead of a linear array
of coin tosses, we have a 2D grid and search for the maximum number of
dots in a row along successive diagonals, j-i = d, for
.
The total length of all these diagonals is mn. Of
course, the dots are not independent random variables, even though
they are independent of each other along any diagonal. Let ph and
p'h be
and
,
respectively. Then
,
so that
,
while
.
These 2 quantities will be
unequal in general.
It turns out that the non-independence of the dots on different
diagonals and the effect of stringing all the diagonals
together into a long ``coin toss'' of size mn doesn't make a lot of
difference. The basic results hold for large mn. That is
is an excellent approximation to
and
is an excellent
approximation to
.
Now we consider local alignment with a scoring function w. Note that
w(ai,bj) is now a random variable. We will assume that
,
so that no gaps will be predicted. Let
H(m,n) be the maximum local alignment weight achieved by aligning
with
.
We will now define sufficient conditions on w(ai,bj) so that H(m,n) behaves like the longest run in a coin toss.
Let
and
.
Also, we'll write
wh1,h2 for the more cumbersome
w(Ah1,Ah2).
Suppose that the match weights,
satisfy:
.
Then
![]() |
Proof: Define
.
Note that f(0) =
1. The negative expected score assures that f'(0) < 0. Thus
.
The existence of
h1' and h2' as described ensures that
.
Thus
.
The uniqueness of
follows from the fact
that f'' > 0 (convex function).
An immediate consequence of this simple theorem is that we can define
new probabilities:
Recall equation 25. We can rewrite it replacing
with
and derive:
In comparison, Karlin showed that for the H(m,n) statistic, with the
weights satisfying the above conditions, there is another constant,
K* >0 such that
![]() | Michael Zuker Professor of Mathematical Sciences Rensselaer Polytechnic Institute 2004-02-10 |