**A simple example of maximum likelihood estimation**

**http://mw511.biol.berkeley.edu/like/mle/**

**Consider the simple procedure of tossing a coin with the goal of
estimating the probability of heads for the coin. The probability of heads
for a fair coin is 0.5. However,**

**for this example we will assume that the probability of heads is
unknown (maybe the coin is strange in some way or that we are testing whether
or not the coin is fair). The**

**act of tossing the coin n times forms an experiment--a procedure
that, in theory, can be repeated an infinite number of times and has a
well-defined set of possible**

**outcomes. When flipping a coin, there are 2n possible sample outcomes,
w. The set of possible outcomes forms the sample space, W. For n = 3 tosses
of a coin, the**

**sample space is**

**W = {HHH, HHT, HTH, THH, TTH, THT, HTT, TTT},**

**where H denotes heads and T denotes tails. An event is a collection
of sample outcomes and is said to occur if the outcome of a particular
experiment is a member of the**

**event. Often, it is useful to think of a function whose domain is
the sample space, W. Such a function is known as a random variable. Examples
of random variables for the**

**coin flip experiment are (i) the number of times heads appear, (ii)
the number of times tails appears, and (iii) the number of flips until
a head appears. Random variables are**

**often denoted by uppercase letters, often X, Y, and Z.**

**As an example of likelihood estimation, the coin toss example will
more fully developed. The three main components of the statistical approach
are (i) the data, (ii) a model**

**describing the probability of observing the data, and (iii) a criterion
that allows us to move from the data and model to an estimate of the parameters
of the model.**

**Data: Assume that we have actually performed the coin flip experiment,
tossing a coin n = 10 times. We observe that the sequence of heads and
tails was H, H, H, T,**

**H, T, T, H, T, H. We will denote heads by 1 and tails by 0; hence,
our data will be coded as**

**x = {1, 1, 1, 0, 1, 0, 0, 1, 0, 1}**

**where x is a vector with elements x1 = 1, x2 = 1, x3 = 1, x4 = 0,
..., x10 = 1. In tossing the coin, we note that heads appeared 6 times
and tails appeared 4 times.**

**Model: An appropriate model that describes the probability of observing
heads for any single flip of the coin is the Bernoulli distribution. The
Bernoulli distribution has**

**the following form:**

**,**

**where p is the probability of heads and xi is the experimental outcome
of the i-th coin flip (i.e., heads or tails). The vertical line in the
function means "given". Note**

**that if p = 0.6, the function returns 0.6 if xi = 1 and 0.4 if xi
= 0. Many independent Bernoulli's gives the binomial distribution.**

**Criterion: Now that we have specified the data, x, and the model,
f(xi |p), we need a criterion to estimate the parameters of the model.
Here, the parameter of**

**interest (the parameter to be estimated) is p--the probability that
heads appears for any single toss of the coin. Several methods could be
applied at this point, including the**

**method of moments, least squares, and Bayesian estimation. However,
we will consider only the the method of maximum likelihood here.**

**The method of maximum likelihood was first proposed by the English
statistician and population geneticist R. A. Fisher. The maximum likelihood
method finds that**

**estimate of a parameter which maximizes the probability of observing
the data given a specific model for the data.**

**The likelihood function is defined as**

**The likelihood function is simply the joint probability of observing
the data. The large P means "product". The likelihood function is obtained
by multiplying the probability**

**function for each toss of the coin. In the case of the coin flip
experiment where we are assuming a Bernoulli distribution for each coin
flip, the likelihood function becomes**

**Taking the log of the likelihood function does not change that value
of p for which the likelihood is maximized. After taking the log of both
sides of the equation this**

**becomes:**

**The following figures show plots of likelihood, L, as a function
of p for several different possible outcomes of n = 10 flips of a coin.
Note that for the case in which 3**

**heads and 7 tails were the outcome of the experiment, that the likelihood
appears to be maximized at p = 0.3. Similarly, p = 0.5 for the case of
5 heads and 5 tails, p =**

**0.8 for the case of 8 heads and 2 tails, and p = 0.9 for the case
of 9 heads and 1 tail. The likelihood appears to be maximized when p is
the proportion of the time that**

**heads appeared in our experiment. This illustrates the "brute force"
way to find the maximum likelihood estimate of p.**

**Instead of determining the maximum likelihood value of p graphically,
we could also find the maximum likelihood estimate of p analytically. We
can do this by taking the**

**derivative of the likelihood function with respect to p and finding
where the slope is zero. When this is done, the maximum is found at . The
estimate of p (the**

**probability of heads) is just the proportion heads that we observed
in our experiment.**

**Hypothesis testing in a likelihood framework**

**Based on the data, a plausible measure of the relative tenability
of two competing hypotheses is the ratio of their likelihoods. Consider
the case in which H0 may specify**

**that Q element of w0, where w0 is a subset of the possible values
of Q and H1 may specify that Q element of w1, where w1 is disjoint from
w0. Here Q represents the**

**parameters of the model describing the data (e.g., p from the Bernoulli
distribution). The likelihood ratio is**

**.**

**Here, the likelihood calculated under the null hypothesis is in the
numerator and the likelihood calculated under the alternative hypothesis
is in the denominator. When L is**

**small, the null hypothesis is discredited whereas when L is large,
the alternative hypothesis is discredited.**

**One interesting fact concerns the case in which one hypothesis is
the subset (or special case) of another more general hypothesis. When the
null hypothesis is a subset (or**

**special case) of the alternative hypothesis, -2logL is distributed
according to a c2 distribution with q degress of freedom under the null
hypothesis, where q is the**

**difference in the number of free parameters between the general and
restricted hypotheses.**

**As an example, consider the coin toss example. Two hypotheses will
be considered. H0 will denote the restricted (null) hypothesis whereas
H1 will denote the unrestricted**

**(or general) hypothesis. We want to test the null hypothesis that
the coin is fair. Hence, under H0 p = 0.5. Under H1, p takes the value
between 0 and 1 which**

**maximizes the likelihood function. Under H0, the likelihood is**

**Under H1, the likelihood is**

**The likelihood ratio for a test of the null hypothesis that p = 0.5
is**

**To calculate the likelihood under the null hypothesis, one simply
substitutes 0.5 in for p in the likelihood function. We already discussed
how to calculate the likelihood**

**under the alternative (unrestricted) hypothesis. The likelihood under
the alternative hypothesis is maximized by substituting**

**for p in the likelihood function. The two hypotheses differ by 1
free parameter; the parameter p is fixed under H0 and free to vary under
H1. Hence, -2logL can be**

**compared to a c2 with 1 degree of freedom. If -2logL is greater than
3.84, then the null hypothesis that p = 0.5 can be rejected with a level
of Type I error (falsely rejecting**

**the null hypothesis) of 5%.**

**Instead of using the c2 distribution to test the significance of
the likelihood ratio, the null hypothesis can be tested in another way
-- through simulation under the null**

**hypothesis. This approach is called Markov simulation or parametric
bootstrapping. To illustrate this approach, we will again consider the
coin toss example. Let us assume**

**that we have tossed a coin 1000 times and have noted that heads appears
450 times. Once again, we want to consider as a null hypothesis that the
coin is fair (p = 0.5).**

**The alternative hypothesis will again be the likelihood calculated
with an unrestricted p.**

**The likelihood ratio test statistic for this example is -2logL =
10.018. Is this value greater than we would expect if p= 0.5? A computer
program was written that generates**

**1000 coin tosses under the assumption that p = 0.5. For every computer
replicate, we generate 1000 coin tosses, and calculate the likelihood ratio
test statistic (-2logL).**

**After this procedure is repeated many thousands or millions of times,
a histogram of the frequency that different -2logL values appear under
the null hypothesis is**

**constructed. The following figure shows the results of such a simulation:**

**The 95% cut-off for this simulated distribution is about 3.8, which
closely corresponds to what we would expect from the c2 distribution (3.84).
In fact, we can directly**

**compare the simulated distribution to a c2 distribution with 1 degree
of freedom. The two distributions are very similar.**

**In this case where we had 450 heads out of 1000 coin tosses, we can
reject the null hypothesis that p = 0.5 with greater than 0.995 probability;
we would not expect to**

**observe 450 heads out of 1000 tosses if the coin was fair. The coin
is biased towards tails.**

**In this case where we had 450 heads out of 1000 coin tosses, we can
reject the**

**null hypothesis that p = 0.5 with greater than 0.995 probability;
we would not**

**expect to observe 450 heads out of 1000 tosses if the coin was fair.
The coin is**

**biased towards tails.**

**Maximum Likelihood Estimation of Phylogeny**

**Data: The observed data are s aligned DNA sequences, each n nucleotides
in length. The sequences are represented by a s x n matrix X = {xij}, where
xij is**

**the nucleotide at the jth site in the ith sequence. An example data
matrix for 3 sequences (species) is**

**,**

**where x _{11} = A, x_{12} = T, x_{21} = C,
and so on.**

**Model: The sequences represent observations of random variables with
a probability distribution determined by a model of evolution. The process
of nucleotide**

**substitution is modeled as a continuous time Markov process, conditional
on a particular tree topology (t) and a vector of branch lengths v = {v1,
v2, ..., v2s-3},**

**which are treated as parameters. It is assumed that different nucleotide
sites within a sequence, and different lineages, experience independent
substitutions, and no**

**recombination occurs. The probabilities of nucleotides observed at
different sites may then be calculated separately. As an example of how
to calculate the probability of**

**observing one site, consider the tree shown in the Figure.**

**The tips of this tree are labelled with the nucleotides for site
1 of the example data matrix, above. The unknown internal states for site
j of the tree are labelled y4j and**

**y5j. Because the identity of the nucleotides at the internal nodes
is not known, the probability of observing the nucleotides at the tip of
the tree is a sum over all possible**

**assignments of nucleotides to internal nodes:**

**where**

**is the probability that nucleotide C is substituted for A over a
branch of length v, Q is a vector of parameters of the substitution model
(some possible parameters are the**

**transition:transversion rate ratio k, the overall rate of substitution
m, and parameters describing rate variation among sites), and p is the
probability distribution of**

**nucleotides for the ultimate ancestor (y5j). If the process is at
equilibrium then p is usually estimated using the observed nucleotide frequencies
(1). Assuming**

**independence among sites, the probability of the complete set of
observed sequences is**

**Likelihood: The likelihood function is used to obtain maximum likelihood
estimates (MLEs) of parameters and is the probabiity of the observed sequences
as a function**

**of the parameter values:**

**The MLE of a parameter is the value that maximizes the likelihood
function. The approach of Felsenstein (1) is to estimate the parameters
of the substitution process and**

**the branch lengths for each tree and then choose among tree topologies
by considering their relative likelihoods. The tree topology with the greatest
likelihood is chosen as**

**the MLE of phylogeny. The logarithm of the likelihood is often used
when very small probabilities are compared (this is a monotonic function
and does not alter the value**

**of the MLE).**

**References**

**1. J. Felsenstein, J. Mol. Evol. 17, 368 (1981).**

**Does among-site rate variation provide
an improved fit of the model to the data?**

Under the null hypothesis, all sites are assumed to have equal rates of substitution. One way of relaxing this assumption is to assume that the rates at different

sites are drawn from a probability distribution. The most commonly used distribution is the gamma because the gamma distribution's shape changes dramatically

depending on the parameter values of the distribution. Under the alternative hypothesis, sites are assumed to be gamma distributed. The gamma distribution

introduces 1 additional parameter. The likelihood ratio test statistic is -2logL = 2(logL0 - logL1), where L0 and L1 are the likelihoods under the null and

alternative hypotheses, respectively.

The significance of the likelihood ratio test statistic can be approximated using a c2 distribution (with s - 2 degrees of freedom) or by parametric bootstrapping.

The following example shows how to perform the likelihood ratio test of the molecular clock. The data are s = 5 albumin sequences from vertebrates (a fish,

frog, bird, mouse, and human). We will assume the Hasegawa, Kishino, and Yano (1985) model of DNA substitution.

The maximum likelihood under the null hypothesis is logL0 = -7628.025. The best estimate of phylogeny supports the monophyly of the mammals and

amniotes. The maximum likelihood estimates of the other parameters of the substitution model are:

k = 1.777 (transition/transversion rate bias)

p(A) = 0.29024 (frequency of nucleotide "A")

p(C) = 0.22242 (frequency of nucleotide "C")

p(G) = 0.23832 (frequency of nucleotide "G")

p(T) = 0.24902 (frequency of nucleotide "T")

The maximum likelihood under the alternative hypothesis is logL1 = -7569.052. The likelihood under the alternative hypothesis is higher than under the null

hypothesis because there is an additional free parameter in the substitution model (i.e., the shape parameter of the gamma distribution). The maximum likelihood

estimate of phylogeny is consistent with the monophyly of mammals and amniotes (though the tree is unrooted). The maximum likelihood estimates of other

parameters of the substitution model are:

k = 2.249 (transition/transversion rate bias)

a = 1.039 (shape parameter of gamma distribution)

p(A) = 0.290 (frequency of nucleotide "A")

p(C) = 0.222 (frequency of nucleotide "C")

p(G) = 0.238 (frequency of nucleotide "G")

p(T) = 0.249 (frequency of nucleotide "T")

Note that when among-site rate variation is not accounted for, that the transition transversion rate ratio is under estimated (1.777 without and 2.249 with gamma

distributed rate variation).

The likelihood ratio test statistic is -2logL = 117.946, which is asymptotically c2 distributed under the null hypothesis with 1 degrees of freedom. Comparing the

observed value of -2logL to a c2 with 3 df shows that the null hypothesis can be rejected at P << 0.001.

Testing the Molecular Clock

Under the null hypothesis, the phylogeny is rooted and the branch lengths are constrained such that all of the tips can be drawn at a single time plane. Under the

alternative hypothesis, each branch is allowed to vary independently. The alternative hypothesis invokes s - 2 additional parameters, where s is the number of

sequences. The likelihood ratio test statistic is -2logL = 2(logL0 - logL1), where L0 and L1 are the likelihoods under the null and alternative hypotheses,

respectively.

The significance of the likelihood ratio test statistic can be approximated using a c2 distribution (with s - 2 degrees of freedom) or by parametric bootstrapping.

The following example shows how to perform the likelihood ratio test of the molecular clock. The data are s = 5 albumin sequences from vertebrates (a fish, frog,

bird, mouse, and human). We will assume the Hasegawa, Kishino, and Yano (1985) model of DNA substitution with among site rate variation described using a

gamma distribution.

The maximum likelihood under the null hypothesis is logL0 = -7585.343. The best estimate of phylogeny supports the monophyly of the mammals and amniotes.

The maximum likelihood estimates of the other parameters of the substitution model are:

k = 2.284 (transition/transversion rate bias)

a = 1.001 (shape parameter of gamma distribution)

p(A) = 0.29024 (frequency of nucleotide "A")

p(C) = 0.22242 (frequency of nucleotide "C")

p(G) = 0.23832 (frequency of nucleotide "G")

p(T) = 0.24902 (frequency of nucleotide "T")

The maximum likelihood under the alternative hypothesis is logL1 = -7569.052. The likelihood under the alternative hypothesis is higher than under the null

hypothesis because there are more free parameters in the substitution model (i.e., no constraints on branch lengths). The maximum likelihood estimate of phylogeny

is consistent with the monophyly of mammals and amniotes (though the tree is unrooted). The maximum likelihood estimates of other parameters of the substitution

model are:

k = 2.249 (transition/transversion rate bias)

a = 1.039 (shape parameter of gamma distribution)

p(A) = 0.290 (frequency of nucleotide "A")

p(C) = 0.222 (frequency of nucleotide "C")

p(G) = 0.238 (frequency of nucleotide "G")

p(T) = 0.249 (frequency of nucleotide "T")

These (incidental) parameter estimates are very similar to the estimates under the null hypothesis.

The likelihood ratio test statistic is -2logL = 32.582, which is asymptotically c2 distributed under the null hypothesis with 3 degrees of freedom. Comparing the

observed value of -2logL to a c2 with 3 df shows
that the null hypothesis can be rejected at P < 0.001.

*Huelsenbeck, J. P., and B. Rannala. 1997.
Phylogenetic methods come of age: Testing hypotheses in a phylogenetic
context. Science 276:227-232.*

*The use of molecular phylogenies to examine evolutionary
questions has become commonplace with the automations of DNA sequencing
and the availability of*

*efficient computer programs to perform phylogenetic
analyses. The application of computer simulation and likelihood ratio
tests to evolutionary hypotheses*

*represents a recent methodological development
in this field. Likelihood ratio tests have enabled biologists to
address many questions in evolutionary biology*

*that have been difficult to resolve in the past,
such as whether host-parasite systems are cospeciating and whether models
of DNA substitution adequately explain*

*observed sequences.*

*Huelsenbeck, J. P., B. Rannala, and Z. Yang.
1997. Statistical tests of host-parasite cospeciation. Evolution
51(2):410-419.*

*A history of cospeciation (synchronous speciation)
among ecologically associated, but otherwise distantly related, species
is often revealed by a strong*

*correspondence of their phylogenies. In
this paper, we present several tests of cospeciation that use maximum-likelihood
and Bayesian methods of phylogenetic*

*estimation. The hypothese tested include:
(1) topological agreement of phylogenies for coevolving groups; (2) identical
speciation times of associated species;*

*and (3) identical evolutionary rates in genes
of associated species. These tests are applied to examine a possible
instance of host-parasite coevolution among*

*pocket gophers and lice using mitochondrial COI
DNA sequences. The observed differences between gopher and louse
trees cannot be explained by sampling*

*error and are consistent with a rate of host switching
about one-third the host speciation rate. A subset of the gopher-louse
data is consistent with a common*

*history of evolution (i.e. the topologies and
speciation times are identical). However, the relative rate of nucleotide
substitution is two to four times in the lice*

*than in the gophers.*

*Huelsenbeck, J. P., and B. Rannala. 1997.
Maximum likelihood estimation of phylogeny using stratigraphic data.
Paleobiology 23(2):174-180.*

*The stratigraphic distribution of fossil species
contains potential information about phylogeny because some phylogenetic
trees are more consistent with the*

*distribution of fossils in the rock record than
others. A maximum likelihood estimator of phylogeny is derived using
an explicit mathematical model of fossil*

*preservation. The method assumes that fossil
preservations within lineages follow an independent Poisson process, but
can be extended to include other*

*preservation models. The performance of
the method was examined using Monte Carlo simulation. The performance
of the maximum likelihood estimator of*

*topology increases with an increase in the preservation
rate. The method is biased, like other methods of phylogeny estimation,
when the rate of fossil*

*preservation is low; estimated trees tend to be
more asymmetrical than the true tree. The method appears to perform
well as a tree rooting criterion even when*

*preservation rates are low. We suggest several
possible extensions of this method to address other questions about the
nature of fossil preservation and the*

*process of speciaton and extinction over time
and space.*