Maximum Likelihood

Maximum Likelihood

A simple example of maximum likelihood estimation

http://mw511.biol.berkeley.edu/like/mle/

Consider the simple procedure of tossing a coin with the goal of estimating the probability of heads for the coin. The probability of heads for a fair coin is 0.5. However,

for this example we will assume that the probability of heads is unknown (maybe the coin is strange in some way or that we are testing whether or not the coin is fair). The

act of tossing the coin n times forms an experiment--a procedure that, in theory, can be repeated an infinite number of times and has a well-defined set of possible

outcomes. When flipping a coin, there are 2n possible sample outcomes, w. The set of possible outcomes forms the sample space, W. For n = 3 tosses of a coin, the

sample space is

W = {HHH, HHT, HTH, THH, TTH, THT, HTT, TTT},

where H denotes heads and T denotes tails. An event is a collection of sample outcomes and is said to occur if the outcome of a particular experiment is a member of the

event. Often, it is useful to think of a function whose domain is the sample space, W. Such a function is known as a random variable. Examples of random variables for the

coin flip experiment are (i) the number of times heads appear, (ii) the number of times tails appears, and (iii) the number of flips until a head appears. Random variables are

often denoted by uppercase letters, often X, Y, and Z.

As an example of likelihood estimation, the coin toss example will more fully developed. The three main components of the statistical approach are (i) the data, (ii) a model

describing the probability of observing the data, and (iii) a criterion that allows us to move from the data and model to an estimate of the parameters of the model.

Data: Assume that we have actually performed the coin flip experiment, tossing a coin n = 10 times. We observe that the sequence of heads and tails was H, H, H, T,

H, T, T, H, T, H. We will denote heads by 1 and tails by 0; hence, our data will be coded as

x = {1, 1, 1, 0, 1, 0, 0, 1, 0, 1}

where x is a vector with elements x1 = 1, x2 = 1, x3 = 1, x4 = 0, ..., x10 = 1. In tossing the coin, we note that heads appeared 6 times and tails appeared 4 times.

Model: An appropriate model that describes the probability of observing heads for any single flip of the coin is the Bernoulli distribution. The Bernoulli distribution has

the following form:

where p is the probability of heads and xi is the experimental outcome of the i-th coin flip (i.e., heads or tails). The vertical line in the function means "given". Note

that if p = 0.6, the function returns 0.6 if xi = 1 and 0.4 if xi = 0. Many independent Bernoulli's gives the binomial distribution.

Criterion: Now that we have specified the data, x, and the model, f(xi |p), we need a criterion to estimate the parameters of the model. Here, the parameter of

interest (the parameter to be estimated) is p--the probability that heads appears for any single toss of the coin. Several methods could be applied at this point, including the

method of moments, least squares, and Bayesian estimation. However, we will consider only the the method of maximum likelihood here.

The method of maximum likelihood was first proposed by the English statistician and population geneticist R. A. Fisher. The maximum likelihood method finds that

estimate of a parameter which maximizes the probability of observing the data given a specific model for the data.

The likelihood function is defined as

The likelihood function is simply the joint probability of observing the data. The large P means "product". The likelihood function is obtained by multiplying the probability

function for each toss of the coin. In the case of the coin flip experiment where we are assuming a Bernoulli distribution for each coin flip, the likelihood function becomes

Taking the log of the likelihood function does not change that value of p for which the likelihood is maximized. After taking the log of both sides of the equation this

becomes:

The following figures show plots of likelihood, L, as a function of p for several different possible outcomes of n = 10 flips of a coin. Note that for the case in which 3

heads and 7 tails were the outcome of the experiment, that the likelihood appears to be maximized at p = 0.3. Similarly, p = 0.5 for the case of 5 heads and 5 tails, p =

0.8 for the case of 8 heads and 2 tails, and p = 0.9 for the case of 9 heads and 1 tail. The likelihood appears to be maximized when p is the proportion of the time that

heads appeared in our experiment. This illustrates the "brute force" way to find the maximum likelihood estimate of p.

Instead of determining the maximum likelihood value of p graphically, we could also find the maximum likelihood estimate of p analytically. We can do this by taking the

derivative of the likelihood function with respect to p and finding where the slope is zero. When this is done, the maximum is found at . The estimate of p (the

probability of heads) is just the proportion heads that we observed in our experiment.

Hypothesis testing in a likelihood framework

Based on the data, a plausible measure of the relative tenability of two competing hypotheses is the ratio of their likelihoods. Consider the case in which H0 may specify

that Q element of w0, where w0 is a subset of the possible values of Q and H1 may specify that Q element of w1, where w1 is disjoint from w0. Here Q represents the

parameters of the model describing the data (e.g., p from the Bernoulli distribution). The likelihood ratio is

Here, the likelihood calculated under the null hypothesis is in the numerator and the likelihood calculated under the alternative hypothesis is in the denominator. When L is

small, the null hypothesis is discredited whereas when L is large, the alternative hypothesis is discredited.

One interesting fact concerns the case in which one hypothesis is the subset (or special case) of another more general hypothesis. When the null hypothesis is a subset (or

special case) of the alternative hypothesis, -2logL is distributed according to a c2 distribution with q degress of freedom under the null hypothesis, where q is the

difference in the number of free parameters between the general and restricted hypotheses.

As an example, consider the coin toss example. Two hypotheses will be considered. H0 will denote the restricted (null) hypothesis whereas H1 will denote the unrestricted

(or general) hypothesis. We want to test the null hypothesis that the coin is fair. Hence, under H0 p = 0.5. Under H1, p takes the value between 0 and 1 which

maximizes the likelihood function. Under H0, the likelihood is

Under H1, the likelihood is

The likelihood ratio for a test of the null hypothesis that p = 0.5 is

To calculate the likelihood under the null hypothesis, one simply substitutes 0.5 in for p in the likelihood function. We already discussed how to calculate the likelihood

under the alternative (unrestricted) hypothesis. The likelihood under the alternative hypothesis is maximized by substituting

for p in the likelihood function. The two hypotheses differ by 1 free parameter; the parameter p is fixed under H0 and free to vary under H1. Hence, -2logL can be

compared to a c2 with 1 degree of freedom. If -2logL is greater than 3.84, then the null hypothesis that p = 0.5 can be rejected with a level of Type I error (falsely rejecting

the null hypothesis) of 5%.

Instead of using the c2 distribution to test the significance of the likelihood ratio, the null hypothesis can be tested in another way -- through simulation under the null

hypothesis. This approach is called Markov simulation or parametric bootstrapping. To illustrate this approach, we will again consider the coin toss example. Let us assume

that we have tossed a coin 1000 times and have noted that heads appears 450 times. Once again, we want to consider as a null hypothesis that the coin is fair (p = 0.5).

The alternative hypothesis will again be the likelihood calculated with an unrestricted p.

The likelihood ratio test statistic for this example is -2logL = 10.018. Is this value greater than we would expect if p= 0.5? A computer program was written that generates

1000 coin tosses under the assumption that p = 0.5. For every computer replicate, we generate 1000 coin tosses, and calculate the likelihood ratio test statistic (-2logL).

After this procedure is repeated many thousands or millions of times, a histogram of the frequency that different -2logL values appear under the null hypothesis is

constructed. The following figure shows the results of such a simulation:

The 95% cut-off for this simulated distribution is about 3.8, which closely corresponds to what we would expect from the c2 distribution (3.84). In fact, we can directly

compare the simulated distribution to a c2 distribution with 1 degree of freedom. The two distributions are very similar.

In this case where we had 450 heads out of 1000 coin tosses, we can reject the null hypothesis that p = 0.5 with greater than 0.995 probability; we would not expect to

observe 450 heads out of 1000 tosses if the coin was fair. The coin is biased towards tails.

In this case where we had 450 heads out of 1000 coin tosses, we can reject the

null hypothesis that p = 0.5 with greater than 0.995 probability; we would not

expect to observe 450 heads out of 1000 tosses if the coin was fair. The coin is

biased towards tails.

Maximum Likelihood Estimation of Phylogeny

Data: The observed data are s aligned DNA sequences, each n nucleotides in length. The sequences are represented by a s x n matrix X = {xij}, where xij is

the nucleotide at the jth site in the ith sequence. An example data matrix for 3 sequences (species) is

where x₁₁ = A, x₁₂ = T, x₂₁ = C, and so on.

Model: The sequences represent observations of random variables with a probability distribution determined by a model of evolution. The process of nucleotide

substitution is modeled as a continuous time Markov process, conditional on a particular tree topology (t) and a vector of branch lengths v = {v1, v2, ..., v2s-3},

which are treated as parameters. It is assumed that different nucleotide sites within a sequence, and different lineages, experience independent substitutions, and no

recombination occurs. The probabilities of nucleotides observed at different sites may then be calculated separately. As an example of how to calculate the probability of

observing one site, consider the tree shown in the Figure.

The tips of this tree are labelled with the nucleotides for site 1 of the example data matrix, above. The unknown internal states for site j of the tree are labelled y4j and

y5j. Because the identity of the nucleotides at the internal nodes is not known, the probability of observing the nucleotides at the tip of the tree is a sum over all possible

assignments of nucleotides to internal nodes:

where

is the probability that nucleotide C is substituted for A over a branch of length v, Q is a vector of parameters of the substitution model (some possible parameters are the

transition:transversion rate ratio k, the overall rate of substitution m, and parameters describing rate variation among sites), and p is the probability distribution of

nucleotides for the ultimate ancestor (y5j). If the process is at equilibrium then p is usually estimated using the observed nucleotide frequencies (1). Assuming

independence among sites, the probability of the complete set of observed sequences is

Likelihood: The likelihood function is used to obtain maximum likelihood estimates (MLEs) of parameters and is the probabiity of the observed sequences as a function

of the parameter values:

The MLE of a parameter is the value that maximizes the likelihood function. The approach of Felsenstein (1) is to estimate the parameters of the substitution process and

the branch lengths for each tree and then choose among tree topologies by considering their relative likelihoods. The tree topology with the greatest likelihood is chosen as

the MLE of phylogeny. The logarithm of the likelihood is often used when very small probabilities are compared (this is a monotonic function and does not alter the value

of the MLE).

References

1. J. Felsenstein, J. Mol. Evol. 17, 368 (1981).

Does among-site rate variation provide an improved fit of the model to the data?

Under the null hypothesis, all sites are assumed to have equal rates of substitution. One way of relaxing this assumption is to assume that the rates at different

sites are drawn from a probability distribution. The most commonly used distribution is the gamma because the gamma distribution's shape changes dramatically

depending on the parameter values of the distribution. Under the alternative hypothesis, sites are assumed to be gamma distributed. The gamma distribution

introduces 1 additional parameter. The likelihood ratio test statistic is -2logL = 2(logL0 - logL1), where L0 and L1 are the likelihoods under the null and

alternative hypotheses, respectively.

The significance of the likelihood ratio test statistic can be approximated using a c2 distribution (with s - 2 degrees of freedom) or by parametric bootstrapping.

The following example shows how to perform the likelihood ratio test of the molecular clock. The data are s = 5 albumin sequences from vertebrates (a fish,

frog, bird, mouse, and human). We will assume the Hasegawa, Kishino, and Yano (1985) model of DNA substitution.

The maximum likelihood under the null hypothesis is logL0 = -7628.025. The best estimate of phylogeny supports the monophyly of the mammals and

amniotes. The maximum likelihood estimates of the other parameters of the substitution model are:

k = 1.777 (transition/transversion rate bias)

p(A) = 0.29024 (frequency of nucleotide "A")

p(C) = 0.22242 (frequency of nucleotide "C")

p(G) = 0.23832 (frequency of nucleotide "G")

p(T) = 0.24902 (frequency of nucleotide "T")

The maximum likelihood under the alternative hypothesis is logL1 = -7569.052. The likelihood under the alternative hypothesis is higher than under the null

hypothesis because there is an additional free parameter in the substitution model (i.e., the shape parameter of the gamma distribution). The maximum likelihood

estimate of phylogeny is consistent with the monophyly of mammals and amniotes (though the tree is unrooted). The maximum likelihood estimates of other

parameters of the substitution model are:

k = 2.249 (transition/transversion rate bias)

a = 1.039 (shape parameter of gamma distribution)

p(A) = 0.290 (frequency of nucleotide "A")

p(C) = 0.222 (frequency of nucleotide "C")

p(G) = 0.238 (frequency of nucleotide "G")

p(T) = 0.249 (frequency of nucleotide "T")

Note that when among-site rate variation is not accounted for, that the transition transversion rate ratio is under estimated (1.777 without and 2.249 with gamma

distributed rate variation).

The likelihood ratio test statistic is -2logL = 117.946, which is asymptotically c2 distributed under the null hypothesis with 1 degrees of freedom. Comparing the

observed value of -2logL to a c2 with 3 df shows that the null hypothesis can be rejected at P << 0.001.

Testing the Molecular Clock

Under the null hypothesis, the phylogeny is rooted and the branch lengths are constrained such that all of the tips can be drawn at a single time plane. Under the

alternative hypothesis, each branch is allowed to vary independently. The alternative hypothesis invokes s - 2 additional parameters, where s is the number of

sequences. The likelihood ratio test statistic is -2logL = 2(logL0 - logL1), where L0 and L1 are the likelihoods under the null and alternative hypotheses,

respectively.

The significance of the likelihood ratio test statistic can be approximated using a c2 distribution (with s - 2 degrees of freedom) or by parametric bootstrapping.

The following example shows how to perform the likelihood ratio test of the molecular clock. The data are s = 5 albumin sequences from vertebrates (a fish, frog,

bird, mouse, and human). We will assume the Hasegawa, Kishino, and Yano (1985) model of DNA substitution with among site rate variation described using a

gamma distribution.

The maximum likelihood under the null hypothesis is logL0 = -7585.343. The best estimate of phylogeny supports the monophyly of the mammals and amniotes.

The maximum likelihood estimates of the other parameters of the substitution model are:

k = 2.284 (transition/transversion rate bias)

a = 1.001 (shape parameter of gamma distribution)

p(A) = 0.29024 (frequency of nucleotide "A")

p(C) = 0.22242 (frequency of nucleotide "C")

p(G) = 0.23832 (frequency of nucleotide "G")

p(T) = 0.24902 (frequency of nucleotide "T")

The maximum likelihood under the alternative hypothesis is logL1 = -7569.052. The likelihood under the alternative hypothesis is higher than under the null

hypothesis because there are more free parameters in the substitution model (i.e., no constraints on branch lengths). The maximum likelihood estimate of phylogeny

is consistent with the monophyly of mammals and amniotes (though the tree is unrooted). The maximum likelihood estimates of other parameters of the substitution

model are:

k = 2.249 (transition/transversion rate bias)

a = 1.039 (shape parameter of gamma distribution)

p(A) = 0.290 (frequency of nucleotide "A")

p(C) = 0.222 (frequency of nucleotide "C")

p(G) = 0.238 (frequency of nucleotide "G")

p(T) = 0.249 (frequency of nucleotide "T")

These (incidental) parameter estimates are very similar to the estimates under the null hypothesis.

The likelihood ratio test statistic is -2logL = 32.582, which is asymptotically c2 distributed under the null hypothesis with 3 degrees of freedom. Comparing the

observed value of -2logL to a c2 with 3 df shows that the null hypothesis can be rejected at P < 0.001.

Huelsenbeck, J. P., and B. Rannala. 1997. Phylogenetic methods come of age: Testing hypotheses in a phylogenetic context. Science 276:227-232.

The use of molecular phylogenies to examine evolutionary questions has become commonplace with the automations of DNA sequencing and the availability of

efficient computer programs to perform phylogenetic analyses. The application of computer simulation and likelihood ratio tests to evolutionary hypotheses

represents a recent methodological development in this field. Likelihood ratio tests have enabled biologists to address many questions in evolutionary biology

that have been difficult to resolve in the past, such as whether host-parasite systems are cospeciating and whether models of DNA substitution adequately explain

observed sequences.

Huelsenbeck, J. P., B. Rannala, and Z. Yang. 1997. Statistical tests of host-parasite cospeciation. Evolution 51(2):410-419.

A history of cospeciation (synchronous speciation) among ecologically associated, but otherwise distantly related, species is often revealed by a strong

correspondence of their phylogenies. In this paper, we present several tests of cospeciation that use maximum-likelihood and Bayesian methods of phylogenetic

estimation. The hypothese tested include: (1) topological agreement of phylogenies for coevolving groups; (2) identical speciation times of associated species;

and (3) identical evolutionary rates in genes of associated species. These tests are applied to examine a possible instance of host-parasite coevolution among

pocket gophers and lice using mitochondrial COI DNA sequences. The observed differences between gopher and louse trees cannot be explained by sampling

error and are consistent with a rate of host switching about one-third the host speciation rate. A subset of the gopher-louse data is consistent with a common

history of evolution (i.e. the topologies and speciation times are identical). However, the relative rate of nucleotide substitution is two to four times in the lice

than in the gophers.

Huelsenbeck, J. P., and B. Rannala. 1997. Maximum likelihood estimation of phylogeny using stratigraphic data. Paleobiology 23(2):174-180.

The stratigraphic distribution of fossil species contains potential information about phylogeny because some phylogenetic trees are more consistent with the

distribution of fossils in the rock record than others. A maximum likelihood estimator of phylogeny is derived using an explicit mathematical model of fossil

preservation. The method assumes that fossil preservations within lineages follow an independent Poisson process, but can be extended to include other

preservation models. The performance of the method was examined using Monte Carlo simulation. The performance of the maximum likelihood estimator of

topology increases with an increase in the preservation rate. The method is biased, like other methods of phylogeny estimation, when the rate of fossil

preservation is low; estimated trees tend to be more asymmetrical than the true tree. The method appears to perform well as a tree rooting criterion even when

preservation rates are low. We suggest several possible extensions of this method to address other questions about the nature of fossil preservation and the

process of speciaton and extinction over time and space.