MLE

It’s routine practice in statistics to apply a statistical model to some process. Often (I’d even say, usually) the model depends on a certain number of parameters. Sooner or later, we’d like to know what the parameters are (or at least be able to estimate them). One of the most powerful methods in statistics for estimating the parameters of a model from a given set of data is called “MLE” for “Maximum Likelihood Estimation.”


The idea is straightforward. Suppose we have a model for a set of data. As an example, suppose we flip a coin n times and observe k occurences of “heads.” As a statistical model, we’re safe treating the coin flip as a Bernoulli trial, so on any given flip there’s a certain probability (call it \theta) of getting heads and a complementary probability (1-\theta) of getting tails (we will ignore the possibility of the coin landing on its edge, rolling down a drain, or being absconded by a passing bird of prey). We’re also safe treating different coin flips as independent events.

In that case, if we know the single-flip probability \theta then we can compute the probability of getting k flips out of n trials. This turns out to be the binomial distribution, and is given by

f(k|n,\theta) = {n! \over k! (n-k)!} \theta^k (1-\theta)^{(n-k)}.

This function is called the likelihood function; it’s just the probability of getting the observed result, for a given set of parameter values. Incidentally, I’ve included the number of flips n as a parameter in this equation, but we won’t really treat it as one, we’ll assume it’s known — so there’s really only one parameter, the single-flip probability \theta.

To compute the MLE estimate of the parameter, we simply find the value of \theta which gives the maximum possible value of the likelihood f(k|\theta) for the available data (in this case, the number of heads k). For the coin flip problem (and many others) we can find the maximum-likelihood value of \theta by setting the derivative of the likelihood with respect to \theta equal to zero, i.e.

df / d\theta = 0,

Whatever value of \theta satisfies this equation must give a local minimum or maximum for the likelihood function — and if it’s a maximum it might be the overall maximum, hence it might be our MLE estimator. This turns out to be the case for the coin-flip experiment.

For our coin-flip problem we get

0 = df / d\theta = {n! \over k! (n-k)!} \Bigl [ k \theta^{(k-1)} (1-\theta)^{(n-k)} - (n-k) \theta^k (1-\theta)^{(n-k-1)} \Bigr ]

= \Bigl [ {k \over \theta} - {n-k \over 1-\theta} \Bigr ] f(k|\theta).

This will be zero when the quantity in brackets is zero, i.e., when

{k \over \theta} - {n-k \over 1-\theta} = 0 .

This equation isn’t difficult to solve for \theta when we know n and k, in fact we get

\theta_{MLE} = k/n.

This is the maximum likelihood estimate, or MLE, of the parameter \theta for the coin-flip problem. In this case (but not in all cases!) the MLE estimate k/n is the same as the estimate we get from the “method of moments.”

As another example, suppose we expect some event (say, the decay of a radioactive nucleus) to occur at a time which is governed by the exponential distribution. Then the probability for the decay time of a single nucleus is

f(t|\theta) = \theta e^{-\theta t}.

where \theta is the “decay rate,” the single parameter for this probability distribution. If we observe a number n of nuclei, and they decay at times t_1, t_2, ..., t_n, then we treat each as independent of all the others. In that case the likelihood function for a given parameter value \theta is the product of the probabilities of the individual events given that same parameter value

L(t_1,t_2,...,t_n|\theta) = f(t_1|\theta) f(t_2|\theta) ... f(t_n|\theta).

It’s convenient to define the log-likelihood as simply the logarithm of the likelihood

\lambda(t_1,t_2,...t_n|\theta) = \ln(L(t_1,t_2,...,t_n|\theta))

= \ln(f(t_1|\theta)) + \ln(f(t_2|\theta)) + ... + \ln(f(t_n|\theta)).

Since the logarithm is a monotone function, the log-likelihood will be maximum for the same parameter value at which the likelihood is maximum. For the radioactive decay we have

\lambda = n \ln(\theta) - \theta \sum_{j=1}^n t_j .

The parameter value which maximizes the log-likelihood (and therefore the likelihood) is the MLE estimator, which is

\theta_{MLE} = n / (\sum_{j=1}^n t_j ) = 1 / \bar t ,

so the MLE estimate of the paramter \theta is 1 over the average decay time \bar t.

When a model has more than one parameter, we simply treat the parameters as a vector, and the MLE estimates are the values of that vector which give the maximum value of the likelihood function (or the log-likelihood function) for the given observed data.

In most cases, MLE estimators have some profound advantages. For one thing, the MLE estimator is a consistent estimator. This means that as the number of available data grows, the MLE estimator converges in probability to the true value of the parameter or parameters. For another thing, as the sample size grows the distribution of the MLE estimator itself approaches the Gaussian, or normal, distribution. We can even compute the asymptotic-limit variance-covariance of the parameter vector as the inverse of the Fisher Information matrix. Also, the MLE estimator is efficient, meaning that no other estimator which is asymptotically unbiased, will have lower asymptotic mean squared error. As if all that weren’t good enough, the MLE is often transformation-invariant, i.e., if \theta is some parameter (or vector of parameters), and \alpha = g(\theta) is some transformation of the parameters, then the MLE estimate \alpha_{MLE} of \alpha is simply g(\theta_{MLE}).

We’ve discussed the likelihood function before, in the context of Bayesian analysis. If we multiply the likelihood function by a prior probability distribution which expresses our knowledge or belief about the parameter values before observations are made, the result is proportional to the posterior probability distribution for the parameter values. If our prior probability is uniform, then the posterior probability is proportional to the likelihood function. If we then estimate the parameters by the values which maximize the posterior probability, the estimate will be identical to the MLE estimator. Therefore in a sense the MLE estimator can be thought of as a Bayesian estimate, namely the mode of the posterior distribution when we use a uniform prior.

Because of its many advantages, because it applies to so many cases, and because it is theoretically so natural (from either a frequentist or Bayesian perspective), MLE estimation has become a workhorse for statistical analysis. In fact it’s rather common, when applying statistical models, not to worry too much about how we’ll estimate their parameters from observed data because if we can’t find a clever or simple way to do so we can always fall back on the MLE estimator (which often coincides with the clever and/or simple way). In fact many statisticians are inordinately fond of MLE as a methodology in general — and with good reason.

17 responses to “MLE

  1. The advantage and disadvantage to a one room school is that you are constantly exposed to ideas which you can’t yet quite grasp. Some rise to the challenge; some sink; and some just tread water.

    I appreciate the post. Thanks!

  2. Great post.

  3. Count me among those who are serious Likelihood fans. The assymptotic Normality of the MLEs means you can also determine confidence intervals using the chi-squared distribution with DOF equal to the number of parameters. It is intuitive, and its relation to AIC and other information criteria makes it possible to do model comparison with models of differeng complexity.

    I’ve also found that it is very useful for model averaging, and that if you use Anderson and Burnhams AIC weights, it is exactly the same as taking the expectation value over all models if you start with a Uniform prior. Note that if the Prior is uninformative and roughly colocated with the central tendency of the likelihood, you get the same result.

    [Response: Indeed, with a large sample the prior usually has little impact on the result anyway. Yeah, MLE is uber-cool.]

  4. It’s a shame to respond to a serious and substantial post with an anecdotal oddity–but I can’t forbear to mention that when a buddy and I (completely quixotically/foolishly/pointlessly) did a coin-flip trial way back in high school, we actually did get one coin-flip which landed on edge. . .

    [Response: That’s the first I’ve ever heard of such an event. Perhaps I’ll include the possibility in future musings.]

    • Well, it illustrates “real world messiness” in several senses; the trials were carried out in an extremely untidy and cluttered room, and a misdirected flip landed the coin on the edge of a rumpled cloth, which held it more or less vertical. Clearly such a result would have been far less likely in a cleaner environment!

      I suppose there’s a protocol question, too; should we have just thrown that result out, instead of creating a third category in which to record it? (Which latter is what we did–with the utmost mock-seriousness.)

    • we actually did get one coin-flip which landed on edge. . .

      Well, then you can answer a question I’ve had for almost exactly 50 years: Did you gain telepathic powers when the coin landed on edge?

      This was, in fact, the premise of a Twilight Zone episode. I will date myself by saying that it’s one of my all-time favorites and that I remember quite well watching the episode’s original airing in February, 1961. (And, yes, that is Darrin from Bewitched in the photo.)

      Incidentally, I believe wikipedia is wrong in saying that he “inadvertently” knocked the coin over on his way home. As I recall, it was quite intentional–Darrin had found that telepathy was not exactly an unqualified boon.

    • Horatio Algeranon

      That would be MUSE : Maximum Unlikelihood (and Strangeness) Estimation.

  5. Typo in the derivative for your first example: second term should be (1-theta)^(n-k-1).

    [Response: Right you are. Fixed.]

  6. Dear Tamino
    I have been grappling with MLE as part of my PhD using multinomial logit modelling. I understand it conceptually but I am still having difficulty explaining plainly it to my supervisor. So it remains an ongoing process.
    This article has been of great help.
    Regards Doug

  7. David B. Benson

    Twice (in the same hallway) I missed slipping a coin into my pocket while walking down the hall. On both occasions the coins landed on edge and rolled down the hallway into wall at the end.

  8. David B. Benson

    Invariant Bayesian estimation on manifolds:
    http://projecteuclid.org/DPubS?service=UI&version=1.0&verb=Display&handle=euclid.aos/1117114330
    arrives at an interesting conclusion.

    • Gavin's Pussycat

      David, nice article, but in my estimation it proves something that is already intuitively obvious: when you have a natural metric on a space, you will also have a natural measure (i.e., you know what the surface area or volume of a 1x1x1 co-ordinate “patch” will be: you get it from the determinant of the metric tensor); in that case, you can indeed define the MLE etc. in an invariant (or covariant) way. But this is not really different from saying that you can consistently determine air pressure in N/m^2 on the Earth’s surface, no matter what Earth surface co-ordinates you use: lat/lon, state plane, UTM, …

      The real problem is when you have a parameter space which does not have a natural metric/measure: like the 1-D space of CO2 doubling sensitivity values. Then you’re lost… and this is where naively using a uniform prior can lead you astray.

  9. Tamino, in the third paragrpah, you say “we can compute the probability of getting k flips out of n trials”, you mean “k heads”, yes? Cos each trial is going to be a flip…

    [Response: Yes indeed.]

  10. Yes, MLE is a great, as is crawling for those who haven’t yet tried walking. :) But as a mode of the posterior distribution, it is strictly speaking not translation-invariant, except for linear transformations. Sometimes models are easiest to specify as unidentifiable or almost unidentifiable – then MLE does not exist. And sometimes MLE fails by being too biased, for example with variance parameters. This leads deeper towards bayesian estimation.

  11. Janne,
    An even more important example of bias in MLE is for the shape parameter of the Weibull. However, as for the bias can be quantified. Have you looked at Weighted averaging with weights based on Likelihood? Basically, it reduces to an expectation value over the Posterior distributions if you assume a uniform Prior. I’m working on a paper that uses it, and it seems to work pretty well.

  12. Ray, no I haven’t looked at weighted averaging. I use mostly standard MLE or MAP from standard R libraries, and then go to samplers if MLE is not enough -which enables all kinds of averages. The latter has not happened too often recently.

    (It’s hard to say anything meaningful on this general level, for people’s contexts and needs seem to vary wildly. But all kinds of approximative methods exist between MLE and full bayesian, and often they should be preferred.)