It’s routine practice in statistics to apply a statistical model to some process. Often (I’d even say, usually) the model depends on a certain number of parameters. Sooner or later, we’d like to know what the parameters are (or at least be able to estimate them). One of the most powerful methods in statistics for estimating the parameters of a model from a given set of data is called “MLE” for “Maximum Likelihood Estimation.”
The idea is straightforward. Suppose we have a model for a set of data. As an example, suppose we flip a coin times and observe occurences of “heads.” As a statistical model, we’re safe treating the coin flip as a Bernoulli trial, so on any given flip there’s a certain probability (call it ) of getting heads and a complementary probability () of getting tails (we will ignore the possibility of the coin landing on its edge, rolling down a drain, or being absconded by a passing bird of prey). We’re also safe treating different coin flips as independent events.
In that case, if we know the single-flip probability then we can compute the probability of getting flips out of trials. This turns out to be the binomial distribution, and is given by
This function is called the likelihood function; it’s just the probability of getting the observed result, for a given set of parameter values. Incidentally, I’ve included the number of flips as a parameter in this equation, but we won’t really treat it as one, we’ll assume it’s known — so there’s really only one parameter, the single-flip probability .
To compute the MLE estimate of the parameter, we simply find the value of which gives the maximum possible value of the likelihood for the available data (in this case, the number of heads ). For the coin flip problem (and many others) we can find the maximum-likelihood value of by setting the derivative of the likelihood with respect to equal to zero, i.e.
Whatever value of satisfies this equation must give a local minimum or maximum for the likelihood function — and if it’s a maximum it might be the overall maximum, hence it might be our MLE estimator. This turns out to be the case for the coin-flip experiment.
For our coin-flip problem we get
This will be zero when the quantity in brackets is zero, i.e., when
This equation isn’t difficult to solve for when we know and , in fact we get
This is the maximum likelihood estimate, or MLE, of the parameter for the coin-flip problem. In this case (but not in all cases!) the MLE estimate is the same as the estimate we get from the “method of moments.”
As another example, suppose we expect some event (say, the decay of a radioactive nucleus) to occur at a time which is governed by the exponential distribution. Then the probability for the decay time of a single nucleus is
where is the “decay rate,” the single parameter for this probability distribution. If we observe a number of nuclei, and they decay at times , then we treat each as independent of all the others. In that case the likelihood function for a given parameter value is the product of the probabilities of the individual events given that same parameter value
It’s convenient to define the log-likelihood as simply the logarithm of the likelihood
Since the logarithm is a monotone function, the log-likelihood will be maximum for the same parameter value at which the likelihood is maximum. For the radioactive decay we have
The parameter value which maximizes the log-likelihood (and therefore the likelihood) is the MLE estimator, which is
so the MLE estimate of the paramter is 1 over the average decay time .
When a model has more than one parameter, we simply treat the parameters as a vector, and the MLE estimates are the values of that vector which give the maximum value of the likelihood function (or the log-likelihood function) for the given observed data.
In most cases, MLE estimators have some profound advantages. For one thing, the MLE estimator is a consistent estimator. This means that as the number of available data grows, the MLE estimator converges in probability to the true value of the parameter or parameters. For another thing, as the sample size grows the distribution of the MLE estimator itself approaches the Gaussian, or normal, distribution. We can even compute the asymptotic-limit variance-covariance of the parameter vector as the inverse of the Fisher Information matrix. Also, the MLE estimator is efficient, meaning that no other estimator which is asymptotically unbiased, will have lower asymptotic mean squared error. As if all that weren’t good enough, the MLE is often transformation-invariant, i.e., if is some parameter (or vector of parameters), and is some transformation of the parameters, then the MLE estimate of is simply .
We’ve discussed the likelihood function before, in the context of Bayesian analysis. If we multiply the likelihood function by a prior probability distribution which expresses our knowledge or belief about the parameter values before observations are made, the result is proportional to the posterior probability distribution for the parameter values. If our prior probability is uniform, then the posterior probability is proportional to the likelihood function. If we then estimate the parameters by the values which maximize the posterior probability, the estimate will be identical to the MLE estimator. Therefore in a sense the MLE estimator can be thought of as a Bayesian estimate, namely the mode of the posterior distribution when we use a uniform prior.
Because of its many advantages, because it applies to so many cases, and because it is theoretically so natural (from either a frequentist or Bayesian perspective), MLE estimation has become a workhorse for statistical analysis. In fact it’s rather common, when applying statistical models, not to worry too much about how we’ll estimate their parameters from observed data because if we can’t find a clever or simple way to do so we can always fall back on the MLE estimator (which often coincides with the clever and/or simple way). In fact many statisticians are inordinately fond of MLE as a methodology in general — and with good reason.