Bayesian statistics offers rich rewards, including that it gives you a probability distribution for everything. But there’s always the pesky question of how to define what’s called the prior probability distribution, especially when we don’t have much information to go on. In such cases, we usually try to define a “non-informative” (or maybe “non-informed”) prior, i.e., one which doesn’t make any assumptions, and has the smallest possible impact on the final answer so we can let the data speak for themselves.
Suppose for instance we’re flipping a coin to estimate the probability of getting “heads.” We can call this probability , and it might have any value from 0 to 1. Out of total flips we observe heads. With Bayesian statistics we can use that information to compute a probability distribution for the possible values of the “heads rate” .
We start with the distribution for the outcome if we already know the parameter . For the coin flip, for example, this is given by the binomial distribution; the likelihood of getting heads out of flips is
This is called the likelihood function.
Now we must combine the likelihood function with some prior probability distribution for the parameter , which we’ll call . We do so using Bayes’ theorem, and that gives us the main result: the posterior probability that the parameter has value , given the data we’ve observed.
Specifically, the posterior probability distribution for the parameter , given some data , is proportional to the product of the likelihood function and the prior probability. “Proportional to” some quantity simply means that it’s some constant times that quantity, so
We can determine the constant of proportionality using the fact that the posterior distribution, like all probability distributions, must give a total probability of 1.
For the coin flip we certainly know the likelihood function. But what prior distribution should we use – especially if we have no information to go on? A common choice is the uniform prior, for which the prior probability is constant
so the posterior probability is proportional to (i.e. equal to some constant times) the likelihood function
We determine by insisting that total probability adds up to 1
which gives the final form of the posterior distribution as
What Bayes’ theorem has given us is not an estimate of (i.e., a point estimate), or a confidence interval, but an actual probability distribution for . We can use this probability distribution to make point/interval estimates if we wish, and to estimate the uncertainty in those estimate. For instance, one point estimate is mean of the distribution, which for the coin flip is . We could also use the mode (most likely value) or the median of the distribution as point estimates.
But the uniform prior isn’t the only choice. There are good reasons to use what’s called a Jeffreys prior, which can be justified by minimizing a certain measure of the probable “discrepancy” between the model and reality. For the coin flip the Jeffreys prior is
and the posterior probability distribution turns out to be
where is the Gamma function.
Using the Jeffreys prior, the mean of the distribution is now . In most cases, and in particular if both and aren’t extremely small, the Jeffreys-prior posterior mean is extremely close to the uniform-prior posterior mean.
In fact if we have enough data (i.e., if both and are big enough), and it doesn’t take a lot, then the two distributions, one using the uniform prior and the other the Jeffreys prior, will be extremely similar. Even modestly large values of and make a decent match. For instance, if we make coin flips and observe heads, the two distributions (plotted below) are very similar:
By the time we get to flips with the same proportion of heads (so ) the two distributions are almost indistinguishable:
As is generally the case when the number of data grows, the results using different priors will converge to each other.
But in some cases we don’t have enough data. We may actually have plenty, we might have done 200 flips, but we don’t have enough “heads” to make big enough for the two choices to converge. In fact if we’ve observed no heads so out of flips, the posterior distributions using the uniform and Jeffreys priors both agree that is probably very low, but are not in very good agreement about the form of the distribution for those low values. The Jeffreys prior gives much more probability for the parameter to be small, than does the uniform prior:
Computing their respective means emphasizes how different the results are. The Jeffreys prior gives a mean value of , which for becomes . The uniform prior gives a posterior mean of , which for is . Since is large, the uniform prior gives a mean which is nearly twice as large as that from the Jeffreys prior.
There are very good reasons to use the Jeffreys prior, it’s based on sound theoretical principles. But in this last case, when the difference between the use of two plausible priors is so large, we need to be aware that the result may be overly sensitive to the prior. That calls for some extra caution. Perhaps it’s ill-advised to put too much faith in either choice — we have to recognize that uncertainty in our prior distribution is a big factor in the uncertainty of the final answer.
In fact I’d like to propose an aphorism. I forget who it was who said something like, I wouldn’t want to ride on an airplane whose safety depended on using the Lebesgue integral rather than the Riemann integral (referring to two different definitions of the integral, both of which serve perfectly well for practical problems). I’ll paraphrase that with my aphorism: I wouldn’t want to ride on an airplane whose safety depended on using the Jeffreys prior rather than a uniform prior.