Prior Knowledge

Bayesian statistics offers rich rewards, including that it gives you a probability distribution for everything. But there’s always the pesky question of how to define what’s called the prior probability distribution, especially when we don’t have much information to go on. In such cases, we usually try to define a “non-informative” (or maybe “non-informed”) prior, i.e., one which doesn’t make any assumptions, and has the smallest possible impact on the final answer so we can let the data speak for themselves.

Suppose for instance we’re flipping a coin to estimate the probability of getting “heads.” We can call this probability \theta, and it might have any value from 0 to 1. Out of n total flips we observe k heads. With Bayesian statistics we can use that information to compute a probability distribution for the possible values of the “heads rate” \theta.

We start with the distribution for the outcome if we already know the parameter \theta. For the coin flip, for example, this is given by the binomial distribution; the likelihood of getting k heads out of n flips is

p(x|\theta) = {n! \over k! (n-k)!} \theta^k (1-\theta)^{n-k}.

This is called the likelihood function.

Now we must combine the likelihood function with some prior probability distribution for the parameter \theta, which we’ll call \pi(\theta). We do so using Bayes’ theorem, and that gives us the main result: the posterior probability that the parameter has value \theta, given the data we’ve observed.

Specifically, the posterior probability distribution for the parameter \theta, given some data x, is proportional to the product of the likelihood function and the prior probability. “Proportional to” some quantity simply means that it’s some constant C times that quantity, so

p(\theta|x) = C ~ p(x|\theta) \pi(\theta).

We can determine the constant of proportionality using the fact that the posterior distribution, like all probability distributions, must give a total probability of 1.

For the coin flip we certainly know the likelihood function. But what prior distribution should we use – especially if we have no information to go on? A common choice is the uniform prior, for which the prior probability is constant

\pi(\theta) = 1,

so the posterior probability is proportional to (i.e. equal to some constant C times) the likelihood function

p(\theta|x) = C {n! \over k! (n-k)!} \theta^k (1-\theta)^{n-k}.

We determine C by insisting that total probability adds up to 1

\int_0^1 p(\theta|x) ~dx = 1,

which gives the final form of the posterior distribution as

p(\theta|x) = {(n+1)! \over k! (n-k)!} \theta^k (1-\theta)^{n-k}.

What Bayes’ theorem has given us is not an estimate of \theta (i.e., a point estimate), or a confidence interval, but an actual probability distribution for \theta. We can use this probability distribution to make point/interval estimates if we wish, and to estimate the uncertainty in those estimate. For instance, one point estimate is mean of the distribution, which for the coin flip is = (k+1)/(n+2). We could also use the mode (most likely value) or the median of the distribution as point estimates.

But the uniform prior isn’t the only choice. There are good reasons to use what’s called a Jeffreys prior, which can be justified by minimizing a certain measure of the probable “discrepancy” between the model and reality. For the coin flip the Jeffreys prior is

\pi(\theta) = 1 / \sqrt{\theta (1-\theta)},

and the posterior probability distribution turns out to be

p(\theta|x) = {\Gamma(n+1) \over \Gamma(k+\frac{1}{2}) \Gamma(n-k+\frac{1}{2})} \theta^{k-\frac{1}{2}} (1-\theta)^{n-k-\frac{1}{2}},

where \Gamma is the Gamma function.

Using the Jeffreys prior, the mean of the distribution is now (k+\frac{1}{2})/(n+1). In most cases, and in particular if both n and k aren’t extremely small, the Jeffreys-prior posterior mean is extremely close to the uniform-prior posterior mean.

In fact if we have enough data (i.e., if both n and k are big enough), and it doesn’t take a lot, then the two distributions, one using the uniform prior and the other the Jeffreys prior, will be extremely similar. Even modestly large values of n and k make a decent match. For instance, if we make n=20 coin flips and observe k=7 heads, the two distributions (plotted below) are very similar:

By the time we get to n=200 flips with the same proportion of heads (so k=70) the two distributions are almost indistinguishable:

As is generally the case when the number of data grows, the results using different priors will converge to each other.

But in some cases we don’t have enough data. We may actually have plenty, we might have done 200 flips, but we don’t have enough “heads” to make k big enough for the two choices to converge. In fact if we’ve observed no heads so k=0 out of n=200 flips, the posterior distributions using the uniform and Jeffreys priors both agree that \theta is probably very low, but are not in very good agreement about the form of the distribution for those low values. The Jeffreys prior gives much more probability for the parameter \theta to be small, than does the uniform prior:

Computing their respective means emphasizes how different the results are. The Jeffreys prior gives a mean value of (k+\frac{1}{2})/(n+1), which for k=0 becomes 1/(2n+2). The uniform prior gives a posterior mean of (k+1)/(n+2), which for k=0 is 1/(n+2). Since n is large, the uniform prior gives a mean which is nearly twice as large as that from the Jeffreys prior.

There are very good reasons to use the Jeffreys prior, it’s based on sound theoretical principles. But in this last case, when the difference between the use of two plausible priors is so large, we need to be aware that the result may be overly sensitive to the prior. That calls for some extra caution. Perhaps it’s ill-advised to put too much faith in either choice — we have to recognize that uncertainty in our prior distribution is a big factor in the uncertainty of the final answer.

In fact I’d like to propose an aphorism. I forget who it was who said something like, I wouldn’t want to ride on an airplane whose safety depended on using the Lebesgue integral rather than the Riemann integral (referring to two different definitions of the integral, both of which serve perfectly well for practical problems). I’ll paraphrase that with my aphorism: I wouldn’t want to ride on an airplane whose safety depended on using the Jeffreys prior rather than a uniform prior.


10 responses to “Prior Knowledge

  1. Dikran Marsupial

    Nice introduction to Bayesian ideas, “we have to recognize that uncertainty in our prior distribution is a big factor in the uncertainty of the final answer” is a certainly a useful take home message. It could be argued that the Jeffreys prior is not as “un-informed” as the flat prior as it is justified by the prior knowledge that our prior knowledge should be invariant to re-parameterisation, so the reason it gives different results is because it reflects a different state of prior knowledge about the problem. A nice thing about the Bayesian approach is that it encourages discussion of our assumptions (i.e. the prior) – the conclusion of even a correctly constructed logical argument is only as valid as its premises.

  2. Tamino,
    Interested to know what you think of model averaging, as proposed by Burnham and Anderson. Might the average over priors give a better result when the results are divergent?

  3. Brian Brademeyer

    Shouldn’t final Jeffries mean be 1/(2n+2) ?

    [Response: Doh! Yes.]

  4. Gavin's Pussycat

    Wasn’t there this thing that the Jeffreys prior was ‘invariant’?
    I remember it was discussed in Jaynes. Unfortunately I found that a little too theoretical to appreciate. I understand that, e.g., the uniform prior is invariant under linear transformation. But what is the thing with Jeffreys?

  5. Nice discussion Tamino. I have always found it very intuitively appealing (read a really neat trick!) that, taking advantage of the fact that the Beta distribution is a conjugate prior for the binomial likelihood, and that Beta(1,1) ≡ Unif(0,1), and finally that the Beta distribution is so flexible for any continuous random variable defined on the unit interval [including the Jeffreys prior ≡ Beta(½ , ½) ], that you can express the mean of the posterior distribution, which itself is another Beta, as a weighted average.

    Using a slight re-parametrization of α and β in Beta(α , β) to Beta(μ , M), where μ = α / (α + β) is the prior mean and M = α + β, is a measure of the prior precision, we have

    Posterior Mean = M/(M+n) * μ + n/(M+n) * (k/n),

    Which is to say it’s the weighted average of the prior mean and the MLE, and the weighting reflects in a very reassuring way the relative information in our prior and our sample. [For both α > 1 and β > 1, the beta prior distribution is concave downward and increasingly concentrated around its mean as the sum α + β increases.] Truth-in-commenting disclosure: I owe all this fun stuff to Brad Carlin and Tom Louis’ fine Bayesian stats book.

  6. since the posterior distributions are not symmetric, wouldn’t the posterior median be a better point estimate than the posterior mean? Are the posterior medians very different for the uniform and Jeffreys prior?

    [Response: The posterior median is an excellent point estimator, and it’s also invariant under transformations of the variable \theta. I don’t know for sure, but I suspect the uniform and Jeffreys posterior medians are in about the same 2:1 ratio as the means.]

  7. I’m rather surprised you don’t mention the use of uniform /Jeffries prior in estimates of climate senstitivity – JA has been baning on about this for years.

    [Response: And I think he has a good point.]

  8. The problem with JA’s argument about bounding climate sensitivity is that even with a Cauchy prior, the result depends critically on the location of the prior. I agree it would be nice to bound climate sensitivity a whole helluva lot better, but I get nervous when my prior dominates the analysis.

  9. David B. Benson

    Only slghtly off-topic:
    Pisarenko & Rodkin
    Heavy-Tailed Distributions in Disaster Analysis
    Springer, 2010.

    Emphasis on earthquakes, but considers other distributions of extreme eventss as well.

  10. O/T – Tamino, are you still panning to do a second installment of this…?