A reader recently linked to a book about information and inference, which definitely leans toward the Bayesian rather than frequentist view of inference. I do too. But I’m not the avowed anti-frequentist that some Bayesians are.
The book contains what is, in my opinion, at least one example of Bayesian analysis gone wrong. Chapter 37 (in section IV) discusses Bayesian inference and sampling theory. It begins with an example of a clinical trial.
We are trying to reduce the incidence of an unpleasant disease called microsoftus. Two vaccinations, A and B, are tested on a group of volunteers. Vaccination B is a control treatment, a placebo treatment with no active ingredients. Of the 40 subjects, 30 are randomly assigned to have treatment A and the other 10 are given the control treatment B. We observe the subjects for one year after their vaccinations. Of the 30 in group A, one contracts microsoftus. Of the 10 in group B, three contract microsoftus. Is treatment A better than treatment B?
The author begins by undertaking a frequentist analysis of the null hypothesis that the two treatments have the same effectiveness. He applies a chi-square test (pretty standard), as well as a variant (Yates’s correction). He first tests using a critical value (at 95% confidence), concluding that the uncorrected chi-square rejects the null hypothesis (but not by much) while the corrected chi-square fails to do so (but not by much). Then he estimates a p-value, getting 0.07 — which is not significant at 95% confidence but is at 90% confidence. The overall result is that there’s some evidence that treatment A is better, but it’s certainly not conclusive. Incidentally, he also warns that since the observed numbers are small, the chi-square test is imperfect (but that’s not relevant to the point we’ll address).
Then he takes a Bayesian approach to evaluate the difference in effectiveness of treatment A and treatment B. He begins by saying:
OK, now let’s infer what we really want to know. We scrap the hypothesis that the two treatments have exactly equal effectivenesses, since we do not believe it.
Let be the probability of getting “microsoftus” with treatment A, while is the probability with treatment B. He adopts a uniform prior, that all possible values of and are equally likely (a standard choice and a good one). “Possible” means between 0 and 1, as all probabilities must be.
He then uses the observed data to compute posterior probability distributions for . This makes it possible to computes the probability that (i.e., that you're less likely to get the disease with treatment A than with B). He concludes that the probability is 0.990, so there's a 99% chance that treatment A is superior to treatment B (the placebo).
However, there’s a problem with this analysis. He assumed that and are different, so he has rejected the null hypothesis by assumption. Given that, it’s no surprise that his result resoundingly favors treatment A! I’ll also point out that his model — that and are unequal — incorporates two parameters rather than the one in the null hypothesis (), but he hasn’t included any inference penalty for the extra parameter (as would be required by any good information criterion) because he rejects the null by assumption.
Let’s do the analysis again. For the frequentist side, instead of using a chi-square test (a bit dicey with such small numbers) we’ll use the exact test, the hypergeometric distribution. Under the null hypothesis (that A and B have the same effect), the probability of getting cases in samples for A, and cases in samples for B, when we have total samples with cases, is
The “combination” operator is the usual, given by
We compute the frequentist p-value by summing the probabilities for the observed, and more extreme, cases, i.e.,
Using this exact test, the p-value is less than 5%, so by the 0.05 standard we actually end up rejecting the null hypothesis (but not by much). Hence it’s likely (one could even say “statistically significant”) that the treatment is effective, but that doesn’t mean proved conclusively.
Now let’s take a Bayesian approach. But instead of just assuming that the null hypothesis is false, let’s compare the null and alternate hypotheses. Under the null hypothesis, there’s a single probability for both treatments A and B, under the alternate there are two different probabilities and . I too will adopt the uniform prior probability for the values, that all possible values are equally likely.
I won’t give the full calculation, but I will give the final result. The probability of getting the observed result (the “data” ) under the null hypothesis is
The probability of getting the observed result under the alternate hypothesis is
We see that the given data are more likely under the alternate hypothesis — that treatment A differs from treatment B — than under the null that there’s no difference. But it’s not overwhelmingly more likely. Clearly it’s likely that the treatment is effective, but it’s far from proved conclusively.
To translate these likelihoods into a probability that the alternate hypothesis is true, we’d have to have a “prior probability” that the alternate is true. Let this prior probability be . Then the probability that the alternate hypothesis is true is
If we use a 50-50 prior (equal chance that the treatment works and that it doesn’t) we get . With this prior, the chance that treatment A is actually having an effect is only 75%, a far cry from the 99% previously claimed.
And it’s well to bear in mind that even the 75% result depends on the prior, and although a 50-50 prior doesn’t assume much, it is an assumption and has no real justification. Perhaps the best we can say is that the data enhance the likelihood that the treatment is effective, increasing the odds ratio by about a factor of 3. But, the odds ratio after this increase depends on the odds ratio before the increase — which is exactly the prior we don’t really have much information on!
In case you feel a bit cheated because I didn’t show the details of how to do the calculation, don’t feel too bad, because the author of the aforementioned text does so himself! Immediately after computing that the probability is 99% that the treatment is effective (which frankly I disagree with!), he considers the example of just one person given treatment A and one person given treatment B. Then he performs exactly the hypothesis comparison I’ve discussed for the example with 30 patients given treatment A and 10 given treatment B. A bit ironic, eh? You can read all about it there.