Bad Bayes Gone Bad

In the last post I discussed what I thought was a mistaken application of Bayesian analysis. I didn’t claim that Bayesian analysis isn’t appropriate for the problem, in fact I showed the kind of Bayesian analysis which I think is appropriate. But some readers objected to my claims; I think we have some Bayesian zealots out there.


Let’s try again. Two treatments, A and B, are applied to attempt to forestall a disease (“microsoftus”). B is just a placebo, and was given to 10 subjects, 3 of whom were subsequently infected. A was given to 30 subjects, only 1 of whom came down with the disease. The Bayesian analysis I objected to concluded that the probability treatment A was better than B is 99% — definitely significant (one might even say strongly so). A frequentist analysis using an exact test rejected the null (that treatment A has no effect different from B) at 95% confidence, but just barely so.

Let’s change the numbers a bit. Only one subject was given treatment B (placebo), and the unfortunate soul subsequently contracted microsoftus. Only four subjects got treatment A, none of whom got the disease. Do we have statistically significant evidence that treatment A works better than placebo?

The scientist conducting the trial starts with:


We scrap the hypothesis that the two treatments have exactly equal effectivenesses, since we do not believe it.

Then he does the exact same analysis I objected to earlier, which led to 99% confidence in the former case. In this case, the probability that \theta_A < \theta_B turns out to be 0.9524. That's more than 95%, so he says we have significant — publishable — evidence that the treatment is working.

One of the referees for the paper is a frequentist, who says that this is mistaken. One of five subjects got the disease. Under the null hypothesis, the chances that the diseased individual was the one of five who got treatment B is: one of five. That might be good at 80% confidence, but it’s nowhere near 95%. No significant result yet.

About these ads

38 responses to “Bad Bayes Gone Bad

  1. I should probably take time to learn how to calculate the 0.9524, but the frequentist approach makes so much sense to me that maybe I can’t learn new tricks.

    • Dikran Marsupial

      Pretty much the opposite for me, the Bayesian interpretation of probability (as a degree of belief in the validity of a proposition, rather than a long run frequency) makes so much more sense, the frequentist approach seems rather odd. How can you have a probabilistic hypothesis testing framework that can’t talk of the probability that a hypothesis is true?

      The Bayesian approach seems to make much more sense to me, however that doesn’t mean I don’t use frequentist hypothesis tests, or see their value. Perhaps we should discuss confidence intervals next? ;o)

  2. A “true” Bayesian would never say “that’s more than 95% , we have significant evidence”. Bayesians do not pick arbitrary probability thresholds for deciding what is “significant” and what is not. Bayesian probabilities represent degrees of belief, not long-run probabilities of events (false positives etc) happening. Theres no type 1 or type 2 errors, and so no “statistical significance” in the traditional sense.

    If a Bayesian (or anyone) used 5 data points to test anything they would have some pretty good “prior” belief about the effect they were looking for, and would be using the data to update their prior model (or check that their new data was consistent with their prior model) – which would not include ridiculous values like infection rates of 0.999999

    Even if they had little prior idea of the true differences b/w A and B, a Bayesian could (and should) sensibly constrain their priors to prevent extreme values – especially in notoriously pathological binomial examples like this, where the max. likelihood would occur at values of 0 (A) and 1 (B). Just about any prior that did not allow , or greatly down-weighted, probabilities close to one would yield P(A<B) values less than the "magic" 0.95 (so even if a Bayesian did use such a threshold, they would save themselves the embarrassment of submitting a manuscript based on 5 cases and no prior information)

    [Response: All I did was apply the exact same analysis as before to different numbers. Now you want to insist on an informative prior?]

    • only if you want sensible results…and I also wanted you to use an infomative prior in the previous analysis, where informative means downweighting obviously absurb values (an option not open to frequentists).

      [Response: I used exactly the same prior that MacKay used. That's the point. As for "informative prior" -- for made-up data in a hypothetical, that's a stretch.]

  3. Tamino:”In order to get statistically significant results, you have measure O(null), A and B in large quantities.”

  4. The referee is wrong. Under the null the chance that the Treatment B subject got the disease is 1/5, but the chance that all the Treatment A subjects escaped the disease is (4/5)^5.

    The p-value then is (4/5)^4 x (1/5) ~= 8%, not too dissimilar to the posterior probability that \theta_A < \theta_B.

    [Response: Don't be silly.]

  5. Dikran Marsupial

    The Bayes factor is still only 20, which according to the usual (Jeffreys’) scale is “strong evidence” against H0, but not decisive. For “decisive evidence” against H0 you need a Bayes factor of 100 or more. The 0.9524 and 0.95 figures in the two tests are not directly comparable as they are not the answers to the same questions.

    However as Kass and Raferty (and one of the zealots on the other post ;o) pointed out, it is important to assess the sensitivity of conclusions to the prior distributions used.

    [Response: All I did was apply the exact same analysis as before to different numbers.

    And no, the Bayes factor is not 20. To get that, you'd do the Bayesian analysis comparing different hypotheses.]

    • Dikran Marsupial

      The analysis may be the same, but the interpretation is “non-standard”. A Bayesian wouldn’t assume that the alternative hypothesis is “true” because the posterior probability of the alternative hypothesis is greater than 0.95 (the 95% is only a heuristic for frequentists anyway). The only things that are “true” and “false” to a Bayesian are things that are logically true or logically false. As a result a Bayseian analysis just discusses the relative support different hypothesis receive from the data. For making decisions, both hypothesis are considered and their losses are weighted according to their posterior probabilities.

      The Bayes factor is about 20. If the posterior probability of the alternative hypothesis is 0.9524, then the posterior probability of the (complementary) null hypothesis is 1-0.9524 = 0.0476. The posterior odds is the product of the prior odds (which is one) and the Bayes factor, so here the Bayes factor is equal to the posterior odds, i.e. 0.9524/(1-0.9524) which is approximately 20. A Bayesian would then compare that to Jeffres’ scale and say there was “strong” (but not “very strong” or “decisive”) evidence for the alternative hypothesis. That is pretty much what MacKay did.

      Bayesian “hypothesis tests” are all about comparing the different hypothesis. It is frequentists that don’t bring the alternative hypothesis into their analysis; Bayesians do.

      To make a closer link between Bayesian and frequentist methods, view the Bayes factor as the Bayesian alternative to the frequentist likelihood ratio test.

      [Response: You don't like the result? Tell it to MacKay -- it's his procedure.]

      • Dikran Marsupial

        Sorry Tamino, the result is that the Bayes factor is 20, which is “strong” but not “very strong” or “decisive” evidence in favour of the alternatve hypothesis, which seems perfectly reasonable to me.

        [edit]

        [Response: This has gone beyond tedious. MacKay makes his question clear:

        If we want to know the answer to the question `how probable is it that p_{A+} is smaller than p_{B+}?'

        and he makes his answer clear:

        is P(p_{A+} < p_{B+} | Data) = 0.990.

        ]

      • Dikran Marsupial

        Sorry Tamino, if you are going to delete posts that are politely attempting to point out your error, then it is a dissapointing day for me as this had always been one of my favourite science blogs.

        You misunderstood MacKay’s example, you are now misleading others by deleting the posts that explain your misunderstanding. A more mature attitude would be to leave the posts as they were (as they were polite and on-topic) and merely agree to disagree (and ask to leave it there – being a gentleman I would have done so)

        [Response: If you're going to simply refuse to believe MacKay meant what he said -- what I quoted verbatim -- there's no point in endless argument. Polite or no.

        You're entitled to your opinion. I'm entitled to put an end to what's pointless and endless.]

    • Dikran Marsupial

      Tamino, has it not ocurred to you that it may be you that has misunderstood? I am not being argumentative for the sake of it. I can see that you have misunderstood MacKay and allowing the misunderstanding to continue whilst deleting posts that explain the misunderstanding, *that* is a disservice to the readers (I for one would be interested in what Aslask had to say, especially as he has taken the trouble to perform the analysis himself).

      In that section the part that is a hypothesis test is the bit that says:

      “In conclusion, according to our Bayesian model, the data … give very strong evidence – about 99 to 1 – that the treatment A is superior to treatment B.”

      That is the only line that can be interpreted as a hypothesis test, and in it he gives the Bayes factor in the form of a ratio (99:1 – the one being the null hypothesis), and note he uses the interpretation in exactly the wording suggested by Jeffreys.

      Can you show me where MacKay uses a threshold of 95% to demonstrate statistical significance in the context of a Bayesian test?

      Your misunderstanding is in assuming that the posterior probability that thetaA is less than thetaB is a Bayesian hypothesis test. It isn’t, you need the null hypothesis as well for that.

      [Response: I repeat: MacKay asks:

      If we want to know the answer to the question `how probable is it that p_{A+} is smaller than p_{B+}?'

      and he makes his answer clear:

      is P(p_{A+} < p_{B+} | Data) = 0.990.

      Your argument boils down to refusing to believe that he meant what he said.

      You've had your say -- more than once. Subject closed]

  6. Janne Sinkkonen

    Take the frequentist t-test. The t-distribution is the posterior of the mean, given a sample from a gaussian r.v., and a conjugate prior. In the frequentist test you compute the confidence interval for the posterior of the mean, or for difference between means, under an implicit uninformative prior. Then you compute the p-value as the value of the cdf of the posterior at zero. You reject or accept H0 depending on how extreme the posterior is, *under H1*.

    Why do you (presumably) accept this, but not MacKay’s procedure, which is exactly the same, except for the form of the distributions (normal -> binomial, t -> beta).

    Here is a plan for you next post: take a small sample, make a sign test, and conclude that the bad bayesian t-test with inflated p-values is invalid. :)

    The real reason for the differences is not bayes vs frequentist, but everything else: invalid approximations (chi-square), poor power (permutation test), so small samples that everything highly depends on test and modeling assumptions, including priors.

    Publishable? On what field would p=0.05 be publishable, with a sample size of {1, 4}? :)

    [Response: Boy, are you reachin'. There's no chi-square here, it's the hypergeometric distribution which is EXACT. There's no "modeling assumption," it's the standard null hypothesis anyone would use. And p=0.05 is publishable -- it's the de facto standard -- it has nothing to do with the sample size, that only affects one's ability to get to p=0.05.]

    • Janne Sinkkonen

      Chi-square refers to your earlier post. Modeling assumptions refer to chi-square and MacKay’s bayesian procedure. Even the permutation test, or “exact inference” have its frequentist properties that may not be so good (according to
      Andrew Gelman, who referred to Agresti and Coull, 1998 – which I have not looked).

      [Response: Chi-square was MacKay's choice -- I applied the exact test in my previous post as well as this one. A vague reference to "frequentist properties that may not be so good" is meaningless, and saying it's according to a paper you've haven't looked at is funny.]

      On most application fields a paper with n={1,4} would not even get to the referees. Nobody plans such an experiment, and a report of one begs the question: what has gone wrong – where are the other data? Has the experiment been prematurely interrupted, or is there a hidden selection process? [I say this on the basis of about 20 years in psychology, brain research, computer science, and later machine learning.]

      [Response: Once again: I just applied the exact analysis used by MacKay with different numbers. Now you want to talk about other data, premature interruption, hidden selection ...]

      What comes to p, I would not accept a paper if the evidence was overall on the level of p=0.05. There is typically too large grey area of twiddling with selection and test procedures to make 0.05 meaningless. Ok, maybe as a poster or in a low-quality journal as “suggestive evidence”, but not in a good journal.

      [Response: So you wouldn't accept a paper if the evidence met the de facto standard?]

      On the other hand, if the integrity of the procedures could be quaranteed, as in an internal process of a research group or a company, then 0.05 is ok, even with {1,4}, to warrant further research etc.

      [Response: This is a hypothetical example to explore the validity of the *analysis* procedure. Your talk about "integrity" is just a distraction.]

      • Janne Sinkkonen

        I agree that publishability is a distraction, but the issue arose from your own assertion of p=0.05 being publishable regardless of n.

        The Agresti and Coull (1998) paper is directly from Gelman’s blog discussing your stuff, see http://www.stat.columbia.edu/~cook/movabletype/archives/2010/03/hey_statistics.html . I’m not an expert on the “frequentist properties”, but maybe you find the reference useful.

        I still see it more likely that the difference in p-values between your exact test and MacKay’s procedure is due to poor power of the exact test, rather than H0 being pre-emptively rejected by the MacKay procedure. There is also the issue of one vs. two tails.

        You don’t have any thoughts of the t-test analogy?

        In general, you could find Gelman’s comments informative. He is an experienced practitioner and an author of a classic book on the subject.

        I have enjoyed and shared (via Google Reader) many of your climate-related posts. I have to admit that this statistical sidetrack disappoints, both in content and in style. Regardless of that, try to keep up the good work.

        [Response: Do you really think that this example -- with 1 person in group B and 4 in group A -- establishes a 95% chance that treatment A is better than treatment B?

        The only problem with the *power* of the exact test is the too-small sample size. But apparently you think Bayesian analysis is some magic genie that can get 95% probability from such a sample. Wrong; it just *seems* to because the application of Bayesian analysis is wrong.]

      • Janne wrote: “I have to admit that this statistical sidetrack disappoints, both in content and in style.”

        I disagree, and so want to encourage Tamino to keep up the statistical posts. Usually they make me think, and sometimes I become wiser. It is also great that Tamino does not fall into a clear bayesian or frequentist category, but actually consider what the two approaches can offer.

  7. It is always good to really think through what it is you are comparing, and simple examples like this one are very illustrative.

    I calculated the bayes factors (odds ratios) for these hypothesis:
    better: A better than B
    worse: A worse than B
    same: A just as good as B

    I get these Bayes Factors:

    HBetter/HWorse: 20
    HBetter/HSame: 5.7

    If we further know that the risk of getting sick is small. I.e. Theta<0.1 then i get these odds ratios:

    HBetter/HWorse: 2.4
    HBetter/HSame: 1.5

    — Matlab code for interrested —
    maxP=0.1; %maximum conceivable risk of getting sick
    [Pa,Pb]=meshgrid(0:maxP/1000:maxP,0:maxP/1000:maxP);
    %we observe 0/4 getting sick under vaccine A
    %we observe 1/1 getting sick under placebo B
    L=Pb.*((1-Pa).^4); %likelihood of that happening.

    Abetter_ix=PaPb;
    Asame_ix=Pa==Pb;

    %Bayesfactors
    BetterOverWorse=mean(L(Abetter_ix))./mean(L(Aworse_ix))
    BetterOverSame=mean(L(Abetter_ix))./mean(L(Asame_ix))

    • Gavin's Pussycat

      Aslak, yes, I get the same results. BTW your listing was messed up b ‘less then’ and ‘greater then’ being HTML characters. The relevant part should be

      Abetter_ix=Pa .less. Pb;
      Asame_ix=Pa==Pb;
      Aworse_ix= Pa .greater. Pb;

      with appropriate symbols substituted.

  8. You write frequentists would argue that “the chances that the diseased individual was the one of five who got treatment B is: one of five. That might be good at 80% confidence…”. That is true, if there is a very small chance of getting sick. However, consider a case where we know a priori that there is a high risk of getting the disease… E.g. all subjects are unprotected sex-workers in south africa, and we are talking about AIDS.

    [edit]

    [Response: This is getting tedious. The whole point is to evaluate the procedure, which includes no prior information -- exactly as was done by MacKay. Just the data.]

    • Tamino,

      There is NO SUCH THING as a Bayesian analysis that uses “just the data”. Such a concept does not, and cannot, exist. Period. Irrespective of what people claim about “no prior information”. This is a dangerous phrase used with a nod and a wink among the knowing and often used to (deliberately) mislead the unwary.

      I think this is part of the problem – priors that are claimed as being “ignorant” are actually making a very particular set of assumptions about the parameters. There is no way of getting round this other than not touching Bayesian probability. The only sensible course of action (IMO) is to choose the priors carefully AND check sensitivity to them (and other inputs if they are also uncertain).

    • In the snipped block, I argued that frequentists also have to deal with priors. I’ve elaborated on this here for the curious:
      http://www.glaciology.net/Home/Miscellaneous-Debris/toyexampleofbayesianvsfrequentistmethods

  9. There’s some discussion about these posts on Andrew Gelman’s blog ( http://scienceblogs.com/appliedstatistics/2010/03/hey_statistics_is_easy.php ). After some initial skepticism, down in the comments he comes around to Tamino’s side when he realizes that Tamino is posting in order to respond to a reader and making a limited point.

    Many comments here seem to be from people who read the post, miss its narrow intent, and complain that this isn’t what they would do if they were examining placebo-control in a Bayesian setup. Probably not, and I’ll venture that Tamino wouldn’t either! The post is about a narrow problem suggested by the quote in MacKay’s text, and is not a general examination of Bayesian inference…

  10. fwiw, Ian Musgrave in the comments at Gelman’s blog is not me… Living here in the US, it’s easy to forget that there are other Ians out there. :)

  11. As Dikran pointed out in the previous post, the main point of difference b/w MacKay’s “procedure” and Tamino’s frequentist tests is that MacKay’s test is one-tailed and Tamino’s frequentist test is 2-tailed.

    As Tamino points out, MacKay was quit claer what hypothesis he was testing:
    “how probable is it that p_{A+} is smaller than p_{B+}?”

    His test is perfectly valid for that. His null is that A=>B. Tamino’s null is the A = B. A frequentist could do the one-tailed test too, testing the null that A=>B.

    The Bayesian equivalent of Tamino’s frequentist null (A=B) hypothesis test is to do the model comparison (discussed in the previous post) – if you do that with the new data (0/4 for A, 1/1 for B), you get only marginally more support for the alternative model – and a Bayes factor < 3 (and the post. model prob. is certainly nowhere near .95).

    Tamino's objection seems to be with MacKay's prior beleifs (that there is effectively zero prob that the treatment effects are identical) and / or his choice of hypothesis (that A < B > 95% therefore its significant) to a Bayesian result or in contrasting it with a frequentist result that is testing a different hypothesis.

    The first step in any analysis is to define the question, if that is done right, frequentist and Bayesian methods will usually give very similar results (except when the frequentist result is wrong).

  12. the 2nd last paragraph of my previous comment got mangled – it should read:

    Tamino’s objection seems to be with MacKay’s prior beleifs (that there is effectively zero prob that the treatment effects are identical) and / or his choice of hypothesis (that A 95% therefore its significant) to a Bayesian result or in contrasting it with a frequentist result that is testing a different hypothesis.

  13. OK one more try…the 2nd last para should end:

    ….choice of hypothesis (A<B). Thats fine, Tamino can argue with MacKay about priors, and is free to use his own priors and test his own hypotheses. But there is little point in applying a frequentist interpretation to a Bayesian result, or in contrasting it with the frequentist result that is testing a different hypothesis.

    [Response: I think you're the one who has utterly failed to understand what I'm saying. In any case, quite beating the dead horse.]

  14. >Response: Don’t be silly.

    The whole point of this post is that the Bayesian probability P(thetaA < thetaB) = .95 is inconsistent with a frequentist p-value of 0.2.

    How is it silly to point out that you've calculated the p-value wrong?

    [Response: Because I didn't. You did.]

    • so is that really the point of this post? that a Bayesian probability is inconsistent with a frequentist p-value?

      [Response: I guess there are two points. 1. Bayesians can go wrong too, especially when they reject the most likely explanation by assertion. 2. Dead horses can get a helluva beating.]

  15. >Response: Because I didn’t. You did.

    One of us has the wrong p-value, which means that one of us has made a mistake. Simply asserting that the other is wrong doesn’t help us determine where the mistake is. I’ve pointed out your mistake — you haven’t taken into account the probability of no one in Treatment A contracting microsoftus. What mistake do you believe I have made?

    Assuming the null hypothesis thetaA = thetaB = 1/5,

    P(yB/nB – yA/nA >= 1)
    = P(yB=1 & yA=0)
    = P(yB=1) x P(yA=0)
    = (4/5)^4 x (1/5)
    ~= 0.08

    [Response: The null hypothesis is NOT thetaA=thetaB=1/5. It's that there's no difference between A and B. Therefore: if only one person gets infected (as happened), there's an equal chance for each person regardless of treatment group. Since B has 1/5 of the persons, the probability that the one person is in group B is 1/5.]

  16. Assuming thetaA=thetaB, then P(yB=1 | yA+yB=1) = 1/5.

    That calculation is correct, but P(yB=1 | yA+yB=1) is not the p-value for this experiment!

    [Response: YES IT IS.]

    The p-value is P(yB/nB – yA/nA >= 1).

    Under the null hypothesis thetaA=thetaB=theta0, we have

    P(yB/nB – yA/nA >= 1) = (1-theta0)^4 x theta0

    Taking the MLE (yA + yB) / (nA + nB) gives us an estimate of 1/5 for theta0, and a p-value of ~ 0.08.

    [Response: Four women and one man are in a ski lodge. One of them suffered a broken leg that day. You say: the probability it was the man is 0.08. I say: wrong. It's 1/5.]

  17. (Sorry if this is a repost, my comment seems to have been lost somehow)

    >Response: Four women and one man are in a ski lodge. One of them suffered a broken leg that day. You say: the probability it was the man is 0.08. I say: wrong. It’s 1/5.

    No, I specifically agreed that thetaA = thetaB => P(yB=1 | yA+yB=1) = 1/5. But this probability is not the p-value.

    >Response: YES IT IS.

    A definition of p-value (from Wikipedia).

    In statistical hypothesis testing, the p-value is the probability of obtaining a test statistic at least as extreme as the one that was actually observed, assuming that the null hypothesis is true.

    The test statistic here is yB/nB – yA/nA, and its observed value is 1. Therefore the p-value is P(yB/nB – yA/nA >= 1).

    [Response: It's up to you to learn why you're wrong. Here's a clue: it's called "Fisher's exact test."]

  18. (Weird, this is the second time WordPress has eaten one of my comments. Anyone else having this problem?)

    [Response: It disappeared because you are wrong, you are obstinate, and I don't care to hear it any more. Go away.]

    • I’ve reviewed Barnard’s exact test, and come to the conclusion that I was wrong. Pete was right.

      The p-value for the stated data by Barnard’s test is 0.082, which is a better result than the 1/5 which results from Fisher’s test. Barnard’s test is more powerful than Fisher’s for small contingency tables; my insistence on the correctness of the Fisher’s test was obstinacy.

  19. >I’ve reviewed Barnard’s exact test, and come to the conclusion that I was wrong. Pete was right.

    Oh, crap — in that case please delete the passive-aggressive post I’ve got sitting in moderation ;-)

    [Response: The fault was mine.]

  20. Janne Sinkkonen

    Ok, so Fisher’s test is not as powerful as Barnard’s test, because Fisher’s assumes two fixed margins. Using Wald’s statistics, both F. and B. test for the extremity of the observed contingency table, without knowing about the asymmetry between placebo and the treatment. So they should be compared to the MacKay’s procedure made two-tailed, right?

    In the latter example of n={1,4}, we have then p_MacKay2 = 0.0958, p_Barnard = 0.0819. For the earlier example with more subjects, p_MacKay2 = 0.0252, p_Barnard = 0.0195. I used the same “uninformative” beta prior for the bayesian procedure as has been used earlier in the examples (parameters (1, 1)).

  21. David B. Benson

    Hard to be both a moderator and one of the respondents at the same time. Well done, methinks.

    • Janne Sinkkonen

      Let me conclude and write a bit more about priors and zealots.

      First some conclusions, that I think most of us can agree:

      - Fisher’s exact test is based on assumptions.
      - Barnard’s exact test is based on assumptions.
      - The bayesian test, “MacKay’s test”, is based on assumptions (prior here, likelihood is quite clear).
      - Taken everything together and comparing the various approaches, there is no systematic difference in p-values between the bayesian and frequentist takes on the problem. But for sure, the test and modeling assumptions one makes affect the p-value.

      Then the priors for MacKay’s procedure, and some other points related to it:

      From a certain viewpoint, the Beta(1,1) prior used so far here is a bit funny. It is uniform over \theta_A \times \theta_B, but if you plot p(\theta_A – \theta_B), you get a triangle-like distribution. It’s not bad, but it feels unnatural, and it actually prefers \theta’s close to each other, although not strongly (which is not bad).

      But what one is usually after in clinical trials is not \theta_A – \theta_B, but more like the ratio, \theta_A/\theta_B, or log(\theta_A/\theta_B), for these tell directly how helpful the treatment is, and are also more directly usable in a cost-benefit analysis. Now if you plot the Beta(1,1) prior on this (log) scale, it looks quite good. It strongly prefers the case of zero treatment effect.

      Parameters of Beta need not be exactly 1 – there’s apparently soon a paper coming out advocating 1/3 for certain cases, and other values are widely used in more complex models. I tried both smaller ones: 1/3, and 0.5, and a larger one, 2. The important plot of log(\theta_A/\theta_B) does not qualitatively change, although it is hard to say about the behavior of the tails on the basis of my casual plots. On the scale of \theta_A-\theta_B and with Beta(2,2), one gets a gaussian-like prior, of course with truncated tails, and conversely a strong peak in the middle with Beta(0.5, 0.5). With Beta(1/3, 1/3) there are also peaks at \theta_A=0 and \theta_B=0.

      How does this affect the p-values? In general, increasing the parameter gives a more conservative p-value. With the larger data of n={10,30}, the difference between the priors is negligible (0.0224…0.0277), but with n={1,4} it is considerable (0.0268…0.176). Therefore, with n={1,4} I would refuse to draw any kind of conclusions, due to the high sensitivity to the prior.

      Note that from the bayesian procedure one also gets the whole posterior, so by plotting p(log(\theta_A/\theta_B)), you actually see how big the treatment effect is likely to have – not only whether it is significant or not. This is not a minor advantage in practice.

      In general the whole bayesian procedure, plot or test, is a one-liner in R, something like
      n=1000000; a=1; hist(log(rbeta(n,1+a,29+a))-log(rbeta(n,3+a,7+a)), 1000)

      Last, the zealots:

      I believe that it is generally more productive to be courteous rather than rude towards one’s commentators. And it may not be wise to be cock-sure of complex issues with diverging viewpoints and no ground truth (yet discovered). After all, this is not a political blog, and we are not Anthony Watts – at least I’m not (don’t know of the others – maybe he is using pseudonyms). But these are largely stylistic and personal issues, although they do affect the eventual output of the discussion and the experience of the participants.

      The philosophical distinction between bayesian and frequentist is much less in practice than it seems from theory. Most people, including me, are fine with both, and use what is available, usable, and produces good results. Here I’d use MacKay’s procedure because it is so simple with R, but for a larger contingency table I’d use chi-square – if the table is dense enough. If not, then some thinking and googling would be required.