A reader recently asked my opinion about this post at Skeptical Science, which is a comment on Ambaum 2010, *Significance Tests in Climate Science*, J. Climate, doi: 10.1175/2010JCLI3746.1.

The theme of the paper is that the common application of significance testing is misguided. Mainly Ambaum objects that far too often researchers will apply a significance test, obtain a highly “significant” (i.e., very low) *p*-value, then declare that their hypothesis is significant at the 1-*p* confidence level. In fact, Ambaum says, this isn’t so. The actual meaning of a *p*-value is that the observed result, or one which is even more extreme, has only a chance *p* of occuring when an *alternate* hypothesis (the *null* hypothesis) is true.

Ambaum also states that a large fraction of papers in the climate-science literature make this mistake — sometimes in a minor way which doesn’t alter the validity of the results, and sometimes in a major way which calls results into question. As Ambaum says:

A large fraction of papers in the climate literature includes erroneous uses of significance tests… The significance statistic is not a quantitative measure of how confident we can be of the `reality’ of a given result. It is concluded that a significance test very rarely provides useful quantitative information.

In large part, I agree with Ambaum. I’ve certainly struggled to emphasize to colleagues that a highly significant statistical result does *not* prove that one’s hypothesis is true, it merely negates the null hypothesis. I’ve also emphasized that there are many ways for the null hypothesis to be false, other than the researcher’s favored hypothesis being true. I’ve even tried to stress that the null hypothesis on which a *p*-value is based usually includes “more than meets the eye.” For instance, when using a linear regression test for trend the usual null hypothesis isn’t just that the data are trendless noise, but that they’re trendless *white* noise. Ambaum is certainly correct that far too much research is done without sufficient understanding of the actual implication of the results of statistical tests.

Yet there are parts of Ambaum’s paper with which I take exception. For one thing, I don’t believe climate science should be singled out for this property. I suggest that errors of statistical interpretation are just as common in most scientific fields as they are in climate science, and that in this regard climate science is typical of science in general.

For another thing, I don’t believe that “a significance test very rarely provides useful quantitative information.” I’d say the opposite, that a significance test almost *always* provides useful quantitative information. In fact that’s the whole point of significance testing, to provide useful *quantitative* information.

Ambaum provides some very basic Bayesian results, chiefly to show that the probability our hypothesis “A” is true given a result “X” (which is often written P(A|X)) depends not only on the improbability of “X” under a null hypothesis, it also depends on the estimated “prior probability” of our theory. This is quite correct. But that doesn’t mean that the signficance test result (which is the probability P(X|B) of “X” given null hypothesis “B”) isn’t useful. On the contrary, it’s incredibly useful. And it’s certaintly quantitative.

It’s worthwhile to warn working scientists of the danger of equating P(X|B) with P(A|X). Linear regression for trend testing is a good example. Just because P(X|B) is incredibly small, doesn’t mean that “A” is true, i.e., that there’s a trend which is linear. Was the regression based on the null hypothesis of white noise? If so, then the low *p*-value might be because the data are noise which isn’t white. Or it might be because there’s a trend, but it’s not a linear trend. Perhaps the trend is approximately linear but not exactly so. All we can safely conclude from the significance test is that the data aren’t just white noise.

It’s also possible that the null hypothesis is provably false but the signficance test will *fail* to show it. If the trend is quadratic, and the minimum is near the center of the observed time span, it often happens that the *p*-value from linear regression will not be significant — but that doesn’t mean that the data are just noise (white or otherwise). It really is necessary to consider a spectrum of possible hypotheses, and to apply our best information (including prior knowledge) to get a reasonable estimate of the probability that our hypothesis is true — or that the null hypothesis is false.

It’s also well to recall that in some cases we already know the null hypothesis is false even when the significance test doesn’t show that. Consider for example testing whether or not a coin is “fair,” i.e., whether or not the probability of “heads” and the probability of “tails” are both equal to 1/2. I submit that we can be quite confident even before the experiment that the null hypothesis is false. The coin is *not* fair, at least not perfectly so. The probability of “heads” might be only slightly different from 1/2, perhaps 0.5000000001 or 0.4999999999, but the chance that it’s exactly equal to 1/2 is vanishingly small. One would be hard pressed to identify a physical system for which the chance of a given binomial result is *exactly* 1/2.

But that in no way degrades the usefulness of the *p*-value from a standard significance test. The idea behind tests based on a null hypothesis is that such a hypothesis makes it possible to *calculate* the probability of the given outcome. When the null hypothesis is chosen to be meaningful, it also provides (useful and quantitative) information about the likelihood of that null hypothesis — and that ain’t just whistlin’ dixie.

Standard hypothesis testing was made popular by the great statistician R. A. Fisher. Ambaum suggests that this is in part due to Fisher’s “marketing” of this idea. I disagree. Significance testing has been immensely popular for a century or so, not just because of the “sales campaign” of a giant in the field of statistics. It’s also so because it is *useful*. In fact, Fisher’s stature aside, it could never have achieved such prominence in science without this extreme usefulness.

Do we need to take full advantage of more than just frequentist significance testing, to take full advantage of a better understanding of statistics? Of course. Do we need to be aware of the null hypothesis and the real meaning of a significant *p*-value? Yes. Are there many papers (in all fields of science) which mistakenly equate the unlikeliness of a result under the null hypothesis with the likelihood of a given alternate theory? Yes.

Should significance testing be regarded as rarely providing any useful quantitative information? Certainly not.

Wonderful overview. I have to admit to only lightly reading Ambaum’s paper when SkS covered it; I’ve been too bogged down to dive deeply in papers outside my field at the moment. However, it did remind me somewhat of similar complaints from 15 years ago in psychology – Cohen 1994, with the delightful little title

“The Earth Is Round (p<.05)", which decries related misuse of significance testing in the social sciences of the time (i.e. interpreting p as the probability that the null is false, or that rejecting the null means the proposed alternative must be true). In spite of those complaints, my research methods classes are still making similar errors, and based on Ambaum (and anecdotal tales from my colleagues in physics) I can only assume similar problems are occurring in other disciplines.Tamino here I do agree with your statements. In psychiatry with studies comparing Lexapro and Prozac the control group may be biased and may create false results. In medicine diabetes medicines may be prematurely found to be safe and effective when not applying statistics and interpreting statistics (including knowing the limitations) correctly. I do indeed as you seem to, take kindly Ambaum’s point. I would think then that: medicine, psychiatry and climate science would all benefit from having more qualified statisticians.

“In spite of those complaints, my research methods classes are still making similar errors, and based on Ambaum (and anecdotal tales from my colleagues in physics) I can only assume similar problems are occurring in other disciplines.”

Yeah, in the Psych MSc stats class I teach I use the NHST True/False quiz in the first week (it’s around the web somewhere). The average score has never risen above chance…

” I would think then that: medicine, psychiatry and climate science would all benefit from having more qualified statisticians. …”

Does that also apply to Climate Audit and George Mason University?

Thanks for your comment. It clears up quite a few issues I had with the Skeptical Science piece.

Actually, Ambaum’s paper echos a lot of the criticisms that Bayesians have leveled against the use of Frequentist significance tests . The point really seems to be that there are a lot of ways to screw up significance testing–but that is true whether one is a Bayesian, a Frequentist or whatever.

People need to realize that significance tests really apply to really simple situations, and reality often is not simple. It’s like knowing which model applies for the conditions of the problem.

JCH that applies to everyone. Hope that helps.

When we perform a test of statistical significance test, what we would really like to ask is “what is the probability that the alternative hypothesis is true?”. A frequentist analysis fundamentally cannot give a direct answer to that question, as they cannot meaningfully talk of the probability of a hypothesis being true – it is not a random variable, it is either true or false and has no “long run frequency”. Instead, the frequentists gives a rather indirect answer to the question by telling you the likelihood of the observations assuming the null hypothesis is true and leaving it up to you to decide what to conclude from that. A Bayesian on the other hand can answer the question directly as the Bayesian definition of probability is not based on long run frequencies but on the state of knowledge of the truth of a proposition. The problem with frequentist statistical test is that there is a tendency to interpret the result as if it were the result of a Bayesian test, which is natural as that is the form of answer we generally want, but still wrong.

The frequentist approach avoids the “subjectivity” of the Bayesian approach (although the extent of that “subjectivity” is debatable), but this is only achieved at the expense of not answering the question we would most like to ask. It could be argued that the frequentist approach merely shifts the subjectivity from the analysis to the interpretation (what should we conclude based on our p-value). Which form of analysis you should use depends on whether you find the “subjectivity” of the Bayesian approach or the “indirectness” of the frequentist approach most abhorrent! ;o)

At the end of the day, as long as the interpretation is consistent with the formulation, there is no problem and both forms of analysis are useful.

Excellent explanation, Dikran! I tried to say similar things over on Skeptical Science (here and then here), but emphasized the decision making practices of scientists in their everyday work processes. Eric L made an excellent comment in a similar vein.

Thanks Tom, I would encourage people to read Toms excellent posts on that thread as well – there are a lot of ways of making the same basic points!

I’ve layed out my objections to Bayesian statistics in this context over at Skeptical Science, though it occurs to me that this would be a better place to discuss this with people who can tell me why I’m wrong. I agree with you that a frequentist approach shifts the subjectivity to the interpretation, but I think that is usually the right thing to do. I understand Bayesian statistics as a useful way to combine several lines of evidence and possibly a pre-existing assumption, but typically a piece of published research is trying to establish one piece of the evidence. A frequentist analysis is something you can actually calculate for your result, whereas a Bayesian analysis is best done by including all the evidence you have including the work others have done, and is out of date the moment future research gives you reason to change your prior.

And ultimately a failure to use Bayesian reasoning when determining the significance in your results is something that will rarely cause wrong conclusions to be published and generally will weed out poorly supported conclusions, which is what you want from statistical significance tests. If I show you that, given the reasons I have for determining my prior confidence as 97.9% and the evidence I’ve found for my conclusion I can use Bayesian reasoning to get a confidence of 98%, should anyone pay attention to my work? Should we tell people who we think came to wrong conclusions and gave pretty strong evidence that their analysis is wrong because they didn’t consider how unlikely their conclusion is if you don’t consider the evidence they’ve provided?

Eric L.

I strongly recommend Ed Jayne’s test on probability. It is truly a gem, full of the sorts of insights that come only after decades of pondering a field.

http://www.amazon.com/Probability-Theory-Logic-Science-Vol/dp/0521592712/ref=sr_1_1?ie=UTF8&qid=1289902144&sr=8-1

[

Response: I made a “bet with myself” that within 24 hr of the question, someone would suggest Jaynes. You’re the third]Eric L. I would recommend E. T. Jaynes “Probability Theory: The Logic of Science”, Cambridge University Press (ISBN 0521592712). Jaynes can (could) be a bit polemic at times, but his book is a classic – well worth reading.

Thanks for the replies. I’ve already replied to Dikran over at SS so no need to duplicate that discussion here, but I think the best question to ask here is, assuming I really can’t work an actual class into my life, anyone have a book to recomend or an online lecture series or anything like that that could give me a better foundation for understanding these issues? I took a freshman level stats course, but what I know about Bayesian statistics I learned in data mining/machine learning classes, where it is fundamental, but they may not have covered some of the philosophy behind what it is you’re measuring or some of the other aspects of it that seem to be coming up. If you know a good primer that covers these issues and what the “right” way to do significance testing is and how you choose priors when you don’t really have much direct evidence for the prior, please share!

I’d extend Eric L.’s point that “typically a piece of published research is trying to establish one piece of the evidence,” by suggesting that the traditional frequentist statistics be in the Results section of a paper, without any interpretation but the most straightforward, narrow, unbiased, uncontroversial, inferentially correct, and therefore unexciting nearly useless interpretation. That puts those statistics in their proper place as mere pieces of evidence to be used in subsequent reasoning about the theories that the author really cares about.

The Discussion or Conclusions sections are where I’d love to see Bayesian statistics! Those sections are where you’re supposed to be interpreting the results–explaining the results’ implications for theory. Usually, authors do that without Bayesian statistics, but Bayesian statistics are a wonderful tool for all the reasons Dikran and Ray described.

Eric L – I would not agree that shifting the subjectivity to the interpretation is the right thing to do as it then ocurrs implicitly (it is rarely stated why a p-value less than a critical value is reason for rejecting or failing to reject the null hypothesis – it is essentially merely a convention), or in some cases the subjective element is introduced unwittingly as many users of significance tests don’t fully understand the underlying framework. It is the implicit nature of the subjectivity that leads to misunderstandings like the p-value fallacy (saying there is a probability p that the null hypothesis is false) that is the subject of Ambaums paper.

I don’t understand your point about “A frequentist analysis is something you can actually calculate for your result” All the usual frequentist significance tests have a direct Bayesian equivalent – and you can calculate either and in most cases you would draw the same conclusions from the outcome of both. Also it is not clear why “Bayesian analysis being best performed including all the evidence”, surely the same is true of both frequentist and Bayesian analyses? Why wouldn’t you want to include all of the evience in drawing your conclusion?

Also if a posterior is dominated by the prior then that is only a problem if there is a flaw in the prior – if you can find no flaw in the prior you are logically obliged to accept the posterior. In this case, what is the chain of reasoning for the prior being 97.9%? Questioning the prior is part of Bayesian analysis. As Ray Ladbury correctly points out (amongst other good points) there are Bayesian tests based on Bayes factors where the priors on the hypotheses do not appear at all (although of course there may still be priors on nuissance parameters).

Eric L.,

Frankly the best argument for using Bayesian statistics is that in most scientific situations, probability is in fact subjective. It depends on the state of the knowledge of the scientist, whereas the hypothesis under consideration is either true (P=1 ) or false (P=0). What is more, one can in fact use Bayesian reasoning to assess only the signficance of one’s results–e.g. by using the Bayes factor. What is more, it is possible to remove most of the subjectivity by using minimally informative priors and empirical Bayesian methods. I have yet to find anything that the Frequentist approach does better than the Bayesian approach–as long as the researcher is honest with himself and his audience.

Dikran Marsupial, yours is an exceptionally lucid clarification of this complex issue. Thank you for taking the time to write this very useful comment.

Cheers! My thoughts on this were clarified considerably through the discussions of this sort of issues at stats.stackexchange.com , which is a site I’d reccomend to anyone interested in stats. While in practice both Bayesian and frequentist approaches are useful, the philosphical differences between the two frameworks are an interesting subject!

The key to fruitful science is to ask the right questions before conducting research. Lots of correct answers are unimportant. The climate research I most admire for its elegance, for its ability to get to the meat of the nut, is usually done by those whom global warming skeptics most voiciferously condemn.

I’m still having some trouble understanding what this means in practice. Assuming I understand the Ambaum paper correctly, it boils down to trying to estimate two numbers, which are almost always neglected: P(r>r_0|H) = the probability a correlation above the detection threshold, given the hypothesis is true; and p(H)/p(barH) = the odds ratio.

In some simple cases, I could see how to compute the first quantity. If you have a physical model for your hypothesis, and some way to approximate the “noise” under that hypothesis, then a Monte Carlo-type simulation would estimate P(r|H) and thus P(r>r_0|H). On the other hand, I have no clue how to estimate an odds ratio other than “educated guess”. Is it common to just assume the odds ratio to be unity if you have no helpful information to estimate it?

I usually find the examples (e.g., those in Ambaum’s paper) too simple or abstract to be of practical use. Does anyone have a better example, or a suggestion for a good reference on this subject? (I’m locating some of Ambaum’s references, but I’m sure there are others!)

AJM, unfortunately, this is one example—perhaps one of the more important ones—that illustrate the basic differences between Frequentist statistics and Bayesian statistics. Consequently, to really understand this difference requires a fair amount of study and thinking, and my personal belief is that one example isn’t going to give you that much insight. A few years back I took the time and simply sat in on a very good introductory course in Bayesian statistics and given the huge returns it gave me, I would personally recommend it to every researcher who hasn’t already done so. [Yesterday, after reading the Ambaum paper, I reread the section on Bayesian hypothesis testing in the Carlin and Louis text, “Bayes and Empirical Bayes Methods for Data Analysis” and found it very helpful. But that was using it as a refresher.]

Hi Tamino,

Just curious about this sentence, “The actual meaning of a p-value is that the observed result, or one which is even more extreme, has only a chance p of occuring when an alternate hypothesis (the null hypothesis) is true.”

Shouldnt that only be when the NULL is true and not when the alternative?

[

Response: Yes indeed, and that’s what I was trying to say. By “alternate” I was referring to the null hypothesis (as an alternate to the researcher’s theory), but since “alternate” usually means *not* the null, my wording was most sloppy.]Eric L | November 16, 2010 at 1:39 am — E.T. Janey’s “Probability Thoery: The Logic of Science” is highly recommended, albeit not perfect for the matters you mention. There are subsequent papers such as “Philosophy and the practice of Bayesian statistics”

http://arxiv.org/abs/1006.3868

For issues in climatology invovling Bayesian reasoning, read Annan & Hargreaves, several papers. For an older paper along that line, try Tol, R.S.J. and A.F. de Vos (1998), ‘A Bayesian Statistical Analysis of the Enhanced Greenhouse Effect’, Climatic Change, 38, 87-112.

I came across this thread pretty much by accident. Some readers may find my attempt to explain p-values in Weather (2004) of some use?

http://onlinelibrary.wiley.com/doi/10.1002/wea.v59:3/issuetoc

Good article – I especially like the acknowledgement – I’ll remeber to mention it next time I give my module on “how to review a paper”! ;o)

Good post.

Assuming P[H|x] = P[x|H] is often called the prosecutors’ fallacy. Prosecutors often show behaviour of the accused that is consistent with guilt, but that is not enough.

“Yet there are parts of Ambaum’s paper with which I take exception. For one thing, I don’t believe climate science should be singled out for this property. I suggest that errors of statistical interpretation are just as common in most scientific fields as they are in climate science, and that in this regard climate science is typical of science in general.”

_____________________

Absolutely. My first reaction to the beginning of this post was “oh goody, another paper pointing out that significance test interpretations are biased and/or have various problems”. There were seemingly boatloads of these kinds of papers in ecology in the 1990s (or maybe my tolerance for them is just low, I don’t know–read one, you’ve read all you need to (if you’re paying attention) IMO.) . It’s likely a science-wide problem, because scientists aren’t statisticians for the most part–we take our 1/2 baked statistical understandings and try to make the best of it. If Ambaud couched this as a climate-science specific problem, then he’s painfully unaware of the social climate surrounding climate sci debates.

[

Response: I don’t think Ambaum was accusing the climate science community of special fault — just that he was speaking *to* the climate science community. One could easily get the impression he thought climate science was more flawed than other disciplines, but I don’t see that he says so explicitly.]Ambaum’s post at SC concluded with this

Perhaps it was not the intention, that actually reads like it is aimed as much at “skeptics” — ignoring physics and “quibbling” over statistics– as it is at climate scientists.

The people who “quibble”

mostabout statistics are those who have no other leg to stand on.Good article. Are you familiar with Ziliak and McCloskey’s 2008 book “The cult of statistical significance: how the standard error costs.”? If I remember correctly – those authors weren’t so enamoured with Fisher – ascribing his success to less than ideal ethics.

Good post with good comments, as usual. The issue of the proper interpretation of p values is in equal parts (a) long-standing, (b) amusing, and (c) intricate. Another recent paper that does a good job discussing the issues is: Wagenmakers, E.-J. A practical solution to the pervasive problems of p values Psychonomic Bulletin & Review, 2007, 14, 779-804.

(Just note that there’s been a corrigendum in a subsequent issue, you need that as well if you really want to apply the detailed recommendations.)

There is a vast literature in cognitive science on how even experts (in ALL fields, not just climate science!) fail to interpret p values correctly. In most instances this matters very little because the formally incorrect interpretations are nonetheless heuristically useful and “correct enough” for practical purposes. After all, about 3 decades have passed since Meehl and Cohen and others first critiqued the commonly-fallacious interpretation of p values; and despite that, we have successfully amassed much robust scientific knowledge in the last 30 years.

There is also a hugely important distinction between true experiments and observational studies, which on the surface of it has nothing to do with p values but in actual fact makes all the difference to the heuristic value of their (mis)interpretation … (bit of bait here for Tamino; maybe you’d care to do a post on this at some point)?

The problem with p values was driven home to me in writing my dissertation in experimental particle physics. Mine was the usual bump hunting thesis–look for a “statistically significant” bump on a histogram of reconstructed particle masses with background of spurious noise. In the course of my research, I generated hundreds of plots, and 3-sigma bumps were pretty common. It was a rule of thumb that one needed 5 sigma of significance to publish for this very reason (a reason that seems to have utterly eluded Lobos Motl).

Frequentist statistical tests apply to a very limited and restrictive set of problems. Nonetheless, if one can identify the closest frequentist analog to one’s own experiment, they do yield useful information, even if the information is more qualitative than quantitative.

This is a little off topic, but, hey, it is at least stats related. I was trying to fit an empirically determined distribution to the best lognormal approximation, and it isn’t really a trivial problem. Least-squares won’t work, as it looks at absolute error rather than percent error, and so fails to give a good fit to the tails of the distribution.

For a variety of reasons, I decided to determine the fit by minimizing the Kullback-Liebler distance, and it worked very well for both the peak and the tails of the distribution. Given the form, it’s not too surprising, but the results were really quite good.