The theme of the paper is that the common application of significance testing is misguided. Mainly Ambaum objects that far too often researchers will apply a significance test, obtain a highly “significant” (i.e., very low) p-value, then declare that their hypothesis is significant at the 1-p confidence level. In fact, Ambaum says, this isn’t so. The actual meaning of a p-value is that the observed result, or one which is even more extreme, has only a chance p of occuring when an alternate hypothesis (the null hypothesis) is true.
Ambaum also states that a large fraction of papers in the climate-science literature make this mistake — sometimes in a minor way which doesn’t alter the validity of the results, and sometimes in a major way which calls results into question. As Ambaum says:
A large fraction of papers in the climate literature includes erroneous uses of significance tests… The significance statistic is not a quantitative measure of how confident we can be of the `reality’ of a given result. It is concluded that a significance test very rarely provides useful quantitative information.
In large part, I agree with Ambaum. I’ve certainly struggled to emphasize to colleagues that a highly significant statistical result does not prove that one’s hypothesis is true, it merely negates the null hypothesis. I’ve also emphasized that there are many ways for the null hypothesis to be false, other than the researcher’s favored hypothesis being true. I’ve even tried to stress that the null hypothesis on which a p-value is based usually includes “more than meets the eye.” For instance, when using a linear regression test for trend the usual null hypothesis isn’t just that the data are trendless noise, but that they’re trendless white noise. Ambaum is certainly correct that far too much research is done without sufficient understanding of the actual implication of the results of statistical tests.
Yet there are parts of Ambaum’s paper with which I take exception. For one thing, I don’t believe climate science should be singled out for this property. I suggest that errors of statistical interpretation are just as common in most scientific fields as they are in climate science, and that in this regard climate science is typical of science in general.
For another thing, I don’t believe that “a significance test very rarely provides useful quantitative information.” I’d say the opposite, that a significance test almost always provides useful quantitative information. In fact that’s the whole point of significance testing, to provide useful quantitative information.
Ambaum provides some very basic Bayesian results, chiefly to show that the probability our hypothesis “A” is true given a result “X” (which is often written P(A|X)) depends not only on the improbability of “X” under a null hypothesis, it also depends on the estimated “prior probability” of our theory. This is quite correct. But that doesn’t mean that the signficance test result (which is the probability P(X|B) of “X” given null hypothesis “B”) isn’t useful. On the contrary, it’s incredibly useful. And it’s certaintly quantitative.
It’s worthwhile to warn working scientists of the danger of equating P(X|B) with P(A|X). Linear regression for trend testing is a good example. Just because P(X|B) is incredibly small, doesn’t mean that “A” is true, i.e., that there’s a trend which is linear. Was the regression based on the null hypothesis of white noise? If so, then the low p-value might be because the data are noise which isn’t white. Or it might be because there’s a trend, but it’s not a linear trend. Perhaps the trend is approximately linear but not exactly so. All we can safely conclude from the significance test is that the data aren’t just white noise.
It’s also possible that the null hypothesis is provably false but the signficance test will fail to show it. If the trend is quadratic, and the minimum is near the center of the observed time span, it often happens that the p-value from linear regression will not be significant — but that doesn’t mean that the data are just noise (white or otherwise). It really is necessary to consider a spectrum of possible hypotheses, and to apply our best information (including prior knowledge) to get a reasonable estimate of the probability that our hypothesis is true — or that the null hypothesis is false.
It’s also well to recall that in some cases we already know the null hypothesis is false even when the significance test doesn’t show that. Consider for example testing whether or not a coin is “fair,” i.e., whether or not the probability of “heads” and the probability of “tails” are both equal to 1/2. I submit that we can be quite confident even before the experiment that the null hypothesis is false. The coin is not fair, at least not perfectly so. The probability of “heads” might be only slightly different from 1/2, perhaps 0.5000000001 or 0.4999999999, but the chance that it’s exactly equal to 1/2 is vanishingly small. One would be hard pressed to identify a physical system for which the chance of a given binomial result is exactly 1/2.
But that in no way degrades the usefulness of the p-value from a standard significance test. The idea behind tests based on a null hypothesis is that such a hypothesis makes it possible to calculate the probability of the given outcome. When the null hypothesis is chosen to be meaningful, it also provides (useful and quantitative) information about the likelihood of that null hypothesis — and that ain’t just whistlin’ dixie.
Standard hypothesis testing was made popular by the great statistician R. A. Fisher. Ambaum suggests that this is in part due to Fisher’s “marketing” of this idea. I disagree. Significance testing has been immensely popular for a century or so, not just because of the “sales campaign” of a giant in the field of statistics. It’s also so because it is useful. In fact, Fisher’s stature aside, it could never have achieved such prominence in science without this extreme usefulness.
Do we need to take full advantage of more than just frequentist significance testing, to take full advantage of a better understanding of statistics? Of course. Do we need to be aware of the null hypothesis and the real meaning of a significant p-value? Yes. Are there many papers (in all fields of science) which mistakenly equate the unlikeliness of a result under the null hypothesis with the likelihood of a given alternate theory? Yes.
Should significance testing be regarded as rarely providing any useful quantitative information? Certainly not.