A recent article in ScienceNews calls into question scientific results established by statistics. It was excerpted by WUWT at great length (I wonder whether Anthony Watts knows the difference between “fair use” and violation of copyright), apparently an attempt to discredit global warming science because, after all, it uses some statistics. Of course Watts fails to consider the outrageous statistical follies coming from his side of the fence (many from himself and his contributors). Lubos gets in on the act, purportedly defending statistics but insisting on a ridiculously high confidence level — and of course, getting in some potshots at global warming science himself.
The ScienceNews article argues for caution accepting published results based on statistical analysis, specifically mentioning results in medical research. It’s based on several papers, including most notably Ioannidis (2005), whose results are disputed by some, but which definitely contains insights worth being aware of. The gist of it is that in many circumstances statistically-based results should be taken with not just a grain, but a whole block of salt. WUWT of course titles its article with a quote from the ScienceNews article, “science’s dirtiest secret: The “scientific method” of testing hypotheses by statistical analysis stands on a flimsy foundation.”
First let’s be clear about one thing: the foundation is not flimsy, it’s solid as a rock. Statistics works, it does what it’s supposed to do. But it is susceptible to misinterpretation, to false results purely due to randomness, to bias, and of course to error. That’s what the ScienceNews article is really about, although it takes liberties (in my opinion) in order to sensationalize the issue. But hey, that’s what magazines (not peer-reviewed journals) do.
One of the dangers inherent in statistical results is over-reliance on the “p-value.” The p-value is the probability of getting the observed result just by random accident, even when there’s no significant effect and the “null hypothesis” (usually, that there’s nothing interesting happening at all) is true. The de facto standard p-value for “statistical significance” is 0.05, or 5%, and when p-values are smaller than this the result is often said to be established with “95% confidence.”
Problem is, that a p-value of 5% doesn’t mean that the alternate hypothesis (the new drug reduces the chances of a heart attack) is 95% certain. For one thing, the p-value only addresses the null hypothesis, so a significant result is evidence against the null, but not necessarily for the alternate. Any deviation from the null hypothesis will negate it, even those that have nothing to do with the alternate.
For another thing, a p-value of 5% means that even when the null hypothesis is true, we can still expect the result to happen 5% of the time. That’s a small, but still substantial probability. The 5% level is a good choice, and has served science well for a century, but no choice is perfect. Anything less and it’s just too hard to get results — too many important truths will be missed. Anything more, and the fraction of falsehoods will be so great that truth will get lost in the noise. There’s a reason that 5% is the de facto standard, because in most cases it works. Well. Even so, we need to be aware that “false alarms” not only can, but will happen (that too is the nature of statistics). That’s why replication (when possible) is so important.
Another important consideration is that often (particularly in medical research) we have prior information that the chance of the alternate hypothesis being true is very small. This gives us a useful prior probability with which we can apply a Bayesian analysis. There’s great value in Bayesian analysis, and it sometimes reveals that although a given observation is unlikely (even 5% unlikely) under the null hypothesis, it’s even more unlikely under the alternate hypothesis. In that case, in spite of the apparently significant p-value, the evidence is against the alternate. Of course, Bayesian analysis has its own hazards, including the fact that if the prior probability isn’t known precisely, then it becomes a choice which is susceptible to (and often accused of) bias.
Yet another danger is inherent in the way scientific results are selected by researchers for publication. A “significant” result is far more likely to be touted than one which is not, especially in fields like clinical trials which have immediate and important financial consequences. Negative results are in fact important, but they don’t get the press, publication, or visibility of positive results. This can lead to the phenomenon where 20 different research groups are investigating the same question, and only one of them gets a result which is significant at the 5% p-value level, but that’s the one which gets published. With 20 experiments, we expect one of them to give a p-value of 5% or less — but the others aren’t visible in the scientific literature, so the published evidence is one-sided.
Another aspect of the same issue is that often an investigation will study a vast number of possible factors. One might, for instance, study the influence of 30,000 genes on the likelihood of developing heart disease. Using a 5% p-value, we expect 1500 of the genes to show a “significant” result just by accident. The vast number of “false positives” can be properly dealt with by statistics, but unfortunately it’s not always done properly. I myself first met this issue when studying variable stars to look for changes in their periods of variation. In a sample of some 400 such stars, about 20 showed “significant” period changes — which is just what you’d expect by chance, if none of them had really changed period.
Another factor which we all try to avoid but is still very real is bias. This is especially problematic when there’s anything subjective about the data. In clinical trials, for instance, it’s sometimes necessary to exclude certain subjects from the analysis for legitimate reasons. But there’s at least temptation to preferentially exclude those which contradict the preferred hypothesis, while preferentially retaining those which support it. And if any of the measures is anything less than perfectly precise, there’s an all-too-human tendency to rate the “treatment” subject’s health one better while rating the “placebo” subject’s response one worse. This emphasizes the importance of the classic “double blind” experiment, which is highly desirable, but frankly, just not always possible.
It also happens that, even with a single data set, researchers who are motivated to support a preferred hypothesis will keep trying new statistical tests until they find one that gives what they want. It’s possible to compensate for such a “suite” of tests statistically but that is often not done, and using a suite of tests should really be decided on before the data is even collected — but that’s even more rare.
As if that weren’t bad enough, accidents happen, and I’m not referring to statistical fluctuations but to plain old errors. These can be data errors or faulty analysis. After all, most scientists are not statisticians, and even standard statistical tests have subtleties which are often unappreciated. To err is human.
All told, there are many ways for the statistical analysis of experiments to give incorrect results. This may be especially true in medical research, for which the financial incentive is high, the prior probability is often very low, and the sample size may be severely limited by circumstances beyond anyone’s control. But that hardly means that the foundation of statistical analysis is flimsy; that’s just sensationalism. Nor do we need to adhere to an impractically high standard of statistical significance — as desirable as it is, effects are often small and gathering more data (as in clinical trials) can be very expensive and time-consuming, while delays in availability of new treatments can be devastating for a patient with serious disease and few treatment options.
It’s important for researchers to be aware of statistical pitfalls, it’s valuable to consider all available results (including negative ones), and for the scientific community and the public to take results with at least a grain of salt. But distrusting statistics on principle is not just incorrect, it’s a foolish choice.