Suppose I showed you a close-up, on a live video feed, of where my shot at a target landed. The video is zoomed in on the target’s bulls-eye, and lo and behold, right there is my shot. No doubt about it: bulls-eye. I brag about what a good shot I am, and you’re impressed.
Then my cat gets on the table and accidentally zooms the camera view out to show the entire target (a cat? maybe not an accident). Now you can see that I took about 300 shots, they landed all over the place (some even missed the target completely), and one of them hit the bulls-eye. My guess: you would no longer be so impressed. You might even think my original camera view was misleading.
That, in essence, is how people still buy and sell the misleading claim that there was a “pause/hiatus/slowdown” in global surface temperature during the early 2000s. Some, I’d even venture to say most, do it without knowing what they’re doing. Some who have done it but have been told why it’s a problem, persist in doing it still. The former I’d call an honest mistake; the latter I’d call not just misleading, but deliberate deception.
If you run a test to look for statistical significance, there’s always a chance you could conclude “yes” not because it is so, just by accident due to random fluctuations. That’s an error of course, in statistics it’s called a “type I error.” The chance of that happening is the p-value of the test result. The usual standard is that a p-value of 0.05 or less is accepted as “significant,” a limit which is not carved in stone but is based on practical experience. Values of 0.05 or less are in the bulls-eye. Make the limit too high, we’ll have too many type I errors. Make the limit too low, make it too difficult to show significance, we run too much risk of not concluding significance even though it’s true (a “type II” error).
But what if you run multiple tests? In that case, the chance of a type I error is greater simply because we took more shots at the target. How would we compensate for having so many chances?
The simplest way is to apply the “Bonferroni correction.” If we run n tests and we want a significance level of 0.05, we multiply the p-value by n, or alternatively, we set the threshold to 0.05/n. There’s a more refined correction based on a more exact calculation (the Šidák correction), but in practice the Bonferroni correction is plenty good enough, at least for testing the lowest p-value among all the tests.
The proliferation of genetic testing has brought about analyses which make a lot of tests. We might be testing for significant results for 10,000 different genes, which involves 10,000 tests. Now the Bonferroni correction starts to go astray because it becomes so hard to reach significance. If we want results at an overall p-value of 0.05, we’d have to adjust the threshold to 0.05/10000 = 0.000005. That’s not easy — and we have too much chance of missing a significant result.
New methods handle it much better. One I’m rather fond of is the Benjamini-Hochberg method. There are various ways to implement it, the one I like is this: take all the p-values and sort them into ascending order, so we have n different p-values pk where p1 <= p2 <= … <= pn. Now define new values, sometimes called q-values, as
Lastly, we find the largest k for which qk <= 0.05 (or whatever our chosen threshold is) and posit significance for all the p-values up to and including the kth one.
There’s a refinement called the Benjamini–Hochberg–Yekutieli method, useful if the tests aren’t independent. We use a correction factor c which depends on how many test are performed
If the tests are independent or positively correlated we use c = 1, if the tests are negatively correlated we use that formula to estimate it. The q-values are now
Let’s apply this to testing for a pause/hiatus/slowdown in global surface temperature since 1975. We’ll use annual averages of NASA data, and test all 12-year-long intervals.
First I’ll fit a straight line (by least squares), then we’ll search the residuals for some significant trend, which we would regard as a change in the global warming rate.
I’ve added a red line indicating the fit with the lowest p-value, from 2002 through 2013. That particular p-value is only 0.00559, which is way lower than 0.05, leading some to conclude that it must represent a change in the global warming rate — a pause/hiatus/slowdown. Here are all the p-values sorted into ascending order:
The thin dashed red line is at a height of 0.05, the limit for declaring statistical significance.
Now let’s compute q-values according to the Benjamini-Hochberg method.
Again the thin dashed red line is at 0.05. It’s obvious that none of the q-values is anywhere near that low; the lowest of all is 0.179.
We don’t need the Benjamini–Hochberg–Yekutieli correction because the tests are positively correlated, but if we did use it the q-values would get even bigger. There’s no getting around it: not statistically significant.
Really, people, there is no evidence to support the claim of a pause/hiatus/slowdown in global surface temperature.
But people are still claiming it. Many are scientists who simply accept it, are unaware that it’s not valid, and want to understand what caused it. Many are climate deniers who want to dispute man-made global warming by any means necessary. Some are climate deniers who have been told, more than once, but refuse to accept it.
A question for Judith Curry: do you still maintain that there was a pause/hiatus/slowdown in global surface temperature in the early 2000s?
This blog is made possible by readers like you; join others by donating at My Wee Dragon.