# Multiple Testing

Suppose I showed you a close-up, on a live video feed, of where my shot at a target landed. The video is zoomed in on the target’s bulls-eye, and lo and behold, right there is my shot. No doubt about it: bulls-eye. I brag about what a good shot I am, and you’re impressed.

Then my cat gets on the table and accidentally zooms the camera view out to show the entire target (a cat? maybe not an accident). Now you can see that I took about 300 shots, they landed all over the place (some even missed the target completely), and one of them hit the bulls-eye. My guess: you would no longer be so impressed. You might even think my original camera view was misleading.

That, in essence, is how people still buy and sell the misleading claim that there was a “pause/hiatus/slowdown” in global surface temperature during the early 2000s. Some, I’d even venture to say most, do it without knowing what they’re doing. Some who have done it but have been told why it’s a problem, persist in doing it still. The former I’d call an honest mistake; the latter I’d call not just misleading, but deliberate deception.

If you run a test to look for statistical significance, there’s always a chance you could conclude “yes” not because it is so, just by accident due to random fluctuations. That’s an error of course, in statistics it’s called a “type I error.” The chance of that happening is the p-value of the test result. The usual standard is that a p-value of 0.05 or less is accepted as “significant,” a limit which is not carved in stone but is based on practical experience. Values of 0.05 or less are in the bulls-eye. Make the limit too high, we’ll have too many type I errors. Make the limit too low, make it too difficult to show significance, we run too much risk of not concluding significance even though it’s true (a “type II” error).

But what if you run multiple tests? In that case, the chance of a type I error is greater simply because we took more shots at the target. How would we compensate for having so many chances?

The simplest way is to apply the “Bonferroni correction.” If we run n tests and we want a significance level of 0.05, we multiply the p-value by n, or alternatively, we set the threshold to 0.05/n. There’s a more refined correction based on a more exact calculation (the Šidák correction), but in practice the Bonferroni correction is plenty good enough, at least for testing the lowest p-value among all the tests.

The proliferation of genetic testing has brought about analyses which make a lot of tests. We might be testing for significant results for 10,000 different genes, which involves 10,000 tests. Now the Bonferroni correction starts to go astray because it becomes so hard to reach significance. If we want results at an overall p-value of 0.05, we’d have to adjust the threshold to 0.05/10000 = 0.000005. That’s not easy — and we have too much chance of missing a significant result.

New methods handle it much better. One I’m rather fond of is the Benjamini-Hochberg method. There are various ways to implement it, the one I like is this: take all the p-values and sort them into ascending order, so we have n different p-values pk where p1 <= p2 <= … <= pn. Now define new values, sometimes called q-values, as

$q_k = \min_{j \ge k} (n p_j / j)$.

Lastly, we find the largest k for which qk <= 0.05 (or whatever our chosen threshold is) and posit significance for all the p-values up to and including the kth one.

There’s a refinement called the Benjamini–Hochberg–Yekutieli method, useful if the tests aren’t independent. We use a correction factor c which depends on how many test are performed

$c(n) = \sum_{j=1}^n 1/j$.

If the tests are independent or positively correlated we use c = 1, if the tests are negatively correlated we use that formula to estimate it. The q-values are now

$q_k = c(n) \min_{j \ge k} (n p_j / j)$.

Let’s apply this to testing for a pause/hiatus/slowdown in global surface temperature since 1975. We’ll use annual averages of NASA data, and test all 12-year-long intervals.

First I’ll fit a straight line (by least squares), then we’ll search the residuals for some significant trend, which we would regard as a change in the global warming rate.

I’ve added a red line indicating the fit with the lowest p-value, from 2002 through 2013. That particular p-value is only 0.00559, which is way lower than 0.05, leading some to conclude that it must represent a change in the global warming rate — a pause/hiatus/slowdown. Here are all the p-values sorted into ascending order:

The thin dashed red line is at a height of 0.05, the limit for declaring statistical significance.

Now let’s compute q-values according to the Benjamini-Hochberg method.

Again the thin dashed red line is at 0.05. It’s obvious that none of the q-values is anywhere near that low; the lowest of all is 0.179.

We don’t need the Benjamini–Hochberg–Yekutieli correction because the tests are positively correlated, but if we did use it the q-values would get even bigger. There’s no getting around it: not statistically significant.

Really, people, there is no evidence to support the claim of a pause/hiatus/slowdown in global surface temperature.

But people are still claiming it. Many are scientists who simply accept it, are unaware that it’s not valid, and want to understand what caused it. Many are climate deniers who want to dispute man-made global warming by any means necessary. Some are climate deniers who have been told, more than once, but refuse to accept it.

A question for Judith Curry: do you still maintain that there was a pause/hiatus/slowdown in global surface temperature in the early 2000s?

This blog is made possible by readers like you; join others by donating at My Wee Dragon.

### 30 responses to “Multiple Testing”

1. There’s a definitional question in play, IMO. It’s evident from this post–as clearly-written as ever, by the way, barring some of the mathematical detail being a bit over my head–that your definition of “pause/hiatus/slowdown” includes the requirement of a statistically-significant change in trend. That’s certainly sensible, particularly so with regard to projecting future warming, which is generally the main point.

However, it’s also clear that some do not include such a condition in their definition of “pause/hiatus/slowdown”. For them, it’s enough that the observed rate of warming did in fact change over some period of time. Those in this camp who are honest will not try to use the information in order to make projections of future warming–or to deny such projections, for that matter–because they recognize that a ‘slowdown’ of their type need not imply any change in underlying trend. I think such a description would capture the attitudes of some researchers who have, for example, been seeking the specific reasons for the ‘slowdown’. That may even apply to our ego-driven friend Sheldon Walker, who has ‘earned’ some virtual ink here.

Others will be propagandists, pure and simple.

I can’t help but feel that if there were a defined terminology differentiating between definitions #1 and #2, we’d be able to avoid much talking past one another. For example, an observed ‘slowdown’ sans implication of significance could be called a ‘deviation’ (implying the trend from which observations deviate is likely unchanged), whereas a ‘slowdown’ in which significance is claimed might be termed a ‘deceleration’ (implying actual change in the trend, as diagnosed by statistical significance).

• Al Rodger

I don’t think your #2 version has validity even if it is named differently.
There is a big methological problem that faces those (like our dear departed Sheldon*) who attempt to find ‘slowdowns’ using wholly statistical methods like OLS. The method they use is wholly statistical so statistical rules apply. As a result, the idea of their test being valid while ignoring statistical significance is flat wrong.
There are other approaches which are less reliant on statistics and which could be used to show ‘slowdown’, but the ones with a chance of showing ‘slowdown’ hit another problem for denialists – you would have to admit that GCMs should project climate with reasonable accuracy and any ‘slowdown’ (here a ‘slowdown’ being a significant deviation below the GCM-projected level) results from poor physical representation within the GCMs. The big problem for denialists is that they want a lot more blood from GCMs than just failing to capture a ‘slowdown’. They want to bin all findings from GCMs.

*Sheldon’s latest grand analysis nailed up at Wattspuia is to show that 20% of the Earth has already exceeded the +1.5°C target global temperature increase” agreed at COP21 in Paris. (My bold.)

• I’m not sure I understand your point fully. If a researcher is interested in causes of variability, and can measure variability with, say, rates of change calculated by straightforward methods, what are they to do? They already know that it’s not going to be significant in the sense we’re talking about, right? Yet the variability remains. Should they ignore it? That doesn’t seem right–science is not about throwing up one’s hands and declaring something unknowable, as Akasofu implicitly does, for a denialist instance.

*As to Sheldon’s latest, I stand with Rhett Butler–even if I’m the one who dragged Sheldon’s name into this thread.

• Al Rodger

Expanding on the point, perhaps the first finding from the global temperature data is how remarkably linear it is, even going back to 1970. (Rahmstorf et al 2017 use data back to 1972). Were there acceleration or interdecadal wobbles, the idea of “slowdown” or “speed-up” could not be kicked into the box marked ‘fantasy’. But it is ruler-straight so “slowdown”, if real at all, becomes a conceptual property of a set of wobbly data which is rising in a linear fashion.
Within that remarkably linear trend, there are wobbles which are in-the-main the product of events that are best considered as random in nature – big volcanic eruptions, the frequency,and size of ENSO both positive & negative. Also, at a monthly level, there is the noise of the weather systems & measurement errors (which adds 50% to the variability at a monthly level relative to at an annual level. This monthly noise with GISTEMP can see as much as 0.5ºC change in the average global temperature one-month-to-the-next.) Monthly or annual, all of this is effectively random noise and, at an annual level, even the false “slowdown” of the early 2000s can be explained in terms of these random-like volcanic & ENSO effects (as per Foster & Rahmstorf (2011)).
I will post this now and continue with a second serving, as the argument is getting a bit long & involved, and I’m off to the pub.

• A little bird told me that the figure 1.5°C will be in the news next week. I am somewhat chagrinned that Sheldon Walker might somehow try to ride on its coat-tails.

• Al Rodger

Following on from above…
The denialist assertion of their being a “slowdown” or a “hiatus” is derived in a number of different ways of varying complexity. (Although “complexity” may be the wrong word. For some it enough that the word “hiatus” is used by climatologists to prove the existence of their “slowdown.”) Yet none of their assessments ever come close to being valid.
One approach they employ is the cherry-picking of a period from the record (and in some cases, cherry-picking the record as well) to show a lack of warming trend and thus a “slowdown.” But the use of OLS to calculate the trend for those cherry-picked periods results in a need to calcuate Confidence Intervals and this in turn results in the need to address auto-correlation and multiple-testing. These necessary complications prevent the identification of any “slowdown” with this method. And even then, this method also results in broken trends which would make any findings non-physical nonsense.
An alternative not used (yet) by denialists that avoids these problems of Confidence Intervals would be to propose a period of linear trend and use it to identify using non-statistical means a later period with a lower trend. For instance, the suggestion that there has been a “slowdown” since 1997 could be tested by calculating the pre-1997 trend and then demonstrating a change post-1997. For instance, this graphic (usually 2 clicks to ‘download your attachment’) only uses statistical method to calculate the trent 1970-1997 and is thus freed from most statistical complications. But it shows no “slowdown” post-1997 as the wobbles are effectively identical to those seen pre-1997. Simply, there is no evident “slowdown.”

But that’s not really the end of the story. Back in 2012 when global temperatures were at the cool-end of a big wobble we were still using HadCRUT3 and Karl et al (2015) was some way off. The removal of biases in the global temperature record caused by measurement-coverage and other artifacts was certainly made more of an issue because of the bold denialist assertions about “slowdown.” And the denialism gave more poignancy to work that considered the lower-than-projected positive AGW forcings or the assessment of volcanic forcings or anthropogenic aerosols through that period.
And if you look more deeply (as climatology should), not just at temperature and not just at the size of wobbles but their global and regional characteristics, there are many more wobbly data to examine which give a fingerprint of the global wobbles and what caused them. And it is one step from such analysis to be comparing the wobbles, not with a straight-as-a-ruler trend line but with GCM output that hopefully can account for artifacts in the temperature record, varying AGW climate forcing and volcanic forcing. And if they are set to do it, GCMs can even cope with ENSO.
Of course, the findings of such work will not lessen the denialist protests but likely provide them with the occasional tasty quote to use in their nonsense.

2. Even Mike Mann thinks there was a slowdown. When I tried hard to get him to think again (based on your work),on Twitter he promptly barred me. One of the most bitterly disappointing reactions from a scientist, that I’ve experienced (well, it’s actually the only bad reaction I’ve had but still). Have you had any conversations with him?

[Response: Yes, and I believe I (along with Stefan Rahmstorf) have persuaded him that the slowdown doesn’t have the supporting evidence needed. He has, in turn, shown evidence of a “slowdown” in very limited regions (but not globally). Bear in mind that I don’t speak for him.]

• rhymeswithgoalie

That, in essence, is how people still buy and sell the misleading claim that there was a “pause/hiatus/slowdown” in global surface temperature during the early 2000s.

I think one problem is the simplistic treatment of global air and surface temperature as reflecting all the additional heat trapped by GHGs, when the *ocean* plays a major role. ENSO states represent a major shifting of heat back and forth between the ocean and the surface. Of course, despite its coverage in introductory high school physics (if not earlier), many people really can’t grasp the difference between temperature and heat, so we’re stuck with a lot of mainstream fixation on surface+atmosphere temperatures.

• Michael Sweet

Tamino,
If you did an analysis of limited regions of the World wouldn’t that make the multiple choices problem even greater? Since there are hundreds of regions and you are searching for an outlier you would have to account for all your possible regions and not just the years you chose.

Given the large year to year variations in climate of limited regions I would think you could never overcome the multiple choice issue also to show statistical significance.

3. Tom Passin

Of course, the data *are* also consistent with a “hiatus”, or maybe we would better say a period with a small or zero slope. But so what? We know there are unmodeled aspects (or, yes, perhaps chance alone) that have caused what appear to be “slowdowns” over the entire span of the data, going back to say 1850. But the temperatures keep going up anyway, over longer time spans.

Anyway, p-values are a poor tool to gauge significance with, because we don’t actually know what the p-value is. We only know its estimate for our data set. And the variance of a sample p-value is quite large. It would be easy for a sample p-value to come out say 0.03 where the “true” value were really 0.25 or higher. So a sample p-value is information to consider, but not rely on too heavily.

[Response: I disagree. A p-value isn’t a property of the data, but of the test applied to the data. I don’t see that there is such a thing as a “true” p-value.]

• jgnfld

I think you have the whole problem reversed. To my mind–and I have posted quantitatively (in Monte Carlo analyses) on this problem before here though tamino does it much better–the question from a frequency POV is whether given the best estimate of frequencies the observations is expectable or not.

The “slowdown” is completely expectable given the observed trends and residual errors just as there is a reasonable probability of seeing runs of 7 or even 8 heads in a row in 100 coin flips is. You have to go to 10 or more heads in a row before the p-value goes below .05 and you might start thinking the flipping procedure is somehow biased.

4. “Many are scientists who simply accept it, are unaware that it’s not valid, and want to understand what caused it.”

If you are interested in the trend and want to ignore the “noise” produced by internal variability then, yes, the pause is not valid. However, there’s more to climate science than proving that a particular trend is happening. Studying the internal variability is a legitimate and useful thing to do.

The problem, of course, is the risk of misinterpretation where people try to pretend that interesting results regarding internal variability somehow counter those showing the trend.

• Timothy (likes zebras)

This depends on the source of the changes in global temperature from year to year.

If they are principally due to, say, temporary changes in ocean dynamics then they would be interesting to study.

If they are principally due to sampling uncertainty due to having a limited number of observation data points then you would be trying to explain changes in noise.

If there is a change that is statistically significant then you can be pretty sure it isn’t noise. Otherwise it might be.

• Yes, that seems to me to be a recurring confusion in this issue. And as you say, it’s only compounded by those who use the idea of a ‘slowdown’ dishonestly in order to promulgate denialist FUD.

And not without effect–I’m on ongoing contact with a gentleman who consumes said FUD, and he can’t even seem to take onboard the fact that insofar as there was a ‘slowdow’ (or as I termed it above, a ‘deviation’) at all, it’s well and truly over for this episode.

• If you are interested in the noise, then call it what it is–a fluctuation.

• Right! Couldn’t manage to come up with that one when I wrote the comment…

5. Gingerbaker

Do we even need sophisticated statistical analyses on this issue? Is there any reasonable proposed mechanism by which a true hiatus could occur?

[GHG] did not go down. Output of the sun did not go down. Ocean thermoclines did not suddenly flip upside down and then right themselves. The Earth’s core did not suddenly cool dramatically and then correct itself perfectly.

A claim of hiatus must present a scientifically rational mechanism to explain it, shouldn’t it?

6. If you can think of it, there is a relevant XKCD.

https://xkcd.com/882/

• I’ve had this pinned over my desk for the last two years in A3 glory. It serves as a recourse for silent pointing whenever someone comes into my office and starts mishandling the concept of multiple testing.

Sometimes they actually get it.

7. The Very Reverend Jebediah Hypotenuse

A claim of hiatus must present a scientifically rational mechanism to explain it, shouldn’t it?

No.
Because asking Dr Judith Curry scientific questions concerning the cause of the pause is exactly analogous to asking Judge Brett Kavanaugh historical questions concerning his judgement.

8. Sorry to beat a dead horse, but the problem, again, is because significance tests and p-values are being used at all, Benjamini-Hochberg or not.

Andrew Gelman, “Bayesian inference completely solves the multiple comparisons problem”, Statistical Modeling, Causal Inference, and Social Science, August 2016.

A. Gelman, J. Hill, M. Yajima, “Why we (usually) don’t have to worry about multiple comparisons”, Journal of Research on Educational Effectiveness, 2012, 5: 189-211.

A. Gelman, J. Carlin, “Some natural solutions to the p-value communication problem — and why they won’t work”, Journal of the American Statistical Association, 2017, 112, 899-901.

The alternative is to use Bayesian hierarchical models. Failing that, use empirical Bayes, per

B. Efron, Large-Scale Inference: Empirical Bayes Methods for Estimation, Testing, and Prediction, Cambridge University Press, 2010.

9. AndyM

If we want to show the rate of warming is relatively constant (and accelerating) we need to calculate surface heat content rather than relying on a metric that is affected by the proportion of heat going into the oceans.

10. Mitch

“One person’s noise is another person’s signal…”
The problem with the hiatus is that it was cherry-picked to place doubt on the temperature trend. There are short term changes in that trend that result from redistributing heat on the planet that can last for a few years. The hiatus represents one of those times when heat was buried away from the surface, keeping temperatures from rising. Such pauses can give important data on how the planet stores and exchanges heat, and at what magnitude. However, it doesn’t say anything about the trend.

11. slowdown or not – why does it really matter, who apart from idiots expects the global temperature to rise in a straight line anyway

in the northern hemisphere we have increased solar forcing from January onwards, only a moron would suggest that just because march was colder than January – it somehow validates/supports whatever nonsense these flat-eathers are pushing

12. Phil.

Tamino, any chance you’ll update the plot of temperature rise corrected for El Nino, volcanos etc.?

[sure! It might be a few days]

13. ProfJ

Decadal variability is a real feature of the climate system and the “hiatus” is a real phenomenon that is a manifestation of that decadal variability. The residual of the global surface temperature really did take nearly two decades to beat the 1998 value.

I would think the analysis you are doing, showing that the “hiatus” is probably just noise, lumps decadal variability into the noise. That’s fine for longer term climate trends, and demonstrates nicely that the “hiatus” does not represent a significant change from the long term anthropogenic trend. But that doesn’t mean decadal variability isn’t a real and important phenomenon deserving of scientific study.

Something in the climate system caused the residuals to temporarily stop increasing. The El-Nino of 1998 explains part of it, but the El-Nino timescale is too short to explain the whole thing.

When you say “there is no evidence to support the claim of a pause/hiatus/slowdown in global surface temperature,” you are sweeping a lot of interesting climate dynamics under the rug. I get that many misuse decadal variability to argue that climate change is not real, but they also misuse almost every other part of science. That doesn’t mean we should pretend decadal variability is not important.

• Such variability is also a feature of ecological communities:

https://esajournals.onlinelibrary.wiley.com/doi/10.1002/eap.1785

Some of these can, from decadal climate variability, be driven through bifurcations. Many other ecological systems exhibit strong hysteresis, so decadal variability on top of a warming trend ought not be thought of as harmless, simply because it’s always been there. It might, in fact, feed back if Carbon retention is harmed, or CO2 take-up rates affected.