Sheldon Walker has made yet another post at WUWT claiming there was a “slowdown” in global temperature recently, this time titled “*Proof that the recent global warming slowdown is statistically significant (correcting for autocorrelation)*.”

The new post, rather than include “*at the 99%% confidence level*” at the end of its title as did the last one, has “*correcting for autocorrelation*.” There are two reasons for the title change. First, I’m not the only one who pointed out that his failure to account for autocorrelation invalidated his previous result; so did a brave few of the blog commenters. Kudos to Sheldon for not being so deep in denial that he rejected this fact.

Second, his new results (accounting for autocorrelation) didn’t seem to satisfy at the 99% confidence level, so he’s lowered the requirement to 90% confidence. The rationale he gives for doing so is silly; it truly amounts to nothing more than *finding significance using higher confidence levels is hard.* Nick Stokes points out, rather pointedly, “So you lower the level until you get a “significant” result?”

But even at the 90% confidence level, his result is still wrong. I’ll remind folks of what I already quoted in my reply to his previous post:

The statistics can be pretty tricky because the noise isn’t the simplest kind (referred to as “white noise”), it’s autocorrelated noise, andthere are other statistical issues too(like “broken trends” and the “multiple testing problem”).

Yes, there are other statistical issues too.

One that I mentioned is the “multiple testing problem.” I’ve explained it before, but this time I think I’ll demonstrate it in action.

I generated some artificial data, covering the exact same time range, which we *know* has no trend, it’s nothing but trendless random noise. Hence there’s no trend. None. I even used *white noise* so we don’t even need to deal with autocorrelation. Plain old trendless, no-autocorrelation white noise. When looking for a trend this is as simple as it gets, and because it’s straight out of a random number generator we already know the correct answer: no trend.

I then ran an analysis similar to Sheldon’s: I used 10-year + 1 month time spans, estimated the trend rate of each with linear regression, and tested the result for significance. Remember, there’s no trend (by construction) so we already know the answer.

I’ve plotted the “*p*-value”, which tells us the statistical significance for each 10-year + 1 month time span, with values below 0.1 colored blue unless they also dip below 0.05, when they’re colored red:

Lo and behold, for five of the time spans the *p*-value is below 0.1, which means 90% confidence, and three of those are below 0.05, meaning 95% confidence! One of them even reaches down to 0.01006 (98.994% confidence). We found what looks like “statistical significance” (at 90% confidence) for no less than *five* of them.

But we know, without doubt, by construction, that there’s no trend. Not for *any* of the time spans. This is just random noise, the simplist kind, with no autocorrelation. No trend.

The reason for not one, but five spans which *seem* to reach statistical significance at 90% confidence, three of those reaching 95% confidence, one just missing 99% confidence by a hair’s breadth, is the **multiple testing problem**. It’s rather important, really.

Sheldon, you should do this test yourself. Do it more than once — 10 times or more — for more confidence that the result you get isn’t one of those one-time “odd duck” results.

One more thing, Sheldon. This is the 2nd time in a row you’ve reached an incorrect result; that’s the nature of learning more about time series analysis. Good on ya for diving deeper and putting in the work to do your analysis. What I find objectionable is that you didn’t say “Hey, look what I found … did I get this right?” Instead you announced “Proof!” Twice. Despite the fact that you were wrong both times.

As for your closing comment:

“Why don’t the warmists just accept that there was a recent slowdown. Refusing to accept the slowdown, in the face of evidence like this article, makes them look like foolish deniers. Some advice for foolish deniers, when you find that you are in a hole, stop digging.”

Can you see the irony in that statement?

This blog is made possible by readers like you; join others by donating at My Wee Dragon.

Didn’t one of the big complaints about, well, everything, used to be along the lines of “You’re doing the statistics wrong! Statistics is hard. Climatologists don’t understand these really difficult concepts. You need to have a professional statistician check your work!” I guess that only applies to other people.

Another problem with mis-using p-values (aside from multiple testing) like this is that you only know the p-value for the sample you are testing, not for the underlying population. And the statistical variation in a p-value is large. And since small p-values are very non-linear with changes in the sample mean, you can get big changes in p for small changes in the sample mean.

Since the p-value should have a uniform distribution, its variance should be 1/12 (see Wikipedia at https://en.wikipedia.org/wiki/Uniform_distribution_(continuous)). So the s.d. is 0.29. Even if you had a sample p-value of 1, it might easily have a population (i.e., a true mean) value of 1 – 0.29 = 0.81, or even lower. The p-value can be weak support for your conclusions.

We can get an idea of the influence of this effect by applying the binomial distribution to the number of occurrences of p < 0.1. In Tamino's example, he got 5 out of 38 occasions when the p value was le. The observed probability of this – getting a p-value < 0.1 – happening was 5/38. The standard deviation of this count is (Npq)^0.5, where q = 1 – p = 1 – 5/38. So the s.d is 2.1. The observed count should have been reported as 5 +/- 2.1, since you should always report mean and variation.

For Sheldon's article, he reported 2 occurrences, not 5, out of 38. The calculated s.d. for his numbers is 2.2. He should have reported the number of occurrences of the magic p-value as 2 +/- 2.2. The standard deviation of the difference between Sheldon's count and Tamino's would be sqrt(2) * 2.1 = 2.8. The observed difference was 3, just 1 s.d apart, if Sheldon's data had the same properties as Tamino's (that is, random noise).

Who thinks that this difference has any statistical significance, raise your hand!

“is that you only know the p-value for the sample you are testing, not for the underlying population”

Well p-values cannot be sensibly applied to populations of data anyway. If one knew the underlying population then one wouldn’t use p-values at all, they’d be able to directly and precisely answer whatever questions about hypotheses they’d like.

“Even if you had a sample p-value of 1, it might easily have a population (i.e., a true mean) value”

The sampling distribution of a p-value *given the null hypothesis* is a uniform distribution. If the null hypothesis is indeed true then no matter how much you sample, the sampling distribution of the data does not change, and the p-value does not converge to any sort of “population value”. As N increases, the p-value for any particular test will only converge—to zero—if the null hypothesis is false.

Can you see the irony in that statement?Even i could see that one coming.

I suppose I should have used a count of 4 instead of 2 times for Sheldon’s results – he saw four places where he got a p-value of less than 0.1, and I was thinking only of the two times where cooling seemed to be happening, but that would only emphasize the similarity to random results.

/shrug. I’m fine with recognizing that there was a “slowdown”. It just doesn’t mean anything. It’s the nature of looking at short-term trends on data with any kind of noise.

Sheldon rediscovered the fact that ENSO, solar, and volcanic influences can cause short-term wiggles in the surface temperatures.

Walker somehow halved the warming rate, the “average” line from 1970 to 2017 that he plots is something like 0.7 C/century. Really, can we get any sort of methodological consistency from WUWT? It’s embarrassing.

Thanks for that very illuminating post. Sheldon picked a fight the wrong time series analyst!

Echoing the comment of Alex C above, There are surely some serious cockups in poor old Sheldon’s methods that remain to be mentioned.

Firstly he says

Autocorrelation does not affect

“the average warming rate”, it affects the Confidence Interval and thus the significance of the“the average warming rate.”And having managed to cock that up, do recall the Harrabin question of 2010.

If a 15-year period struggles to be significant, how can any but an exceptional 10-year period manage it?

Sheldon’s Graph 2 says it plots

If I plot the OLS warming rate per century for these, I get a very similar graph. If Sheldon is plotting the warming rate of the bottom of the 90% CI, I would expect to see a similar graph with the data all shifted down by a couple of degrees C, But Sheldon’s Graph 2 plots values that are reduced in magnitude by a factor of 2. He is probably in some manner plotting the statistical significance of any positive trend with 1=90% – that would result in something like his Graph 2.“Warming rate (degrees celsius per century)”So all his grand analysis shows is that one ten-year sequence has a zero average trend. And as the OP sets out, that is what you would expect to find.

The go-to device for estimating statistics in a serially correlated dataset is the Politis and Romano

stationary bootstrap, at least from a Frequentist perspective.