Response to Sheldon Walker

Sheldon Walker commented on my most recent post about his most recent post. It began thus:



Sheldon Walker | February 7, 2018 at 12:10 am | Reply

Sheldon Walker: Oh, oh, I see, running away then. You yellow bastards! Come back here and take what’s coming to you. I’ll bite your legs off!


I’ve got to give you credit; you do have a sense of humor.



A quick update on some comments that you made.

1) That I would not be able to learn how to do a linear regression with a correction for autocorrelation, off the internet.
– I will give you half marks for this comment. I found most of the stuff on the internet about autocorrelation too technical. I wanted a practical example. So I had to work it out for myself. I triple checked my results to make sure that they were right. For each date interval I did 3 linear regressions:
1) first regression with no correction for autocorrelation. Used this to compare to the other regressions, to make sure that values were reasonable.
2) added lagged temperature anomaly. Did linear regression to work out how much autocorrelation was present. Over 741 regressions the average was about 0.58 (that figure is from memory, may be wrong). The amount of autocorrelation varied from about 0.4 to 0.7.
3) corrected the temperature anomaly for autocorrelation. Did another linear regression to make sure that it was gone. All 741 regressions had a residual autocorrelation of about 1.0 x e^-15
2) That I would not learn how to do a linear regression with a correction for autocorrelation overnight.
– You get full marks for this comment. It took me 6 days. Considering that I have a full time day job, I think that 6 days is ok.


I’m not convinced you’ve properly corrected for autocorrelation, some of the things you say make me suspect not. But I don’t know enough about the details to be sure.

You can save yourself a lot of trouble by studying annual averages instead of monthly averages. Their autocorrelation is far less than that of monthly data; there’s still some which lingers, and it makes it too likely to declare “significant!” when it’s not really, but the autocorrelation is so much less that at least your results will be “in the ballpark.”

You might think that using annual rather than monthly averages severely reduces the precision of regression. Counterintuitively, this is not so; the loss of precision is negligible. A (somewhat technical) illustration can be found in Foster & Brown, see section 4. If basic analysis indicates far greater precision with monthly rather than annual averages, it’s a sign that autocorrelation invalidates the apparent precision of analyzing monthly data without autocorrelation correction.

I don’t recall saying you wouldn’t be able to figure out how to apply an autocorrelation correction “overnight.” Perhaps I wasn’t clear enough, but what I meant was the much more general comment that you couldn’t learn enough about trend analysis in general in this situation to do it right overnight. I’ll stand by that statement. If you can point to a direct quote of my referring specifically to autocorrelation correction in my statement of the need for more time to learn it right, then I’ll stand corrected.


I have fully analysed the GISTEMP data using the method that I developed. I found 9 slowdown trends which are significant at the 99% confidence level. They mostly start in 2001 and 2002. The longest lasts 14 or 15 years (sorry, can’t remember which – I don’t have the results with me). There were also 2 slowdown trends which are significant at the 95% confidence level, and 2 or 3 which were significant at the 90% confidence level.

Interestingly, there were 6 slowdown trends which were significant at the 90% confidence level, which started in 1997 and 1998. The famous slowdown that warmists say is due to the 1998 El Nino. They are much less significant then the 2002 slowdown.


No, you didn’t find “9 slowdown trends which are significant at the 99% confidence level.” You only thought you did, because you hadn’t allowed for autocorrelation or for the multiple testing problem.

No, you didn’t find “6 slowdown trends which were significant at the 90% confidence level,” because you still hadn’t taken the multiple testing problem into account (or the “broken trend” issue, but we’ll leave that for another day).

If you haven’t already done so, you really need to read this. It shows that you can get apparently significant trends or trend changes (lots of them) in plain old random noise, when multiple testing is not allowed for. If you really believe that plain old random noise can show real trends, then you need to “check yourself.”

The multiple testing problem isn’t something I made up. But, even scientists have a hard time with it. It’s a very real phenomenon, in fact it’s one of the main reasons that Fyfe et al. believed that had confirmed a “slowdown” when in fact they hadn’t (Stephan Rahmstorf, Niamh Cahill, and I published research about that specifically). Jim Breyer (Duke Univ.) is on something of a “mission” to make it better understood, especially in medical research. I’m trying to make it better understood in climate science. Time will tell how well I succeed.



Watch out for my results on WattsUpWithThat. It will be a while, because I want to include the same analysis for UAH and RSS. I have not done those yet.

I offered warmists a compromise about slowdowns. You turned it down. You won’t be offered it a second time.


Somehow, that news fails to disquiet me.

It seemed to me that the “compromise” you offered was to acknowledge that a “slowdown” in global surface temperature didn’t necessarily mean a slowdown in global warming. We agree on that. What I have tried to make you understand is that you haven’t found reliable evidence of a slowdown in the warming rate of surface temperature. You think you have, but you haven’t.

Seriously: I can tell you work hard at this. But seriously: you’ve still got a lot to learn. There’s no reason you can’t learn it, but it takes time, and a healthy dose of humility will help.


This blog is made possible by readers like you; join others by donating at My Wee Dragon.


Advertisements

52 responses to “Response to Sheldon Walker

  1. I spent weeks of my graduate statistics education learning about how to handle multiple tests. Dealing with it properly, consistently frustrated me in my attempts to find support for my hypotheses, but eventually I came to (emotional) grips with it. That was decades ago, so I’m surprised that Sheldon is so resistant to acknowledging the existence of that issue long after he was alerted to it. It’s really basic.

  2. “and a healthy dose of humility will help.”
    Seriously, this is one of the most important attitudes for this kind of statistical work. that, and continuous improvement – that is, trying to stress your methods and improve them, over and over.

  3. Unfortunately I think Sheldon’s desire to be proven right is stronger than his desire to understand. So when the evidence comes in that he was wrong he starts looking for a different analysis that will rove him right. Hence he is almost bound to run into the multiple testing problem.

  4. Sheldon Walker

    Hi Tamino,
    thank you for the advice. I have read your advice about “annual averages instead of monthly averages to reduce autocorrelation” on your website before. But the idea that the loss of precision is negligible going to annual averages is new to me.
    I know a little bit about autocorrelation and the multiple testing problem, but not a lot.
    You made the comment “No, you didn’t find “6 slowdown trends which were significant at the 90% confidence level,” because you still hadn’t taken the multiple testing problem into account (or the “broken trend” issue, but we’ll leave that for another day).”
    When I analyse the 741 date intervals, I treat every date interval in exactly the same way. I don’t treat date intervals where I expect to find a slowdown, any different to every other date interval. The 6 90% significant slowdown trends which start in 1997 and 1998, are exactly where you might expect to find slowdown trends (starting with the 1998 El Nino). I don’t find these slowdown trends in most other places. Why do I find them where I do?
    Are you suggesting that there is an insignificant slowdown there, but that my testing is incorrectly showing them as significant?
    I enjoy learning, and I will certainly be trying to increase my knowledge of things like autocorrelation and the multiple testing problem.
    You probably won’t believe me, but as a skeptic, I constantly question what I know, and question the validity of my results. What you call a lack of humility, is actually a deliberate choice of writing style. Most people don’t want to read articles which are written in a weak style. Given the animosity between warmists and skeptic, a strong confident attitude is required. If warmists were more reasonable, then maybe I wouldn’t need to be so brash.

    • “Are you suggesting that there is an insignificant slowdown there, but that my testing is incorrectly showing them as significant?”

      Tamino is arguing from a statistical POV, not a mathematical POV. We’re all in agreement that the line you get from minimizing the squared residuals from 2002 to 2012 is generally of a lower slope than lines you get from fitting, say, 2007 to 2017. Those are mathematical facts. We can also calculate things like t-scores and p-values, which again are plainly mathematical facts about certain data.

      What they *mean* is a statistics question, a question of formal inference. What a p-value attempts to be—that is, the underlying *intended* question to be answered—is a measure of how embarrassing some data appears to be for a “null” hypothesis (here, that the underlying real trend, which we can only estimate with samples, is unchanged between 1970-ish and now). It is that actual purpose that we want to satisfy, and when we have only one test, and one p-value, this is a proper interpretation of the mathematical fact: a low p-value (p < 0.05, conventionally) means particularly embarrassing data for the null, and a higher p-value means the data is not very surprising/embarrassing.

      The reason that we make such an inference is because, given that the null hypothesis is true, the p-value will uniformly take some vale between 0 and 1. A p-value of < 0.05 would occur 5% of the time, and so when we do our one test and we see that particular value, we would infer that the data is embarrassing to the null hypothesis (and, further, maybe we would be embarrassed to believe the null hypothesis).

      There's a big pit fall when we have multiple tests, however, which is that we "get more chances" to have our significant value. Our many p-values can still actually all take values uniformly from 0 to 1, and with enough p-values we'll end up seeing < 0.05, not because that's a real signal, but because of random chance. The null hypothesis actually *promises* to generate p-values below 0.05, 5% of the time; or, p-values below 0.10, 10% of the time. So when you see something like 2-3 p-values below 0.10 when you did your autocorrelation-adjusted checks, out of some 40 tests, that's not embarrassing to the null—that's actually expected from the null.

      So what is Tamino saying? Tamino is saying that, regardless of the fact that some p-value may be below 0.10, or that some sample slope may be zero, we should pay attention to the question we desired answered at the outset: is there any data so embarrassing to the notion that the trend has not changed? And the answer is, no, because what we see—only a few p < 0.10 cases out of 40-ish—was actually promised to us by the null hypothesis. There is no "insignificant" slow down; there is, in fact, no evidence of a slow down at all.

      There are other, more robust methods of finding potential changes in trend, some of which have been explored here, like Chow tests for continuous trends:

      https://tamino.wordpress.com/2016/11/07/testing-for-change/
      (^Here, you've already been told about this before.)
      https://tamino.wordpress.com/2016/10/18/breaking-bad/
      https://tamino.wordpress.com/2017/07/11/climate-trend-change-do-it-right/

      The short and sweet is that these tests do not reveal a trend change.

      • As an aside, p<0.05 is only really a convention in things like small scale medical testing and its relatives, where there is a significant ethical cost to making experiments larger, and/or where the outcomes will usually only be used to inform further research. Treating a patient based on a single p<0.05 result is, and should be, an ethical no-no.

        In particle physics and astronomy, we typically use 5 sigma as a criterion for "detection", which is p<0.0000003 (under certain assumptions about the underlying distributions, of course).

        I'd be asking for a much, much higher probability that p<0.05 before I called off attempts to reduce or mitigate climate change, given that it's a global phenomenon which poses existential risks to our way of life, and is so much more consequential than treating a single patient.

        Link for anyone interested who isn't familiar with this yet: https://blogs.scientificamerican.com/observations/five-sigmawhats-that/

        [Response: In my opinion, the entire “5-sigma” thing is misunderstood. The real reason physicists require 5-sigma is that it’s their way of dealing with the multiple testing problem. Here’s a quote from the SciAm article you link to:

        “The reason for such stringent standards is that several three-sigma events have later turned out to be statistical anomalies, and physicists are loath to declare discovery and later find out that the result was just a blip. One factor is the “look elsewhere effect:” when analyzing very wide energy intervals, it is likely that you will see a statistically improbable event at some particular energy level. As a concrete example, there is just under a one percent chance of flipping an ordinary coin 100 times and getting at least 66 heads. But if a thousand people flip identical coins 100 times each, it becomes likely that a few people will get at least 66 heads each; one of those events on its own should not be interpreted as evidence that the coins were somehow rigged.”

        That is a fine description of the multiple testing problem. If you *do* account for multiple testing, that 5-sigma becomes a hell of a lot less. Compensating by insisting on 5-sigma works, but it’s not ideal and is far from the whole story. More important, it emphasizes that when multiple testing is not allowed for, 5-sigma does *not* correspond to p < 0.0000003. The 5-sigma requirement isn't what people think it is.

        There's also the fact that some things don't follow the normal distribution. For other distributions, even without multiple testing 5-sigma doesn't mean p < 0.000003. The most one can say with certainty is Chebyshev's theorem, that p < 0.04.

        I'm rather tired of physicists promoting their requirement of 5-sigma as somehow "better." It's not. It simply their crude way of accounting for multiple testing.]

      • Paul Grimes,
        Indeed. Listen to Tamino. I was an experimental particle physicist, and we were quite cognizant of the reason for th 5-sigma rule (multiple testing)–a certain Czech string theorist whose name must not be spoken for fear of summoning him not withstanding. Significance testing like much of the formalism Fisher and Pearson worked out (mostly in opposition) is all convention. The whole idea of the null hypothesis is an awkward recognition that one cannot assign a probability to a hypothesis in an absolute sense. That is one reason why Bayesians reject the whole idea of significance per se.

        [Response: As a former frequentist, now a rookie Bayesian, I agree. But I don’t regard frequentist statistics as useless, and more important, it’s so imprinted on modern science that its use helps us communicate. Even WUWTians know what a p-value means (at least, some of them).]

    • Sheldon,
      Consider the following two thought experiments.
      Consider an experiment in which you toss a coin 15 times. On average you expect 7.5 Heads per trial. If you got 15 Tails in a row, you’d be surprised,since the odds of that are 1 in 32768. However, if you conducted 32768 such trials, having one in which you got 15 Tails would not be surprising at all, would it?

      Let’s say you want to look at the rate at which you get “Heads” changes. You toss the coin 32783 times (32768 + 15), and you notice that there are 15 Tails in a row from tosses 4451 to 4465–surely that represents a significant slowdown in the rate of “Heads,” right? Nope.

      In reality, what you are doing is even more rigged to give you “significant” changes, because you aren’t even pre-defining the duration of your pause.

    • Sheldon,
      I would not spend six days resolving the autocorrelation problem. I would rather visit Cowtan’s trend calculator:
      http://www.ysbl.york.ac.uk/~cowtan/applets/trend/trend.html
      and make an examination of sliding 15 year intervals.
      The lowest 15y trend in recent Gistemp is 1998-2012 (remember to use the fractional 2013, etc as endpoint) is 0.096 +/- 0.144 C/decade, so the trend doesn’t differ significantly from a hypothetical 0,20 C/decade long term trend. And we have not even considered the multiple testing issue…

      Last year Ryan Maue tweeted something like “models don’t produce 15-year pauses”. I checked whether this was true, if models did, or did not, make 15-year trend detours of around 0.1 C from some kind of long-term trend.

      I used the CCSM4 ensemble, with six members ( ie six realisations of the CCSM4 “climate system”), and made 15 year running trends of the individual members. The metric of comparison is the difference between one model and the average of the other five. If one alternatively wants the difference between a model and the six-member ensemble average, the trend divergence should simply be reduced by one sixth.

      The are plenty of slow-downs and speed-ups where single members diverge 0.1 C/decade, or more, from “the rest of the pack”

    • Sheldon,
      the autocorrelation is why using monthly data does not gain you much over using annual data here. Neighboring months tell you nearly the same thing. Your gain in precision from using monthly data is much less than you are expecting.
      Then there is the complication of needing to fit seasonality in the model if you are using monthly data. At least two extra terms will be required in the model to capture this.
      The more complicated a model is the more things can go wrong. If you are just looking at the temperature and if you are only interested in long term trends then the annual averages will tell you nearly the same thing as the monthly averages and require a simpler analysis.

  5. Sheldon,
    In much of science, they require >95% confidence (~2 standard deviations) to claim significance. This is arbitrary. Fisher and Pearson both knew it was arbitrary–one of the few things they ever agreed on. In contrast, in particle physics, they require at least 5 standard deviations–precisely because of the multiple testing problem. So, how many trends are you calculating. If it were 20, you’d expect 1 to be significant at the >95% CL even in the absence of any auto-correlation. .

  6. Tamino: “…and because it’s straight out of a random number generator we already know the correct answer: no trend.”

    What exactly is required for a random number generator? If I have a random number generator for numbers in the range 1..100, and I use it to generate a sequence of 10 numbers, what if it gives me 1, 2, 3, 4, 5, 6, 7, 8, 9, 10?

    My first thought would be there is a bug in it, but a random number generator could do that, and it has a trend.

    [Response: No, it doesn’t have a trend. I know why, but you don’t and I doubt you ever will.

    In the meantime, I suggest you acquire a perfect random number generator which gives integers from 1 to 100. Generate a sequence of ten and test whether or not you get what you’ve proposed (1 to 10 in order). Do that test once every second until you get it. You should finish in around 3 trillion years. But don’t despair; you might get lucky and hit it on the first try!]

    • Random turns out to be one of the most difficult terms to define in statistics. Kolmogorov tried several different definitions, but was never completely satisfied. Bruno de Finetti was driven to such despair over the inadequacy of definitions of random that it was one factor that drove him toward Bayesian probability.

      As to random number generators, my favorite quote comes from von Neumann: “Any one who considers arithmetical methods of producing random digits is, of course, in a state of sin.” Most random number generators aren’t random. Also, if you generate random numbers uniformly on the interval from 0 to 1, you can generate them according to nearly any probability generation.

    • Tamino: “No, it doesn’t have a trend. I know why, but you don’t and I doubt you ever will.”

      That’s why I asked the question, so why respond that way? I understand the improbability of getting the numbers 1 through 10 in ascending order from a random number generator. Why did you think I don’t understand that? Those are rhetorical questions; this one is real: How do you know (without testing) there is no trend in a finite sequence of numbers from a random number generator?

      I believe you. I’m not trying to cast doubt. I have no nefarious motive. It sounds like you are saying it is impossible for a random number generator to generate a sequence of numbers that contain a trend. Not just extremely improbable but it has probability 0. You’re saying it is impossible, not just extremely improbable, for a random number generator to generate the annual global average surface temperature for the last 30 years.

      That doesn’t sound right, so I’m asking.

      [Response: My apologies. I get so many comments that are genuinely contentious, sometimes it’s hard to tell the difference (esp. without non-verbal communication). I took your comment for one of those. My mistake.

      A sequence of numbers (let’s call it a time series, and the sequence numbers “time”) will follow a probability distribution. If and only if the probability distribution is time-dependent, then there is a trend. In this case, by construction the probability distribution has no time-dependence. Therefore, no trend.

      A trend is *not* an apparent pattern in that time series. Patterns can seem to exist even when they don’t, due to randomness. When we do statistical tests, we don’t get perfect knowledge we only get a probabilistic result. We often express it as a “p-value,” which emphasizes that it’s probabilistic. Note that it’s also possible for a trend to be present and quite strong, but *not* revealed by the test because of randomness.

      Of course that means all statistical tests involve uncertainty. If a perfect random-number generator returns the aforementioned sequence, the statistical test will return such a low p-value that we’ll (incorrectly) conclude it shows a trend. But the chance of that happening is so low, we’re safe in taking that chance.]

      • Thanks for that. I now see the importance of time-dependence in the meaning of “trend.” But, it wouldn’t be enough, would it, to build a random number generator that takes exactly one year to generate each random number? That wouldn’t qualify as time dependence, because we know the numbers are random. We know something about the cause of the numbers.

        It seems to me that Sheldon Walker, by claiming there is a pause from t1 to t2, is claiming there is a cause, which is not randomness, and which is strong enough to nullify the known greenhouse effect of the increasing CO2 level during the period from t1 to t2. I think that anyone who argues there is a pause must either propose a cause for it, which is not natural variation, or he must come out as a greenhouse effect denier.

        [Response: The “time” referred to in the previous response wasn’t years, it was simply an index. If a random-number generator produces a new result every millisecond, you could think of it as “time” in milliseconds.

        There are other things that affect temperature besides greenhouse gases. It is possible that one of them could create an actual trend change, without negating the continual warming influence of GHGs. What I haven’t yet seen, is actual evidence that this has happened in the last 50 years or so.]

      • A random number generator could use the current time value to compute the next random number. Then you could say the sequence is time-dependent. But if the next random number computation produces the same result no matter how long you wait to do it, it’s not time-dependent.

      • Surely the point of the random 1,2,3,4,5,6 not being a trend because it is a random sequence and solely a statistical artifact; surely this continues to be part & parcel of Sheldon Walker’s misunderstandings about his chosen task. Were such a sequence 1,2,3,4,5,6 not present in every billion or so random numbers, the numbers would be shown not to be random. Walker fails to grasp he is working statistically and must abide by the rules.
        ♣ His 10-year trends are incompatable with trends over contiguous time periods (as in the ‘escalator’ criticism of denialist cherry-picked temperature trends) and thus is a statistical construct that cannot be a model of physical quantities.
        ♣ With the 10-year data set considered in isolation, the calculation of confidence intervals remains entirely dependent on statistical constructs. Without such confidence intervals the analytical method is pretty toothless (indeed limbless).
        ♣ Even the average trend calculated using OLS hands the physical evidence over to a statistical analysis in that the wobbly data is being treated entirely as signal+noise rather than physical-measurements. (Consider a multi-year period of flat trend which ends with a short-but-sharp dip of a few months. Such a physical reality is converted into a multi-year period of negative trend within the statistical model.) (Or, more explicitly, consider the exceptionally below-the-10-year-trend of the average temperature in 2008. This dip is physically evident. Yet, even though it features in the data used by Walker’s flag-waving error-filled 2002-12 analysis, it has little impact on his result because Walker’s trends give emphasis to the wobbles at the ends of his 2002-12 analysis.)

        I would suggest the existence of wobbles (dips & lumps) are what drives Sheldon Walker to make these ridiculous analyses.
        GISS wobbles. So there are points where it is running above the average-trend-line-over-the-long-term and points where it is running below.
        One could attempt to demonstrate whether or not these wobbles are altering the average-trend-line-over-the-long-term, but that is not what Walker is about. (He says, if you can believe him, that he “usually” talks of “temporary slowdown or pause.”)
        Walker is attempting but entirely failing to demonstrate if a particular set of wobbles constitute a significant divergence from the average-trend-line-over-the-long-term. Walker’s wrong-headedness will not change that.
        Yet the physical dips and lumps remain evident. Their size can be easily measured. Determining their significance is a much bigger ask.

      • Martin,
        In some ways, I think your confusion gets to the heart of Sheldon’s problem. He views the trend purely as arising from the numbers–even over the short term where noise dominates. The trend is that which persists in the series. If you have two random series that diverge over a portion of the numbers therein, it doesn’t matter if their average behavior over the long haul is the same. In the end, they get you to the same place.

      • Now we’re getting at the point I’m trying to hit. Walker is claiming there is a statistically significant pause. Tamino has proved there is no statistical significant pause. This argument (whether it is Walker’s or Monckton’s or Steven Goddard’s) is about statistical significance, and despite Tamino having proved there is no pause, they can keep claiming there is a pause and trying to prove it is there, because if the argument is about statistical significance, then that’s just statistics, not certainty. Maybe it wasn’t statistically significant, but if it’s about statistics, maybe there really was a pause that didn’t last long enough to become statistically significant.

        Nobody ever asks Walker or Monckton or Goddard: For the sake of argument, suppose we pretend Tamino didn’t prove there was no pause. For the sake of argument, suppose it really was a statistically significant pause. Then what is your point?

      • Martin Smith: “Maybe it wasn’t statistically significant, but if it’s about statistics, maybe there really was a pause that didn’t last long enough to become statistically significant. ”

        And maybe there are invisible purple leprechauns in the trunk of my car? You can’t know until you look, right?

        Actually, there is another reason to posit that no pause–significant or not–has occurred: Physics. There have been no significant changes in forcing, no big volcanic eruptions… So, just as I can reject the proposition that there are leprechauns of any color in my trunk because no one has ever produced evidence of leprechauns, I can also reject the idea of a pause.

      • Martin Smith: “But if the next random number computation produces the same result no matter how long you wait to do it, it’s not time-dependent.”

        It’s also not truly random, as repeating the calculation will produce identical results–which we could predict once we’ve seen the first results.

      • Tamino’s use of the phrase “probability distribution is time-dependent” is key. We often (virtually always?) talk about the result of a regression by saying “Y is a function of X”, but that’s not really what a regression provides. What a regression is telling you is that “the mean value of Y is a function of X”. There is still a distribution of individual Y values around mean Y at that value of X – and that distribution is smaller than the overall distribution of Y for all X. (standard error of regression vs. standard deviation of Y)

        By design, Tamino’s random time sequence with no trend has a mean value of Y that has no trend – unless you look at too short an interval to get a good estimate of the mean (which isn’t a trend… because it’s too short!).

    • Martin Smith:

      “maybe there really was a pause that didn’t last long enough to become statistically significant. […] For the sake of argument, suppose it really was a statistically significant pause. Then what is your point?”

      If I understand you correctly you’re essentially calling for a focus on the more important question of practical significance, contrasted to statistical significance—power analysis, in a way. I would think though that until “skeptics” can even get basic interpretation (or even calculation) of statistics correct, it’s too much to ask to consider the sort of “meta-subjects” like effect sizes that would be required for the full inference package. There’s already a complete unwillingness to properly consider exogenous factors like ENSO as being in that realm.

      • Alex C: “I would think though that until “skeptics” can even get basic interpretation (or even calculation) of statistics correct…”
        I agree, but I’m not really calling for a focus on the practical as opposed to the statistical. I’m saying call their bluff. Walker’s effort might be an exception, but I think their real purpose of claiming there is a pause/slowdown/hiatus is really just to create doubt about the veracity of the whole climate science effort. But the models predict that there will be pauses, so even if there is a statistically significant pause, it is consistent with everything else predicted. It isn’t damaging evidence, so their bluff should be called: “f it is a pause, what are you claiming it means?”

      • I don’t understand what you would gain with asking them that question—they’ve already provided the answer without having to be asked. Their answer is “that means the globe is cooling and climate science is wrong”.

        This isn’t a bluff on their part, as if letting them continue down this route will expose that they have no cards to play. There’s no hidden lack of understanding that can be revealed by letting them reach their end game; they’re not required to somehow show an understanding in order to sow doubt. Why you want to ask such open ended questions, where the goal of the denier is actually to leave them open-ended *because that sows doubt*, I just don’t get.

      • Alex C: “I don’t understand what you would gain with asking them that question—they’ve already provided the answer without having to be asked. Their answer is “that means the globe is cooling and climate science is wrong”.”

        “Why you want to ask such open ended questions, where the goal of the denier is actually to leave them open-ended *because that sows doubt*, I just don’t get.”

        Your two remarks illustrate the problem. They don’t actually provide the answer, like you claim. Their repeated attempts to prove there is pause is what they want to do, because it suggests “the globe is cooling and climate science is wrong.” They just get to keep throwing the suggestion out there, and uninformed people read it and think climate science is dubious. They gain by leaving the question open, not by actually proving there was a pause. If they actually proved there was a pause, then, if the pause ended, they would have to admit that the climate models predict that such pauses will occur and that the fact that such a pause did occur validates the models. But if the pause didn’t end, then climate scientists would already be looking at it to try to understand the cause.

  7. You have 741 types of jelly bean. 6 of them seem to cause acne. Is that surprising?

    https://www.xkcd.com/882/

  8. There is still an important detail missing from the discussion. It is entirely possible that all of the current global mean surface temperature products are still underestimating “hiatus” trends.
    Hausfather et al (2017) found that HadSST3 is probably underestimating the SST trend due to a change in the ship bias over the past 10 years. Cowtan et al (2018) obtained essentially the same result by using coastal weather stations and coastal ship observations to infer SST bias. HadSST3 is used in HadCRUT4, C&W and Berkeley.
    If I replace HadSST3 by either COBE-SST2 or the coastal hybrid record using a low ice mask, then the trend on 1998-2012 further increases to 0.13C/decade.
    ERSSTv4/5 are not affected by this issue, and yet GISTEMP shows a lower trend. This could be due to the much sparser Arctic station sampling in GHCNv3 combined with known limitations in the homogenization algorithm in the presence of rapid Arctic warming – we won’t know for sure until they switch to GHCNv4.
    The second issue is the impact of changing sea ice boundaries the blending of air and sea surface temperature. This is an unresolved problem.
    We (i.e. all of the temperature providers) are still working on these issues. Never has such a short period has been the subject of such detailed attention, and current temperature record products are not necessarily capable of answering the kinds of questions which are being asked about it. In this respect, I view a lot of the hiatus hysteria as being in part a problem of temperature data users who have not sufficiently educated themselves in the temperature data literature before making use of the data (and in that category I include the authors of the 200+ hiatus papers we have catalogued).
    For more details, see this page and in particular the most recent update:
    http://www-users.york.ac.uk/~kdc3/papers/coverage2013/updates.html

  9. Why is Sheldon Walker worth all that trouble?

    [Response: I could be wrong, but my impression is that he really does want to get it right. For instance, when informed (by multiple sources) that autocorrelation invalidates his results, he faced the issue rather than deny it. It’s not trivial to do so, but he worked at it.

    I’m also convinced that his cognitive bias makes it extremely difficult to abandon his pre-conception that there was a genuine “slowdown.” Well, cognitive bias makes it hard for everybody. For those of us who keep working at it, and really want to get it right, there’s a real chance to learn. If we abandon all hope …]

    • He is interested in learning, but one suspects only so that he appears more convincing to his fellow skeptics. So if he uses autocorrelation – big bonus. “Wow he sounds authoritative!” go the masses on WUWT. Of course the main thing he’s learned so far is that he must not mention the multiple testing problem, as that blows so many skeptical “arguments” out of the water.

  10. Besides the autocorrelation in the monthly/annual series itself, your “sliding window” monthly approach introduces a whole additional level of autocorrelation which would still be present in any annual conclusions as well. It can be dealt with but I doubt you’d like the results of dealing with it.

    • There’s correlation between consecutive p-values (as they use mostly similar data), the windowing artificially amplifies certain frequencies in the data, which introduces spurious trends… it’s really not a robust way of finding trend changes.

  11. Martin: I asked a confirmed lotto addict (well non-clinical) once if she would consider betting on the numbers 1,2,3,4,5,6,7. Her answer was “No, that set could never win”.

    Mulling on her answer for a while gives some insight into the nature of probability.

    • @jgnfld
      The real problem with such number patterns for lottery is, that many other people choose number patterns. Therefore, the win will be lower.

      • Actually I seem to remember reading once about a lottery where the winning numbers were multiples of 7s and many people who told their bosses to F-off and quit were a bit disappointed. Or this may just be apocryphal.

        But that doesn’t get at the 1:7 can’t win idea. That’s a different superstition.

    • I don’t think her answer says anything about the nature of probability. It says she doesn’t understand the nature of probability. But when Tamino wrote: “…and because it’s straight out of a random number generator we already know the correct answer: no trend,” if he really just means it is extremely improbable that a random number generator would generate data with a trend in it, then he didn’t say it correctly.

      • I very much disagree. Her thought was that that sequence was LESS probable than other sequences. I see this in many other people too.

      • Yes, but that’s about the nature of her superstition, not about the nature of probability.

      • Martin Smith,
        No. Because the random numbers are generated randomly, and the underlying model does not have a trend, THERE IS NO TREND.

  12. Sheldon, out of curiosity, did your analysis show any speedup trends to 99% confidence level during the same timeframe? (Serious question – I’d like to know if the trends you found had “opposites”, perhaps in the years leading into those El Nino events.)

  13. Don’t know if this will clear anything up or not, but here goes…

    A trend exists if and only if the underlying physical (or mathematical) process produces a trend. What we test with p-values and so on is the likelihood, given that we don’t know the underlying process, that we have detected a trend.

    A lot of denier activity seems to be devoted to playing around with statistics. They start from the premise that global warming was guessed at from climate statistics; it then follows that if their side can do the statistics better, global warming goes away. In fact global warming was predicted (in 1896) for reasons having to do with radiation physics. The close correlation (r = 0.91 for carbon dioxide and dT 1850-2016) only confirms the prediction from the physics. The physics came first.

    • Here’s an analysis from the direction of “sliding window” analyse such as Mr. Walker is performing.

      First, let’s look at the 45 years of GISS J-D annual data from 1973-2017. How many 15 year ‘hiatuses’ do we find as we slide our window across 15 year intervals? 6

      Now lets model a bit. (Note I am modeling using independent, nonautocorrelated data here.) Let’s construct a 100,000 45 year periods. With a little bit of parallel processing in R, the distribution of the number of 15 year ‘hiatuses’ in a sliding window analysis is straightforward. Here is the histogram.

      Note that the mean number of ‘hiatuses IS about 6.

  14. Would this be a good opportunity to complain that, thanks to social media, an entire generation now thinks that “trending” means large in magnitude rather than having a nonzero first derivative?
    As if statistics wasn’t hard enough to understand already.

    [Response: Yes, it’s a good opportunity. Any suggestions? Can we get Neil deGrasse Tyson on the job?]

  15. Ah, ye good olde multiple comparison problem! The funniest (to me) demonstration of the necessity of correcting for multiple comparisons was an fMRI study done by Bennett et al. (It actually won an igNobel prize!) The one single participant in this study: “One mature Atlantic Salmon (Salmo salar) participated in the fMRI study. The salmon was approximately 18 inches long, weighed 3.8 lbs, and was not alive at the time of scanning.” The dead fish was then “shown a series of photographs depicting human individuals in social situations with a specified emotional valence,” and was “asked to determine what emotion the individual in the photo must have been
    experiencing.”

    Did Bennett et al. see differences in brain activity (in the dead salmon) to the different pictures? Yes they did :-D Why did that happen? Because, as the poster points out, they did not appropriately correct for multiple comparisons. Again, this study was done to illustrate the absolute necessity of doing so.

    Here’s the poster of that study if case anyone’s interested:
    http://users.stat.umn.edu/~corbett/classes/5303/Bennett-Salmon-2009.pdf

    Good to see climatologists struggling with these issues as well.

    • There’s even the slight chance that the multiple comparisons techniques cannot themselves remove such effects. This is where a subjective Bayesian approach comes in handy: it matters not what the evidence says, because the prior probability of the dead fish having a different cognitive response to one image v. another is zero*.

      (*As in, nobody would be able to name finite odds they’d take a bet at that it would; or I’d like to meet the person that could do so.)

  16. The fool is still at it, pumping out posts for Willard at an unseemly rate.

    Walker’s OP from two days ago plots out OLS trends for all the 741 possible year-&-one-month-starting/ending-January combinations since 1970 using (presumably GISS LOTI monthly data – the fool doesn’t manage to say) plotting trend-length against OLS trend. Having then arbitrarily decided his ‘slowdown’ spans the period Jan01-Jan15, he expresses surprise that all 15 plotted data points sit where they do. ” I didn’t expect the slowdown trends to be clustered so much to the left,” he tells the Wattsupian passers-by. He appears to feel that this surprise-result is worthy of an OP on Wattsupia. Or in other words, Walker thinks it is worth knowing that the fool ‘calculated a set of OLS trends, all of which used mostly an identical data and guess what, they all have similar trends.’

    And yesterday Walker set out the findings gleened by hitting the LOESS function on some unsuspecting spreadsheet. The flattening of presumably a heavily weighted LOESS analysis is used to fantasise that his ‘slowdown’ is as significant as the required temperature response to proper AGW mitigation. Of course, the fool is seemingly unaware that the warming rate since 1970, if assumed linear, was only +0.167ºC/decade 1970-2001. It now has risen to 0.180ºC/decade 1970-2017). So as as well as Walker’s ‘slowdown’ there has been an associated ‘speed-up’ that appears to have been larger than Walker’s ‘slowdown’.

    • So how did he choose the degree of smoothing for the LOESS analysis?
      I have done an analysis of this sort of data using penalized regression splines with the smoothing parameter chosen by generalized cross validation.This led to the maximum amount of smoothing being done, That is, the optimum predictor was a linear regression.
      I think he has just got himself into more trouble.

      • He needs to show that his results are ACTUALLY “unexpected”. It is pretty easy to show they are not unexpected at all but pretty much on the mark.

    • Re. the 1970-2017 interval and Sheldon’s comment “I didn’t expect the slowdown trends to be clustered so much to the left,” someone might point out to Sheldon no one really predicted Mt. Pinatubo in 1991 all that far in advance. One should be able to take into accounts its effects on a sliding window analysis long after the fact, however.

  17. He does not realize that the data will confess to whatever he wants if he tortures it enough. We have to somehow make him see that such confessions are usually false. The poor data.