Statistics can be tricky

And for more than one reason.

Sometimes it’s because the math is hard. As the science of statistics has advanced, and computers have made it practical to use ever-more-sophisticated methods, the math has gotten more sophisticated, which often means just plain harder. It’s getting more and more difficult to stay current with the latest developments. But it’s worth the effort; as the methods have gotten more complicated, they’ve also gotten better.

But sometimes it’s for an age-old reason, which few lay people and not enough scientists (even not enough statisticians) fully appreciate, that you have to understand what your data are doing, and why. Often, you really need to feel it in your gut.

Back in 1998 Patrick Michaels et al. (Clim. Res., 10, 27-33) published an analysis of temperature variability. I know enough about Michaels that I don’t trust him as far as I can spit, even after I’m dead. But that’s not related to the problem with this paper; the problem is a failure to understand what the data are doing, and why.

Their approach was simple: take temperature anomaly, which they refer to as “IPCC” data (it’s probably HadCRU data, but that’s not relevant to the topic at hand). Then compute the amount of variation it shows during small time spans (for the case to be illustrated, one year). This generates a time series of variability as a function of year, one which can be studied for trends. Here in fact is their description of their analysis of monthly average temperatures:

“We first looked at the variability of the monthly temperature anomalies within each year using the IPCC monthly gridcell temperature anomaly dataset. For each gridcell in this dataset which contained a complete set of 12 monthly anomalies in any year, we simply calculated the intra-annual variance in the temperature anomalies for that year as:”


They can then take the individual time series for each grid cell to compute a grand average time series. For the monthly anomalies they describe the result thus:

“For the period 1947 to 1996, we produced a single areally-weighted average variance term for each year using the 1041 cells with nearly continuous data. This produced a time-series of intra-annual variance levels from 1947 to 1996 over a large spatial scale (Fig. 3). The trend in this time-series is towards decreasing intra-annual variance of monthly temperature anomalies and is highly significant…”

It’s essentially a correct description of their results, and here’s their graph (their figure 3):


Indeed these data show a decreasing trend. But not for the reason they think; it’s not because of a decrease in intra-annual variability.

If you look closely at the graph, you might notice that their variability measure seems to decrease right around 1960. In fact, to the eye it looks more like a step change (at 1960) than a linear trend. I already know that the two patterns mimic each other very well, so if I only had the data but no idea what it was or how it came about, I doubt I’d proclaim a step change. I’d probably favor the linear hypothesis, based on physics consideration. But I wouldn’t rule out either possibility.

By now some of you might be thinking, “If it’s HadCRU data [which I suspect it is], don’t they use a baseline starting in 1961?” Why, yes they do. It’s the baseline which defines the reference for computing anomaly values.

Anomaly is the difference between a given month’s temperature and the average for that same month during the baseline period. When you subtract that average to convert to anomalies, you’re not necessarily removing “the” annual cycle, you’re removing the average annual cycle during the baseline period.

If the annual cycle itself changes, then subtracting the baseline-period average won’t remove it; it will leave behind a “residue” of the annual cycle, namely, the difference between the then-annual-cycle and the baseline-period annual cycle. That’s why, when I’ve done multiple regression of temperature against such factors as el Nino, volcanic eruptions, and solar variations, I’ve also included an annual cycle — so that if there is a “residue” of the annual cycle due to baseline effects, it can be accounted for in the regression.

I’d bet dollars to donuts that it’s because of the baseline effect that they see different variability before the baseline than they do after.

As I mentioned, I don’t trust Patrick Michaels one little bit. But that’s not the point here. I also see this kind of mistake from scienists I do trust. For example, back in 2012 I criticized Hansen et al. (especially here, but in posts before and after it as well) for making what I believe is a similar (but not identical) mistake. Again, it had to do with the dependence of variability on the baseline period. And make no mistake about it, Jim Hansen is one of those people I do trust.

The real message is that to become a truly good analyst, you have to do more than just learn the math. Anybody can figure out the equations and do the tests, but it takes a good intuition, a good feel for why things might be happening, to notice some of the more subtle effects that can change the result entirely. If I were hiring a statistician, of course I’d want thorough knowledge of standard techniques and how to compute them. But when it came down to selecting the winner out of the finalists, I would disdain the I-know-all-the-latest-techniques guy for the gal who has that innate sense, that “feel” for what’s happening in her gut.

That’s not to say that knowledge of the techniques and methods isn’t necessary — you just can’t get by without them. And it’s not to say that you shouldn’t choose this as your field because you’re worried you might not have that “feel” for things. Unless you’re a super-genius, the intuition doesn’t come from the classroom but from experience; don’t expect to have it fully developed right after graduation, but when you’ve got 10 or 20 years experience under your belt, you’ll know.

And you’ll be glad you’ve got it. After all, statistics can be tricky.

16 responses to “Statistics can be tricky

  1. Andrew dodds

    It’s not a gut feeling, it’s a 10 billion node neural net that’s seen a lot of training data..

  2. “Unless you’re a super-genius, the intuition doesn’t come from the classroom but from experience; don’t expect to have it fully developed right after graduation, but when you’ve got 10 or 20 years experience under your belt, you’ll know.”

    It’s the same in music theory: you usually don’t know, coming out of school, whether you’really going to develop that feel for which question to ask when. Scary, given how much you already have invested. One of the trials of youth.

  3. And there will be another “step change” in the data circa 1990 as the baseline period ends.

  4. There is a similar ignorance about the reduction of the variance in the anomaly period here.

  5. Tamino, I seem to remember something similar about artists, including both playwrights and classical composers. First the learn what has been done before. They learn how to imitate the forms. Their knowledge is largely formulaic at that point. But then like a chess or martial arts master, they are able to move beyond the forms, beyond what they have been taught, seeing how things are related.

    With playwrights their work will become more organic, perhaps with characterization, plot, theme and style coming together, supporting one-another through and through down to the level of details. In classic music they will see when to bend or break the rules, and their creativity leads to new forms or even schools of music.

    With chess masters they will view chess not in terms of moving individual pieces but in groups that work together as a unit, and they will see when to break those units up and form new units. With martial arts masters their movement will become more fluid, with one form seamlessly becoming another. Fully aware of one’s surroundings, activity may at times even be experienced as effortless play.

  6. Interesting post Tamino. I would suggest this paper (below) as it discusses and aims to deal with the issue at hand.

    Click to access CRU_ANOVA.pdf

  7. I trust Hansen too. Which means i read his papers once trying to understand what he said and why, before thinking about it myself. Yours, on stats, I read twice. I remember the fine distinction, you made with the Hansen paper, its when you got upped to read twice. There is however a distinction between that ‘oops’, and the face banging misrepresentations of some work.
    I could even write the Patrick Michaels one off as oops except, getting a finding of reduced variability… gives me a nervous twitch. Climate just aint going to do that. Not without a good reason, and if it had one knowing what it was would be important. And even if it did moving the mean would still give records. My best feel is stats/thinking were stopped, when a desired answer was found. Its a kind of gradient descent algorithm, If you keep doing stats until you an answer you can represent to mean what you want pops out and that is their stopping criteria. Well it is for people i don’t trust.

    I stop, when I run out of questions.

    [Response: You’ve hit the nail on the head. All of us, I think, show a little bit of acute stop-when-you-get-what-you-want-itis, but for deniers it’s a chronic disease.]

  8. The data and fact that you have been presented for the better understand the global Weirding is correct and easy to get.
    Thanks for the hard work.
    Global Weirding News

  9. As I got to about half way through reading this my thought was that it wasn’t a step change as such but a slope down over roughly the period 1950 to 1970 so I wondered if it was due to better and more-consistent measurements being introduced, particularly in the less-well-covered grid cells. In other words, some of the earlier variation could have been due to measurement noise.

    Any idea if that could be part of the story? I guess an approach would be to compare variations in nearby grid cells. If they’re well correlated then it’s likely real, if not then more likely measurement error.

    [Response: I keep telling myself to knuckle down and do a thorough analysis of temperature variability — but I always seem to have something else keeping me busy. Oh well, I guess it’s better to have too much to do, than not enough.]

  10. typo: scienists <— 'ni' should be 'nit' (nitpicking impulse uncontrollable)
    variability: minor gut twitch, adding CO2 to the atmosphere would amount to 'putting a lid on' nighttime heat loss to space, which could reduce variability overall?

  11. ok, far too obscure failed joke, “scienists” should read “scientists” -ni- / -nti-
    My apology.

  12. Two questions:
    1) Reading comprehension: I look at their Fig. 3 and the step change appears to occur in the late 1940’s. What am I missing?
    2) Methods: If you’re going to compute variation over time wouldn’t it be better to use a kernel method? More specifically, rather then computing the variance as shown, why not use a Parzen window or LOWESS (or the like) to compute the trend in the quantity of interest and then a kernel method to compute an analogue of variance? Do so would produce a result which is less sensitive to peculiarities in the data, e.g., uneven data density, occasional anomalies.

    Let x(t) be the data at time t and xt(t) be the estimated trend at time t. Using the kernel approach the variation is
    s^2(t_j) = sum_i{ K(t_i,t_j) * (x(t_i)-xt(t_i))^2) / sum_i{K(t_i,t_j)}
    where i are the indices of data which fall within the averaging window. Standard stuff. Why or why not use that approach?

  13. Personally, I have always wondered why climate scientists converged on the anomaly-compared-to-a-subset-, in denier terms) of-the-data-baseline method in the first place. In my past stats work in the psychological/health program evaluation sphere that is a procedure I have never once seen used.

    Often one subtracts out the grand mean. Regression analysis is based on then subtracting out various subset means. But you just never see the mean taken of a subset and then applied to the whole dataset. Extracting a degree of freedom in a restricted range but not in the rest seems strange to me.

    I realize there are levels of complexity here having to do with location changes etc. at each single site at the data reduction stage, etc. But I am talking about using an anomaly based on 30 year subsets of the reduced (“hoaxed” :-o in denier terms) datasets.

    BTW, I perfectly understand the use of anomaly data versus absolute data. That is not my question.

    I’ve never seen a source on the history of this procedure. Does anyone know of one? Is there a sound statistical reason to prefer using a subset to define an anomaly versus/over using the entire available range? Or is this an accident of physicists not consulting statisticians again?

  14. Something strange happened in the first sentence. It should read: “anomaly-compared-to-a-subset-of-the-data-baseline method”