Because scientists, you’re doing it wrong.
There’s a mistake you keep making, as many times as I point it out, even in the peer-reviewed literature, you keep doin’ it wrong. If you’re a scientist, then maybe yes this means you. There are at least two new papers (yes, new) that did it again. And yes, they’re about the so-called “hiatus.”
I’m not just going to show you how you’re doin’ it wrong. I’ll show you how to do it right — or at least, better. And I’ll share some programs to help.
Doin’ it Wrong
What’s the usual way to claim statistical significance for a trend change? Step 1: separate the data into “before” and “after” the time you think the trend changed. Step 2: fit a trend to each. Step 3: compare the trends rates (how fast they’re rising or falling), using information about how uncertain your estimates are in the first place. Step 4: apply a test (usually a “t-test”) to see whether or not the trend rates are really different.
It seems about as straightforward as could be. But it isn’t; there are two complications which are overlooked in this approach. The first depends on what one means by “trend change,” but it’s the second that alters the statistics so profoundly, it blows the t-test out of the water.
We first have to be sure what we mean by “trend change.” Most would say it means that the slope (i.e. rate of change) is not constant, so in our test the slope is allowed to change at some point.
If we seek to fit such a model to, say, yearly average global temperature data from HadCRUT4, and use only the data from 1975 through 2014 and nothing else, the best fit model is this one:
It changes slope in 2005, but in published research that particular year as a time of significant change has not even been “on the radar.”
It’s because many researchers mean more than just slope change when they say “trend change.” They also allow the value itself to change, suddenly and discontinuously, like this model:
Whether or not you have a physical reason to allow such a “jump discontinuity,” there’s no getting around that it’s a statistical degree of freedom in your model, one you can’t ignore. We usually ignore the intercept in a linear model (justifiably so), but this model has two intercepts and we can’t ignore them both.
The right way to test such a model is to recognize that the null hypothesis (linear trend with no change to slope or intercept) is a subset of the test model (slope change plus jump discontinuity) and apply an F-test. The procedure was published over 50 years ago by Gregory Chow, and is called the Chow Test. The attached program will do it for you. It will also compute the p-value for a slope change with no jump discontinuity.
Incidentally, both the above models — the first, unbroken one, and the second “broken” trend — give a p-value of about 0.03. That’s 97% confidence, right? Wrong.
Getting a p-value of 0.03 is like taking a lottery ticket and scratching it off to reveal a winner — at the 3% level! That only happens by accident 3% of the time, so presto: evidence, right? Not this time.
You didn’t play a randomly selected lottery ticket. You played the one that looked like it had the best chance. And you had plenty to choose from.
When you get to pick the time (like here), you’re not just playing one lottery ticket. There are, oh let’s say about 10 to choose from, so of course you’ll pick the one that says “winner.” If you have 10 lottery tickets, at 3% chance each, and one of them hits, is that really a surprise? What are the odds? Whatever they are, you must compensate for the multiple choices of the time of trend change. Failure to do so is the fallacy of multiple trials.
And how profoundly does it affect the statistics, you wonder? Those p-values of 0.03 are actually 0.13 for the unbroken trend and 0.37 for the broken trend. Now how much evidence is there for a trend change, particularly the broken trend since 1997? Unless we’re accepting conclusions at the 63% confidence level …
Monte Carlo is your Best Friend
How, precisely, do you compensate for multiple trials? Monte Carlo to the rescue. I ran 10,000 Monte Carlo simulations of random noise, did both the Chow test and the test for unbroken trend change, for each possible trend change time, found the best in each run, and tallied these “best p-values” to estimate their probability distribution.
The attached program will do that for you too. It defaults to 500 simulations. If you’re good with R, you’ll be able to figure it out. I provide no support for any of these programs.
Scientists … this means you. You really are doin’ it wrong, and mistaken published research not only leads our science astray, it’s exploited by the dementors. Admit your past mistakes, and swallow the chocolate frog.
Run Run Run
The attached file is called “chow.xls” so WordPress will allow me to include it. It’s really plain text, R code. Change the name to “chow.r” or just “chow.txt” and use it as you wish.
If you have a time series of values (t,x) and they’re in time order, you can run the program as “mychow = chow(t,x)”. You’ll see what happens. If you want to translate naive p-values (from that program) to realistic ones with Monte Carlo simulations, run “myvals = montechow(n)” where “n” is the number of observations in the time series. When “n” gets much above 50, maybe do “myvals = montechow(n,200)” for only 200 simulations instead of 500.
This blog is made possible by readers like you; join others by donating at Peaseblossom’s Closet.