Because scientists, you’re doing it wrong.

There’s a mistake you keep making, as many times as I point it out, even in the peer-reviewed literature, you keep doin’ it wrong. If you’re a scientist, then maybe yes this means you. There are at least two new papers (yes, new) that did it again. And yes, they’re about the so-called “hiatus.”

I’m not just going to show you how you’re doin’ it wrong. I’ll show you how to do it right — or at least, better. And I’ll share some programs to help.

**Doin’ it Wrong**

What’s the usual way to claim statistical significance for a trend change? Step 1: separate the data into “before” and “after” the time you think the trend changed. Step 2: fit a trend to each. Step 3: compare the trends rates (how fast they’re rising or falling), using information about how uncertain your estimates are in the first place. Step 4: apply a test (usually a “*t*-test”) to see whether or not the trend rates are really different.

It seems about as straightforward as could be. But it isn’t; there are two complications which are overlooked in this approach. The first depends on what one means by “trend change,” but it’s the second that alters the statistics so profoundly, it blows the *t*-test out of the water.

**Trend Change**

We first have to be sure what we mean by “trend change.” Most would say it means that the slope (i.e. rate of change) is not constant, so in our test the slope is allowed to change at some point.

If we seek to fit such a model to, say, yearly average global temperature data from HadCRUT4, and use only the data from 1975 through 2014 and nothing else, the best fit model is this one:

It changes slope in 2005, but in published research that particular year as a time of significant change has not even been “on the radar.”

It’s because many researchers mean more than just slope change when they say “trend change.” They also allow the value itself to change, suddenly and discontinuously, like this model:

Whether or not you have a physical reason to allow such a “jump discontinuity,” there’s no getting around that it’s a statistical degree of freedom in your model, one you can’t ignore. We usually ignore the intercept in a linear model (justifiably so), but this model has two intercepts and we can’t ignore them both.

The right way to test such a model is to recognize that the *null hypothesis* (linear trend with no change to slope or intercept) is a subset of the test model (slope change plus jump discontinuity) and apply an *F*-test. The procedure was published over 50 years ago by Gregory Chow, and is called the Chow Test. The attached program will do it for you. It will also compute the *p*-value for a slope change with no jump discontinuity.

Incidentally, both the above models — the first, unbroken one, and the second “broken” trend — give a *p*-value of about 0.03. That’s 97% confidence, right? Wrong.

**Horribly Wrong**

Getting a *p*-value of 0.03 is like taking a lottery ticket and scratching it off to reveal a winner — at the 3% level! That only happens by accident 3% of the time, so presto: evidence, right? Not this time.

You didn’t play a randomly selected lottery ticket. You played the one that *looked like* it had the best chance. And you had plenty to choose from.

When you get to pick the time (like here), you’re not just playing one lottery ticket. There are, oh let’s say about 10 to choose from, so of course you’ll pick the one that says “winner.” If you have 10 lottery tickets, at 3% chance each, and one of them hits, is that really a surprise? What are the odds? Whatever they are, **you must compensate for the multiple choices of the time of trend change**. Failure to do so is the *fallacy of multiple trials*.

And how profoundly does it affect the statistics, you wonder? Those *p*-values of 0.03 are actually 0.13 for the unbroken trend and 0.37 for the broken trend. Now how much evidence is there for a trend change, particularly the broken trend since 1997? Unless we’re accepting conclusions at the 63% confidence level …

**Monte Carlo is your Best Friend**

How, *precisely*, do you compensate for multiple trials? Monte Carlo to the rescue. I ran 10,000 Monte Carlo simulations of random noise, did both the Chow test and the test for unbroken trend change, for each possible trend change time, found the best in each run, and tallied these “best *p*-values” to estimate their probability distribution.

The attached program will do that for you too. It defaults to 500 simulations. If you’re good with R, you’ll be able to figure it out. I provide no support for any of these programs.

**Heed**

Scientists … this means you. You really are doin’ it wrong, and mistaken published research not only leads our science astray, it’s exploited by the dementors. Admit your past mistakes, and swallow the chocolate frog.

**Run Run Run**

The attached file is called “chow.xls” so WordPress will allow me to include it. It’s really plain text, R code. Change the name to “chow.r” or just “chow.txt” and use it as you wish.

If you have a time series of values (t,x) and they’re in time order, you can run the program as “mychow = chow(t,x)”. You’ll see what happens. If you want to translate naive *p*-values (from that program) to realistic ones with Monte Carlo simulations, run “myvals = montechow(n)” where “n” is the number of observations in the time series. When “n” gets much above 50, maybe do “myvals = montechow(n,200)” for only 200 simulations instead of 500.

This blog is made possible by readers like you; join others by donating at Peaseblossom’s Closet.

Nice post! So, I’d like to do the corresponding Bayesian assessment, using Exact Bayesian Segmentation via the

EBSpackage inR. I have HadCRUT4 at home, but it would be good to use exactly the same numerical values. Happy to do this as a Guest Post or at my blog (http://667-per-cm.net) or whatever way is appropriate. What do you think?[

Response:It sounds like a good idea, I’d be happy to host the post. But do remember — the IMPORTANT thing is not to get them the perfect test, just one that’s right.]This is not what I intend to be as the final, because

thatdemands an indication ofchangepoints, but the graph below shows some of the work I have been doing on this question:This is a so-called

Kalman smootherapplied toboththe GISTEMP and HadCRUT4 series, attempting to devise a synopsis. The result is based uponMaximumLikelihoodEstimation of the variance of the series, and in the graph you can see confidence bounds on the individual series as well as the final result. The result shows no particular inflection points in the 2000s.The proposed use of theRpackageEBS is supposed to give a probability of segment change at each time step, and a report on that effort will be forthcoming.Nice post indeed! Is there a variant of the Chow test that works well with auto-correlated time-series?

[

Response:Not that I’m aware. I strongly recommend, instead of using monthly data use annual averages to reduce autocorrelation. Counter to intuition, doing so doesn’t really degrade the precision of linear regression.]Cheers Tamino, I did look into this a couple of years ago, mostly to counter the possible argument that there is a statistically significant breakpoint in the monthly data (rather than just pointing out that the autocorrelation invalidates the Chow test). It seems somewhat surprising that it isn’t a solved problem as auto-correlated time series with breakpoints doesn’t sound like a particularly obscure problem (but time-series is not my field)!

*******************************************

*******************************************

NOTE: THIS HAS BEEN WORDED TO IGNORE YOUR CO-AUTHORSHIP OF THE CITED PAPER. I DON’T KNOW WHAT YOUR POLICY IS ABOUT THAT BUT ASSUMED THIS ROUTE FOR SAFETY. I found that paper very interesting which led to this exercise of my own.

********************************************

********************************************

After recent reading a recent Environmental Research Letters paper which looked at this problem (http://iopscience.iop.org/article/10.1088/1748-9326/aa6825/pdf) using simulations, I played with some Monte Carlo simulations of my own looking at the power side of the problem. That is, given a series with a known underlying trend (I used the two used in the above paper: UAH6 and GISS), what is the probability of seeing a significant result if one intentionally samples only from those random distributions which happen to start at the 1998 residual value (+/- .1sd) and go for n years?

With UAH6 the percent of significant results seen at various series lengths (years) when cherrypicking only from those random distributions whose initial value is 3.08 sd (+/- .1 sd) versus picking from all distributions regardless of initial value is:

Series length: 15 17 19 21 23 25 27 years

% signif found: 0.07 0.28 0.86 2.00 3.52 5.34 7.11 (cherrypicked)

% signif found: 38.1 49.4 61.7 73.3 83.2 91.0 95.5 (~cherrypicked)

Cherrypicking only those distributions which happen to start at a large local max clearly has a huge effect for decades.

With GISS the results show much more power as the 1998 residual is lower (1.68 sd), the trend is higher, and the sd is lower. Even so cherrypicking the start point from the 1998 local max has a large effect out to ~19 years even when we know the underlying distribution exists in the data:

Series length: 14 15 16 17 18 19 20 years

% signif found: 48.4 61.8 74 2 84.1 91.0 95.4 98.0 (cherrypicked)

% signif found: 76.6 84.0 89.9 94.2 96.9 98.5 99.4 (~cherrypicked)

I looked at other aspects as well. I would have no problem sharing the R code with anyone who is interested, though it’s a fairly straightforward exercise to create so long as one is familiar with multicore programming in R (at 100K iterations it takes a while to complete on less than an i-7…in testing, results are fairly stable after ~10K iterations, however). The code has not been reviewed by anyone else, but I have not identified fundamental errors in testing to date.

Sorry, there was an typo in the UAH table. The cherrypicked line is out by a factor of 10. It should read:

Series length: 15 17 19 21 23 25 27 years

% signif found: 0.71 2.8 8.6 19.9 35.2 53.4 71.1 (cherrypicked)

% signif found: 38.1 49.4 61.7 73.3 83.2 91.0 95.5 (~cherrypicked)

The text remains unchanged.

IMO it is useful to remember, what a “trend” can be looked at: e.g. the constant and linear part of a Taylor series, which is an approach to approximate a much more complex function with something mathematically easy to handle. Practically every function can be approximated linearly if you only make the interval short enough. In a real-world time series, we get always some complex and unexplained (not necessarily unexplainable!) behaviour, that we neglect consciously, and so reduce the information considerably. We label what we decide to neglect “noise”. Why do we do this? It became so much a habit to approximate, that we forgot actually the reason, if we knew it once at all, that is.

For many cases, the underlying laws and rules are pretty simple. A simple law with a couple of parameters produces a mathematically simple graph with little information in it, basically reflecting the information of the parameters put into the model. Often, some parameters are missing and subject to our research task. So with a measured graph reduced to the appropriate number of degrees of freedom, we are able to deduce the values of the missing parameters. With the time series of heights of a falling weight, we can deduce the acceleration in the gravitational field, and with some model of the resistance of air, we can even calculate the density of the air from the deviation from the quadratic law, given a good precision.

In climate science the situation is somewhat different: the reality as well as the models are both full of unpredictability. The models yield in their totality a relatively broad temperature path, or shoud I say: boulevard. But each single model run yields the path of a drunken guzzler on this boulevard of plausibility. And the real world is one of those model runs: it may as well behave quite erratic – as long as it stay on our boulevard, it is still within the scope of the modeling. So I actually don’t see so much sense in this or that way of approximating the lower earth temperature time series.

Whether the real world with its 10^30 or so degrees of freedom behaves more – or less – drunk than the models with their 10^6 or so is another, though interesting, question.

@Kinimod,

Yeah, but not all parts of the boulevard are equiprobable. It is possible that next year’s temperature will be uniformly +5C higher than this year, but it is not likely. That smaller excursions over the short term are more likely than larger ones imposes a stochastic continuity if not an analytical one. The point of detecting changes in trends as proposed is that they are saying, for the changepoint “Something interesting happened here”. The proposed analysis is useful, because other, classical means of analyzing the data

http://math.ucr.edu/home/baez/ecological/galkowski/rtsTrends.png from here don’t show it, that is, they miss it.

I think looking at trend changes in terms of statistics is valid, but I would also like to see geophysics invoked to model the underlying variability. What I have been doing is research on fitting the significant ENSO variability to the cyclic lunisolar forcing patterns. I presented this idea at last year’s AGU and will submit another abstract to this fall’s conference. There is likely enough training information in the NINO3.4 signal from the years 1880 to 1920 to effectively fit the signal in all other years. This lunar forcing also fits ENSO proxy measures based on coral data dating back to 1650. And it also agrees with preliminary research findings that NASA JPL has had in the proposal stage.

The relevance is that if we can compensate the non-secular variability by known geophysical factors, any secular trend will become more apparent.

BTW, even wilder is that within the last year, two significant papers on lunar forcing as triggering to earthquakes have appeared. This will have the effect of adding deterministic elements to the stochastic prediction of earthquakes in the future. I would recommend tracking this lunar effect because it is fascinating to see how concepts change as we get better and more complete data (satellite, seismic, etc) over the years. Lunar forcing is just not for ocean tides. Paul Pukite (@whut)

This reminds me of this

https://www.xkcd.com/882/

Taking 20 test instead of 1 increases (if independent) the p-value from 1-0.95=0.05 to 1-0.95^20=0.64.

Claiming it is still 0.05, despite of more than one choice, and having found something, is either incompetence or fraud (or both).

A previous post here (bit lazy to find it now) looked for the false positive rate for a sequential “slope from this year” significance egg hunt method, performed on white noise realizations. Came out close to 30% IIRC; independence of each check is definitely not a correct assumption here as each choice of start point uses most of the data as in the previous and in the next choice.

Following publication of Fyfe et al (2016), their response to the lack of statisitcal significance evident in the change in rate of global average temperature took time to develop but did seem to be simply saying

“There may be no statistical significance but there was a change in trend. And that change in trend was the result of physical processes not random ones. So the lack of significance doesn’t matter.”That is my interpretation of the Fyfe et al input into the

‘slowdown discussion’mid last year.Of course, the change in trends are calculated using statistical techniques so it is a brave man who denies the need for statistical significance. But it is correct to say that there is a physical basis for the variations from underlying trend in global warming.

The problem is perhaps that papers like Fyfe et al (2016) (unlike say Foster & Rahmstorf (2010) ) expend too much effort establishing the

“slowdown”and too little time discussing the underlying physical basis. As a result, they can keep saying“we do not believe that warming has ceased”until they are blue in the face, it will not stop the deluded denialists waving their paper as proof-positive that AGW has stopped/is a myth.It is not problematic to me, at least, to say that a possible causative factor was not picked up in a particular analysis where there is high noise relative to the proposed effect. It is quite another thing, again in my opinion at least, to say that a causative factor was surely operating, but we just couldn’t measure it.

That’s where I get off Fyfe et al-like reasoning.

That said, especially in the satellite data and to a lesser degree in GISS, I do have problems with a simple trend plus random error analysis as it does not appear that the errors are random and independent once the trend is removed. Large positive excursions (el Nino’s) are not balanced off by large negative excursions of the same relative magnitude and they occur at intervals which skew the analysis depending on initial start value such that we do, in fact, get people proposing “pauses”. In the satellite data in particular, as I demonstrate above, “pauses” are completely expected if one starts the analysis at an extreme el Nino maximum: It is identifying a significant warming trend that is actually the UNEXPECTED result for many years into the future after such an occurrence.

I’ve never been sure of the best way out of this dilemma except to note the power of regression to identify a real trend is very low under such circumstances.

[

Response:The fact that “Large positive excursions (el Nino’s) are not balanced off by large negative excursions of the same relative magnitude” is not evidence of non-randomness. “Random” doesn’t mean “Gaussian” or “normal.” It just follows some probability distribution which is itself skewed.]@jgnfld, @Tamino,

As I recently pointed out (in writing) to colleagues,

randomnessas a construct and a basis for analysis is pretty questionable, and most modern texts on probability theory don’t even try to define it. (They were trying to invoke the idea of a randomly chosen Internet Protocol address.) To see the classical attack on it, read Jaynes, perE. T. Jaynes,Probability Theory: The Logic of Science, Cambridge University Press, 2003, section 3.8.1.Random variables are often defined in terms of Borel sets and measures, but that’s about it. Even a text as classic as DeGroot and Schervish, while not Bayesian, really

dislikesthe long run limiting distribution thing.This conceptually why hypothesis testing doesn’t sit well with Bayesians like Jaynes.

[

Response:I disagree that “randomness as a construct and a basis for analysis is pretty questionable.” As for defining it, here’s mine: not predictable, even]in principle.I think that Kolmogorov’s definition for a series of random numbers was one that could not be represented with fewer bits than the sum of the bits for all the numbers. He was never fully satisfied with it. This was one of the issues that caused Bruno de Finetti to give up on a frequentist approach to probability.

All I’ve been trying to say is that the proper sampling distribution for an analysis starting from a chosen point has to include the information residing in the fact of where the chosen point lies with respect to the underlying trend. Certainly Monte Carlo modeling seems to bear this out. And I don’t think a Bayesian would really argue with that POV any more than a frequentist (which I am closer to).

A chosen start point conveys information about future probabilities every bit as much as Monte Hall’s door.

On determinism versus randomness, Laplace is the guy to study. He started the conversation on Bayesian analysis with his introduction of the “Sunrise Problem”. And he also worked out the deterministic equations of tidal fluid flow that eventually evolved into Navier-Stokes. All these years later, we are still battling what is deterministic and what is stochastic regrading climate behaviors. As there is obviously still much work to do, I took Laplace’s Tidal Equations and simplified to find a solution to QBO here http://contextearth.com/compact-qbo-derivation/

Contrary to what Richard Lindzen says, QBO is one of those climate behaviors that is obviously deterministic. It’s not quite like predicting the sunrise every morning, but with some more work, we may get there.

My main point is that most scientists are punting on analyzing for determinism instead of digging in to the problems using the applied physics and math that has been known since Laplace’s time.

I couldn’t agree more. I think fitting global temperatures with discontinuous line segments can quickly lead to absurd results as the number of segments increases — e.g. a staircase of zero or negative trend segments to describe an obviously increasing temperature. I’ve been messing around with the method described by Seidel and Lanzante (JOURNAL OF GEOPHYSICAL RESEARCH, VOL. 109, D14108, doi:10.1029/2003JD004414, 2004), which uses the “Schwarz Bayesian Information Criterion” to decide if a single linear trend, or various other possiblities (piece-wise continuous linear, piece-wise discontinuous linear, etc) is a better fit after taking into account the increase in degrees of freedom in the various methods. I’m wondering if you have any comment/ideas/opinions about this method. Feel free to either reply here or to me directly.

One more thing — I think that when using annual averages, one ought to try using different definitions of year — because ENSO events are usually split in two by our “normal” definition of a year (Jan-Dec), which tends to squelch ENSO variability in annual mean time series. If you use July-June years instead, a different picture emerges.

[

Response:I suspect that using information criteria (like AIC (Akaike) or BIC (Schwarz/Bayesian)) is probably the wave of the future (I’ll take a look at Seidel and Lanzante). I have wondered how to treat the choice of the change-point time as a parameter, I’ll bet they shed some light on that.And you’re quite right about how we define “year” for annual averages. Surely, July-June better isolates the impact of a single ENSO event. Maybe it’s worthwhile adapting the Chow test (and others) to autocorrelated noise so we don’t need annual averages at all.

Here, I can define the year differently as long as I explain it, but for the general public that can sometimes be problematic — as soon as we re-define “year” some regard it as artificial in spite of the fact that *any* choice is arbitrary.]Hi Tamino,

This post reminds that I got 90% of the way to building an interactive website that could be used to explore Bayesian changepoint analyses of climate data, using the approach described here (http://onlinelibrary.wiley.com/doi/10.1890/09-0998.1/full*), or BIC/AIC approximations thereof (which users can fit in real time), and the climate data you supplied on subscription (has that service ceased or is just me that no longer gets updates?). The models even deal with autocorrelation (at least partially). There is also an “Interactive piecewise’ mode that can be used to see instantly the effect of cherrypicking changepoints, allowing discontinuities (step changes), or ignoring autocorrelation. The site uses plotly functions to make interactive plots (with values and other info available on hover).

I never had time to finish it completely, or funds to keep it up to date and properly hosted, but there is still a version that mostly works here..

https://tanytarsus.shinyapps.io/climatetrends2/

I’d be curious to hear if you and your readers think there is any value in finishing and making a site like this widely available? (see also the ‘Add station data’ button to view and analyse data from specific climate stations…).

*its relatively old, but is still the simplest method I know that allows the number and location of changepoints to be unknown parameters.

In Eli’s humble opinion the problem is that global temperature anomaly is a weak metric for evaluating climate change. It would be better to develop a better one

To amplify @Eli’s comment, Isaac Held has discussed this a bit, including the broad outline of possible candidates.

I don’t know enough about these models to know if calculating a

likelihoodfor a specific dataset using them even makes sense to say. I suspect it makes sense but is probably really hard. So, in that case, perhaps something like Owen’sempirical likelihoodmight be an approach like Wood sketches in “Statistical inference for noisy nonlinear ecological dynamic systems”, or Mengersen, Pudlo, and Robert in “Bayesian computation via empirical likelihood”. Devising a summary statistic for comparison might take a bit of work, but, freed from having to use a physical observable, Holden, Edwards, Hensman, and Wilkinson have a working paper on this, “ABC for climate: Dealing with expensive simulators”.