Cherry p

Here’s some data, annual values for the time span from 1979 through 2013:


If we test for a trend using linear regression (the trend line is also plotted) we get a “p-value” of about 0.54. That’s not less than 0.05 (the de facto cutoff for “significance” which gives us “95% confidence”), so no, that trend is not statistically significant. I believe the correct statistical nomenclature would be: “No way.”

Now let’s look for a change in trend by testing whether we can get statistical significance using just the more recent data. We’ll try every possible start year from 1979 through 2009 and use the data all the way up to 2013 (so we’ll have at least 5 years of data in every case), in order to identify which start year gives the smallest p-value and therefore the most likely real trend. That gives us this:


Note that I highlighted the final 15 years of data in red, and added a trend line for those data (also in red). If we test that trend line for statistical significance, we get a p-value of 0.0055. That’s highly significant, at over 99% confidence. Should we conclude that the final 15 years of data demonstrate, convincingly, a significant departure from the trend that preceeded it?

The answer is “No.”

I say that with confidence because the data are the output of a random-number generator, i.e. by construction they’re white noise with no trend. I know because I created them using a random-number generator. But that’s not why we should reject the conclusion of a convincing change in trend.

If you take some data and test for trend, but the data are just white noise, then the “p-value” you get is actually the probability of getting such a result or stronger when the data are just white noise. If that p-value is small enough (usually, 0.05 or less for 95% confidence) then we can claim statistical significance because the chance of getting a p-value that small, or less, is only 0.05.

But that’s not what we did. We tried every possible start year which included at least 5 data points. That means we gave ourselves a lot of chances to get a small p-value, so the actual chance of finding some start year which gives a p-value 0.05 or less is quite a bit bigger than 0.05. In fact, in this particular case the chance of finding at least one start year which gives a p-value of 0.05 or less, is just about 0.39.

Yeah, you read that right. When we do this particular analysis on plain old white noise, we have a 39% chance of finding a “statistically significant” trend.

Why such a difference between the “apparent” p-value and reality? Because when we pick out the start year that gives the lowest p-value we’re cherry-picking. It’s “cherry-picking” because we picked the start year because of the result it gives. It usually, as in this example, involves choosing a start year which is an extreme value.

If we want to allow this kind of “cherry-picking” but not invalidate our statistics, we have to allow for the vastly greater chance of getting small p-values somewhere. I ran Monte Carlo simulations which indicate that in this specific case (35 years of data, white noise, minimum 5 years for trend) if we want a genuine p-value of 0.05 (for genuine 95% confidence), we have to require an observed p-value of 0.0038.

Yeah, you read that right. For a genuine p-value of 0.05 we have to require an observed p-value of 0.0038.

And that is why we should reject the claim of statistically significant trend “since 1999.” Because the observed p-value (0.0055) isn’t less than the the required cutoff (0.0038) when allowing for “honest cherry-picking.”

As I said, the start time which is selected by such a procedure is usually an extreme value. Look at any of the many graphs of global temperature, think about the oft-repeated claim about a “pause” in global warming since 1998 — then consider how the start year for that claim was selected.


44 responses to “Cherry p

  1. Steve McIntyre should hire you.

  2. Good example, but it is still hard to beat the cherry Pick in McIntyre & McKitrick(2005) in GRL,as per Deep Climate or then Nick Stokes:

    1) Generate 10,000 time-series with overlong persistence and other issues, which essentially guarantee that some will show a desired pattern.
    2) Sort by a newly-selected pattern that selects for that pattern, and get samples from the top 1% and claim that the decentered PCA creates big (positive) hockey sticks from red noise (sort of).

    3) Get article in prime spot on front page of Wall Street Journal 2 days after GRL publication date, despite rarity of science articles on WSJ front page. That was written by Antonio Regalaldo, now at MIT Technology Review.

    Cherry orchard, easy as 1-2-3.

  3. Richard Simons

    Years ago I was involved with forage crop variety testing. At the time it was normal to rank the varieties then to use a t-test to identify those that were ‘significantly’ better than the others. Unfortunately, in almost every case no correction was made to control the experimentwise (vs comparisonwise) error rate. When I made the appropriate correction, both on my data and those of others, almost all the differences became non-significant – a decidedly discouraging conclusion.

  4. Also known as the look elsewhere effect…

  5. Based on the 99% significance but only a 39% chance of randomly getting 95% significance, I’m assuming that you picked a clear example from set of different random runs. That fits your definition of cherry picking!

    There’s nothing wrong with selecting a clear example, but then it’s no longer fair to describe it as randomly generated white noise. It is a desired pattern selected from a larger set of randomly generated white noise.

    It’s a good example nevertheless.

    • If the cumulative probability of finding such a p-value was greater than 0.39 then the point would not only be stronger but the accusation of choosing a graph that makes the point would be weaker.

      If it was really a 5% chance of getting a graph that shows what Tamino showed, then yes accuse him of picking a clear example from a set of random runs. But why would the blog post be made in the first place? It’s *because* it is so high as 39% that the blog post was written; it’s silly to come back with “because it is so low as 39%”.

      • Right, 39% chance to “find” a trend with 95% confidence. What is the chance of finding a trend with 99% confidence? Less than 5% I’m sure. I’m not complaining about the likeliness of this happening, I’m complaining about the characterization of the data as “random white noise”.

        [Response: I generated 10,000 sets of white noise. 39% of them had a stretch with a p-value 0.05 or less. I didn’t keep generating sets until I got this result, it just happened.

        And by the way, you’re wrong about the chance of getting a p-value less than 0.01 (99% confidence). It’s greater than 5%.]

        Imagine generating random temperatures until you found one that matched trends in global warming. Then imagine claiming that there is no real trend in temperatures because you can get the same results by generating white noise without a trend. This is the problem with cherry picking a random result and treating it as representative of random data.

        [Response: That’s not what was done. Not even close. Do we have new “debate rules” where you get to make stuff up but the rest of us have to stick to facts?

        39% of the runs had p-values less than 0.05. I picked one of them to show what it looked like. If you want to call theat cherry-picking, go ahead. But stop bothering us with such nonsense.]

  6. John, sorry I’m pretty ignorant of post high school stats. Can you explain what ‘overlong persistence’ is and how it affects things?

    • Andy: Deep Climate gave a detailed technical explanation, but that is likely more than you want. I’m sure a time series expert like tamino could do better, and he had a nice post a while back that showed that the hockey stick was in the data, no artifact, but I’ve lost the link to that somewhere.

      a) Statistically, temperature anomalies for year N and N+1 could be independent, with random noise in each year from ENSOs, volcanoes, etc, like tamino’s graph.

      b) Now imagine another graph, where Year N and N+1 had a nonzero correlation, so that you plotted newvalue(N+1) = value(N) = random-value, and you could go further, where effects persisted for many years. But, in real life the year-to-year persistence of noise effects disappears after a few years.

      c) If you use longer persistence and simulate such 10,000 times, at least some of the curves will depart into all sorts of shapes. and longer persistence helps generate bigger departures.

      d) Then, make up a rule that puts upward hockey sticks at top, and (about same number of) downward hockey sticks at bottom, and select among the top 1%. The shapes depend on various parameters, but it’s hard to get the extreme hockey sticks without unealstically-long persistence.

      It is (barely) conceivable that the extra persistence (and some other problems) were errors or incompetence, but the 100:1 sort+cherry-pick code is explicit.

    • Andy, hi.

      Just to add to what John Mashey has already said. McIntyre way overcooked the noise, using the ARFIMA algorithm, which gives an average de-correlation time of roughly 19 years. Vis:

      (1+.9)/(1-.9) = 19

      ARFIMA is roughly equivalent to AR1(x) with a co-efficient of .9. This is not relevant to anything that could occur in nature. As John stated:

      ” But, in real life the year-to-year persistence of noise effects disappears after a few years.”

      Yeah, 2 or 3 years, but not 19 years. In short, It’s just McIntyre playing silly buggers with stats, because his stuff is not subject to the laws of physics/science or peer review. So he can get away with almost anything. Which he did. This jiggery-pokery is best explained here (van Storch tried it on as well. Another familiar name in these circles):

      How Red are my Proxies?

    • Here’s a non-rigorous “arm-waving” explanation of the underlying cause of the overlong persistence in McIntyre’s synthetic noise (copy/pasted/modified from a post I made in another forum):

      The random noise generated by McIntyre was contaminated with hockey-stick signal statistics, for the reason outlined below.

      What McIntyre’s noise-generation script did (and I verified it by examining it myself) was read in Mann’s tree-ring data, compute autocorrelations of the tree-ring time-series, and then use those autocorrelation results to generate his synthetic random noise.

      However, the script failed to perform one critically important step — it did not filter out the underlying “hockey-stick” signal from the tree-ring data before computing the autocorrelations.

      One must remember that the tree-ring data contains *signal* as well as noise — if you are going to use that data as a “template” (as McIntyre did) for your synthetic random noise, *you first must remove the signal components*.

      Otherwise, your random noise will be contaminated with signal-statistics, rendering it useless for evaluating the “noise-only” behavior of a procedure.

      This alone invalidates McIntyre’s entire “hockey sticks from random noise” argument.

      To recap, McIntyre’s synthetic random noise was contaminated with “hockey stick” signal statistics because he did not bother to remove the hockey-stick signal from Mann’s tree-ring data prior to using that tree-ring data as a “template” for his synthetic noise.

      It is that slowly-varying “hockey stick” climate signal that gives rise to the excessively long persistence in McIntyre’s synthetic noise. That excessively long persistence is the result of a very basic screwup on McIntyre’s part.

  7. “Because the observed p-value (0.0055) isn’t less than the the required cutoff (0.0038) when allowing for ‘honest cherry-picking.’ ”


    I’m uneasy with the idea that you would, in principle, accept that a significant pause or trend change had occurred if the p-value for the 1998-onwards trend was <0.0038. (You are not necessarily making this claim, but it might seem to follow from your statement.) It still wouldn't be "honest" cherry-picking, as far as I am concerned, because the temperature in 1998 was not simply a random result. It was a spectacularly atypical result. Generally, we should not make statistical decisions based on single outliers.

    If the p-value was <0.0038 (say, p=0.0030), it would indicate that *something* atypical had occurred, but we would then need to consider what it was that was atypical: the extreme heat of 1998, consistent with an unusually strong El Nino on top of the trend; a true change in the trend; a collapse of AGW theory requiring a reworking of physics as we know it; a combination of these; or something else. We would be obliged to note the statistically atypical result, but we would not be obliged to assume that the significance related to a trend change, when other explanations are possible.

    Anyone trying to draw long-term inferences from a temperature series should be prepared to ditch any single outlying year from the discussion. If a contentious conclusion about climate requires the inclusion of any single atypical year, then surely the conclusion is brittle. The atypical year may be of interest in its own right, but should not dominate the discussion of trends. That's why I can be confident that Curry etc are not speaking in good faith when they keep talking about the pause. If they couldn't make their argument sans-1998, then their argument is invalid, and not worth serious consideration.

    I'm not suggesting *you* need to hear this, but talk of the pause is wrong at many more levels than just the cherry-picked start date.

    Fawningly yours, etc,


    • “I’m uneasy with the idea that you would, in principle, accept that a significant pause or trend change had occurred if the p-value for the 1998-onwards trend was <0.0038. (You are not necessarily making this claim, but it might seem to follow from your statement.) It still wouldn't be "honest" cherry-picking, as far as I am concerned, because the temperature in 1998 was not simply a random result. It was a spectacularly atypical result. Generally, we should not make statistical decisions based on single outliers."

      Yes, but given the nature of the data, a p-value of <0.0038 for the best 5 year run you can find would be something that only happens 5% of the time. That doesn't mean there is a reason for it, but it may mean its worth looking for one.

      • “Yes, but given the nature of the data, a p-value of <0.0038 for the best 5 year run you can find would be something that only happens 5% of the time."

        Yes, that's what Tamino said.

        "That doesn't mean there is a reason for it, but it may mean its worth looking for one."

        Yes, that's what I said.

        But even then the reason for the mildly unlikely (<5%) result would be inherently *unlikely* to be a change in the underlying warming trend, especially when the finding of an apparent trend-change would be relying on a single outlying year that we already know was a strong El Nino. Just because we found the unlikeliness using trend statistics would not mean it was the trend that was significant, except in the trite and misleading sense of the numbers falling out that way. The unlikeliness, along with the implied obligation to look for an explanation, could be concentrated in a single year.

    • In the present case, we can have confidence that a trend change hasn’t occurred rather simply: the regression slope for 1998-2013 (.0099K/yr +- .0096 [95% CI], Cowtan & Way) is not significantly different from the regression slope for any period of equal or greater length ending in 1997. And that’s even AFTER we cherry-pick the start year.

  8. But still, if the terms “pause” or “hiatus” are repeated enough in the news or blogs and even some scientists seem to acknowledge “the pause” doesn’t that make it significant? I was arguing about this with a GOP member of the US House of Reps. I somehow thought that someone with an MD degree would know enough about science and trends in data to at least have second thoughts when looking at the data. One problem is that despite my comments he did not seem willing to look at the data before 1998. Finally I was “unfriended” on his Facebooki page.

  9. If you say and read something enough times and don’t have the skills to calculate things out, that’s what you get. I can’t say the number of times I’ve read “unanticipated hiatus”. It may be “unexpected” to someone who “calculates” by eye and no examination of the overall distribution.

    Unfortunately we are not going to change innumeracy very soon. Once 3 heads come up too many just know a 4th is “unexpected”.

  10. This is a very common problem in many field of science. I have seen a paper that was used to demonstrate that GMO were toxic. The authors examined something like 40 potential pathology. One of those came with p<0.05. Proof!

    I have seen the same thing for some cancer risk. Just start with the potential source of problem and make concentric circle. Stop when you get the appropriate p value.Note that in almost all case, the cancer that is affected is leukemia or brain tumour. Off course, there are among the rarest one, which provide enough the adequate randomness to get a false positive.

    This is why physicist don't trust anything below 3 sigmas and use Monte-Carlo simulation to check everything. With the usage, particle physicists discovered that 4 sigmas event were bogus most of the time. Now they require 5 sigmas at least.

  11. I’ve generated quite a few of these pseudo-temperature data sets myself. One thing I’ve noticed is that simply drawing lines between the points leads the eye to find spurious correlations, especially at the end. Say there is one point 5 or 8 or 12 intervals from the right hand end that’s unusually high – fairly common by chance. If the end point goes down relative to its neighbor, you automatically see a near-plateau from that higher point to the end. If the last point goes up, it has no weight to its right to enhance its apparent upward trend. So the visual impression is still that of a plateau. And the illusion can be pretty compelling if you haven’t learned to look past it.

    So just one high point 5 or 10 years from the end is all it takes to create an apparent but false “hiatus”. The odds are pretty high that you’ll get this situation fairly often in a data set like these surface temperature ones.

    If you plot the data with dots instead of lines, you don’t get that visual illusion.

    • A small point, but I think a vital one: The correlations you observe in that scenario are entirely NONspurious. They are completely expectable and you can even calculate pretty much how expectable they are.

      • Of course there are apparent correlations. But there is a visual illusion effect (at least for me) that magnifies what’s really there when you are looking at the last part of the graph, if you join the points by lines. After all, the lines practically by definition serve to link points so as to make order to appear … in this case, even of the data are random.

  12. This is such a common problem; thanks for such a nice illustration. Another similar example is common in the large cohort data mining field (such as the Nurse’s Health Study – ). In a hypothesis driven investigation (is moderate alcohol consumption associated with diabetes?) the statistics are very different than an open ended study (what variables are associated with diabetes?). In conversations with students this is very troubling that the same finding (moderate alcohol consumption is associated with diabetes) will be either significant or not depending on how the question was asked.

  13. In case anyone is interested in checking this, I’ve replicated the selection process in R:
    #Function that outputs p-value for linear regression test
    lmp <- function (modelobject) {
    if (class(modelobject) != "lm") stop("Not an object of class 'lm' ")
    f <- summary(modelobject)$fstatistic
    p <- pf(f[1],f[2],f[3],lower.tail=F)
    attributes(p) <- NULL
    tot = 0 #To keep track of number of datasets that give p<0.05
    for(a in 1:1000){
    years = 2013-1979+1
    t = seq(1979,2013)
    for(n in 1:(years-5)){
    fit=lm(q[n:years] ~ t[n:years]) #Fitting from our start year
    break #If dataset gives p<0.05, move to next

    Essentially, the program will create a large number of datasets, and calculate p-values for successive start years for each one. With any given dataset, once a p-value of less than 0.05 is obtained, we count that one and move on to the next dataset. We don't want to count *all* instances of p<0.05, because p-values of a dataset are correlated to p-values of subsets of that data. So if we get a p<0.05 in a dataset, we will likely get another one with the next start date; so we only want to find the first, and move on.

    The program takes a while to run when "a" can go to 1000, about a minute. When I run it I get counts that are close to 390, which is what we'd expect from a 39% likelihood given above.

  14. uknowispeaksense

    Reblogged this on uknowispeaksense.

  15. OT Horatio, i am sure that these ladies and gents would love to perform some of your greats like uncertainty, but your blog is by invite only


  16. Once consequence of this is there is less chance of cherry-picking if you use decadal averages instead of monthly figures since there are fewer samples (1/120 as many) hence less opportunity for the “look elsewhere” effect.

    And of course there’s no change in the trend from the 70’s to the 00’s (though there was flattening from the 40’s to the 70’s)

  17. If we know, that the data are random, of course no p-value whatsoever tells us anything. We always put in a model of the system and its statistics we are examining, and better explicitly than implicitly. Then we can use statistics to test it. The best we can expect is a coherent overall picture.We ‘ll never know whether the picture is the “truth”, as there are always several underlying models, which give a coherent picture with the data. (So for the sake of economy, we take the simplest we can think of, which is called “Occams’s razor”.)
    If we extract trend lines from noise, we need a model, in which a trend line embedded in noise is connected to some model parameter. The statement “We have a trend here with p < 0.05." is in itself senseless without a model assumption. Such assumptions are e.g. that the noise or competing effects in the interesting frequency band are low enough to see the effect, that we have certain time constants not too short or long and a somehow good natured step response function.
    This sounds a bit trivial, as the model (atmosphere, GHGs, climate) seems obious. But if we plunge into the statistics without referring to our model, and thus to the complete picture model + data + statistics, we easily loose orientation and create misunderstsandings.

    • Yes, but a p<0.05 result may make you look for an underlying reason, or at least get some more data and see if the pattern continues.

    • There is some sense in what you are saying but you need to look at both sides of the coin. Statistically the present surface temp values are absolutely expected statistically in pretty much every canonical series. That is, THERE IS NO “HIATUS” in them. Yet how many physicists have spent how much time working on precisely why the present observations are occurring?

      The middle ground is to partition out what one can, true, including Pacific/Arctic/etc. oscillations and see what that does. But in a system as complex and (likely) chaotic as the climate of the planet, there will always be “hiatuses” some of which may not in fact need any model to explain them whatsoever except the “luck” of the particular starting value and particular evolutionary path.

      • “some of which may not in fact need any model to explain them whatsoever except the “luck” of the particular starting value and particular evolutionary path.”

        I was appreciating your comment up to that point. Is it chaotic or is it noise?

        [Response: Don’t underestimate the similarities. The outcome of a coin flip or a dice roll is chaotic — but it’s perfectly valid to treat as random.]

      • As tamino says.

        Also differentiating some chaotic systems from some random systems is not at all a trivial task.

        Basically I do think climate is deterministic, it’s just that I’m not at all sure it’s completely predictable over all ranges. In fact, I’m “skeptical” that it can ever be.

      • Actually, flip coin is deterministic but the attractors are very small, which make it very hard to predict. Dice rolling is chaotic.

      • No problem with *choosing* what is the signal and what is the noise, for whatever useful reason. I found your language (“no need for any model”) confusing, and wanted to be clear that’s what you were doing.

      • The signal is the part that you’re looking at for this year’s research project. The noise is either the stuff you figured out in last year’s research project, or the stuff you’ll start looking at for next year’s research project.

      • Re. M. Gardener: To my mind there is no need for a model to “explain” why a particular path occurred in a chaotic process any more than there is a need to “explain” why a particular random occurrence came up.

        There may be a need to “explain” what’s happening now, but (again to my mind) the focus for many appears to be focused on what is really an apparent rather than real point.

      • JGarland:

        I think this is semantics.
        First, I assume everyone agrees that we are not talking about quantum physics randomness?
        So, we *choose* to treat something as ‘random’…
        a) because it is just too hard to do otherwise
        b) because it isn’t relevant to testing our hypothesis.

        Where we have a misunderstanding is in you saying “To my mind there is no need for a model to “explain” why a particular path occurred in a chaotic process”. I think what you mean is “there is no need to explain a particular path”– as in choice (b).

        But that choice of exclusion depends in the first place on a model, doesn’t it? That is something of what I understood kinimod to be getting at in the first place, and with which I am sympathetic.

      • Last comment: When how a particular path evolves depends on the particular value of a parameter in the nth decimal place at the edge of measurability or beyond, and when the evolution will evolve differently at nearby points in that same decimal place, I see no particular need to seek an explanation for the why of that path. All I can do is follow a number of evolutions and look at the means, trends, and variations in them. That’s exactly how I’d treat a truly random phenomenon. That’s about the best I can put it semantically.

        I’ll leave the last word to you.

  18. How do you get the adjusted p value? From Monte carlo?

    [Response: Yes. The stats seem to be sensitive to sample size, and very senstitive to the character of the noise if it’s not white. So, before applying this I would recommend Monte Carlo for the specific conditions in order to define the stats.]