In the last post I mentioned that when we have two different estimates, each with its own uncertainty range (note: I use the 95% confidence interval almost all the time, or to be precise the ±2σ range), the fact that their ranges overlap isn’t the proper statistical test for whether the estimates are significantly different. Somebody asked about that.

Just for a gut feeling: I know that when error ranges overlap, there are values that fall in the “plausible range” for both estimates, which suggests that the estimates may well be in agreement. But sometimes, those “plausible in both ranges” values are unlikely in both ranges. Unlikely isn’t so implausible, but unlikely for *both* is unlikely squared, and that’s too implausible to be plausible.

What follows is some of the math, and it’s really simple, really, but I know that turns off some readers. Others want it; feel free to skip it and enjoy the remainder of the day.

Let the two estimates be *x* and *y*, and suppose their “standard errors” (the 1σ errors) are σ_{x} and σ_{y}. If we were testing whether or not *x* (or *y*) is zero, we’d form a test statistic like *t* = *x*/σ_{x} (or *t* = *y*/σ_{y}). But what we’re not interested in is whether or not the *difference* is zero, with the difference being *d* = *x* – *y*. But what’s the “standard error” for *d*?

That turns out to be σ_{d} = √ (σ_{x}^{2} + σ_{y}^{2}). Our test statistic will be *t* = d / σ_{d}, or in more detail (x-y) / √ (σ_{x}^{2} + σ_{y}^{2}).

Statistically we’ll usually treat that as a *t*-test, and we’ll wonder about the number of “degrees of freedom” to use, which can be a sticky issue. But in many cases (including global temperature since 1979) the degrees of freedom is large enough (even allowing for autocorrelation) that we can safely treat it as large, in which case the *t* distribution approaches the normal distribution.

But the important point is that two different error ranges can overlap (at least in part), even when they are significantly different. Suppose they have the same standard error, σ_{x} = σ_{y} = σ. Then σ_{d} = σ √2. If *d* = 3σ, then their difference has a test statistic *t* = 3 / √2 = 2.12, which indicates that yes their difference is statistically significant — in spite of the fact that their error ranges overlap.

I know, it’s pretty simple really. But it’s the kind of detail that the non-statistician (even those mathematically savvy) may not be aware of, and it’s a mistake I’ve seen made at many levels (names will not be named).

Thanks to for kind readers for donations to the blog. If you’d like to help, please visit the donation link below.

This blog is made possible by readers like you; join others by donating at My Wee Dragon.

I first worked out the math you are talking about on an exam in grad school. The prof–R. G. Demaree gave a terribly unfair (to my grad student mind) midterm with 36 T-F questions ALL of which were false but ALL of which appeared on the face of it to be true. Worst test I ever took. Answered T for about 5 answers. Got this one right at least.

The easiest was along the lines of: There can be more negative than positive correlations found in a correlation matrix? T/F

Bayesian estimation supersedes the

t-test:Online calculator.

You don’t explicitly state how small t should be for it not to be statistically significant. Is it 2?

I am probably one of the guilty. When discussing climate data on blogs I have a quick rule of thumb. I tend to assume that two means are significantly different If their 95% confidence limits do not overlap.

[

Response:If their limits do not overlap, then indeed the difference is significant. The confusion arises because sometimes it’s significant even when the intervals do overlap.]