A previous post addressed some issues with linear regression, “linear” meaning we’re fitting a straight line to some data. Let’s devote another post to scrutinizing the issue — so this post is all about the math, readers who aren’t that interested can rest assured we’ll get back to climate science soon.
It was mentioned in a comment that least-squares regression is BLUE. In this acronym, “B” is for “best” meaning “least-variance” — but for practical purposes it means (among other things) that if a linear trend is present, we have a better chance to detect it with fewer data points using least-squares than with any other linear unbiased estimator. “U” is for “unbiased,” meaning that the line we expect to get is the true trend line. Both of these are highly desirable qualities.
Finally, “L” is for “linear,” which in this context has nothing to do with the fact that our model trend is a straight line. It means that the best-fit line we get is a linear function of the input data. Therefore if we’re fitting data x as a linear function of time t, and it happens that the data x are the sum of two other data sets a and b, then the best-fit line to x is the sum of the best-fit line to a and the best-fit line to b. In some (perhaps even many) contexts that is a remarkably useful property.
A Sea Ice Parable
Let’s make some artificial data — just for demonstration — monthly averages which we’ll call “Arctic sea ice extent.” Once upon a time, on a planet not-so-far-far-away, it looked like this:
We can compute anomaly as is commonly done, as the difference between each month’s value and the average for that same month throughout the entire time span, which looks like this:
Least-squares regression indicates an average trend of -14.4 thousand km^2/yr. We can also apply some robust regression methods, e.g. least-absolute-difference regression says -14.3, and the Theil-Sen estimate is -14.1. All three methods give similar, and good, results.
Now suppose the total sea ice data are the sum of sea ice in three major regions. Region 1 is the Arctic ocean, region 2 is the mid-latitudes, and region the third is Big Bay (which is actually connected to the ocean but is called a bay). Here’s the data for region 1:
And here are the anomalies:
Least-squares regression gives a trend rate for region 1 of -3.7 thousand km^2 per year, but the noise is obviously not following the normal distribution. We could instead try Theil-Sen or least-absolute-deviation regression, but when we do we find that they give an estimated trend of zero!
The problem (and it is indeed a problem) is that these methods are pretty much treating the nonzero values as outlier noise. They’re really taking a “majority view” of the trend, and the majority of the estimates based on point-pairs are zero.
Frankly, that’s clearly not right. During the early years there was little or no melting at all in the Arctic ocean, but recently there has been consistent, even extensive, Arctic ocean melt. This is most assuredly a trend, but the robust regression methods fail to detect it.
The mid-latitudes don’t show such behavior, instead their data look like this:
with anomalies like this:
All three methods once again give good estimates, least-squares indicating -9.7 thousand km^2/yr, least-absolute-deviation -10.3, and Theil-Sen -10.0.
When we look at Big Bay we notice the same problem as with the Arctic ocean, but in reverse:
Now we note that there was regular seasonal freezing during the early years, but it has declined so much that in recent years it’s usual to have no freezing at all. Here are the anomalies:
Least-squares regression reveals the declining trend, estimating it at -1.0 thousand km^2/yr. But once again both Theil-Sen and least-absolute-deviation regression indicate a trend rate of zero.
If we use a robust method to estimate the trend in each region, then add the trends together to get an estimate for the entire Arctic, then Theil-Sen gives -10.0 and least-absolute deviation -10.3 thousand km^2/yr. But these are in conflict with the estimates from these same methods for the Arctic as a whole. So, this is one of those cases where the robust regression methods have led us astray.
If, on the other hand, we estimate the trend in each region separately by least-squares regression, then add the regional trends together, we get exactly the same estimate which least-squares gives for the Arctic as a whole. And that’s good! It’s because least-squares is a linear estimate, which we remind the reader means that it is a linear function of the input data.
Yes this is an artificial example, but it’s not so different from the real situation with Arctic sea ice. It illustrates one of the palpable advantages of a linear estimator, and of all the linear estimators, least-squares regression can be called “best.”
Robust methods are terrific, and in some circumstance indispensable. We’ve also seen a case in another recent post of least-squares being far less than optimal for the analysis. But overall, I still stand by least-squares as the single best overall estimator of linear trend around. And one of the reasons is that it’s a linear estimate of linear trend.