A previous post addressed some issues with linear regression, “linear” meaning we’re fitting a straight line to some data. Let’s devote another post to scrutinizing the issue — so this post is all about the math, readers who aren’t that interested can rest assured we’ll get back to climate science soon.
It was mentioned in a comment that least-squares regression is BLUE. In this acronym, “B” is for “best” meaning “least-variance” — but for practical purposes it means (among other things) that if a linear trend is present, we have a better chance to detect it with fewer data points using least-squares than with any other linear unbiased estimator. “U” is for “unbiased,” meaning that the line we expect to get is the true trend line. Both of these are highly desirable qualities.
Finally, “L” is for “linear,” which in this context has nothing to do with the fact that our model trend is a straight line. It means that the best-fit line we get is a linear function of the input data. Therefore if we’re fitting data x as a linear function of time t, and it happens that the data x are the sum of two other data sets a and b, then the best-fit line to x is the sum of the best-fit line to a and the best-fit line to b. In some (perhaps even many) contexts that is a remarkably useful property.
A Sea Ice Parable
Let’s make some artificial data — just for demonstration — monthly averages which we’ll call “Arctic sea ice extent.” Once upon a time, on a planet not-so-far-far-away, it looked like this:
We can compute anomaly as is commonly done, as the difference between each month’s value and the average for that same month throughout the entire time span, which looks like this:
Least-squares regression indicates an average trend of -14.4 thousand km^2/yr. We can also apply some robust regression methods, e.g. least-absolute-difference regression says -14.3, and the Theil-Sen estimate is -14.1. All three methods give similar, and good, results.
Now suppose the total sea ice data are the sum of sea ice in three major regions. Region 1 is the Arctic ocean, region 2 is the mid-latitudes, and region the third is Big Bay (which is actually connected to the ocean but is called a bay). Here’s the data for region 1:
And here are the anomalies:
Least-squares regression gives a trend rate for region 1 of -3.7 thousand km^2 per year, but the noise is obviously not following the normal distribution. We could instead try Theil-Sen or least-absolute-deviation regression, but when we do we find that they give an estimated trend of zero!
The problem (and it is indeed a problem) is that these methods are pretty much treating the nonzero values as outlier noise. They’re really taking a “majority view” of the trend, and the majority of the estimates based on point-pairs are zero.
Frankly, that’s clearly not right. During the early years there was little or no melting at all in the Arctic ocean, but recently there has been consistent, even extensive, Arctic ocean melt. This is most assuredly a trend, but the robust regression methods fail to detect it.
The mid-latitudes don’t show such behavior, instead their data look like this:
with anomalies like this:
All three methods once again give good estimates, least-squares indicating -9.7 thousand km^2/yr, least-absolute-deviation -10.3, and Theil-Sen -10.0.
When we look at Big Bay we notice the same problem as with the Arctic ocean, but in reverse:
Now we note that there was regular seasonal freezing during the early years, but it has declined so much that in recent years it’s usual to have no freezing at all. Here are the anomalies:
Least-squares regression reveals the declining trend, estimating it at -1.0 thousand km^2/yr. But once again both Theil-Sen and least-absolute-deviation regression indicate a trend rate of zero.
If we use a robust method to estimate the trend in each region, then add the trends together to get an estimate for the entire Arctic, then Theil-Sen gives -10.0 and least-absolute deviation -10.3 thousand km^2/yr. But these are in conflict with the estimates from these same methods for the Arctic as a whole. So, this is one of those cases where the robust regression methods have led us astray.
If, on the other hand, we estimate the trend in each region separately by least-squares regression, then add the regional trends together, we get exactly the same estimate which least-squares gives for the Arctic as a whole. And that’s good! It’s because least-squares is a linear estimate, which we remind the reader means that it is a linear function of the input data.
Yes this is an artificial example, but it’s not so different from the real situation with Arctic sea ice. It illustrates one of the palpable advantages of a linear estimator, and of all the linear estimators, least-squares regression can be called “best.”
Robust methods are terrific, and in some circumstance indispensable. We’ve also seen a case in another recent post of least-squares being far less than optimal for the analysis. But overall, I still stand by least-squares as the single best overall estimator of linear trend around. And one of the reasons is that it’s a linear estimate of linear trend.
Ed Deming tested thousands of systems. He said,
“The behaviour of dynamic feed back systems as they seek a new equlibrium is always non linear.”
That was as true for our industrial chemical reactors as it is for the Arctic Sea ice or the Greenland Ice sheet.
The Arctic Sea Ice is a dynamic feedback system that is out of equlibrium.
We will see very non-linear behaviour in the Arctic Sea Ice System.
Me thinks that Tamino was referring to the the Principle of Linear Superposition…. a somewhat different beast.
As opposed to a step change.
The problem with using OLS on Time Series data is that there is bound to be some autocorrelation present. (to use your example, that the period change in the sea ice is correlated with the previous periods change in sea ice). As these lag effects are not explicitly put into the regression in a standard OLS regression (unlike in AR models), they will end up being located in the error term of the OLS regression. They will be, in effect omitted variables. The presence of OVB violates one of the main assumptions of OLS (that the error term be uncorrelated with the regressors).
No BLUE with this approach.
[Response: Generalized least squares is BEST for autocorrelated noise. However, OLS is still unbiased, and uncertainties of the parameters can be corrected for autocorrelation in a straightforward way. So OLS may not be BEST, but it’s still damn good.]
While not perhaps Tamino’s overall point, I take this exercise in part to demonstrate that if your statistical synthesis suggests the ice is not melting, and it is, you need to re-check your synthesis. The good thing about physical data is you can always go out and see if the ice is melting.
Horatio was under the impression that if reality conflicted with your statistical test, it was just an indication that you should look for an alternate reality.
Isn’t that what the 60’s were all about?
If you have a regression with autocorrelated residuals, what’s an easy way to correct for it? (Note, I didn’t say an efficient way or the best way). So far I’ve been using Cochrane-Orcutt iteration, but that’s kind of involved.
[Response: See the appendix of Foster & Rahmstorf 2011.]
Where would I find that…?
Just to reiterate a point that Tamino makes clearly, but that seems to cause slight confusion … “linear” in the statistical sense used here means that the regression model is a linear function of the data, whether or not the model itself describes a straight line.
In this sense, robust regression is nonlinear even when it’s used to fit a perfectly straight line. And OLS is still linear even when it’s used to fit a curve, such as quadratic or polynomial models.
Elaborating on the OLS is BLUE theme: if errors are iid *and* normally distributed, then OLS is BUE: the best (lowest variance) unbiased estimator whether linear or not. But if errors are not normal then OLS is merely BLUE, and some nonlinear estimator might be better. Like robust regression.
Although it still won’t have the “sum of the means equals the mean of the sums” property demonstrated nicely above.
“linear” in the statistical sense used here means that the regression model is a linear function of the data, whether or not the model itself describes a straight line.
in other words for a function L to be a linear function of the data it means that for constants a and b L(ax +by) = aL(x) + bL(y) where x and y are the data.