From time to time it is suggested that ordinary least squares, a.k.a. “OLS,” is inappropriate for some particular trend analysis. Sometimes this is a “word to the wise” because OLS actually is inappropriate (or at least, inferior to other choices). Sometimes (in tamino’s humble opinion) this is because an individual has seen situations in which OLS performs poorly, and is sufficiently impressed by robust regression as a substitute, to form the faulty opinion that it’s superior to OLS generally. For the record, this comment is not one of those cases.
In reality, OLS is the workhorse of trend analysis and there are very good reasons for that. It’s founded on some very simple, and very common, assumptions about the data, and if those assumptions hold true, OLS is the best method for linear trend detection and estimation. It can be dangerous to use the word “best” in a statistical analysis, but in this case I feel justified in doing so.
Of course that raises some nontrivial questions. What are those assumptions? When might they not hold true? How could we tell? What should we do if we can establish that the OLS assumptions aren’t valid?
Let’s consider the case of linear regression, in which we have a set of data values taken at times , and we wish to fit a straight line to those data. A straight line has the form
where is the intercept of the line and is the slope. Any possible choices for and define a straight line, but for most such choices the line is obviously not a good model of the data. So, how do we choose “good” values for the slope and intercept? That’s where “regression” comes in.
We can always try lots and lots of values for the parameters, then choose the values that give us the “best” fit. That’s a terribly inefficient way to fit a line to data, but — it actually works! Yet it still requires a way to decide which possible line is “best.” And that can be very tricky indeed!
There is a sensible way — choose the line which gives the maximum likelihood of the observed data. We could even call this “maximum likelihood regression.” But now we have yet another problem: how do we compute the likelihood?
To do so, we have to have a model for the data, and that model must include the behavior of the noise, not just of the underlying signal. Let’s start with the simplest (or at least, most natural) assumption about the data: that the data really do follow a straight line, with deviations which are zero-mean i.i.d. and follow the normal distribution. In this case our model is
where is a normally distributed random variable. Note that implicit in this model is the assumption that all the values come from the same normal distribution, i.e., they all exhibit the same standard deviation , because we’ve insisted they are “i.i.d.” which stands for “independent identically-distributed.” We further assume that the mean of that distribution is zero. And the “independent” part means that different noise values have no dependence on each other.
Now we can compute the likelihood of our data, given a model — which means, given the slope , the intercept , and the standard deviation . If we know the values and , we can compute the values as
These values are called the residuals for our model, and if our model is correct then they’re equal to the true noise values. The probability density for any particular value is
The joint probability for all the noise values, often called the likelihood function, is the product of their individual probabilities
The maximum-likelihood solution is that which gives the maximum probability.
It’s generally easier to work with the log-likelihood, which is just the logarithm of the likelihood function
Maximizing the likelihood function is equivalent to maximizing the log-likelihood function. Furthermore, the log-likelihood involves a sum, which is easier to deal with in most contexts than a product.
If we find the values of and which maximize the log-likelihood, they will be the values which maximize the quantity
In fact maximizing this quantity with respect to and turns out not to depend on the value of . We need only choose the slope and intercept which maximize the quantity
This is the negative of the sum of the squares of the residuals. Maximizing a negative in the same as minimizing a positive, so the solution for the slope and intercept is that which minimizes the sum of squares of the residuals — hence the name “least squares.” Therefore the least-squares solution is the maximum-likelihood solution when the given assumptions hold true.
Now we can begin to appreciate when the ordinary least-squares (OLS) solution might not be optimal, or even appropriate. When the data violate the given assumptions, OLS is not the maximum-likelihood solution. It may not even be a particularly “good” solution (although it usually is).
What assumptions might be violated? The most often-considered case is when the model of the noise isn’t right. For example, we often consider trend analysis when the noise values aren’t independent. In fact temperature data (especially monthly data) exhibit autocorrelation, where values nearby in time are correlated with each other. If they’re correlated, they’re not independent — and the OLS solution is no longer maximum-likelihood. In such a case, the maximum-likelihood solution is given by generalized least squares (GLS). However, it’s usually a lot easier (and nearly as good) to use OLS but apply a correction when computing the uncertainties of the solution.
Another very common case is when the “normal distribution” assumption doesn’t hold. This is often indicated by the presence of outliers, which are noise values which are quite far from the rest of the data. There’s not universal agreement on exactly how to define an outlier, but sometimes they stick out like a sore thumb. Suppose, for instance, you were surveying the annual incomes of 100 people who were all members of “Oprah’s book club” in Chicago. You might find that 99 of the people had a variety of incomes which formed a nice smooth distribution (probably not the normal distribution, but at least reasonably smooth), but the 100th member was Oprah herself — who made fifty million dollars last year (for Oprah, that’s a slow year)! Her income is so different from that of all the other club members that it sticks out like a sore thumb — we would all agree it’s an outlier.
Extreme values — outliers — can have an extreme impact on the result of OLS. Therefore they can change the result dramatically, sometimes even invalidate it, all because the OLS assumptions aren’t correct, the noise does not follow the normal distribution. Identifying such cases can sometimes be tricky. After all, even noise which does follow the normal distribution can show extreme values! In such a case, those extreme values should be retained and OLS is the right way to go. But in some cases (like Oprah’s book club) the normal-distribution assumption is so obviously not right, and the OLS solution is so strongly affected by the outlier, that OLS isn’t just less than optimal, it’s downright untrustworthy.
Such behavior isn’t limited to regression. Suppose, for instance, we simply wanted to know the average income of members of the Chicago branch of Oprah’s book club. The mean (a.k.a. arithmetic average) of the 99% might be $50,000 per year, which wouldn’t set off any alarm bells. But when Oprah is included, the mean increases to over half a million dollars per person per year! That does set off alarm bells — unless we were talking about Mitt Romney’s country club, such a figure is hard to believe.
In such a case it is often useful to compute a statistic which is resistant to the presence of outliers. For the average, for instance, there is such a statistic called the median. It’s the value which falls right in the middle of the probability distribution, so there’s equal chance of a given sample being above or below that value. If the median income of the 99 non-Oprah members is $44,000 per year, then even if we include Oprah the median value may well still be $44,000 per year, and certainly won’t be anywhere near half a million! In this case the median gives a much more realistic portrayal of the typical income of members of the book club. And if the “average” is meant to convey what’s “typical,” then the median succeeds while the mean utterly fails. This is robust statistics in action. It works.
There are also methods for linear regression which are resistant to the presence of outliers, which fall into the category of robust regression. Probably the most common is to find the solution which minimizes the sum of the absolute values of the residuals rather than the sum of their squares. This is the method of least absolute deviations. It is highly resistant to outliers (as long as there aren’t many of them), and in fact the median is the least-absolute-deviation estimate of the center of a distribution (whereas the mean is the least-squared deviation estimate).
Such a strategy can be generalized by choosing some other function than the square (which gives rise to least-squares regression), or the absolute value (which gives rise to least-absolute-deviation regression), for which we will minimize its sum over all values of the residuals. Such functions are called M-estimators. M-estimators need not actually represent the log-likelihood function, although that was the originating idea and is why they’re called “M”-estimators. In fact both the square and the absolute value can be used as M-estimators, so this broad class includes both least-squares and least-absolute-deviation regression, although when least-squares is selected is doesn’t accomplish the usual purpose — to be resistant to outliers. I have occasionally used the Cauchy distribution as an M-estimator because it’s very close to the square when the residual is small, it suppresses outliers very well, and, well, because it’s interesting!
And I should mention non-parametric regression, which makes no assumption whatever about the distribution of the noise and requires no function to minimize. A popular example is the Theil-Sen estimator, which is the median of all the slopes computed from pairs of data points in the sample. It’s strongly resistant to outliers, to cases where the distribution of the noise is highly skewed, and it’s simple to compute. What could be better!
Incidentally, there are other drawbacks to robust regression which might impact whether it’s the appropriate choice for analysis. For one thing, there are situations in which the robust-regression solution is ambiguous. For another, robust regression (and least-absolute-deviation regression specifically) often requires a lot more computation than least-squares regression, for which the exact “best” solution can be directly computed.
Having indulged in a great deal of general discussion, I’ll finally discuss the context in which this recently came up — sea ice data. Data for the southern hemisphere from the Univ. of Bremen, extending back to 1973, indicate that the very early data include some rather extreme outliers. They’re not so extreme as to invalidate OLS — but they’re extreme enough that it would appear that some form of robust regression is definitely more appropriate. In fact, I don’t have access to that data, but judging visually from the graph I’d guess that it’s straightforward to show that the residuals do not follow the normal distribution and therefore the OLS assumptions do not apply. This definitely calls for robust regression! Or … does it?
Perhaps not. It’s true that the assumptions of OLS do not hold. But, neither do the assumptions behind robust regression. Robust regression is predicated on the model that the data follow a straight line, plus noise which is not normally distributed. If that were true, then the extreme early values would be the result of extreme noise values, but would still represent deviation from a straight-line underlying signal. But what if the underlying signal itself is not a straight line? What if those large deviations aren’t due to noise, but because the signal itself was higher in those earlier years? In that case, the outliers may not indicate non-normal noise, they may indicate that the signal really was higher in those early years, a fact which should reduce the estimated trend rate. But most robust regression methods will treat those values as outlier noise, and they will have negligible (if any!) impact on the estimated trend.
In such a case, which regression method you choose might actually depend on what you’re trying to estimate physically. Let me consider again the incomes of Oprah’s book clube, to make an analogy. If we want to know the “typical” income of a member, then Oprah’s outlier is pretty much irrelevant and should not have undue influence on the result — the median (a robust measure) is definitely appropriate. But if we’re more interested in the spending power of Oprah’s book club, then it would be a capital mistake to downplay the largest value. In that case, the mean — despite it’s non-robust nature and how strongly it’s affected by a single value — is more appropriate.
In the case of southern hemisphere sea ice data, estimates prior to 1973 (when the Univ. Bremen data begin) indicate that Antarctic sea ice really was more extensive prior to the satellite era, and the sharp decline indicated by the earliest Univ. Bremen data is part of the actual trend. In that case, if we’re using the Univ. Bremen data to look for the long-term trend, the I submit it may be more logical to apply OLS because it gives full weight to the earliest data, whose extreme values represent the signal, not the noise.
If, on the other hand, we’re more interested in just the recent trend — then those earliest values are evidently out of line compared to what followed. In that case, it might indeed be a better idea to use robust regression so they don’t have undue influence on the estimated trend.
It’s also good to keep in mind that sea ice data for the southern hemisphere since 1973 is demonstrably not following a straight line. In fact the northern hemisphere sea ice data show demonstrable departure from a straight line in the last decade. Treating those departures as “outliers” and substituting robust regression for OLS would be, in my opinion, a mistake. A vastly better approach is to treat the trend as nonlinear.
That doesn’t mean a straight line isn’t a useful model! And it doesn’t invalidate linear regression (OLS or robust) as a useful tool to measure the size of the trend. But it should not be forgotten that, because the signal itself is not linear, robust regression is not automatically a better choice for trend analysis of sea ice, southern or northern.