A reader recently inquired about using the Theil-Sen slope to estimate trends in temperature data, rather than the more usual least-squares regression. The Theil-Sen estimator is a non-parametric method to estimate a slope (perhaps more properly, a “distribution-free” method) which is robust, i.e., it is resistant to the presence of outliers (extremely variant data values) which can wreak havoc with least-squares regression. It also doesn’t rely on the noise following the normal distribution, it’s truly a distribution-free method. Even when the data are normally distributed and outliers are absent, it’s still competitive with least-squares regression.
So — why not use Theil-Sen?
There is one catch. The noise in temperature data tends to exhibit autocorrelation, that nearby (in time) values are correlated with each other. We know how to compensate for this with least-squares regression, but I don’t know how it affects the Theil-Sen estimator. Probably somebody does — I’d be surprised if nobody has studied this question — but I haven’t seen their work.
Even so, my intuition is that autocorrelation will have about the same affect on the Theil-Sen estimator as on the least-squares estimator, and could be compensated in the same way. So, I decided to run some tests on artificial data in order to investigate the issue.
First, let’s show just how good the Theil-Sen estimator is when there’s no autocorrelation. I generated 500 artificial time series of random noise following the normal distribution (without autocorrelation), then estimated the slope and the uncertainty in the slope by both least-squares (LS) and Theil-Sen (TS). Of course the “real” slope is zero because there’s no signal, just noise. Here’s the result:
Both methods gave almost the same result in every case. The standard deviation of the LS estimates was 0.0034, that of the TS estimates was 0.0035. The average estimated standard error for LS was 0.0034, for TS was 0.0035. Clearly, both methods work excellently and give nearly identical results.
We can even look at the confidence limits for both methods. For TS, the confidence limits aren’t computed using the standard error, but using the distribution of the estimated slopes of each pair of data points. Here are the lower 95% confidence limits by both methods:
Here are the upper 95% confidence limits by both methods:
Again, both methods do an excellent job and they give nearly identical results. One would be hard pressed to give a compelling reason not to use Theil-Sen.
But what about autocorrelated noise? I created another 500 artificial time series following an ARMA(1,1) process, with its parameters set to mimic the noise in monthly global temperature data. Then I estimated the slope and its standard error by both methods, but did not compensate the LS estimate for autocorrelation. Here’s the result:
The standard deviation of the LS estimates was 0.0088, but the average standard error was only 0.0040. This is because the standard error estimates are wrong — they’re not compensated for autocorrelation.
The standard deviation of the TS estimates was 0.0088, but the average standard error was only 0.0041. Again, the standard error estimates are wrong because they’re not compensated for autocorrelation. The interesting point is that the autocorrelation seems to have have the same effect on the TS slopes that it had on the LS slopes.
We can also look at the 95% lower and upper confidence limits, without any autocorrelation correction:
Again, the results are nearly identical. More to the point, if we apply the same correction to the TS slope as to the LS slope, we’ll get correct results. I conclude that there’s no good reason not to use TS for temperature data, but we must still correct for autocorrelation, and we do so in the same way as for LS.
Just for fun, I applied both LS and TS to annual average global temperature data since 1975 from NASA GISS. I used annual averages because they show much less autocorrelation than monthly averages (it’s much less, but it’s still there!), and I did not apply an autocorrelation correction. Here’s the result:
The estimates, and the estimated uncertainties, are indistinguishable. I also computed, by both methods, the slope for data from various start years through 2012:
Again the results are pretty much indistinguishable. It should be noted that the error bars are too small, because there is still some residual autocorrelation which is not compensated — but with annual averages, the uncompensated estimates at least get us in the ballpark.
Computing the slope from 1975 to various end years also gives similar results:
Finally, I computed the slopes for all 15-year time spans from 1975 to the present:
Once again the two methods give the same answers.
All this reinforces the conclusion that there’s no reason not to use Theil-Sen for trend estimation of temperature data. There’s also no reason not to use least squares. Both methods require correction for autocorrelation, and both methods give reliable results.