In the last post we looked at smoothing time series by focusing mainly, or exclusively, on local data, i.e., data which are nearby (in time) to the moment at which we’re trying to estimate the smooth. This is usually accomplished by having an observation window or weighting function which is centered at the moment of estimation, and we let this window “slide” through time to generate smoothed estimates at all times. We even showed that fitting a polynomial or a Fourier series globally (i.e. to all the data at once) gives results which are, for times not near the edges of the observation window, nearly the same.
But we also identified a problem with local fits, that if we insist the “smoothing window” be entirely (or even mostly) within the “observation window” (which does after all seem like a good idea) then, unlike with global fits, we don’t get higher uncertainty near the beginning and end of the observation window — because we don’t get any estimate at all. We lose a slice of time near the start, and near the end. We saw this with the simplest local smooth (the “moving average”), and it holds for others as well.
We also found that if we do allow the center of smoothing window to go all the way to the edge of time, we’ll get an estimate but. If there’s a nonzero trend at the edges, we’ll only be able to see it on one side of our “smoothing moment” and this causes a bias in the smooth. If we only see the “down” side we’ll get an artificially low estimate, if we only see the “up” side we’ll get an artificially high estimate. And the fact is, that data do show nonzero trend at the edges. Often.
Maybe there’s a way around that. If we want to compensate for the bias at the edges of time introduced by a nonzero trend, why not estimate what that trend is and compensate for it? And how would we estimate the trend? An obvious choice is by fitting a straight line to the data in the smoothing window by least squares.
We let the “smoothing window” slide through time, but now we let its center go all the way from the first observation moment to the last. If data on one side or the other are incomplete or missing, no problem. We fit a straight line to the data within the smoothing window and use it to estimate the smoothed value at the moment where we’ve centered our smoothing window. Advance the smoothing window to the next moment of interest, rinse, repeat. Within the heart of the observation window, when edge effects are not a problem, this will give us exactly the same result as a simple moving average if the data are evenly sampled in time. But very near the edge, we can get an actual estimate and it won’t include that nasty bias brought about by trend-at-the-end. And there’s another benefit: that if the data are not evenly sampled in time, we avoid many of the resultant complications.
Of course, we still lose precision near the edges because, let’s face it, there’s just not as much data and it’s all on one side of the edge! That’s just a fact of life — you can’t get as much precision if at some particular moment the available data are lopsided and there’s not as much of it.
That’s why, when Bill Cleveland (at Bell Labs) developed the LOWESS smooth (for “locally weighted scatterplot smoothing), he defined the “smoothing window” to include, not a fixed amount of time before and after the moment of interest, but a fixed number of data points which are nearest (in time) to the moment of interest. This means that the earliest and latest estimates are based on the same number of data. This solution isn’t perfect, because the uncertainty of the local regression is greater near its own edge. But let’s face it, we will never be able to estimate at the edge with as much precision as in the middle, because we only have data on one side of the moment of interest. But we can get to the edge of observed time, and the bias introduced by having only one “side” of the picture is greatly reduced.
The full LOWESS algorithm allows the use of higher-order polynomials (not just straight lines) for the “local fit,” but experience has shown that it’s best to keep the polynomial order low. In fact the “standard” implementation in R uses straight lines for the local fits (I prefer a 2nd-degree polynomial for local fit). It also allows for repeated application of the smoothing, at each iteration downweighting the data which (according to the preceding iteration) may be “outlier” data.
Maybe best of all, the LOWESS algorithm produces a nice “smooth” smooth because the local fit isn’t just a plain ordinary least-squares fit, it’s a weighted least-squares fit. Cleveland’s choice for weighting (and still the standard for LOWESS) was the “tricube function.” I prefer a different weighting function, for some rather complicated reasons that I won’t go into. I probably ought to write a paper about this.
The LOWESS smooth is not just logical enough to attract attention. Much experience has shown that it is just plain good. Damn good! It goes all the way to the edge of time (even beyond if you want, but that’s ill-advised), it reduces local error to near its minimum (for a given time scale of smoothing), and in the course of real-world application it has become one of the most popular smoothing methods.
There is one aspect of using straight lines for the local fits which I regard as a distinct advantage. At the very edge of time, if we use a 2nd-degree (quadratic) polynomial, we include in our smoothed estimate the estimated quadratic influence, i.e. the estimated acceleration. I regard this as insufficiently precise to be optimal. Using straight lines includes no endpoint acceleration, so is not vulnerable to this. Instead, the “first” and “last” estimates are from straight-line fits so they tend not to accelerate. Because of this the slope at the endpoints tends not to wiggle, a condition which is sometimes (in other methods) imposed as a constraint. In fact this is referred to as the “minimum roughness criterion” because we can view acceleration (local curvature) as the degree of “roughness” so we’re imposing minimum roughness at the edge.
Linear local fits, however, can underestimate the extreme excursions of the signal, not at the edges but in the middle! Not by much — really, not enough to worry about — but I decided that in my own implementation I would use an “almost quadratic” local fit. I actually use a quartic (4th-degree) polynomial fit, but it is constrained to have zero acceleration at the edges of the smoothing window. Because of the constraints, this fit function has the same number of degrees of freedom as a quadratic fit but it still conforms to the minimum-roughness criterion.
And here’s another advantage: we know how to calculate the uncertainty in a point estimate derived from a local least-squares fit, so we can compute the uncertainty associated with our smooth. (Bear in mind that this is the uncertainty of the smoothed value, not of the “instantaneous local” value.) In my own implementation I include this calculation, so I’m even able to put “error bars” on the smoothed values. I have yet to see this done by anyone else with LOWESS smoothing, which frankly surprises me. Maybe I ought to write a paper abou this.
But wait — there’s more!!! If we can use the value of a local fit to estimate the smoothed value at a particular moment, then why not use the slope from a local fit to estimate the slope at a particular moment? Why not, indeed? I’ve included that feature in my implementation as well, so my own personal smoothing program computes the estimated smoothed value, the uncertainty in that estimate, and the estimated slope, at each moment of interest. Like standard LOWESS, it defaults to using the times of observation as the “moments of interest” — but it also allows using any set of times you choose. I suppose I should get that paper out.
Theory is one thing, practice is sometimes another; let’s see it in action. Here’s some data: annual average temperature anomaly from 1910 through 2012 for Australia:
Clearly it’s been on the rise since the mid-1900s. But there’s also visual evidence of a downturn at the end. Can smoothing pick that up? Is it even real?
Let’s apply a “slow” smooth, one with a long time scale of about 30 years, which some might regard as “climate” (but let’s not get into that discussion now). I’ll calculate 3 smooths: a vanilla LOWESS smooth (in red, using the standard R implementation), a spline smooth (in blue), and moving averages (green):
Here’s a close-up on just the smooths:
The LOWESS and spline smooths are nearly identical. The moving averages follow them well, but with a good bit of jittering around — and not even very “smooth” jittering. Perhaps most important, the smooths (lowess and spline) go all the way to the edges of time.
There’s no sign here of a downturn at the end. But is that just because we’re smoothing so much that we smoothed it right out of existence? Let’s try a faster smooth, with a time scale of about 15 years:
And here are just the smooths:
Again, the lowess and spline smooths agree well and the moving averages follow them well (but not up to the edges). But now, there is a distinct downturn at the end. I can hear Anthony Watts proclaiming that pronounced cooling is imminent in Australia!
On a different topic entirely: if you’re reading this but not at its blog of origin (tamino.wordpress.com), and at the top is a line saying “Written by Amego Sando,” then you should be aware that that is a lie. This post was not written by Amego Sando, he’s just a thief who steals other people’s blog posts and re-posts them while claiming credit for authorship. Oh — and if you actually happen to be Amego Sando and you’re reading this: va te faire mettre.
Back to the subject: How can we tell whether the downturn at the end is really part of the “signal” or just an artifact of the “noise”? The right way is to isolate the episode in question and test it by other means than smoothing! But there is at least a way to get a clue from smoothing, if we can computer the uncertainty range of the smooth.
So, I applied my own version of the lowess algorithm. I computed fast and slow smooths using the same time scales as before, and here they are (fast in blue, slow in red) with the error ranges (plus or minus two times the standard error) shown as dashed lines:
It’s obvious that at the end, the slow smooth is still going up but the fast one is going down. However, they are still in agreement within their error ranges. In fact, the last half of the “slow” smooth is entirely within the error range of the “fast” smooth (and vice versa), which suggests that the fast smooth doesn’t really establish going in a new direction at the end, it’s just wiggling around due to noise. This isn’t proof by any means, and truly to answer the question (downturn or no?) we should use other methods than smoothing. In this particular case, those other methods happen to agree that the downturn is not established with statistical signficance.
The real moral of the story is that if you want to find a “fast” wiggle, you have to use a “fast” smooth (i.e., short time scale). Slow smooths just don’t allow enough “wiggle room” to get ’em! But those wiggles might be noise response rather than genuine trend, so don’t jump to conclusions without some further exploration.
It’s also clear that if we allow faster wiggles at the end, then we have to allow them in the middle too. Note that the “fast” smooth has a lot of wiggles which, in this particular case, don’t correspond to statistically singificant signal, they might be just noise. That’s not just possible, it’s inevitable — if you allow fast wiggles, you’re going to end up finding them almost everywhere, whether “real” or not.
I’d also like to mention a couple of other approaches to the endpoint problem which fall under the umbrella of “global” rather than local smoothing methods. One is something we’ve already mentioned, padding the data, then applying some global fit (Fourier series are popular when doing this). We can impose desired constraints by our choice of padded data values. If we pad the data with repeated zeroes we get the “minimum norm constraint,” which tends to keep the smoothed values near zero at the endpoints. If we pad by taking the original data and adding it to the end after “flipping” it in time (i.e., in reverse time order), we get the “minimum slope” constraint because the smooth will tend to have slope zero at the end. Finally, if we take the original data and flip it in both time and data, around the final data point, we get the “minimum roughness” constraint.
Another approach gaining popularity is to put more effort into choosing the functions we use for global regression. The method du jour is “singular spectrum analysis” (SSA), which identifies functions which are often “almost” cyclic but whose periods and peculiarities are a reflection of the data, not of some standard set of functions (like polynomials or trigonometric functions). This has been shown to choose the functions well, producing an excellent smooth all the way up to the edges.
It is even becoming popular to use SSA but to generate artificial noise for Monte Carlo simulations so we can evaluate the uncertainty of the smooth (a procedure sometimes called MC-SSA for “Monte Carlo Singluar Spectrum Analysis”). In fact this is one of the strong appeals of this method, that it does not assume a predetermined form for the smooth but still allows computing uncertainties.
I prefer my own implementation of the lowess smooth because I can compute uncertainties with it already. I’ve run a number of tests on data sets to which others have applied MC-SSA and found that the results are pretty much indistinguishable. That really shouldn’t surprise us. The data are what they are — the changes at any particular time scale are what they are — and if a smoothing method is doing its job well then it should “tell it like it is” — which isn’t going to change just because we use a different algorithm to estimate it.
One final note. For the puzzling question whether or not Australian temperature showed evidence of a downturn in progress, I used the data from 1910 through 2012 from Australia’s Bureau of Meteorology. But 2013 is now complete and we have another data point which we can use to test the idea. And here, circled in red, is that final data point added to the graph: