Smooth 3

In the last post we looked at smoothing time series by focusing mainly, or exclusively, on local data, i.e., data which are nearby (in time) to the moment at which we’re trying to estimate the smooth. This is usually accomplished by having an observation window or weighting function which is centered at the moment of estimation, and we let this window “slide” through time to generate smoothed estimates at all times. We even showed that fitting a polynomial or a Fourier series globally (i.e. to all the data at once) gives results which are, for times not near the edges of the observation window, nearly the same.


But we also identified a problem with local fits, that if we insist the “smoothing window” be entirely (or even mostly) within the “observation window” (which does after all seem like a good idea) then, unlike with global fits, we don’t get higher uncertainty near the beginning and end of the observation window — because we don’t get any estimate at all. We lose a slice of time near the start, and near the end. We saw this with the simplest local smooth (the “moving average”), and it holds for others as well.

We also found that if we do allow the center of smoothing window to go all the way to the edge of time, we’ll get an estimate but. If there’s a nonzero trend at the edges, we’ll only be able to see it on one side of our “smoothing moment” and this causes a bias in the smooth. If we only see the “down” side we’ll get an artificially low estimate, if we only see the “up” side we’ll get an artificially high estimate. And the fact is, that data do show nonzero trend at the edges. Often.

Maybe there’s a way around that. If we want to compensate for the bias at the edges of time introduced by a nonzero trend, why not estimate what that trend is and compensate for it? And how would we estimate the trend? An obvious choice is by fitting a straight line to the data in the smoothing window by least squares.

We let the “smoothing window” slide through time, but now we let its center go all the way from the first observation moment to the last. If data on one side or the other are incomplete or missing, no problem. We fit a straight line to the data within the smoothing window and use it to estimate the smoothed value at the moment where we’ve centered our smoothing window. Advance the smoothing window to the next moment of interest, rinse, repeat. Within the heart of the observation window, when edge effects are not a problem, this will give us exactly the same result as a simple moving average if the data are evenly sampled in time. But very near the edge, we can get an actual estimate and it won’t include that nasty bias brought about by trend-at-the-end. And there’s another benefit: that if the data are not evenly sampled in time, we avoid many of the resultant complications.

Of course, we still lose precision near the edges because, let’s face it, there’s just not as much data and it’s all on one side of the edge! That’s just a fact of life — you can’t get as much precision if at some particular moment the available data are lopsided and there’s not as much of it.

That’s why, when Bill Cleveland (at Bell Labs) developed the LOWESS smooth (for “locally weighted scatterplot smoothing), he defined the “smoothing window” to include, not a fixed amount of time before and after the moment of interest, but a fixed number of data points which are nearest (in time) to the moment of interest. This means that the earliest and latest estimates are based on the same number of data. This solution isn’t perfect, because the uncertainty of the local regression is greater near its own edge. But let’s face it, we will never be able to estimate at the edge with as much precision as in the middle, because we only have data on one side of the moment of interest. But we can get to the edge of observed time, and the bias introduced by having only one “side” of the picture is greatly reduced.

The full LOWESS algorithm allows the use of higher-order polynomials (not just straight lines) for the “local fit,” but experience has shown that it’s best to keep the polynomial order low. In fact the “standard” implementation in R uses straight lines for the local fits (I prefer a 2nd-degree polynomial for local fit). It also allows for repeated application of the smoothing, at each iteration downweighting the data which (according to the preceding iteration) may be “outlier” data.

Maybe best of all, the LOWESS algorithm produces a nice “smooth” smooth because the local fit isn’t just a plain ordinary least-squares fit, it’s a weighted least-squares fit. Cleveland’s choice for weighting (and still the standard for LOWESS) was the “tricube function.” I prefer a different weighting function, for some rather complicated reasons that I won’t go into. I probably ought to write a paper about this.

The LOWESS smooth is not just logical enough to attract attention. Much experience has shown that it is just plain good. Damn good! It goes all the way to the edge of time (even beyond if you want, but that’s ill-advised), it reduces local error to near its minimum (for a given time scale of smoothing), and in the course of real-world application it has become one of the most popular smoothing methods.

There is one aspect of using straight lines for the local fits which I regard as a distinct advantage. At the very edge of time, if we use a 2nd-degree (quadratic) polynomial, we include in our smoothed estimate the estimated quadratic influence, i.e. the estimated acceleration. I regard this as insufficiently precise to be optimal. Using straight lines includes no endpoint acceleration, so is not vulnerable to this. Instead, the “first” and “last” estimates are from straight-line fits so they tend not to accelerate. Because of this the slope at the endpoints tends not to wiggle, a condition which is sometimes (in other methods) imposed as a constraint. In fact this is referred to as the “minimum roughness criterion” because we can view acceleration (local curvature) as the degree of “roughness” so we’re imposing minimum roughness at the edge.

Linear local fits, however, can underestimate the extreme excursions of the signal, not at the edges but in the middle! Not by much — really, not enough to worry about — but I decided that in my own implementation I would use an “almost quadratic” local fit. I actually use a quartic (4th-degree) polynomial fit, but it is constrained to have zero acceleration at the edges of the smoothing window. Because of the constraints, this fit function has the same number of degrees of freedom as a quadratic fit but it still conforms to the minimum-roughness criterion.

And here’s another advantage: we know how to calculate the uncertainty in a point estimate derived from a local least-squares fit, so we can compute the uncertainty associated with our smooth. (Bear in mind that this is the uncertainty of the smoothed value, not of the “instantaneous local” value.) In my own implementation I include this calculation, so I’m even able to put “error bars” on the smoothed values. I have yet to see this done by anyone else with LOWESS smoothing, which frankly surprises me. Maybe I ought to write a paper abou this.

But wait — there’s more!!! If we can use the value of a local fit to estimate the smoothed value at a particular moment, then why not use the slope from a local fit to estimate the slope at a particular moment? Why not, indeed? I’ve included that feature in my implementation as well, so my own personal smoothing program computes the estimated smoothed value, the uncertainty in that estimate, and the estimated slope, at each moment of interest. Like standard LOWESS, it defaults to using the times of observation as the “moments of interest” — but it also allows using any set of times you choose. I suppose I should get that paper out.

Theory is one thing, practice is sometimes another; let’s see it in action. Here’s some data: annual average temperature anomaly from 1910 through 2012 for Australia:

oz1

Clearly it’s been on the rise since the mid-1900s. But there’s also visual evidence of a downturn at the end. Can smoothing pick that up? Is it even real?

Let’s apply a “slow” smooth, one with a long time scale of about 30 years, which some might regard as “climate” (but let’s not get into that discussion now). I’ll calculate 3 smooths: a vanilla LOWESS smooth (in red, using the standard R implementation), a spline smooth (in blue), and moving averages (green):

oz_slow

Here’s a close-up on just the smooths:

oz_slow2

The LOWESS and spline smooths are nearly identical. The moving averages follow them well, but with a good bit of jittering around — and not even very “smooth” jittering. Perhaps most important, the smooths (lowess and spline) go all the way to the edges of time.

There’s no sign here of a downturn at the end. But is that just because we’re smoothing so much that we smoothed it right out of existence? Let’s try a faster smooth, with a time scale of about 15 years:

oz_fast

And here are just the smooths:

oz_fast2

Again, the lowess and spline smooths agree well and the moving averages follow them well (but not up to the edges). But now, there is a distinct downturn at the end. I can hear Anthony Watts proclaiming that pronounced cooling is imminent in Australia!

On a different topic entirely: if you’re reading this but not at its blog of origin (tamino.wordpress.com), and at the top is a line saying “Written by Amego Sando,” then you should be aware that that is a lie. This post was not written by Amego Sando, he’s just a thief who steals other people’s blog posts and re-posts them while claiming credit for authorship. Oh — and if you actually happen to be Amego Sando and you’re reading this: va te faire mettre.

Back to the subject: How can we tell whether the downturn at the end is really part of the “signal” or just an artifact of the “noise”? The right way is to isolate the episode in question and test it by other means than smoothing! But there is at least a way to get a clue from smoothing, if we can computer the uncertainty range of the smooth.

So, I applied my own version of the lowess algorithm. I computed fast and slow smooths using the same time scales as before, and here they are (fast in blue, slow in red) with the error ranges (plus or minus two times the standard error) shown as dashed lines:

oz_mylow

It’s obvious that at the end, the slow smooth is still going up but the fast one is going down. However, they are still in agreement within their error ranges. In fact, the last half of the “slow” smooth is entirely within the error range of the “fast” smooth (and vice versa), which suggests that the fast smooth doesn’t really establish going in a new direction at the end, it’s just wiggling around due to noise. This isn’t proof by any means, and truly to answer the question (downturn or no?) we should use other methods than smoothing. In this particular case, those other methods happen to agree that the downturn is not established with statistical signficance.

The real moral of the story is that if you want to find a “fast” wiggle, you have to use a “fast” smooth (i.e., short time scale). Slow smooths just don’t allow enough “wiggle room” to get ’em! But those wiggles might be noise response rather than genuine trend, so don’t jump to conclusions without some further exploration.

It’s also clear that if we allow faster wiggles at the end, then we have to allow them in the middle too. Note that the “fast” smooth has a lot of wiggles which, in this particular case, don’t correspond to statistically singificant signal, they might be just noise. That’s not just possible, it’s inevitable — if you allow fast wiggles, you’re going to end up finding them almost everywhere, whether “real” or not.

I’d also like to mention a couple of other approaches to the endpoint problem which fall under the umbrella of “global” rather than local smoothing methods. One is something we’ve already mentioned, padding the data, then applying some global fit (Fourier series are popular when doing this). We can impose desired constraints by our choice of padded data values. If we pad the data with repeated zeroes we get the “minimum norm constraint,” which tends to keep the smoothed values near zero at the endpoints. If we pad by taking the original data and adding it to the end after “flipping” it in time (i.e., in reverse time order), we get the “minimum slope” constraint because the smooth will tend to have slope zero at the end. Finally, if we take the original data and flip it in both time and data, around the final data point, we get the “minimum roughness” constraint.

Another approach gaining popularity is to put more effort into choosing the functions we use for global regression. The method du jour is “singular spectrum analysis” (SSA), which identifies functions which are often “almost” cyclic but whose periods and peculiarities are a reflection of the data, not of some standard set of functions (like polynomials or trigonometric functions). This has been shown to choose the functions well, producing an excellent smooth all the way up to the edges.

It is even becoming popular to use SSA but to generate artificial noise for Monte Carlo simulations so we can evaluate the uncertainty of the smooth (a procedure sometimes called MC-SSA for “Monte Carlo Singluar Spectrum Analysis”). In fact this is one of the strong appeals of this method, that it does not assume a predetermined form for the smooth but still allows computing uncertainties.

I prefer my own implementation of the lowess smooth because I can compute uncertainties with it already. I’ve run a number of tests on data sets to which others have applied MC-SSA and found that the results are pretty much indistinguishable. That really shouldn’t surprise us. The data are what they are — the changes at any particular time scale are what they are — and if a smoothing method is doing its job well then it should “tell it like it is” — which isn’t going to change just because we use a different algorithm to estimate it.

One final note. For the puzzling question whether or not Australian temperature showed evidence of a downturn in progress, I used the data from 1910 through 2012 from Australia’s Bureau of Meteorology. But 2013 is now complete and we have another data point which we can use to test the idea. And here, circled in red, is that final data point added to the graph:

oz2013

22 responses to “Smooth 3

  1. Brilliant. I like the last picture most.

  2. Thanks Tamino for a great explanation of smoothing. I did wonder when reading Smooth 1 why you were only talking about global functions like polynomials and Fourier series, now I see waiting was a good idea. I get the impression that you really like the LOWESS smooth from this post and many previous posts, now I see why.

    I think I have spotted some typos in your post, “Maybe I ought to write a paper abou this”, should be about.

    “if we can computer the uncertainty range of the smooth.” should probably be compute.

  3. It was interesting to read –“Note that the “fast” smooth has a lot of wiggles which, in this particular case, don’t correspond to statistically singificant signal, they might be just noise.”
    I suppose to explain that statement we’d have to define ‘Temperature Anomaly- Australia’ with a bit more precision.
    We have a measurement for this quantity which is subject to measurement error. Without that error it would be equal to the actual annual averages of Australia’s max/min daily temperatures for the calendar year. But that real value will be ‘lumpy’. (If nothing else, rainfall can be shown easily to be ‘lumpy’ and so too can cloudiness Therefore temperature which is very dependent on the likes of cloudiness & raininess will naturally be ‘lumpy’.)
    So the ‘signal’ as stated is thus (I assume) an artificial ‘climatic temperature anomaly,’ an underlying value that remains after variations in ‘weather’ and the ‘lumpy’ clouds and rain are factored out of the analysis.
    And so I would have thought that, as the analysis seeks to understand an artificial (but useful) construct, what is wiggle and what is signal is open to discussion. The wiggle c1950 dipping below the slow smoothed confidence limits will be within the analysis & not just noise if wiggles of that size/duration are of concern to us. Also the existence of that wiggle c1950 perhaps (?) suggests the apparent dip in the fast smoothing to 2012 is itself not anomalous within a long-term rising trend.

  4. Guess who plagiarized your post in its entirety in his …/Smooth-3/ post?

    [Response: The same guy who now has posted, under his own name, an unambiguous statement that he is a thief who steals blog posts and claims authorship. Next time, it’ll be in bold face type.]

  5. Google hasn’t improved their answer on plagiarism since 2009:
    http://productforums.google.com/forum/#!forum/websearch/websearch/dzov-2-fpro

    It reminds me of cellphone companies not willing to turn off stolen phones.

  6. that should be:
    http://productforums.google.com/forum/#!topic/websearch/4vwvzWSjM6g

    for Google’s answer on what to do when someone’s copied your website. It amounts to “not our problem, if you don’t like it, set the DMCA on ’em”

  7. even better, here’s Google’s Search guy.
    http://searchengineland.com/googles-matt-cutts-25-30-of-the-webs-content-is-duplicate-content-thats-okay-180063

    Seems to amount to “Google shows the most valuable* site, when multiple copies are found, no matter who the original author is.” And value, to Google, means it generates income for them.

    Damn shame.

    I wonder if putting the author’s name and link explicitly, somewhere in every main post would help — at least if the scraper deleted that from every scraped copy it’d be evidence for a DMCA complaint.

  8. Excellent post, I’ve never understood how Lowess worked until now.
    Australia is very susceptable to ENSO which is why 2010-2012 were relatively ‘cool’. It is scary how hot last year was given this was only a shift to a neutral phase. 2014 is shaping up as an exceptionally hot year as well.

  9. “Damn shame.”

    How else would you find duplicates of your work if you couldn’t use a search engine to do so?

    It should be possible to get this person kicked off their ISP if their ISP is reputable (which is probalby not the case).

  10. I don’t say this for every post, but I should. Thanks, Tamino. I l learn something from every one of your posts. You have an extraordinary gift for translating techno-speak into language we non-math-minded people can understand. This series of posts is especially good! Now I know what a Lowess smooth is, when to use it and why it’s so important. I’m sure I’m not alone in my appreciation of all your time and hard work. I know you are making a difference in the world.

  11. Thanks yet again; excellent stuff. I wonder if you can offer a view on the intellectual wasteland that is financial data time series plotting. There you will find smoothing sophistication extending to boxcars — but plotted trailing to “solve” the leading edge problem — and, the raging favourite, exponentially decaying weighting, giving finite weight to every point in the historical series.

    These guys manage billions of our money. Why do they not have a clue?

  12. Horatio Algeranon

    “The real moral of the story is that if you want to find a downward trend, you have to use a “fast” smooth (i.e., short time scale).”

  13. Tamino: “In my own implementation I include this calculation, so I’m even able to put “error bars” on the smoothed values. I have yet to see this done by anyone else with LOWESS smoothing, which frankly surprises me. Maybe I ought to write a paper abou this.”

    I wrote a Turbo Pascal LOWESS (in DOS) implementation in the late 1980s that kept the error data and could show the error bars – of course, mean +/- N sigma lines are usually more useful, and that’s what I actually showed. Points that were outside the clipping level (typically 2 sigma) were shown differently so you could assess whether or not to clip them out.

    I’m sorry to say that my current program is less capable and only shows the smoothed result … so far. (My old one doesn’t run under DOSBox.)

    [Response: I like to use +/- 2 sigma myself. I tend to refer to uncertainties in general as “error bars” because it conjures up the right idea.

    And I too have some old programs which do excellent stuff but run under DOS. I thought I was the only one! So far, all mine run OK under DOSbox.]

  14. Thank you for this excellent post. Could you give a few clues to your preferred weighing function, and also why you prefer 2nd degree.

    [Response: I like 2nd degree because local linear fits will underestimate the extremity of the extremes. It’s really not enough to worry about — it’s just my own preference.

    The weighting function is based on how how rapidly the terms of a Taylor series expansion become significant, when defining “band-limited” in a different-than-usual way, as functions whose Taylor series coefficients decay rapidly enough.

    I think I really should write up all the details for a journal. It’ll be a little while before I’m able to get to it.]

    Here’s a link for matlab users: http://www.mathworks.se/help/curvefit/smoothing-data.html
    It also has a pretty good explanation.

  15. This was an incredibly valuable set of posts and should act as a resource to many over the years in areas far beyond climate.

    Thank you.

  16. Tamino: “And I too have some old programs which do excellent stuff but run under DOS. I thought I was the only one! So far, all mine run OK under DOSbox.”

    Some of mine do too, including some Turbo Pascal graphical ones. Not the LOWESS program, though. I have an old curve fitting program an associate wrote, using cubic splines under tension, that still runs – does a beautiful job, too, much better than regular cubic splines.

  17. Horatio Algeranon

    How does LOWESS compare to Savitzky-Golay for smoothing? (eg, when one chooses the same degree polynomial for the local fit)

    I used to program hand-held spectrometers (Ion Mobility and Raman) for the purpose of compound detection/ID and know SG can work quite well when one knows the width of the noise “peaks” relative to that of the features of interest and chooses the width of the filter to take this into account.

    And for anyone who thinks “smoothing is just for visual inspection”, I can vouch that that is simply not true. The smoothing greatly reduces the “false alarms” (false ID’s) for the kind of instruments I worked on. Indeed, without it, they would be much less reliable (if not useless).

    [Response: First, a SG filter isn’t a *weighted* least squares fit (although it certainly could be done that way). Perhaps most important, though, is that it is truly a *filter* so it requires an even time sampling. In fact one of the purposes (as was made clear in their original paper) was to publish the filter coefficients so that it would be practial even when using a desk calculator! (Yeah, it’s that old)]

    • Horatio Algeranon

      SG may be “old” (which need not mean “inferior”, since OLS is even older :) but in many (if not most) cases regarding climate data, the time sampling actually is even.

      I was simply curious how the smoothing results compare to LOWESS, but perhaps SG is simply “not as good as LOWESS”, in which case that is good to know.

      [Response: In my opinion, using a *weighted* least squares is a superior choice. But for evenly sampled data, a SG filter is just fine (and it’s easy enough to compute with a desktop calculator).]

      • Isn’t SG optimised for fit rather than smoothness? In my (very limited) experience it seems to produce somewhat “cuspy” smooths. Of the simple numerical methods, the old Spencer’s 15 and 21-point smooths seem to do better.

        [Response: You can trade off between “fit” and “smoothness” by varying the width of the SG filter. I’m not sure what you mean by “cuspy”.]

      • Here: http://gergs.net/wp-content/uploads/2014/01/Savitzky-Golay.png

        That compares a 15-point cubic SG smooth with a 15-point Spencers, for an Australian stock index. The inset shows the filter weights. Note the cusps in the SG smooth in July and August.

        Don’t want to pretend I understand (I don’t!), but this guy (pdf, p7) says SG optimises error reduction (his “error reducing index”) over smoothing (“smoothing index”, defined in terms of the third derivative).

  18. UPDATE to my earlier post about my old LOWESS program not running in DOSBox. Turns out that it does. My test data file had something subtle off about it, and the old program wasn’t robust enough to handle it. I cleaned up the data and the program still works well. After all these years!

  19. Nice post , thanks. That reminds me:

    Cleveland wrote several classics The Elements of Graphing Data(1985,1994) and Visualizing data(1993), read in that order. They sit on my handiest shelf between Tufte and Tukey books.

    As a historical/nostalgia note, Bill worked in an organization that not only did great statistics research, but a lot of computer science and software like UNIX & C & a whole lot of tools that have gotten widespread use.

    Circa 1978, R. C. Prim was Executive Director of a research division @ Bell Labs – Murray Hill.
    John Tukey was Associate Director, i.e., could do what he wanted without administrative bother.

    One of Prim’s labs was run by Henry O. Pollak, who I was lucky to hear speak while in high school, about why theoretical minimal spanning tree algorithms were worth a lot of $ to the telephone company.

    Under Pollak were a bunch of departments, a sample of whose people is:
    -Cleveland, @now at Purdue.
    John Chambers (S)
    Joseph Kruskal
    Ron Graham, great juggler and the popularizer of Erdos numbers … his being 1.
    Michael Garey./
    David S. Johnson.

    Nearby, in same part of the building was another of Prim’s labs with people computing folks might recognize,to pick a few with Wikipedia pages: Doug McIlroy, Al Aho, Robert Morris Sr, Ravi Sethi, Ken Thompson, Jeff Ullman, AG Fraser, Stu Feldman, Steve C Johnson, Brian Kernighan, Dennis Ritchie, Steve Bourne, Mike Lesk …

    Amazing groups … monopoly money is nice when you have it.