[Note: please see the UPDATE at the end of this post]
In the last post we looked at smoothing data by fitting a polynomial, and by fitting a Fourier series. We found that we could control the degree of “smoothness” in either case (by polynomial degree or by cutoff frequency) and that with the right amount of smoothness both methods did a reasonable (and sometimes outstanding) job over most of the observed time span, but did much worse near the endpoints of observation. Let’s forget about the endpoint mess for the moment, and concentrate on the performance of the smoothing near the middle of the window of observation.
We’ll make it as simple as possible. Suppose we have a lot of observations xk, k = 1,2,…,N of some variable over time, and the times of observation extend from -0.5 to +0.5 so the width of the entire window is exactly 1 time unit. We further insist that the observations are evenly spaced throughout time. What would our smoothing techniques give for the smoothed value at the time t = 0, i.e. at the midpoint of the observation window?
Let’s take our data and fit a polynomial of degree zero. This is just a constant, and the least-squares solution to xk = const is f(t) = , where is the average of all the xk
where wk = 1 for all k. Note that at the last step I’ve inserted a mysterious and apparently unnecessary “weight vector” wk — unnecessary because in this case it happens to be equal to 1 for all k.
Since our “smooth” is a constant throughout time, its value at time zero is equal to that constant, i.e., to , which is equal to the mean value of the product of a “weight vector” wk with the data vector xk.
Now let’s fit a 1st-degree polynomial (a straight line) instead of a constant. This will certainly give us a different smooth! But it turns out that the difference makes no contribution, none at all, to the smoothed value at time t = 0. It’s still the mean value of a “weight vector” wk with the data vector xk, and the weight vector is still 1 for all k values so it’s also equal to the arithmetic mean of the data.
Now fit a 2nd-degree (quadratic) polynomial. In this case, the smoothed value at time t = 0 is still the mean value of a weight vector times the data vector, but the weight vector is no longer all 1′s. Instead, it looks like this over time:
Note that the largest (absolute) values of the weights are near t = 0, so data points near (in time) to t = 0 have more influence on our smoothed value at t = 0 — as they should. Note also that large values near the extremes tend to make the smoothed value at t = 0 negative because their weights are negative, and by doing so they define the contribution of the quadratic term to our smoothed estimate at t = 0.
Going to 3rd degree (quadratic to cubic) make no change in the weight vector, as did going from constant to linear. In fact, because we’re only considering the smoothed value at t = 0 and our observations are equally spaced, the odd-numbered polynomials don’t really affect this.
But at 4th degree our weight vector has more wiggles
and at 6th degree more still
In fact if we go all the way to a 24th-degree polynomial, the weight vector looks like this:
It’s taking on a characteristic shape. As the order of the polynomial increases, the weight vector which defines the smoothed value at t = 0 approaches the sinc function, plotted here in red atop the weight function for polynomial degree 24:
In both cases, the largest (absolute) values are the weights near t = 0 so data values near then will dominate the smoothed estimate. It’s almost like our smoothed value at t = 0 will be a “weighted average” of the data values near t = 0. This isn’t actually correct because some of the weights are negative so it’s not really a weighted average, it would better be called a “mean weighted value” or something like that.
Let’s try something only slightly different. Let’s smooth our data with a 1st-order Fourier series, then compute the smoothed value at time t = 0. This too turns out to be the mean weighted value of a weight vector wk with the data xk, with this weight vector:
That’s certainly reminiscent of the weight vector for a 2nd-degree polynomial. Let’s go to a 2nd-order Fourier series to get this weight vector
Hmmm… I see a pattern emerging here. Let’s kick it up to an 8th-order Fourier series:
Once again, the smoothed value at t = 0 is dominated by data values near t = 0, with a wavy-looking weight function to account for more complex behavior. And once again, we can compare this weight function to the sinc function and find it matches even better:
This suggests that it might be useful, when smoothing, to focus attention mostly — or even only — on the data values which are actually nearby (in time) to the moment at which we’re trying to compute our smooth. We might even define a weight function of our own, and use it to compute a “mean weighted value” by shifting its peak from moment to moment — placing the peak of the weighting function at whatever particular time t (whether zero or not) we’re focusing on at the moment.
In fact, for higher polynomial degree or Fourier order, computing the entire smooth (not just at t = 0) with polynomials or Fourier series is actually quite close to just that — taking a weight function (the one for “at t = 0″) and shifting it through time to compute a “mean weighted value” at each moment which depends mostly on the data values near that moment. The process itself — a “sliding” mean weighted value — is nothing more nor less than the convolution of the data with a filter, with the filter being the weight vector itself.
All this nice filter stuff depended on our estimating the smooth in the middle of the time span, or at least not being near the edge of time. Rather than consider how edge effects play out in our polynomial or Fourier smooths, or how things change when the time sampling is uneven (which can get quite complicated), let’s focus on the essential idea that our smooth should depend on the data nearby in time, and elevate that to the status of a general principle. Essentially, instead of smoothing data by fitting some function or sequence of functions globally, let’s compute smoothed values based on using mainly — or only — local data.
Of course that raises the issue of what “local” means. We are free to choose whatever definition suits our need, and in particular by making “local” mean a very narrow or very broad slice of time we can control the degree of “smoothness” of our estimate, just as we did for global fits by varying their parameters (polynomial degree or cutoff frequency).
The Simplest Smooth
The simplest way to estimate the smoothed value at a particular moment of time would be simply to average the data near that moment. Since we’ve elevated “nearby in time” to a principle, we do have to choose how nearby in time, i.e. how wide our “averaging window” is in order to set the time scale of the smoothing.
Let’s try the simplest thing of all: just cut time into slices, maybe even make them all the same width, and average the data within each slice. We’ll even call each slice a “bin” just to use the same terminology as others. Heck we could even compute a standard error associated with that average and a confidence interval if we wanted to [note: we will utterly ignore autocorrelation and treat the noise as "white noise"]. We could do this with, say, 10-year long time slices of annual average global temperature from NASA GISS, to get this (data in black, 10-year averages with error bars large red dots):
That’s great, in fact we do this kind of thing all the time — but is it smoothing? It doesn’t compute values at all the observed times and it isn’t defined for in-between values. It’s not even clear how to compute residuals.
But hey — it does accomplish the purpose (or at least, one of the purposes), namely, that it reduces the noise a lot more than it alters the “signal”, if such things exist. More correctly it reduces the total variation of the “fast” part much more than it interferes with the variation of the “slow” part, so it has at least estimated the slow variation for us.
And that’s one of the purposes of smoothing. In some particular cases it may be the only relevant purpose, in which case this simple computation of averages has done the job. Something to be aware of is that there are no “smooth police” to make sure you do everything according to traditional standards. If a particular method tells us what we wanted to know without leading us astray, then it’s doing its job.
Another benefit of this simple averaging scheme is that we can describe the “essential” (i.e. slow, or long-term) changes using fewer data points. This used to be one of the most valuable aspects, because it means fewer calculations. For extremely large data sets, computing the Date-Compensated Discrete Fourier Transform oversampling the spectrum at 20 times the “nominal” frequency resolution can be a bit of a problem, requiring prohibitive resources for data storage/access and arithmetic calculations. If instead we analyze the averages rather than the raw data, we can reduce our computational workload while losing almost no information about long-term behavior. We will lose the information about short-term (fast) behavior, but if we’re interested in the long-term then that’s OK. [Important note: if you do this, in the “reduced” data set make the time equal to the average time of all data within the bin, just as the value is the average value within the bin.]
And for those anal-retentive smoothers who insist on a function defined at all intermediate times, we can even get this result from a least-squares fit of “step functions” which are 10 days wide (or however wide our averaging interval is):
This enables us to interpolate (since it’s defined at all intermediate times), and to compute residuals if we want to study them. (Personally I’d rather define a complete function by the “connect-the-dots” procedure). For those who object that this “smooth” isn’t at all “smooth,” I suggest you enroll at the “smooth police academy.” Lesson 1: this kind of smoothing is out of your jurisdiction.
We don’t have to restrict ourselves to non-overlapping time slices. We could start with the first 10 years of data and use them to compute an average time and average data value, then instead of shifting the “averaging window” 10 years ahead, we shift it only 1 data point ahead in time and compute yet another average time/data. In fact we can keep doing this, advancing the averaging window by 1 data point each time, until our window finally covers the final data point. This will give us a set of “average times” and “average values” which, when applied to global temperature from NASA, looks like this:
This smooth isn’t very smooth either, which is only a problem if we care about that. This is the simplest case of what’s called a “moving average.”
But another, very real problem is that the “smoothed” data covers a smaller time span than the raw data. Our very 1st bin average is (for evenly sampled data) halfway into the first “bin,” so we lose about half a bin’s worth of time at the beggining of the data set, and again at the end. Data from 1880 through 2013 will therefore become “smoothed data” from 1885 through 2008. That’s no problem deep in the heart of the 20th century, but if the thing you’re most interested in is the behavior near the end of a time series (what’s happening now), then losing part of the time span at the end is a distinct disadvantage of moving averages.
There are ways to extend the moving averages to the edges of the observed time window. We could, for instance, use a smaller window as we get near the edge of time. Alternatively, we could “pad” the raw data, adding data points at times before/after observations. What data values to add is a tricky problem! Sometimes one will pad a time series with the 1st data value at the beginning and the last data value at the end. Or, one can pad the time series with the mean data value. Perhaps most common (in some contexts) is to pad the time series with values of zero, which is sometimes OK but sometimes disastrous! (This is just me, but personally I hate this whole strategy — I never met a padding method I didn’t dislike).
When we computed our averages (either moving or not), we simply averaged the data in the given bin. We could substitute a “weighted average,” or, if some of the weights are negative, a “mean weighted value,” for the straight average. This is the “filter” idea we talked about earlier when discussing global fits of polynomial or Fourier series. The averages we’ve computed so far, giving equal weight to each data point within the averaging window, is nothing more nor less than the simplest example of such a filter, for which the weights are all equal within the averaging window but all zero outside of it. This is sometimes called a “boxcar filter.”
One obvious strategy is to give more weight to nearer (in time) data values by using larger weights in the center of our filter. A common choice is a “wedge filter,” which increases linearly from its left edge to the middle, then decreases linearly back to zero. We control the degree of smoothing by varying the width of the averaging window (the part of the filter which is non-zero). Another common choice is to use weights which follow the normal distribution, a case called “Gaussian smoothing” (since the normal distribution is also called a “Gaussian”). We can’t choose the width of the window any more because the window is infinite (Gaussian weights approach but don’t get to zero), but we can choose the width of the Gaussian itself as our control parameter for smoothness since Gaussians include a “scale parameter” (the standard deviation if it’s a probability density) which is ideal for adjusting the time scale.
Here, for instance, are the results of a 10-year wide boxcar filter (in blue), a 12-year wide tent filter (green), and a Gaussian smooth with scale parameter 3 years (red):
The three smooths are so close that it’s a bit hard to tell them apart on the graph, except at the edges. And that’s a good thing — it means that we’re getting consistent results from one method to another.
Notice that I extended the Gaussian smooth all the way to the edges of the time span. That’s not necessarily a good idea, and in this case they are not to be trusted. The reason is that a Gaussian smooth is still an “average” (weighted) over a “window”, and the data may show some trend over time so that earlier or later data are rising or falling. That’s no problem in the middle of the time span because at any given moment, the data which are earlier might be lower-than-average and the data which are later in time might be higher-than-average but their deviations will offset each other to cancel out and return a reliable whole-bin average. But at the very edge of the time span there is no data “later in time”, there’s only the “earlier in time” half with its below-average values to make our estimate come out below average. Like here:
The data are artificial, a straight line plus white noise, but the Gaussian smooth curves way too high at the start and too low at the end and the difference is substantial. I did use a wide Gaussian filter (scale parameter 10 years) to exaggerate the effect, but it’s very real and in some practical cases quite pronounced. Although this example has used a Gaussian filter, the “trend at the beginning or end” problem affects moving averages in general, not just Gaussian smooths. This is the principle reason that in some fields one usually does not extend a smoothing filter all the way to the edge, but restricts oneself to “complete” windows.
The bottom line is that “edge effects” in smoothing are hard to suppress, and in fact it’s impossible to eliminate them. But there are things that can help, both with global smooths (like polynomial/Fourier series) and with local smooths. Also, the filter idea is truly excellent, brilliant in fact, with lots of advantageous mathematical properties, until the time sampling is uneven, when things go horribly wrong.
All of which leaves us plenty to talk about in the next smooth post.
UPDATE UPDATE UPDATE
There’s a person who calls himself “Amego Sando” associated with a website called “Nouvelles et satellite scientifique” who has a very bad habit.
He takes MY blog posts, and reposts them, word-for-word, with all the pretty pictures. Here’s one. At the top of each he claims this:
“Written by Amego Sando”
That is a lie. Amego Sando is nothing but a “petit” thief.
I don’t mind if people re-post my stuff, but I insist on two things:
1. Link to the original
2. Credit the author (me, not Amego Sando)
I’ve place a comment on that site demanding my posts be removed. I very much doubt that it will get any response, or even ever be approved.
So here’s a question for the more savvy among you: If this organization and the thief Amego Sando don’t set this right — how do I go about forcing them to do so?