Suppose you are asked to determine how y, the severity of an unhealthy reaction to some toxin, relates to x, the amount of toxin one is exposed to. You expect ahead of time that the relationship will be monotonic, i.e. that increasing the exposure cannot decrease the severity, and that in fact the relationship will be “smooth” (whatever that means). You also have just enough laboratory resources to collect some actual data, so at equally spaced exposure levels from 1 to 9 you measure the reaction severity as accurately as you can, coming up with these data:
Suppose now that we want to know the likely severity at some point in between those integer-valued measurement points from 1 to 9? What if we wanted to know the severity for an exposure of 4.5, or 2.271? In other words, what if we want to interpolate to get an estimate?
The classic way would be to define a function at all values x from lowest to highest — not just the x values at which we have observations — by simply “connecting the dots” (as has already been done in the above graph). The line segments from one data point to the next define the function values at intermediate x values.
That’s fine, and the function so defined is at least continuous. But it’s not smooth, it makes instantaneous changes in direction at the observed x values (in other words, it’s derivative is undefined at the observed x values). In between those values, it’s perfectly smooth — a straight line segment is as smooth as can be — so this “classic” interpolation function is piecewise-smooth. But it isn’t smooth, and the merest glance at the jagged corners connecting the straight-line segments makes that obvious.
And we already said we expect y to be a smooth function of x. Perhaps we could find a smooth function which matches all the data values at the measured levels. It’s easy to do so: we can fit any 9 data points with an 8th-degree polynomial. Then we can hope that the polynomial approximates the smooth behavior we’re after, and compute the value of that polynomial at the “in-between” points in order to interpolate. The smooth function looks like this:
Ouch. That’s kinda wiggly and bumpy in ways that you don’t expect reality to be. First of all, we expected a monotone increase — up only — so that points with higher x can’t have lower y, but this “smooth” curve goes both up and down willy-nilly. Hell, it even dips to negative values, which for our “severity” variable is nonsense. And, it just keeps changing direction too fast and too often to be sufficiently “smooth” for us, it’s too “wiggly” and “bumpy”, too … “rough.”
Well, polynomials aren’t the only way to model a completely general function: we could use a Fourier series instead. We can model any nine data points with a 4th-order Fourier series, which gives us this:
Pfeh! That’s just as bad the polynomial fit!
What is Smooth?
Hold on for a moment here. We started out saying that polynomials (and the Fourier series) could give us a smooth function through our data — and they do. But then we say it’s not “smooth enough”? Just what does one mean by “smooth” anyway?
The strict mathematical definition of a smooth function is one which is infinitely differentiable. Polynomials meet that criterion, as do sines and cosines, so both polynomials and Fourier series do indeed provide us with smooth functions. But we didn’t like the fast wiggling, the jiggling around from point to point of our smooth functions either. What we really want is something that’s more smooth than just the fit-all-the-data-perfectly option.
It just so happens that most of the “rough stuff” is the fast stuff, and when it comes to Fourier analysis, the fast stuff is the high-frequency stuff. What if I fit a Fourier series to my data, but only to 1st order, i.e. using only the lowest non-zero frequency? Then, I should have a slower — and, we expect, smoother — fit. It looks like this:
The fit isn’t very good. But it isn’t horrible either. There’s genuine (and statistically significant!) correlation between the data values and the model values. It obviously doesn’t capture the variation well, but it does capture some important aspects of the overall quantitative behavior.
It also highlights a difference when we try to model our data using only sufficiently “smooth” functions. Namely, that our model no longer matches all the data perfectly. There are differences, which we can call residuals. In the case of our 1st-order Fourier fit, the differences are substantial.
We could try to capture more of the “signal” — or so we might call the true value of y as a function of x if we thought such a thing even existed — by using a higher order Fourier series. If we go to 2nd order, we get this model:
The match is better but not great, and it’s already wiggling around faster than we were hoping for. But it is getting closer.
What if we try the “use only slow functions” strategy with polynomials? In this case “slow” generally means low degree while “fast” means high degree. If we limit ourselves to a 2nd-degree (quadratic) polynomial, we get this model of the data:
Now we’re getting somewhere! The fit is outstanding, it’s plenty “smooth” enough to satisfy anybody, and it’s always going in the “up” direction.
There are still residuals, although they’re quite small and don’t show any apparent pattern. What we have done is to separate the data into two parts: the smooth part (in this case, a quadratic polynomial) and the rough part (the residuals).
We might even hypothesize that the smooth (in this case, the quadratic fit) is our best approximation of reality, and that the residuals are an example of the “noise” (departure from expectation) in the system.
And we’d be exactly right. These data were created by computing a quadratic function and adding random noise.
Since the signal itself is quadratic, so is the best polynomial degree for smoothing. For higher degree polynomials, the excess wiggling — especially at the endpoints — is due to the noise in the data. If we fit a straight line (a 1st-degree polynomial) to some data, then we’ve modelled the data with two functions ( and ) so we require two parameters: slope and intercept. Now increase the order from 1st to 2nd for a quadratic fit. In one sense, the extra “function” we’re fitting is . But that function can easily have nonzero average value, so in a way it is “like” the function which we’ve already included in our fit. What we’re really interested in is what this extra function does that the other functions don’t already do for us. This turns out to be captured, not by power functions , but by “orthogonal polynomials,” each of which is one degree higher, and all of which are “orthogonal” to each other (which, very loosely speaking, means they don’t duplicate the patterns found in the other polynomials).
The first two orthogonal polynomials are just and (where is the average time value) that we’ve already been using. The next 8 orthogonal polynomials look like this:
Two things are worth noting near the beginning and end of the observed time span. First, the values are larger, more so the larger the degree of the orthogonal polynomial. Second, the wiggles get closer together, i.e. they get faster. These properties of the fit functions persist in the final result, so when we use a high-degree polynomial we tend to get much larger uncertainty (i.e. noise response) as well as bigger and faster wiggles (also due to noise alone) near the beginning and end, exactly as we observed with our 8th-degree polynomial fit. A higher polynomial degree usually only makes things worse.
The probable error of a polynomial smooth which is due to noise alone is determined by something which we can call the “uncertainty function,” which gives the expected contribution of that polynomial to the variance of the estimated y value at a given x value. The uncertainty function tallies variance, but we can take its square root to compute the “standard error function” giving the probable error as a function of time. Here it is for polynomials of degree zero (a constant) up through degree 5, for polynomials which cover the time span :
Note how the uncertainty (i.e. the contribution of noise to the smooth function) is exaggerated near the endpoints, the more so the higher the polynomial degree — with just a 5th-degree polynomial, the standard errors are already three times as large at the endpoints as in the middle. And, don’t forget about that extra wiggling near the enpoints too; the combination of exaggerated endpoint uncertainty and exaggerated endpoint wiggling makes polynomial smoothing with degree higher than 3 or 4 at the most extremely untrustworthy near the endpoints of the time span.
The “fast” (high-degree) polynomials had too much wiggling at the endpoints, but the slow (2nd-degree) worked fine. Of course that’s because the signal itself was a 2nd-degree polynomial. For Fourier series on the other hand, even in the “slow” case it didn’t fit very well. For the Fourier series the fit is poor because Fourier series are designed to create a periodic function. Whatever smooth function it returns will actually be periodic with period equal to the time span of the observed data. In fact we get the same smooth function if we fit a Fourier series to repeated copies of the data:
Note that in order to repeat, it has to dive back down at the end of each “cycle” toward those low values at the beginning of the next “cycle.” To do so, it has to exaggerate the wiggles, especially at the end. And that’s just to fit the signal, even without any noise. This is another case where the essential properties of the functions we’re using persist in the final result.
There are many ways to ameliorate this (and other) problems, but none of them entirely eliminate it. The fact remains that periodic functions have essential tendencies which persist in any Fourier-based smooth, and the problematic aspect is the behavior of the smooth near the endpoints.
It should be mentioned that for times well away from the endpoints, a polynomial smooth and a Fourier-based smooth both give oustanding results if the “time scale” (cutoff frequency for Fourier, polynomial degree for polynomials) is well chosen.
A More Generic Smooth
We’ve tried using classes of functions (polynomials, Fourier series) and restricting them to the “slow” ones in order to keep things sufficiently “smooth.” Perhaps instead we could seek some completely general smooth function which optimizes some criterion which combines both “fit” (how closely does it match the data) with “smoothness.” It’s easy to define how well it fits the data — the sum of the squares of the residuals is only the most common of many methods. But how do we define “smoothness” for some function in general?
The idea is that it’s the bending of the smooth curve that accounts for its “roughness,” and that the bending is measured by the second time derivative of the function. Of course, for that to exist the function has to be twice-differentiable, but that’s fine because we want a nice “smooth” function. It may not be “technically” smooth (infinitely differentiable) but it will at least be smooth-looking.
To measure of the goodness-of-fit (or should I say badness-of-fit), take the usual sum of the squared residuals. To measure the roughness, integrate the square of the 2nd derivative over the observed time span. Combine these two quantities into a weighted average, giving more weight to the roughness if you want an extra-smooth smooth but more weight to the badness-of-fit if you want an extra-good fit. Hence this method involves a parameter (actually it can involve many, but let’s not get into details) which controls how “smooth” the final smooth will be. This is nothing new, with polynomials we controlled smoothness by polynomial degree and with Fourier series by cutoff frequency.
Then: find the function which minimizes the weighted average of badness-of-fit and roughness. The solution turns out to be a function which is not smooth, i.e. not infinitely differentiable, but is piecewise-smooth, i.e. it’s made of a finite number of pieces which are themselves smooth. Furthermore, the pieces are joined as smoothly as possible by requiring that where they meet, they have the same value, the same derivative, and the same 2nd derivative. The result is called a spline smooth. The pieces themselves turn out to be cubic polynomials, so the smooth function is sometimes referred to as a “cubic spline.” If we apply this method to our toxicity data, with a reasonably smooth smooth we get this:
Global and Local Smooths
Fitting functions like polynomials or Fourier series to the entire set of data, and finding a function which optimizes some measure of total goodness as a spline smooth does, might be called “global” smoothing methods because they fit a smooth to the entire data set, both computationally and conceptually. However, one can also look at smoothing as a local problem, in which the value of the smooth at some particular time is determined by the data values which are nearby in time to the given moment. In the next post, we’ll take a look at some methods and issues related to local smoothing.