# Category Archives: mathematics

## Ranges Overlap

In the last post I mentioned that when we have two different estimates, each with its own uncertainty range (note: I use the 95% confidence interval almost all the time, or to be precise the ±2σ range), the fact that their ranges overlap isn’t the proper statistical test for whether the estimates are significantly different. Somebody asked about that.

Just for a gut feeling: I know that when error ranges overlap, there are values that fall in the “plausible range” for both estimates, which suggests that the estimates may well be in agreement. But sometimes, those “plausible in both ranges” values are unlikely in both ranges. Unlikely isn’t so implausible, but unlikely for both is unlikely squared, and that’s too implausible to be plausible.

What follows is some of the math, and it’s really simple, really, but I know that turns off some readers. Others want it; feel free to skip it and enjoy the remainder of the day.

## Multiple Testing

Suppose I showed you a close-up, on a live video feed, of where my shot at a target landed. The video is zoomed in on the target’s bulls-eye, and lo and behold, right there is my shot. No doubt about it: bulls-eye. I brag about what a good shot I am, and you’re impressed.

Then my cat gets on the table and accidentally zooms the camera view out to show the entire target (a cat? maybe not an accident). Now you can see that I took about 300 shots, they landed all over the place (some even missed the target completely), and one of them hit the bulls-eye. My guess: you would no longer be so impressed. You might even think my original camera view was misleading.

That, in essence, is how people still buy and sell the misleading claim that there was a “pause/hiatus/slowdown” in global surface temperature during the early 2000s. Some, I’d even venture to say most, do it without knowing what they’re doing. Some who have done it but have been told why it’s a problem, persist in doing it still. The former I’d call an honest mistake; the latter I’d call not just misleading, but deliberate deception.

## Climate Trend Change? Do it Right

Because scientists, you’re doing it wrong.

There’s a mistake you keep making, as many times as I point it out, even in the peer-reviewed literature, you keep doin’ it wrong. If you’re a scientist, then maybe yes this means you. There are at least two new papers (yes, new) that did it again. And yes, they’re about the so-called “hiatus.”

I’m not just going to show you how you’re doin’ it wrong. I’ll show you how to do it right — or at least, better. And I’ll share some programs to help.

## Smooth 3

In the last post we looked at smoothing time series by focusing mainly, or exclusively, on local data, i.e., data which are nearby (in time) to the moment at which we’re trying to estimate the smooth. This is usually accomplished by having an observation window or weighting function which is centered at the moment of estimation, and we let this window “slide” through time to generate smoothed estimates at all times. We even showed that fitting a polynomial or a Fourier series globally (i.e. to all the data at once) gives results which are, for times not near the edges of the observation window, nearly the same.

## Smooth 2

[Note: please see the UPDATE at the end of this post]

In the last post we looked at smoothing data by fitting a polynomial, and by fitting a Fourier series. We found that we could control the degree of “smoothness” in either case (by polynomial degree or by cutoff frequency) and that with the right amount of smoothness both methods did a reasonable (and sometimes outstanding) job over most of the observed time span, but did much worse near the endpoints of observation. Let’s forget about the endpoint mess for the moment, and concentrate on the performance of the smoothing near the middle of the window of observation.

## Smooth 1

Suppose you are asked to determine how y, the severity of an unhealthy reaction to some toxin, relates to x, the amount of toxin one is exposed to. You expect ahead of time that the relationship will be monotonic, i.e. that increasing the exposure cannot decrease the severity, and that in fact the relationship will be “smooth” (whatever that means). You also have just enough laboratory resources to collect some actual data, so at equally spaced exposure levels from 1 to 9 you measure the reaction severity as accurately as you can, coming up with these data:

Suppose now that we want to know the likely severity at some point in between those integer-valued measurement points from 1 to 9? What if we wanted to know the severity for an exposure of 4.5, or 2.271? In other words, what if we want to interpolate to get an estimate?

The classic way would be to define a function at all values x from lowest to highest — not just the x values at which we have observations — by simply “connecting the dots” (as has already been done in the above graph). The line segments from one data point to the next define the function values at intermediate x values.

That’s fine, and the function so defined is at least continuous. But it’s not smooth, it makes instantaneous changes in direction at the observed x values (in other words, it’s derivative is undefined at the observed x values). In between those values, it’s perfectly smooth — a straight line segment is as smooth as can be — so this “classic” interpolation function is piecewise-smooth. But it isn’t smooth, and the merest glance at the jagged corners connecting the straight-line segments makes that obvious.

And we already said we expect y to be a smooth function of x. Perhaps we could find a smooth function which matches all the data values at the measured levels. It’s easy to do so: we can fit any 9 data points with an 8th-degree polynomial. Then we can hope that the polynomial approximates the smooth behavior we’re after, and compute the value of that polynomial at the “in-between” points in order to interpolate. The smooth function looks like this:

Ouch. That’s kinda wiggly and bumpy in ways that you don’t expect reality to be. First of all, we expected a monotone increase — up only — so that points with higher x can’t have lower y, but this “smooth” curve goes both up and down willy-nilly. Hell, it even dips to negative values, which for our “severity” variable is nonsense. And, it just keeps changing direction too fast and too often to be sufficiently “smooth” for us, it’s too “wiggly” and “bumpy”, too … “rough.”

Well, polynomials aren’t the only way to model a completely general function: we could use a Fourier series instead. We can model any nine data points with a 4th-order Fourier series, which gives us this:

Pfeh! That’s just as bad the polynomial fit!

What is Smooth?

Hold on for a moment here. We started out saying that polynomials (and the Fourier series) could give us a smooth function through our data — and they do. But then we say it’s not “smooth enough”? Just what does one mean by “smooth” anyway?

The strict mathematical definition of a smooth function is one which is infinitely differentiable. Polynomials meet that criterion, as do sines and cosines, so both polynomials and Fourier series do indeed provide us with smooth functions. But we didn’t like the fast wiggling, the jiggling around from point to point of our smooth functions either. What we really want is something that’s more smooth than just the fit-all-the-data-perfectly option.

It just so happens that most of the “rough stuff” is the fast stuff, and when it comes to Fourier analysis, the fast stuff is the high-frequency stuff. What if I fit a Fourier series to my data, but only to 1st order, i.e. using only the lowest non-zero frequency? Then, I should have a slower — and, we expect, smoother — fit. It looks like this:

The fit isn’t very good. But it isn’t horrible either. There’s genuine (and statistically significant!) correlation between the data values and the model values. It obviously doesn’t capture the variation well, but it does capture some important aspects of the overall quantitative behavior.

It also highlights a difference when we try to model our data using only sufficiently “smooth” functions. Namely, that our model no longer matches all the data perfectly. There are differences, which we can call residuals. In the case of our 1st-order Fourier fit, the differences are substantial.

We could try to capture more of the “signal” — or so we might call the true value of y as a function of x if we thought such a thing even existed — by using a higher order Fourier series. If we go to 2nd order, we get this model:

The match is better but not great, and it’s already wiggling around faster than we were hoping for. But it is getting closer.

What if we try the “use only slow functions” strategy with polynomials? In this case “slow” generally means low degree while “fast” means high degree. If we limit ourselves to a 2nd-degree (quadratic) polynomial, we get this model of the data:

Now we’re getting somewhere! The fit is outstanding, it’s plenty “smooth” enough to satisfy anybody, and it’s always going in the “up” direction.

There are still residuals, although they’re quite small and don’t show any apparent pattern. What we have done is to separate the data into two parts: the smooth part (in this case, a quadratic polynomial) and the rough part (the residuals).

We might even hypothesize that the smooth (in this case, the quadratic fit) is our best approximation of reality, and that the residuals are an example of the “noise” (departure from expectation) in the system.

And we’d be exactly right. These data were created by computing a quadratic function and adding random noise.

Noise Response

Since the signal itself is quadratic, so is the best polynomial degree for smoothing. For higher degree polynomials, the excess wiggling — especially at the endpoints — is due to the noise in the data. If we fit a straight line (a 1st-degree polynomial) to some data, then we’ve modelled the data with two functions ($f_o(t) = 1$ and $f_1(t) = t$) so we require two parameters: slope and intercept. Now increase the order from 1st to 2nd for a quadratic fit. In one sense, the extra “function” we’re fitting is $f_2(t) = t^2$. But that function can easily have nonzero average value, so in a way it is “like” the function $f_o(t) = 1$ which we’ve already included in our fit. What we’re really interested in is what this extra function does that the other functions don’t already do for us. This turns out to be captured, not by power functions $f_j(t) = t^j$, but by “orthogonal polynomials,” each of which is one degree higher, and all of which are “orthogonal” to each other (which, very loosely speaking, means they don’t duplicate the patterns found in the other polynomials).

The first two orthogonal polynomials are just $f_o(t) = 1$ and $f_1(t) = t - \bar t$ (where $\bar t$ is the average time value) that we’ve already been using. The next 8 orthogonal polynomials look like this:

Two things are worth noting near the beginning and end of the observed time span. First, the values are larger, more so the larger the degree of the orthogonal polynomial. Second, the wiggles get closer together, i.e. they get faster. These properties of the fit functions persist in the final result, so when we use a high-degree polynomial we tend to get much larger uncertainty (i.e. noise response) as well as bigger and faster wiggles (also due to noise alone) near the beginning and end, exactly as we observed with our 8th-degree polynomial fit. A higher polynomial degree usually only makes things worse.

The probable error of a polynomial smooth which is due to noise alone is determined by something which we can call the “uncertainty function,” which gives the expected contribution of that polynomial to the variance of the estimated y value at a given x value. The uncertainty function tallies variance, but we can take its square root to compute the “standard error function” giving the probable error as a function of time. Here it is for polynomials of degree zero (a constant) up through degree 5, for polynomials which cover the time span $-0.5 \le t \le 0.5$:

Note how the uncertainty (i.e. the contribution of noise to the smooth function) is exaggerated near the endpoints, the more so the higher the polynomial degree — with just a 5th-degree polynomial, the standard errors are already three times as large at the endpoints as in the middle. And, don’t forget about that extra wiggling near the enpoints too; the combination of exaggerated endpoint uncertainty and exaggerated endpoint wiggling makes polynomial smoothing with degree higher than 3 or 4 at the most extremely untrustworthy near the endpoints of the time span.

Function Misfit

The “fast” (high-degree) polynomials had too much wiggling at the endpoints, but the slow (2nd-degree) worked fine. Of course that’s because the signal itself was a 2nd-degree polynomial. For Fourier series on the other hand, even in the “slow” case it didn’t fit very well. For the Fourier series the fit is poor because Fourier series are designed to create a periodic function. Whatever smooth function it returns will actually be periodic with period equal to the time span of the observed data. In fact we get the same smooth function if we fit a Fourier series to repeated copies of the data:

Note that in order to repeat, it has to dive back down at the end of each “cycle” toward those low values at the beginning of the next “cycle.” To do so, it has to exaggerate the wiggles, especially at the end. And that’s just to fit the signal, even without any noise. This is another case where the essential properties of the functions we’re using persist in the final result.

There are many ways to ameliorate this (and other) problems, but none of them entirely eliminate it. The fact remains that periodic functions have essential tendencies which persist in any Fourier-based smooth, and the problematic aspect is the behavior of the smooth near the endpoints.

It should be mentioned that for times well away from the endpoints, a polynomial smooth and a Fourier-based smooth both give oustanding results if the “time scale” (cutoff frequency for Fourier, polynomial degree for polynomials) is well chosen.

A More Generic Smooth

We’ve tried using classes of functions (polynomials, Fourier series) and restricting them to the “slow” ones in order to keep things sufficiently “smooth.” Perhaps instead we could seek some completely general smooth function which optimizes some criterion which combines both “fit” (how closely does it match the data) with “smoothness.” It’s easy to define how well it fits the data — the sum of the squares of the residuals is only the most common of many methods. But how do we define “smoothness” for some function in general?

The idea is that it’s the bending of the smooth curve that accounts for its “roughness,” and that the bending is measured by the second time derivative of the function. Of course, for that to exist the function has to be twice-differentiable, but that’s fine because we want a nice “smooth” function. It may not be “technically” smooth (infinitely differentiable) but it will at least be smooth-looking.

To measure of the goodness-of-fit (or should I say badness-of-fit), take the usual sum of the squared residuals. To measure the roughness, integrate the square of the 2nd derivative over the observed time span. Combine these two quantities into a weighted average, giving more weight to the roughness if you want an extra-smooth smooth but more weight to the badness-of-fit if you want an extra-good fit. Hence this method involves a parameter (actually it can involve many, but let’s not get into details) which controls how “smooth” the final smooth will be. This is nothing new, with polynomials we controlled smoothness by polynomial degree and with Fourier series by cutoff frequency.

Then: find the function which minimizes the weighted average of badness-of-fit and roughness. The solution turns out to be a function which is not smooth, i.e. not infinitely differentiable, but is piecewise-smooth, i.e. it’s made of a finite number of pieces which are themselves smooth. Furthermore, the pieces are joined as smoothly as possible by requiring that where they meet, they have the same value, the same derivative, and the same 2nd derivative. The result is called a spline smooth. The pieces themselves turn out to be cubic polynomials, so the smooth function is sometimes referred to as a “cubic spline.” If we apply this method to our toxicity data, with a reasonably smooth smooth we get this:

Global and Local Smooths

Fitting functions like polynomials or Fourier series to the entire set of data, and finding a function which optimizes some measure of total goodness as a spline smooth does, might be called “global” smoothing methods because they fit a smooth to the entire data set, both computationally and conceptually. However, one can also look at smoothing as a local problem, in which the value of the smooth at some particular time is determined by the data values which are nearby in time to the given moment. In the next post, we’ll take a look at some methods and issues related to local smoothing.

## Sampling Rate, part 2

“Suppose we have a signal which is band-limited, say it’s limited to the frequency band from 0 to 0.5 cycles per day,” says the engineering professor to the class in digital signal processing. “If we observe this signal at regular intervals with a sampling rate which is at least twice the bandwidth — in this case, at least once per day — then we can use Fourier analysis to reconstruct the signal. We can even interpolate it to fill in the gaps. This is one of the most common applications of Fourier analysis in the real world — we observe a signal, then use its Fourier transform either to reconstruct the signal or simply to identify its Fourier components (and therefore its physical nature).”

## Sampling Rate

T, from my mechanical engineering world we have strict rules on sampling rates vs. signal frequency rates. Ie you cannot reliably measure a 60hz ac sine wave with a 5hz analog sampling device. The result ends up being strange results that don’t show spikes well and also might not show averages well either. Can you help me understand how 120 year sampling proxies can resolve relatively high frequency temperature spikes?

This objection comes up so often from those who are accustomed to data which are evenly sampled in the time domain, and the misconception is so firmly imprinted on so many people, that it’s worth illustrating how uneven time sampling overcomes such limitations.

## Back to School

Much of what’s wrong with the online discussion of global warming is revealed by a recent reader comment on RealClimate.

Greg Goodman thinks that he’s taking climate scientists to school — he actually “lectures” the RealClimate readership about their supposed need to “dig a bit deeper” into the data on Arctic sea ice (both extent and area). He shows a graph based on some analysis which — unbeknownst to him — actually reveals that he doesn’t know what the hell he’s doing. He thinks he has established the presence of “cyclic variations” of which the climate science community is ignorant, and concludes that climate scientists are missing “important clues” about “internal fluctuations” which, of course, those inadequate computer models just can’t handle.

One would be hard pressed to find a more clear-cut example of hubris.

Climate scientists who study sea ice have been all over the data, every piece of it, but instead of making the mistakes Goodman makes they’ve been as careful and rigorous as their expertise and experience allow. They have certainly dug a whole helluva lot deeper than Greg Goodman has, or probably is capable of. It’s Goodman who needs to go back to school.

## Theil-Sen

A reader recently inquired about using the Theil-Sen slope to estimate trends in temperature data, rather than the more usual least-squares regression. The Theil-Sen estimator is a non-parametric method to estimate a slope (perhaps more properly, a “distribution-free” method) which is robust, i.e., it is resistant to the presence of outliers (extremely variant data values) which can wreak havoc with least-squares regression. It also doesn’t rely on the noise following the normal distribution, it’s truly a distribution-free method. Even when the data are normally distributed and outliers are absent, it’s still competitive with least-squares regression.