Regression

It sometimes happens that we have only limited access to direct measurements of some variable of interest, but we have abundant data on some other, related variable. In such cases we can use the “other” variable as a proxy for our target variable. We attempt to determine the relationship between them, then use the measurement of one as input to that relationship in order to estimate the other. Voila! Of course such indirect estimates will be imperfect, but at least they’re an approximation, we hope a useful one. Why, such practice has even been applied to climate science.


Let X be the predictor variable (the “proxy” variable), for which we have lots of data, and let Y be the “target” variable. Suppose futher that there’s a strict linear relationship between them

Y = \alpha + \beta X.

In such a case, if we know X then we can determine Y exactly, as long as we know the two defining parameters \alpha and \beta. To determine those, all we need is at least two cases for which we know both X and Y. Then the proxy X is a perfect proxy for the target Y, which we can determine without error. How wonderful! It’s also unusual, or as they say in these parts, “T’aint likely.”

A much more common case is that we can’t directly observe the target variable Y without some kind of noise entering into the mix; what we actually observe is y which includes some random variation v

y = Y + v.

We expect v to be a random variable, and usually expect it to have zero mean. It might be measurement error, it might be a random physical process which contributes to y, but it’s in addition to the definitive value Y which is a simple consequence of the influence of X. We now have the model

y = Y + v = \alpha + \beta X + v,

where v is random. We often even assume that the random term v follows a normal distribution with zero mean and unknown standard deviation. We can still estimate the parameters \alpha and \beta from samples of X and y, by regression (usually least-squares regression). The least-squares regression also enables us to estimate the standard deviation in v (which we’ll call \sigma_v) as well.

Knowing all those parameters, if we have a value for X for which we don’t know the corresponding values of Y and y, we can estimate them using our model. We can use X as a proxy for Y, but it’s no longer a perfect proxy — not because the relation between X and Y is imperfect, but because our estimates of the parameters \alpha and \beta are imperfect (since they’re based on noisy data). But at least the estimated values of \alpha and \beta are unbiased, i.e., we expect to get the correct result, the imperfection is only random and doesn’t preferentially lean one way or the other.

We can also use X as a proxy for y, in which case we can compute a probability distribution for y. The variance of y will be the variance \sigma_v^2 of the random component v, while the expected value of y will be Y (which we estimate from Y = \alpha + \beta X).

An example may clarify. I generated X values following a uniform distribution from -1 to +1, used the relationship Y = X (so the correct parameters are \alpha=0 and \beta=1, and added noise to the Y values to get y values; here’s a scatterplot of X against y (the solid line represents the correct relationship Y=X):

If we fit a line by least-squares regression we get a nearly perfect estimate of the correct relationship:

We can even confirm that the relationship is at least approximately linear by computing averages over small intervals of X:

Clearly X and Y are (at least approximately) linearly related, clearly our regression line is almost exactly the true relationship. There’s no problem using X as a proxy for Y or y, everything is hunky-dory and peachy-keen.

Imperfect Proxies

Now let’s turn things around a bit. Let’s suppose there’s no noise in the response variable, so we get to observe Y instead of the noisy version y. But let’s also suppose that there is noise in the predictor variable, so instead of observing X we observe

x = X + u,

where u is a random component (possibly genuine signal, perhaps just measurement error) in the predictor variable which follows a normal distribution with mean zero and variance \sigma_u^2. Here’s some sample data according to this prescription:

As expected, it looks a lot like the previous artificial data, except now we’ve added noise to X to give x, rather than adding noise to Y to get y.

If we now estimate the relationship by fitting a line with linear regression, we get an estimate which is noticeably different from the true relationship between X and Y:

The slope of our regression line is significantly less than the slope of the true relationship line. Does this mean that we should abandon this regression if we want to use x as a proxy for Y?

Not necessarily! Let’s again compute averages of the Y values over small intervals of x:

Note that the averages no longer follow a straight line. As we get nearer to the edges of the observed interval, the relationship between “average Y” and x deviates more and more from the relationship between Y and X.

And that’s as it should be, because x no longer represents a direct measurement of X. Instead it’s a combination of X and the random term u. Suppose, for instance, that X is a person’s true height, u is measurement error, and x is the measured or estimated height. Then an observed value x of 8 feet tall probably does not represent a true height X of 8 feet tall. It’s more likely that the true height is on the vary tall side, but the measurement error is also positive, so the most likely value of X is a bit less than 8 feet. In that case, the most likely value of Y will be less than it would be if X were truly 8 feet tall.

This emphasizes that when we use one variable as a proxy for another, what we really get is the conditional probability for the target, given the proxy. When x is 8 feet, the measurement error is more likely to be positive than negative because we know a priori that it’s much more likely for people to be shorter than 8 feet than it is for them to be taller than that. Essentially we have a prior distribution of X values which informs our best estimate of X given x, which in turn informs our best estimate of Y or y, given x.

In such a case, the regression line is a good linear estimate of the conditional expectation of Y (or y) given x. If we then had samples of x to use as a proxy in order to estimate Y or y then the regression line is the right choice if (and only if) the proxy values are drawn from the same distribution as the calibration data.

Terra Incognita

In the previous example, we’d use the regression line to interpret x as a proxy, in spite of its significant deviation from the relationship involving X. This is because our regression line takes into account the prior information, namely, the conditional probability for X given x.

But suppose that we now want to use x data as a proxy, but the new x values are not drawn from the same distribution as those we used to determine the regression line (to calibrate the proxy). We might have used, say, the volume of a rock as a proxy for its weight. To calibrate that proxy, we have a near-perfect device for measuring weight but our volume measurements are imperfect guesses. And, we calibrated our proxy using only rocks from 2000 to 4000 cubic centimeters (cm3), but we’ll use our proxy relationship to estimate the weight of rocks from 200 to 400 cm3. Here’s the calibration data, with the true relationship shown as a black line and the regression-fit relationship as a red line:

Here’s the calibration data together with the new data (plotted as estimated volumes and true weightes), with both the true-relationship line and the regression line extrapolated to the much lower-volume range:

Clearly the proxy is off — it overestimated weightes because the regression fit underestimated the slope of the true-relationship line. And since our new data are drawn from a different distribution than the calibration data, the regression line no longer represents the conditional expection of weight, given the estimated volume.

What we really want in this case is the get the slope of the true-relationship line; only then is it valid to extrapolate so far outside the range of the calibration data. But how would we get that?

If (and only if!) we know a priori that the weight data used for calibration are noise-free (or so near to noise-free that we can safely make that assumption), then we can find the true-relationship slope by inverse regression. Instead of regressing weight on volume, we regress volume on weight. That gives us this:

Note that the regression line is so near the true-relationship line that they’re indistinguishable. If we use the inverse-regression line to extrapolate to the new range, we get this:

Again everything is hunky-dory.

Double Trouble

Unfortunately things are rarely so simple. The foregoing assumes that the target variable is noise-free, but that’s rarely the case (’tain’t likely). And then there’s the much more realistic situation in which there might be a strict relationship between X and Y, but we don’t get to observe either one noise-free. Instead we observe x=X+u and y=Y+v, where both u and v are random. They might be zero-mean and normally distributed, but both our variables have noise. And of course, we don’t want to restrict our proxy to values drawn from the same distribution as the calibration data, we need to extrapolate it to unknown territory, so we really need to estimate the true-relationship parameters.

That particular problem is known as “errors-in-variables” regression. If we know the ratio of the variances \sigma_u^2 and \sigma_v^2 of the proxy and the target noise, then we can estimate the true-relationship slope by a process called total least squares. Ordinary least squares finds the coefficients which minimizes the sum of squared errors in the target:

SSE = \sum (y_j - \hat y_j)^2,

where \hat y_j are the model values. Total least squares minimizes the total sum of squared errors

TSE = \sum {(y_j - \hat y_j)^2 \over \sigma_v^2} + \sum {(x_j - \hat x_j)^2 \over \sigma_u^2}.

To solve this problem we only need to know the ratio of \sigma_v to \sigma_u. But we still need a priori information about them. Sometimes such information is avaiable. We might, for instance, have multiple measurements of x and y for the same physical case, which enables us to estimate those variances. But it’s more usual that such is not the case. One might even say, ’tain’t likely.

However, sometimes we don’t even have a clue about the variance ratio. In that case, we can get a valid estimate in some cases. It sometimes works to use a slope estimate other than the usual (least-squares) value, based on higher moments of the distribution for the observed values. But there are cases even when that doesn’t work — if, for instance, the distribution of the noise-free variable X is normal, and the noise variables u and v are also normally distributed, then there’s just no way for us to identify from the raw data alone, the correct ratio of the noise variances, or the slope of the true-relationship line. Fortunately, the “insoluble case” is pretty rare too.

And what does this have to do with climate science? There’s a new paper by Ammann et al. (Climate of the Past, 6, 273-279, 2010) which proposes a method to improve the proxy calibration by using part of the data to compute a correction factor. Different parts are used for the correction factor calculation, a la cross-validation. The net result is that the method indicates that the amount of variance in past climate is a bit greater than previously estimated based on proxy data.

10 responses to “Regression

  1. I was just trying to show someone over on WUWT why a 50% (0.5) correlation coefficient was not “pathetic” but actually good enough to break out the champagne.

  2. I think I got something out of this. I use regression frequently, and I assume that the predictor variable is known without error, even though I’m sure it’s not. I’ll read this again to try and understand what I’m supposed to do about it. I liked the example of the measurement of 8 ft tall. We’ve had several such extreme measurements at work recently, and I think we’ve tended to assume normal error around them.

  3. Tony O'Brien

    They speak of attenuation in the results, not really surprising. Even floating averaging hides the true rate of change. Add in lags, thresholds, seasonal timing not to mention other factors I doubt I could have got much of a signal at all.

    The more we understand the past the better we will understand the fututre. (Just as well I am not a statistician.)

    Thank you for explaining it in a way I can almost understand.

  4. Well done, Tamino. you made this analysis comprehensible to an utter novice in statistics.

  5. The Ammann et al. article is clearly written and worth reading. Tamino’s purpose wasn’t immediately clear to me until I went on to read the article, and realized that I’d been led to that door by a well-laid path of stepping stones.

    I knew that errors in variables tend to attenuate (bias towards zero) regression slopes, but hadn’t thought about how this might affect proxy temperature reconstructions until staring at Ammann’s Figure 2.

    Thanks.

  6. “Correcting for signal attenuation from noisy proxy data in climate reconstructions” (Ammann et al)

    Perhaps a better title would be

    “Progression of regression for progression* or regression**? ”

    *”correction for bias in
    the signal amplitude”

    **”increased variance”

  7. Jacob Mack

    Dead on Tamino. Very good explanation of important statistical methods.

  8. Good read related to this topic:

    Click to access ch1.pdf

  9. I think I got something out of this. I use regression frequently, and I assume that the predictor variable is known without error, even though I’m sure it’s not. I’ll read this again to try and understand what I’m supposed to do about it. I liked the example of the measurement of 8 ft tall. We’ve had several such extreme measurements at work recently, and I think we’ve tended to assume normal error around them.