It sometimes happens that we have only limited access to direct measurements of some variable of interest, but we have abundant data on some other, related variable. In such cases we can use the “other” variable as a proxy for our target variable. We attempt to determine the relationship between them, then use the measurement of one as input to that relationship in order to estimate the other. Voila! Of course such indirect estimates will be imperfect, but at least they’re an approximation, we hope a useful one. Why, such practice has even been applied to climate science.
Let be the predictor variable (the “proxy” variable), for which we have lots of data, and let be the “target” variable. Suppose futher that there’s a strict linear relationship between them
In such a case, if we know then we can determine exactly, as long as we know the two defining parameters and . To determine those, all we need is at least two cases for which we know both and . Then the proxy is a perfect proxy for the target , which we can determine without error. How wonderful! It’s also unusual, or as they say in these parts, “T’aint likely.”
A much more common case is that we can’t directly observe the target variable without some kind of noise entering into the mix; what we actually observe is which includes some random variation
We expect to be a random variable, and usually expect it to have zero mean. It might be measurement error, it might be a random physical process which contributes to , but it’s in addition to the definitive value which is a simple consequence of the influence of . We now have the model
where is random. We often even assume that the random term follows a normal distribution with zero mean and unknown standard deviation. We can still estimate the parameters and from samples of and , by regression (usually least-squares regression). The least-squares regression also enables us to estimate the standard deviation in (which we’ll call ) as well.
Knowing all those parameters, if we have a value for for which we don’t know the corresponding values of and , we can estimate them using our model. We can use as a proxy for , but it’s no longer a perfect proxy — not because the relation between and is imperfect, but because our estimates of the parameters and are imperfect (since they’re based on noisy data). But at least the estimated values of and are unbiased, i.e., we expect to get the correct result, the imperfection is only random and doesn’t preferentially lean one way or the other.
We can also use as a proxy for , in which case we can compute a probability distribution for . The variance of will be the variance of the random component , while the expected value of will be (which we estimate from ).
An example may clarify. I generated values following a uniform distribution from -1 to +1, used the relationship (so the correct parameters are and , and added noise to the values to get values; here’s a scatterplot of against (the solid line represents the correct relationship ):
If we fit a line by least-squares regression we get a nearly perfect estimate of the correct relationship:
We can even confirm that the relationship is at least approximately linear by computing averages over small intervals of :
Clearly and are (at least approximately) linearly related, clearly our regression line is almost exactly the true relationship. There’s no problem using as a proxy for or , everything is hunky-dory and peachy-keen.
Now let’s turn things around a bit. Let’s suppose there’s no noise in the response variable, so we get to observe instead of the noisy version . But let’s also suppose that there is noise in the predictor variable, so instead of observing we observe
where is a random component (possibly genuine signal, perhaps just measurement error) in the predictor variable which follows a normal distribution with mean zero and variance . Here’s some sample data according to this prescription:
As expected, it looks a lot like the previous artificial data, except now we’ve added noise to to give , rather than adding noise to to get .
If we now estimate the relationship by fitting a line with linear regression, we get an estimate which is noticeably different from the true relationship between and :
The slope of our regression line is significantly less than the slope of the true relationship line. Does this mean that we should abandon this regression if we want to use as a proxy for ?
Not necessarily! Let’s again compute averages of the values over small intervals of :
Note that the averages no longer follow a straight line. As we get nearer to the edges of the observed interval, the relationship between “average ” and deviates more and more from the relationship between and .
And that’s as it should be, because no longer represents a direct measurement of . Instead it’s a combination of and the random term . Suppose, for instance, that is a person’s true height, is measurement error, and is the measured or estimated height. Then an observed value of 8 feet tall probably does not represent a true height of 8 feet tall. It’s more likely that the true height is on the vary tall side, but the measurement error is also positive, so the most likely value of is a bit less than 8 feet. In that case, the most likely value of will be less than it would be if were truly 8 feet tall.
This emphasizes that when we use one variable as a proxy for another, what we really get is the conditional probability for the target, given the proxy. When is 8 feet, the measurement error is more likely to be positive than negative because we know a priori that it’s much more likely for people to be shorter than 8 feet than it is for them to be taller than that. Essentially we have a prior distribution of values which informs our best estimate of given , which in turn informs our best estimate of or , given .
In such a case, the regression line is a good linear estimate of the conditional expectation of (or ) given . If we then had samples of to use as a proxy in order to estimate or then the regression line is the right choice if (and only if) the proxy values are drawn from the same distribution as the calibration data.
In the previous example, we’d use the regression line to interpret as a proxy, in spite of its significant deviation from the relationship involving . This is because our regression line takes into account the prior information, namely, the conditional probability for given .
But suppose that we now want to use data as a proxy, but the new values are not drawn from the same distribution as those we used to determine the regression line (to calibrate the proxy). We might have used, say, the volume of a rock as a proxy for its weight. To calibrate that proxy, we have a near-perfect device for measuring weight but our volume measurements are imperfect guesses. And, we calibrated our proxy using only rocks from 2000 to 4000 cubic centimeters (cm3), but we’ll use our proxy relationship to estimate the weight of rocks from 200 to 400 cm3. Here’s the calibration data, with the true relationship shown as a black line and the regression-fit relationship as a red line:
Here’s the calibration data together with the new data (plotted as estimated volumes and true weightes), with both the true-relationship line and the regression line extrapolated to the much lower-volume range:
Clearly the proxy is off — it overestimated weightes because the regression fit underestimated the slope of the true-relationship line. And since our new data are drawn from a different distribution than the calibration data, the regression line no longer represents the conditional expection of weight, given the estimated volume.
What we really want in this case is the get the slope of the true-relationship line; only then is it valid to extrapolate so far outside the range of the calibration data. But how would we get that?
If (and only if!) we know a priori that the weight data used for calibration are noise-free (or so near to noise-free that we can safely make that assumption), then we can find the true-relationship slope by inverse regression. Instead of regressing weight on volume, we regress volume on weight. That gives us this:
Note that the regression line is so near the true-relationship line that they’re indistinguishable. If we use the inverse-regression line to extrapolate to the new range, we get this:
Again everything is hunky-dory.
Unfortunately things are rarely so simple. The foregoing assumes that the target variable is noise-free, but that’s rarely the case (’tain’t likely). And then there’s the much more realistic situation in which there might be a strict relationship between and , but we don’t get to observe either one noise-free. Instead we observe and , where both and are random. They might be zero-mean and normally distributed, but both our variables have noise. And of course, we don’t want to restrict our proxy to values drawn from the same distribution as the calibration data, we need to extrapolate it to unknown territory, so we really need to estimate the true-relationship parameters.
That particular problem is known as “errors-in-variables” regression. If we know the ratio of the variances and of the proxy and the target noise, then we can estimate the true-relationship slope by a process called total least squares. Ordinary least squares finds the coefficients which minimizes the sum of squared errors in the target:
where are the model values. Total least squares minimizes the total sum of squared errors
To solve this problem we only need to know the ratio of to . But we still need a priori information about them. Sometimes such information is avaiable. We might, for instance, have multiple measurements of and for the same physical case, which enables us to estimate those variances. But it’s more usual that such is not the case. One might even say, ’tain’t likely.
However, sometimes we don’t even have a clue about the variance ratio. In that case, we can get a valid estimate in some cases. It sometimes works to use a slope estimate other than the usual (least-squares) value, based on higher moments of the distribution for the observed values. But there are cases even when that doesn’t work — if, for instance, the distribution of the noise-free variable is normal, and the noise variables and are also normally distributed, then there’s just no way for us to identify from the raw data alone, the correct ratio of the noise variances, or the slope of the true-relationship line. Fortunately, the “insoluble case” is pretty rare too.
And what does this have to do with climate science? There’s a new paper by Ammann et al. (Climate of the Past, 6, 273-279, 2010) which proposes a method to improve the proxy calibration by using part of the data to compute a correction factor. Different parts are used for the correction factor calculation, a la cross-validation. The net result is that the method indicates that the amount of variance in past climate is a bit greater than previously estimated based on proxy data.