Since the subject of “ridge regression” came up in discussions on RealClimate recently, I thought I’d give a very brief description of what the heck it is.
I tried to keep the math to a minimum, but I failed. There’s no getting around that fact that this is a mathematical topic so there’s lots of math anyway. But it’s still only a peek at the subject — I hope that it at least gives a little perspective on what ridge regression is and why it’s used. Oh well, at least the “rigor police” have been banished.
In regression problems we’re usually trying to estimate the parameters for some model. Perhaps the simplest case is linear regression of a single variable — we use it, for instance, to estimate the trend rate over time, as for instance of global temperature. We have a target variable, which we’ll call . We have a set of predictor variables, which we’ll call , etc. Our model is a linear one:
where etc., are the parameters of our model. The quantities are random terms, which we usually (but don’t always) assume are white noise, in fact we often assume they’re Gaussian white noise. We’d like to use observed data (observed values of the ‘s and ) to estimate these parameters.
Note: I’ve left out a constant term (a term), in order to simplify the discussion which follows. This is actually justifiable if we center all the data (i.e., offset it so that its mean is zero, both the target and the predictors).
Of course we expect to have more than one set of observed values (both for the predictor variables and the response variable). We can call them (where indexes the values — it might be a time index, but it doesn’t have to be) and , etc. Then the model is that the response depends on the predictors at the same time (or whatever the index — which needn’t be a “time index” — may refer to)
We can arrange the predictors into a convenient matrix. If we have observations of predictors, this will be an matrix . It’s important to note this is not a square matrix, in fact the number of data values is usually (but not always!) larger than the number of predictors . We can also arrange the parameters into a vector . Viewing the response variable as an -vector, our model becomes
where is now a vector of the random noise in the observed data vector .
Of course we still need a method to estimate the parameter vector . The most common method is least squares regression. We find the parameter values which minimize the sum of squared residuals
The solution turns out to be a matrix equation
where is the transpose of the matrix , and the exponent “-1” indicates the matrix inverse of the given quantity. We can even justify this in probabilistic terms, by noting that if we have the correct parameter vector , then the vector of residuals
is equal to the noise vector , and if their distribution is Gaussian then the probability density function for that vector is proportional to
We expect the true parameters to give us nearly the most likely (i.e., maximum probability) result, so the least-squares solution, by minimzing the sum of squared residuals, gives the maximum-likelihood values of the parameter vector . Even when the noise isn’t Gaussian, as long as it’s white noise we know from the Gauss-Markov theorem that the least-squares estimate gives the “best linear unbiased estimator” (BLUE) of the parameters. And that’s great! It’s one of the reasons least squares is so popular. Its estimates are unbiased (the expected values of the parameters are the true values), and of all the unbiased estimators, it gives the least variance (most precise answer).
There are cases, however, for which the best linear unbiased estimator is not necessarily the “best” estimator. One pertinent case occurs when the two (or more) of the predictor variables are very strongly correlated. For instance, if we regress global temperature as a function of time, and of the logarithm of population, then we find that those two predictors are strongly correlated. Since both of them are increasing (in similar fashion), will the global temperature increase be due to one or the other (or both or neither) factor? In such cases, the matrix has a determinant which is close to zero, which makes it “ill-conditioned” so the matrix can’t be inverted with as much precision as we’d like, there’s uncomfortably large variance in the final parameter estimates.
One approach is to use an estimator which is no longer unbiased, but has considerably less variance than the least-squares estimator. A straightforward way to do this is “Tikhonov regularization,” also known as “ridge regression.” We add a matrix to the matrix , where is usually proportional to the identity matrix, i.e.
It doesn’t have to be, in fact there are cases for which we specifically want a different choice, but that’s the simplest case. Essentially we’re minimizing the new quantity
Hence we minimize a quantity which is the sum of the squared residuals, plus a term usually proportional to the sum (or often a weighted sum) of the squared parameters. Essentially, we “penalize” large values of the paramters in the quantity we’re seeking to minimize.
The solution is
Now the matrix we need to invert no longer has determinant near zero, so the solution does not lead to uncomfortably large variance in the estimated parameters. And that’s a good thing.
We pay a price for this. The new estimates are no longer unbiased, their expected values are not equal to the true values. Generally they tend to underestimate the true values. However, the variance of this new estimate can be so much lower than that of the least-squares estimator, that the total expected mean squared error is also less — and that makes it (in a certain sense) a “better” estimator, surely a better-behaved one.
Relationship to PCA
We could take a different approach. We could perform principal components analysis (PCA) on our predictor variables, which is strongly related to singular value decomposition (SVD) of the predictor matrix . SVD enables us to express the matrix as a product of terms
The matrices and are unitary, and more important, the matrix is diagonal (i.e., all its off-diagonal components are zero), with diagnoal elements . The ridge-regression solution, when viewed in terms of the decomposed predictor matrix, has the form
The matrix is still diagonal but its diagonal elements are
Now the interesting part: for elements which are large compared to the regularization parameter , the regularization has little effect on the result. But for elements which are small compared to the regularization parameter, their influence is much reduced.
The elements with largest are the principal components which have the largest singular values, i.e., the earliest principal components. So, ridge regression suppresses the leading principal components only slightly, but the later ones greatly. This reveals the kinship between ridge regression and another technique, which is to do a principal components analysis and simply omit the low-order principal components altogether, without any “shrinkage” of the leading principal components.
We could also try approaching the problem from a Bayesian perspective. Suppose the noise is Gaussian, so the likelihood function is, just as before, proportional to
In a Bayesian analysis we multiply the likelihood function by a prior probability for the parameters. Suppose we expect a priori that they will be small. Then our prior probability might be a Gaussian with mean value zero, and inverse variance matrix , i.e.
Then the posterior probability will be proportional to
The factor in the exponent is proportional to
Therefore the posterior mode, which is the most likely value within the Bayesian framework, is exactly equal to the ridge-regression estimate.
The similarity of ridge regression to both PCA, and to a Bayesian approach, are examples of the “unity” which we often see in mathematics, where strikingly different approaches lead to the same, or similar, results. Also, taking steps to reduce the variance in our estimated parameters, even at the cost of making the estimates biased (one hopes, only slightly so) can sometimes improve the reliability of results dramatically.