Ridge Regression

Since the subject of “ridge regression” came up in discussions on RealClimate recently, I thought I’d give a very brief description of what the heck it is.

I tried to keep the math to a minimum, but I failed. There’s no getting around that fact that this is a mathematical topic so there’s lots of math anyway. But it’s still only a peek at the subject — I hope that it at least gives a little perspective on what ridge regression is and why it’s used. Oh well, at least the “rigor police” have been banished.

In regression problems we’re usually trying to estimate the parameters for some model. Perhaps the simplest case is linear regression of a single variable — we use it, for instance, to estimate the trend rate over time, as for instance of global temperature. We have a target variable, which we’ll call y. We have a set of predictor variables, which we’ll call x_1, x_2, etc. Our model is a linear one:

y = \beta_1 x_1 + \beta_2 x_2 + ... + \epsilon,

where \beta_1, \beta_2, \beta_3, etc., are the parameters of our model. The quantities \epsilon are random terms, which we usually (but don’t always) assume are white noise, in fact we often assume they’re Gaussian white noise. We’d like to use observed data (observed values of the x‘s and y) to estimate these parameters.

Note: I’ve left out a constant term (a \beta_o term), in order to simplify the discussion which follows. This is actually justifiable if we center all the data (i.e., offset it so that its mean is zero, both the target and the predictors).

Of course we expect to have more than one set of observed values (both for the predictor variables and the response variable). We can call them y_t (where t indexes the values — it might be a time index, but it doesn’t have to be) and x_{t1}, x_{t2}, etc. Then the model is that the response y_t depends on the predictors at the same time (or whatever the index — which needn’t be a “time index” — may refer to)

y_t = \beta_1 x_{t1} + \beta_2 x_{t2} + ... + \epsilon_t.

We can arrange the predictors into a convenient matrix. If we have n observations of k predictors, this will be an n \times k matrix {\bf X}. It’s important to note this is not a square matrix, in fact the number of data values n is usually (but not always!) larger than the number of predictors k. We can also arrange the k parameters into a vector \beta. Viewing the response variable as an n-vector, our model becomes

y = {\bf X} \beta + \epsilon.

where \epsilon is now a vector of the random noise in the observed data vector y.

Of course we still need a method to estimate the parameter vector \beta. The most common method is least squares regression. We find the parameter values which minimize the sum of squared residuals

SSR = \sum_t |y - {\bf X} \beta|^2 .

The solution turns out to be a matrix equation

\beta = ({\bf X^T X})^{-1} {\bf X^T} y,

where {\bf X^T} is the transpose of the matrix {\bf X}, and the exponent “-1” indicates the matrix inverse of the given quantity. We can even justify this in probabilistic terms, by noting that if we have the correct parameter vector \beta, then the vector of residuals

R = y - {\bf X} \beta = \epsilon,

is equal to the noise vector \epsilon, and if their distribution is Gaussian then the probability density function for that vector is proportional to

e^{-\frac{1}{2} ||R||^2}.

We expect the true parameters to give us nearly the most likely (i.e., maximum probability) result, so the least-squares solution, by minimzing the sum of squared residuals, gives the maximum-likelihood values of the parameter vector \beta. Even when the noise isn’t Gaussian, as long as it’s white noise we know from the Gauss-Markov theorem that the least-squares estimate gives the “best linear unbiased estimator” (BLUE) of the parameters. And that’s great! It’s one of the reasons least squares is so popular. Its estimates are unbiased (the expected values of the parameters are the true values), and of all the unbiased estimators, it gives the least variance (most precise answer).

There are cases, however, for which the best linear unbiased estimator is not necessarily the “best” estimator. One pertinent case occurs when the two (or more) of the predictor variables are very strongly correlated. For instance, if we regress global temperature as a function of time, and of the logarithm of population, then we find that those two predictors are strongly correlated. Since both of them are increasing (in similar fashion), will the global temperature increase be due to one or the other (or both or neither) factor? In such cases, the matrix {\bf X^T X} has a determinant which is close to zero, which makes it “ill-conditioned” so the matrix can’t be inverted with as much precision as we’d like, there’s uncomfortably large variance in the final parameter estimates.

One approach is to use an estimator which is no longer unbiased, but has considerably less variance than the least-squares estimator. A straightforward way to do this is “Tikhonov regularization,” also known as “ridge regression.” We add a matrix \Gamma^T \Gamma to the matrix {\bf X^T X}, where \Gamma is usually proportional to the identity matrix, i.e.

\Gamma = \lambda {\bf I}.

It doesn’t have to be, in fact there are cases for which we specifically want a different choice, but that’s the simplest case. Essentially we’re minimizing the new quantity

||y - {\bf X} \beta||^2 + || \Gamma \beta ||^2.

Hence we minimize a quantity which is the sum of the squared residuals, plus a term usually proportional to the sum (or often a weighted sum) of the squared parameters. Essentially, we “penalize” large values of the paramters in the quantity we’re seeking to minimize.

The solution is

\beta = ({\bf X^T X} + \Gamma^T \Gamma)^{-1} {\bf X^T} y.

Now the matrix we need to invert no longer has determinant near zero, so the solution does not lead to uncomfortably large variance in the estimated parameters. And that’s a good thing.

We pay a price for this. The new estimates are no longer unbiased, their expected values are not equal to the true values. Generally they tend to underestimate the true values. However, the variance of this new estimate can be so much lower than that of the least-squares estimator, that the total expected mean squared error is also less — and that makes it (in a certain sense) a “better” estimator, surely a better-behaved one.

Relationship to PCA

We could take a different approach. We could perform principal components analysis (PCA) on our predictor variables, which is strongly related to singular value decomposition (SVD) of the predictor matrix {\bf X}. SVD enables us to express the matrix as a product of terms

{\bf X} = {\bf UDV^T}.

The matrices {\bf U} and {\bf V} are unitary, and more important, the matrix {\bf D} is diagonal (i.e., all its off-diagonal components are zero), with diagnoal elements d_j. The ridge-regression solution, when viewed in terms of the decomposed predictor matrix, has the form

\beta = {\bf V \tilde D U^T} y.

The matrix {\bf \tilde D} is still diagonal but its diagonal elements are

\tilde d_j = d_j / (d_j^2 + \lambda^2).

Now the interesting part: for elements d_j which are large compared to the regularization parameter \lambda, the regularization has little effect on the result. But for elements which are small compared to the regularization parameter, their influence is much reduced.

The elements with largest d_j are the principal components which have the largest singular values, i.e., the earliest principal components. So, ridge regression suppresses the leading principal components only slightly, but the later ones greatly. This reveals the kinship between ridge regression and another technique, which is to do a principal components analysis and simply omit the low-order principal components altogether, without any “shrinkage” of the leading principal components.

Bayesian Interpretation

We could also try approaching the problem from a Bayesian perspective. Suppose the noise is Gaussian, so the likelihood function is, just as before, proportional to

e^{-\frac{1}{2} ||R||^2}.

In a Bayesian analysis we multiply the likelihood function by a prior probability for the parameters. Suppose we expect a priori that they will be small. Then our prior probability might be a Gaussian with mean value zero, and inverse variance matrix \Gamma^T \Gamma, i.e.

\pi(\beta) \propto e^{-\frac{1}{2} ||\Gamma \beta||^2}.

Then the posterior probability will be proportional to

e^{-\frac{1}{2} ( ||R||^2 + ||\Gamma \beta||^2)}.

The factor in the exponent is proportional to

|| y - {\bf X} \beta||^2 + || \Gamma \beta||^2.

Therefore the posterior mode, which is the most likely value within the Bayesian framework, is exactly equal to the ridge-regression estimate.

The similarity of ridge regression to both PCA, and to a Bayesian approach, are examples of the “unity” which we often see in mathematics, where strikingly different approaches lead to the same, or similar, results. Also, taking steps to reduce the variance in our estimated parameters, even at the cost of making the estimates biased (one hopes, only slightly so) can sometimes improve the reliability of results dramatically.


16 responses to “Ridge Regression

  1. Thanks for that, which is the first explanation of Ridge that I’ve seen. And I even skimmed it. I guess the question on the bated breath of all your fans will be: as you say, Ridge may be better under some circumstances, but it has some penalties. Could you assess whether the circumstances actually aplly in the case-of-interest, and what the practical effect of the penalties is likely to be?

    [Response: I haven’t studied the Steig/O’Donnell papers in nearly enough detail to know. Nor do I intend to. Eric Steig made a compelling case for his opinion on his RC posts, and I really don’t care to read what O’Donnell has to say since he seems to have poisoned his posts with needless and unjustified vitriol.

    But I’d guess that some sort of compensation for highly correlated predictor variables was necessary, since both authors used forms of it. Generally the impact of ridge regression is to underestimate the parameters, but I doubt that it’s a serious problem in that case.]

  2. Nice, but now you’ve got me wondering how you’re supposed to choose what value of \lambda to use.

    [Response: Of course! I neglected to mention that.

    Sometimes an “ad hoc” choice is made, but it’s more usual to try various values and test their performance (usually using cross-validation). In fact somebody even derived a formula for the optimal choice which results from leave-one-out cross-validation. Or, if you get to the result using the Bayesian approach you might have a good reason for a particular choice based on constructing your prior probability distribution.]

    • Dikran Marsupial

      For the Bayesian approach, a more usual procedure would be to maximise the marginal likelihood of the model (also known as the Bayesian “evidence” for the model). IIRC, this is also known as a “type II maximum likelihood” approach, and was developed for the ridge regression model by John Skilling.

      Tamino, is the leave-one-out method applicable to ridge regression models in a time-series context – is there some adjustment for the serial correllation?

  3. When you say “try various values” that makes it sound like guessing. Is it possible with a small amount of data (and I was shocked at how small the Antarctic data sets were) to arrive at a “validated” value that’s just a happy accident? O’Donnell seems hepped on the idea that Steig got the right answer but that his process was flawed.

    [Response: “Try various values” doesn’t have to be guessing, it can be done in a thorough fashion. After all, that’s what a Fourier periodogram is — you try various frequency values and see what works. But the frequencies you try are usually rigorously chosen to cover the valid possibilities.

    Again, I don’t know enough of the details to offer a useful opinion on their dispute. But I know which fella I trust.]

  4. Jeff Freymueller

    Thanks for another clear and informative post. Your efforts are much appreciated. I enjoy reading the blog as time allows, and especially like the fact that I can learn some things here.

    Now I know that “ridge regression” is just another term for “damped least squares” as we generally call it on the geophysical inverse side of things, at least for the case you described where the matrix gamma is proportional to the identity. I often have a hard time figuring out what statisticians are talking about because they often use a completely different terminology for the same math….

    My answer to Jeffrey Davis’ question (not in conflict with yours) is that with such a method you no longer have “a solution”, but rather a family of solutions as you vary the parameter lambda. You can define an optimal lambda based on some criterion (as noted in your reply to Ernst), but it is generally wise not to pin too much meaning on any feature in the family of solutions that does not persist for a reasonable range of lambdas. It can be hard to assess this in real-world problems, which is why judgement and physical understanding of the problem always remain so important.

    Now a question. What is specific about “iridge” as a type of ridge regression? Presumably it incorporates a specific method to choose the matrix gamma?

  5. Thanks. This puts it very nicely into a reference frame that is useful for me: linear inverse problems and bayesian inference. I have to read your post in more detail later. (Terminology-wise I much prefer the tikhonov regularization over ridge regression, but that is probably just where I am coming from).

  6. Tamino, I really enjoyed reading this post, especially the connections showing the unity of statistics. I assume IRIDGE is a specific piece of software or numerical recipe for ridge regression. I suspect if it had been named ITIKHONOV the approach might have been more commonly known as Tikhonov regularization. Anyway, Steig and O’Donnell were arguing over the value of the parameter KGND. Could this parameter be related to your lambda?

    • k_gnd and lambda both play a similar role. The main difference is that lambda is a continuous parameter whereas k_gnd specifies a discrete truncation point.

      The term Tikhonov Regularisation has priority; the method has been rediscovered many times since, and has therefore accrued several different names.

      iridge comes from Tapio Schneider’s RegEM code. The i is for ‘individual’ (as opposed to ‘multiple’ in mridge), which means that a different value of lambda is chosen (by generalised cross-validation) for each column of data.

      • Dikran Marsupial

        Choosing a different value of lambda for each attribute via cross-validation would be a *very* risky thing to do. The cross-validation estimator has a finite variance, which can be very large especially if leave-one-out cross-validation is used. That means if you have a large number of attributes you will end up over-fitting the cross-validation error (unless you have a lot of data). Something like LASSO or LARS would be a much better bet.

  7. Gavin's Pussycat

    Tamino, I enjoyed reading this especially because you use the standard statistical notation, which is very different from that used in geodesy… but of course the logic is the same.

    In geodesy, the situation is not so much that two or more predictors are strongly correlated, but that some of the parameters are completely unconstrained by the observations. This happens, e.g., when you have angle and distance observations in a geodetic network, which by themselves are unable to constrain the co-ordinate values of the network points (if these are chosen as model parameters): you can freely translate or rotate your network solution without any of the observations changing. This is called a “datum defect”, and shows as one or more zero eigenvalues in the “normal matrix” $X^T X$. It can be fixed by a priori fixing the co-ordinates to a set of approximate values (e.g., read from a map), with very low weights.

    This is of course a bit special situation, where eigenvalues are either precisely zero, or clearly non-zero. Here, a very low weight for the prior info will do, introducing no bias in practice (it pulls the co-ordinates to the assumed approximate values, but should not change angles or distances). I understand that in climatology this is not quite as clean; the parameters that are poorly constrained are the coefficients of the higher-eigenvalue principal components of the reconstruction, representing higher spatial resolution than the proxies are able to properly capture. Then, you can use this Tikhonov technique which introduces bias, or alternatively throw away altogether the smaller eigenvalues $d_j$ — the “total truncated least-squares” approach.

    I hope that this different perspective helps some readers.

  8. Thank you, still can’t do it, but I think I got parts of this. Understood it so, that this is essentially a method that evens out some the gaps in some data sets, given the correlation between individual data points is somewhat known (and thus lambda must be screened?), but by doing this something (some of the accuracy?) is lost? Am I totally off with this? If I’d be at university I might ask for a practical example as this appears to be one statistical method practising scientists (or future ones) might want to know, or just to tease the stats lecturer who thinks this is trivial.

  9. Thanks, Gavin’s Pussycat, at least there are familiar words I don’t have to translate :-).

  10. Halldór Björnsson

    This is the clearest explanation of ridge regression I have seen. The connection to damped least squares (and variance quelling techniques) in geophysical inverse theory is obvious as Freymueller points out. I’ve never done a survey of this, but damped least squares might be one of those straightforward techniques that get applied under different names in different fields.

    I’d never realized that dropping eigenvectors that “explain a small amount of variance” was akin to variance quelling; ‘unity’ in our professional lives may go unnoticed a lot of the time. It is fun to see connections like these.

    Thank you for a very enjoyable post.

  11. why is it called “ridge” regression? should one normalize/standardize both x and y before computing the \beta values? also why does one “center” the X matrix?
    Could you suggest a book that discusses ridge regression especially in the light of the questions I have? Thanks.