Last time we saw that the 1st principal component is the direction of the projection which gives the maximum variation. So … how would we find that, anyway?
A projection is defined by its direction. Many people think of a direction as a vector, but it’s not. A vector has both direction and distance. But a direction has no distance associated with it. The geometrical object associated with a direction can be called a ray. We can represent our ray by a unit vector, i.e., a vector with length = 1.
Then it’s easy! The projection of any data vector is just its (vector) dot product with the unit vector which defines the ray.
An example may help. Suppose we have data series, giving an
-dimensional vector at each of
points of time (note that
and
are different variables).. Call these data values
, for
. At a given moment of time
, they take the values
. From each series
, i.e. for each
, compute the average value over time
.
Then define a zeroed data series, for all times , as
.
Note they are not yet normalized, just zeroed.
Suppose further that the unit vector defining the direction of our projection is . Then the dot product of the ray vector with the zeroed data vector gives the projected value
:
.
Because we zeroed the original data, the average value over time of this projection will be zero
.
Then the average variation (not variance) of the projected values will be
.
We can write this as
.
from which we see that the sum over time doesn’t involve the unit vector defining the ray. Hence we can compute the sum over time, once and for all, by computing the matrix
.
Then the variation of any projection defined by unit vector is given by
.
The matrix completely characterizes how the variation in a projection depends on its direction
. We can call this matrix the covariance matrix. Note that if we interchange the index variables
and
, the value of
is unchanged, so the covariance matrix is a symmetric matrix.
The problem of finding the projection which gives the maximum variation, is therefore equivalent to the problem of finding the unit vector (ray) which gives the maximum value when contracted with the covariance matrix. That unit vector defines a ray, the direction of which defines the maximum-variation projection, and that’s our 1st principal component.
The astute reader will note that I haven’t yet normalized the data. Normalization per se isn’t necessarily the appropriate thing to do. What’s necessary is that if we’re going to estimate variation as “distance” in our -dimensional space, then we have to make sure that our different dimensions — the different axes of our space — are on the same scale. This can be tricky: if the first data series is degrees Celsius, and the 2nd series is CO2 concentration in ppmv, how do we define a scale which applies to both?
The usual solution is to normalize all the series, i.e., to expand or contract each series so that its average variation is the same (equal to 1). This amounts to assuming that a variation of 1 standard deviation of one variable, should be thought of as representing the same “distance” as a variation of 1 standard deviation of any other variable — it’s the standard deviation that defines the relative scale between different data series. We may have reason not to normalize the data. If the data series themselves suggest a physically meaningful scaling, we may argue for keeping it. Even better, we might want the scaling for which the noise has the same size in all the many dimensions.
The “ideal” situation (if there is such a thing) is to identify the directions in our multidimensional space which act independently; these are the “meaningful” directions (whatever that means). But identifying them based on objective criteria is tricky business, and in the majority of cases the directions identified by PCA (and some of its variants) are plenty close enough to capture the relevant information, and as good as we’re gonna get anyway.
So the general default is to normalize the data. For each series, i.e. for each , we’ll first compute the average variation (not variance) of the data
.
I say variation rather than variance, because a proper estimate of the variance would involve dividing by rather than
. But I’m not interested in a proper estimate of the variance. I just want a data series which has average value zero (it’s been zeroed) and which has rms value 1 (it’s average squared value is 1). This defines my normalized series as
.
Now we have an interesting situation. Using the normalized series , we can compute the covariance matrix just as we did with the non-normalized series. When we do so, all the diagonal elements are the average squared value of a normalized series, which is just 1. The off-diagonal elements are just the correlation coefficients between the data series. So let’s write the covariance matrix using normalized series as
,
and call it the correlation matrix. Its diagonal elements are 1, its off-diagonal elements are the correlations between the data series. Like the covariance matrix, the correlation matrix is symmetric.
So we know how to compute our correlation matrix. How then do we find the unit vector giving the maximum variation? We want to maximize the quantity
,
but we need to impose the constraint that the vector is a unit vector
.
We can include the constraint by defining a Lagrange undetermined multiplier , and using it to define a quantity
.
Then we find the vector by finding the maxima and minima of . Substituting the value of
, and setting the derivative with respect to any component
of the vector equal to zero, we get
This tells us that
.
Therefore if is our ray of maximum variation, then the matrix
multiplied by the vector
gives a multiple of the vector
. This is the classic definition of an eigenvector. If multiplying a matrix by some vector, the resultant vector is a multiple of the original vector, then that vector is an eigenvector of the matrix.
If we multiply the eigenvector by some constant — we may double it or multiply it by a billion or by one tenth — then multiply it with the matrix, the result is still a multiple of the original. Hence the length of the eigenvector is irrelevant; all vectors which point in the same direction are the same eigenvector. Therefore the eigenvector equation actually defines a ray rather than a single vector, but we can characterize that ray by a unit vector in the appropriate direction.
Let’s see what this means in the 2-dimensional case. Our correlation matrix will be 2 by 2, diagonal elements 1, and the off-diagonal elements are equal to the correlation between the two data series.
Since the length of an eigenvector doesn’t change the eigenvector equation, let’s choose the length so that the 1st component, , is equal to 1. Let
be the 2nd component. Then the eigenvector is
If we multiply the matrix by our eigenvector, we have to get a multiple of our eigenvector,
The first thing we note is that . Next we note that
.
From this we conclude that . So,
. This means we have the following two eigenvectors:
Actually it’s wise if we conform to the custom of normalizing these eigenvectors, i.e., changing the length so that they’re unit vectors. This gives
The first still represents an equal contribution from each dimension, i.e., the sum of the series (divided by the square root of two). The second still represents a decrease in one equal to the increase of the other, i.e., the difference of the series (divided by the square root of two). The first eigenvector has eigenvalue , the second has eigenvalue
. If
is greater than 0, then the first eigenvalue is larger and the first eigenvector is the 1st principal component. If
is less than 0, then the second eigenvalue is larger and the second eigenvector is the 1st principal component.
In higher dimensions, still each eigenvector defines a projection, and the variation associated with that projection is equal to the eigenvalue! Hence the eigenvector with the largest eigenvalue is the 1st principal component, that with the next-largest eigenvalue is the 2nd principal component, etc. Because the correlation matrix (or covariance matrix) is real and symmetric, it turns out that all of its eigenvectors are perpendicular to each other (”orthogonal”).
Consider, e.g., the case of three series. The correlation matrix has 1’s along the diagonal and correlations between the series off the diagonal:
Suppose the first two series correlate strongly, so that is positive, even close to 1. But series three doesn’t correlate with either of the others, neither positively nor negatively: both
and
are zero. Because the 3rd series doesn’t correlate with either of the others, it “decouples” from the system and constitutes its own eigenvector:
Its eigenvalue is 1. Meanwhile the first two series mix just as they would in the two-dimensional case, giving eigenvectors
with eigenvalues and
. The end result is that the story common to series 1 and 2 was the big story, series 3 is the medium story, and the difference between 1 and 2 is the small story.
In other cases we may have a group of series which correlate amongst themselves, but don’t correlate with any of the rest of the series. In such a case, the group will decouple from the system, and act as a separate group. The eigenvalue decomposition enables us to isolate the strongest signal present within each group, as one of its principal components.
In any case we can expect to capture most of the variation in the data using far fewer principal components than we have original data series. And those we do use will be the ones with the best signal-to-noise ratio.
In the next post we’ll interrupt our series, although we will apply the simplest PCA analysis to some temperature data in order to investigate some other temperature data. Then we’ll return to how PCA has been used in proxy reconstructions, and ultimately address some of the issues of thorny controversy.








10 responses so far ↓
Don Fontaine // February 20, 2008 at 2:35 am
Thank you for these posts.
fred // February 20, 2008 at 6:59 am
It gets better still. A real public service, this.
David B. Benson // February 20, 2008 at 8:00 pm
What Don Fontaine and fred said.
P. Lewis // February 20, 2008 at 8:48 pm
This will take a day or three(!) to get to grips with personally.
But it’s not all gobbledegook to me, so there’s definite hope. Keep it coming.
(Rhetorical) How do you manage all this and your day job, and have a life?
elspi // February 20, 2008 at 9:31 pm
\begin{nitpick}
“Because the correlation matrix (or covariance matrix) is real and symmetric, it turns out that all of its eigenvectors are perpendicular to each other (”orthogonal”).”
This is true if all eigenspaces are 1-dimensional,
and in general you can choose an orthonormal basis of eigenvectors, but if an eigenspace has dimension at least two then there are lots of unit eigenvectors in that eigenspace which are not perpendicular to one another.
“Is it possible that in this setting all eigenspace are one dimensional and that you have just made a fool of yourself”
Yes
\end{nitpick}
KH // February 24, 2008 at 6:28 pm
It seems that one might be able to relate this to singular value decomposition. Is that the case?
S2 // February 24, 2008 at 10:38 pm
elpsi wrote:
Just when I thought I was beginning to understand this. :(
But if “all eigenspace (sic)” only have one dimension, doesn’t that mean that nothing changes?
In Tamino’s lake example, the only way to get all eigenspaces to be one-dimensional would be to make all the ice-out times agree (to the picosecond)?
I’m happy enough to make a fool of myself, as long as I can (in retrospect) understand why.
elspi // February 25, 2008 at 3:04 am
“In Tamino’s lake example, the only way to get all eigenspaces to be one-dimensional would be to make all the ice-out times agree (to the picosecond)?”
No, I don’t think so.
I think (now) that with probability one, the eigenspaces will have dimension one. Eigenspaces of dimension more than 2 should correspond to multiple roots of the characteristic polynomial. A random polynomial will have no multiple roots (with probability one). The characteristic polynomial of a symmetric matrix isn’t exactly random (even it the entries of the matrix are) but I wager it is random enough to get no multiple roots.
I don’t think that the dimension of the eigenspaces matters for the applications in any case. That is why I called it a nitpick.
Martin Vermeer // February 25, 2008 at 9:08 pm
Yes. Or EOF (Empirical Orthogonal Functions). All very similar.
mmghosh // March 6, 2008 at 3:24 pm
Its easier to get to grips with the maths after this post. Many thanks.
Thanks to the other posters who put up the other sites too in the previous threads.
Leave a Comment