Open Mind

PCA, part 1

February 16, 2008 · 66 Comments

It’ll take several posts to explore the subject of PCA (principal components analysis) sufficiently to get a good idea of how it relates to some of the climate science that’s often under discussion (i.e., hockey sticks and all that). So here’s part 1, an introduction to PCA.

Some time ago I came across data for ice-out day (the day of the year on which the lake ice breaks up) for a number of the New England lakes. Here’s the data for two of them, lakes Auburn and Moosehead:

2lakes.jpg


We can see right away that the changes are similar; when one lake has ice-out earlier or later in the year, the other lake tends to do the same. So these two data series aren’t telling us two completely different stories; to a large degree, they’re telling us the same story.

What story are they both telling us? We can get a clearer picture by bringing them onto the same scale, which we can do by normalizing the series. To do that, we first calculate the average and the standard deviation for each series. Letting \bar x be the series average, and \sigma the series standard deviation, we then replace each data value x by

z = (x - \bar x) / \sigma.

In this way we create a new series which looks just like the old one — it goes up and down by exactly the right proportion at exactly the right time — but we’ve “zeroed” it (its average value is zero) and we’ve expanded or contracted it to “standard” size (its standard deviation is one). If we do this to both the series for Auburn lake and Moosehead lake, then compare these normalized series, we get:

2norm.jpg

Now we can see just how similar are the stories told by these two series!

But of course we don’t just want to see it; we want to quantify it. Let’s get some insight by looking at the pair of two series as a single 2-dimensional data series, where the 1st dimension is the normalized Auburn data and the 2nd is the normalized Moosehead data. Then let’s plot our data, not as a function of time, but simply as points in the 2-dimensional space:

2dlakes.jpg

Clearly the two dimensions correlate with each other; when one is high or low, the other tends to be high or low also. In fact the data points tend to cluster about the diagonal line; that’s the line which is closest to the data in the least-squares sense, i.e., it gives the minimum sum of squared distances from data points to the line. Since the variation away from the line is minimum, the variation along the line is maximum. If we take each data point and move it to the point on the line closest to it, i.e., if we project the data onto the line, we get the maximum amount of variation from this line and no other.

And that’s an important point. After all, it’s the changes which contain the information in data; if the data never changes we don’t have much information! Along this projection line the changes have the greatest variation, so this is the *biggest* story being told by these two data series.

Whenever we have two normalized data series, and we make a 2-dimensional plot of that data, one of the diagonal lines always gives the projection with the most total variation, and the other diagonal line (perpendicular to the first diagonal line) gives the projection with the least variation. The direction of the line which gives the maximum-variation projection is the direction of the first prinipal component. The direction which is perpendicular to that, which gives the minimum-variation projection, is the direction of the second principal component.

If the series are correlated (like Auburn and Moosehead lakes), then the 1st principal component will be the diagonal from lower left to upper right, while the other diagonal (lower right to upper left) is the 2nd principal component. In such a case, it’s the story the two series have in common that is the “big” story, while the difference between the series is the “small” story. If, on the other hand, the series are anticorrelated, then these diagonals swap order, with lower-right-to-upper left showing the most variation and becoming 1st, the lower-left-to-upper-right now 2nd. Then it’s the difference between the series that is the “big” story while the changes common to both are the small story.

For Auburn and Moosehead lakes, the 1st principal component points in the direction of equal increase for both (normalized) series. Hence it represents an equal combination of the two series, or their sum. Actually it represents their sum divided by the square root of two, but that’s just a scale factor.

The 2nd principal component points in the direction of increasing normalized Moosehead data while Auburn decreases by the same amount. Hence this represents an increase in one accompanied by equal decrease in the other, which is simply the change in the difference between the Auburn and Moosehead series. Actually it represents their difference divided by the square root of two, but that’s just a scale factor.

When we have just two data series, so our series vector is 2-dimensional, the principal components always turn out to be the sum of the normalized series divided by the square root of two, and the difference between the normalized series divided by the square root of two. The sum gives the story which is common to the two series while the difference gives the story which is different between the two series. Whichever direction leaves the most variation in the projected data is the 1st, representing the majority of the variation, the other (the 2nd) represents the minority. And here’s what that gives for these two Maine lakes:

2dpc.jpg

Evidently most of the variation is in the 1st principal component, PC1, while the 2nd, PC2, shows only a little. It’s the sum (or average, or sum divided by the square root of two, or however we choose to scale it) that gives the “big picture” from these two data series.

And now for the payoff: because the two principal component directions are perpendicular, we can in a sense call their projections “independent” (I’ll play fast and loose with the truth here, but the point is valid). Hence we may expect that the noise will distribute itself evenly among all the principal components. But clearly the signal doesn’t distribute itself evenly among the principal components! It’s mostly in the 1st PC, as indicated by its vastly greater variation than the 2nd PC, due to the strong correlation between the data series. If the noise level is the same in the two PCs, and the signal level is vastly greater in the 1st PC, then it’s the 1st PC that gives us the best signal-to-noise ratio. And it’s that signal we’re hoping to identify.

It may happen that the two series are neither correlated nor anticorrelated, i.e., that their correlation coefficient is r=0. In this case, neither diagonal defines a higher-variation projection than the other. In fact, all directions define a projection which gives the same variation. There’s no “preferred direction” in our 2-dimensional vector space, by which to define principal components. We can, however, still choose the sum divided by the square root of two, and the difference divided by the square root of two, in full knowledge that they give the same variation and which we choose to call 1st, which 2nd, is entirely arbitrary.

That’s all well and good; the maximum-variation projection gives the big picture, the minimum-variation projection gives the small picture, and with just two data series one is the scaled sum, the other the scaled difference. But what if we had more than two series to start with? If we consider the time period from 1950 to 2004, then we have ice-out data, not for just two New England lakes, but for ten. Here’s the data:

ice10.jpg

Clearly all the lakes show similar changes. We can see this even better in the normalized series:

ice10norm.jpg

Again we see that they all tend to go up and down together! But they don’t do so perfectly. At times the changes are strikingly similar, but there are some notable differences: around 1967, 1974, and 1987, for instance. We’ll see how these emerge from PCA.

Form the data for the ten lakes into a 10-dimensional vector, and plot each year’s data as a point in this 10-dimensional space. I won’t bother to graph it, since I haven’t yet figured out how to make clear 10-dimensional graphs. But we can determine, mathematically, the line which best fits the data, in the sense that when we project the data onto this line we get the maximum variation — just as we did in the 2-dimensional case. The direction of this line will be the first principal component.

Next we look for a vector which is perpendicular to the 1st principal component, whose direction defines the projection which gives the maximum variation. This is the 2nd principal component.

We repeat this process until we have as many principal components as is the dimension of our space (in this case, 10). All the principal components are perpendicular to each other. The projection defined by the 1st principal component gives the most variation, that defined by the 2nd principal component give the 2nd-most variation, etc. The last principal component gives the least variation, and usually represents either a very small signal, or just noise. With 10 dimensions, the last principal component usually has variations too small to contribute to our understanding of the signal.

For the 10 New England lakes, all the data series correlate positively with each other, i.e., they all strongly tend to vary together in the same direction. Under such circumstances the 1st principal component is close to (but not exactly) an average for all 10 lakes, times the square root of 10. To be exact, it’s a weighted average of the normalized series for all 10 lakes, times the square root of 10. When all data series correlate strongly, they’re all telling us mostly the same story, so the first principal component, PC1, turns out to be the variations which they all share in common. Here it is, plotted through time (thick black line), compared to the 10 original data series (thin lines), when plot to the same scale (the PC is plotted on the right-hand axis):

ice10pc1.jpg

We noted earlier that in addition to a common story, the data from the different lakes sometimes showed notable disagreements. The strongest correlation among those disagreements gives us our 2nd principal component, which turns out to be roughly proportional to the difference between the average of five lakes (Auburn, Cobboseecontee, Houghton’s, China Lake, and Damariscotta) and that of three other lakes (Moosehead, Aziscohos, and 1st Connecticut). This difference between the two groups gives us the 2nd principal component (thick red line):

ice10pc2.jpg

We see that it’s largest around 1967, 1974, and 1987 — just the times we noted that the series showed differences. The 2nd principal component has captured much of the difference, although it hasn’t captured it all. Much of the remainder can be found in the 3rd principal component (thick blue line):

ice10pc3.jpg

The 1st principal component accounts for 73% of the variation of the data series. The 2nd PC accounts for a further 13%, making 86% explained by the first two PCs. The 3rd PC accounts for another 6% of the variation, meaning that the first three PCs cover 92% of the variation present in the original 10 data series.

That’s most of the variation, and most of the information, in those data series. We can represent that information by these three signal-rich series (the first three PCs) rather than the ten noisier series (the original lake ice-out data). In fact examination of the 10th principal component shows just how small it is (thick purple line):

ice10pc10.jpg

If we discard the 10th principal component, the variation we lose — hence the information we lose — is very small, and is almost entirely noise anyway. Likewise for the 9th principal component, and the 8th, etc. By discarding principal components which have very small variation, we get rid of a fair share of the noise, lose almost no signal, and simplify the input data to our analysis. Instead of characterizing the ice-out data by 10 time series, we can do so effectively by using only three: the first three principal components.

And that can be important. Suppose, for example, we’re using ice-out data as a proxy for springtime temperature. If we try to estimate springtime temperature as a function of the data for all 10 lakes, we have so many “predictor” variables that we’re likely to suffer from overfitting. That’s when we have so many predictor variables that we start to fit the noise rather than just the signal. This causes the fit to look really great during the calibration period! — after all, we’re even fitting the noise! But the fit we create for the calibration noise only magnifies the noise in our proxy reconstruction. Noise is, by definition, unpredictable.

By using only the first three principal components instead of the ten raw data series, we have only three predictor variables so we reduce the problem of overfitting. But we’ve lost mostly noise, almost none of the signal. What could be better? If we use the first three principal components, from 1950 to 2003, as predictor variables to estimate springtime temperature, Then we get this fit between actual observed values (black line) and the predicted “proxy reconstruction” values (red line)

proxy10.jpg

The proxy is quite worthy; it rises and falls along with the observed temperature with strong correlation.

This is the general idea of principal component analysis, or PCA. In the next installment, we’ll learn the mathematics that enables us to determine what those maximum-variation directions are.

Categories: Global Warming · climate change

66 responses so far ↓

  • John Cross // February 17, 2008 at 2:30 am

    Tamino: Excellent post - I have been looking forward to this for a while!!!!

    Would it be possible to plot the principle components from the 10 lakes on one graph to illustrate how PCs change?

    A rather simplistic explanation of the PC method that I once came across is to imagine 3 sets of data which are plotted against each other in a 3D plot (similar to your 2 lake plot). Now we imagine we can pick up this 3D box and look at it from different directions. We move it around until we see the data lining up the best. This orientation can be thought of as PC1. The other two PCs are the orthogonal views. Would you consider this broadly correct?

    Thanks
    John

    [Response: It's quite correct. You can imagine doing it in three dimensions, and I can graph it in two, but no matter how many dimensions, we can do it mathematically.

    I'm not sure how interesting plotting all the PCs would be; the 4th through 10th are small, and mainly jitter caused by noise. The signal is in the first three.]

  • Timothy Chase // February 17, 2008 at 3:43 am

    Thank you, Tamino.

    I am looking forward to this series. I briefly touched on eigen values and eigen vectors back in the eighties (I will probably have to brush up on that as well) when learning the mathematics behind fairly basic quantum mechanics, but this is something I never got to and would nevertheless like to understand.

  • nanny_govt_sucks // February 17, 2008 at 6:11 am

    If the lake ice were broken up each year by a group of guys with axes around the same month of each year (date might vary a bit because some of the guys might be on vacation, or sick, etc…) would it not be possible to calculate a PC1 that reflected not temperature, but the activities of this group of guys with axes? And then maybe the PC2, or PC3 would reflect temperature?

    In this case (I know this example is a bit far-fetched, but please bear with me) how would you know what the PC1 (or 2 or 3) represents?

  • P. Lewis // February 17, 2008 at 11:03 am

    The Lake Ice-Out Data for New England for what Tamino has analysed here (and the data for another 19 New England lakes) and the lakes’ lat/lon data can be obtained from the USGS if anyone would like to work through it themselves (since that is the only way to “really” learn rather than just to appreciate).

  • fred // February 17, 2008 at 12:05 pm

    Yes, thanks from me too. I know what a lot of work it is to patiently lay out this stuff in a clear basic way. Like writing user guides! Tedious but valuable. Thank you.

  • S2 // February 17, 2008 at 12:46 pm

    Thanks.
    I understand the 2D example, but I’m struggling a bit with the 10D one. I may have a go at a 3D one to help me make sense of it.

    Actually it represents their sum divided by the square root of two, but that’s just a scale factor.

    To be exact, it’s a weighted average of the normalized series for all 10 lakes, times the square root of 10.

    That confused me for a moment, until I realised that the first is using a sum and the second is using an average.

  • Dodo // February 17, 2008 at 4:54 pm

    Nice lesson for all of us that have forgotten mot of the statistics we learned in school. I hope your real You is a teacher.

    One question: You have categorized the post under “global warming”. A couple of lakes in New England showing not even significant local warming - so what’s the point?

  • Phil B. // February 17, 2008 at 5:01 pm

    Looking forward to Part 2. I was surprised not to see an upward trend in the data.

  • John Cross // February 17, 2008 at 6:38 pm

    [I’m not sure how interesting plotting all the PCs would be; the 4th through 10th are small, and mainly jitter caused by noise. The signal is in the first three.]

    I agree, but I thought it would be interesting to see how quickly they decayed and how similar the PCs are after a certain limit.

    Regards,
    John

  • Paz // February 17, 2008 at 9:45 pm

    “I agree, but I thought it would be interesting to see how quickly they decayed and how similar the PCs are after a certain limit.”

    shouldn’t they not be totally dissimilar even after the 4th, because they are orthogonal to one another? That is, the variance captured by one predictor would not be not available for the next one? So, in general is it true that each component has a zero-correlation to the all others, or am i misunderstanding things?

    by the way, very nice post. I really enjoy your site. Although I am from a totally different field (Neuroscience), I have learned a lot about about statistics that I can apply.

  • B Buckner // February 17, 2008 at 10:10 pm

    Phil B - I may be mistaken but I think the ice out day is the number of days since Jan 1 when the ice out occurs; therefore a lower number equals earlier melting.

  • P. Lewis // February 17, 2008 at 11:07 pm

    Dodo said:

    One question: You have categorized the post under “global warming”. A couple of lakes in New England showing not even significant local warming - so what’s the point?

    Perhaps since Tamino said

    It’ll take several posts to explore the subject of PCA (principal components analysis) sufficiently to get a good idea of how it relates to some of the climate science that’s often under discussion (i.e., hockey sticks and all that). So here’s part 1, an introduction to PCA.

    is why.

  • JimV // February 17, 2008 at 11:38 pm

    As I suspected from the briefer write-up at RealClimate, the mathematics appears to be the same procedure that is used to reduce the number of degrees of freedom for vibrational analyses of coupled components, by using the first several natural modes of each component. In that case it is primarily used to reduce the size, time, and cost of vibration models. It seems even more clever to use it with statistical data to eliminate some of the noise, so at this point my main question is why would one ever not want to use it? (Of course there are probably pathalogical cases that would defeat any given procedure.)

    Thanks very much for the explanation.

  • nanny_govt_sucks // February 18, 2008 at 12:59 am

    If we use the first three principal components, from 1950 to 2003, as predictor variables to estimate springtime temperature,

    Why would 3 orthogonal PCs all represent temperature? From your description above it sounds like the different PCs should reflect different “stories”, and temperature would be (might be) only one of the “stories”.

  • John Cross // February 18, 2008 at 2:46 am

    Paz: the shape of them should be very dissimilar but I am guessing that the size or amplitude of them would be about the same.

    The PCs that explain more of the variance are - for lack of a better term - larger than the ones that only explain a little (the associated PC have larger eigen values). If you look at PC1, it has a lot of ups and downs while PC2 has fewer one and they are smaller, PC3 has even fewer and smaller ones.

    I would guess that the PCs that are picking up mostly random noise would look to all have about the same amplitude on a graph.

    Tamino, please correct me if I am wrong here.

    Regards,
    John

  • Timothy Chase // February 18, 2008 at 6:08 am

    JimV wrote:

    As I suspected from the briefer write-up at RealClimate, the mathematics appears to be the same procedure that is used to reduce the number of degrees of freedom for vibrational analyses of coupled components, by using the first several natural modes of each component. In that case it is primarily used to reduce the size, time, and cost of vibration models. It seems even more clever to use it with statistical data to eliminate some of the noise, so at this point my main question is why would one ever not want to use it? (Of course there are probably pathalogical cases that would defeat any given procedure.)

    Thanks very much for the explanation.

    I don’t pretend to understand any of this as of yet. However, it turns out that you can use principle component analysis in the the study of bimodal climate modes where the two polar states of the climate mode will be essentially negatives of each other. You have seen that most likely with El Nino or some other climate mode — the warm areas become cold, the cold becomes hot, etc.. However, this assumes that the mode is susceptible to linear analysis, and some climate modes may very well not be — and there has been literature to this effect for a while.

    See for example:

    The Preferred Structure of Variability of the Northern Hemisphere Atmospheric Circulation
    A.H. Monahan, et al.
    Geophysical Research Letters, Vol 28, No. 6, pgs 1019-1022, Mar 15, 2001

    More recently, see:

    Methods in multivariate statistical analysis are essential for working with large amounts of geophysical data, data from observational arrays, from satellites, or from numerical model output. In classical multivariate statistical analysis, there is a hierarchy of methods, starting with linear regression at the base, followed by principal component analysis (PCA) and finally canonical correlation analysis (CCA). A multivariate time series method, the singular spectrum analysis (SSA), has been a fruitful extension of the PCA technique. The common drawback of these classical methods is that only linear structures can be correctly extracted from the data. Since the late 1980s, neural network methods have become popular for performing nonlinear regression and classification. More recently, neural network methods have been extended to perform nonlinear PCA (NLPCA), nonlinear CCA (NLCCA), and nonlinear SSA (NLSSA). This paper presents a unified view of the NLPCA, NLCCA, and NLSSA techniques and their applications to various data sets of the atmosphere and the ocean (especially for the El Nino-Southern Oscillation and the stratospheric quasi-biennial oscillation). These data sets reveal that the linear methods are often too simplistic to describe real-world systems, with a tendency to scatter a single oscillatory phenomenon into numerous unphysical modes or higher harmonics, which can be largely alleviated in the new nonlinear paradigm.

    Hsieh, W. W. (2004), Nonlinear multivariate and time series analysis by neural network methods, Rev. Geophys., 42, RG1003, doi:10.1029/2002RG000112.
    http://www.ocgy.ubc.ca/~william/Pubs/Rev.Geop.pdf

    It is worth noting that methods which are intended to capture the non-linear behavior of climate modes are themselves generalizations of the linear methods. Thus for example, one might employ kernel PCA instead of PCA.

  • P. Lewis // February 18, 2008 at 11:35 am

    nanny_govt_sucks asked:

    Why would 3 orthogonal PCs all represent temperature? From your description above it sounds like the different PCs should reflect different “stories”, and temperature would be (might be) only one of the “stories”.

    Temperature may or may not be affected by more than ‘one story’. For example (not necessarily all relevant to this example): synoptic conditions, ocean circulation conditions, diurnal/seasonal changes, long-term climate change conditions, altitude, latitude/longitude, solar output, change in proxy behaviour, change in monitoring regime, instrument drift, … are all variables that might conceivably have an effect you may wish to evaluate.

  • fred // February 18, 2008 at 2:15 pm

    JimV

    I don’t think there is any argument about using PCA in general. The argument is about something quite different, whether the specific methods used in MBH98 were a correct application of PCA. So far Tamino has given an admirably clear explication of the method on an uncontroversial and simple sample data set. And he must have worked hard to make it that simple and clear! The interesting part about climate will come when he gets to MBH98 and discusses those methods of application.

    The controversy is summarized in Wegman. It will be interesting to see Tamino’s take on this, especially if he is as clear about it as he has been in the first installment about the basics.

    The two pdf tutorial references someone gave earlier are very nice and clear accounts as well. Thanks for those.

  • stewart // February 18, 2008 at 4:01 pm

    Tamino, are you going to post on ways to determine if a PCA is statistically significant or not? That will have some relevance in part 2, when different approaches are used which may allocate variance along different axes/components.
    Fred, here’s a comment from Andrew Vickers, a biostatician, that goes to the point of your comment and similar ones. “Statistics is often seen as a set of laws, handed down from above, violation of which constitutes a transgression. As a statistician, I am repeatedly asked whether a particular statistical analysis is “allowed” or whether it would be “against the rules”; as a statistics teacher, my students’ questions often concern ‘right and wrong.’”
    There’s no laws, there’s simply more effective and less effective statistical approaches to answering your questions.

  • Phil B. // February 18, 2008 at 4:03 pm

    B. Buckner, I agree. Earlier spring melts would result in a downward trend.

  • nanny_govt_sucks // February 18, 2008 at 7:19 pm

    Temperature may or may not be affected by more than ‘one story’. For example (not necessarily all relevant to this example): synoptic conditions, ocean circulation conditions, diurnal/seasonal changes, long-term climate change conditions, altitude, latitude/longitude, solar output, change in proxy behaviour, change in monitoring regime, instrument drift, … are all variables that might conceivably have an effect you may wish to evaluate.

    I still don’t get it. If your PC1 through PC3 reflect ocean circulation, instrument drift and altitude respectively, then any correltation to spring temperature would be completely spurius.

  • stewart // February 19, 2008 at 1:45 am

    PC1 is the element that accounts for most variance on all measures simultaneously. PC2 is the common dimension that accounts for the largest possible degree of remaining variance, etc. NaGS, an important element of science is validating your findings by looking for external correlates, and the internal structure of each PC. This allows you to say what each PC is summarizing, so you can compare results to others (and sometimes, discover that my PC2 is the same as your PC3, due to some characteristics of the data used, etc.).
    John C, what you are describing is what’s called a ’scree plot’. When using PCA for data reduction, when you want to retain the maximum portion of the reliable variance, you will typically keep those components with eigenvalues above the general run of the scree.

  • moptop // February 19, 2008 at 4:37 am

    One interesting feature of ice out time, I live on a lake that right now has about 18 inches out in the open water, and more in the bays, is that wind patterns affect it considerably. This year, the ice had almost melted the ice completely as it sat in the water, when a gentle breeze that barely stirred the trees carried it away. Other years, high winds carry the ice to shore with force when it is still pretty thick, seeing it pile up on beaches, and even threaten cottages. It is something to watch a sheet of ice that weighs ton upon ton carried into shore and just keep crashing upon itself. I am determined to make a youtube of the next spectacular one.

    My point is that it is not a pure proxy for temperature, although I am sure it is pretty well correlated.

    The reason they call it “ice out”, which I never understood until I witnessed it, is that when the ice blows away, it piles up on one shore or another and never comes back. You look out one minute, ice, next minute, blue water.

  • Fred // February 19, 2008 at 10:28 am

    He is just explaining how to do PCA. Not how you then go on to evaluate the results. He’s specifically not covering the trap of auto correlation. This is probably in part 3 of the series. So yes, the results could be showing an association between two time series that is spurious. But he’s still showing the right way to generate the stats and get to the principal components. What you want is coming later.

    Caveat - I am no expert at all! Correct if wrong.

  • nanny_govt_sucks // February 19, 2008 at 6:09 pm

    NaGS, an important element of science is validating your findings by looking for external correlates, and the internal structure of each PC. This allows you to say what each PC is summarizing,

    OK, but can 3 orthogonal PCs from the same PCA analysis all be summarizing spring temperature? Tamino seems to imply this above when he combines PCs 1 through 3.

  • Martin Vermeer // February 19, 2008 at 6:34 pm

    Yes, this is a very good explanation of the idea behind PCA.

    BTW this technique is related to SVD (Singular Value Decomposition), also often used to analyse spatiotemporal data. I described this technique in my lecture notes (not nearly as good, but with a nice example involving Finnish tide gauges) at the above web link.

    Tamino you could have pointed out that these analyses are easy to make using Matlab, or its free clone Octave.

  • S2 // February 19, 2008 at 10:31 pm

    It’s beginning to make sense to me.

    OK, but can 3 orthogonal PCs from the same PCA analysis all be summarizing spring temperature?

    Tamino’s example has nothing whatsoever to do with Spring temperature - he is simply comparing two (or 10) sets of data. What the data is doesn’t really matter - it could just as easily have been the number of apples sold by different stores. Or, as Timothy said, you could compare September Sea-ice in the two hemispheres.The data (on its own) doesn’t explain what’s causing it.

    My problem was - if you go above two dimensions, how do you determine what PC1 is? I tried it using GISS, Hadley & RSS annual temps (I could have used apple sales from Asda, Tesco & Sainsbury’s - I just happened to have the temp data to hand).
    And I think I’ve got it now - I don’t decide what PC1 is - the maths does it for me. Essentially I end up with a 3D version of Tamino’s 3rd graph above, with a slope having a general equation of ax +by + cz.

    Could someone please tell me if I’ve got that horribly wrong?

    I’m looking forward to the rest of the series.

    [Response: You've basically got it right. The projection along PC1 (or any PC) is ax+by+cz, where coefficients a, b, and c are the coordinates of the principal component vector.]

  • Hank Roberts // February 20, 2008 at 12:47 am

    Martin, is there a remedial page for people who look at your coursework and haven’t a clue? What are the required courses before that one? I’d love to have some better idea what you’re teaching, but I’d need to start in early calculus somewhere. I’m just being curious, I have no expectation of mastering the math.

  • John Willit // February 20, 2008 at 3:29 am

    If you stick to freeze-up dates in a certain location, that is one thing. If you extend it to every lake on the planet (all 100 million of them) then that is anothing thing.

    But you cannot use tree rings in your PCA since tree rings increase based on growing conditions (which includes everything from precipitation to desease to insect infestations ) and secondly, tree so not grow in perfect concentric circles, especially the Strip Bark trees used by Mann.

    So skip the trees, give us the data for all 100 million lakes for the past 2,000 years and we can then see the results.

  • Martin Vermeer // February 20, 2008 at 2:48 pm

    Hank,

    yes this is the _advanced_ adjustment calculus course, now called Statistical Methods in Geodesy… there is a basic course too, but it is not mine and not on line. Then there’s the introductory geodesy material, but it’s not in English :-(

    Actually you only need to look at Ch. 12, specifically 12.2. The example data is all there, you should be able to do it with Matlab/Octave (don’t even _think_ of trying Excel) , though I admit it’s a bit short on details. I could mail you my matlab script to play with if you’re interested.

  • Hank Roberts // February 20, 2008 at 4:26 pm

    I’m not up to running the software but perhaps others will. And this is a help for me just to read with a bit more understanding of what you’re doing. Thank you.

  • John Cross // February 20, 2008 at 8:52 pm

    Martin: You said “(don’t even _think_ of trying Excel)” and actually, when I was playing around with the PCA method a few years ago I did try in excel!

    It is not an easy task especially since Excel does not have nice functions for Eigen vectors and values and no simple way of making them. I did get around this by installing a set of population tools designed for excel (something called poptools I think) and it did have an Eigen function. None the less, it was a pain!!

    I will check out Octave!

    Thanks
    John

  • Ellen // February 20, 2008 at 9:19 pm

    After reading this article,I want to go out to see the movie with my boy firend.
    Goodbye and Good Luck.

  • Martin Vermeer // February 20, 2008 at 9:58 pm

    John: Octave is a great piece of software. Like Matlab by the way. It’s actually a little bit unfair for the MathWorks people behind Matlab, as they are a hard working and honest lot working closely with the scientific community. But. Matlab is pricey. Not a problem if your university pays, but it _is_ a problem for hobbyists and poor students.

    Octave has improved enormously over the last three years. More precisely, its Matlab compatibility has improved, to the point where also the plot commands are largely compatible. And today, even the Windows version is easy to install and runs well, so you’re not stuck if you’re stuck :-/

  • nanny_govt_sucks // February 20, 2008 at 10:58 pm

    the first three PCs cover 92% of the variation present in the original 10 data series.

    That’s most of the variation, and most of the information, in those data series.

    Why 92%? Why not 90, or 95%? Was 92% decided on beforehand, or was it decided after you had a look at the data? Is selection of PCs arbitrary or are there rules?

  • steven mosher // February 20, 2008 at 11:32 pm

    Fred.

    Let Tamino finish. He’s doing a good thing here explaining mathematical methods to folks. Leave MBH out of this, otherwise we end up in a food fight and food fights dont educate anyone.
    except in how to throw whip cream word pies.

    If Tamino wanted to argue that MBH used acceptable methods he would go argue with UC or others at CA. That is not his goal. After reading here a while ( and instigating some fights, sorry .. T) I see that Tamino is really an educator at heart. He writes best, when he is explaining things. Just the facts maam.
    So, lets encourage the best.

    [Response: Thank you.]

  • David Warkentin // February 21, 2008 at 2:49 am

    n_g_s -

    The amounts of variation covered by each PC are results determined by the data, not parameters directly chosen by the analyst. They’re in the same proportion as the eigenvalues Tamino describes in Part 2.

    (I say not “directly chosen” because they are affected by the choice of scaling - in this case, that choice is made when each data record is normalized to have unity standard deviation.)

  • dhogaza // February 21, 2008 at 4:01 am

    I see that Tamino is really an educator at heart. He writes best, when he is explaining things. Just the facts maam.
    So, lets encourage the best.

    I agree with this entirely. I suspect that in the near future, we’ll learn if Steven Mosher has learned anything.

  • J // February 21, 2008 at 1:26 pm

    Steve Mosher wrote:

    After reading here a while [...] I see that Tamino is really an educator at heart. He writes best, when he is explaining things.

    I agree. My favorite posts here have been the ones where Tamino took some subject I knew little or nothing about (like the boreholes posts) and explained it thoroughly and clearly.

    I wish there was some easier way of browsing back through old posts. Maybe somebody else could organize an index to Open Mind? Yes, I know how to use the Search tool, but browsing is often useful too.

  • nanny_govt_sucks // February 21, 2008 at 7:59 pm

    The amounts of variation covered by each PC are results determined by the data, not parameters directly chosen by the analyst.

    I don’t know what question you’re responding to.

    Tamino chose PCs 1 through 3 and combined them in a graph above. My question is “why?”. 92% of variation explained appears to have been his cutoff, but this seems arbitrary, unless I’m missing something.

  • Hank Roberts // February 21, 2008 at 8:30 pm

    Nan, what you’re missing is that the three principal components emerged from the particular set of data. Tamino didn’t “choose” them as in creating them. He’s showing you what emerges from the data set.

    Take a different data set, you get different results.

  • J // February 21, 2008 at 9:40 pm

    NGS wrote:

    Tamino chose PCs 1 through 3 and combined them in a graph above. My question is “why?”. 92% of variation explained appears to have been his cutoff, but this seems arbitrary, unless I’m missing something.

    The percentage of variation explained by each individual PC is not arbitrary, it’s an objective result of the process (which involves maximizing the variance in the 1st PC, then maximizing the remaining variance in the 2nd, 3rd, 4th etc in order).

    The decision about which/how many PCs to “keep” is arbitrary. Typically, as in this case, the first few PCs will get you up over 90%. That might be sufficient, or you might want to include a few of the lesser components, if there are features detectable in them that seem important or relevant to whatever you’re studying.

    There aren’t really “rules” for this, any more than there are for anything else in statistics. (People who don’t study statistics formally often mistakenly believe that there is some way of categorizing statistical analysis as “right” or “wrong”, when it’s usually more of a distinction between “useful” and “not useful”.)

    If the 1st PC is able to satisfactorily address whatever question you are trying to answer, then you can use it alone. If not, you might need to add the second, third, …. But parsimony is valued in science, and — all else being equal — the simpler your model/explanation can be, the better.

  • nanny_govt_sucks // February 21, 2008 at 10:02 pm

    Nan, what you’re missing is that the three principal components emerged from the particular set of data. Tamino didn’t “choose” them as in creating them.

    I didn’t say that. He obviously chose the first 3 PCs (instead of the first 2 or 4) and combined them in his graphic above. I’m looking for the logic or rules behind this choice. Tamino?

  • tamino // February 21, 2008 at 10:07 pm

    Indeed there aren’t hard-and-fast rules about how many PCs to include in all cases.

    There are, however, some clear-cut cases. Suppose we have 12 dimensions and therefore 12 PCs. If the first 2 PCs explain 90% of the variation, and the other 10 PCs explain 1% each, then the fact that among the “remaining 10,” none stands out, they’re all about equal, argues strongly that they belong together in a group: the “low-variance PCs which are mostly noise” group. Then we clearly use just the first 2 PCs.

    But it’s not always the case that a few PCs stand out while all the rest are subdued by the crowd; sometimes it’s not perfectly clear where to draw the line between meaningful and not.

    Principal components capture the substantial variations in a data set with fewer dimensions than the raw data, but that “fewer” is not necessarily one. We may need 2 or 3 or more PCs to capture the substantial variations that exist, that’s just a property of the data itself, not of its relationship to any *other* quantity like temperature. Once we have reduced the noise and complexity of our data by using PCs, we can then proceed to study the relationship of other things to this (possibly multidimensional) reduced representation.

  • Paul Middents // February 21, 2008 at 11:30 pm

    I would echo all the other folks finding this set by Tamino particularly interesting and informative. He is obviously a teacher at heart if not by profession.

    Thanks to P. Lewis for pointing us to the data source for Tamino’s example. I would encourage anyone seriously interested in this to at least work through the 2-D case. Excel will do this handily. I have also moved it to Mathematica in preparation for trying the 10D example.

    For Hank, a first course in Linear Algebra is almost as life changing as the first exposure to a real stats or probability theory course. I took both after retiring from the Navy and starting a community college teaching career. The Linear Algebra course is about the minimum for sorting through Martin Vermeer’s excellent class notes.

    It is interesting to try other pairs of lakes for the 2D case. I tried Kezar in Maine and Ponkapoag Pond in eastern Massachusetts. This selection was not very scientific. My son worked at a resort on Lake Kezar in the 80’s and my daughter lives in central Massachusetts. As it turned out the two are quite different as the latitudes and proximity to the coast should have indicated. Ponkapoag has about twice the standard deviation of Kezar and the two are not as well correlated as Tamino’s example. PC1 for Kezar and Ponkapoag has significantly less variation during the first half of the data set but more recent years don’t show nearly as sharp a difference between PC1 and PC2.

    One question that has come up as a result of doing my homework:

    When plotting the normalized values against one another, Tamino says that for this example the 45 degree diagonal represents the “best fit”. I would interpret this as a slope=1. When I used a simple linear regression in both Excel and Mathematica I got a best fit slope of about 0.76 for Tamino’s example and 0.42 for the less well correlated Kezar/Ponkapoag. Is this an indication of the noise or lack of correlation between the two data sets? Or am I just screwed up?

    For Tamino and all the other knowledgeable amateurs and dedicated professionals contributing at Open Mind, at Real Climate and at Rabbett Run, thanks.

  • P. Lewis // February 22, 2008 at 2:35 am

    In answer to Paul Middents, my understanding is that the diagonals are the rotated axes (new coordinate system) of your data’s maximum variance/variation, not slopes.

  • Hank Roberts // February 22, 2008 at 3:55 am

    > linear algebra
    Tamino, if you feel inclined toward opening another blog, “Prerequisites to an Open Mind” — or pointing to something like, I’d be game. My beloved spouse mastered calculus, which I stumbled through, and could coach me. Else, I found
    http://joshua.smcvt.edu/linearalgebra/linalg.html
    (which wants that basic calculus).

    Paul, thanks for the encouragement.

  • steven mosher // February 22, 2008 at 4:08 am

    One thing that would be a nice addition would be the ability to upload graphics in comments. I know it’s possible in WordPress ( winking at other sites that permit this) I think it’s an add in.

    Anyway, thanks for doing this Tamino.

    I have some other topics that might be of interest to folks.. Again keeping it germane yet
    not a food fight.

    1. Hurst phenomena
    2. change point analysis

    The second is probably more interesting and more accessible to folks. Occasionally, you talk about Climate regimes, and you pick out time periods to talk about 1900-1940s, the 40s to the late seventies, and the late 70s until today.

    I liked your approach of dividing the regimes, and then I read this from NOAA. Anyway, it’s not a sceptical point, but one that I thought might just interest folks and expand a bit on how regimes can be selected. I see it as an objective method for avoiding the cherry picking charge
    which is always a food fight starter.

    Hi Dhog. Nice to see you.

    http://ams.confex.com/ams/pdfpapers/100694.pdf

  • Christopher // February 27, 2008 at 9:21 pm

    Is this data available in the public domain? Apart from the webpages at http://me.water.usgs.gov/iceout.html? I’d like to recreate this example in Matlab or R. The whole notion of turning a subset of PCs into a proxy still isn’t quite there for me. And this is very clearly explained that I think I’d get it if I went thru the steps myself.

  • Principle Component Analysis « flea_on_the_bulldog // February 28, 2008 at 2:52 pm

    [...] PCA Part 1 [...]

  • TCO // March 3, 2008 at 5:19 am

    Sorry if these comments are not right on. Bear with me. My knowledge of these fields is amateur. But I’m trying to get a feel for things.

    A. What is the “signal” and is it a bit of a tautology, a circularity? I mean the signal is whatever gives most expression of the data and PCA goes in the direction of the data.

    [Response: Separating signal from noise ain't easy. Some things can be positively identified as signal, it's harder to say for sure that something is noise. It's also complicated by the fact that some climate factors (volcanic eruptions, el Nino) are not, strictly speaking, noise -- they're governed by firm physical laws -- but they're either chaotic or simply too complex for our present understanding to predict. So, they are often (and quite validly) *treated* as noise. One of the things I learned recently is that I was mistaken to give such a strong impression that the lowest-order PCs are reliably mostly noise. They usually are, but there are exceptions in which the low-order PCs identify the smaller (but sometimes more interesting) signal.

    Perhaps the best one can do is establish that certain components are *consistent* with the behavior of noise while others are not. So, we can sometimes identify signal components without ambiguity, but what's left may be noise, or may be signal which has behavior indistinguishable from noise.

    The best discriminator is usually: more data. Sooner or later, the non-noise character will make itself apparent. But that's not always practical.]

    B. Sometimes the signal may be physically meaningful, may be related to the variable of concern. But not always. Isn’t there a danger in PCA, that one will assume this to be the case when really PCA is just a way of slicing the data?

    [Response: Yes. The final installment (in the works) emphasizes that PCA is really just a transformation of the data -- a way to look at it from a different perspective.

    But keep in mind that the regression itself gives us statistics we can use to evaluate whether the fit is meaningful or not.]

    C. Is it really so useful to do PCA, extract “signal”, throw away “noise” and then do regressions? Isn’t it more statistically proper to just do multiple regressions on the data themselves with parameters of interst?

    [Response: Multiple regression has the distinct advantage that it doesn't discard *any* of the information in the data. But it has the disadvantage that is keeps all of the "degrees of freedom." The more degrees of freedom in the "predictor" variables, the more the regression will fit the noise rather than signal, degrading the quality. This is the classic problem called *overfitting*. It's like fitting higher and higher order polynomials to a time series. If we have a high enough degree, the fit will match the data *perfectly*. But that's only because we've included enough degrees of freedom to fit all the noise as well as all the signal -- such a fit is essentially meaningless.

    So like most things, it's a tradeoff -- reduce the information and you will reduce the impact of noise on the regression, but you may also lose some information.]

    D. The transforms to subtract mean and to normalize by SD can be done for many problems other than PCA (regressions). No?

    [Response: Yes. Such transformations lose no information (the data series can always be "transformed back" if we wish).]

    E. Also, there are times dividing by SD is helpful and times not. I’ve seen a medical example with blood work and diagnosis, where one problem requires SD dividing (and is hurt if you don’t) and one with the converse effect!

    F. PCA in that it does the “turning the 3-dimensional box around to find the best directions to look at things, seems most useful for problems in pattern recognition, in initial data examination, in chemical noses, etc. I don’t know that it is so useful versus just doing regressions.

    [Response: When the number of predictor variables gets *very* large (and there are cases with hundreds of predictors) the overfitting problem can be nontrivial. In such cases using PCA for regression is not just justified, it (or some other dimensional reduction method) may be necessary.]

  • TCO // March 5, 2008 at 1:22 am

    All that writing and no italics reply?

    [Response: This blog has had very heavy traffic lately; it's not easy to keep up. I'll answer your question; look for a response to that post.]

  • TCO // March 5, 2008 at 2:46 am

    In multiple regressions, you can just use a few terms, just as in PCA, you just use a few terms. I’ve never seen people feel that they needed to use hundreds of terms, just because there were than many degrees of freedom. Instead people build reasonable models with expected variables (and perhaps squared terms) using the methods in Box Hunter Hunter and other DOE advisories. And the extra degrees of freedom, you just don’t bother using them…they’re gravy…they’re your protection against overfitting. Yes, if the data is noisy, you might have a low expression of Rsq because the model does not model noise…but that’s the reality. Also, you are really testing the variables versus output versus making an implicit assumption that the first PC contains most of the signal (of the variable of interest).

    [Response: Indeed you can. But there are dangers, if you don't know a prior which terms are meaningful or if the meaningful information is a combination of terms rather than a subset. To take an extreme example (only for illustration): suppose we have 2 series -- rainfall and temperature -- but we have to reduce it to one (I said it was just for illustration). Leaving one out means we can either have rainfall or temperature, but not both, and it may be the case that neither is sufficient. PCA enables us to form a single series which is a combination of information from both physical processes. Hence it may (or may not) solve the problem.

    PCA is not "the" way to reduce dimension or combine data sets. But it's *one* way.]

  • TCO // March 5, 2008 at 3:50 am

    I think it’s great for something like that. Where you want to reduce dimension (say to store patterns). And where “signature” or “fingerprint” is the important thing. but if you DO want to understand temp versus data, why not do the regression.

    P.s. Bare with me. I tend to think about these things intuitively since my knowledge of stats is based on a college undergrad course in DOE, some corporate six sigma…and reading Climate Audit.

  • The Flea // March 5, 2008 at 4:33 pm

    Tamino: You do not need to post this, but I seem to recall discussions in the past where it was indicated that an index or catalog of your posts would be valuable. I have undertaken to develop one called Flea on the Bulldog and found here:

    http://fleaondog.wordpress.com/

    It will not be a fast work but I should be able to get something usable in about a month. I was very clear on the site that I have no connection with you and there is nothing required of you to do for this except perhaps read it every so often and offer corrections if I don’t categorize your posts correctly. Of course if you wish to be more involved I will be pleased to set you up as an admin.

    If at some point you think it is useful, you may consider linking to it from your site. On the other hand, if you don’t like the idea, please just let me know and I will be happy to delete it - no problems, no questions asked.

    The Flea!

    [Response: I'm flattered. I don't really have time to join the indexing effort, but I'll check in from time to time.]

  • mmghosh // March 5, 2008 at 9:19 pm

    Flea,

    Nice site. It will be useful for all those of us who have recently started viewing this site for educating ourselves.

    Small typo on the left hand column (surface).

    Regards

  • steven mosher // March 6, 2008 at 12:59 am

    TCO, my good friend.

    I think it will be interesting to compare Mann’s failure to center the mean properly with Watt’s failure to account for the difference in anomaly period. Especially since the former made it through “peer review” and the latter was spotted by “web review”

    Wegman was pretty clear so it will be interesting to watch how this is handled:

    “The controversy of Mann’s methods lies in that the
    proxies are centered on the mean of the period 1902-1995, rather than on the whole time
    period. This mean is, thus, actually decentered low, which will cause it to exhibit a larger
    variance, giving it preference for being selected as the first principal component. The net
    effect of this decentering using the proxy data in MBH98 and MBH99 is to produce a
    “hockey stick” shape. Centering the mean is a critical factor in using the principal
    component methodology properly. It is not clear that Mann and associates realized the
    error in their methodology at the time of publication.”

    My preference, since I believe in AGW and you do not, is to say Mann was in error and one error
    by one man does not bring a theory down. But
    his mistakes seem to be defended beyond all reason. this only feeds denialists like you.

    You and I still have a debate to finish about the future of science publishing. But not here, no fighting at Tammy town, unless we agree to no name calling.

    [Response: I let this through because you're involved in an active discussion. But please, please, please, can people not wait even a few days to hammer on non-centered PCA?

    Absolutely no more on that until the next PCA post. I'm sorry if that interferes with discussions, but I gotta draw the line somewhere.]

  • The Flea // March 6, 2008 at 2:39 am

    mmghosh: Thanks for pointing out the typo - corrected! But after all, what do you expect from a flea. Its amazing I can reach the keyboard at all.

    Best,
    Flea.

  • Johan i Kanada // March 7, 2008 at 1:13 am

    Good intro to PCA… even I might have understood something… but I am not sure.

    The PCs are statistically orthogonal, but what does that say about the physical reality? (Assuming, of course, that one analyzes some physical observations.)

    Is it reasonable to assume that each PC can have more than one underlying physical cause (PC1 = f(temp, sunshine, wind, people with axes…) for example, but also PC2 = g(temp, sunshine, wind, people with axes…))?

    [Response: Yes. The PCs don't by themselves necessarily disentangle the influences, and the influences themselves can exhibit interdependence. We hope that the regression (when we get to that step) will show the relationship between a subset of PCs and the physical parameter we're trying to isolate.]

  • Johan i Kanada // March 7, 2008 at 5:39 pm

    It seems important to keep in mind that PCs do not have any physical meaning, i.e. they are purely statistical constructs, and do not correspond to a physical reality.
    (Whereas a Fourier transform e.g. actually does corresponds to the physical reality (a signal’s frequency content).)

    Is this correctly understood?

    [Response: Yes and no. Purely mathematical analysis (PCA, or Fourier, or general time series) can tell us about signals that exist in the data. But they don't attach those signals to physics. In all these cases, the signal so determined is meaningful and significant (if we've done our job right), but connecting that to physics, or to other signals we've determined, still remains.

    In a sense, all genuine signals are related to physics. If we note an annual cycle, or an upward trend, or (gasp!) a hockey stick, we can be confident there's a *cause* for that behavior -- it's not just random, if it were it would be noise rather than signal. But is that pattern related to temperature? Or some other known variable, or some unknown (perhaps even undreamed of) phenomenon? For example: if we see an annual cycle in data, is that because of the changing temperature throughout the seasons, or the changing elevation angle of the noonday sun, or -- the annual cycle of slave labor graduate students we've enlisted to do the data collection?

    Alas, nature rarely makes it easy.]

  • TCO // March 8, 2008 at 9:30 pm

    You say that doing PCA is good when the variables of concern are not well known. But in the case of climate reconstructions, we know EXACTLY what we’re looking for. A temperature proxy.

  • Hank Roberts // March 8, 2008 at 10:05 pm

    But, TCO, are you saying that any of the temperature proxies are well known?

    The issue isn’t knowing _what_ the variables are, it’s how well they are known.

    Anything could be a variable, if you don’t know about it.

    When you know something contributes to the outcome for sure, but you don’t know it very well — that’s when the variable isn’t well known.

    Parsing.

  • TCO // March 8, 2008 at 11:26 pm

    You are losing me.

  • Hank Roberts // March 9, 2008 at 12:03 am

    You can’t know — as in understand, as in “well known” or “exactly” — just by knowing what you’re looking for.

    To know a variable you have to know something about how it varies.

    To know the variables plural, you have to know something about enough of them to explain significant variation (which can be just a small part of the total, of course).

    To know any one variable exactly you’d have to know everything about it, in the way your proxy expressed it.

    Nobody knows exactly everything about anything in science. You’d want either religion or mathematics for proofs.

  • TCO // March 9, 2008 at 4:45 am

    The intention in MBH is to examine temperature proxies versus observed temp and thus make models of previous unmeasured temps.

    This is not the case of some series where we don’t know what variables are of interest as forcing mechanisms.

  • Hank Roberts // March 9, 2008 at 6:16 am

    But how well do you know the variables? Will they explain the observations if you call them?

    Look at the IPCC’s statement of how well each identified forcing is understood and the error bars.

    Figuring out which components are primary, secondary and so forth is a serious task to take on.

Leave a Comment