In the preceding post we examined criticism of one of the Berkeley team’s papers, Decadal Variations in the Global Atmospheric Land Temperatures (Muller et al. 2011, hereafter referred to as M2011), by a fake skeptic. Now I’ll offer my own critique.
The more I study this paper, the less I like it. There are a few serious problems and some bogus numbers. They’ve taken steps which don’t invalidate analysis but do make it a helluva lot harder. And there are unanswered questions which are relevant to the central theme. In fact the more I think about it … maybe I was to hard on Doug Keenan? Nah.
I have quite a lot to say about this paper, so I’m going to do so in two posts. In this, the first, I’ll address some of the statistical issues as well as one of the central results.
The paper computes correlation (actually cross-correlation, which allows for a lag between the factors) between land-only temperature and ocean oscillation indexes, including AMO (Atlantic Multidecadal Oscillation), ENSO (el Nino Southern Oscillation), PDO (Pacific Decadal Oscillation), NAO (North Atlantic Oscillation), and AO (Arctic Oscillation). The temperature data they used for this comparison was an average of four data sets, the land-only data from NASA GISS, NOAA/NCDC, HadCRU, and a Berkely analysis based on a random sample of data sets which were not used in any of the other temperature estimates. Because of greater variance in data prior to 1950, the analysis is restricted to the time span 1950 through 2010.
As for the statistical analysis, for one thing, there isn’t enough of it — some of the most important questions aren’t addressed at all. Also, in spite of the use of Monte Carlo tests to compensate for the quirks in the data, the standard errors they quote aren’t based on such tests. Instead they’re naive, and simply in error. Further, there are some issues with the Monte Carlo tests themselves. And although it’s by no means forbidden to smooth data before analysis, doing so imposes the severest standard of care and rigor, a standard which I’m beginning to suspect simply hasn’t been met.
M2011 specifically examine variations on time scales from 2 to 15 years. Therefore before analysis, all data were subject to two filtering steps. First, a 12-month moving-average filter was applied. This “smoothing” step effectively acts as a low-pass filter, removing rapid fluctuations but leaving slower ones in place. Then, a 5th-degree polynomial fit was subtracted from the data, to produce a “pre-whitened” data set which was the object of all subsequent analyses. Subtracting the low-order polynomial effectively acts as a high-pass filter, removing the very slow fluctuations. Finally, the time series were rescaled, in fact they were normalized so they would all have the same total variance (i.e., all be on the same “scale”).
Eliminating slow fluctuations was motivated by a desire to remove the long-term trend due to global warming or other influences. This is important because the time series are very “red” — meaning the variance is dominated by slow fluctuations. If it’s not removed, then it will overwhelm the variations at medium timescales which M2011 want to study. But eliminating fast fluctuations seems to me unnecessary — it really doesn’t interfere with medium range fluctuations — and has the severe drawback of introducing artificial autocorrelation into the data. I would have omitted this step, instead turning the focus onto fluctuations on time scales up to (but not greater than) 15 years.
Perhaps their choice was motivated by the fact that the time series show a significant (but not necessarily large) annual cycle. This is because even those time series which are computed as anomalies (like temperature and AMO) have had the average annual cycle during the baseline period removed. However, the annual cycle itself is subject to fluctuations, so removing its average form during a baseline period leaves a residual annual cycle, which is especially noticeable because the analysis data covers the time period 1950 through 2010, during which the various annual cycles can deviate from their averages during the “baseline period” used for anomaly calculation.
The 12-month moving-average filter eliminates fluctuations with a period of 1 year. I would instead have removed any residual annual cycle directly (say, by fitting a Fourier series) rather than invite the extreme complications due to artificial autocorrelation from the moving-average filter. However, their choice is a valid one and does concord with their intention to study fluctuation in the time scale range 2 to 15 years, although it imposes extreme demands on the statistical analysis — especially as the time series already have strong autocorrelation even before the moving-average step. For some questions addressed by the paper those demands are perhaps met, but for others they’re not.
I would also have removed the slow fluctuations by a method other than 5th-degree polynomial fit because polynomials of even modest order can show undesirable behavior near the ends of the time series — I would probably have used a lowess smooth instead. But again, their choice is a valid one.
The main conclusion is that land-only global temperature correlates more strongly with AMO than the other indexes, even ENSO. For correlation of temperature with AMO, M2011 report a peak correlation at lag 0, with value r=0.65 +/- 0.04. The quoted “+/-” value is said to refer to 1 standard error (so a 95% confidence interval, in the normal approximation, would be plus or minus twice that value). However, the quoted value for the standard error is just plain wrong. That’s the value which would apply if the input data series were white noise, but they’re not — even before the moving-average filter is applied and the autocorrelation is even stronger after its application. Even if one omits the moving-average filter (which I think is a much better idea), the standard error according to a crude calculation should be at least twice as large. With the moving-average filter, by a crude calculation the standard error is about 4 times as large, and considering all the uncertainties present — the impact of the moving-average filter, the crudeness of the calculation — it could well be even more than that. It is possible that M2011 simply reported the standard errors which were spit out by whatever program computed the correlations, which would be based on a white-noise model. Regardless, all the quoted standard errors for correlations are simply bogus, far too low.
Fortunately they don’t rely on those values for statistical tests. Unfortunately, they don’t actually test the main claim, that the AMO correlation is stronger than the ENSO correlation. Therefore the primary result of M2011 is not demonstrated by their analysis.
They do directly test whether or not the AMO correlation is statistically significant. This is done by Monte Carlo simulations, i.e., generating multiple artificial AMO signals with the same basic properties in order to see how they correlate with the temperature data. To get artificial signals with the same properties as the actual (smoothed) AMO data, they note that there are 16 times at which the smoothed AMO signal crosses the zero line headed upward. They split the AMO data at these 16 points, which creates 17 segments of data. These 17 segments are then “scrambled,” i.e., arranged into a random order, to generate artificial data with behavior (such as autocorrelation) which is like that of the original smoothed AMO, but is not the same as the original smoothed AMO.
On the face of it this sounds like a sound procedure, especially since it would seem to guarantee that where different segments are joined together they will both represent an upward crossing of the zero line so the scrambled signal should maintain “continuity” and be a realistic “fake” smoothed AMO. But that’s not quite true, because the two endpoint segments are incomplete, they’re not full “one upward zero crossing to the next” sections. Here, for instance is one such “scrambling” of the AMO series (to which I’ve applied the same 12-month moving average filter and 5th-degree polynomial filter, but I didn’t bother to normalize the series, which makes no difference for this particular point):
I’ve marked with red arrows where the section 17 ends, and where section 1 begins, because at those points there’s an unreal discontinuity in the artificial signal. Perhaps a better idea would have been to keep the first and last sections in place, and randomly scramble the 15 sections in between.
I have another misgiving about their Monte Carlo procedure. Because of the combination of strong autocorrelation in the original data, and additional autocorrelation introduced by the moving-average filter, the autocorrelation of the smoothed AMO data persists to large lags. Therefore scrambling the “one upward zero crossing to the next” sections might alter the autocorrelation structure significantly at lags for which it still has an impact on the analysis results. I haven’t done the analysis to prove this, but my intuition tells me that the scrambling process is courting trouble.
And it’s all because the autocorrelation is made so strong, and so persistent, by the moving-average filter. Clearly M2011 are aware of the fact that a simple white-noise test is insufficient, and it’s true that Monte Carlo tests can overcome a great many severe difficulties without even having to know what the noise structure is. But I suspect that M2011 didn’t really appreciate just how severe the problems become with the application of their moving-average filter, and may not have invested careful enough thought into designing tests which would be proof against the severity of the problem.
In spite of the difficulties with the Monte Carlo tests, I don’t think they’re hampered so much as to be useless for testing the significance of the AMO correlation. This is especially true as the indicated p-value from their tests is less than 0.000001, which would be significant even if it were too low by four orders of magnitude! Therefore I accept the result that the correlation with AMO is statistically significant.
But that doesn’t establish that the AMO correlation is stronger than the ENSO correlation. It’s like testing two coins to determine whether they’re “fair” (i.e., have equal chances of landing heads or tails). Suppose we flip each one 100 times. Coin “A” gives 62 heads and 38 tails, which rejects the null hypothesis (fair coin) with statistical significance. Coin “B” gives 58 heads and 42 tails, which fails to reject the null hypothesis. But that does not establish (with statistical significance) that coin “A” has a higher “heads” probability than coin “B”. They could very well both have a 60%-40% chance of heads over tails.
M2011 don’t give a graph of the cross-correlation function (CCF) between AMO (or other indexes) and their average temperature record, but they do graph AMO correlation with each component temperature record. Here, for instance, is their graph of the CCF for AMO and the various temperature records:
It’s worth noting that the various CCF’s don’t all peak at lag 0. An extreme closeup shows that for some (most?) of the temperature records, the peak happens at a very small negative lag (click any of the graphs for a larger, clearer view):
I don’t have their Berkeley temperature record, but I filtered the AMO and the land-only data from NASA GISS by their method and computed the CCF, giving this (my plot covers a smaller range of lags, since I already know from their analysis about where the peak lies):
For the GISS temperature data, the peak correlation is at lag -1 months, meaning that temperature leads AMO rather than lags it. This argues against causality from AMO to temperature, but only very (very!) weakly since the difference between the lag 0 correlation and lag -1 correlation is so tiny.
I did the same computation, but without the 12-month moving-average filter:
Now the peak correlation is at lag -2 months (again temperature leads AMO) and the difference from the lag 0 correlation is larger. I think this suggests two things. First, it’s yet another reason it may have been better to omit the moving-average filter. Second, the argument against causality from AMO to temperature is stronger. It’s still very weak — but based only on the time series, the argument for causality is even weaker.
The correlation between ENSO and temperature is not as strong as that between AMO and temperature. But it does peak at a realistic lag for a causal relationship, as temperature lags ENSO by 5 months:
For this computation I used the MEI (multivariate el Nino index) to characterize ENSO, rather than the Nino 3.4 index used by M2011. Even so, using their choice doesn’t alter the result substantively — the correlation is weaker than with AMO (probably!), but the lag is causally realistic.
I also computed the CCF between ENSO and AMO. The peak correlation is at lag 8 months and is much stronger than the lag 0 correlation, so if the relationship is causal then it’s ENSO driving AMO rather than the other way around:
Given all these considerations, I find implausible the speculation of M2011 that:
However, it is also interesting to consider whether oceanic changes in the AMO may be driving short-term fluctuations in land surface temperature.
I agree that it’s interesting to consider! But I consider it to be implausible. It’s hardly impossible, but it seems more likely to me that the correlation of AMO with land-only temperature reflects a common cause rather than causality from AMO to temperature.
There are other issues which I’d like to discuss, so there’s lots more to say. But this post is already long enough — so those will appear in Part II (coming soon).