In this post I’d like to examine another claim made in one of the Berkeley papers, that there is a periodic fluctuation in the AMO (Atlantic Multidecadal Oscillation) with period about 9.1 years.

The claim is based on Fourier analysis of the AMO time series:

It should be noted, however, that this isn’t the raw AMO time series. First, it’s only the data since 1950. Second, it has been smoothed with a 12-month moving-average filter and detrended with a 5th-degree polynomial. When I do the same thing, then compute a Fourier transform, I get a similar result:

Although the two Fourier spectra look similar, you may have noticed that the actual numerical power values are dramatically different. That’s because Muller et al. (2011, hereafter M2011) used the normalization convention which is customary in geophysics, for which the total integrated power in the power spectrum equals 1, whereas my program uses the scaling customary in astronomy, where the *average* power equals 1. It’s just a different choice of scaling.

We both used Fourier transforms with no taper. However, M2011 padded the time series with zeros in order to oversample the periodogram. This enables one to identify a frequency more precisely. After all, the plain Fourier transform at the “default” frequency spacing looks like this:

I simply oversampled the periodogram by a factor of 4, since my program has a feature to do that without zero-padding. We probably didn’t oversample the periodogram at the same rate — but it doesn’t make much difference.

In any case, clearly there’s a peak at frequency 0.11 cycles/yr (period about 9 years). But is it significant of periodic behavior? To answer that question, M2011 perform a Monte Carlo analysis. They noted that there are 16 points at which the smoothed, detrended time series crosses the zero line headed upward. This divides the time series into 17 segments. Then they created 10,000 artificial time series by randomly permuting the 17 segments and computed the power spectrum of each. According to M2011, only 170 of 10,000 such artificial series gave a spectral peak at least as strong as that observed in the actual AMO data. Hence they conclude that there’s only a 170/10000 = 1.7% chance of this peak being accidental, i.e., that it is statistically significant at 98.3% confidence (p-value 1.7%).

I’m not convinced that the period is real. I’m not convinced it *isn’t* real either. On the face of it, a p-value of 1.7% is strong evidence. However, I reproduced their test exactly as it was described. I took the same AMO data from 1950 to the present, applied a 12-month moving-average filter, and subtracted a 5th-degree polynomial fit to create a smoothed time series. Then I split it into 17 segments at the 16 points at which it crosses the zero line headed upward. Then I generated 10,000 random permutations of those segments to create artificial time series, compute the Fourier spectrum, and noted the power level of the strongest peak for each artificial data set. I got a very different result from exactly the same test, that 461 out of 10,000 artificial data sets showed a power level at least as high as that observed for the real data.

I believe I’ve interpreted their method correctly, and I’ve double-checked my program for errors. But my result is flatly incompatible with theirs. One of us has made a mistake somewhere. It might be me.

But if any interested reader has the “skillz” and wants to try it, please do. I think it would be a worthwhile effort — just be sure to follow exactly the same procedure.

Even if my result is correct, that only increases the p-value to 4.61% so it’s still significant at 95.4% confidence. But frankly, I don’t completely trust the Monte Carlo simulation. I don’t completely distrust it either, but I don’t think it’s precise enough to be reliable at a “borderline” confidence level. Here’s why.

The autocorrelation of the AMO data is very strong. Even if you omit the 12-month moving-average filter, the lag-1 autocorrelation is 0.857 (*very* strong)! In fact the autocorrelation structure is very much like that of an AR(1) noise process. So, I generated 10,000 artificial data sets of an AR(1) noise process with autoregression coefficient 0.857. Then I applied a 12-month moving-average filter followed by subtracting a 5th-degree polynomial fit. Then I noted the highest power level in a Fourier spectrum.

For each artificial data set, I *also* split it into segments at each upward crossing of the zero line, randomly permuted the segments, then recorded the highest power level in the Fourier spectrum of the permuted artificial data set.

Here’s the distribution function (not the probability density function or “pdf” but the cumulative distribution function or “cdf”) from the non-permuted (in black) and the permuted (in red) data sets:

The vertical line marks the power level (in my analysis) of the actual AMO data. Here’s a close-up:

Note that the permuted data sets consistently give *higher* significance levels than the unpermuted. The “significance level” of the actual AMO peak power for the unpermuted data is only 90.5% confidence, but for the permuted data it’s 93.7% confidence. Hence the process of “splicing and dicing” the artificial data reduced the p-value to only 2/3 of its actual value, inflating the significance level of the peak. It seems to me to be possible, even likely, the the Monte Carlo test used by M2011 likewise inflated the significance level.

If I correctly reproduced their method, then the indicated p-value according to the Monte Carlo test is 4.6% rather than 1.7%. If the Monte Carlo test inflated significance, then it’s really more like 7%, which is only 93% confidence. I quite agree that those are some nontrivial “if”s. I also agree that 93% confidence is still genuine evidence of periodic behavior, although it’s not as strong as reported in M2011.

Incidentally, one can compute the Fourier spectrum without the 12-month moving-average filter (which I regard as wholly unnecessary and introduces tremendous, unnecessary complications). There’s still a peak at the given period:

However, when I generated artificial AR(1) noise series with the same autocorrelation but *without* the 12-month moving-average filter, then used them to define the likelihood of a periodogram peak at the observed power level, the peak is only significant at the 90% confidence level. Again, that’s some evidence of periodic behavior but isn’t nearly as strong as reported in M2011.

Of course, there’s more data for the AMO than used by M2011, since the data linked at the beginning of this post start in 1856 rather than 1950. I detrended the entire data set with a lowess smooth:

Then I Fourier analyzed the entire time span:

The given peak is still present, but it’s still only about 90% significant. We can find the same peak if we use only the data *prior* to 1950, but it’s weaker, and it’s actually *broader*:

This suggests that possibly the fluctuation is present but its characteristics are variable. So — time for a wavelet transform!

Fluctuation at the given frequency is somewhat persistent, but there are also times when it seems to be absent on the whole. In addition, the frequency as well as strength seems somewhat variable, as shown by a closeup of the wavelet transform:

Here’s my bottom line: there seems to be a persistent fluctuation with period about 9 years, but both the period and the strength are variable. In fact the period seems to have been a good bit longer (frequency lower) prior to 1900 (although I can’t testify about the data quality from that era). Whether to call it a “period” or a “characteristic time scale” is, I think, still an open question — but on the whole I’d say “periodic” is not a bad description. There’s a good case for periodicity, but it’s not yet *proved*.

Could one alleviate some of the autocorrelation by using annual values rather than 12-month smooths.

I also have to ask, you detrended the already linearly detrended NA SSTs with a loess smooth if I’m reading correctly, why is this necessary prior to your fourier the 2nd time around?

[

Response:Strictly speaking it isn’t *necessary*. But the time series is so red — so dominated by low-frequency behavior — that removing the slowest fluctuations helps the low-but-not-SO-low frequency behavior “rise above the noise.”]This is good, speedy work. Very cool. But to me the greatest take home point is that frequentist stats can bog people down, especially if they don’t understand the nuances of interpretation. When would it be ‘proved’? And what if it changed over time so that it was ‘proved’ for some time periods and not for others? Does one’s interpretation of the p-value depend upon what one intends to do with the estimate? It seems there’s good reason to believe that some cyclicity (?) exists, and while it shouldn’t be assumed perhaps it should be expected.

I would suppose that the North Atlantic supports oceanic Rossby waves with return equatorial followed by coastal Kelvin wave. A period of around 9 years is believable (as is many another period). One would have to look via satellite data or by measuring the coastal Kelvin component.

The only Rossby/Kelvin wave I’m positive about is in the North Pacific with a period quite close to 3.75 years. It is resonate in the North Pacific and the coastal Kelvin wave has been measured from the oceanographic pier in La Jolla to somewhere in Alaska. I’m a little uncerain as to exactly what keeps it energized but it appears to always be there.

As for the North Atlantic data, it might be that before around 1950 CE one Rossby wave mode was energized and after that a mode with a slightly longer period, dunno.

Correction: coastal Kelvin waves have been measured along the west coast of North America. I don’t know whether or not those coastal Kelvin waves were associated with the ~3.75 year period, with El Nino or with something else.

Here are images from satellite data of an equatorial Kelvin wave in action:

http://earthobservatory.nasa.gov/IOTD/view.php?id=43105

Good work! It might have been also very interesting to add a more detailed discussion on the lag part (which you have mentioned somewhere else), and of course some comparative values that can be seen with the same ‘full BEST-like method’ from PDO :)

One of the first things I did playing around with temperature data was to Fourier Transform it looking for periodicities. The 0.1 year^-1 peak always jumps out, as well as something around 0.3 which took to be related to El Nino.

Anyway, I had been assuming the 0.1 year^-1 is essentially the solar cycle. Have you tried a regression against sunspot numbers to see if that accounts for most of it?

[

Response:I don’t find significant response at 0.1 yr^-1.]Maybe I’m misinterpreting something? You state “clearly there’s a peak at frequency 0.11 cycles/yr” in the text and your wavelet transform seems to show a peak a little below 0.1 early in the record moving higher later; so roughly around 0.1 yr^-1. It’s been a while since I looked at the temperature records – reviewing the spreadsheet where I looked at this about 3 years back, every one of the temperature records (GISS, HadCru, RSS, UAH) had several noticeable peaks in frequency domain, with the two strongest being a little below 0.3 yr^-1 (ENSO?) and this one a little above 0.1 yr^-1. I didn’t do any statistical significance tests, so they may not be “significant” in that sense but it seemed interesting at the time.

Anyway, the sunspot peak in the frequency domain is around 0.09 yr^-1; there’s a bit of width or a side peak in the 20th century record that spreads that around 0.1, so I’d always thought there could be some relationship. And you did find some amplitude for solar cycle in your analysis here:

http://tamino.wordpress.com/2011/01/20/how-fast-is-earth-warming/

[

Response:The clear peak around 0.11 yr^-1 refers only to the post-1950 data.I’ve looked again at temperature data (GISS) and the bare spectrum has nothing like a significant peak anywhere near the solar cycle. I also detrended it (with a lowess smooth) and although there are peaks in that frequency band, they’re nowhere near significant. Bear in mind that a periodogram always wiggles up and down so there will always be “peaks” in any frequency range — but for GISS temperature data they’re not even close to significant.

I don’t deny solar-cycle influence on temperature, and it seems the best evidence of it is from Camp & Tung. But it seems to me to be too weak to be detected with Fourier analysis. I have heard (from Urs Neu) that if you remove from temperature data the volcanic and el Nino signals, and the global warming trend, then it *can* be detected — but I haven’t confirmed this.]OT here, but a question about BEST.

Has anybody determined which of the three temperature series, NOAA, GisTemp, and HadCrut, is in closet agreement with BEST?

BEST monthly data (1880/1-2010/5) correlation coefficient with:

HAD: .810

GISS: .868

NCDC: .823

kap55 – thanks. I’m a math illiterate. I looked at some stat sites, and if I get the drift, which is in doubt, GISS is the most correlated of the three?

You guys can’t hurt my feelings. My kid just scored 3.5(?) standard deviations above the whatever on his last med school test: 99.94%. I think that’s good?

Hmm.. I can’t reproduce your result…

For the GISTEMP data I get a very similar correlation, but for the Hadley-Data I get very different results. And I think (as you titled those data as “HAD”) I know the main source of this error: You need to take the Hadley CRUTEM3 data, not HadCRUT3, because only the frist one is land-only temps.

If you do that, you get the following correlation coefficients for…

BEST (01/1880-05/2010) versus…

– CRUTEM3: 0,857

– GISTEMP: 0,858

BEST (01/1950-12/2009) versus…

– CRUTEM3: 0,890

– GISTEMP: 0,910

BEST (01/1980-12/2009) versus…

– CRUTEM3: 0,844

– GISTEMP: 0,892

So, over the complete period, correlation with GISTEMP and CRUTEM3 is nearly identical. But when you look at more recent periods (since 1950 or 1980) BEST correlates much more better with GISTEMP than with CRUTEM3. And this is no wonder, as BEST and GIS cover the whole earth, whereas CRUTEM3 lacks many areas especially in the arctic.

A couple of years ago I read an explanation on the NASA website for the possible causes for the difference between the GisTemp and HadCrut result.

It would seem to me HadCrut needs to be either repaired or dethroned.

Interesting. Willis E. was making a fuss about the poorer correlation between BEST and the other records over the last 12 or so years. Does the Hadcrut/GISS divergence hold true for land only?

I just looked at the monthly BEST data…

And I found something curious… I wonder if someone else has noticed that yet?

The last values are:

…

02/2010 1,086

03/2010 0,859

04/2010 -1,035

05/2010 1,098

Notice the April 2010 value: -1,035 (!)

This is in no way realistic. The last time a value lower than -1,0 occured was more than 50 years ago.

Well, ok, the stated uncertainty for this value is also enormous (2,7°C), around 27 (!) times higher than in the last years before. But this value affects the averaged (e.g. annual) temperature data also, even in those graphs, shown in the four papers.

For example here is a comparison between a plot of the monthly BEST data vs the graph of figure 1 from their UHI paper:

In both graphs you can see a drop at the very end of the graph, which is completely and only caused by this mysterious April 2010 value. If you create the annual average without this month’s value (drawn in the image linked above in red), you get instead a slight rise (clearly visible at the very end of the graph when the black and red line seperates).

I believe, the reason for the high uncertainty estimate of the last two months in their data and the weird April 2010 value, is, that most of the stations they used for their analysis had not yet reported those values. The ones that did report data, were situated mostly in below average temperature areas, which then lead to the mysteriously low April 2010 value. But if that is the case, then I wonder why they did not just stop with their Analysis in March 2010, when they had enough data.

Last but not least: This has of course no influence on the conclusions made in their papers, but I just find it weird…

[

Response:I noticed that too, and guessed (as you did) that it was due to a sparsity of data on which that value is based.]Poorer quality data at the early end of the AMO time series is likely, but poor quality would tend to introduce random errors or biases. It seems to me it would be very hard to find a class of data flaws that could cause a peak in an energy spectrum to gradually change to a somewhat different period.

FWIW, I’m getting the same results you are. Ran ten lots of 10,000 samples each, and got p-values from .044 to .050, with a mean of .048, nicely consistent with your own.

[

Response:Hmmm… That’s pretty strong evidence that I didn’t make a mistake in my programming. Which means that either: 1) M2011 made a mistake in their programming; or 2) I’ve misinterpreted their method (which doesn’t seem likely). Thanks for the extra investigation!]You’re welcome; it was good practice, anyway.

I tried to look through the Berkeley code, but unless I’m missing something, the relevant parts aren’t there yet; it’s just the stuff they used to create and analyse BEST itself, I think. So option 1 is currently impossible to check. Option 2 seems unlikely, since I followed their description in my own attempt at replication. Granted, I’d obviously seen yours beforehand, so that may have coloured my interpretation, but I don’t think so. There’s just not that much wiggle room there.

Hmm. May be just coincidence, but I get results far closer to Muller and co using R’s default taper value of 0.1; tends to give p-values of around 0.015, on average.

After zero-padding the series to a length of 2048, which gives me a periodogram matching the shape of Muller’s reasonably well but not perfectly, I’m seeing p-values of ~0.04. No “skillz”, just a script kiddie with little understanding of the method, so don’t take this as evidence of anything.

David Benson makes interesting points about possible mechanisms for a ~9 yr variation in SSTs. However, it needs to be taken into account that the AMO is the average for the whole North Atlantic (north of the equator) and the ocean dynamics in the tropics are quite different from those in the mid-latitudes, which are again different from those in the sub-polar and polar regions. Hence, a first step is to check from what region this signal arises. I can think of several ways of doing this, pherhaps the simplest is to start with gridded historic data set (perhaps this one here http://www.esrl.noaa.gov/psd/data/gridded/data.noaa.ersst.html), and at each gridpoint run a filter that only retains temporal variability close to 9 years. The result will be a gridded data set where the only temporal evolution signal has power at ~9 years. Make a map of the temporal standard deviation of this data set. This map would show where the activity on the 9 year timescale in the Atlantic really is.

Speaking of cycles of ~ 9 years in the North Atlantic, I recently happened on Humlum et al., “Identifying natural contributions to late Holocene climate change,”

Global and Planetary Change79, no. 1-2 (October 2011). They claim novelty for the fun-with-Fourier-and-wavelets method, applied to a century of surface temps on Svalbard and GISP2 ice-core data from Greenland.Tamino, I’d love to hear your take on this, but I advise against imbibing hot beverages while reading. I was particularly astonished by the 99.9% confidence they claim both for a 3,598-year cycle in 4,000 years of data and for a 68.4-year cycle in 99 years of data…

Oops, sorry, forgot the link:

http://www.sciencedirect.com/science/article/pii/S0921818111001457

doi:10.1016/j.gloplacha.2011.09.005

For those comparing GISTEMP to BEST note that gistemp doesn’t have an actual land-only temp series for comparison. Their met only is not the same as land-only.

> Humlum et al.

The first under “Related Articles” list on that page is

Solar and volcanic fingerprints in tree-ring chronologies over the past 2000 years

http://www.sciencedirect.com/science/article/pii/S0031018211005141

doi:10.1016/j.palaeo.2011.10.014

“… solar-climate associations remain weak, with for example no clear linkage distinguishable in the southwestern United States drought records at centennial time scales. Other forcing factors, namely volcanic activity, appear to mask the solar signal in space and time. To investigate this hypothesis, we attempted to extract volcanic signals from the temperature proxies using a statistical modelling approach. Wavelet analysis of the volcanic contribution reveals significant periodicities near the DeVries frequency during the Little Ice Age (LIA). This remarkable and coincidental superposition of the signals makes it very difficult to separate volcanic and solar forcing during the LIA. Nevertheless, the “volcano free” temperature records show significant periodicities near the DeVries periodicity during the entire past 1500 years, further pointing to solar mechanisms and emphasising the need for solar related studies in the absence of strong multi-decadal volcanic forcing.

You are looking for a stable repetitive cycle in a forced dynamic system.

The fact that the system is forced means that patterns and periods are not stable.

CO2 has been increasing for a hundred years, forcing has been increasing for a hundred years, so we do not have a stable reference period.

As the climate changes, the pattern and period of the AMO change.

I have been doing some work on the full AMO data set (linked to near the start of the post), and I think the roughly 9 year feature is real, but I don’t think it’s exactly an oscillation. I’ve been comparing the detrended AMO data with synthetic noise data generated with a Gaussian distribution. Instead, it looks like the AMO data set is filtered with a corner frequency in the vicinity of the 9 year feature. The higher frequencies roll off with a 1/f characteristic. The lower frequencies also roll off, but the slope and the exact corner frequency depend on how you detrend the data.

I detrended with a LOWESS smooth that I then subtracted from the base data, similar to what Tamino reported. The degree of low-frequency roll off varied with the LOWESS smoothing width – smaller width, more LF rolloff. The approximately 9-year period feature moved to a slightly higher frequencies as I reduced the smoothing window with, but the effect was small.

Comparing with the FFT of the Gaussian noise data rolled off by a simple 1-pole low-pass filter, the two looked almost identical from the corner frequency on up. The amplitude of the peak-like feature was very similar between the two FFTs (this was true for several different noise runs). Since the Gaussian noise data sets don’t have a 9-year feature, I conclude that the AMO feature is essentially the same as in the filtered noise data – an artifact of the filtering.

Here are the steps I followed in this work:

1. Import or create the data set.

2. For the AMO data, detrend by subtracting a LOWESS smooth of the data from the original. The results are not very sensitive to the window width – I tried widths of about 50 – 200 months.

3. Filter the Gaussian noise data set with a low-pass, single pole filter with a corner frequency corresponding to the roughly 9 year period.

4. Apply a 1/2 cycle cosine window to each data set (removes edge artifacts and improves resolution of peaks).

5. Apply FFT to each windowed data set. I used an FFT routine that works for non-power-of two lengths, so I didn’t have to pad the data sets.

6. Compare both FFTs using a log-log plot.

To sum up, the detrended AMO data look like Gaussian noise passed through a mild bandpass filter. The details of the low-frequency rolloff interact with the specifics of the detrending, but the high-frequency rolloff is pretty independent.