Getting model-data comparison right

One of the favorite criticisms harped on by deniers is that global temperature isn’t rising as fast as computer models have predicted. So far, comparisons have shown that observed temperature is on the low end, even skirting the significantly low end, of model results. They generally use this to imply, or say outright, that not only are models “wrong wrong wrong” but the whole of climate science is “wrong wrong wrong.”

Of course it might be a valid criticism of the models, but not of global warming theory which most decidedly does not depend on complex computer models. The models are just our best way of forecasting what the future will bring; they aren’t necessary to understand, or confirm by many observations (not just temperature data), the physics behind man-made climate change.

But it’s still important to understand why models are diverging from observations, because models really are our best tool for knowing what to expect. A new paper by Cowtan et al. has identified one of the crucial reasons, namely, that models aren’t diverging from observations by nearly as much as has been believed so far.

How could that be? It’s simple, really, when you realize that so far people have been comparing apples to oranges. Global temperature from model runs is the global average of surface air temperature (SAT), but global temperature from observational data is a blended average of surface air temperature over land and ice regions with sea surface temperature (SST) over open ocean. One thing we know, and have known for a long time, is that not only are SAT and SST different, they’re exhibiting different trends.

When I heard about this, my first thought was “Of course. Why didn’t I think of that?” Because it really is one of those things that’s obviously true — obvious, that is, once someone thinks of it.

A good summary of the situation was given by one of the co-authors, Mike Mann:

A number of us had independently noticed that at least some of the apparent discrepancy in past comparisons of observed and modeled warming appeared to be an artifact of an apples-and-oranges comparison: Observational global average temperatures employ sea surface temperature (SST) over the oceans, while model-estimated temperatures have typically used surface air temperature (SAT) over the oceans. Since SSTs are warming more slowly than SATs (for physically-understood reasons) that leads to an apparent divergence between the two quantities.

As we learned of each others’ parallel efforts and joined forces, led by Kevin, this turned into a far more exhaustive and authoritative analysis by a team of leading experts. What we found is that it is highly non-trivial to do the comparison right. One key complication that arises is that the observations typically extrapolate land temperatures over sea-ice covered regions since the SST is not accessible in that case. But the distribution of sea ice changes seasonally, and there is a long-term trend toward decreasing sea ice in many regions. So the observations actually represent a moving target. To do this right requires treating the model temperature field in precisely the same way as the observations, which means using a time-dependent land/sea mask!

So suffice it say that past comparisons of observed and model-predicted warming (including e.g. those shown in the most recent IPCC report) haven’t quite been correct. The apparent divergence between model- and observed warming appears to be in substantial part an artifact of. Doing the comparison properly, we reconcile a large chunk (38%) of the discrepancy. The rest can easily be explained by other factors that have been examined in recent work, e.g. errors in the radiative forcing used in the model simulations and the fact that the models and observations have experienced different realizations of internal decadal variability.

Indeed, while the central idea (compare like to like) is simple, getting that right isn’t. One of the difficulties is the switch from SAT over ice-covered ocean to SST over open ocean when the ice melts, a change which is certainly time-dependent and is itself trending. Another is that most global temperature estimates from observations are based on using anomalies rather than absolute temperature. This too is related to sea ice, because anomalies are based on long-term average conditions, and the fact that seawater temperature underneath ice changes almost not at all, being constrained by the freezing point of water. As sea ice has declined, this has introduced a cool bias at the point where the ice melts.

The situation is further complicated by the fact that for proper comparison, one should process model data in the same way as the observational data set to which it’s compared. Some use interpolation for infilling sparsely- or un-observed regions, others simply mask out those regions. Then there’s the issue of sea ice; should one use monthly averages as though sea ice were constant throughout the month, or allow for monthly variability in sea ice cover? The authors tested several choices, paying particular attention to emulating as closely as possible the procedure used for the HadCRUT4 data.

Their figure 3 illustrates the magnitude of the effect:


Figure 3
: Difference between global mean blended temperature and air temperature, for different variants of the blending calculation, averaged over 84 historical + RCP8.5 simulations. Blended temperatures show less warming than air temperatures; hence the sign of the difference is negative for recent decades. Results are shown for the four permutations of masked versus global and absolute temperatures versus anomalies (with variable sea ice in each case). Two additional series for the absolute and anomaly methods with fixed ice show that fixing the sea ice boundary eliminates the effect of using anomalies. The final series shows the HadCRUT4 method, which shows similar behaviour to the other anomaly methods.

All by itself, doing an apples-to-apples comparison reduced the discrepancy between models and observations. When, in addition, one re-computes computer model results using the fact that estimates of climate forcing have improved since the runs used for IPCC reports, the discrepancy between models and data is considerably reduced.

Of course this is bad news for the deniers, who will find one of their favorite criticisms undermined. I expect a hissy-fit to follow. But it’s good news for the rest of us, because it means we can have more confidence in our best (albeit certainly imperfect) forecasts of what to expect in the future, and what will be the consequences of the actions we choose to address the growing problem. If only we can get the U.S. government to address the problem at all.

19 responses to “Getting model-data comparison right

  1. It seems the paper is paywalled at the AGU but Kevin Cowtan has a page devoted to it at his website:

    It also contains a neat video lecture where he hammers a bunch of nails into the denialist’s hiatus coffin.

  2. This one is already causing copious conniptions at Tony’s place.

    Instead of using the actually charts in the Cowtan et al paper, he replaces it with a bizarre John Christy chart that compares tropical (20S-20N) satellite data to an average of 102 CMIP5 models. And he gives that graph a highly misleading title that would lead the unsuspecting to believe it actually came from the paper.

    Instead of apples and oranges, Watts has moncked (my own invented term) this one up to the point that he’s deceptively comparing apples and aspartame.

    • Rob:
      Worse yet, as far as I can see Spencer and Christy are still just ignoring Po-Chedley et al. 2015, treating it as if it didn’t exist, and pretending that this “tropical troposphere hot-spot” is still “missing”. But Po-Chedley et al. solved that problem months ago. The graph Watts posted is both disingenuous — because it has nothing to do with the Cowtan et al paper — and also obsolete, because it doesn’t account for the improvements in Po-Chedley et al. 2015.

    • Rob Nicholls

      ‘Watts has moncked…this one up…’
      An excellent turn of phrase.

  3. Martin Stolpe

    Mike Mann uploaded a non-paywalled preprint on his website:

  4. Comparing apples to apples, the best global temperature index to compare with global SAT from models, is actually the “old” Gistemp dTs. The coverage over oceans is poor though, with temperatures infilled 1200 km from island and coastal GHCN stations.
    I would love to see a global BEST based on land stations only but with full global kriging ( no land mask). I believe that an “unlimited” BEST land, because of its large data base and ability to use short segments of temperature data, would be an even better alternative than Gistemp dTs for models vs observations comparisons.

  5. A ‘purloined letter’ indeed–hiding right out in the open.

    Language side note: FWIW, I particularly relished Mike’s phrase “highly non-trivial.”

  6. No doubt the deniers will be quite happy about this. They have complained for a while that the models are “wrong” and the predicted global temperature does not match observed temps. This paper shows that to a certain extent they were right, insofar as the global temperatures was not being calculated in a way that allowed for the most accurate comparisons with observed data. With the use of blended land/air and sea surface temperatures, and more accurate forcings, the discrepancies are considerably less. That should be a great relief to deniers who were concerned about the “model-observation” gap.

  7. Zeke Hausfather


    Actually, in the paper we mask models to have the same coverage as HadCRUT4 (as well as C&W and Berkeley in the supp mats), which effectively accounts for coverage differences.

    • christianjo


      I dont think its really true, Coverage should be matter, because some parts of nat. and ant. climate variability is in Model not in phase with observations.

      E.g sea ice cover, which in models is not so low as in observations and this could cause a bias by WACCy-Pattern which cool middle latitudes and warm high latitudes. This is in Models not present because of their weaker of sea ice loose. So if you cut out arctic latitudes, you can get a small cool bias against models.

    • Zeke, it should be easy to create a BEST dTs by simply removing the land mask. Are you not curious of the result? Have you not tried that at any stage of the development of the BEST land temp series? An inhouse product with the ambition to estimale the complete global 2 m air temp..
      However, I guess I have to wait for Gistemp dTs with ISTI/ GHCNv4. Gistemp dTs ( or GHCN v3) is getting really poor with islands, they dont even have Hawaii in recent years..
      I would love to see Lewis and Curry using Gistemp dTs, instead of Hadcrut4, for calculations of climate sensitivity. That would increase their numbers by 40-50%. The other way around, if they have used a global observational index like Cowtan&Way or BEST their numbers would have increased by 10%, plus a further addition of about 20% (my estimate) for SST not being marine 2 m temps..

  8. Been saying this for ages and identified a probable total of about 15% bias in naive model/obs comparisons due to coverage and SAT/SST issues. Looks like their numbers are similar.

    On the other hand, I’ve also found that almost all CMIP5 model runs are inadequately initialised in relation to known historical conditions at 1850 – in particular the period immediately preceding featured fairly extreme volcanic activity. Comparing “uninitialised” runs against runs initialised by simulation of the past millennium from the same model suggests that the CMIP5 ensemble would have simulated about 15% greater warming on average for 1850-present if properly initialised. So, over the full period it’s a wash. However, nearly all the initialisation bias appears to be spent by about 1950 so comparisons tied to 1961-1990 or later baselines are solely biased in one direction.

  9. Looking for a comment on this by someone who might be more knowledgeable about model-observational comparison practice.

    Wilks cautions in his Section 7.7.1 of Statistical Methods in the Atmospheric Sciences, 3rd edition (2011), that if estimates of variance are sought for ensembles like HadCRUT4 (the application to HadCRUT4 is mine, not his) …

    … the dispersion of a forecast ensemble can at best only approximate the [probability density function] of forecast uncertainty … In particular, a forecast ensemble may reflect errors both in statistical location (most or all ensemble members being well away from the actual state of the atmosphere, but relatively nearer to each other) and dispersion (either under- or overrepresenting the forecast uncertainty). Often, operational ensemble forecasts are found to exhibit too little dispersion …, which leads to overconfidence in probability assessment if ensemble relative frequencies are interpreted as estimating probabilities.

    Strikingly, if such variances are estimated using some methods on HadCRUT4 (e.g., Fyfe, Gillet, Zwiers, Nature Climate Change, 2013), they look tiny compared with variances estimated from climate model runs. Of course, then, whatever the discrepancy between their means, SST vs SAT or whatever, it looks worse because the pooled variance is much smaller than models variance.

    So, what’s the story about HadCRUT4 ensemble variance? Is there gross underdispersion using some methods, a case of the Wilks comment? Is it because “There is basically one observational record in climate research” (Kharin, 2008) and climate models are trying to forecast “all possible futures”? Or is it something else?

    My hunch is that agreement between observations and models oughtn’t just be in the first moment, the mean, but to some degree second and higher moments. And I’d expect there to be quite a lot of slop. So getting tiny variances from observations just doesn’t make sense to me.


  10. So can there be (or is there) an observational record that uses surface air temperatures over the ocean rather than sea surface temperatures? Or is that too hard to do?

    • As I understand it, there’s no practical way to do a comparable true 2-meter temperature record at sea. There is some marine air temperature data, but it’s ship-based, which means that it’s relatively scanty and not particularly helpfully distributed. Apparently, it also has issues related to the varying heights of ship decks above the surface.

  11. I suspect if you remove the influence of ENSO, much of the remaining difference from the ensemble mean (with updated Schmidt forcings) would be eliminated (hint).

  12. Is there a CMIP-5 model output dataset that is narrowly targeted at layers of troposphere above “surface,” against which the RSS and UAH variety of indices properly should be compared?

  13. I believe the higher-than surface troposphere at least in some layers is thought to be more sensitive to ENSO and other ocean changes (due to increased energy transfer when warm ocean water causes more evaporation that releases energy when condensing at altitude). If so, then the model runs’ spread should be larger than for surface temperature.