As some of you know, I devised a method for aligning temperature data records which I believe is better than the “reference station method” used by NASA GISS. However, the difference is small and it doesn’t change the overall global result when small regions are averaged, then those regional results are area-weight-averaged to produce a global estimate. It’s an interesting, and possibly useful, refinement which doesn’t change the overall final answer.
In any case, the method computes an offset for each station which will bring it into alignment with the other stations, based on the idea that the difference between station data sets which are at nearby locations is approximately constant. RomanM modified my method by computing a separate offset for each station for each month. This compensates for possible differences in the annual cycle of different nearby stations. If your purpose is to study the annual cycle then this is a loss of information, but if your purpose is to study anomaly (the departure from the annual cycle) this is a gain. Most of the time we really want to get at the anomalies, so on the whole I’d say his method is better.
But nowadays I use yet another method. It’s based on the Berkeley method for combining station data to form a global average. They use a “weighting function” to represent the area over which a station should be considered to have an influence. The temperature estimate for a specific location is the weighted average of nearby stations, with closer stations given greater weight by the weighting function. I ignored the weighting function altogether so that all stations in a given region can be equally weighted to compute a regional average. My modification is therefore specifically tailored to produce a local estimate, whereas the Berkeley method is designed to include everything in one fell swoop and produce a global estimate.
Anyway, here are the details.
We have a number of station data records for the temperature estimate at station j, at time . These can be temperature anomalies so the seasonal cycle is already removed, or they can be raw temperature. Note that the straight Berkeley method uses anomalies, because they intend to combine stations worldwide, but I allow raw data so that nearby stations can be combined to give an estimate which includes the annual cycle.
We assume that there’s an offset for station j (one offset for each station), and the offsets together make up a vector of offsets. Therefore, while the data record for station j is , its contribution to the regional average will be
We also assume that there’s a single regional average temperature for each time . The offsets and the temperature averages are determined simultaneously, by minimizing the sum of squared deviations of offset station records from the regional average. In other words, we minimize the quantity
Defining equations for the offsets and average temperatures result from taking partial derivatives of this expression and setting them equal to zero. Therefore
We can rearrange these two equations. The 1st says that for any time we have what I’ll call equation (1):
where is the number of stations with data at time . The 2nd says that for any station j we have what I’ll call equation (2):
where is the number of time points for which station j has data.
This system of equations is actually underdetermined. If I add a constant c to all the station offsets , and subtract that same constant c from all the average temperatures , then the quantity Q to be minimized is unchanged — so the offsets and average temperatures are only determined up to an additive constant. Therefore I impose the additional condition that the offset for the very first station, , is required to be zero. Other choices would do as well. In fact, there might be merit in requiring that the average of all the station offsets must be zero.
To compute the values, I begin by setting all station offsets to zero. Then I use equation (1) to compute the average temperature at all times . Then I use those values for in equation (2) to compute updated values for the station offsets . Then I subtract to the value of from all the offsets so that I’ll meet the requirement .
These values are only approximate, so I repeat that procedure, at each step computing new estimates of offsets and temperatures . I also compute the sum of the absolute values of the differences in the offsets at each iteration. When this sum of differences is less than a particular threshold, I consider that the algorithm has converged and use the latest values of the offsets to compute final values of the average temperatures.
About the code:
It’s an R function. Its inputs are a set of times t, a set of temperatures x, and a set of station identifiers part. It returns a list with two elements, the computed average temperatures and the station offsets.
The first entry in the list is actually a data frame with 5 columns. Column 1 is the time, column 2 is the average temperature for that time, column 3 is the number of stations with data for that time, column 4 is the standard error of the average, column 5 is the standard deviation of the offset temperatures. The second entry is simply a vector of the station offsets, and the 1st element of this vector will be zero.
There are a few other options. You may want the data frame to include the raw input data, in which case you can include the option “full=T” and the data frame will have more than 5 columns, it will also include a column for each raw input data set.
You can also set your own threshold for when the calculation is considered to have converged. By default, if the sum of absolute values of the changes in station offsets is less than 0.001, it is considered to have converged. If you want to insist it must be less than 0.000001, include the option “offres=.000001”.
For data to be combined at a particular time, the time values must match exactly. However, if you want to round off the times to a particular number of significant digits after the decimal point before insisting that they match, say rounding times to 3 decimal digits before computing, include the option “tround=3”.
By default the average temperatures will be rounded to 4 digits after the decimal point. If you want 6 digits instead, include the option “xround=6”.
To get the code itself, download this pdf file, then copy-and-paste it into a text editor and save that as an R function.
There’s always the possibility that I’ve made an error, or that my code is fine but my description is faulty.
And it almost certainly could be optimized — perhaps someone can modify it so that it’s more efficient either in the time of computation or in the memory required. Feel free.