While looking around for tornado data, I found a fascinating page by Harold Brooks in which he builds a model of the likelihood of a “significant tornado day,” which I’ll call an “STD” (yeah, it’s a funny choice). This is defined as a day with at least one tornado of Fujita scale F2 or stronger.
He was interested in studying runs of consecutive STDs. At first, he simply used the bare probability of an STD (which depends strongly on the time of year) to estimate the likelihood of consecutive runs, but it utterly failed to mimic the observed data.
Let’s explore such a model — that the probability of any given day being an STD is simply some constant which might depend on the time of year, but doesn’t depend on what may have happened on other specific days. In fact, let’s start simple by ignoring the time-of-year dependence, studying the model in which the probability of an STD is the same constant for all days of the year. Let that probability be , then the probability of not being an STD is of course . In some given number () of days, how many runs of consecutive STDs should we find?
Suppose yesterday was not an STD. If today also isn’t, then it completes a “run” — in this case, a run of zero consecutive STDs. The probability of that happening is simply the probability today isn’t an STD, i.e.:
What’s the probability of a run of 1 day (but not longer)? That’s the probability that today is and STD but tomorrow isn’t, which is
The probability of a run of 2 is the probability that both today and tomorrow are STDs, but the day after isn’t:
You probably see the pattern: the probability of a run of STDs in a row is:
i.e., days in a row yes, terminated by a “no” day (some of you may recognize this as a re-parameterization of the geometric distribution). Note that a “run” is defined by the fact that it must end with a non-STD, so every non-STD completes a run and a “run of ” lasts days (including the run-ending non-STD). It’s also worth noting that the total probability of all possible run lengths (from zero to infinity) is
as it must be (whew!).
That gives the complete probability distribution for the lengths of runs of consecutive STDs. But — how many runs will we observe in a given number () of days?
Since a “run of ” is days long, the average length of a run is simply the expected value (the average) of . This can be computed as (angle brackets denote the expected value):
I’ll leave out the details (feel free to indulge yourself!), also mention that there’s a clever way to get this without any computation at all(!), and just state the result:
Since the average run length is , the expected number of runs in days is
Out of those runs, we expect runs of length zero, runs of length 1, and in general
runs of length .
Let’s compute an example. Suppose the single-day probability of an STD is , and we consider days (one hundred years). Then the expected number of runs of various lengths are:
In a 100-year span, we expect nearly 18000 runs of zero, 5400 runs of one, 1600 runs of two, 483 runs of three, and 145 runs of four. The chance of a run of ten is only 0.11, so only about one year in a thousand will have a run of ten consecutive STDs. The chance of a run of nine is just 0.35, so we might well get one in a century (or we well might not), but the chance of two such runs in a century is less than one out of twenty — ’tain’t likely.
Just for giggles, I ran a simulation of this procedure and observed 17,851 runs of zero, 5428 runs of one, 1635 runs of two, 477 runs of three, 133 runs of four, 33 runs of five, 11 runs of six, and 5 runs of seven. I didn’t get any runs of eight (you’d only expect 1) or longer (you wouldn’t expect any), and all the simulated counts are well within theoretical expectation.
For actual counts of significant tornado days (STDs), the bare probability of an STD depends strongly on time of year. Here’s Brooks’ estimate:
He based this on data from Tom Grazulis’ book Signficant Tornadoes 1680-1991 combined with other data through 1995. I estimated the same thing using data from NOAA-NWS and got this:
Using Brooks’ estimate, the bare probability never really gets much above 0.3 and is well below that for most of the year. Using my estimate, it never gets above 0.4 and is well below that for most of the year. The bottom line is that we should expect long runs of consecutive STDs to be even more rare than indicated by the simple constant-probability model with . In fact they should be quite a bit more rare — in a single century we shouldn’t expect any runs as long as eight, and the likelihood of a run of nine is exceedingly small.
But the observed data show otherwise. In spite of the small bare probability of an STD, in the Grazulis data (used by Brooks) over the last century there are two runs of nine in the data record (15-23 May 1949 and 6-14 June 1967). This is too unlikely to be plausible if the simple independent-probability model is correct.
Incidentally, in the NOAA-NWS data there are even longer runs of consecutive STDs. In fact there are three runs of 12 in the record (May 23 to June 3, 1954; June 7 to Jun 18, 1962; and May 9 to May 20, 1982). Of course, the higher bare probability estimated from the NOAA-NWS model means we should expect longer runs. But using those (time-of-year dependent) figures, the observed number of long runs is still well above what’s expected from the bare-probability model.
We are led to the conclusion that the probability of an STD on any given day isn’t just dependent on the time of year, it depends on something else as well — something which increases the likelihood of long runs of consecutive STDs.
Markov Chain Model:
Brooks was able to mimic observations of long runs of consecutive STDs, by using a 1st-order Markov Chain model. In this model, there are two different probabilities of a significant tornado day (STD) — one which applies if the previous day was not an STD, another which applies if the previous day was an STD. Brooks uses standard notation and calls the probability of STD if the previous day was not ““, the probability of STD if the previous day was also, ““. They’re often referred to as “transition probabilities.” I chose to simplify things a bit by using the notation and .
Both these probabilities are time-of-year dependent, and here’s Brooks’ estimate of the probabilities throughout the year based on the data from Grazulis:
Incidentally, the other two probabilities are just and .
I estimated the same quantities using the NOAA-NWS data, which does seem to give somewhat higher bare STD probabilities. It also gives somewhat higher transition probabilities:
Armed with these estimates, it’s straightforward to run a simulation which estimates the number of runs of any given length which we’d expect to see in a year. Both Brooks and I executed runs of 10 million simulated days. Brooks ended up with this comparison of expected vs observed runs of length per year, based on the Grazulis data:
I got this comparison of expected vs observed runs of length per year, based on the NOAA-NWD data (expected values in black, observed in red):
The match is actually quite good. The large visual discrepancy for very long runs is due to two factors: first, the y-axis is logarithmic so small numbers show exaggerated differences; second, long runs are rare so their random fluctuations are larger in proportion to the number of events.
The Markov Chain model actually does give realistic values, but for both estimates (Brooks’ using Grazulis’ data, and mine using NOAA-NWS data) the observed frequency of very long runs is a bit higher than the theoretical estimate. In particular, three occurrences of a “run of 12” in the 58 years of NOAA-NWS data is more than is plausible. It is remotely possible that this is just a statistical fluke, but it’s more likely that the 1st-order Markov Chain model, while an excellent approximation (which the bare-probability model is not), is still incomplete.