One of the subjects that interests me (a lot of people, in fact) is *extreme values*. By definition, they’re not very common. It follows that when we look at observed data to discover what the likelihood of extreme values is, we have little data to go on.

In fact, sometimes there’s a sense in which we have *no* data to go on. One of the more interesting questions is, how likely is some quantity to reach a given value which is larger than any we’ve yet observed? That is, without doubt, quite a challenge. But these days it’s a recurring theme as observations take us into new territory, terra incognita.

The situation is inherently uncertain, but not hopeless. We’re interested in the total probability of a value being as big as, or greater than, some value *x*. That is just 1 minus the cdf (cumulative distribution function), something called the *survival function*.

We do know some of the properties of the survival function. Of course it’s always in the range from 0 (no chance of being that high or higher) to 1 (everything is that high or higher). We also know that as *x* increases, the survival function *S(x)* cannot increase, i.e. the survival function is *monotone nonincreasing*. In almost all cases we expected it to be *monotone decreasing*. Finally, for *x* equal to negative infinity the survival function is 1 (every actual value is bigger than negative infinity) while for *x* equal to infinity, the survival function is 0 (no chance of a value being infinite or bigger). These are just mirror images of the properties of the cdf, since the survival function is 1 minus the cdf.

So the survival function starts out at 1, then decreases to 0; the extreme values are those near the end. We can take a look for some well-known probability distributions. Suppose for instance something actually follows the normal distribution. Here’s the pdf for the normal distribution (the bell curve, you’ve probably seen it before):

More relevant to our discussion is the survival function, which looks like this:

For extremes, we’re interested in the tail of this distribution. Here it is for *x* values 2 or greater, which is “pretty extreme” (but not extremely extreme) when we’re working in units of standard deviations:

Not only does the survival function necessarily decay to zero as *x* increases, for the normal distribution it decays with extreme rapidity. That means that the chances of extreme values, for a variable following the normal distribution, are extremely low. To put it in more familiar terms, ‘taint likely.

Other distributions don’t show such rapid decay of the survival function. Here (for an “extreme” example) is the t-distribution with only 1 degree of freedom:

Its survival function looks like this.

It gives us an example of a *heavy-tailed* distribution. In fact here’s that heavy tail:

Compared to the normal distribution it has a *very* heavy tail. The tail is so heavy that I haven’t plotted it in terms of standard deviations, simply because with only 1 degree of freedom the tail is *so* heavy its variance diverges, so can’t be computed.

What about an in-between case? The archtype is the one distribution for which the pdf looks exactly the same as the survival function for allowed *x* values, the *exponential distribution*. It’s only defined for *x* values which are positive, i.e. negative *x* values aren’t allowed:

One of the fascinating properties of the exponential distribution is that its tail also looks the same:

This is a case for which the survival function decays faster than the heavy-tailed t-distribution (with 1 *or more*) degrees of freedom) but not as fast as the normal distribution.

There are even distributions whose tails decay faster than normal. An example is the *uniform distribution*, for which the survival function decays so fast that it actually hits zero in finite time:

Of course we can’t know what the real distribution is, other than in certain exceptional (and rare) circumstances. But we can approximate the tail of a distribution as decaying slowly, at intermediate speed (exponential distribution), or rapidly. This leads to the Pickands–Balkema–de Haan theorem. It states that for many underlying probability distributions, the survival function for large values of *x* is well approximated by the generalized pareto distribution, often expressed in the form

.

The quantities and don’t have their usual meanings, the mean and standard deviation of the distribution. Instead they are a generic *location parameter* and *scale parameter*. The quantity *k* is sometimes referred to as the *shape parameter*.

There are three “regimes” for the shape parameter. When *k* is positive, the survival function decays as *x* increases according to a power-law. Therefore the survival function gets lower and lower as *x* increases (as it must), but never quite reaches zero. In fact, because of its power-law decrease it decays rather “slowly” (compared to other distributions). When *k* is negative, it decays but actually hits zero at a finite value of *x* — this is the case where there’s an upper limit to *x* values which are at all possible.

When *k* is equal to zero the generalized pareto distribution is undefined, but we can use the limiting distribution as *k* goes to zero which turns out to be the exponential distribution.

Hence one approach to estimating the probability of extreme values which are more extreme than any yet observed, is to fit a generalized pareto distribution to the tail that we *have* been able to observe. We then extrapolate that to higher *x* values and voila — we have an estimate of their probability.

The procedure is fraught with uncertainty. But that doesn’t mean it’s useless (however much certain deniers might want you to believe that). As imperfect as it might be, it gives us a realistic (at least in the ballpark) and quantitative estimate.

How could we tell how much of the “observed tail” to use, and whether or not the extreme part of the tail is heavy, light, or in between? A diagnostic I’ve found useful is to use the *logarithm* of survival function which we’ve estimated from the data. Actually it’s useful to use the negative of the logarithm, a quantity referred to in survival theory as the *cumulative hazard function*

.

For in-between decay (exponential distribution) the cumulative hazard function follows a straight line. Here for instance is the cumulative hazard function for the standard exponential distribution:

For slowly decaying tails, the cumulative hazard function curves downward as *x* increases to extreme values, like in the t-distribution:

For rapidly decaying tails, however, it curves upward as *x* increases. For the generalized pareto distribution it will go to infinity (the survival function goes to zero) in finite time, as for example the uniform distribution:

Another useful diagnostic is something else borrowed from survival theory, the *mean lifetime*. But I’ve gotten mathematical enough already, I’ll leave it to interested readers to research the topic themselves.

Not all distributions approach the generalized pareto distribution for large *x* values. For example, for the normal distribution the cumulative hazard function does curve upward for large *x* values:

But it does *not* reach infinity in finite time (the survival function does not go to zero in finite time), so we can’t really extrapolate to very very large *x* values. Likewise for the log-normal distribution.

But such an approach can still be useful to extrapolate to values higher than observed but not *too much* higher. The survival function for the normal distribution is approximately exponential; it curves upward but not by a lot, as long as we’re only looking a *little* above the highest-yet observed values.

All of which reinforces my opinion, that this approach to extreme values is fraught with uncertainty but is by no means hopeless.

One final note: such attempts to estimate the likelihood of bigger-than-yet-seen values is different from the other tactic in extreme value theory, to estimate the distribution of the “hottest-per-year” or “biggest-per-century” values. That’s covered by the other part of extreme value theory, with comes complete with its own set of special distributions which apply (Gumbel, Frechet, reverse Weibull). But that is a topic for another day.

One nice use for a t-distribution is an a prior on a parameter for a Bayesian calculation: It gives you something more general than a Gaussian, but not as unprincipled as an “uninformative prior” like a Uniform, and not quite as wild as a Cauchy (sometimes called “Lorentz”). Still, I have seen Cauchy distributions used as priors at times.

“How could we tell how much of the “observed tail” to use, and whether or not the extreme part of the tail is heavy, light, or in between?”

Why wouldn’t you use the whole tail, if not the whole distribution?

Aren’t you throwing away potentially important information when you use only part?

That was interesting — extremely so.

Thanks.

If you know the distribution exactly, you could indeed use it, but you don’t so you can’t. As for how much of the tail to use, I like thinking of financial analogues, for which people believe they have good mental models–If you think of income distributions, which are very skewed compared to climate data, would the including the top 50%, 10% 1%, 0.1%, 1%, 5%, or 50% or more as the model-training dataset be absolutely better or worse for predicting the top 0.001% of income? As you go to smaller and smaller subsets of the top of the distribution, you lose training set size, but you do get values closer to the values of interest. While if you include larger sets, your model might be more representative of the bulk, but less representative of the extremes.

I guess I wasn’t clear.

I meant the whole distribution of values.

if you are going to fit some parametric function to the tail, why not fit it to the whole thing (all the values)?

or better yet, find a known distribution that approximates the distribution of values.

Physical processes are producing temperature distributions so i find it hard to believe that there are no distributions that have been studied that would not do an adequate job of describing “non-gaussian” distributions.

I find it hard to accept “extrapolations” based on only tails, particularly when one does not even use the entire tail (eg, > mean value), not least o fall because tails areprcisely where these distributions are “sparse”

I should have said “why not find a parametric function that fits all the values”? because the tail function is “tailored”, so to speak.

Horatio Algeranon, that the generalized Pareto distribution is valid is only for the tails, not for the full distribution. That you are looking at the tails is one of the conditions for the mathematical theorem that shows that this distribution fits.

The situation is similar to the central limit theorem. If you average enough values (of any distribution with a finite variance) you get a normal distribution. The difficulty is, what are enough values.

In the same way you only get the generalized Pareto distribution when you fit it to the tails. The difficulty is, how far into the tail do you have to go. That depends on the data at hand.

Because of this theorem the use of the generalized Pareto distribution is more than just fitting “some parametric function”.