Survival, part I

The topic: it’s probably not what you think.

This post has nothing to do with climate change. It’s about the survival function, which finds application in many fields, most notably medical research (where survival is considered important) and failure testing. I’ve come up with a new way to estimate it from observed data, and most important, what I believe is a much-improved way to estimate confidence limits.

The traditional way is called the Kaplan-Meier (KM) estimator. Back in 1957, Edward L. Kaplan and Paul Meier both submitted manuscripts defining essentially the same method to the Journal of the American Statistical Association. The editor at the time was John Tukey, who suggested they join forces and submit together. Which they did.

To say their method became a success, is a gold-medal understatement. It’s nearly universal these days, indispensible in medical research, and their original paper has become one of the most-cited scientific papers of all time. When your work has been referred to by other scientific publications over 30,000 times, you know it’s had an impact. They well deserve one of the most flattering accolades one can achieve in statistics: to have a method named after you.

The germ of the idea is simple. Suppose we observe a number of individuals, say n of them. At a particular time t, some may have died (or been diagnosed, or gone into remission, or any kind of notable event) while others still live. Life or death, in this case, can be viewed as a binomial event. If the number still alive is k, then the chance of surviving at least that long is easy to estimate using binomial statistics: it’s just k / n.

So far, so ridiculously easy. But medical research (and failure testing, and other applications) comes with a catch: sometimes we don’t know whether some patients lived or died because we lost track of them. Maybe they were checking in every week (so it’s easy to tell whether they’re dead or alive), but stopped checking in and we can’t find out what happened. So we have some data — they lived at least “so long” — but we don’t know how long they lived after that. Or, perhaps we simply chose to end the study after a year or two. Even if we know the status of everyone who was still alive at that time, we don’t know how long they lived because their lives aren’t over yet.

Such data are referred to as censored data. In this particular case they’re right-censored because the lifetime is at least some given length, but could also be any value higher than that. Left-censored data mean that the event happened no later than a given time, but we don’t know how much earlier. Finally, interval-censored data are events that happen no earlier than one given time and no later than some other, but could have occured at any time in between. Medical research tends to be dominated by right-censored data.

Censoring complicates the calculation of binomial statistics for the Kaplan-Meier estimator. We can’t just ignore it, or ignore it after it’s censored, because that will bias the results. Even if it didn’t, we wouldn’t want to; many studies need the maximum information but suffer from small samples, it’s too much of a loss to ignore any of it.

Therefore Kaplan and Meier devised their method, also called the product-limit estimator. It’s ingenious; interested readers can find details on Wikipedia (linked to at the start of this post).

Of course just estimating the survival function isn’t enough, we need to know how uncertain our estimate is. A method to do that had already been worked out by Greenwood, and is called “Greenwood’s method.” It estimates the standard error of the Kaplan-Meier estimate, then uses the normal-distribution approximation to assign confidence limits. It’s akin to the calculation of confidence limits for binomial statistics called the Wald interval.

If, for instance, you want 95% confidence intervals, you take the mean value (the Kaplan-Meier estimate), subtract 1.96 times the standard error to get the lower confidence limit, add 1.96 times the standard error for the upper confidence limit. That approximation can be shown to apply asymptotically, meaning as the sample size gets larger and larger. But it’s just an approximation, a better one the more data we have.

One of the things noticed before long was that at late times, the lower confidence limit estimated by Greenwood’s method tended to be too low. Another method was devised to estimate the standard error with censored data, which gave somewhat improved results, called “Peto’s method.”

But the main problem is that the Wald interval isn’t very good for binomial statistics. This is especially a problem with small samples, even medium-size samples. It’s even a problem with very large samples when the number of “events” or “non-events” (deaths or survivals) gets very low. Unfortunately, for survival analysis it’s often important to know what’s happening at precisely those times, either very early (e.g. infant mortality) or very late (total life expectancy).

Those are the times for which the naive Wald interval for binomial statistics can give nonsense results, like a lower confidence limit less than zero. That of course is impossible; the probability of survival (or of anything, for that matter) can’t be less than none. Likewise it can lead to upper confidence limits greater than one, similarly impossible. The custom in such cases is simply to truncate the confidence limits at zero below and one above. It works, the result is at least no longer nonsense, but it’s not very good and certainly not very satisyfing.

The survival function estimate is essentially a probability, so it’s always between zero and one. If we log-transform, the result is between minus infinity and zero. If we compute the standard error of the log-transformed value, then we can subtract 1.96 times that to get a lower limit which is surely finite (and negative), that can be exponentiated (un-log-transformed) to give a lower limit estimate which is never below zero. Problem solved! Or at least, improved. Such a confidence interval is called, not surprisingly, “log-transformed.”

It solves the problem of the lower limit being negative, but doesn’t solve the problem of the upper limit being above one. So, we can log-transform the negative of the log-transform (whew!), compute the standard error of that, add and subtract 1.96 standard errors, then un-log-transform the un-log-transformed resultant limits, and we’re guaranteed our confidence limits will never be below zero or above one. Such an interval is called, not surprisingly, log-log-transformed.

Hence we seem to have three different ways to estimate confidence limits for the Kaplan-Meier estimate, “plain,” “log,” and “log-log.” But actually there are six, because for any of the three we can use the standard error from Greenwood’s method, or Peto’s.

The simple fact that there are six viable alternatives for confidence limits, should be taken as a sign: we can’t have much “confidence” in our confidence limits.

Perhaps at this point you’re thinking, “What’s wrong, Tamino? All this text and no graphs?” Fear not.

I retrieved some data for survival of AIDS patients. It’s only an illustrative sample, and I even limited it to those who did not have a history of IV drug use. That left 51 subjects, and for 9 of them the data were censored but for the other 42 the time of death was known. Here’s the plot of the KM estimate, with the “plain” (95%) confidence limits shown as dashed lines:


The little vertical lines (tick marks, if you like) mark the times at which subjects were censored, i.e. were lost track of in the study.

This also makes evident another property of the KM estimate, which I consider a shortcoming, that it’s not a continuous function of time. That’s the way it is; between event times the numbers don’t change so the KM estimate doesn’t change either. But it’s implausible that the survival function is constant from one moment of death until the next is imminent. Some have attempted to deal with this by smoothing the KM estimate, but my new method takes a different approach.

How do those plain confidence limits compare to alternatives? I won’t bother showing the versions using Peto’s method, there are enough already with Greenwood’s. Here are the plain (already shown), log, and log-log estimated confidence intervals (all 95% confidence):


There are clearly visible differences among them (of course). What’s surprising is how big the differences are. It’s easier to see if we zoom in, say, on the earliest times, the first six months:


Just after month 1, the upper confidence limit by the log-log method is more than twice as far from 1 as the other two estimates. The lower confidence limit is also substantially different. Here’s a zoom-in of the latest times, the last months:


For the lower confidence limit it’s the “plain” estimate that is the “odd man out.” It even falls to zero at month 57 (it gets truncated there). But for the upper confidence limit it’s the log estimate that’s the extreme. In both cases, the differences can well be described as substantial.

Regardless which is the odd-ball or how far apart the estimates are, it’s an open question which one is closest to being “right.” Until now, I believe. I think my new method doesn’t just give good answers, at certain moments of time it gives the correct answers.

That’s why I’ve written it up and submitted to the Journal of the American Statistical Association. I chose them for two reasons. First, it’s a top-flight statistics journal. Second, it’s where Kaplan & Meier published their groundbreaking paper over 50 years ago.

And what’s the new method, you wonder? There’s another post coming soon.


15 responses to “Survival, part I

  1. The technique you describe may also be used for estimating freeway capacity (as in automobile highways), per Ozguven and Ozbay,

  2. If you really have developed a new method for this, for God’s sake submit it to a journal!

    [Response: Already submitted]

  3. Or patent it!

  4. Exciting… even for those of us who’ve never heard of Peto before… I await the next ‘installment.’

  5. Having been down the patent path with something much more innocuous, I can hardly imaging what a stats application would involve. This has relevance in work I am associated with and will follow it with interest… well done.

  6. I *really* want to see this new method.

    Those poor AIDS patients. It seems like this data is from quite some time ago – what was the source?

  7. They well deserve one of the most flattering accolades one can achieve in statistics: to have a method named after you.

    My old undergraduate mathematics instructor Professor I. M. Average was insufferable about this.

  8. Looking forward to part 2. We use survival function quite a bit in our ecological population work.

    [Response: Do you use Kaplan-Meier?]

  9. Beautifully presented. Hanging out for the next bit!

  10. I have used Kaplan-Meier to estimate the reliability of telecommunications equipment. I have forgotten what method I used to estimate confidence intervals – but the methods (Peto’s, Greenwood’s etc) are in Meeker & Escobar’s “Statistical Methods for Reliability Data”

    I read somewhere that Kaplan-Meier’s original paper is one of the most cited in mathematics..There is an alternative to the K-M called the Nelson-Aalen Estimator that sums the hazard rates in the intervals to obtain a cumulative hazard rate function. The survival funaction (or reliability function) can be derived from the cumulative hazard rate function.

  11. Forgot to add that I look forward to the paper – I hope we hear more about it. Good work.

  12. Tamino, yes, we used K-M on some of our work on red-backed voles in the southern Yukon on eight 2.54 ha plots doing mark and recapture (intensive), and radio-collaring. The project started in late 80s, went intensively for 10 years, then I came in shortly after the 10-yr mark and continued work on small mammals. I was in till early 2000s, and was up to my diodes in data—I had a stats background, but this was a whole new crash-course level. I look at some of my books from that time and marvel that once I actually knew some of this.

    Far as I know they’ve continued to trap, tag, and release small mammals in the spring and late summer on a smaller scale, but I doubt the budget is there now for the radio collaring (esp. since the research station was one of the ones partially defunded by our PM).

    [Response: The reason I ask is, I’d like my program (which is still in development, but works) to go through some more tests than I’ve applied, in particular, actual working data.]

  13. We use the Kaplan-Meier estimate where I work as well, and by we I mean the biostatisticians who analyze my data. I’ve been thinking a lot about this stuff lately, and wondering about including an additional type of failure to our conventional mortality models. Part of my role at work is to develop disease models with animals for agricultural vaccine research. I’m interested in combining an assessment at the end of the challenge, i.e. after a period of time with no deaths in any group.

    In some cases, death is a very poor end point for evaluating vaccine efficacy, for instance if the disease doesn’t normally cause much death, and instead produces chronic illness that can present some kind of lesion. But I don’t want to examine all the animals every day and have the stress of handling overtake the response to the disease. So say I can only examine the animals at the end of the trial, and end up with binomial data, pass/fail. It sounds like this may be an example of interval censoring? Could that easilly be combined with mortality data to produce a Kaplan-Meier estimate? Does the fact that I only measure this once and at the end make the probability of the event- a lesion- happening later impossible to determine?

    I haven’t had a solid opportunity to discuss this with them yet, and would love to hear your opinion Tamino!

    [Response: It sounds kinda like interval data. Of course the event is “lesion” rather than “death” and KM (or my method) really treats “events” whether mortality or not. In that case you could treat lesion as the event of note, and if death occurs treat that either as positive sign of lesion, or as censoring. But, if you examine all the subjects at the end of the trial, and all have been in the experiment for the same amount of time, and none died in mid-experiment, then it really is plain old binomial data.

    I think I’d need more details to give you really good answers. I’d be interested to hear more.]

  14. Peter Congdon has a number of examples from the medical and biological literature pertaining to survival analysis in two of his textbooks which might be good trials for your algorithm. He has an index to these at, and I’m thinking of Chapter 10 of APPLIED BAYESIAN MODELLING and Chapter 13 of BAYESIAN STATISTICAL MODELLING, 2nd edition. The datasets are saved as WinBUGS files, which makes them slightly inaccessible, even if WinBUGS is open source.

    I have it, and if you would find these useful, I’d be happy to extract the data and repackage them for your use with a little notice. Congdon’s textbooks have descriptions and references to the literature from where they were drawn. It would be good for your purposes, I think, since in each case you would have results from the original papers, which Congdon almost invariably cites, and then the results from his Bayesian analysis.