Survival, part III: the program

Note: See the update at the end of the post.

Since I’ve devised a new way to estimate the survival function, I wrote some code (in R) to execute it. The program is in its early stages of development, so there’s a lot of work still to be done. But one of the ways to learn about it is to let people play with it. So, I’ve decided to share my code so others can have some of the fun.


I’ll start with the main caveat: it’s slow. There’s a lot of calculation involved! And, it’s written in R which is an interpreted language rather than a compiled language like C. This means that if you have a large sample it tends to crawl, and the more heavily censored the data are, the slower it crawls. Samples of around 500 take several minutes, samples of around 1000 are painfully slow. Again, the more heavily censored, the slower it gets.

Now, to how to run it. First import it with the single command:

source(“gfsurv.r”)

Then you’re ready to use it. There are two primary inputs. One is a set of times, let’s call them “t”. The other is a set of censoring indicators, let’s call them “cens”. The censoring indicators are equal to 0 for censored data, equal to 1 for event data, so all the values should be either 0 or 1. When you’ve got them, you can estimate the survival function (and assign it to a variable, let’s call it “gf”) thus:

gf = gfsurv(t,cens)

This will put the results into a data frame called “gf” with the following columns:

t = times
S = survival function
se = standard error
lower.ci = lower confidence interval
upper.ci = upper confidence interval
H = cumulative hazard function

It also produces a plot.

There’s one other option. In addition to passing it times and censoring indicators, you can also pass it a factor which separates the data into different groups. Suppose, for instance, you have a variable called “gender” which indicates male or female. You can compute separate survival functions for each gender with this call:

gf = gfsurv(t,cens,gender)

This will return a list with two parts, one for males and one for females. It will also produce a plot showing both, each in a different color. Do note that groups with fewer than 4 uncensored data will simply be ignored.

There’s also a plotting program, in case you’ve got one of those multiple-set results and you want to plot it. You can call it like this:

gfplot(gf,ylim=c(.5,1),lwd=2,conf.lim=F)

That instruct the program to plot the results, adjust the y-axis so it only goes from 0.5 to 1, use line width 2 for heavier lines, and NOT to plot confidence limits.

As usual, it’s entirely possible I’ve made some mistake which means it won’t run on your system even though it does on mine.

So, there it is. Except of course, the program itself. WordPress won’t allow me to upload an R program, or even a text file, so I changed the name to “gfsurv.xls” to make it think it’s an Excel file. Just change the name back, from “gfsurv.xls” to “gfsurv.r” and you should be good to go. Here ’tis:

gfsurv

Enjoy!

UPDATE

I’ve made a few changes. The most important one is that you can now choose your own confidence level. The default is 0.95 (for 95% confidence), but if you want, say, 99% confidence you can change the call to this:

gf = gfsurv(t,cens,fact,conf.level=0.99)

The other change is that you can now use the “gfplot” routine to plot an already-computed result, even if you didn’t pass a factor so it has only one survival function in it.

Here it is, and as before I renamed it to have a suffix “.xls” just so WordPress will allow me to upload it:

gfsurv

As before, enjoy!

11 responses to “Survival, part III: the program

  1. Tamino – great job.
    I have few questions.
    First – do you consider rewriting it in C/C++ and only giving it interface to R? I am immensly interested in fast version that could handle several thousand of objects and repeat it for numerous tens of factors in a reasonable time.
    I would be interested in doing that if that’s the only option.
    Second – how should I handle situation when large chunk (or even most) of data is censored – common situation when the study lasts few years and subjects survive past the end of the study. Should I add multiple censoring events at the same time – end of study?
    And more generally – if there are more than one events at the same day, should I simply write them as multiple entries with the same time?
    Third – how can we use the function in the multiple test situation? From your description I understand that you use one confidence level (0.95). When using your function in numerous tests one should take this into account. One could use some form of false discovery control, but this would be tricky and costly, the Bonferoni correction would be easiest, however, it requires flexible confidence intervals – e.g.performing 20 tests would require raw confidence level 0.001 and wider than standard confidence intervals.

    [Response: I won’t be doing it in C (or others), I just don’t have the skills.

    Yes, add multiple events at the same time (censored or not), just like with KM.

    I’ve been meaning to add an option to choose the confidence level (other than the default 95%), it’s probably the next change I’ll make.]

  2. Yeah…that code with all the nested loops will run slowly on R. Nor do I see obvious “apply” speedups at first glance.

    You will write an R package with the relevant function coded in C, correct???!!! :-)

    [Response: Actually, I’ll try to find someone else who has the skills to do that. But first, I want to get the R version refined, including more functionality.]

  3. I’ve changed the program to allow you to choose your own confidence level (other than the default 95%) if desired.

  4. Excellent work, GF/Tamino. I hope that others adapt the code for C and other languages and applications. I’m keen to see how this is received, especially in epidemiology.

  5. I have tried to speed up the code and got it to run about 30% faster using R by vectorizing bbcdf.

    Before getting there, I think, I have found a subtle bug (with limited impact), while checking my implemention against yours. In your implementation bbcdf(z,n,0) == bbcdf(z,n,1), whereas I think it ought to be bbcdf(z,n,0) == 1.0 . This should have little impact, as bbcdf(z,n,1) is usually close to 1.0 . The behaviour is caused by the line
    for (k in 0:(j-1))
    For j == 0 the loop gets executed for k == 0 and k == -1, instead of being skipped entirely (as I think was intended). The k == -1 execution adds 0.0, but the k == 0 run gives the same contribution as for j == 1. Checking for j == 0 and forcing the output to 1.0 will solve this bug/feature, if desired.

    Back to the speeding up. Using a function from the “apply”-family may reduce the C-accent of the code, but does not seem to speed it up, at least in my trials. However, vectorizing does. Here is a simplified – but working – version to showcase what I have changed

    ==================================
    bbcdf = function(z,n,j) {
    colSums(outer(j:n, z, function(x,y) dbinom(x,n,y)))
    }
    ==================================

    Using dbinom(k,n,z) == choose(n,k)*z^k*(1-z)^(n-k) tidies up the code, but does not affect execution speed. The gain in speed comes from turning the for loop into an outer product and retrieving column sums using an optimized procedure from base R.

    And here is the production version, using the same j<n/2 trick you used in your implementation and in addition checking for the j==0 case. If wordpress should kill the indentions and you are using RStudio, mark the section and press Ctrl-Shift-A (/Code/Reformat Code)

    ========================
    bbcdf = function(z,n,j) {
    if (j == 0) {
    bb <- rep(1.0, length(z))
    } else {
    if (j < n / 2) {
    bb <- 1.0 – colSums(outer(0:(j-1), z, function(x,y) dbinom(x,n,y)))
    }else{
    bb <- colSums(outer(j:n, z, function(x,y) dbinom(x,n,y)))
    }
    }
    bb
    }
    ========================

    In case you want to further tidy up the code (no speed change), here is another drop-in replacement:
    =========================
    bbpdf = function(z,n,j){
    n * dbinom(j-1,n-1,z)
    }
    =========================

  6. Dick Veldkamp

    This maybe a dumb question, but I wondered if you did consider generating Monte Carlo style artificial data based on known distributions? That way, the survival function would be exactly known, and your new algorithm should of course reproduce the corresponding confidence levels.

    [Response: It’s not a dumb question, it’s a smart one. I’ve already done some, will probably do more before long.]

  7. github ?

    So we can all contribute.

  8. I have put some more work into speeding up the code within R. The code is by now about 10 times faster. Code is here (rename from .xls to .R:
    https://bluegrue.files.wordpress.com/2015/08/gfsurv_v-r.xls
    and a benchmark suite is here:
    https://bluegrue.files.wordpress.com/2015/08/gfsurvbenchmark-r.xls

    Major changes:
    – vectorized bbcdf
    – vectorized calculation of the weight[] matrix (that’s the biggy in terms of time)
    – seperated calculation of cdf[], se[] and upper.ci[], lower.ci[]. Vectorized the former, refactored the latter.

    I’ve run speed tests vs the Aids2 data from package MASS. I have used the first samples of the “hs” subset.
    # samples/censored gfsurv1.R gfsurv_v.R
    600/59 1.9sec 0.42sec
    800/101 5.9sec 0.8sec
    1000/174 23.3sec 1.9sec
    2465/933 ???sec 83.6sec
    For the smaller sample sizes a relatively large part of the time is taken up by printing the graphs.

    There are slight differences in the output of the order of 10^-13, which are numerical in nature. The R function colSums() will return slightly different results than an explicit calculation in a for loop, if the numbers of terms are large and vary over many orders of magnitude. I have tested my code that it otherwise yields the same results as yours.

    I hope you find my changes readable and useful. Use as you see fit.

    [Response: Thanks. I’ll put it through its paces.]

    • A final bit of optimization. I have vectorized wbbcdf. This drops the computation time on my PC from about 83secs to about 50secs for the 2465/933 testcase. Here’s the drop-in replacement code:

      wbbcdf = function(z,n,j,w){
      lut <- c(cumsum(w[length(w):1]), rep(1.0,n-j))
      colSums(lut * outer((j+1-length(w)):n,z, function(jj,prob) dbinom(jj,n,prob)))
      }

      I observed that the original implementation repeatedly calculated the same dbinom() call when calling bbcdf. So I took the bbcdf() code into wbbcdf() and changed the order of summation. Now I simply multiply the cummulated sum of the weights (filled with ones at the end) with the required dbinom() once and sum up the results. This speeds up the calculation of lower.ci and upper.ci that much, that no further optimization is needed. A side note: using dbinom is superior to direct evaluation of choose(n,k)*z^k*(1-z)^(n-k), as it works for higher n. n=2000, k=1972, z=0.891 (an example coming up in the test runs) will overflow choose(n,k) and result in a NaN, whereas dbinom(k,n,z) works nicely.

      As a matter of taste you may want to change redefine.lim(), too. The version below uses the standard function unitroot(), which does not require an initial guess and works as fast as your implementation. I have left the "lim" parameter for compatibility reasons only.

      refine.lim <- function(lim = NA, qcut = NA, j = NA,
      w = NULL, ntotal = NA, threshold = 0.0001) {
      uniroot(function(q) wbbcdf(q,ntotal,j,w) – qcut, interval = c(0,1),
      tol = threshold)$root
      }

      You could also simply drop the uniroot line into the main code, instead of encapsulating it inside refine.lim().

      I think, this is as far as I can reasonably push the execution speed within R; I doubt a C implementation would be much faster. The time sink is the calculation of the weight matrix, the rest of the calculation is by now neglible in comparison.

      [Response: Excellent work. If the paper is accepted (I expect it’ll need some big revisions though), then I’ll provide R code along with it. I’d like to include your revisions, and of course give you proper credit. There’s plenty of time to deal with that; no hurry.]

  9. A long time ago I played with a form of survival. Imagine a fair casino, and an infinite supply of gamblers, who each arrived with one dollar, that they put on red or black (or any 50/50 proposition). If they lose, that is the end of them. If they win, they now had two dollars, and they will then bet one of those dollars on a 50/50 proposition. And they continue until they have lost their money.
    So I run this as a simulation with 100 gamblers starting out, and of course about half of them last just one turn. Some more drop out after 3 turns (you can’t drop out after 2 turns, because you must have at least one dollar left). I’m quite surprised to find that one lucky gambler last some 90 odd turns before their money runs out.
    Anyway, I run it again, and I’m waiting for it to finish when all the punters have lost their money. And it doesn’t. And after some time, I realise why. Its a fair casino. If all the punters always lose their money, the casino always comes out ahead – and it can’t if its a fair casino. So at least one gambler needs to not lose their money.
    So I do it again, with the roulette wheel having a zero to give the casino the edge. And it is still surprising just how long some gamblers last.