Survival Analysis in R (2023)

This tutorial provides an introduction to survival analysis, and toconducting a survival analysis in R. This tutorial was originallypresented at the Memorial Sloan Kettering Cancer Center R-Presentersseries on August 30, 2018. It was then modified for a more extensivetraining at Memorial Sloan Kettering Cancer Center in March, 2019.Updates, sometimes significant, are made when new functionality becomesavailable in R. This tutorial reflects my own opinions about the bestfunctionality available in R for survival analysis.

This presentation will cover some basics of survival analysis, andthe following series of tutorial papers can be helpful for additionalreading:

Clark, T., Bradburn, M., Love, S., & Altman, D. (2003). Survivalanalysis part I: Basic concepts and first analyses. 232-238. ISSN0007-0920.

M J Bradburn, T G Clark, S B Love, & D G Altman. (2003). SurvivalAnalysis Part II: Multivariate data analysis – an introduction toconcepts and methods. British Journal of Cancer, 89(3), 431-436.

Bradburn, M., Clark, T., Love, S., & Altman, D. (2003). Survivalanalysis Part III: Multivariate data analysis – choosing a model andassessing its adequacy and fit. 89(4), 605-11.

Clark, T., Bradburn, M., Love, S., & Altman, D. (2003). Survivalanalysis part IV: Further concepts and methods in survival analysis.781-786. ISSN 0007-0920.

The basics

Survival data are time-to-event data that consist of a distinct starttime and end time.

Examples from cancer:

  • Time from surgery to death
  • Time from start of treatment to progression
  • Time from response to recurrence

Time-to-event data are common in many other fields. Some otherexamples include:

  • Time from HIV infection to development of AIDS
  • Time to heart attack
  • Time to onset of substance abuse
  • Time to initiation of sexual activity
  • Time to machine malfunction

Because time-to-event data are common in many fields, it also goes bynames besides survival analysis including:

  • Reliability analysis
  • Duration analysis
  • Event history analysis
  • Time-to-event analysis

A key feature of survival data is censoring.

Survival Analysis in R (1)

RICH JT, NEELY JG, PANIELLO RC, VOELKER CCJ, NUSSENBAUM B, WANG EW. APRACTICAL GUIDE TO UNDERSTANDING KAPLAN-MEIER CURVES. Otolaryngologyhead and neck surgery: official journal of American Academy ofOtolaryngology Head and Neck Surgery. 2010;143(3):331-336. doi:10.1016/j.otohns.2010.05.007.

A subject may be censored due to:

  • Loss to follow-up
  • Withdrawal from study
  • No event by end of fixed study period

Specifically these are examples of right censoring.Left censoring and interval censoring are also possible, and methodsexist to analyze these types of data, but this tutorial will be focus onright censoring.

To illustrate the impact of censoring, suppose we have the followingdata:

Survival Analysis in R (2)

How would we compute the proportion who are event-free at 10years?

  • Subjects 6 and 7 were event-free at 10 years.
  • Subjects 2, 9, and 10 had the event before 10years.
  • Subjects 1, 3, 4, 5, and 8 were censored before 10years, so we don’t know whether they had the event or not at 10years. But we know something about them - that they were each followedfor a certain amount of time without the event of interest prior tobeing censored.

Survival analysis techniques provide a way to appropriatelyaccount for censored patients in the analysis.

Other reasons specialized analysis techniques are needed:

  • The distribution of follow-up times is skewed, and may differbetween censored patients and those with events
  • Follow-up times are always positive

Example of the distribution of follow-up times according to eventstatus:

Survival Analysis in R (3)

To analyze survival data, we need to know the observed time \(Y_i\) and the event indicator \(\delta_i\). For subject \(i\):

  • Observed time \(Y_i = \min(T_i,C_i)\) where \(T_i\) = eventtime and \(C_i\) = censoring time
  • Event indicator \(\delta_i\) = 1 ifevent observed (i.e.\(T_i \leq C_i\)),= 0 if censored (i.e.\(T_i >C_i\))

The probability that a subject will survive beyond any givenspecified time

\[S(t) = Pr(T>t) = 1 -F(t)\]

\(S(t)\): survival function \(F(t) = Pr(T \leq t)\): cumulativedistribution function

In theory the survival function is smooth; in practice we observeevents on a discrete time scale.

The survival probability at a certain time, \(S(t)\), is a conditional probability ofsurviving beyond that time, given that an individual has survived justprior to that time. The survival probability can be estimated as thenumber of patients who are alive without loss to follow-up at that time,divided by the number of patients who were alive just prior to thattime.

The Kaplan-Meier estimate of survival probability ata given time is the product of these conditional probabilities up untilthat given time.

At time 0, the survival probability is 1, i.e.\(S(t_0) = 1\).


In this section, we will use the following packages:

# install.packages(c("survival", "lubridate", "ggsurvfit", "gtsummary", "tidycmprsk"))# remotes::install_github("zabore/condsurv")# remotes::install_github("zabore/ezfun")library(survival)library(lubridate)library(ggsurvfit)library(gtsummary)library(tidycmprsk)library(condsurv)

The lung dataset

Throughout this section, we will use the lung datasetfrom the {survival} package as example data. The data contain subjectswith advanced lung cancer from the North Central Cancer Treatment Group.We will focus on the following variables throughout this tutorial:

  • time: Observed survival time in days
  • status: censoring status 1=censored, 2=dead
  • sex: 1=Male, 2=Female

Note that the status is coded in a non-standard way in thisdataset. Typically you will see 1=event, 0=censored. Let’s recodeit to avoid confusion:

lung <- lung %>% mutate( status = recode(status, `1` = 0, `2` = 1) )

Now we have:

  • time: Observed survival time in days
  • status: censoring status 0=censored, 1=dead
  • sex: 1=Male, 2=Female

Here are the first 6 observations:

head(lung[, c("time", "status", "sex")])
## time status sex## 1 306 1 1## 2 455 1 1## 3 1010 0 1## 4 210 1 1## 5 883 1 1## 6 1022 0 1

Note: the Surv() function in the {survival}package accepts by default TRUE/FALSE, where TRUE is event and FALSE iscensored; 1/0 where 1 is event and 0 is censored; or 2/1 where 2 isevent and 1 is censored. Please take care to ensure the eventindicator is properly formatted.

Calculating survival times

Data will often come with start and end dates rather thanpre-calculated survival times. The first step is to make sure these areformatted as dates in R.

Let’s create a small example dataset with variablessx_date for surgery date and last_fup_date forthe last follow-up date:

(Video) Survival Analysis in R

date_ex <- tibble( sx_date = c("2007-06-22", "2004-02-13", "2010-10-27"), last_fup_date = c("2017-04-15", "2018-07-04", "2016-10-31") )date_ex
## # A tibble: 3 × 2## sx_date last_fup_date## <chr> <chr> ## 1 2007-06-22 2017-04-15 ## 2 2004-02-13 2018-07-04 ## 3 2010-10-27 2016-10-31

We see these are both character variables, but we need them to beformatted as dates.

We will use the {lubridate} package to work with dates. In this case,we need to use the ymd() function to change the format,since the dates are currently in the character format where the yearcomes first, followed by the month, and followed by the day.

date_ex <- date_ex %>% mutate( sx_date = ymd(sx_date), last_fup_date = ymd(last_fup_date) )date_ex
## # A tibble: 3 × 2## sx_date last_fup_date## <date> <date> ## 1 2007-06-22 2017-04-15 ## 2 2004-02-13 2018-07-04 ## 3 2010-10-27 2016-10-31

Now we see that the two dates are formatted as date rather than ascharacter. Access the help page with ?ymd to see all dateformat options.

Now that the dates are formatted, we need to calculate the differencebetween start and end dates in some units, usually months or years.Using the {lubridate} package, the operator %--% designatesa time interval, which is then converted to the number of elapsedseconds using as.duration() and finally converted to yearsby dividing by dyears(1), which gives the number of secondsin a year.

date_ex <- date_ex %>% mutate( os_yrs = as.duration(sx_date %--% last_fup_date) / dyears(1) )date_ex
## # A tibble: 3 × 3## sx_date last_fup_date os_yrs## <date> <date> <dbl>## 1 2007-06-22 2017-04-15 9.82## 2 2004-02-13 2018-07-04 14.4 ## 3 2010-10-27 2016-10-31 6.01

Now we have our observed time for use in survival analysis.

Note: we need to load the {lubridate} package using a callto library in order to be able to access the specialoperators (similar to situation with pipes - i.e.we can’t uselubridate::ymd() and then expect to use the specialoperators).

Creating survival objects and curves

The Kaplan-Meier method is the most common way to estimate survivaltimes and probabilities. It is a non-parametric approach that results ina step function, where there is a step down each time an eventoccurs.

The Surv() function from the {survival} package createsa survival object for use as the response in a model formula. There willbe one entry for each subject that is the survival time, which isfollowed by a + if the subject was censored. Let’s look atthe first 10 observations:

Surv(lung$time, lung$status)[1:10]
## [1] 306 455 1010+ 210 883 1022+ 310 361 218 166

We see that subject 1 had an event at time 306 days, subject 2 had anevent at time 455 days, subject 3 was censored at time 1010 days,etc.

The survfit() function creates survival curves using theKaplan-Meier method based on a formula. Let’s generate the overallsurvival curve for the entire cohort, assign it to objects1, and look at the structure using str():

s1 <- survfit(Surv(time, status) ~ 1, data = lung)str(s1)
## List of 16## $ n : int 228## $ time : num [1:186] 5 11 12 13 15 26 30 31 53 54 ...## $ n.risk : num [1:186] 228 227 224 223 221 220 219 218 217 215 ...## $ n.event : num [1:186] 1 3 1 2 1 1 1 1 2 1 ...## $ n.censor : num [1:186] 0 0 0 0 0 0 0 0 0 0 ...## $ surv : num [1:186] 0.996 0.982 0.978 0.969 0.965 ...## $ std.err : num [1:186] 0.0044 0.00885 0.00992 0.01179 0.01263 ...## $ cumhaz : num [1:186] 0.00439 0.0176 0.02207 0.03103 0.03556 ...## $ std.chaz : num [1:186] 0.00439 0.0088 0.00987 0.01173 0.01257 ...## $ type : chr "right"## $ logse : logi TRUE## $ : num 0.95## $ conf.type: chr "log"## $ lower : num [1:186] 0.987 0.966 0.959 0.947 0.941 ...## $ upper : num [1:186] 1 1 0.997 0.992 0.989 ...## $ call : language survfit(formula = Surv(time, status) ~ 1, data = lung)## - attr(*, "class")= chr "survfit"

Some key components of this survfit object that will beused to create survival curves include:

  • time: the timepoints at which the curve has a step, least one event occurred
  • surv: the estimate of survival at the correspondingtime

Kaplan-Meier plots

We will use the {ggsurvfit} package to generate Kaplan-Meier plots.This package aims to ease plotting of time-to-event endpoints using thepower of the {ggplot2} package. See details.

Note: alternatively, survival plots can be created usingbase R or the {survminer} package.

The {ggsurvfit} package works best if you create thesurvfit object using the includedggsurvfit::survfit2() function, which uses the same syntaxto what we saw previously with survival::survfit(). Theggsurvfit::survfit2() tracks the environment from thefunction call, which allows the plot to have better default values forlabeling and p-value reporting.

survfit2(Surv(time, status) ~ 1, data = lung) %>% ggsurvfit() + labs( x = "Days", y = "Overall survival probability" )

Survival Analysis in R (4)

The default plot in ggsurvfit() shows the step functiononly. We can add the confidence interval usingadd_confidence_interval():

survfit2(Surv(time, status) ~ 1, data = lung) %>% ggsurvfit() + labs( x = "Days", y = "Overall survival probability" ) + add_confidence_interval()

Survival Analysis in R (5)

Typically we will also want to see the numbers at risk in a tablebelow the x-axis. We can add this usingadd_risktable():

survfit2(Surv(time, status) ~ 1, data = lung) %>% ggsurvfit() + labs( x = "Days", y = "Overall survival probability" ) + add_confidence_interval() + add_risktable()

Survival Analysis in R (6)

Plots can be customized using many standard {ggplot2} options.

Estimating \(x\)-year survival

One quantity often of interest in a survival analysis is theprobability of surviving beyond a certain number of years, \(x\).

For example, to estimate the probability of surviving to \(1\) year, use summary with thetimes argument (Note: the timevariable in the lung data is actually in days, so we needto use times = 365.25)

summary(survfit(Surv(time, status) ~ 1, data = lung), times = 365.25)
## Call: survfit(formula = Surv(time, status) ~ 1, data = lung)## ## time n.risk n.event survival std.err lower 95% CI upper 95% CI## 365 65 121 0.409 0.0358 0.345 0.486

We find that the \(1\)-yearprobability of survival in this study is 41%.

The associated lower and upper bounds of the 95% confidence intervalare also displayed.

The \(1\)-year survival probabilityis the point on the y-axis that corresponds to \(1\) year on the x-axis for the survivalcurve.

Survival Analysis in R (7)

What happens if you use a “naive” estimate?

121 of the 228 patients in the lung data died by \(1\) year so the “naive” estimate iscalculated as:

\[\Big(1 - \frac{121}{228}\Big) \times 100= 47\%\] You get an incorrect estimate of the\(1\)-year probability of survival whenyou ignore the fact that 42 patients were censored before \(1\) year.

Recall the correct estimate of the \(1\)-year probability of survival,accounting for censoring using the Kaplan-Meier method, was 41%.

Ignoring censoring leads to an overestimate of theoverall survival probability. Imagine two studies, each with 228subjects. There are 165 deaths in each study. Censoring is ignored inone (pink line), censoring is accounted for in the other (blue line).The censored subjects only contribute information for a portion of thefollow-up time, and then fall out of the risk set, thus pulling down thecumulative probability of survival. Ignoring censoring erroneouslytreats patients who are censored as part of the risk set for the entirefollow-up period.

Survival Analysis in R (8)

We can produce nice tables of \(x\)-time survival probability estimatesusing the tbl_survfit() function from the {gtsummary}package:

survfit(Surv(time, status) ~ 1, data = lung) %>% tbl_survfit( times = 365.25, label_header = "**1-year survival (95% CI)**" )
Characteristic 1-year survival (95% CI)
Overall41% (34%, 49%)

Estimating median survival time

Another quantity often of interest in a survival analysis is theaverage survival time, which we quantify using the median. Survivaltimes are not expected to be normally distributed so the mean is not anappropriate summary.

We can obtain the median survival directly from thesurvfit object:

survfit(Surv(time, status) ~ 1, data = lung)
## Call: survfit(formula = Surv(time, status) ~ 1, data = lung)## ## n events median 0.95LCL 0.95UCL## [1,] 228 165 310 285 363

We see the median survival time is 310 days The lower and upperbounds of the 95% confidence interval are also displayed.

Median survival is the time corresponding to a survival probabilityof \(0.5\):

(Video) Survival Analysis in R

Survival Analysis in R (9)

What happens if you use a “naive” estimate?

Summarize the median survival time among the 165 patients whodied:

lung %>% filter(status == 1) %>% summarize(median_surv = median(time))
## median_surv## 1 226

You get an incorrect estimate of median survivaltime of 226 days when you ignore the fact that censored patients alsocontribute follow-up time.

Recall the correct estimate of median survival timeis 310 days.

Ignoring censoring will lead to an underestimate ofmedian survival time because the follow-up time that censored patientscontribute is excluded (pink line). The true survival curve accountingfor censoring in the lung data is shown in blue forcomparison.

Survival Analysis in R (10)

We can produce nice tables of median survival time estimates usingthe tbl_survfit() function from the {gtsummary}package:

survfit(Surv(time, status) ~ 1, data = lung) %>% tbl_survfit( probs = 0.5, label_header = "**Median survival (95% CI)**" )
Characteristic Median survival (95% CI)
Overall310 (285, 363)

Comparing survival times between groups

We can conduct between-group significance tests using a log-ranktest. The log-rank test equally weights observations over the entirefollow-up time and is the most common way to compare survival timesbetween groups. There are versions that more heavily weight the early orlate follow-up that could be more appropriate depending on the researchquestion (see ?survdiff for different test options).

We get the log-rank p-value using the survdiff function.For example, we can test whether there was a difference in survival timeaccording to sex in the lung data:

survdiff(Surv(time, status) ~ sex, data = lung)
## Call:## survdiff(formula = Surv(time, status) ~ sex, data = lung)## ## N Observed Expected (O-E)^2/E (O-E)^2/V## sex=1 138 112 91.6 4.55 10.3## sex=2 90 53 73.4 5.68 10.3## ## Chisq= 10.3 on 1 degrees of freedom, p= 0.001

We see that there was a significant difference in overall survivalaccording to sex in the lung data, with a p-value of p =0.001.

The Cox regression model

We may want to quantify an effect size for a single variable, orinclude more than one variable into a regression model to account forthe effects of multiple variables.

The Cox regression model is a semi-parametric model that can be usedto fit univariable and multivariable regression models that havesurvival outcomes.

\[h(t|X_i) = h_0(t) \exp(\beta_1 X_{i1} +\cdots + \beta_p X_{ip})\]

\(h(t)\): hazard, or theinstantaneous rate at which events occur \(h_0(t)\): underlying baseline hazard

Some key assumptions of the model:

  • non-informative censoring
  • proportional hazards

Note: parametric regression models for survival outcomes arealso available, but they won’t be addressed in this tutorial

We can fit regression models for survival data using thecoxph() function from the {survival} package, which takes aSurv() object on the left hand side and has standard syntaxfor regression formulas in R on the right hand side.

coxph(Surv(time, status) ~ sex, data = lung)
## Call:## coxph(formula = Surv(time, status) ~ sex, data = lung)## ## coef exp(coef) se(coef) z p## sex -0.5310 0.5880 0.1672 -3.176 0.00149## ## Likelihood ratio test=10.63 on 1 df, p=0.001111## n= 228, number of events= 165

We can obtain tables of results using thetbl_regression() function from the {gtsummary} package,with the option to exponentiate set to TRUE to return the hazard ratiorather than the log hazard ratio:

coxph(Surv(time, status) ~ sex, data = lung) %>% tbl_regression(exp = TRUE) 
Characteristic HR1 95% CI1 p-value
sex0.590.42, 0.820.001
1 HR = Hazard Ratio, CI = Confidence Interval

The quantity of interest from a Cox regression model is ahazard ratio (HR). The HR represents the ratio ofhazards between two groups at any particular point in time. The HR isinterpreted as the instantaneous rate of occurrence of the event ofinterest in those who are still at risk for the event. It isnot a risk, though it is commonly mis-interpreted assuch. If you have a regression parameter \(\beta\), then HR = \(\exp(\beta)\).

A HR < 1 indicates reduced hazard of death whereas a HR > 1indicates an increased hazard of death.

So the HR = 0.59 implies that 0.59 times as many females are dying asmales, at any given time. Stated differently, females have asignificantly lower hazard of death than males in these data.

In Part 1 we covered using log-rank tests and Cox regression toexamine associations between covariates of interest and survivaloutcomes. But these analyses rely on the covariate being measured atbaseline, that is, before follow-up time for the eventbegins. What happens if you are interested in a covariate that ismeasured after follow-up time begins?

Example: Overall survival is measured from treatmentstart, and interest is in the association between complete response totreatment and survival.

Anderson et al (JCO, 1983) described why traditional methods such aslog-rank tests or Cox regression are biased in favor of responders inthis scenario, and proposed the landmark approach. The nullhypothesis in the landmark approach is that survival fromlandmark does not depend on response status at landmark.

Anderson, J., Cain, K., & Gelber, R. (1983). Analysis of survivalby tumor response. Journal of Clinical Oncology : Official Journal ofthe American Society of Clinical Oncology, 1(11), 710-9.

Some other possible covariates of interest in cancer research thatmay not be measured at baseline include:

  • tranplant failure
  • graft versus host disease
  • second resection
  • adjuvant therapy
  • compliance
  • adverse events

The BMT dataset

Throughout this section we will use the BMT dataset from{SemiCompRisks} package as an example dataset. The data consist of 137bone marrow transplant patients. Variables of interest include:

  • T1 time (in days) to death or last follow-up
  • delta1 death indicator; 1=Dead, 0=Alive
  • TA time (in days) to acute graft-versus-hostdisease
  • deltaA acute graft-versus-host disease indicator;1-Developed acute graft-versus-host disease, 0-Never developed acutegraft-versus-host disease

First, load the data for use in examples throughout:

# install.packages("SemiCompRisks)data(BMT, package = "SemiCompRisks")

Here are the first 6 observations:

head(BMT[, c("T1", "delta1", "TA", "deltaA")])
## T1 delta1 TA deltaA## 1 2081 0 67 1## 2 1602 0 1602 0## 3 1496 0 1496 0## 4 1462 0 70 1## 5 1433 0 1433 0## 6 1377 0 1377 0

Landmark approach

  1. Select a fixed time after baseline as your landmark time.Note: this should be done based on clinical information, priorto data inspection
  2. Subset population for those followed at least until landmark time.Note: always report the number excluded due to the event ofinterest or censoring prior to the landmark time.
  3. Calculate follow-up from landmark time and apply traditionallog-rank tests or Cox regression.

In the BMT data, interest is in the association betweenacute graft versus host disease (aGVHD) and survival. But aGVHD isassessed after the transplant, which is our baseline, or start offollow-up, time.

Step 1 Select landmark time

Typically aGVHD occurs within the first 90 days following transplant,so we use a 90-day landmark.

Step 2 Subset population for those followed at leastuntil landmark time

lm_dat <- BMT %>% filter(T1 >= 90) 

We exclude 15 patients who were not followed until the landmark timeof 90 days.

(Video) Survival Analysis in R (in 8-minutes)

Note: All 15 excluded patients died before the 90 daylandmark.

Step 3 Calculate follow-up time from landmark andapply traditional methods.

lm_dat <- lm_dat %>% mutate( lm_T1 = T1 - 90 )
survfit2(Surv(lm_T1, delta1) ~ deltaA, data = lm_dat) %>% ggsurvfit() + labs( x = "Days from 90-day landmark", y = "Overall survival probability" ) + add_risktable()

Survival Analysis in R (11)

In Cox regression you can use the subset option incoxph to exclude those patients who were not followedthrough the landmark time, and we can view the results using thetbl_regression() function from the {gtsummary} package:

coxph( Surv(T1, delta1) ~ deltaA, subset = T1 >= 90, data = BMT ) %>% tbl_regression(exp = TRUE)
Characteristic HR1 95% CI1 p-value
deltaA1.080.57, 2.070.8
1 HR = Hazard Ratio, CI = Confidence Interval

Time-dependent covariate approach

An alternative to a landmark analysis is incorporation of atime-dependent covariate. This may be more appropriate than landmarkanalysis when:

  • the value of a covariate is changing over time
  • there is not an obvious landmark time
  • use of a landmark would lead to many exclusions

Analysis of time-dependent covariates requires setup of a specialdataset, in a format known as counting process format. See the detailedpaper on this by the author of the {survival} package UsingTime Dependent Covariates and Time Dependent Coefficients in the CoxModel.

There was no ID variable in the BMT data, which isneeded to create the special dataset, so create an ID variable calledmy_id:

BMT <- rowid_to_column(BMT, "my_id")

Use the tmerge function with the event andtdc function options to create the special dataset.

  • tmerge() creates a long dataset with multiple timeintervals for the different covariate values for each patient
  • event() creates the new event indicator to go with thenewly created time intervals
  • tdc() creates the time-dependent covariate indicator togo with the newly created time intervals
td_dat <- tmerge( data1 = BMT %>% select(my_id, T1, delta1), data2 = BMT %>% select(my_id, T1, delta1, TA, deltaA), id = my_id, death = event(T1, delta1), agvhd = tdc(TA) )

To see what this does, let’s look at the data for the first 5individual patients. The variables of interest in the original datalooked like:

## my_id T1 delta1 TA deltaA## 1 1 2081 0 67 1## 2 2 1602 0 1602 0## 3 3 1496 0 1496 0## 4 4 1462 0 70 1## 5 5 1433 0 1433 0

The new dataset for these same patients looks like:

## my_id T1 delta1 tstart tstop death agvhd## 1 1 2081 0 0 67 0 0## 2 1 2081 0 67 2081 0 1## 3 2 1602 0 0 1602 0 0## 4 3 1496 0 0 1496 0 0## 5 4 1462 0 0 70 0 0## 6 4 1462 0 70 1462 0 1## 7 5 1433 0 0 1433 0 0

We see that patients 1 and 4 now have multiple rows of data becausethey both developed aGVHD at some point in time after baseline.

Now we can analyze this time-dependent covariate as usual using Coxregression with coxph and an alteration to our use ofSurv to include arguments to both time andtime2:

coxph( Surv(time = tstart, time2 = tstop, event = death) ~ agvhd, data = td_dat ) %>% tbl_regression(exp = TRUE)
Characteristic HR1 95% CI1 p-value
agvhd1.400.81, 2.430.2
1 HR = Hazard Ratio, CI = Confidence Interval

We find that acute graft versus host disease is not significantlyassociated with death using either landmark analysis or a time-dependentcovariate.

Often one will want to use landmark analysis for visualization of asingle covariate, and Cox regression with a time-dependent covariate forunivariable and multivariable modeling.

Competing risks analyses may be used when subjects have multiplepossible events in a time-to-event setting.


  • recurrence
  • death from disease
  • death from other causes
  • treatment response

All or some of these (among others) may be possible events in anygiven study.

The fundamental problem that may lead to the need for specializedstatistical methods is unobserved dependence among the various eventtimes. For example, one can imagine that patients who recur are morelikely to die, and therefore times to recurrence and times to deathwould not be independent events.

There are two approaches to analysis in the presence of multiplepotential outcomes:

  1. Cause-specific hazard of a given event: this represents the rate perunit of time of the event among those not having failed from otherevents
  2. Subdistribution hazard of a given event: this represents the rateper unit of time of the event as well as the influence of competingevents

Each of these approaches may only illuminate one important aspect ofthe data while possibly obscuring others, and the chosen approach shoulddepend on the question of interest.

When the events are independent (almost never true), cause-specifichazards is unbiased. When the events are dependent, a variety of resultscan be obtained depending on the setting.

Cumulative incidence using 1 minus the Kaplan-Meier estimate isalways >= cumulative incidence using competing risks methods, so canonly lead to an overestimate of the cumulative incidence, though theamount of overestimation depends on event rates and dependence amongevents.

To establish that a covariate is indeed acting on the event ofinterest, cause-specific hazards may be preferred for treatment orprognostic marker effect testing. To establish overall benefit,subdistribution hazards may be preferred for building prognosticnomograms or considering health economic effects to get a better senseof the influence of treatment and other covariates on an absolutescale.

Some references:

Dignam JJ, Zhang Q, Kocherginsky M. The use and interpretation ofcompeting risks regression models. Clin Cancer Res.2012;18(8):2301-8.

Kim HT. Cumulative incidence in competing risks data and competingrisks regression analysis. Clin Cancer Res. 2007 Jan 15;13(2 Pt1):559-65.

Satagopan JM, Ben-Porat L, Berwick M, Robson M, Kutler D, AuerbachAD. A note on competing risks in survival data analysis. Br J Cancer.2004;91(7):1229-35.

Austin, P., & Fine, J. (2017). Practical recommendations forreporting Fine‐Gray model analyses for competing risk data. Statisticsin Medicine, 36(27), 4391-4400.

We can obtain the cause-specific hazard of a given event using themethods introduced in the previous section, where the event of interestcounts as an event and any competing events are censored at the date ofthe competing even.

The rest of this section will focus on methods for subdistribtuionhazards.

The primary package we will use for competing risks analysis is the{tidycmprsk} package.

The Melanoma dataset

We will use the Melanoma data from the {MASS} package toillustrate these concepts. It contains variables:

  • time survival time in days, possibly censored.
  • status 1 died from melanoma, 2 alive, 3 dead from othercauses.
  • sex 1 = male, 0 = female.
  • age age in years.
  • year of operation.
  • thickness tumor thickness in mm.
  • ulcer 1 = presence, 0 = absence.
# install.packages("MASS")data(Melanoma, package = "MASS")

The status variable in these data are coded in anon-standard way. Let’s recode to avoid confusion:

Melanoma <- Melanoma %>% mutate( status = as.factor(recode(status, `2` = 0, `1` = 1, `3` = 2)) )

Now we have:

  • status 0=alive, 1=died from melanoma, 2=dead from othercauses.

Here are the first 6 patients:

## time status sex age year thickness ulcer## 1 10 2 1 76 1972 6.76 1## 2 30 2 1 56 1968 0.65 0## 3 35 0 1 41 1977 1.34 0## 4 99 2 0 71 1968 2.90 0## 5 185 1 1 52 1965 12.08 1## 6 204 1 1 28 1971 4.84 1

Cumulative incidence for competing risks

A non-parametric estimate of the cumulative incidence of the event ofinterest. At any point in time, the sum of the cumulative incidence ofeach event is equal to the total cumulative incidence of any event (nottrue in the cause-specific setting). Gray’s test is a modifiedChi-squared test used to compare 2 or more groups.

(Video) Survival Analysis Part 5 | Kaplan Meier Model in R with RStudio

Estimate the cumulative incidence in the context of competing risksusing the cuminc function from the {tidycmprsk} package. Bydefault this requires the status to be a factor variable with censoredpatients coded as 0.

cuminc(Surv(time, status) ~ 1, data = Melanoma)
## ## time n.risk estimate std.error 95% CI ## 1,000 171 0.127 0.023 0.086, 0.177 ## 2,000 103 0.230 0.030 0.174, 0.291 ## 3,000 54 0.310 0.037 0.239, 0.383 ## 4,000 13 0.339 0.041 0.260, 0.419 ## 5,000 1 0.339 0.041 0.260, 0.419 ## ## time n.risk estimate std.error 95% CI ## 1,000 171 0.034 0.013 0.015, 0.066 ## 2,000 103 0.050 0.016 0.026, 0.087 ## 3,000 54 0.058 0.017 0.030, 0.099 ## 4,000 13 0.106 0.032 0.053, 0.179 ## 5,000 1 0.106 0.032 0.053, 0.179

We can use the ggcuminc() function from the {ggsurvfit}package to plot the cumulative incidence. By default it plots the firstevent type only. So the following plot shows the cumulative incidence ofdeath from melanoma:

cuminc(Surv(time, status) ~ 1, data = Melanoma) %>% ggcuminc() + labs( x = "Days" ) + add_confidence_interval() + add_risktable()

Survival Analysis in R (12)

If we want to include both event types, specify the outcomes in theggcuminc(outcome=) argument:

cuminc(Surv(time, status) ~ 1, data = Melanoma) %>% ggcuminc(outcome = c("1", "2")) + ylim(c(0, 1)) + labs( x = "Days" )

Survival Analysis in R (13)

Now let’s say we wanted to examine death from melanoma or othercauses in the Melanoma data, according toulcer, the presence or absence of ulceration. We canestimate the cumulative incidence at various times by group and displaythat in a table using the tbl_cuminc() function from the{tidycmprsk} package, and add Gray’s test to test for a differencebetween groups over the entire follow-up period using theadd_p() function.

cuminc(Surv(time, status) ~ ulcer, data = Melanoma) %>% tbl_cuminc( times = 1826.25, label_header = "**{time/365.25}-year cuminc**") %>% add_p()
Characteristic 5-year cuminc p-value1
09.1% (4.6%, 15%)
139% (29%, 49%)
1 Gray's Test

Then we can see the plot of death due to melanoma, according toulceration status, as before using ggcuminc() from the{ggsurvfit} package:

cuminc(Surv(time, status) ~ ulcer, data = Melanoma) %>% ggcuminc() + labs( x = "Days" ) + add_confidence_interval() + add_risktable()

Survival Analysis in R (14)

Competing risks regression

There are two approaches to competing risks regression:

  1. Cause-specific hazards
    • instantaneous rate of occurrence of the given type of event insubjects who are currently event‐free
    • estimated using Cox regression (coxph function)
  2. Subdistribution hazards
    • instantaneous rate of occurrence of the given type of event insubjects who have not yet experienced an event of that type
    • estimated using Fine-Gray regression (crrfunction)

Let’s say we’re interested in looking at the effect of age and sex ondeath from melanoma, with death from other causes as a competingevent.

The crr() function from the {tidycmprsk} package willestimate the subdistribution hazards.

crr(Surv(time, status) ~ sex + age, data = Melanoma)
## ## Variable Coef SE HR 95% CI p-value ## sex 0.588 0.272 1.80 1.06, 3.07 0.030 ## age 0.013 0.009 1.01 0.99, 1.03 0.18

And we can generate tables of formatted results using thetbl_regression() function from the {gtsummary} package,with the option exp = TRUE to obtain the hazard ratioestimates:

crr(Surv(time, status) ~ sex + age, data = Melanoma) %>% tbl_regression(exp = TRUE)
Characteristic HR1 95% CI1 p-value
sex1.801.06, 3.070.030
age1.010.99, 1.030.2
1 HR = Hazard Ratio, CI = Confidence Interval

We see that male sex (recall that 1=male, 0=female in these data) issignificantly associated with increased hazard of death due to melanoma,whereas age was not significantly associated with death due tomelanoma.

Alternatively, if we wanted to use the cause-specific hazardsregression approach, we first need to censor all subjects who didn’thave the event of interest, in this case death from melanoma, and thenuse coxph as before. So patients who died from other causesare now censored for the cause-specific hazard approach to competingrisks. Again we generate a table of formatted results using thetbl_regression() function from the {gtsummary} package:

coxph( Surv(time, ifelse(status == 1, 1, 0)) ~ sex + age, data = Melanoma ) %>% tbl_regression(exp = TRUE)
Characteristic HR1 95% CI1 p-value
sex1.821.08, 3.070.025
age1.021.00, 1.030.056
1 HR = Hazard Ratio, CI = Confidence Interval

And in this case we obtain similar results using the two approachesto competing risks regression.

So far we’ve covered:

  • The basics of survival analysis including the Kaplan-Meier survivalfunction and Cox regression
  • Landmark analysis and time-dependent covariates
  • Cumulative incidence and regression for competing risksanalyses

In this section I’ll include a variety of bits and pieces of thingsthat may come up and be handy to know:

  1. Assessing the proportional hazards assumption
  2. Making a smooth survival plot based on \(x\)-year survival according to a continuouscovariate
  3. Conditional survival

Assessing proportional hazards

One assumption of the Cox proportional hazards regression model isthat the hazards are proportional at each point in time throughoutfollow-up. The cox.zph() function from the {survival}package allows us to check this assumption. It results in two mainthings:

  1. A hypothesis test of whether the effect of each covariate differsaccording to time, and a global test of all covariates at once.
    • This is done by testing for an interaction effect between thecovariate and log(time)
    • A significant p-value indicates that the proportional hazardsassumption is violated
  2. Plots of the Schoenfeld residuals
    • Deviation from a zero-slope line is evidence that the proportionalhazards assumption is violated
mv_fit <- coxph(Surv(time, status) ~ sex + age, data = lung)cz <- cox.zph(mv_fit)print(cz)
## chisq df p## sex 2.608 1 0.11## age 0.209 1 0.65## GLOBAL 2.771 2 0.25

Survival Analysis in R (15)Survival Analysis in R (16)

Here we see that with p-values >0.05, we do not reject the nullhypothesis, and conclude that the proportional hazards assumption issatisfied for each individual covariate, and also for the modeloverall.

Smooth survival plot

Sometimes you will want to visualize a survival estimate according toa continuous variable. The sm.survival function from thesm package allows you to do this for a quantile of thedistribution of survival data. The default quantile isp = 0.5 for median survival.

# install.packages("sm")library(sm)sm.options( list( xlab = "Age (years)", ylab = "Median time to death (years)") )sm.survival( x = lung$age, y = lung$time, status = lung$status, h = sd(lung$age) / nrow(lung)^(-1/4) )

Survival Analysis in R (17)

  • The x’s represent events
  • The o’s represent censoring
  • The line is a smoothed estimate of median survival according to age
    • In this case, too smooth!

The option h is the smoothing parameter. This should berelated to the standard deviation of the continuous covariate, \(x\). Suggested to start with \(\frac{sd(x)}{n^{-1/4}}\) then reduce by\(1/2\), \(1/4\), etc to get a good amount ofsmoothing. The previous plot was too smooth so let’s reduce it by \(1/6\):

sm.survival( x = lung$age, y = lung$time, status = lung$status, h = (1/6) * sd(lung$age) / nrow(lung)^(-1/4) )

Survival Analysis in R (18)

Now we can see that median time to death decreases slightly as ageincreases.

Conditional survival

Sometimes it is of interest to generate survival estimates among agroup of patients who have already survived for some length of time.

\[S(y|x) = \frac{S(x +y)}{S(x)}\]

  • \(y\): number of additionalsurvival years of interest
  • \(x\): number of years a patienthas already survived

Zabor, E., Gonen, M., Chapman, P., & Panageas, K. (2013). Dynamicprognostication using conditional survival estimates. Cancer, 119(20),3589-3592.

The estimates are easy to generate with basic math on your own, andare also implemented in the {condsurv} package available from

We can use the conditional_surv_est() function from the{condsurv} package to get estimates and 95% confidence intervals. Let’scondition on survival to 6-months

fit1 <- survfit(Surv(time, status) ~ 1, data = lung)prob_times <- seq(365.25, 182.625 * 4, 182.625)purrr::map_df( prob_times, ~conditional_surv_est( basekm = fit1, t1 = 182.625, t2 = .x) ) %>% mutate(months = round(prob_times / 30.4)) %>% select(months, everything()) %>% kable()

Recall that our initial \(1\)-yearsurvival estimate was 0.41. We see that for patients who have alreadysurvived 12 months this increases to 0.58.

We can also visualize conditional survival data based on differentlengths of time survived using the condKMggplot() functionfrom the {condsurv} package:

gg_conditional_surv( basekm = fit1, at = prob_times, main = "Conditional survival in lung data", xlab = "Days" ) + labs(color = "Conditional time")

Survival Analysis in R (19)

The resulting plot has one survival curve for each time on which wecondition. In this case the first line is the overall survival curvesince it is conditioning on time 0.

(Video) Survival Analysis using R (part 1)


How do you do a survival analysis in R? ›

The R package named survival is used to carry out survival analysis. This package contains the function Surv() which takes the input data as a R formula and creates a survival object among the chosen variables for analysis. Then we use the function survfit() to create a plot for the analysis.

How do you calculate survival probability in R? ›

To begin our analysis, we use the formula Surv(futime, status) ~ 1 and the survfit() function to produce the Kaplan-Meier estimates of the probability of survival over time. The times parameter of the summary() function gives some control over which times to print.

How do you estimate a survival analysis? ›

For each time interval, survival probability is calculated as the number of subjects surviving divided by the number of patients at risk. Subjects who have died, dropped out, or move out are not counted as “at risk” i.e., subjects who are lost are considered “censored” and are not counted in the denominator.

Is survival analysis better than logistic regression? ›

There is much more statistical information as well, as survival analyses tend to have greater statistical power to detect a significant treatment or exposure effect than methods for binary outcomes such as logistic regression.

How do you read a survival plot? ›

The Kaplan-Meier plot can be interpreted as follow: The horizontal axis (x-axis) represents time in days, and the vertical axis (y-axis) shows the probability of surviving or the proportion of people surviving. The lines represent survival curves of the two groups. A vertical drop in the curves indicates an event.

How do you compare survival curves in R? ›

Log Rank Test in R, the most frequent technique to compare survival curves between two groups is to use a log-rank test. Test hypotheses: Ho: In terms of survivability, there is no difference between the two groups. Hi: There is a survival differential between the two groups.

How do you calculate hazard rate from survival probability? ›

The hazard function is λ(t) = f(t)/S(t). It is the probability that the person or machine or business dies in the next instant, given that it survived to time t. λ(t) = f(t)/S(t) = λexp(−λt)/exp(−λt) = λ so the hazard function is a constant.

How do you read a Kaplan-Meier curve? ›

Kaplan-Meier-Curve [Simply Explained] - YouTube

How do you read a log rank test? ›

Log-Rank Test [Simply Explained] - YouTube

How do you calculate average survival time? ›

In computing the mean survival time estimate, what's done is to take the value of the survival time for each step in the function, multiply it by the duration of time for which the function stayed at that level, and then sum these products over all of the steps in the function.

How do you calculate standard error in survival analysis? ›

In the case of censoring it should be divided by the number of obsevation minus the number of censored observations in the first year. For a proportion the standard error is sqrt(p*(1-p)/n).

Is censored data included in survival analysis? ›

A subject is said to be at risk if the original event has occurred, but the final event has not. A key characteristic that distinguishes survival analysis from other areas in statistics is that survival data are usually censored or incomplete in some way.

What is a survival object in R? ›

The simplest survival object produced by Surv in R, for example, can be thought of as a matrix, with one row for each case, the first column representing the last follow-up time and the second indicating whether there was an event at that time (typically 0 for censored observations, 1 for events).

How does Coxph work in R? ›

The R summary for the Cox model gives the hazard ratio (HR) for the second group relative to the first group, that is, female versus male. The beta coefficient for sex = -0.53 indicates that females have lower risk of death (lower survival rates) than males, in these data. Hazard ratios.

What is the difference between Kaplan-Meier and Cox regression? ›

KM Survival Analysis cannot use multiple predictors, whereas Cox Regression can. KM Survival Analysis can run only on a single binary predictor, whereas Cox Regression can use both continuous and binary predictors. KM is a non-parametric procedure, whereas Cox Regression is a semi-parametric procedure.


1. Survival Analysis on Cancer data | RStudio Tutorial
(LiquidBrain Bioinformatics)
2. Survival Analysis using R - Tutorial for beginners
(Farhan Haq)
3. Kaplan Meier Curves in R Studio
(David Shimunov)
4. Survival Analysis Part 11 | Cox Proportional Hazards Model in R with RStudio
(MarinStatsLectures-R Programming & Statistics)
5. R Tutorial : Survival Analysis / Time-to-Event Analysis in R
6. Survival Analysis Part 12 | Checking Cox PH Model Assumptions in R with RStudio
(MarinStatsLectures-R Programming & Statistics)
Top Articles
Latest Posts
Article information

Author: Arielle Torp

Last Updated: 06/06/2023

Views: 5800

Rating: 4 / 5 (41 voted)

Reviews: 88% of readers found this page helpful

Author information

Name: Arielle Torp

Birthday: 1997-09-20

Address: 87313 Erdman Vista, North Dustinborough, WA 37563

Phone: +97216742823598

Job: Central Technology Officer

Hobby: Taekwondo, Macrame, Foreign language learning, Kite flying, Cooking, Skiing, Computer programming

Introduction: My name is Arielle Torp, I am a comfortable, kind, zealous, lovely, jolly, colorful, adventurous person who loves writing and wants to share my knowledge and understanding with you.