-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathexample.Rmd
171 lines (127 loc) · 8.69 KB
/
example.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
---
title: "Example penalized LTRC survival model with simulated data"
author: "Sarah F. McGough, Devin Incerti, Svetlana Lyalina, Ryan Copping, Balasubramanian Narasimhan, Robert Tibshirani"
date: "`r Sys.Date()`"
output:
html_document:
toc: yes
toc_depth: 2
toc_float: true
number_sections: TRUE
---
```{r setup, include=FALSE}
# So that it can be viewed on GitHub pages, render with:
# rmarkdown::render(input = "example.Rmd", output_dir = "docs")
knitr::opts_chunk$set(echo = TRUE)
```
# Introduction
This guide shows how to use penalized survival models for prediction with left-truncated and right-censored data (LTRC) as described in [McGough *et al.* 2021](https://onlinelibrary.wiley.com/doi/10.1002/sim.9136). The analyses utilize a simulated dataset based on the Foundation Medicine and Flatiron Health Clinico-Genomic Database (CGDB), as described in the paper.
We will use `glmnet` to fit the models, `rsample` to split the data into training and test sets, and various features of the `survival` package. We will also need a script containing custom functions for calibrating survival predictions.
```{r, message = FALSE, warning = FALSE}
# Packages used in this script
library("glmnet")
library("rsample")
library("survival")
# Custom functions for calibration
library("data.table") # Dependency for calibration functions
library("ggplot2") # Required to visualize calibrations
source("R/calibrate.R") # Our custom calibration functions
theme_set(theme_bw()) # ggplot theme
# There is some randomness in the analysis include splitting
# the data into train/test sets and cross validation.
set.seed(7)
```
# LTRC data
While we are not permitted to share data from the CGDB, we can simulate data that is consistent with what we observe. Our simulated dataset is contained in the file `simdata.rds`. Details on how we simulated the data are described in the paper. If you would like to see the code used to simulate the data you can view/run the file `simdata.R`.
```{r simdata}
simdata <- as.data.frame(readRDS("simdata.rds"))
head(simdata)
```
There are two characteristics of the data that are artificial and would not be observed in reality. First, we simulate (latent) `death_time`, but deaths are only observed if they occur prior to censoring; that is, since the data is right-censored, we only observe `event_time`. Second, some patients are not included in the data (i.e., they are `hidden`), because their `event_time` occurs prior to the time they would have entered the dataset (their `entry_time`). These patients are left-truncated.
To create a realistic LTRC dataset we will consequently remove the latent `death_time` variable and subset the data to patients that were not left-truncated.
```{r simdataAfterLT}
simdata <- simdata[simdata$hidden == 0, ]
simdata$death_time <- NULL
```
## Train and test sets
For prediction problems, it is common to split the data into training and tests sets. Here we use the `rsample` package to randomly split the data so that 75% of patients are in the training set and 25% are in the test set.
```{r split}
data_split <- initial_split(simdata, prop = .75)
train_data <- training(data_split)
test_data <- testing(data_split)
```
# Model fitting
## Model data
### Input matrix
`glmnet` requires an input matrix `x`, which must be a `matrix` object. Here we consider a simple example with a small number of predictors contained in `x`.
```{r xvars}
x_vars <- attr(simdata, "x_vars")
x_vars
```
We create `x` matrices for both the training and the test sets.
```{r xData}
x_train <- as.matrix(train_data[, x_vars])
x_test <- as.matrix(test_data[, x_vars])
```
### Response
When fitting Cox proportional hazards models with `glmnet`, the response should be a `survival::Surv` object. The key difference between modeling LTRC and RC data is that the response must contain start-stop intervals when using LTRC data. Start-stop `Surv` objects are created by passing the time that a patient enters the dataset (`time`), the follow-up time (`time2`), and a status indicator equal to 1 if a patient has died and 0 if they are alive.
As with `x`, we create `y` for both the training and the test sets.
```{r trainTestData}
y_train <- Surv(time = train_data$entry_time,
time2 = train_data$event_time,
event = train_data$dead)
y_test <- Surv(test_data$entry_time,
test_data$event_time,
test_data$dead)
```
## Cross validation
We will fit a penalized lasso model appropriate for LTRC data using `cv.glmnet()` (note: requires `glmnet` > 4.1). The value of the tuning parameter $\lambda$ that minimizes the partial likelihood deviance (`lambda.min`) is selected using 5-fold cross validation. For more details, see the vignette on the [glmnet website](https://glmnet.stanford.edu/articles/Coxnet.html).
```{r glmnetCV}
cvfit <- cv.glmnet(x_train, y_train, family = "cox", nfolds = 5)
```
## Model coefficients
The coefficients (i.e., log hazard ratios) depend on the value of $\lambda$. We can easily view the coefficients for a specific value of $\lambda$ using `coef()`. We might want to examine coefficients from a model that minimizes the deviance:
```{r coefLambdaMin}
coef(cvfit, s = "lambda.min")
```
Or alternatively, we may choose a model that minimizes deviance within 1 standard deviation of the minimum (`lambda.1se`). Note that this results in a sparser model:
```{r coefLambda1se}
coef(cvfit, s = "lambda.1se")
```
The full regularization path can also be plotted:
```{r plotCV}
plot(cvfit)
```
# Model predictions and performance
We will make out-of-sample predictions on the test set (25% of data) from models where the deviance was minimized (`lambda.min`). With survival data, there is no single best way to summarize predictions or to assess the performance of a model. We will first consider the C-index, which is a measure of the extent to which a model can *discriminate* across patients. Then we will examine survival probabilities and the extent to which predicted probabilities are well *calibrated*; that is, the extent to which predicted probabilities are similar to observed probabilities.
## C-index
The C-index can be computed from the linear predictor (i.e., the predicted log hazards). By default, the `predict.cv.glmnet()` method predicts the linear predictor for each patient.
```{r}
lp_test <- predict(cvfit, newx = x_test, s = "lambda.min")
head(lp_test)
```
The `survival::concordance()` function will compute the C-index in a manner appropriate for LTRC data. When making out of sample predictions, it is convenient to specify the `object` argument as a `formula` that relates the response to the linear predictor.
```{r cindex}
concordance(y_test ~ lp_test, reverse = TRUE)
```
## Survival probabilities and calibration
### Survival probabilities
Models fit with `glmnet` now have a `survfit` method which can be used to generate survival curves. Note that the values of `x` and `y` used to train the model must also be passed so that the baseline hazard can be computed.
```{r}
surv_test <- survfit(cvfit, newx = x_test, s = "lambda.min",
x = x_train, y = y_train)
```
Plots using base `R` graphics can be quickly generated using the `plot.survfit()` method, which produces a survival curve for each patient in the test set.
```{r}
plot(surv_test)
```
Note that we could have alternatively used a general purpose plotting package like `ggplot` or a specialized plotting package for survival analysis like `survminer` to produce more modern looking graphics. It might also be a good idea to aggregate predicted survival curves across patients in a meaningful way.
### Calibration
While there are many great `R` packages such a `rms` and `pec` that can be used to evaluate a survival model (including via calibration), none (to our knowledge) work with LTRC data. We consequently wrote our own function to calibrate survival models for the paper (available in the file `R/calibrate.R`), which we will use here. `calibrate()` is a generic function with a method for `survfit` objects. We also created a generic `autplot()` method to quickly plot the output produced by `calibrate()`.
Let's calibrate the model at semi-annual intervals for up to 3 years. Each point consists of patients in a given decile of predicted survival at a particular time point. The x-axis is the average predicted survival probability and the y-axis is the pseudo observed survival probability computed using the (left truncation adjusted) Kaplan-Meier estimator. Perfectly calibrated predictions are those that lie along the 45-degree line so that predicted and observed survival probabilities are equal.
```{r calibrate}
# calibrate.survfit() method
cal_test <- calibrate(object = surv_test, times = seq(.5, 3, .5), y = y_test)
# autoplot.calibrate() method
autoplot(cal_test)
```