In this notebook, we will present a simple example of the PCHazard
method described in this paper.
For a more verbose introduction to pycox
see this notebook.
import numpy as np
import matplotlib.pyplot as plt
# For preprocessing
from sklearn.preprocessing import StandardScaler
from sklearn_pandas import DataFrameMapper
import torch # For building the networks
import torchtuples as tt # Some useful functions
from pycox.datasets import metabric
from pycox.models import PCHazard
from pycox.evaluation import EvalSurv
## Uncomment to install `sklearn-pandas`
# ! pip install sklearn-pandas
np.random.seed(1234)
_ = torch.manual_seed(123)
We load the METABRIC data set as a pandas DataFrame and split the data in in train, test and validation.
The duration
column gives the observed times and the event
column contains indicators of whether the observation is an event (1) or a censored observation (0).
df_train = metabric.read_df()
df_test = df_train.sample(frac=0.2)
df_train = df_train.drop(df_test.index)
df_val = df_train.sample(frac=0.2)
df_train = df_train.drop(df_val.index)
df_train.head()
x0 | x1 | x2 | x3 | x4 | x5 | x6 | x7 | x8 | duration | event | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 5.603834 | 7.811392 | 10.797988 | 5.967607 | 1.0 | 1.0 | 0.0 | 1.0 | 56.840000 | 99.333336 | 0 |
1 | 5.284882 | 9.581043 | 10.204620 | 5.664970 | 1.0 | 0.0 | 0.0 | 1.0 | 85.940002 | 95.733330 | 1 |
3 | 6.654017 | 5.341846 | 8.646379 | 5.655888 | 0.0 | 0.0 | 0.0 | 0.0 | 66.910004 | 239.300003 | 0 |
4 | 5.456747 | 5.339741 | 10.555724 | 6.008429 | 1.0 | 0.0 | 0.0 | 1.0 | 67.849998 | 56.933334 | 1 |
5 | 5.425826 | 6.331182 | 10.455145 | 5.749053 | 1.0 | 1.0 | 0.0 | 1.0 | 70.519997 | 123.533333 | 0 |
The METABRIC dataset has 9 covariates: x0, ..., x8
.
We will standardize the 5 numerical covariates, and leave the binary covariates as is.
Note that PyTorch require variables of type 'float32'
.
We like using the sklearn_pandas.DataFrameMapper
to make feature mappers.
cols_standardize = ['x0', 'x1', 'x2', 'x3', 'x8']
cols_leave = ['x4', 'x5', 'x6', 'x7']
standardize = [([col], StandardScaler()) for col in cols_standardize]
leave = [(col, None) for col in cols_leave]
x_mapper = DataFrameMapper(standardize + leave)
x_train = x_mapper.fit_transform(df_train).astype('float32')
x_val = x_mapper.transform(df_val).astype('float32')
x_test = x_mapper.transform(df_test).astype('float32')
The survival methods require individual label transforms, so we have included a proposed label_transform
for each method.
In this case label_transform
is just a shorthand for the class pycox.preprocessing.label_transforms.LabTransDiscreteTime
.
The PCHazard
is a continuous-time method, but requires defined intervals in which the hazard is constant. Hence, we need to perform an operation similar to a discretization of the time scale.
We let num_durations
define the size of this (equidistant) interval grid.
num_durations = 10
labtrans = PCHazard.label_transform(num_durations)
get_target = lambda df: (df['duration'].values, df['event'].values)
y_train = labtrans.fit_transform(*get_target(df_train))
y_val = labtrans.transform(*get_target(df_val))
train = (x_train, y_train)
val = (x_val, y_val)
# We don't need to transform the test labels
durations_test, events_test = get_target(df_test)
type(labtrans)
pycox.preprocessing.label_transforms.LabTransPCHazard
Note that y_train
now consist of three labels: the interval index, the event indicator, and the proportion of the interval before the event/censoring occur (i.e, $\rho(t_i)$ in the paper).
y_train
(array([2, 2, 6, ..., 1, 5, 3]), array([0., 1., 0., ..., 1., 0., 0.], dtype=float32), array([0.7965465 , 0.69519496, 0.7370492 , ..., 0.06606603, 0.5865238 , 0.96302533], dtype=float32))
We make a neural net with torch
.
For simple network structures, we can use the MLPVanilla
provided by torchtuples
.
For building more advanced network architectures, see for example the tutorials by PyTroch.
The following net is an MLP with two hidden layers (with 32 nodes each), ReLU activations, and num_nodes
output nodes.
We also have batch normalization and dropout between the layers.
in_features = x_train.shape[1]
num_nodes = [32, 32]
out_features = labtrans.out_features
batch_norm = True
dropout = 0.1
net = tt.practical.MLPVanilla(in_features, num_nodes, out_features, batch_norm, dropout)
If you instead want to build this network with torch
you can uncomment the following code.
It is essentially equivalent to the MLPVanilla
, but without the torch.nn.init.kaiming_normal_
weight initialization.
# net = torch.nn.Sequential(
# torch.nn.Linear(in_features, 32),
# torch.nn.ReLU(),
# torch.nn.BatchNorm1d(32),
# torch.nn.Dropout(0.1),
# torch.nn.Linear(32, 32),
# torch.nn.ReLU(),
# torch.nn.BatchNorm1d(32),
# torch.nn.Dropout(0.1),
# torch.nn.Linear(32, out_features)
# )
To train the model we need to define an optimizer. You can choose any torch.optim
optimizer, but here we instead use one from tt.optim
as it has some added functionality.
We use the Adam
optimizer, but instead of choosing a learning rate, we will use the scheme proposed by Smith 2017 to find a suitable learning rate with model.lr_finder
. See this post for an explanation.
We also set duration_index
which connects the output nodes of the network the the discretization times. This is only useful for prediction and does not affect the training procedure.
model = PCHazard(net, tt.optim.Adam, duration_index=labtrans.cuts)
batch_size = 256
lr_finder = model.lr_finder(x_train, y_train, batch_size, tolerance=8)
_ = lr_finder.plot()
lr_finder.get_best_lr()
0.10722672220103299
Often, this learning rate is a little high, so we instead set it manually to 0.01
model.optimizer.set_lr(0.01)
We include the EarlyStopping
callback to stop training when the validation loss stops improving. After training, this callback will also load the best performing model in terms of validation loss.
epochs = 100
callbacks = [tt.callbacks.EarlyStopping()]
log = model.fit(x_train, y_train, batch_size, epochs, callbacks, val_data=val)
0: [0s / 0s], train_loss: 2.6413, val_loss: 2.5174 1: [0s / 0s], train_loss: 2.3406, val_loss: 2.3192 2: [0s / 0s], train_loss: 2.1494, val_loss: 2.0976 3: [0s / 0s], train_loss: 1.9278, val_loss: 1.8521 4: [0s / 0s], train_loss: 1.7077, val_loss: 1.6468 5: [0s / 0s], train_loss: 1.5645, val_loss: 1.5185 6: [0s / 0s], train_loss: 1.4957, val_loss: 1.4789 7: [0s / 0s], train_loss: 1.4887, val_loss: 1.4919 8: [0s / 0s], train_loss: 1.4821, val_loss: 1.4874 9: [0s / 0s], train_loss: 1.4499, val_loss: 1.4807 10: [0s / 0s], train_loss: 1.4514, val_loss: 1.4747 11: [0s / 0s], train_loss: 1.4425, val_loss: 1.4793 12: [0s / 0s], train_loss: 1.4182, val_loss: 1.4848 13: [0s / 0s], train_loss: 1.4268, val_loss: 1.4878 14: [0s / 0s], train_loss: 1.4193, val_loss: 1.4861 15: [0s / 0s], train_loss: 1.4099, val_loss: 1.4978 16: [0s / 0s], train_loss: 1.3999, val_loss: 1.5006 17: [0s / 0s], train_loss: 1.4148, val_loss: 1.4996 18: [0s / 0s], train_loss: 1.4060, val_loss: 1.5080 19: [0s / 0s], train_loss: 1.3819, val_loss: 1.5106 20: [0s / 0s], train_loss: 1.3740, val_loss: 1.5111
_ = log.plot()
For evaluation we first need to obtain survival estimates for the test set.
This can be done with model.predict_surv
which returns an array of survival estimates, or with model.predict_surv_df
which returns the survival estimates as a dataframe.
However, we need to define at how many points we want to get the predictions.
The default (model.sub = 1
) is just to use the interval knots, but by increasing the model.sub
argument, we replace the knots with and equidistant number of points in each interval.
This is very similar to the interpolation of the discrete methods such as LogisticHazard
and PMF
.
surv = model.predict_surv_df(x_test)
surv.iloc[:, :5].plot(drawstyle='steps-post')
plt.ylabel('S(t | x)')
_ = plt.xlabel('Time')
model.sub = 10
surv = model.predict_surv_df(x_test)
surv.iloc[:, :5].plot(drawstyle='steps-post')
plt.ylabel('S(t | x)')
_ = plt.xlabel('Time')
The EvalSurv
class contains some useful evaluation criteria for time-to-event prediction.
We set censor_surv = 'km'
to state that we want to use Kaplan-Meier for estimating the censoring distribution.
ev = EvalSurv(surv, durations_test, events_test, censor_surv='km')
We start with the event-time concordance by Antolini et al. 2005.
ev.concordance_td('antolini')
0.6573402857263979
We can plot the the IPCW Brier score for a given set of times. Here we just use 100 time-points between the min and max duration in the test set. Note that the score becomes unstable for the highest times. It is therefore common to disregard the rightmost part of the graph.
time_grid = np.linspace(durations_test.min(), durations_test.max(), 100)
ev.brier_score(time_grid).plot()
plt.ylabel('Brier score')
_ = plt.xlabel('Time')
In a similar manner, we can plot the the IPCW negative binomial log-likelihood.
ev.nbll(time_grid).plot()
plt.ylabel('NBLL')
_ = plt.xlabel('Time')
The two time-dependent scores above can be integrated over time to produce a single score Graf et al. 1999. In practice this is done by numerical integration over a defined time_grid
.
ev.integrated_brier_score(time_grid)
0.16404163283537904
ev.integrated_nbll(time_grid)
0.4844648690929724