Survival regression diagnostics

Key ideas: Survival analysis, proportional hazards regression, Cox model, regression diagnostics

This notebook illustrates a few ways to perform diagnostic checks on proportional hazards regression models. We currently lack a few things needed to do this properly (Kaplan-Meier survival curve plots and deviance residuals for Cox models).

Some useful links:

http://www.mwsug.org/proceedings/2006/stats/MWSUG-2006-SD08.pdf

http://www.stat.columbia.edu/~madigan/G6101/notes/survival4.pdf

Here are the import statements:

In [1]:
import numpy as np
import pandas as pd
from statsmodels.duration.hazard_regression import PHReg

The data are from a fairly large study looking at the relationship between overall survival (mortality) and various molecular risk factors. 'futime' is the time until the subject either dies or is censored. 'death' is an indicator variable that tells us whether follow-up time corresponds to a death or to a censoring event.

In [2]:
url = "http://vincentarelbundock.github.io/Rdatasets/csv/survival/flchain.csv"
data = pd.read_csv(url)
del data["chapter"]
data = data.dropna()
data["lam"] = data["lambda"]
data["female"] = 1*(data["sex"] == "F")
data["year"] = data["sample.yr"] - min(data["sample.yr"])
print data.head()
   Unnamed: 0  age sex  sample.yr  kappa  lambda  flc.grp  creatinine  mgus  \
0           1   97   F       1997   5.70   4.860       10         1.7     0   
1           2   92   F       2000   0.87   0.683        1         0.9     0   
2           3   94   F       1997   4.36   3.850       10         1.4     0   
3           4   92   F       1996   2.42   2.220        9         1.0     0   
4           5   93   F       1996   1.32   1.690        6         1.1     0   

   futime  death    lam  female  year  
0      85      1  4.860       1     2  
1    1281      1  0.683       1     5  
2      69      1  3.850       1     2  
3     115      1  2.220       1     1  
4    1039      1  1.690       1     1  

[5 rows x 14 columns]

Here is the regression model that we will be working with:

In [3]:
status = np.asarray(data["death"])
mod = PHReg.from_formula("futime ~ 0 + age + female + creatinine + " +
                         "np.sqrt(kappa) + np.sqrt(lam) + year + mgus", 
                         data, status=status, ties="efron")
rslt = mod.fit()
print rslt.summary()
                           Results: PHReg
====================================================================
Model:                      PH Reg         Sample size:         6524
Dependent variable:         futime         Num. events:         1962
Ties:                       Efron                                   
--------------------------------------------------------------------
                log HR log HR SE   HR      t    P>|t|  [0.025 0.975]
--------------------------------------------------------------------
age             0.1012    0.0025 1.1065 40.9289 0.0000 1.1012 1.1119
female         -0.2817    0.0474 0.7545 -5.9368 0.0000 0.6875 0.8280
creatinine      0.0134    0.0411 1.0135  0.3271 0.7436 0.9351 1.0985
np.sqrt(kappa)  0.4047    0.1147 1.4988  3.5288 0.0004 1.1971 1.8766
np.sqrt(lam)    0.7046    0.1117 2.0230  6.3056 0.0000 1.6251 2.5183
year            0.0477    0.0192 1.0489  2.4902 0.0128 1.0102 1.0890
mgus            0.3160    0.2532 1.3717  1.2479 0.2121 0.8350 2.2532
====================================================================
Confidence intervals are for the hazard ratios

The martingale residuals can be used to identify unusual observations. Here is a plot of the martingale residuals against case number. We don't know anything about how the cases are ordered, so the accumulation of more negative values on the left side of the plot may not be meaningful.

In [4]:
plt.plot(rslt.martingale_residuals, 'o', alpha=0.5)
plt.xlabel("Case number")
plt.ylabel("Martingale residual")
Out[4]:
<matplotlib.text.Text at 0x3ea1850>

The horizontal bands in the margingale residual plot are usually present and do not indicate anything meaningful. There is one notably extreme value. We'll look into that value next.

In [5]:
ii = np.flatnonzero(rslt.martingale_residuals < -5)
print data.iloc[ii,:]
print data.mean()

# Figure out where the extreme value lies in the covariate space
# For some reason this doesn't work in Pandas
pr = (np.asarray(data) > np.asarray(data)[ii,:]).mean(0)
pr = pd.Series(pr, data.columns)
print "\n", pr
      Unnamed: 0  age sex  sample.yr  kappa  lambda  flc.grp  creatinine  \
3613        3614   61   M       1998   20.5    16.7       10         9.6   

      mgus  futime  death   lam  female  year  
3613     0    3856      0  16.7       0     3  

[1 rows x 14 columns]
Unnamed: 0    3780.781576
age             65.057787
sample.yr     1996.623237
kappa            1.451986
lambda           1.728203
flc.grp          5.537860
creatinine       1.093516
mgus             0.014715
futime        3647.502146
death            0.300736
lam              1.728203
female           0.550582
year             1.623237
dtype: float64

Unnamed: 0    0.512569
age           0.566370
sex           0.000000
sample.yr     0.106990
kappa         0.000000
lambda        0.000766
flc.grp       0.000000
creatinine    0.000307
mgus          0.014715
futime        0.612048
death         0.300736
lam           0.000766
female        0.550582
year          0.106990
dtype: float64

The observation with amn extreme martingale residual has extremely high values for kappa, lambda, and creatinine (either the highest in the whole data set, or in the top 0.1% of values).

We can also plot the martingale residuals against various covariates.

In [6]:
plt.plot(np.sqrt(data["creatinine"]), rslt.martingale_residuals, 'o', alpha=0.5)
plt.xlabel("sqrt Creatinine")
plt.ylabel("Martingale residual")
Out[6]:
<matplotlib.text.Text at 0x3a3b8d0>

Here is another plot of martingale residuals against a continuous covariate.

In [7]:
plt.plot(np.sqrt(data["kappa"]), rslt.martingale_residuals, 'o', alpha=0.5)
plt.xlabel("sqrt Kappa")
plt.ylabel("Martingale residual")
Out[7]:
<matplotlib.text.Text at 0x4144dd0>

Plots of the Schoenfeld residuals against time can be informative about whether the hazards are reasonably proportional, as modeled. Note that Schoenfeld residuals are not defined for censored cases. There is a slight increase in the Schoenfeld residuals with respect to age.

In [8]:
sr = rslt.schoenfeld_residuals
ii = np.flatnonzero(pd.notnull(sr[:,0]))
plt.plot(mod.endog[ii], sr[ii,0], 'o')
plt.xlabel("Time")
plt.ylabel("Age Schoenfeld residual")
Out[8]:
<matplotlib.text.Text at 0x440ebd0>

Here is another plot using Schoenfeld residuals to diagnose non-proportional hazards.

In [9]:
sr = rslt.schoenfeld_residuals
ii = np.flatnonzero(pd.notnull(sr[:,0]))
plt.plot(mod.endog[ii], sr[ii,3], 'o', alpha=0.5)
plt.xlabel("Time")
plt.ylabel("sqrt(kappa) Schoenfeld residual")
Out[9]:
<matplotlib.text.Text at 0x4414810>