Proportional hazards regression

Key ideas: survival analysis, proportional hazards regression, Cox model, R, Stata

In this notebook we fit a basic proportional hazards regression model in Statsmodels, R, and Stata, and compare the results.

In [1]:
import numpy as np
import pandas as pd
from statsmodels.duration.hazard_regression import PHReg

The data are from a fairly large study looking at the relationship between a few biochemical variables and survival (mortality).

In [2]:
url = "http://vincentarelbundock.github.io/Rdatasets/csv/survival/flchain.csv"
data = pd.read_csv(url)
del data["chapter"]
data = data.dropna()
data["lam"] = data["lambda"]
data["female"] = 1*(data["sex"] == "F")
data["year"] = data["sample.yr"] - min(data["sample.yr"])
print data.head()
   Unnamed: 0  age sex  sample.yr  kappa  lambda  flc.grp  creatinine  mgus  \
0           1   97   F       1997   5.70   4.860       10         1.7     0   
1           2   92   F       2000   0.87   0.683        1         0.9     0   
2           3   94   F       1997   4.36   3.850       10         1.4     0   
3           4   92   F       1996   2.42   2.220        9         1.0     0   
4           5   93   F       1996   1.32   1.690        6         1.1     0   

   futime  death    lam  female  year  
0      85      1  4.860       1     2  
1    1281      1  0.683       1     5  
2      69      1  3.850       1     2  
3     115      1  2.220       1     1  
4    1039      1  1.690       1     1  

[5 rows x 14 columns]

Statsmodels

In [5]:
status = np.asarray(data["death"])
mod = PHReg.from_formula("futime ~ 0 + age + female + creatinine + " +
                         "np.sqrt(kappa) + np.sqrt(lam) + year + mgus", 
                         data, status=status, ties="efron")
rslt = mod.fit()
print rslt.summary()
                           Results: PHReg
====================================================================
Model:                      PH Reg         Sample size:         6524
Dependent variable:         futime         Num. events:         1962
Ties:                       Efron                                   
--------------------------------------------------------------------
                log HR log HR SE   HR      t    P>|t|  [0.025 0.975]
--------------------------------------------------------------------
age             0.1012    0.0025 1.1065 40.9289 0.0000 1.1012 1.1119
female         -0.2817    0.0474 0.7545 -5.9368 0.0000 0.6875 0.8280
creatinine      0.0134    0.0411 1.0135  0.3271 0.7436 0.9351 1.0985
np.sqrt(kappa)  0.4047    0.1147 1.4988  3.5288 0.0004 1.1971 1.8766
np.sqrt(lam)    0.7046    0.1117 2.0230  6.3056 0.0000 1.6251 2.5183
year            0.0477    0.0192 1.0489  2.4902 0.0128 1.0102 1.0890
mgus            0.3160    0.2532 1.3717  1.2479 0.2121 0.8350 2.2532
====================================================================
Confidence intervals are for the hazard ratios

R

Here is an R program that fits the same survival model:

library(survival)

data = read.csv("http://vincentarelbundock.github.io/Rdatasets/csv/survival/flchain.csv")

ii = is.finite(data$creatinine)
data = data[ii,]
data$female = 1*(data$sex == "F")
data$year = data$sample.yr - min(data$sample.yr)

surv = Surv(data$futime, data$death)
md = coxph(surv ~ age + female + creatinine + sqrt(kappa) + sqrt(lambda) + 
           year + mgus, data)
print(summary(md))

Here is the result of running this program in R:

Call:
coxph(formula = surv ~ age + female + creatinine + sqrt(kappa) + 
    sqrt(lambda) + year + mgus, data = data)

  n= 6524, number of events= 1962 

                  coef exp(coef)  se(coef)      z Pr(>|z|)    
age           0.101202  1.106501  0.002473 40.929  < 2e-16 ***
female       -0.281688  0.754509  0.047448 -5.937 2.91e-09 ***
creatinine    0.013435  1.013525  0.041068  0.327 0.743568    
sqrt(kappa)   0.404687  1.498834  0.114681  3.529 0.000417 ***
sqrt(lambda)  0.704566  2.022969  0.111736  6.306 2.87e-10 ***
year          0.047728  1.048885  0.019166  2.490 0.012767 *  
mgus          0.316021  1.371659  0.253242  1.248 0.212068    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

             exp(coef) exp(-coef) lower .95 upper .95
age             1.1065     0.9038    1.1012     1.112
female          0.7545     1.3254    0.6875     0.828
creatinine      1.0135     0.9867    0.9351     1.098
sqrt(kappa)     1.4988     0.6672    1.1971     1.877
sqrt(lambda)    2.0230     0.4943    1.6251     2.518
year            1.0489     0.9534    1.0102     1.089
mgus            1.3717     0.7290    0.8350     2.253

Concordance= 0.793  (se = 0.007 )
Rsquare= 0.321   (max possible= 0.994 )
Likelihood ratio test= 2529  on 7 df,   p=0
Wald test            = 2464  on 7 df,   p=0
Score (logrank) test = 2928  on 7 df,   p=0

Stata

Below is a Stata program to fit the same Cox model as was fit above using Statsmodels and R. Note that Stata and R handle failure times that are zero (and more generally, failure times that equal entry times) differently. Stata drops these cases (of which there are three in this data set). The workaround is to shift the failure time to the right slightly, see:

http://www.stata.com/support/faqs/statistics/time-and-cox-model/

clear

copy http://vincentarelbundock.github.io/Rdatasets/csv/survival/flchain.csv flchain.csv, replace

import delimited flchain

generate female = 1
replace female = 0 if sex == "M"

replace futime = 0.001 if futime == 0

generate sqrt_kappa = sqrt(kappa)
generate sqrt_lambda = sqrt(lambda)
generate year = sampleyr - 1995
generate creatinine1 = real(creatinine)

drop chapter

stset futime, failure(death)

stcox age female creatinine1 sqrt_kappa sqrt_lambda year mgus, nohr efron

Here are the Stata results:

. stset futime, failure(death)

     failure event:  death != 0 & death < .
obs. time interval:  (0, futime]
 exit on or before:  failure

------------------------------------------------------------------------------
     7874  total observations
        0  exclusions
------------------------------------------------------------------------------
     7874  observations remaining, representing
     2169  failures in single-record/single-failure data
 2.88e+07  total analysis time at risk and under observation
                                              at risk from t =         0
                                   earliest observed entry t =         0
                                        last observed exit t =      5215

. 
. stcox age female creatinine1 sqrt_kappa sqrt_lambda year mgus, nohr efron

         failure _d:  death
   analysis time _t:  futime

Iteration 0:   log likelihood = -16702.426
Iteration 1:   log likelihood = -15684.424
Iteration 2:   log likelihood = -15441.111
Iteration 3:   log likelihood =   -15437.9
Iteration 4:   log likelihood = -15437.782
Iteration 5:   log likelihood = -15437.782
Iteration 6:   log likelihood = -15437.782
Refining estimates:
Iteration 0:   log likelihood = -15437.782

Cox regression -- Efron method for ties

No. of subjects =         6524                     Number of obs   =      6524
No. of failures =         1962
Time at risk    =     23796304
                                                   LR chi2(7)      =   2529.29
Log likelihood  =   -15437.782                     Prob > chi2     =    0.0000

------------------------------------------------------------------------------
          _t |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         age |   .1012024   .0024726    40.93   0.000     .0963561    .1060486
      female |  -.2816877   .0474475    -5.94   0.000    -.3746832   -.1886923
 creatinine1 |   .0134347   .0410684     0.33   0.744    -.0670578    .0939273
  sqrt_kappa |   .4046874   .1146813     3.53   0.000     .1799163    .6294586
 sqrt_lambda |   .7045662   .1117365     6.31   0.000     .4855667    .9235656
        year |   .0477281   .0191664     2.49   0.013     .0101627    .0852934
        mgus |   .3160207   .2532423     1.25   0.212    -.1803252    .8123666
------------------------------------------------------------------------------