# Proportional hazards regression¶

Key ideas: survival analysis, proportional hazards regression, Cox model, R, Stata

In this notebook we fit a basic proportional hazards regression model in Statsmodels, R, and Stata, and compare the results.

In [1]:
import numpy as np
import pandas as pd
from statsmodels.duration.hazard_regression import PHReg


The data are from a fairly large study looking at the relationship between a few biochemical variables and survival (mortality).

In [2]:
url = "http://vincentarelbundock.github.io/Rdatasets/csv/survival/flchain.csv"
del data["chapter"]
data = data.dropna()
data["lam"] = data["lambda"]
data["female"] = 1*(data["sex"] == "F")
data["year"] = data["sample.yr"] - min(data["sample.yr"])

   Unnamed: 0  age sex  sample.yr  kappa  lambda  flc.grp  creatinine  mgus  \
0           1   97   F       1997   5.70   4.860       10         1.7     0
1           2   92   F       2000   0.87   0.683        1         0.9     0
2           3   94   F       1997   4.36   3.850       10         1.4     0
3           4   92   F       1996   2.42   2.220        9         1.0     0
4           5   93   F       1996   1.32   1.690        6         1.1     0

futime  death    lam  female  year
0      85      1  4.860       1     2
1    1281      1  0.683       1     5
2      69      1  3.850       1     2
3     115      1  2.220       1     1
4    1039      1  1.690       1     1

[5 rows x 14 columns]


## Statsmodels¶

In [5]:
status = np.asarray(data["death"])
mod = PHReg.from_formula("futime ~ 0 + age + female + creatinine + " +
"np.sqrt(kappa) + np.sqrt(lam) + year + mgus",
data, status=status, ties="efron")
rslt = mod.fit()
print rslt.summary()

                           Results: PHReg
====================================================================
Model:                      PH Reg         Sample size:         6524
Dependent variable:         futime         Num. events:         1962
Ties:                       Efron
--------------------------------------------------------------------
log HR log HR SE   HR      t    P>|t|  [0.025 0.975]
--------------------------------------------------------------------
age             0.1012    0.0025 1.1065 40.9289 0.0000 1.1012 1.1119
female         -0.2817    0.0474 0.7545 -5.9368 0.0000 0.6875 0.8280
creatinine      0.0134    0.0411 1.0135  0.3271 0.7436 0.9351 1.0985
np.sqrt(kappa)  0.4047    0.1147 1.4988  3.5288 0.0004 1.1971 1.8766
np.sqrt(lam)    0.7046    0.1117 2.0230  6.3056 0.0000 1.6251 2.5183
year            0.0477    0.0192 1.0489  2.4902 0.0128 1.0102 1.0890
mgus            0.3160    0.2532 1.3717  1.2479 0.2121 0.8350 2.2532
====================================================================
Confidence intervals are for the hazard ratios


## R¶

Here is an R program that fits the same survival model:

library(survival)

ii = is.finite(data$creatinine) data = data[ii,] data$female = 1*(data$sex == "F") data$year = data$sample.yr - min(data$sample.yr)

surv = Surv(data$futime, data$death)
md = coxph(surv ~ age + female + creatinine + sqrt(kappa) + sqrt(lambda) +
year + mgus, data)
print(summary(md))

Here is the result of running this program in R:

Call:
coxph(formula = surv ~ age + female + creatinine + sqrt(kappa) +
sqrt(lambda) + year + mgus, data = data)

n= 6524, number of events= 1962

coef exp(coef)  se(coef)      z Pr(>|z|)
age           0.101202  1.106501  0.002473 40.929  < 2e-16 ***
female       -0.281688  0.754509  0.047448 -5.937 2.91e-09 ***
creatinine    0.013435  1.013525  0.041068  0.327 0.743568
sqrt(kappa)   0.404687  1.498834  0.114681  3.529 0.000417 ***
sqrt(lambda)  0.704566  2.022969  0.111736  6.306 2.87e-10 ***
year          0.047728  1.048885  0.019166  2.490 0.012767 *
mgus          0.316021  1.371659  0.253242  1.248 0.212068
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

exp(coef) exp(-coef) lower .95 upper .95
age             1.1065     0.9038    1.1012     1.112
female          0.7545     1.3254    0.6875     0.828
creatinine      1.0135     0.9867    0.9351     1.098
sqrt(kappa)     1.4988     0.6672    1.1971     1.877
sqrt(lambda)    2.0230     0.4943    1.6251     2.518
year            1.0489     0.9534    1.0102     1.089
mgus            1.3717     0.7290    0.8350     2.253

Concordance= 0.793  (se = 0.007 )
Rsquare= 0.321   (max possible= 0.994 )
Likelihood ratio test= 2529  on 7 df,   p=0
Wald test            = 2464  on 7 df,   p=0
Score (logrank) test = 2928  on 7 df,   p=0

## Stata¶

Below is a Stata program to fit the same Cox model as was fit above using Statsmodels and R. Note that Stata and R handle failure times that are zero (and more generally, failure times that equal entry times) differently. Stata drops these cases (of which there are three in this data set). The workaround is to shift the failure time to the right slightly, see:

http://www.stata.com/support/faqs/statistics/time-and-cox-model/

clear

copy http://vincentarelbundock.github.io/Rdatasets/csv/survival/flchain.csv flchain.csv, replace

import delimited flchain

generate female = 1
replace female = 0 if sex == "M"

replace futime = 0.001 if futime == 0

generate sqrt_kappa = sqrt(kappa)
generate sqrt_lambda = sqrt(lambda)
generate year = sampleyr - 1995
generate creatinine1 = real(creatinine)

drop chapter

stset futime, failure(death)

stcox age female creatinine1 sqrt_kappa sqrt_lambda year mgus, nohr efron

Here are the Stata results:

. stset futime, failure(death)

failure event:  death != 0 & death < .
obs. time interval:  (0, futime]
exit on or before:  failure

------------------------------------------------------------------------------
7874  total observations
0  exclusions
------------------------------------------------------------------------------
7874  observations remaining, representing
2169  failures in single-record/single-failure data
2.88e+07  total analysis time at risk and under observation
at risk from t =         0
earliest observed entry t =         0
last observed exit t =      5215

.
. stcox age female creatinine1 sqrt_kappa sqrt_lambda year mgus, nohr efron

failure _d:  death
analysis time _t:  futime

Iteration 0:   log likelihood = -16702.426
Iteration 1:   log likelihood = -15684.424
Iteration 2:   log likelihood = -15441.111
Iteration 3:   log likelihood =   -15437.9
Iteration 4:   log likelihood = -15437.782
Iteration 5:   log likelihood = -15437.782
Iteration 6:   log likelihood = -15437.782
Refining estimates:
Iteration 0:   log likelihood = -15437.782

Cox regression -- Efron method for ties

No. of subjects =         6524                     Number of obs   =      6524
No. of failures =         1962
Time at risk    =     23796304
LR chi2(7)      =   2529.29
Log likelihood  =   -15437.782                     Prob > chi2     =    0.0000

------------------------------------------------------------------------------
_t |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
age |   .1012024   .0024726    40.93   0.000     .0963561    .1060486
female |  -.2816877   .0474475    -5.94   0.000    -.3746832   -.1886923
creatinine1 |   .0134347   .0410684     0.33   0.744    -.0670578    .0939273
sqrt_kappa |   .4046874   .1146813     3.53   0.000     .1799163    .6294586
sqrt_lambda |   .7045662   .1117365     6.31   0.000     .4855667    .9235656
year |   .0477281   .0191664     2.49   0.013     .0101627    .0852934
mgus |   .3160207   .2532423     1.25   0.212    -.1803252    .8123666
------------------------------------------------------------------------------