Survival analysis with dependent events

Key ideas: Proportional hazards, survival analysis, event history analysis, Cox model, sandwich covariance

This notebook illustrates how the robust 'sandwich covariance' can be used to analyze survival-type data in which each subject may have several, possibly dependent events.

In [1]:
import numpy as np
from statsmodels.duration.hazard_regression import PHReg

Simulated data

First we use simulate data to illustrate how the robust covariance matrix works in some simple settings. We simulate survival times from an exponential distribution, with the log hazard values determined by a linear function of five covariates. For this simple illustration, all observations are uncensored.

In [2]:
n = 300 # Sample size
p = 5 # Number of covariates

exog = np.random.normal(size=(n, p))
lin_pred = exog.sum(1)
endog = -np.exp(-lin_pred)*np.log(np.random.uniform(size=n))

Here is the basic proportional hazards model for this data set:

In [3]:
mod = PHReg(endog, exog)
rslt = mod.fit()
print rslt.summary()
                    Results: PHReg
=======================================================
Model:                   PH Reg     Sample size:    300
Dependent variable:      y          Num. events:    300
Ties:                    Breslow                       
-------------------------------------------------------
   log HR log HR SE   HR      t    P>|t|  [0.025 0.975]
-------------------------------------------------------
x1 0.9882    0.0769 2.6863 12.8483 0.0000 2.3104 3.1233
x2 0.9216    0.0753 2.5134 12.2459 0.0000 2.1687 2.9129
x3 0.9691    0.0767 2.6355 12.6301 0.0000 2.2675 3.0631
x4 1.0897    0.0816 2.9733 13.3603 0.0000 2.5341 3.4887
x5 1.0375    0.0756 2.8220 13.7293 0.0000 2.4336 3.2725
=======================================================
Confidence intervals are for the hazard ratios

Next we take the same data set and duplicate each value four times. In a naive analysis, this will leave the parameter estimates unchanged, but the standard errors will be reduced by a factor of two. Of course this is incorrect becuase there is in fact no new information in the three repeated values for each subject.

In [4]:
exog_rep = np.kron(exog, np.ones((4,1)))
endog_rep = np.kron(endog, np.ones(4))
groups = np.kron(np.arange(n), np.ones(4))

Here is the proportional hazards regression model fit to the data with replicated records.

In [5]:
mod_rep = PHReg(endog_rep, exog_rep)
rslt_rep = mod_rep.fit()
print rslt_rep.summary() 
                    Results: PHReg
=======================================================
Model:                  PH Reg     Sample size:    1200
Dependent variable:     y          Num. events:    1200
Ties:                   Breslow                        
-------------------------------------------------------
   log HR log HR SE   HR      t    P>|t|  [0.025 0.975]
-------------------------------------------------------
x1 0.9882    0.0385 2.6863 25.6966 0.0000 2.4912 2.8966
x2 0.9216    0.0376 2.5134 24.4918 0.0000 2.3347 2.7058
x3 0.9691    0.0384 2.6355 25.2603 0.0000 2.4446 2.8413
x4 1.0897    0.0408 2.9733 26.7207 0.0000 2.7449 3.2207
x5 1.0375    0.0378 2.8220 27.4587 0.0000 2.6206 3.0390
=======================================================
Confidence intervals are for the hazard ratios

Here is the proportional hazards regression fit using the robust covariance matrix to accommodate the (perfect) dependence within groups. The standard errors roughly match what we obtained with the un-duplicated data.

In [6]:
mod_rep_a = PHReg(endog_rep, exog_rep)
rslt_rep_a = mod_rep_a.fit(groups=groups)
print rslt_rep_a.summary()
                    Results: PHReg
=======================================================
Model:                PH Reg    Num. events:       1200
Dependent variable:   y         Max. group size:   4   
Ties:                 Breslow   Min. group size:   4   
Sample size:          1200      Avg. group size:   4.0 
-------------------------------------------------------
   log HR log HR SE   HR      t    P>|t|  [0.025 0.975]
-------------------------------------------------------
x1 0.9882    0.0721 2.6863 13.7018 0.0000 2.3322 3.0941
x2 0.9216    0.0733 2.5134 12.5675 0.0000 2.1769 2.9019
x3 0.9691    0.0717 2.6355 13.5138 0.0000 2.2899 3.0332
x4 1.0897    0.0746 2.9733 14.6018 0.0000 2.5687 3.4416
x5 1.0375    0.0717 2.8220 14.4717 0.0000 2.4521 3.2478
=======================================================
Confidence intervals are for the hazard ratios
Standard errors account for dependence within groups

We can also check that using the robust covariance when the events are actually independent does not needlessly inflate the variance.

In [7]:
n = 100 # Sample size
p = 5 # Number of covariates

exog = np.random.normal(size=(4*n, p))
lin_pred = exog.sum(1)
endog = -np.exp(-lin_pred)*np.log(np.random.uniform(size=4*n))
groups = np.kron(np.arange(n), np.ones(4))

for method in "naive", "robust":
    if method == "robust":
        mod_rep_b = PHReg(endog, exog)
        rslt_rep_b = mod_rep_b.fit(groups=groups)
    else:
        mod_rep_b = PHReg(endog, exog)
        rslt_rep_b = mod_rep_b.fit()
    print rslt_rep_b.summary()
    print "\n"
                    Results: PHReg
=======================================================
Model:                   PH Reg     Sample size:    400
Dependent variable:      y          Num. events:    400
Ties:                    Breslow                       
-------------------------------------------------------
   log HR log HR SE   HR      t    P>|t|  [0.025 0.975]
-------------------------------------------------------
x1 1.1389    0.0690 3.1232 16.5122 0.0000 2.7283 3.5753
x2 1.2356    0.0719 3.4404 17.1854 0.0000 2.9882 3.9611
x3 1.1830    0.0689 3.2640 17.1787 0.0000 2.8519 3.7357
x4 1.1353    0.0681 3.1121 16.6664 0.0000 2.7231 3.5566
x5 1.1574    0.0683 3.1815 16.9533 0.0000 2.7831 3.6370
=======================================================
Confidence intervals are for the hazard ratios


                    Results: PHReg
=======================================================
Model:                 PH Reg    Num. events:       400
Dependent variable:    y         Max. group size:   4  
Ties:                  Breslow   Min. group size:   4  
Sample size:           400       Avg. group size:   4.0
-------------------------------------------------------
   log HR log HR SE   HR      t    P>|t|  [0.025 0.975]
-------------------------------------------------------
x1 1.1389    0.0728 3.1232 15.6382 0.0000 2.7078 3.6024
x2 1.2356    0.0799 3.4404 15.4692 0.0000 2.9419 4.0235
x3 1.1830    0.0708 3.2640 16.7171 0.0000 2.8413 3.7496
x4 1.1353    0.0712 3.1121 15.9351 0.0000 2.7065 3.5784
x5 1.1574    0.0859 3.1815 13.4707 0.0000 2.6885 3.7650
=======================================================
Confidence intervals are for the hazard ratios
Standard errors account for dependence within groups


Next we simulate a survival data set in which conesecutive bocks of four observations are dependent, but not perfectly dependent as above. We include a parameter that allows us to control the degree of dependence.

In [8]:
n = 300 # Sample size
p = 5 # Number of covariates

exog = np.random.normal(size=(n, p))
exog = np.kron(exog, np.ones((4,1)))
lin_pred = exog.sum(1)
endog = -np.exp(-lin_pred)*np.log(np.random.uniform(size=4*n))
groups = np.kron(np.arange(n), np.ones(4))
endog_a = endog.reshape((n,4)).mean(1)
endog_a = np.kron(endog_a, np.ones(4))

# Adjust this parameter to control the amount of within-subject dependence
# 0 = perfect dependence, 1 = no dependence
a = 0.5

endog = a*endog + (1-a)*endog_a
In [9]:
for method in "robust", "naive":
    if method == "robust":
        mod_rep_i = PHReg(endog, exog)
        rslt_rep_i = mod_rep_i.fit(groups=groups)
    else:    
        mod_rep_i = PHReg(endog, exog)
        rslt_rep_i = mod_rep_i.fit()
    print rslt_rep_i.summary()
    print "\n"
                    Results: PHReg
=======================================================
Model:                PH Reg    Num. events:       1200
Dependent variable:   y         Max. group size:   4   
Ties:                 Breslow   Min. group size:   4   
Sample size:          1200      Avg. group size:   4.0 
-------------------------------------------------------
   log HR log HR SE   HR      t    P>|t|  [0.025 0.975]
-------------------------------------------------------
x1 1.5805    0.0597 4.8575 26.4676 0.0000 4.3210 5.4607
x2 1.6111    0.0604 5.0086 26.6596 0.0000 4.4491 5.6384
x3 1.7075    0.0657 5.5149 25.9994 0.0000 4.8488 6.2725
x4 1.7531    0.0578 5.7724 30.3293 0.0000 5.1541 6.4648
x5 1.6474    0.0575 5.1937 28.6730 0.0000 4.6406 5.8128
=======================================================
Confidence intervals are for the hazard ratios
Standard errors account for dependence within groups


                    Results: PHReg
=======================================================
Model:                  PH Reg     Sample size:    1200
Dependent variable:     y          Num. events:    1200
Ties:                   Breslow                        
-------------------------------------------------------
   log HR log HR SE   HR      t    P>|t|  [0.025 0.975]
-------------------------------------------------------
x1 1.5805    0.0466 4.8575 33.8966 0.0000 4.4333 5.3224
x2 1.6111    0.0459 5.0086 35.0739 0.0000 4.5773 5.4804
x3 1.7075    0.0459 5.5149 37.2238 0.0000 5.0407 6.0337
x4 1.7531    0.0494 5.7724 35.4667 0.0000 5.2394 6.3596
x5 1.6474    0.0472 5.1937 34.8959 0.0000 4.7347 5.6972
=======================================================
Confidence intervals are for the hazard ratios