Proportional hazards regression

This notebook demonstrates prediction for the proportional hazards regression model using the predict and get_distribution functions.

Key ideas: Proportional hazards regression, Cox model, prediction

In [1]:
import numpy as np
import pandas as pd
from statsmodels.duration.hazard_regression import PHReg

We start with a simple simulated data set with 1000 observations and no censoring. The event times for the first 500 observations are distributed as standard exponential values, which implies that the hazard function has a constant value of 1, and the cumulative hazard function has the graph y = x. The event times in the second set of 500 values are exponentially distributed with mean 2, which implies that the hazard function has a constant value of 1/2, and the cumulative hazard function follows the graph y = x/2.

In [2]:
exog = np.zeros(1000)
exog[500:] = 1
endog = -np.exp(np.log(2)*exog) * np.log(np.random.uniform(size=1000))

Here is the fitted model.

In [3]:
mod = PHReg(endog, exog)
rslt = mod.fit()
print rslt.summary()
                     Results: PHReg
========================================================
Model:                   PH Reg     Sample size:    1000
Dependent variable:      y          Num. events:    1000
Ties:                    Breslow                        
--------------------------------------------------------
    log HR log HR SE   HR      t    P>|t|  [0.025 0.975]
--------------------------------------------------------
x1 -0.6573    0.0664 0.5183 -9.8917 0.0000 0.4550 0.5904
========================================================
Confidence intervals are for the hazard ratios

The baseline hazard function corresponds to a subject with exog == 0, and having a cumulative hazard function equal to y = x.

In [4]:
bch = rslt.baseline_cumulative_hazard
bch = bch[0] # Only one stratum here
time, cumhaz, surv = tuple(bch)
plt.clf()
plt.plot(time, cumhaz, '-o', alpha=0.6)
plt.grid(True)
plt.xlabel("Time")
plt.ylabel("Cumulative hazard")
Out[4]:
<matplotlib.text.Text at 0x4fbec10>

Now we can use the predict function to recover the cumulative hazard functions for the two subgroups of the data corresponding to exog == 0 and exog == 1.

In [5]:
chaz = rslt.predict(pred_type='cumhaz')
chaz = chaz.predicted_values
plt.clf()
plt.plot(endog[0:500], chaz[0:500], 'o', label='x=0')
plt.plot(endog[500:], chaz[500:], 'o', label='x=1')
leg = plt.legend(numpoints=1, handletextpad=0.0001, loc='lower right')
leg.draw_frame(False)
plt.grid(True)
plt.xlabel("Time")
plt.ylabel("Cumulative hazard")
Out[5]:
<matplotlib.text.Text at 0x4e3b350>

Here are plots of the estimated survival functions for these two groups.

In [6]:
chaz = rslt.predict(pred_type='surv')
chaz = chaz.predicted_values
plt.clf()
plt.plot(endog[0:500], chaz[0:500], 'o', label='x=0')
plt.plot(endog[500:], chaz[500:], 'o', label='x=1')
leg = plt.legend(numpoints=1, handletextpad=0.0001, loc='upper right')
leg.draw_frame(False)
plt.grid(True)
plt.xlabel("Time")
plt.ylabel("Survival probability")
Out[6]:
<matplotlib.text.Text at 0x4e3fa90>

Next we use a Weibull distribution to obtain a non-constant baseline hazard function.

In [7]:
shape = 2
exog = np.zeros(1000)
exog[500:] = 1
hr = np.exp(np.log(2)*exog)
endog = hr**(-1/float(shape)) * np.random.weibull(shape, size=1000)

Here is the fitted model.

In [8]:
mod = PHReg(endog, exog)
rslt = mod.fit()
print rslt.summary()
                    Results: PHReg
=======================================================
Model:                  PH Reg     Sample size:    1000
Dependent variable:     y          Num. events:    1000
Ties:                   Breslow                        
-------------------------------------------------------
   log HR log HR SE   HR      t    P>|t|  [0.025 0.975]
-------------------------------------------------------
x1 0.6825    0.0668 1.9788 10.2131 0.0000 1.7359 2.2557
=======================================================
Confidence intervals are for the hazard ratios

For a baseline observation, the cumulative hazard function has the form time^shape, so we plot that curve together with the estimated cumulative hazard function.

In [9]:
bch = rslt.baseline_cumulative_hazard
bch = bch[0] # Only one stratum here
time, cumhaz, surv = tuple(bch)
plt.clf()
plt.plot(time, cumhaz, '-', label='Estimate')
plt.plot(time, time**shape, '-', label='Truth')
plt.grid(True)
plt.legend(loc='upper left')
plt.xlabel("Time")
plt.ylabel("Cumulative hazard")
Out[9]:
<matplotlib.text.Text at 0x39b6ed0>

Predicted distributions

We can obtain an estimate of the fitted distribution of survival times (with no censoring) for each subject. These are discrete distributions, supported on the distinct failure times in a stratum.

To demonstrate this feature, we will use simulated data from an exponential distribution whose mean value depends on a single covariate, and for simplicity we will take all observations to be uncensored.

In [10]:
exog = np.linspace(-1, 1, 600)
endog = -np.exp(exog) * np.log(np.random.uniform(size=600))
mod = PHReg(endog, exog)
rslt = mod.fit()
dist = rslt.get_distribution()

First we will plot the true and fitted mean survival times.

In [11]:
mns_true = np.exp(exog)
plt.plot(mns_true, dist.mean(), 'o', alpha=0.5)
plt.xlabel("True mean survival time")
plt.ylabel("Estimated mean survival time")
plt.grid(True)

Next we will generate a sample from the fitted survival time distribution for each subject, and average these draws to estimate the mean survival time.

In [12]:
sample_mn = 0.
for k in range(1000):
    sample_mn += dist.rvs()
sample_mn /= 1000

plt.xlabel("True means")
plt.ylabel("Sample means")
plt.plot(mns_true, sample_mn, 'o', alpha=0.5)
Out[12]:
[<matplotlib.lines.Line2D at 0x5796450>]

Next we will plot the true standard deviations against the sample standard deviations of simulated data.

In [13]:
sample_va = 0.
for k in range(1000):
    sample_va += (dist.rvs() - mns_true)**2
sample_va /= 1000

plt.plot(mns_true, np.sqrt(sample_va), 'o', alpha=0.5)
plt.xlabel("True standard deviation")
plt.ylabel("Estimated standard deviation")
Out[13]:
<matplotlib.text.Text at 0x557da10>

Here we do the same thing with the third central moment.

In [14]:
sample_m3 = 0.
for k in range(1000):
    sample_m3 += (dist.rvs() - mns_true)**3
sample_m3 /= 1000

plt.plot(sample_m3, 2*mns_true**3, 'o', alpha=0.5)
plt.xlabel("Sample 3rd moment")
plt.ylabel("True 3rd moment")
Out[14]:
<matplotlib.text.Text at 0x557e990>

We can also compare the observed and exact cumulative distribution functions.

In [15]:
tr = 3
sample_p = 0.
for k in range(1000):
    sample_p += (dist.rvs() <= tr)
sample_p /= 1000

plt.plot(sample_p, 1 - np.exp(-tr/mns_true), 'o', alpha=0.5)
plt.xlabel("Sample proportion")
plt.ylabel("True proportion")
plt.grid(True)