This notebook demonstrates several methods for survival analysis that are available in Statsmodels, using the NHANES III study data.

In [1]:

```
import statsmodels
import numpy as np
import pandas as pd
import statsmodels.api as sm
import os
import matplotlib.pyplot as plt
%matplotlib inline
```

NHANES is a cross sectional study (subjects are assessed at one time point and are never re-assessed by NHANES). The NHANES study team periodically uses social security and other sources of death records to determine which study subjects are still alive at a given follow up point. Here we will use the mortality data from 2011. People who are died prior to 2011 should have a date of death in these mortality files. People who are alive in 2011 will be considered to be censored in the survival analyses.

The mortality data are available at the link below as a fixed width format (FWF) file. The file to retrieve is called "NHANES_III_MORT_2011_PUBLIC.dat".

http://www.cdc.gov/nchs/data-linkage/mortality-public.htm

We downloaded the mortality data and gzipped it. To read a FWF file we need a specification defining which values appear in each column of the data file. This is contained in the SAS script provided with the data:

ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/datalinkage/linked_mortality/SAS-Read-in-Program-All-Surveys.sas

Copy lines 119 to 141 from the SAS program and place it into a text file called "mortality_fields.txt". The following utility function "get_colspecs" will read this file and use it to produce a list of start/end points for the pandas `read_fwf`

function to use when reading the mortality data file.

In [2]:

```
def get_colspecs(fname):
fields = []
fid = open(fname)
columns = []
colspecs = []
for line in fid:
v = line.rstrip().split()
columns.append(v[0])
u = v[1].split("-")
if len(u) == 1:
u = [int(u[0]), int(u[0])]
else:
u = [int(u[0]), int(u[1])]
u[0] -= 1
colspecs.append(u)
return columns, colspecs
```

Now we are ready to read the mortality information:

In [3]:

```
columns, colspecs = get_colspecs("mortality_fields.txt")
fname = "NHANES_III_MORT_2011_PUBLIC.dat.gz"
mort = pd.read_fwf(fname, colspecs=colspecs, header=None, names=columns,
compression="gzip")
```

The data dictionary for the mortality data is here:

http://www.cdc.gov/nchs/data/datalinkage/public_use_data_dictionary_11_17_2015.pdf

The only variables we will use here are "PERMTH_INT", which is the time duration from the NHANES clinical assessment to the last follow up time, and "MORTSTAT", which indicates whether the follow up interval ends with death or censoring. We will also need the "SEQN" variable to erge the mortality and assessment data. Therefore we wil drop the other columns, and drop the cases with missing data on these key variables.

In [4]:

```
mort = mort[["MORTSTAT", "PERMTH_INT", "SEQN"]].dropna()
```

We will also need the data from the NHANES 3 assessment. The data and codebooks are available here:

http://www.cdc.gov/nchs/nhanes/nh3data.htm

In the analyses below we will use the adult data. As with the mortality data considered above, this is a fixed-width format file. The column specifications are given in the SAS file that can be downloaded from the above link.

In [5]:

```
columns, colspecs = get_colspecs("adult_fields.txt")
fname = "adult.dat.gz"
data = pd.read_fwf(fname, colspecs=colspecs, header=None, names=columns,
compression="gzip")
```

Now we merge the mortality and visit data.

In [6]:

```
df = pd.merge(data, mort, left_on="SEQN", right_on="SEQN")
print(mort.shape)
print(data.shape)
print(df.shape)
```

Next we drop the columns that we will not use below, and drop cases with missing data.

In [7]:

```
df = df[["PERMTH_INT", "MORTSTAT", "HSAGEIR", "HSSEX", "HAZA8AK1", "SEQN"]].dropna()
```

Just for orientation, a histogram of age values is shown below.

In [8]:

```
plt.hist(df.HSAGEIR, color='grey')
plt.xlabel("Age (years)", size=16)
plt.ylabel("Frequency", size=16)
```

Out[8]:

We can now create a survival function object. To do this we need to provide the duration values ("PERMTH_INT") and the mortality status variable ("MORTSTAT"). The "summary" method of the survival function returns a dataframe containing the Kaplan-Meier estimates of the survival function at its steps. For any survival analysis it is important to define time zero in a meaningful way. There are various ways to do this here, but one approach suitable here is to select people in a narrow age range, say 50-55. This means we are looking at the survival function for one "birth cohort" -- people who were roughly age 52.5 when the NHANES III assessmsents took place.

In [9]:

```
ii = (df.HSAGEIR >= 50) & (df.HSAGEIR < 55)
df = df.loc[ii]
```

In [10]:

```
sf1 = sm.duration.SurvfuncRight(df["PERMTH_INT"], df["MORTSTAT"], "")
sf1.summary().head()
```

Out[10]:

In [11]:

```
sf1.plot()
plt.grid(True)
plt.ylim(0.6, 1)
plt.xlabel("Time (months)", size=16)
plt.ylabel("Proportion alive", size=16)
```

Out[11]:

In [12]:

```
plt.grid(True)
plt.fill_between(sf1.surv_times, sf1.surv_prob - 2*sf1.surv_prob_se, sf1.surv_prob + 2*sf1.surv_prob_se, color="lightgrey")
plt.plot(sf1.surv_times, sf1.surv_prob, '-', color='black')
plt.ylim(0.7, 1)
plt.xlabel("Time (months)", size=16)
plt.ylabel("Proportion alive", size=16)
```

Out[12]:

In [13]:

```
lcb, ucb = sf1.simultaneous_cb(transform='arcsin')
plt.grid(True)
plt.plot(sf1.surv_times, sf1.surv_prob, '-', color='black')
plt.fill_between(sf1.surv_times, lcb, ucb, color='lightgrey')
```

Out[13]:

The inverse of the survival function is the survival quantile function. It tells us the survival time corresponding to each probability point (e.g. what is the 10th percentile of survival times). These estimates may be undefined for probabiity points that are "not reached". For example, in the overall NHANES sample the median survival time is not reached.

The first plot below shows the survival quantles up to the 20th percentile, which is the highest probability point that is reached in these data.

In [14]:

```
qp = np.linspace(0.01, 0.19, 20)
sq = [sf1.quantile(x) for x in qp]
plt.clf()
plt.grid(True)
plt.plot(qp, sq, '-')
plt.xlabel("Probability", size=16)
plt.ylabel("Survival quantile", size=16)
```

Out[14]:

We can put pointwise confidence bands around the survival quantile curves.

In [15]:

```
ci = [sf1.quantile_ci(x) for x in qp]
ci = np.asarray(ci)
plt.clf()
plt.grid(True)
plt.fill_between(qp, ci[:, 0], ci[:, 1], color='lightgrey')
plt.plot(qp, sq, '-', color='black')
plt.xlabel("Probability", size=16)
plt.ylabel("Survival quantile", size=16)
```

Out[15]:

To illustrate the methods for comparing groups, we will compare survival curves for women and men. The gender information is in the "HSSEX" variable. HSSEX=1 indicates males and HSSEX=2 indicates females.

In [16]:

```
ii = (df.HSSEX==1)
sfm = sm.duration.SurvfuncRight(df.loc[ii, "PERMTH_INT"], df.loc[ii, "MORTSTAT"], "Male")
ii = (df.HSSEX==2)
sff = sm.duration.SurvfuncRight(df.loc[ii, "PERMTH_INT"], df.loc[ii, "MORTSTAT"], "Female")
```

In [17]:

```
ax = plt.gca()
plt.grid(True)
sfm.plot(ax)
sff.plot(ax)
plt.xlabel("Time (months)", size=16)
plt.ylabel("Proportion alive", size=16)
plt.ylim(0.5, 1)
ha, lb = ax.get_legend_handles_labels()
plt.legend(ha, lb, loc="lower left")
```

Out[17]:

In [18]:

```
chisq, pval = sm.duration.survdiff(df.PERMTH_INT, df.MORTSTAT, df.HSSEX)
print(chisq, pval)
```

To move beyond group-wise comparisons, we can use regression analysis. There are several methods of regression analysis that are suitable for data with censoring. The most commonly used in practice is the proportional hazards regression model, or "Cox model". Below we run a very simple proportional hazards regression model using gender as a predictor of mortality.

In [19]:

```
model1 = sm.PHReg(df.PERMTH_INT, df.HSSEX, status=df.MORTSTAT)
result1 = model1.fit()
result1.summary()
```

Out[19]:

In [20]:

```
ii = (df.HAZA8AK1 != 888)
df = df.loc[ii]
plt.hist(df.HAZA8AK1.values)
plt.xlabel("SBP", size=16)
plt.ylabel("Frequency", size=16)
```

Out[20]:

In [21]:

```
model2 = sm.PHReg.from_formula("PERMTH_INT ~ 0 + HAZA8AK1*HSSEX", status="MORTSTAT", data=df)
result2 = model2.fit()
result2.summary()
```

Out[21]:

Since the blood pressure effect may not be linear, we can try modeling it using splines.

In [22]:

```
model3 = sm.PHReg.from_formula("PERMTH_INT ~ 0 + bs(HAZA8AK1, 4) + HSSEX", status="MORTSTAT", data=df)
result3 = model3.fit()
result3.summary()
```

Out[22]:

`predict_functional`

function to graph the relationship between blood pressure and the mortality hazard. The association is U-shaped -- people with very low or very high blood pressure are both at higher risk of death. The risk of death for people with low blood pressure could be due to confounding with other factors that induce both low blood pressure and death.

In [23]:

```
from statsmodels.sandbox.predict_functional import predict_functional
plt.clf()
plt.grid(True)
sexl = {1: "Male", 2: "Female"}
for sex in 1,2:
pr, cb, fv = predict_functional(result3, "HAZA8AK1", values={"HSSEX": sex})
plt.plot(fv, pr.predicted_values, label=sexl[sex])
ha, lb = plt.gca().get_legend_handles_labels()
plt.legend(ha, lb, loc="upper left")
plt.xlabel("SBP", size=16)
plt.ylabel("Log hazard ratio", size=16)
```

Out[23]: