## MICE for duration data¶

This notebook shows how to use MICE to fit surival regression models with missing data in Statsmodels.

In [1]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.sandbox.mice import mice
from statsmodels.duration.hazard_regression import PHReg


The following function simulates a data set for survival analysis. There are two covariates in addition to the survival time and status. All variables have missing values. The missing value patterns for the survival time and status variables are identical.

In [2]:
def gendat(n):

# Covariates
x1 = np.random.normal(size=n)
x2 = np.random.normal(size=n)

# The event time
event_time = np.random.exponential(size=n) * np.exp(x1)

# The duration of observation
obs_time = np.random.exponential(size=n)

# The event or censoring time
time = np.where(event_time < obs_time, event_time, obs_time)

# The censoring status
status = np.where(time == event_time, 1, 0)

df = pd.DataFrame({"time": time, "status": status, "x1": x1, "x2": x2})

# Create missing values
df.time.iloc[0:100] = np.nan
df.status.iloc[0:100] = np.nan
df.x1.iloc[80:150] = np.nan
df.x2.iloc[140:200] = np.nan

return df


Next we create a data set and wrap it in a MICEData object. We will have the survival status variable "follow" the time variable. When using PMM to do imputation, each imputed time value is actually an observed time value from the data set. By having status follow time, the status value and time value will always be obtained from the same observation. Note that this requires that the status and time values have the same missingness pattern, which would normally be the case in practice.

Also note that we must indicate that the string "status" is a variable name and the corresponding data must be obtained from the data frame.

Finally, the intercept is not identified in a proportional hazards regression model, so we must explicitly exclude the intercept by including "0 + " in the formula.

In [3]:
df = gendat(500)
idata = mice.MICEData(df)
idata.set_imputer("time", "0 + x1 + x2", model_class=PHReg, init_kwds={"status": mice.PatsyFormula("status")},
predict_kwds={"pred_type": "hr"})
idata.set_imputer("status", model_class=sm.GLM, init_kwds={"family": sm.families.Binomial()})

/projects/sage/sage-6.9/local/lib/python2.7/site-packages/pandas/core/indexing.py:115: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
self._setitem_with_indexer(indexer, value)


Next we run a few rounds of imputation.

In [4]:
for j in range(5):
x = idata.next_sample()


Here is a histrogram of the imputed survival times in the final iteration:

In [5]:
_ = idata.plot_imputed_hist('time')


Here is a scatterplot showing the fitted and imputed or observed survival times at the final MICE iteration. Note that since the linear predictor corresponds to the hazard rate, there is an inverse relationship between the "time fitted" axis (which is actually a fitted hazard ratio) and the observed or imputed time axis.

In [6]:
_ = idata.plot_fit_obs('time')


Here is a scatterplot showing the relationship between the survival times and the values of the first predictor variable.

In [7]:
_ = idata.plot_bivariate('time', 'x1', jitter=(0, 0.1))