MICE for duration data

This notebook shows how to use MICE to fit surival regression models with missing data in Statsmodels.

In [1]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.sandbox.mice import mice
from statsmodels.duration.hazard_regression import PHReg

The following function simulates a data set for survival analysis. There are two covariates in addition to the survival time and status. All variables have missing values. The missing value patterns for the survival time and status variables are identical.

In [2]:
def gendat(n):
    # Covariates
    x1 = np.random.normal(size=n)
    x2 = np.random.normal(size=n)
    # The event time
    event_time = np.random.exponential(size=n) * np.exp(x1)
    # The duration of observation
    obs_time = np.random.exponential(size=n)
    # The event or censoring time
    time = np.where(event_time < obs_time, event_time, obs_time)
    # The censoring status
    status = np.where(time == event_time, 1, 0)
    df = pd.DataFrame({"time": time, "status": status, "x1": x1, "x2": x2})
    # Create missing values
    df.time.iloc[0:100] = np.nan
    df.status.iloc[0:100] = np.nan
    df.x1.iloc[80:150] = np.nan
    df.x2.iloc[140:200] = np.nan
    return df

Next we create a data set and wrap it in a MICEData object. We will have the survival status variable "follow" the time variable. When using PMM to do imputation, each imputed time value is actually an observed time value from the data set. By having status follow time, the status value and time value will always be obtained from the same observation. Note that this requires that the status and time values have the same missingness pattern, which would normally be the case in practice.

Also note that we must indicate that the string "status" is a variable name and the corresponding data must be obtained from the data frame.

Finally, the intercept is not identified in a proportional hazards regression model, so we must explicitly exclude the intercept by including "0 + " in the formula.

In [3]:
df = gendat(500)
idata = mice.MICEData(df)
idata.set_imputer("time", "0 + x1 + x2", model_class=PHReg, init_kwds={"status": mice.PatsyFormula("status")}, 
                  predict_kwds={"pred_type": "hr"})
idata.set_imputer("status", model_class=sm.GLM, init_kwds={"family": sm.families.Binomial()})
/projects/sage/sage-6.9/local/lib/python2.7/site-packages/pandas/core/indexing.py:115: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)

Next we run a few rounds of imputation.

In [4]:
for j in range(5):
    x = idata.next_sample()

Here is a histrogram of the imputed survival times in the final iteration:

In [5]:
_ = idata.plot_imputed_hist('time')

Here is a scatterplot showing the fitted and imputed or observed survival times at the final MICE iteration. Note that since the linear predictor corresponds to the hazard rate, there is an inverse relationship between the "time fitted" axis (which is actually a fitted hazard ratio) and the observed or imputed time axis.

In [6]:
_ = idata.plot_fit_obs('time')

Here is a scatterplot showing the relationship between the survival times and the values of the first predictor variable.

In [7]:
_ = idata.plot_bivariate('time', 'x1', jitter=(0, 0.1))