This notebook shows how to use MICE to fit surival regression models with missing data in Statsmodels.

In [1]:

```
import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.sandbox.mice import mice
from statsmodels.duration.hazard_regression import PHReg
```

In [2]:

```
def gendat(n):
# Covariates
x1 = np.random.normal(size=n)
x2 = np.random.normal(size=n)
# The event time
event_time = np.random.exponential(size=n) * np.exp(x1)
# The duration of observation
obs_time = np.random.exponential(size=n)
# The event or censoring time
time = np.where(event_time < obs_time, event_time, obs_time)
# The censoring status
status = np.where(time == event_time, 1, 0)
df = pd.DataFrame({"time": time, "status": status, "x1": x1, "x2": x2})
# Create missing values
df.time.iloc[0:100] = np.nan
df.status.iloc[0:100] = np.nan
df.x1.iloc[80:150] = np.nan
df.x2.iloc[140:200] = np.nan
return df
```

Next we create a data set and wrap it in a MICEData object. We will have the survival status variable "follow" the time variable. When using PMM to do imputation, each imputed time value is actually an observed time value from the data set. By having status follow time, the status value and time value will always be obtained from the same observation. Note that this requires that the status and time values have the same missingness pattern, which would normally be the case in practice.

Also note that we must indicate that the string "status" is a variable name and the corresponding data must be obtained from the data frame.

Finally, the intercept is not identified in a proportional hazards regression model, so we must explicitly exclude the intercept by including "0 + " in the formula.

In [3]:

```
df = gendat(500)
idata = mice.MICEData(df)
idata.set_imputer("time", "0 + x1 + x2", model_class=PHReg, init_kwds={"status": mice.PatsyFormula("status")},
predict_kwds={"pred_type": "hr"})
idata.set_imputer("status", model_class=sm.GLM, init_kwds={"family": sm.families.Binomial()})
```

Next we run a few rounds of imputation.

In [4]:

```
for j in range(5):
x = idata.next_sample()
```

Here is a histrogram of the imputed survival times in the final iteration:

In [5]:

```
_ = idata.plot_imputed_hist('time')
```

In [6]:

```
_ = idata.plot_fit_obs('time')
```

In [7]:

```
_ = idata.plot_bivariate('time', 'x1', jitter=(0, 0.1))
```