This notebook demonstrates the statsmodels MICE implementation.

The CHAIN data set, analyzed below, has also been used to illustrate the R mi package. Section 4 of this paper describes an analysis of the data set conducted in R

In [1]:

```
import sys
sys.path.insert(0, "/projects/57433cc7-78ab-4105-a525-ba087aa3e2fc/statsmodels-mice2")
%matplotlib inline
import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.sandbox.mice import mice
import matplotlib.pyplot as plt
```

First we load the data and do a bit of cleanup.

In [2]:

```
data = pd.read_csv("chain.csv")
del data["Unnamed: 0"]
data.columns = [x.replace(".W1", "") for x in data.columns]
print data.head()
```

We can make some simple graphs to visualize the missing data patterns. The colored cells correspond to missing values.

In [3]:

```
imp = mice.MICEData(data)
_ = imp.plot_missing_pattern()
```

We can exclude the complete cases and completely observed variables in order to make it easier to see the different missing data patterns.

In [4]:

```
_ = imp.plot_missing_pattern(hide_complete_rows=True, hide_complete_columns=True)
```

Here is a simple example of how to coduct an analysis using MICE. The formula specifies the "analysis model", which is the model we primarily want to fit and interpret. Additional "imputation models" are used to impute missing values of each variable that has missing values. In this example we use defaults, so that all the imputation models use ordinary least squares, with each variable imputed from a model in which every other variable has a main effect.

In [5]:

```
mi = mice.MICE("h39b ~ age + c28 + pcs + mcs37 + b05 + haartadhere", sm.OLS, imp)
result = mi.fit(20, 5)
print(result.summary())
```

We can assess how well the imputed values match the observed values by looking at histograms of the marginal distributions. These histograms show the imputed values from the final imputed data set together with the observed values.

In [6]:

```
plt.clf()
for col in data.columns:
plt.figure()
ax = plt.axes()
_ = imp.plot_imputed_hist(col, ax=ax, )
```

We can also look at marginal relationships between each variable and the outcome variable The `plot_bivariate`

method colors the points accorording to whether they are missing or observed on each variable in the scatterplot. We hope to see the same trends and degree of scatter among the observed and imputed points.

In [7]:

```
plt.clf()
jitter = {"age": None, "c28": None, "pcs": None, "mcs37": 0.1,
"b05": 0.1, "haartadhere": 0.1}
for col in data.columns:
if col == "h39b":
continue
_ = imp.plot_bivariate("h39b", col, jitter=jitter[col])
```

Another useful diagnostic plot is to look at the fitted values from the imputation model plotted against either the observed or imputed values. The imputed values are taken from the final imputed data set. We hope to see similar trends and similar degrees of scatter in the observed and imputed values. This plot can be made for any of the variables in the data set. Here we show the result for h39b since that is the variable with the greatest number of missing values in the data set.

In [8]:

```
plt.clf()
_ = imp.plot_fit_obs("h39b")
```