This notebook demonstrates the statsmodels MICE implementation.

The CHAIN data set, analyzed below, has also been used to illustrate the R mi package. Section 4 of this paper describes an analysis of the data set conducted in R

In [1]:

```
import sys
sys.path.insert(0, "/projects/57433cc7-78ab-4105-a525-ba087aa3e2fc/statsmodels-mice2")
%matplotlib inline
import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.sandbox.mice import mice
import matplotlib.pyplot as plt
```

First we load the data and do a bit of cleanup.

In [2]:

```
data = pd.read_csv("chain.csv")
del data["Unnamed: 0"]
data.columns = [x.replace(".W1", "") for x in data.columns]
print data.head()
```

In [3]:

```
imp = mice.MICEData(data)
_ = imp.plot_missing_pattern()
```

In [4]:

```
_ = imp.plot_missing_pattern(hide_complete_rows=True, hide_complete_columns=True)
```

In [5]:

```
mi = mice.MICE("h39b ~ age + c28 + pcs + mcs37 + b05 + haartadhere", sm.OLS, imp)
result = mi.fit(20, 5)
print(result.summary())
```

In [6]:

```
plt.clf()
for col in data.columns:
plt.figure()
ax = plt.axes()
_ = imp.plot_imputed_hist(col, ax=ax, )
```

`plot_bivariate`

method colors the points accorording to whether they are missing or observed on each variable in the scatterplot. We hope to see the same trends and degree of scatter among the observed and imputed points.

In [7]:

```
plt.clf()
jitter = {"age": None, "c28": None, "pcs": None, "mcs37": 0.1,
"b05": 0.1, "haartadhere": 0.1}
for col in data.columns:
if col == "h39b":
continue
_ = imp.plot_bivariate("h39b", col, jitter=jitter[col])
```

In [8]:

```
plt.clf()
_ = imp.plot_fit_obs("h39b")
```