*Key ideas:* Gaussian GEE, Poisson GEE, score testing, offsets

In [1]:

```
import numpy as np
import pandas as pd
import statsmodels.api as sm
```

In [2]:

```
data = pd.read_csv("cbpp.csv")
data["rate"] = data["incidence"] / data["size"].astype(np.float64)
```

A basic approach is to model the naive incidence rates as Gaussian, with an exchangeable within-herd dependence structure (i.e. the covariance matrix of the 4 observations for one herd has constant off-diagonal elements). We also allow the mean to vary with `period`

(the time variable). We treat `period`

as a categorical variable, so each of the four time periods has its own mean value.

In [3]:

```
ga = sm.families.Gaussian()
ex = sm.cov_struct.Exchangeable()
model1 = sm.GEE.from_formula("rate ~ 0 + C(period)", groups="herd",
data=data, family=ga, cov_struct=ex)
result1 = model1.fit()
print result1.summary()
```

In [4]:

```
model2 = sm.GEE.from_formula("rate ~ C(period)", groups="herd",
data=data, family=ga, cov_struct=ex)
result2 = model2.fit()
print result2.summary()
```

We also see that there is very little evidence of within-herd correlation.

In [5]:

```
print ex.summary()
```

In [6]:

```
fig = result2.plot_isotropic_dependence(min_n=1)
```

We can test the hypothesis that time periods 2-4 have a common mean value using a score test. This involves fitting the model under a linear constraint in which the mean parameters for periods 2-4 are held equal to each other. Currently, this cannot be done using the GEE.from_formula method, so we specify endog, exog, and the constraints explicitly. First we construct the data as numpy arrays.

In [7]:

```
endog = data["rate"]
period = data["period"]
groups = data["herd"]
n = len(endog)
exog = [np.ones(n), 1*(period == 2), 1*(period == 3), 1*(period == 4)]
exog = np.array(exog).T
```

In [8]:

```
L = np.zeros((2,4))
L[0,:] = [0, 1, -1, 0]
L[1,:] = [0, 0, 1, -1]
R = np.zeros(2)
```

Now we are ready to fit the model:

In [9]:

```
model3 = sm.GEE(endog, exog, groups=groups, family=ga, cov_struct=ex,
constraint=(L, R))
result3 = model3.fit()
print result3.summary()
```

In [10]:

```
print model3.score_test_results
```

Alternatively, we can use Patsy to process the formulas and constraints for us.

In [11]:

```
import patsy
dmat = patsy.dmatrix("C(period)", data=data)
di = dmat.design_info
lc = di.linear_constraint("C(period)[T.2] = C(period)[T.3] = C(period)[T.4]")
constraint = (np.asarray(lc.coefs), np.asarray(lc.constants))
model4 = sm.GEE(np.asarray(data["rate"]), np.asarray(dmat), groups=np.asarray(groups), family=ga, cov_struct=ex,
constraint=constraint)
result4 = model4.fit()
print result4.summary()
print model4.score_test_results
```

There isn't a lot of data for assessing the link function, but we can still check the fitted values on residuals plot, shown below. There is some evidence for an increasing mean/variance relationship.

In [12]:

```
plt.plot(result1.fittedvalues, result1.resid, 'o', alpha=0.6)
plt.xlabel("Fitted values", size=17)
plt.ylabel("Residuals", size=17)
plt.xlim(0, 0.25)
```

Out[12]:

In [13]:

```
po = sm.families.Poisson()
ex = sm.cov_struct.Exchangeable()
size = np.log(np.asarray(data["size"]))
model5 = sm.GEE.from_formula("incidence ~ C(period)", groups=groups,
data=data, offset=size, family=po, cov_struct=ex)
result5 = model5.fit(cov_type='naive')
print result5.summary()
print ex.summary()
```

Below we fit a negative binomial model to the counts.

In [14]:

```
nb = sm.families.NegativeBinomial(alpha=1.)
ex = sm.cov_struct.Exchangeable()
size = np.log(np.asarray(data["size"]))
model6 = sm.GEE.from_formula("incidence ~ C(period)", groups=groups,
data=data, offset=size, family=nb, cov_struct=ex)
result6 = model6.fit()
print result6.summary()
print ex.summary()
```