*Key ideas:* Gaussian GEE, Poisson GEE, score testing, offsets

The CBPP (Contagious bovine pleuropneumonia) dataset (part of the R LMER library) contains data on the incidence of new cases of CBPP in 15 herds of cattle. Each herd is assessed for CBPP during four time periods. This is a longitudinal data set, since the same herds are observed in the four time periods. In this notebook we use generalized estimating equations (GEE) to compare the CBP incidence rates among the time periods, while accounting for the possibility that there is dependence within herds across time.

In [1]:

```
import numpy as np
import pandas as pd
import statsmodels.api as sm
```

We begin by constructing a variable that contains the naive estimate of the incidence rate (the number of incident cases divided by the herd size).

In [2]:

```
data = pd.read_csv("cbpp.csv")
data["rate"] = data["incidence"] / data["size"].astype(np.float64)
```

A basic approach is to model the naive incidence rates as Gaussian, with an exchangeable within-herd dependence structure (i.e. the covariance matrix of the 4 observations for one herd has constant off-diagonal elements). We also allow the mean to vary with `period`

(the time variable). We treat `period`

as a categorical variable, so each of the four time periods has its own mean value.

In [3]:

```
ga = sm.families.Gaussian()
ex = sm.cov_struct.Exchangeable()
model1 = sm.GEE.from_formula("rate ~ 0 + C(period)", groups="herd",
data=data, family=ga, cov_struct=ex)
result1 = model1.fit()
print result1.summary()
```

We see that after the first period, the incidence rate drops substantially. We don't have a lot of background information about this data set, but for the sake of discussion we can treat this as a hypothesis that can be studied as if it were pre-specified (rather than one that was discovered using the data). If we reparameterize the model in terms of deviations from the first period, we see that there is statistical evidence that periods 2-4 have lower incidence rates than period 1 (due to the Z-scores for periods 2-4 being rather large in magnitude).

In [4]:

```
model2 = sm.GEE.from_formula("rate ~ C(period)", groups="herd",
data=data, family=ga, cov_struct=ex)
result2 = model2.fit()
print result2.summary()
```

We also see that there is very little evidence of within-herd correlation.

In [5]:

```
print ex.summary()
```

The weak dependence is corroborated by the following dependence plot, however the sample size of this data set is too small to learn much about the covariance structure.

In [6]:

```
fig = result2.plot_isotropic_dependence(min_n=1)
```

We can test the hypothesis that time periods 2-4 have a common mean value using a score test. This involves fitting the model under a linear constraint in which the mean parameters for periods 2-4 are held equal to each other. Currently, this cannot be done using the GEE.from_formula method, so we specify endog, exog, and the constraints explicitly. First we construct the data as numpy arrays.

In [7]:

```
endog = data["rate"]
period = data["period"]
groups = data["herd"]
n = len(endog)
exog = [np.ones(n), 1*(period == 2), 1*(period == 3), 1*(period == 4)]
exog = np.array(exog).T
```

Next we specify the constraints that define the null hypothesis that we are testing. We are testing params[1] = params[2] = params[3], or equivalently, params[1] - params[2] = 0 and params[2] - params[3] = 0. This can be expressed as a matrix equation of the form L*params = R (where R = 0 in this case).

In [8]:

```
L = np.zeros((2,4))
L[0,:] = [0, 1, -1, 0]
L[1,:] = [0, 0, 1, -1]
R = np.zeros(2)
```

Now we are ready to fit the model:

In [9]:

```
model3 = sm.GEE(endog, exog, groups=groups, family=ga, cov_struct=ex,
constraint=(L, R))
result3 = model3.fit()
print result3.summary()
```

The score statistic tells us how fast the likelihood increases if we did not impose the constraint that the rates at time points 2-4 are equal. Based on the score test, there is no statistical evidence that periods 2-4 are different from each other.

In [10]:

```
print model3.score_test_results
```

Alternatively, we can use Patsy to process the formulas and constraints for us.

In [11]:

```
import patsy
dmat = patsy.dmatrix("C(period)", data=data)
di = dmat.design_info
lc = di.linear_constraint("C(period)[T.2] = C(period)[T.3] = C(period)[T.4]")
constraint = (np.asarray(lc.coefs), np.asarray(lc.constants))
model4 = sm.GEE(np.asarray(data["rate"]), np.asarray(dmat), groups=np.asarray(groups), family=ga, cov_struct=ex,
constraint=constraint)
result4 = model4.fit()
print result4.summary()
print model4.score_test_results
```

There isn't a lot of data for assessing the link function, but we can still check the fitted values on residuals plot, shown below. There is some evidence for an increasing mean/variance relationship.

In [12]:

```
plt.plot(result1.fittedvalues, result1.resid, 'o', alpha=0.6)
plt.xlabel("Fitted values", size=17)
plt.ylabel("Residuals", size=17)
plt.xlim(0, 0.25)
```

Out[12]:

To address this, we can model the number of transmission events as following a Poisson distribution. This is based on a conceptual model in which there is a fixed transmission probability, and every transmission event is independent of the others. We use an offset to account for differences in the herd sizes.

In [13]:

```
po = sm.families.Poisson()
ex = sm.cov_struct.Exchangeable()
size = np.log(np.asarray(data["size"]))
model5 = sm.GEE.from_formula("incidence ~ C(period)", groups=groups,
data=data, offset=size, family=po, cov_struct=ex)
result5 = model5.fit(cov_type='naive')
print result5.summary()
print ex.summary()
```

The dependence is still weak. The scale factor is quite large, indicating that there is a lot of heterogeneity among the herds in CBPP transmission.

Below we fit a negative binomial model to the counts.

In [14]:

```
nb = sm.families.NegativeBinomial(alpha=1.)
ex = sm.cov_struct.Exchangeable()
size = np.log(np.asarray(data["size"]))
model6 = sm.GEE.from_formula("incidence ~ C(period)", groups=groups,
data=data, offset=size, family=nb, cov_struct=ex)
result6 = model6.fit()
print result6.summary()
print ex.summary()
```