*Key ideas:* Poisson GEE, exchangeable and autoregressive covariance structures, longitudinal count data

In this notebook, we fit a Poisson GEE to the epilepsy data from the MASS library in R. The data come from a study in which the number of seizures was recorded on each of 236 subjects over a 16 week period, with a treatment introduced at the end of week 8. The seizure counts were binned as follows: weeks 1-8, 9-10, 11-12, 13-14, and 15-16. This is a longitudinal data set with five measurements per subject, and there is likely to be within-subject dependence.

In the analysis below, the total number of seizures during the first 8 weeks is treated as a baseline measure, and the effectiveness of the treatment is based on the seizure counts for weeks 9-16 (binned into 4 intervals that are each 2 weeks in duration).

More information about the data can be found here:

http://vincentarelbundock.github.io/Rdatasets/doc/MASS/epil.html

Here are the standard import statements:

In [1]:

```
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
import pandas as pd
from statsmodels.genmod.generalized_estimating_equations import GEE
from statsmodels.genmod.cov_struct import (Exchangeable,
Independence,Autoregressive)
from statsmodels.genmod.families import Poisson
```

The data are in a clean csv file and are easy to read into Python:

In [2]:

```
data = pd.read_csv("epil.csv")
print(data.head())
```

We can start with a plot that shows how the baseline seizure count is related to the seizure count in each of the 4 follow-up periods. It is clear that subjects having more seizures in the baseline period also had more seizures in the follow-up periods. We use the square root transform to reduce skew (note also the the square root transform is the variance stabilizing transformation for the Poisson distribution).

In [3]:

```
for period in range(1,5):
ii = data['period'] == period
x = np.sqrt(data['base'].loc[ii])
y = np.sqrt(data['y'].loc[ii])
plt.plot(x, y, 'o', color='rgbc'[period-1], label=str(period), alpha=0.5)
plt.xlabel("Baseline", size=17)
plt.ylabel("Follow-up", size=17)
plt.ylim(ymin=-0.5)
ha, lb = plt.gca().get_legend_handles_labels()
leg = plt.figlegend(ha, lb, "center right", numpoints=1, handletextpad=0.0001)
leg.draw_frame(False)
```

Next we fit a GEE model in which the working dependence structure specifies no dependence within subjects.

In [4]:

```
fam = Poisson()
ind = Independence()
model1 = GEE.from_formula("y ~ age + trt + base", "subject", data, cov_struct=ind, family=fam)
result1 = model1.fit()
print(result1.summary())
```

Consistent with the scatterplot shown above, subjects with a greater number of baseline seizures have more seizures in the followup period. The standard deviation of the number of baseline seizures is 26.7, so a subject who is 1 standard deviation above the mean number of seizures during the baseline period has on average exp(26.7*0.0226) = 1.7 times more seizures during the follow-up period (compared to a subject whose baseline value falls at the mean). There is also a positive association between age and the post-treatment seizure counts: every additional decade of age is associated with exp(10*0.0223) = 1.25 times more seizures. The subjects in the treatment arm have fewer seizures, but there is no statistical evidence of a treatment effect.

The scale parameter is estimated to be around 5:

In [5]:

```
result1.scale
```

Out[5]:

This means that the variance is around 5 times the mean. This is a sign of substantial overdisperion, since the variance of a Poisson distribution is equal to its mean, hence the scale parameter should be approximately equal 1 if the data follow a Poisson regression model.

Next we plot the empirical mean/variance relationship, by plotting the mean of the repeated measures for each subject against the corresponding variance (note that this does not account for the effects of the covariates). In the log/log plot, the relationship is roughly linear with a slope that is much greater than 1, reflecting the overdispersion noted above.

In [6]:

```
yg = model1.cluster_list(np.asarray(data["y"]))
ymn = [x.mean() for x in yg]
yva = [x.var() for x in yg]
plt.grid(True)
plt.plot(np.log(ymn), np.log(yva), 'o')
plt.plot([-2, 6], [-2, 6], '-', color='grey')
plt.xlabel("Log variance", size=17)
plt.ylabel("Log mean", size=17)
```

Out[6]:

The exchangeable working dependence structure gives the same parameter estimates and standard errors in this case as the independence structure, but we learn from the exchangeable model that the within-subject correlation coefficient is roughly 0.4.

In [7]:

```
fam = Poisson()
ex = Exchangeable()
model2 = GEE.from_formula("y ~ age + trt + base", "subject", data, cov_struct=ex, family=fam)
result2 = model2.fit()
print(result2.summary())
print(ex.summary())
```

We can also try fitting the model with an autoregressive dependence structure. In this case, starting values are needed to get the estimating algorithm to converge.

In [8]:

```
time = np.kron(np.ones(data.shape[0]/4), (0,1,2,3))
fam = Poisson()
ar = Autoregressive()
ar.dep_params = 0.4
model3 = GEE.from_formula("y ~ age + trt + base", "subject", data, time=time, cov_struct=ar, family=fam)
result3 = model3.fit(start_params=result2.params)
print(result3.summary())
print(ar.summary())
print("scale=%.2f" % (result2.scale))
```

Comparing the autoregressive and exchangeable models, we see that the regression coefficients are quite similar. Nevertheless, we can probe the dependence structure a bit more by making a plot of the empirical correlation between residuals as a function of the time spacing. Here the time point values are 0, 1, 2, 3, which are the default integer time values for the four post-treatment measurements. The plot below suggests that the dependence decreases with time, so the autoregressive structure seems to be more suitable.

In [9]:

```
fig = result1.plot_isotropic_dependence()
plt.grid(True)
```

Next we look at how one of the regression parameters (the age effect) changes as we hold the autocorrelation parameter fixed at various values. The broken vertical line shows the estimated autocorrelation parameter.

In [10]:

```
rslts = result3.params_sensitivity(0., 0.9, 5)
dp = [x.cov_struct.dep_params for x in rslts]
age_params = np.asarray([x.params[2] for x in rslts])
age_se = np.asarray([x.bse[2] for x in rslts])
plt.grid(True)
plt.plot(dp, age_params, '-', color='orange', lw=4)
plt.plot(dp, age_params + age_se, '-', color='grey')
plt.plot(dp, age_params - age_se, '-', color='grey')
plt.axvline(result3.cov_struct.dep_params, ls='--', color='black')
plt.xlabel("Autocorrelation parameter", size=14)
plt.ylabel("Age effect", size=14)
```

Out[10]: