Generalized Estimating Equations

Key ideas: Poisson GEE, exchangeable and autoregressive covariance structures, longitudinal count data

In this notebook, we fit a Poisson GEE to the epilepsy data from the MASS library in R. The data come from a study in which the number of seizures was recorded on each of 236 subjects over a 16 week period, with a treatment introduced at the end of week 8. The seizure counts were binned as follows: weeks 1-8, 9-10, 11-12, 13-14, and 15-16. This is a longitudinal data set with five measurements per subject, and there is likely to be within-subject dependence.

In the analysis below, the total number of seizures during the first 8 weeks is treated as a baseline measure, and the effectiveness of the treatment is based on the seizure counts for weeks 9-16 (binned into 4 intervals that are each 2 weeks in duration).

More information about the data can be found here:

http://vincentarelbundock.github.io/Rdatasets/doc/MASS/epil.html

Here are the standard import statements:

In [1]:
import matplotlib.pyplot as plt
%matplotlib inline

import numpy as np
import pandas as pd
from statsmodels.genmod.generalized_estimating_equations import GEE
from statsmodels.genmod.cov_struct import (Exchangeable,
    Independence,Autoregressive)
from statsmodels.genmod.families import Poisson

The data are in a clean csv file and are easy to read into Python:

In [2]:
data = pd.read_csv("epil.csv")
print(data.head())
   Unnamed: 0  y      trt  base  age  V4  subject  period     lbase      lage
0           1  5  placebo    11   31   0        1       1 -0.756354  0.114204
1           2  3  placebo    11   31   0        1       2 -0.756354  0.114204
2           3  3  placebo    11   31   0        1       3 -0.756354  0.114204
3           4  3  placebo    11   31   1        1       4 -0.756354  0.114204
4           5  3  placebo    11   30   0        2       1 -0.756354  0.081414

We can start with a plot that shows how the baseline seizure count is related to the seizure count in each of the 4 follow-up periods. It is clear that subjects having more seizures in the baseline period also had more seizures in the follow-up periods. We use the square root transform to reduce skew (note also the the square root transform is the variance stabilizing transformation for the Poisson distribution).

In [3]:
for period in range(1,5):
    ii = data['period'] == period
    x = np.sqrt(data['base'].loc[ii])
    y = np.sqrt(data['y'].loc[ii])
    plt.plot(x, y, 'o', color='rgbc'[period-1], label=str(period), alpha=0.5)
plt.xlabel("Baseline", size=17)
plt.ylabel("Follow-up", size=17)
plt.ylim(ymin=-0.5)
ha, lb = plt.gca().get_legend_handles_labels()
leg = plt.figlegend(ha, lb, "center right", numpoints=1, handletextpad=0.0001)
leg.draw_frame(False)

Next we fit a GEE model in which the working dependence structure specifies no dependence within subjects.

In [4]:
fam = Poisson()
ind = Independence()
model1 = GEE.from_formula("y ~ age + trt + base", "subject", data, cov_struct=ind, family=fam)
result1 = model1.fit()
print(result1.summary())
                               GEE Regression Results                              
===================================================================================
Dep. Variable:                           y   No. Observations:                  236
Model:                                 GEE   No. clusters:                       59
Method:                        Generalized   Min. cluster size:                   4
                      Estimating Equations   Max. cluster size:                   4
Family:                            Poisson   Mean cluster size:                 4.0
Dependence structure:         Independence   Num. iterations:                    51
Date:                     Wed, 23 Mar 2016   Scale:                           5.087
Covariance type:                    robust   Time:                         18:05:10
====================================================================================
                       coef    std err          z      P>|z|      [95.0% Conf. Int.]
------------------------------------------------------------------------------------
Intercept            0.5730      0.361      1.589      0.112        -0.134     1.280
trt[T.progabide]    -0.1519      0.171     -0.888      0.375        -0.487     0.183
age                  0.0223      0.011      1.960      0.050      2.11e-06     0.045
base                 0.0226      0.001     18.451      0.000         0.020     0.025
==============================================================================
Skew:                          3.7823   Kurtosis:                      28.6672
Centered skew:                 2.7597   Centered kurtosis:             21.9865
==============================================================================

Consistent with the scatterplot shown above, subjects with a greater number of baseline seizures have more seizures in the followup period. The standard deviation of the number of baseline seizures is 26.7, so a subject who is 1 standard deviation above the mean number of seizures during the baseline period has on average exp(26.7*0.0226) = 1.7 times more seizures during the follow-up period (compared to a subject whose baseline value falls at the mean). There is also a positive association between age and the post-treatment seizure counts: every additional decade of age is associated with exp(10*0.0223) = 1.25 times more seizures. The subjects in the treatment arm have fewer seizures, but there is no statistical evidence of a treatment effect.

The scale parameter is estimated to be around 5:

In [5]:
result1.scale
Out[5]:
5.0873839366294833

This means that the variance is around 5 times the mean. This is a sign of substantial overdisperion, since the variance of a Poisson distribution is equal to its mean, hence the scale parameter should be approximately equal 1 if the data follow a Poisson regression model.

Next we plot the empirical mean/variance relationship, by plotting the mean of the repeated measures for each subject against the corresponding variance (note that this does not account for the effects of the covariates). In the log/log plot, the relationship is roughly linear with a slope that is much greater than 1, reflecting the overdispersion noted above.

In [6]:
yg = model1.cluster_list(np.asarray(data["y"]))
ymn = [x.mean() for x in yg]
yva = [x.var() for x in yg]
plt.grid(True)
plt.plot(np.log(ymn), np.log(yva), 'o')
plt.plot([-2, 6], [-2, 6], '-', color='grey')
plt.xlabel("Log variance", size=17)
plt.ylabel("Log mean", size=17)
Out[6]:
<matplotlib.text.Text at 0x7f881ecc0be0>

The exchangeable working dependence structure gives the same parameter estimates and standard errors in this case as the independence structure, but we learn from the exchangeable model that the within-subject correlation coefficient is roughly 0.4.

In [7]:
fam = Poisson()
ex = Exchangeable()
model2 = GEE.from_formula("y ~ age + trt + base", "subject", data, cov_struct=ex, family=fam)
result2 = model2.fit()
print(result2.summary())
print(ex.summary())
                               GEE Regression Results                              
===================================================================================
Dep. Variable:                           y   No. Observations:                  236
Model:                                 GEE   No. clusters:                       59
Method:                        Generalized   Min. cluster size:                   4
                      Estimating Equations   Max. cluster size:                   4
Family:                            Poisson   Mean cluster size:                 4.0
Dependence structure:         Exchangeable   Num. iterations:                    51
Date:                     Wed, 23 Mar 2016   Scale:                           5.087
Covariance type:                    robust   Time:                         18:05:11
====================================================================================
                       coef    std err          z      P>|z|      [95.0% Conf. Int.]
------------------------------------------------------------------------------------
Intercept            0.5730      0.361      1.589      0.112        -0.134     1.280
trt[T.progabide]    -0.1519      0.171     -0.888      0.375        -0.487     0.183
age                  0.0223      0.011      1.960      0.050      2.11e-06     0.045
base                 0.0226      0.001     18.451      0.000         0.020     0.025
==============================================================================
Skew:                          3.7823   Kurtosis:                      28.6672
Centered skew:                 2.7597   Centered kurtosis:             21.9865
==============================================================================
The correlation between two observations in the same cluster is 0.393

We can also try fitting the model with an autoregressive dependence structure. In this case, starting values are needed to get the estimating algorithm to converge.

In [8]:
time = np.kron(np.ones(data.shape[0]/4), (0,1,2,3))
fam = Poisson()
ar = Autoregressive()
ar.dep_params = 0.4
model3 = GEE.from_formula("y ~ age + trt + base", "subject", data, time=time, cov_struct=ar, family=fam)
result3 = model3.fit(start_params=result2.params)
print(result3.summary())
print(ar.summary())
print("scale=%.2f" % (result2.scale))
                               GEE Regression Results                              
===================================================================================
Dep. Variable:                           y   No. Observations:                  236
Model:                                 GEE   No. clusters:                       59
Method:                        Generalized   Min. cluster size:                   4
                      Estimating Equations   Max. cluster size:                   4
Family:                            Poisson   Mean cluster size:                 4.0
Dependence structure:       Autoregressive   Num. iterations:                    26
Date:                     Wed, 23 Mar 2016   Scale:                           5.064
Covariance type:                    robust   Time:                         18:05:11
====================================================================================
                       coef    std err          z      P>|z|      [95.0% Conf. Int.]
------------------------------------------------------------------------------------
Intercept            0.4123      0.422      0.977      0.329        -0.415     1.240
trt[T.progabide]    -0.0704      0.164     -0.430      0.667        -0.391     0.250
age                  0.0274      0.013      2.108      0.035         0.002     0.053
base                 0.0223      0.001     18.252      0.000         0.020     0.025
==============================================================================
Skew:                          3.9903   Kurtosis:                      30.1753
Centered skew:                 2.7597   Centered kurtosis:             21.9865
==============================================================================
Autoregressive(1) dependence parameter: 0.500

scale=5.09

Dependence structure diagnostics

Comparing the autoregressive and exchangeable models, we see that the regression coefficients are quite similar. Nevertheless, we can probe the dependence structure a bit more by making a plot of the empirical correlation between residuals as a function of the time spacing. Here the time point values are 0, 1, 2, 3, which are the default integer time values for the four post-treatment measurements. The plot below suggests that the dependence decreases with time, so the autoregressive structure seems to be more suitable.

In [9]:
fig = result1.plot_isotropic_dependence()
plt.grid(True)

Next we look at how one of the regression parameters (the age effect) changes as we hold the autocorrelation parameter fixed at various values. The broken vertical line shows the estimated autocorrelation parameter.

In [10]:
rslts = result3.params_sensitivity(0., 0.9, 5)
dp = [x.cov_struct.dep_params for x in rslts]
age_params = np.asarray([x.params[2] for x in rslts])
age_se = np.asarray([x.bse[2] for x in rslts])
plt.grid(True)
plt.plot(dp, age_params, '-', color='orange', lw=4)
plt.plot(dp, age_params + age_se, '-', color='grey')
plt.plot(dp, age_params - age_se, '-', color='grey')
plt.axvline(result3.cov_struct.dep_params, ls='--', color='black')
plt.xlabel("Autocorrelation parameter", size=14)
plt.ylabel("Age effect", size=14)
Out[10]:
<matplotlib.text.Text at 0x7f881eba8198>