Marginal models for longitudinal count data with Generalized Estimating Equations

Key ideas: Gaussian GEE, Poisson GEE, score testing, offsets

The CBPP (Contagious bovine pleuropneumonia) dataset (part of the R LMER library) contains data on the incidence of new cases of CBPP in 15 herds of cattle. Each herd is assessed for CBPP during four time periods. This is a longitudinal data set, since the same herds are observed in the four time periods. In this notebook we use generalized estimating equations (GEE) to compare the CBP incidence rates among the time periods, while accounting for the possibility that there is dependence within herds across time.

In [1]:
import numpy as np
import pandas as pd
import statsmodels.api as sm

We begin by constructing a variable that contains the naive estimate of the incidence rate (the number of incident cases divided by the herd size).

In [2]:
data = pd.read_csv("cbpp.csv")
data["rate"] = data["incidence"] / data["size"].astype(np.float64)

Gaussian approach

A basic approach is to model the naive incidence rates as Gaussian, with an exchangeable within-herd dependence structure (i.e. the covariance matrix of the 4 observations for one herd has constant off-diagonal elements). We also allow the mean to vary with period (the time variable). We treat period as a categorical variable, so each of the four time periods has its own mean value.

In [3]:
ga = sm.families.Gaussian()
ex = sm.cov_struct.Exchangeable()

model1 = sm.GEE.from_formula("rate ~ 0 + C(period)", groups="herd",
                      data=data, family=ga, cov_struct=ex)
result1 = model1.fit()
print result1.summary()
                               GEE Regression Results                              
===================================================================================
Dep. Variable:                        rate   No. Observations:                   56
Model:                                 GEE   No. clusters:                       15
Method:                        Generalized   Min. cluster size:                   1
                      Estimating Equations   Max. cluster size:                   4
Family:                           Gaussian   Mean cluster size:                 3.7
Dependence structure:         Exchangeable   Num. iterations:                     4
Date:                     Sat, 18 Oct 2014   Scale:                           0.016
Covariance type:                    robust   Time:                         22:29:51
================================================================================
                   coef    std err          z      P>|z|      [95.0% Conf. Int.]
--------------------------------------------------------------------------------
C(period)[1]     0.2198      0.042      5.206      0.000         0.137     0.303
C(period)[2]     0.0741      0.020      3.714      0.000         0.035     0.113
C(period)[3]     0.0894      0.037      2.401      0.016         0.016     0.162
C(period)[4]     0.0413      0.019      2.133      0.033         0.003     0.079
==============================================================================
Skew:                          1.1918   Kurtosis:                       1.4013
Centered skew:                 0.5332   Centered kurtosis:              0.6795
==============================================================================

We see that after the first period, the incidence rate drops substantially. We don't have a lot of background information about this data set, but for the sake of discussion we can treat this as a hypothesis that can be studied as if it were pre-specified (rather than one that was discovered using the data). If we reparameterize the model in terms of deviations from the first period, we see that there is statistical evidence that periods 2-4 have lower incidence rates than period 1 (due to the Z-scores for periods 2-4 being rather large in magnitude).

In [4]:
model2 = sm.GEE.from_formula("rate ~ C(period)", groups="herd",
                      data=data, family=ga, cov_struct=ex)
result2 = model2.fit()
print result2.summary()
                               GEE Regression Results                              
===================================================================================
Dep. Variable:                        rate   No. Observations:                   56
Model:                                 GEE   No. clusters:                       15
Method:                        Generalized   Min. cluster size:                   1
                      Estimating Equations   Max. cluster size:                   4
Family:                           Gaussian   Mean cluster size:                 3.7
Dependence structure:         Exchangeable   Num. iterations:                     2
Date:                     Sat, 18 Oct 2014   Scale:                           0.016
Covariance type:                    robust   Time:                         22:29:51
==================================================================================
                     coef    std err          z      P>|z|      [95.0% Conf. Int.]
----------------------------------------------------------------------------------
Intercept          0.2198      0.042      5.206      0.000         0.137     0.303
C(period)[T.2]    -0.1457      0.051     -2.855      0.004        -0.246    -0.046
C(period)[T.3]    -0.1304      0.054     -2.396      0.017        -0.237    -0.024
C(period)[T.4]    -0.1785      0.046     -3.870      0.000        -0.269    -0.088
==============================================================================
Skew:                          1.1918   Kurtosis:                       1.4013
Centered skew:                 0.5332   Centered kurtosis:              0.6795
==============================================================================

We also see that there is very little evidence of within-herd correlation.

In [5]:
print ex.summary()
The correlation between two observations in the same cluster is 0.004

The weak dependence is corroborated by the following dependence plot, however the sample size of this data set is too small to learn much about the covariance structure.

In [6]:
fig = result2.plot_isotropic_dependence(min_n=1)

Score test comparing periods 2-4 to period 1

We can test the hypothesis that time periods 2-4 have a common mean value using a score test. This involves fitting the model under a linear constraint in which the mean parameters for periods 2-4 are held equal to each other. Currently, this cannot be done using the GEE.from_formula method, so we specify endog, exog, and the constraints explicitly. First we construct the data as numpy arrays.

In [7]:
endog = data["rate"]
period = data["period"]
groups = data["herd"]

n = len(endog)
exog = [np.ones(n), 1*(period == 2), 1*(period == 3), 1*(period == 4)]
exog = np.array(exog).T

Next we specify the constraints that define the null hypothesis that we are testing. We are testing params[1] = params[2] = params[3], or equivalently, params[1] - params[2] = 0 and params[2] - params[3] = 0. This can be expressed as a matrix equation of the form L*params = R (where R = 0 in this case).

In [8]:
L = np.zeros((2,4))
L[0,:] = [0, 1, -1, 0]
L[1,:] = [0, 0, 1, -1]
R = np.zeros(2)

Now we are ready to fit the model:

In [9]:
model3 = sm.GEE(endog, exog, groups=groups, family=ga, cov_struct=ex,
                constraint=(L, R))
result3 = model3.fit()
print result3.summary()
                               GEE Regression Results                              
===================================================================================
Dep. Variable:                        rate   No. Observations:                   56
Model:                                 GEE   No. clusters:                       15
Method:                        Generalized   Min. cluster size:                   1
                      Estimating Equations   Max. cluster size:                   4
Family:                           Gaussian   Mean cluster size:                 3.7
Dependence structure:         Exchangeable   Num. iterations:                     4
Date:                     Sat, 18 Oct 2014   Scale:                           0.015
Covariance type:                    robust   Time:                         22:29:51
==============================================================================
                 coef    std err          z      P>|z|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
const          0.2198      0.042      5.206      0.000         0.137     0.303
x1            -0.1510      0.046     -3.273      0.001        -0.241    -0.061
x2            -0.1510      0.046     -3.273      0.001        -0.241    -0.061
x3            -0.1510      0.046     -3.273      0.001        -0.241    -0.061
==============================================================================
Skew:                          1.2807   Kurtosis:                       1.6141
Centered skew:                 0.5474   Centered kurtosis:              0.7898
==============================================================================

The score statistic tells us how fast the likelihood increases if we did not impose the constraint that the rates at time points 2-4 are equal. Based on the score test, there is no statistical evidence that periods 2-4 are different from each other.

In [10]:
print model3.score_test_results
{'df': 2, 'p-value': 0.57225449912148196, 'statistic': 1.1163429160392608}

Alternatively, we can use Patsy to process the formulas and constraints for us.

In [11]:
import patsy
dmat = patsy.dmatrix("C(period)", data=data)
di = dmat.design_info
lc = di.linear_constraint("C(period)[T.2] = C(period)[T.3] = C(period)[T.4]")

constraint = (np.asarray(lc.coefs), np.asarray(lc.constants))

model4 = sm.GEE(np.asarray(data["rate"]), np.asarray(dmat), groups=np.asarray(groups), family=ga, cov_struct=ex,
         constraint=constraint)
result4 = model4.fit()
print result4.summary()
print model4.score_test_results
                               GEE Regression Results                              
===================================================================================
Dep. Variable:                           y   No. Observations:                   56
Model:                                 GEE   No. clusters:                       15
Method:                        Generalized   Min. cluster size:                   1
                      Estimating Equations   Max. cluster size:                   4
Family:                           Gaussian   Mean cluster size:                 3.7
Dependence structure:         Exchangeable   Num. iterations:                     2
Date:                     Sat, 18 Oct 2014   Scale:                           0.015
Covariance type:                    robust   Time:                         22:29:51
==============================================================================
                 coef    std err          z      P>|z|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
const          0.2198      0.042      5.206      0.000         0.137     0.303
x1            -0.1510      0.046     -3.273      0.001        -0.241    -0.061
x2            -0.1510      0.046     -3.273      0.001        -0.241    -0.061
x3            -0.1510      0.046     -3.273      0.001        -0.241    -0.061
==============================================================================
Skew:                          1.2807   Kurtosis:                       1.6141
Centered skew:                 0.5474   Centered kurtosis:              0.7898
==============================================================================
{'df': 2, 'p-value': 0.57225449912621951, 'statistic': 1.1163429160227036}

Poisson approach

There isn't a lot of data for assessing the link function, but we can still check the fitted values on residuals plot, shown below. There is some evidence for an increasing mean/variance relationship.

In [12]:
plt.plot(result1.fittedvalues, result1.resid, 'o', alpha=0.6)
plt.xlabel("Fitted values", size=17)
plt.ylabel("Residuals", size=17)
plt.xlim(0, 0.25)
Out[12]:
(0, 0.25)

To address this, we can model the number of transmission events as following a Poisson distribution. This is based on a conceptual model in which there is a fixed transmission probability, and every transmission event is independent of the others. We use an offset to account for differences in the herd sizes.

In [13]:
po = sm.families.Poisson()
ex = sm.cov_struct.Exchangeable()
size = np.log(np.asarray(data["size"]))

model5 = sm.GEE.from_formula("incidence ~ C(period)", groups=groups,
                      data=data, offset=size, family=po, cov_struct=ex)
result5 = model5.fit(cov_type='naive')
print result5.summary()
print ex.summary()
                               GEE Regression Results                              
===================================================================================
Dep. Variable:                   incidence   No. Observations:                   56
Model:                                 GEE   No. clusters:                       15
Method:                        Generalized   Min. cluster size:                   1
                      Estimating Equations   Max. cluster size:                   4
Family:                            Poisson   Mean cluster size:                 3.7
Dependence structure:         Exchangeable   Num. iterations:                    11
Date:                     Sat, 18 Oct 2014   Scale:                           8.248
Covariance type:                     naive   Time:                         22:29:52
==================================================================================
                     coef    std err          z      P>|z|      [95.0% Conf. Int.]
----------------------------------------------------------------------------------
Intercept         -1.5176      0.368     -4.126      0.000        -2.239    -0.797
C(period)[T.2]    -0.9982      0.777     -1.285      0.199        -2.521     0.525
C(period)[T.3]    -1.1174      0.839     -1.331      0.183        -2.762     0.528
C(period)[T.4]    -1.5641      1.129     -1.385      0.166        -3.778     0.650
==============================================================================
Skew:                          2.3583   Kurtosis:                       5.5733
Centered skew:                 1.5871   Centered kurtosis:              4.0577
==============================================================================
The correlation between two observations in the same cluster is 0.028

The dependence is still weak. The scale factor is quite large, indicating that there is a lot of heterogeneity among the herds in CBPP transmission.

Negative binomial approach

Below we fit a negative binomial model to the counts.

In [14]:
nb = sm.families.NegativeBinomial(alpha=1.)
ex = sm.cov_struct.Exchangeable()
size = np.log(np.asarray(data["size"]))

model6 = sm.GEE.from_formula("incidence ~ C(period)", groups=groups,
                      data=data, offset=size, family=nb, cov_struct=ex)
result6 = model6.fit()
print result6.summary()
print ex.summary()
                               GEE Regression Results                              
===================================================================================
Dep. Variable:                   incidence   No. Observations:                   56
Model:                                 GEE   No. clusters:                       15
Method:                        Generalized   Min. cluster size:                   1
                      Estimating Equations   Max. cluster size:                   4
Family:                   NegativeBinomial   Mean cluster size:                 3.7
Dependence structure:         Exchangeable   Num. iterations:                    11
Date:                     Sat, 18 Oct 2014   Scale:                           4.201
Covariance type:                    robust   Time:                         22:29:52
==================================================================================
                     coef    std err          z      P>|z|      [95.0% Conf. Int.]
----------------------------------------------------------------------------------
Intercept         -1.5163      0.199     -7.625      0.000        -1.906    -1.127
C(period)[T.2]    -1.0196      0.349     -2.922      0.003        -1.704    -0.336
C(period)[T.3]    -0.9816      0.430     -2.281      0.023        -1.825    -0.138
C(period)[T.4]    -1.6129      0.359     -4.497      0.000        -2.316    -0.910
==============================================================================
Skew:                          2.3583   Kurtosis:                       5.5741
Centered skew:                 1.5874   Centered kurtosis:              4.0588
==============================================================================
The correlation between two observations in the same cluster is 0.015