Generalized Estimating Equations

Key ideas: Gaussian GEE, longitudinal data, exchangeable dependence, autoregressive dependence

In this notebook, we analyze a longitudinal data set of CD4 counts for children with HIV. The children are assigned to one of two treatment arms. In addition, each child may or may not be on anti-retroviral therapy (ART). There are various ways to approach modeling the data. Here we will express the CD4 level as a function of the treatment group, ART status, and age at the time of the assessment. The model that we fit is a "marginal model" that gives us the mean CD4 level for children of a given age, ART status, and treatment status.

These data are discussed in Gelman and Hill's "Data analysis using regression and multilevel/hierarchical models" and have also been considered in many other places. The data can be found at http://www.stat.columbia.edu/~gelman/arm/software/.

We start with the basic import statements.

In [1]:
import numpy as np
import pandas as pd
import datetime
import statsmodels.api as sm
import matplotlib.pyplot as plt

%matplotlib inline

First we load the data and take a peek at the first few rows.

In [2]:
data = pd.read_csv("arm_cd4_allvar.csv")
print(data.head())
   VISIT  newpid        VDATE  CD4PCT  arv    visage  treatmnt  CD4CNT  \
0      1       1   6/29/1988     18.0  0.0  3.910000         1   323.0   
1      4       1   1/19/1989     37.0  0.0  4.468333         1   610.0   
2      7       1   4/13/1989     13.0  0.0  4.698333         1   324.0   
3     10       1          NaN     NaN  0.0  5.005000         1     NaN   
4     13       1  11/30/1989     13.0  0.0  5.330833         1   626.0   

   baseage  
0     3.91  
1     3.91  
2     3.91  
3     3.91  
4     3.91  

Note that there are two treatment variables in the dataset. One is the treatment of interest (treatmnt). In addition, ARV treatment information was available and used as a control covariate. The following crosstab shows how these two treatments are related.

In [3]:
pd.crosstab(data.treatmnt, data.arv)
Out[3]:
arv 0.0 1.0
treatmnt
1 561 53
2 464 46

Since there are missing values we will drop them (this is "listwise deletion" or "complete case analysis").

In [4]:
data = data.dropna()

We will need the age of the child at the time of the visit. This takes a bit of date processing, since the data set only contains the age at the baseline visit, and the dates of all visits. First we get the date of the first visit for each child.

In [5]:
data["VDATE"] = pd.to_datetime(data["VDATE"])
data["FirstVisit"] = data.groupby('newpid')["VDATE"].transform(np.min)
data["Day"] = (data["VDATE"] - data["FirstVisit"]) / np.timedelta64(1, 'D')
print(data.head())
   VISIT  newpid      VDATE  CD4PCT  arv    visage  treatmnt  CD4CNT  baseage  \
0      1       1 1988-06-29    18.0  0.0  3.910000         1   323.0     3.91   
1      4       1 1989-01-19    37.0  0.0  4.468333         1   610.0     3.91   
2      7       1 1989-04-13    13.0  0.0  4.698333         1   324.0     3.91   
4     13       1 1989-11-30    13.0  0.0  5.330833         1   626.0     3.91   
6     19       1 1990-06-07    12.0  1.0  5.848333         1   220.0     3.91   

  FirstVisit    Day  
0 1988-06-29    0.0  
1 1988-06-29  204.0  
2 1988-06-29  288.0  
4 1988-06-29  519.0  
6 1988-06-29  708.0  

Now we can compute each child's age on the date of each visit.

In [6]:
data["Age"] = data["baseage"] + data["Day"] / 365
print(data.head())
   VISIT  newpid      VDATE  CD4PCT  arv    visage  treatmnt  CD4CNT  baseage  \
0      1       1 1988-06-29    18.0  0.0  3.910000         1   323.0     3.91   
1      4       1 1989-01-19    37.0  0.0  4.468333         1   610.0     3.91   
2      7       1 1989-04-13    13.0  0.0  4.698333         1   324.0     3.91   
4     13       1 1989-11-30    13.0  0.0  5.330833         1   626.0     3.91   
6     19       1 1990-06-07    12.0  1.0  5.848333         1   220.0     3.91   

  FirstVisit    Day       Age  
0 1988-06-29    0.0  3.910000  
1 1988-06-29  204.0  4.468904  
2 1988-06-29  288.0  4.699041  
4 1988-06-29  519.0  5.331918  
6 1988-06-29  708.0  5.849726  

The dependent variable in a linear regression does not always have to be symmetrically distributed, but it is a good idea to check the marginal distribution before proceeding. In this case it is reasonably symmetric.

In [7]:
plt.hist(data.CD4PCT) 
_ = plt.xlabel("CD4PCT")

GEE with independent working covariance

Here is a very basic linear model that estimates the conditional mean CD4 percentage as a function of the subject's age at the time of the visit, the antiretroviral therapy status, and the treatment group. Based on this analysis, older children seem to have lower CD4 levels, and children on ARV have higher CD4 levels, but there is no clear effect of the antiretroviral therapy.

In [8]:
model1 = sm.GEE.from_formula("CD4PCT ~ Age + arv + treatmnt", groups="newpid", data=data)
result1 = model1.fit()
print(result1.summary())
                               GEE Regression Results                              
===================================================================================
Dep. Variable:                      CD4PCT   No. Observations:                  978
Model:                                 GEE   No. clusters:                      226
Method:                        Generalized   Min. cluster size:                   1
                      Estimating Equations   Max. cluster size:                   7
Family:                           Gaussian   Mean cluster size:                 4.3
Dependence structure:         Independence   Num. iterations:                     2
Date:                     Wed, 23 Mar 2016   Scale:                         171.820
Covariance type:                    robust   Time:                         12:49:37
==============================================================================
                 coef    std err          z      P>|z|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept     23.9086      2.997      7.977      0.000        18.034    29.783
Age           -0.9923      0.373     -2.661      0.008        -1.723    -0.261
arv            1.6919      2.070      0.817      0.414        -2.365     5.749
treatmnt       2.7020      1.635      1.653      0.098        -0.503     5.907
==============================================================================
Skew:                          0.2773   Kurtosis:                      -0.3432
Centered skew:                 0.2924   Centered kurtosis:              3.2945
==============================================================================

This GEE estimate uses the default independence working covariance structure, so the coefficient estimates are the same as would be obtained from simple linear regression, although the standard errors are different. We can check this by fitting the OLS estimates directly:

In [9]:
model2 = sm.OLS.from_formula("CD4PCT ~ Age + arv + treatmnt", data)
result2 = model2.fit()
print(result2.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                 CD4PCT   R-squared:                       0.041
Model:                            OLS   Adj. R-squared:                  0.038
Method:                 Least Squares   F-statistic:                     13.71
Date:                Wed, 23 Mar 2016   Prob (F-statistic):           9.12e-09
Time:                        12:49:37   Log-Likelihood:                -3902.3
No. Observations:                 978   AIC:                             7813.
Df Residuals:                     974   BIC:                             7832.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept     23.9086      1.495     15.996      0.000        20.975    26.842
Age           -0.9923      0.183     -5.421      0.000        -1.351    -0.633
arv            1.6919      1.451      1.166      0.244        -1.155     4.539
treatmnt       2.7020      0.842      3.209      0.001         1.050     4.354
==============================================================================
Omnibus:                       18.827   Durbin-Watson:                   0.803
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               17.333
Skew:                           0.277   Prob(JB):                     0.000172
Kurtosis:                       2.657   Cond. No.                         19.3
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

The default standard errors returned by GEE are the robust standard errors which account for the dependence among repeated measures taken on the same child. We can override this and request the "naive" standard errors, which agree exactly with OLS (however generally this should not be done in practice). Note that the robust standard errors are larger than the naive standard errors, which reflects the fact that there is usually less information in a data set with dependent values compared to a dataset with independent values.

In [10]:
model3 = sm.GEE.from_formula("CD4PCT ~ Age + arv + treatmnt", "newpid", data)
result3 = model3.fit(cov_type='naive')
print(result3.summary())
                               GEE Regression Results                              
===================================================================================
Dep. Variable:                      CD4PCT   No. Observations:                  978
Model:                                 GEE   No. clusters:                      226
Method:                        Generalized   Min. cluster size:                   1
                      Estimating Equations   Max. cluster size:                   7
Family:                           Gaussian   Mean cluster size:                 4.3
Dependence structure:         Independence   Num. iterations:                     2
Date:                     Wed, 23 Mar 2016   Scale:                         171.820
Covariance type:                     naive   Time:                         12:49:38
==============================================================================
                 coef    std err          z      P>|z|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept     23.9086      1.495     15.996      0.000        20.979    26.838
Age           -0.9923      0.183     -5.421      0.000        -1.351    -0.634
arv            1.6919      1.451      1.166      0.243        -1.151     4.535
treatmnt       2.7020      0.842      3.209      0.001         1.052     4.352
==============================================================================
Skew:                          0.2773   Kurtosis:                      -0.3432
Centered skew:                 0.2924   Centered kurtosis:              3.2945
==============================================================================

GEE with exchangeable covariance

Next we use a GEE estimator with a more interesting covariance structure. The exchangeable covariance structure treats all pairs of repeated measures on a subject as being equally correlated. The parameter estimates (regression main effects) are now different from what we had above. To the extent that the exchangeable covariance structure fits the data, these should be better (more statistically efficient) estimates.

In [11]:
ex = sm.cov_struct.Exchangeable()
model4 = sm.GEE.from_formula("CD4PCT ~ Age + arv + treatmnt", groups="newpid", data=data,
         cov_struct=ex)
result4 = model4.fit()
print(result4.summary())
                               GEE Regression Results                              
===================================================================================
Dep. Variable:                      CD4PCT   No. Observations:                  978
Model:                                 GEE   No. clusters:                      226
Method:                        Generalized   Min. cluster size:                   1
                      Estimating Equations   Max. cluster size:                   7
Family:                           Gaussian   Mean cluster size:                 4.3
Dependence structure:         Exchangeable   Num. iterations:                    10
Date:                     Wed, 23 Mar 2016   Scale:                         175.419
Covariance type:                    robust   Time:                         12:49:38
==============================================================================
                 coef    std err          z      P>|z|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept     27.4284      2.801      9.792      0.000        21.939    32.918
Age           -1.7609      0.336     -5.244      0.000        -2.419    -1.103
arv            1.9451      1.181      1.647      0.100        -0.369     4.260
treatmnt       1.9567      1.569      1.247      0.212        -1.119     5.032
==============================================================================
Skew:                          0.2521   Kurtosis:                      -0.3788
Centered skew:                 0.2914   Centered kurtosis:              3.3861
==============================================================================

The covariance structure has a summary method that tells us that the observations within a subject are substantially correlated.

In [12]:
print(ex.summary())
The correlation between two observations in the same cluster is 0.704

GEE with autoregressive covariance structure

We can also fit the model using an autoregressive covariance structure, in which the errors become less dependent for pairs of CD4 values points that are further apart in time.

In [13]:
ar = sm.cov_struct.Autoregressive()
model5 = sm.GEE.from_formula("CD4PCT ~ Age + arv + treatmnt", groups="newpid", data=data, time="Age",
         cov_struct=ar)
result5 = model5.fit()
print(result5.summary())
print(ar.summary())
                               GEE Regression Results                              
===================================================================================
Dep. Variable:                      CD4PCT   No. Observations:                  978
Model:                                 GEE   No. clusters:                      226
Method:                        Generalized   Min. cluster size:                   1
                      Estimating Equations   Max. cluster size:                   7
Family:                           Gaussian   Mean cluster size:                 4.3
Dependence structure:       Autoregressive   Num. iterations:                    11
Date:                     Wed, 23 Mar 2016   Scale:                         171.898
Covariance type:                    robust   Time:                         12:49:39
==============================================================================
                 coef    std err          z      P>|z|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept     24.3002      2.861      8.493      0.000        18.693    29.908
Age           -1.0563      0.356     -2.963      0.003        -1.755    -0.358
arv            2.2118      1.370      1.615      0.106        -0.473     4.896
treatmnt       2.4728      1.576      1.569      0.117        -0.616     5.562
==============================================================================
Skew:                          0.2707   Kurtosis:                      -0.3444
Centered skew:                 0.2975   Centered kurtosis:              3.2915
==============================================================================
Autoregressive(1) dependence parameter: 0.492

One way to assess the fit of the covariance structure is to do a graphical analysis of the residuals, focusing on whether the errors become less dependent when spaced further in time. Based on the following plot, the autocorrelation does not seem to drop with increasing age difference, so the exchangeable correlation structure may be more appropriate for this data set.

In [14]:
fig = result1.plot_isotropic_dependence()
fig.get_axes()[0].set_ylim(0, 1)
Out[14]:
(0, 1)

We can also look at the relationship between fitted values and residuals to see if there is any evidence of a mean/variance relationship. There is no such evidence here.

In [15]:
plt.plot(result1.fittedvalues, result1.resid, 'o', alpha=0.5)
plt.xlabel("Fitted values", size=17)
plt.ylabel("Residuals", size=17)
Out[15]:
<matplotlib.text.Text at 0x7f5382138358>