Generalized Estimating Equations¶

Key ideas: Nested dependence structures

This notebook demonstrates how regression models for nested data can be estimated using GEE.

The data are from a study of GCSE (General Certificate of Seconday Education) exam scores in the UK. We know the school in which each student is enrolled. In addition, the schools are clustered into "local educational areas" (LEA). Thus we have two nested levels of clustering (schools within LEA's). The top level (LEA) can be handled by treating it as the GEE group. The inner level (school) can be handled using a nested dependence structure.

All of the models here use a linear link and a Gaussian model for the data.

The data are available from: http://www.bristol.ac.uk/cmm/learning/support/datasets/

Here are all the libraries that we will need:

In [1]:
import numpy as np
import pandas as pd
import statsmodels.api as sm


The data file has no column labels, so we attach them manually.

In [2]:
data = pd.read_csv("a-level-chemistry.txt", header=None, sep=' ')
data.columns = ["Board", "A-Score", "Gtot", "Gnum",
"Gender", "Age", "Inst_Type", "LEA",
"Institute", "Student"]


The GCSE has several components, and not all students take all of the components. The variable 'Gtot' in the data set is the sum of exam scores over all the components taken by a given student. We divide the score sum by the number of components taken, in order to obtain a mean score that is not inflated just because a student takes more exams.

In [3]:
data["GCSE"] = data["Gtot"] / data["Gnum"]


Our primary focus here is on the nested dependence structure. For comparison, we will also consider a single-level dependence structure determined only by the school. Since the school codes are recycled across the LEA's, we need to create a new variable that combines the two codes.

In [4]:
data["School"] = [str(x) + str(y) for x,y in
zip(data["LEA"], data["Institute"])]
us = set(data["School"])
us = {x: k for k,x in enumerate(list(us))}
data["School"] = [us[x] for x in data["School"]]


These are all the variables in the analysis.

In [5]:
qdata = data[["GCSE", "Gender", "Age", "LEA", "Institute", "School"]]
qdata = np.asarray(qdata)

In [6]:
family = sm.families.Gaussian()


The initial model considers only the school effect, and does not account for the fact that the schools are nested in LEA's.

In [7]:
ex = sm.cov_struct.Exchangeable()
model1 = sm.GEE.from_formula("GCSE ~ Age + Gender", "LEA",
data=data, family=family, cov_struct=ex)
result1 = model1.fit()
print result1.summary()

                               GEE Regression Results
===================================================================================
Dep. Variable:                        GCSE   No. Observations:                31022
Model:                                 GEE   No. clusters:                      131
Method:                        Generalized   Min. cluster size:                  10
Estimating Equations   Max. cluster size:                 969
Family:                           Gaussian   Mean cluster size:               236.8
Dependence structure:         Exchangeable   Num. iterations:                    10
Date:                     Sun, 19 Oct 2014   Scale:                           0.738
Covariance type:                    robust   Time:                         00:24:19
==============================================================================
coef    std err          z      P>|z|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept      6.0896      0.022    271.745      0.000         6.046     6.134
Age            0.0134      0.001      9.927      0.000         0.011     0.016
Gender         0.3258      0.013     25.641      0.000         0.301     0.351
==============================================================================
Skew:                         -0.5403   Kurtosis:                       0.4449
Centered skew:                -0.5121   Centered kurtosis:              0.4452
==============================================================================


The dependence among subjects within a school is fairly weak:

In [8]:
print ex.summary()

The correlation between two observations in the same cluster is 0.039


Next we fit the model using the nested dependence structure. Since the iterations are somewhat slow, we limit the number of iterations. This may result in a warning message.

In [9]:
ne = sm.cov_struct.Nested()
model2 = sm.GEE.from_formula("GCSE ~ Age + Gender", "LEA",
data=data, family=family, cov_struct=ne,
dep_data=data["Institute"])
result2 = model2.fit() #maxiter=10)
print result2.summary()

                               GEE Regression Results
===================================================================================
Dep. Variable:                        GCSE   No. Observations:                31022
Model:                                 GEE   No. clusters:                      131
Method:                        Generalized   Min. cluster size:                  10
Estimating Equations   Max. cluster size:                 969
Family:                           Gaussian   Mean cluster size:               236.8
Dependence structure:               Nested   Num. iterations:                    11
Date:                     Sun, 19 Oct 2014   Scale:                           0.758
Covariance type:                    robust   Time:                         00:24:42
==============================================================================
coef    std err          z      P>|z|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept      5.9790      0.020    294.902      0.000         5.939     6.019
Age            0.0129      0.001     10.587      0.000         0.010     0.015
Gender         0.3699      0.012     31.882      0.000         0.347     0.393
==============================================================================
Skew:                         -0.5306   Kurtosis:                       0.4397
Centered skew:                -0.5020   Centered kurtosis:              0.4395
==============================================================================


The nested dependence structure captures more of the dependence in the data compared to the exchangeable dependence structure. Component 1 is the variation among LEA's. It is fairly weak, suggesting that the schools within a LEA are fairly heterogeneous. The variance among schools within an LEA is stronger, suggesting that students within a school share similar characteristics. Variation among students (residual variance in the table below) is the greatest contributor to the overall variance.

In [10]:
print ne.summary()

Variance estimates
------------------
Component 1: 0.043
Component 2: 0.176
Residual: 0.539