Generalized Estimating Equations

Key ideas: GEE, ordinal data

This notebook demonstrates how ordinal regression models can be estimated using GEE. Ordinal regression models have a dependent variable that takes on finitely many values that are ordered.

The data are from a study of A-level exam scores in the UK. A student receives one of 6 possible ordered scores on the exam, so the data are ordinal. The students are clustered based on the school that they attended prior to taking the exam. In addition to gender and age, we also have access to the student's score on a previous exam. We can use regression analysis to assess how these three predictors are related to the score on the A-level exam. GEE allows us to do this while accounting for the ordered structure of the response, and the (presumed) non-independence among students who attend the same school. This non-independence could be driven by socioeconomic and other factors that cluster within neighborhoods or schools.

The data are available from: http://www.bristol.ac.uk/cmm/learning/support/datasets/

First we import all the libraries that we will need:

In [15]:
import numpy as np
import pandas as pd
import statsmodels.api as sm

The data are available from the link http://www.bristol.ac.uk/cmm/learning/support/datasets/ as the archive momeg.zip, which contains several files. Here we will focus on the chemistry scores, contained in the file a-level-chemistry.txt.

In [16]:
data = pd.read_csv("a-level-chemistry.txt", header=None, sep=' ')
data.columns = ["Board", "A-Score", "Gtot", "Gnum",
                "Gender", "Age", "Inst_Type", "LEA",
                "Institute", "Student"]
print data.head()
print data.describe()
   Board  A-Score  Gtot  Gnum  Gender  Age  Inst_Type  LEA  Institute  Student
0      3        4    53     8       1    3          7    1          1       23
1      3       10    61     8       1   -3          7    1          1       24
2      3       10    58     8       1   -4          7    1          1       26
3      3       10    60     8       1   -2          7    1          1       28
4      3        8    58     9       1   -1          7    1          1       30
              Board       A-Score          Gtot          Gnum        Gender  \
count  31022.000000  31022.000000  31022.000000  31022.000000  31022.000000   
mean       4.156631      5.812585     55.951196      8.870157      0.443556   
std        1.975071      3.319167     11.061807      1.033699      0.496812   
min        1.000000      0.000000      0.000000      1.000000      0.000000   
25%        3.000000      4.000000     49.000000      8.000000      0.000000   
50%        3.000000      6.000000     56.000000      9.000000      0.000000   
75%        6.000000      8.000000     63.000000      9.000000      1.000000   
max        8.000000     10.000000    103.000000     14.000000      1.000000   

                Age     Inst_Type           LEA     Institute        Student  
count  31022.000000  31022.000000  31022.000000  31022.000000   31022.000000  
mean      -0.467765      4.936980     83.695216     18.410386   94734.095158  
std        3.430142      3.363572     36.940333     17.364621   56578.654087  
min       -6.000000      0.000000      1.000000      1.000000      23.000000  
25%       -3.000000      1.000000     52.000000      5.000000   45292.750000  
50%       -1.000000      5.000000     95.000000     13.000000   91269.500000  
75%        3.000000      7.000000    116.000000     26.000000  144722.750000  
max        5.000000     11.000000    131.000000    100.000000  196035.000000  

Here is a plot of the marginal distribution of exam scores.

In [17]:
sc = np.unique(data["A-Score"])
sc_n = [sum(data["A-Score"] == x) for x in sc]
plt.bar(sc-0.25, sc_n, 0.5)
plt.gca().set_xticks(sc)
plt.gca().set_xticklabels([str(x) for x in sc])
plt.xlim(-1, 11)
plt.xlabel("A-level score", size=17)
plt.ylabel("Number of students", size=17)
Out[17]:
<matplotlib.text.Text at 0x7f6ddd572e90>

We will use the student's score on the GCSE as one predictor of the A-level score. This can be viewed as a control for baseline achievement. The GCSE is a more basic exam than the A-level exam, but there should be a positive association between GCSE and A-level scores.

The GCSE has several components, and not all students take all of the components. The variable 'Gtot' in the data set is the sum of scores over all the components taken by a given student. We divide the sum by the number of components taken to obtain a mean score that is not inflated just because a student takes more exams.

In [18]:
data["GCSE"] = data["Gtot"] / data["Gnum"]

There are two levels of clustering in these data. The "LEA" is a grouping of schools in a local area, and the "Institute" variable is the actual school that a student attends. Institute codes are recycled across the LEA's, so here we create a composite code that reflects both the LEA and the institute.

In [19]:
data["School"] = [str(x) + str(y) for x,y in
                  zip(data["LEA"], data["Institute"])]

us = set(data["School"])
us = {x: k for k,x in enumerate(list(us))}
data["School"] = [us[x] for x in data["School"]]

These are all the variables in the analysis.

In [20]:
endog = data["A-Score"]
exog = data[["Gender", "Age", "GCSE"]]
groups = data["School"]

We will convert the GCSE score to a Z-score since we don't really care about the scale.

In [21]:
mn = exog.loc[:, "GCSE"].mean()
sd = exog.loc[:, "GCSE"].std()
exog.loc[:, "GCSE"] = (exog.loc[:, "GCSE"] - mn) / sd

As a basic initial model we will treat all the indicators as being independent. Although this is clearly not correct, the GEE estimates and standard errors are still meaningful.

In [22]:
ind = sm.cov_struct.Independence()
model0 = sm.OrdinalGEE(endog, exog, groups, cov_struct=ind)
result0 = model0.fit()
print result0.summary()
print ind.summary()
                           OrdinalGEE Regression Results                           
===================================================================================
Dep. Variable:                     A-Score   No. Observations:               155110
Model:                          OrdinalGEE   No. clusters:                     2410
Method:                        Generalized   Min. cluster size:                   5
                      Estimating Equations   Max. cluster size:                 940
Family:                           Binomial   Mean cluster size:                64.4
Dependence structure:         Independence   Num. iterations:                     7
Date:                     Mon, 12 Jan 2015   Scale:                           1.000
Covariance type:                    robust   Time:                         00:50:18
==============================================================================
                 coef    std err          z      P>|z|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
I(y>0.0)       3.1493      0.039     80.567      0.000         3.073     3.226
I(y>2.0)       1.9428      0.031     62.975      0.000         1.882     2.003
I(y>4.0)       0.8754      0.028     31.262      0.000         0.821     0.930
I(y>6.0)      -0.2568      0.028     -9.070      0.000        -0.312    -0.201
I(y>8.0)      -1.6892      0.033    -51.327      0.000        -1.754    -1.625
Gender        -0.5722      0.032    -18.041      0.000        -0.634    -0.510
Age           -0.0298      0.003     -9.751      0.000        -0.036    -0.024
GCSE           1.7411      0.026     66.114      0.000         1.689     1.793
==============================================================================
Skew:                         -0.1833   Kurtosis:                       0.5139
Centered skew:                -0.1453   Centered kurtosis:              0.3663
==============================================================================
Observations within a cluster are modeled as being independent.

Next we fit the same model with an exhangeable covariance structure. This dependence structure treats all binary indicators from a given cluster as having the same pairwise correlation. Since some indicators are "between subjects" and others are "within subjects", this may not reflect the true dependence very well. Here we limit the iterations so there will be a convergence warning.

In [23]:
ex = sm.cov_struct.Exchangeable()
model1 = sm.OrdinalGEE(endog, exog, groups, cov_struct=ex)
result1 = model1.fit(maxiter=5)
print result1.summary()
print ex.summary()
                           OrdinalGEE Regression Results                           
===================================================================================
Dep. Variable:                     A-Score   No. Observations:               155110
Model:                          OrdinalGEE   No. clusters:                     2410
Method:                        Generalized   Min. cluster size:                   5
                      Estimating Equations   Max. cluster size:                 940
Family:                           Binomial   Mean cluster size:                64.4
Dependence structure:         Exchangeable   Num. iterations:                     5
Date:                     Mon, 12 Jan 2015   Scale:                           1.000
Covariance type:                    robust   Time:                         00:50:28
==============================================================================
                 coef    std err          z      P>|z|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
I(y>0.0)       2.9170      0.046     64.065      0.000         2.828     3.006
I(y>2.0)       1.7610      0.034     51.547      0.000         1.694     1.828
I(y>4.0)       0.7190      0.029     24.401      0.000         0.661     0.777
I(y>6.0)      -0.4017      0.029    -13.840      0.000        -0.459    -0.345
I(y>8.0)      -1.8439      0.034    -54.260      0.000        -1.910    -1.777
Gender        -0.5778      0.026    -21.841      0.000        -0.630    -0.526
Age           -0.0305      0.003    -10.403      0.000        -0.036    -0.025
GCSE           1.7143      0.025     67.338      0.000         1.664     1.764
==============================================================================
Skew:                         -0.1111   Kurtosis:                       0.4678
Centered skew:                -0.0767   Centered kurtosis:              0.3215
==============================================================================
The correlation between two observations in the same cluster is 0.048
/projects/3c16f532-d309-4425-850f-a41a8c8f8c92/statsmodels-master/statsmodels/genmod/generalized_estimating_equations.py:1078: IterationLimitWarning: Iteration limit reached prior to convergence
  IterationLimitWarning)

The dependence is weak, perhaps due to the oversimplicity of the dependence model:

Next we plot the conditional probability distributions of the A-level exam scores for females and males. Age and the GCSE score are held fixed at their mean values.

In [24]:
ev = [{"Gender": 1}, {"Gender": 0}]
fig = result1.plot_distribution(None, ev)
ax = fig.get_axes()[0]
leg = plt.figlegend(ax.get_lines(), ["Females", "Males"], "upper center", ncol=2)
leg.draw_frame(False)

Here is a plot of the fitted distribution of A-level exam scores for females of average age, when the GCSE exam score is either 1 standard deviation below or 1 standard deviation above the mean.

In [25]:
ev = [{"GCSE": -1, "Gender": 1}, {"GCSE": 1, "Gender": 0}]
fig = result1.plot_distribution(None, ev)
ax = fig.get_axes()[0]
leg = plt.figlegend(ax.get_lines(), ["-1SD GCSE", "+1SD GCSE"], "upper center", ncol=2)
leg.draw_frame(False)

Now we can try a more sophisticated dependence model which accounts for the differing correlations between within-subject and between-subject indicators (as well as for different degrees of correlation among the within-subject indicators).

In [29]:
dep = sm.cov_struct.GlobalOddsRatio("ordinal")

Here we fit the model again using the new dependence structure. The parameter estimates from the exchangeable model can be used as starting values. This currently takes quite a while to run to convergence, so here we limit the number of iterations. Since we limit the number of iterations, we get a convergence warning message. We can ignore this message here since we have done this deliberately.

In [ ]:
model2 = sm.OrdinalGEE(endog, exog, groups, cov_struct=dep)
result2 = model2.fit(start_params=result1.params, maxiter=2)
print result2.summary()
                           OrdinalGEE Regression Results                           
===================================================================================
Dep. Variable:                     A-Score   No. Observations:               155110
Model:                          OrdinalGEE   No. clusters:                     2410
Method:                        Generalized   Min. cluster size:                   5
                      Estimating Equations   Max. cluster size:                 940
Family:                           Binomial   Mean cluster size:                64.4
Dependence structure:      GlobalOddsRatio   Num. iterations:                     2
Date:                     Mon, 12 Jan 2015   Scale:                           1.000
Covariance type:                    robust   Time:                         01:07:51
==============================================================================
                 coef    std err          z      P>|z|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
I(y>0.0)       3.0021      0.036     83.751      0.000         2.932     3.072
I(y>2.0)       1.8451      0.028     65.154      0.000         1.790     1.901
I(y>4.0)       0.7979      0.025     31.969      0.000         0.749     0.847
I(y>6.0)      -0.3360      0.025    -13.689      0.000        -0.384    -0.288
I(y>8.0)      -1.8018      0.027    -66.548      0.000        -1.855    -1.749
Gender        -0.5759      0.025    -23.013      0.000        -0.625    -0.527
Age           -0.0297      0.003    -11.060      0.000        -0.035    -0.024
GCSE           1.7267      0.021     83.945      0.000         1.686     1.767
==============================================================================
Skew:                         -0.1419   Kurtosis:                       0.4969
Centered skew:                -0.1056   Centered kurtosis:              0.3495
==============================================================================
/projects/3c16f532-d309-4425-850f-a41a8c8f8c92/statsmodels-master/statsmodels/genmod/generalized_estimating_equations.py:1078: IterationLimitWarning: Iteration limit reached prior to convergence
  IterationLimitWarning)

We have already captured much more of the dependence structure than when working with the exchangeable correlation. However due to the large sample size, this does not impact the parameter estimates or standard errors very much.

In [ ]:
print dep.summary()
Global odds ratio: 1.713