Comparing GEE and MixedLM

Keywords: Linear mixed models, generalized estimating equations, Stata

In this notebook we show how GEE and mixed models can be used as alternative approaches to analyzing a data set with a simple clustering structure. We also demonstrate how equivalent models can be fit in Statsmodels and in Stata.

The data are from a survey of individuals in Vietnam, focusing on health care expenditures. The main variable of interest is the amount of money spent on health care, which will be the dependent variable in all models considered here. This variable is log transformed, so we are looking at the proportional change in health care expenditure when contrasting groups defined in terms of the covariates. The survey was conducted as a cluster sample of communes, so we will used mixed models and GEE to account for the clustering structure in the data.

Here are the import statements:

In [1]:
import pandas as pd
from statsmodels.genmod.generalized_estimating_equations import GEE
from statsmodels.genmod.dependence_structures import Exchangeable, Independence
from statsmodels.regression.mixed_linear_model import MixedLM

Next we read the data and take a peek at the first few rows.

In [2]:
fname = "http://vincentarelbundock.github.io/Rdatasets/csv/Ecdat/VietNamI.csv"
data = pd.read_csv(fname)
print data.head()
   Unnamed: 0  pharvis   lnhhexp       age     sex  married  educ  illness  \
0           1        0  2.730363  3.761200    male        1     2        1   
1           2        0  2.737248  2.944439  female        0     0        1   
2           3        0  2.266935  2.564950    male        0     4        0   
3           4        1  2.392753  3.637586  female        1     3        1   
4           5        1  3.105335  3.295837    male        1     3        1   

   injury  illdays  actdays  insurance  commune  
0       0        7        0          0      192  
1       0        4        0          0      167  
2       0        0        0          1       76  
3       0        3        0          1      123  
4       0       10        0          0      148  

[5 rows x 13 columns]

Age is log transformed in the data set, but we don't want this, so we exponentiate it back to the year scale.

In [3]:
data["age"] = np.exp(data["age"])

Education is recorded in years, but we don't care about the units here so we will convert it to Z-scores.

In [4]:
data["educ"] = (data["educ"] - data["educ"].mean()) / data["educ"].std()

Here is a basic linear model fit using GEE, with the default Gaussian family and linear link. The data are modeled as being independent within communes.

In [5]:
ind = Independence()
fml = "lnhhexp ~ age + sex + married + educ + insurance + injury + actdays"
mod_gee = GEE.from_formula(fml, groups=data["commune"], cov_struct=ind, data=data)
mod_gee = mod_gee.fit()
print mod_gee.summary()
print mod_gee.cov_struct.summary()
                               GEE Regression Results                              
===================================================================================
Dep. Variable:                     lnhhexp   No. Observations:                27765
Model:                                 GEE   No. clusters:                      194
Method:                        Generalized   Min. cluster size:                  51
                      Estimating Equations   Max. cluster size:                 206
Family:                           Gaussian   Mean cluster size:               143.1
Dependence structure:         Independence   Num. iterations:                     2
Date:                     Sun, 20 Jul 2014   Scale:                           0.361
Covariance type:                    robust   Time:                         06:49:00
===============================================================================
                  coef    std err          z      P>|z|      [95.0% Conf. Int.]
-------------------------------------------------------------------------------
Intercept       2.5646      0.031     82.534      0.000         2.504     2.626
sex[T.male]    -0.0072      0.006     -1.200      0.230        -0.019     0.005
age             0.0014      0.000      3.179      0.001         0.001     0.002
married        -0.0562      0.013     -4.315      0.000        -0.082    -0.031
educ            0.1472      0.016      9.084      0.000         0.115     0.179
insurance       0.1524      0.027      5.564      0.000         0.099     0.206
injury          0.0212      0.074      0.288      0.773        -0.123     0.165
actdays        -0.0066      0.005     -1.284      0.199        -0.017     0.003
==============================================================================
Skew:                          0.4577   Kurtosis:                       0.5664
Centered skew:                 0.0407   Centered kurtosis:              0.9566
==============================================================================
Observations within a cluster are modeled as being independent.

Here is Stata code to fit this same model:

clear

copy http://vincentarelbundock.github.io/Rdatasets/csv/Ecdat/VietNamI.csv vietnam.csv, replace                          
import delimited vietnam

generate sex2 = 1 if sex == "male"
replace sex2 = 0 if sex == "female"

egen zeduc = std(educ)
gen age2 = exp(age)

xtgee lnhhexp age2 sex2 married zeduc insurance injury actdays,///
      fam(gaus) link(iden) i(commune) corr(ind) robust

Here is the output from Stata:

Iteration 1: tolerance = 1.062e-13

GEE population-averaged model                   Number of obs      =     27765
Group variable:                    commune      Number of groups   =       194
Link:                             identity      Obs per group: min =        51
Family:                           Gaussian                     avg =     143.1
Correlation:                   independent                     max =       206
                                                Wald chi2(7)       =    182.43
Scale parameter:                  .3605621      Prob > chi2        =    0.0000

Pearson chi2(27765):              10011.01      Deviance           =  10011.01
Dispersion (Pearson):             .3605621      Dispersion         =  .3605621

                                (Std. Err. adjusted for clustering on commune)
------------------------------------------------------------------------------
             |               Robust
     lnhhexp |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        age2 |   .0014177    .000447     3.17   0.002     .0005415    .0022938
        sex2 |  -.0071507   .0059765    -1.20   0.232    -.0188645    .0045631
     married |  -.0561731   .0130533    -4.30   0.000     -.081757   -.0305892
       zeduc |   .1472443   .0162504     9.06   0.000     .1153941    .1790946
   insurance |   .1523975   .0274584     5.55   0.000       .09858     .206215
      injury |   .0211572   .0736902     0.29   0.774    -.1232731    .1655874
     actdays |   -.006571   .0051319    -1.28   0.200    -.0166293    .0034873
       _cons |   2.564619   .0311538    82.32   0.000     2.503559    2.625679
------------------------------------------------------------------------------

Next we fit the model using GEE with an exchangeable dependence structure.

In [6]:
ex = Exchangeable()
fml = "lnhhexp ~ age + sex + married + educ + insurance + injury + actdays"
mod_gee = GEE.from_formula(fml, groups=data["commune"], cov_struct=ex, data=data)
mod_gee = mod_gee.fit()
print mod_gee.summary()
print mod_gee.cov_struct.summary()
                               GEE Regression Results                              
===================================================================================
Dep. Variable:                     lnhhexp   No. Observations:                27765
Model:                                 GEE   No. clusters:                      194
Method:                        Generalized   Min. cluster size:                  51
                      Estimating Equations   Max. cluster size:                 206
Family:                           Gaussian   Mean cluster size:               143.1
Dependence structure:         Exchangeable   Num. iterations:                     5
Date:                     Sun, 20 Jul 2014   Scale:                           0.364
Covariance type:                    robust   Time:                         06:49:01
===============================================================================
                  coef    std err          z      P>|z|      [95.0% Conf. Int.]
-------------------------------------------------------------------------------
Intercept       2.6034      0.032     82.066      0.000         2.541     2.666
sex[T.male]    -0.0050      0.004     -1.163      0.245        -0.013     0.003
age            -0.0009      0.000     -3.522      0.000        -0.001    -0.000
married         0.0172      0.009      1.960      0.050      4.87e-07     0.034
educ            0.1238      0.009     14.535      0.000         0.107     0.141
insurance       0.0881      0.014      6.522      0.000         0.062     0.115
injury          0.0409      0.046      0.886      0.376        -0.050     0.132
actdays        -0.0077      0.004     -1.983      0.047        -0.015 -9.03e-05
==============================================================================
Skew:                          0.4829   Kurtosis:                       0.5541
Centered skew:                 0.0832   Centered kurtosis:              0.9251
==============================================================================
The correlation between two observations in the same cluster is 0.483
The Stata command to fit this model is:

xtgee lnhhexp age2 sex2 married zeduc insurance injury actdays, ///              
      fam(gaus) link(iden) i(commune) corr(exc) robust

The result of fitting this model is:
GEE population-averaged model                   Number of obs      =     27765
Group variable:                    commune      Number of groups   =       194
Link:                             identity      Obs per group: min =        51
Family:                           Gaussian                     avg =     143.1
Correlation:                  exchangeable                     max =       206
                                                Wald chi2(7)       =    440.94
Scale parameter:                  .3634145      Prob > chi2        =    0.0000

                                (Std. Err. adjusted for clustering on commune)
------------------------------------------------------------------------------
             |               Robust
     lnhhexp |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        age2 |  -.0008954   .0002549    -3.51   0.000    -.0013949   -.0003958
        sex2 |  -.0050254   .0043309    -1.16   0.246    -.0135139     .003463
     married |   .0172334   .0088151     1.95   0.051    -.0000438    .0345106
       zeduc |   .1238274   .0085411    14.50   0.000     .1070871    .1405676
   insurance |   .0881371   .0135478     6.51   0.000     .0615838    .1146903
      injury |   .0409336   .0463357     0.88   0.377    -.0498827    .1317499
     actdays |  -.0077043   .0038948    -1.98   0.048    -.0153381   -.0000706
       _cons |   2.603396   .0318052    81.85   0.000     2.541059    2.665733
------------------------------------------------------------------------------

Next we fit a linear mixed model, with a random intercept for each commune.

In [7]:
mod_lme = MixedLM.from_formula(fml, groups=data["commune"], data=data)
mod_lme = mod_lme.fit()
print mod_lme.summary()
          Mixed Linear Model Regression Results
=========================================================
Model:            MixedLM Dependent Variable: lnhhexp    
No. Observations: 27765   Method:             REML       
No. Groups:       194     Scale:              0.1822     
Min. group size:  51      Likelihood:         -16275.3316
Max. group size:  206     Converged:          Yes        
Mean group size:  143.1                                  
---------------------------------------------------------
               Coef.  Std.Err.   z    P>|z| [0.025 0.975]
---------------------------------------------------------
Intercept       2.603    0.032 82.258 0.000  2.541  2.666
sex[T.male]    -0.005    0.005 -0.974 0.330 -0.015  0.005
age            -0.001    0.000 -5.436 0.000 -0.001 -0.001
married         0.017    0.007  2.591 0.010  0.004  0.030
educ            0.124    0.003 40.621 0.000  0.118  0.130
insurance       0.088    0.008 11.469 0.000  0.073  0.103
injury          0.041    0.033  1.244 0.214 -0.024  0.105
actdays        -0.008    0.003 -2.681 0.007 -0.013 -0.002
Z1 RE           0.189    0.046                           
=========================================================

Here is the Stata code to fit the mixed linear model:

xtmixed lnhhexp age2 sex2 married zeduc insurance injury actdays || commune:, reml

Here is the fitted mixed model from Stata:

Mixed-effects REML regression                   Number of obs      =     27765
Group variable: commune                         Number of groups   =       194

                                                Obs per group: min =        51
                                                               avg =     143.1
                                                               max =       206


                                                Wald chi2(7)       =   2076.70
Log restricted-likelihood = -16275.332          Prob > chi2        =    0.0000

------------------------------------------------------------------------------
     lnhhexp |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        age2 |  -.0008973   .0001651    -5.44   0.000    -.0012208   -.0005738
        sex2 |  -.0050251   .0051614    -0.97   0.330    -.0151412    .0050911
     married |   .0172924   .0066729     2.59   0.010     .0042137     .030371
       zeduc |   .1237944   .0030475    40.62   0.000     .1178215    .1297673
   insurance |   .0880734   .0076774    11.47   0.000      .073026    .1031209
      injury |   .0409419   .0328861     1.24   0.213    -.0235137    .1053974
     actdays |  -.0077056   .0028735    -2.68   0.007    -.0133376   -.0020737
       _cons |   2.603433   .0316463    82.27   0.000     2.541407    2.665458
------------------------------------------------------------------------------

------------------------------------------------------------------------------
  Random-effects Parameters  |   Estimate   Std. Err.     [95% Conf. Interval]
-----------------------------+------------------------------------------------
commune: Identity            |
                   sd(_cons) |   .4345079   .0222826      .3929579    .4804512
-----------------------------+------------------------------------------------
                sd(Residual) |   .4268603    .001818      .4233119    .4304385
------------------------------------------------------------------------------
LR test vs. linear regression: chibar2(01) = 17991.25 Prob >= chibar2 = 0.0000

The random effect labeed "Z1 RE" in the Statsmodels output is a variance parameter, whereas Stata produces a standard deviation parameter. The results are equivalent:

In [8]:
0.4345079**2
Out[8]:
0.18879711516241

Next we compare the parameter estimates from the GEE and mixed models. For this linear model, the parameter estimates are nearly identical. This will generally be true for linear models, but not for nonlinear models.

In [9]:
plt.plot(mod_lme.params[1:-1], mod_gee.params[1:], 'o')
plt.plot([0, 0.14], [0, 0.14], '-')
plt.xlabel("MixedLM coefficient", size=14)
plt.ylabel("GEE coefficient", size=14)
Out[9]:
<matplotlib.text.Text at 0x545e490>

Next we compare the standard errors. The GEE standard errors will typically be larger than the mixed model standard errors. For linear models, GEE and mixed models estimate the same population parameters. Since the mixed model uses maximum likelihood (or, almost equivalently, restricted maximum likelihood), it should be nearly optimal in terms of statistical efficiency, and hence will have the smallest standard errors of any (approximately) unbiased procedure. One possible advantage of GEE is that it does not model the random intercepts as being normally distributed.

In [10]:
plt.plot(mod_lme.bse[1:-1], mod_gee.bse[1:], 'o')
plt.plot([0,0.05], [0,0.05], '-')
plt.xlabel("MixedLM SE", size=14)
plt.ylabel("GEE SE", size=14)
Out[10]:
<matplotlib.text.Text at 0x5a718d0>

Finally, we compare the Z-scores. The Z-scores are fairly consistent for the smaller values (e.g. |Z| < 3), where small differences in the Z-scores may be important. For the effects that are clearly real, the mixed models Z-scores have larger magnitudes, but this has little impact on our interpretation of the results.

In [11]:
plt.plot(mod_lme.tvalues[1:-1], mod_gee.tvalues[1:], 'o')
plt.plot([-10,15], [-10,15], '-')
plt.xlabel("MixedLM Z-score", size=14)
plt.ylabel("GEE Z-score", size=14)
Out[11]:
<matplotlib.text.Text at 0x5fd8e10>