Marginal regression models for proportions with GEE

Following Papke and Woolridge (1996, 2008) we can use GEE to fit a marginal probit or logit model when the dependent variable is a proportion.

See here for additional information about the methods and data set:

http://faculty.smu.edu/millimet/classes/eco6375/papers/papke%20wooldridge%202008.pdf

In [1]:
import statsmodels.api as sm
import numpy as np
import pandas as pd
In [2]:
data = pd.io.stata.read_stata("meap92_01.dta")

The response variable is a proportion. There are no zeros, but there are 22 cases n which the response variable is excactly equal to 1.

In [3]:
hist(data.math4)
plt.xlabel("math4")
print np.sum(data.math4 == 0)
print np.sum(data.math4 == 1)
0
22

Here is the probit model used by Papke and Woolridge:

In [4]:
fml = "math4 ~ lavgrexp + lunch + lenroll + alavgrexp + alunch + alenroll + y96 + y97 + y98 + y99 + y00 + y01"
fa = sm.families.Binomial(sm.families.links.probit)
model = sm.GEE.from_formula(fml, "distid", data, cov_struct=sm.cov_struct.Exchangeable(), family=fa)
result = model.fit()
In [5]:
print result.summary()
print result.cov_struct.summary()
                               GEE Regression Results                              
===================================================================================
Dep. Variable:                       math4   No. Observations:                 3507
Model:                                 GEE   No. clusters:                      501
Method:                        Generalized   Min. cluster size:                   7
                      Estimating Equations   Max. cluster size:                   7
Family:                           Binomial   Mean cluster size:                 7.0
Dependence structure:         Exchangeable   Num. iterations:                     7
Date:                     Sat, 11 Oct 2014   Scale:                           0.064
Covariance type:                    robust   Time:                         15:51:47
==============================================================================
                 coef    std err          z      P>|z|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept     -1.9147      0.752     -2.546      0.011        -3.389    -0.441
lavgrexp       0.8846      0.206      4.297      0.000         0.481     1.288
lunch         -0.2373      0.209     -1.136      0.256        -0.647     0.172
lenroll        0.0876      0.139      0.632      0.528        -0.184     0.359
alavgrexp     -0.5835      0.223     -2.612      0.009        -1.021    -0.146
alunch        -0.9755      0.217     -4.499      0.000        -1.401    -0.550
alenroll      -0.0821      0.139     -0.589      0.556        -0.355     0.191
y96           -0.0365      0.018     -2.045      0.041        -0.071    -0.002
y97           -0.1471      0.027     -5.390      0.000        -0.201    -0.094
y98            0.2515      0.034      7.471      0.000         0.186     0.318
y99            0.2149      0.037      5.867      0.000         0.143     0.287
y00            0.3046      0.040      7.640      0.000         0.226     0.383
y01            0.2257      0.044      5.147      0.000         0.140     0.312
==============================================================================
Skew:                         -0.1648   Kurtosis:                       0.7052
Centered skew:                -0.1670   Centered kurtosis:              2.0229
==============================================================================
The correlation between two observations in the same cluster is 0.490

Here is the same analaysis done in Stata:

. xtgee math4 lavgrexp lunch lenroll alavgrexp alunch alenroll y96-y01, fa(bi) link(probit) corr(exch) robust

Iteration 1: tolerance = .01938386
Iteration 2: tolerance = .00128988
Iteration 3: tolerance = .00001032
Iteration 4: tolerance = 2.708e-07

GEE population-averaged model                   Number of obs      =      3507
Group variable:                     distid      Number of groups   =       501
Link:                               probit      Obs per group: min =         7
Family:                           binomial                     avg =       7.0
Correlation:                  exchangeable                     max =         7
                                                Wald chi2(12)      =   1815.43
Scale parameter:                         1      Prob > chi2        =    0.0000

                                 (Std. Err. adjusted for clustering on distid)
------------------------------------------------------------------------------
             |             Semirobust
       math4 |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
    lavgrexp |    .884564   .2060662     4.29   0.000     .4806817    1.288446
       lunch |  -.2372942   .2091221    -1.13   0.256    -.6471659    .1725775
     lenroll |   .0875629   .1387427     0.63   0.528    -.1843677    .3594935
   alavgrexp |  -.5835138   .2236705    -2.61   0.009      -1.0219   -.1451277
      alunch |  -.9754696   .2170624    -4.49   0.000    -1.400904   -.5500351
    alenroll |  -.0820307   .1393712    -0.59   0.556    -.3551933    .1911318
         y96 |  -.0364771   .0178529    -2.04   0.041    -.0714681    -.001486
         y97 |  -.1471389   .0273264    -5.38   0.000    -.2006976   -.0935801
         y98 |   .2515377   .0337018     7.46   0.000     .1854833     .317592
         y99 |   .2148552   .0366599     5.86   0.000      .143003    .2867073
         y00 |   .3046286   .0399143     7.63   0.000     .2263981    .3828591
         y01 |   .2256619   .0438877     5.14   0.000     .1396437    .3116801
       _cons |  -1.914975   .7528262    -2.54   0.011    -3.390487   -.4394628
------------------------------------------------------------------------------

. . estat wcorrelation

Estimated within-distid correlation matrix R:

      |        c1         c2         c3         c4         c5         c6         c7 
------+-----------------------------------------------------------------------------
   r1 |         1                                                                   
   r2 |  .4914078          1                                                        
   r3 |  .4914078   .4914078          1                                             
   r4 |  .4914078   .4914078   .4914078          1                                  
   r5 |  .4914078   .4914078   .4914078   .4914078          1                       
   r6 |  .4914078   .4914078   .4914078   .4914078   .4914078          1            
   r7 |  .4914078   .4914078   .4914078   .4914078   .4914078   .4914078          1

Here is the same model, except with a logit link:

In [6]:
model2 = sm.GEE.from_formula(fml, "distid", data, cov_struct=sm.cov_struct.Exchangeable(), family=sm.families.Binomial())
result2 = model2.fit()
print result2.summary()
print result2.cov_struct.summary()
                               GEE Regression Results                              
===================================================================================
Dep. Variable:                       math4   No. Observations:                 3507
Model:                                 GEE   No. clusters:                      501
Method:                        Generalized   Min. cluster size:                   7
                      Estimating Equations   Max. cluster size:                   7
Family:                           Binomial   Mean cluster size:                 7.0
Dependence structure:         Exchangeable   Num. iterations:                     8
Date:                     Sat, 11 Oct 2014   Scale:                           0.064
Covariance type:                    robust   Time:                         15:51:48
==============================================================================
                 coef    std err          z      P>|z|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept     -3.1770      1.268     -2.505      0.012        -5.662    -0.692
lavgrexp       1.3917      0.344      4.050      0.000         0.718     2.065
lunch         -0.4209      0.346     -1.216      0.224        -1.099     0.258
lenroll        0.1891      0.230      0.821      0.412        -0.262     0.641
alavgrexp     -0.9014      0.379     -2.378      0.017        -1.644    -0.159
alunch        -1.5630      0.358     -4.365      0.000        -2.265    -0.861
alenroll      -0.1734      0.231     -0.750      0.454        -0.627     0.280
y96           -0.0584      0.029     -1.988      0.047        -0.116    -0.001
y97           -0.2363      0.045     -5.215      0.000        -0.325    -0.148
y98            0.4248      0.056      7.520      0.000         0.314     0.536
y99            0.3653      0.061      5.945      0.000         0.245     0.486
y00            0.5203      0.067      7.761      0.000         0.389     0.652
y01            0.3886      0.074      5.275      0.000         0.244     0.533
==============================================================================
Skew:                         -0.1580   Kurtosis:                       0.7019
Centered skew:                -0.1662   Centered kurtosis:              2.0124
==============================================================================
The correlation between two observations in the same cluster is 0.491

Here is the same analysis using Stata:

. xtgee math4 lavgrexp lunch lenroll alavgrexp alunch alenroll y96-y01, fa(bi) link(logit) corr(exch) robust

Iteration 1: tolerance = .03823289
Iteration 2: tolerance = .0038322
Iteration 3: tolerance = .00004492
Iteration 4: tolerance = 4.634e-06
Iteration 5: tolerance = 5.825e-08

GEE population-averaged model                   Number of obs      =      3507
Group variable:                     distid      Number of groups   =       501
Link:                                logit      Obs per group: min =         7
Family:                           binomial                     avg =       7.0
Correlation:                  exchangeable                     max =         7
                                                Wald chi2(12)      =   1725.03
Scale parameter:                         1      Prob > chi2        =    0.0000

                                 (Std. Err. adjusted for clustering on distid)
------------------------------------------------------------------------------
             |               Robust
       math4 |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
    lavgrexp |   1.391682   .3439966     4.05   0.000     .7174607    2.065903
       lunch |  -.4208871   .3465353    -1.21   0.225    -1.100084    .2583097
     lenroll |   .1890935   .2305562     0.82   0.412    -.2627884    .6409754
   alavgrexp |  -.9014303   .3794192    -2.38   0.018    -1.645078   -.1577823
      alunch |  -1.562824   .3584047    -4.36   0.000    -2.265284   -.8603634
    alenroll |  -.1733469   .2315958    -0.75   0.454    -.6272663    .2805725
         y96 |  -.0584388   .0294272    -1.99   0.047     -.116115   -.0007626
         y97 |  -.2363386   .0453651    -5.21   0.000    -.3252525   -.1474246
         y98 |   .4248137   .0565481     7.51   0.000     .3139815    .5356459
         y99 |   .3653465   .0615172     5.94   0.000     .2447749     .485918
         y00 |    .520279   .0671024     7.75   0.000     .3887606    .6517973
         y01 |   .3886253   .0737458     5.27   0.000     .2440861    .5331645
       _cons |  -3.177677   1.269384    -2.50   0.012    -5.665625   -.6897297
------------------------------------------------------------------------------

. . estat wcorrelation

Estimated within-distid correlation matrix R:

      |        c1         c2         c3         c4         c5         c6         c7 
------+-----------------------------------------------------------------------------
   r1 |         1                                                                   
   r2 |  .4925912          1                                                        
   r3 |  .4925912   .4925912          1                                             
   r4 |  .4925912   .4925912   .4925912          1                                  
   r5 |  .4925912   .4925912   .4925912   .4925912          1                       
   r6 |  .4925912   .4925912   .4925912   .4925912   .4925912          1            
   r7 |  .4925912   .4925912   .4925912   .4925912   .4925912   .4925912          1