Josef Perktold All Rights Reserved
In this notebook we compare three types of models from statsmodels that are represent the same underlying model in some cases. The models are for binary or count response in the linear exponential family. The three models or model groups are:
discrete models
: Logit, Probit, Poisson and Negative Binomial in statsmodels.discrete which are standard maximum likelihood models.
generalized linear model (GLM)
: A general model class for one parameter distributions in the linear exponential family, which includes Binomial with Logit or Probit link, Poissson and Negative Binomial for fixed dispersion.
generalized estimating equations models (GEE)
: GEE covers the distributions of GLM, but allows for cluster correlation for one-way panel or cluster data. When we use the Independence
correlation structure, then observations are treated as uncorrelated in the estimation of the parameters, which is the same as the corresponding GLM and discrete model.
The parameters of the conditional expection with distributions in the linear exponential family (LEF) can be consistently estimated even if the correlation and variances are misspecified. However, if those are misspecified, then the usual standard errors for the parameter estimates are not correct. We can use robust sandwich covariance matrices which are robust to a structure of misspecified correlation or heteroscedasticity. Overdispersion in Poisson or Binomial is a special case of misspecified variances.
GEE does not assume that within cluster correlation is correctly specified by default. All other models report the usual "nonrobust" standard errors by default. However, we can choose a robust covariance in those models using the cov_type
keyword in fit. These means that GLM or the discrete models with cluster robust standard errors are equivalent to GEE with independence correlation except for different small sample corrections. While GEE is specialized to one-way cluster correlation, we can also choose heteroscedasticity robust (HC), autocorrelation and heteroscedasticity robust (HAC) or panel time series robust covariance matrices when we use GLM or the discrete models.
GLM and discrete models have a large overlap of identical models, however, they differ in their approach and in the direction in which it is possible or easy to extend them. GLM come mostly from the statistics tradition, the main optimization approach is iterated weighted least square, and cover besides discrete distribution also continuous LEF distribution where the mean function can be estimated without specifying the dispersion parameter. GLM also provide a choice of nonlinearities through the specification of a link function. The discrete models in statsmodels came mostly from the econometrics tradition, they use standard Newton or Quasi-Newton optimizers. The optimizer allow easier extensions to multiparameter distributions. For example, GLM with Negative Binomial takes the dispersion parameter as given, while the NegativeBinomial model in discrete estimates the dispersion parameter jointly with the parameters for the mean function. There is a trend towards converging evolution in statsmodels so that GLM and the corresponding discrete models become more similar in the available features.
GLM, the discrete models, the linear models (OLS, WLS) and other models that use inherit the generic maximum likelihood model features share the same implementation of robust standard errors in statsmodels and have the same options. Models that need a robust covariance by default, GEE, RLM and QuantileRegression, have specialized implementation and do not follow yet the same pattern or have different choices for robust covariance matrices.
In the following we compare these models for the Poisson, Negative Binomial, and the Binomial-Logit and Binomial-Probit cases.
About the notebook
This notebook does not have unique model and result names, running only parts might take the wrong instances, that is it will take the latest that has been calculated not the one that is calculated in previous cells. Use Run All
or Run All Above
.
get_robustcov_results
History prior to inclusion in this repo:
initial script try_robust_poisson.py
notebook version in gist
First we need some imports. This includes importing the example data from a unittest module, which is given as endog
and exog
already.
The intial version, and the internal implementation uses a helper function get_robustcov_results
that does the sandwich covariance calculations. This function is not for users anymore. The standard usage is now through the cov_type
argument in the fit
method of the model. Besides a simpler access to the robust covariances, this also provides a more robust code path that avoids potential conflicts in calculations based on different covariance types.
import pandas as pd
The following is reusing some setup code from a test module
from statsmodels.discrete.tests.test_sandwich_cov import *
Imports for Generalized Estimating Equations, GEE, which is not part of the unit test module.
from statsmodels.genmod.generalized_linear_model import GLM
from statsmodels.genmod.generalized_estimating_equations import GEE
from statsmodels.genmod.cov_struct import Independence
res2 = results_st.results_poisson_hc1
mod = smd.Poisson(endog, exog)
res1 = mod.fit(disp=False)
print(res1.bse)
yr_con 0.053336 op_75_79 0.112778 const 0.135680 dtype: float64
# TODO: move to an Appendix
# the code fragility with cached attributes and retro-fitting of results
from statsmodels.base.covtype import get_robustcov_results
# Warning: why gives this different results than cov_type in fit in next cell,
# because bse has already been cached in print in previous cell
#res_hc0_ = cls.res1.get_robustcov_results('HC1')
get_robustcov_results(res1._results, 'HC1', use_self=True)
bse_rob = res1.bse
nobs, k_vars = mod.exog.shape
corr_fact = (nobs) / float(nobs - 1.)
# for bse we need sqrt of correction factor
corr_fact = np.sqrt(1./corr_fact)
print(res1.bse)
# create result but don't do anything before calculating robust cov, then it works
res1h = mod.fit(disp=False)
get_robustcov_results(res1h._results, 'HC1', use_self=True)
print(res1h.bse)
# this raises exception, use_self=False is not supported anymore
#res1h2 = get_robustcov_results(res1h._results, 'HC1', use_self=False)
yr_con 0.053336 op_75_79 0.112778 const 0.135680 dtype: float64 yr_con 0.189488 op_75_79 0.544821 const 0.655468 dtype: float64
The default covariance in Poisson and GLM is cov_type="nonrobust"
, which assumes that the likelihood function, or, in LEF, that mean and covariance of the response variable are correctly specified.
res1f_nr = mod.fit(disp=False)
print(res1f_nr.cov_type)
print(res1f_nr.bse)
nonrobust yr_con 0.053336 op_75_79 0.112778 const 0.135680 dtype: float64
If the cov_type is one of the "HC" variants, then the Goddambe-Eicker-Huber-White robust standard error is calulated. This is robust to heteroscedasticity and misspecified distribution, but assumes that the observations are independently distributed or uncorrelated.
Note: The HCx variants are only available for the linear regression models. In other models, all HCx variants are currently treated in the same way and do not differ in their small sample adjustments.
res1f_hc1 = mod.fit(disp=False, cov_type='HC1', cov_kwds={})
print(res1f_hc1.cov_type)
print(res1f_hc1.bse)
HC1 yr_con 0.189488 op_75_79 0.544821 const 0.655468 dtype: float64
res1f_hc1 = mod.fit(disp=False, cov_type='HC1')
print(res1f_hc1.cov_type)
print(res1f_hc1.bse)
HC1 yr_con 0.189488 op_75_79 0.544821 const 0.655468 dtype: float64
We can calculate cluster robust standard errors using cov_type='cluster'
and providing the groups labels that indicate cluster membership in cov_kwds
. cov_kwds
needs to be a dictionary with the information and keys corresponding to the cov_type
.
res1f = mod.fit(disp=False, cov_type='cluster', cov_kwds=dict(groups=group))
print(res1f.cov_type)
print(res1f.bse)
cluster yr_con 0.205667 op_75_79 0.114459 const 1.139941 dtype: float64
res_clu_nc = mod.fit(cov_type='cluster', cov_kwds=dict(groups=group, use_correction=False))
print(res_clu_nc.bse)
Optimization terminated successfully. Current function value: 10.418931 Iterations 14 yr_con 0.178293 op_75_79 0.099224 const 0.988214 dtype: float64
GLM with family Poisson and the default link function which is log
is equivalent to Poisson model in discrete
.
print('\nGLM')
from statsmodels.genmod.generalized_linear_model import GLM
from statsmodels.genmod.families import Poisson
from statsmodels.genmod import families
mod = GLM(endog, exog, family=Poisson())
#old version
#res_glm_nr = mod.fit()
#get_robustcov_results(res_glm_nr._results, 'HC1', use_self=True)
res_glm_nr = mod.fit(cov_type='nonrobust') # this is also the default
print(res_glm_nr.cov_type)
print(res_glm_nr.bse)
print(np.sqrt(np.diag(res_glm_nr.cov_params()))) # just checking
GLM nonrobust yr_con 0.053336 op_75_79 0.112778 const 0.135680 dtype: float64 [ 0.05333642 0.11277776 0.13568042]
res_glm_clu = mod.fit()
get_robustcov_results(res_glm_clu._results, cov_type='cluster', groups=group, use_self=True)
print(res_glm_clu.cov_type)
print(res_glm_clu.bse)
print(np.sqrt(np.diag(res_glm_clu.cov_params())))
cluster yr_con 0.205667 op_75_79 0.114459 const 1.139941 dtype: float64 [ 0.20566683 0.11445894 1.13994051]
res_glm_clu2 = mod.fit(cov_type='cluster', cov_kwds=dict(groups=group))
print(res_glm_clu2.cov_type + ' - fit')
print(res_glm_clu2.bse)
print(np.sqrt(np.diag(res_glm_clu2.cov_params())))
cluster - fit yr_con 0.205667 op_75_79 0.114459 const 1.139941 dtype: float64 [ 0.20566683 0.11445894 1.13994051]
res_glm_hc1 = mod.fit(cov_type='HC1', cov_kwds={})
print(res_glm_hc1.cov_type + ' - fit')
print(res_glm_hc1.bse)
HC1 - fit yr_con 0.189488 op_75_79 0.544821 const 0.655468 dtype: float64
res_glm_clu_nsc = mod.fit(cov_type='cluster', cov_kwds=dict(groups=group, correction=False))
print(res_glm_clu_nsc.bse)
yr_con 0.205667 op_75_79 0.114459 const 1.139941 dtype: float64
GEE has three options for the covariance of the parameters that are specific to GEE. Covariance type 'naive' is the standard nonrobust covariance and assumes that the correlation and variances are correctly specified and is under independence identical the one of in GLM or in the corresponding discrete model. "robust"
requests cluster robust standard errors that does not use the small sample degrees of freedom correction in the denominator that GLM and discrete models use by default. "bias_reduced"
covariance is a cluster robust covariance that reduces the bias in the estimate and improves the coverage and size of Wald tests in small samples.
fam = Poisson() #families.Poisson()
ind = Independence()
mod1 = GEE(endog, exog, group, cov_struct=ind, family=fam)
res_gee = mod1.fit()
#print res_gee.summary()
print(res_gee.bse)
yr_con 0.178293 op_75_79 0.099224 const 0.988214 dtype: float64
res_gee_nr = mod1.fit(cov_type='naive')
print(res_gee_nr.bse)
yr_con 0.053336 op_75_79 0.112778 const 0.135680 dtype: float64
res_gee_bc = mod1.fit(cov_type='bias_reduced')
print(res_gee_bc.bse)
yr_con 0.235464 op_75_79 0.119337 const 1.260559 dtype: float64
The robust standard error differ from those of the cluster robust GLM or Poisson model, which are in between the robust and the bias corrected robust standard errors of GEE.
print(res_glm_clu2.bse.values, 'GLM')
print(res_gee.bse.values, 'GEE robust')
print(res_gee_bc.bse.values, 'GEE bias reduced')
[ 0.20566683 0.11445894 1.13994051] GLM [ 0.17829252 0.09922443 0.98821413] GEE robust [ 0.23546361 0.11933672 1.26055918] GEE bias reduced
The cluster robust covariance matrices in GLM and discrete models makes by default a correction for a small number of cluster. The default 'robust' standard errors in GEE do not make this adjustment which explains the scale differences Note that we have a very small number of clusters in this example, and any differences in small sample adjustments can have a large effect in this case.
The bias reduced robust covariance in GEE is not just a scale adjustment, and the standard errors differe in this example between 4% to 15% to the GLM cluster robust standard errors with small sample correction.
# compare GEE with robust and Poisson with cluster with correction.
res_gee.bse / res_glm_clu2.bse.values
yr_con 0.8669 op_75_79 0.8669 const 0.8669 dtype: float64
# compare GEE with robust and Poisson with cluster without correction.
res_gee.bse / res_clu_nc.bse
yr_con 1 op_75_79 1 const 1 dtype: float64
# compare GEE with bias-reduced and GLM with cluster with correction.
print(res_gee_bc.bse.values / res_glm_clu2.bse.values)
[ 1.14487891 1.04261596 1.10581136]
The estimated Poisson model shows overdispersion. There is currently no option for robust covariance matrices in GLM or Poisson that only correct for over or under dispersion. The HC standard errors correct for dispersion but also for general unspecified heteroscedasticity. An estimate of the dispersion can be obtained by dividing the sum of squared pearson residuals by the number of observations minus the number of estimated parameters.
np.sqrt(res_glm_clu2.pearson_chi2 / res_glm_clu2.df_resid)
5.0365298571458696
print(res_gee.summary())
GEE Regression Results =================================================================================== Dep. Variable: accident No. Observations: 34 Model: GEE No. clusters: 5 Method: Generalized Min. cluster size: 6 Estimating Equations Max. cluster size: 7 Family: Poisson Mean cluster size: 6.8 Dependence structure: Independence Num. iterations: 14 Date: Sun, 18 Oct 2015 Scale: 1.000 Covariance type: robust Time: 15:12:32 ============================================================================== coef std err z P>|z| [0.025 0.975] ------------------------------------------------------------------------------ yr_con -0.0217 0.178 -0.122 0.903 -0.371 0.328 op_75_79 0.2215 0.099 2.232 0.026 0.027 0.416 const 2.2697 0.988 2.297 0.022 0.333 4.207 ============================================================================== Skew: 1.8833 Kurtosis: 2.5279 Centered skew: -0.0381 Centered kurtosis: 2.2025 ==============================================================================
In the following we can compare the results of the three models for cluster robust standard errors. The first table confirms that the parameter estimates are identical except for different convergence criteria in the optimization.
res_all = [res1f, res_glm_clu, res_gee]
res_names = ['Poisson-cluster', 'GLM-cluster', 'GEE']
#print pd.concat((res_glm_clu.params, res_gee.params), axis=1)
print('Parameters')
pd.DataFrame(np.round(np.column_stack([np.asarray(getattr(res, 'params')) for res in res_all]), 4),
index=res_glm_clu.params.index, columns=res_names)
Parameters
Poisson-cluster | GLM-cluster | GEE | |
---|---|---|---|
yr_con | -0.0217 | -0.0217 | -0.0217 |
op_75_79 | 0.2215 | 0.2215 | 0.2215 |
const | 2.2697 | 2.2697 | 2.2697 |
print('Standard Errors')
pd.DataFrame(np.round(np.column_stack([np.asarray(getattr(res, 'bse')) for res in res_all]), 4),
index=res_glm_clu.params.index, columns=res_names)
Standard Errors
Poisson-cluster | GLM-cluster | GEE | |
---|---|---|---|
yr_con | 0.2057 | 0.2057 | 0.1783 |
op_75_79 | 0.1145 | 0.1145 | 0.0992 |
const | 1.1399 | 1.1399 | 0.9882 |
print('\nNegative Binomial')
mod = smd.NegativeBinomial(endog, exog)
res_glm_nr = mod.fit(disp=False)
get_robustcov_results(res_glm_nr._results, 'HC1', use_self=True)
print(res_glm_nr.cov_type)
print(res_glm_nr.bse)
print(np.sqrt(np.diag(res_glm_nr.cov_params())))
Negative Binomial HC1 yr_con 0.257201 op_75_79 0.554121 const 0.749135 alpha 0.551116 dtype: float64 [ 0.25720074 0.55412081 0.74913526 0.55111619]
# old version TODO delete
res_glm_clu = mod.fit(disp=False)
get_robustcov_results(res_glm_clu._results, cov_type='cluster', groups=group, use_self=True)
print(res_glm_clu.cov_type)
print(res_glm_clu.bse)
print(np.sqrt(np.diag(res_glm_clu.cov_params())))
cluster yr_con 0.285445 op_75_79 0.104592 const 1.293787 alpha 0.549495 dtype: float64 [ 0.2854446 0.1045925 1.29378708 0.54949506]
res_glm_clu2 = mod.fit(disp=False, cov_type='cluster', cov_kwds=dict(groups=group))
print(res_glm_clu2.cov_type + ' - fit')
print(res_glm_clu2.bse)
print(np.sqrt(np.diag(res_glm_clu2.cov_params())))
bse_st = np.array([ 0.2721609 , 0.09972456, 1.23357855, 0.5239233 ])
print(bse_st),
print('Stata')
print(res_glm_clu2.bse.values / bse_st)
cluster - fit yr_con 0.285445 op_75_79 0.104592 const 1.293787 alpha 0.549495 dtype: float64 [ 0.2854446 0.1045925 1.29378708 0.54949506] [ 0.2721609 0.09972456 1.23357855 0.5239233 ] Stata [ 1.04880825 1.04881382 1.04880803 1.04880822]
res_glm_hc1 = mod.fit(disp=False, cov_type='HC1', cov_kwds={})
print(res_glm_hc1.cov_type + ' - fit')
print(res_glm_hc1.bse)
HC1 - fit yr_con 0.257201 op_75_79 0.554121 const 0.749135 alpha 0.551116 dtype: float64
print('\nGLM Logit')
endog_bin = (endog > endog.mean()).astype(int)
mod1 = GLM(endog_bin, exog,
family=families.Binomial())
res_glm_logit_clu = mod1.fit()
print(res_glm_logit_clu.cov_type + ' - fit')
print(res_glm_logit_clu.bse)
res_glm_logit_clu2 = mod1.fit(cov_type='cluster', cov_kwds=dict(groups=group))
print(res_glm_logit_clu2.cov_type + ' - fit')
print(res_glm_logit_clu2.bse)
print(np.sqrt(np.diag(res_glm_logit_clu2.cov_params())))
GLM Logit nonrobust - fit yr_con 0.385787 op_75_79 0.826742 const 1.048827 dtype: float64 cluster - fit yr_con 0.337142 op_75_79 0.593020 const 1.864865 dtype: float64 [ 0.33714199 0.59302014 1.86486478]
Note: copy-paste results names are wrong
print('\nLogit')
endog_bin = (endog > endog.mean()).astype(int)
mod1 = smd.Logit(endog_bin, exog)
res_logit_clu = mod1.fit()
print(res_logit_clu.cov_type + ' - fit')
print(res_logit_clu.bse)
res_logit_clu2 = mod1.fit(cov_type='cluster', cov_kwds=dict(groups=group))
print(res_logit_clu2.cov_type + ' - fit')
print(res_logit_clu2.bse)
print(np.sqrt(np.diag(res_logit_clu2.cov_params())))
Logit Optimization terminated successfully. Current function value: 0.589409 Iterations 5 nonrobust - fit yr_con 0.385787 op_75_79 0.826742 const 1.048827 dtype: float64 Optimization terminated successfully. Current function value: 0.589409 Iterations 5 cluster - fit yr_con 0.337142 op_75_79 0.593020 const 1.864865 dtype: float64 [ 0.33714199 0.59302014 1.86486478]
fam = families.Binomial()
ind = Independence()
mod1 = GEE(endog_bin, exog, group, cov_struct=ind, family=fam)
res_gee = mod1.fit() #start_params=res_glm_logit_clu.params)
#print res_gee.summary()
print(res_gee.bse)
yr_con 0.292268 op_75_79 0.514089 const 1.616651 dtype: float64
compare
print(res_glm_logit_clu2.bse.values)
print(np.sqrt(np.diag(res_gee.cov_robust_bc)))
[ 0.33714199 0.59302014 1.86486478] [ 0.37699761 0.62587787 2.0525678 ]
res_all = [res_logit_clu2, res_glm_logit_clu2, res_gee]
res_names = ['Logit-cluster', 'GLM-cluster', 'GEE']
#print pd.concat((res_glm_clu.params, res_gee.params), axis=1)
print('Parameters')
pd.DataFrame(np.round(np.column_stack([np.asarray(getattr(res, 'params')) for res in res_all]), 4),
index=res_glm_logit_clu.params.index, columns=res_names)
Parameters
Logit-cluster | GLM-cluster | GEE | |
---|---|---|---|
yr_con | 0.3394 | 0.3394 | 0.3394 |
op_75_79 | 0.8832 | 0.8832 | 0.8832 |
const | -2.0881 | -2.0881 | -2.0881 |
print('Standard Errors')
pd.DataFrame(np.round(np.column_stack([np.asarray(getattr(res, 'bse')) for res in res_all]), 4),
index=res_glm_logit_clu.params.index, columns=res_names)
Standard Errors
Logit-cluster | GLM-cluster | GEE | |
---|---|---|---|
yr_con | 0.3371 | 0.3371 | 0.2923 |
op_75_79 | 0.5930 | 0.5930 | 0.5141 |
const | 1.8649 | 1.8649 | 1.6167 |
print('\nProbit')
endog_bin = (endog > endog.mean()).astype(int)
mod1 = smd.Probit(endog_bin, exog)
res_probit_clu = mod1.fit()
print(res_probit_clu.cov_type + ' - fit')
print(res_probit_clu.bse)
res_probit_clu2 = mod1.fit(cov_type='cluster', cov_kwds=dict(groups=group))
print(res_probit_clu2.cov_type + ' - fit')
print(res_probit_clu2.bse)
print(np.sqrt(np.diag(res_probit_clu2.cov_params())))
res_probit_hac = mod1.fit(cov_type='HAC', cov_kwds=dict(maxlags=3))
print(res_probit_hac.cov_type + ' - fit')
print(res_probit_hac.bse)
print(np.sqrt(np.diag(res_probit_hac.cov_params())))
Probit Optimization terminated successfully. Current function value: 0.589823 Iterations 5 nonrobust - fit yr_con 0.232729 op_75_79 0.490381 const 0.604882 dtype: float64 Optimization terminated successfully. Current function value: 0.589823 Iterations 5 cluster - fit yr_con 0.192395 op_75_79 0.307330 const 1.041974 dtype: float64 [ 0.19239499 0.30732984 1.04197447] Optimization terminated successfully. Current function value: 0.589823 Iterations 5 HAC - fit yr_con 0.241707 op_75_79 0.256575 const 0.820305 dtype: float64 [ 0.24170689 0.25657473 0.82030549]
#mod1 = GLM(endog_bin, exog, family=families.Binomial(links.CDFLink))
mod1 = GLM(endog_bin, exog, family=families.Binomial(links.probit))
res_glm_probit_clu = mod1.fit()
print(res_glm_probit_clu.cov_type + ' - fit')
print(res_glm_probit_clu.bse)
print(np.sqrt(np.diag(res_glm_probit_clu.cov_params())))
nonrobust - fit yr_con 0.232456 op_75_79 0.487844 const 0.610261 dtype: float64 [ 0.23245625 0.4878443 0.61026116]
res_glm_probit_clu2 = mod1.fit(cov_type='cluster', cov_kwds=dict(groups=group))
print(res_glm_probit_clu2.cov_type + ' - fit')
print(res_glm_probit_clu2.bse)
print(np.sqrt(np.diag(res_glm_probit_clu2.cov_params())))
cluster - fit yr_con 0.194533 op_75_79 0.312240 const 1.052924 dtype: float64 [ 0.19453294 0.31223975 1.05292424]
print('\nprobit params')
print(res_probit_clu.params)
print(res_glm_probit_clu.params)
print(res_probit_clu.bse)
print(res_glm_probit_clu.bse)
probit params yr_con 0.202947 op_75_79 0.523378 const -1.253350 dtype: float64 yr_con 0.202947 op_75_79 0.523378 const -1.253350 dtype: float64 yr_con 0.232729 op_75_79 0.490381 const 0.604882 dtype: float64 yr_con 0.232456 op_75_79 0.487844 const 0.610261 dtype: float64
print(pd.concat((res_probit_clu.params, res_glm_probit_clu.params), axis=1))
print(pd.concat((res_probit_clu.bse, res_glm_probit_clu.bse), axis=1))
print(pd.concat((res_probit_clu2.params, res_glm_probit_clu2.params), axis=1))
print(pd.concat((res_probit_clu2.bse, res_glm_probit_clu2.bse), axis=1))
# TODO: use summary_col to compare estimators
0 1 yr_con 0.202947 0.202947 op_75_79 0.523378 0.523378 const -1.253350 -1.253350 0 1 yr_con 0.232729 0.232456 op_75_79 0.490381 0.487844 const 0.604882 0.610261 0 1 yr_con 0.202947 0.202947 op_75_79 0.523378 0.523378 const -1.253350 -1.253350 0 1 yr_con 0.192395 0.194533 op_75_79 0.307330 0.312240 const 1.041974 1.052924
There is a difference in the numerical Hessian between GLM-Probit and discrete Probit, and consequently in the standard errors. Discrete Probit uses an analytical expression for the Hessian while GLM-Probit uses the generic GLM calculation where one part is based on finite difference calculations.
#res_probit_clu = res_probit_clu._results # not wrapped
res_glm_probit_clu = res_glm_probit_clu._results
sc_probit = res_probit_clu.model.jac(res_probit_clu.params)
sc_glm_probit = res_glm_probit_clu.model.score_obs(res_probit_clu.params)
print(np.max(np.abs(sc_probit - sc_glm_probit)))
hess_glm_probit = res_glm_probit_clu.model.hessian(res_probit_clu.params)
hess_probit = res_probit_clu.model.hessian(res_probit_clu.params)
print(np.max(np.abs(hess_probit - hess_glm_probit)))
print(hess_probit / hess_glm_probit)
8.881784197e-16 0.0319808160248 [[ 0.99975476 1.0007156 0.99958227] [ 1.0007156 0.99793767 0.99793767] [ 0.99958227 0.99793767 1.00047687]]
fam = families.Binomial(links.probit)
ind = Independence()
mod1 = GEE(endog_bin, exog, group, cov_struct=ind, family=fam)
res_gee = mod1.fit(start_params=res_glm_probit_clu.params)
#print res_gee.summary()
print(res_gee.bse)
yr_con 0.169572 op_75_79 0.274044 const 0.916809 dtype: float64
compare
print(res_glm_probit_clu2.bse.values)
print(np.sqrt(np.diag(res_gee.cov_robust_bc)))
[ 0.19453294 0.31223975 1.05292424] [ 0.21855456 0.33301099 1.16412386]
res_all = [res_probit_clu, res_glm_probit_clu, res_gee]
res_names = ['Probit-cluster', 'GLM-cluster', 'GEE']
#print pd.concat((res_glm_clu.params, res_gee.params), axis=1)
print('Parameters')
pd.DataFrame(np.round(np.column_stack([np.asarray(getattr(res, 'params')) for res in res_all]), 4),
index=res_probit_clu.params.index, columns=res_names)
Parameters
Probit-cluster | GLM-cluster | GEE | |
---|---|---|---|
yr_con | 0.2029 | 0.2029 | 0.2029 |
op_75_79 | 0.5234 | 0.5234 | 0.5234 |
const | -1.2534 | -1.2534 | -1.2534 |
print('Standard Errors')
pd.DataFrame(np.round(np.column_stack([np.asarray(getattr(res, 'bse')) for res in res_all]), 4),
index=res_probit_clu.params.index, columns=res_names)
Standard Errors
Probit-cluster | GLM-cluster | GEE | |
---|---|---|---|
yr_con | 0.2327 | 0.2325 | 0.1696 |
op_75_79 | 0.4904 | 0.4878 | 0.2740 |
const | 0.6049 | 0.6103 | 0.9168 |
this is a stump, just for a quick comparison
We compare OLS and GLM and GEE both with gaussian family and linear link, which are the defaults for GLM and GEE.
As we can see below, both nonrobust and cluster robust without correction are identical in the three models
import statsmodels.regression.linear_model as linear
mod_ols = linear.OLS(endog, exog)
res_ols_nr = mod_ols.fit()
res_ols_clu_nc = mod_ols.fit(cov_type='cluster', cov_kwds=dict(groups=group, use_correction=False))
mod_glmgau = GLM(endog, exog)
res_glmgau_nr = mod_glmgau.fit()
res_glmgau_clu_nc = mod_glmgau.fit(cov_type='cluster', cov_kwds=dict(groups=group, use_correction=False))
mod_geegau = GEE(endog, exog, group, cov_struct=Independence())
res_geegau_nr = mod_geegau.fit(cov_type='naive')
res_geegau_rob = mod_geegau.fit(cov_type='robust')
res_geegau_bc = mod_geegau.fit(cov_type='bias_reduced')
print(res_ols_nr.bse.values)
print(res_glmgau_nr.bse.values)
print(res_geegau_nr.bse.values)
[ 2.83491152 5.82922209 7.04487696] [ 2.83491152 5.82922209 7.04487696] [ 2.83491152 5.82922209 7.04487696]
print(res_ols_clu_nc.bse.values)
print(res_glmgau_clu_nc.bse.values)
print(res_geegau_rob.bse.values)
[ 2.03582198 1.27532744 9.94937361] [ 2.03582198 1.27532744 9.94937361] [ 2.03582198 1.27532744 9.94937361]
As reference we print just the summary for GEE which contains information about the cluster structure.
print(res_geegau_rob.summary())
GEE Regression Results =================================================================================== Dep. Variable: accident No. Observations: 34 Model: GEE No. clusters: 5 Method: Generalized Min. cluster size: 6 Estimating Equations Max. cluster size: 7 Family: Gaussian Mean cluster size: 6.8 Dependence structure: Independence Num. iterations: 2 Date: Sun, 18 Oct 2015 Scale: 262.251 Covariance type: robust Time: 15:12:35 ============================================================================== coef std err z P>|z| [0.025 0.975] ------------------------------------------------------------------------------ yr_con -0.2339 2.036 -0.115 0.909 -4.224 3.756 op_75_79 2.2898 1.275 1.795 0.073 -0.210 4.789 const 9.7344 9.949 0.978 0.328 -9.766 29.235 ============================================================================== Skew: 1.8831 Kurtosis: 2.5279 Centered skew: -0.0364 Centered kurtosis: 2.2008 ==============================================================================