Marginal regression with nested dependence using GEE¶

Key ideas: GEE, nested dependence, simulation studies

This notebook is a simulation study to assess how well the dependence parameters are estimated in a GEE analysis of nested data.

In [1]:
import numpy as np
import pandas as pd
import statsmodels.api as sm


These are parameters that determine the sample sizes and the population values of the variance parameters.

In [2]:
# Standard_deviation of errors that are independent among individuals
obs_sd = 3

# The standard deviation of errors that are shared within subgroups
subgroup_sd = 2

# The standard deviation of errors that are shared within (top-level) groups
group_sd = 1

# The number of groups
n_group = 100

# The number of subgroups in each group
n_subgroup = 4

# The number of observations in each subgroup
n_obs = 5


Next we have a function to simulate data.

In [3]:
def generate_data():

# exog data is iid standard normal
exog = [np.random.normal(size=(n_obs, 2)) for i in range(n_group * n_subgroup)]

# Storage
endog = []
groups = []
nests = []

# Generate the endog values
ii = 0
for i in range(n_group):

# Group effect, shared by all observations within a group
group_e = group_sd * np.random.normal()

for j in range(n_subgroup):

# Subgroup effect, shared by all observations within a subgroup
subgroup_e = subgroup_sd * np.random.normal()

# All regression slopes are equal to 1.
expval = exog[ii].sum(1)

# The total random error for one observation
obs_e = obs_sd * np.random.normal(size=n_obs)

# The endog value for one observation
y = expval + group_e + subgroup_e + obs_e

endog.append(y)
groups.append(i*np.ones(n_obs))
nests.append(j*np.ones(n_obs))
ii += 1

exog = np.concatenate(exog)
endog = np.concatenate(endog)
groups = np.concatenate(groups)
nests = np.concatenate(nests)

return endog, exog, groups, nests


Next we simulate several datasets and use GEE to estimate the regression and dependence parameters in each data set.

In [4]:
nrep = 100

params = []
dep_params = []
scales = []
for rp in range(nrep):

endog, exog, groups, nests = generate_data()

family = sm.families.Gaussian()
ne = sm.cov_struct.Nested()

mod1 = sm.GEE(endog, exog, groups=groups, family=family, cov_struct=ne, dep_data=nests)
rslt1 = mod1.fit()
params.append(rslt1.params)
dep_params.append(ne.dep_params)
scales.append(rslt1.scale)

params = np.asarray(params)
dep_params = np.asarray(dep_params)
scales = np.asarray(scales)


The regression coefficients should equal (1,1), as hard-coded above.

In [5]:
print params.mean(0) # should return [1, 1] approximately

[ 1.00835738  0.99717758]


The dependence parameters should equal group_sd^2 and subgroup^sd^2

In [6]:
# These two should be approximately equal
print dep_params.mean(0)
print [group_sd**2, subgroup_sd**2]

[ 1.02916337  4.0213762 ]
[1, 4]


The error variance is the scale parameter minus the sum of all other variance components.

In [7]:
# These two should be approximately equal
print scales.mean() - dep_params.mean(0).sum()
print obs_sd**2

8.98561856896
9