Marginal regression with nested dependence using GEE

Key ideas: GEE, nested dependence, simulation studies

This notebook is a simulation study to assess how well the dependence parameters are estimated in a GEE analysis of nested data.

In [1]:
import numpy as np
import pandas as pd
import statsmodels.api as sm

These are parameters that determine the sample sizes and the population values of the variance parameters.

In [2]:
# Standard_deviation of errors that are independent among individuals
obs_sd = 3

# The standard deviation of errors that are shared within subgroups
subgroup_sd = 2

# The standard deviation of errors that are shared within (top-level) groups
group_sd = 1

# The number of groups
n_group = 100

# The number of subgroups in each group
n_subgroup = 4

# The number of observations in each subgroup
n_obs = 5

Next we have a function to simulate data.

In [3]:
def generate_data():
    
    # exog data is iid standard normal
    exog = [np.random.normal(size=(n_obs, 2)) for i in range(n_group * n_subgroup)]
    
    # Storage
    endog = []
    groups = []
    nests = []
    
    # Generate the endog values
    ii = 0
    for i in range(n_group):
        
        # Group effect, shared by all observations within a group
        group_e = group_sd * np.random.normal()
        
        for j in range(n_subgroup):
            
            # Subgroup effect, shared by all observations within a subgroup
            subgroup_e = subgroup_sd * np.random.normal()
            
            # All regression slopes are equal to 1.
            expval = exog[ii].sum(1)
            
            # The total random error for one observation
            obs_e = obs_sd * np.random.normal(size=n_obs)
            
            # The endog value for one observation
            y = expval + group_e + subgroup_e + obs_e
            
            endog.append(y)
            groups.append(i*np.ones(n_obs))
            nests.append(j*np.ones(n_obs))
            ii += 1
            
    exog = np.concatenate(exog)
    endog = np.concatenate(endog)
    groups = np.concatenate(groups)
    nests = np.concatenate(nests)

    return endog, exog, groups, nests

Next we simulate several datasets and use GEE to estimate the regression and dependence parameters in each data set.

In [4]:
nrep = 100

params = []
dep_params = []
scales = []
for rp in range(nrep):
        
    endog, exog, groups, nests = generate_data()

    family = sm.families.Gaussian()
    ne = sm.cov_struct.Nested()

    mod1 = sm.GEE(endog, exog, groups=groups, family=family, cov_struct=ne, dep_data=nests)
    rslt1 = mod1.fit()
    params.append(rslt1.params)
    dep_params.append(ne.dep_params)
    scales.append(rslt1.scale)

params = np.asarray(params)
dep_params = np.asarray(dep_params)
scales = np.asarray(scales)

The regression coefficients should equal (1,1), as hard-coded above.

In [5]:
print params.mean(0) # should return [1, 1] approximately
[ 1.00835738  0.99717758]

The dependence parameters should equal group_sd^2 and subgroup^sd^2

In [6]:
# These two should be approximately equal
print dep_params.mean(0)
print [group_sd**2, subgroup_sd**2]
[ 1.02916337  4.0213762 ]
[1, 4]

The error variance is the scale parameter minus the sum of all other variance components.

In [7]:
# These two should be approximately equal
print scales.mean() - dep_params.mean(0).sum()
print obs_sd**2
8.98561856896
9