# Generalized Estimating Equations¶

Key ideas: GEE, nominal data, simulation studies, sensitivity analyses

This notebook demonstrates GEE analysis for a nominal response variable, using simulated data. We simulate data from multinomial distributions with group-wise dependence and covariate effects. "Group-wise dependence" means that the outcomes for observations in the same group are correlated. "Covariate effects" means that the marginal outcome distribution for each observation depends on a vector of covariates. We show that we are able to recover much of this structure through the regression analysis.

Here are the import statements:

In [1]:
import numpy as np
import statsmodels.api as sm
from scipy.stats.distributions import norm


Next we define the two values that control how the data are simulated. The data are structured as a set of dependent paired observations (values in different pairs are independent of each other). Each value in a pair can take on one of three possible states (i.e. the sample space of the dependent variable has three values). The values in params control the regression structure. The probabilities of the three outcomes are proportional to exp(params[0]' x), exp(params[1]' x) and exp(params[2]' x). The third category is the "reference category".

In [2]:
n = 500
params = [np.r_[1, 0, -1], [0, 1, 0], [0, 0, 0]]


The following 'prob' function takes the covariate data and the parameters and returns a matrix of probabilities. Each row of the result is a discrete probability distribution that is the marginal outcome distribution for one observation.

In [3]:
def prob(exog, params):
pr = [np.dot(exog, p) for p in params]
pr = np.vstack(pr).T
pr = np.exp(pr)
pr /= pr.sum(1)[:, None]
return pr


The 'sample' function returns paired observations, based on the marginal distributions pr1 and pr2. (the i^th rows of pr1 and pr2 are the marginal distributions for one paired sample). Dependence in the paired observations is induced using a Gaussian copula, with correlation parameter r.

In [4]:
def sample(pr1, pr2, r):

n = len(pr1)
z1 = np.random.normal(size=n)
z2 = r*z1 + np.sqrt(1 - r**2)*np.random.normal(size=n)

cpr1 = np.cumsum(pr1, 1)
cpr2 = np.cumsum(pr2, 1)

endog1 = np.zeros(n)
endog2 = np.zeros(n)

for i in range(n):
q1 = norm.ppf(cpr1[i, :])
q2 = norm.ppf(cpr2[i, :])
endog1[i] = np.sum(z1[i] > q1)
endog2[i] = np.sum(z2[i] > q2)

return endog1, endog2


Here is a function to generate one data set.

In [5]:
def generate_data(r):

exog1 = np.random.normal(size=(n,3))
exog2 = np.random.normal(size=(n,3))

pr1 = prob(exog1, params)
pr2 = prob(exog2, params)

endog1, endog2 = sample(pr1, pr2, r)

endog = np.asarray(zip(endog1, endog2))
endog, exog = [], []
for i in range(n):
endog.append(endog1[i])
endog.append(endog2[i])
exog.append(exog1[i,:])
exog.append(exog2[i,:])
endog = np.asarray(endog)
exog = np.asarray(exog)

groups = np.kron(np.arange(n), [1, 1])

return endog, exog, groups


### Fitting a model using an independence covariance structure¶

Next we generate a dataset and fit it using an independence covariance structure. The parameter estimates can be compared to their population values specified above as params.

In [6]:
endog, exog, groups = generate_data(0.4)
ind = sm.cov_struct.Independence()
model = sm.NominalGEE(endog, exog, groups, cov_struct=ind)
result = model.fit()
pa = result.params
print result.summary()

                           NominalGEE Regression Results
===================================================================================
Dep. Variable:                           y   No. Observations:                 2000
Model:                          NominalGEE   No. clusters:                      500
Method:                        Generalized   Min. cluster size:                   4
Estimating Equations   Max. cluster size:                   4
Family:                       _Multinomial   Mean cluster size:                 4.0
Dependence structure:         Independence   Num. iterations:                    26
Date:                     Mon, 12 Jan 2015   Scale:                           1.000
Covariance type:                    robust   Time:                         00:26:21
==============================================================================
coef    std err          z      P>|z|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
x1[0.0]        0.9592      0.090     10.690      0.000         0.783     1.135
x2[0.0]       -0.0216      0.078     -0.277      0.782        -0.175     0.132
x3[0.0]       -1.0851      0.087    -12.404      0.000        -1.257    -0.914
x1[1.0]       -0.1116      0.078     -1.426      0.154        -0.265     0.042
x2[1.0]        0.9270      0.086     10.763      0.000         0.758     1.096
x3[1.0]       -0.0480      0.080     -0.599      0.549        -0.205     0.109
==============================================================================
Skew:                          0.4961   Kurtosis:                      -0.5085
Centered skew:                 0.2899   Centered kurtosis:             -0.3846
==============================================================================


Next we make a plot that displays the fitted probabilities for two hypothetical subjects.

In [7]:
p = result.plot_distribution(None, [{"x1": 0, "x2": 0, "x3": 1}, {"x1": 0, "x2": 0, "x3": 5}])
ax = p.get_axes()[0]
ax.set_xlim(-0.05, 2.05)
ax.set_ylim(-0.05, 1.05)
li = ax.lines
leg = plt.figlegend(li, ("1", "2"), "upper center", ncol=2)
leg.draw_frame(False)


### Fitting a model to capture the dependence¶

Next we fit a model with a "global odds ratio" dependence structure. This means that any two pairs of binary indicators derived from different observations in the same group are modeled as having the same odds ratio, irrespective of which thresholds were used to produce the values.

In [8]:
gor = sm.cov_struct.GlobalOddsRatio("nominal")
model = sm.NominalGEE(endog, exog, groups, cov_struct=gor)
result = model.fit(start_params=pa, maxiter=100)
print result.summary()
print gor.summary()

                           NominalGEE Regression Results
===================================================================================
Dep. Variable:                           y   No. Observations:                 2000
Model:                          NominalGEE   No. clusters:                      500
Method:                        Generalized   Min. cluster size:                   4
Estimating Equations   Max. cluster size:                   4
Family:                       _Multinomial   Mean cluster size:                 4.0
Dependence structure:      GlobalOddsRatio   Num. iterations:                    47
Date:                     Mon, 12 Jan 2015   Scale:                           1.000
Covariance type:                    robust   Time:                         00:26:41
==============================================================================
coef    std err          z      P>|z|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
x1[0.0]        0.8530      0.079     10.823      0.000         0.698     1.007
x2[0.0]       -0.0650      0.070     -0.928      0.354        -0.202     0.072
x3[0.0]       -0.9353      0.072    -12.940      0.000        -1.077    -0.794
x1[1.0]       -0.1758      0.076     -2.299      0.021        -0.326    -0.026
x2[1.0]        0.8493      0.074     11.476      0.000         0.704     0.994
x3[1.0]        0.0157      0.074      0.212      0.832        -0.130     0.161
==============================================================================
Skew:                          0.5066   Kurtosis:                      -0.5997
Centered skew:                 0.2964   Centered kurtosis:             -0.4604
==============================================================================
Global odds ratio: 1.813



The following plot shows the fitted global odds ratio relative to the iteration count. The starting point is the "crude odds ratio", which is biased toward zero. It just takes a few iterations to remove this bias.

In [9]:
plt.plot(result.fit_history['dep_params'])
plt.xlabel("Iteration", size=14)
plt.ylabel("Global odds ratio", size=14)

Out[9]:
<matplotlib.text.Text at 0x7f7e32d94410>

We can also check to see how fast the fitting algorithm converged.

In [14]:
plt.plot([np.sqrt(np.sum(x**2)) for x in result.fit_history['score']])
plt.xlabel("Iteration", size=14)
plt.ylabel("Score norm", size=14)

Out[14]:
<matplotlib.text.Text at 0x7f7e32a2f610>

### Sensitivity analysis¶

Next we conduct a sensitivity analysis to assess how the parameter estimates vary with the degree of within-cluster dependence (the global odds ratio). We fit a sequence of models with odds ratios ranging from 1 to 1.5. For each model, the dependence structure is held fixed, so we are only optimizing the regression coefficients.

In [15]:
results = result.params_sensitivity(1, 3.0, 3)

/projects/3c16f532-d309-4425-850f-a41a8c8f8c92/statsmodels-master/statsmodels/genmod/generalized_estimating_equations.py:1078: IterationLimitWarning: Iteration limit reached prior to convergence
IterationLimitWarning)


Next we plot the parameter estimates (+/-2 SE) obtained in the sensitivity analysis against the global odds ratio value.

In [16]:
dep_params = np.asarray([x.cov_struct.dep_params for x in results])
for ix in range(3):
pa = np.asarray([x.params[ix] for x in results])
se = np.asarray([x.bse[ix] for x in results])
plt.plot(dep_params, pa, color='purple', ls='-')
plt.plot(dep_params, pa + 2*se, color='grey')
plt.plot(dep_params, pa - 2*se, color='grey')
plt.xlabel("Global odds ratio", size=14)
plt.ylabel("Parameter estimate", size=14)