Generalized Estimating Equations for count data

Key ideas: Poisson regression, negative binomial regression, overdispersion, simulation studies, Stata comparisons, R comparisons

Two GLM families are commonly used with count data - the Poisson family and the negative binomial family.

In a Poisson distribution, the mean is equal to the variance. However in practice data often exhibit "overdispersion", meaning that the variance is greater than the mean. When a Poisson regression model is fit using the GLM or GEE framework, a specific deviation from a strict equality between the mean and the variance is allowed to occur. Specifically, the mean and variance are modeled as being proportional, but with a proportionality constant that can be any number that is greater than or equal to 1 (in a Poisson distribution, this proportionality constant is always equal to 1).

The negative binomial distribution has an additional shape parameter in the distribution. If the value of this parameter is a, then the variance at a given mean value m is m + a*m^2. Thus when a = 0, the mean/variance relationship is the same as in the Poisson model, and larger values of a correspond to greater overdispersion. This parameter must be set explicitly before fitting the model (below we illustrate how to compare the fits under different values of this parameter).

In [1]:
from scipy.stats.distributions import nbinom
import numpy as np
import pandas as pd
import statsmodels.api as sm

Poisson regression

First we simulate data with a marginal mean structure that is the same as in the basic Poisson GLM (an exponential link function applied to the linear predictor), and with a variance that is a constant multiple (equal to scale) times the mean. The observations in the data set are independent of each other. Here we will fit the model using GEE with independent covariance structure, so the estimates will be the same as obtained using GLM. The conditional mean of endog is exp(beta' * exog), and the conditional variance of endog is scale times the conditional mean. Note that the data are not actually Poisson distributed (they are drawn from a negative binomial distribution), so the GLM/GEE Poisson model is mis-specified, although it is a reasonable approximation to the true distribution of the data.

In [2]:
def simdata_qpoisson(n, scale):
    exog = np.random.normal(size=(n, 3))
    params = (-1)**np.arange(3) * 0.3
    lin_pred = np.dot(exog, params)
    expval = np.exp(lin_pred) # Mean structure
    if scale == 1:
        endog = np.random.poisson(expval, size=len(expval))
    else:
        b = 1 / float(scale)
        a = expval * b / (1 - b)
        endog = nbinom.rvs(a, b, size=n)
    return endog, exog, expval

The following plot illustrates the mean/variance relatinship. The purple points show empirical means and variances when the scale parameter is 2, and the orang points show empirical means and variances when the scale parameter is 1.

In [3]:
plt.clf()
for scale in 1, 2:
    
    endog, exog, expval = simdata_qpoisson(20000, scale)

    bins = np.arange(10, 91, 10).tolist()
    pctls = np.percentile(expval, bins)
    di = np.digitize(expval, pctls)
    split = []
    for d in np.unique(di):
        split.append(endog[di == d])
    mean = [np.mean(x) for x in split]
    var = [np.var(x) for x in split]
    color = {1: "orange", 2: "purple"}[scale]
    plt.plot(mean, var, 'o', alpha=0.6, color=color)
    plt.plot([0, 3], [0, scale*3], '-', color=color)
plt.grid(True)
plt.xlabel("Mean", size=15)
plt.ylabel("Variance", size=15)
Out[3]:
<matplotlib.text.Text at 0x7f581f21f550>

Here is the simulation. We fit them marginal regression using groups of size 2, although we know that the data are simulated independently.

In [4]:
n = 500 # The sample size
nrep = 20

rslt = []
for scale in 1,2,3,4:
    params_est = []
    scale_est = []
    for rp in range(nrep):
        endog, exog, _ = simdata_qpoisson(n, scale)
        fam = sm.families.Poisson()
        ind = sm.cov_struct.Independence()
        groups = np.kron(np.arange(n/2), np.r_[1, 1])
        mod1 = sm.GEE(endog, exog, groups=groups, cov_struct=ind, family=fam)
        rslt1 = mod1.fit()
        params_est.append(rslt1.params)
        scale_est.append(rslt1.scale)
    params_est = np.asarray(params_est)
    rslt.append([scale, np.mean(scale_est), np.std(scale_est), params_est.mean(0), params_est.std(0)])

Now we convert the lists of lists to a Pandas data frame, which allows us to make a nicely formatted table. For each setting in the simulation study, we display the true scale parameter, along with the mean and standard deviation of the estimated scale parameters. For moderate sample sizes, we see that the bias in the estimated scale parameter is slightly negative, and the uncertainty in the estimated scale parameter is quite high.

Here is what we get for the scale estimates:

In [5]:
df = [x[0:3] for x in rslt]
df = pd.DataFrame(df, columns=["True scale", "Mean estimated scale", "Standard error"])
print df
   True scale  Mean estimated scale  Standard error
0           1              0.986109        0.078713
1           2              1.898801        0.253234
2           3              3.005000        0.465713
3           4              3.943440        0.649029

Here is what we get for the parameter estimates, which should be 0, 0.3, -0.3, 0.3:

In [6]:
df = [x[3] for x in rslt]
df = pd.DataFrame(df, columns=["Params%d" % k for k in range(1, 4)])
df["True scale"] = [1, 2, 3, 4]
print df
    Params1   Params2   Params3  True scale
0  0.282129 -0.306494  0.307466           1
1  0.303764 -0.296812  0.286591           2
2  0.307309 -0.309480  0.307750           3
3  0.293419 -0.297983  0.290763           4

Here are the standard deviations of the parameter estimates:

In [7]:
df = [x[4] for x in rslt]
df = pd.DataFrame(df, columns=["Params%d" % k for k in range(1, 4)])
df["True scale"] = [1, 2, 3, 4]
print df
    Params1   Params2   Params3  True scale
0  0.038929  0.046856  0.035198           1
1  0.046287  0.036544  0.037742           2
2  0.060722  0.077816  0.055488           3
3  0.055069  0.073605  0.102340           4

Negative binomial regression models

Next we carry out a similar analysis using negative binomial GLM/GEE models. The function defined in the following cell simulates data matching the model of a negative binomial GLM, with the canonical log link function.

In [8]:
def simdata_nbinom(n, alpha):
    exog = np.random.normal(size=(n,3))
    params = (-1)**np.arange(3) * 0.3
    lin_pred = np.dot(exog, params)
    u = np.exp(-lin_pred) / alpha
    p = u / (1 + u)
    endog = nbinom.rvs(1 / alpha, p, size=n)
    return endog, exog

Next we simulate data from this model and fit it using a negative binomial GLM. Data are simulated for several values of alpha, and for each simulated value of alpha we fit a sequence of negative binomial GLM's, using different values for the alpha parameter. We then plot the log likelihood from the MLE against the value of alpha that is set when fitting the model. Note that this procedure is possible in a GLM since there is a likelihood, but not in a GEE analysis which has no likelihood. There is an AIC and a quasi-likelihood for GEE analysis, but these are not currently available in Statsmodels.

We see that at least when the sample size is large, the MLE of the shape parameter (alpha) is reasonably close to the truth.

In [9]:
n = 1000
plt.clf()
ax = plt.axes()

for alpha in 1.,2.,3.:
    
    endog, exog = simdata_nbinom(n, alpha)

    loglike = []
    for a in np.linspace(1.0, 5, 20):
        fam = sm.families.NegativeBinomial(alpha=a)
        model = sm.GLM(endog, exog, family=fam)
        result = model.fit()
        loglike.append([a, result.llf])
        
    loglike = np.asarray(loglike)
    plt.plot(loglike[:,0], loglike[:,1] - loglike[:, 1].max(), label=str(alpha))

ha, lb = ax.get_legend_handles_labels()
leg = plt.figlegend(ha, lb, "upper center", ncol=3)
leg.draw_frame(False)
plt.xlabel("alpha", size=17)
plt.ylabel("log likelihood", size=17)
plt.ylim(-20, 0)
Out[9]:
(-20, 0)

Next we perform a simulation study fitting a GEE to simulated data sets.

In [10]:
n = 500 # The sample size
nrep = 100
true_alphas = [0.5, 1., 2., 4., 8.] 

rslt = []
for alpha in true_alphas:
    scale_est = []
    params_est = []
    for rp in range(nrep):
        endog, exog = simdata_nbinom(n, alpha)
        fam = sm.families.NegativeBinomial(alpha=alpha)
        ind = sm.cov_struct.Independence()
        groups = np.kron(np.arange(n/2), np.r_[1, 1])
        model = sm.GEE(endog, exog, groups=groups, cov_struct=ind, family=fam)
        result = model.fit()
        scale_est.append(result.scale)
        params_est.append(result.params)
    params_est = np.asarray(params_est)
    rslt.append([np.mean(scale_est), np.std(scale_est), params_est.mean(0), params_est.std(0)])
/projects/3c16f532-d309-4425-850f-a41a8c8f8c92/statsmodels-master/statsmodels/genmod/generalized_estimating_equations.py:1010: IterationLimitWarning: Iteration limit reached prior to convergence
  IterationLimitWarning)

The scale parameters should all equal 1.

In [11]:
df = [x[0:2] for x in rslt]
df = pd.DataFrame(df, columns=["Mean estimated scale", "Standard error"])
print df
   Mean estimated scale  Standard error
0              0.987027        0.104587
1              0.979143        0.143386
2              0.991896        0.158683
3              0.966593        0.202967
4              0.887862        0.267265

The parameter estimates should equal 0.3, -0.3, 0.3.

In [12]:
df = [x[2] for x in rslt]
df = pd.DataFrame(df, columns=["Slope1", "Slope2", "Slope3"])
df["True alpha"] = true_alphas
print df
     Slope1    Slope2    Slope3  True alpha
0  0.292653 -0.311477  0.294707         0.5
1  0.300464 -0.298377  0.305033         1.0
2  0.287237 -0.282707  0.302635         2.0
3  0.297589 -0.301080  0.297237         4.0
4  0.309510 -0.301496  0.287678         8.0