Statsmodels Generalized Linear Models

Key ideas: Power analysis, logistic regression, simulation study

This notebook shows how to use simulation to conduct a power analysis. The setting is a study in which logistic regression is used to estimate the effect of a covariate of interest in the presence of a second confounding covariate.

We start with the usual import statments:

In [1]:
import numpy as np
import pandas as pd
from statsmodels.discrete.discrete_model import Logit

Several factors influence the standard error of a regession coefficient in logistic regression, including:

  • The sample size
  • The variance of the variable of interest
  • The correlation between the variable of interest and other covariates
  • The overall proportion of successes versus failures

As is the case in most power analyses, it is necessary to speculate on a number of aspects of the generating model. Here we will use a generating model in which the two covariates are jointly Gaussian. Without loss of generality, they can both have variance equal to 1. We will treat the first variable X1 as a confounder which is centered. The variable of primary interest, denoted "X2", will have a mean that could differ from zero. We also need to specifiy the correlation "r" between the two covariates, and the population values of the regression coefficients.

We take a functional programming approach, implementing the function "gen_gendat" below that returns a function that can be called to produce simulated data sets. The returned function takes no arguments, and simulates data according to the parameters specified when it was created. The population structure is determined by:

  • n -- the sample size
  • r -- the correlation between the two covariates
  • mu -- the mean of the variable of interest (the second covariate)
  • params -- a 3-dimensional vector of coefficients [intercept, confounder effect, variable of interest effect]
In [2]:
def gen_gendat(n, r, mu, params):
    def gendat():
        exog = np.random.normal(size=(n, 3))
        exog[:,0] = 1
        exog[:,2] = r*exog[:,1] + np.sqrt(1-r**2)*exog[:,2]
        exog[:,2] += mu
        linpred = np.dot(exog, params)
        expval = 1 / (1 + np.exp(-linpred))
        endog = 1*(np.random.uniform(size=n) <= expval)
        return endog, exog
    return gendat

Next we implement another function called "stderr" to conduct the power simulation. This function takes an argument "gendat" that can be called (with no arguments) to produce data. The second argument is the number of simulation replications. This function returns the pair (endog, exog), where endog is a vector of binary responses, and exog is a matrix of covariates. A value of "gendat" can be obtained by calling "gen_gendat" defined above. The "stderr" function returns the mean and standard deviation of the estimated values of params[2].

In [3]:
def stderr(gendat, nrep):

    pr = []
    for jr in range(nrep):
        endog, exog = gendat()
        mod = Logit(endog, exog)
        rslt = mod.fit(disp=False)
        pr.append(rslt.params[2])
        
    pr = np.asarray(pr)
    return np.mean(pr), np.std(pr)

Now we can run the simulation. First we hold everything fixed except the sample size. The standard error should decrease by a factor of 1/sqrt(2) with each successive doubling of the sample size.

In [4]:
nrep = 1000
se = []
sample_sizes = [50, 100, 200]
for n in sample_sizes:
    gdat = gen_gendat(n, 0.3, 0, np.r_[0, 0, 0])
    se.append(stderr(gdat, nrep))
    
se = pd.DataFrame(se, index=sample_sizes, columns=("Mean", "SD"))
print se
         Mean        SD
50   0.005844  0.355883
100  0.006356  0.233696
200  0.004289  0.157229

[3 rows x 2 columns]

The next simulation shows that as the correlation between the confounder and the variable of interest increases, the standard error increases. In this example, the increase isn't too great unless the correlation is quite strong

In [5]:
nrep = 1000
se = []
r_values = [0, 0.25, 0.5, 0.75]
for r in r_values:
    gdat = gen_gendat(100, r, 0, np.r_[0, 0, 0])
    se.append(stderr(gdat, nrep))
    
se = pd.DataFrame(se, index=r_values, columns=("Mean", "SD"))
print se
          Mean        SD
0.00  0.005133  0.216416
0.25 -0.015310  0.226289
0.50 -0.009271  0.241154
0.75 -0.012453  0.322934

[4 rows x 2 columns]

Increasing the intercept leads to a greater imbalance between the success and failure rates. This reduces the power as seen below.

In [6]:
se = []
icept_values = [0, 0.5, 1, 1.5]
for icept in icept_values:
    gdat = gen_gendat(100, 0, 0, np.r_[icept, 0, 0])
    se.append(stderr(gdat, nrep))
    
se = pd.DataFrame(se, index=icept_values, columns=("Mean", "SD"))
print se
         Mean        SD
0.0  0.018262  0.224892
0.5 -0.002686  0.217836
1.0 -0.015501  0.248451
1.5 -0.005094  0.291535

[4 rows x 2 columns]