Key ideas: Power analysis, logistic regression, simulation study
This notebook shows how to use simulation to conduct a power analysis. The setting is a study in which logistic regression is used to estimate the effect of a covariate of interest in the presence of a second confounding covariate.
We start with the usual import statments:
import numpy as np import pandas as pd from statsmodels.discrete.discrete_model import Logit
Several factors influence the standard error of a regession coefficient in logistic regression, including:
As is the case in most power analyses, it is necessary to speculate on a number of aspects of the generating model. Here we will use a generating model in which the two covariates are jointly Gaussian. Without loss of generality, they can both have variance equal to 1. We will treat the first variable X1 as a confounder which is centered. The variable of primary interest, denoted "X2", will have a mean that could differ from zero. We also need to specifiy the correlation "r" between the two covariates, and the population values of the regression coefficients.
We take a functional programming approach, implementing the function "gen_gendat" below that returns a function that can be called to produce simulated data sets. The returned function takes no arguments, and simulates data according to the parameters specified when it was created. The population structure is determined by:
def gen_gendat(n, r, mu, params): def gendat(): exog = np.random.normal(size=(n, 3)) exog[:,0] = 1 exog[:,2] = r*exog[:,1] + np.sqrt(1-r**2)*exog[:,2] exog[:,2] += mu linpred = np.dot(exog, params) expval = 1 / (1 + np.exp(-linpred)) endog = 1*(np.random.uniform(size=n) <= expval) return endog, exog return gendat
Next we implement another function called "stderr" to conduct the power simulation. This function takes an argument "gendat" that can be called (with no arguments) to produce data. The second argument is the number of simulation replications. This function returns the pair (endog, exog), where endog is a vector of binary responses, and exog is a matrix of covariates. A value of "gendat" can be obtained by calling "gen_gendat" defined above. The "stderr" function returns the mean and standard deviation of the estimated values of params.
def stderr(gendat, nrep): pr =  for jr in range(nrep): endog, exog = gendat() mod = Logit(endog, exog) rslt = mod.fit(disp=False) pr.append(rslt.params) pr = np.asarray(pr) return np.mean(pr), np.std(pr)
Now we can run the simulation. First we hold everything fixed except the sample size. The standard error should decrease by a factor of 1/sqrt(2) with each successive doubling of the sample size.
nrep = 1000 se =  sample_sizes = [50, 100, 200] for n in sample_sizes: gdat = gen_gendat(n, 0.3, 0, np.r_[0, 0, 0]) se.append(stderr(gdat, nrep)) se = pd.DataFrame(se, index=sample_sizes, columns=("Mean", "SD")) print se
Mean SD 50 0.005844 0.355883 100 0.006356 0.233696 200 0.004289 0.157229 [3 rows x 2 columns]
The next simulation shows that as the correlation between the confounder and the variable of interest increases, the standard error increases. In this example, the increase isn't too great unless the correlation is quite strong
nrep = 1000 se =  r_values = [0, 0.25, 0.5, 0.75] for r in r_values: gdat = gen_gendat(100, r, 0, np.r_[0, 0, 0]) se.append(stderr(gdat, nrep)) se = pd.DataFrame(se, index=r_values, columns=("Mean", "SD")) print se
Mean SD 0.00 0.005133 0.216416 0.25 -0.015310 0.226289 0.50 -0.009271 0.241154 0.75 -0.012453 0.322934 [4 rows x 2 columns]
Increasing the intercept leads to a greater imbalance between the success and failure rates. This reduces the power as seen below.
se =  icept_values = [0, 0.5, 1, 1.5] for icept in icept_values: gdat = gen_gendat(100, 0, 0, np.r_[icept, 0, 0]) se.append(stderr(gdat, nrep)) se = pd.DataFrame(se, index=icept_values, columns=("Mean", "SD")) print se
Mean SD 0.0 0.018262 0.224892 0.5 -0.002686 0.217836 1.0 -0.015501 0.248451 1.5 -0.005094 0.291535 [4 rows x 2 columns]