*Key ideas:* Power analysis, logistic regression, simulation study

This notebook shows how to use simulation to conduct a power analysis. The setting is a study in which logistic regression is used to estimate the effect of a covariate of interest in the presence of a second confounding covariate.

We start with the usual import statments:

In [1]:

```
import numpy as np
import pandas as pd
from statsmodels.discrete.discrete_model import Logit
```

Several factors influence the standard error of a regession coefficient in logistic regression, including:

- The sample size
- The variance of the variable of interest
- The correlation between the variable of interest and other covariates
- The overall proportion of successes versus failures

As is the case in most power analyses, it is necessary to speculate on a number of aspects of the generating model. Here we will use a generating model in which the two covariates are jointly Gaussian. Without loss of generality, they can both have variance equal to 1. We will treat the first variable X1 as a confounder which is centered. The variable of primary interest, denoted "X2", will have a mean that could differ from zero. We also need to specifiy the correlation "r" between the two covariates, and the population values of the regression coefficients.

We take a functional programming approach, implementing the function "gen_gendat" below that returns a function that can be called to produce simulated data sets. The returned function takes no arguments, and simulates data according to the parameters specified when it was created. The population structure is determined by:

- n -- the sample size
- r -- the correlation between the two covariates
- mu -- the mean of the variable of interest (the second covariate)
- params -- a 3-dimensional vector of coefficients [intercept, confounder effect, variable of interest effect]

In [2]:

```
def gen_gendat(n, r, mu, params):
def gendat():
exog = np.random.normal(size=(n, 3))
exog[:,0] = 1
exog[:,2] = r*exog[:,1] + np.sqrt(1-r**2)*exog[:,2]
exog[:,2] += mu
linpred = np.dot(exog, params)
expval = 1 / (1 + np.exp(-linpred))
endog = 1*(np.random.uniform(size=n) <= expval)
return endog, exog
return gendat
```

In [3]:

```
def stderr(gendat, nrep):
pr = []
for jr in range(nrep):
endog, exog = gendat()
mod = Logit(endog, exog)
rslt = mod.fit(disp=False)
pr.append(rslt.params[2])
pr = np.asarray(pr)
return np.mean(pr), np.std(pr)
```

In [4]:

```
nrep = 1000
se = []
sample_sizes = [50, 100, 200]
for n in sample_sizes:
gdat = gen_gendat(n, 0.3, 0, np.r_[0, 0, 0])
se.append(stderr(gdat, nrep))
se = pd.DataFrame(se, index=sample_sizes, columns=("Mean", "SD"))
print se
```

In [5]:

```
nrep = 1000
se = []
r_values = [0, 0.25, 0.5, 0.75]
for r in r_values:
gdat = gen_gendat(100, r, 0, np.r_[0, 0, 0])
se.append(stderr(gdat, nrep))
se = pd.DataFrame(se, index=r_values, columns=("Mean", "SD"))
print se
```

In [6]:

```
se = []
icept_values = [0, 0.5, 1, 1.5]
for icept in icept_values:
gdat = gen_gendat(100, 0, 0, np.r_[icept, 0, 0])
se.append(stderr(gdat, nrep))
se = pd.DataFrame(se, index=icept_values, columns=("Mean", "SD"))
print se
```