Simulated clinical trial¶

Simulated randomized clinical trials (RCT) are useful because we know the true treatment effects being estimated. Consider a two-treatment (A, B) RCT for hypertension where covariate adjustment is used for baseline systolic blood pressure (SBP) and where there are two outcomes: (1) incidence of death or stroke (DS) within one year and (2) systolic blood pressure at one year after randomization. Even though time to DS would be a preferred outcome (and would handle censoring) we ignore the timing of events for this example and use a binary logistic model for DS. SBP is assumed to follow a normal distribution with constant SD $\sigma=7$ given baseline SBP. The true B:A treatment effect is assumed to be a mean 7mmHg drop in SBP with B. We assume a correlation between SBP reduction and incidence of DS by forming a population logistic model for DS in which baseline and 1y SBP each have a regression coefficient of 0.05 for predicting DS and treatment has a coefficient of log(0.8). To estimate the true effect of treatment on DS not adjusted for follow-up SBP we first simulate a trial with $n=40000$.

The regression coefficient for treatment in the proper outcome model that did not adjust for 1y SBP was round(btrt, 4) which corresponds to a B:A odds ratio of round(exp(btrt), 4) which we take as the true treatment effect on the binary outcome.

The trial could easily be run sequentially but we treat the sample size as fixed at $n=1500$ and simulate the trial data as such.

In [1]:

%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
import pystan
import statsmodels.api as sm
import statsmodels.formula.api as smf
from scipy.stats import logistic

np.random.seed(1)

/Users/fonnescj/anaconda3/envs/dev/lib/python3.5/site-packages/Cython/Distutils/old_build_ext.py:30: UserWarning: Cython.Distutils.old_build_ext does not properly handle dependencies and is deprecated.
  "Cython.Distutils.old_build_ext does not properly handle dependencies "

In [10]:

invlogit = lambda x: 1 / (1 + np.exp(-x))

In [11]:

A,B = 0,1

def sim(n):
    trt = np.r_[np.ones(int(n/2)), np.zeros(int(n/2))]
    sbp0 = np.random.normal(140, 7, size=n)
    sbp = sbp0 - 5 - 7*(trt==B) + np.random.normal(scale=5, size=n)
    logit = -2.6 - np.log(0.8)*(trt==B) + 0.05*(sbp0 - 140) + 0.05*(sbp - 130)
    ds = np.arange(2)[(np.random.random(n) <= invlogit(logit)).astype(int)]
    
    df = pd.DataFrame({'trt': trt,
                      'sbp0': sbp0,
                      'sbp': sbp,
                      'ds': ds})
    return df

In [12]:

n = 1500
data = sim(n)
data.plot.scatter('sbp0', 'sbp', c='trt', cmap='plasma', alpha=0.4);

In [13]:

fit1 = smf.ols(formula='sbp ~ sbp0 + trt', data=data).fit()
fit1.summary()

Out[13]:

OLS Regression Results
Dep. Variable:	sbp	R-squared:	0.712
Model:	OLS	Adj. R-squared:	0.712
Method:	Least Squares	F-statistic:	1852.
Date:	Mon, 07 Nov 2016	Prob (F-statistic):	0.00
Time:	17:14:18	Log-Likelihood:	-4532.6
No. Observations:	1500	AIC:	9071.
Df Residuals:	1497	BIC:	9087.
Df Model:	2
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[95.0% Conf. Int.]
Intercept	-4.8409	2.595	-1.865	0.062	-9.932 0.250
sbp0	0.9978	0.018	54.137	0.000	0.962 1.034
trt	-6.9345	0.257	-27.007	0.000	-7.438 -6.431

Omnibus:	2.361	Durbin-Watson:	2.015
Prob(Omnibus):	0.307	Jarque-Bera (JB):	2.373
Skew:	0.071	Prob(JB):	0.305
Kurtosis:	2.866	Cond. No.	2.84e+03

In [14]:

fit2 = smf.glm(formula='ds ~ sbp0 + sbp + trt', data=data, family=sm.families.Binomial()).fit()
fit2.summary()

Out[14]:

Generalized Linear Model Regression Results
Dep. Variable:	ds	No. Observations:	1500
Model:	GLM	Df Residuals:	1496
Model Family:	Binomial	Df Model:	3
Link Function:	logit	Scale:	1.0
Method:	IRLS	Log-Likelihood:	-461.49
Date:	Mon, 07 Nov 2016	Deviance:	922.98
Time:	17:14:18	Pearson chi2:	1.50e+03
No. Iterations:	8

	coef	std err	z	P>\|z\|	[95.0% Conf. Int.]
Intercept	-15.7563	1.922	-8.197	0.000	-19.524 -11.989
sbp0	0.0784	0.022	3.572	0.000	0.035 0.121
sbp	0.0173	0.018	0.983	0.325	-0.017 0.052
trt	0.2147	0.215	0.998	0.318	-0.207 0.636

Specify Stan model

In [15]:

model = """
data {
    int n;
    vector[n] x;
    real y1[n];
    int y2[n];
    vector[n] treat;
    vector<lower=0>[2] sigma_b;
    vector[2] Zero;
}

parameters {
    vector[2] alpha;
    vector[2] beta;
    vector[2] mu;
    real<lower=0> sigma;
    corr_matrix[2] Omega_b; 
}


transformed parameters {
    vector[n] theta1;
    vector[n] theta2;
    cov_matrix[2] Sigma; 
    
    Sigma <- quad_form_diag(Omega_b, sigma_b); 
    
    theta1 <- mu[1] + alpha[1]*x + beta[1]*treat;
    theta2 <- mu[2] + alpha[2]*x + beta[2]*treat;
}

model {
    beta ~ multi_normal(Zero, Sigma); 
    Omega_b ~ lkj_corr(1);

    sigma ~ cauchy(0, 5);
    
    y1 ~ normal(theta1, sigma);
    y2 ~ bernoulli_logit(theta2);
}

generated quantities {
    real A;
    real B;
    real A_or_B;
    real A_and_B;

    A <- if_else(beta[1] < -7, 1, 0);
    B <- if_else(exp(beta[2]) < 0.8, 1, 0);
    
    A_or_B <- if_else(A || B, 1, 0);
    A_and_B <- if_else(A && B, 1, 0);
}
"""

In [16]:

fit = pystan.stan(model_code=model, 
                  data=dict(x=data.sbp0, 
                            y1=data.sbp, 
                            y2=data.ds, 
                            treat=data.trt, n=n, sigma_b=(2, 1.5), Zero=(0,0)))

INFO:pystan:COMPILING THE C++ CODE FOR MODEL anon_model_456ec81631c7e21b1d5f3580dbc6353b NOW.

In [17]:

fit['beta'].mean(0)

Out[17]:

array([-6.8113808 ,  0.09079104])

In [18]:

np.exp(fit['beta'][:,1]).mean()

Out[18]:

1.1123857364294381

In [19]:

fit.plot(pars=['beta']);

Probabilities of A OR B and A AND B::

In [20]:

sample = fit.extract(pars=['A_or_B', 'A_and_B'])

In [21]:

[sample[s].mean(0) for s in sample]

Out[21]:

[0.25574999999999998, 0.0082500000000000004]

Correlation between betas

In [22]:

fit['Omega_b'].mean(0)

Out[22]:

array([[ 1.        , -0.01236086],
       [-0.01236086,  1.        ]])