Baseline adjustment simulation study

A "pre/post" study assesses each subject at two time points. Let y1 and y2 denote the responses of a subject at these two time points. A common analysis is to look at how the "change score" y2 - y1 is related to a covariate x. One way to approach this question is to regress y2 - y1 on x and y1. Glymour et al. raise questions about the ability of this analytic approach to properly capture the role of x. They claim that when y1 and y2 are measured with error, the assessment of the relationship between x and the change score is biased toward overstating the strength of the effect. Here we look into this issue and consider several alternative approaches to conducting the analysis.

In the example presented by Glymour et al., y1 and y2 are "cognitive scores" which measure an underlying "cognitive state", and x is educational attainment. There is no question that there is a cross sectional relationship between x and both y1 and y2, and that these associations likely exend to cross-sectional associations between x and the two unobserved cognitive states. It is unclear however whether x is associated with the change in cognitive state.

Reference:

Glymour et al. (2005). When Is Baseline Adjustment Useful in Analyses of Change? An Example with Education and Cognitive Change

http://aje.oxfordjournals.org/content/162/3/267.long

In [1]:
import numpy as np
import pandas as pd
import statsmodels.api as sm

The following function simulates data from a pre/post longitudinal study. The parameter r2_xc controls the association (if any) between the treatemnt variable x and the change in the state variables.

In [2]:
def gendat(n, meas_err_cov, r2_x1, r2_xc, long=False):
    """
    Simulate pre/post data.
    
    Parameters
    ----------
    n : int
        Sample size
    meas_err_cov : array
        The covariance matrix of the measurement errors.
    r2_x1 : float
        The R^2 for predicting the baseline state from x.
    r2_xc : float
        The R^2 for predicting the change of state from x.  Set to 0
        to satisfy the null hypothesis.
    long : boolean
        If True, return the data in long format.
    """
    x = np.random.normal(size=n)
    
    state1 = np.sqrt(r2_x1) * x + np.sqrt(1 - r2_x1) * np.random.normal(size=n)
    state_change = -1 + np.sqrt(r2_xc)*x + np.sqrt(1 - r2_xc) * np.random.normal(size=n)
    state2 = state1 + state_change + 0.5 * np.random.normal(size=n)

    meas_err = np.random.multivariate_normal(np.zeros(2), meas_err_cov, n)
    score1 = state1 + meas_err[:, 0]
    score2 = state2 + meas_err[:, 1]

    change_score = score2 - score1

    data = pd.DataFrame(index=range(n))
    data["change_score"] = change_score
    data["score1"] = score1
    data["score2"] = score2
    data["state1"] = state1
    data["state2"] = state2
    data["x"] = x
    
    if long:
        data["id"] = range(1000)
        data = pd.melt(data, id_vars=["id", "x"], value_vars=["score1", "score2"], value_name="score", var_name="time")
        data = data.sort("id")
        data["time"] = data["time"].replace({"score1": 1, "score2": 2})

    return data

Calilbrate the simulation population

Here we check to make sure that the simulated data appear realistic.

In [3]:
# Covariance matrix of measurement errors when measurement error is present
cov_me = np.asarray([[0.2, 0], [0, 0.2]])

# Covariance matrix of measurement errors when no measurement error is present
cov_0 = np.zeros((2, 2))
In [4]:
data = gendat(1000, cov_me, r2_x1=0.3, r2_xc=0)
ax = plt.axes()
_ = ax.hist(data.state1, histtype='step', label="Baseline", lw=4, alpha=0.5)
_ = ax.hist(data.state2, histtype='step', label="Follow-up", lw=4, alpha=0.5)
ha, lb = ax.get_legend_handles_labels()
leg = plt.figlegend(ha, lb, "upper center", ncol=2)
leg.draw_frame(False)
plt.xlabel("Cognitive state")
plt.ylabel("Frequency")
Out[4]:
<matplotlib.text.Text at 0x7fec492630d0>
In [5]:
plt.plot(data.state1, data.state2, 'o', alpha=0.2)
plt.xlabel("Baseline cognitive state")
plt.ylabel("Cognitive state at follow-up")
Out[5]:
<matplotlib.text.Text at 0x7fec4705b910>
In [6]:
plt.plot(data.state1, data.score1, 'o', alpha=0.5)
plt.xlabel("Baseline cognitive state")
plt.ylabel("Baseline cognitive score")
Out[6]:
<matplotlib.text.Text at 0x7fec46f5da50>

OLS approaches

First we look at the situation where there is no measurement error (i.e. cognitive score and cognitive state are equal). The plot shows the quantiles of the Z-statistic for the effect of education on change in cognitive score, for a linear model regressing the change score on the baseline variable (y1) and the treatment variable (x).

In [7]:
def ols_bl(cov, r2_x1, r2_xc):
    zscores = []
    for j in range(100):
        data = gendat(1000, cov, r2_x1=r2_x1, r2_xc=r2_xc)
        model = sm.OLS.from_formula("change_score ~ x + score1", data)
        result = model.fit()
        zscores.append(result.tvalues["x"])
    zscores = np.asarray(zscores)
    zscores.sort()
    return zscores
    
zscores_ols_bl = ols_bl(cov_0, 0.3, 0)
sm.ProbPlot(zscores_ols_bl).qqplot(line='45')
plt.grid(True)

Next we introduce a bit of measurement error into the cognitive scores.

In [8]:
zscores_ols_bl = ols_bl(cov_me, 0.3, 0)
sm.ProbPlot(zscores_ols_bl).qqplot(line='45')
plt.grid(True)

Now we drop baseline cognitive score as a predictor of the cognitive score change.

In [9]:
def ols_simple(cov, r2_x1, r2_xc):
    zscores = []
    for j in range(100):
        data = gendat(1000, cov, r2_x1, r2_xc)
        model = sm.OLS.from_formula("change_score ~ x", data)
        result = model.fit()
        zscores.append(result.tvalues["x"])
    zscores = np.asarray(zscores)
    zscores.sort()
    return zscores

zscores_ols_simple = ols_simple(cov_me, 0.3, 0)
sm.ProbPlot(zscores_ols_simple).qqplot(line='45')
plt.grid(True)

As a hypothetical exercise, we can condition on the actual baseline cognitive state, rather than on the score (which is observed with error):

In [ ]:
def ols_impossible(cov, r2_x1, r2_xc):
    zscores = []
    for j in range(100):
        data = gendat(1000, cov, r2_x1, r2_xc)
        model = sm.OLS.from_formula("change_score ~ x + state1", data)
        result = model.fit()
        zscores.append(result.tvalues["x"])
    zscores = np.asarray(zscores)
    zscores.sort()
    return zscores

zscores_ols_impossible = ols_impossible(cov_me, 0.3, 0)
sm.ProbPlot(zscores_ols_impossible).qqplot(line='45')
plt.grid(True)

GEE

A different approach is to regress the individual scores against education, time, and an education x time interaction. The interaction captures any protective effect of education on the rate of cognitive decline. We can use GEE to account for within-subject correlations in the two time points. This procedure seems to perform well when there is no effect.

In [ ]:
def gee(cov, r2_x1, r2_xc):
    zscores = []
    for j in range(100):
        data = gendat(1000, cov, r2_x1, r2_xc, long=True)
        model = sm.GEE.from_formula("score ~ x + time + x*time", groups="id", cov_struct=sm.cov_struct.Exchangeable(), data=data)
        result = model.fit()
        zscores.append(result.tvalues["x:time"])
    zscores = np.asarray(zscores)
    zscores.sort()
    return zscores

zscores_gee_0 = gee(cov_me, 0.3, 0)
sm.ProbPlot(zscores_gee_0).qqplot(line='45')
plt.grid(True)

Next we look at the same analysis when there is a relationship between x and the change in y. There is at least some power to detect an association between x and the change in cognitive score.

In [ ]:
zscores_gee_1 = gee(cov_me, 0.3, 0.1)

sm.ProbPlot(zscores_gee_1).qqplot(line='45')
plt.grid(True)

To get a sense of the power, we can compare the GEE procedure to the best possible OLS procedure, namely the OLS regression of the change score against 'x' and the true baseline status.

In [ ]:
def gee_impossible(cov, r2_x1, r2_xc):
    zscores = []
    for j in range(100):
        data = gendat(1000, cov_me, r2_x1=0.3, r2_xc=r2_xc)
        model = sm.OLS.from_formula("change_score ~ x + state1", data)
        result = model.fit()
        zscores.append(result.tvalues["x"])
    zscores = np.asarray(zscores) 
    zscores.sort()
    return zscores
  
zscores_gee_impossible = gee_impossible(cov_me, 0.3, 0.1)
plt.plot(zscores_gee_impossible, zscores_gee_1, 'o')
plt.plot([4, 10], [4, 10], '-', color='grey')
plt.xlabel("Change score with true baseline")
plt.ylabel("GEE")
plt.grid(True)

We can also compare the GEE procedure to the OLS change score regresion adjusting for the observed baseline score (which we know can be biased).

In [ ]:
zscores_ols_bl = ols_bl(cov_me, 0.3, 0.1)

plt.clf()
plt.plot(zscores_ols_bl, zscores_gee_1, 'o')
plt.plot([4, 12], [4, 12], '-', color='grey')
plt.xlabel("Change score with estimated baseline")
plt.ylabel("GEE")
plt.grid(True)

Finally, we can compare the GEE results to what we obtain when regressing the change score on x without baseline adjustment.

In [ ]:
zscores_ols_simple_1 = ols_simple(cov_me, 0.3, 0.1)

plt.plot(zscores_ols_simple_1, zscores_gee_1, 'o')
plt.plot([4, 12], [4, 12], '-', color='grey')
plt.xlabel("Change score with no baseline adjustment")
plt.ylabel("GEE")
plt.grid(True)

Correlated measurement error

In [ ]:
cov_me_dep = np.asarray([[0.2, 0.1], [0.1, 0.2]])
In [ ]:
zscores_gee_dep_1 = gee(cov_me_dep, 0.3, 0.1)
In [ ]:
zscores_gee_dep_0 = gee(cov_me_dep, 0.3, 0)
---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
<ipython-input-18-fe0741eea1b5> in <module>()
----> 1 zscores_gee_dep_0 = gee(cov_me_dep, 0.3, 0)

<ipython-input-11-d93312c24ea3> in gee(cov, r2_x1, r2_xc)
      4         data = gendat(1000, cov, r2_x1, r2_xc, long=True)
      5         model = sm.GEE.from_formula("score ~ x + time + x*time", groups="id", cov_struct=sm.cov_struct.Exchangeable(), data=data)
----> 6         result = model.fit()
      7         zscores.append(result.tvalues["x:time"])
      8     zscores = np.asarray(zscores)

/usr/local/sage/sage-6.4/local/lib/python2.7/site-packages/statsmodels/genmod/generalized_estimating_equations.py in fit(self, maxiter, ctol, start_params, params_niter, first_dep_update, cov_type)
    977         for itr in range(maxiter):
    978 
--> 979             update, score = self._update_mean_params()
    980             if update is None:
    981                 warnings.warn("Singular matrix encountered in GEE update",

/usr/local/sage/sage-6.4/local/lib/python2.7/site-packages/statsmodels/genmod/generalized_estimating_equations.py in _update_mean_params(self)
    721 
    722             rslt = self.cov_struct.covariance_matrix_solve(expval, i,
--> 723                                                 sdev, (dmat, resid))
    724             if rslt is None:
    725                 return None, None

/usr/local/sage/sage-6.4/local/lib/python2.7/site-packages/statsmodels/genmod/cov_struct.py in covariance_matrix_solve(self, expval, index, stdev, rhs)
    264             else:
    265                 x1 = x / stdev[:, None]
--> 266                 y = x1 / (1. - self.dep_params)
    267                 y -= c * x1.sum(0)
    268                 y /= stdev[:, None]

KeyboardInterrupt: