Notebook

Answer all questions and submit them either as a Jupyter notebook, LaTeX document, or Markdown document. Each question is worth 25 points.

This homework is due Friday, March 18, 2022.

In [1]:

import numpy as np
import pymc as pm
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

# Set seed
np.random.seed(42)

Question 1¶

The goal of this problem is to investigate the role of the proposal distribution in a Metropolis-Hastings algorithm designed to simulate from the posterior distribution of the mixture parameter $\delta$.

Simulate 200 realizations from the mixture distribution: $$y_i \sim \delta N(7, 0.5^2) + (1-\delta) N(10, 0.5^2)$$ with $\delta = 0.7$. Plot a histogram of these data.
An approximate Bayesian computing (ABC) algorithm estimates the posterior distribution by using the model to produce artificial data sets from sample parameters simulated from the prior distribution. Simulation samples are accepted if they are "close enough" to the observed data. Implement an ABC procedure to simulate from the posterior distribution of $\delta$, using your data from part (1).
Implement a random walk M-H algorithm with proposal $\delta^{\prime} = \delta^{(i)} + \epsilon$ with $\epsilon \sim Unif(−1,1)$.
Reparameterize the problem letting $U = \log\left[\frac{\delta}{1 - \delta}\right]$ and $u^{\prime} = u^{(i)} + \epsilon$. Implement a random walk chain in U-space.
Compare the estimates and convergence behavior of the three algorithms.

In part (1), you are asked to simulate data from a distribution with $\delta$ known. For parts (2)–(4), assume $\delta$ is unknown with prior $\delta \sim Unif( 0,1)$. For parts (2)–(4), provide an appropriate plot and a table summarizing the output of the algorithm. To facilitate comparisons, use the same number of iterations, random seed, starting values, and burn-in period for all implementations of the algorithm.

In [ ]:

Question 2¶

Carlin (1992) considers a Bayesian approach to meta-analysis, and includes the following examples of 22 trials of beta-blockers to prevent mortality after myocardial infarction. These data are given below.

In one possible random effects model we assume the true baseline mean (on a log-odds scale) $m_i$ in a trial $i$ is drawn from some population distribution. Let $r^C_i$ denote number of events in the control group in trial $i$, and $r^T_i$ denote events under active treatment in trial $i$. Our model is:

$$\begin{aligned} r^C_i &\sim \text{Binomial}\left(p^C_i, n^C_i\right) \\ r^T_i &\sim \text{Binomial}\left(p^T_i, n^T_i\right) \\ \text{logit}\left(p^C_i\right) &= \mu_i \\ \text{logit}\left(p^T_i\right) &= \mu_i + \delta \\ \mu_i &\sim \text{Normal}(m, s). \end{aligned}$$

In this case, we want to make inferences about the population effect $m$, and the predictive distribution for the effect $\delta_{\text{new}}$ in a new trial.

This particular model uses a random effect for the population mean, and a fixed effect for the treatment effect. There are 3 other models you could fit to represent all possible combinations of fixed or random effects for these two parameters.

Build all 4 models to estimate the treatment effect in PyMC and

use convergence diagnostics to check for convergence in each model
use posterior predictive checks to compare the fit of the models
use DIC to compare the models as approximations of the true generating model

Which model would you select and why?

In [13]:

r_t_obs = np.array([3, 7, 5, 102, 28, 4, 98, 60, 25, 138, 64, 45, 9, 57, 25, 33, 28, 8, 6, 32, 27, 22])
n_t_obs = np.array([38, 114, 69, 1533, 355, 59, 945, 632, 278,1916, 873, 263, 291, 858, 154, 207, 251, 151, 174, 209, 391, 680])
r_c_obs = np.array([3, 14, 11, 127, 27, 6, 152, 48, 37, 188, 52, 47, 16, 45, 31, 38, 12, 6, 3, 40, 43, 39])
n_c_obs = np.array([39, 116, 93, 1520, 365, 52, 939, 471, 282, 1921, 583, 266, 293, 883, 147, 213, 122, 154, 134, 218, 364, 674])
N = len(n_c_obs)

In [ ]: