Hierarchical or multilevel modeling is a generalization of regression modeling.
Multilevel models are regression models in which the constituent model parameters are given probability models. This implies that model parameters are allowed to vary by group.
Observational units are often naturally clustered. Clustering induces dependence between observations, despite random sampling of clusters and random sampling within clusters.
A hierarchical model is a particular multilevel model where parameters are nested within one another.
Some multilevel structures are not hierarchical.
We will motivate this topic using an environmental epidemiology example.
Radon is a radioactive gas that enters homes through contact points with the ground. It is a carcinogen that is the primary cause of lung cancer in non-smokers. Radon levels vary greatly from household to household.
The EPA did a study of radon levels in 80,000 houses. Two important predictors:
We will focus on modeling radon levels in Minnesota.
The hierarchy in this example is households within county.
First, we import the data from a local file, and extract Minnesota's data.
%matplotlib inline
import numpy as np
import pandas as pd
import pymc as pm
import arviz as az
import matplotlib.pyplot as plt
import seaborn as sns; sns.set_context('notebook')
import warnings
warnings.filterwarnings("ignore", module="mkl_fft")
warnings.filterwarnings("ignore", module="matplotlib")
DATA_URL = 'https://raw.githubusercontent.com/fonnesbeck/Bios8366/master/data/'
try:
srrs2 = pd.read_csv('../data/srrs2.dat')
except FileNotFoundError:
srrs2 = pd.read_csv(DATA_URL + 'srrs2.dat')
# Import radon data
srrs2.columns = srrs2.columns.map(str.strip)
srrs_mn = srrs2[srrs2.state=='MN'].copy()
RANDOM_SEED = 20090425
Next, obtain the county-level predictor, uranium, by combining two variables.
try:
cty = pd.read_csv('../data/cty.dat')
except FileNotFoundError:
cty = pd.read_csv(DATA_URL + 'cty.dat')
srrs_mn['fips'] = srrs_mn.stfips*1000 + srrs_mn.cntyfips
cty_mn = cty[cty.st=='MN'].copy()
cty_mn['fips'] = 1000*cty_mn.stfips + cty_mn.ctfips
Use the merge
method to combine home- and county-level information in a single DataFrame.
srrs_mn = srrs_mn.merge(cty_mn[['fips', 'Uppm']], on='fips')
srrs_mn = srrs_mn.drop_duplicates(subset='idnum')
u = np.log(srrs_mn.Uppm).unique()
n = len(srrs_mn)
We also need a lookup table (dict
) for each unique county, for indexing.
srrs_mn.county = srrs_mn.county.map(str.strip)
county, mn_counties = srrs_mn.county.factorize()
Finally, create local copies of variables.
radon = srrs_mn.activity
srrs_mn['log_radon'] = log_radon = np.log(radon + 0.1).values
floor_measure = srrs_mn.floor.values
Distribution of radon levels in MN (log scale):
srrs_mn.activity.apply(lambda x: np.log(x+0.1)).hist(bins=25, grid=False);
The two conventional alternatives to modeling radon exposure represent the two extremes of the bias-variance tradeoff:
*Complete pooling*:
Treat all counties the same, and estimate a single radon level.
$$y_i = \alpha + \beta x_i + \epsilon_i$$*No pooling*:
Model radon in each county independently.
$$y_i = \alpha_{j[i]} + \beta x_i + \epsilon_i$$where $j = 1,\ldots,85$
The errors $\epsilon_i$ may represent measurement error, temporal within-house variation, or variation among houses.
Here are the point estimates of the slope and intercept for the complete pooling model:
floor = srrs_mn.floor.values
log_radon = srrs_mn.log_radon.values
with pm.Model(rng_seeder=RANDOM_SEED) as pooled_model:
mu = pm.Normal('mu', 0, sd=1e5)
beta = pm.Normal('beta', mu=0, sd=1e5)
sigma = pm.HalfCauchy('sigma', 5)
theta = mu + beta*floor
y = pm.Normal('y', theta, sd=sigma, observed=log_radon)
pm.model_to_graphviz(pooled_model)
with pooled_model:
pooled_trace = pm.sample()
mu_mean = pooled_trace.posterior.mean(dim=("chain", "draw")).mu.values
beta_mean = pooled_trace.posterior.mean(dim=("chain", "draw")).beta.values
plt.scatter(srrs_mn.floor, np.log(srrs_mn.activity+0.1))
xvals = np.linspace(-0.2, 1.2)
plt.plot(xvals, beta_mean*xvals + mu_mean, 'r--');
Estimates of county radon levels for the unpooled model:
coords={'county': mn_counties}
with pm.Model(coords=coords, rng_seeder=RANDOM_SEED) as unpooled_model:
mu = pm.Normal('mu', 0, sd=1e5, dims='county')
beta = pm.Normal('beta', 0, sd=1e5)
sigma = pm.HalfCauchy('sigma', 5)
theta = mu[county] + beta*floor
y = pm.Normal('y', theta, sd=sigma, observed=log_radon)
pm.model_to_graphviz(unpooled_model)
with unpooled_model:
unpooled_trace = pm.sample()
az.plot_forest(
unpooled_trace,
var_names=['mu'],
ess=True, r_hat=True,
combined=True,
figsize=(6,18)
);
unpooled_estimates = unpooled_trace.posterior.mean(dim=('chain', 'draw')).mu
unpooled_se = unpooled_trace.posterior.std(dim=('chain', 'draw')).mu
We can plot the ordered estimates to identify counties with high radon levels:
unpooled_means = unpooled_trace.posterior.mean(dim=("chain", "draw"))
unpooled_hdi = az.hdi(unpooled_trace)
unpooled_means_iter = unpooled_means.sortby("mu")
unpooled_hdi_iter = unpooled_hdi.sortby(unpooled_means_iter.mu)
_, ax = plt.subplots(figsize=(10,6))
xticks = np.arange(0, 86, 6)
unpooled_means_iter.plot.scatter(x="county", y="mu", ax=ax, alpha=0.8)
ax.vlines(
np.arange(mn_counties.size),
unpooled_hdi_iter.mu.sel(hdi="lower"),
unpooled_hdi_iter.mu.sel(hdi="higher"),
color="orange",
alpha=0.6,
)
ax.set(ylabel="Radon estimate", ylim=(-2, 4.5))
ax.set_xticks(xticks)
ax.set_xticklabels(unpooled_means_iter.county.values[xticks])
ax.tick_params(rotation=45)
sns.despine(trim=True);
Here are visual comparisons between the pooled and unpooled estimates for a subset of counties representing a range of sample sizes.
sample_counties = ('LAC QUI PARLE', 'AITKIN', 'KOOCHICHING',
'DOUGLAS', 'CLAY', 'STEARNS', 'RAMSEY', 'ST LOUIS')
fig, axes = plt.subplots(2, 4, figsize=(12, 6), sharey=True, sharex=True)
axes = axes.ravel()
m = unpooled_trace.posterior.mean(dim=("chain", "draw")).beta
for i,c in enumerate(sample_counties):
y = srrs_mn.log_radon[srrs_mn.county==c]
x = srrs_mn.floor[srrs_mn.county==c]
axes[i].scatter(x + np.random.randn(len(x))*0.01, y, alpha=0.4)
# No pooling model
b = unpooled_estimates.sel(county=c)
# Plot both models and data
xvals = np.linspace(0, 1)
axes[i].plot(xvals, m.values*xvals+b.values)
axes[i].plot(xvals, beta_mean*xvals+mu_mean, 'r--')
axes[i].set_xticks([0,1])
axes[i].set_xticklabels(['basement', 'floor'])
axes[i].set_ylim(-1, 3)
axes[i].set_title(c)
if not i%2:
axes[i].set_ylabel('log radon level')
Neither of these models are satisfactory:
When we pool our data, we imply that they are sampled from the same model. This ignores any variation among sampling units (other than sampling variance):
When we analyze data unpooled, we imply that they are sampled independently from separate models. At the opposite extreme from the pooled case, this approach claims that differences between sampling units are to large to combine them:
In a hierarchical model, parameters are viewed as a sample from a population distribution of parameters. Thus, we view them as being neither entirely different or exactly the same. This is *parital pooling*.
We can use PyMC to easily specify multilevel models, and fit them using Markov chain Monte Carlo.
The simplest partial pooling model for the household radon dataset is one which simply estimates radon levels, without any predictors at any level. A partial pooling model represents a compromise between the pooled and unpooled extremes, approximately a weighted average (based on sample size) of the unpooled county estimates and the pooled estimates.
$$\hat{\alpha} \approx \frac{(n_j/\sigma_y^2)\bar{y}_j + (1/\sigma_{\alpha}^2)\bar{y}}{(n_j/\sigma_y^2) + (1/\sigma_{\alpha}^2)}$$Estimates for counties with smaller sample sizes will shrink towards the state-wide average.
Estimates for counties with larger sample sizes will be closer to the unpooled county estimates.
with pm.Model(coords=coords, rng_seeder=RANDOM_SEED) as partial_pooling:
# Priors
mu_a = pm.Normal('mu_a', mu=0., sd=1e5)
sigma_a = pm.HalfCauchy('sigma_a', 5)
# Random intercepts
mu = pm.Normal('mu', mu=mu_a, sd=sigma_a, dims='county')
# Model error
sigma_y = pm.HalfCauchy('sigma_y',5)
# Expected value
y_hat = mu[county]
# Data likelihood
y_like = pm.Normal('y_like', mu=y_hat, sd=sigma_y, observed=log_radon)
pm.model_to_graphviz(partial_pooling)
with partial_pooling:
partial_pooling_trace = pm.sample(tune=2000)
N_county = srrs_mn.groupby("county")["idnum"].count().values
fig, axes = plt.subplots(1, 2, figsize=(10, 4), sharex=True, sharey=True)
for ax, trace, level in zip(
axes,
(unpooled_trace, partial_pooling_trace),
("no pooling", "partial pooling"),
):
# add variable with x values to xarray dataset
trace.posterior = trace.posterior.assign_coords({"N_county": ("county", N_county)})
# plot means
trace.posterior.mean(dim=("chain", "draw")).plot.scatter(
x="N_county", y="mu", ax=ax, alpha=0.9
)
ax.hlines(
partial_pooling_trace.posterior.mu.mean(),
0.9,
max(N_county) + 1,
alpha=0.4,
ls="--",
label="Est. population mean",
)
# plot hdi
hdi = az.hdi(trace).mu
ax.vlines(N_county, hdi.sel(hdi="lower"), hdi.sel(hdi="higher"), color="orange", alpha=0.5)
ax.set(
title=f"{level.title()} Estimates",
xlabel="Nbr obs in county (log scale)",
xscale="log",
ylabel="Log radon",
)
ax.legend(fontsize=10)
Notice the difference between the unpooled and partially-pooled estimates, particularly at smaller sample sizes. The former are both more extreme and more imprecise.
This model allows intercepts to vary across county, according to a random effect.
$$y_i = \alpha_{j[i]} + \beta x_{i} + \epsilon_i$$where
$$\epsilon_i \sim N(0, \sigma_y^2)$$and the intercept random effect:
$$\alpha_{j[i]} \sim N(\mu_{\alpha}, \sigma_{\alpha}^2)$$As with the the “no-pooling” model, we set a separate intercept for each county, but rather than fitting separate least squares regression models for each county, multilevel modeling shares strength among counties, allowing for more reasonable inference in counties with little data.
with pm.Model(coords=coords, rng_seeder=RANDOM_SEED) as varying_intercept:
# Priors
mu_a = pm.Normal('mu_a', mu=0., tau=0.0001)
sigma_a = pm.HalfCauchy('sigma_a', 5)
# Random intercepts
mu = pm.Normal('mu', mu=mu_a, sd=sigma_a, dims='county')
# Common slope
beta = pm.Normal('beta', mu=0., sd=1e5)
# Model error
sd_y = pm.HalfCauchy('sd_y', 5)
# Expected value
y_hat = mu[county] + beta * floor_measure
# Data likelihood
y_like = pm.Normal('y_like', mu=y_hat, sd=sd_y, observed=log_radon)
pm.model_to_graphviz(varying_intercept)
with varying_intercept:
varying_intercept_trace = pm.sample(tune=2000)
pm.plot_forest(varying_intercept_trace, var_names=['mu'], figsize=(6,18), combined=True, ess=True, r_hat=True);
pm.plot_posterior(varying_intercept_trace, var_names=['sigma_a', 'beta']);
The estimate for the floor
coefficient is approximately -0.66, which can be interpreted as houses without basements having about half ($\exp(-0.66) = 0.52$) the radon levels of those with basements, after accounting for county.
az.summary(varying_intercept_trace, var_names=['beta'])
import xarray as xr
xvals = xr.DataArray([0, 1], dims="Level", coords={"Level": ["Basement", "Floor"]})
post = varying_intercept_trace.posterior # alias for readability
theta = (
(post.mu + post.beta * xvals).mean(dim=("chain", "draw")).to_dataset(name="Mean log radon")
)
_, ax = plt.subplots()
theta.plot.scatter(x="Level", y="Mean log radon", alpha=0.2, color="k", ax=ax) # scatter
ax.plot(xvals, theta["Mean log radon"].T, "k-", alpha=0.2)
# add lines too
ax.set_title("MEAN LOG RADON BY COUNTY");
It is easy to show that the partial pooling model provides more objectively reasonable estimates than either the pooled or unpooled models, at least for counties with small sample sizes.
sample_counties = ('LAC QUI PARLE', 'AITKIN', 'KOOCHICHING',
'DOUGLAS', 'CLAY', 'STEARNS', 'RAMSEY', 'ST LOUIS')
fig, axes = plt.subplots(2, 4, figsize=(12, 6), sharey=True, sharex=True)
axes = axes.ravel()
m = unpooled_trace.posterior.mean(dim=("chain", "draw")).beta
for i,c in enumerate(sample_counties):
y = srrs_mn.log_radon[srrs_mn.county==c]
x = srrs_mn.floor[srrs_mn.county==c]
axes[i].scatter(x + np.random.randn(len(x))*0.01, y, alpha=0.4)
# No pooling model
b = unpooled_estimates.sel(county=c)
# Plot both models and data
xvals = np.linspace(0, 1)
axes[i].plot(xvals, m.values*xvals+b.values)
axes[i].plot(xvals, beta_mean*xvals+mu_mean, 'r--')
varying_intercept_trace.posterior.sel(county=c).beta
post = varying_intercept_trace.posterior.sel(county='DOUGLAS').mean(dim=("chain", "draw"))
theta = (
post.mu.values + post.beta.values * xvals
)
axes[i].plot(xvals, theta, 'k:')
axes[i].set_xticks([0,1])
axes[i].set_xticklabels(['basement', 'floor'])
axes[i].set_ylim(-1, 3)
axes[i].set_title(c)
if not i%2:
axes[i].set_ylabel('log radon level')
Alternatively, we can posit a model that allows the counties to vary according to how the location of measurement (basement or floor) influences the radon reading.
$$y_i = \alpha + \beta_{j[i]} x_{i} + \epsilon_i$$with pm.Model(coords=coords, rng_seeder=RANDOM_SEED) as varying_slope:
# Priors
mu_b = pm.Normal('mu_b', mu=0., sd=1e5)
sigma_b = pm.HalfCauchy('sigma_b', 5)
# Common intercepts
mu = pm.Normal('mu', mu=0., sd=1e5)
# Random slopes
beta = pm.Normal('beta', mu=mu_b, sd=sigma_b, dims='county')
# Model error
sigma_y = pm.HalfCauchy('sigma_y',5)
# Expected value
y_hat = mu + beta[county] * floor_measure
# Data likelihood
y_like = pm.Normal('y_like', mu=y_hat, sd=sigma_y, observed=log_radon)
with varying_slope:
varying_slope_trace = pm.sample()
az.plot_forest(varying_slope_trace, var_names=['beta'], figsize=(6,18), combined=True, ess=True, r_hat=True);
xvals = xr.DataArray([0, 1], dims="Level", coords={"Level": ["Basement", "Floor"]})
post = varying_slope_trace.posterior # alias for readability
theta = (
(post.mu + post.beta * xvals).mean(dim=("chain", "draw")).to_dataset(name="Mean log radon")
)
_, ax = plt.subplots()
theta.plot.scatter(x="Level", y="Mean log radon", alpha=0.2, color="k", ax=ax) # scatter
ax.plot(xvals, theta["Mean log radon"].T, "k-", alpha=0.2)
# add lines too
ax.set_title("MEAN LOG RADON BY COUNTY");
The partial pooling models specified above uses a centered parameterization of the slope random effect. That is, the individual county effects are distributed around a county mean, with a spread controlled by the hierarchical standard deviation parameter. As the preceding plot reveals, this constraint serves to shrink county estimates toward the overall mean, to a degree proportional to the county sample size. This is exactly what we want, and the model appears to fit well--the Gelman-Rubin statistics are exactly 1.
But, on closer inspection, there are signs of trouble. Specifically, let's look at the trace of the random effects, and their corresponding standard deviation:
fig, axs = plt.subplots(nrows=2)
axs[0].plot(varying_slope_trace.posterior.sel(chain=0)['sigma_b'], alpha=.5);
axs[0].set(ylabel='sigma_b');
axs[1].plot(varying_slope_trace.posterior.sel(chain=0)['beta'], alpha=.05);
axs[1].set(ylabel='beta');
Notice that when the chain reaches the lower end of the parameter space for $\sigma_b$, it appears to get "stuck" and the entire sampler, including the random slopes b
, mixes poorly.
Jointly plotting the random effect variance and one of the individual random slopes demonstrates what is going on.
x = varying_slope_trace.posterior['beta'].sel(chain=0, county='AITKIN').to_series()
x.name='slope'
y = varying_slope_trace.posterior['sigma_b'].sel(chain=0).to_series()
y.name='slope group variance'
jp = sns.jointplot(x=x, y=y, ylim=(0, .7));
When the group variance is small, this implies that the individual random slopes are themselves close to the group mean. This results in a funnel-shaped relationship between the samples of group variance and any of the slopes (particularly those with a smaller sample size).
In itself, this is not a problem, since this is the behavior we expect. However, if the sampler is tuned for the wider (unconstrained) part of the parameter space, it has trouble in the areas of higher curvature. The consequence of this is that the neighborhood close to the lower bound of $\sigma_b$ is sampled poorly; indeed, in our chain it is not sampled at all below 0.1. The result of this will be biased inference.
Now that we've spotted the problem, what can we do about it? The best way to deal with this issue is to reparameterize our model. Notice the random slopes in this version:
with pm.Model(coords=coords, rng_seeder=RANDOM_SEED) as varying_slope_noncentered:
# Priors
mu_b = pm.Normal('mu_b', mu=0., sd=1e5)
sigma_b = pm.HalfCauchy('sigma_b', 5)
# Common intercepts
mu = pm.Normal('mu', mu=0., sd=1e5)
# Non-centered random slopes
# Centered: b = pm.Normal('b', mu_b, sd=sigma_b, shape=counties)
z = pm.Normal('z', mu=0, sd=1, dims='county')
beta = pm.Deterministic("beta", mu_b + z * sigma_b, dims='county')
# Model error
sigma_y = pm.HalfCauchy('sigma_y',5)
# Expected value
y_hat = mu + beta[county] * floor_measure
# Data likelihood
y_like = pm.Normal('y_like', mu=y_hat, sd=sigma_y, observed=log_radon)
pm.model_to_graphviz(varying_slope_noncentered)
This is a non-centered parameterization. By this, we mean that the random deviates are no longer explicitly modeled as being centered on $\mu_b$. Instead, they are independent standard normals $\upsilon$, which are then scaled by the appropriate value of $\sigma_b$, before being location-transformed by the mean.
This model samples much better.
with varying_slope_noncentered:
noncentered_trace = pm.sample(tune=2000, target_accept=.9)
Notice that the bottlenecks in the traces are gone.
fig, axs = plt.subplots(nrows=2)
axs[0].plot(noncentered_trace.posterior.sel(chain=0)['sigma_b'], alpha=.5);
axs[0].set(ylabel='sigma_b');
axs[1].plot(noncentered_trace.posterior.sel(chain=0)['beta'], alpha=.05);
axs[1].set(ylabel='beta');
And, we are now fully exploring the support of the posterior.
x = noncentered_trace.posterior['beta'].sel(chain=0, county='AITKIN').to_series()
x.name='slope'
y = noncentered_trace.posterior['sigma_b'].sel(chain=0).to_series()
y.name='slope group variance'
jp = sns.jointplot(x=x, y=y, ylim=(0, .7));
fig, (ax1, ax2) = plt.subplots(2, 1, sharex=True, constrained_layout=True)
az.plot_posterior(varying_slope_trace, var_names=['sigma_b'], ax=ax1)
az.plot_posterior(noncentered_trace, var_names=['sigma_b'], ax=ax2)
ax1.set_title('Centered (top) and non-centered (bottom)');
The most general model allows both the intercept and slope to vary by county:
$$y_i = \alpha_{j[i]} + \beta_{j[i]} x_{i} + \epsilon_i$$with pm.Model(coords=coords, rng_seeder=RANDOM_SEED) as varying_intercept_slope:
# Priors
mu_a = pm.Normal('mu_a', mu=0., sd=1e5)
sigma_a = pm.HalfCauchy('sigma_a', 5)
mu_b = pm.Normal('mu_b', mu=0., sd=1e5)
sigma_b = pm.HalfCauchy('sigma_b', 5)
# Random intercepts
mu = pm.Normal('mu', mu=mu_a, sd=sigma_a, dims='county')
# Random slopes
beta = pm.Normal('beta', mu=mu_b, sd=sigma_b, dims='county')
# Model error
sigma_y = pm.Uniform('sigma_y', lower=0, upper=100)
# Expected value
y_hat = mu[county] + beta[county] * floor_measure
# Data likelihood
y_like = pm.Normal('y_like', mu=y_hat, sd=sigma_y, observed=log_radon)
with varying_intercept_slope:
varying_intercept_slope_trace = pm.sample(tune=2000)
az.plot_energy(varying_intercept_slope_trace)
az.plot_forest(varying_intercept_slope_trace, var_names=['mu','beta'], figsize=(6,24), combined=True, ess=True, r_hat=True);
import xarray as xr
xvals = xr.DataArray([0, 1], dims="Level", coords={"Level": ["Basement", "Floor"]})
post = varying_intercept_slope_trace.posterior # alias for readability
theta = (
(post.mu + post.beta * xvals).mean(dim=("chain", "draw")).to_dataset(name="Mean log radon")
)
_, ax = plt.subplots()
theta.plot.scatter(x="Level", y="Mean log radon", alpha=0.2, color="k", ax=ax) # scatter
ax.plot(xvals, theta["Mean log radon"].T, "k-", alpha=0.2)
# add lines too
ax.set_title("MEAN LOG RADON BY COUNTY");
Reparameterize the varying_intercept_slope
model to be non-centered, and compare the resulting parameter estimates.
with pm.Model(coords=coords, rng_seeder=RANDOM_SEED) as varying_intercept_slope_noncentered:
# Priors
mu_a = pm.Normal('mu_a', mu=0., sd=1e5)
sigma_a = pm.HalfCauchy('sigma_a', 5)
mu_b = pm.Normal('mu_b', mu=0., sd=1e5)
sigma_b = pm.HalfCauchy('sigma_b', 5)
# Random intercepts
z_mu = pm.Normal('z_mu', mu=0, sd=1, dims='county')
mu = pm.Deterministic("mu", mu_a + z_mu * sigma_a, dims='county')
# Random slopes
z_beta = pm.Normal('z_beta', mu=0, sd=1, dims='county')
beta = pm.Deterministic("beta", mu_b + z_beta * sigma_b, dims='county')
# Model error
sigma_y = pm.Uniform('sigma_y', lower=0, upper=100)
# Expected value
y_hat = mu[county] + beta[county] * floor_measure
# Data likelihood
y_like = pm.Normal('y_like', mu=y_hat, sd=sigma_y, observed=log_radon)
with varying_intercept_slope_noncentered:
varying_intercept_slope_noncentered_trace = pm.sample(tune=2000, target_accept=.9)
az.plot_forest(varying_intercept_slope_noncentered_trace, var_names=['mu','beta'], figsize=(6,24), combined=True, ess=True, r_hat=True);
A primary strength of multilevel models is the ability to handle predictors on multiple levels simultaneously. If we consider the varying-intercepts model above:
$$y_i = \alpha_{j[i]} + \beta x_{i} + \epsilon_i$$we may, instead of a simple random effect to describe variation in the expected radon value, specify another regression model with a county-level covariate. Here, we use the county uranium reading $u_j$, which is thought to be related to radon levels:
$$\alpha_j = \gamma_0 + \gamma_1 u_j + \zeta_j$$$$\zeta_j \sim N(0, \sigma_{\alpha}^2)$$Thus, we are now incorporating a house-level predictor (floor or basement) as well as a county-level predictor (uranium).
Note that the model has both indicator variables for each county, plus a county-level covariate. In classical regression, this would result in collinearity. In a multilevel model, the partial pooling of the intercepts towards the expected value of the group-level linear model avoids this.
Group-level predictors also serve to reduce group-level variation $\sigma_{\alpha}$. An important implication of this is that the group-level estimate induces stronger pooling.
with pm.Model(coords=coords, rng_seeder=RANDOM_SEED) as hierarchical_intercept:
# Priors
sigma_a = pm.HalfCauchy('sigma_a', 5)
# County uranium model
gamma_0 = pm.Normal('gamma_0', mu=0., sd=1e5)
gamma_1 = pm.Normal('gamma_1', mu=0., sd=1e5)
# Uranium model for intercept
mu_a = pm.Deterministic('mu_a', gamma_0 + gamma_1*u)
# County variation not explained by uranium
epsilon_a = pm.Normal('epsilon_a', mu=0, sd=1, dims='county')
mu = pm.Deterministic('mu', mu_a + sigma_a*epsilon_a)
# Common slope
beta = pm.Normal('beta', mu=0., sd=1e5)
# Model error
sigma_y = pm.Uniform('sigma_y', lower=0, upper=100)
# Expected value
y_hat = mu[county] + beta * floor_measure
# Data likelihood
y_like = pm.Normal('y_like', mu=y_hat, sd=sigma_y, observed=log_radon)
with hierarchical_intercept:
hierarchical_intercept_trace = pm.sample(tune=2000)
uranium = u
post = hierarchical_intercept_trace.posterior.assign_coords(uranium=uranium)
avg_a = post["mu_a"].mean(dim=("chain", "draw")).values[np.argsort(uranium)]
avg_a_county = post["mu"].mean(dim=("chain", "draw"))
avg_a_county_hdi = az.hdi(post, var_names="mu")["mu"]
_, ax = plt.subplots()
ax.plot(uranium[np.argsort(uranium)], avg_a, "k--", alpha=0.6, label="Mean intercept")
az.plot_hdi(
uranium,
post["mu"],
fill_kwargs={"alpha": 0.1, "color": "k", "label": "Mean intercept HPD"},
ax=ax,
)
ax.scatter(uranium, avg_a_county, alpha=0.8, label="Mean county-intercept")
ax.vlines(
uranium,
avg_a_county_hdi.sel(hdi="lower"),
avg_a_county_hdi.sel(hdi="higher"),
alpha=0.5,
color="orange",
)
plt.xlabel("County-level uranium")
plt.ylabel("Intercept estimate")
plt.legend(fontsize=9);
The standard errors on the intercepts are narrower than for the partial-pooling model without a county-level covariate.
In some instances, having predictors at multiple levels can reveal correlation between individual-level variables and group residuals. We can account for this by including the average of the individual predictors as a covariate in the model for the group intercept.
$$\alpha_j = \gamma_0 + \gamma_1 u_j + \gamma_2 \bar{x} + \zeta_j$$These are broadly referred to as *contextual effects*.
# Create new variable for mean of floor across counties
xbar = srrs_mn.groupby('county')['floor'].mean().values
xbar.shape, u.shape
type(floor_idx)
with pm.Model(coords=coords, rng_seeder=RANDOM_SEED) as contextual_effect:
floor_idx = pm.Data("floor_idx", floor, mutable=True)
county_idx = pm.Data("county_idx", county, mutable=True)
# Priors
sigma_a = pm.HalfCauchy('sigma_a', 5)
# County uranium model for slope
gamma = pm.Normal('gamma', mu=0., sd=1e5, shape=3)
# Uranium model for intercept
mu_a = pm.Deterministic('mu_a', gamma[0] + gamma[1]*u + gamma[2]*xbar)
# County variation not explained by uranium
epsilon_a = pm.Normal('epsilon_a', mu=0, sd=1, dims='county')
mu = pm.Deterministic('mu', mu_a + sigma_a*epsilon_a)
# Common slope
beta = pm.Normal('beta', mu=0., sd=1e15)
# Model error
sigma_y = pm.Uniform('sigma_y', lower=0, upper=100)
# Expected value
y_hat = mu[county_idx] + beta * floor_idx
# Data likelihood
y_like = pm.Normal('y_like', mu=y_hat, sd=sigma_y, observed=log_radon)
pm.model_to_graphviz(contextual_effect)
with contextual_effect:
contextual_effect_trace = pm.sample(tune=2000)
az.plot_forest(contextual_effect_trace, var_names=['gamma'], combined=True, ess=True, r_hat=True);
az.summary(contextual_effect_trace, var_names=['gamma'])
So, we might infer from this that counties with higher proportions of houses without basements tend to have higher baseline levels of radon. Perhaps this is related to the soil type, which in turn might influence what type of structures are built.
Gelman (2006) used cross-validation tests to check the prediction error of the unpooled, pooled, and partially-pooled models
root mean squared cross-validation prediction errors:
There are two types of prediction that can be made in a multilevel model:
For example, if we wanted to make a prediction for a new house with no basement in St. Louis and Kanabec counties, we just need to sample from the radon model with the appropriate intercept.
That is,
$$\tilde{y}_i \sim N(\alpha_{69} + \beta (x_i=1), \sigma_y^2)$$Because we judiciously set the county index and floor values as shared variables earlier, we can modify them directly to the desired values (69 and 1 respectively) and sample corresponding posterior predictions, without having to redefine and recompile our model. Using the model just above:
prediction_coords = {"obs_id": ["ST LOUIS", "KANABEC"]}
with contextual_effect:
pm.set_data(
{"county_idx": np.array([69, 31]),
"floor_idx": np.array([1, 1])}
)
stl_pred = pm.sample_posterior_predictive(
contextual_effect_trace.posterior
)
contextual_effect_trace.extend(stl_pred)
contextual_effect_trace
az.plot_posterior(contextual_effect_trace, group='posterior_predictive');
How would we make a prediction from a new county (e.g. one not included in this dataset)?
# Write your answer here
Gelman, A., & Hill, J. (2006). Data Analysis Using Regression and Multilevel/Hierarchical Models (1st ed.). Cambridge University Press.
Betancourt, M. J., & Girolami, M. (2013). Hamiltonian Monte Carlo for Hierarchical Models.
Gelman, A. (2006). Multilevel (Hierarchical) modeling: what it can and cannot do. Technometrics, 48(3), 432–435.