title: Monotonic Effects in PyMC3 tags: PyMC3, Bayesian Statistics, Papers
Last week I came across the following tweet from Paul Bürkner about a paper he coauthored about including ordinal predictors in Bayesian regression models, and I thought the approach was very clever.
Have you ever wondered how to handle ordinal predictors in your regression models? We propose a simple and intuitive method that is readily available via #brms and @mcmc_stan: https://t.co/dKg4AphvsG
— Paul Bürkner (@paulbuerkner) November 2, 2018
The code in the paper uses brms
and Stan to illustrate these concepts. In this post I'll be replicating some of the paper's analysis in Python and PyMC3, mostly for my own edification.
%matplotlib inline
from itertools import product
import arviz as az
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
import pymc3 as pm
from rpy2.robjects import pandas2ri, r
import seaborn as sns
from theano import shared
sns.set()
The paper uses data from the ordPens
R package, which we download and load into a Pandas DataFrame
.
%%bash
if [[ ! -e ~/data/ordPens ]];
then
mkdir -p data
wget -q -O data/ordPens_0.3-1.tar.gz ~/data/ https://cran.r-project.org/src/contrib/ordPens_0.3-1.tar.gz
tar xzf data/ordPens_0.3-1.tar.gz -C data
fi
!ls data/ordPens/data/
ICFCoreSetCWP.RData
pandas2ri.activate()
r.load('data/ordPens/data/ICFCoreSetCWP.RData');
all_df = r['ICFCoreSetCWP']
all_df.head()
b1602 | b122 | b126 | b130 | b134 | b140 | b147 | b152 | b164 | b180 | ... | e450 | e455 | e460 | e465 | e570 | e575 | e580 | e590 | s770 | phcs | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 0 | 1 | 2 | 1 | 1 | 0 | 0 | 2 | 0 | 0 | ... | 4 | 0 | 0 | 0 | 3 | 3 | 4 | 3 | 0 | 44.33 |
2 | 3 | 2 | 2 | 3 | 3 | 2 | 3 | 3 | 3 | 1 | ... | 3 | 3 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 21.09 |
3 | 0 | 1 | 2 | 1 | 1 | 0 | 1 | 2 | 0 | 0 | ... | 4 | 0 | 0 | 0 | 3 | 3 | 4 | 0 | 0 | 41.74 |
4 | 0 | 0 | 0 | 2 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 2 | 2 | -1 | 0 | 0 | 2 | 2 | 1 | 1 | 33.96 |
5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 46.29 |
5 rows × 68 columns
The variable of interest here is phcs
, which is a subjective physical health score. The predictors we are interested in are d450
and d455
.
df = all_df[['d450', 'd455', 'phcs']]
df.head()
d450 | d455 | phcs | |
---|---|---|---|
1 | 0 | 2 | 44.33 |
2 | 3 | 3 | 21.09 |
3 | 0 | 2 | 41.74 |
4 | 3 | 2 | 33.96 |
5 | 0 | 0 | 46.29 |
These predictors are ratings on a five-point scale (0-4) of the patient's impairment while walking (d450
) and moving around (d455
). For more information on this data, consult the ordPens
documentation.
The following plots show a fairly strong monotonic relationship between d450
, d455
, and phcs
.
fig, (d450_ax, d455_ax) = plt.subplots(ncols=2, sharey=True, figsize=(16, 6))
sns.stripplot('d450', 'phcs', data=df,
jitter=0.1, color='C0', alpha=0.75,
ax=d450_ax);
sns.stripplot('d455', 'phcs', data=df,
jitter=0.1, color='C0', alpha=0.75,
ax=d455_ax);
d455_ax.set_ylabel("");
fig.tight_layout();
The big idea of the paper is to include monotonic effects due to these ordinal predictors as follows. A scalar b∼N(0,102) parameterizes the overall strength and direction of the relationship, and a Dirichlet vector ξ∼Dirichlet(1,…,1) encodes how much of b is gained at each level. The parameters b and ξ are combined into
mo(i)=bi∑k=0ξkwhich can be included as a term in a regression model. It is evident that if i<j then mo(i)≤mo(j) since
mo(j)−mo(i)=bj∑k=i+1ξk≥0and therefore the effect of this term will be monotonic as desired.
The following function constructs this distribution in PyMC3.
def monotonic_prior(name, n_cat):
b = pm.Normal(f'b_{name}', 0., 10.)
ξ = pm.Dirichlet(f'ξ_{name}', np.ones(n_cat))
return pm.Deterministic(f'mo_{name}', b * ξ.cumsum())
With this notation in hand, our model is
μi=β0+mod450(ji)+mod455(ki)β0∼N(0,102)yi∼N(μi,σ2)σ∼HalfNormal(52)where ji and ki are the level of d450
and d455
for the i-th patient respectively and yi is that patient's phcs
score.
We now express this model in PyMC3.
d450 = df['d450'].values
d450_cats = np.unique(d450)
d450_n_cat = d450_cats.size
d450_ = shared(d450)
d455 = df['d455'].values
d455_cats = np.unique(d455)
d455_n_cat = d455_cats.size
d455_ = shared(d455)
phcs = df['phcs'].values
with pm.Model() as model:
β0 = pm.Normal('β0', 0., 10.)
mo_d450 = monotonic_prior('d450', d450_n_cat)
mo_d455 = monotonic_prior('d455', d455_n_cat)
μ = β0 + mo_d450[d450_] + mo_d455[d455_]
σ = pm.HalfNormal('σ', 5.)
phcs_obs = pm.Normal('phcs', μ, σ, observed=phcs)
We now sample from the model's posterior distribution.
CHAINS = 3
SEED = 934520 # from random.org
SAMPLE_KWARGS = {
'draws': 1000,
'tune': 1000,
'chains': CHAINS,
'random_seed': list(SEED + np.arange(CHAINS))
}
with model:
trace = pm.sample(**SAMPLE_KWARGS)
Auto-assigning NUTS sampler... Initializing NUTS using jitter+adapt_diag... Multiprocess sampling (3 chains in 2 jobs) NUTS: [σ, ξ_d455, b_d455, ξ_d450, b_d450, β0] Sampling 3 chains: 100%|██████████| 6000/6000 [00:41<00:00, 145.59draws/s] The number of effective samples is smaller than 25% for some parameters.
We use arviz
to check the performance of our sampler.
inf_data = az.convert_to_inference_data(trace)
The energy plot, BFMI, and Gelman-Rubin statistics show no cause for concern.
az.plot_energy(inf_data);
az.gelman_rubin(inf_data).max()
<xarray.Dataset> Dimensions: () Data variables: β0 float64 1.0 b_d450 float64 1.0 b_d455 float64 1.0 ξ_d450 float64 1.0 mo_d450 float64 1.0 ξ_d455 float64 1.0 mo_d455 float64 1.0 σ float64 1.0
We now sample from the model's posterior predictive distribution and visualize the results.
pp_d450, pp_d455 = np.asarray(list(zip(*product(d450_cats, d455_cats))))
d450_.set_value(pp_d450)
d455_.set_value(pp_d455)
with model:
pp_trace = pm.sample_posterior_predictive(trace)
100%|██████████| 3000/3000 [00:07<00:00, 388.49it/s]
pp_df = pd.DataFrame({
'd450': pp_d450,
'd455': pp_d455,
'pp_phcs_mean': pp_trace['phcs'].mean(axis=0)
})
The important feature of this encoding of ordinal predictors is that the ξ parameters allow different levels of the predictor to contribute to result in a different change in the effect, which is in contrast to what happens when these are included as linear predictors, which is quite common in the literature.
REF_CAT = 1
fig, (d450_ax, d455_ax) = plt.subplots(ncols=2, sharey=True, figsize=(16, 6))
(pp_df.pivot_table('pp_phcs_mean', 'd450', 'd455')
.plot(marker='o', ax=d450_ax));
d450_ax.set_xticks(d450_cats);
d450_ax.set_ylabel("Posterior predictive phcs");
(pp_df.pivot_table('pp_phcs_mean', 'd455', 'd450')
.plot(marker='o', ax=d455_ax));
d455_ax.set_xticks(d455_cats);
fig.tight_layout();
The following plot corresponds to Figure 3 in the original paper, and the dark lines agree with the mean in that figure quite well.
fig, (d450_ax, d455_ax) = plt.subplots(ncols=2, sharey=True, figsize=(16, 6))
(pp_df[pp_df['d455'] != REF_CAT]
.pivot_table('pp_phcs_mean', 'd450', 'd455')
.plot(marker='o', c='k', alpha=0.5, legend=False,
ax=d450_ax));
(pp_df[pp_df['d455'] == REF_CAT]
.plot('d450', 'pp_phcs_mean',
marker='o', c='k',
label=f"Refernce category (d455 = {REF_CAT})",
ax=d450_ax));
d450_ax.set_xticks(d450_cats);
d450_ax.set_ylabel("Posterior excpected phcs");
(pp_df[pp_df['d450'] != REF_CAT]
.pivot_table('pp_phcs_mean', 'd455', 'd450')
.plot(marker='o', c='k', alpha=0.5, legend=False,
ax=d455_ax));
(pp_df[pp_df['d450'] == REF_CAT]
.plot('d455', 'pp_phcs_mean',
marker='o', c='k',
label=f"Refernce category (d450 = {REF_CAT})",
ax=d455_ax));
d455_ax.set_xticks(d455_cats);
fig.tight_layout();
For reference, we compare this model to a model that includes d450
and d455
as linear predictors. This model is given by
d450_.set_value(d450)
d455_.set_value(d455)
with pm.Model() as linear_model:
β0 = pm.Normal('β0', 0., 10.)
β_d450 = pm.Normal('β_d450', 0., 10.)
β_d455 = pm.Normal('β_d455', 0., 10.)
μ = β0 + β_d450 * d450_ + β_d455 * d455_
σ = pm.HalfNormal('σ', 5.)
phcs_obs = pm.Normal('phcs', μ, σ, observed=phcs)
with linear_model:
linear_trace = pm.sample(**SAMPLE_KWARGS)
Auto-assigning NUTS sampler... Initializing NUTS using jitter+adapt_diag... Multiprocess sampling (3 chains in 2 jobs) NUTS: [σ, β_d455, β_d450, β0] Sampling 3 chains: 100%|██████████| 6000/6000 [00:07<00:00, 771.92draws/s]
As in the paper, compare these models by stacking their posterioir predictive distributions.
comp_df = (pm.compare({
model: trace,
linear_model: linear_trace
})
.rename({
0: "Paper",
1: "Linear"
}))
comp_df
WAIC | pWAIC | dWAIC | weight | SE | dSE | var_warn | |
---|---|---|---|---|---|---|---|
Paper | 2825.24 | 6.22 | 0 | 1 | 29.01 | 0 | 0 |
Linear | 2830.2 | 3.7 | 4.97 | 0 | 29.09 | 4.42 | 0 |
We see that the model from the paper has a lower WAIC and gets 100% of the weight, a strong sign that it is surperior to the linear model.
This post is available as a Jupyter notebook here.