Lecture 3: Geocentric Models
Lecture 4: Wiggly Orbits
Proposed problems and solutions in R for the exercises of the week.
%matplotlib inline
import pymc3 as pm
import numpy as np
import pandas as pd
import theano
from scipy import stats
from sklearn.metrics import mean_squared_error
import math
import matplotlib.pyplot as plt
import altair as alt
alt.data_transformers.enable('default', max_rows=None)
import arviz as az
import warnings
warnings.filterwarnings('ignore')
Example of building a prior manually (kind of a pain):
x = np.linspace(100, 250, 100)
plt.plot(x, stats.norm.pdf(x, 178, 20));
x = np.linspace(-10, 60, 100)
plt.plot(x, stats.uniform.pdf(x, 0, 50));
n_samples = 1000
sample_mu = stats.norm.rvs(loc=178, scale=20, size=n_samples)
sample_sigma = stats.uniform.rvs(loc=0, scale=50, size=n_samples)
prior_h = stats.norm.rvs(loc=sample_mu, scale=sample_sigma)
pm.kdeplot(prior_h)
plt.xlabel('heights', fontsize=14)
plt.yticks([]);
The weights listed below were recorded in the !Kung census, but heights were not recorded for these individuals. Provide predicted heights and 89% compatibility intervals for each of these individuals. That is, fill in the table below, using model-based predictions.
aux = pd.DataFrame(np.array([45, 40, 65, 31, 53])).reset_index().rename(columns={'index':'Individual',0:'weight'})
aux['expected height']=np.nan
aux['89% interval'] = np.nan
aux
Individual | weight | expected height | 89% interval | |
---|---|---|---|---|
0 | 0 | 45 | NaN | NaN |
1 | 1 | 40 | NaN | NaN |
2 | 2 | 65 | NaN | NaN |
3 | 3 | 31 | NaN | NaN |
4 | 4 | 53 | NaN | NaN |
Let's get the data.
The weights that interest us are all adult weights, so we can analyze only the adults and make an okay linear approximation. If you did something else, that's okay. I deliberately made the question a little vague. Loading the data, selecting out adults, and doing the regression from the book:
d = pd.read_csv('../../data/Howell1.csv', sep=';', header=0)
d.head()
height | weight | age | male | |
---|---|---|---|---|
0 | 151.765 | 47.825606 | 63.0 | 1 |
1 | 139.700 | 36.485807 | 63.0 | 0 |
2 | 136.525 | 31.864838 | 65.0 | 0 |
3 | 156.845 | 53.041915 | 41.0 | 1 |
4 | 145.415 | 41.276872 | 51.0 | 0 |
d2 = d[d.age >= 18].reset_index(drop=True)
d2 = d2.assign(weight_c=pd.Series(d2.weight - d2.weight.mean()))
This is the first of many times that a model is defined in pymc3 in this course so I'll take a moment to explain a bit its general structure. First the model is defined and named, in this case as model_1
. Then there are four basic sections within it:
Data: where the data that model needs is defined. I'm taking advantage here of the Data container class recently included in pymc3 v3.7. This class wraps the theano shared variable class and let the model be aware of its inputs and outputs. In the later notebooks I usually define the data outside of the model because it saves some space when comparing models but for a individual model I think is just a bit nicer. This is very useful for making predictions with the model for data not used for its fitting, something extremely common in Machine Learning ($Xtest$ anyone?). In the course and the book this is usually called out-of-sample prediction but in general Data Science we refer to it as just measuring the accuracy of the model on the test set. More information on the Data class can be found here and here and on the out-of-sample (and in-sample) prediction in here.
Priors: where the priors are defined. This priors are the initial distributions our parameters evolve from. They are supposed to represent our knowledge about our system and the nature of reality and maths jeje. For instance, a standard deviation can not be negative so we initialize it as a Uniform distribution with lower bound zero. In the course this is mentioned again and again.
Regression: where we define the actual structure of the model, in this case a linear regression (to estimate $\mu$) and where we also define the prior for our output variable, in this case height_hat
. I will be using the "hat" subindex in the output metric of almost all models to express that it is the model estimation of the original variable, in this case height
.
Prior, trace and posterior: where the priors are sampled from and where the posteriors are estimated and sampled from. Most of the time we will not need neither the priors or the posterior predictions only the parameters, that are stored in the trace. However, I think its a good policy to have in all models to keep all concepts in mind and just uncomment them when we need them (like in this exercise).
with pm.Model() as model_1:
# Data
weight = pm.Data('weight', d2['weight_c'].values)
height = pm.Data('height', d2['height'].values)
# Priors
alpha = pm.Normal('alpha', mu=178, sd=20)
beta = pm.Normal('beta', mu=0, sd=10)
sigma = pm.Uniform('sigma', lower=0, upper=50)
# Regression
mu = alpha + beta * weight
height_hat = pm.Normal('height_hat', mu=mu, sd=sigma, observed=height)
# Prior sampling, trace definition and posterior sampling
prior = pm.sample_prior_predictive()
posterior_1 = pm.sample(draws=1000, tune=1000)
posterior_pred_1 = pm.sample_posterior_predictive(posterior_1)
Auto-assigning NUTS sampler... Initializing NUTS using jitter+adapt_diag... Multiprocess sampling (4 chains in 4 jobs) NUTS: [sigma, beta, alpha] Sampling 4 chains, 0 divergences: 100%|██████████| 8000/8000 [00:02<00:00, 2787.39draws/s] 100%|██████████| 4000/4000 [00:54<00:00, 73.70it/s]
Let's get summaries and traces using both az
and pm
.
az.summary(posterior_1, credible_interval=.89).round(2)
#pm.summary(posterior_1, alpha=.11).round(2)# also possible
mean | sd | mcse_mean | mcse_sd | hpd_5.5% | hpd_94.5% | ess_mean | ess_sd | ess_bulk | ess_tail | r_hat | |
---|---|---|---|---|---|---|---|---|---|---|---|
alpha | 154.6 | 0.28 | 0.0 | 0.0 | 154.17 | 155.04 | 6154.0 | 6152.0 | 6156.0 | 2962.0 | 1.0 |
beta | 0.9 | 0.04 | 0.0 | 0.0 | 0.84 | 0.97 | 6599.0 | 6599.0 | 6655.0 | 2465.0 | 1.0 |
sigma | 5.1 | 0.19 | 0.0 | 0.0 | 4.80 | 5.41 | 5620.0 | 5581.0 | 5667.0 | 3028.0 | 1.0 |
Both pymc3 and arviz have traceplots functions to see how the distribution of each parameter looks like. There usually are four curves per parameter since four chains is the default when running the model.
pm.traceplot(posterior_1);
#az.plot_trace(posterior_1); #also possible
Let's get a predicted value for each of our original samples. This is not requested by the exercise but I think it makes sense first to estimate how good our predictions are in-sample (or in $X{train}$) before moving to doing predictions out-of-sample ($X_{test}$)._
posterior_pred_1['height_hat'].shape
(4000, 352)
We got 1000 draws (times 4 chains) for height_hat
for each of the original exercise samples. Let's just take the mean for each original sample and build a new column for it in the original dataset.
d2 = d2.assign(height_hat=np.mean(posterior_pred_1['height_hat'],axis=0))
plot = alt.Chart(d2)\
.mark_circle()\
.encode(
x=alt.X('height', title='height',scale=alt.Scale(domain=(135, 180))),
y=alt.Y('height_hat', title='height_hat',scale=alt.Scale(domain=(135, 180)))
)
plot
print(f'The RMSE in X_train is {round(math.sqrt(mean_squared_error(d2.height.values, d2.height_hat.values)),2)} cm.')
The RMSE in X_train is 5.07 cm.
I think thats a reasonably good RMSE. Let's build now the High Posterior Density Interval for each original sample.
height_hat_hpd = pm.hpd(posterior_pred_1['height_hat'], alpha=0.11)
#height_hat_hpd = az.hpd(posterior_pred_1['height_hat'], credible_interval=.89) # is also good
hpdi = pd.DataFrame(height_hat_hpd).rename({0:'lower_hpdi',1:'upper_hpdi'}, axis=1)
d3 = pd.concat([d2,hpdi], axis = 1)
round(d3.head(5),2)
height | weight | age | male | weight_c | height_hat | lower_hpdi | upper_hpdi | |
---|---|---|---|---|---|---|---|---|
0 | 151.76 | 47.83 | 63.0 | 1 | 2.84 | 157.03 | 148.86 | 165.17 |
1 | 139.70 | 36.49 | 63.0 | 0 | -8.50 | 146.79 | 139.01 | 155.29 |
2 | 136.52 | 31.86 | 65.0 | 0 | -13.13 | 142.63 | 134.62 | 151.08 |
3 | 156.84 | 53.04 | 41.0 | 1 | 8.05 | 162.00 | 153.85 | 170.50 |
4 | 145.42 | 41.28 | 51.0 | 0 | -3.71 | 151.41 | 143.28 | 159.62 |
Let's finally predict the values requested by the exercise. We had our original variables stored in aux
.
In order to use the model to predict on unseen data we need to swap the values stored in the data within the model. I know, it's a bit convoluted but you get used to it jeje. In this case we had two inputs: weight
and height
but actually only have information regarding weight
for our new individuals so we only update this variable. It is a bit odd though that the dependent variable, in this case height
can just remain holding the original training data and things work fine. I guess that sample_posterior_predictive
does not use it. Since the dimensions do not match (training data has 352 indiviuals and test data only 5) I guess it's the only possible explanation.
I read here that some people define a $fake-holdout-y$ with the same dimensions as $X_{test}$ but at least now it seems unnecessary (check also this where only the predictors are updated). Maybe someone can PR this and shed some light here.
In any case the predictions are good and match perfectly those obtained in the original R solutions.
weight.set_value(aux.weight.values-np.mean(d2.weight))
Once the data is swapped we use sample_posterior_predictive
again over the original trace and model.
posterior_pred_1 = pm.sample_posterior_predictive(trace = posterior_1, samples = 500, model = model_1)
100%|██████████| 500/500 [00:07<00:00, 64.27it/s]
And we finally get the requested values for the new individuals.
aux['expected height'] = posterior_pred_1['height_hat'].mean(axis=0)
aux['89% interval'] = pd.Series(list(pm.hpd(posterior_pred_1['height_hat'], alpha=0.11)))
aux
Individual | weight | expected height | 89% interval | |
---|---|---|---|---|
0 | 0 | 45 | 154.378392 | [147.10980069780555, 162.98875419930548] |
1 | 1 | 40 | 150.080918 | [142.40517323434042, 158.0363212364431] |
2 | 2 | 65 | 172.656271 | [165.08652880603088, 181.55263532157568] |
3 | 3 | 31 | 141.740800 | [133.28582732918625, 150.67856920275216] |
4 | 4 | 53 | 161.565233 | [151.87099317877485, 168.51969307816304] |
Model the relationship between
height
(cm) and the natural logarithm of weight (log-kg):log(weight)
. Use the entireHowell1
data frame, all 544 rows, adults and non-adults. Use any model type from Chapter 4 (from the book) that you think useful: an ordinary linear regression, a polynomial or a spline. Plot the posterior predictions against the raw data.
d2 = d
d2 = d2.assign(log_weight_c=pd.Series(np.log(d2.weight) - np.log(d2.weight).mean()))
with pm.Model() as model_2:
# Data
log_weight = pm.Data('log_weight', d2['log_weight_c'].values)
height = pm.Data('height', d2['height'].values)
# Priors
alpha = pm.Normal('alpha', mu=178, sd=20)
beta = pm.Normal('beta', mu=0, sd=1)
sigma = pm.Uniform('sigma', lower=0, upper=50)
# Regression
mu = alpha + beta * log_weight
height_hat = pm.Normal('height_hat', mu=mu, sd=sigma, observed=height)
# Prior sampling, trace definition and posterior sampling
prior = pm.sample_prior_predictive()
posterior_2 = pm.sample(draws=1000, tune=1000)
posterior_pred_2 = pm.sample_posterior_predictive(posterior_2)
Auto-assigning NUTS sampler... Initializing NUTS using jitter+adapt_diag... Multiprocess sampling (4 chains in 4 jobs) NUTS: [sigma, beta, alpha] Sampling 4 chains, 0 divergences: 100%|██████████| 8000/8000 [00:04<00:00, 1901.01draws/s] 100%|██████████| 4000/4000 [01:06<00:00, 70.39it/s]
Let's get summaries and traces using both az
and pm
.
az.summary(posterior_2, credible_interval=.89).round(2)
#pm.summary(trace_model_2, alpha=.11).round(2) also possible
mean | sd | mcse_mean | mcse_sd | hpd_5.5% | hpd_94.5% | ess_mean | ess_sd | ess_bulk | ess_tail | r_hat | |
---|---|---|---|---|---|---|---|---|---|---|---|
alpha | 138.32 | 0.79 | 0.02 | 0.01 | 137.16 | 139.68 | 2234.0 | 2234.0 | 2236.0 | 2656.0 | 1.0 |
beta | 16.19 | 1.35 | 0.03 | 0.02 | 14.11 | 18.38 | 1921.0 | 1900.0 | 1939.0 | 1804.0 | 1.0 |
sigma | 18.58 | 0.93 | 0.02 | 0.02 | 17.11 | 20.02 | 1923.0 | 1923.0 | 1930.0 | 1846.0 | 1.0 |
pm.traceplot(posterior_2);
#az.plot_trace(trace_model_2); #also possible
d2 = d2.assign(height_hat=np.mean(posterior_pred_2['height_hat'],axis=0)) # just as the previous exercise, we recover the mean of height_hat for each sample
height_pred_hpd = pm.hpd(posterior_pred_2['height_hat'], alpha=0.11) # and the High Posterior Density interval for each sample as well
# az.hpd(height_pred['height_hat'], credible_interval=.89) is also good
Some harmless data wragling and plotting:
hpdi = pd.DataFrame(height_pred_hpd).rename({0:'lower_hpdi',1:'upper_hpdi'}, axis=1)
d2 = pd.concat([d2,hpdi], axis = 1)
d2_plot = d2.melt(id_vars=['weight'],
value_vars=['height','height_hat'],
var_name='height_origin',
value_name='heights')
plot = alt.Chart(d2_plot.query('height_origin.isin(["height","height_hat"])'))\
.mark_point()\
.encode(
x=alt.X('weight', title='weight',scale=alt.Scale(domain=(0, 80))),
y=alt.Y('heights', title='heights',scale=alt.Scale(domain=(0, 180))),
color='height_origin'
)
plot_hpdi = alt.Chart(d2)\
.mark_area(opacity=0.2, color='orange')\
.encode(
x=alt.X('weight', title='weight',scale=alt.Scale(domain=(0, 80))),
y='lower_hpdi:Q',
y2='upper_hpdi:Q'
)
plot_2 = alt.Chart(d2)\
.mark_point(color='black')\
.encode(
x=alt.X('height', title='height',scale=alt.Scale(domain=(0, 180))),
y=alt.Y('height_hat', title='height_hat',scale=alt.Scale(domain=(0, 180)))
)
alt.hconcat(plot+plot_hpdi,plot_2)
Results are meh. I don't know why I get a worst fit that the R solution. Since the parameters are not shown I cannot trace back the error but the code seems fine to me. Let's see what happens when we take a polynomial regression of second degree.
d2 = d
d2 = d2.assign(weight_c=pd.Series(d2.weight - d2.weight.mean())/d2.weight.std())
d2 = d2.assign(weight_c_2=pd.Series(d2.weight_c**2))
with pm.Model() as model_2_2:
# Data
weight_c = pm.Data('weight_c', d2['weight_c'].values)
weight_c_2 = pm.Data('weight_c_2', d2['weight_c_2'].values)
height = pm.Data('height', d2['height'].values)
# Priors
alpha = pm.Normal('alpha', mu=178, sd=20)
beta1 = pm.Lognormal('beta1', mu=0, sd=1) # note this is a lognormal, not a normal, just to follow the code of the book as close as possible (a normal works just as good)
beta2 = pm.Normal('beta2', mu=0, sd=1)
sigma = pm.Uniform('sigma', lower=0, upper=50)
# Regression
mu = alpha + beta1 * weight_c + beta2 * weight_c_2
height_hat = pm.Normal('height_hat', mu=mu, sd=sigma, observed=height)
# Prior sampling, trace definition and posterior sampling
prior = pm.sample_prior_predictive()
posterior_2_2 = pm.sample(draws=1000, tune=1000)
posterior_pred_2_2 = pm.sample_posterior_predictive(posterior_2_2)
Auto-assigning NUTS sampler... Initializing NUTS using jitter+adapt_diag... Multiprocess sampling (4 chains in 4 jobs) NUTS: [sigma, beta2, beta1, alpha] Sampling 4 chains, 0 divergences: 100%|██████████| 8000/8000 [00:04<00:00, 1697.10draws/s] 100%|██████████| 4000/4000 [00:56<00:00, 70.83it/s]
Let's get summaries and traces using both az
and pm
.
az.summary(posterior_2_2, credible_interval=.89).round(2)
#pm.summary(trace_model_2_2, alpha=.11).round(2) also possible
mean | sd | mcse_mean | mcse_sd | hpd_5.5% | hpd_94.5% | ess_mean | ess_sd | ess_bulk | ess_tail | r_hat | |
---|---|---|---|---|---|---|---|---|---|---|---|
alpha | 146.04 | 0.37 | 0.01 | 0.0 | 145.48 | 146.64 | 2650.0 | 2650.0 | 2647.0 | 3019.0 | 1.0 |
beta2 | -7.79 | 0.27 | 0.00 | 0.0 | -8.21 | -7.36 | 2446.0 | 2445.0 | 2447.0 | 2542.0 | 1.0 |
beta1 | 21.74 | 0.29 | 0.00 | 0.0 | 21.23 | 22.16 | 3142.0 | 3142.0 | 3155.0 | 3140.0 | 1.0 |
sigma | 5.80 | 0.18 | 0.00 | 0.0 | 5.52 | 6.09 | 3414.0 | 3408.0 | 3422.0 | 2868.0 | 1.0 |
pm.traceplot(posterior_2_2);
#az.plot_trace(posterior_2_2); #also possible
d2 = d2.assign(height_hat=np.mean(posterior_pred_2_2['height_hat'],axis=0))
height_pred_hpd = pm.hpd(posterior_pred_2_2['height_hat'], alpha=0.11)
# az.hpd(height_pred['height_hat'], credible_interval=.89) is also good
hpdi = pd.DataFrame(height_pred_hpd).rename({0:'lower_hpdi',1:'upper_hpdi'}, axis=1)
d2 = pd.concat([d2,hpdi], axis = 1)
d2_plot = d2.melt(id_vars=['weight'],
value_vars=['height','height_hat'],
var_name='height_origin',
value_name='heights')
plot = alt.Chart(d2_plot.query('height_origin.isin(["height","height_hat"])'))\
.mark_point()\
.encode(
x=alt.X('weight', title='weight',scale=alt.Scale(domain=(0, 80))),
y=alt.Y('heights', title='heights',scale=alt.Scale(domain=(0, 180))),
color='height_origin'
)
plot_hpdi = alt.Chart(d2)\
.mark_area(opacity=0.2, color='orange')\
.encode(
x=alt.X('weight', title='weight',scale=alt.Scale(domain=(0, 80))),
y='lower_hpdi:Q',
y2='upper_hpdi:Q'
)
plot_2 = alt.Chart(d2)\
.mark_point(color='black')\
.encode(
x=alt.X('height', title='height',scale=alt.Scale(domain=(0, 180))),
y=alt.Y('height_hat', title='height_hat',scale=alt.Scale(domain=(0, 180)))
)
alt.hconcat(plot+plot_hpdi,plot_2)
Much better now.
Plot the prior predictive distribution for the polynomial regression model in Chapter 4 (just what we did in the previous exercise). You can modify the the code that plots the linear regression prior predictive distribution. 20 or 30 parabolas from the prior should suffice to show where the prior probability resides. Can you modify the prior distributions of
alpha
,beta1
, andbeta2
so that the prior predictions stay within the biologically reasonable outcome space? That is to say: Do not try to fit the data by hand. But do try to keep the curves consistent with what you know about height and weight, before seeing these exact data.
We have already sampled the prior with the sample_prior_predictive
function within the model. Super-easy sintax. Let's do it, with 30 samples is enough for plotting.
az.summary(prior, credible_interval=.89).round(2)
# using summary with the priors is just a bit ugly, a bunch of warnings trigger and NaNs are produced
arviz.stats.stats_utils - WARNING - Array contains NaN-value. arviz.stats.stats_utils - WARNING - Array contains NaN-value. arviz.stats.stats_utils - WARNING - Array contains NaN-value. arviz.stats.stats_utils - WARNING - Array contains NaN-value. arviz.stats.stats_utils - WARNING - Array contains NaN-value. arviz.stats.stats_utils - WARNING - Array contains NaN-value.
mean | sd | mcse_mean | mcse_sd | hpd_5.5% | hpd_94.5% | ess_mean | ess_sd | ess_bulk | ess_tail | r_hat | |
---|---|---|---|---|---|---|---|---|---|---|---|
sigma_interval__ | 0.16 | 1.63 | NaN | NaN | -5.68 | NaN | NaN | NaN | NaN | NaN | NaN |
sigma | 26.05 | 13.65 | NaN | NaN | 0.17 | NaN | NaN | NaN | NaN | NaN | NaN |
beta1_log__ | -0.08 | 1.03 | NaN | NaN | -2.81 | NaN | NaN | NaN | NaN | NaN | NaN |
beta1 | 1.57 | 2.06 | NaN | NaN | 0.06 | NaN | NaN | NaN | NaN | NaN | NaN |
alpha | 178.34 | 20.12 | NaN | NaN | 120.19 | NaN | NaN | NaN | NaN | NaN | NaN |
beta2 | -0.06 | 1.02 | NaN | NaN | -2.75 | NaN | NaN | NaN | NaN | NaN | NaN |
height_hat | 178.31 | 35.66 | 0.88 | 0.63 | 122.61 | 234.7 | 1622.0 | 1622.0 | 1556.0 | 3843.0 | 1.22 |
Let's just recover the values from the priors manually and build a pandas dataframe.
df = pd.DataFrame()
df = df.assign(height_hat = pd.Series(prior['height_hat'].mean(axis=1)),
alpha = pd.Series(prior['alpha']),
sigma = pd.Series(prior['sigma']),
beta1 = pd.Series(prior['beta1']),
beta2 = pd.Series(prior['beta2']))
df.head()
height_hat | alpha | sigma | beta1 | beta2 | |
---|---|---|---|---|---|
0 | 217.866044 | 219.399265 | 23.173812 | 3.352894 | -1.657478 |
1 | 178.677972 | 175.505789 | 46.415113 | 0.176993 | 0.567291 |
2 | 161.901267 | 160.900475 | 13.919769 | 1.652082 | 1.164084 |
3 | 198.518299 | 199.519911 | 34.265158 | 1.179330 | 1.502513 |
4 | 168.899595 | 167.592501 | 27.945360 | 0.739642 | -0.053964 |
df.mean() # it checks out with the az.summary results
height_hat 178.308150 alpha 178.337894 sigma 26.047420 beta1 1.571378 beta2 -0.057667 dtype: float64
Now some data wrangling.
priors = pd.DataFrame()
for i in range(30): # 30 parabolas is fine
priors['prior_'+str(i)] = df.loc[i,'alpha'] + df.loc[i,'beta1'] * d2['weight_c'] + df.loc[i,'beta2'] * d2['weight_c_2']
aux = pd.concat([d2.weight_c, priors], axis=1)
aux_plot = aux.melt(id_vars=['weight_c'],
value_vars=list(aux.columns)[1:],
var_name='prior',
value_name='height')
plot = alt.Chart(aux_plot)\
.mark_line()\
.encode(
x=alt.X('weight_c', title='weight_c'),
y=alt.Y('height', title='height'),
color=alt.Color('prior', legend=None)
)
plot.properties(title='beta1 ~ dlnorm(0, 1) , beta2 ~ dnorm(0, 1)')
Recall that the world's tallest person was 270 cm tall. The tallest person in the sample is about 180 cm. The prior curvature is not very strong. Those parabolas hardly bend at all. We can increase the standard deviation on the beta2
prior, but that will produce some silly shapes (left below), where either average weight is tallest or shortest. That can't be right. The basic problem is that beta2
needs to be negative to make the curve bend down, but beta1
has to also change in order to move the maximum height to the right. It's all a bit confusing, and is they key reason that working with polynomial models is so hard. The prior on the right below can only bend down, but I've made the linear model a + beta1*weight_c - beta2*weight_c_2
and given beta2
a logNormal prior.
d2 = d
d2 = d2.assign(weight_c=pd.Series(d2.weight - d2.weight.mean())/d2.weight.std())
d2 = d2.assign(weight_c_2=pd.Series(d2.weight_c**2))
Please note I'm running three models in the same snippet for simplicity. The only differences are the priors.
with pm.Model() as model_3:
# Data
weight_c = pm.Data('weight_c', d2['weight_c'].values)
weight_c_2 = pm.Data('weight_c_2', d2['weight_c_2'].values)
height = pm.Data('height', d2['height'].values)
# Priors
alpha = pm.Normal('alpha', mu=178, sd=20)
beta1_1 = pm.Lognormal('beta1_1', mu=0, sd=1)
beta2_1 = pm.Normal('beta2_1', mu=0, sd=10) # increasing the sd
beta1_2 = pm.Lognormal('beta1_2', mu=2, sd=1)
beta2_2 = pm.Lognormal('beta2_2', mu=1, sd=1) # making beta2 a lognormal
beta1_3 = pm.Normal('beta1_3', mu=0, sd=100) # flat prior
beta2_3 = pm.Normal('beta2_3', mu=0, sd=100) # flat prior
sigma = pm.Uniform('sigma', lower=0, upper=50)
# Regressions
mu_1 = alpha + beta1_1 * weight_c + beta2_1 * weight_c_2
height_hat_1 = pm.Normal('height_hat_1', mu=mu_1, sd=sigma, observed=height)
mu_2 = alpha + beta1_2 * weight_c - beta2_2 * weight_c_2 # note the sign change
height_hat_2 = pm.Normal('height_hat_2', mu=mu_2, sd=sigma, observed=height)
mu_3 = alpha + beta1_3 * weight_c + beta2_3 * weight_c_2
height_hat_3 = pm.Normal('height_hat_3', mu=mu_3, sd=sigma, observed=height)
# Prior sampling, trace definition and posterior sampling
prior = pm.sample_prior_predictive()
posterior_3 = pm.sample()
posterior_pred_3 = pm.sample_posterior_predictive(posterior_3)
Auto-assigning NUTS sampler... Initializing NUTS using jitter+adapt_diag... Multiprocess sampling (4 chains in 4 jobs) NUTS: [sigma, beta2_3, beta1_3, beta2_2, beta1_2, beta2_1, beta1_1, alpha] Sampling 4 chains, 0 divergences: 100%|██████████| 4000/4000 [00:04<00:00, 948.38draws/s] 100%|██████████| 2000/2000 [01:31<00:00, 21.90it/s]
df = pd.DataFrame()
df = df.assign(alpha = pd.Series(prior['alpha']),
sigma = pd.Series(prior['sigma']),
beta1_1 = pd.Series(prior['beta1_1']),
beta2_1 = pd.Series(prior['beta2_1']),
beta1_2 = pd.Series(prior['beta1_2']),
beta2_2 = pd.Series(prior['beta2_2']),
beta1_3 = pd.Series(prior['beta1_3']),
beta2_3 = pd.Series(prior['beta2_3']))
df.head()
alpha | sigma | beta1_1 | beta2_1 | beta1_2 | beta2_2 | beta1_3 | beta2_3 | |
---|---|---|---|---|---|---|---|---|
0 | 174.900998 | 7.264140 | 0.939067 | -11.344971 | 3.215034 | 0.980401 | -6.743351 | -15.686045 |
1 | 192.935702 | 8.193026 | 0.519352 | -3.573985 | 2.470305 | 2.705127 | 67.586147 | 56.931776 |
2 | 194.906673 | 13.476760 | 1.030064 | -5.529674 | 6.735777 | 3.593835 | 205.460102 | 19.219400 |
3 | 191.899950 | 1.687804 | 6.197893 | -16.481562 | 28.411498 | 15.928632 | 2.799394 | 57.436045 |
4 | 182.414696 | 28.177356 | 0.581860 | 6.988846 | 22.424196 | 3.230487 | -72.042675 | 238.269823 |
Now some data wrangling.
priors_1 = pd.DataFrame()
priors_2 = pd.DataFrame()
priors_3 = pd.DataFrame()
for i in range(30): # 30 parabolas is fine
priors_1['prior_'+str(i)] = df.loc[i,'alpha'] + df.loc[i,'beta1_1'] * d2['weight_c'] + df.loc[i,'beta2_1'] * d2['weight_c_2']
priors_2['prior_'+str(i)] = df.loc[i,'alpha'] + df.loc[i,'beta1_2'] * d2['weight_c'] - df.loc[i,'beta2_2'] * d2['weight_c_2'] # note the sign change
priors_3['prior_'+str(i)] = df.loc[i,'alpha'] + df.loc[i,'beta1_3'] * d2['weight_c'] + df.loc[i,'beta2_3'] * d2['weight_c_2']
aux_1 = pd.concat([d2.weight_c, priors_1], axis=1)
aux_2 = pd.concat([d2.weight_c, priors_2], axis=1)
aux_3 = pd.concat([d2.weight_c, priors_3], axis=1)
aux_1_plot = aux_1.melt(id_vars=['weight_c'],
value_vars=list(aux_1.columns)[1:],
var_name='prior',
value_name='height')
aux_2_plot = aux_2.melt(id_vars=['weight_c'],
value_vars=list(aux_2.columns)[1:],
var_name='prior',
value_name='height')
aux_3_plot = aux_3.melt(id_vars=['weight_c'],
value_vars=list(aux_3.columns)[1:],
var_name='prior',
value_name='height')
plot_1 = alt.Chart(aux_1_plot)\
.mark_line(clip=True)\
.encode(
x=alt.X('weight_c', title='weight_c', scale=alt.Scale(domain=(-2, 2))),
y=alt.Y('height', title='height', scale=alt.Scale(domain=(50, 250))),
color=alt.Color('prior', legend=None)
)
plot_2 = alt.Chart(aux_2_plot)\
.mark_line(clip=True)\
.encode(
x=alt.X('weight_c', title='weight_c', scale=alt.Scale(domain=(-2, 2))),
y=alt.Y('height', title='height', scale=alt.Scale(domain=(50, 250))),
color=alt.Color('prior', legend=None)
)
alt.hconcat(plot_1.properties(title='beta1 ~ dlnorm(0, 1) , beta2 ~ dnorm(0, 1)'),
plot_2.properties(title='beta1 ~ dlnorm(2, 1) , beta2 ~ dlnorm(1, 1)'))
A key problem in getting reasonable curves here is that obviously a and beta1
and beta2
are correlated in the family of reasonable curves. But the priors are uncorrelated, they are independent of one another. Still, if you can get independent priors to at least live within some reasonably space of outcome values, that's a lot better than flat priors. What would flat priors look like here? Something like this:
plot_3 = alt.Chart(aux_3_plot)\
.mark_line(clip=True)\
.encode(
x=alt.X('weight_c', title='weight_c', scale=alt.Scale(domain=(-2, 2))),
y=alt.Y('height', title='height', scale=alt.Scale(domain=(50, 250))),
color=alt.Color('prior', legend=None)
)
plot_3.properties(title='beta1 ~ dnorm(0, 100) , beta2 ~ dnorm(0, 100)')
These prior curves actually strongly favor explosive growth or shrinkage near the mean. This is a general phenomenon with “flat” priors: Once the predictor is at all complicated, “flat” does not imply “no relationship.” Do any of the priors above make a difference for inference in this sample? No. There is a lot of data and the model is quite simple, in terms of the way that parameters relate to predictions. This will not always be the case.