Lecture 19: Gaussian Processes
Lecture 20: Missing Values and Measurement Error
Proposed problems and solutions in R for the exercises of the week.
%matplotlib inline
import pymc3 as pm
import numpy as np
import pandas as pd
import theano
import theano.tensor as tt
from scipy import stats
from scipy.special import expit as logistic
from sklearn import preprocessing
import matplotlib.pyplot as plt
from matplotlib.patches import Ellipse
import altair as alt
alt.data_transformers.enable('default', max_rows=None)
import arviz as az
import warnings
warnings.filterwarnings('ignore')
Consider the relationship between brain volume (
brain
) and body mass (body
) in the data(Primates301
). These values are presented as single values for each species. However, there is always a range of sizes in a species, and some of these measurements are taken from very small samples. So these values are measured with some unknown error.
We don't have the raw measurements to work with—that would be best. But we can imagine what might happen if we had them. Suppose error is proportional to the measurement. This makes sense, because larger animals have larger variation. As a consequence, the uncertainty is not uniform across the values and this could mean trouble.
Let's make up some standard errors for these measurements, to see what might happen. Load the data and scale the the measurements so the maximum is 1 in both cases:
d = pd.read_csv('../../data/Primates301.csv', header=0, sep=';')
d.dropna(subset=['brain','body'], inplace=True)
d['B'] = d['brain']/np.max(d['brain'])
d['M'] = d['body']/np.max(d['body'])
d['Bse'] = d['B']*0.1
d['Mse'] = d['M']*0.1
d.head()
name | genus | species | subspecies | spp_id | genus_id | social_learning | research_effort | brain | body | group_size | gestation | weaning | longevity | sex_maturity | maternal_investment | B | M | Bse | Mse | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Allenopithecus_nigroviridis | Allenopithecus | nigroviridis | NaN | 1 | 1 | 0.0 | 6.0 | 58.02 | 4655.0 | 40.0 | NaN | 106.15 | 276.0 | NaN | NaN | 0.118102 | 0.035808 | 0.011810 | 0.003581 |
2 | Alouatta_belzebul | Alouatta | belzebul | NaN | 3 | 3 | 0.0 | 15.0 | 52.84 | 6395.0 | 7.4 | NaN | NaN | NaN | NaN | NaN | 0.107558 | 0.049192 | 0.010756 | 0.004919 |
3 | Alouatta_caraya | Alouatta | caraya | NaN | 4 | 3 | 0.0 | 45.0 | 52.63 | 5383.0 | 8.9 | 185.92 | 323.16 | 243.6 | 1276.72 | 509.08 | 0.107130 | 0.041408 | 0.010713 | 0.004141 |
4 | Alouatta_guariba | Alouatta | guariba | NaN | 5 | 3 | 0.0 | 37.0 | 51.70 | 5175.0 | 7.4 | NaN | NaN | NaN | NaN | NaN | 0.105237 | 0.039808 | 0.010524 | 0.003981 |
5 | Alouatta_palliata | Alouatta | palliata | NaN | 6 | 3 | 3.0 | 79.0 | 49.88 | 6250.0 | 13.1 | 185.42 | 495.60 | 300.0 | 1578.42 | 681.02 | 0.101533 | 0.048077 | 0.010153 | 0.004808 |
Now I'll make up some standard errors for B and M, assuming error is 10% of the measurement (I just did).
Ignoring measurement error, the corresponding ulam
(pymc3) model is:
Bse = theano.shared(np.array(d['Bse']))
M = theano.shared(np.array(d['M']))
Mse = theano.shared(np.array(d['Mse']))
B = theano.shared(np.array(d['B']))
with pm.Model() as model_11:
# Priors
a = pm.Normal('a', mu=0, sd=1)
b = pm.Normal('b', mu=0, sd=0.5)
sigma = pm.Exponential('sigma', lam=1)
# Regression
mu = a + b*np.log(M)
brain_hat = pm.Lognormal('brain_hat', mu, sigma, observed=B)
# Prior sampling, trace definition and posterior sampling
#prior = pm.sample_prior_predictive()
posterior_11 = pm.sample(tune=3000)
#posterior_pred_11 = pm.sample_posterior_predictive(posterior_11)
Auto-assigning NUTS sampler... Initializing NUTS using jitter+adapt_diag... Multiprocess sampling (4 chains in 4 jobs) NUTS: [sigma, b, a] Sampling 4 chains, 0 divergences: 100%|██████████| 14000/14000 [00:09<00:00, 1439.68draws/s]
Your job is to add the measurement errors to this model. Use the divorce/marriage example in the chapter as a guide. It might help to initialize the unobserved true values of B
and M
using the observed values, by adding a list like this to ulam
(in pymc3 this is done by defining a start
parameter in the pm.sample
function).
Running the model without measurement error (the code was given in the assignment), we get:
az.summary(posterior_11, var_names=['a','b', 'sigma'], credible_interval=.89).round(2)
mean | sd | mcse_mean | mcse_sd | hpd_5.5% | hpd_94.5% | ess_mean | ess_sd | ess_bulk | ess_tail | r_hat | |
---|---|---|---|---|---|---|---|---|---|---|---|
a | 0.42 | 0.06 | 0.0 | 0.0 | 0.34 | 0.52 | 770.0 | 770.0 | 765.0 | 936.0 | 1.00 |
b | 0.78 | 0.01 | 0.0 | 0.0 | 0.76 | 0.80 | 819.0 | 819.0 | 818.0 | 1011.0 | 1.01 |
sigma | 0.29 | 0.02 | 0.0 | 0.0 | 0.26 | 0.32 | 1021.0 | 1005.0 | 1033.0 | 859.0 | 1.01 |
We'll plot this implied relationship in a moment, after fitting the measurement error model too. To build the measurement error model, all we really need to do is add the observation process. This means that the observed values arise from their own distribution, each having a true value as the mean. We use these unknown true values in the regression. It looks like this:
N = len(d)
with pm.Model() as model_12:
# Priors
sigma = pm.Exponential('sigma', lam=1)
a = pm.Normal('a', 0., 1.)
b = pm.Normal('b', 0., 1.)
# Regression for the true values
M_true = pm.Normal('M_true', mu=0.5, sd=1, shape=N)
mu = a + b*np.log(M_true)
B_true = pm.Lognormal('B_true', mu, sigma, shape=N)
# Observed variables
B_obs = pm.Normal('B_obs', B_true, Bse, observed=B)
M_obs = pm.Normal('M_obs', M_true, Mse, observed=M)
# Start value and additional kwarg for NUTS and trace definition
start = dict(B_true=d.B.values, M_true=d.M.values)
posterior_12 = pm.sample(1000,
tune=1000,
start=start,
nuts_kwargs=dict(target_accept=.95))
Auto-assigning NUTS sampler... Initializing NUTS using jitter+adapt_diag... Multiprocess sampling (4 chains in 4 jobs) NUTS: [B_true, M_true, b, a, sigma] Sampling 4 chains, 0 divergences: 100%|██████████| 8000/8000 [00:31<00:00, 251.14draws/s]
az.summary(posterior_12, var_names=['a','b', 'sigma'], credible_interval=.89).round(2)
mean | sd | mcse_mean | mcse_sd | hpd_5.5% | hpd_94.5% | ess_mean | ess_sd | ess_bulk | ess_tail | r_hat | |
---|---|---|---|---|---|---|---|---|---|---|---|
a | 0.42 | 0.06 | 0.0 | 0.0 | 0.32 | 0.51 | 4743.0 | 4679.0 | 4774.0 | 2964.0 | 1.0 |
b | 0.78 | 0.01 | 0.0 | 0.0 | 0.76 | 0.81 | 4753.0 | 4752.0 | 4754.0 | 3002.0 | 1.0 |
sigma | 0.26 | 0.02 | 0.0 | 0.0 | 0.23 | 0.29 | 4990.0 | 4918.0 | 5002.0 | 3137.0 | 1.0 |
For now, notice that the posterior distributions of a and b are nearly identical to model_11
. Adding measurement error hasn't changed a thing! Plotting the regression against the observed values:
d['B_true'] = np.mean(posterior_12['B_true'], axis=0)
d['M_true'] = np.mean(posterior_12['M_true'], axis=0)
aux = pd.DataFrame()
aux['x'] = np.arange(len(d))/len(d)
aux['avg_curve'] = np.exp(posterior_12['a'].mean())*aux['x']**(posterior_12['b'].mean())
for i in range(50):
aux[f'curve_{i}'] = np.exp(posterior_12['a'][i])*aux['x']**(posterior_12['b'][i])
aux_plot = aux.melt(id_vars=['x'],
value_vars=aux.columns[2:].tolist(),
var_name='posteriors',
value_name='y')
plot = alt.Chart(d)\
.mark_circle()\
.encode(
x=alt.X('M',title='body mass',scale=alt.Scale(domain=(0, 1))),
y=alt.Y('B',title='brain volume',scale=alt.Scale(domain=(0, 1)))
)
plot_2 = alt.Chart(aux_plot)\
.mark_line(clip=True)\
.encode(
x=alt.X('x',scale=alt.Scale(domain=(0, 1))),
y=alt.Y('y',scale=alt.Scale(domain=(0, 1))),
color = alt.Color('posteriors', legend=None)
)
plot_3 = alt.Chart(d)\
.mark_point()\
.encode(
x=alt.X('M_true',scale=alt.Scale(domain=(0, 1))),
y=alt.Y('B_true',scale=alt.Scale(domain=(0, 1)))
)
plot_4 = alt.Chart(aux)\
.mark_line(clip=True, color='black')\
.encode(
x=alt.X('x',scale=alt.Scale(domain=(0, 1))),
y=alt.Y('avg_curve',scale=alt.Scale(domain=(0, 1)))
)
alt.hconcat(plot + plot_2, plot + plot_3 + plot_4)
(On the left chart) The two points in the upper right are gorillas. Most primates are small, and obviously gorillas have something special going on.
(On the right chart) Now let's plot the estimated values on this. The open points are the posterior mean estimates. Notice that they have moved towards the regression line, as you'd expect. But even the outlier gorillas haven't moved much. The assumed error just isn't big enough to get them any closer. If you increase the amount of error, you can get all of the species to fall right on the regression line. Try for example 30% error. The model will mix poorly, but take a look at the inferred true values. The truth of this example is that there are just so many small primates that they dominate the relationship. And their measurement errors are also smaller (in absolute terms). So adding plausible amounts of measurement error here doesn't make a big difference. We still don't have a good explanation for gorillas.
Before moving on, I'll also plot the estimated species values with 50% compatibility ellipses.
aux_ellipse = pd.concat([pd.DataFrame(pm.stats.hpd(posterior_12['B_true'], alpha=0.1)).rename({0:'B_true_lower_hpdi',1:'B_true_upper_hpdi'}, axis=1),
pd.DataFrame(pm.stats.hpd(posterior_12['M_true'], alpha=0.1)).rename({0:'M_true_lower_hpdi',1:'M_true_upper_hpdi'}, axis=1),
pd.DataFrame(np.mean(posterior_12['B_true'], axis=0)).rename({0:'B_true_avg'}, axis=1),
pd.DataFrame(np.mean(posterior_12['M_true'], axis=0)).rename({0:'M_true_avg'}, axis=1)],
axis=1)
aux_ellipse['B_true_hpdi'] = abs(aux_ellipse.B_true_lower_hpdi - aux_ellipse.B_true_upper_hpdi)
aux_ellipse['M_true_hpdi'] = abs(aux_ellipse.M_true_lower_hpdi - aux_ellipse.M_true_upper_hpdi)
ells = [Ellipse(xy=aux_ellipse.loc[i,['M_true_avg','B_true_avg']].values,
width=aux_ellipse.loc[i,'M_true_hpdi'],
height=aux_ellipse.loc[i,'B_true_hpdi'])
for i in range(len(aux_ellipse))]
fig = plt.figure(0, figsize=(8, 6))
ax = fig.add_subplot(111, aspect='equal')
ax.set_ylabel('brain volume')
ax.set_xlabel('body mass')
for e in ells:
ax.add_artist(e)
e.set_alpha(0.1)
ax.set_xlim(0, 1.2)
ax.set_ylim(0, 1.2)
plt.show()
Now consider missing values, this data set is lousy with them. You can ignore measurement error in this problem. Let's get a quick idea of the missing values by counting them in each variable:
d = pd.read_csv('../../data/Primates301.csv', header=0, sep=';')
d.isna().sum()
name 0 genus 0 species 0 subspecies 267 spp_id 0 genus_id 0 social_learning 98 research_effort 115 brain 117 body 63 group_size 114 gestation 161 weaning 185 longevity 181 sex_maturity 194 maternal_investment 197 dtype: int64
We’ll continue to focus on just brain and body, to stave off insanity. Consider only those species with measured body masses:
d.dropna(subset=['body'], inplace=True)
d['B'] = d['brain']/np.max(d['brain'])
d['M'] = d['body']/np.max(d['body'])
d.isna().sum()
name 0 genus 0 species 0 subspecies 223 spp_id 0 genus_id 0 social_learning 39 research_effort 53 brain 56 body 0 group_size 52 gestation 98 weaning 122 longevity 118 sex_maturity 131 maternal_investment 134 B 56 M 0 dtype: int64
First, consider whether there is a pattern to the missing values. Does it look like missing values are associated with particular values of body mass? Draw a DAG that represents how missingness works in this case. Which type (MCAR, MAR, MNAR) is this?
Compare the inferences to an analysis that drops all the missing values. Has anything changed? Why or why not? Hint: Consider the density of data in the ranges where there are missing values. You might want to plot the imputed brain sizes together with the observed values.
Looks like the missing brain values are almost all for small bodied species. This implies at least a MAR (dog eats students homework) mechanism. Let's try a DAG to express it:
M
here is body mass, B
(unobserved, suggested by the circle) is brain size, R_B is the missingness mechanism, and B*
is the observed brain sizes (with missing values). The arrow from M
to R_B
indicates that body size influences missingness. In this case, it would imply that small body size makes a missing brain value more likely.
Now let's do some imputation. Remember that the model for imputation is really no different than an ordinary model. It just needs a prior for any variable with missing values. In this case, the missing values are in the outcome, so the likelihood is the prior we need. So the model doesn't change at all.
# PyMC3 can handle missing value quite naturally.
body = d['B'].values.copy()
mask = np.isnan(body)
#body[~mask] = -999
#body = np.ma.masked_values(body, value=-999)
body = np.ma.masked_array(data=body, mask=mask)
M = theano.shared(np.array(d['M']))
B = theano.shared(np.array(d['B']))
with pm.Model() as model_21:
# Priors
a = pm.Normal('a', mu=0, sd=1)
b = pm.Normal('b', mu=0, sd=1)
sigma = pm.Exponential('sigma', lam=1)
# Regression
mu = a + b*np.log(M)
brain_hat = pm.Lognormal('brain_hat', mu, sigma, observed=body) # please note that body contains masked values
# Prior sampling, trace definition and posterior sampling
#prior = pm.sample_prior_predictive()
posterior_21 = pm.sample(tune=2000)
#posterior_pred_21 = pm.sample_posterior_predictive(posterior_21)
Auto-assigning NUTS sampler... Initializing NUTS using jitter+adapt_diag... Multiprocess sampling (4 chains in 4 jobs) NUTS: [brain_hat_missing, sigma, b, a] Sampling 4 chains, 0 divergences: 100%|██████████| 10000/10000 [01:25<00:00, 117.63draws/s]
Here is the posterior summary for the imputation model:
az.summary(posterior_21, var_names=['a','b', 'sigma'], credible_interval=.89).round(2)
mean | sd | mcse_mean | mcse_sd | hpd_5.5% | hpd_94.5% | ess_mean | ess_sd | ess_bulk | ess_tail | r_hat | |
---|---|---|---|---|---|---|---|---|---|---|---|
a | 0.43 | 0.06 | 0.0 | 0.0 | 0.33 | 0.51 | 1333.0 | 1328.0 | 1332.0 | 1340.0 | 1.0 |
b | 0.78 | 0.01 | 0.0 | 0.0 | 0.76 | 0.81 | 1140.0 | 1137.0 | 1140.0 | 1108.0 | 1.0 |
sigma | 0.29 | 0.02 | 0.0 | 0.0 | 0.27 | 0.32 | 1912.0 | 1902.0 | 1899.0 | 1321.0 | 1.0 |
The same analysis on only complete cases:
d = pd.read_csv('../../data/Primates301.csv', header=0, sep=';')
d.dropna(subset=['body', 'brain'], inplace=True)
d['B'] = d['brain']/np.max(d['brain'])
d['M'] = d['body']/np.max(d['body'])
M = theano.shared(np.array(d['M']))
B = theano.shared(np.array(d['B']))
with pm.Model() as model_22:
# Priors
a = pm.Normal('a', mu=0, sd=1)
b = pm.Normal('b', mu=0, sd=1)
sigma = pm.Exponential('sigma', lam=1)
# Regression
mu = a + b*np.log(M)
brain_hat = pm.Lognormal('brain_hat', mu, sigma, observed=B)
# Prior sampling, trace definition and posterior sampling
#prior = pm.sample_prior_predictive()
posterior_22 = pm.sample(tune=2000)
#posterior_pred_22 = pm.sample_posterior_predictive(posterior_22)
Auto-assigning NUTS sampler... Initializing NUTS using jitter+adapt_diag... Multiprocess sampling (4 chains in 4 jobs) NUTS: [sigma, b, a] Sampling 4 chains, 0 divergences: 100%|██████████| 10000/10000 [00:06<00:00, 1527.08draws/s]
az.summary(posterior_22, var_names=['a','b', 'sigma'], credible_interval=.89).round(2)
mean | sd | mcse_mean | mcse_sd | hpd_5.5% | hpd_94.5% | ess_mean | ess_sd | ess_bulk | ess_tail | r_hat | |
---|---|---|---|---|---|---|---|---|---|---|---|
a | 0.43 | 0.06 | 0.0 | 0.0 | 0.34 | 0.52 | 773.0 | 762.0 | 774.0 | 899.0 | 1.00 |
b | 0.78 | 0.01 | 0.0 | 0.0 | 0.76 | 0.80 | 751.0 | 749.0 | 751.0 | 978.0 | 1.00 |
sigma | 0.29 | 0.02 | 0.0 | 0.0 | 0.27 | 0.31 | 1166.0 | 1166.0 | 1137.0 | 870.0 | 1.01 |
Let's add a quick model comparison:
model_21.name = 'model_11'
model_22.name = 'model_12'
pm.compare({model_21: posterior_21,
model_22: posterior_22})
WAIC | pWAIC | dWAIC | weight | SE | dSE | var_warn | |
---|---|---|---|---|---|---|---|
model_12 | -869.7 | 2.46 | 0 | 1 | 37.81 | 0 | 0 |
model_11 | -869.41 | 2.59 | 0.29 | 0 | 37.81 | 0.03 | 0 |
Really no difference from before. Let's plot the imputed values:
d = pd.read_csv('../../data/Primates301.csv', header=0, sep=';')
d.dropna(subset=['body'], inplace=True)
d['B'] = d['brain']/np.max(d['brain'])
d['M'] = d['body']/np.max(d['body'])
aux = pd.concat([pd.DataFrame(d.M[np.isnan(d.loc[:, 'B'])]).reset_index(drop=True),
pd.DataFrame(np.mean(posterior_21['brain_hat_missing'], axis=0)).reset_index(drop=True).rename({0:'B'}, axis=1),
pd.DataFrame(pm.stats.hpd(posterior_21['brain_hat_missing'], alpha=0.5)).rename({0:'B_lower_hpdi',1:'B_upper_hpdi'}, axis=1)],
axis = 1)
# generate the points
points = alt.Chart(aux).mark_point(
filled=True,
size=50,
color='black'
).encode(
x=alt.X('M', title='body mass M',scale=alt.Scale(domain=(0, 1))),
y=alt.Y('B', title='brain size B',scale=alt.Scale(domain=(0, 1)))
)
points2 = alt.Chart(d).mark_point(
filled=True,
size=50,
color='blue',
opacity=0.3
).encode(
x=alt.X('M', scale=alt.Scale(domain=(0, 1))),
y=alt.Y('B', scale=alt.Scale(domain=(0, 1)))
)
# generate the error bars
errorbars = alt.Chart(aux).mark_errorbar().encode(
x=alt.X("M"),
y=alt.Y("B_lower_hpdi:Q", title=None),
y2="B_upper_hpdi:Q"
)
points + errorbars + points2
Black open points are the imputed values, with 50% compatibility intervals. Imputation hasn't done much, apparently because all but one of the missing values are in a very dense region of the body size range. So almost no information was lost, the missing info is redundant.