We fit a hierarchical model to some batting average data.
import numpy as np
import pystan
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('ggplot')
The number of hits for batter $i$ is x[i]
.
x = np.array([ 4, 2, 10, 5, 2, 3, 4, 11, 2, 6, 7, 11, 2, 4, 5, 6, 7,
4, 4, 7, 3, 2, 3, 5, 9, 3, 3, 6, 5, 8, 4, 2, 4, 2,
6, 1, 5, 2, 5, 10, 7, 7, 1, 4, 5, 3, 4, 6, 7, 5])
The number of at bats for batter $i$ is n[i]
.
n = np.array([20, 25, 24, 14, 14, 17, 16, 25, 17, 19, 31, 25, 21, 16, 29, 17, 19,
20, 20, 21, 26, 18, 16, 24, 18, 25, 15, 19, 19, 25, 23, 15, 17, 15,
17, 14, 19, 9, 18, 25, 23, 25, 10, 22, 20, 16, 22, 15, 28, 18])
x / n
array([0.2 , 0.08 , 0.41666667, 0.35714286, 0.14285714, 0.17647059, 0.25 , 0.44 , 0.11764706, 0.31578947, 0.22580645, 0.44 , 0.0952381 , 0.25 , 0.17241379, 0.35294118, 0.36842105, 0.2 , 0.2 , 0.33333333, 0.11538462, 0.11111111, 0.1875 , 0.20833333, 0.5 , 0.12 , 0.2 , 0.31578947, 0.26315789, 0.32 , 0.17391304, 0.13333333, 0.23529412, 0.13333333, 0.35294118, 0.07142857, 0.26315789, 0.22222222, 0.27777778, 0.4 , 0.30434783, 0.28 , 0.1 , 0.18181818, 0.25 , 0.1875 , 0.18181818, 0.4 , 0.25 , 0.27777778])
plt.hist(n, bins=10, alpha=0.5,
histtype='stepfilled', color='steelblue',
edgecolor='none');
plt.xlabel("Number of At Bats");
plt.hist(x/n, bins=10,alpha=0.5,
histtype='stepfilled', color='steelblue',
edgecolor='none');
plt.xlabel("Batting Average");
Player $i$ has some unobserved true batting average $p_i$. The number of hits for player $i$ is modeled as a Binomial with $n_i$ tries (number of at bats) and success probability $p_i$. So $$ x_i \sim Binomial(n_i,p_i)$$
The binomial likelihood is probably the most natural model, although it does assume constant player performance over time and no "streaks."
What is the form of the distribution of batting averages? This is a difficult question to answer. One simple model is to assume $Beta(\alpha,\beta)$ for some values of $\alpha$ and $\beta$. This is at least a reasonable choice because the Beta has support (puts all its probability) in the interval $[0,1]$, which is clearly necessary for a batting average distribution.
Finally we need to choose a distribution for $\alpha$ and $\beta$, known as the hyperprior distribution. This gets even harder to justify. We would like a prior that allows for many shapes. We choose $$ \pi(\alpha,\beta) \propto (\alpha + \beta)^{-3}\mathbb{1}_{\alpha\geq 1,\beta \geq 1}$$
The purpose of requiring $\alpha,\beta > 0$ is to ensure unimodality of the $p$ distribution. This seemed reasonable based on the histogram of $x/n$. The $-3$ term in the power discourages large values of $\alpha$ and $\beta$, which cause the prior to become concentrated.
We can use Stan to draw from the prior, which is useful for verifying that we have constructed at least a somewhat reasonable prior.
prior_model = """
parameters {
real<lower=1> alpha; // parameter in beta prior
real<lower=1> beta; // parameter in beta prior
}
model {
target += (-3.0)*log(alpha + beta);
}
"""
sm = pystan.StanModel(model_code=prior_model)
INFO:pystan:COMPILING THE C++ CODE FOR MODEL anon_model_7b2d2b586d3dc302072674033c5f9c51 NOW.
We are going to draw samples from the prior.
fit = sm.sampling(iter=1000, chains=2)
from scipy.stats import beta
la = fit.extract(permuted=True) # return a dictionary of arrays
p = np.linspace(0,1,1000)
for ii in np.arange(100):
plt.plot(p,beta.pdf(p,la['alpha'][ii],la['beta'][ii]),color='blue',alpha=0.2);
plt.ylim(0, 10);
plt.xlabel("p");
plt.ylabel("Draws from the Prior Distribution");
The prior looks at least quasi--reasonable. If we know anything about baseball, then we know the prior puts too much chance on $p$ distributions that have many large batting averages.
Write the model in Stan code.
batting_model = """
data {
int<lower=0> N; // number of batters
int<lower=0> x[N]; // number of hits
int<lower=0> n[N]; // number of at bats (ie attempts)
}
parameters {
real<lower=0,upper=1> p[N]; // true batting average
real<lower=1> alpha; // parameter in beta prior
real<lower=1> beta; // parameter in beta prior
}
model {
x ~ binomial(n, p);
p ~ beta(alpha, beta);
target += (-3.0)*log(alpha + beta);
}
"""
sm = pystan.StanModel(model_code=batting_model)
INFO:pystan:COMPILING THE C++ CODE FOR MODEL anon_model_7f7047aaef59430c1994a073b9020d12 NOW.
batting_dat = {'n': n,
'x': x,
'N':x.size}
batting_dat
{'N': 50, 'n': array([20, 25, 24, 14, 14, 17, 16, 25, 17, 19, 31, 25, 21, 16, 29, 17, 19, 20, 20, 21, 26, 18, 16, 24, 18, 25, 15, 19, 19, 25, 23, 15, 17, 15, 17, 14, 19, 9, 18, 25, 23, 25, 10, 22, 20, 16, 22, 15, 28, 18]), 'x': array([ 4, 2, 10, 5, 2, 3, 4, 11, 2, 6, 7, 11, 2, 4, 5, 6, 7, 4, 4, 7, 3, 2, 3, 5, 9, 3, 3, 6, 5, 8, 4, 2, 4, 2, 6, 1, 5, 2, 5, 10, 7, 7, 1, 4, 5, 3, 4, 6, 7, 5])}
fit = sm.sampling(data=batting_dat, iter=5000, chains=4)
/usr/local/lib/python3.5/dist-packages/pystan/misc.py:360: FutureWarning: Conversion of the second argument of issubdtype from `int` to `np.signedinteger` is deprecated. In future, it will be treated as `np.int64 == np.dtype(int).type`. if np.issubdtype(np.asarray(v).dtype, int):
print(fit)
Inference for Stan model: anon_model_7f7047aaef59430c1994a073b9020d12. 4 chains, each with iter=5000; warmup=2500; thin=1; post-warmup draws per chain=2500, total post-warmup draws=10000. mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat p[0] 0.24 4.1e-3 0.05 0.15 0.21 0.24 0.27 0.34 133 1.03 p[1] 0.21 0.01 0.05 0.1 0.17 0.21 0.24 0.29 15 1.07 p[2] 0.29 0.01 0.05 0.21 0.25 0.27 0.32 0.42 20 1.06 p[3] 0.27 2.5e-3 0.05 0.18 0.23 0.26 0.3 0.39 429 1.01 p[4] 0.23 4.8e-4 0.05 0.13 0.2 0.23 0.26 0.33 10000 1.01 p[5] 0.23 4.6e-4 0.05 0.14 0.21 0.24 0.26 0.33 10000 1.01 p[6] 0.25 4.7e-4 0.05 0.16 0.22 0.25 0.28 0.35 10000 1.01 p[7] 0.3 0.01 0.06 0.21 0.26 0.29 0.33 0.43 22 1.06 p[8] 0.23 0.01 0.05 0.12 0.19 0.23 0.26 0.31 19 1.08 p[9] 0.26 4.9e-4 0.05 0.17 0.23 0.26 0.29 0.37 10000 1.01 p[10] 0.24 3.4e-3 0.05 0.15 0.21 0.25 0.27 0.34 178 1.03 p[11] 0.3 3.3e-3 0.05 0.21 0.27 0.3 0.33 0.43 256 1.01 p[12] 0.21 9.4e-3 0.05 0.11 0.18 0.22 0.24 0.3 25 1.05 p[13] 0.25 4.9e-4 0.05 0.15 0.22 0.25 0.28 0.35 10000 1.01 p[14] 0.23 4.9e-3 0.04 0.13 0.2 0.23 0.25 0.31 77 1.03 p[15] 0.27 7.7e-3 0.05 0.18 0.24 0.26 0.3 0.39 45 1.04 p[16] 0.27 3.3e-3 0.05 0.18 0.24 0.27 0.3 0.39 244 1.02 p[17] 0.24 6.1e-3 0.05 0.14 0.21 0.24 0.27 0.33 59 1.03 p[18] 0.24 4.6e-4 0.05 0.14 0.21 0.24 0.27 0.33 10000 1.02 p[19] 0.27 4.8e-4 0.05 0.18 0.24 0.26 0.29 0.38 10000 1.01 p[20] 0.21 8.6e-3 0.05 0.11 0.18 0.21 0.25 0.29 30 1.09 p[21] 0.22 4.2e-3 0.05 0.12 0.19 0.22 0.25 0.31 126 1.04 p[22] 0.24 4.8e-4 0.05 0.14 0.21 0.24 0.27 0.33 10000 1.03 p[23] 0.24 0.01 0.05 0.15 0.21 0.24 0.27 0.33 17 1.06 p[24] 0.3 0.01 0.06 0.21 0.25 0.29 0.33 0.44 28 1.07 p[25] 0.21 7.2e-3 0.05 0.12 0.18 0.21 0.25 0.29 41 1.08 p[26] 0.24 0.01 0.05 0.14 0.21 0.24 0.28 0.34 24 1.07 p[27] 0.26 4.8e-4 0.05 0.17 0.23 0.26 0.29 0.37 10000 1.0 p[28] 0.25 4.7e-4 0.05 0.16 0.22 0.25 0.27 0.35 10000 1.0 p[29] 0.26 3.6e-3 0.05 0.18 0.23 0.25 0.29 0.37 179 1.03 p[30] 0.23 0.01 0.05 0.14 0.2 0.23 0.26 0.32 20 1.07 p[31] 0.23 4.7e-4 0.05 0.13 0.2 0.23 0.25 0.32 10000 1.02 p[32] 0.24 4.7e-4 0.05 0.15 0.22 0.24 0.27 0.35 10000 1.01 p[33] 0.23 8.3e-3 0.05 0.13 0.2 0.23 0.26 0.32 34 1.05 p[34] 0.27 5.1e-4 0.05 0.18 0.24 0.26 0.29 0.39 10000 1.0 p[35] 0.22 0.01 0.05 0.11 0.19 0.23 0.25 0.31 23 1.06 p[36] 0.25 4.6e-4 0.05 0.16 0.22 0.25 0.28 0.35 10000 1.0 p[37] 0.24 5.2e-4 0.05 0.14 0.21 0.24 0.27 0.36 10000 1.01 p[38] 0.25 4.8e-4 0.05 0.16 0.23 0.25 0.28 0.36 10000 1.01 p[39] 0.29 3.8e-3 0.05 0.2 0.26 0.28 0.32 0.41 196 1.01 p[40] 0.26 2.3e-3 0.05 0.18 0.23 0.25 0.28 0.37 407 1.01 p[41] 0.25 7.0e-3 0.05 0.17 0.22 0.25 0.28 0.36 45 1.04 p[42] 0.23 4.9e-4 0.05 0.12 0.2 0.23 0.25 0.33 10000 1.01 p[43] 0.23 3.8e-3 0.05 0.14 0.2 0.24 0.26 0.32 147 1.03 p[44] 0.25 4.7e-4 0.05 0.16 0.22 0.25 0.28 0.35 10000 1.02 p[45] 0.24 2.4e-3 0.05 0.14 0.21 0.24 0.26 0.33 401 1.02 p[46] 0.24 8.6e-3 0.05 0.14 0.2 0.24 0.27 0.32 31 1.07 p[47] 0.27 3.6e-3 0.05 0.18 0.24 0.27 0.3 0.41 232 1.01 p[48] 0.25 3.0e-3 0.04 0.17 0.23 0.25 0.28 0.34 212 1.02 p[49] 0.25 4.7e-4 0.05 0.17 0.22 0.25 0.28 0.36 10000 1.02 alpha 33.1 16.36 32.72 5.15 10.28 17.87 41.07 109.75 4 2.02 beta 99.16 47.81 95.62 15.55 31.41 54.21 128.84 320.73 4 1.93 lp__ -552.7 8.57 19.17 -582.7 -567.6 -556.4 -538.3 -516.0 5 1.46 Samples were drawn using NUTS at Tue Apr 10 23:04:27 2018. For each parameter, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence, Rhat=1).
We examine how the posterior means compare to the MLEs and posterior draws from the Beta distribution.
la = fit.extract(permuted=True) # return a dictionary of arrays
## obtain posterior means for p parameters
p_pm = np.apply_along_axis(np.mean,0,la['p'])
plt.hist(x/n, bins=10, normed=True, alpha=0.5,
histtype='stepfilled', color='steelblue',
edgecolor='none');
plt.hist(p_pm, bins=10, normed=True, alpha=0.5,
histtype='stepfilled', color='red',
edgecolor='none');
plt.xlabel("p_hat");
The posterior means (red) are much more concentrated than the maximum likelihood estimates (blue).
p = np.linspace(0,0.6,1000)
for ii in np.arange(la['beta'].size,step=100):
plt.plot(p,beta.pdf(p,la['alpha'][ii],la['beta'][ii]),color='blue',alpha=0.1);
plt.plot(p,beta.pdf(p,10,30),color='red',alpha=1);
plt.xlabel("p");
plt.ylabel("Draws from the Posterior");
The red curve is the true beta distribution used to simulate the $p_i$. The blue curves are draws from the posterior. With more batters and more at bats for each batter, the posterior draws should concentrate around the true curve in red.
plt.plot(la['alpha'],la['beta'],'.');
plt.xlabel("alpha");
plt.ylabel("beta");