Notebook

Hierarchical Model for Batting Averages¶

We fit a hierarchical model to some batting average data.

In [311]:

import numpy as np
import pystan
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('ggplot')

The number of hits for batter $i$ is x[i].

In [312]:

x = np.array([ 4,  2, 10,  5,  2,  3,  4, 11,  2,  6,  7, 11,  2,  4,  5,  6,  7,
        4,  4,  7,  3,  2,  3,  5,  9,  3,  3,  6,  5,  8,  4,  2,  4,  2,
        6,  1,  5,  2,  5, 10,  7,  7,  1,  4,  5,  3,  4,  6,  7,  5])

The number of at bats for batter $i$ is n[i].

In [313]:

n = np.array([20, 25, 24, 14, 14, 17, 16, 25, 17, 19, 31, 25, 21, 16, 29, 17, 19,
       20, 20, 21, 26, 18, 16, 24, 18, 25, 15, 19, 19, 25, 23, 15, 17, 15,
       17, 14, 19,  9, 18, 25, 23, 25, 10, 22, 20, 16, 22, 15, 28, 18])

In [314]:

x / n

Out[314]:

array([0.2       , 0.08      , 0.41666667, 0.35714286, 0.14285714,
       0.17647059, 0.25      , 0.44      , 0.11764706, 0.31578947,
       0.22580645, 0.44      , 0.0952381 , 0.25      , 0.17241379,
       0.35294118, 0.36842105, 0.2       , 0.2       , 0.33333333,
       0.11538462, 0.11111111, 0.1875    , 0.20833333, 0.5       ,
       0.12      , 0.2       , 0.31578947, 0.26315789, 0.32      ,
       0.17391304, 0.13333333, 0.23529412, 0.13333333, 0.35294118,
       0.07142857, 0.26315789, 0.22222222, 0.27777778, 0.4       ,
       0.30434783, 0.28      , 0.1       , 0.18181818, 0.25      ,
       0.1875    , 0.18181818, 0.4       , 0.25      , 0.27777778])

In [315]:

plt.hist(n, bins=10, alpha=0.5,
         histtype='stepfilled', color='steelblue',
         edgecolor='none');
plt.xlabel("Number of At Bats");

In [316]:

plt.hist(x/n, bins=10,alpha=0.5,
         histtype='stepfilled', color='steelblue',
         edgecolor='none');
plt.xlabel("Batting Average");

The Model¶

Likelihood¶

Player $i$ has some unobserved true batting average $p_i$. The number of hits for player $i$ is modeled as a Binomial with $n_i$ tries (number of at bats) and success probability $p_i$. So $$ x_i \sim Binomial(n_i,p_i)$$

The binomial likelihood is probably the most natural model, although it does assume constant player performance over time and no "streaks."

Prior on $p_i$¶

What is the form of the distribution of batting averages? This is a difficult question to answer. One simple model is to assume $Beta(\alpha,\beta)$ for some values of $\alpha$ and $\beta$. This is at least a reasonable choice because the Beta has support (puts all its probability) in the interval $[0,1]$, which is clearly necessary for a batting average distribution.

Finally we need to choose a distribution for $\alpha$ and $\beta$, known as the hyperprior distribution. This gets even harder to justify. We would like a prior that allows for many shapes. We choose $$ \pi(\alpha,\beta) \propto (\alpha + \beta)^{-3}\mathbb{1}_{\alpha\geq 1,\beta \geq 1}$$

The purpose of requiring $\alpha,\beta > 0$ is to ensure unimodality of the $p$ distribution. This seemed reasonable based on the histogram of $x/n$. The $-3$ term in the power discourages large values of $\alpha$ and $\beta$, which cause the prior to become concentrated.

Drawing from the Prior¶

We can use Stan to draw from the prior, which is useful for verifying that we have constructed at least a somewhat reasonable prior.

In [317]:

prior_model = """
parameters {
    real<lower=1> alpha; // parameter in beta prior
    real<lower=1> beta; // parameter in beta prior
}
model {
    target += (-3.0)*log(alpha + beta);
}
"""

In [318]:

sm = pystan.StanModel(model_code=prior_model)

INFO:pystan:COMPILING THE C++ CODE FOR MODEL anon_model_7b2d2b586d3dc302072674033c5f9c51 NOW.

We are going to draw samples from the prior.

In [319]:

fit = sm.sampling(iter=1000, chains=2)

In [320]:

from scipy.stats import beta

In [321]:

la = fit.extract(permuted=True)  # return a dictionary of arrays

In [322]:

p = np.linspace(0,1,1000)
for ii in np.arange(100):
    plt.plot(p,beta.pdf(p,la['alpha'][ii],la['beta'][ii]),color='blue',alpha=0.2);
plt.ylim(0, 10);
plt.xlabel("p");
plt.ylabel("Draws from the Prior Distribution");

The prior looks at least quasi--reasonable. If we know anything about baseball, then we know the prior puts too much chance on $p$ distributions that have many large batting averages.

Fitting the Model with Stan¶

Write the model in Stan code.

In [323]:

batting_model = """
data {
    int<lower=0> N; // number of batters
    int<lower=0> x[N]; // number of hits
    int<lower=0> n[N]; // number of at bats (ie attempts)
}
parameters {
    real<lower=0,upper=1> p[N]; // true batting average
    real<lower=1> alpha; // parameter in beta prior
    real<lower=1> beta; // parameter in beta prior
}
model {
    x ~ binomial(n, p);
    p ~ beta(alpha, beta);
    target += (-3.0)*log(alpha + beta);
}
"""

In [324]:

sm = pystan.StanModel(model_code=batting_model)

INFO:pystan:COMPILING THE C++ CODE FOR MODEL anon_model_7f7047aaef59430c1994a073b9020d12 NOW.

In [325]:

batting_dat = {'n': n,
            'x': x,
           'N':x.size}
batting_dat

Out[325]:

{'N': 50,
 'n': array([20, 25, 24, 14, 14, 17, 16, 25, 17, 19, 31, 25, 21, 16, 29, 17, 19,
        20, 20, 21, 26, 18, 16, 24, 18, 25, 15, 19, 19, 25, 23, 15, 17, 15,
        17, 14, 19,  9, 18, 25, 23, 25, 10, 22, 20, 16, 22, 15, 28, 18]),
 'x': array([ 4,  2, 10,  5,  2,  3,  4, 11,  2,  6,  7, 11,  2,  4,  5,  6,  7,
         4,  4,  7,  3,  2,  3,  5,  9,  3,  3,  6,  5,  8,  4,  2,  4,  2,
         6,  1,  5,  2,  5, 10,  7,  7,  1,  4,  5,  3,  4,  6,  7,  5])}

In [326]:

fit = sm.sampling(data=batting_dat, iter=5000, chains=4)

/usr/local/lib/python3.5/dist-packages/pystan/misc.py:360: FutureWarning: Conversion of the second argument of issubdtype from `int` to `np.signedinteger` is deprecated. In future, it will be treated as `np.int64 == np.dtype(int).type`.
  if np.issubdtype(np.asarray(v).dtype, int):

In [327]:

print(fit)

Inference for Stan model: anon_model_7f7047aaef59430c1994a073b9020d12.
4 chains, each with iter=5000; warmup=2500; thin=1; 
post-warmup draws per chain=2500, total post-warmup draws=10000.

        mean se_mean     sd   2.5%    25%    50%    75%  97.5%  n_eff   Rhat
p[0]    0.24  4.1e-3   0.05   0.15   0.21   0.24   0.27   0.34    133   1.03
p[1]    0.21    0.01   0.05    0.1   0.17   0.21   0.24   0.29     15   1.07
p[2]    0.29    0.01   0.05   0.21   0.25   0.27   0.32   0.42     20   1.06
p[3]    0.27  2.5e-3   0.05   0.18   0.23   0.26    0.3   0.39    429   1.01
p[4]    0.23  4.8e-4   0.05   0.13    0.2   0.23   0.26   0.33  10000   1.01
p[5]    0.23  4.6e-4   0.05   0.14   0.21   0.24   0.26   0.33  10000   1.01
p[6]    0.25  4.7e-4   0.05   0.16   0.22   0.25   0.28   0.35  10000   1.01
p[7]     0.3    0.01   0.06   0.21   0.26   0.29   0.33   0.43     22   1.06
p[8]    0.23    0.01   0.05   0.12   0.19   0.23   0.26   0.31     19   1.08
p[9]    0.26  4.9e-4   0.05   0.17   0.23   0.26   0.29   0.37  10000   1.01
p[10]   0.24  3.4e-3   0.05   0.15   0.21   0.25   0.27   0.34    178   1.03
p[11]    0.3  3.3e-3   0.05   0.21   0.27    0.3   0.33   0.43    256   1.01
p[12]   0.21  9.4e-3   0.05   0.11   0.18   0.22   0.24    0.3     25   1.05
p[13]   0.25  4.9e-4   0.05   0.15   0.22   0.25   0.28   0.35  10000   1.01
p[14]   0.23  4.9e-3   0.04   0.13    0.2   0.23   0.25   0.31     77   1.03
p[15]   0.27  7.7e-3   0.05   0.18   0.24   0.26    0.3   0.39     45   1.04
p[16]   0.27  3.3e-3   0.05   0.18   0.24   0.27    0.3   0.39    244   1.02
p[17]   0.24  6.1e-3   0.05   0.14   0.21   0.24   0.27   0.33     59   1.03
p[18]   0.24  4.6e-4   0.05   0.14   0.21   0.24   0.27   0.33  10000   1.02
p[19]   0.27  4.8e-4   0.05   0.18   0.24   0.26   0.29   0.38  10000   1.01
p[20]   0.21  8.6e-3   0.05   0.11   0.18   0.21   0.25   0.29     30   1.09
p[21]   0.22  4.2e-3   0.05   0.12   0.19   0.22   0.25   0.31    126   1.04
p[22]   0.24  4.8e-4   0.05   0.14   0.21   0.24   0.27   0.33  10000   1.03
p[23]   0.24    0.01   0.05   0.15   0.21   0.24   0.27   0.33     17   1.06
p[24]    0.3    0.01   0.06   0.21   0.25   0.29   0.33   0.44     28   1.07
p[25]   0.21  7.2e-3   0.05   0.12   0.18   0.21   0.25   0.29     41   1.08
p[26]   0.24    0.01   0.05   0.14   0.21   0.24   0.28   0.34     24   1.07
p[27]   0.26  4.8e-4   0.05   0.17   0.23   0.26   0.29   0.37  10000    1.0
p[28]   0.25  4.7e-4   0.05   0.16   0.22   0.25   0.27   0.35  10000    1.0
p[29]   0.26  3.6e-3   0.05   0.18   0.23   0.25   0.29   0.37    179   1.03
p[30]   0.23    0.01   0.05   0.14    0.2   0.23   0.26   0.32     20   1.07
p[31]   0.23  4.7e-4   0.05   0.13    0.2   0.23   0.25   0.32  10000   1.02
p[32]   0.24  4.7e-4   0.05   0.15   0.22   0.24   0.27   0.35  10000   1.01
p[33]   0.23  8.3e-3   0.05   0.13    0.2   0.23   0.26   0.32     34   1.05
p[34]   0.27  5.1e-4   0.05   0.18   0.24   0.26   0.29   0.39  10000    1.0
p[35]   0.22    0.01   0.05   0.11   0.19   0.23   0.25   0.31     23   1.06
p[36]   0.25  4.6e-4   0.05   0.16   0.22   0.25   0.28   0.35  10000    1.0
p[37]   0.24  5.2e-4   0.05   0.14   0.21   0.24   0.27   0.36  10000   1.01
p[38]   0.25  4.8e-4   0.05   0.16   0.23   0.25   0.28   0.36  10000   1.01
p[39]   0.29  3.8e-3   0.05    0.2   0.26   0.28   0.32   0.41    196   1.01
p[40]   0.26  2.3e-3   0.05   0.18   0.23   0.25   0.28   0.37    407   1.01
p[41]   0.25  7.0e-3   0.05   0.17   0.22   0.25   0.28   0.36     45   1.04
p[42]   0.23  4.9e-4   0.05   0.12    0.2   0.23   0.25   0.33  10000   1.01
p[43]   0.23  3.8e-3   0.05   0.14    0.2   0.24   0.26   0.32    147   1.03
p[44]   0.25  4.7e-4   0.05   0.16   0.22   0.25   0.28   0.35  10000   1.02
p[45]   0.24  2.4e-3   0.05   0.14   0.21   0.24   0.26   0.33    401   1.02
p[46]   0.24  8.6e-3   0.05   0.14    0.2   0.24   0.27   0.32     31   1.07
p[47]   0.27  3.6e-3   0.05   0.18   0.24   0.27    0.3   0.41    232   1.01
p[48]   0.25  3.0e-3   0.04   0.17   0.23   0.25   0.28   0.34    212   1.02
p[49]   0.25  4.7e-4   0.05   0.17   0.22   0.25   0.28   0.36  10000   1.02
alpha   33.1   16.36  32.72   5.15  10.28  17.87  41.07 109.75      4   2.02
beta   99.16   47.81  95.62  15.55  31.41  54.21 128.84 320.73      4   1.93
lp__  -552.7    8.57  19.17 -582.7 -567.6 -556.4 -538.3 -516.0      5   1.46

Samples were drawn using NUTS at Tue Apr 10 23:04:27 2018.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

Analysis of Results¶

We examine how the posterior means compare to the MLEs and posterior draws from the Beta distribution.

In [328]:

la = fit.extract(permuted=True)  # return a dictionary of arrays

In [329]:

## obtain posterior means for p parameters
p_pm = np.apply_along_axis(np.mean,0,la['p'])

In [330]:

plt.hist(x/n, bins=10, normed=True, alpha=0.5,
         histtype='stepfilled', color='steelblue',
         edgecolor='none');
plt.hist(p_pm, bins=10, normed=True, alpha=0.5,
         histtype='stepfilled', color='red',
         edgecolor='none');
plt.xlabel("p_hat");

The posterior means (red) are much more concentrated than the maximum likelihood estimates (blue).

In [331]:

p = np.linspace(0,0.6,1000)
for ii in np.arange(la['beta'].size,step=100):
    plt.plot(p,beta.pdf(p,la['alpha'][ii],la['beta'][ii]),color='blue',alpha=0.1);
plt.plot(p,beta.pdf(p,10,30),color='red',alpha=1);
plt.xlabel("p");
plt.ylabel("Draws from the Posterior");

The red curve is the true beta distribution used to simulate the $p_i$. The blue curves are draws from the posterior. With more batters and more at bats for each batter, the posterior draws should concentrate around the true curve in red.

Final Notes¶

See Gelman BDA Chapter 5 for more discussion of priors on the beta parameters $\alpha$ and $\beta$. It can be useful to reparameterize in terms of mean and variance.
A mixture of beta distributions would allow for more model flexibility, at the cost of more difficult inference.
The posterior on $\alpha$ and $\beta$ is highly correlated. This can cause challenges for samplers like stan. It is sometimes possible to address these issues by reparameterizing.

In [333]:

plt.plot(la['alpha'],la['beta'],'.');
plt.xlabel("alpha");
plt.ylabel("beta");