Think Bayes¶

Second Edition

License: Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)

In [1]:

# If we're running on Colab, install empiricaldist
# https://pypi.org/project/empiricaldist/

import sys
IN_COLAB = 'google.colab' in sys.modules

if IN_COLAB:
    !pip install empiricaldist

In [2]:

# Get utils.py and create directories

import os

if not os.path.exists('utils.py'):
    !wget https://github.com/AllenDowney/ThinkBayes2/raw/master/code/soln/utils.py
        
if not os.path.exists('figs'):
    !mkdir figs

In [3]:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from empiricaldist import Pmf, Cdf
from utils import decorate, savefig

Introduction¶

Three dimensions!

The Grizzly Bear Problem¶

In 1996 and 1997 Mowat and Strobeck deployed bear traps in locations in British Columbia and Alberta, in an effort to estimate the population of grizzly bears. They describe the experiment in "Estimating Population Size of Grizzly Bears Using Hair Capture, DNA Profiling, and Mark-Recapture Analysis"

The "trap" consists of a lure and several strands of barbed wire intended to capture samples of hair from bears that visit the lure. Using the hair samples, the researchers use DNA analysis to identify individual bears.

During the first session, on June 29, 1996, the researchers deployed traps at 76 sites. Returning 10 days later, they obtained 1043 hair samples and identified 23 different bears. During a second 10-day session they obtained 1191 samples from 19 different bears, where 4 of the 19 were from bears they had identified in the first batch.

To estimate the population of bears from this data, we need a model for the probability that each bear will be observed during each session. As a starting place, we'll make the simplest assumption, that every bear in the population has the same (unknown) probability of being sampled during each round.

With these assumptions we can compute the probability of the data for a range of possible populations.

As an example, let's suppose that the actual population of bears is 200.

After the first session, 23 of the 200 bears have been identified. During the second session, if we choose 19 bears at random, what is the probability that 4 of them were previously identified?

I'll define

N: actual (unknown) population size, 200.
K: number of bears identified in the first session, 23.
n: number of bears observed in the second session, 19 in the example.
k: the number of bears in the second session that had previously been identified, 4.

For given values of N, K, and n, the distribution of k is described by the hypergeometric distribution:

$PMF(k) = {K \choose k}{N-K \choose n-k}/{N \choose n}$

To understand why, consider:

The denominator, ${ N \choose n}$, is the number of subsets of $n$ we could choose from a population of $N$ bears.
The numerator is the number of subsets that contain $k$ bears from the previously identified $K$ and $n-k$ from the previously unseen $N-K$.

SciPy provides hypergeom, which we can use to compute this PMF.

In [138]:

from scipy.stats import hypergeom

N = 100
K = 23
n = 19

ks = np.arange(12)
ps = hypergeom(N, K, n).pmf(ks)

plt.bar(ks, ps)

decorate(xlabel='Number of bears observed twice',
         ylabel='PMF',
         title='Hypergeometric distribution of k (known population 200)')

So that's the distribution of k given N, K, and n. Now let's go the other way: given K, n, and k, how can we estimate the total population, N?

As a starting place, let's suppose that, prior to this study, an expert in this domain would have estimated that the population is between 50 and 500, and equally likely to be any value in that range.

In [5]:

Ns = np.arange(50, 501)
prior_N = Pmf(1, Ns)
prior_N.index.name = 'N'

So that's our prior.

To compute the likelihood of the data, we can use hypergeom with constants K and n, and a range of values of N.

In [6]:

K = 23
n = 19
k = 4

likelihood = hypergeom(Ns, K, n).pmf(k)

We can compute the posterior in the usual way.

In [7]:

posterior_N = prior_N * likelihood
posterior_N.normalize()

Out[7]:

34.97606148975133

And here's what it looks like.

In [8]:

posterior_N.plot()

decorate(xlabel='Population of bears (N)',
         ylabel='PDF',
         title='Posterior distribution of N')

The most likely value is 109.

In [9]:

posterior_N.max_prob()

Out[9]:

But the distribution is skewed to the right, so the posterior mean is substantially higher.

In [10]:

posterior_N.mean()

Out[10]:

173.79880627085643

In [11]:

posterior_N.credible_interval(0.9)

Out[11]:

array([ 77., 363.])

Two parameter model¶

In [12]:

ps = np.linspace(0, 1, 101)
prior_p = Pmf(1, ps)
prior_p.index.name = 'p'

In [13]:

from utils import make_joint

joint_prior = make_joint(prior_N, prior_p)

In [14]:

N_mesh, p_mesh = np.meshgrid(Ns, ps)
N_mesh.shape

Out[14]:

(101, 451)

In [15]:

from scipy.stats import binom

like1 = binom.pmf(K, N_mesh, p_mesh)
like1.sum()

Out[15]:

229.56130903199954

In [16]:

like2 = binom.pmf(k, K, p_mesh) * binom.pmf(n-k, N_mesh-K, p_mesh)
like2.sum()

Out[16]:

24.771767481921685

In [17]:

from utils import normalize

joint_posterior = joint_prior * like1 * like2
normalize(joint_posterior)

In [18]:

def plot_contour(joint, **options):
    """Plot a joint distribution.
    
    joint: DataFrame representing a joint PMF
    """
    cs = plt.contour(joint.columns, joint.index, joint, **options)
    decorate(xlabel=joint.columns.name, 
             ylabel=joint.index.name)
    return cs

In [19]:

plot_contour(joint_posterior)

decorate(title='Joint posterior distribution of N and p')

In [20]:

from utils import marginal

marginal_N = marginal(joint_posterior, 0)

posterior_N.plot(color='gray')
marginal_N.plot()

decorate(xlabel='Population of bears (N)',
         ylabel='PDF',
         title='Posterior marginal distribution of N')

In [21]:

marginal_N.mean(), marginal_N.credible_interval(0.9)

Out[21]:

(138.7505213647261, array([ 68., 277.]))

In [22]:

marginal_p = marginal(joint_posterior, 1)
marginal_p.plot()

decorate(xlabel='Probability of observing a bear',
         ylabel='PDF',
         title='Posterior marginal distribution of p')

In [23]:

from seaborn import JointGrid

def joint_plot(joint, **options):
    x = joint.columns.name
    x = 'x' if x is None else x

    y = joint.index.name
    y = 'y' if y is None else y

    # make a JointGrid with minimal data
    data = pd.DataFrame({x:[0], y:[0]})
    g = JointGrid(x, y, data, **options)

    # replace the contour plot
    g.ax_joint.contour(joint.columns, 
                   joint.index, 
                   joint, 
                   cmap='viridis')
    
    # replace the marginals
    marginal_x = marginal(joint, 0)
    g.ax_marg_x.plot(marginal_x.qs, marginal_x.ps)
    
    marginal_y = marginal(joint, 1)
    g.ax_marg_y.plot(marginal_y.ps, marginal_y.qs)

In [24]:

joint_plot(joint_posterior)

Two parameters better than one?¶

In [25]:

mean = (23 + 19) / 2
N1 = 138
p = mean/N1
p

Out[25]:

0.15217391304347827

In [26]:

from scipy.stats import binom

binom(N1, p).std()

Out[26]:

4.219519857292647

In [27]:

binom(N1, p).pmf([23, 19]).prod()

Out[27]:

0.007113233315460428

In [28]:

N2 = 173
p = mean/N2
p

Out[28]:

0.12138728323699421

In [29]:

binom(N2, p).std()

Out[29]:

4.2954472470306415

In [30]:

binom(N2, p).pmf([23, 19]).prod()

Out[30]:

0.006918408214247037

The Lincoln index problem¶

A few years ago my occasional correspondent John D. Cook wrote an excellent blog post about the Lincoln index, which is a way to estimate the number of errors in a document (or program) by comparing results from two independent testers.

http://www.johndcook.com/blog/2010/07/13/lincoln-index/

Here's his presentation of the problem:

"Suppose you have a tester who finds 20 bugs in your program. You

want to estimate how many bugs are really in the program. You know there are at least 20 bugs, and if you have supreme confidence in your tester, you may suppose there are around 20 bugs. But maybe your tester isn't very good. Maybe there are hundreds of bugs. How can you have any idea how many bugs there are? There's no way to know with one tester. But if you have two testers, you can get a good idea, even if you don't know how skilled the testers are."

Suppose the first tester finds 20 bugs, the second finds 15, and they find 3 in common; how can we estimate the number of bugs?

In [31]:

n0 = 20
n1 = 15
k11 = 3

In [32]:

k10 = n0 - k11
k01 = n1 - k11

k10, k01

Out[32]:

(17, 12)

In [33]:

Ns = np.arange(32, 350)
prior_N = Pmf(1, Ns)
prior_N.index.name = 'N'

In [34]:

p0, p1 = 0.2, 0.15

like0 = binom.pmf(n0, Ns, p0)
like0.sum()

Out[34]:

4.9999998675182855

In [35]:

like1 = binom.pmf(k01, Ns-n0, p1) * binom.pmf(k11, n0, p1)
like1.sum()

Out[35]:

1.6188593076346907

In [36]:

likelihood = like0 * like1
likelihood.shape

Out[36]:

(318,)

In [37]:

prior_N.shape

Out[37]:

(318,)

In [38]:

posterior_N = prior_N * likelihood
posterior_N.normalize()

Out[38]:

0.10960645032182464

In [39]:

posterior_N.plot()

decorate(xlabel='n',
         ylabel='PMF',
         title='Posterior marginal distribution of n with known p1, p2')

In [40]:

posterior_N.mean()

Out[40]:

102.12500000000003

Unknown probabilities¶

In [41]:

p0 = np.linspace(0, 1, 61)
prior_p0 = Pmf(1, p0)
prior_p0.index.name = 'p0'

In [42]:

p1 = np.linspace(0, 1, 51)
prior_p1 = Pmf(1, p1)
prior_p1.index.name = 'p1'

In [43]:

from utils import make_joint

joint = make_joint(prior_p0, prior_p1)
joint.shape

Out[43]:

(51, 61)

In [44]:

joint_pmf = Pmf(joint.transpose().stack())
joint_pmf.head()

Out[44]:

p0   p1  
0.0  0.00    1
     0.02    1
     0.04    1
     0.06    1
     0.08    1
dtype: int64

In [45]:

joint_prior = make_joint(prior_N, joint_pmf)
joint_prior.shape

Out[45]:

(3111, 318)

In [46]:

joint_prior.head()

Out[46]:

	N	32	33	34	35	36	37	38	39	40	41	...	340	341	342	343	344	345	346	347	348	349
p0	p1
0.0	0.00	1	1	1	1	1	1	1	1	1	1	...	1	1	1	1	1	1	1	1	1	1
	0.02	1	1	1	1	1	1	1	1	1	1	...	1	1	1	1	1	1	1	1	1	1
	0.04	1	1	1	1	1	1	1	1	1	1	...	1	1	1	1	1	1	1	1	1	1
	0.06	1	1	1	1	1	1	1	1	1	1	...	1	1	1	1	1	1	1	1	1	1
	0.08	1	1	1	1	1	1	1	1	1	1	...	1	1	1	1	1	1	1	1	1	1

5 rows × 318 columns

In [47]:

likelihood = joint_prior.copy()
Ns = joint_prior.columns

for (p0, p1) in joint_prior.index:
    like0 = binom.pmf(n0, Ns, p0)
    like1 = binom.pmf(k01, Ns-n0, p1) * binom.pmf(k11, n0, p1)
    likelihood.loc[p0, p1] = like0 * like1
    
likelihood.to_numpy().sum()

Out[47]:

8.930285561728951

In [48]:

from utils import normalize

joint_posterior = joint_prior * likelihood
normalize(joint_posterior)
joint_posterior.shape

Out[48]:

(3111, 318)

In [49]:

from utils import marginal

posterior_N = marginal(joint_posterior, 0)
posterior_N.plot()

In [50]:

posterior_pmf = marginal(joint_posterior, 1)
posterior_pmf.shape

Out[50]:

(3111,)

In [51]:

posterior_joint_ps = posterior_pmf.unstack().transpose()
posterior_joint_ps.head()

Out[51]:

p0	0.000000	0.016667	0.033333	0.050000	0.066667	0.083333	0.100000	0.116667	0.133333	0.150000	...	0.850000	0.866667	0.883333	0.900000	0.916667	0.933333	0.950000	0.966667	0.983333	1.000000
p1
0.00	0.0	0.000000e+00	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000e+00	0.000000e+00	...	0.000000e+00	0.000000e+00	0.000000e+00	0.000000e+00	0.000000e+00	0.000000e+00	0.000000e+00	0.000000e+00	0.000000e+00	0.0
0.02	0.0	4.015401e-10	0.000002	0.000032	0.000050	0.000025	0.000007	0.000002	5.060290e-07	1.416566e-07	...	6.400273e-25	1.228006e-25	1.958740e-26	2.449499e-27	2.193235e-28	1.207843e-29	3.077500e-31	1.914739e-33	3.786697e-37	0.0
0.04	0.0	1.890324e-08	0.000105	0.001978	0.004170	0.003270	0.001700	0.000760	3.213970e-04	1.333903e-04	...	1.315527e-20	2.561670e-21	4.144637e-22	5.254755e-23	4.767778e-24	2.659487e-25	6.860462e-27	4.319699e-29	8.642163e-33	0.0
0.06	0.0	1.196881e-08	0.000081	0.002099	0.007045	0.009668	0.008665	0.006179	3.852341e-03	2.210927e-03	...	3.588320e-18	7.090774e-19	1.163614e-19	1.495587e-20	1.375016e-21	7.768342e-23	2.028796e-24	1.292760e-26	2.616374e-30	0.0
0.08	0.0	1.471051e-09	0.000014	0.000600	0.003545	0.008300	0.011721	0.012207	1.046347e-02	7.875544e-03	...	1.660452e-16	3.329404e-17	5.541137e-18	7.219543e-19	6.725358e-20	3.848194e-21	1.017439e-22	6.560825e-25	1.343223e-28	0.0

5 rows × 61 columns

In [52]:

def plot_contour(joint, **options):
    """Plot a joint distribution.
    
    joint: DataFrame representing a joint PMF
    """
    cs = plt.contour(joint.columns, joint.index, joint, **options)
    decorate(xlabel=joint.columns.name, 
             ylabel=joint.index.name)
    return cs

In [53]:

plot_contour(posterior_joint_ps)
decorate(title='Posterior joint distribution for p1 and p2')

In [54]:

posterior_p1 = marginal(posterior_joint_ps, 0)
posterior_p2 = marginal(posterior_joint_ps, 1)

posterior_p1.plot(label='p1')
posterior_p2.plot(label='p2')
decorate(xlabel='Probability of finding a bug',
         ylabel='PDF',
         title='Posterior marginal distributions of p1 and p2')

In [55]:

posterior_p1.mean(), posterior_p1.credible_interval(0.9)

Out[55]:

(0.22970406720316164, array([0.1, 0.4]))

In [56]:

posterior_p2.mean(), posterior_p2.credible_interval(0.9)

Out[56]:

(0.17501024717938662, array([0.06, 0.32]))

In [57]:

joint_plot(posterior_joint_ps)

Chao et al¶

In [58]:

data = [-1000, 63, 55, 18, 69, 17, 21, 28]

In [59]:

index = pd.MultiIndex.from_product([[0, 1]]*3)

In [60]:

Kijk = pd.Series(data, index)
Kijk

Out[60]:

0  0  0   -1000
      1      63
   1  0      55
      1      18
1  0  0      69
      1      17
   1  0      21
      1      28
dtype: int64

In [61]:

Kijk.xs(1, level=0).sum()

Out[61]:

In [62]:

Kijk.xs(1, level=1).sum()

Out[62]:

In [63]:

Kijk.xs(1, level=2).sum()

Out[63]:

In [64]:

n = pd.Series(1, range(3))

for level in n.index:
    n[level] = Kijk.xs(1, level=level).sum()
    
n

Out[64]:

0    135
1    122
2    126
dtype: int64

In [65]:

Kixx = Kijk.sum(level=0)
Kixx

Out[65]:

0   -864
1    135
dtype: int64

In [66]:

Kijx = Kijk.sum(level=[0,1])
Kijx

Out[66]:

0  0   -937
   1     73
1  0     86
   1     49
dtype: int64

In [67]:

def num_observed(K):
    return np.asarray(K)[1:].sum()

In [68]:

s0 = num_observed(Kixx)
s0

Out[68]:

In [69]:

s1 = num_observed(Kijx)
s1

Out[69]:

In [70]:

s2 = num_observed(Kijk)
s2

Out[70]:

In [71]:

s = [s0, s1, s2]
s

Out[71]:

[135, 208, 271]

In [72]:

k = pd.concat([Kixx, Kijx, Kijk])
k

Out[72]:

0            -864
1             135
(0, 0)       -937
(0, 1)         73
(1, 0)         86
(1, 1)         49
(0, 0, 0)   -1000
(0, 0, 1)      63
(0, 1, 0)      55
(0, 1, 1)      18
(1, 0, 0)      69
(1, 0, 1)      17
(1, 1, 0)      21
(1, 1, 1)      28
dtype: int64

In [73]:

k[1]

Out[73]:

In [74]:

k[0,1]

Out[74]:

In [75]:

k[0,0,1]

Out[75]:

In [76]:

k[0]

Out[76]:

-864

In [77]:

N = 300
k0 = N - s[0]
k0

Out[77]:

In [78]:

k00 = N - s[1]
k00

Out[78]:

In [79]:

k000 = N - s[2]
k000

Out[79]:

In [80]:

Ns = np.arange(s2, 500, 5)
prior_N = Pmf(1, Ns)
prior_N.index.name = 'N'

In [81]:

ps = np.linspace(0, 1, 101)
prior_p = Pmf(1, ps)
prior_p.index.name = 'p'

In [82]:

from utils import make_joint

joint_prior = make_joint(prior_N, prior_p)

In [83]:

def likelihood_round1(ps, n, k, s, joint_prior):

    like = joint_prior.copy()

    for N in joint_prior:
        like[N] = binom.pmf(n[0], N, ps)
    return like

In [84]:

like1 = likelihood_round1(ps, n, k, s, joint_prior)

In [85]:

def likelihood_round2(ps, n, k, s, joint_prior):
    like = joint_prior.copy()

    for N in joint_prior:
        k0 = N - s[0]
        like[N] = (binom.pmf(k[0,1], k0, ps) * 
                   binom.pmf(k[1,1], k[1], ps))
    return like

In [86]:

like2 = likelihood_round2(ps, n, k, s, joint_prior)
like2.to_numpy().sum()

Out[86]:

0.40773670523278144

In [87]:

joint_posterior2 = joint_prior * like1 * like2
normalize(joint_posterior2)

In [88]:

plot_contour(joint_posterior2)
decorate(title='Joint posterior of N and p')

In [89]:

marginal_N = marginal(joint_posterior2, 0)
marginal_N.plot()

decorate(xlabel='N',
         ylabel='PDF',
         title='Posterior marginal distribution of N')

In [90]:

marginal_N.mean(), marginal_N.credible_interval(0.9)

Out[90]:

(342.19942566476345, array([296., 396.]))

In [91]:

marginal_p = marginal(joint_posterior2, 1)
marginal_p.plot()

decorate(xlabel='p',
         ylabel='PDF',
         title='Posterior marginal distribution of p')

In [92]:

def likelihood_round3(ps, n, k, s, joint_prior):
    like = joint_prior.copy()

    for N in joint_prior:
        k00 = N - s[1]
        like[N] = (binom.pmf(k[0,0,1], k00, ps) * 
                   binom.pmf(k[0,1,1], k[0,1], ps) * 
                   binom.pmf(k[1,0,1], k[1,0], ps) * 
                   binom.pmf(k[1,1,1], k[1,1], ps))
    return like

In [93]:

like3 = likelihood_round3(ps, n, k, s, joint_prior)
like3.to_numpy().sum()

Out[93]:

1.7620042777092345e-07

In [94]:

joint_posterior3 = joint_posterior2 * like3
normalize(joint_posterior3)

In [95]:

plot_contour(joint_posterior3)
decorate(title='Joint posterior of N and p')

In [96]:

marginal_N = marginal(joint_posterior3, 0)
marginal_N.plot()

decorate(xlabel='N',
         ylabel='PDF',
         title='Posterior marginal distribution of N')

In [97]:

marginal_N.mean(), marginal_N.credible_interval(0.9)

Out[97]:

(391.0058839098183, array([356., 431.]))

In [98]:

marginal_p = marginal(joint_posterior3, 1)
marginal_p.plot()

decorate(xlabel='p',
         ylabel='PDF',
         title='Posterior marginal distribution of p')

In [99]:

joint_plot(joint_posterior3)

Spina bifida¶

In [100]:

data_sb = [-100, 60, 49, 4, 247, 112, 142, 12]
num_observed(data_sb)

Out[100]:

In [101]:

def make_stats(data, num_rounds):
    index = pd.MultiIndex.from_product([[0, 1]]*num_rounds)

    K = pd.Series(data, index)

    n = pd.Series(0, range(num_rounds))
    for level in n.index:
        n[level] = K.xs(1, level=level).sum()
    
    t = [K.sum(level=list(range(i+1)))
         for i in range(num_rounds)]
    
    s = [num_observed(Kx) for Kx in t]
        
    k = pd.concat(t)
    return n, k, s

In [102]:

n, k, s = make_stats(data_sb, 3)

In [103]:

Out[103]:

0    513
1    207
2    188
dtype: int64

In [104]:

Out[104]:

0             13
1            513
(0, 0)       -40
(0, 1)        53
(1, 0)       359
(1, 1)       154
(0, 0, 0)   -100
(0, 0, 1)     60
(0, 1, 0)     49
(0, 1, 1)      4
(1, 0, 0)    247
(1, 0, 1)    112
(1, 1, 0)    142
(1, 1, 1)     12
dtype: int64

In [105]:

Out[105]:

[513, 566, 626]

In [106]:

Ns = np.arange(s[2], 1000, 5)
prior_N = Pmf(1, Ns)
prior_N.index.name = 'N'
prior_N.shape

Out[106]:

(75,)

In [107]:

probs0 = np.linspace(0.5, 1.0, 51)
prior_p0 = Pmf(1, probs0)
prior_p0.index.name = 'p0'
prior_p0.head()

Out[107]:

p0
0.50    1
0.51    1
0.52    1
0.53    1
0.54    1
dtype: int64

In [108]:

probs1 = np.linspace(0.1, 0.5, 41)
prior_p1 = Pmf(1, probs1)
prior_p1.index.name = 'p1'
prior_p1.head()

Out[108]:

p1
0.10    1
0.11    1
0.12    1
0.13    1
0.14    1
dtype: int64

In [109]:

probs2 = np.linspace(0.1, 0.4, 31)
prior_p2 = Pmf(1, probs2)
prior_p2.index.name = 'p2'
prior_p2.head()

Out[109]:

p2
0.10    1
0.11    1
0.12    1
0.13    1
0.14    1
dtype: int64

In [110]:

def make_joint3(prior0, prior1, prior2):
    joint2 = make_joint(prior0, prior1)
    joint2_pmf = Pmf(joint2.transpose().stack())
    joint3 = make_joint(prior2, joint2_pmf)
    return joint3

In [111]:

joint_prior = make_joint3(prior_p0, prior_p1, prior_N)
joint_prior.head()

Out[111]:

	N	626	631	636	641	646	651	656	661	666	671	...	951	956	961	966	971	976	981	986	991	996
p0	p1
0.5	0.10	1	1	1	1	1	1	1	1	1	1	...	1	1	1	1	1	1	1	1	1	1
	0.11	1	1	1	1	1	1	1	1	1	1	...	1	1	1	1	1	1	1	1	1	1
	0.12	1	1	1	1	1	1	1	1	1	1	...	1	1	1	1	1	1	1	1	1	1
	0.13	1	1	1	1	1	1	1	1	1	1	...	1	1	1	1	1	1	1	1	1	1
	0.14	1	1	1	1	1	1	1	1	1	1	...	1	1	1	1	1	1	1	1	1	1

5 rows × 75 columns

In [112]:

likelihood = joint_prior.copy()
Ns = joint_prior.columns

for (p0, p1) in joint_prior.index:
    like0 = binom.pmf(k[1], Ns, p0)
    k0 = Ns - s[0]
    like1 = binom.pmf(k[0,1], k0, p1) * binom.pmf(k[1,1], k[1], p1)
    likelihood.loc[p0, p1] = like0 * like1
    
likelihood.to_numpy().sum()

Out[112]:

0.018731276129625645

In [113]:

from utils import normalize

joint_posterior = joint_prior * likelihood
normalize(joint_posterior)
joint_posterior.shape

Out[113]:

(2091, 75)

In [114]:

from utils import marginal

posterior_N = marginal(joint_posterior, 0)
posterior_N.plot()
decorate(xlabel='N',
         ylabel='PDF',
         title='Posterior for N after two rounds')
posterior_N.mean(), posterior_N.credible_interval(0.9)

Out[114]:

(692.2431131826501, array([656., 736.]))

In [115]:

posterior_pmf = marginal(joint_posterior, 1)
posterior_pmf.head()

Out[115]:

p0   p1  
0.5  0.10    2.947207e-37
     0.11    1.861223e-32
     0.12    1.953523e-28
     0.13    4.500420e-25
     0.14    2.829563e-22
dtype: float64

In [116]:

posterior_joint_ps = posterior_pmf.unstack().transpose()
posterior_joint_ps.head()

Out[116]:

p0	0.50	0.51	0.52	0.53	0.54	0.55	0.56	0.57	0.58	0.59	...	0.91	0.92	0.93	0.94	0.95	0.96	0.97	0.98	0.99	1.00
p1
0.10	2.947207e-37	5.752617e-37	8.229098e-37	8.950258e-37	7.768676e-37	5.687643e-37	3.693776e-37	2.201006e-37	1.222578e-37	6.363633e-38	...	1.080048e-69	4.109999e-73	2.668340e-77	1.661087e-82	4.154567e-89	9.891179e-98	1.527339e-109	3.716743e-127	6.539915e-159	0.0
0.11	1.861223e-32	3.784273e-32	5.733157e-32	6.751113e-32	6.508276e-32	5.417108e-32	4.059544e-32	2.805169e-32	1.804716e-32	1.084217e-32	...	3.624491e-63	1.393297e-66	9.097228e-71	5.680458e-76	1.422909e-82	3.389949e-91	5.235870e-103	1.274223e-120	2.242121e-152	0.0
0.12	1.953523e-28	4.172673e-28	6.771129e-28	8.748955e-28	9.488763e-28	9.058997e-28	7.861471e-28	6.298952e-28	4.686467e-28	3.243756e-28	...	2.072457e-57	8.043589e-61	5.280316e-65	3.306703e-70	8.295063e-77	1.977499e-85	3.055031e-97	7.435326e-115	1.308332e-146	0.0
0.13	4.500420e-25	1.020180e-24	1.796033e-24	2.581497e-24	3.185529e-24	3.509935e-24	3.532280e-24	3.278305e-24	2.815720e-24	2.241299e-24	...	2.655770e-52	1.040168e-55	6.863443e-60	4.310016e-65	1.082693e-71	2.582675e-80	3.990875e-92	9.713577e-110	1.709232e-141	0.0
0.14	2.829563e-22	6.888566e-22	1.333969e-21	2.160752e-21	3.060611e-21	3.905723e-21	4.557629e-21	4.892007e-21	4.841600e-21	4.424124e-21	...	9.435150e-48	3.727362e-51	2.471482e-55	1.556105e-60	3.914164e-67	9.342422e-76	1.443951e-87	3.514706e-105	6.184642e-137	0.0

5 rows × 51 columns

In [117]:

plot_contour(posterior_joint_ps)
decorate(title='Posterior joint distribution for p1 and p2')

In [118]:

posterior_p1 = marginal(posterior_joint_ps, 0)
posterior_p2 = marginal(posterior_joint_ps, 1)

posterior_p1.plot(label='p1')
posterior_p2.plot(label='p2')
decorate(xlabel='Probability of finding a bug',
         ylabel='PDF',
         title='Posterior marginal distributions of p1 and p2')

In [119]:

joint3 = make_joint(prior_p2, posterior_pmf)
joint3.head()

Out[119]:

	p2	0.10	0.11	0.12	0.13	0.14	0.15	0.16	0.17	0.18	0.19	...	0.31	0.32	0.33	0.34	0.35	0.36	0.37	0.38	0.39	0.40
p0	p1
0.5	0.10	2.947207e-37	2.947207e-37	2.947207e-37	2.947207e-37	2.947207e-37	2.947207e-37	2.947207e-37	2.947207e-37	2.947207e-37	2.947207e-37	...	2.947207e-37	2.947207e-37	2.947207e-37	2.947207e-37	2.947207e-37	2.947207e-37	2.947207e-37	2.947207e-37	2.947207e-37	2.947207e-37
	0.11	1.861223e-32	1.861223e-32	1.861223e-32	1.861223e-32	1.861223e-32	1.861223e-32	1.861223e-32	1.861223e-32	1.861223e-32	1.861223e-32	...	1.861223e-32	1.861223e-32	1.861223e-32	1.861223e-32	1.861223e-32	1.861223e-32	1.861223e-32	1.861223e-32	1.861223e-32	1.861223e-32
	0.12	1.953523e-28	1.953523e-28	1.953523e-28	1.953523e-28	1.953523e-28	1.953523e-28	1.953523e-28	1.953523e-28	1.953523e-28	1.953523e-28	...	1.953523e-28	1.953523e-28	1.953523e-28	1.953523e-28	1.953523e-28	1.953523e-28	1.953523e-28	1.953523e-28	1.953523e-28	1.953523e-28
	0.13	4.500420e-25	4.500420e-25	4.500420e-25	4.500420e-25	4.500420e-25	4.500420e-25	4.500420e-25	4.500420e-25	4.500420e-25	4.500420e-25	...	4.500420e-25	4.500420e-25	4.500420e-25	4.500420e-25	4.500420e-25	4.500420e-25	4.500420e-25	4.500420e-25	4.500420e-25	4.500420e-25
	0.14	2.829563e-22	2.829563e-22	2.829563e-22	2.829563e-22	2.829563e-22	2.829563e-22	2.829563e-22	2.829563e-22	2.829563e-22	2.829563e-22	...	2.829563e-22	2.829563e-22	2.829563e-22	2.829563e-22	2.829563e-22	2.829563e-22	2.829563e-22	2.829563e-22	2.829563e-22	2.829563e-22

5 rows × 31 columns

In [120]:

joint3_pmf = Pmf(joint3.stack())
joint3_pmf.head()

Out[120]:

p0   p1   p2  
0.5  0.1  0.10    2.947207e-37
          0.11    2.947207e-37
          0.12    2.947207e-37
          0.13    2.947207e-37
          0.14    2.947207e-37
dtype: float64

In [121]:

prior4 = make_joint(posterior_N, joint3_pmf)
prior4.head()

Out[121]:

		N	626	631	636	641	646	651	656	661	666	671	...	951	956	961	966	971	976	981	986	991	996
p0	p1	p2
0.5	0.1	0.10	1.110082e-40	3.145803e-40	7.660854e-40	1.632689e-39	3.092079e-39	5.271664e-39	8.181044e-39	1.166761e-38	1.541936e-38	1.901980e-38	...	1.021086e-48	5.631592e-49	3.101757e-49	1.705500e-49	9.357734e-50	5.120560e-50	2.792478e-50	1.516473e-50	8.193237e-51	4.399697e-51
		0.11	1.110082e-40	3.145803e-40	7.660854e-40	1.632689e-39	3.092079e-39	5.271664e-39	8.181044e-39	1.166761e-38	1.541936e-38	1.901980e-38	...	1.021086e-48	5.631592e-49	3.101757e-49	1.705500e-49	9.357734e-50	5.120560e-50	2.792478e-50	1.516473e-50	8.193237e-51	4.399697e-51
		0.12	1.110082e-40	3.145803e-40	7.660854e-40	1.632689e-39	3.092079e-39	5.271664e-39	8.181044e-39	1.166761e-38	1.541936e-38	1.901980e-38	...	1.021086e-48	5.631592e-49	3.101757e-49	1.705500e-49	9.357734e-50	5.120560e-50	2.792478e-50	1.516473e-50	8.193237e-51	4.399697e-51
		0.13	1.110082e-40	3.145803e-40	7.660854e-40	1.632689e-39	3.092079e-39	5.271664e-39	8.181044e-39	1.166761e-38	1.541936e-38	1.901980e-38	...	1.021086e-48	5.631592e-49	3.101757e-49	1.705500e-49	9.357734e-50	5.120560e-50	2.792478e-50	1.516473e-50	8.193237e-51	4.399697e-51
		0.14	1.110082e-40	3.145803e-40	7.660854e-40	1.632689e-39	3.092079e-39	5.271664e-39	8.181044e-39	1.166761e-38	1.541936e-38	1.901980e-38	...	1.021086e-48	5.631592e-49	3.101757e-49	1.705500e-49	9.357734e-50	5.120560e-50	2.792478e-50	1.516473e-50	8.193237e-51	4.399697e-51

5 rows × 75 columns

In [122]:

prior4.shape

Out[122]:

(64821, 75)

In [ ]:

In [123]:

joint2 = make_joint(posterior_N, prior_p2)
joint2.head()

Out[123]:

N	626	631	636	641	646	651	656	661	666	671	...	951	956	961	966	971	976	981	986	991	996
p2
0.10	0.000377	0.001067	0.002599	0.00554	0.010492	0.017887	0.027759	0.039589	0.052319	0.064535	...	3.464588e-12	1.910823e-12	1.052439e-12	5.786834e-13	3.175120e-13	1.737428e-13	9.475000e-14	5.145458e-14	2.780001e-14	1.492836e-14
0.11	0.000377	0.001067	0.002599	0.00554	0.010492	0.017887	0.027759	0.039589	0.052319	0.064535	...	3.464588e-12	1.910823e-12	1.052439e-12	5.786834e-13	3.175120e-13	1.737428e-13	9.475000e-14	5.145458e-14	2.780001e-14	1.492836e-14
0.12	0.000377	0.001067	0.002599	0.00554	0.010492	0.017887	0.027759	0.039589	0.052319	0.064535	...	3.464588e-12	1.910823e-12	1.052439e-12	5.786834e-13	3.175120e-13	1.737428e-13	9.475000e-14	5.145458e-14	2.780001e-14	1.492836e-14
0.13	0.000377	0.001067	0.002599	0.00554	0.010492	0.017887	0.027759	0.039589	0.052319	0.064535	...	3.464588e-12	1.910823e-12	1.052439e-12	5.786834e-13	3.175120e-13	1.737428e-13	9.475000e-14	5.145458e-14	2.780001e-14	1.492836e-14
0.14	0.000377	0.001067	0.002599	0.00554	0.010492	0.017887	0.027759	0.039589	0.052319	0.064535	...	3.464588e-12	1.910823e-12	1.052439e-12	5.786834e-13	3.175120e-13	1.737428e-13	9.475000e-14	5.145458e-14	2.780001e-14	1.492836e-14

5 rows × 75 columns

In [124]:

like2 = joint2.copy()
Ns = joint2.columns

for p2 in joint2.index:
    k00 = Ns - s[1]
    like = (binom.pmf(k[0,0,1], k00, p2) *
            binom.pmf(k[0,1,1], k[0,1], p2) *
            binom.pmf(k[1,0,1], k[1,0], p2) *
            binom.pmf(k[1,1,1], k[1,1], p2))
    
    like2.loc[p2] = like

like2.to_numpy().sum()

Out[124]:

2.874211433485911e-13

In [125]:

like2.head()

Out[125]:

N	626	631	636	641	646	651	656	661	666	671	...	951	956	961	966	971	976	981	986	991	996
p2
0.10	1.703221e-90	8.307258e-84	2.355935e-79	7.995505e-76	7.320646e-73	2.668800e-70	4.860121e-68	5.122132e-66	3.460629e-64	1.615036e-62	...	2.879404e-34	3.938314e-34	5.324392e-34	7.117373e-34	9.410051e-34	1.230875e-33	1.593338e-33	2.041697e-33	2.590460e-33	3.255168e-33
0.11	7.721462e-85	3.561425e-78	9.551394e-74	3.065398e-70	2.654165e-67	9.150232e-65	1.575798e-62	1.570514e-60	1.003422e-58	4.428414e-57	...	3.456749e-30	4.471086e-30	5.716236e-30	7.225990e-30	9.034562e-30	1.117549e-29	1.368038e-29	1.657750e-29	1.989033e-29	2.363611e-29
0.12	6.958834e-80	3.033360e-73	7.688301e-69	2.331921e-65	1.908175e-62	6.217069e-60	1.011853e-57	9.530637e-56	5.754759e-54	2.400242e-52	...	7.918101e-27	9.678980e-27	1.169473e-26	1.397142e-26	1.650873e-26	1.929908e-26	2.232705e-26	2.556910e-26	2.899358e-26	3.256111e-26
0.13	1.598949e-75	6.582716e-69	1.575776e-64	4.513992e-61	3.488575e-58	1.073491e-55	1.650111e-53	1.467914e-51	8.371216e-50	3.297611e-48	...	4.434125e-24	5.119168e-24	5.841747e-24	6.591377e-24	7.355838e-24	8.121531e-24	8.873921e-24	9.598042e-24	1.027903e-23	1.090265e-23
0.14	1.136172e-71	4.414796e-65	9.974608e-61	2.696859e-57	1.967168e-54	5.713311e-52	8.288932e-50	6.959562e-48	3.745985e-46	1.392749e-44	...	7.356258e-22	8.015754e-22	8.633437e-22	9.194183e-22	9.684231e-22	1.009176e-21	1.040736e-21	1.062438e-21	1.073913e-21	1.075091e-21

5 rows × 75 columns

In [126]:

like4 = prior4.copy()

for (p0, p1, p2) in prior4.index:
    like4.loc[p0, p1, p2] = like2.loc[p2]
    
like4.to_numpy().sum()

Out[126]:

6.009976107419042e-10

In [127]:

posterior4 = prior4 * like4
normalize(posterior4)

In [128]:

marginal_N = marginal(posterior4, 0)
marginal_N.plot()

decorate(xlabel='N',
         ylabel='PDF',
         title='Posterior for N after three rounds')

marginal_N.mean(), marginal_N.credible_interval(0.9)

Out[128]:

(764.7788307540033, array([731., 801.]))

In [129]:

posterior_p012 = marginal(posterior4, 1)
posterior_p012.unstack().head()

Out[129]:

	p2	0.10	0.11	0.12	0.13	0.14	0.15	0.16	0.17	0.18	0.19	...	0.31	0.32	0.33	0.34	0.35	0.36	0.37	0.38	0.39	0.40
p0	p1
0.5	0.10	1.418302e-63	4.240113e-59	2.774674e-55	5.057520e-52	3.092954e-49	7.348184e-47	7.626764e-45	3.804192e-43	9.862481e-42	1.418366e-40	...	1.122851e-40	1.468367e-41	1.495120e-42	1.193693e-43	7.517753e-45	3.753762e-46	1.492334e-47	4.740089e-49	1.206188e-50	2.464024e-52
	0.11	8.956875e-59	2.677720e-54	1.752265e-50	3.193929e-47	1.953265e-44	4.640531e-42	4.816461e-40	2.402427e-38	6.228362e-37	8.957275e-36	...	7.091039e-36	9.273043e-37	9.441997e-38	7.538418e-39	4.747618e-40	2.370579e-41	9.424402e-43	2.993465e-44	7.617326e-46	1.556082e-47
	0.12	9.401057e-55	2.810511e-50	1.839162e-46	3.352320e-43	2.050129e-40	4.870661e-38	5.055315e-36	2.521566e-34	6.537234e-33	9.401477e-32	...	7.442692e-32	9.732904e-33	9.910237e-34	7.912257e-35	4.983058e-36	2.488139e-37	9.891769e-39	3.141915e-40	7.995078e-42	1.633250e-43
	0.13	2.165764e-51	6.474703e-47	4.236960e-43	7.722893e-40	4.722977e-37	1.122076e-34	1.164616e-32	5.809046e-31	1.506012e-29	2.165861e-28	...	1.714607e-28	2.242213e-29	2.283067e-30	1.822783e-31	1.147970e-32	5.732038e-34	2.278812e-35	7.238172e-37	1.841863e-38	3.762593e-40
	0.14	1.361688e-48	4.070860e-44	2.663917e-40	4.855638e-37	2.969492e-34	7.054865e-32	7.322325e-30	3.652339e-28	9.468798e-27	1.361748e-25	...	1.078030e-25	1.409754e-26	1.435439e-27	1.146044e-28	7.217665e-30	3.603922e-31	1.432764e-32	4.550878e-34	1.158040e-35	2.365667e-37

5 rows × 31 columns

In [130]:

posterior_p2 = marginal(posterior_p012.unstack(), 0)
posterior_p2.plot()

In [131]:

posterior_p01 = marginal(posterior_p012.unstack(), 1)
joint_plot(posterior_p01.unstack().transpose())

Diabetes¶

In [132]:

data4 = [-10000, 10, 182, 8, 74, 7, 20, 14, 709, 12, 650, 46, 104, 18, 157, 58]
num_observed(data4)

Out[132]:

In [133]:

n, k, s = make_stats(data4, 4)

In [134]:

Out[134]:

0    1754
1     452
2    1135
3     173
dtype: int64

In [135]:

Out[135]:

0               -9685
1                1754
(0, 0)          -9800
(0, 1)            115
(1, 0)           1417
(1, 1)            337
(0, 0, 0)       -9990
(0, 0, 1)         190
(0, 1, 0)          81
(0, 1, 1)          34
(1, 0, 0)         721
(1, 0, 1)         696
(1, 1, 0)         122
(1, 1, 1)         215
(0, 0, 0, 0)   -10000
(0, 0, 0, 1)       10
(0, 0, 1, 0)      182
(0, 0, 1, 1)        8
(0, 1, 0, 0)       74
(0, 1, 0, 1)        7
(0, 1, 1, 0)       20
(0, 1, 1, 1)       14
(1, 0, 0, 0)      709
(1, 0, 0, 1)       12
(1, 0, 1, 0)      650
(1, 0, 1, 1)       46
(1, 1, 0, 0)      104
(1, 1, 0, 1)       18
(1, 1, 1, 0)      157
(1, 1, 1, 1)       58
dtype: int64

In [136]:

Out[136]:

[1754, 1869, 2059, 2069]