Second Edition
Copyright 2020 Allen B. Downey
License: Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
# If we're running on Colab, install empiricaldist
# https://pypi.org/project/empiricaldist/
import sys
IN_COLAB = 'google.colab' in sys.modules
if IN_COLAB:
!pip install empiricaldist
# Get utils.py and create directories
import os
if not os.path.exists('utils.py'):
!wget https://github.com/AllenDowney/ThinkBayes2/raw/master/code/soln/utils.py
if not os.path.exists('figs'):
!mkdir figs
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from empiricaldist import Pmf, Cdf
from utils import decorate, savefig
Three dimensions!
In 1996 and 1997 Mowat and Strobeck deployed bear traps in locations in British Columbia and Alberta, in an effort to estimate the population of grizzly bears. They describe the experiment in "Estimating Population Size of Grizzly Bears Using Hair Capture, DNA Profiling, and Mark-Recapture Analysis"
The "trap" consists of a lure and several strands of barbed wire intended to capture samples of hair from bears that visit the lure. Using the hair samples, the researchers use DNA analysis to identify individual bears.
During the first session, on June 29, 1996, the researchers deployed traps at 76 sites. Returning 10 days later, they obtained 1043 hair samples and identified 23 different bears. During a second 10-day session they obtained 1191 samples from 19 different bears, where 4 of the 19 were from bears they had identified in the first batch.
To estimate the population of bears from this data, we need a model for the probability that each bear will be observed during each session. As a starting place, we'll make the simplest assumption, that every bear in the population has the same (unknown) probability of being sampled during each round.
With these assumptions we can compute the probability of the data for a range of possible populations.
As an example, let's suppose that the actual population of bears is 200.
After the first session, 23 of the 200 bears have been identified. During the second session, if we choose 19 bears at random, what is the probability that 4 of them were previously identified?
I'll define
N: actual (unknown) population size, 200.
K: number of bears identified in the first session, 23.
n: number of bears observed in the second session, 19 in the example.
k: the number of bears in the second session that had previously been identified, 4.
For given values of N, K, and n, the distribution of k is described by the hypergeometric distribution:
$PMF(k) = {K \choose k}{N-K \choose n-k}/{N \choose n}$
To understand why, consider:
The denominator, ${ N \choose n}$, is the number of subsets of $n$ we could choose from a population of $N$ bears.
The numerator is the number of subsets that contain $k$ bears from the previously identified $K$ and $n-k$ from the previously unseen $N-K$.
SciPy provides hypergeom
, which we can use to compute this PMF.
from scipy.stats import hypergeom
N = 100
K = 23
n = 19
ks = np.arange(12)
ps = hypergeom(N, K, n).pmf(ks)
plt.bar(ks, ps)
decorate(xlabel='Number of bears observed twice',
ylabel='PMF',
title='Hypergeometric distribution of k (known population 200)')
So that's the distribution of k
given N
, K
, and n
.
Now let's go the other way: given K
, n
, and k
, how can we estimate the total population, N
?
As a starting place, let's suppose that, prior to this study, an expert in this domain would have estimated that the population is between 50 and 500, and equally likely to be any value in that range.
Ns = np.arange(50, 501)
prior_N = Pmf(1, Ns)
prior_N.index.name = 'N'
So that's our prior.
To compute the likelihood of the data, we can use hypergeom
with constants K
and n
, and a range of values of N
.
K = 23
n = 19
k = 4
likelihood = hypergeom(Ns, K, n).pmf(k)
We can compute the posterior in the usual way.
posterior_N = prior_N * likelihood
posterior_N.normalize()
34.97606148975133
And here's what it looks like.
posterior_N.plot()
decorate(xlabel='Population of bears (N)',
ylabel='PDF',
title='Posterior distribution of N')
The most likely value is 109.
posterior_N.max_prob()
109
But the distribution is skewed to the right, so the posterior mean is substantially higher.
posterior_N.mean()
173.79880627085643
posterior_N.credible_interval(0.9)
array([ 77., 363.])
ps = np.linspace(0, 1, 101)
prior_p = Pmf(1, ps)
prior_p.index.name = 'p'
from utils import make_joint
joint_prior = make_joint(prior_N, prior_p)
N_mesh, p_mesh = np.meshgrid(Ns, ps)
N_mesh.shape
(101, 451)
from scipy.stats import binom
like1 = binom.pmf(K, N_mesh, p_mesh)
like1.sum()
229.56130903199954
like2 = binom.pmf(k, K, p_mesh) * binom.pmf(n-k, N_mesh-K, p_mesh)
like2.sum()
24.771767481921685
from utils import normalize
joint_posterior = joint_prior * like1 * like2
normalize(joint_posterior)
def plot_contour(joint, **options):
"""Plot a joint distribution.
joint: DataFrame representing a joint PMF
"""
cs = plt.contour(joint.columns, joint.index, joint, **options)
decorate(xlabel=joint.columns.name,
ylabel=joint.index.name)
return cs
plot_contour(joint_posterior)
decorate(title='Joint posterior distribution of N and p')
from utils import marginal
marginal_N = marginal(joint_posterior, 0)
posterior_N.plot(color='gray')
marginal_N.plot()
decorate(xlabel='Population of bears (N)',
ylabel='PDF',
title='Posterior marginal distribution of N')
marginal_N.mean(), marginal_N.credible_interval(0.9)
(138.7505213647261, array([ 68., 277.]))
marginal_p = marginal(joint_posterior, 1)
marginal_p.plot()
decorate(xlabel='Probability of observing a bear',
ylabel='PDF',
title='Posterior marginal distribution of p')
from seaborn import JointGrid
def joint_plot(joint, **options):
x = joint.columns.name
x = 'x' if x is None else x
y = joint.index.name
y = 'y' if y is None else y
# make a JointGrid with minimal data
data = pd.DataFrame({x:[0], y:[0]})
g = JointGrid(x, y, data, **options)
# replace the contour plot
g.ax_joint.contour(joint.columns,
joint.index,
joint,
cmap='viridis')
# replace the marginals
marginal_x = marginal(joint, 0)
g.ax_marg_x.plot(marginal_x.qs, marginal_x.ps)
marginal_y = marginal(joint, 1)
g.ax_marg_y.plot(marginal_y.ps, marginal_y.qs)
joint_plot(joint_posterior)
mean = (23 + 19) / 2
N1 = 138
p = mean/N1
p
0.15217391304347827
from scipy.stats import binom
binom(N1, p).std()
4.219519857292647
binom(N1, p).pmf([23, 19]).prod()
0.007113233315460428
N2 = 173
p = mean/N2
p
0.12138728323699421
binom(N2, p).std()
4.2954472470306415
binom(N2, p).pmf([23, 19]).prod()
0.006918408214247037
A few years ago my occasional correspondent John D. Cook wrote an excellent blog post about the Lincoln index, which is a way to estimate the number of errors in a document (or program) by comparing results from two independent testers.
http://www.johndcook.com/blog/2010/07/13/lincoln-index/
Here's his presentation of the problem:
"Suppose you have a tester who finds 20 bugs in your program. You
want to estimate how many bugs are really in the program. You know there are at least 20 bugs, and if you have supreme confidence in your tester, you may suppose there are around 20 bugs. But maybe your tester isn't very good. Maybe there are hundreds of bugs. How can you have any idea how many bugs there are? There's no way to know with one tester. But if you have two testers, you can get a good idea, even if you don't know how skilled the testers are."
Suppose the first tester finds 20 bugs, the second finds 15, and they find 3 in common; how can we estimate the number of bugs?
n0 = 20
n1 = 15
k11 = 3
k10 = n0 - k11
k01 = n1 - k11
k10, k01
(17, 12)
Ns = np.arange(32, 350)
prior_N = Pmf(1, Ns)
prior_N.index.name = 'N'
p0, p1 = 0.2, 0.15
like0 = binom.pmf(n0, Ns, p0)
like0.sum()
4.9999998675182855
like1 = binom.pmf(k01, Ns-n0, p1) * binom.pmf(k11, n0, p1)
like1.sum()
1.6188593076346907
likelihood = like0 * like1
likelihood.shape
(318,)
prior_N.shape
(318,)
posterior_N = prior_N * likelihood
posterior_N.normalize()
0.10960645032182464
posterior_N.plot()
decorate(xlabel='n',
ylabel='PMF',
title='Posterior marginal distribution of n with known p1, p2')
posterior_N.mean()
102.12500000000003
p0 = np.linspace(0, 1, 61)
prior_p0 = Pmf(1, p0)
prior_p0.index.name = 'p0'
p1 = np.linspace(0, 1, 51)
prior_p1 = Pmf(1, p1)
prior_p1.index.name = 'p1'
from utils import make_joint
joint = make_joint(prior_p0, prior_p1)
joint.shape
(51, 61)
joint_pmf = Pmf(joint.transpose().stack())
joint_pmf.head()
p0 p1 0.0 0.00 1 0.02 1 0.04 1 0.06 1 0.08 1 dtype: int64
joint_prior = make_joint(prior_N, joint_pmf)
joint_prior.shape
(3111, 318)
joint_prior.head()
N | 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 | 40 | 41 | ... | 340 | 341 | 342 | 343 | 344 | 345 | 346 | 347 | 348 | 349 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
p0 | p1 | |||||||||||||||||||||
0.0 | 0.00 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | ... | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
0.02 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | ... | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | |
0.04 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | ... | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | |
0.06 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | ... | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | |
0.08 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | ... | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
5 rows × 318 columns
likelihood = joint_prior.copy()
Ns = joint_prior.columns
for (p0, p1) in joint_prior.index:
like0 = binom.pmf(n0, Ns, p0)
like1 = binom.pmf(k01, Ns-n0, p1) * binom.pmf(k11, n0, p1)
likelihood.loc[p0, p1] = like0 * like1
likelihood.to_numpy().sum()
8.930285561728951
from utils import normalize
joint_posterior = joint_prior * likelihood
normalize(joint_posterior)
joint_posterior.shape
(3111, 318)
from utils import marginal
posterior_N = marginal(joint_posterior, 0)
posterior_N.plot()
posterior_pmf = marginal(joint_posterior, 1)
posterior_pmf.shape
(3111,)
posterior_joint_ps = posterior_pmf.unstack().transpose()
posterior_joint_ps.head()
p0 | 0.000000 | 0.016667 | 0.033333 | 0.050000 | 0.066667 | 0.083333 | 0.100000 | 0.116667 | 0.133333 | 0.150000 | ... | 0.850000 | 0.866667 | 0.883333 | 0.900000 | 0.916667 | 0.933333 | 0.950000 | 0.966667 | 0.983333 | 1.000000 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
p1 | |||||||||||||||||||||
0.00 | 0.0 | 0.000000e+00 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000e+00 | 0.000000e+00 | ... | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.0 |
0.02 | 0.0 | 4.015401e-10 | 0.000002 | 0.000032 | 0.000050 | 0.000025 | 0.000007 | 0.000002 | 5.060290e-07 | 1.416566e-07 | ... | 6.400273e-25 | 1.228006e-25 | 1.958740e-26 | 2.449499e-27 | 2.193235e-28 | 1.207843e-29 | 3.077500e-31 | 1.914739e-33 | 3.786697e-37 | 0.0 |
0.04 | 0.0 | 1.890324e-08 | 0.000105 | 0.001978 | 0.004170 | 0.003270 | 0.001700 | 0.000760 | 3.213970e-04 | 1.333903e-04 | ... | 1.315527e-20 | 2.561670e-21 | 4.144637e-22 | 5.254755e-23 | 4.767778e-24 | 2.659487e-25 | 6.860462e-27 | 4.319699e-29 | 8.642163e-33 | 0.0 |
0.06 | 0.0 | 1.196881e-08 | 0.000081 | 0.002099 | 0.007045 | 0.009668 | 0.008665 | 0.006179 | 3.852341e-03 | 2.210927e-03 | ... | 3.588320e-18 | 7.090774e-19 | 1.163614e-19 | 1.495587e-20 | 1.375016e-21 | 7.768342e-23 | 2.028796e-24 | 1.292760e-26 | 2.616374e-30 | 0.0 |
0.08 | 0.0 | 1.471051e-09 | 0.000014 | 0.000600 | 0.003545 | 0.008300 | 0.011721 | 0.012207 | 1.046347e-02 | 7.875544e-03 | ... | 1.660452e-16 | 3.329404e-17 | 5.541137e-18 | 7.219543e-19 | 6.725358e-20 | 3.848194e-21 | 1.017439e-22 | 6.560825e-25 | 1.343223e-28 | 0.0 |
5 rows × 61 columns
def plot_contour(joint, **options):
"""Plot a joint distribution.
joint: DataFrame representing a joint PMF
"""
cs = plt.contour(joint.columns, joint.index, joint, **options)
decorate(xlabel=joint.columns.name,
ylabel=joint.index.name)
return cs
plot_contour(posterior_joint_ps)
decorate(title='Posterior joint distribution for p1 and p2')
posterior_p1 = marginal(posterior_joint_ps, 0)
posterior_p2 = marginal(posterior_joint_ps, 1)
posterior_p1.plot(label='p1')
posterior_p2.plot(label='p2')
decorate(xlabel='Probability of finding a bug',
ylabel='PDF',
title='Posterior marginal distributions of p1 and p2')
posterior_p1.mean(), posterior_p1.credible_interval(0.9)
(0.22970406720316164, array([0.1, 0.4]))
posterior_p2.mean(), posterior_p2.credible_interval(0.9)
(0.17501024717938662, array([0.06, 0.32]))
joint_plot(posterior_joint_ps)
data = [-1000, 63, 55, 18, 69, 17, 21, 28]
index = pd.MultiIndex.from_product([[0, 1]]*3)
Kijk = pd.Series(data, index)
Kijk
0 0 0 -1000 1 63 1 0 55 1 18 1 0 0 69 1 17 1 0 21 1 28 dtype: int64
Kijk.xs(1, level=0).sum()
135
Kijk.xs(1, level=1).sum()
122
Kijk.xs(1, level=2).sum()
126
n = pd.Series(1, range(3))
for level in n.index:
n[level] = Kijk.xs(1, level=level).sum()
n
0 135 1 122 2 126 dtype: int64
Kixx = Kijk.sum(level=0)
Kixx
0 -864 1 135 dtype: int64
Kijx = Kijk.sum(level=[0,1])
Kijx
0 0 -937 1 73 1 0 86 1 49 dtype: int64
def num_observed(K):
return np.asarray(K)[1:].sum()
s0 = num_observed(Kixx)
s0
135
s1 = num_observed(Kijx)
s1
208
s2 = num_observed(Kijk)
s2
271
s = [s0, s1, s2]
s
[135, 208, 271]
k = pd.concat([Kixx, Kijx, Kijk])
k
0 -864 1 135 (0, 0) -937 (0, 1) 73 (1, 0) 86 (1, 1) 49 (0, 0, 0) -1000 (0, 0, 1) 63 (0, 1, 0) 55 (0, 1, 1) 18 (1, 0, 0) 69 (1, 0, 1) 17 (1, 1, 0) 21 (1, 1, 1) 28 dtype: int64
k[1]
135
k[0,1]
73
k[0,0,1]
63
k[0]
-864
N = 300
k0 = N - s[0]
k0
165
k00 = N - s[1]
k00
92
k000 = N - s[2]
k000
29
Ns = np.arange(s2, 500, 5)
prior_N = Pmf(1, Ns)
prior_N.index.name = 'N'
ps = np.linspace(0, 1, 101)
prior_p = Pmf(1, ps)
prior_p.index.name = 'p'
from utils import make_joint
joint_prior = make_joint(prior_N, prior_p)
def likelihood_round1(ps, n, k, s, joint_prior):
like = joint_prior.copy()
for N in joint_prior:
like[N] = binom.pmf(n[0], N, ps)
return like
like1 = likelihood_round1(ps, n, k, s, joint_prior)
def likelihood_round2(ps, n, k, s, joint_prior):
like = joint_prior.copy()
for N in joint_prior:
k0 = N - s[0]
like[N] = (binom.pmf(k[0,1], k0, ps) *
binom.pmf(k[1,1], k[1], ps))
return like
like2 = likelihood_round2(ps, n, k, s, joint_prior)
like2.to_numpy().sum()
0.40773670523278144
joint_posterior2 = joint_prior * like1 * like2
normalize(joint_posterior2)
plot_contour(joint_posterior2)
decorate(title='Joint posterior of N and p')
marginal_N = marginal(joint_posterior2, 0)
marginal_N.plot()
decorate(xlabel='N',
ylabel='PDF',
title='Posterior marginal distribution of N')
marginal_N.mean(), marginal_N.credible_interval(0.9)
(342.19942566476345, array([296., 396.]))
marginal_p = marginal(joint_posterior2, 1)
marginal_p.plot()
decorate(xlabel='p',
ylabel='PDF',
title='Posterior marginal distribution of p')
def likelihood_round3(ps, n, k, s, joint_prior):
like = joint_prior.copy()
for N in joint_prior:
k00 = N - s[1]
like[N] = (binom.pmf(k[0,0,1], k00, ps) *
binom.pmf(k[0,1,1], k[0,1], ps) *
binom.pmf(k[1,0,1], k[1,0], ps) *
binom.pmf(k[1,1,1], k[1,1], ps))
return like
like3 = likelihood_round3(ps, n, k, s, joint_prior)
like3.to_numpy().sum()
1.7620042777092345e-07
joint_posterior3 = joint_posterior2 * like3
normalize(joint_posterior3)
plot_contour(joint_posterior3)
decorate(title='Joint posterior of N and p')
marginal_N = marginal(joint_posterior3, 0)
marginal_N.plot()
decorate(xlabel='N',
ylabel='PDF',
title='Posterior marginal distribution of N')
marginal_N.mean(), marginal_N.credible_interval(0.9)
(391.0058839098183, array([356., 431.]))
marginal_p = marginal(joint_posterior3, 1)
marginal_p.plot()
decorate(xlabel='p',
ylabel='PDF',
title='Posterior marginal distribution of p')
joint_plot(joint_posterior3)
data_sb = [-100, 60, 49, 4, 247, 112, 142, 12]
num_observed(data_sb)
626
def make_stats(data, num_rounds):
index = pd.MultiIndex.from_product([[0, 1]]*num_rounds)
K = pd.Series(data, index)
n = pd.Series(0, range(num_rounds))
for level in n.index:
n[level] = K.xs(1, level=level).sum()
t = [K.sum(level=list(range(i+1)))
for i in range(num_rounds)]
s = [num_observed(Kx) for Kx in t]
k = pd.concat(t)
return n, k, s
n, k, s = make_stats(data_sb, 3)
n
0 513 1 207 2 188 dtype: int64
k
0 13 1 513 (0, 0) -40 (0, 1) 53 (1, 0) 359 (1, 1) 154 (0, 0, 0) -100 (0, 0, 1) 60 (0, 1, 0) 49 (0, 1, 1) 4 (1, 0, 0) 247 (1, 0, 1) 112 (1, 1, 0) 142 (1, 1, 1) 12 dtype: int64
s
[513, 566, 626]
Ns = np.arange(s[2], 1000, 5)
prior_N = Pmf(1, Ns)
prior_N.index.name = 'N'
prior_N.shape
(75,)
probs0 = np.linspace(0.5, 1.0, 51)
prior_p0 = Pmf(1, probs0)
prior_p0.index.name = 'p0'
prior_p0.head()
p0 0.50 1 0.51 1 0.52 1 0.53 1 0.54 1 dtype: int64
probs1 = np.linspace(0.1, 0.5, 41)
prior_p1 = Pmf(1, probs1)
prior_p1.index.name = 'p1'
prior_p1.head()
p1 0.10 1 0.11 1 0.12 1 0.13 1 0.14 1 dtype: int64
probs2 = np.linspace(0.1, 0.4, 31)
prior_p2 = Pmf(1, probs2)
prior_p2.index.name = 'p2'
prior_p2.head()
p2 0.10 1 0.11 1 0.12 1 0.13 1 0.14 1 dtype: int64
def make_joint3(prior0, prior1, prior2):
joint2 = make_joint(prior0, prior1)
joint2_pmf = Pmf(joint2.transpose().stack())
joint3 = make_joint(prior2, joint2_pmf)
return joint3
joint_prior = make_joint3(prior_p0, prior_p1, prior_N)
joint_prior.head()
N | 626 | 631 | 636 | 641 | 646 | 651 | 656 | 661 | 666 | 671 | ... | 951 | 956 | 961 | 966 | 971 | 976 | 981 | 986 | 991 | 996 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
p0 | p1 | |||||||||||||||||||||
0.5 | 0.10 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | ... | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
0.11 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | ... | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | |
0.12 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | ... | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | |
0.13 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | ... | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | |
0.14 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | ... | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
5 rows × 75 columns
likelihood = joint_prior.copy()
Ns = joint_prior.columns
for (p0, p1) in joint_prior.index:
like0 = binom.pmf(k[1], Ns, p0)
k0 = Ns - s[0]
like1 = binom.pmf(k[0,1], k0, p1) * binom.pmf(k[1,1], k[1], p1)
likelihood.loc[p0, p1] = like0 * like1
likelihood.to_numpy().sum()
0.018731276129625645
from utils import normalize
joint_posterior = joint_prior * likelihood
normalize(joint_posterior)
joint_posterior.shape
(2091, 75)
from utils import marginal
posterior_N = marginal(joint_posterior, 0)
posterior_N.plot()
decorate(xlabel='N',
ylabel='PDF',
title='Posterior for N after two rounds')
posterior_N.mean(), posterior_N.credible_interval(0.9)
(692.2431131826501, array([656., 736.]))
posterior_pmf = marginal(joint_posterior, 1)
posterior_pmf.head()
p0 p1 0.5 0.10 2.947207e-37 0.11 1.861223e-32 0.12 1.953523e-28 0.13 4.500420e-25 0.14 2.829563e-22 dtype: float64
posterior_joint_ps = posterior_pmf.unstack().transpose()
posterior_joint_ps.head()
p0 | 0.50 | 0.51 | 0.52 | 0.53 | 0.54 | 0.55 | 0.56 | 0.57 | 0.58 | 0.59 | ... | 0.91 | 0.92 | 0.93 | 0.94 | 0.95 | 0.96 | 0.97 | 0.98 | 0.99 | 1.00 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
p1 | |||||||||||||||||||||
0.10 | 2.947207e-37 | 5.752617e-37 | 8.229098e-37 | 8.950258e-37 | 7.768676e-37 | 5.687643e-37 | 3.693776e-37 | 2.201006e-37 | 1.222578e-37 | 6.363633e-38 | ... | 1.080048e-69 | 4.109999e-73 | 2.668340e-77 | 1.661087e-82 | 4.154567e-89 | 9.891179e-98 | 1.527339e-109 | 3.716743e-127 | 6.539915e-159 | 0.0 |
0.11 | 1.861223e-32 | 3.784273e-32 | 5.733157e-32 | 6.751113e-32 | 6.508276e-32 | 5.417108e-32 | 4.059544e-32 | 2.805169e-32 | 1.804716e-32 | 1.084217e-32 | ... | 3.624491e-63 | 1.393297e-66 | 9.097228e-71 | 5.680458e-76 | 1.422909e-82 | 3.389949e-91 | 5.235870e-103 | 1.274223e-120 | 2.242121e-152 | 0.0 |
0.12 | 1.953523e-28 | 4.172673e-28 | 6.771129e-28 | 8.748955e-28 | 9.488763e-28 | 9.058997e-28 | 7.861471e-28 | 6.298952e-28 | 4.686467e-28 | 3.243756e-28 | ... | 2.072457e-57 | 8.043589e-61 | 5.280316e-65 | 3.306703e-70 | 8.295063e-77 | 1.977499e-85 | 3.055031e-97 | 7.435326e-115 | 1.308332e-146 | 0.0 |
0.13 | 4.500420e-25 | 1.020180e-24 | 1.796033e-24 | 2.581497e-24 | 3.185529e-24 | 3.509935e-24 | 3.532280e-24 | 3.278305e-24 | 2.815720e-24 | 2.241299e-24 | ... | 2.655770e-52 | 1.040168e-55 | 6.863443e-60 | 4.310016e-65 | 1.082693e-71 | 2.582675e-80 | 3.990875e-92 | 9.713577e-110 | 1.709232e-141 | 0.0 |
0.14 | 2.829563e-22 | 6.888566e-22 | 1.333969e-21 | 2.160752e-21 | 3.060611e-21 | 3.905723e-21 | 4.557629e-21 | 4.892007e-21 | 4.841600e-21 | 4.424124e-21 | ... | 9.435150e-48 | 3.727362e-51 | 2.471482e-55 | 1.556105e-60 | 3.914164e-67 | 9.342422e-76 | 1.443951e-87 | 3.514706e-105 | 6.184642e-137 | 0.0 |
5 rows × 51 columns
plot_contour(posterior_joint_ps)
decorate(title='Posterior joint distribution for p1 and p2')
posterior_p1 = marginal(posterior_joint_ps, 0)
posterior_p2 = marginal(posterior_joint_ps, 1)
posterior_p1.plot(label='p1')
posterior_p2.plot(label='p2')
decorate(xlabel='Probability of finding a bug',
ylabel='PDF',
title='Posterior marginal distributions of p1 and p2')
joint3 = make_joint(prior_p2, posterior_pmf)
joint3.head()
p2 | 0.10 | 0.11 | 0.12 | 0.13 | 0.14 | 0.15 | 0.16 | 0.17 | 0.18 | 0.19 | ... | 0.31 | 0.32 | 0.33 | 0.34 | 0.35 | 0.36 | 0.37 | 0.38 | 0.39 | 0.40 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
p0 | p1 | |||||||||||||||||||||
0.5 | 0.10 | 2.947207e-37 | 2.947207e-37 | 2.947207e-37 | 2.947207e-37 | 2.947207e-37 | 2.947207e-37 | 2.947207e-37 | 2.947207e-37 | 2.947207e-37 | 2.947207e-37 | ... | 2.947207e-37 | 2.947207e-37 | 2.947207e-37 | 2.947207e-37 | 2.947207e-37 | 2.947207e-37 | 2.947207e-37 | 2.947207e-37 | 2.947207e-37 | 2.947207e-37 |
0.11 | 1.861223e-32 | 1.861223e-32 | 1.861223e-32 | 1.861223e-32 | 1.861223e-32 | 1.861223e-32 | 1.861223e-32 | 1.861223e-32 | 1.861223e-32 | 1.861223e-32 | ... | 1.861223e-32 | 1.861223e-32 | 1.861223e-32 | 1.861223e-32 | 1.861223e-32 | 1.861223e-32 | 1.861223e-32 | 1.861223e-32 | 1.861223e-32 | 1.861223e-32 | |
0.12 | 1.953523e-28 | 1.953523e-28 | 1.953523e-28 | 1.953523e-28 | 1.953523e-28 | 1.953523e-28 | 1.953523e-28 | 1.953523e-28 | 1.953523e-28 | 1.953523e-28 | ... | 1.953523e-28 | 1.953523e-28 | 1.953523e-28 | 1.953523e-28 | 1.953523e-28 | 1.953523e-28 | 1.953523e-28 | 1.953523e-28 | 1.953523e-28 | 1.953523e-28 | |
0.13 | 4.500420e-25 | 4.500420e-25 | 4.500420e-25 | 4.500420e-25 | 4.500420e-25 | 4.500420e-25 | 4.500420e-25 | 4.500420e-25 | 4.500420e-25 | 4.500420e-25 | ... | 4.500420e-25 | 4.500420e-25 | 4.500420e-25 | 4.500420e-25 | 4.500420e-25 | 4.500420e-25 | 4.500420e-25 | 4.500420e-25 | 4.500420e-25 | 4.500420e-25 | |
0.14 | 2.829563e-22 | 2.829563e-22 | 2.829563e-22 | 2.829563e-22 | 2.829563e-22 | 2.829563e-22 | 2.829563e-22 | 2.829563e-22 | 2.829563e-22 | 2.829563e-22 | ... | 2.829563e-22 | 2.829563e-22 | 2.829563e-22 | 2.829563e-22 | 2.829563e-22 | 2.829563e-22 | 2.829563e-22 | 2.829563e-22 | 2.829563e-22 | 2.829563e-22 |
5 rows × 31 columns
joint3_pmf = Pmf(joint3.stack())
joint3_pmf.head()
p0 p1 p2 0.5 0.1 0.10 2.947207e-37 0.11 2.947207e-37 0.12 2.947207e-37 0.13 2.947207e-37 0.14 2.947207e-37 dtype: float64
prior4 = make_joint(posterior_N, joint3_pmf)
prior4.head()
N | 626 | 631 | 636 | 641 | 646 | 651 | 656 | 661 | 666 | 671 | ... | 951 | 956 | 961 | 966 | 971 | 976 | 981 | 986 | 991 | 996 | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
p0 | p1 | p2 | |||||||||||||||||||||
0.5 | 0.1 | 0.10 | 1.110082e-40 | 3.145803e-40 | 7.660854e-40 | 1.632689e-39 | 3.092079e-39 | 5.271664e-39 | 8.181044e-39 | 1.166761e-38 | 1.541936e-38 | 1.901980e-38 | ... | 1.021086e-48 | 5.631592e-49 | 3.101757e-49 | 1.705500e-49 | 9.357734e-50 | 5.120560e-50 | 2.792478e-50 | 1.516473e-50 | 8.193237e-51 | 4.399697e-51 |
0.11 | 1.110082e-40 | 3.145803e-40 | 7.660854e-40 | 1.632689e-39 | 3.092079e-39 | 5.271664e-39 | 8.181044e-39 | 1.166761e-38 | 1.541936e-38 | 1.901980e-38 | ... | 1.021086e-48 | 5.631592e-49 | 3.101757e-49 | 1.705500e-49 | 9.357734e-50 | 5.120560e-50 | 2.792478e-50 | 1.516473e-50 | 8.193237e-51 | 4.399697e-51 | ||
0.12 | 1.110082e-40 | 3.145803e-40 | 7.660854e-40 | 1.632689e-39 | 3.092079e-39 | 5.271664e-39 | 8.181044e-39 | 1.166761e-38 | 1.541936e-38 | 1.901980e-38 | ... | 1.021086e-48 | 5.631592e-49 | 3.101757e-49 | 1.705500e-49 | 9.357734e-50 | 5.120560e-50 | 2.792478e-50 | 1.516473e-50 | 8.193237e-51 | 4.399697e-51 | ||
0.13 | 1.110082e-40 | 3.145803e-40 | 7.660854e-40 | 1.632689e-39 | 3.092079e-39 | 5.271664e-39 | 8.181044e-39 | 1.166761e-38 | 1.541936e-38 | 1.901980e-38 | ... | 1.021086e-48 | 5.631592e-49 | 3.101757e-49 | 1.705500e-49 | 9.357734e-50 | 5.120560e-50 | 2.792478e-50 | 1.516473e-50 | 8.193237e-51 | 4.399697e-51 | ||
0.14 | 1.110082e-40 | 3.145803e-40 | 7.660854e-40 | 1.632689e-39 | 3.092079e-39 | 5.271664e-39 | 8.181044e-39 | 1.166761e-38 | 1.541936e-38 | 1.901980e-38 | ... | 1.021086e-48 | 5.631592e-49 | 3.101757e-49 | 1.705500e-49 | 9.357734e-50 | 5.120560e-50 | 2.792478e-50 | 1.516473e-50 | 8.193237e-51 | 4.399697e-51 |
5 rows × 75 columns
prior4.shape
(64821, 75)
joint2 = make_joint(posterior_N, prior_p2)
joint2.head()
N | 626 | 631 | 636 | 641 | 646 | 651 | 656 | 661 | 666 | 671 | ... | 951 | 956 | 961 | 966 | 971 | 976 | 981 | 986 | 991 | 996 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
p2 | |||||||||||||||||||||
0.10 | 0.000377 | 0.001067 | 0.002599 | 0.00554 | 0.010492 | 0.017887 | 0.027759 | 0.039589 | 0.052319 | 0.064535 | ... | 3.464588e-12 | 1.910823e-12 | 1.052439e-12 | 5.786834e-13 | 3.175120e-13 | 1.737428e-13 | 9.475000e-14 | 5.145458e-14 | 2.780001e-14 | 1.492836e-14 |
0.11 | 0.000377 | 0.001067 | 0.002599 | 0.00554 | 0.010492 | 0.017887 | 0.027759 | 0.039589 | 0.052319 | 0.064535 | ... | 3.464588e-12 | 1.910823e-12 | 1.052439e-12 | 5.786834e-13 | 3.175120e-13 | 1.737428e-13 | 9.475000e-14 | 5.145458e-14 | 2.780001e-14 | 1.492836e-14 |
0.12 | 0.000377 | 0.001067 | 0.002599 | 0.00554 | 0.010492 | 0.017887 | 0.027759 | 0.039589 | 0.052319 | 0.064535 | ... | 3.464588e-12 | 1.910823e-12 | 1.052439e-12 | 5.786834e-13 | 3.175120e-13 | 1.737428e-13 | 9.475000e-14 | 5.145458e-14 | 2.780001e-14 | 1.492836e-14 |
0.13 | 0.000377 | 0.001067 | 0.002599 | 0.00554 | 0.010492 | 0.017887 | 0.027759 | 0.039589 | 0.052319 | 0.064535 | ... | 3.464588e-12 | 1.910823e-12 | 1.052439e-12 | 5.786834e-13 | 3.175120e-13 | 1.737428e-13 | 9.475000e-14 | 5.145458e-14 | 2.780001e-14 | 1.492836e-14 |
0.14 | 0.000377 | 0.001067 | 0.002599 | 0.00554 | 0.010492 | 0.017887 | 0.027759 | 0.039589 | 0.052319 | 0.064535 | ... | 3.464588e-12 | 1.910823e-12 | 1.052439e-12 | 5.786834e-13 | 3.175120e-13 | 1.737428e-13 | 9.475000e-14 | 5.145458e-14 | 2.780001e-14 | 1.492836e-14 |
5 rows × 75 columns
like2 = joint2.copy()
Ns = joint2.columns
for p2 in joint2.index:
k00 = Ns - s[1]
like = (binom.pmf(k[0,0,1], k00, p2) *
binom.pmf(k[0,1,1], k[0,1], p2) *
binom.pmf(k[1,0,1], k[1,0], p2) *
binom.pmf(k[1,1,1], k[1,1], p2))
like2.loc[p2] = like
like2.to_numpy().sum()
2.874211433485911e-13
like2.head()
N | 626 | 631 | 636 | 641 | 646 | 651 | 656 | 661 | 666 | 671 | ... | 951 | 956 | 961 | 966 | 971 | 976 | 981 | 986 | 991 | 996 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
p2 | |||||||||||||||||||||
0.10 | 1.703221e-90 | 8.307258e-84 | 2.355935e-79 | 7.995505e-76 | 7.320646e-73 | 2.668800e-70 | 4.860121e-68 | 5.122132e-66 | 3.460629e-64 | 1.615036e-62 | ... | 2.879404e-34 | 3.938314e-34 | 5.324392e-34 | 7.117373e-34 | 9.410051e-34 | 1.230875e-33 | 1.593338e-33 | 2.041697e-33 | 2.590460e-33 | 3.255168e-33 |
0.11 | 7.721462e-85 | 3.561425e-78 | 9.551394e-74 | 3.065398e-70 | 2.654165e-67 | 9.150232e-65 | 1.575798e-62 | 1.570514e-60 | 1.003422e-58 | 4.428414e-57 | ... | 3.456749e-30 | 4.471086e-30 | 5.716236e-30 | 7.225990e-30 | 9.034562e-30 | 1.117549e-29 | 1.368038e-29 | 1.657750e-29 | 1.989033e-29 | 2.363611e-29 |
0.12 | 6.958834e-80 | 3.033360e-73 | 7.688301e-69 | 2.331921e-65 | 1.908175e-62 | 6.217069e-60 | 1.011853e-57 | 9.530637e-56 | 5.754759e-54 | 2.400242e-52 | ... | 7.918101e-27 | 9.678980e-27 | 1.169473e-26 | 1.397142e-26 | 1.650873e-26 | 1.929908e-26 | 2.232705e-26 | 2.556910e-26 | 2.899358e-26 | 3.256111e-26 |
0.13 | 1.598949e-75 | 6.582716e-69 | 1.575776e-64 | 4.513992e-61 | 3.488575e-58 | 1.073491e-55 | 1.650111e-53 | 1.467914e-51 | 8.371216e-50 | 3.297611e-48 | ... | 4.434125e-24 | 5.119168e-24 | 5.841747e-24 | 6.591377e-24 | 7.355838e-24 | 8.121531e-24 | 8.873921e-24 | 9.598042e-24 | 1.027903e-23 | 1.090265e-23 |
0.14 | 1.136172e-71 | 4.414796e-65 | 9.974608e-61 | 2.696859e-57 | 1.967168e-54 | 5.713311e-52 | 8.288932e-50 | 6.959562e-48 | 3.745985e-46 | 1.392749e-44 | ... | 7.356258e-22 | 8.015754e-22 | 8.633437e-22 | 9.194183e-22 | 9.684231e-22 | 1.009176e-21 | 1.040736e-21 | 1.062438e-21 | 1.073913e-21 | 1.075091e-21 |
5 rows × 75 columns
like4 = prior4.copy()
for (p0, p1, p2) in prior4.index:
like4.loc[p0, p1, p2] = like2.loc[p2]
like4.to_numpy().sum()
6.009976107419042e-10
posterior4 = prior4 * like4
normalize(posterior4)
marginal_N = marginal(posterior4, 0)
marginal_N.plot()
decorate(xlabel='N',
ylabel='PDF',
title='Posterior for N after three rounds')
marginal_N.mean(), marginal_N.credible_interval(0.9)
(764.7788307540033, array([731., 801.]))
posterior_p012 = marginal(posterior4, 1)
posterior_p012.unstack().head()
p2 | 0.10 | 0.11 | 0.12 | 0.13 | 0.14 | 0.15 | 0.16 | 0.17 | 0.18 | 0.19 | ... | 0.31 | 0.32 | 0.33 | 0.34 | 0.35 | 0.36 | 0.37 | 0.38 | 0.39 | 0.40 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
p0 | p1 | |||||||||||||||||||||
0.5 | 0.10 | 1.418302e-63 | 4.240113e-59 | 2.774674e-55 | 5.057520e-52 | 3.092954e-49 | 7.348184e-47 | 7.626764e-45 | 3.804192e-43 | 9.862481e-42 | 1.418366e-40 | ... | 1.122851e-40 | 1.468367e-41 | 1.495120e-42 | 1.193693e-43 | 7.517753e-45 | 3.753762e-46 | 1.492334e-47 | 4.740089e-49 | 1.206188e-50 | 2.464024e-52 |
0.11 | 8.956875e-59 | 2.677720e-54 | 1.752265e-50 | 3.193929e-47 | 1.953265e-44 | 4.640531e-42 | 4.816461e-40 | 2.402427e-38 | 6.228362e-37 | 8.957275e-36 | ... | 7.091039e-36 | 9.273043e-37 | 9.441997e-38 | 7.538418e-39 | 4.747618e-40 | 2.370579e-41 | 9.424402e-43 | 2.993465e-44 | 7.617326e-46 | 1.556082e-47 | |
0.12 | 9.401057e-55 | 2.810511e-50 | 1.839162e-46 | 3.352320e-43 | 2.050129e-40 | 4.870661e-38 | 5.055315e-36 | 2.521566e-34 | 6.537234e-33 | 9.401477e-32 | ... | 7.442692e-32 | 9.732904e-33 | 9.910237e-34 | 7.912257e-35 | 4.983058e-36 | 2.488139e-37 | 9.891769e-39 | 3.141915e-40 | 7.995078e-42 | 1.633250e-43 | |
0.13 | 2.165764e-51 | 6.474703e-47 | 4.236960e-43 | 7.722893e-40 | 4.722977e-37 | 1.122076e-34 | 1.164616e-32 | 5.809046e-31 | 1.506012e-29 | 2.165861e-28 | ... | 1.714607e-28 | 2.242213e-29 | 2.283067e-30 | 1.822783e-31 | 1.147970e-32 | 5.732038e-34 | 2.278812e-35 | 7.238172e-37 | 1.841863e-38 | 3.762593e-40 | |
0.14 | 1.361688e-48 | 4.070860e-44 | 2.663917e-40 | 4.855638e-37 | 2.969492e-34 | 7.054865e-32 | 7.322325e-30 | 3.652339e-28 | 9.468798e-27 | 1.361748e-25 | ... | 1.078030e-25 | 1.409754e-26 | 1.435439e-27 | 1.146044e-28 | 7.217665e-30 | 3.603922e-31 | 1.432764e-32 | 4.550878e-34 | 1.158040e-35 | 2.365667e-37 |
5 rows × 31 columns
posterior_p2 = marginal(posterior_p012.unstack(), 0)
posterior_p2.plot()
posterior_p01 = marginal(posterior_p012.unstack(), 1)
joint_plot(posterior_p01.unstack().transpose())
data4 = [-10000, 10, 182, 8, 74, 7, 20, 14, 709, 12, 650, 46, 104, 18, 157, 58]
num_observed(data4)
2069
n, k, s = make_stats(data4, 4)
n
0 1754 1 452 2 1135 3 173 dtype: int64
k
0 -9685 1 1754 (0, 0) -9800 (0, 1) 115 (1, 0) 1417 (1, 1) 337 (0, 0, 0) -9990 (0, 0, 1) 190 (0, 1, 0) 81 (0, 1, 1) 34 (1, 0, 0) 721 (1, 0, 1) 696 (1, 1, 0) 122 (1, 1, 1) 215 (0, 0, 0, 0) -10000 (0, 0, 0, 1) 10 (0, 0, 1, 0) 182 (0, 0, 1, 1) 8 (0, 1, 0, 0) 74 (0, 1, 0, 1) 7 (0, 1, 1, 0) 20 (0, 1, 1, 1) 14 (1, 0, 0, 0) 709 (1, 0, 0, 1) 12 (1, 0, 1, 0) 650 (1, 0, 1, 1) 46 (1, 1, 0, 0) 104 (1, 1, 0, 1) 18 (1, 1, 1, 0) 157 (1, 1, 1, 1) 58 dtype: int64
s
[1754, 1869, 2059, 2069]