Copyright 2020 Allen B. Downey
License: Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
# If we're running on Colab, install libraries
import sys
IN_COLAB = 'google.colab' in sys.modules
if IN_COLAB:
!pip install pymc3
!pip install arviz
!pip install graphviz
In an excellent blog post, John D. Cook wrote about the Lincoln index, which is a way to estimate the number of errors in a document (or program) by comparing results from two independent testers.
Here's his presentation of the problem:
"Suppose you have a tester who finds 20 bugs in your program. You want to estimate how many bugs are really in the program. You know there are at least 20 bugs, and if you have supreme confidence in your tester, you may suppose there are around 20 bugs. But maybe your tester isn't very good. Maybe there are hundreds of bugs. How can you have any idea how many bugs there are? There's no way to know with one tester. But if you have two testers, you can get a good idea, even if you don't know how skilled the testers are."
Suppose the first tester finds 20 bugs, the second finds 15, and they find 3 in common; how can we estimate the number of bugs?
I'll use the following notation for the data:
For this model, I'll express the data in different notation:
k11 is the number of bugs found by both testers,
k10 is the number of bugs found by the first tester but not the second,
k01 is the number of bugs found by the second tester but not the first, and
k00 is the unknown number of undiscovered bugs.
Here are the values for all but k00
:
k10 = 20 - 3
k01 = 15 - 3
k11 = 3
num_seen = k01 + k10 + k11
num_seen
Here's the model:
import pymc3 as pm
with pm.Model() as model5:
p0 = pm.Beta('p0', alpha=1, beta=1)
p1 = pm.Beta('p1', alpha=1, beta=1)
N = pm.DiscreteUniform('N', num_seen, 350)
q0 = 1-p0
q1 = 1-p1
ps = [q0*q1, q0*p1, p0*q1, p0*p1]
k00 = N - num_seen
data = pm.math.stack((k00, k01, k10, k11))
y = pm.Multinomial('y', n=N, p=ps, observed=data)
with model5:
trace5 = pm.sample(500)
with model5:
pm.plot_posterior(trace5)
with model5:
pm.traceplot(trace5)