Exploration of a problem interpreting binary test results
Copyright 2015 Allen Downey MIT License
from __future__ import print_function, division
import thinkbayes2
from sympy import symbols
p
is the prevalence of a condition
s
is the sensititivity of the test
The false positive rate is known to be either t1
(with probability q
) or t2
(with probability 1-q
)
p, q, s, t1, t2 = symbols('p q s t1 t2')
I'll use a
through h
for each of the 8 possible conditions.
a, b, c, d, e, f, g, h = symbols('a b c d e f g h')
And here are the probabilities of the conditions.
a = q * p * s
b = q * p * (1-s)
c = q * (1-p) * t1
d = q * (1-p) * (1-t1)
e = (1-q) * p * s
f = (1-q) * p * (1-s)
g = (1-q) * (1-p) * t2
h = (1-q) * (1-p) * (1-t2)
pmf1 = thinkbayes2.Pmf()
pmf1['sick'] = p*s
pmf1['notsick'] = (1-p)*t1
pmf1
Pmf({'notsick': t1*(-p + 1), 'sick': p*s})
nc1 = pmf1.Normalize()
nc1.simplify()
p*s - t1*(p - 1)
pmf2 = thinkbayes2.Pmf()
pmf2['sick'] = p*s
pmf2['notsick'] = (1-p)*t2
pmf2
Pmf({'notsick': t2*(-p + 1), 'sick': p*s})
nc2 = pmf2.Normalize()
nc2.simplify()
p*s - t2*(p - 1)
pmf_t = thinkbayes2.Pmf({t1:q, t2:1-q})
pmf_t[t1] *= nc1
pmf_t[t2] *= nc2
pmf_t.Normalize()
1.0*q*(p*s + t1*(-p + 1)) + (-1.0*q + 1.0)*(p*s + t2*(-p + 1))
pmf_t.Mean().simplify()
1.0*(q*t1*(p*s - t1*(p - 1)) - t2*(q - 1)*(p*s - t2*(p - 1)))/(q*(p*s - t1*(p - 1)) - (q - 1)*(p*s - t2*(p - 1)))
d1 = dict(q=0.5, p=0.1, s=0.5, t1=0.2, t2=0.8)
pmf_t.Mean().evalf(subs=d1)
0.662000000000000
d2 = dict(q=0.75, p=0.1, s=0.5, t1=0.4, t2=0.8)
pmf_t.Mean().evalf(subs=d2)
0.554000000000000
pmf_t[t1].evalf(subs=d2)
0.615000000000000
x = pmf_t[t1] * pmf1['sick'] + pmf_t[t2] * pmf2['sick']
x.simplify()
1.0*p*s/(q*(p*s - t1*(p - 1)) - (q - 1)*(p*s - t2*(p - 1)))
x.evalf(subs=d1)
0.100000000000000
x.evalf(subs=d2)
0.100000000000000
t = q * t1 + (1-q) * t2
pmf = thinkbayes2.Pmf()
pmf['sick'] = p*s
pmf['notsick'] = (1-p)*t
pmf
Pmf({'notsick': (-p + 1)*(q*t1 + t2*(-q + 1)), 'sick': p*s})
pmf.Normalize()
p*s + (-p + 1)*(q*t1 + t2*(-q + 1))
pmf['sick'].simplify()
1.0*p*s/(p*s - (p - 1)*(q*t1 - t2*(q - 1)))
pmf['sick'].evalf(subs=d1)
0.100000000000000
pmf['sick'].evalf(subs=d2)
0.100000000000000
gold = thinkbayes2.Pmf()
gold['0 sick t1'] = q * (1-p)**2 * t1**2
gold['1 sick t1'] = q * 2*p*(1-p) * s * t1
gold['2 sick t1'] = q * p**2 * s**2
gold['0 sick t2'] = (1-q) * (1-p)**2 * t2**2
gold['1 sick t2'] = (1-q) * 2*p*(1-p) * s * t2
gold['2 sick t2'] = (1-q) * p**2 * s**2
gold.Normalize()
p**2*q*s**2 + p**2*s**2*(-q + 1) + 2*p*q*s*t1*(-p + 1) + p*s*t2*(-p + 1)*(-2*q + 2) + q*t1**2*(-p + 1)**2 + t2**2*(-p + 1)**2*(-q + 1)
p0 = gold['0 sick t1'] + gold['0 sick t2']
p0.evalf(subs=d1)
0.852895633323010
p0.evalf(subs=d2)
0.826831935836675
t = q * t1 + (1-q) * t2
collapsed = thinkbayes2.Pmf()
collapsed['0 sick'] = (1-p)**2 * t**2
collapsed['1 sick'] = 2*p*(1-p) * s * t
collapsed['2 sick'] = p**2 * s**2
collapsed.Normalize()
p**2*s**2 + 2*p*s*(-p + 1)*(q*t1 + t2*(-q + 1)) + (-p + 1)**2*(q*t1 + t2*(-q + 1))**2
collapsed['0 sick'].evalf(subs=d1)
0.810000000000000
collapsed['0 sick'].evalf(subs=d2)
0.810000000000000
pmf1 = thinkbayes2.Pmf()
pmf1['0 sick'] = (1-p)**2 * t1**2
pmf1['1 sick'] = 2*p*(1-p) * s * t1
pmf1['2 sick'] = p**2 * s**2
nc1 = pmf1.Normalize()
pmf2 = thinkbayes2.Pmf()
pmf2['0 sick'] = (1-p)**2 * t2**2
pmf2['1 sick'] = 2*p*(1-p) * s * t2
pmf2['2 sick'] = p**2 * s**2
nc2 = pmf2.Normalize()
pmf_t = thinkbayes2.Pmf({t1:q, t2:1-q})
pmf_t[t1] *= nc1
pmf_t[t2] *= nc2
pmf_t.Normalize()
1.0*q*(p**2*s**2 + 2*p*s*t1*(-p + 1) + t1**2*(-p + 1)**2) + (-1.0*q + 1.0)*(p**2*s**2 + 2*p*s*t2*(-p + 1) + t2**2*(-p + 1)**2)
x = pmf_t[t1] * pmf1['0 sick'] + pmf_t[t2] * pmf2['0 sick']
x.simplify()
1.0*(p - 1)**2*(q*t1**2 + t2**2*(-q + 1))/(q*(p**2*s**2 - 2*p*s*t1*(p - 1) + t1**2*(p - 1)**2) - (q - 1)*(p**2*s**2 - 2*p*s*t2*(p - 1) + t2**2*(p - 1)**2))
x.evalf(subs=d1), p0.evalf(subs=d1)
(0.852895633323010, 0.852895633323010)
x.evalf(subs=d2), p0.evalf(subs=d2)
(0.826831935836675, 0.826831935836675)
pmf_t
represents the distribution of t
pmf_t = thinkbayes2.Pmf({t1:q, t2:1-q})
pmf_t.Mean().simplify()
1.0*q*t1 - 1.0*t2*(q - 1)
I'll consider two sets of parameters, d1
and d2
, which have the same mean value of t
.
d1 = dict(q=0.5, p=0.1, s=0.5, t1=0.2, t2=0.8)
pmf_t.Mean().evalf(subs=d1)
0.500000000000000
d2 = dict(q=0.75, p=0.1, s=0.5, t1=0.4, t2=0.8)
pmf_t.Mean().evalf(subs=d2)
0.500000000000000
prob
takes two numbers that represent odds in favor and returns the corresponding probability.
def prob(yes, no):
return yes / (yes + no)
Scenario A
In the first scenario, there are two kinds of people in the world, or two kinds of tests, so there are four outcomes that yield positive tests: two true positives (a and d) and two false positives (c and g).
We can compute the probability of a true positive given a positive test:
res = prob(a+e, c+g)
res.simplify()
p*s/(-p*q*t1 + p*q*t2 + p*s - p*t2 + q*t1 - q*t2 + t2)
In this scenario, the two parameter sets yield the same answer:
res.evalf(subs=d1)
0.100000000000000
res.evalf(subs=d2)
0.100000000000000
Scenario B
Now suppose instead of two kinds of people, or two kinds of tests, the distribution of t
represents our uncertainty about t
. That is, we are only considering one test, and we think the false positive rate is the same for everyone, but we don't know what it is.
In this scenario, we need to think about the sampling process that brings patients to see doctors. There are three possibilities:
B1. Only patients who test positive see a doctor.
B2. All patients see a doctor with equal probability, regardless of test results and regardless of whether they are sick or not.
B3. Patients are more or less likely to see a doctor, depending on the test results and whether they are sick or not.
Scenario B1
If patients only see a doctor after testing positive, the doctor doesn't learn anything about t
just because a patient tests positive. In that case, the doctor should compute the conditional probabilities:
p1
is the probability the patient is sick given a positive test and t1
p2
is the probability the patient is sick given a positive test and t2
We can compute p1
and p2
, form pmf_p
, and compute its mean:
p1 = prob(a, c)
p1.simplify()
p*s/(p*s - t1*(p - 1))
p1.evalf(subs=d1)
0.217391304347826
p2 = prob(e, g)
p2.simplify()
p*s/(p*s - t2*(p - 1))
p2.evalf(subs=d1)
0.0649350649350649
pmf_p = thinkbayes2.Pmf([p1, p2])
pmf_p.Mean().simplify()
0.5*p*s*(2*p*s - t1*(p - 1) - t2*(p - 1))/((p*s - t1*(p - 1))*(p*s - t2*(p - 1)))
pmf_p.Mean().evalf(subs=d1)
0.141163184641446
p1.evalf(subs=d2), p2.evalf(subs=d2), pmf_p.Mean().evalf(subs=d2)
(0.121951219512195, 0.0649350649350649, 0.0934431422236300)
Scenario B2
If all patients see a doctor, the doctor can learn about t
based on the number of positive and negative tests.
The likelihood of a positive test given t1
is (a+c)/q
The likelihood of a positive test given t2
is (e+g)/(1-q)
update
takes a pmf
and updates it with these likelihoods
def update(pmf):
post = pmf.Copy()
post[p1] *= (a + c) / q
post[p2] *= (e + g) / (1-q)
post.Normalize()
return post
post
is what we should believe about p
after seeing one patient with a positive test:
post = update(pmf_p)
post[p1].simplify()
(-0.5*p*s + 0.5*t1*(p - 1))/(-1.0*p*s + 0.5*t1*(p - 1) + 0.5*t2*(p - 1))
post.Mean().simplify()
-1.0*p*s/(-1.0*p*s + 0.5*t1*(p - 1) + 0.5*t2*(p - 1))
When q
is 0.5, the posterior mean is p
:
post.Mean().evalf(subs=d1)
0.100000000000000
But other distributions of t
yield different values.
post.Mean().evalf(subs=d2)
0.0847457627118644
Let's see what we get after seeing two patients
post2 = update(post)
post2.Mean().simplify()
1.0*p*s*(2*p*s - t1*(p - 1) - t2*(p - 1))/((p*s - t1*(p - 1))**2 + (p*s - t2*(p - 1))**2)
Positive tests are more likely under t2
than t1
, so each positive test makes it more likely that t=t2
. So the expected value of p
converges on p2
.
post2.Mean().evalf(subs=d1)
0.0774233508826262
post2.Mean().evalf(subs=d2)
0.0775295663600526
post3 = update(post2)
post3.Mean().evalf(subs=d1)
0.0688926818860678
post3.Mean().evalf(subs=d2)
0.0724135699794844