Title: Sampling from Probability Distributions Author: Thomas Breuel Institution: UniKL
from pylab import *
We can easily sample for distributions in Python.
The basic distributions are randn
(normal density) and rand
(uniform density).
data = randn(10000)
_=hist(data,bins=100)
If we want a binomial variable with $p=0.1$, we can sample as follows:
n = 1000
p = 0.1
sum(rand(n)<p)
109
p = 0.5
dist = array([sum(rand(n)<p) for i in range(10000)])
_=hist(dist,bins=20)
print mean(dist),var(dist)
500.114 246.746804
Theoretical values are:
$$ \bar{x} = n ~ p $$$$ v = n ~ p ~(1-p) $$print n*p
print n*p*(1-p)
500.0 250.0
We often estimate parameters from measurements. Let's do this for a normal density.
data = randn(1000)
print mean(data),var(data)
0.0317909909166 0.877621361124
Note that our parameter estimates themselves have a distribution of errors.
means = [mean(randn(1000)) for i in range(1000)]
_=hist(means,bins=100)
Another interesting example is extreme value distributions. They matter a great deal in biology and business, where "the best" of a population often "win" or are measured. They also matter when people tell you statistics like "90 percent of population X are women/men, therefore there must be bias".
N = 100000
maxs = array([amax(randn(N)) for i in range(1000)])
_=hist(maxs,bins=100)
This kind of bias (around 20 percent of the standard deviation here) leads to a bias in maxima.
max1s = array([amax(randn(N)+0.2) for i in range(1000)])
_=hist(maxs,bins=100,alpha=0.7)
_=hist(max1s,bins=100,alpha=0.7)
sum(max1s>maxs)*1.0/len(maxs)
0.70599999999999996
Interestingly, you can get much greater extreme value differences without any bias at all, simply by having more variance in one population than in another.
maxs = array([amax(randn(10000)) for i in range(1000)])
max2s = array([amax(1.1*randn(10000)) for i in range(1000)])
_=hist(maxs,bins=100,alpha=0.7)
_=hist(max2s,bins=100,alpha=0.7)
sum(max2s>maxs)*1.0/len(maxs)
0.82399999999999995
Let's look at another important example from statistics. Consider a population of size $N = 1000000$:
N = 1000000
Now assume that about 1:1000 of the population are infected with some disease.
infected = (rand(N)<1e-3)
sum(infected)
1045
Also, assume we have a test that is 99 percent accurate. We give the test to the entire population and measure the results.
testresult = where(rand(N)<1e-2,1-infected,infected)
sum(testresult)
11120
Now, let's look at the population of people whose test result is positive. What fraction of those people are infected?
sum(infected[testresult==True])*1.0/sum(testresult==True)
0.092985611510791363
(Implications)
That's not good: less than 10 percent of the population that tests positive is actually infected. Ninety percent of those who test positive are false positives even though the test is 99 percent accurate.
This is a common issue in medicine, anti-terrorism, and many other problems:
All of these end up generating enormous numbers of false positives.
(Bayesian Calculation)
Mathematically, we write this in terms of conditional probabilities:
prior probability:
$$P(i)$$probability of test:
$$P(t=0|i=0) = 0.99$$$$P(t=1|i=1) = 0.99$$Bayes formula:
$$ P(i=1|t=1) = \frac{P(t=1|i=1) P(i=1)}{P(t=1)} = \frac{P(t=1|i=1) P(i=1)}{P(t=1|i=0) P(i=0) + P(t=1|i=1) P(i=1) } $$$$ ~~~~~~ = \frac{0.99 \cdot 0.001}{ 0.01 \cdot 0.999 + 0.99 \cdot 0.001 } \approx \frac{0.001}{0.01} = 0.1 $$(Derivation of Bayes Formula)
Joint density ($c$ and $x$ could be any variables):
$$ p(c,x) $$Conditional density:
$$ p(c,x) = P(c|x) ~ p(x) $$Bayes formula:
$$ P(c|x) ~ p(x) ~ = ~ p(x|c) ~ P(c) $$$$ P(c|x) = \frac{p(x|c) P(c)}{p(x)} $$