Probability is a way to quantify the uncertainty associated with events chosen from a some universe of events.
The laws of probability, so true in general, so fallacious in particular.
—Edward Gibbon
P(E) means “the probability of the event E.”
We’ll use probability theory to build and evaluate models.
Roughly speaking, we say that two events E and F are dependent if knowing something about whether E happens gives us information about whether F happens (and vice versa). Otherwise they are independent.
When two events E and F are independent, then by definition we have:
$$P(E,F) =P(E)P(F)$$we define the probability of E conditional on F as:
$$ P(E \mid F) = \frac{P(E,F)} {P(F)} $$We often rewrite this as:
$$ P(E,F) = P(E \mid F) P(F) $$When E and F are independent:
$$ P(E \mid F)=P(E)$$F occurred gives us no additional information about whether E occurred.
The event B and G (“both children are girls and the older child is a girl”) is just the event B.
The probability of the event “both children are girls” (B) conditional on the event “the older child is a girl” (G)
$$P(B \mid G) =\frac{P(B,G)}{P(G)} =\frac{P(B)}{P(G)} =1/2$$The probability of the event “both children are girls” conditional on the event “at least one of the children is a girl” (L).
$$P(B \mid L) =\frac{P(B,L)}{P(L)} =\frac{P(B)}{P(L)} =1/3$$from collections import Counter
import math, random
import matplotlib.pyplot as plt # pyplot
def random_kid():
return random.choice(["boy", "girl"])
random_kid()
'girl'
both_girls = 0
older_girl = 0
either_girl = 0
random.seed(100)
for _ in range(10000):
younger = random_kid()
older = random_kid()
if older == "girl":
older_girl += 1
if older == "girl" and younger == "girl":
both_girls += 1
if older == "girl" or younger == "girl":
either_girl += 1
print("P(both | older):", both_girls / older_girl) # 0.514 ~ 1/2
print("P(both | either): ", both_girls / either_girl) # 0.342 ~ 1/3
P(both | older): 0.4906832298136646 P(both | either): 0.33271913661489866
Using the definition of conditional probability twice tells us that:
$$P(E \mid F) =\frac{P(E,F)}{P(F)} =\frac{P(F \mid E)P(E)}{P(F)}$$The complement of an event is the "opposite" of that event.
The event F can be split into the two mutually exclusive events:
Bayes’s Theorem
$ P(E \mid F) = \frac{P(F \mid E) P(E)}{ P(F \mid E) P(E) + P(F E{}') P(E{}') }$
** Why data scientists are smarter than doctors. **
Imagine a certain disease that affects 1 in every 10,000 people.
And imagine that there is a test for this disease that gives the correct result (“diseased” if you have the disease, “nondiseased” if you don’t) 99% of the time.
Let’s use
Then Bayes’s Theorem says that the probability that you have the disease, conditional on testing positive, is:
$$ P(D \mid T) = \frac{P(T \mid D) P(D)}{ P(T \mid D) P(D) + P(T D{}') P(D{}') }$$Here we know that
Question: Can you compute the probability of $P(D \mid T)$?
Joke: most doctors will guess that $P(D \mid T)$ is approximately 2.
If you substitute these numbers into Bayes’s Theorem, you find $P(D \mid T) =0.98 \% $
That is, the people who test positive actually have the disease is less than 1% !!!
This assumes that people take the test more or less at random. If only people with certain symptoms take the test we would instead have to condition on the event “positive test and symptoms” and the number would likely be a lot higher.
A more intuitive way to see this is to imagine a population of 1 million people.
A random variable is a variable whose possible values have an associated probability distribution.
The associated distribution gives the probabilities that the variable realizes each of its possible values.
The expected value of a random variable is the average of its values weighted by their probabilities.
Discrete distribution associates positive probability with discrete outcomes.
Often we’ll want to model distributions across a continuum of outcomes
.
all the numbers between 0 and 1
.Because there are infinitely many numbers between 0 and 1, this means that the weight it assigns to individual points must necessarily be zero.
The continuous distribution can be described with a probability density function (pdf)
def uniform_pdf(x):
return 1 if x >= 0 and x < 1 else 0
uniform_pdf(0.5)
1
The probability that a random variable following that distribution is between 0.2 and 0.3 is 1/10
The cumulative distribution function (cdf) gives the probability that a random variable is less than or equal to a certain value.
def uniform_cdf(x):
"returns the probability that a uniform random variable is less than x"
if x < 0: return 0 # uniform random is never less than 0
elif x < 1: return x # e.g. P(X < 0.4) = 0.4
else: return 1 # uniform random is always less than 1
uniform_cdf(0.8)
0.8
uniform_cdf(0.3) - uniform_cdf(0.2)
0.09999999999999998
x = [(i-10)*10/100 for i in range(31)]
y = [uniform_cdf(i) for i in x]
plt.plot(x, y)
plt.title('The uniform cdf')
plt.xlabel('$x$')
plt.ylabel('$CDF$')
plt.show()
The normal distribution is the king of distributions.
The probability density of the normal distribution is
$$ f(x \mid \mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2} } e^{ -\frac{(x-\mu)^2}{2\sigma^2} } $$where
def normal_pdf(x, mu=0, sigma=1):
sqrt_two_pi = math.sqrt(2 * math.pi)
return (math.exp(-(x-mu) ** 2 / 2 / sigma ** 2) / (sqrt_two_pi * sigma))
normal_pdf(5)
1.4867195147342979e-06
def plot_normal_pdfs(plt):
xs = [x / 10.0 for x in range(-50, 50)]
plt.plot(xs,[normal_pdf(x,sigma=1) for x in xs],'-',label='mu=0,sigma=1')
plt.plot(xs,[normal_pdf(x,sigma=2) for x in xs],'--',label='mu=0,sigma=2')
plt.plot(xs,[normal_pdf(x,sigma=0.5) for x in xs],':',label='mu=0,sigma=0.5')
plt.plot(xs,[normal_pdf(x,mu=-1) for x in xs],'-.',label='mu=-1,sigma=1')
plt.xlabel('Z', fontsize = 20)
plt.ylabel('Probability', fontsize = 20)
plt.legend()
plt.show()
plot_normal_pdfs(plt)
def normal_cdf(x, mu=0,sigma=1):
return (1 + math.erf((x - mu) / math.sqrt(2) / sigma)) / 2
def plot_normal_cdfs(plt):
xs = [x / 10.0 for x in range(-50, 50)]
plt.plot(xs,[normal_cdf(x,sigma=1) for x in xs],'-',label='mu=0,sigma=1')
plt.plot(xs,[normal_cdf(x,sigma=2) for x in xs],'--',label='mu=0,sigma=2')
plt.plot(xs,[normal_cdf(x,sigma=0.5) for x in xs],':',label='mu=0,sigma=0.5')
plt.plot(xs,[normal_cdf(x,mu=-1) for x in xs],'-.',label='mu=-1,sigma=1')
plt.xlabel('Z', fontsize = 20)
plt.ylabel('Probability', fontsize = 20)
plt.legend(loc=4) # bottom right
plt.show()
plot_normal_cdfs(plt)
One reason the normal distribution is so useful is the central limit theorem
A random variable defined as the average of a large number of independent and identically distributed (IID) random variables is itself approximately normally distributed.
In particular, if $x_1$, ..., $x_n$ are random variables with mean μ and standard deviation σ, and if n is large,
$ \frac{(x_1 +... +x_n) - \mu n}{\sigma \sqrt{n}} = \frac{ \frac{x_1 +... +x_n}{n} - \mu }{\sigma / \sqrt{n}}$ is approximately normally distributed with mean 0 and standard deviation 1.
In probability theory, the central limit theorem (CLT) establishes that, in some situations, when independent random variables are added, their properly normalized sum tends toward a normal distribution (informally a "bell curve") even if the original variables themselves are not normally distributed.
Binomial random variables (二项式随机变量) have two parameters n and p.
A Binomial(n,p) random variable is simply the sum of n independent Bernoulli(p) random variables, each of which equals 1 with probability p and 0 with probability 1 − p.
The mean of a Bernoulli($p$) variable is $p$, and its variance is $p(1-p)$
Central limit theorem says that as $n$ gets large,
<font size = '6', color= 'blue'>a Binomial(n,p) variable is approximately a normal random variable
def bernoulli_trial(p):
return 1 if random.random() < p else 0
def binomial(p, n):
return sum(bernoulli_trial(p) for _ in range(n))
def make_hist(p, n, num_points):
data = [binomial(p, n) for _ in range(num_points)]
# use a bar chart to show the actual binomial samples
histogram = Counter(data)
plt.bar([x - 0.4 for x in histogram.keys()],
[v / num_points for v in histogram.values()],
0.8, color='0.75')
mu = p * n
sigma = math.sqrt(n * p * (1 - p))
# use a line chart to show the normal approximation
xs = range(min(data), max(data) + 1)
ys = [normal_cdf(i + 0.5, mu, sigma) - normal_cdf(i - 0.5, mu, sigma)
for i in xs]
plt.plot(xs,ys)
plt.show()
make_hist(0.75, 100, 10000)
scipy.stats contains pdf and cdf functions for most of the popular probability distributions.