# imports for the tutorial
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib notebook
We define the following:
Term | Usually donated by | Definition | Example |
---|---|---|---|
Experiment | |||
Sample | $\omega$ | ||
Sample Space | $\Omega$ | ||
Event | $A$ | An empty set $\emptyset$ The entire set (any outcome): $\Omega$ |
|
Event Space | $\mathcal{F}$ | ||
Probability | P, Pr | $P(\emptyset)=0$ $P(\Omega)=1$ |
possible_outcomes = ['H', 'T']
probabilities = [0.5, 0.5]
# toss a coin twice
first_toss = np.random.choice(possible_outcomes, p=probabilities)
print("first toss result: ", first_toss)
second_toss = np.random.choice(possible_outcomes, p=probabilities)
print("second toss result: ", second_toss)
first toss result: H second toss result: T
Find the proability of getting an even number or a number that is a multiple of 3.
Why is it different than just giving the probability of getting a 6? Because we ask about 2 different events (even number and a multiple of 3), and we ask for the union of 2 events.
Assume $X$ is a continuous random variable. We define the following:
def plot_normal_pdf_cdf(mu=0, sigma=1):
x = np.linspace(-10, 10, 1000)
x_pdf = (1 / np.sqrt(2 * np.pi * sigma ** 2)) * np.exp(- (x - mu) ** 2 / (2 * sigma ** 2))
x_cdf = np.cumsum(x_pdf) / (len(x) / (np.max(x) - np.min(x))) # normalization
fig = plt.figure(figsize=(8,5))
ax = fig.add_subplot(1,1,1)
ax.plot(x, x_pdf, label='PDF')
ax.plot(x, x_cdf, label='CDF')
ax.grid()
ax.legend()
ax.set_xlabel('x')
ax.set_title('PDF and CDF of Normal Distribution')
plot_normal_pdf_cdf(mu=0, sigma=1)
plot_normal_pdf_cdf(mu=0, sigma=0.3)
# notice how the pdf can be larger than 1
Suppose that the events $B_1, ..., B_k$ are mutually exclusive and form a partiotion of the sample space (i.e. one of them must occur), then for any event $Pr(A)$, Bayes Rule: $$Pr(B_i|A) = \frac{Pr(A,B_i)}{Pr(A)} = \frac{Pr(A|B_i)Pr(B_i)}{Pr(A)} = \frac{Pr(A|B_i)Pr(B_i)}{\sum_{j=1}^k Pr(A|B_j)Pr(B_j)}$$
Given a dataset where each sample is a male or a female with their height, what is the probability that given a certain height, that person is a female, that is, calculate: $Pr(\textit{Gender} = \textit{Female} | \textit{Height} = X cm)$?
# load the data
dataset = pd.read_csv('./datasets/heights_dataset.csv')
# use only the heights
dataset = dataset.drop('Weight', axis=1)
# inch -> cm
dataset['Height'] = dataset['Height'] * 2.54
## print the number of rows in the data set
number_of_rows = len(dataset)
print('Number of rows in the dataset: {}'.format(number_of_rows))
## show the first 10 rows
dataset.head(10)
Number of rows in the dataset: 10000
Gender | Height | |
---|---|---|
0 | Male | 187.571423 |
1 | Male | 174.706036 |
2 | Male | 188.239668 |
3 | Male | 182.196685 |
4 | Male | 177.499761 |
5 | Male | 170.822660 |
6 | Male | 174.714106 |
7 | Male | 173.605229 |
8 | Male | 170.228132 |
9 | Male | 161.179495 |
# let's plot the histogram
figure = plt.figure()
ax = figure.add_subplot(1,1,1)
male_ds = dataset[:5000].rename(index=str, columns={"Height": "Male"}).plot.hist(ax=ax)
female_ds = dataset[5000:].rename(index=str, columns={"Height": "Female"}).plot.hist(ax=ax)
ax.grid()
ax.set_xlabel('Height(cm)')
Text(0.5, 0, 'Height(cm)')
The mean is the proability weighted average of all possible values.
The variance is a measure of the "spread" of the distribution (can also be considered as confidence).
What is the mean and variance of the heights of males? females? combined together?
# easy with pandas
print("the mean of males' height is: {:.3f} cm".format(dataset[:5000].Height.mean()))
print("the variance of males' height is: {:.3f} cm^2".format(dataset[:5000].Height.var()))
print("the std of males' height is: {:.3f} cm".format(dataset[:5000].Height.std()))
the mean of males' height is: 175.327 cm the variance of males' height is: 52.896 cm^2 the std of males' height is: 7.273 cm
print("the mean of females' height is: {:.3f} cm".format(dataset[5000:].Height.mean()))
print("the variance of females' height is: {:.3f} cm^2".format(dataset[5000:].Height.var()))
print("the std of females' height is: {:.3f} cm".format(dataset[5000:].Height.std()))
the mean of females' height is: 161.820 cm the variance of females' height is: 46.903 cm^2 the std of females' height is: 6.849 cm
print("the mean of total height is: {:.3f} cm".format(dataset.Height.mean()))
print("the variance of total height is: {:.3f} cm^2".format(dataset.Height.var()))
print("the std of total' height is: {:.3f} cm".format(dataset.Height.std()))
the mean of total height is: 168.574 cm the variance of total height is: 95.506 cm^2 the std of total' height is: 9.773 cm
Correlation is a measure of linear dependency between two variables.
We define correlation between two Random Variables (RV) as $\sigma_{xy}$:
REMEMBER:
Independence $\rightarrow$ Uncorrelated BUT Uncorrelated $\nrightarrow$ Independence
It is a measure of linear correlation between two variables X and Y, denoted $\rho$ or $r_{xy}$:
Below are examples (from the presentation) that show correlated variables, but they are not neccessariy caused by one another:
If one is calculating the average temperature of 10 objects in a room, and nine of them are between 20 and 25 degrees Celsius, but an oven is at 175 °C, the median of the data will be between 20 and 25 °C but the mean temperature will be between 35.5 and 40 °C. In this case, the median better reflects the temperature of a randomly sampled object (but not the temperature in the room) than the mean. Naively interpreting the mean as "a typical sample", equivalent to the median, is incorrect. As illustrated in this case, outliers may indicate data points that belong to a different population than the rest of the sample set.
All the examples (from the presentation) below share the same Pearson's r:
\begin{pmatrix} \sigma_1 ^2 & \sigma_{1,2} & \cdots & \sigma_{1,d} \\ \sigma_{2,1} & \sigma_2^2 & \cdots & \sigma_{2,d} \\ \vdots & \vdots & \ddots & \vdots \\ \sigma_{d,1} & \sigma_{d,2} & \cdots & \sigma_d^2 \end{pmatrix}$$
num_samples = 1000
num_variables = 5
mu = np.random.random(size=(1, num_variables))
sigma = mu * mu.T + np.eye(num_variables) * 1e-4
# generate multivariate distribution
mult_var = np.random.multivariate_normal(np.zeros(num_variables), sigma)
print("mu:")
print(np.zeros(num_variables))
print("Sigma:")
print(sigma)
print("draw a sample from each variable:")
print(mult_var)
mu: [0. 0. 0. 0. 0.] Sigma: [[0.04800934 0.0189633 0.13069076 0.03377338 0.08095299] [0.0189633 0.00760599 0.05172954 0.01336806 0.03204252] [0.13069076 0.05172954 0.35660823 0.0921296 0.22082975] [0.03377338 0.01336806 0.0921296 0.02390833 0.05706729] [0.08095299 0.03204252 0.22082975 0.05706729 0.13688724]] draw a sample from each variable: [-0.24298271 -0.09838847 -0.72393916 -0.176606 -0.41587365]
\begin{pmatrix} \sigma_1 ^2 & \rho \sigma_1 \sigma_2 \\ \rho \sigma_1 \sigma_2 & \sigma_2^2 \end{pmatrix}$
\end{pmatrix}^{-1} = \frac{1}{ad-bc} \begin{pmatrix} d & -b \\ -c & a \end{pmatrix} $$
from matplotlib import cm
from mpl_toolkits.mplot3d import Axes3D
from scipy.stats import multivariate_normal
def plot_3d_normal_dist():
# Our 2-dimensional distribution will be over variables X and Y
N = 60
X = np.linspace(-3, 3, N)
Y = np.linspace(-3, 4, N)
X, Y = np.meshgrid(X, Y)
# Mean vector and covariance matrix
mu = np.array([0., 1.])
Sigma = np.array([[ 1. , -0.5], [-0.5, 1.5]])
# Pack X and Y into a single 3-dimensional array
pos = np.empty(X.shape + (2,))
pos[:, :, 0] = X
pos[:, :, 1] = Y
F = multivariate_normal(mu, Sigma)
Z = F.pdf(pos)
# Create a surface plot and projected filled contour plot under it.
fig = plt.figure(figsize=(8,5))
ax = fig.gca(projection='3d')
ax.plot_surface(X, Y, Z, rstride=3, cstride=3, linewidth=1, antialiased=True,
cmap=cm.viridis)
# cset = ax.contourf(X, Y, Z, zdir='z', offset=-0.15, cmap=cm.viridis)
# Adjust the limits, ticks and view angle
ax.set_zlim(-0.15,0.2)
ax.set_zticks(np.linspace(0,0.2,5))
ax.view_init(27, -21)
plt.show()
%matplotlib notebook
plot_3d_normal_dist()
Given 5,000 samples of heights for each gender, estimate the Gaussian of heights for each gender.
We want to achieve something like that:
Given $\{x_i\}_{i=1}^n$ i.i.d samples of $X \sim N(\mu, \sigma^2)$, what is the MLE?
The first thing to ask yourself is, what are the parameters in this problem? In our case, the parametrs are $\theta = [\mu, \sigma^2]$, it is just a matter of notation.
As usual, find the point where the deriviative w.r.t $\theta$ is 0
Summary: $$ \hat{\mu}_{MLE} = \frac{1}{n} \sum_{i=1}^n x_i$$ $$\hat{\sigma^2}_{MLE} = \frac{1}{n}\sum_{i=1}^n (x_i - \hat{\mu}_{MLE})^2 $$
Do these look familiar? These are the empirical mean and variance!
def plot_normal_mle():
mu_real = 5
var_real = 36
num_samples= 1000
samples = np.random.normal(mu_real, np.sqrt(var_real), size=(num_samples))
mu_mle = np.sum(samples) / num_samples
var_mle = np.sum(np.square(samples - mu_mle)) / num_samples
x = np.linspace(-30, 30, 10000)
f_x_mle = (1 / np.sqrt(2 * np.pi * var_mle)) * np.exp(-0.5 * (np.square(x - mu_mle)) / var_mle)
# set bins for histogram
n_bins = 100
bins_edges = np.linspace(samples.min(), samples.max() + 1e-9, n_bins + 1)
fig = plt.figure(figsize=(5, 5))
ax = fig.add_subplot(1, 1, 1)
ax.grid()
ax.set_ylabel('Histogram (PDF)')
ax.set_xlabel('x')
# plot histogram
ax.hist(samples, bins=bins_edges, density=True, label="Samples")
# plot estimation
ax.plot(x, f_x_mle, linewidth=3, color='red', label="MLE")
ax.legend()
# let's see how the MLE performs
plot_normal_mle()
Given $\{x_i\}_{i=1}^n$ i.i.d samples of $X \sim N(\mu, \Sigma)$, what is the MLE?
The final results are pretty much the same, but with vectors and matrices, though the math is a little more complicated. $$ \hat{\overline{\mu}}_{MLE} = \frac{1}{n} \sum_{i=1}^n \overline{x_i} $$ $$ \hat{\Sigma}_{MLE} = \frac{1}{n} \sum_{i=1}^n (\overline{x_i} - \hat{\overline{\mu}}_{MLE}) (\overline{x_i} - \hat{\overline{\mu}}_{MLE})^{T}$$
Using the above, we will use the following:
Given $\{x_i\}_{i=1}^n$ i.i.d samples of $X \sim \textit{Geom}(\theta)$, what is the MLE?
Assume: