Lecture 9: Sampling from a distribution¶

Today we will continue to learn about about the randomness, which can come from either

Simulations that use randomness.
Real life data that have randomness.

Simulations are samples obtained from an underlying probability distribution.

Ansatz: real life data follow certain underlying (unknown) probability distribution.

Two most common descriptive statistics for a dataset are:

Mean (average);
Standard deviation (how spread the data are).
Min and max

If time allows, we will learn a bit about normalization.

In [ ]:

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [ ]:

# homework 2 problem 3 distance

numpy.random module¶

In-class exercise 1: Uniform distribution¶

Use np.random.uniform function to obtain 100 samples (size= parameter) from the uniform distribution between 0 and 1. Then use pyplot's hist function to plot the histogram using various numbers of bins (let the bins parameter to be 2, 10, 100).

Normal distribution¶

There are two functions for the normal distribution: np.random.normal() and np.random.randn() in numpy.random submodule.

In [ ]:

# numpy array with Gaussian (normal) random numbers
np.random.normal(loc=0.0, scale=1.0, size=(2,3))
# below are samples drawn from this distrubtion

In [ ]:

# np.random.randn() returns r.v. ~ N(0,1)
np.random.seed(42)
np.random.randn(2,3)

What if we want to use random.randn() in numpy to generate normal distribution with mean $\mu$ and standard dev $\sigma$?

HINT : $ (X -\mu)/\sigma \sim \mathcal{N}(0,1)$.

In [ ]:

# answer
X = 2*np.random.randn(3,3) + 5 # samples drawn from N(5,2)
print(X)

In [ ]:

N = 5000 # sample size
mu = 0.0 # mean
sigma = 1.0  #  standard dev

X = np.random.normal(loc=mu, scale=sigma, size=N)
plt.axis([-6, 6, 0, N/100]) # set up the axis
plt.hist(X, bins = 500) # bin size = (total sample)/(bins)
plt.show()

random.choice: chooses a random element of an ndarray¶

In [ ]:

words = ["love", "hate", "tender", "care", "deep"]
words = np.array(words)
np.random.choice(words)

In [ ]:

poem = np.empty([4, 3], dtype=object)

In [ ]:

for i in range(4): # rows
    for j in range(3): # columns
        poem[i,j] = random.choice(words) 

print(poem)

In [ ]:

# if we want to format better
for row in poem:
    print()
    for words in row:
        print(''.join(str(words)))

Mean and Standard Deviation of a Data-set¶

Say I am given a numpy array X full of numbers (e.g. grades).

In [ ]:

X = np.array([67, 62, 78, 67, 64, 52, 50, 80, 50, 94, 
              77, 62, 78, 67, 44, 52, 70, 80, 50, 94, 
              100, 61, 59, 56, 30, 91, 60, 54, 34, 98])

In [ ]:

np.min(X), np.max(X)

In [ ]:

plt.hist(X, bins=8,edgecolor='black')
plt.show()

Question: If we knew that these numbers came from a normal distribution $N(\mu, \sigma)$, what is the most likely normal distribution that this data would have come from?

Answer: It's the normal distribution $N(\mu, \sigma)$ where $\mu$ is the mean of the data, and $\sigma$ is the standard deviation of the data. Mean is the average of the data, standard deviation is the square root of the average of the square of the distance to the mean of the data... better to write down the formula:

$$\mu = \frac{1}{N}\sum_{i=1}^N x_i$$$$\sigma = \sqrt{\frac{1}{N}\sum_{i=1}^N (x_i - \mu)^2}$$

In python, we can use np.mean and np.std to compute these.

Mean of a dataset vs Expected value of a random variable¶

If $X$ is a random variable: $$\operatorname {E} [X]=\sum _{i=1}^{\infty }x_{i}\,p_{i},$$ or $$\operatorname {E} [X] = \int _{-\infty }^{+\infty }x p(x)\,dx.$$

When the mean is discussed, we mean the sample mean (funny word play there). We compute the sample mean on a given set of samples, that is a set of outcomes of a probability distribution. This mean may yield different properties with regards to the estimation of the "actual average" of the underlying probability distribution.

For instance you may consider how the mathematical definition of the sample mean behaves when passing to the limit (taking the sample size to infinity), etc. However the expected value above is functionally associated to distribution with a given parameter, a distribution that can further generate samples with different sample means.

If $\{x_i\}$ are samples drawn from the distribution for randome variable $X$, in general $$ \mu = \frac{\sum_{k=1}^N x_i}{N}\neq \operatorname {E} [X] . $$

In [ ]:

mu, sigma = np.mean(X), np.std(X)
print("mean of the samples is: ", mu)
print("standard deviation of the samples is: ", sigma)

Recall that the Gaussian distribution $N(\mu, \sigma)$ has density function:

$$ p(x) = \frac{1}{\sqrt{2\pi}\sigma}\exp\left({-\frac{(x-\mu)^2}{2\sigma^2}}\right)$$

We will now draw the bell curve:

In [ ]:

from math import sqrt, pi, e

In [ ]:

N = 300 # grid size, not sample size
Z = np.linspace(0,100,N)

In [ ]:

gaussian_func = lambda mu, sigma: (lambda x: 1/(sqrt(2*pi)*sigma) * e**(-0.5*(x - mu)*(x-mu)/(sigma**2)))

In [ ]:

plt.plot(Z, guassian_func(mu, sigma)(Z)) # blue curve
plt.hist(X, bins= 8, density=True, edgecolor='black') 
# orange histogram, "density = True" means we plot the density histogram instead of the absolute numbers
plt.show()

Conclusion: For any data-set, the mean and standard deviation of the best-fitting Gaussian distribution are the mean and standard deviation of the data-set.

Importing real life data (optional)¶

In this week's lab practice, we have seen how to import real life data on UCI machine learning dataset repository. Today we will try numpy's built-in loading.

Download winequality-red.csv from https://www.kaggle.com/uciml/red-wine-quality-cortez-et-al-2009/ or from Canvas and put it in the same directory with this notebook.
Check the csv file using Excel on the lab computer. Import the data using the following command.

In [ ]:

wine_data = np.loadtxt('winequality-red.csv', delimiter=',', skiprows=1) # Kaggle version.

In-class exercise 2:¶

Repeat above procedure for each column of wine_data.

In [ ]:

# plt.hist first, then creat a linspace with max and min of each column using certain number of points
print("max is: ", max(wine_data[:,0]))
print("min is: ",min(wine_data[:,0]))

In [ ]:

mu =  np.mean(wine_data[:,0])
sigma = np.std(wine_data[:,0])
print("mean is:", mu)
print("stanard dev is:", sigma)

In [ ]:

z = np.linspace(4.6, 15.9, 300)
plt.plot(z,gaussian_func(mu,sigma)(z))
plt.hist(wine_data[:,0], bins= 100, density=True, edgecolor='black')
plt.show()