Today we will continue to learn about about the randomness, which can come from either
Simulations are samples obtained from an underlying probability distribution.
Ansatz: real life data follow certain underlying (unknown) probability distribution.
Two most common descriptive statistics for a dataset are:
If time allows, we will learn a bit about normalization.
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
# homework 2 problem 3 distance
Use np.random.uniform
function to obtain 100 samples (size=
parameter) from the uniform distribution between 0 and 1. Then use pyplot
's hist
function to plot the histogram using various numbers of bins (let the bins
parameter to be 2, 10, 100).
There are two functions for the normal distribution: np.random.normal()
and np.random.randn()
in numpy.random
submodule.
# numpy array with Gaussian (normal) random numbers
np.random.normal(loc=0.0, scale=1.0, size=(2,3))
# below are samples drawn from this distrubtion
# np.random.randn() returns r.v. ~ N(0,1)
np.random.seed(42)
np.random.randn(2,3)
What if we want to use random.randn()
in numpy to generate normal distribution with mean $\mu$ and standard dev $\sigma$?
HINT : $ (X -\mu)/\sigma \sim \mathcal{N}(0,1)$.
# answer
X = 2*np.random.randn(3,3) + 5 # samples drawn from N(5,2)
print(X)
N = 5000 # sample size
mu = 0.0 # mean
sigma = 1.0 # standard dev
X = np.random.normal(loc=mu, scale=sigma, size=N)
plt.axis([-6, 6, 0, N/100]) # set up the axis
plt.hist(X, bins = 500) # bin size = (total sample)/(bins)
plt.show()
words = ["love", "hate", "tender", "care", "deep"]
words = np.array(words)
np.random.choice(words)
poem = np.empty([4, 3], dtype=object)
for i in range(4): # rows
for j in range(3): # columns
poem[i,j] = random.choice(words)
print(poem)
# if we want to format better
for row in poem:
print()
for words in row:
print(''.join(str(words)))
Say I am given a numpy array X
full of numbers (e.g. grades).
X = np.array([67, 62, 78, 67, 64, 52, 50, 80, 50, 94,
77, 62, 78, 67, 44, 52, 70, 80, 50, 94,
100, 61, 59, 56, 30, 91, 60, 54, 34, 98])
np.min(X), np.max(X)
plt.hist(X, bins=8,edgecolor='black')
plt.show()
Question: If we knew that these numbers came from a normal distribution $N(\mu, \sigma)$, what is the most likely normal distribution that this data would have come from?
Answer: It's the normal distribution $N(\mu, \sigma)$ where $\mu$ is the mean of the data, and $\sigma$ is the standard deviation of the data. Mean is the average of the data, standard deviation is the square root of the average of the square of the distance to the mean of the data... better to write down the formula:
$$\mu = \frac{1}{N}\sum_{i=1}^N x_i$$$$\sigma = \sqrt{\frac{1}{N}\sum_{i=1}^N (x_i - \mu)^2}$$In python, we can use np.mean
and np.std
to compute these.
If $X$ is a random variable: $$\operatorname {E} [X]=\sum _{i=1}^{\infty }x_{i}\,p_{i},$$ or $$\operatorname {E} [X] = \int _{-\infty }^{+\infty }x p(x)\,dx.$$
When the mean is discussed, we mean the sample mean (funny word play there). We compute the sample mean on a given set of samples, that is a set of outcomes of a probability distribution. This mean may yield different properties with regards to the estimation of the "actual average" of the underlying probability distribution.
For instance you may consider how the mathematical definition of the sample mean behaves when passing to the limit (taking the sample size to infinity), etc. However the expected value above is functionally associated to distribution with a given parameter, a distribution that can further generate samples with different sample means.
If $\{x_i\}$ are samples drawn from the distribution for randome variable $X$, in general
$$
\mu = \frac{\sum_{k=1}^N x_i}{N}\neq \operatorname {E} [X] .
$$
mu, sigma = np.mean(X), np.std(X)
print("mean of the samples is: ", mu)
print("standard deviation of the samples is: ", sigma)
Recall that the Gaussian distribution $N(\mu, \sigma)$ has density function:
$$ p(x) = \frac{1}{\sqrt{2\pi}\sigma}\exp\left({-\frac{(x-\mu)^2}{2\sigma^2}}\right)$$We will now draw the bell curve:
from math import sqrt, pi, e
N = 300 # grid size, not sample size
Z = np.linspace(0,100,N)
gaussian_func = lambda mu, sigma: (lambda x: 1/(sqrt(2*pi)*sigma) * e**(-0.5*(x - mu)*(x-mu)/(sigma**2)))
plt.plot(Z, guassian_func(mu, sigma)(Z)) # blue curve
plt.hist(X, bins= 8, density=True, edgecolor='black')
# orange histogram, "density = True" means we plot the density histogram instead of the absolute numbers
plt.show()
Conclusion: For any data-set, the mean and standard deviation of the best-fitting Gaussian distribution are the mean and standard deviation of the data-set.
In this week's lab practice, we have seen how to import real life data on UCI machine learning dataset repository. Today we will try numpy
's built-in loading.
Download winequality-red.csv
from https://www.kaggle.com/uciml/red-wine-quality-cortez-et-al-2009/ or from Canvas and put it in the same directory with this notebook.
Check the csv file using Excel on the lab computer. Import the data using the following command.
wine_data = np.loadtxt('winequality-red.csv', delimiter=',', skiprows=1) # Kaggle version.
Repeat above procedure for each column of wine_data.
# plt.hist first, then creat a linspace with max and min of each column using certain number of points
print("max is: ", max(wine_data[:,0]))
print("min is: ",min(wine_data[:,0]))
mu = np.mean(wine_data[:,0])
sigma = np.std(wine_data[:,0])
print("mean is:", mu)
print("stanard dev is:", sigma)
z = np.linspace(4.6, 15.9, 300)
plt.plot(z,gaussian_func(mu,sigma)(z))
plt.hist(wine_data[:,0], bins= 100, density=True, edgecolor='black')
plt.show()