In [3]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import scipy as sp
import scipy.stats as st
from IPython.display import Latex,display,Math
import pandas as pd
print("Modules Imported!")
Modules Imported!

# Maximum Likelihood Estimation¶

## Properties of MLE¶

[4pts]Task: Write a function that, given samples drawn from $N(\mu,\sigma^2)$, returns the maximum likelihood estimates of $\mu$ and $\sigma^2$.

In [4]:
def NormML(y):

return [mu_hat, sigma_hat]

### Bias¶

The tendency of an estimator to over- or under-estimate can be measured by its bias.

[10pts]Task: Prove that the estimator in the previous task is unbiased when estimating the mean but biased when estimating the variance. Find the value of bias for variance. Is it an over-estimation or under-estimation? (You can turn in this part either using Markdown or attach a pdf or a picture of your solution.)

(double click to edit)

We now want to see the effect of bias via simulation.

[10pts]Task: Generate 10000 batches of $N(3,4)$ samples, each containing 10 samples. Compute the estimates for each batch and then take their averages across batches. How do these averages compare with the true values? Is there a difference between the behaviors for the mean and the variance?

In [5]:
theta0 = [3,4] #[mean,variance]
mynorm = st.norm(theta0[0],np.sqrt(theta0[1]))
N = 10000 # number of batches
n = 10 # samples per batch
y = mynorm.rvs([N,n])

# Hints: create variables as following:
# estimates for mu and sigma
# estimates for mu
# estimates for sigma
# average in lieu of expected value
# average in lieu of expected value
# print results

### Consistentcy¶

Intuitively, an estimator is consistent if it gives us the correct solution when we have lots of data. To see this formally, consider a data set $\{y_1,\dotsc,y_N\}$ where the distribution of each $y_i$ depends on some parameter $\theta$. Denote the true value of $\theta$ by $\theta_0$. Let the ML estimator for this dataset be denoted as $\hat\theta_N$. We'd like to see how the accuracy of this estimator changes as $N\to\infty$. In particular, _the estimator is said to be consistent if $\hat\theta_N$ gets closer and closer to the true value $\theta_0$: $$\lim{N\to\infty} \hat\theta_N=\theta_0.$$

[13pts]Task: Generate a vector $y$ consisting of $N$ normal samples drawn from $N(\mu_0,\sigma_0^2)$ for $\mu_0=3$, $\sigma_0^2=4$ and a large value of $N$, say $N=500$. Then compute the estimates for $\mu$ and $\sigma^2$ using the first $n$ sample (that is y[0:n]) as $n$ ranges from 1 to $N$, choosing an appropriate step size. Plot the result and discuss it in the context of consistency. Spoiler Alert: ML is consistent.

In [6]:
theta0 = [3,4] #[mean,variance]

## Multivariate Normal Estimation¶

Consider the dataset $\{x_1,\dotsc,x_N\}$, where each $x_i$ is a vector of size $m$. Assuming this data is multivariate normal, i.e., $x_i\sim \mathcal N(\mu,K)$, we would like to find estimate $\mu$ and $K$ using maximum likelihood.

[5pts]Task: Write a module for estimating $\mu$ and $K$. Assume that the input is a $N\times m$ array $X$, where each row $x_i$ of $X$ is an $m$-dimensional data sample, with distribution $x_i\sim \mathcal N(\mu,K)$.

In [7]:
def MultiVarNormML(X):

return muhat,Khat

We will use this estimator to explore relationships from a dataset about math performance for a group of students. First we read the data and then normalize the columns that we care about to have variance equal to 1. These features are:

• Medu - mother's education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)
• Fedu - father's education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)
• studytime - weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)
• failures - number of past class failures (numeric: n if 1<=n<3, else 4)
• goout - going out with friends (numeric: from 1 - very low to 5 - very high)
• G3 - final grade (numeric: from 0 to 20, output target)
In [6]:
dims = ['studytime','goout','failures','Medu','Fedu','G3']
for s in dims:
T[s] = T[s]/np.std(T[s])
X = T[dims].values

[10pts]Task: Find the ML estimate of $K$ for this data and discuss a couple of relationships that you observe. In particular, what appears to be more strongly correlated with final grade: mother's or father's education?

In [8]:

(double click to edit)

### One more estimator property: Asymptotic Normality (optional, no extra credit, do at your own risk!)¶

From the asymptotic normality of the ML estimator we have $$\sqrt{n}(\hat\mu_{ML}-\mu)\to N(0,I^{-1}(\mu))$$ where $I$ is the Fisher information and in this case equals $I(\mu)=\frac{1}{\sigma^2}$.

And for $\sigma^2$, $$\sqrt{n}({\hat\sigma^2}_{ML}-\sigma^2)\to N(0,I^{-1}(\sigma^2)),$$ where $I(\sigma^2)=1/(2\sigma^4)$.

[15pts]Task: Using the batches and estimates from the Bias section, plot the histograms for the estimates (multiplied by $\sqrt{n}$ as above) and compare them with the limit distribution given on the right side of the above expressions. Are the approximations by the limit distributions accurate to the same degree for both mean and the variance? If not, why do you think that is the case? (You can try with batches with more samples and verify that for large batches, they become increasingly better approximations).

[5pts]Bonus: Prove that $I(\mu)=\frac1{\sigma_2}$.

In [9]:

(double click to edit)

# Bayesian Estimation¶

## Bernoulli data and Beta prior¶

Suppose you are given a sample of $n$ iid Bernoulli random variables with probability of success equal to $\theta$.

[8pts]Task: Write a function that produces the frozen posterior of $\theta$ given a Beta prior with pramaters $\alpha$ and $\beta$. Recall that a frozen RV is an object representing an RV with fixed parameters, e.g.,

rv = st.beta(a, b)

and can be called for example to compute the pdf at a certain point as

rv.pdf(0.3)
In [10]:
def bayesBernBeta(data,prior):
"""
input:
- data is an iid sample of Bernoulli RVs
- prior=[alpha,beta], prior Beta parameters
"""

return posterior

The data is given below, in three different sizes.

In [11]:
# the data is a sample of 1000 iid bernoulli rvs
sample = np.array([0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1])
small_sample = sample[:3]
med_sample = sample[:50]
large_sample = sample

We will use three different beta priors:

• non-informative: uniform: $Beta(1,1)$.
• non-informative: Jeffery's: $Beta(1/2,1/2)$.
• informative: we believe that the prior peaks at $\theta = 1/3$ and furthermore, we believe that the probibility that $\theta<0.5$ is $60\%$.

[25pts]Task: For the last case, you must find the $\alpha$ and $\beta$ parameters for a Beta prior that satisfies the condition. Generate 9 posterior plots, for different data and prior combinations and discuss their differences.

In [12]: