Notebook

Lecture 34: A Look Ahead; Examples of Regression Example, Sampling from a Finite Population¶

Stat 110, Prof. Joe Blitzstein, Harvard University¶

The Top-10 List¶

Conditioning ... is the soul of statistics!
Symmetry ... is powerful but dangerous
Random variables and their distributions
Stories (proofs, backgrounds of the distributions covered)
Linearity
Indicator random variables
LOTUS
Law of Large Numbers
Central Limit Theorem
Markov Chains

Items 1 through 4 deal with the Big Picture^™ questions: What is randomness? How do we think about uncertainty?

Items 5 through 7 are for computing expected values (mean, variance & standard deviation).

Items 8 through 10 are important for understanding long-run behavior.

Where to go from here?¶

Some topics to study from here on out:

Statistical inference (we have data, need to estimate parameters or make predictions)
Regress & linear models
Finance
Computational biology
Stochastic processes

Advice¶

Learn R
Learn C
Read Mostly Harmless Econometrics

Ex. A Simple Linear Regression¶

You've seen this before:

$\begin{align} Y &= \beta_0 + \beta_1 \, X + \epsilon \end{align}$

We want to use $X$ to predict $Y$
$\beta_j$ are linear coeffiecients, with $\beta_0$ being the value of $Y$ when $x=0$ (default value)
$\epsilon$ error term (since $X$ is not perfect)
a common assumption is $\mathbb{E}(\epsilon | X) = 0$ (centered at 0, $\epsilon$ 's distribution may or may not be normal)

So how would we solve for $\beta_1$ ?

We can start by treating $Cov$ as an operator!

$\begin{align} Cov(Y, X) &= Cov\left( (\beta_0 + \beta_1 \, X + \epsilon), X \right) \\ &= Cov(\beta_0, X) + Cov\left( (\beta_1 \, X), X\right) + Cov(\epsilon, X) \\ \\ \text{now } Cov(\beta_0, X) &= 0 &\quad \text{ since } Cov \text{ of constant with anything is } 0 \\ \\ \text{and } Cov\left( (\beta_1 \, X), X\right) &= \beta_1 \, Cov(X, X) &\quad \text{by definition of }Var \\ &= \beta_1 \, Var(X) \\ \\ \text{and since } \mathbb{E}(\epsilon) &= \mathbb{E}\left( \mathbb{E}(\epsilon|X) \right) = \mathbb{E}(0) = 0 \\ \text{and further } \mathbb{E}(\epsilon \, X) &= \mathbb{E}\left( \mathbb{E}(\epsilon \, X | X) \right) &\quad \text{ by Adam's Law} \\ &= \mathbb{E}\left( X \mathbb{E}(\epsilon | X) \right) &\quad \text{ since } X \text{ is known, we can pull it out} \\ &= \mathbb{E}(0) \\ &= 0 \\ \text{so }Cov(\epsilon, X) &= \mathbb{E}(\epsilon \, X) - \mathbb{E}(\epsilon) \, \mathbb{E}(X) \\ &= 0 - 0 = 0 \\\\ \Rightarrow \beta_1 &= \frac{Cov(X,Y)}{Var(X)} &\quad \text{(population version)} \end{align}$

Calculate $\beta_1$ with $Cov(X,Y)$ and $Var(X)$ ¶

Here we calculate $\beta_1 = \frac{Cov(X,Y)}{Var(X)} \,$ using numpy.cov:

In [1]:

import numpy as np

X = np.array([95, 85, 80, 70, 60])
Y = np.array([85, 95, 70, 65, 70])

# numpy.cov(X, Y) returns the matrix
# [ Cov(X,X), Cov(X,Y)]
# [ Cov(X,Y), Cov(Y,Y)]
covM = np.cov(X,Y)

beta_1 = covM[0,1]/covM[0,0]
beta_1

Out[1]:

0.64383561643835618

Calculate $\beta_1$ via sklearn LinearRegression API¶

For comparison's sake, we also obtain $\beta_1$ via sklearn.linear_model.LinearRegression

In [2]:

from sklearn import linear_model

regr = linear_model.LinearRegression()
regr.fit(np.matrix(X).T, Y).coef_[0]

Out[2]:

0.64383561643835607

Ex. Sampling from a Finite Population¶

Here's the set-up:

We have a finite population , size $N$ .
Let $Y_1, Y_2, ..., Y_N$ be some value of interest (height, weight, opinion.
Each person in the population can be uniquely identified.
$Y_j$ are fixed, non-random values, but nevertheless unknown.
Using some sampling scheme to obtain a sample of size $n$ .
Using this sample, we want to infer to the sum of $Y$ (or perhaps the average).
Assume that the inclusion probability of person $j$ ending up in our sample is $p_j$ (assume that the true value is known).
Our sample data takes the form (X1,Z1),(X2,Z2),...,(Xn,Zn), where
- $Z_j$ is the ID of the $j^{th}$ person in our sample
- $X_j = Y_j$

The difference between $X_j$ and $Y_j$ ¶

It is important to understand the difference between $X_j$ and $Y_j$ :

$Y_j$ is a fixed, non-random value.
$X_j$ , due to our random sampling (person $j$ was randomly selected from the population with probability $p_j$ ), is a random variable.

How do we get an unbiased estimator for the total?¶

Let $t_y$ be the true population total $\sum_{1}^{N} Y_i$ . How can we use random sampling of this finite population to find $\hat{t_y}$ ?

The claim is that $\sum_{j=1}^{n} \frac{X_j}{P_{Z_j}}$ is an unbiased estimator for $t_y$ we are looking for.

$\begin{align} t_y &= \sum_{j=1}^{n} \frac{X_j}{P_{Z_j}} \\ &= \sum_{j=1}^{N} \frac{I_j \, Y_j}{P_j} &\quad \text{where } I_j = 1 \text{ if person } j \text{ included in sample}\\\\ \mathbb{E}(t_y) &= \mathbb{E}\left( \sum_{j=1}^{N} \frac{I_j \, Y_j}{P_j} \right) &\quad \text{ find expected value to get } \hat{t_y} \\ &= \sum_{j=1}^{N} \frac{P_j \, Y_j}{P_j} &\quad \text{ by linearity} \\ &= \boxed{ \sum_{j=1}^{N} Y_j } \end{align}$

This is known as the Horvitz-Thompson Estimator, or alternately inverse probability weighting.

But is an unbiased estimator good?

Basu's Circus Elephants¶

Statistics is not easy, and it requires a lot of effort to keep your eyes open and question whether or not a tentative method is really going to yield a proper answer. Here is an anecdote to illustrate an example of when blindly applying an Horvitz-Thompson estimate ends in disaster.

The circus owner is planning to ship his 50 adult elephants and so he needs a rough estimate of the total weight of the elephants. As weighing an elephant is a cumbersome process, the owner wants to estimate the total weight by weighing just one elephant. Which elephant should he weigh?
So the owner looks back on his records and discovers a list of the elephants' weights taken 3 years ago. He finds that 3 years ago Sambo the middle-sized elephant was the average (in weight) elephant in his herd. He checks with the elephant trainer who reassures him (the owner) that Sambo may still be considered to be the average elephant in the herd. Therefore, the owner plans to weigh Sambo and take 50y (where y is the present weight of Sambo) as an estimate of the total weight of the 50 elephants.
But the circus statistician is horrified when he learns of the owner's proposed sampling plan. "How can you get an unbiased estimate of Y this way?", protests the statistician.
So, together they work out a compromise sampling plan. With the help of a table of random numbers they devise a plan that allots a selection probability of 99/100 to Sambo and equal selection probabilities of 1/4900 to each of the other 49 elephants. Naturally, Sambo is selected and the owner is happy.
"How are you going to estimate Y?", asks the statistician.
"Why? The estimate ought to be 50y of course," says the owner.
"Oh! No! That cannot possibly be right," says the statistician, "I recently read an article in the Annals of Mathematical Statistics where it is proved that the Horvitz-Thompson estimator is the unique hyperadmissible estimator in the class of all generalized polynomial unbiased estimators."
"What is the Horvitz-Thompson estimate in this case?" asks the owner, duly impressed.
"Since the selection probability for Sambo in our plan was 99/100," says the statistician, "the proper estimate of Y is 100y/99 and not 50y."
"And, how would you have estimated Y", inquires the incredulous owner, "if our sampling plan made us select, say, the big elephant Jumbo?"
"According to what I understand of the Horvitz-Thompson estimation method," says the unhappy statistician, "the proper estimate of Y would then have been 4900y, where y is Jumbo's weight".
That is how the statistician lost his circus job (and perhaps became a teacher of statistics)

View Lecture 34: A Look Ahead | Statistics 110 on YouTube.