Term Definitions

TODO: Complete

  • $\hat Y$ : Predicted $Y$, aka the fitted value
  • $\bar Y$: Sample Mean
  • $\mu$: Population Mean
  • $\beta_n$: model coefficient
  • $s$
  • $\sigma$

Coursera Regressions Week 1

Least Squares

In [15]:
import pandas as pd
import matplotlib.pyplot as plt
import rpy2
from IPython.display import YouTubeVideo
%load_ext rpy2.ipython
%matplotlib inline
The rpy2.ipython extension is already loaded. To reload it, use:
  %reload_ext rpy2.ipython
In [2]:
heig = pd.DataFrame.from_csv('data/Galton.csv', index_col=None)
heig = pd.melt(heig, value_vars = ['child', 'parent'], value_name = 'height')
In [3]:
%%R -i heig
require(ggplot2)
df <- as.data.frame(heig)
g <- ggplot(heig, aes(height, fill=variable)) + geom_histogram(binwidth=1) + theme_bw()
g <- g + facet_grid(~variable)

g
Loading required package: ggplot2

Find the middle via least squares

Let $Y_i$ be the height of child $i$ for $i = 1, ..., n = 928$ then define the middle as the value of $\mu$ that minimizes

$\sum\limits_{i=1}^n (Y_i - \mu)^2$

  • The middle gives us the 'physical' center of mass of the graph
  • You might have guessed that the answer that minimizes the squared distance above is $\mu = \bar{Y}$

Experiment

$\sum\limits_{i=1}^n (Y_i - \mu)^2$

Let's try different values of $\mu$ to see where the minimal value is above

In [4]:
child = heig[heig['variable'] == 'child']
In [5]:
import numpy as np
child.head()
Out[5]:
variable height
0 child 61.7
1 child 61.7
2 child 61.7
3 child 61.7
4 child 61.7
In [6]:
hrange = np.arange(child['height'].min(), child['height'].max(), .5)
In [7]:
MSE = lambda x: ((child.height - x)**2).mean()
In [8]:
MSEdf = pd.DataFrame([(_, MSE(_)) for _ in hrange], columns = ['mu', 'MSE'])

Let's plot different values for $\mu$ and see what we get

In [9]:
pd.DataFrame.plot(MSEdf, x = 'mu', y='MSE', title='Minimizing Mu');
In [10]:
child.mean()
Out[10]:
height    68.08847
dtype: float64

<img "source=(/images/least_squares.proof.jpg">

The math follows as

$$ \begin{align} \sum_{i=1}^n (Y_i - \mu)^2 & = \ \sum_{i=1}^n (Y_i - \bar Y + \bar Y - \mu)^2 \\ & = \sum_{i=1}^n (Y_i - \bar Y)^2 + \ 2 \sum_{i=1}^n (Y_i - \bar Y) (\bar Y - \mu) +\ \sum_{i=1}^n (\bar Y - \mu)^2 \\ & = \sum_{i=1}^n (Y_i - \bar Y)^2 + \ 2 (\bar Y - \mu) \sum_{i=1}^n (Y_i - \bar Y) +\ \sum_{i=1}^n (\bar Y - \mu)^2 \\ & = \sum_{i=1}^n (Y_i - \bar Y)^2 + \ 2 (\bar Y - \mu) (\sum_{i=1}^n Y_i - n \bar Y) +\ \sum_{i=1}^n (\bar Y - \mu)^2 \\ & = \sum_{i=1}^n (Y_i - \bar Y)^2 + \sum_{i=1}^n (\bar Y - \mu)^2\\ & \geq \sum_{i=1}^n (Y_i - \bar Y)^2 \ \end{align} $$

  • $\bar{Y}$ : different guesses for $\mu$
  • $\mu$ : value that minimizes $\sum\limits_{i=1}^n (Y_i - \mu)^2$

He adds and subtracts $\bar{Y}$ and then expands and simplifies.

The end result is that for any arbitrary $\mu$, the sum will be greater than the sum when $\bar{Y} = \mu$

Comparing childrens' heights and their parents' heights

In [11]:
# reimport so I don't have to unmelt
heig = pd.DataFrame.from_csv('data/Galton.csv', index_col=None)
# get counts of values for the bubble plot
h = pd.DataFrame({'count' : heig.groupby( [ "child", "parent"] ).size()}).reset_index()
h[:10]
Out[11]:
child parent count
0 61.7 64.0 1
1 61.7 64.5 1
2 61.7 65.5 1
3 61.7 68.5 1
4 61.7 70.5 1
5 62.2 64.5 1
6 62.2 66.5 3
7 62.2 67.5 3
8 63.2 64.0 2
9 63.2 64.5 4
In [12]:
%%R -i h
require(ggplot2)
require(hexbin)
df <- as.data.frame(h)
g <- ggplot(df, aes(parent, child, size=count)) + geom_point() + ggtitle("Bubble Plot of Height")
g + theme_bw()
Loading required package: hexbin

Regression through the origin

Suppose that $X_i$ are the parent's heights

Consider picking the slope $\beta$ that minimizes

$\sum\limits_{i=1}^n (Y_i - X_i\beta)^2$ -- this in effect is trying to minimize the error value $\epsilon$

This is using the origin as a pivot point picking the line that minimizes the sum of the squared vertical distances of the points to the line

A line through the origin is:

$y = x\beta$

  • It has a y-intercept of 0.
  • the slope is $\beta$
  • $x_i\beta=\hat{y}$ is the fitted value
  • $y_i - \hat{y}$ is the vertical distance between the data point and the fitted line
  • is it NOT the orthoganol distance
  • we don't really want the line go through the origin
  • if we subtract the means so that the origin is the mean of the parents' and child's heights, then going through the origin makes sense
  • the origin is the average of both variables
  • $\mu$: population mean
  • $\bar X$: sample mean
  • $\sigma$: population standard deviation
  • $s$: sample standard deviation
In [13]:
col = 'child'
h[col] = h[col] - h[col].mean()
col = 'parent'
h[col] = h[col] - h[col].mean()
In [14]:
%%R -i h
require(ggplot2)
require(hexbin)
df <- as.data.frame(h)
g <- ggplot(df, aes(parent, child, size=count)) + geom_point() + ggtitle("Bubble Plot Height De-meaned") + theme_bw()
g2 <- geom_smooth(data=df, aes(parent, child, weight=count), colour='red', method=lm, se=FALSE)
g3 <- geom_smooth(data=df, aes(parent, child), colour='green', method=lm, se=FALSE)
g4 <- geom_smooth(data=df, aes(parent, child), colour='blue', method=loess, se=FALSE)
g + g2 + g3 + g4 + theme(legend.position="none")

Note the difference in the linear fit when the size of the dots is considered.

  • red = size considered
  • green = size not considered
  • blue = loess

Find the best $\beta$

find:

$\min\sum\limits_{i=1}^n (Y_i - X_i\beta)^2$

In [15]:
col = 'child'
heig[col] = heig[col] - heig[col].mean()
col = 'parent'
heig[col] = heig[col] - heig[col].mean()
In [16]:
MSE = lambda beta: ((heig['child'] - heig['parent'] * beta)**2).mean()
In [17]:
# try slopes between -3 and 3
betas = np.arange(-3, 3, .01)
In [18]:
df = pd.DataFrame([(beta, MSE(beta),) for beta in betas], columns = ['beta', 'MSE'])
In [19]:
df.plot(x = 'beta', y='MSE', title = 'finding the best Beta empirically');
In [20]:
min_beta = float(df[df['MSE']==df['MSE'].min()]['beta'])
df[df['MSE']==df['MSE'].min()]
Out[20]:
beta MSE
365 0.65 5.000338
In [21]:
float(min_beta)
Out[21]:
0.6499999999999222

Lets create the linear regression and see if we get the same answer

Using scipy

In [22]:
from scipy.stats import linregress
In [23]:
slope, intercept, rvalue, pvalue, stderr = linregress(heig['parent'], heig['child'])
In [24]:
slope
Out[24]:
0.64629058199363942
In [25]:
MSE(slope)
Out[25]:
5.0002937655516639

Using StatsModels

In [26]:
import statsmodels.formula.api as sm
In [27]:
result = sm.ols(formula="child ~ parent", data=heig).fit()

Some Basic Notation and Background

We write $X_1, X_2, ... X_n$ to describe $n$ data points

  • Uppercase $X$ for a random variable
  • lowercase $x$ for a particular value
  • greek letters for things we don't know

Empirical Mean

$\hat{X} = \dfrac{1}{n}\sum\limits_{i=1}^n X_i$

If we subtract the mean from all data points, we get data that has a mean of 0.

$\hat{X_i} = X_i - \hat{X} = 0$

Remember the mean is the least squares solution for minimizing:

$\sum\limits_{i=1}^n (X_i - \mu)^2$

The empirical standard deviation and variance

$\text{Variance: } s^2 = \dfrac{1}{n-1}\sum\limits_{i = 1}^n {\left( {X_i - \bar X} \right)^2 } $

Normalization

$Z_i = \dfrac{X_i - \bar x}{s}$

  • Normalizing gives an empiricial mean of zero and a empirical standard deviation of 0
  • Its units are described in standard deviations from the mean
  • A z value of 2 means the normalized observation is 2 standard deviations from the mean

The empirical covariance

of $n$ observations is ..

$Cov(X, Y) = \dfrac{1}{n-1}\sum\limits_{i=1}^n(X_i - \hat X)(Y_i - \hat Y) = \dfrac{1}{n-1}\left(\sum\limits_{i=1}^nX_iY_i - n\hat X\hat Y\right)$

  • think of covariance as a function that takes two vectors as an input
  • has the units of X and Y

Correlation is:

$ Cor(X, Y) = \dfrac{Cov(X, Y)}{S_xS_y}$

Where

  • $S_x$ is the estimate of the standard deviation for the $X$ observations
  • $S_y$ is the estimate of the standard deviation for the $Y$ observations
  • Unit free, between -1 and 1

$Cor(X, Y) = \pm 1$ only when the $X$ or $Y$ observations fall perfectly on a posivite or negative sloped line

$Cor(X, Y) = 0$ implies no linear relationship

Least squares estimation of regression lines

  • Let $Y_i$ be the $i^{th}$ child's height
  • Let $X_i$ be the $i^{th}$ parents' average height

Consider finding the best line

Child's Height = $\beta_0$ + Parent's Height $\cdot \beta_1$

Use least squares

$ \sum\limits_{i=1}^n(Y_i - (\beta_0 + \beta_1X_i))^2 $

  • the difference between the actual and the predicted

$ \sum\limits_{i=1}^n{\big(Y_i - \hat Y\big)}^2 $ - the difference between the actual and the predicted

Oh how the fuck how do we do it?

Mean only regression

So after a lot of stuff...

$ \hat\beta_1 = \dfrac{\sum\limits_{i=1}^n Y_iX_i}{\sum\limits_{i=1}^nX_i^2} $

When we restrict the line through the origin

Figuring out the final forms we can use to get $\beta_n$

... lots of math... which can be summarized as

$\hat \beta_1 = Cor(Y, X) \dfrac{S_dY}{S_dX}$

also stated as

$\dfrac{\sum(X-\bar X) \sum (Y - \bar Y)}{\sum(X - \bar X)^2}$

and

$\hat \beta_0 = \bar Y - \hat \beta_1 \bar X$

And our regression line always travels through $(\bar X, \bar Y)$

In [28]:
heig = pd.DataFrame.from_csv('data/Galton.csv', index_col=None)
heig.head()
Out[28]:
child parent
0 61.7 70.5
1 61.7 68.5
2 61.7 65.5
3 61.7 64.5
4 61.7 64.0
In [29]:
y = 'child'
x = 'parent'

beta1 = heig[y].corr(heig[x]) * heig[y].std() / heig[x].std()
In [30]:
beta0 = heig[y].mean() - beta1 * heig[x].mean()
beta0
Out[30]:
23.941530180421658
In [31]:
beta1
Out[31]:
0.64629058199351186
In [32]:
result = sm.ols(formula="child ~ parent", data=heig).fit()
In [33]:
result.params
Out[33]:
Intercept    23.941530
parent        0.646291
dtype: float64

Regression to the Mean

$P(Y < x| X=x)$ gets bigger as $x$ heads to very large values

$P(Y > x| X=x)$ gets bigger as $x$ heads to very small values

Quiz

q1

In [34]:
# q1 - what's a better way to do this?
x = [0.18, -1.54, 0.42, 0.95]
w = [2, 1, 3, 1]

Give the value of $\mu$ that minimizes $\sum\limits_{i=1}^nw_i(x_i - \mu)^2$

In [35]:
df = pd.DataFrame([x, w]).T
df.columns = ['x', 'w']
df
Out[35]:
x w
0 0.18 2
1 -1.54 1
2 0.42 3
3 0.95 1
In [36]:
square = lambda x: x**2
f1 = lambda mu: (df['w'] * (df['x'] - mu).apply(square)).sum()
In [37]:
mus = np.arange(0, 1, .01)
f1(.18)
Out[37]:
3.7241
In [38]:
mus = pd.DataFrame([(mu, f1(mu),) for mu in mus], columns = ['mu', 'fx'])
In [39]:
pd.DataFrame.plot(mus, x = 'mu', y='fx', title='Minimizing Mu');
In [40]:
mus[mus['fx']==mus['fx'].min()]
Out[40]:
mu fx
15 0.15 3.7166
In [41]:
# q2
x = [0.8, 0.47, 0.51, 0.73, 0.36, 0.58, 0.57, 0.85, 0.44, 0.42]
y = [1.39, 0.72, 1.55, 0.48, 1.19, -1.59, 1.23, -0.65, 1.49, 0.05]
df = pd.DataFrame([x, y]).T
df.columns = ['x', 'y']
df.head()
Out[41]:
x y
0 0.80 1.39
1 0.47 0.72
2 0.51 1.55
3 0.73 0.48
4 0.36 1.19
In [42]:
#result = sm.ols(formula="child ~ parent", data=heig).fit()
# y intercept set to zero
result = sm.ols(formula="y ~ x - 1", data=df).fit()
In [43]:
result.summary()
/Users/idris/.virtualenvs/regres/lib/python2.7/site-packages/scipy/stats/stats.py:1205: UserWarning: kurtosistest only valid for n>=20 ... continuing anyway, n=10
  int(n))
Out[43]:
OLS Regression Results
Dep. Variable: y R-squared: 0.183
Model: OLS Adj. R-squared: 0.092
Method: Least Squares F-statistic: 2.018
Date: Tue, 30 Sep 2014 Prob (F-statistic): 0.189
Time: 15:48:14 Log-Likelihood: -14.561
No. Observations: 10 AIC: 31.12
Df Residuals: 9 BIC: 31.42
Df Model: 1
coef std err t P>|t| [95.0% Conf. Int.]
x 0.8263 0.582 1.421 0.189 -0.490 2.142
Omnibus: 2.980 Durbin-Watson: 3.044
Prob(Omnibus): 0.225 Jarque-Bera (JB): 1.581
Skew: -0.959 Prob(JB): 0.454
Kurtosis: 2.658 Cond. No. 1.00

So that's the code way to do it. Let's do it symbolically too.

$\beta_1 = \dfrac{\sum(X_i - \bar X)(Y_i - \bar Y)}{\sum(X_i - \bar X)^2}$

wait... that's close.

But for a RTO ("regression through origin") Remember that $\bar X$ and $\bar Y$ for a RTO are both $0$

$\beta_1 = \dfrac{\sum X_iY_i}{\sum X_i^2}$

In [44]:
x = 'x'
y = 'y'
num = (df[x] * df[y]).sum()
den = (df[x]).apply(square).sum()
In [45]:
num / den
Out[45]:
0.82625166087128599

q3

In [46]:
%%R -i x
require(dataset)
data(mtcars)
head(mtcars)
lm(mpg~wt, data=mtcars)
Loading required package: dataset

Call:
lm(formula = mpg ~ wt, data = mtcars)

Coefficients:
(Intercept)           wt  
     37.285       -5.344  

q4

Consider data with an outcome (Y) and a predictor (X). The standard deviation of the predictor is one half that of the outcome. The correlation between the two variables is .5. What value would the slope coefficient for the regression model with Y as the outcome and X as the predictor?

  • $Cor(X, Y) = \frac{1}{2}$
  • $2 \cdot S_d(X) = S_d(Y)$
  • $\hat \beta_1 = Cor(Y, X) \dfrac{S_dY}{S_dX}$
  • $\hat \beta_1 = \frac{1}{2} \dfrac{2S_dY}{S_dX} = \frac{1}{2}\cdot\frac{2}{1}$

q5

Students were given two hard tests and scores were normalized to have empirical mean 0 and variance 1. The correlation between the scores on the two tests was 0.4. What would be the expected score on Quiz 2 for a student who had a normalized score of 1.5 on Quiz 1?

$Cor(q1, q2) = 0.4 $

$ Z_{q1} = 1.5 $

In [47]:
1.5 *0.4 #???
Out[47]:
0.6000000000000001

q6

In [48]:
x = [8.58, 10.46, 9.01, 9.64, 8.86]
s = pd.Series(x)
In [49]:
# normalized values
ns = (s - s.mean())/s.std()
ns
Out[49]:
0   -0.971866
1    1.531022
2   -0.399397
3    0.439337
4   -0.599095
dtype: float64
In [50]:
ns.mean()
Out[50]:
1.8873791418627661e-15

q7

In [51]:
df
Out[51]:
x y
0 0.80 1.39
1 0.47 0.72
2 0.51 1.55
3 0.73 0.48
4 0.36 1.19
5 0.58 -1.59
6 0.57 1.23
7 0.85 -0.65
8 0.44 1.49
9 0.42 0.05
In [52]:
result = sm.ols("y ~ x", data=df).fit()
In [53]:
result.summary()
Out[53]:
OLS Regression Results
Dep. Variable: y R-squared: 0.076
Model: OLS Adj. R-squared: -0.039
Method: Least Squares F-statistic: 0.6620
Date: Tue, 30 Sep 2014 Prob (F-statistic): 0.439
Time: 15:48:14 Log-Likelihood: -13.666
No. Observations: 10 AIC: 31.33
Df Residuals: 8 BIC: 31.94
Df Model: 1
coef std err t P>|t| [95.0% Conf. Int.]
Intercept 1.5675 1.252 1.252 0.246 -1.320 4.455
x -1.7128 2.105 -0.814 0.439 -6.568 3.142
Omnibus: 3.809 Durbin-Watson: 2.515
Prob(Omnibus): 0.149 Jarque-Bera (JB): 1.534
Skew: -0.956 Prob(JB): 0.464
Kurtosis: 3.158 Cond. No. 8.37

q9

What value minimizes the sum of the squared distances between these points and itself?

By definition: see this

In [54]:
df['x'].mean()
Out[54]:
0.57299999999999995

q10

Consider taking the slope having fit Y as the outcome and X as the predictor, β1 and the slope from fitting X as the outcome and Y as the predictor, γ1, and dividing the two as β1/γ1. What is this ratio always equal to

In [55]:
df
Out[55]:
x y
0 0.80 1.39
1 0.47 0.72
2 0.51 1.55
3 0.73 0.48
4 0.36 1.19
5 0.58 -1.59
6 0.57 1.23
7 0.85 -0.65
8 0.44 1.49
9 0.42 0.05
In [56]:
r1 = sm.ols("x~y", data=df).fit()
r2 = sm.ols("y~x", data=df).fit()
In [57]:
r2.params.x / r1.params.y
Out[57]:
38.390772016849851
In [58]:
df['y'].var() / df['x'].var()
Out[58]:
38.390772016849652

$\beta_1 = Cor(X, Y)\dfrac{S_y}{S_X}$

$\gamma_1 = Cor(X, Y)\dfrac{S_X}{S_Y}$

$\dfrac{\beta_1}{\gamma_1} = \dfrac{Cor(X, Y)\dfrac{S_y}{S_X}}{Cor(X, Y)\dfrac{S_X}{S_Y}}$

$\dfrac{S_Y}{S_X}\cdot\dfrac{S_Y}{S_X} = \dfrac{S_Y^2}{S_X^2} = \dfrac{Var(Y)}{Var(X)}$

Week 2

Statistical linear regression models

Basic regression model with additive Gaussian errors

  • Least squares is an estimation tool, how do we do inference?
  • Develop a probabilistic model for linear regression
  • $\epsilon$ is the error

$Y_i = \beta_0 + \beta_1X_i + \epsilon_i$

slope intercept line with error term

Here the $\epsilon_i$ are assumed iid $N(0, \sigma^2)$

$ E[Y_i | X_i = x_i] = \mu_i = \beta_0 + \beta_1x_i $

The expected value of $Y_i$ given $x_i$ is the regression equation. You take the expected value of the equation above and the $E[\epsilon_i]=0$

$ Var(Y_i | X_i = x_i) = \sigma^2 $

Likelihood

  • see the statistical inference class
  • The least squares estimate for $\mu_i$ is exactly the maximum likelihood estimate

Recap

Interpreting regression coefficients

$\beta_0$ is the expected value of the response when the predictor is 0

$E[Y|X=0] = \beta_0 + \beta_1 \cdot 0 = \beta_0$

This isn't always interesting, when $X=0$ is impossible or far outside of the range of possible values

You can shift the data by $a$, which changes the intercept but not the slope. In this way you can make the intercept interesting

$\beta_1$ is the expected change in response for a 1 unit change in the predictor

Multiplication of $X$ by a factor of $a$ results in dividing the coefficient by a factor of $a$

Using the regression coefficients for prediction

$\hat Y = \hat \beta_0 + \hat \beta_1X$

  • Why are these $\hat \beta$ instead of just $\beta$?
  • Because the $\beta$ values are just our estimations, specifically our least squares estimation

Statistical Regression Models Examples

In [59]:
%%R -o diamond
require(UsingR)
require(ggplot2)
data(diamond)
g <- ggplot(diamond, aes(carat, price)) + geom_point() + theme_bw()
g <- g + geom_smooth(method=lm, se=FALSE)
g
Loading required package: UsingR
Loading required package: MASS
Loading required package: HistData
Loading required package: Hmisc
Loading required package: grid
Loading required package: lattice
Loading required package: survival
Loading required package: splines
Loading required package: Formula

Attaching package: ‘Hmisc’

The following object is masked from ‘package:base’:

    format.pval, round.POSIXt, trunc.POSIXt, units

Loading required package: quantreg
Loading required package: SparseM

Attaching package: ‘SparseM’

The following object is masked from ‘package:base’:

    backsolve


Attaching package: ‘quantreg’

The following object is masked from ‘package:Hmisc’:

    latex

The following object is masked from ‘package:survival’:

    untangle.specials


Attaching package: ‘UsingR’

The following object is masked from ‘package:survival’:

    cancer

The following object is masked from ‘package:ggplot2’:

    movies

In [60]:
formula = "price ~ carat"
r1 = sm.ols(formula, data=diamond).fit()
In [61]:
r1.summary()
Out[61]:
OLS Regression Results
Dep. Variable: price R-squared: 0.978
Model: OLS Adj. R-squared: 0.978
Method: Least Squares F-statistic: 2070.
Date: Tue, 30 Sep 2014 Prob (F-statistic): 6.75e-40
Time: 15:48:16 Log-Likelihood: -233.20
No. Observations: 48 AIC: 470.4
Df Residuals: 46 BIC: 474.1
Df Model: 1
coef std err t P>|t| [95.0% Conf. Int.]
Intercept -259.6259 17.319 -14.991 0.000 -294.487 -224.765
carat 3721.0249 81.786 45.497 0.000 3556.398 3885.651
Omnibus: 0.739 Durbin-Watson: 1.994
Prob(Omnibus): 0.691 Jarque-Bera (JB): 0.181
Skew: 0.056 Prob(JB): 0.913
Kurtosis: 3.280 Cond. No. 18.5
In [62]:
# same formula, but subtracting out the carat mean to get a usable intercept values
f = "price ~ I(carat - carat.mean())"
r2 = sm.ols(f, data=diamond).fit()
In [63]:
# note that the slope is the same
print(r1.params)
print(r2.params)
print(diamond.price.mean())
Intercept    -259.625907
carat        3721.024852
dtype: float64
Intercept                   500.083333
I(carat - carat.mean())    3721.024852
dtype: float64
500.083333333

$500 (the intercept) is the expected price for the average sized diamond of the data.

And the intercept is exactly the mean of the prices

In [64]:
diamond.price.mean()
Out[64]:
500.08333333333331

Changing Scale

In [65]:
# A one carat increase in diamond is pretty big.
f = "price ~ I(carat *10)"
r3 = sm.ols(f, data=diamond).fit()
r3.params
Out[65]:
Intercept       -259.625907
I(carat * 10)    372.102485
dtype: float64
In [66]:
# And the slope is 1/coef we mutliplied the regressor by
In [67]:
df = pd.DataFrame([0.16, 0.27, 0.34], columns=['carat'])
df['price'] = r3.predict(df)
df
Out[67]:
carat price
0 0.16 335.738069
1 0.27 745.050803
2 0.34 1005.522542
In [68]:
diamond['price_predict'] = r3.predict(diamond)
In [69]:
%%R -i diamond
g <- ggplot(diamond, aes(carat, price)) + geom_point(color='green')
g <- g + geom_line(data=diamond, aes(carat, price_predict), colour='blue', linetype=2) 
g <- g + geom_point(data=diamond, aes(carat, price_predict), colour='red', shape=3) 
g + theme_bw()
In [70]:
# green dots - actual observations
# blue dashed line - predictions
# red dashes - places where we made predictions. Note that these are the same x_i values where we have observed points

Residuals and residual variation

$\hat Y_i = \hat \beta_0 + \hat \beta_1X_i$

$e_i = Y_i - \hat Y_i$ which is the vertical distance between the observed data point and the regression line

$e_i \neq \epsilon_i$.

$e_i$ is the observed error, not the true error which we don't get to know

Least squares minimizes $\sum\limits_{i=1}^ne_i^2$

The $e_i$ can be thought of as the estimates of the $\epsilon_i$

Properties of the residuals

<img "src=images/properties_of_the_residuals.png"/>

  • Useful for investigating poor model fit
  • Residuals can be thought of as the outcome $Y$ with the linear association of the predictor $X$ removed
  • residual variation is the variation after removing the systematic variation
  • residual plots highlight poor model fit
In [71]:
%%R -o diamond
data(diamond)
y <- diamond$price
x <- diamond$carat
n <- length(y)
fit <- lm(y~x)
e <- resid(fit)
yhat <- predict(fit)
max(abs(e - (y-yhat)))
[1] 9.485746e-13
In [72]:
diamond
x = diamond.carat
y = diamond.price
results = sm.ols("price ~ carat", data=diamond).fit()
In [73]:
e = results.resid
In [74]:
yhat = results.predict()
In [75]:
df = pd.DataFrame([e, y - yhat]).T
df.columns = ['e', 'y - yhat']
In [76]:
df.head()
Out[76]:
e y - yhat
0 -17.948318 -17.948318
1 -7.738069 -7.738069
2 -22.948318 -22.948318
3 -85.158566 -85.158566
4 -28.630306 -28.630306
In [77]:
(df.e - df['y - yhat']).abs().max()
Out[77]:
0.0
In [78]:
from statsmodels.api import graphics

What we're looking for is a pattern in the residuals. A pattern indicates our model is missing something

See this for more details on plotting regressions in statsmodels

In [79]:
fig, ax = plt.subplots(figsize=(12, 8))
fig = graphics.plot_fit(results, "carat", ax=ax)
In [80]:
fig, ax = plt.subplots(figsize=(12, 8))
graphics.plot_regress_exog(results, "carat", fig);
In [81]:
# generated data set
from random import uniform, normalvariate, seed
from math import sin
In [82]:
seed(123)
x = [uniform(-3, 3) for _ in xrange(100)]
y = [_ + sin(_) + normalvariate(mu=100, sigma=.2) for _ in x]
df = pd.DataFrame([_ for _ in zip(x, y)], columns = ['x', 'y'])
df = df.sort('x')
In [83]:
plt.scatter(df.y, df.x);
In [84]:
results = sm.ols("y ~ x", data=df).fit()
In [85]:
# why does this show up twice??
# todo: look at the source and figure 
fig, ax = plt.subplots(figsize=(12, 8))
graphics.plot_regress_exog(results, "x", fig);
In [86]:
plt.plot(df.x, results.resid, 'o');

Heteroskedasticity

  • The model isn't right because the variance isn't constant as a function of $x$
In [87]:
x = [uniform(0, 6) for _ in xrange(100)]
df = pd.DataFrame(x, columns = ['x'])
In [88]:
ygen = lambda x: x + normalvariate(0, .001*x)
df['y'] = [ygen(x) for x in df.x]
In [89]:
df = df.sort('x')
plt.plot(df.x, df.y, 'o');
In [90]:
r_hetero = sm.ols("y~x", data=df).fit()
In [91]:
    plt.plot(df.x, r_hetero.resid, 'o');
In [92]:
fig, ax = plt.subplots(figsize=(12, 8))
graphics.plot_regress_exog(r_hetero, "x", fig);
  • Residuals highlight stuff you will not see in scatter plots

Estimating Residual Variation

  • Model $Y_i = \beta_0 + \beta_1 X_i + \epsilon_i$ where $\epsilon_i \sim N(0, \sigma^2)$.
  • The ML estimate of $\sigma^2$ is $\frac{1}{n}\sum_{i=1}^n e_i^2$, the average squared residual.
  • Most people use $$ \hat \sigma^2 = \frac{1}{n-2}\sum_{i=1}^n e_i^2. $$

  • The $n-2$ instead of $n$ is so that $E[\hat \sigma^2] = \sigma^2$

Diamond Example

In [93]:
diamond.head()
Out[93]:
carat price
0 0.17 355
1 0.16 328
2 0.17 350
3 0.18 325
4 0.25 642
In [94]:
%%R
y <- diamond$price; x <- diamond$carat; n <- length(y)
fit <- lm(y ~ x)
summary(fit)$sigma
sqrt(sum(resid(fit)^2) / (n - 2))
[1] 31.84052
In [95]:
y = 'price'; x = 'carat'
fit = sm.ols('%s ~ %s' % (y, x, ), data=diamond).fit()
In [96]:
e = (fit.predict(diamond) - diamond.price)**2
  • calculating:

$$ \hat \sigma^2 = \frac{1}{n-2}\sum_{i=1}^n e_i^2. $$

In [97]:
from math import sqrt
sigma = sqrt((e.shape[0]-2)**-1 * e.sum())
sigma
Out[97]:
31.840522265031762
In [98]:
# Normalized by N-1 by default. This can be changed using the ddof argument
sqrt(fit.resid.var(ddof=2))
Out[98]:
31.840522265031762

Summarizing Variation

$$ \sum_{i=1}^n (Y_i - \bar Y)^2 = \sum_{i=1}^n (Y_i - \hat Y_i)^2 + \sum_{i=1}^n (\hat Y_i - \bar Y)^2 $$

Or

Total Variation = Residual Variation + Regression Variation

Define the percent of total varation described by the model as $$ R^2 = \frac{\sum_{i=1}^n (\hat Y_i - \bar Y)^2}{\sum_{i=1}^n (Y_i - \bar Y)^2} = 1 - \frac{\sum_{i=1}^n (Y_i - \hat Y_i)^2}{\sum_{i=1}^n (Y_i - \bar Y)^2} $$

Key:

$Y_i$ - $i^{th}$ obseration of $Y$

$\hat Y_i$ - Predicted value of $i^{th} Y$

$n$ - number of observations

$\bar Y$ - Empirical mean of $Y$

Total Variation = $\sum_{i=1}^n (\bar Y - Y_i)^2$

Residual Variation = $\sum_{i=1}^n(\hat Y_i - Y_i)^2$

Regression Variation = $\sum_{i=1}^n (\hat Y_i - \bar Y)^2$

$R^2 = \dfrac{\text{Regression Variation}}{\text{Total Variation}} = 1 - \dfrac{\text{Residual Variation}}{\text{Total Variation}}$

Relation between $R^2$ and $r$ (the corrrelation)

Recall that $(\hat Y_i - \bar Y) = \hat \beta_1 (X_i - \bar X)$ so that

$$ R^2 = \frac{\sum_{i=1}^n (\hat Y_i - \bar Y)^2}{\sum_{i=1}^n (Y_i - \bar Y)^2} = \hat \beta_1^2 \frac{\sum_{i=1}^n(X_i - \bar X)^2}{\sum_{i=1}^n (Y_i - \bar Y)^2} = Cor(Y, X)^2 $$ Since, recall, $$ \hat \beta_1 = Cor(Y, X)\frac{Sd(Y)}{Sd(X)} $$ So, $R^2$ is literally $r$ squared.

The pit and pratfalls of solely relying on $R^2$

Taken from here

In [99]:
from __future__ import print_function
"""
Edward Tufte uses this example from Anscombe to show 4 datasets of x
and y that have the same mean, standard deviation, and regression
line, but which are qualitatively different.

matplotlib fun for a rainy day
"""

from pylab import *

x =  array([10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5])
y1 = array([8.04, 6.95, 7.58, 8.81, 8.33, 9.96, 7.24, 4.26, 10.84, 4.82, 5.68])
y2 = array([9.14, 8.14, 8.74, 8.77, 9.26, 8.10, 6.13, 3.10, 9.13, 7.26, 4.74])
y3 = array([7.46, 6.77, 12.74, 7.11, 7.81, 8.84, 6.08, 5.39, 8.15, 6.42, 5.73])
x4 = array([8,8,8,8,8,8,8,19,8,8,8])
y4 = array([6.58,5.76,7.71,8.84,8.47,7.04,5.25,12.50,5.56,7.91,6.89])

def fit(x):
    return 3+0.5*x



xfit = array( [amin(x), amax(x) ] )

subplot(221)
plot(x,y1,'ks', xfit, fit(xfit), 'r-', lw=2)
axis([2,20,2,14])
setp(gca(), xticklabels=[], yticks=(4,8,12), xticks=(0,10,20))
text(3,12, 'I', fontsize=20)

subplot(222)
plot(x,y2,'ks', xfit, fit(xfit), 'r-', lw=2)
axis([2,20,2,14])
setp(gca(), xticklabels=[], yticks=(4,8,12), yticklabels=[], xticks=(0,10,20))
text(3,12, 'II', fontsize=20)

subplot(223)
plot(x,y3,'ks', xfit, fit(xfit), 'r-', lw=2)
axis([2,20,2,14])
text(3,12, 'III', fontsize=20)
setp(gca(), yticks=(4,8,12), xticks=(0,10,20))

subplot(224)

xfit = array([amin(x4),amax(x4)])
plot(x4,y4,'ks', xfit, fit(xfit), 'r-', lw=2)
axis([2,20,2,14])
setp(gca(), yticklabels=[], yticks=(4,8,12), xticks=(0,10,20))
text(3,12, 'IV', fontsize=20)

#verify the stats
pairs = (x,y1), (x,y2), (x,y3), (x4,y4)
for x,y in pairs:
    print ('mean=%1.2f, std=%1.2f, r=%1.2f'%(mean(y), std(y), corrcoef(x,y)[0][1]))

show()
mean=7.50, std=1.94, r=0.82
mean=7.50, std=1.94, r=0.82
mean=7.50, std=1.94, r=0.82
mean=7.50, std=1.94, r=0.82

So that's the bad $R^2$, what about the residual plots

In [100]:
x =  array([10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5])
y1 = array([8.04, 6.95, 7.58, 8.81, 8.33, 9.96, 7.24, 4.26, 10.84, 4.82, 5.68])
y2 = array([9.14, 8.14, 8.74, 8.77, 9.26, 8.10, 6.13, 3.10, 9.13, 7.26, 4.74])
y3 = array([7.46, 6.77, 12.74, 7.11, 7.81, 8.84, 6.08, 5.39, 8.15, 6.42, 5.73])
x4 = array([8,8,8,8,8,8,8,19,8,8,8])
y4 = array([6.58,5.76,7.71,8.84,8.47,7.04,5.25,12.50,5.56,7.91,6.89])

df = pd.DataFrame([x, x4, y1, y2, y3, y4]).T
df.columns = ['x', 'x4', 'y1', 'y2', 'y3', 'y4']

def regress(x, y, data):
    formula = '%s ~ %s' % (y, x, )
    return sm.ols(formula, data=data).fit()

subplot(221)
plt.plot(df.x, regress('x', 'y1', df).resid, 'ro');
plt.title('I');
subplot(222)
plt.plot(df.x, regress('x', 'y2', df).resid, 'ro');
plt.title('II');
subplot(223)
plt.plot(df.x, regress('x', 'y3', df).resid, 'ro');
plt.title('III');
subplot(224)
plt.plot(df.x, regress('x4', 'y4', df).resid, 'ro');
plt.title('IV');

and the residuals are fucked up too

Inference in Regression

Consider the model $$ Y_i = \beta_0 + \beta_1 X_i + \epsilon_i $$

$\epsilon \sim N(0, \sigma^2)$.

We assume that the true model is known.

$\hat \beta_0 = \bar Y - \hat \beta_1 \bar X $

$\hat \beta_1 = Cor(X, Y) \cdot \dfrac{Sd(Y)}{Sd(X)} $


Not sure if this last equation should be $\beta_1$ or $\hat \beta_1$

$\hat \beta_1= \dfrac{\sum(X_i - \bar X)(Y_i - \bar Y)}{\sum(X_i - \bar X)^2}$

Tests for Regression Coefficients

Review

  • Statistics like $\frac{\hat \theta - \theta}{\hat \sigma_{\hat \theta}}$ often have the following properties.
    1. Is normally distributed and has a finite sample Student's T distribution if the estimated variance is replaced with a sample estimate (under normality assumptions).
    2. Can be used to test $H_0 : \theta = \theta_0$ versus $H_a : \theta >, <, \neq \theta_0$.
    3. Can be used to create a confidence interval for $\theta$ via $\hat \theta \pm Q_{1-\alpha/2} \hat \sigma_{\hat \theta}$ where $Q_{1-\alpha/2}$ is the relevant quantile from either a normal or T distribution.
  • In the case of regression with iid sampling assumptions and normal errors, our inferences will follow very similarily to what you saw in your inference class.
  • We won't cover asymptotics for regression analysis, but suffice it to say that under assumptions on the ways in which the $X$ values are collected, the iid sampling model, and mean model, the normal results hold to create intervals and confidence intervals

Standard errors (conditioned on X)

$$ \begin{align} Var(\hat \beta_1) & = Var\left(\frac{\sum_{i=1}^n (Y_i - \bar Y) (X_i - \bar X)}{\sum_{i=1}^n (X_i - \bar X)^2}\right) \\ & = \frac{Var\left(\sum_{i=1}^n Y_i (X_i - \bar X) \right) }{\left(\sum_{i=1}^n (X_i - \bar X)^2 \right)^2} \\ & = \frac{\sum_{i=1}^n \sigma^2(X_i - \bar X)^2}{\left(\sum_{i=1}^n (X_i - \bar X)^2 \right)^2} \\ & = \frac{\sigma^2}{\sum_{i=1}^n (X_i - \bar X)^2} \\ \end{align} $$

Results

  • $\sigma_{\hat \beta_1}^2 = Var(\hat \beta_1) = \sigma^2 / \sum_{i=1}^n (X_i - \bar X)^2$
  • $\sigma_{\hat \beta_0}^2 = Var(\hat \beta_0) = \left(\frac{1}{n} + \frac{\bar X^2}{\sum_{i=1}^n (X_i - \bar X)^2 }\right)\sigma^2$
  • In practice, $\sigma$ is replaced by its estimate.
  • It's probably not surprising that under iid Gaussian errors $$ \frac{\hat \beta_j - \beta_j}{\hat \sigma_{\hat \beta_j}} $$ follows a $t$ distribution with $n-2$ degrees of freedom and a normal distribution for large $n$.
  • This can be used to create confidence intervals and perform hypothesis tests.

Example

$H_0: \beta_n = 0$

$H_a: \beta_n \neq 0$

In [101]:
%%R -o diamond
library(UsingR); data(diamond)
y <- diamond$price; x <- diamond$carat; n <- length(y)
beta1 <- cor(y, x) * sd(y) / sd(x)
beta0 <- mean(y) - beta1 * mean(x)
e <- y - beta0 - beta1 * x
sigma <- sqrt(sum(e^2) / (n-2)) 
ssx <- sum((x - mean(x))^2)
seBeta0 <- (1 / n + mean(x) ^ 2 / ssx) ^ .5 * sigma 
seBeta1 <- sigma / sqrt(ssx)
tBeta0 <- beta0 / seBeta0; 
tBeta1 <- beta1 / seBeta1
pBeta0 <- 2 * pt(abs(tBeta0), df = n - 2, lower.tail = FALSE)
pBeta1 <- 2 * pt(abs(tBeta1), df = n - 2, lower.tail = FALSE)
coefTable <- rbind(c(beta0, seBeta0, tBeta0, pBeta0), c(beta1, seBeta1, tBeta1, pBeta1))
colnames(coefTable) <- c("Estimate", "Std. Error", "t value", "P(>|t|)")
rownames(coefTable) <- c("(Intercept)", "x")
coefTable
             Estimate Std. Error   t value      P(>|t|)
(Intercept) -259.6259   17.31886 -14.99094 2.523271e-19
x           3721.0249   81.78588  45.49715 6.751260e-40
In [102]:
import scipy.stats as st
n = 46
st.t.pdf(abs(1), df = n - 2)
Out[102]:
0.23924692073493825
In [103]:
x = diamond['carat']
y = diamond['price']
n = float(len(y))
beta1 = y.corr(x) * y.std() / x.std() # regression slope
beta0 = y.mean() - beta1 * x.mean()   # regression intercept
e = y - (beta1 * x + beta0) # errors
sigma = ((e**2).sum()/(n - 2))**0.5 # unbiased residual mean 
ssx = ((x - x.mean())**2).sum() # sum of squared Xs
seBeta0 = (1/n + x.mean()**2 / ssx)**0.5 * sigma
seBeta1 = sigma / ssx**0.5
tBeta0 = beta0 / seBeta0
tBeta1 = beta1 / seBeta1
# TODO: review hypothesis testing
pBeta0 = 2 * (1-st.t.cdf(abs(tBeta0), df = n - 2))
pBeta1 = 2 * (1-st.t.cdf(abs(tBeta1), df = n - 2))
l = [[beta0, seBeta0, tBeta0, pBeta0], 
     [beta1, seBeta1, tBeta1, pBeta1]]
coefTable = pd.DataFrame((_ for _ in l), 
                         columns = ["Estimate", "Std. Error", "t value", "P(>|t|)"],
                         index = ["(Intercept)", "x"])

coefTable
Out[103]:
Estimate Std. Error t value P(>|t|)
(Intercept) -259.625907 17.318856 -14.990938 0
x 3721.024852 81.785880 45.497155 0
In [104]:
fit = sm.ols("price ~ carat", data=diamond).fit()
fit.summary()
Out[104]:
OLS Regression Results
Dep. Variable: price R-squared: 0.978
Model: OLS Adj. R-squared: 0.978
Method: Least Squares F-statistic: 2070.
Date: Tue, 30 Sep 2014 Prob (F-statistic): 6.75e-40
Time: 15:48:23 Log-Likelihood: -233.20
No. Observations: 48 AIC: 470.4
Df Residuals: 46 BIC: 474.1
Df Model: 1
coef std err t P>|t| [95.0% Conf. Int.]
Intercept -259.6259 17.319 -14.991 0.000 -294.487 -224.765
carat 3721.0249 81.786 45.497 0.000 3556.398 3885.651
Omnibus: 0.739 Durbin-Watson: 1.994
Prob(Omnibus): 0.691 Jarque-Bera (JB): 0.181
Skew: 0.056 Prob(JB): 0.913
Kurtosis: 3.280 Cond. No. 18.5
In [105]:
%%R
data(diamond)
fit <- lm("price ~ carat", data=diamond)
summary(fit)
Call:
lm(formula = "price ~ carat", data = diamond)

Residuals:
    Min      1Q  Median      3Q     Max 
-85.159 -21.448  -0.869  18.972  79.370 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -259.63      17.32  -14.99   <2e-16 ***
carat        3721.02      81.79   45.50   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 31.84 on 46 degrees of freedom
Multiple R-squared:  0.9783,	Adjusted R-squared:  0.9778 
F-statistic:  2070 on 1 and 46 DF,  p-value: < 2.2e-16


Interlude to find Python equivalant of R's:

pt(abs(tBeta1), df = n - 2, lower.tail = FALSE)

In [106]:
print ('\n'.join(st.t.pdf.__doc__.split('\n')[:100]))
        Probability density function at x of the given RV.

        Parameters
        ----------
        x : array_like
            quantiles
        arg1, arg2, arg3,... : array_like
            The shape parameter(s) for the distribution (see docstring of the
            instance object for more information)
        loc : array_like, optional
            location parameter (default=0)
        scale : array_like, optional
            scale parameter (default=1)

        Returns
        -------
        pdf : ndarray
            Probability density function evaluated at x

        
In [107]:
import numpy as np
x = np.arange(-3, 3, .01)
ypdf = [st.t.pdf(_, n - 2) for _ in x]
In [108]:
ycdf = [1 - st.t.cdf(_, n - 2) for _ in x]
In [109]:
f, axarr = plt.subplots(2, sharex=True)
axarr[0].plot(x, ypdf)
axarr[0].set_title('PDF')
axarr[1].plot(x, ycdf)
axarr[1].set_title('CDF');
In [110]:
%%R -o dfR
# test to figure out wtf the pt function is doing
require(ggplot2)
n <- 48
x <- seq(-3, 3, by=.01)
y <- pt(x, df = n - 2, lower.tail = FALSE)
df1 <- data.frame(x, y, 'false')
names(df1) <- c('x', 'y', 'tail')

y <- pt(x, df = n - 2, lower.tail = TRUE)
df2 <- data.frame(x, y, 'true')
names(df2) <- c('x', 'y', 'tail')
dfR <- rbind(df1, df2)
ggplot(dfR, aes(x, y, colour=tail)) + geom_line() + theme_bw() + ggtitle("Effect of lower.tail parameter in `pt` function")

So the pt function creates a cdf

In [111]:
dfR = dfR[dfR['tail'] == 'false']
In [112]:
f, axarr = plt.subplots(2, sharex=True)
axarr[0].plot(dfR['x'], dfR['y'])
axarr[0].set_title('R CDF')
axarr[1].plot(x, ycdf)
axarr[1].set_title('Python CDF');

End Interlude


Getting a confidence interval

In [113]:
%%R
sumCoef <- summary(fit)$coefficients
sumCoef
             Estimate Std. Error   t value     Pr(>|t|)
(Intercept) -259.6259   17.31886 -14.99094 2.523271e-19
carat       3721.0249   81.78588  45.49715 6.751260e-40
In [114]:
%%R
sumCoef <- summary(fit)$coefficients
sumCoef[1,1] + c(-1, 1) * qt(.975, df = fit$df) * sumCoef[1, 2]
[1] -294.4870 -224.7649
In [115]:
%%R
sumCoef[2,1] + c(-1, 1) * qt(.975, df = fit$df) * sumCoef[2, 2]
[1] 3556.398 3885.651
In [116]:
coefTable['Estimate']['x'] + pd.Series([-1, 1]) * coefTable['Std. Error']['x'] * st.t.ppf(.975, fit.df_resid)
Out[116]:
0    3556.398413
1    3885.651290
dtype: float64
In [117]:
coefTable['Estimate']['(Intercept)'] + pd.Series([-1, 1]) * coefTable['Std. Error']['(Intercept)'] * st.t.ppf(.975, fit.df_resid)
Out[117]:
0   -294.486957
1   -224.764858
dtype: float64

WTF are these different R t-distribution functions doing

In [118]:
%%R
require(reshape)
require(ggplot2)
n <- 3
by <- 0.001
x <- seq(-n, n, by)
qt <- qt(x, df = 30)
pt <- pt(x, df=30)
dt <- dt(x, df=30)
rt <- rt(x, df=30)

df <- data.frame(x, qt, pt, dt, rt)
names(df) <- c('x', 'qt\ndensity', 'pt\ndistribution', 'dt\nquantile', 'rt\nrandom')
mdf <- melt.data.frame(df, id = c('x'))
g <- ggplot(mdf, aes(x, value, colour=variable)) + geom_point(alpha=0.1)
g <- g + facet_wrap(~variable, scale='free_y') + theme_bw()
g <- g + ggtitle("R Functions for T-Distribution") + theme(legend.position="none")
g <- g + ylab('f(x)')
g
Loading required package: reshape
Loading required package: plyr

Attaching package: ‘plyr’

The following object is masked from ‘package:Hmisc’:

    is.discrete, summarize


Attaching package: ‘reshape’

The following object is masked from ‘package:plyr’:

    rename, round_any

  • pt and qt are the opposite of eachother.
  • The input of pt is a $z$ score and you get back the $P(z>X)$
  • The input of qt is some $p, p\in [0, 1]$. You get back a $z$ such that $P(z>X) = p$

In [119]:
x = np.arange(-3, 3, .001)
dt = st.t.pdf(x, 30)
qt = st.t.ppf(x, 30)
pt = st.t.cdf(x, 30)
f, axarr = plt.subplots(3, sharex=True)
axarr[0].plot(x, dt)
axarr[0].set_title('Python st.t.pdf = R dt')
axarr[1].plot(x, qt)
axarr[1].set_title('Python st.t.ppf = R qt')
axarr[2].plot(x, pt)
axarr[2].set_title('Python st.t.cdf = R pt');
In [120]:
st.t.ppf(.975, 46)
Out[120]:
2.0128955952945886
In [121]:
st.norm.ppf(.975)
Out[121]:
1.959963984540054
In [122]:
%%R
qt(.975, df = fit$df)
[1] 2.012896

Prediction of Outcomes

  • Consider predicting $Y$ at a value of $X$
    • Predicting the price of a diamond given the carat
    • Predicting the height of a child given the height of the parents
  • The obvious estimate for prediction at point $x_0$ is $$ \hat \beta_0 + \hat \beta_1 x_0 $$
  • A standard error is needed to create a prediction interval.
  • There's a distinction between intervals for the regression line at point $x_0$ and the prediction of what a $y$ would be at point $x_0$.
  • Line at $x_0$ se, $\hat \sigma\sqrt{\frac{1}{n} + \frac{(x_0 - \bar X)^2}{\sum_{i=1}^n (X_i - \bar X)^2}}$
  • Prediction interval se at $x_0$, $\hat \sigma\sqrt{1 + \frac{1}{n} + \frac{(x_0 - \bar X)^2}{\sum_{i=1}^n (X_i - \bar X)^2}}$

Confidence Intervals

$\theta$ via $\hat \theta \pm Q_{1-\alpha/2} \hat \sigma_{\hat \theta}$

Plotting the prediction intervals

In [123]:
%%R
data(diamond)
x <- diamond$carat; y <- diamond$price
plot(x, y, frame=FALSE,xlab="Carat",ylab="Dollars",pch=21,col="black", bg="lightblue", cex=2)
abline(fit, lwd = 2)
xVals <- seq(min(x), max(x), by = .01)
yVals <- beta0 + beta1 * xVals
se1 <- sigma * sqrt(1 / n + (xVals - mean(x))^2/ssx)      # orange
se2 <- sigma * sqrt(1 + 1 / n + (xVals - mean(x))^2/ssx)  # blue
lines(xVals, yVals + 2 * se1, col='orange')
lines(xVals, yVals - 2 * se1, col='orange')
lines(xVals, yVals + 2 * se2, col='blue')
lines(xVals, yVals - 2 * se2, col='blue')

black line = OLS regression line

orange lines = standard error line

blue lines = standard error prediction interval

Discussion

  • Both intervals have varying widths.
    • Least width at the mean of the Xs.
  • We are quite confident in the regression line, so that interval is very narrow.
    • If we knew $\beta_0$ and $\beta_1$ this interval would have zero width.
  • The prediction interval must incorporate the variabilibity in the data around the line.
    • Even if we knew $\beta_0$ and $\beta_1$ this interval would still have width.

ok... I'm taking this one on faith

Quiz 2

q1

Consider the following data with x as the predictor and y as as the outcome.

Give a P-value for the two sided hypothesis test of whether β1 from a linear regression model is 0 or not.

In [124]:
x = [0.61, 0.93, 0.83, 0.35, 0.54, 0.16, 0.91, 0.62, 0.62]
y = [0.67, 0.84, 0.6, 0.18, 0.85, 0.47, 1.1, 0.65, 0.36]
df = pd.DataFrame(zip(x, y), columns = ['x', 'y'])

$\beta_1 = Cor(x, y) \cdot \dfrac{Sd_y}{Sd_x}$

$\sigma_{\hat \beta_1}^2 = Var(\hat \beta_1) = \dfrac{\sigma^2}{\sum_{i=1}^n (X_i - \bar X)^2}$

$ \hat \sigma^2 = \frac{1}{n-2}\sum_{i=1}^n e_i^2 $

$ ssx = {\sum_{i=1}^n (X_i - \bar X)^2}$ (convenience value)

$\hat \beta_0 = \bar Y - \hat \beta_1 \bar X $

$ p = \dfrac{\mu - \mu:H_o}{\sigma}$

In [125]:
from math import sqrt
x = df['x']
y = df['y']
n = len(x)
beta1 = y.corr(x) * (y.std() / x.std())
beta0 = y.mean() - beta1 * x.mean()
ssx = ((x - mean(x))**2).sum()
yhat = beta0 + beta1 * x
e = y - yhat
sigma = sqrt(1.0/(n-2) * (e**2).sum())
se_beta1 = sqrt((sigma**2 / ssx))
p = beta1 / se_beta1
p
Out[125]:
2.3254912115169395
In [126]:
results = sm.ols('y~x', data=df).fit().summary()
results
/Users/idris/.virtualenvs/regres/lib/python2.7/site-packages/scipy/stats/stats.py:1205: UserWarning: kurtosistest only valid for n>=20 ... continuing anyway, n=9
  int(n))
Out[126]:
OLS Regression Results
Dep. Variable: y R-squared: 0.436
Model: OLS Adj. R-squared: 0.355
Method: Least Squares F-statistic: 5.408
Date: Tue, 30 Sep 2014 Prob (F-statistic): 0.0530
Time: 15:48:27 Log-Likelihood: 1.8658
No. Observations: 9 AIC: 0.2684
Df Residuals: 7 BIC: 0.6629
Df Model: 1
coef std err t P>|t| [95.0% Conf. Int.]
Intercept 0.1885 0.206 0.914 0.391 -0.299 0.676
x 0.7224 0.311 2.325 0.053 -0.012 1.457
Omnibus: 1.435 Durbin-Watson: 1.384
Prob(Omnibus): 0.488 Jarque-Bera (JB): 0.698
Skew: -0.067 Prob(JB): 0.705
Kurtosis: 1.642 Cond. No. 5.85

q2

Consider the previous problem, give the estimate of the residual standard deviation.

In [127]:
sigma
Out[127]:
0.2229980659078784

q3

In the mtcars data set, fit a linear regression model of weight (predictor) on mpg (outcome). Get a 95% confidence interval for the expected mpg at the average weight. What is the lower endpoint?

In [128]:
%%R -o mtcars
data(mtcars)
In [129]:
r = sm.ols('mpg ~ I(wt - wt.mean())', data=mtcars).fit()
In [130]:
alpha = 0.05
20.0906 + pd.Series([-1, 1]) * st.t.ppf(1 - alpha/2, 30) * .538
Out[130]:
0    18.991857
1    21.189343
dtype: float64
In [131]:
t = r.summary().tables[1]
print (t.as_text())
=====================================================================================
                        coef    std err          t      P>|t|      [95.0% Conf. Int.]
-------------------------------------------------------------------------------------
Intercept            20.0906      0.538     37.313      0.000        18.991    21.190
I(wt - wt.mean())    -5.3445      0.559     -9.559      0.000        -6.486    -4.203
=====================================================================================
In [132]:
(20.0906 - 18.991) / .538
Out[132]: