# Term Definitions¶

TODO: Complete

• $\hat Y$ : Predicted $Y$, aka the fitted value
• $\bar Y$: Sample Mean
• $\mu$: Population Mean
• $\beta_n$: model coefficient
• $s$
• $\sigma$

# Least Squares¶

In [15]:
import pandas as pd
import matplotlib.pyplot as plt
import rpy2
%matplotlib inline

The rpy2.ipython extension is already loaded. To reload it, use:

In [2]:
heig = pd.DataFrame.from_csv('data/Galton.csv', index_col=None)
heig = pd.melt(heig, value_vars = ['child', 'parent'], value_name = 'height')

In [3]:
%%R -i heig
require(ggplot2)
df <- as.data.frame(heig)
g <- ggplot(heig, aes(height, fill=variable)) + geom_histogram(binwidth=1) + theme_bw()
g <- g + facet_grid(~variable)

g

Loading required package: ggplot2


# Find the middle via least squares¶

Let $Y_i$ be the height of child $i$ for $i = 1, ..., n = 928$ then define the middle as the value of $\mu$ that minimizes

$\sum\limits_{i=1}^n (Y_i - \mu)^2$

• The middle gives us the 'physical' center of mass of the graph
• You might have guessed that the answer that minimizes the squared distance above is $\mu = \bar{Y}$

# Experiment¶

$\sum\limits_{i=1}^n (Y_i - \mu)^2$

Let's try different values of $\mu$ to see where the minimal value is above

In [4]:
child = heig[heig['variable'] == 'child']

In [5]:
import numpy as np

Out[5]:
variable height
0 child 61.7
1 child 61.7
2 child 61.7
3 child 61.7
4 child 61.7
In [6]:
hrange = np.arange(child['height'].min(), child['height'].max(), .5)

In [7]:
MSE = lambda x: ((child.height - x)**2).mean()

In [8]:
MSEdf = pd.DataFrame([(_, MSE(_)) for _ in hrange], columns = ['mu', 'MSE'])


### Let's plot different values for $\mu$ and see what we get¶

In [9]:
pd.DataFrame.plot(MSEdf, x = 'mu', y='MSE', title='Minimizing Mu');

In [10]:
child.mean()

Out[10]:
height    68.08847
dtype: float64

<img "source=(/images/least_squares.proof.jpg">

The math follows as

\begin{align} \sum_{i=1}^n (Y_i - \mu)^2 & = \ \sum_{i=1}^n (Y_i - \bar Y + \bar Y - \mu)^2 \\ & = \sum_{i=1}^n (Y_i - \bar Y)^2 + \ 2 \sum_{i=1}^n (Y_i - \bar Y) (\bar Y - \mu) +\ \sum_{i=1}^n (\bar Y - \mu)^2 \\ & = \sum_{i=1}^n (Y_i - \bar Y)^2 + \ 2 (\bar Y - \mu) \sum_{i=1}^n (Y_i - \bar Y) +\ \sum_{i=1}^n (\bar Y - \mu)^2 \\ & = \sum_{i=1}^n (Y_i - \bar Y)^2 + \ 2 (\bar Y - \mu) (\sum_{i=1}^n Y_i - n \bar Y) +\ \sum_{i=1}^n (\bar Y - \mu)^2 \\ & = \sum_{i=1}^n (Y_i - \bar Y)^2 + \sum_{i=1}^n (\bar Y - \mu)^2\\ & \geq \sum_{i=1}^n (Y_i - \bar Y)^2 \ \end{align}

• $\bar{Y}$ : different guesses for $\mu$
• $\mu$ : value that minimizes $\sum\limits_{i=1}^n (Y_i - \mu)^2$

He adds and subtracts $\bar{Y}$ and then expands and simplifies.

The end result is that for any arbitrary $\mu$, the sum will be greater than the sum when $\bar{Y} = \mu$

# Comparing childrens' heights and their parents' heights¶

In [11]:
# reimport so I don't have to unmelt
heig = pd.DataFrame.from_csv('data/Galton.csv', index_col=None)
# get counts of values for the bubble plot
h = pd.DataFrame({'count' : heig.groupby( [ "child", "parent"] ).size()}).reset_index()
h[:10]

Out[11]:
child parent count
0 61.7 64.0 1
1 61.7 64.5 1
2 61.7 65.5 1
3 61.7 68.5 1
4 61.7 70.5 1
5 62.2 64.5 1
6 62.2 66.5 3
7 62.2 67.5 3
8 63.2 64.0 2
9 63.2 64.5 4
In [12]:
%%R -i h
require(ggplot2)
require(hexbin)
df <- as.data.frame(h)
g <- ggplot(df, aes(parent, child, size=count)) + geom_point() + ggtitle("Bubble Plot of Height")
g + theme_bw()

Loading required package: hexbin


## Regression through the origin¶

Suppose that $X_i$ are the parent's heights

Consider picking the slope $\beta$ that minimizes

$\sum\limits_{i=1}^n (Y_i - X_i\beta)^2$ -- this in effect is trying to minimize the error value $\epsilon$

This is using the origin as a pivot point picking the line that minimizes the sum of the squared vertical distances of the points to the line

A line through the origin is:

$y = x\beta$

• It has a y-intercept of 0.
• the slope is $\beta$
• $x_i\beta=\hat{y}$ is the fitted value
• $y_i - \hat{y}$ is the vertical distance between the data point and the fitted line
• is it NOT the orthoganol distance
• we don't really want the line go through the origin
• if we subtract the means so that the origin is the mean of the parents' and child's heights, then going through the origin makes sense
• the origin is the average of both variables
• $\mu$: population mean
• $\bar X$: sample mean
• $\sigma$: population standard deviation
• $s$: sample standard deviation
In [13]:
col = 'child'
h[col] = h[col] - h[col].mean()
col = 'parent'
h[col] = h[col] - h[col].mean()

In [14]:
%%R -i h
require(ggplot2)
require(hexbin)
df <- as.data.frame(h)
g <- ggplot(df, aes(parent, child, size=count)) + geom_point() + ggtitle("Bubble Plot Height De-meaned") + theme_bw()
g2 <- geom_smooth(data=df, aes(parent, child, weight=count), colour='red', method=lm, se=FALSE)
g3 <- geom_smooth(data=df, aes(parent, child), colour='green', method=lm, se=FALSE)
g4 <- geom_smooth(data=df, aes(parent, child), colour='blue', method=loess, se=FALSE)
g + g2 + g3 + g4 + theme(legend.position="none")


Note the difference in the linear fit when the size of the dots is considered.

• red = size considered
• green = size not considered
• blue = loess

### Find the best $\beta$¶

find:

$\min\sum\limits_{i=1}^n (Y_i - X_i\beta)^2$

In [15]:
col = 'child'
heig[col] = heig[col] - heig[col].mean()
col = 'parent'
heig[col] = heig[col] - heig[col].mean()

In [16]:
MSE = lambda beta: ((heig['child'] - heig['parent'] * beta)**2).mean()

In [17]:
# try slopes between -3 and 3
betas = np.arange(-3, 3, .01)

In [18]:
df = pd.DataFrame([(beta, MSE(beta),) for beta in betas], columns = ['beta', 'MSE'])

In [19]:
df.plot(x = 'beta', y='MSE', title = 'finding the best Beta empirically');

In [20]:
min_beta = float(df[df['MSE']==df['MSE'].min()]['beta'])
df[df['MSE']==df['MSE'].min()]

Out[20]:
beta MSE
365 0.65 5.000338
In [21]:
float(min_beta)

Out[21]:
0.6499999999999222

## Lets create the linear regression and see if we get the same answer¶

### Using scipy¶

In [22]:
from scipy.stats import linregress

In [23]:
slope, intercept, rvalue, pvalue, stderr = linregress(heig['parent'], heig['child'])

In [24]:
slope

Out[24]:
0.64629058199363942
In [25]:
MSE(slope)

Out[25]:
5.0002937655516639

### Using StatsModels¶

In [26]:
import statsmodels.formula.api as sm

In [27]:
result = sm.ols(formula="child ~ parent", data=heig).fit()


# Some Basic Notation and Background¶

We write $X_1, X_2, ... X_n$ to describe $n$ data points

• Uppercase $X$ for a random variable
• lowercase $x$ for a particular value
• greek letters for things we don't know

Empirical Mean

$\hat{X} = \dfrac{1}{n}\sum\limits_{i=1}^n X_i$

If we subtract the mean from all data points, we get data that has a mean of 0.

$\hat{X_i} = X_i - \hat{X} = 0$

Remember the mean is the least squares solution for minimizing:

$\sum\limits_{i=1}^n (X_i - \mu)^2$

# The empirical standard deviation and variance¶

$\text{Variance: } s^2 = \dfrac{1}{n-1}\sum\limits_{i = 1}^n {\left( {X_i - \bar X} \right)^2 }$

# Normalization¶

$Z_i = \dfrac{X_i - \bar x}{s}$

• Normalizing gives an empiricial mean of zero and a empirical standard deviation of 0
• Its units are described in standard deviations from the mean
• A z value of 2 means the normalized observation is 2 standard deviations from the mean

# The empirical covariance¶

of $n$ observations is ..

$Cov(X, Y) = \dfrac{1}{n-1}\sum\limits_{i=1}^n(X_i - \hat X)(Y_i - \hat Y) = \dfrac{1}{n-1}\left(\sum\limits_{i=1}^nX_iY_i - n\hat X\hat Y\right)$

• think of covariance as a function that takes two vectors as an input
• has the units of X and Y

## Correlation is:¶

$Cor(X, Y) = \dfrac{Cov(X, Y)}{S_xS_y}$

Where

• $S_x$ is the estimate of the standard deviation for the $X$ observations
• $S_y$ is the estimate of the standard deviation for the $Y$ observations
• Unit free, between -1 and 1

$Cor(X, Y) = \pm 1$ only when the $X$ or $Y$ observations fall perfectly on a posivite or negative sloped line

$Cor(X, Y) = 0$ implies no linear relationship

# Least squares estimation of regression lines¶

• Let $Y_i$ be the $i^{th}$ child's height
• Let $X_i$ be the $i^{th}$ parents' average height

Consider finding the best line

Child's Height = $\beta_0$ + Parent's Height $\cdot \beta_1$

Use least squares

$\sum\limits_{i=1}^n(Y_i - (\beta_0 + \beta_1X_i))^2$

• the difference between the actual and the predicted

$\sum\limits_{i=1}^n{\big(Y_i - \hat Y\big)}^2$ - the difference between the actual and the predicted

Oh how the fuck how do we do it?

# Mean only regression¶

So after a lot of stuff...

$\hat\beta_1 = \dfrac{\sum\limits_{i=1}^n Y_iX_i}{\sum\limits_{i=1}^nX_i^2}$

When we restrict the line through the origin

# Figuring out the final forms we can use to get $\beta_n$¶

... lots of math... which can be summarized as

$\hat \beta_1 = Cor(Y, X) \dfrac{S_dY}{S_dX}$

also stated as

$\dfrac{\sum(X-\bar X) \sum (Y - \bar Y)}{\sum(X - \bar X)^2}$

and

$\hat \beta_0 = \bar Y - \hat \beta_1 \bar X$

And our regression line always travels through $(\bar X, \bar Y)$

In [28]:
heig = pd.DataFrame.from_csv('data/Galton.csv', index_col=None)

Out[28]:
child parent
0 61.7 70.5
1 61.7 68.5
2 61.7 65.5
3 61.7 64.5
4 61.7 64.0
In [29]:
y = 'child'
x = 'parent'

beta1 = heig[y].corr(heig[x]) * heig[y].std() / heig[x].std()

In [30]:
beta0 = heig[y].mean() - beta1 * heig[x].mean()
beta0

Out[30]:
23.941530180421658
In [31]:
beta1

Out[31]:
0.64629058199351186
In [32]:
result = sm.ols(formula="child ~ parent", data=heig).fit()

In [33]:
result.params

Out[33]:
Intercept    23.941530
parent        0.646291
dtype: float64

# Regression to the Mean¶

$P(Y < x| X=x)$ gets bigger as $x$ heads to very large values

$P(Y > x| X=x)$ gets bigger as $x$ heads to very small values

# Quiz¶

### q1¶

In [34]:
# q1 - what's a better way to do this?
x = [0.18, -1.54, 0.42, 0.95]
w = [2, 1, 3, 1]


Give the value of $\mu$ that minimizes $\sum\limits_{i=1}^nw_i(x_i - \mu)^2$

In [35]:
df = pd.DataFrame([x, w]).T
df.columns = ['x', 'w']
df

Out[35]:
x w
0 0.18 2
1 -1.54 1
2 0.42 3
3 0.95 1
In [36]:
square = lambda x: x**2
f1 = lambda mu: (df['w'] * (df['x'] - mu).apply(square)).sum()

In [37]:
mus = np.arange(0, 1, .01)
f1(.18)

Out[37]:
3.7241
In [38]:
mus = pd.DataFrame([(mu, f1(mu),) for mu in mus], columns = ['mu', 'fx'])

In [39]:
pd.DataFrame.plot(mus, x = 'mu', y='fx', title='Minimizing Mu');

In [40]:
mus[mus['fx']==mus['fx'].min()]

Out[40]:
mu fx
15 0.15 3.7166
In [41]:
# q2
x = [0.8, 0.47, 0.51, 0.73, 0.36, 0.58, 0.57, 0.85, 0.44, 0.42]
y = [1.39, 0.72, 1.55, 0.48, 1.19, -1.59, 1.23, -0.65, 1.49, 0.05]
df = pd.DataFrame([x, y]).T
df.columns = ['x', 'y']

Out[41]:
x y
0 0.80 1.39
1 0.47 0.72
2 0.51 1.55
3 0.73 0.48
4 0.36 1.19
In [42]:
#result = sm.ols(formula="child ~ parent", data=heig).fit()
# y intercept set to zero
result = sm.ols(formula="y ~ x - 1", data=df).fit()

In [43]:
result.summary()

/Users/idris/.virtualenvs/regres/lib/python2.7/site-packages/scipy/stats/stats.py:1205: UserWarning: kurtosistest only valid for n>=20 ... continuing anyway, n=10
int(n))

Out[43]:
Dep. Variable: R-squared: y 0.183 OLS 0.092 Least Squares 2.018 Tue, 30 Sep 2014 0.189 15:48:14 -14.561 10 31.12 9 31.42 1
coef std err t P>|t| [95.0% Conf. Int.] 0.8263 0.582 1.421 0.189 -0.490 2.142
 Omnibus: Durbin-Watson: 2.98 3.044 0.225 1.581 -0.959 0.454 2.658 1

## So that's the code way to do it. Let's do it symbolically too.¶

$\beta_1 = \dfrac{\sum(X_i - \bar X)(Y_i - \bar Y)}{\sum(X_i - \bar X)^2}$

wait... that's close.

But for a RTO ("regression through origin") Remember that $\bar X$ and $\bar Y$ for a RTO are both $0$

$\beta_1 = \dfrac{\sum X_iY_i}{\sum X_i^2}$

In [44]:
x = 'x'
y = 'y'
num = (df[x] * df[y]).sum()
den = (df[x]).apply(square).sum()

In [45]:
num / den

Out[45]:
0.82625166087128599

### q3¶

In [46]:
%%R -i x
require(dataset)
data(mtcars)
lm(mpg~wt, data=mtcars)

Loading required package: dataset

Call:
lm(formula = mpg ~ wt, data = mtcars)

Coefficients:
(Intercept)           wt
37.285       -5.344



## q4¶

Consider data with an outcome (Y) and a predictor (X). The standard deviation of the predictor is one half that of the outcome. The correlation between the two variables is .5. What value would the slope coefficient for the regression model with Y as the outcome and X as the predictor?

• $Cor(X, Y) = \frac{1}{2}$
• $2 \cdot S_d(X) = S_d(Y)$
• $\hat \beta_1 = Cor(Y, X) \dfrac{S_dY}{S_dX}$
• $\hat \beta_1 = \frac{1}{2} \dfrac{2S_dY}{S_dX} = \frac{1}{2}\cdot\frac{2}{1}$

### q5¶

Students were given two hard tests and scores were normalized to have empirical mean 0 and variance 1. The correlation between the scores on the two tests was 0.4. What would be the expected score on Quiz 2 for a student who had a normalized score of 1.5 on Quiz 1?

$Cor(q1, q2) = 0.4$

$Z_{q1} = 1.5$

In [47]:
1.5 *0.4 #???

Out[47]:
0.6000000000000001

### q6¶

In [48]:
x = [8.58, 10.46, 9.01, 9.64, 8.86]
s = pd.Series(x)

In [49]:
# normalized values
ns = (s - s.mean())/s.std()
ns

Out[49]:
0   -0.971866
1    1.531022
2   -0.399397
3    0.439337
4   -0.599095
dtype: float64
In [50]:
ns.mean()

Out[50]:
1.8873791418627661e-15

### q7¶

In [51]:
df

Out[51]:
x y
0 0.80 1.39
1 0.47 0.72
2 0.51 1.55
3 0.73 0.48
4 0.36 1.19
5 0.58 -1.59
6 0.57 1.23
7 0.85 -0.65
8 0.44 1.49
9 0.42 0.05
In [52]:
result = sm.ols("y ~ x", data=df).fit()

In [53]:
result.summary()

Out[53]:
Dep. Variable: R-squared: y 0.076 OLS -0.039 Least Squares 0.6620 Tue, 30 Sep 2014 0.439 15:48:14 -13.666 10 31.33 8 31.94 1
coef std err t P>|t| [95.0% Conf. Int.] 1.5675 1.252 1.252 0.246 -1.320 4.455 -1.7128 2.105 -0.814 0.439 -6.568 3.142
 Omnibus: Durbin-Watson: 3.809 2.515 0.149 1.534 -0.956 0.464 3.158 8.37

### q9¶

What value minimizes the sum of the squared distances between these points and itself?

By definition: see this

In [54]:
df['x'].mean()

Out[54]:
0.57299999999999995

### q10¶

Consider taking the slope having fit Y as the outcome and X as the predictor, β1 and the slope from fitting X as the outcome and Y as the predictor, γ1, and dividing the two as β1/γ1. What is this ratio always equal to

In [55]:
df

Out[55]:
x y
0 0.80 1.39
1 0.47 0.72
2 0.51 1.55
3 0.73 0.48
4 0.36 1.19
5 0.58 -1.59
6 0.57 1.23
7 0.85 -0.65
8 0.44 1.49
9 0.42 0.05
In [56]:
r1 = sm.ols("x~y", data=df).fit()
r2 = sm.ols("y~x", data=df).fit()

In [57]:
r2.params.x / r1.params.y

Out[57]:
38.390772016849851
In [58]:
df['y'].var() / df['x'].var()

Out[58]:
38.390772016849652

$\beta_1 = Cor(X, Y)\dfrac{S_y}{S_X}$

$\gamma_1 = Cor(X, Y)\dfrac{S_X}{S_Y}$

$\dfrac{\beta_1}{\gamma_1} = \dfrac{Cor(X, Y)\dfrac{S_y}{S_X}}{Cor(X, Y)\dfrac{S_X}{S_Y}}$

$\dfrac{S_Y}{S_X}\cdot\dfrac{S_Y}{S_X} = \dfrac{S_Y^2}{S_X^2} = \dfrac{Var(Y)}{Var(X)}$

# Week 2¶

## Statistical linear regression models¶

### Basic regression model with additive Gaussian errors¶

• Least squares is an estimation tool, how do we do inference?
• Develop a probabilistic model for linear regression
• $\epsilon$ is the error

$Y_i = \beta_0 + \beta_1X_i + \epsilon_i$

slope intercept line with error term

Here the $\epsilon_i$ are assumed iid $N(0, \sigma^2)$

$E[Y_i | X_i = x_i] = \mu_i = \beta_0 + \beta_1x_i$

The expected value of $Y_i$ given $x_i$ is the regression equation. You take the expected value of the equation above and the $E[\epsilon_i]=0$

$Var(Y_i | X_i = x_i) = \sigma^2$

## Likelihood¶

• see the statistical inference class
• The least squares estimate for $\mu_i$ is exactly the maximum likelihood estimate

## Interpreting regression coefficients¶

$\beta_0$ is the expected value of the response when the predictor is 0

$E[Y|X=0] = \beta_0 + \beta_1 \cdot 0 = \beta_0$

This isn't always interesting, when $X=0$ is impossible or far outside of the range of possible values

You can shift the data by $a$, which changes the intercept but not the slope. In this way you can make the intercept interesting

$\beta_1$ is the expected change in response for a 1 unit change in the predictor

Multiplication of $X$ by a factor of $a$ results in dividing the coefficient by a factor of $a$

### Using the regression coefficients for prediction¶

$\hat Y = \hat \beta_0 + \hat \beta_1X$

• Why are these $\hat \beta$ instead of just $\beta$?
• Because the $\beta$ values are just our estimations, specifically our least squares estimation

## Statistical Regression Models Examples¶

In [59]:
%%R -o diamond
require(UsingR)
require(ggplot2)
data(diamond)
g <- ggplot(diamond, aes(carat, price)) + geom_point() + theme_bw()
g <- g + geom_smooth(method=lm, se=FALSE)
g

Loading required package: UsingR

Attaching package: ‘Hmisc’

The following object is masked from ‘package:base’:

format.pval, round.POSIXt, trunc.POSIXt, units

Attaching package: ‘SparseM’

The following object is masked from ‘package:base’:

backsolve

Attaching package: ‘quantreg’

The following object is masked from ‘package:Hmisc’:

latex

The following object is masked from ‘package:survival’:

untangle.specials

Attaching package: ‘UsingR’

The following object is masked from ‘package:survival’:

cancer

The following object is masked from ‘package:ggplot2’:

movies


In [60]:
formula = "price ~ carat"
r1 = sm.ols(formula, data=diamond).fit()

In [61]:
r1.summary()

Out[61]:
Dep. Variable: R-squared: price 0.978 OLS 0.978 Least Squares 2070. Tue, 30 Sep 2014 6.75e-40 15:48:16 -233.20 48 470.4 46 474.1 1
coef std err t P>|t| [95.0% Conf. Int.] -259.6259 17.319 -14.991 0.000 -294.487 -224.765 3721.0249 81.786 45.497 0.000 3556.398 3885.651
 Omnibus: Durbin-Watson: 0.739 1.994 0.691 0.181 0.056 0.913 3.28 18.5
In [62]:
# same formula, but subtracting out the carat mean to get a usable intercept values
f = "price ~ I(carat - carat.mean())"
r2 = sm.ols(f, data=diamond).fit()

In [63]:
# note that the slope is the same
print(r1.params)
print(r2.params)
print(diamond.price.mean())

Intercept    -259.625907
carat        3721.024852
dtype: float64
Intercept                   500.083333
I(carat - carat.mean())    3721.024852
dtype: float64
500.083333333


x <- diamond$carat n <- length(y) fit <- lm(y~x) e <- resid(fit) yhat <- predict(fit) max(abs(e - (y-yhat)))  [1] 9.485746e-13  In [72]: diamond x = diamond.carat y = diamond.price results = sm.ols("price ~ carat", data=diamond).fit()  In [73]: e = results.resid  In [74]: yhat = results.predict()  In [75]: df = pd.DataFrame([e, y - yhat]).T df.columns = ['e', 'y - yhat']  In [76]: df.head()  Out[76]: e y - yhat 0 -17.948318 -17.948318 1 -7.738069 -7.738069 2 -22.948318 -22.948318 3 -85.158566 -85.158566 4 -28.630306 -28.630306 In [77]: (df.e - df['y - yhat']).abs().max()  Out[77]: 0.0 In [78]: from statsmodels.api import graphics  ### What we're looking for is a pattern in the residuals. A pattern indicates our model is missing something¶ See this for more details on plotting regressions in statsmodels In [79]: fig, ax = plt.subplots(figsize=(12, 8)) fig = graphics.plot_fit(results, "carat", ax=ax)  In [80]: fig, ax = plt.subplots(figsize=(12, 8)) graphics.plot_regress_exog(results, "carat", fig);  In [81]: # generated data set from random import uniform, normalvariate, seed from math import sin  In [82]: seed(123) x = [uniform(-3, 3) for _ in xrange(100)] y = [_ + sin(_) + normalvariate(mu=100, sigma=.2) for _ in x] df = pd.DataFrame([_ for _ in zip(x, y)], columns = ['x', 'y']) df = df.sort('x')  In [83]: plt.scatter(df.y, df.x);  In [84]: results = sm.ols("y ~ x", data=df).fit()  In [85]: # why does this show up twice?? # todo: look at the source and figure fig, ax = plt.subplots(figsize=(12, 8)) graphics.plot_regress_exog(results, "x", fig);  In [86]: plt.plot(df.x, results.resid, 'o');  ### Heteroskedasticity¶ • The model isn't right because the variance isn't constant as a function of$x$In [87]: x = [uniform(0, 6) for _ in xrange(100)] df = pd.DataFrame(x, columns = ['x'])  In [88]: ygen = lambda x: x + normalvariate(0, .001*x) df['y'] = [ygen(x) for x in df.x]  In [89]: df = df.sort('x') plt.plot(df.x, df.y, 'o');  In [90]: r_hetero = sm.ols("y~x", data=df).fit()  In [91]:  plt.plot(df.x, r_hetero.resid, 'o');  In [92]: fig, ax = plt.subplots(figsize=(12, 8)) graphics.plot_regress_exog(r_hetero, "x", fig);  • Residuals highlight stuff you will not see in scatter plots ## Estimating Residual Variation¶ • Model$Y_i = \beta_0 + \beta_1 X_i + \epsilon_i$where$\epsilon_i \sim N(0, \sigma^2)$. • The ML estimate of$\sigma^2$is$\frac{1}{n}\sum_{i=1}^n e_i^2$, the average squared residual. • Most people use $$\hat \sigma^2 = \frac{1}{n-2}\sum_{i=1}^n e_i^2.$$ • The$n-2$instead of$n$is so that$E[\hat \sigma^2] = \sigma^2$## Diamond Example¶ In [93]: diamond.head()  Out[93]: carat price 0 0.17 355 1 0.16 328 2 0.17 350 3 0.18 325 4 0.25 642 In [94]: %%R y <- diamond$price; x <- diamond$carat; n <- length(y) fit <- lm(y ~ x) summary(fit)$sigma
sqrt(sum(resid(fit)^2) / (n - 2))

[1] 31.84052

In [95]:
y = 'price'; x = 'carat'
fit = sm.ols('%s ~ %s' % (y, x, ), data=diamond).fit()

In [96]:
e = (fit.predict(diamond) - diamond.price)**2

• calculating:

$$\hat \sigma^2 = \frac{1}{n-2}\sum_{i=1}^n e_i^2.$$

In [97]:
from math import sqrt
sigma = sqrt((e.shape[0]-2)**-1 * e.sum())
sigma

Out[97]:
31.840522265031762
In [98]:
# Normalized by N-1 by default. This can be changed using the ddof argument
sqrt(fit.resid.var(ddof=2))

Out[98]:
31.840522265031762

## Summarizing Variation¶

$$\sum_{i=1}^n (Y_i - \bar Y)^2 = \sum_{i=1}^n (Y_i - \hat Y_i)^2 + \sum_{i=1}^n (\hat Y_i - \bar Y)^2$$

Or

Total Variation = Residual Variation + Regression Variation

Define the percent of total varation described by the model as $$R^2 = \frac{\sum_{i=1}^n (\hat Y_i - \bar Y)^2}{\sum_{i=1}^n (Y_i - \bar Y)^2} = 1 - \frac{\sum_{i=1}^n (Y_i - \hat Y_i)^2}{\sum_{i=1}^n (Y_i - \bar Y)^2}$$

### Key:¶

$Y_i$ - $i^{th}$ obseration of $Y$

$\hat Y_i$ - Predicted value of $i^{th} Y$

$n$ - number of observations

$\bar Y$ - Empirical mean of $Y$

Total Variation = $\sum_{i=1}^n (\bar Y - Y_i)^2$

Residual Variation = $\sum_{i=1}^n(\hat Y_i - Y_i)^2$

Regression Variation = $\sum_{i=1}^n (\hat Y_i - \bar Y)^2$

$R^2 = \dfrac{\text{Regression Variation}}{\text{Total Variation}} = 1 - \dfrac{\text{Residual Variation}}{\text{Total Variation}}$

## Relation between $R^2$ and $r$ (the corrrelation)¶

Recall that $(\hat Y_i - \bar Y) = \hat \beta_1 (X_i - \bar X)$ so that

$$R^2 = \frac{\sum_{i=1}^n (\hat Y_i - \bar Y)^2}{\sum_{i=1}^n (Y_i - \bar Y)^2} = \hat \beta_1^2 \frac{\sum_{i=1}^n(X_i - \bar X)^2}{\sum_{i=1}^n (Y_i - \bar Y)^2} = Cor(Y, X)^2$$ Since, recall, $$\hat \beta_1 = Cor(Y, X)\frac{Sd(Y)}{Sd(X)}$$ So, $R^2$ is literally $r$ squared.

## The pit and pratfalls of solely relying on $R^2$¶

Taken from here

In [99]:
from __future__ import print_function
"""
Edward Tufte uses this example from Anscombe to show 4 datasets of x
and y that have the same mean, standard deviation, and regression
line, but which are qualitatively different.

matplotlib fun for a rainy day
"""

from pylab import *

x =  array([10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5])
y1 = array([8.04, 6.95, 7.58, 8.81, 8.33, 9.96, 7.24, 4.26, 10.84, 4.82, 5.68])
y2 = array([9.14, 8.14, 8.74, 8.77, 9.26, 8.10, 6.13, 3.10, 9.13, 7.26, 4.74])
y3 = array([7.46, 6.77, 12.74, 7.11, 7.81, 8.84, 6.08, 5.39, 8.15, 6.42, 5.73])
x4 = array([8,8,8,8,8,8,8,19,8,8,8])
y4 = array([6.58,5.76,7.71,8.84,8.47,7.04,5.25,12.50,5.56,7.91,6.89])

def fit(x):
return 3+0.5*x

xfit = array( [amin(x), amax(x) ] )

subplot(221)
plot(x,y1,'ks', xfit, fit(xfit), 'r-', lw=2)
axis([2,20,2,14])
setp(gca(), xticklabels=[], yticks=(4,8,12), xticks=(0,10,20))
text(3,12, 'I', fontsize=20)

subplot(222)
plot(x,y2,'ks', xfit, fit(xfit), 'r-', lw=2)
axis([2,20,2,14])
setp(gca(), xticklabels=[], yticks=(4,8,12), yticklabels=[], xticks=(0,10,20))
text(3,12, 'II', fontsize=20)

subplot(223)
plot(x,y3,'ks', xfit, fit(xfit), 'r-', lw=2)
axis([2,20,2,14])
text(3,12, 'III', fontsize=20)
setp(gca(), yticks=(4,8,12), xticks=(0,10,20))

subplot(224)

xfit = array([amin(x4),amax(x4)])
plot(x4,y4,'ks', xfit, fit(xfit), 'r-', lw=2)
axis([2,20,2,14])
setp(gca(), yticklabels=[], yticks=(4,8,12), xticks=(0,10,20))
text(3,12, 'IV', fontsize=20)

#verify the stats
pairs = (x,y1), (x,y2), (x,y3), (x4,y4)
for x,y in pairs:
print ('mean=%1.2f, std=%1.2f, r=%1.2f'%(mean(y), std(y), corrcoef(x,y)[0][1]))

show()

mean=7.50, std=1.94, r=0.82
mean=7.50, std=1.94, r=0.82
mean=7.50, std=1.94, r=0.82
mean=7.50, std=1.94, r=0.82


### So that's the bad $R^2$, what about the residual plots¶

In [100]:
x =  array([10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5])
y1 = array([8.04, 6.95, 7.58, 8.81, 8.33, 9.96, 7.24, 4.26, 10.84, 4.82, 5.68])
y2 = array([9.14, 8.14, 8.74, 8.77, 9.26, 8.10, 6.13, 3.10, 9.13, 7.26, 4.74])
y3 = array([7.46, 6.77, 12.74, 7.11, 7.81, 8.84, 6.08, 5.39, 8.15, 6.42, 5.73])
x4 = array([8,8,8,8,8,8,8,19,8,8,8])
y4 = array([6.58,5.76,7.71,8.84,8.47,7.04,5.25,12.50,5.56,7.91,6.89])

df = pd.DataFrame([x, x4, y1, y2, y3, y4]).T
df.columns = ['x', 'x4', 'y1', 'y2', 'y3', 'y4']

def regress(x, y, data):
formula = '%s ~ %s' % (y, x, )
return sm.ols(formula, data=data).fit()

subplot(221)
plt.plot(df.x, regress('x', 'y1', df).resid, 'ro');
plt.title('I');
subplot(222)
plt.plot(df.x, regress('x', 'y2', df).resid, 'ro');
plt.title('II');
subplot(223)
plt.plot(df.x, regress('x', 'y3', df).resid, 'ro');
plt.title('III');
subplot(224)
plt.plot(df.x, regress('x4', 'y4', df).resid, 'ro');
plt.title('IV');


and the residuals are fucked up too

## Inference in Regression¶

Consider the model $$Y_i = \beta_0 + \beta_1 X_i + \epsilon_i$$

$\epsilon \sim N(0, \sigma^2)$.

We assume that the true model is known.

$\hat \beta_0 = \bar Y - \hat \beta_1 \bar X$

$\hat \beta_1 = Cor(X, Y) \cdot \dfrac{Sd(Y)}{Sd(X)}$

Not sure if this last equation should be $\beta_1$ or $\hat \beta_1$

$\hat \beta_1= \dfrac{\sum(X_i - \bar X)(Y_i - \bar Y)}{\sum(X_i - \bar X)^2}$

## Review¶

• Statistics like $\frac{\hat \theta - \theta}{\hat \sigma_{\hat \theta}}$ often have the following properties.
1. Is normally distributed and has a finite sample Student's T distribution if the estimated variance is replaced with a sample estimate (under normality assumptions).
2. Can be used to test $H_0 : \theta = \theta_0$ versus $H_a : \theta >, <, \neq \theta_0$.
3. Can be used to create a confidence interval for $\theta$ via $\hat \theta \pm Q_{1-\alpha/2} \hat \sigma_{\hat \theta}$ where $Q_{1-\alpha/2}$ is the relevant quantile from either a normal or T distribution.
• In the case of regression with iid sampling assumptions and normal errors, our inferences will follow very similarily to what you saw in your inference class.
• We won't cover asymptotics for regression analysis, but suffice it to say that under assumptions on the ways in which the $X$ values are collected, the iid sampling model, and mean model, the normal results hold to create intervals and confidence intervals

## Standard errors (conditioned on X)¶

\begin{align} Var(\hat \beta_1) & = Var\left(\frac{\sum_{i=1}^n (Y_i - \bar Y) (X_i - \bar X)}{\sum_{i=1}^n (X_i - \bar X)^2}\right) \\ & = \frac{Var\left(\sum_{i=1}^n Y_i (X_i - \bar X) \right) }{\left(\sum_{i=1}^n (X_i - \bar X)^2 \right)^2} \\ & = \frac{\sum_{i=1}^n \sigma^2(X_i - \bar X)^2}{\left(\sum_{i=1}^n (X_i - \bar X)^2 \right)^2} \\ & = \frac{\sigma^2}{\sum_{i=1}^n (X_i - \bar X)^2} \\ \end{align}

## Results¶

• $\sigma_{\hat \beta_1}^2 = Var(\hat \beta_1) = \sigma^2 / \sum_{i=1}^n (X_i - \bar X)^2$
• $\sigma_{\hat \beta_0}^2 = Var(\hat \beta_0) = \left(\frac{1}{n} + \frac{\bar X^2}{\sum_{i=1}^n (X_i - \bar X)^2 }\right)\sigma^2$
• In practice, $\sigma$ is replaced by its estimate.
• It's probably not surprising that under iid Gaussian errors $$\frac{\hat \beta_j - \beta_j}{\hat \sigma_{\hat \beta_j}}$$ follows a $t$ distribution with $n-2$ degrees of freedom and a normal distribution for large $n$.
• This can be used to create confidence intervals and perform hypothesis tests.

### Example¶

$H_0: \beta_n = 0$

$H_a: \beta_n \neq 0$

In [101]:
%%R -o diamond
library(UsingR); data(diamond)
y <- diamond$price; x <- diamond$carat; n <- length(y)
beta1 <- cor(y, x) * sd(y) / sd(x)
beta0 <- mean(y) - beta1 * mean(x)
e <- y - beta0 - beta1 * x
sigma <- sqrt(sum(e^2) / (n-2))
ssx <- sum((x - mean(x))^2)
seBeta0 <- (1 / n + mean(x) ^ 2 / ssx) ^ .5 * sigma
seBeta1 <- sigma / sqrt(ssx)
tBeta0 <- beta0 / seBeta0;
tBeta1 <- beta1 / seBeta1
pBeta0 <- 2 * pt(abs(tBeta0), df = n - 2, lower.tail = FALSE)
pBeta1 <- 2 * pt(abs(tBeta1), df = n - 2, lower.tail = FALSE)
coefTable <- rbind(c(beta0, seBeta0, tBeta0, pBeta0), c(beta1, seBeta1, tBeta1, pBeta1))
colnames(coefTable) <- c("Estimate", "Std. Error", "t value", "P(>|t|)")
rownames(coefTable) <- c("(Intercept)", "x")
coefTable

             Estimate Std. Error   t value      P(>|t|)
(Intercept) -259.6259   17.31886 -14.99094 2.523271e-19
x           3721.0249   81.78588  45.49715 6.751260e-40

In [102]:
import scipy.stats as st
n = 46
st.t.pdf(abs(1), df = n - 2)

Out[102]:
0.23924692073493825
In [103]:
x = diamond['carat']
y = diamond['price']
n = float(len(y))
beta1 = y.corr(x) * y.std() / x.std() # regression slope
beta0 = y.mean() - beta1 * x.mean()   # regression intercept
e = y - (beta1 * x + beta0) # errors
sigma = ((e**2).sum()/(n - 2))**0.5 # unbiased residual mean
ssx = ((x - x.mean())**2).sum() # sum of squared Xs
seBeta0 = (1/n + x.mean()**2 / ssx)**0.5 * sigma
seBeta1 = sigma / ssx**0.5
tBeta0 = beta0 / seBeta0
tBeta1 = beta1 / seBeta1
# TODO: review hypothesis testing
pBeta0 = 2 * (1-st.t.cdf(abs(tBeta0), df = n - 2))
pBeta1 = 2 * (1-st.t.cdf(abs(tBeta1), df = n - 2))
l = [[beta0, seBeta0, tBeta0, pBeta0],
[beta1, seBeta1, tBeta1, pBeta1]]
coefTable = pd.DataFrame((_ for _ in l),
columns = ["Estimate", "Std. Error", "t value", "P(>|t|)"],
index = ["(Intercept)", "x"])

coefTable

Out[103]:
Estimate Std. Error t value P(>|t|)
(Intercept) -259.625907 17.318856 -14.990938 0
x 3721.024852 81.785880 45.497155 0
In [104]:
fit = sm.ols("price ~ carat", data=diamond).fit()
fit.summary()

Out[104]:
Dep. Variable: R-squared: price 0.978 OLS 0.978 Least Squares 2070. Tue, 30 Sep 2014 6.75e-40 15:48:23 -233.20 48 470.4 46 474.1 1
coef std err t P>|t| [95.0% Conf. Int.] -259.6259 17.319 -14.991 0.000 -294.487 -224.765 3721.0249 81.786 45.497 0.000 3556.398 3885.651
 Omnibus: Durbin-Watson: 0.739 1.994 0.691 0.181 0.056 0.913 3.28 18.5
In [105]:
%%R
data(diamond)
fit <- lm("price ~ carat", data=diamond)
summary(fit)

Call:
lm(formula = "price ~ carat", data = diamond)

Residuals:
Min      1Q  Median      3Q     Max
-85.159 -21.448  -0.869  18.972  79.370

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)  -259.63      17.32  -14.99   <2e-16 ***
carat        3721.02      81.79   45.50   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 31.84 on 46 degrees of freedom
Multiple R-squared:  0.9783,	Adjusted R-squared:  0.9778
F-statistic:  2070 on 1 and 46 DF,  p-value: < 2.2e-16



### Interlude to find Python equivalant of R's:¶

pt(abs(tBeta1), df = n - 2, lower.tail = FALSE)

In [106]:
print ('\n'.join(st.t.pdf.__doc__.split('\n')[:100]))

        Probability density function at x of the given RV.

Parameters
----------
x : array_like
quantiles
arg1, arg2, arg3,... : array_like
The shape parameter(s) for the distribution (see docstring of the
loc : array_like, optional
location parameter (default=0)
scale : array_like, optional
scale parameter (default=1)

Returns
-------
pdf : ndarray
Probability density function evaluated at x


In [107]:
import numpy as np
x = np.arange(-3, 3, .01)
ypdf = [st.t.pdf(_, n - 2) for _ in x]

In [108]:
ycdf = [1 - st.t.cdf(_, n - 2) for _ in x]

In [109]:
f, axarr = plt.subplots(2, sharex=True)
axarr[0].plot(x, ypdf)
axarr[0].set_title('PDF')
axarr[1].plot(x, ycdf)
axarr[1].set_title('CDF');

In [110]:
%%R -o dfR
# test to figure out wtf the pt function is doing
require(ggplot2)
n <- 48
x <- seq(-3, 3, by=.01)
y <- pt(x, df = n - 2, lower.tail = FALSE)
df1 <- data.frame(x, y, 'false')
names(df1) <- c('x', 'y', 'tail')

y <- pt(x, df = n - 2, lower.tail = TRUE)
df2 <- data.frame(x, y, 'true')
names(df2) <- c('x', 'y', 'tail')
dfR <- rbind(df1, df2)
ggplot(dfR, aes(x, y, colour=tail)) + geom_line() + theme_bw() + ggtitle("Effect of lower.tail parameter in pt function")


So the pt function creates a cdf

In [111]:
dfR = dfR[dfR['tail'] == 'false']

In [112]:
f, axarr = plt.subplots(2, sharex=True)
axarr[0].plot(dfR['x'], dfR['y'])
axarr[0].set_title('R CDF')
axarr[1].plot(x, ycdf)
axarr[1].set_title('Python CDF');


### Getting a confidence interval¶

In [113]:
%%R
sumCoef <- summary(fit)$coefficients sumCoef   Estimate Std. Error t value Pr(>|t|) (Intercept) -259.6259 17.31886 -14.99094 2.523271e-19 carat 3721.0249 81.78588 45.49715 6.751260e-40  In [114]: %%R sumCoef <- summary(fit)$coefficients
sumCoef[1,1] + c(-1, 1) * qt(.975, df = fit$df) * sumCoef[1, 2]  [1] -294.4870 -224.7649  In [115]: %%R sumCoef[2,1] + c(-1, 1) * qt(.975, df = fit$df) * sumCoef[2, 2]

[1] 3556.398 3885.651

In [116]:
coefTable['Estimate']['x'] + pd.Series([-1, 1]) * coefTable['Std. Error']['x'] * st.t.ppf(.975, fit.df_resid)

Out[116]:
0    3556.398413
1    3885.651290
dtype: float64
In [117]:
coefTable['Estimate']['(Intercept)'] + pd.Series([-1, 1]) * coefTable['Std. Error']['(Intercept)'] * st.t.ppf(.975, fit.df_resid)

Out[117]:
0   -294.486957
1   -224.764858
dtype: float64

WTF are these different R t-distribution functions doing

In [118]:
%%R
require(reshape)
require(ggplot2)
n <- 3
by <- 0.001
x <- seq(-n, n, by)
qt <- qt(x, df = 30)
pt <- pt(x, df=30)
dt <- dt(x, df=30)
rt <- rt(x, df=30)

df <- data.frame(x, qt, pt, dt, rt)
names(df) <- c('x', 'qt\ndensity', 'pt\ndistribution', 'dt\nquantile', 'rt\nrandom')
mdf <- melt.data.frame(df, id = c('x'))
g <- ggplot(mdf, aes(x, value, colour=variable)) + geom_point(alpha=0.1)
g <- g + facet_wrap(~variable, scale='free_y') + theme_bw()
g <- g + ggtitle("R Functions for T-Distribution") + theme(legend.position="none")
g <- g + ylab('f(x)')
g

Loading required package: reshape

Attaching package: ‘plyr’

The following object is masked from ‘package:Hmisc’:

is.discrete, summarize

Attaching package: ‘reshape’

The following object is masked from ‘package:plyr’:

rename, round_any


• pt and qt are the opposite of eachother.
• The input of pt is a $z$ score and you get back the $P(z>X)$
• The input of qt is some $p, p\in [0, 1]$. You get back a $z$ such that $P(z>X) = p$

In [119]:
x = np.arange(-3, 3, .001)
dt = st.t.pdf(x, 30)
qt = st.t.ppf(x, 30)
pt = st.t.cdf(x, 30)
f, axarr = plt.subplots(3, sharex=True)
axarr[0].plot(x, dt)
axarr[0].set_title('Python st.t.pdf = R dt')
axarr[1].plot(x, qt)
axarr[1].set_title('Python st.t.ppf = R qt')
axarr[2].plot(x, pt)
axarr[2].set_title('Python st.t.cdf = R pt');

In [120]:
st.t.ppf(.975, 46)

Out[120]:
2.0128955952945886
In [121]:
st.norm.ppf(.975)

Out[121]:
1.959963984540054
In [122]:
%%R
qt(.975, df = fit$df)  [1] 2.012896  ## Prediction of Outcomes¶ • Consider predicting$Y$at a value of$X$• Predicting the price of a diamond given the carat • Predicting the height of a child given the height of the parents • The obvious estimate for prediction at point$x_0$is $$\hat \beta_0 + \hat \beta_1 x_0$$ • A standard error is needed to create a prediction interval. • There's a distinction between intervals for the regression line at point$x_0$and the prediction of what a$y$would be at point$x_0$. • Line at$x_0$se,$\hat \sigma\sqrt{\frac{1}{n} + \frac{(x_0 - \bar X)^2}{\sum_{i=1}^n (X_i - \bar X)^2}}$• Prediction interval se at$x_0$,$\hat \sigma\sqrt{1 + \frac{1}{n} + \frac{(x_0 - \bar X)^2}{\sum_{i=1}^n (X_i - \bar X)^2}}$### Confidence Intervals¶$\theta$via$\hat \theta \pm Q_{1-\alpha/2} \hat \sigma_{\hat \theta}$### Plotting the prediction intervals¶ In [123]: %%R data(diamond) x <- diamond$carat; y <- diamond$price plot(x, y, frame=FALSE,xlab="Carat",ylab="Dollars",pch=21,col="black", bg="lightblue", cex=2) abline(fit, lwd = 2) xVals <- seq(min(x), max(x), by = .01) yVals <- beta0 + beta1 * xVals se1 <- sigma * sqrt(1 / n + (xVals - mean(x))^2/ssx) # orange se2 <- sigma * sqrt(1 + 1 / n + (xVals - mean(x))^2/ssx) # blue lines(xVals, yVals + 2 * se1, col='orange') lines(xVals, yVals - 2 * se1, col='orange') lines(xVals, yVals + 2 * se2, col='blue') lines(xVals, yVals - 2 * se2, col='blue')  black line = OLS regression line orange lines = standard error line blue lines = standard error prediction interval ### Discussion¶ • Both intervals have varying widths. • Least width at the mean of the Xs. • We are quite confident in the regression line, so that interval is very narrow. • If we knew$\beta_0$and$\beta_1$this interval would have zero width. • The prediction interval must incorporate the variabilibity in the data around the line. • Even if we knew$\beta_0$and$\beta_1$this interval would still have width. ok... I'm taking this one on faith ## Quiz 2¶ #### q1¶ Consider the following data with x as the predictor and y as as the outcome. Give a P-value for the two sided hypothesis test of whether β1 from a linear regression model is 0 or not. In [124]: x = [0.61, 0.93, 0.83, 0.35, 0.54, 0.16, 0.91, 0.62, 0.62] y = [0.67, 0.84, 0.6, 0.18, 0.85, 0.47, 1.1, 0.65, 0.36] df = pd.DataFrame(zip(x, y), columns = ['x', 'y']) $\beta_1 = Cor(x, y) \cdot \dfrac{Sd_y}{Sd_x}\sigma_{\hat \beta_1}^2 = Var(\hat \beta_1) = \dfrac{\sigma^2}{\sum_{i=1}^n (X_i - \bar X)^2} \hat \sigma^2 = \frac{1}{n-2}\sum_{i=1}^n e_i^2  ssx = {\sum_{i=1}^n (X_i - \bar X)^2}$(convenience value)$\hat \beta_0 = \bar Y - \hat \beta_1 \bar X  p = \dfrac{\mu - \mu:H_o}{\sigma}\$

In [125]:
from math import sqrt
x = df['x']
y = df['y']
n = len(x)
beta1 = y.corr(x) * (y.std() / x.std())
beta0 = y.mean() - beta1 * x.mean()
ssx = ((x - mean(x))**2).sum()
yhat = beta0 + beta1 * x
e = y - yhat
sigma = sqrt(1.0/(n-2) * (e**2).sum())
se_beta1 = sqrt((sigma**2 / ssx))
p = beta1 / se_beta1
p

Out[125]:
2.3254912115169395
In [126]:
results = sm.ols('y~x', data=df).fit().summary()
results

/Users/idris/.virtualenvs/regres/lib/python2.7/site-packages/scipy/stats/stats.py:1205: UserWarning: kurtosistest only valid for n>=20 ... continuing anyway, n=9
int(n))

Out[126]:
Dep. Variable: R-squared: y 0.436 OLS 0.355 Least Squares 5.408 Tue, 30 Sep 2014 0.0530 15:48:27 1.8658 9 0.2684 7 0.6629 1
coef std err t P>|t| [95.0% Conf. Int.] 0.1885 0.206 0.914 0.391 -0.299 0.676 0.7224 0.311 2.325 0.053 -0.012 1.457
 Omnibus: Durbin-Watson: 1.435 1.384 0.488 0.698 -0.067 0.705 1.642 5.85

#### q2¶

Consider the previous problem, give the estimate of the residual standard deviation.

In [127]:
sigma

Out[127]:
0.2229980659078784

#### q3¶

In the mtcars data set, fit a linear regression model of weight (predictor) on mpg (outcome). Get a 95% confidence interval for the expected mpg at the average weight. What is the lower endpoint?

In [128]:
%%R -o mtcars
data(mtcars)

In [129]:
r = sm.ols('mpg ~ I(wt - wt.mean())', data=mtcars).fit()

In [130]:
alpha = 0.05
20.0906 + pd.Series([-1, 1]) * st.t.ppf(1 - alpha/2, 30) * .538

Out[130]:
0    18.991857
1    21.189343
dtype: float64
In [131]:
t = r.summary().tables[1]
print (t.as_text())

=====================================================================================
coef    std err          t      P>|t|      [95.0% Conf. Int.]
-------------------------------------------------------------------------------------
Intercept            20.0906      0.538     37.313      0.000        18.991    21.190
I(wt - wt.mean())    -5.3445      0.559     -9.559      0.000        -6.486    -4.203
=====================================================================================

In [132]:
(20.0906 - 18.991) / .538

Out[132]: