Notebook

Question 1¶

Describe the null hypotheses to which the p-values given in Table 3.4 correspond. Explain what conclusions you can draw based on these p-values. Your explanation should be phrased in terms of sales, TV, radio, and newspaper, rather than in terms of the coefficients of the linear model.

In [15]:

import pandas as pd
import statsmodels.api as sm # t-stats and their p-values for LR coefficients not available via sklearn.


ad_data = pd.read_csv('advertising.csv')
X = ad_data[['TV','radio','newspaper']]
Y = ad_data['sales']

X_wi_intercept = sm.add_constant(X)
regr = sm.OLS(Y, X_wi_intercept).fit()
regr.summary()

Out[15]:

OLS Regression Results
Dep. Variable:	sales	R-squared:	0.897
Model:	OLS	Adj. R-squared:	0.896
Method:	Least Squares	F-statistic:	570.3
Date:	Fri, 31 Aug 2018	Prob (F-statistic):	1.58e-96
Time:	16:28:14	Log-Likelihood:	-386.18
No. Observations:	200	AIC:	780.4
Df Residuals:	196	BIC:	793.6
Df Model:	3
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
const	2.9389	0.312	9.422	0.000	2.324	3.554
TV	0.0458	0.001	32.809	0.000	0.043	0.049
radio	0.1885	0.009	21.893	0.000	0.172	0.206
newspaper	-0.0010	0.006	-0.177	0.860	-0.013	0.011

Omnibus:	60.414	Durbin-Watson:	2.084
Prob(Omnibus):	0.000	Jarque-Bera (JB):	151.241
Skew:	-1.327	Prob(JB):	1.44e-33
Kurtosis:	6.332	Cond. No.	454.

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Intercept term

coeff is 2.9389
std err is 0.312
t stat is 9.44
p-value for t-stat is < 10^-3

interpretation

Assuming the following null hypothesis is true: 'There is no association between the number of units sold and the event that there is no budget allocated for tv, radio or newspaper ads', there is a proability of < 10 ^-3 that the following relationship: 'There will be 2939 units sold in the event that there is no budget allocated for tv, radio or newspaper ads' will be observed from the data. This is strong evidence that the null hypothesis is false, we may conclude that the alternative hypothesis is most likely true.

TV term

coeff is 0.0458
std err is 0.001
t stat is 32.809
p-value for t-stat is < 10^-3

iterpretation

Assuming the following null hypothesis is true: 'There is no association between the budget allocated for TV ads and units sold', there is a probability of < 10^-3 that the following relationship: 'There will be 46 additional units sold per additional $1000 allocated for TV advertising' will be observed from the data. This is strong evidence that the null hypothesis is false, we may conclude that the alternative hypothesis is most likely true.

Radio term

coeff is 0.1885
std err is 0.009
t stat is 21.893
p-value for t-stat is < 10^-3

iterpretation

Assuming the following null hypothersis is true: 'There is no association between the budget allocated for Radio ads and units sold', there is a probability of < 10^-3 that the following reltionship: 'There will be 189 additional units sold per additional $1000 allocated for radio adverstising' will be observed from the data. This is strong evidence that the null hypothesis is false, we may conclude that the alternative hypothesis is most likely true.

Newspaper term

coeff is -0.0010
std err is 0.006
t stat is -0.177
p-value for t-stat is 0.860

iterpretation

Assuming the following null hypothersis is true: 'There is no association between the budget allocated for newspaper ads and units sold', there is a probability of < 0.86 that the following reltionship: 'There will be 1 fewer unit sold per additional $1000 allocated for newspaper adverstising' will be observed from the data. This is strong evidence that the null hypothesis is true, we may conclude that the alternative hypothesis is most likely false.

Question 2¶

Carefully explain the differences between the KNN classifier and KNN regression methods.

The range of a KNN classifier is a discrete unordered set. The range of a KNN regression model is the set of all real numbers.
The output of a KNN classifier is computed as the most represented class label of the k observations with the smallest euclidean distance from the input. The output of a KNN regression model is the mean of the k observations with the smallest euclidean distance from the input.

Question 3¶

Suppose we have a data set with five predictors, X1 = GPA, X2 = IQ, X3 = Gender (1 for Female and 0 for Male), X4 = Interaction between GPA and IQ, and X5 = Interaction between GPA and Gender. The response is starting salary after graduation (in thousands of dollars). Suppose we use least squares to fit the model, and get:

β0 = 50 

β1 (GPA) = 20

β2 (IQ) = 0.07

β3 (GENDER 1=female) = 35

β4 (GPA * IQ) = 0.01

β5 (GPA * GENDER) = −10

(a) Which answer is correct, and why?

i. For a fixed value of IQ and GPA, males earn more on average than females.

ii. For a fixed value of IQ and GPA, females earn more on average than males.

iii. For a fixed value of IQ and GPA, males earn more on average than females provided that the GPA is high enough.

iv. For a fixed value of IQ and GPA, females earn more on average than males provided that the GPA is high enough.

i) Not enough information - it depends on what value GPA is fixed at.

ii) Not enough information - it depends on what value GPA is fixed at.

iii) TRUE: If we fix IQ @ 0, GPA must be > than (50 + 35) / 20 = 4.25 for male to have > salary than female.

iv) FALSE: Salary for a female is penalised in proportion to their GPA by β5.

Question 4¶

I collect a set of data (n = 100 observations) containing a single predictor and a quantitative response. I then fit a linear regression model to the data, as well as a separate cubic regression, i.e. Y = β0 +β1X +β2X2 +β3X3 +ε.

(a) Suppose that the true relationship between X and Y is linear, i.e. Y = β0 + β1X + ε. Consider the training residual sum of squares (RSS) for the linear regression, and also the training RSS for the cubic regression. Would we expect one to be lower than the other, would we expect them to be the same, or is there not enough information to tell? Justify your answer.

(b) Answer (a) using test rather than training RSS.

(c) Suppose that the true relationship between X and Y is not linear, but we don’t know how far it is from linear. Consider the training RSS for the linear regression, and also the training RSS for the cubic regression. Would we expect one to be lower than the other, would we expect them to be the same, or is there not enough information to tell? Justify your answer.

(d) Answer (c) using test rather than training RSS.

In [70]:

# (a) If the true relationship is perfectly linear and there is no noise, no model can
# perform better than simple linear regression (RSS=0). The cubic polynomial can perform
# equivalently as the coefficients for the higher order terms can be zero. 

import numpy as np
import seaborn as sns
from sklearn.linear_model import LinearRegression

x = np.arange(1, 10).reshape(9, -1)
y = 2 * np.arange(1, 10).reshape(9, -1)
regr = LinearRegression()
regr.fit(x, y)
print('simple regression coefficients: {}'.format(regr.coef_))
print('simple regression r-squared score: {}'.format(regr.score(x,y)))


regr = LinearRegression()
quad = x ** 2
cub = x ** 3
x = np.hstack((x,quad,cub))
regr.fit(x, 2 * np.arange(1, 10).reshape(9, -1))
print('ploynomial regression coefficients: {}'.format(regr.coef_))
print('ploynomial regression r-squared score: {}'.format(regr.score(x,y)))

# (a ctd) If the true relationship is perfectly linear but there is some noise in the data,
# i.e there is not enough information in X to perfectly predict y, then simple linear regression
# will yield a higher RSS on the training set than polynomial regression.

x = np.arange(1, 10).reshape(9, -1)
y = 2 * np.arange(1, 10).reshape(9, -1) + np.random.normal(0, size=(9, 1))
regr = LinearRegression()
regr.fit(x, y)
print('noisey: simple regression coefficients: {}'.format(regr.coef_))
print('noisey: simple regression r-squared score: {}'.format(regr.score(x,y)))

regr = LinearRegression()
quad = x ** 2
cub = x ** 3
x = np.hstack((x,quad,cub))
regr.fit(x, 2 * np.arange(1, 10).reshape(9, -1))
print('noisey: ploynomial regression coefficients: {}'.format(regr.coef_))
print('noisey: ploynomial regression r-squared score: {}'.format(regr.score(x,y)))

simple regression coefficients: [[2.]]
simple regression r-squared score: 1.0
ploynomial regression coefficients: [[ 2.00000000e+00 -1.11820197e-15  2.83538789e-17]]
ploynomial regression r-squared score: 1.0
noisey: simple regression coefficients: [[2.03470809]]
noisey: simple regression r-squared score: 0.99233862990075
noisey: ploynomial regression coefficients: [[ 2.00000000e+00 -1.11820197e-15  2.83538789e-17]]
noisey: ploynomial regression r-squared score: 0.9901443862648076

In [72]:

# (b) First part of answer to (a) still holds if no noise + perfect linear relationship. In the
# case of noise; RSS on the test set will be lower for simple linear regression than for polynomial
# regression, as the polynomial regression model has higher variance.

# (c) The training RSS for the polynomial model will be lower than the training RSS for the 
# simple model as it has lower bias.

# (d) There isn't enough information to tell. The ploynomial model has lower bias but higher
# variance, the total effect on the RSS depends on the relative strength of these two factors.
# If the bias decreases faster (with model complexity) than the variance increases, then one
# would expect the polynomial model to outperfrom the simple model on the test set. The converse
# is also true.

Question 5¶

Consider the fitted values that result from performing linear regres- sion without an intercept. In this setting, the ith fitted value takes the form: where:

y_hat_i = Xi * β_hat

where:

β_hat = sum(Xi * yi) / sum(x^2)

Show that we can write:

y_hat_i = sum(ai * yi)

What is ai?

Note: We interpret this result by saying that the fitted values from linear regression are linear combinations of the response values.

Question 6¶

Using (3.4), argue that in the case of simple linear regression, the least squares line always passes through the point (x_bar, y_bar)