Excercises from Chapter 3 of An Introduction to Statistical Learning by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani.
I've elected to use Python instead of R.
Q1. Describe the null hypotheses to which the p-values given in Table 3.4 correspond. Explain what conclusions you can draw based on these p-values. Your explanation should be phrased in terms of sales, TV, radio, and newspaper, rather than in terms of the coefficients of the linear model.
A. The null hypotheses are
The p-values given in table 3.4 suggest that we can reject the null hypotheses 1 & 2, it seems likely that there is a relationship between TV ads and Sales, and radio ads and sales.
The p-value associated with the t-statistic for newspaper ads is high which suggests that we cannot reject null hypothesis 3. This suggests that there is no significant relationship between newspaper ads and sales.
Q2. Carefully explain the differences between the KNN classifier and KNN regression methods.
A. The KNN classifier and the KNN regression methods are largely similar. The KNN classifier determines a decision boundary which can be used to segment data into 2 or more clusters or groups. KNN regression is non-parmetric method for estimating a regression function that can be used to predict some quantitivie variable.
Q3. Suppose we have a data set with five predictors, X1 = GPA, X2 = IQ, X3 = Gender (1 for Female and 0 for Male), X4 = Interaction between GPA and IQ, and X5 = Interaction between GPA and Gender. The response is starting salary after graduation (in thousands of dollars). Suppose we use least squares to fit the model, and get β_0 = 50, β_1 = 20 , β_2 = 0.07 , β_3 = 35 , β_4 = 0.01 , β_5 = −10 .
(a) Which answer is correct, and why?
A. iii. For a fixed value of IQ and GPA, males earn more on average than females provided that the GPA is high enough.
Because X3 is our dummy variable for gender with 1 for female and 0 male, and coefficient 35, which means – all else being equal – the model will estimate a starting salary for females $35k higher than for males, but there is an additional interaction variable concerning GPA and gender which means if GPA > 3.5 then males earn more than females.
(b) Predict the salary of a female with IQ of 110 and a GPA of 4.0
def f(gpa, iq, gender):
return 50 + 20*gpa + 0.07*iq + 35*gender + 0.01*gpa*iq + (-10*gpa*gender)
gpa = 4
iq = 110
gender = 1
print('$' + str(f(gpa, iq, gender) * 1000))
$137100.0
(c) True or false: Since the coefficient for the GPA/IQ interaction term is very small, there is very little evidence of an interaction effect. Justify your answer.
False: the interaction effect might be small but we would need to inspect the standard error to understand if this interaction effect is significant. If the standard error is also very small then it might still be considered a significant effect.
Q4. I collect a set of data (n = 100 observations) containing a single predictor and a quantitative response. I then fit a linear regression model to the data, as well as a separate cubic regression, i.e.
Y = β0 + β1X + β2X^2 + β3X^3 + ε
(a) Suppose that the true relationship between X and Y is linear, i.e. Y = β0 + β1X + ε. Consider the training residual sum of squares (RSS) for the linear regression, and also the training RSS for the cubic regression. Would we expect one to be lower than the other, would we expect them to be the same, or is there not enough information to tell? Justify your answer.
A: We would expect the training RSS for the cubic model because it is more flexible which allows it to fit more closely variance in the training data – which will reduce RSS despite this note being representative of a closer approaximation to the true linear relationship that is f(x).
(b) Answer (a) using test rather than training RSS.
A: We would expect the test RSS for the linear regression to be lower because the assumption of high bias is correct and so the lack of flexibility in that model is of no cost in estimating the true f(x). The cubic model is more flexible, and so is likely to overfit the training data meaning that the fit of the model will be affected by variance in the training data that is not representive of the true f(x).
(c) Suppose that the true relationship between X and Y is not linear, but we don’t know how far it is from linear. Consider the training RSS for the linear regression, and also the training RSS for the cubic regression. Would we expect one to be lower than the other, would we expect them to be the same, or is there not enough information to tell? Justify your answer.
A: We expect training RSS to decrease as the the variance/flexibility of our model increases. This holds true regardles of the true value of f(x). So we expect the cubic model to result in a lower training RSS
(d) Answer (c) using test rather than training RSS.
There is not enough information to answer this fully.
If the true relationship is highly non-linear and there is low noise (or irreducible error) in our training data then we might expect the more flexible cubic model to deliver a better test RSS.
However, if the relationship is only slightly non-linear or the noise in our training data is high then a linear model might deliver better results.
5. Consider the fitted values that result from performing linear regression without an intercept. In this setting, the i-th fitted value takes the form:
What is a_i′ ?
Note: We interpret this result by saying that the fitted values from linear regression are linear combinations of the response values.
6. Using (3.4), argue that in the case of simple linear regression, the least squares line always passes through the point (x ̄, y ̄).
7. It is claimed in the text that in the case of simple linear regression of Y onto X, the R2 statistic (3.17) is equal to the square of the correlation between X and Y (3.18). Prove that this is the case. For simplicity, you may assume that x ̄ = y ̄ = 0.