Excercises from Chapter 4 of An Introduction to Statistical Learning by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani.
I've elected to use Python instead of R.
import numpy as np
Hint: For this problem, you should follow the arguments laid out in Section 4.4.2, but without making the assumption that σ12 = . . . = σK2 .
Note: A hypercube is a generalization of a cube to an arbitrary number of dimensions. When p = 1, a hypercube is simply a line segment, when p = 2 it is a square, and when p = 100 it is a 100-dimensional cube.
Improve, as increased sample size reduces a more flexible models tendency to overfit the training data.
False. If the bayes decision boundary is linear, then a more flexible model is prone to overfit and take account of noise in the training data that will reduce its accuracy in making predictions during test.
For multiple logistic regression a prediction p(X) is given by
$$p(X) = \frac{\exp{(β_0+β_1 X_1 + β_2 X_2)}}{1 + \exp{(β_0+β_1 X_1 + β_2 X_2)}}$$beta = np.array([-6, 0.05, 1])
X = np.array([1, 40, 3.5])
pX = np.exp(beta.T@X) / (1 + np.exp(beta.T@X))
print('p(X) = ' + str(np.around(pX, 4)))
p(X) = 0.3775
50 hrs
$p_1(4) = 0.752$
KNN with k=1 is a highly flexible non-parametric model, it is prone to overfitting in which case we would observe a low training error and high test error.
The test error will be most indicative of the models performance on new observations.
We know that the average error rate for KNN is 18%. We expect the error in test to be higher than training therefore the best possible test error is 18% (assuming 18% error in training).
The worst possible test error is 36% (assuming 0% error in training).
Therefore the knn test error is somewhere in the range 18 - 36%.
The logistic regression achieves a test error of 30%. This inflexible model is failing to account for some variance in the data, but we do no know if this variance is noise (an irreducible error), or variance in the true relationship which could be accounted for by a more flexible model.
Without any further information we can calculate the probability that KNN produces lower than 30% error in test as:
$p = \frac{30-18}{36-18} = \frac{2}{3}$
Therefore we should prefere the KNN method.
INCORRECT: We know k=1 so training error will be 0%
$odds = \frac{p(X)}{1 - p(X)} = 0.37$
Rearranging for p(X)
$p(X) = 0.37 - 0.37p(X)$
$p(X) + 0.37p(X) = 0.37$
$p(X) = \frac{0.37}{1 + 0.37}$
$p(X) = 0.27$
$odds = \frac{p(X)}{1 - p(X)} = \frac{0.16}{1 - 0.16} = 0.19$