Notebook

We perform best subset, forward stepwise, and backward stepwise selection on a single data set. For each approach, we obtain p + 1 models, containing 0, 1, 2, . . . , p predictors. Explain your answers:

(a) Which of the three models with k predictors has the smallest training RSS?

The model obtained by best subset selection has training RSS equal to or smaller than the others. The best subset procedure considers all possible models for each number of predictors, forward stepwise, and backward stepwise do not.
For k=1, best subset and forward stepwise will always obtain the same model.
For k=p, best subset and backward stepwise will always obtain the same model.

(b) Which of the three models with k predictors has the smallest test RSS?

The same reasoning above applies.

(c)

i. The predictors in the k-variable model identified by forward stepwise are a subset of the predictors in the (k+1)-variable model identified by forward stepwise selection.

True

ii. The predictors in the k-variable model identified by backward stepwise are a subset of the predictors in the (k + 1) variable model identified by backward stepwise selection.

False

iii. The predictors in the k-variable model identified by back- ward stepwise are a subset of the predictors in the (k + 1)- variable model identified by forward stepwise selection.

False

iv. The predictors in the k-variable model identified by forward stepwise are a subset of the predictors in the (k+1) variable model identified by backward stepwise selection.

False

v. The predictors in the k-variable model identified by best subset are a subset of the predictors in the (k + 1) variable model identified by best subset selection.

False

For parts (a) through (c), indicate which of i. through iv. is correct. Justify your answer.

(a) The lasso, relative to least squares, is:

iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

(b) Repeat (a) for ridge regression relative to least squares.

iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

ii. More flexible and hence will give improved prediction accu- racy when its increase in variance is less than its decrease in bias.

Suppose we estimate the regression coefficients in a linear regression model by minimizing: (LASSO optimisation objective)

for a particular value of s. For parts (a) through (e), indicate which of i. through v. is correct. Justify your answer

(a) As we increase s from 0, the training RSS will:

iii. Steadily increase.

We will pay a price in increased bias and recieve no benefit from reduced variance.

(b) Repeat (a) for test RSS.

ii. Decrease initially, and then eventually start increasing in a U shape.

Initially, the increased bias of the model is offset by an even larger reduction in variance. At higher values for s, the bias increases and is not offset by further reduction in variance; increaing the test RSS. (This assumes the model is actually suffering from high variance initially..)

iv. Steadily decrease

We expect the variance to decrease (possibly rapidly initially) and then flatten out.

(d) Repeat (a) for (squared) bias.

iii. Steadily increase.

We expect the bias to increase steadily (possibly slowly initially) and then possibly increase more rapidly.

(e) Repeat (a) for the irreducible error.

v. Remain constant.

Suppose we estimate the regression coefficients in a linear regression model by minimizing:

(Ridge optimisation objective)

answers same as for 3

It is well-known that ridge regression tends to give similar coefficient values to correlated variables, whereas the lasso may give quite dif- ferent coefficient values to correlated variables. We will now explore this property in a very simple setting.

Suppose that n = 2, p = 2, x11 = x12, x21 = x22. Furthermore, suppose that y1+y2 = 0 and x11 + x21 = 0 and x12 + x22 =0,so that the estimate for the intercept in a least squares, ridge regression, or lasso model is zero: β_hat_0 = 0.

(a) Write out the ridge regression optimization problem in this set- ting. (b) Argue that in this setting, the ridge coefficient estimates satisfy βˆ 1 = βˆ 2 .

(c) Write out the lasso optimization problem in this setting. (d) Argue that in this setting, the lasso coefficients βˆ1 and βˆ2 are not unique—in other words, there are many possible solutions to the optimization problem in (c). Describe these solutions.

We will now explore (6.12) and (6.13) further.

(a) Consider (6.12) with p = 1. For some choice of y1 and λ > 0, plot (6.12) as a function of β1. Your plot should confirm that (6.12) is solved by (6.14).

In [22]:

import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

y = 10
lmbda = 2

def eq_612(beta):
    return (y - beta) ** 2 + (lmbda * (beta ** 2))

betas = np.linspace(-3,10,101)
f_of_betas = [eq_612(b) for b in betas]

sns.lineplot(x=betas, y=f_of_betas)

plt.axvline(x=y / (1 + lmbda), c='y')

Out[22]:

<matplotlib.lines.Line2D at 0x115c534a8>

(b) Consider (6.13) with p = 1. For some choice of y1 and λ > 0, plot (6.13) as a function of β1. Your plot should confirm that (6.13) is solved by (6.15).

In [30]:

y = 10
lmbda = 2

def eq_613(beta):
    return (y - beta) ** 2 + (lmbda * np.abs(beta))

betas = np.linspace(-3,20,101)
f_of_betas = [eq_613(b) for b in betas]
sns.lineplot(x=betas, y=f_of_betas)

solution = None
if y > (lmbda/2):
    solution = y - (lmbda/2)
elif y < - lmbda / 2:
    solution = y + lmbda/2
elif y <= lmbda/2:
    soluton = 0
    
plt.axvline(x=solution, c='y')
    

Out[30]:

<matplotlib.lines.Line2D at 0x1153d03c8>