Notebook

QUESTION 1¶

For each of parts (a) through (d), indicate whether we would generally expect the performance of a flexible statistical learning method to be better or worse than an inflexible method. Justify your answer.

(a) The sample size n is extremely large, and the number of predic- tors p is small.

(b) The number of predictors p is extremely large, and the number of observations n is small.

(d) The variance of the error terms, i.e. σ2 = Var(ε), is extremely high.

Answers below refer to the expression for the expected value of the loss function: E[y0 - f_hat]^2 = Var[f_hat(x0)] + [Bias[f_hat(x0)]]^2 + Var[epsion].

(a) Better

Variance increases slowly or not at all initially: for a large number of observations, we can support more flexible models before variance begins to increase significantly.
Decreasing bias: As we increase the flexibility of the model, we expect the bias to decrease as the model captures more structure in the training data.

(b) Worse

Variance increases rapidly: for a small number of observations, we can support a less flexible model before variance begins to increase significantly.
The rate of increase of variance outweighs rate of decrease in bias as the number of paramaters increases for a fixed small number of observations.

The variance increases slowly or not at all initially.
The bias decreases rapidly.

(d) Worse

The variance increases rapidly.
The rate of increase of the variance outweighs the rate of decrease in bias.

QUESTION 2¶

Explain whether each scenario is a classification or regression prob- lem, and indicate whether we are most interested in inference or pre- diction. Finally, provide n and p.

(a) We collect a set of data on the top 500 firms in the US. For each firm we record profit, number of employees, industry and the CEO salary. We are interested in understanding which factors affect CEO salary.

(b) We are considering launching a new product and wish to know whether it will be a success or a failure. We collect data on 20 similar products that were previously launched. For each prod- uct we have recorded whether it was a success or failure, price charged for the product, marketing budget, competition price, and ten other variables.

(c) We are interested in predicting the % change in the USD/Euro exchange rate in relation to the weekly changes in the world stock markets. Hence we collect weekly data for all of 2012. For each week we record the % change in the USD/Euro, the % change in the US market, the % change in the British market, and the % change in the German market.

(a) regression, inference, n=500, p=3

(b) classification, prediction, n=20, p=13

QUESTION 3¶

We now revisit the bias-variance decomposition.

(a) Provide a sketch of typical (squared) bias, variance, training error, test error, and Bayes (or irreducible) error curves, on a single plot, as we go from less flexible statistical learning methods towards more flexible approaches. The x-axis should represent the amount of flexibility in the method, and the y-axis should represent the values for each curve. There should be five curves. Make sure to label each one.

(b) Explain why each of the five curves has the shape displayed in part (a).

In [1]:

import seaborn as sns
import numpy as np
import pandas as pd
import sklearn.linear_model as linear_model
from sklearn.preprocessing import PolynomialFeatures
import matplotlib.pyplot as plt

np.random.seed(0)

# Generate some data
x = np.arange(0, np.pi , np.pi / 25)
noise = np.random.normal(0,1, x.size) / 3
y = np.sin(x)
true_df = pd.DataFrame(data={'x' : x, 'y' : y})
train_df = pd.DataFrame(data={'x' : x, 'y' : y + noise})


# Visualise models vs ideal function
plt.ylim(2, -2)
sns.lineplot(x='x', y='y', data=true_df, ci=None)
sns.regplot(x='x', y='y', data=train_df, order=1, ci=None)
sns.regplot(x='x', y='y', data=train_df, order=2, ci=None)
sns.regplot(x='x', y='y', data=train_df, order=7, ci=None)


# TODO - Write simulations to generate plots on p36 of ISL.
# The plots on p36 of ISL are the plots this question is after.
# Those plots are a bit more involved to simulate - I'll come back and do them at some point.

Out[1]:

<matplotlib.axes._subplots.AxesSubplot at 0x1a1a8e3d30>

(b) referring to plots on p36 of ISL

Three scenarios are depicted, from leftmost to rightmost:

(1) The ideal function is somewhat non-linear

Bias: Initially decreases rapidy as more complex model captures the structure in the training data, before becoming flat as increasing complexity further does not model the true relationship any better.
Variance: initially increases slowly as a moderately complex model best approximates the true relationship, before increasing rapidly as more complex models capture the noise in the training set and fail to generalize beyond it.
Test MSE: Summing the bias and variance curves yields the u shaped test MSE curve.
Training MSE: (not pictured) decreases monotonically with model complexity as more structure is captured in the training data and predictive performance on the training set improves.

(2) The ideal function is linear

Bias: Flat - the true relationship is best represented by a low complexity model, increasing model complexity does not model the true realtionship any better.
Variance: Increases - the true realtionship is best approximated by a low compexity model, more complex models capture the noise in the training set and fail to generalize beyond it.
Test MSE: Summing the bias and variance curves yields the (half) u shaped test MSE curve.
Training MSE: (not pictured) decreases monotonically with model complexity as more structure is captured in the training data and predictive performance on the training set improves.

(3) The ideal function is very non-linear

Bias: Initially decreases very rapidly - as more complex model captures the structure in the training data, before becoming flat as increasing complexity further does not model the true relationship any better.
Variance: Flat for a while - a very complex model best approximates the true relationship, before increasing slightly as extremely complex models capture the noise in the training set and fail to generalize beyond it.
Test MSE: Summing the bias and variance curves yields the (half) u shaped test MSE curve.
Training MSE: (not pictured) decreases monotonically with model complexity as more structure is captured in the training data and predictive performance on the training set improves.

QUESTION 4¶

You will now think of some real-life applications for statistical learning.

(a) Describe three real-life applications in which classification might be useful. Describe the response, as well as the predictors. Is the goal of each application inference or prediction? Explain your answer.

(b) Describe three real-life applications in which regression might be useful. Describe the response, as well as the predictors. Is the goal of each application inference or prediction? Explain your answer.

(a)

Response: Weed vs. Crop
Predictors: Image
Prediction: Not interested in interpreting the input pixels, only interested in accurate classification.
Response: Fraud vs Not Fraud
Predictors: Time, location, merchant identifier..
Prediction: Both, definately interested in accurate prediction - interpretablity is a bonus as identifying strong predictors of fraud could plausiblly aid efforts to prevent it.
Response: Fake news or Not
Predictors: Content, Time, IP address, comments
Prediction: Both, definately interested in accurate prediction - interpretablity is a bonus as identifying strong predictors of fake news could plausiblly aid efforts to label it as such faster.

(b)

Response: House Price
Predictors: Number of bedrooms, street, nearest school...
Prediction: Both, definately interested in accurate prediction - interpretablity is a bonus as identifying strong predictors of house price could plausiblly aid efforts to take advantage of real estate opportunites.
Response: Stock Price
Predictors: historical stock data
Prediction: If we could make accurate predictions we would be rich.
Response: Life expectancy
Predictors: lifestyle, family history of disease, genome
Prediction: Both - an actuary would like an accurate prediction, an individual would like to deploy effoft in lifestyle changes to live longer.

(c)

Gene expression clustering in cancer cell lines.
Topic clustering in news stories.
Infectious agent clustering.

Question 5¶

What are the advantages and disadvantages of a very flexible (versus a less flexible) approach for regression or classification? Under what circumstances might a more flexible approach be preferred to a less flexible approach? When might a less flexible approach be preferred?

Very flexible vs Less flexible

Bias: Flexible models have lower bias.
Variance: Flexible models have high variance if there are not a sufficient number of observations to fit their parameters.

When flexible preferred

Large number of observations
True relationship is complex

When less flexible preferred

Small number of observations
True realtionship is not complex

Question 6¶

Describe the differences between a parametric and a non-parametric statistical learning approach. What are the advantages of a para- metric approach to regression or classification (as opposed to a non- parametric approach)? What are its disadvantages?

Parametric: Makes assumptions about the structure of the function we are trying to approximate in order to reduce the search space of possible functions - for example, a linear model assumes a linear function and need only search over the space of possible coefficeints, rather that the space of all possible functions.

Non-Parametric: Makes no assumption about the structure of the function we are trying to approximate - for example, KNN.

advantages / disadvantages

Non-parametric models must search over a very large space, become intractable for highly dimensional data - example; the curse of dimesnionality with KNN. Parametric models avoid this problem.

Non-parametric make no assumption about the structure of the true fuction and therefore have low bias. Parametric models have higher bias.

Question 7¶

The table below provides a training data set containing six observations, three predictors, and one qualitative response variable.

obs	X1	X2	X3	Y
1	0	3	0	Red
2	2	0	0	Red
3	0	1	3	Red
4	0	1	2	Green
5	-1	0	1	Green
6	1	1	1	Red

Suppose we wish to use this data set to make a prediction for Y when X1 = X2 = X3 = 0 using K-nearest neighbors.

(a) Compute the Euclidean distance between each observation and the test point,X1 =X2 =X3 =0.

(b) What is our prediction with K = 1? Why?

(d) If the Bayes decision boundary in this problem is highly non- linear, then would we expect the best value for K to be large or small? Why?

In [54]:

from sklearn.neighbors import NearestNeighbors
import numpy as np

X = pd.DataFrame(data={'X1': [0,2,0,0,-1,1],
                       'X2': [3,0,1,1,0,1],
                       'X3': [0,0,3,2,1,1]})
test_point = pd.DataFrame(data={'X1': [0],
                                'X2': [0],
                                'X3': [0]})

# (a)
nbrs = NearestNeighbors(n_neighbors=6, algorithm='ball_tree').fit(X)
dist, idx = nbrs.kneighbors(test_point)
print('(a): nearest observations: {}'.format(idx[0] + 1))


# (b)
# Answer - Green: nearest observation is: Green

# (c)
# Answer - Red: 3 nearest observations are: Green, Red, Red.

# (d)
# Answer - Small: smaller value for k yields model with lower bias and higher variance.

(a): nearest observations: [5 6 2 4 1 3]