(a) The sample size n is extremely large, and the number of predic- tors p is small.
(b) The number of predictors p is extremely large, and the number of observations n is small.
(c) The relationship between the predictors and response is highly non-linear.
(d) The variance of the error terms, i.e. σ2 = Var(ε), is extremely high.
Answers below refer to the expression for the expected value of the loss function: E[y0 - f_hat]^2 = Var[f_hat(x0)] + [Bias[f_hat(x0)]]^2 + Var[epsion]
.
(a) Better
(b) Worse
(c) Better
(d) Worse
(a) We collect a set of data on the top 500 firms in the US. For each firm we record profit, number of employees, industry and the CEO salary. We are interested in understanding which factors affect CEO salary.
(b) We are considering launching a new product and wish to know whether it will be a success or a failure. We collect data on 20 similar products that were previously launched. For each prod- uct we have recorded whether it was a success or failure, price charged for the product, marketing budget, competition price, and ten other variables.
(c) We are interested in predicting the % change in the USD/Euro exchange rate in relation to the weekly changes in the world stock markets. Hence we collect weekly data for all of 2012. For each week we record the % change in the USD/Euro, the % change in the US market, the % change in the British market, and the % change in the German market.
(a) regression, inference, n=500, p=3
(b) classification, prediction, n=20, p=13
(c) regression, prediction, n=52, p=4
(a) Provide a sketch of typical (squared) bias, variance, training error, test error, and Bayes (or irreducible) error curves, on a single plot, as we go from less flexible statistical learning methods towards more flexible approaches. The x-axis should represent the amount of flexibility in the method, and the y-axis should represent the values for each curve. There should be five curves. Make sure to label each one.
(b) Explain why each of the five curves has the shape displayed in part (a).
import seaborn as sns
import numpy as np
import pandas as pd
import sklearn.linear_model as linear_model
from sklearn.preprocessing import PolynomialFeatures
import matplotlib.pyplot as plt
np.random.seed(0)
# Generate some data
x = np.arange(0, np.pi , np.pi / 25)
noise = np.random.normal(0,1, x.size) / 3
y = np.sin(x)
true_df = pd.DataFrame(data={'x' : x, 'y' : y})
train_df = pd.DataFrame(data={'x' : x, 'y' : y + noise})
# Visualise models vs ideal function
plt.ylim(2, -2)
sns.lineplot(x='x', y='y', data=true_df, ci=None)
sns.regplot(x='x', y='y', data=train_df, order=1, ci=None)
sns.regplot(x='x', y='y', data=train_df, order=2, ci=None)
sns.regplot(x='x', y='y', data=train_df, order=7, ci=None)
# TODO - Write simulations to generate plots on p36 of ISL.
# The plots on p36 of ISL are the plots this question is after.
# Those plots are a bit more involved to simulate - I'll come back and do them at some point.
<matplotlib.axes._subplots.AxesSubplot at 0x1a1a8e3d30>
(b) referring to plots on p36 of ISL
Three scenarios are depicted, from leftmost to rightmost:
(1) The ideal function is somewhat non-linear
(2) The ideal function is linear
(3) The ideal function is very non-linear
(a) Describe three real-life applications in which classification might be useful. Describe the response, as well as the predictors. Is the goal of each application inference or prediction? Explain your answer.
(b) Describe three real-life applications in which regression might be useful. Describe the response, as well as the predictors. Is the goal of each application inference or prediction? Explain your answer.
(c) Describe three real-life applications in which cluster analysis might be useful.
(a)
Response: Weed vs. Crop
Predictors: Image
Prediction: Not interested in interpreting the input pixels, only interested in accurate classification.
Response: Fraud vs Not Fraud
Predictors: Time, location, merchant identifier..
Prediction: Both, definately interested in accurate prediction - interpretablity is a bonus as identifying strong predictors of fraud could plausiblly aid efforts to prevent it.
Response: Fake news or Not
Predictors: Content, Time, IP address, comments
Prediction: Both, definately interested in accurate prediction - interpretablity is a bonus as identifying strong predictors of fake news could plausiblly aid efforts to label it as such faster.
(b)
Response: House Price
Predictors: Number of bedrooms, street, nearest school...
Prediction: Both, definately interested in accurate prediction - interpretablity is a bonus as identifying strong predictors of house price could plausiblly aid efforts to take advantage of real estate opportunites.
Response: Stock Price
Predictors: historical stock data
Prediction: If we could make accurate predictions we would be rich.
Response: Life expectancy
Predictors: lifestyle, family history of disease, genome
Prediction: Both - an actuary would like an accurate prediction, an individual would like to deploy effoft in lifestyle changes to live longer.
(c)
What are the advantages and disadvantages of a very flexible (versus a less flexible) approach for regression or classification? Under what circumstances might a more flexible approach be preferred to a less flexible approach? When might a less flexible approach be preferred?
Very flexible vs Less flexible
When flexible preferred
When less flexible preferred
Describe the differences between a parametric and a non-parametric statistical learning approach. What are the advantages of a para- metric approach to regression or classification (as opposed to a non- parametric approach)? What are its disadvantages?
Parametric: Makes assumptions about the structure of the function we are trying to approximate in order to reduce the search space of possible functions - for example, a linear model assumes a linear function and need only search over the space of possible coefficeints, rather that the space of all possible functions.
Non-Parametric: Makes no assumption about the structure of the function we are trying to approximate - for example, KNN.
advantages / disadvantages
Non-parametric models must search over a very large space, become intractable for highly dimensional data - example; the curse of dimesnionality with KNN. Parametric models avoid this problem.
Non-parametric make no assumption about the structure of the true fuction and therefore have low bias. Parametric models have higher bias.
The table below provides a training data set containing six observations, three predictors, and one qualitative response variable.
obs | X1 | X2 | X3 | Y |
---|---|---|---|---|
1 | 0 | 3 | 0 | Red |
2 | 2 | 0 | 0 | Red |
3 | 0 | 1 | 3 | Red |
4 | 0 | 1 | 2 | Green |
5 | -1 | 0 | 1 | Green |
6 | 1 | 1 | 1 | Red |
Suppose we wish to use this data set to make a prediction for Y when X1 = X2 = X3 = 0 using K-nearest neighbors.
(a) Compute the Euclidean distance between each observation and the test point,X1 =X2 =X3 =0.
(b) What is our prediction with K = 1? Why?
(c) What is our prediction with K = 3? Why?
(d) If the Bayes decision boundary in this problem is highly non- linear, then would we expect the best value for K to be large or small? Why?
from sklearn.neighbors import NearestNeighbors
import numpy as np
X = pd.DataFrame(data={'X1': [0,2,0,0,-1,1],
'X2': [3,0,1,1,0,1],
'X3': [0,0,3,2,1,1]})
test_point = pd.DataFrame(data={'X1': [0],
'X2': [0],
'X3': [0]})
# (a)
nbrs = NearestNeighbors(n_neighbors=6, algorithm='ball_tree').fit(X)
dist, idx = nbrs.kneighbors(test_point)
print('(a): nearest observations: {}'.format(idx[0] + 1))
# (b)
# Answer - Green: nearest observation is: Green
# (c)
# Answer - Red: 3 nearest observations are: Green, Red, Red.
# (d)
# Answer - Small: smaller value for k yields model with lower bias and higher variance.
(a): nearest observations: [5 6 2 4 1 3]