2.4 Exercises¶

Conceptual¶

1. For each of parts (a) through (d), indicate whether we would generally expect the performance of a ﬂexible statistical learning method to be better or worse than an inﬂexible method. Justify your answer.

(a) The sample size n is extremely large, and the number of predictors p is small.

better - a more flexible approach will fit the data closer and with the large sample size a better fit than an inflexible approach would be obtained
(b) The number of predictors p is extremely large, and the numberof observations n is small.

worse - a flexible method would overfit the small number of observations
(c) The relationship between the predictors and response is highly non-linear.

better - with more degrees of freedom, a flexible model would obtain a better fit
(d) The variance of the error terms, i.e. $\sigma^2 = Var(\varepsilon)$, is extremely high.

worse - flexible methods fit to the noise in the error terms and increase variance

2. Explain whether each scenario is a classification or regression problem, and indicate whether we are most interested in inference or prediction. Finally, provide $n$ and $p$.

(a) We collect a set of data on the top 500 firms in the US. For each firm we record profit, number of employees, industry and the CEO salary. We are interested in understanding which factors affect CEO salary.

*regression; inference; quantitative output of CEO salary based on CEO firm's features

$n = 500$ - firms in the US
$p = 3$ - profit, number of employees, industry*

(b) We are considering launching a new product and wish to know whether it will be a success or a failure. We collect data on 20 similar products that were previously launched. For each product we have recorded whether it was a success or failure, price charged for the product, marketing budget, competition price, and ten other variables.

*classification; prediction; predicting new product's success or failure

$n = 20$ - similar products previously launched
$p = 13$ - price charged, marketing budget, comp. price, ten other variables*

(c) We are interesting in predicting the % change in the US dollar in relation to the weekly changes in the world stock markets. Hence we collect weekly data for all of 2012. For each week we record the % change in the dollar, the % change in the US market, the % change in the British market, and the % change in the German market.

*regression; prediction; quantitative output of % change

$n = 52$ - weeks of 2012 weekly data
$p = 3$ - % change in US market, % change in British market, % change in German market*

3. We now revisit the bias-variance decomposition.

(a) Provide a sketch of typical (squared) bias, variance, training error, test error, and Bayes (or irreducible) error curves, on a single plot, as we go from less flexible statistical learning methods towards more flexible approaches. The x-axis should represent the amount of flexibility in the method, and the y-axis should represent the values for each curve. There should be five curves. Make sure to label each one.
(b) Explain why each of the ﬁve curves has the shape displayed in part (a).

The training MSE declines monotonically as flexibility increases, this is because as flexibility increases the $f$ curve fits the observed data more closely. The test MSE intially declines as flexibility increases but at some point it levels off and then starts to increase again (U-shape), this is because when a $f$ curve yields a small training MSE but a large test MSE we are actually overfitting the data (our procedure tries too hard to find patterns in the training data that are maybe only caused by chance rather than by true properties of the unknown $f$). The squared bias decreases monotonically and the variance increases monotonically; as a general rule, as we use more flexible methods, the variance will increase and the bias will decrease. Variance refers to the amount by wich $\hat{f}$ would change if we estimated it using a different training data set, so if the curve fits the observations very closely, changing any point may cause $\hat{f}$ to change considerably, and so will result in some variance. Bias refers to the error that is introduced by approximating a real-life problem by a much simpler model, so if we use a very simple model (linear regression) it is unlikely that any real-life problem has such a simple linear relationship, and so performing linear regression will result in some bias in the estimate of $f$. The irreducible error is a constant so it is a parallel line, this curve lies below the test MSE curve because the expected test MSE will always be greater the $Var(\varepsilon)$ (see relation $(2.7)$).

4. You will now think of some real-life applications for statistical learning.

(a) Describe three real-life applications in which classification might be useful. Describe the response, as well as the predictors. Is the goal of each application inference or prediction ? Explain your answer.

Classification 1 – Is this TV series/movie/ad campaign going to be successful or not (Response : Success/Failure, Predictors : Money spent, Talent, Running Time, Producer, TV Channel, Air time slot, etc., Goal : Prediction).
Classification 2 – Should this applicant be admitted into Harvard University or not (Response : Admit/Not admit, Predictors : SAT Scores, GPA, Socio Economic Strata, Income of parents, Essay effectiveness, Potential, etc., Goal : Prediction).
Classification 3 – Salk Polio vaccine trials – Successful/Not Successful (Response : Did the child get polio or not, Predictors : Age, Geography, General health condition, Control/Test group, etc., Goal : Prediction).
(b) Describe three real-life applications in which regression might be useful. Describe the response, as well as the predictors. Is the goal of each application inference or prediction ? Explain your answer.

Regression 1 – GDP Growth in European economies (Response : What is the GDP of countries predicted to be by 2050, Predictors : Population, Per capita income, Education, Average life expectancy, Tax Revenue, Government Spending etc., Goal : Inference).
Regression 2 – What is the average house sale price in XXX neighborhood over the next 5 years (Response : Average house in XXX neighborhood will sell for Y next year, Z the year after, T after that, etc., Predictors : Proximity to transit, Parks, Schools, Average size of family, Average Income of Family, Crime Rate, Price Flux in surrounding neighborhoods etc., Goal : Inference).
Regression 3 – Gas mileage that a new car design will result in (Response : With certain parameters being set, X is the mileage we will get out of this car, Predictors: Fuel type, Number of Cylinders, Engine Version, etc., Goal : Inference).
(c) Describe three real-life applications in which cluster analysis might be useful.

Cluster 1 – Division of countries into Developed, Developing and Third World (Response : By 2050, countries in Asia can be split into these following clusters, Predictors : Per Capita Income, Purchasing power parity, Average birth rate, Average number of years of education received, Average Death Rate, Population etc., Goal: Prediction).
Cluster 2 – Division of average working population into income segments for taxation purposes (Response : This worker falls under this taxation bracket, Predictors : Income, Job Industry, Job Segment, Size of Company, etc., Goal : Prediction).
Cluster 3 – Cluster new movies being produced into ratings G/PG/R/PG-13 etc. (Response : This movie is a R/PG/PG-13, Predictors : Violent content, Sexual language, theme, etc., Goal : Prediction).

5. What are the advantages and disadvantages of a very ﬂexible (versus a less ﬂexible) approach for regression or classiﬁcation ? Under what circumstances might a more ﬂexible approach be preferred to a less ﬂexible approach? When might a less ﬂexible approach be preferred ?

The advantages for a very flexible approach for regression or classification are obtaining a better fit for non-linear models, decreasing bias.

The disadvantages for a very flexible approach for regression or classification are requires estimating a greater number of parameters, follow the noise too closely (overfit), increasing variance.

A more flexible approach would be preferred to a less flexible approach when we are interested in prediction and not the interpretability of the results.

A less flexible approach would be preferred to a more flexible approach when we are interested in inference and the interpretability of the results.

More flexible	Less flexible
More interpretable model (inference).	Less interpretable model (inference).
May not yield accurate predictions.	Quite accurate predictions (but beware of overﬁtting).
If f is linear then linear regression may have no bias.	If f is highly nonlinear and we have a lot of observations then a nonlinear model may work very well.

6. Describe the differences between a parametric and a non-parametric statistical learning approach. What are the advantages of a parametric approach to regression or classification (as opposed to a nonparametric approach) ? What are its disadvantages ?

A parametric approach reduces the problem of estimating f down to one of estimating a set of parameters because it assumes a form for f.

A non-parametric approach does not assume a functional form for f and so requires a very large number of observations to accurately estimate f.

The advantages of a parametric approach to regression or classification are the simplifying of modeling f to a few parameters and not as many observations are required compared to a non-parametric approach.

The disadvantages of a parametric approach to regression or classification are a potential to inaccurately estimate f if the form of f assumed is wrong or to overfit the observations if more flexible models are used.

Parametric	Non-parametric
Reduces the problem of estimating $f$ to estimating a set of parameters.	Seek an estimate of $f$ that gets as close to the data points as possible without being too rough or wiggly.
Simplifies the problem because it is generally easier to estimate a set of parameters than to ﬁt an entirely arbitrary function.	May accurately ﬁt a wider range of possible shapes for $f$.
If we use more ﬂexible models it may lead to overﬁtting the data.	A very large number of observations is required to obtain an accurate estimate for $f$.

7. The table below provides a training data set containing 6 observations, 3 predictors, and 1 qualitative response variable. Suppose we wish to use this data set to make a prediction for Y when X1 = X2 = X3 = 0 using K-nearest neighbors.

(a) Compute the Euclidean distance between each observation and the test point, X1 = X2 = X3 = 0.

Obs.	X1	X2	X3	Distance(0, 0, 0)	Y
1	0	3	0	3	Red
2	2	0	0	2	Red
3	0	1	3	sqrt(10) ~ 3.16	Red
4	0	1	2	sqrt(5) ~ 2.23	Green
5	-1	0	1	sqrt(2) ~ 1.41	Green
6	1	1	1	sqrt(3) ~ 1.73	Red

(b) What is our prediction with $K = 1$ ? Why ?

If $K = 1$ then $x_5\in\mathcal{N}_0$ and we have

$P(Y = \mathrm{Red} | X = x_0) = \frac{1}{1}\sum_{i\in\mathcal{N}_0}I(y_i = \mathrm{Red}) = I(y_5 = \mathrm{Red}) = 0$

and

$P(Y = \mathrm{Green} | X = x_0) = \frac{1}{1}\sum_{i\in\mathcal{N}_0}I(y_i = \mathrm{Green}) = I(y_5 = \mathrm{Green}) = 1.$

Our prediction is then Green. Observation 5 is the closest neighbor for K = 1.
(c) What is our prediction with $K = 3$ ? Why ?

If $K = 3$ then $x_2,x_5,x_6\in\mathcal{N}_0$ and we have

$P(Y = \mathrm{Red} | X = x_0) = \frac{1}{3}\sum_{i\in\mathcal{N}_0}I(y_i = \mathrm{Red}) = \frac{1}{3}(1 + 0 + 1) = \frac{2}{3}$

and

$P(Y = \mathrm{Green} | X = x_0) = \frac{1}{3}\sum_{i\in\mathcal{N}_0}I(y_i = \mathrm{Green}) = \frac{1}{3}(0 + 1 + 0) = \frac{1}{3}.$

Our prediction is then Red. Observations 2, 5, 6 are the closest neighbors for K = 3. 2 is Red, 5 is Green, and 6 is Red.
(d) If the Bayes decision boundary in this problem is highly nonlinear, then would we expect the best value for $K$ to be large or small ? Why ?

Small. A small $K$ would be flexible for a non-linear decision boundary, whereas a large $K$ would try to fit a more linear boundary because it takes more points into consideration. As $K$ becomes larger, the boundary becomes smoother.

Applied¶

8. This exercise relates to the College data set, which can be found in the file College.csv. It contains a number of variables for 777 different universities and colleges in the US. The variables are:

Private : Public/private indicator
Apps : Number of applications received
Accept : Number of applicants accepted
Enroll : Number of new students enrolled
Top10perc : New students from top 10 % of high school class
Top25perc : New students from top 25 % of high school class
F.Undergrad : Number of full-time undergraduates
P.Undergrad : Number of part-time undergraduates
Outstate : Out-of-state tuition
Room.Board : Room and board costs
Books : Estimated book costs
Personal : Estimated personal spending
PhD : Percent of faculty with Ph.D.’s
Terminal : Percent of faculty with terminal degree
S.F.Ratio : Student/faculty ratio
perc.alumni : Percent of alumni who donate
Expend : Instructional expenditure per student
Grad.Rate : Graduation rate

Before reading the data into R, it can be viewed in Excel or a text editor.

(a) Use the read.csv() function to read the data into R. Call the loaded data college. Make sure that you have the directory set to the correct location for the data.

In [1]:

college = read.csv("../../../data/College.csv")

(b) Look at the data using the fix() function. You should notice that the first column is just the name of each university. We don't really want R to treat this as data. However, it may be handy to have these names for later.

In [2]:

head(college)

X	Private	Apps	Accept	Enroll	Top10perc	Top25perc	F.Undergrad	P.Undergrad	Outstate	Room.Board	Books	Personal	PhD	Terminal	S.F.Ratio	perc.alumni	Expend	Grad.Rate
Abilene Christian University	Yes	1660	1232	721	23	52	2885	537	7440	3300	450	2200	70	78	18.1	12	7041	60
Adelphi University	Yes	2186	1924	512	16	29	2683	1227	12280	6450	750	1500	29	30	12.2	16	10527	56
Adrian College	Yes	1428	1097	336	22	50	1036	99	11250	3750	400	1165	53	66	12.9	30	8735	54
Agnes Scott College	Yes	417	349	137	60	89	510	63	12960	5450	450	875	92	97	7.7	37	19016	59
Alaska Pacific University	Yes	193	146	55	16	44	249	869	7560	4120	800	1500	76	72	11.9	2	10922	15
Albertson College	Yes	587	479	158	38	62	678	41	13500	3335	500	675	67	73	9.4	11	9727	55

In [3]:

rownames(college) = college[,1]
head(college[, 1:5])

	X	Private	Apps	Accept	Enroll
Abilene Christian University	Abilene Christian University	Yes	1660	1232	721
Adelphi University	Adelphi University	Yes	2186	1924	512
Adrian College	Adrian College	Yes	1428	1097	336
Agnes Scott College	Agnes Scott College	Yes	417	349	137
Alaska Pacific University	Alaska Pacific University	Yes	193	146	55
Albertson College	Albertson College	Yes	587	479	158

You should see that there is now a row.names column with the name of each university recorded. This means that R has given each row a name corresponding to the appropriate university. R will not try to perform calculations on the row names. However, we still need to eliminate the ﬁrst column in the data where the names are stored.

In [4]:

college = college[,-1]
head(college[, 1:5])

	Private	Apps	Accept	Enroll	Top10perc
Abilene Christian University	Yes	1660	1232	721	23
Adelphi University	Yes	2186	1924	512	16
Adrian College	Yes	1428	1097	336	22
Agnes Scott College	Yes	417	349	137	60
Alaska Pacific University	Yes	193	146	55	16
Albertson College	Yes	587	479	158	38

Now you should see that the ﬁrst data column is Private. Note that another column labeled row.names now appears before the Private column. However, this is not a data column but rather the name that R is giving to each row.

(c) i. Use the summary() function to produce a numerical summary of the variables in the data set.

In [5]:

summary(college)

 Private        Apps           Accept          Enroll       Top10perc    
 No :212   Min.   :   81   Min.   :   72   Min.   :  35   Min.   : 1.00  
 Yes:565   1st Qu.:  776   1st Qu.:  604   1st Qu.: 242   1st Qu.:15.00  
           Median : 1558   Median : 1110   Median : 434   Median :23.00  
           Mean   : 3002   Mean   : 2019   Mean   : 780   Mean   :27.56  
           3rd Qu.: 3624   3rd Qu.: 2424   3rd Qu.: 902   3rd Qu.:35.00  
           Max.   :48094   Max.   :26330   Max.   :6392   Max.   :96.00  
   Top25perc      F.Undergrad     P.Undergrad         Outstate    
 Min.   :  9.0   Min.   :  139   Min.   :    1.0   Min.   : 2340  
 1st Qu.: 41.0   1st Qu.:  992   1st Qu.:   95.0   1st Qu.: 7320  
 Median : 54.0   Median : 1707   Median :  353.0   Median : 9990  
 Mean   : 55.8   Mean   : 3700   Mean   :  855.3   Mean   :10441  
 3rd Qu.: 69.0   3rd Qu.: 4005   3rd Qu.:  967.0   3rd Qu.:12925  
 Max.   :100.0   Max.   :31643   Max.   :21836.0   Max.   :21700  
   Room.Board       Books           Personal         PhD        
 Min.   :1780   Min.   :  96.0   Min.   : 250   Min.   :  8.00  
 1st Qu.:3597   1st Qu.: 470.0   1st Qu.: 850   1st Qu.: 62.00  
 Median :4200   Median : 500.0   Median :1200   Median : 75.00  
 Mean   :4358   Mean   : 549.4   Mean   :1341   Mean   : 72.66  
 3rd Qu.:5050   3rd Qu.: 600.0   3rd Qu.:1700   3rd Qu.: 85.00  
 Max.   :8124   Max.   :2340.0   Max.   :6800   Max.   :103.00  
    Terminal       S.F.Ratio      perc.alumni        Expend     
 Min.   : 24.0   Min.   : 2.50   Min.   : 0.00   Min.   : 3186  
 1st Qu.: 71.0   1st Qu.:11.50   1st Qu.:13.00   1st Qu.: 6751  
 Median : 82.0   Median :13.60   Median :21.00   Median : 8377  
 Mean   : 79.7   Mean   :14.09   Mean   :22.74   Mean   : 9660  
 3rd Qu.: 92.0   3rd Qu.:16.50   3rd Qu.:31.00   3rd Qu.:10830  
 Max.   :100.0   Max.   :39.80   Max.   :64.00   Max.   :56233  
   Grad.Rate     
 Min.   : 10.00  
 1st Qu.: 53.00  
 Median : 65.00  
 Mean   : 65.46  
 3rd Qu.: 78.00  
 Max.   :118.00

ii. Use the pairs() function to produce a scatterplot matrix of the ﬁrst ten columns or variables of the data. Recall that you can reference the ﬁrst ten columns of a matrix A using A[,1:10].

In [6]:

pairs(college[,1:10])

iii. Use the plot() function to produce side-by-side boxplots of Outstate versus Private.

In [7]:

plot(college$Private, college$Outstate,
     xlab = "Private University", ylab = "Out of State tuition in USD", main = "Outstate Tuition Plot")

iv. Create a new qualitative variable, called Elite, by binning the Top10perc variable. We are going to divide universities into two groups based on whether or not the proportion of students coming from the top 10% of their high school classes exceeds 50%.

In [8]:

Elite = rep("No", nrow(college))
Elite[college$Top10perc > 50] = "Yes"
Elite = as.factor(Elite)
college = data.frame(college, Elite)

Use the summary() function to see how many elite universities there are. Now use the plot() function to produce side-by-side boxplots of Outstate versus Elite.

In [9]:

summary(college$Elite)

No: 699
Yes: 78

In [10]:

plot(college$Elite, college$Outstate,
     xlab = "Elite University", ylab ="Out of State tuition in USD", main = "Outstate Tuition Plot")

v. Use the hist()( function to produce some histograms with diﬀering numbers of bins for a few of the quantitative variables. You may ﬁnd the command par(mfrow=c(2,2))* useful: it will divide the print window into four regions so that four plots can be made simultaneously. Modifying the arguments to this function will divide the screen in other ways

In [11]:

par(mfrow = c(2,2))
hist(college$Books, col = 2, xlab = "Books", ylab = "Count")
hist(college$PhD, col = 3, xlab = "PhD", ylab = "Count")
hist(college$Grad.Rate, col = 4, xlab = "Grad Rate", ylab = "Count")
hist(college$perc.alumni, col = 6, xlab = "% alumni", ylab = "Count")

vi. Continue exploring the data, and provide a brief summary of what you discover.

In [12]:

summary(college$PhD)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   8.00   62.00   75.00   72.66   85.00  103.00

It is a little weird to have universities with $103\%$ of faculty with Phd's, let us see how many universities have this percentage and their names.

In [13]:

weird.phd = college[college$PhD == 103, ]
nrow(weird.phd)

1

In [14]:

rownames(weird.phd)

'Texas A&M University at Galveston'

In [15]:

par(mfrow=c(1,1))
plot(college$Outstate, college$Grad.Rate)

High tuition correlates to high graduation rate.

In [16]:

plot(college$Accept / college$Apps, college$S.F.Ratio)

Colleges with low acceptance rate tend to have low S:F ratio.

In [17]:

plot(college$Top10perc, college$Grad.Rate)

Colleges with the most students from top 10% perc don't necessarily have the highest graduation rate. Also, rate > 100 is erroneous!

9. This exercise involves the Auto data set studied in the lab. Make sure that the missing values have been removed from the data.

(a) Which of the predictors are quantitative, and which are qualitative ?

In [18]:

Auto = read.csv("../../../data/Auto.csv", header=T, na.strings="?")
Auto = na.omit(Auto)

In [19]:

str(Auto)

'data.frame':	392 obs. of  9 variables:
 $ mpg         : num  18 15 18 16 17 15 14 14 14 15 ...
 $ cylinders   : int  8 8 8 8 8 8 8 8 8 8 ...
 $ displacement: num  307 350 318 304 302 429 454 440 455 390 ...
 $ horsepower  : int  130 165 150 150 140 198 220 215 225 190 ...
 $ weight      : int  3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
 $ acceleration: num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
 $ year        : int  70 70 70 70 70 70 70 70 70 70 ...
 $ origin      : int  1 1 1 1 1 1 1 1 1 1 ...
 $ name        : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...
 - attr(*, "na.action")=Class 'omit'  Named int [1:5] 33 127 331 337 355
  .. ..- attr(*, "names")= chr [1:5] "33" "127" "331" "337" ...

In [20]:

summary(Auto)

      mpg          cylinders      displacement     horsepower        weight    
 Min.   : 9.00   Min.   :3.000   Min.   : 68.0   Min.   : 46.0   Min.   :1613  
 1st Qu.:17.00   1st Qu.:4.000   1st Qu.:105.0   1st Qu.: 75.0   1st Qu.:2225  
 Median :22.75   Median :4.000   Median :151.0   Median : 93.5   Median :2804  
 Mean   :23.45   Mean   :5.472   Mean   :194.4   Mean   :104.5   Mean   :2978  
 3rd Qu.:29.00   3rd Qu.:8.000   3rd Qu.:275.8   3rd Qu.:126.0   3rd Qu.:3615  
 Max.   :46.60   Max.   :8.000   Max.   :455.0   Max.   :230.0   Max.   :5140  
                                                                               
  acceleration        year           origin                      name    
 Min.   : 8.00   Min.   :70.00   Min.   :1.000   amc matador       :  5  
 1st Qu.:13.78   1st Qu.:73.00   1st Qu.:1.000   ford pinto        :  5  
 Median :15.50   Median :76.00   Median :1.000   toyota corolla    :  5  
 Mean   :15.54   Mean   :75.98   Mean   :1.577   amc gremlin       :  4  
 3rd Qu.:17.02   3rd Qu.:79.00   3rd Qu.:2.000   amc hornet        :  4  
 Max.   :24.80   Max.   :82.00   Max.   :3.000   chevrolet chevette:  4  
                                                 (Other)           :365

Quantitative variables: mpg, cylinders, displacement, horsepower, weight, acceleration, year
Qualitative variables: name, origin

(b) What is the range of each quantitative predictor? You can answer this using the range() function.

In [21]:

sapply(Auto[, 1:7], range)

mpg	cylinders	displacement	horsepower	weight	acceleration	year
9.0	3	68	46	1613	8.0	70
46.6	8	455	230	5140	24.8	82

(c) What is the mean and standard deviation of each quantitative predictor?

In [22]:

sapply(Auto[, 1:7], mean)

mpg: 23.4459183673469
cylinders: 5.4719387755102
displacement: 194.411989795918
horsepower: 104.469387755102
weight: 2977.58418367347
acceleration: 15.5413265306122
year: 75.9795918367347

In [23]:

sapply(Auto[, 1:7], sd)

mpg: 7.8050074865718
cylinders: 1.70578324745278
displacement: 104.644003908905
horsepower: 38.4911599328285
weight: 849.402560042949
acceleration: 2.75886411918808
year: 3.68373654357783

(d) Now remove the 10th through 85th observations. What is the range, mean, and standard deviation of each predictor in the subset of the data that remains?

In [24]:

subsetAuto = Auto[-(10:85),]
sapply(subsetAuto[, 1:7], range)

mpg	cylinders	displacement	horsepower	weight	acceleration	year
11.0	3	68	46	1649	8.5	70
46.6	8	455	230	4997	24.8	82

In [25]:

sapply(subsetAuto[, 1:7], mean)

mpg: 24.4044303797468
cylinders: 5.37341772151899
displacement: 187.240506329114
horsepower: 100.721518987342
weight: 2935.97151898734
acceleration: 15.7268987341772
year: 77.1455696202532

In [26]:

sapply(subsetAuto[, 1:7], sd)

mpg: 7.86728282443069
cylinders: 1.65417865185607
displacement: 99.6783672303628
horsepower: 35.7088532738003
weight: 811.30020815829
acceleration: 2.69372071752036
year: 3.10621690872137

(e) Using the full data set, investigate the predictors graphically, using scatterplots or other tools of your choice. Create some plots highlighting the relationships among the predictors. Comment on your ﬁndings.

In [27]:

pairs(Auto)

We seem to get more mileage per gallon on a 4 cyl vehicle than the others. Weight, displacement and horsepower seem to have an inverse eﬀect with mpg. We see an overall increase in mpg over the years. Almost doubled in one decade. Japanese cars have higher mpg than US or European cars.

In [28]:

plot(Auto$mpg, Auto$weight)

Heavier weight correlates with lower mpg.

In [29]:

plot(Auto$mpg, Auto$cylinders)

More cylinders, less mpg.

In [30]:

plot(Auto$mpg, Auto$year)

Cars become more efficient over time.

(f) Suppose that we wish to predict gas mileage (mpg) on the basis of the other variables. Do your plots suggest that any of the other variables might be useful in predicting mpg? Justify your answer.

In [31]:

pairs(Auto)

See descriptions of plots in (e)
All of the predictors show some correlation with mpg. The name predictor has too little observations per name though, so using this as a predictor is likely to result in overfitting the data and will not generalize well.

The cylinders, horsepower, year and origin can be used as predictors. Displacement and weight were not used because they are highly correlated with horespower and with each other.

In [32]:

cor(Auto$weight, Auto$horsepower)

0.864537737574144

In [33]:

cor(Auto$weight, Auto$displacement)

0.932994404089011

In [34]:

cor(Auto$displacement, Auto$horsepower)

0.897257001843469

10. This exercise involves the Boston housing data set.

(a) To begin, load in the Boston data set. The Boston data set is part of the MASS library in R.

In [35]:

library(MASS)

Now the data set is contained in the object Boston.

In [36]:

Boston

crim	zn	indus	chas	nox	rm	age	dis	rad	tax	ptratio	black	lstat	medv
0.00632	18.0	2.31	0	0.538	6.575	65.2	4.0900	1	296	15.3	396.90	4.98	24.0
0.02731	0.0	7.07	0	0.469	6.421	78.9	4.9671	2	242	17.8	396.90	9.14	21.6
0.02729	0.0	7.07	0	0.469	7.185	61.1	4.9671	2	242	17.8	392.83	4.03	34.7
0.03237	0.0	2.18	0	0.458	6.998	45.8	6.0622	3	222	18.7	394.63	2.94	33.4
0.06905	0.0	2.18	0	0.458	7.147	54.2	6.0622	3	222	18.7	396.90	5.33	36.2
0.02985	0.0	2.18	0	0.458	6.430	58.7	6.0622	3	222	18.7	394.12	5.21	28.7
0.08829	12.5	7.87	0	0.524	6.012	66.6	5.5605	5	311	15.2	395.60	12.43	22.9
0.14455	12.5	7.87	0	0.524	6.172	96.1	5.9505	5	311	15.2	396.90	19.15	27.1
0.21124	12.5	7.87	0	0.524	5.631	100.0	6.0821	5	311	15.2	386.63	29.93	16.5
0.17004	12.5	7.87	0	0.524	6.004	85.9	6.5921	5	311	15.2	386.71	17.10	18.9
0.22489	12.5	7.87	0	0.524	6.377	94.3	6.3467	5	311	15.2	392.52	20.45	15.0
0.11747	12.5	7.87	0	0.524	6.009	82.9	6.2267	5	311	15.2	396.90	13.27	18.9
0.09378	12.5	7.87	0	0.524	5.889	39.0	5.4509	5	311	15.2	390.50	15.71	21.7
0.62976	0.0	8.14	0	0.538	5.949	61.8	4.7075	4	307	21.0	396.90	8.26	20.4
0.63796	0.0	8.14	0	0.538	6.096	84.5	4.4619	4	307	21.0	380.02	10.26	18.2
0.62739	0.0	8.14	0	0.538	5.834	56.5	4.4986	4	307	21.0	395.62	8.47	19.9
1.05393	0.0	8.14	0	0.538	5.935	29.3	4.4986	4	307	21.0	386.85	6.58	23.1
0.78420	0.0	8.14	0	0.538	5.990	81.7	4.2579	4	307	21.0	386.75	14.67	17.5
0.80271	0.0	8.14	0	0.538	5.456	36.6	3.7965	4	307	21.0	288.99	11.69	20.2
0.72580	0.0	8.14	0	0.538	5.727	69.5	3.7965	4	307	21.0	390.95	11.28	18.2
1.25179	0.0	8.14	0	0.538	5.570	98.1	3.7979	4	307	21.0	376.57	21.02	13.6
0.85204	0.0	8.14	0	0.538	5.965	89.2	4.0123	4	307	21.0	392.53	13.83	19.6
1.23247	0.0	8.14	0	0.538	6.142	91.7	3.9769	4	307	21.0	396.90	18.72	15.2
0.98843	0.0	8.14	0	0.538	5.813	100.0	4.0952	4	307	21.0	394.54	19.88	14.5
0.75026	0.0	8.14	0	0.538	5.924	94.1	4.3996	4	307	21.0	394.33	16.30	15.6
0.84054	0.0	8.14	0	0.538	5.599	85.7	4.4546	4	307	21.0	303.42	16.51	13.9
0.67191	0.0	8.14	0	0.538	5.813	90.3	4.6820	4	307	21.0	376.88	14.81	16.6
0.95577	0.0	8.14	0	0.538	6.047	88.8	4.4534	4	307	21.0	306.38	17.28	14.8
0.77299	0.0	8.14	0	0.538	6.495	94.4	4.4547	4	307	21.0	387.94	12.80	18.4
1.00245	0.0	8.14	0	0.538	6.674	87.3	4.2390	4	307	21.0	380.23	11.98	21.0
⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮
4.87141	0	18.10	0	0.614	6.484	93.6	2.3053	24	666	20.2	396.21	18.68	16.7
15.02340	0	18.10	0	0.614	5.304	97.3	2.1007	24	666	20.2	349.48	24.91	12.0
10.23300	0	18.10	0	0.614	6.185	96.7	2.1705	24	666	20.2	379.70	18.03	14.6
14.33370	0	18.10	0	0.614	6.229	88.0	1.9512	24	666	20.2	383.32	13.11	21.4
5.82401	0	18.10	0	0.532	6.242	64.7	3.4242	24	666	20.2	396.90	10.74	23.0
5.70818	0	18.10	0	0.532	6.750	74.9	3.3317	24	666	20.2	393.07	7.74	23.7
5.73116	0	18.10	0	0.532	7.061	77.0	3.4106	24	666	20.2	395.28	7.01	25.0
2.81838	0	18.10	0	0.532	5.762	40.3	4.0983	24	666	20.2	392.92	10.42	21.8
2.37857	0	18.10	0	0.583	5.871	41.9	3.7240	24	666	20.2	370.73	13.34	20.6
3.67367	0	18.10	0	0.583	6.312	51.9	3.9917	24	666	20.2	388.62	10.58	21.2
5.69175	0	18.10	0	0.583	6.114	79.8	3.5459	24	666	20.2	392.68	14.98	19.1
4.83567	0	18.10	0	0.583	5.905	53.2	3.1523	24	666	20.2	388.22	11.45	20.6
0.15086	0	27.74	0	0.609	5.454	92.7	1.8209	4	711	20.1	395.09	18.06	15.2
0.18337	0	27.74	0	0.609	5.414	98.3	1.7554	4	711	20.1	344.05	23.97	7.0
0.20746	0	27.74	0	0.609	5.093	98.0	1.8226	4	711	20.1	318.43	29.68	8.1
0.10574	0	27.74	0	0.609	5.983	98.8	1.8681	4	711	20.1	390.11	18.07	13.6
0.11132	0	27.74	0	0.609	5.983	83.5	2.1099	4	711	20.1	396.90	13.35	20.1
0.17331	0	9.69	0	0.585	5.707	54.0	2.3817	6	391	19.2	396.90	12.01	21.8
0.27957	0	9.69	0	0.585	5.926	42.6	2.3817	6	391	19.2	396.90	13.59	24.5
0.17899	0	9.69	0	0.585	5.670	28.8	2.7986	6	391	19.2	393.29	17.60	23.1
0.28960	0	9.69	0	0.585	5.390	72.9	2.7986	6	391	19.2	396.90	21.14	19.7
0.26838	0	9.69	0	0.585	5.794	70.6	2.8927	6	391	19.2	396.90	14.10	18.3
0.23912	0	9.69	0	0.585	6.019	65.3	2.4091	6	391	19.2	396.90	12.92	21.2
0.17783	0	9.69	0	0.585	5.569	73.5	2.3999	6	391	19.2	395.77	15.10	17.5
0.22438	0	9.69	0	0.585	6.027	79.7	2.4982	6	391	19.2	396.90	14.33	16.8
0.06263	0	11.93	0	0.573	6.593	69.1	2.4786	1	273	21.0	391.99	9.67	22.4
0.04527	0	11.93	0	0.573	6.120	76.7	2.2875	1	273	21.0	396.90	9.08	20.6
0.06076	0	11.93	0	0.573	6.976	91.0	2.1675	1	273	21.0	396.90	5.64	23.9
0.10959	0	11.93	0	0.573	6.794	89.3	2.3889	1	273	21.0	393.45	6.48	22.0
0.04741	0	11.93	0	0.573	6.030	80.8	2.5050	1	273	21.0	396.90	7.88	11.9

Read about the data set:

In [37]:

?Boston

How many rows are in this data set? How many columns? What do the rows and columns represent?

In [38]:

dim(Boston)

506
14

506 rows, 14 columns
14 features, 506 housing values in Boston suburbs

(b) Make some pairwise scatterplots of the predictors (columns) in this data set. Describe your ﬁndings.

In [39]:

pairs(Boston)

crim correlates with: age, dis, rad, tax, ptratio
zn correlates with: indus, nox, age, lstat
indus correlates with: age, dis
nox correlates with: age, dis
dis correlates with: lstat
lstat correlates with: medv

In [40]:

par(mfrow = c(2, 2))
plot(Boston$nox, Boston$crim)
plot(Boston$rm, Boston$crim)
plot(Boston$age, Boston$crim)
plot(Boston$dis, Boston$crim)

(c) Are any of the predictors associated with per capita crime rate? If so, explain the relationship.

In [41]:

hist(Boston$crim, breaks = 50)

Most suburbs do not have any crime ($80\%$ of data falls in crim < 20).

In [42]:

pairs(Boston[Boston$crim < 20, ])

There may be a relationship between crim and age, dis, rad, tax and ptratio.

In [43]:

par(mfrow = c(3, 2))
plot(Boston$age, Boston$crim, main = "Older homes, more crime")
plot(Boston$dis, Boston$crim, main = "Closer to work-area, more crime")
plot(Boston$rad, Boston$crim, main = "Higher index of accessibility to radial highways, more crime")
plot(Boston$tax, Boston$crim, main = "Higher tax rate, more crime")
plot(Boston$ptratio, Boston$crim, main = "Higher pupil:teacher ratio, more crime")

(d) Do any of the suburbs of Boston appear to have particularly high crime rates? Tax rates? Pupil-teacher ratios? Comment on the range of each predictor.

In [44]:

par(mfrow=c(1,3))
hist(Boston$crim[Boston$crim > 1], breaks=25)
hist(Boston$tax, breaks=25)
hist(Boston$ptratio, breaks=25)

Most cities have low crime rates, but there is a long tail: 18 suburbs appear to have a crime rate > 20, reaching to above 80.

There is a large divide between suburbs with low tax rates and a peak at 660-680.

A skew towards high ratios, but no particularly high ratios.

(e) How many of the suburbs in this data set bound the Charles river?

In [45]:

nrow(Boston[Boston$chas == 1, ])

35

(f) What is the median pupil-teacher ratio among the towns in this data set?

In [46]:

median(Boston$ptratio)

19.05

(g) Which suburb of Boston has lowest median value of owneroccupied homes? What are the values of the other predictors or that suburb, and how do those values compare to the overall ranges for those predictors? Comment on your ﬁndings.

In [47]:

t(subset(Boston, medv == min(Boston$medv)))

	399	406
crim	38.3518	67.9208
zn	0.0000	0.0000
indus	18.1000	18.1000
chas	0.0000	0.0000
nox	0.6930	0.6930
rm	5.4530	5.6830
age	100.0000	100.0000
dis	1.4896	1.4254
rad	24.0000	24.0000
tax	666.0000	666.0000
ptratio	20.2000	20.2000
black	396.9000	384.9700
lstat	30.5900	22.9800
medv	5.0000	5.0000

In [48]:

summary(Boston)

      crim                zn             indus            chas        
 Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
 1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
 Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
 Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
 3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
 Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
      nox               rm             age              dis        
 Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
 1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
 Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
 Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
 3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
 Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
      rad              tax           ptratio          black       
 Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
 1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
 Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
 Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
 3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
 Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
     lstat            medv      
 Min.   : 1.73   Min.   : 5.00  
 1st Qu.: 6.95   1st Qu.:17.02  
 Median :11.36   Median :21.20  
 Mean   :12.65   Mean   :22.53  
 3rd Qu.:16.95   3rd Qu.:25.00  
 Max.   :37.97   Max.   :50.00

Not the best place to live, but certainly not the worst.

(h) In this data set, how many of the suburbs average more than seven rooms per dwelling? More than eight rooms per dwelling? Comment on the suburbs that average more than eight rooms per dwelling.

In [49]:

nrow(Boston[Boston$rm > 7, ])

64

In [50]:

nrow(Boston[Boston$rm > 8, ])

13

In [51]:

summary(subset(Boston, rm > 8))

      crim               zn            indus             chas       
 Min.   :0.02009   Min.   : 0.00   Min.   : 2.680   Min.   :0.0000  
 1st Qu.:0.33147   1st Qu.: 0.00   1st Qu.: 3.970   1st Qu.:0.0000  
 Median :0.52014   Median : 0.00   Median : 6.200   Median :0.0000  
 Mean   :0.71879   Mean   :13.62   Mean   : 7.078   Mean   :0.1538  
 3rd Qu.:0.57834   3rd Qu.:20.00   3rd Qu.: 6.200   3rd Qu.:0.0000  
 Max.   :3.47428   Max.   :95.00   Max.   :19.580   Max.   :1.0000  
      nox               rm             age             dis       
 Min.   :0.4161   Min.   :8.034   Min.   : 8.40   Min.   :1.801  
 1st Qu.:0.5040   1st Qu.:8.247   1st Qu.:70.40   1st Qu.:2.288  
 Median :0.5070   Median :8.297   Median :78.30   Median :2.894  
 Mean   :0.5392   Mean   :8.349   Mean   :71.54   Mean   :3.430  
 3rd Qu.:0.6050   3rd Qu.:8.398   3rd Qu.:86.50   3rd Qu.:3.652  
 Max.   :0.7180   Max.   :8.780   Max.   :93.90   Max.   :8.907  
      rad              tax           ptratio          black      
 Min.   : 2.000   Min.   :224.0   Min.   :13.00   Min.   :354.6  
 1st Qu.: 5.000   1st Qu.:264.0   1st Qu.:14.70   1st Qu.:384.5  
 Median : 7.000   Median :307.0   Median :17.40   Median :386.9  
 Mean   : 7.462   Mean   :325.1   Mean   :16.36   Mean   :385.2  
 3rd Qu.: 8.000   3rd Qu.:307.0   3rd Qu.:17.40   3rd Qu.:389.7  
 Max.   :24.000   Max.   :666.0   Max.   :20.20   Max.   :396.9  
     lstat           medv     
 Min.   :2.47   Min.   :21.9  
 1st Qu.:3.32   1st Qu.:41.7  
 Median :4.14   Median :48.3  
 Mean   :4.31   Mean   :44.2  
 3rd Qu.:5.12   3rd Qu.:50.0  
 Max.   :7.44   Max.   :50.0

In [52]:

summary(subset(Boston, rm > 8))

      crim               zn            indus             chas       
 Min.   :0.02009   Min.   : 0.00   Min.   : 2.680   Min.   :0.0000  
 1st Qu.:0.33147   1st Qu.: 0.00   1st Qu.: 3.970   1st Qu.:0.0000  
 Median :0.52014   Median : 0.00   Median : 6.200   Median :0.0000  
 Mean   :0.71879   Mean   :13.62   Mean   : 7.078   Mean   :0.1538  
 3rd Qu.:0.57834   3rd Qu.:20.00   3rd Qu.: 6.200   3rd Qu.:0.0000  
 Max.   :3.47428   Max.   :95.00   Max.   :19.580   Max.   :1.0000  
      nox               rm             age             dis       
 Min.   :0.4161   Min.   :8.034   Min.   : 8.40   Min.   :1.801  
 1st Qu.:0.5040   1st Qu.:8.247   1st Qu.:70.40   1st Qu.:2.288  
 Median :0.5070   Median :8.297   Median :78.30   Median :2.894  
 Mean   :0.5392   Mean   :8.349   Mean   :71.54   Mean   :3.430  
 3rd Qu.:0.6050   3rd Qu.:8.398   3rd Qu.:86.50   3rd Qu.:3.652  
 Max.   :0.7180   Max.   :8.780   Max.   :93.90   Max.   :8.907  
      rad              tax           ptratio          black      
 Min.   : 2.000   Min.   :224.0   Min.   :13.00   Min.   :354.6  
 1st Qu.: 5.000   1st Qu.:264.0   1st Qu.:14.70   1st Qu.:384.5  
 Median : 7.000   Median :307.0   Median :17.40   Median :386.9  
 Mean   : 7.462   Mean   :325.1   Mean   :16.36   Mean   :385.2  
 3rd Qu.: 8.000   3rd Qu.:307.0   3rd Qu.:17.40   3rd Qu.:389.7  
 Max.   :24.000   Max.   :666.0   Max.   :20.20   Max.   :396.9  
     lstat           medv     
 Min.   :2.47   Min.   :21.9  
 1st Qu.:3.32   1st Qu.:41.7  
 Median :4.14   Median :48.3  
 Mean   :4.31   Mean   :44.2  
 3rd Qu.:5.12   3rd Qu.:50.0  
 Max.   :7.44   Max.   :50.0

Relatively lower crime (comparing range), lower lstat (comparing range).