1. For each of parts (a) through (d), indicate whether we would generally expect the performance of a flexible statistical learning method to be better or worse than an inflexible method. Justify your answer.
(a) The sample size n is extremely large, and the number of predictors p is small.
better - a more flexible approach will fit the data closer and with the large sample size a better fit than an inflexible approach would be obtained
(b) The number of predictors p is extremely large, and the numberof observations n is small.
worse - a flexible method would overfit the small number of observations
(c) The relationship between the predictors and response is highly non-linear.
better - with more degrees of freedom, a flexible model would obtain a better fit
(d) The variance of the error terms, i.e. $\sigma^2 = Var(\varepsilon)$, is extremely high.
worse - flexible methods fit to the noise in the error terms and increase variance
2. Explain whether each scenario is a classification or regression problem, and indicate whether we are most interested in inference or prediction. Finally, provide $n$ and $p$.
(a) We collect a set of data on the top 500 firms in the US. For each firm we record profit, number of employees, industry and the CEO salary. We are interested in understanding which factors affect CEO salary.
*regression; inference; quantitative output of CEO salary based on CEO firm's features
$n = 500$ - firms in the US
$p = 3$ - profit, number of employees, industry*
(b) We are considering launching a new product and wish to know whether it will be a success or a failure. We collect data on 20 similar products that were previously launched. For each product we have recorded whether it was a success or failure, price charged for the product, marketing budget, competition price, and ten other variables.
*classification; prediction; predicting new product's success or failure
$n = 20$ - similar products previously launched
$p = 13$ - price charged, marketing budget, comp. price, ten other variables*
(c) We are interesting in predicting the % change in the US dollar in relation to the weekly changes in the world stock markets. Hence we collect weekly data for all of 2012. For each week we record the % change in the dollar, the % change in the US market, the % change in the British market, and the % change in the German market.
*regression; prediction; quantitative output of % change
$n = 52$ - weeks of 2012 weekly data
$p = 3$ - % change in US market, % change in British market, % change in German market*
3. We now revisit the bias-variance decomposition.
(a) Provide a sketch of typical (squared) bias, variance, training error, test error, and Bayes (or irreducible) error curves, on a single plot, as we go from less flexible statistical learning methods towards more flexible approaches. The x-axis should represent the amount of flexibility in the method, and the y-axis should represent the values for each curve. There should be five curves. Make sure to label each one.
(b) Explain why each of the five curves has the shape displayed in part (a).
The training MSE declines monotonically as flexibility increases, this is because as flexibility increases the $f$ curve fits the observed data more closely. The test MSE intially declines as flexibility increases but at some point it levels off and then starts to increase again (U-shape), this is because when a $f$ curve yields a small training MSE but a large test MSE we are actually overfitting the data (our procedure tries too hard to find patterns in the training data that are maybe only caused by chance rather than by true properties of the unknown $f$). The squared bias decreases monotonically and the variance increases monotonically; as a general rule, as we use more flexible methods, the variance will increase and the bias will decrease. Variance refers to the amount by wich $\hat{f}$ would change if we estimated it using a different training data set, so if the curve fits the observations very closely, changing any point may cause $\hat{f}$ to change considerably, and so will result in some variance. Bias refers to the error that is introduced by approximating a real-life problem by a much simpler model, so if we use a very simple model (linear regression) it is unlikely that any real-life problem has such a simple linear relationship, and so performing linear regression will result in some bias in the estimate of $f$. The irreducible error is a constant so it is a parallel line, this curve lies below the test MSE curve because the expected test MSE will always be greater the $Var(\varepsilon)$ (see relation $(2.7)$).
4. You will now think of some real-life applications for statistical learning.
(a) Describe three real-life applications in which classification might be useful. Describe the response, as well as the predictors. Is the goal of each application inference or prediction ? Explain your answer.
Classification 1 – Is this TV series/movie/ad campaign going to be successful or not (Response : Success/Failure, Predictors : Money spent, Talent, Running Time, Producer, TV Channel, Air time slot, etc., Goal : Prediction).
Classification 2 – Should this applicant be admitted into Harvard University or not (Response : Admit/Not admit, Predictors : SAT Scores, GPA, Socio Economic Strata, Income of parents, Essay effectiveness, Potential, etc., Goal : Prediction).
Classification 3 – Salk Polio vaccine trials – Successful/Not Successful (Response : Did the child get polio or not, Predictors : Age, Geography, General health condition, Control/Test group, etc., Goal : Prediction).
(b) Describe three real-life applications in which regression might be useful. Describe the response, as well as the predictors. Is the goal of each application inference or prediction ? Explain your answer.
Regression 1 – GDP Growth in European economies (Response : What is the GDP of countries predicted to be by 2050, Predictors : Population, Per capita income, Education, Average life expectancy, Tax Revenue, Government Spending etc., Goal : Inference).
Regression 2 – What is the average house sale price in XXX neighborhood over the next 5 years (Response : Average house in XXX neighborhood will sell for Y next year, Z the year after, T after that, etc., Predictors : Proximity to transit, Parks, Schools, Average size of family, Average Income of Family, Crime Rate, Price Flux in surrounding neighborhoods etc., Goal : Inference).
Regression 3 – Gas mileage that a new car design will result in (Response : With certain parameters being set, X is the mileage we will get out of this car, Predictors: Fuel type, Number of Cylinders, Engine Version, etc., Goal : Inference).
(c) Describe three real-life applications in which cluster analysis might be useful.
Cluster 1 – Division of countries into Developed, Developing and Third World (Response : By 2050, countries in Asia can be split into these following clusters, Predictors : Per Capita Income, Purchasing power parity, Average birth rate, Average number of years of education received, Average Death Rate, Population etc., Goal: Prediction).
Cluster 2 – Division of average working population into income segments for taxation purposes (Response : This worker falls under this taxation bracket, Predictors : Income, Job Industry, Job Segment, Size of Company, etc., Goal : Prediction).
Cluster 3 – Cluster new movies being produced into ratings G/PG/R/PG-13 etc. (Response : This movie is a R/PG/PG-13, Predictors : Violent content, Sexual language, theme, etc., Goal : Prediction).
5. What are the advantages and disadvantages of a very flexible (versus a less flexible) approach for regression or classification ? Under what circumstances might a more flexible approach be preferred to a less flexible approach? When might a less flexible approach be preferred ?
The advantages for a very flexible approach for regression or classification are obtaining a better fit for non-linear models, decreasing bias.
The disadvantages for a very flexible approach for regression or classification are requires estimating a greater number of parameters, follow the noise too closely (overfit), increasing variance.
A more flexible approach would be preferred to a less flexible approach when we are interested in prediction and not the interpretability of the results.
A less flexible approach would be preferred to a more flexible approach when we are interested in inference and the interpretability of the results.
More flexible | Less flexible |
---|---|
More interpretable model (inference). | Less interpretable model (inference). |
May not yield accurate predictions. | Quite accurate predictions (but beware of overfitting). |
If f is linear then linear regression may have no bias. | If f is highly nonlinear and we have a lot of observations then a nonlinear model may work very well. |
6. Describe the differences between a parametric and a non-parametric statistical learning approach. What are the advantages of a parametric approach to regression or classification (as opposed to a nonparametric approach) ? What are its disadvantages ?
A parametric approach reduces the problem of estimating f down to one of estimating a set of parameters because it assumes a form for f.
A non-parametric approach does not assume a functional form for f and so requires a very large number of observations to accurately estimate f.
The advantages of a parametric approach to regression or classification are the simplifying of modeling f to a few parameters and not as many observations are required compared to a non-parametric approach.
The disadvantages of a parametric approach to regression or classification are a potential to inaccurately estimate f if the form of f assumed is wrong or to overfit the observations if more flexible models are used.
Parametric | Non-parametric |
---|---|
Reduces the problem of estimating $f$ to estimating a set of parameters. | Seek an estimate of $f$ that gets as close to the data points as possible without being too rough or wiggly. |
Simplifies the problem because it is generally easier to estimate a set of parameters than to fit an entirely arbitrary function. | May accurately fit a wider range of possible shapes for $f$. |
If we use more flexible models it may lead to overfitting the data. | A very large number of observations is required to obtain an accurate estimate for $f$. |
7. The table below provides a training data set containing 6 observations, 3 predictors, and 1 qualitative response variable. Suppose we wish to use this data set to make a prediction for Y when X1 = X2 = X3 = 0 using K-nearest neighbors.
Obs. | X1 | X2 | X3 | Distance(0, 0, 0) | Y |
---|---|---|---|---|---|
1 | 0 | 3 | 0 | 3 | Red |
2 | 2 | 0 | 0 | 2 | Red |
3 | 0 | 1 | 3 | sqrt(10) ~ 3.16 | Red |
4 | 0 | 1 | 2 | sqrt(5) ~ 2.23 | Green |
5 | -1 | 0 | 1 | sqrt(2) ~ 1.41 | Green |
6 | 1 | 1 | 1 | sqrt(3) ~ 1.73 | Red |
(b) What is our prediction with $K = 1$ ? Why ?
If $K = 1$ then $x_5\in\mathcal{N}_0$ and we have
$P(Y = \mathrm{Red} | X = x_0) = \frac{1}{1}\sum_{i\in\mathcal{N}_0}I(y_i = \mathrm{Red}) = I(y_5 = \mathrm{Red}) = 0$
and
$P(Y = \mathrm{Green} | X = x_0) = \frac{1}{1}\sum_{i\in\mathcal{N}_0}I(y_i = \mathrm{Green}) = I(y_5 = \mathrm{Green}) = 1.$
Our prediction is then Green. Observation 5 is the closest neighbor for K = 1.
(c) What is our prediction with $K = 3$ ? Why ?
If $K = 3$ then $x_2,x_5,x_6\in\mathcal{N}_0$ and we have
$P(Y = \mathrm{Red} | X = x_0) = \frac{1}{3}\sum_{i\in\mathcal{N}_0}I(y_i = \mathrm{Red}) = \frac{1}{3}(1 + 0 + 1) = \frac{2}{3}$
and
$P(Y = \mathrm{Green} | X = x_0) = \frac{1}{3}\sum_{i\in\mathcal{N}_0}I(y_i = \mathrm{Green}) = \frac{1}{3}(0 + 1 + 0) = \frac{1}{3}.$
Our prediction is then Red. Observations 2, 5, 6 are the closest neighbors for K = 3. 2 is Red, 5 is Green, and 6 is Red.
(d) If the Bayes decision boundary in this problem is highly nonlinear, then would we expect the best value for $K$ to be large or small ? Why ?
Small. A small $K$ would be flexible for a non-linear decision boundary, whereas a large $K$ would try to fit a more linear boundary because it takes more points into consideration. As $K$ becomes larger, the boundary becomes smoother.
8. This exercise relates to the College data set, which can be found in the file College.csv. It contains a number of variables for 777 different universities and colleges in the US. The variables are:
Before reading the data into R, it can be viewed in Excel or a text editor.
college = read.csv("../../../data/College.csv")
head(college)
X | Private | Apps | Accept | Enroll | Top10perc | Top25perc | F.Undergrad | P.Undergrad | Outstate | Room.Board | Books | Personal | PhD | Terminal | S.F.Ratio | perc.alumni | Expend | Grad.Rate |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Abilene Christian University | Yes | 1660 | 1232 | 721 | 23 | 52 | 2885 | 537 | 7440 | 3300 | 450 | 2200 | 70 | 78 | 18.1 | 12 | 7041 | 60 |
Adelphi University | Yes | 2186 | 1924 | 512 | 16 | 29 | 2683 | 1227 | 12280 | 6450 | 750 | 1500 | 29 | 30 | 12.2 | 16 | 10527 | 56 |
Adrian College | Yes | 1428 | 1097 | 336 | 22 | 50 | 1036 | 99 | 11250 | 3750 | 400 | 1165 | 53 | 66 | 12.9 | 30 | 8735 | 54 |
Agnes Scott College | Yes | 417 | 349 | 137 | 60 | 89 | 510 | 63 | 12960 | 5450 | 450 | 875 | 92 | 97 | 7.7 | 37 | 19016 | 59 |
Alaska Pacific University | Yes | 193 | 146 | 55 | 16 | 44 | 249 | 869 | 7560 | 4120 | 800 | 1500 | 76 | 72 | 11.9 | 2 | 10922 | 15 |
Albertson College | Yes | 587 | 479 | 158 | 38 | 62 | 678 | 41 | 13500 | 3335 | 500 | 675 | 67 | 73 | 9.4 | 11 | 9727 | 55 |
rownames(college) = college[,1]
head(college[, 1:5])
X | Private | Apps | Accept | Enroll | |
---|---|---|---|---|---|
Abilene Christian University | Abilene Christian University | Yes | 1660 | 1232 | 721 |
Adelphi University | Adelphi University | Yes | 2186 | 1924 | 512 |
Adrian College | Adrian College | Yes | 1428 | 1097 | 336 |
Agnes Scott College | Agnes Scott College | Yes | 417 | 349 | 137 |
Alaska Pacific University | Alaska Pacific University | Yes | 193 | 146 | 55 |
Albertson College | Albertson College | Yes | 587 | 479 | 158 |
You should see that there is now a row.names column with the name of each university recorded. This means that R has given each row a name corresponding to the appropriate university. R will not try to perform calculations on the row names. However, we still need to eliminate the first column in the data where the names are stored.
college = college[,-1]
head(college[, 1:5])
Private | Apps | Accept | Enroll | Top10perc | |
---|---|---|---|---|---|
Abilene Christian University | Yes | 1660 | 1232 | 721 | 23 |
Adelphi University | Yes | 2186 | 1924 | 512 | 16 |
Adrian College | Yes | 1428 | 1097 | 336 | 22 |
Agnes Scott College | Yes | 417 | 349 | 137 | 60 |
Alaska Pacific University | Yes | 193 | 146 | 55 | 16 |
Albertson College | Yes | 587 | 479 | 158 | 38 |
Now you should see that the first data column is Private. Note that another column labeled row.names now appears before the Private column. However, this is not a data column but rather the name that R is giving to each row.
summary(college)
Private Apps Accept Enroll Top10perc No :212 Min. : 81 Min. : 72 Min. : 35 Min. : 1.00 Yes:565 1st Qu.: 776 1st Qu.: 604 1st Qu.: 242 1st Qu.:15.00 Median : 1558 Median : 1110 Median : 434 Median :23.00 Mean : 3002 Mean : 2019 Mean : 780 Mean :27.56 3rd Qu.: 3624 3rd Qu.: 2424 3rd Qu.: 902 3rd Qu.:35.00 Max. :48094 Max. :26330 Max. :6392 Max. :96.00 Top25perc F.Undergrad P.Undergrad Outstate Min. : 9.0 Min. : 139 Min. : 1.0 Min. : 2340 1st Qu.: 41.0 1st Qu.: 992 1st Qu.: 95.0 1st Qu.: 7320 Median : 54.0 Median : 1707 Median : 353.0 Median : 9990 Mean : 55.8 Mean : 3700 Mean : 855.3 Mean :10441 3rd Qu.: 69.0 3rd Qu.: 4005 3rd Qu.: 967.0 3rd Qu.:12925 Max. :100.0 Max. :31643 Max. :21836.0 Max. :21700 Room.Board Books Personal PhD Min. :1780 Min. : 96.0 Min. : 250 Min. : 8.00 1st Qu.:3597 1st Qu.: 470.0 1st Qu.: 850 1st Qu.: 62.00 Median :4200 Median : 500.0 Median :1200 Median : 75.00 Mean :4358 Mean : 549.4 Mean :1341 Mean : 72.66 3rd Qu.:5050 3rd Qu.: 600.0 3rd Qu.:1700 3rd Qu.: 85.00 Max. :8124 Max. :2340.0 Max. :6800 Max. :103.00 Terminal S.F.Ratio perc.alumni Expend Min. : 24.0 Min. : 2.50 Min. : 0.00 Min. : 3186 1st Qu.: 71.0 1st Qu.:11.50 1st Qu.:13.00 1st Qu.: 6751 Median : 82.0 Median :13.60 Median :21.00 Median : 8377 Mean : 79.7 Mean :14.09 Mean :22.74 Mean : 9660 3rd Qu.: 92.0 3rd Qu.:16.50 3rd Qu.:31.00 3rd Qu.:10830 Max. :100.0 Max. :39.80 Max. :64.00 Max. :56233 Grad.Rate Min. : 10.00 1st Qu.: 53.00 Median : 65.00 Mean : 65.46 3rd Qu.: 78.00 Max. :118.00
ii. Use the pairs() function to produce a scatterplot matrix of the first ten columns or variables of the data. Recall that you can reference the first ten columns of a matrix A using A[,1:10].
pairs(college[,1:10])
iii. Use the plot() function to produce side-by-side boxplots of Outstate versus Private.
plot(college$Private, college$Outstate,
xlab = "Private University", ylab = "Out of State tuition in USD", main = "Outstate Tuition Plot")
iv. Create a new qualitative variable, called Elite, by binning the Top10perc variable. We are going to divide universities into two groups based on whether or not the proportion of students coming from the top 10% of their high school classes exceeds 50%.
Elite = rep("No", nrow(college))
Elite[college$Top10perc > 50] = "Yes"
Elite = as.factor(Elite)
college = data.frame(college, Elite)
Use the summary() function to see how many elite universities there are. Now use the plot() function to produce side-by-side boxplots of Outstate versus Elite.
summary(college$Elite)
plot(college$Elite, college$Outstate,
xlab = "Elite University", ylab ="Out of State tuition in USD", main = "Outstate Tuition Plot")
v. Use the hist()( function to produce some histograms with differing numbers of bins for a few of the quantitative variables. You may find the command par(mfrow=c(2,2))* useful: it will divide the print window into four regions so that four plots can be made simultaneously. Modifying the arguments to this function will divide the screen in other ways
par(mfrow = c(2,2))
hist(college$Books, col = 2, xlab = "Books", ylab = "Count")
hist(college$PhD, col = 3, xlab = "PhD", ylab = "Count")
hist(college$Grad.Rate, col = 4, xlab = "Grad Rate", ylab = "Count")
hist(college$perc.alumni, col = 6, xlab = "% alumni", ylab = "Count")
vi. Continue exploring the data, and provide a brief summary of what you discover.
summary(college$PhD)
Min. 1st Qu. Median Mean 3rd Qu. Max. 8.00 62.00 75.00 72.66 85.00 103.00
It is a little weird to have universities with $103\%$ of faculty with Phd's, let us see how many universities have this percentage and their names.
weird.phd = college[college$PhD == 103, ]
nrow(weird.phd)
rownames(weird.phd)
par(mfrow=c(1,1))
plot(college$Outstate, college$Grad.Rate)
High tuition correlates to high graduation rate.
plot(college$Accept / college$Apps, college$S.F.Ratio)
Colleges with low acceptance rate tend to have low S:F ratio.
plot(college$Top10perc, college$Grad.Rate)
Colleges with the most students from top 10% perc don't necessarily have the highest graduation rate. Also, rate > 100 is erroneous!
9. This exercise involves the Auto data set studied in the lab. Make sure that the missing values have been removed from the data.
Auto = read.csv("../../../data/Auto.csv", header=T, na.strings="?")
Auto = na.omit(Auto)
str(Auto)
'data.frame': 392 obs. of 9 variables: $ mpg : num 18 15 18 16 17 15 14 14 14 15 ... $ cylinders : int 8 8 8 8 8 8 8 8 8 8 ... $ displacement: num 307 350 318 304 302 429 454 440 455 390 ... $ horsepower : int 130 165 150 150 140 198 220 215 225 190 ... $ weight : int 3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ... $ acceleration: num 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ... $ year : int 70 70 70 70 70 70 70 70 70 70 ... $ origin : int 1 1 1 1 1 1 1 1 1 1 ... $ name : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ... - attr(*, "na.action")=Class 'omit' Named int [1:5] 33 127 331 337 355 .. ..- attr(*, "names")= chr [1:5] "33" "127" "331" "337" ...
summary(Auto)
mpg cylinders displacement horsepower weight Min. : 9.00 Min. :3.000 Min. : 68.0 Min. : 46.0 Min. :1613 1st Qu.:17.00 1st Qu.:4.000 1st Qu.:105.0 1st Qu.: 75.0 1st Qu.:2225 Median :22.75 Median :4.000 Median :151.0 Median : 93.5 Median :2804 Mean :23.45 Mean :5.472 Mean :194.4 Mean :104.5 Mean :2978 3rd Qu.:29.00 3rd Qu.:8.000 3rd Qu.:275.8 3rd Qu.:126.0 3rd Qu.:3615 Max. :46.60 Max. :8.000 Max. :455.0 Max. :230.0 Max. :5140 acceleration year origin name Min. : 8.00 Min. :70.00 Min. :1.000 amc matador : 5 1st Qu.:13.78 1st Qu.:73.00 1st Qu.:1.000 ford pinto : 5 Median :15.50 Median :76.00 Median :1.000 toyota corolla : 5 Mean :15.54 Mean :75.98 Mean :1.577 amc gremlin : 4 3rd Qu.:17.02 3rd Qu.:79.00 3rd Qu.:2.000 amc hornet : 4 Max. :24.80 Max. :82.00 Max. :3.000 chevrolet chevette: 4 (Other) :365
Quantitative variables: mpg, cylinders, displacement, horsepower, weight, acceleration, year
Qualitative variables: name, origin
sapply(Auto[, 1:7], range)
mpg | cylinders | displacement | horsepower | weight | acceleration | year |
---|---|---|---|---|---|---|
9.0 | 3 | 68 | 46 | 1613 | 8.0 | 70 |
46.6 | 8 | 455 | 230 | 5140 | 24.8 | 82 |
sapply(Auto[, 1:7], mean)
sapply(Auto[, 1:7], sd)
subsetAuto = Auto[-(10:85),]
sapply(subsetAuto[, 1:7], range)
mpg | cylinders | displacement | horsepower | weight | acceleration | year |
---|---|---|---|---|---|---|
11.0 | 3 | 68 | 46 | 1649 | 8.5 | 70 |
46.6 | 8 | 455 | 230 | 4997 | 24.8 | 82 |
sapply(subsetAuto[, 1:7], mean)
sapply(subsetAuto[, 1:7], sd)
pairs(Auto)
We seem to get more mileage per gallon on a 4 cyl vehicle than the others. Weight, displacement and horsepower seem to have an inverse effect with mpg. We see an overall increase in mpg over the years. Almost doubled in one decade. Japanese cars have higher mpg than US or European cars.
plot(Auto$mpg, Auto$weight)
Heavier weight correlates with lower mpg.
plot(Auto$mpg, Auto$cylinders)
More cylinders, less mpg.
plot(Auto$mpg, Auto$year)
Cars become more efficient over time.
pairs(Auto)
See descriptions of plots in (e)
All of the predictors show some correlation with mpg. The name predictor has too little observations per name though, so using this as a predictor is likely to result in overfitting the data and will not generalize well.
The cylinders, horsepower, year and origin can be used as predictors. Displacement and weight were not used because they are highly correlated with horespower and with each other.
cor(Auto$weight, Auto$horsepower)
cor(Auto$weight, Auto$displacement)
cor(Auto$displacement, Auto$horsepower)
10. This exercise involves the Boston housing data set.
library(MASS)
Now the data set is contained in the object Boston.
Boston
crim | zn | indus | chas | nox | rm | age | dis | rad | tax | ptratio | black | lstat | medv |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0.00632 | 18.0 | 2.31 | 0 | 0.538 | 6.575 | 65.2 | 4.0900 | 1 | 296 | 15.3 | 396.90 | 4.98 | 24.0 |
0.02731 | 0.0 | 7.07 | 0 | 0.469 | 6.421 | 78.9 | 4.9671 | 2 | 242 | 17.8 | 396.90 | 9.14 | 21.6 |
0.02729 | 0.0 | 7.07 | 0 | 0.469 | 7.185 | 61.1 | 4.9671 | 2 | 242 | 17.8 | 392.83 | 4.03 | 34.7 |
0.03237 | 0.0 | 2.18 | 0 | 0.458 | 6.998 | 45.8 | 6.0622 | 3 | 222 | 18.7 | 394.63 | 2.94 | 33.4 |
0.06905 | 0.0 | 2.18 | 0 | 0.458 | 7.147 | 54.2 | 6.0622 | 3 | 222 | 18.7 | 396.90 | 5.33 | 36.2 |
0.02985 | 0.0 | 2.18 | 0 | 0.458 | 6.430 | 58.7 | 6.0622 | 3 | 222 | 18.7 | 394.12 | 5.21 | 28.7 |
0.08829 | 12.5 | 7.87 | 0 | 0.524 | 6.012 | 66.6 | 5.5605 | 5 | 311 | 15.2 | 395.60 | 12.43 | 22.9 |
0.14455 | 12.5 | 7.87 | 0 | 0.524 | 6.172 | 96.1 | 5.9505 | 5 | 311 | 15.2 | 396.90 | 19.15 | 27.1 |
0.21124 | 12.5 | 7.87 | 0 | 0.524 | 5.631 | 100.0 | 6.0821 | 5 | 311 | 15.2 | 386.63 | 29.93 | 16.5 |
0.17004 | 12.5 | 7.87 | 0 | 0.524 | 6.004 | 85.9 | 6.5921 | 5 | 311 | 15.2 | 386.71 | 17.10 | 18.9 |
0.22489 | 12.5 | 7.87 | 0 | 0.524 | 6.377 | 94.3 | 6.3467 | 5 | 311 | 15.2 | 392.52 | 20.45 | 15.0 |
0.11747 | 12.5 | 7.87 | 0 | 0.524 | 6.009 | 82.9 | 6.2267 | 5 | 311 | 15.2 | 396.90 | 13.27 | 18.9 |
0.09378 | 12.5 | 7.87 | 0 | 0.524 | 5.889 | 39.0 | 5.4509 | 5 | 311 | 15.2 | 390.50 | 15.71 | 21.7 |
0.62976 | 0.0 | 8.14 | 0 | 0.538 | 5.949 | 61.8 | 4.7075 | 4 | 307 | 21.0 | 396.90 | 8.26 | 20.4 |
0.63796 | 0.0 | 8.14 | 0 | 0.538 | 6.096 | 84.5 | 4.4619 | 4 | 307 | 21.0 | 380.02 | 10.26 | 18.2 |
0.62739 | 0.0 | 8.14 | 0 | 0.538 | 5.834 | 56.5 | 4.4986 | 4 | 307 | 21.0 | 395.62 | 8.47 | 19.9 |
1.05393 | 0.0 | 8.14 | 0 | 0.538 | 5.935 | 29.3 | 4.4986 | 4 | 307 | 21.0 | 386.85 | 6.58 | 23.1 |
0.78420 | 0.0 | 8.14 | 0 | 0.538 | 5.990 | 81.7 | 4.2579 | 4 | 307 | 21.0 | 386.75 | 14.67 | 17.5 |
0.80271 | 0.0 | 8.14 | 0 | 0.538 | 5.456 | 36.6 | 3.7965 | 4 | 307 | 21.0 | 288.99 | 11.69 | 20.2 |
0.72580 | 0.0 | 8.14 | 0 | 0.538 | 5.727 | 69.5 | 3.7965 | 4 | 307 | 21.0 | 390.95 | 11.28 | 18.2 |
1.25179 | 0.0 | 8.14 | 0 | 0.538 | 5.570 | 98.1 | 3.7979 | 4 | 307 | 21.0 | 376.57 | 21.02 | 13.6 |
0.85204 | 0.0 | 8.14 | 0 | 0.538 | 5.965 | 89.2 | 4.0123 | 4 | 307 | 21.0 | 392.53 | 13.83 | 19.6 |
1.23247 | 0.0 | 8.14 | 0 | 0.538 | 6.142 | 91.7 | 3.9769 | 4 | 307 | 21.0 | 396.90 | 18.72 | 15.2 |
0.98843 | 0.0 | 8.14 | 0 | 0.538 | 5.813 | 100.0 | 4.0952 | 4 | 307 | 21.0 | 394.54 | 19.88 | 14.5 |
0.75026 | 0.0 | 8.14 | 0 | 0.538 | 5.924 | 94.1 | 4.3996 | 4 | 307 | 21.0 | 394.33 | 16.30 | 15.6 |
0.84054 | 0.0 | 8.14 | 0 | 0.538 | 5.599 | 85.7 | 4.4546 | 4 | 307 | 21.0 | 303.42 | 16.51 | 13.9 |
0.67191 | 0.0 | 8.14 | 0 | 0.538 | 5.813 | 90.3 | 4.6820 | 4 | 307 | 21.0 | 376.88 | 14.81 | 16.6 |
0.95577 | 0.0 | 8.14 | 0 | 0.538 | 6.047 | 88.8 | 4.4534 | 4 | 307 | 21.0 | 306.38 | 17.28 | 14.8 |
0.77299 | 0.0 | 8.14 | 0 | 0.538 | 6.495 | 94.4 | 4.4547 | 4 | 307 | 21.0 | 387.94 | 12.80 | 18.4 |
1.00245 | 0.0 | 8.14 | 0 | 0.538 | 6.674 | 87.3 | 4.2390 | 4 | 307 | 21.0 | 380.23 | 11.98 | 21.0 |
⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
4.87141 | 0 | 18.10 | 0 | 0.614 | 6.484 | 93.6 | 2.3053 | 24 | 666 | 20.2 | 396.21 | 18.68 | 16.7 |
15.02340 | 0 | 18.10 | 0 | 0.614 | 5.304 | 97.3 | 2.1007 | 24 | 666 | 20.2 | 349.48 | 24.91 | 12.0 |
10.23300 | 0 | 18.10 | 0 | 0.614 | 6.185 | 96.7 | 2.1705 | 24 | 666 | 20.2 | 379.70 | 18.03 | 14.6 |
14.33370 | 0 | 18.10 | 0 | 0.614 | 6.229 | 88.0 | 1.9512 | 24 | 666 | 20.2 | 383.32 | 13.11 | 21.4 |
5.82401 | 0 | 18.10 | 0 | 0.532 | 6.242 | 64.7 | 3.4242 | 24 | 666 | 20.2 | 396.90 | 10.74 | 23.0 |
5.70818 | 0 | 18.10 | 0 | 0.532 | 6.750 | 74.9 | 3.3317 | 24 | 666 | 20.2 | 393.07 | 7.74 | 23.7 |
5.73116 | 0 | 18.10 | 0 | 0.532 | 7.061 | 77.0 | 3.4106 | 24 | 666 | 20.2 | 395.28 | 7.01 | 25.0 |
2.81838 | 0 | 18.10 | 0 | 0.532 | 5.762 | 40.3 | 4.0983 | 24 | 666 | 20.2 | 392.92 | 10.42 | 21.8 |
2.37857 | 0 | 18.10 | 0 | 0.583 | 5.871 | 41.9 | 3.7240 | 24 | 666 | 20.2 | 370.73 | 13.34 | 20.6 |
3.67367 | 0 | 18.10 | 0 | 0.583 | 6.312 | 51.9 | 3.9917 | 24 | 666 | 20.2 | 388.62 | 10.58 | 21.2 |
5.69175 | 0 | 18.10 | 0 | 0.583 | 6.114 | 79.8 | 3.5459 | 24 | 666 | 20.2 | 392.68 | 14.98 | 19.1 |
4.83567 | 0 | 18.10 | 0 | 0.583 | 5.905 | 53.2 | 3.1523 | 24 | 666 | 20.2 | 388.22 | 11.45 | 20.6 |
0.15086 | 0 | 27.74 | 0 | 0.609 | 5.454 | 92.7 | 1.8209 | 4 | 711 | 20.1 | 395.09 | 18.06 | 15.2 |
0.18337 | 0 | 27.74 | 0 | 0.609 | 5.414 | 98.3 | 1.7554 | 4 | 711 | 20.1 | 344.05 | 23.97 | 7.0 |
0.20746 | 0 | 27.74 | 0 | 0.609 | 5.093 | 98.0 | 1.8226 | 4 | 711 | 20.1 | 318.43 | 29.68 | 8.1 |
0.10574 | 0 | 27.74 | 0 | 0.609 | 5.983 | 98.8 | 1.8681 | 4 | 711 | 20.1 | 390.11 | 18.07 | 13.6 |
0.11132 | 0 | 27.74 | 0 | 0.609 | 5.983 | 83.5 | 2.1099 | 4 | 711 | 20.1 | 396.90 | 13.35 | 20.1 |
0.17331 | 0 | 9.69 | 0 | 0.585 | 5.707 | 54.0 | 2.3817 | 6 | 391 | 19.2 | 396.90 | 12.01 | 21.8 |
0.27957 | 0 | 9.69 | 0 | 0.585 | 5.926 | 42.6 | 2.3817 | 6 | 391 | 19.2 | 396.90 | 13.59 | 24.5 |
0.17899 | 0 | 9.69 | 0 | 0.585 | 5.670 | 28.8 | 2.7986 | 6 | 391 | 19.2 | 393.29 | 17.60 | 23.1 |
0.28960 | 0 | 9.69 | 0 | 0.585 | 5.390 | 72.9 | 2.7986 | 6 | 391 | 19.2 | 396.90 | 21.14 | 19.7 |
0.26838 | 0 | 9.69 | 0 | 0.585 | 5.794 | 70.6 | 2.8927 | 6 | 391 | 19.2 | 396.90 | 14.10 | 18.3 |
0.23912 | 0 | 9.69 | 0 | 0.585 | 6.019 | 65.3 | 2.4091 | 6 | 391 | 19.2 | 396.90 | 12.92 | 21.2 |
0.17783 | 0 | 9.69 | 0 | 0.585 | 5.569 | 73.5 | 2.3999 | 6 | 391 | 19.2 | 395.77 | 15.10 | 17.5 |
0.22438 | 0 | 9.69 | 0 | 0.585 | 6.027 | 79.7 | 2.4982 | 6 | 391 | 19.2 | 396.90 | 14.33 | 16.8 |
0.06263 | 0 | 11.93 | 0 | 0.573 | 6.593 | 69.1 | 2.4786 | 1 | 273 | 21.0 | 391.99 | 9.67 | 22.4 |
0.04527 | 0 | 11.93 | 0 | 0.573 | 6.120 | 76.7 | 2.2875 | 1 | 273 | 21.0 | 396.90 | 9.08 | 20.6 |
0.06076 | 0 | 11.93 | 0 | 0.573 | 6.976 | 91.0 | 2.1675 | 1 | 273 | 21.0 | 396.90 | 5.64 | 23.9 |
0.10959 | 0 | 11.93 | 0 | 0.573 | 6.794 | 89.3 | 2.3889 | 1 | 273 | 21.0 | 393.45 | 6.48 | 22.0 |
0.04741 | 0 | 11.93 | 0 | 0.573 | 6.030 | 80.8 | 2.5050 | 1 | 273 | 21.0 | 396.90 | 7.88 | 11.9 |
Read about the data set:
?Boston
How many rows are in this data set? How many columns? What do the rows and columns represent?
dim(Boston)
506 rows, 14 columns
14 features, 506 housing values in Boston suburbs
pairs(Boston)
crim correlates with: age, dis, rad, tax, ptratio
zn correlates with: indus, nox, age, lstat
indus correlates with: age, dis
nox correlates with: age, dis
dis correlates with: lstat
lstat correlates with: medv
par(mfrow = c(2, 2))
plot(Boston$nox, Boston$crim)
plot(Boston$rm, Boston$crim)
plot(Boston$age, Boston$crim)
plot(Boston$dis, Boston$crim)
hist(Boston$crim, breaks = 50)
Most suburbs do not have any crime ($80\%$ of data falls in crim < 20).
pairs(Boston[Boston$crim < 20, ])
There may be a relationship between crim and age, dis, rad, tax and ptratio.
par(mfrow = c(3, 2))
plot(Boston$age, Boston$crim, main = "Older homes, more crime")
plot(Boston$dis, Boston$crim, main = "Closer to work-area, more crime")
plot(Boston$rad, Boston$crim, main = "Higher index of accessibility to radial highways, more crime")
plot(Boston$tax, Boston$crim, main = "Higher tax rate, more crime")
plot(Boston$ptratio, Boston$crim, main = "Higher pupil:teacher ratio, more crime")
par(mfrow=c(1,3))
hist(Boston$crim[Boston$crim > 1], breaks=25)
hist(Boston$tax, breaks=25)
hist(Boston$ptratio, breaks=25)
Most cities have low crime rates, but there is a long tail: 18 suburbs appear to have a crime rate > 20, reaching to above 80.
There is a large divide between suburbs with low tax rates and a peak at 660-680.
A skew towards high ratios, but no particularly high ratios.
nrow(Boston[Boston$chas == 1, ])
median(Boston$ptratio)
t(subset(Boston, medv == min(Boston$medv)))
399 | 406 | |
---|---|---|
crim | 38.3518 | 67.9208 |
zn | 0.0000 | 0.0000 |
indus | 18.1000 | 18.1000 |
chas | 0.0000 | 0.0000 |
nox | 0.6930 | 0.6930 |
rm | 5.4530 | 5.6830 |
age | 100.0000 | 100.0000 |
dis | 1.4896 | 1.4254 |
rad | 24.0000 | 24.0000 |
tax | 666.0000 | 666.0000 |
ptratio | 20.2000 | 20.2000 |
black | 396.9000 | 384.9700 |
lstat | 30.5900 | 22.9800 |
medv | 5.0000 | 5.0000 |
summary(Boston)
crim zn indus chas Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000 Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000 Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000 Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000 nox rm age dis Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100 Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207 Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188 Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127 rad tax ptratio black Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38 Median : 5.000 Median :330.0 Median :19.05 Median :391.44 Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23 Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90 lstat medv Min. : 1.73 Min. : 5.00 1st Qu.: 6.95 1st Qu.:17.02 Median :11.36 Median :21.20 Mean :12.65 Mean :22.53 3rd Qu.:16.95 3rd Qu.:25.00 Max. :37.97 Max. :50.00
Not the best place to live, but certainly not the worst.
nrow(Boston[Boston$rm > 7, ])
nrow(Boston[Boston$rm > 8, ])
summary(subset(Boston, rm > 8))
crim zn indus chas Min. :0.02009 Min. : 0.00 Min. : 2.680 Min. :0.0000 1st Qu.:0.33147 1st Qu.: 0.00 1st Qu.: 3.970 1st Qu.:0.0000 Median :0.52014 Median : 0.00 Median : 6.200 Median :0.0000 Mean :0.71879 Mean :13.62 Mean : 7.078 Mean :0.1538 3rd Qu.:0.57834 3rd Qu.:20.00 3rd Qu.: 6.200 3rd Qu.:0.0000 Max. :3.47428 Max. :95.00 Max. :19.580 Max. :1.0000 nox rm age dis Min. :0.4161 Min. :8.034 Min. : 8.40 Min. :1.801 1st Qu.:0.5040 1st Qu.:8.247 1st Qu.:70.40 1st Qu.:2.288 Median :0.5070 Median :8.297 Median :78.30 Median :2.894 Mean :0.5392 Mean :8.349 Mean :71.54 Mean :3.430 3rd Qu.:0.6050 3rd Qu.:8.398 3rd Qu.:86.50 3rd Qu.:3.652 Max. :0.7180 Max. :8.780 Max. :93.90 Max. :8.907 rad tax ptratio black Min. : 2.000 Min. :224.0 Min. :13.00 Min. :354.6 1st Qu.: 5.000 1st Qu.:264.0 1st Qu.:14.70 1st Qu.:384.5 Median : 7.000 Median :307.0 Median :17.40 Median :386.9 Mean : 7.462 Mean :325.1 Mean :16.36 Mean :385.2 3rd Qu.: 8.000 3rd Qu.:307.0 3rd Qu.:17.40 3rd Qu.:389.7 Max. :24.000 Max. :666.0 Max. :20.20 Max. :396.9 lstat medv Min. :2.47 Min. :21.9 1st Qu.:3.32 1st Qu.:41.7 Median :4.14 Median :48.3 Mean :4.31 Mean :44.2 3rd Qu.:5.12 3rd Qu.:50.0 Max. :7.44 Max. :50.0
summary(subset(Boston, rm > 8))
crim zn indus chas Min. :0.02009 Min. : 0.00 Min. : 2.680 Min. :0.0000 1st Qu.:0.33147 1st Qu.: 0.00 1st Qu.: 3.970 1st Qu.:0.0000 Median :0.52014 Median : 0.00 Median : 6.200 Median :0.0000 Mean :0.71879 Mean :13.62 Mean : 7.078 Mean :0.1538 3rd Qu.:0.57834 3rd Qu.:20.00 3rd Qu.: 6.200 3rd Qu.:0.0000 Max. :3.47428 Max. :95.00 Max. :19.580 Max. :1.0000 nox rm age dis Min. :0.4161 Min. :8.034 Min. : 8.40 Min. :1.801 1st Qu.:0.5040 1st Qu.:8.247 1st Qu.:70.40 1st Qu.:2.288 Median :0.5070 Median :8.297 Median :78.30 Median :2.894 Mean :0.5392 Mean :8.349 Mean :71.54 Mean :3.430 3rd Qu.:0.6050 3rd Qu.:8.398 3rd Qu.:86.50 3rd Qu.:3.652 Max. :0.7180 Max. :8.780 Max. :93.90 Max. :8.907 rad tax ptratio black Min. : 2.000 Min. :224.0 Min. :13.00 Min. :354.6 1st Qu.: 5.000 1st Qu.:264.0 1st Qu.:14.70 1st Qu.:384.5 Median : 7.000 Median :307.0 Median :17.40 Median :386.9 Mean : 7.462 Mean :325.1 Mean :16.36 Mean :385.2 3rd Qu.: 8.000 3rd Qu.:307.0 3rd Qu.:17.40 3rd Qu.:389.7 Max. :24.000 Max. :666.0 Max. :20.20 Max. :396.9 lstat medv Min. :2.47 Min. :21.9 1st Qu.:3.32 1st Qu.:41.7 Median :4.14 Median :48.3 Mean :4.31 Mean :44.2 3rd Qu.:5.12 3rd Qu.:50.0 Max. :7.44 Max. :50.0
Relatively lower crime (comparing range), lower lstat (comparing range).