1. Describe the null hypotheses to which the p-values given in Table 3.4 correspond. Explain what conclusions you can draw based on these p-values. Your explanation should be phrased in terms of sales, TV, radio, and newspaper, rather than in terms of the coefficients of the linear model.
The null hypotheses associated with Table 3.4 are that advertising budgets of "TV", "radio" or "newspaper" do not have an effect on sales. More precisely $H_0^{(1)} : \beta_1 = 0$, $H_0^{(2)} : \beta_2 = 0$ and $H_0^{(3)} : \beta_3 = 0$. The corresponding p-values are highly significant for "TV" and "radio" and not significant for "newspaper"; so we reject $H_0^{(1)}$ and $H_0^{(2)}$ and we do not reject $H_0^{(3)}$. We may conclude that newspaper advertising budget do not affect sales.
More on p-value:
2. Carefully explain the differences between the KNN classifier and KNN regression methods.
The KNN classifier is typically used to solve classification problems (those with a qualitative response) by identifying the neighborhood of $x_0$ and then estimating the conditional probability $P(Y = j | X = x_0)$ for class $j$ as the fraction of points in the neighborhood whose response values equal $j$. The KNN regression method is used to solve regression problems (those with a quantitative response) by again identifying the neighborhood of $x_0$ and then estimating $f(x_0)$ as the average of all the training responses in the neighborhood.
3. Suppose we have a data set with five predictors, $X1 = \mathrm{GPA}$, $X2 = \mathrm{IQ}$, $X3 = \mathrm{Gender}$ (1 for Female and 0 for Male), $X4 = \mathrm{Interaction\ between\ GPA\ and\ IQ}$, and $X5 = \mathrm{Interaction\ between\ GPA\ and\ Gender}$. The response is starting salary after graduation (in thousands of dollars). Suppose we use least squares to fit the model, and get $\hat{\beta_0} = 50$, $\hat{\beta_1} = 20$, $\hat{\beta_2} = 0.07$, $\hat{\beta_3} = 35$, $\hat{\beta_4} = 0.01$, $\hat{\beta_5} = -10$.
(a) Which answer is correct, and why?
The least square line is given by $\hat{y} = 50 + 20GPA + 0.07IQ + 35Gender + 0.01GPA\times IQ - 10GPA\times Gender$ which becomes for the males $\hat{y} = 50 + 20GPA + 0.07IQ + 0.01GPA\times IQ,$ and for the females $\hat{y} = 85 + 10GPA + 0.07IQ + 0.01GPA\times IQ.$ So the starting salary for males is higher than for females on average iff $50 + 20GPA\ge 85 + 10GPA$ which is equivalent to $GPA\ge 3.5$. Therefore iii. is the right answer.
(b) Predict the salary of a female with IQ of 110 and a GPA of 4.0.
It suffices to plug in the given values in the least square line for females given above and we obtain $\hat{y} = 85 + 40 + 7.7 + 4.4 = 137.1$, which gives us a starting salary of $137100$$.
(c) True or false: Since the coefficient for the GPA/IQ interaction term is very small, there is very little evidence of an interaction effect. Justify your answer.
False. To verify if the GPA/IQ has an impact on the quality of the model we need to test the hypothesis $H_0 : \hat{\beta_4} = 0$ and look at the p-value associated with the $t$ or the $F$ statistic to draw a conclusion.
4. I collect a set of data ($n = 100$ observations) containing a single predictor and a quantitative response. I then fit a linear regression model to the data, as well as a separate cubic regression, i.e. $Y = \beta_0 + \beta_1 X + \beta_2 X^2 + \beta_3 X^3 + \varepsilon$ .
(a) Suppose that the true relationship between $X$ and $Y$ is linear, i.e. $Y = \beta_0 + \beta_1 X + \varepsilon$. Consider the training residual sum of squares (RSS) for the linear regression, and also the training RSS for the cubic regression. Would we expect one to be lower than the other, would we expect them to be the same, or is there not enough information to tell? Justify your answer.
Without knowing more details about the training data, it is difficult to know which training RSS is lower between linear or cubic. However, as the true relationship between $X$ and $Y$ is linear, we may expect the least squares line to be close to the true regression line, and consequently the RSS for the linear regression may be lower than for the cubic regression.
(b) Answer (a) using test rather than training RSS.
In this case the test RSS depends upon the test data, so we have not enough information to conclude.
(c) Suppose that the true relationship between X and Y is not linear, but we don’t know how far it is from linear. Consider the training RSS for the linear regression, and also the training RSS for the cubic regression. Would we expect one to be lower than the other, would we expect them to be the same, or is there not enough information to tell? Justify your answer.
Here also, there is not enough information to tell. If the actual data is pretty linear, then a linear regression may yield better results than a cubic regression. However, if the actual data is more cubic, then cubic regression may be the better option. Whichever one is the better option will yield lower RSS than the other.
(d) Answer (c) using test rather than training RSS.
See answer for (c) above.
5. Consider the fitted values that result from performing linear regression without an intercept. In this setting, the i-th fitted value takes the form $\hat{y_i} = x_i\hat{\beta}$, where $\hat{\beta} = \frac{\sum_{i=1}^n x_iy_i}{\sum_{k=1}^nx_k^2}$. Show that we can write $\hat{y}_i = \sum_{j=1}^na_jy_j$. What is $a_j$ ?
We have immediately that $\hat{y}_i = x_i\frac{\sum_{j=1}^n x_jy_j}{\sum_{k=1}^nx_k^2} = \sum_{j=1}^n\frac{x_ix_j}{\sum_{k=1}^nx_k^2}y_j = \sum_{j=1}^na_jy_j$.
6. Using (3.4), argue that in the case of simple linear regression, the least squares line always passes through the point $(\overline{x},\overline{y})$.
The least square line equation is $y = \hat{\beta}_0 + \hat{\beta}_1x$, so if we substitute $\overline{x}$ for $x$ we obtain $y = \hat{\beta}_0 + \hat{\beta}_1\overline{x} = \overline{y} - \hat{\beta}_1\overline{x} + \hat{\beta}_1\overline{x} = \overline{y}$. We may conclude that the least square line passes through the point $(\overline{x},\overline{y})$.
7. It is claimed in the text that in the case of simple linear regression of $Y$ onto $X$, the $R^2$ statistic (3.17) is equal to the square of the correlation between $X$ and $Y$ (3.18). Prove that this is the case. For simplicity, you may assume that $\overline{x} = \overline{y} = 0$.
We have the following equalities $R^2 = 1 - \frac{RSS}{TSS} = 1 - \frac{\sum_i(y_i - \hat{y}_i)^2}{\sum_jy_j^2};$ with $\hat{y}_i = \hat{\beta}_1x_i$ we may write $R^2 = 1 - \frac{\sum_i(y_i - \sum_jx_jy_j/\sum_jx_j^2 x_i)^2}{\sum_jy_j^2} = \frac{\sum_jy_j^2 - (\sum_iy_i^2 - 2\sum_iy_i(\sum_jx_jy_j/\sum_jx_j^2)x_i + \sum_i(\sum_jx_jy_j/\sum_jx_j^2)^2x_i^2)}{\sum_jy_j^2}$ and finally $R^2 = \frac{2(\sum_ix_iy_i)^2/\sum_jx_j^2 - (\sum_ix_iy_i)^2/\sum_jx_j^2}{\sum_jy_j^2} = \frac{(\sum_ix_iy_i)^2}{\sum_jx_j^2\sum_jy_j^2} = Cor(X, Y)^2$.
8. This question involves the use of simple linear regression on the "Auto" data set.
(a) Use the lm() function to perform a simple linear regression with mpg as the response and horsepower as the predictor. Use the summary() function to print the results. Comment on the output. For example :
Auto = read.csv("../../../data/Auto.csv", header = T, na.strings = "?")
Auto = na.omit(Auto)
summary(Auto)
mpg cylinders displacement horsepower weight Min. : 9.00 Min. :3.000 Min. : 68.0 Min. : 46.0 Min. :1613 1st Qu.:17.00 1st Qu.:4.000 1st Qu.:105.0 1st Qu.: 75.0 1st Qu.:2225 Median :22.75 Median :4.000 Median :151.0 Median : 93.5 Median :2804 Mean :23.45 Mean :5.472 Mean :194.4 Mean :104.5 Mean :2978 3rd Qu.:29.00 3rd Qu.:8.000 3rd Qu.:275.8 3rd Qu.:126.0 3rd Qu.:3615 Max. :46.60 Max. :8.000 Max. :455.0 Max. :230.0 Max. :5140 acceleration year origin name Min. : 8.00 Min. :70.00 Min. :1.000 amc matador : 5 1st Qu.:13.78 1st Qu.:73.00 1st Qu.:1.000 ford pinto : 5 Median :15.50 Median :76.00 Median :1.000 toyota corolla : 5 Mean :15.54 Mean :75.98 Mean :1.577 amc gremlin : 4 3rd Qu.:17.02 3rd Qu.:79.00 3rd Qu.:2.000 amc hornet : 4 Max. :24.80 Max. :82.00 Max. :3.000 chevrolet chevette: 4 (Other) :365
lm.fit1 = lm(mpg ~ horsepower, data = Auto)
summary(lm.fit1)
Call: lm(formula = mpg ~ horsepower, data = Auto) Residuals: Min 1Q Median 3Q Max -13.5710 -3.2592 -0.3435 2.7630 16.9240 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 39.935861 0.717499 55.66 <2e-16 *** horsepower -0.157845 0.006446 -24.49 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 4.906 on 390 degrees of freedom Multiple R-squared: 0.6059, Adjusted R-squared: 0.6049 F-statistic: 599.7 on 1 and 390 DF, p-value: < 2.2e-16
There is a relationship between horsepower and mpg as deterined by testing the null hypothesis of all regression coefficients equal to zero $H_0 : \beta_i = 0\ \forall i$. Since the F-statistic is far larger than 1 and the p-value of the F-statistic is close to zero we can reject the null hypothesis and state there is a statistically significant relationship between horsepower and mpg.
We may note that as the $R^2$ is equal to 0.6059, almost 60.5948% of the variability in "mpg" can be explained using "horsepower".
As the coeficient of "horsepower" is negative, the relationship is also negative.
The more horsepower an automobile has the linear regression indicates the less mpg fuel efficiency the automobile will have.
predict(lm.fit1, data.frame(horsepower=c(98)), interval="confidence")
fit | lwr | upr | |
---|---|---|---|
1 | 24.46708 | 23.97308 | 24.96108 |
predict(lm.fit1, data.frame(horsepower=c(98)), interval="prediction")
fit | lwr | upr | |
---|---|---|---|
1 | 24.46708 | 14.8094 | 34.12476 |
(b) Plot the response and the predictor. Use the abline() function to display the least squares regression line.
plot(Auto$horsepower, Auto$mpg, main = "Scatterplot of mpg vs. horsepower", xlab = "horsepower", ylab = "mpg", col = "blue")
abline(lm.fit1, col = "red")
(c) Use the plot() function to produce diagnostic plots of the least squares regression fit. Comment on any problems you see with the fit.
par(mfrow = c(2, 2))
plot(lm.fit1)
The plot of residuals versus fitted values indicates the presence of non linearity in the data. The plot of standardized residuals versus leverage indicates the presence of a few outliers (higher than 2 or lower than -2) and a few high leverage points.
9. This question involves the use of multiple linear regression on the Auto data set.
(a) Produce a scatterplot matrix which include all the variables in the data set.
pairs(Auto)
(b) Compute the matrix of correlations between the variables using the function cor(). You will need to exclude the name variable, which is qualitative.
cor(subset(Auto, select = -name))
mpg | cylinders | displacement | horsepower | weight | acceleration | year | origin | |
---|---|---|---|---|---|---|---|---|
mpg | 1.0000000 | -0.7776175 | -0.8051269 | -0.7784268 | -0.8322442 | 0.4233285 | 0.5805410 | 0.5652088 |
cylinders | -0.7776175 | 1.0000000 | 0.9508233 | 0.8429834 | 0.8975273 | -0.5046834 | -0.3456474 | -0.5689316 |
displacement | -0.8051269 | 0.9508233 | 1.0000000 | 0.8972570 | 0.9329944 | -0.5438005 | -0.3698552 | -0.6145351 |
horsepower | -0.7784268 | 0.8429834 | 0.8972570 | 1.0000000 | 0.8645377 | -0.6891955 | -0.4163615 | -0.4551715 |
weight | -0.8322442 | 0.8975273 | 0.9329944 | 0.8645377 | 1.0000000 | -0.4168392 | -0.3091199 | -0.5850054 |
acceleration | 0.4233285 | -0.5046834 | -0.5438005 | -0.6891955 | -0.4168392 | 1.0000000 | 0.2903161 | 0.2127458 |
year | 0.5805410 | -0.3456474 | -0.3698552 | -0.4163615 | -0.3091199 | 0.2903161 | 1.0000000 | 0.1815277 |
origin | 0.5652088 | -0.5689316 | -0.6145351 | -0.4551715 | -0.5850054 | 0.2127458 | 0.1815277 | 1.0000000 |
(c) Use the lm() function to perform a multiple linear regression with mpg as the response and all other variables except name as the predictors. Use the summary() function to print the results. Comment on the output. For instance :
lm.fit2 = lm(mpg ~ . - name, data = Auto)
summary(lm.fit2)
Call: lm(formula = mpg ~ . - name, data = Auto) Residuals: Min 1Q Median 3Q Max -9.5903 -2.1565 -0.1169 1.8690 13.0604 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -17.218435 4.644294 -3.707 0.00024 *** cylinders -0.493376 0.323282 -1.526 0.12780 displacement 0.019896 0.007515 2.647 0.00844 ** horsepower -0.016951 0.013787 -1.230 0.21963 weight -0.006474 0.000652 -9.929 < 2e-16 *** acceleration 0.080576 0.098845 0.815 0.41548 year 0.750773 0.050973 14.729 < 2e-16 *** origin 1.426141 0.278136 5.127 4.67e-07 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 3.328 on 384 degrees of freedom Multiple R-squared: 0.8215, Adjusted R-squared: 0.8182 F-statistic: 252.4 on 7 and 384 DF, p-value: < 2.2e-16
Yes, there is a relatioship between the predictors and the response by testing the null hypothesis of whether all the regression coefficients are zero. The F-statistic is far from 1 (with a small p-value), indicating evidence against the null hypothesis.
We can answer this question by checking the p-values associated with each predictor's t-statistic.
We can see that displacement, weight, year, and origin have a statistically significant relationship, while cylinders, horsepower, and acceleration do not.
The coefficient ot the "year" variable suggests that the average effect of an increase of 1 year is an increase of 0.7508 in "mpg" (all other predictors remaining constant).
(d) Use the plot() function to produce diagnostic plots of the linear regression fit. Comment on any problems you see with the fit. Do the residual plots suggest any unusually large outliers ? Does the leverage plots identify any observations with unusually high leverages ?
par(mfrow = c(2, 2))
plot(lm.fit2)
As before, the plot of residuals versus fitted values indicates the presence of mild non linearity in the data. The plot of standardized residuals versus leverage indicates the presence of a few outliers (higher than 2 or lower than -2) and one high leverage point.
(e) Use the * and : symbols to fit linear regression models with interaction effects. Do any interactions appear to be statistically significant ?
lm.fit3 = lm(mpg ~ .*., data = Auto[, 1:8])
summary(lm.fit3)
Call: lm(formula = mpg ~ . * ., data = Auto[, 1:8]) Residuals: Min 1Q Median 3Q Max -7.6303 -1.4481 0.0596 1.2739 11.1386 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.548e+01 5.314e+01 0.668 0.50475 cylinders 6.989e+00 8.248e+00 0.847 0.39738 displacement -4.785e-01 1.894e-01 -2.527 0.01192 * horsepower 5.034e-01 3.470e-01 1.451 0.14769 weight 4.133e-03 1.759e-02 0.235 0.81442 acceleration -5.859e+00 2.174e+00 -2.696 0.00735 ** year 6.974e-01 6.097e-01 1.144 0.25340 origin -2.090e+01 7.097e+00 -2.944 0.00345 ** cylinders:displacement -3.383e-03 6.455e-03 -0.524 0.60051 cylinders:horsepower 1.161e-02 2.420e-02 0.480 0.63157 cylinders:weight 3.575e-04 8.955e-04 0.399 0.69000 cylinders:acceleration 2.779e-01 1.664e-01 1.670 0.09584 . cylinders:year -1.741e-01 9.714e-02 -1.793 0.07389 . cylinders:origin 4.022e-01 4.926e-01 0.816 0.41482 displacement:horsepower -8.491e-05 2.885e-04 -0.294 0.76867 displacement:weight 2.472e-05 1.470e-05 1.682 0.09342 . displacement:acceleration -3.479e-03 3.342e-03 -1.041 0.29853 displacement:year 5.934e-03 2.391e-03 2.482 0.01352 * displacement:origin 2.398e-02 1.947e-02 1.232 0.21875 horsepower:weight -1.968e-05 2.924e-05 -0.673 0.50124 horsepower:acceleration -7.213e-03 3.719e-03 -1.939 0.05325 . horsepower:year -5.838e-03 3.938e-03 -1.482 0.13916 horsepower:origin 2.233e-03 2.930e-02 0.076 0.93931 weight:acceleration 2.346e-04 2.289e-04 1.025 0.30596 weight:year -2.245e-04 2.127e-04 -1.056 0.29182 weight:origin -5.789e-04 1.591e-03 -0.364 0.71623 acceleration:year 5.562e-02 2.558e-02 2.174 0.03033 * acceleration:origin 4.583e-01 1.567e-01 2.926 0.00365 ** year:origin 1.393e-01 7.399e-02 1.882 0.06062 . --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 2.695 on 363 degrees of freedom Multiple R-squared: 0.8893, Adjusted R-squared: 0.8808 F-statistic: 104.2 on 28 and 363 DF, p-value: < 2.2e-16
The interactions between "acceleration:origin", "acceleration:year" and "displacement:year" appear to be statistically significant.
(f) Try a few different transformations of the variables, such as $\log{X}$, $\sqrt{X}$, $X^2$. Comment on your findings.
par(mfrow = c(2, 2))
plot(log(Auto$horsepower), Auto$mpg)
plot(sqrt(Auto$horsepower), Auto$mpg)
plot((Auto$horsepower)^2, Auto$mpg)
We limit ourselves to examining "horsepower" as sole predictor. It seems that the log transformation gives the most linear looking plot.
10. This question should be answered using the Carseats data set.
(a) Fit a multiple regression model to predict Sales using Price, Urban and US.
Carseats = read.csv("../../../data/Carseats.csv", header=T, na.strings="?")
summary(Carseats)
Sales CompPrice Income Advertising Min. : 0.000 Min. : 77 Min. : 21.00 Min. : 0.000 1st Qu.: 5.390 1st Qu.:115 1st Qu.: 42.75 1st Qu.: 0.000 Median : 7.490 Median :125 Median : 69.00 Median : 5.000 Mean : 7.496 Mean :125 Mean : 68.66 Mean : 6.635 3rd Qu.: 9.320 3rd Qu.:135 3rd Qu.: 91.00 3rd Qu.:12.000 Max. :16.270 Max. :175 Max. :120.00 Max. :29.000 Population Price ShelveLoc Age Education Min. : 10.0 Min. : 24.0 Bad : 96 Min. :25.00 Min. :10.0 1st Qu.:139.0 1st Qu.:100.0 Good : 85 1st Qu.:39.75 1st Qu.:12.0 Median :272.0 Median :117.0 Medium:219 Median :54.50 Median :14.0 Mean :264.8 Mean :115.8 Mean :53.32 Mean :13.9 3rd Qu.:398.5 3rd Qu.:131.0 3rd Qu.:66.00 3rd Qu.:16.0 Max. :509.0 Max. :191.0 Max. :80.00 Max. :18.0 Urban US No :118 No :142 Yes:282 Yes:258
attach(Carseats)
lm.fit4 = lm(Sales ~ Price + Urban + US)
summary(lm.fit4)
Call: lm(formula = Sales ~ Price + Urban + US) Residuals: Min 1Q Median 3Q Max -6.9206 -1.6220 -0.0564 1.5786 7.0581 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 13.043469 0.651012 20.036 < 2e-16 *** Price -0.054459 0.005242 -10.389 < 2e-16 *** UrbanYes -0.021916 0.271650 -0.081 0.936 USYes 1.200573 0.259042 4.635 4.86e-06 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 2.472 on 396 degrees of freedom Multiple R-squared: 0.2393, Adjusted R-squared: 0.2335 F-statistic: 41.52 on 3 and 396 DF, p-value: < 2.2e-16
(b) Provide an interpretation of each coefficient in the model. Be careful - some of the variables in the model are qualitative!
The coefficient of the "Price" variable may be interpreted by saying that the average effect of a price increase of 1 dollar is a decrease of 54.4588 units in sales all other predictors remaining fixed. The coefficient of the "Urban" variable may be interpreted by saying that on average the unit sales in urban location are 21.9162 units less than in rural location all other predictors remaining fixed. The coefficient of the "US" variable may be interpreted by saying that on average the unit sales in a US store are 1200.5727 units more than in a non US store all other predictors remaining fixed.
(c) Write out the model in equation form, being careful to handle the qualitative variables properly.
The model may be written as $Sales = 13.0435 + (-0.0545)\times Price + (-0.0219)\times Urban + (1.2006)\times US + \varepsilon$ with $Urban = 1$ if the store is in an urban location and $0$ if not, and $US = 1$ if the store is in the US and $0$ if not.
(d) For which of the predictors can you reject the null hypothesis $H_0 : \beta_j = 0$?
Based on the p-values, F-statistic, and p-value of the F-statistic we can reject the null hypothesis for the "Price" and "US" variables.
(e) On the basis of your response to the previous question, fit a smaller model that only uses the predictors for which there is evidence of association with the outcome.
lm.fit5 = lm(Sales ~ Price + US)
summary(lm.fit5)
Call: lm(formula = Sales ~ Price + US) Residuals: Min 1Q Median 3Q Max -6.9269 -1.6286 -0.0574 1.5766 7.0515 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 13.03079 0.63098 20.652 < 2e-16 *** Price -0.05448 0.00523 -10.416 < 2e-16 *** USYes 1.19964 0.25846 4.641 4.71e-06 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 2.469 on 397 degrees of freedom Multiple R-squared: 0.2393, Adjusted R-squared: 0.2354 F-statistic: 62.43 on 2 and 397 DF, p-value: < 2.2e-16
(f) How well do the models in (a) and (e) fit the data?
The $R^2$ for the smaller model is marginally better than for the bigger model. Essentially about 23.9263% of the variability is explained by the model.
Based on the RSE and $R^2$ of the linear regressions, they both fit the data similarly, with linear regression from (e) fitting the data slightly better.
(g) Using the model from (e), obtain 95% confidence intervals for the coefficient(s).
confint(lm.fit5)
2.5 % | 97.5 % | |
---|---|---|
(Intercept) | 11.79032020 | 14.27126531 |
Price | -0.06475984 | -0.04419543 |
USYes | 0.69151957 | 1.70776632 |
(h) Is there evidence of outliers or high leverage observations in the model from (e) ?
plot(predict(lm.fit5), rstudent(lm.fit5))
All studentized residuals appear to be bounded by -3 to 3, so not potential outliers are suggested from the linear regression.
par(mfrow = c(2, 2))
plot(lm.fit5)
The plot of standardized residuals versus leverage indicates the presence of a few outliers (higher than 2 or lower than -2) and a one leverage point.
There are a few observations that greatly exceed $(p+1)/n$ (r 3/397
) on the leverage-statistic plot that suggest that the corresponding points have high leverage.
11. In this problem we will investigate the t-statistic for the null hypothesis $H_0 : \beta = 0$ in simple linear regression without an intercept. To begin, we generate a predictor $x$ and a response $y$ as follows.
set.seed(1)
x = rnorm(100)
y = 2 * x + rnorm(100)
(a) Perform a simple linear regression of $y$ onto $x$, without an intercept. Report the coefficient estimate $\hat{\beta}$, the standard error of this coefficient estimate, and the t-statistic and p-value associated with the null hypothesis $H_0$. Comment on these results.
lm.fit6 = lm(y ~ x + 0)
summary(lm.fit6)
Call: lm(formula = y ~ x + 0) Residuals: Min 1Q Median 3Q Max -1.9154 -0.6472 -0.1771 0.5056 2.3109 Coefficients: Estimate Std. Error t value Pr(>|t|) x 1.9939 0.1065 18.73 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.9586 on 99 degrees of freedom Multiple R-squared: 0.7798, Adjusted R-squared: 0.7776 F-statistic: 350.7 on 1 and 99 DF, p-value: < 2.2e-16
According to the summary above, we have a value of 1.9939 for $\hat{\beta}$, a value of 0.1065 for the standard error, a value of 18.7259 for the t-statistic and a value of 2.6422 × 10-34 for the p-value. The small p-value allows us to reject $H_0$.
(b) Now perform a simple linear regression of $x$ onto $y$, without an intercept. Report the coefficient estimate $\hat{\beta}$, the standard error of this coefficient estimate, and the t-statistic and p-value associated with the null hypothesis $H_0$. Comment on these results.
lm.fit7 = lm(x ~ y + 0)
summary(lm.fit7)
Call: lm(formula = x ~ y + 0) Residuals: Min 1Q Median 3Q Max -0.8699 -0.2368 0.1030 0.2858 0.8938 Coefficients: Estimate Std. Error t value Pr(>|t|) y 0.39111 0.02089 18.73 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.4246 on 99 degrees of freedom Multiple R-squared: 0.7798, Adjusted R-squared: 0.7776 F-statistic: 350.7 on 1 and 99 DF, p-value: < 2.2e-16
According to the summary above, we have a value of 0.3911 for $\hat{\beta}$, a value of 0.0209 for the standard error, a value of 18.7259 for the t-statistic and a value of 2.6422 × 10-34 for the p-value. The small p-value allows us to reject $H_0$.
(c) What is the relationship between the results obtained in (a) and (b)?
We obtain the same value for the t-statistic and consequently the same value for the corresponding p-value.
Both results in (a) and (b) reflect the same line created in 11(a). In other words, $y = 2x + \epsilon$ could also be written $x = 0.5 (y - \epsilon)$.*
(d) For the regrssion of $Y$ onto $X$ without an intercept, the t-statistic for $H_0 : \beta = 0$ takes the form $\hat{\beta}/SE(\hat{\beta})$, where $\hat{\beta}$ is given by (3.38), and where $SE(\hat{\beta}) = \sqrt{\frac{\sum_{i=1}^n(y_i - x_i\hat{\beta})^2}{(n - 1)\sum_{i=1}^nx_i^2}}$.
(These formulas are slightly different from those given in Sections 3.1.1 and 3.1.2, since here we are performing regression without an intercept.) Show algebraically, and confirm numerically in R, that the t-statistic can be written as
$\frac{\sqrt{n - 1}\sum_{i=1}^nx_iy_i}{\sqrt{(\sum_{i=1}^nx_i^2)(\sum_{i=1}^ny_i^2) - (\sum_{i=1}^nx_iy_i)}}$.
We have
$t = \frac{\sum_ix_iy_y/\sum_jx_j^2}{\sqrt{\sum_i(y_i - x_i\hat{\beta})^2/(n - 1)\sum_jx_j^2}} = \frac{\sqrt{n - 1}\sum_ix_iy_i}{\sqrt{\sum_jx_j^2\sum_i(y_i - x_i\sum_jx_jy_j/\sum_jx_j^2)^2}} = \frac{\sqrt{n - 1}\sum_ix_iy_i}{\sqrt{(\sum_jx_j^2)(\sum_jy_j^2) - (\sum_jx_jy_j)}}$.
Now let's verify this result numerically.
n = length(x)
t = sqrt(n - 1)*(x %*% y)/sqrt(sum(x^2) * sum(y^2) - (x %*% y)^2)
as.numeric(t)
We may see that the t above is exactly the t-statistic given in the summary of the "lm.fit7".
(e) Using the results from (d), argue that the t-statistic for the regression of $y$ onto $x$ is the same t-statistic for the regression of $x$ onto $y$.
It is easy to see that if we replace $x_i$ by $y_i$ in the formula for the t-statistic, the result would be the same.
If you swap t(x,y) as t(y,x), then you will find t(x,y) = t(y,x).
(f) In R, show that when regression is performed with an intercept, the t-statistic for $H_0 : \beta_1 = 0$ is the same for the regression of $y$ onto $x$ as it is the regression of $x$ onto $y$.
lm.fit8 = lm(y ~ x)
summary(lm.fit8)
Call: lm(formula = y ~ x) Residuals: Min 1Q Median 3Q Max -1.8768 -0.6138 -0.1395 0.5394 2.3462 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -0.03769 0.09699 -0.389 0.698 x 1.99894 0.10773 18.556 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.9628 on 98 degrees of freedom Multiple R-squared: 0.7784, Adjusted R-squared: 0.7762 F-statistic: 344.3 on 1 and 98 DF, p-value: < 2.2e-16
lm.fit9 = lm(x ~ y)
summary(lm.fit9)
Call: lm(formula = x ~ y) Residuals: Min 1Q Median 3Q Max -0.90848 -0.28101 0.06274 0.24570 0.85736 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.03880 0.04266 0.91 0.365 y 0.38942 0.02099 18.56 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.4249 on 98 degrees of freedom Multiple R-squared: 0.7784, Adjusted R-squared: 0.7762 F-statistic: 344.3 on 1 and 98 DF, p-value: < 2.2e-16
It is again easy to see that the t-statistic for the both fit are equal to 18.5556.
12. This problem involves simple linear regression without an intercept.
(a) Recall that the coefficient estimate $\hat{\beta}$ for the linear regression of $Y$ onto $X$ witout an intercept is given by (3.38). Under what circumstance is the coefficient estimate for the regression of $X$ onto $Y$ the same as the coefficient estimate for the regression of $Y$ onto $X$ ?
The coefficient estimate for the regression of $Y$ onto $X$ is $\hat{\beta} = \frac{\sum_ix_iy_i}{\sum_jx_j^2}$;
The coefficient estimate for the regression of $X$ onto $Y$ is $\hat{\beta}' = \frac{\sum_ix_iy_i}{\sum_jy_j^2}$.
The coefficients are the same iff $\sum_jx_j^2 = \sum_jy_j^2$.
(b) Generate an example in R with $n = 100$ observations in which the coefficient estimate for the regression of $X$ onto $Y$ is different from the coefficient estimate for the regression of $Y$ onto $X$.
set.seed(1)
x = 1:100
y = 2 * x + rnorm(100, sd = 0.1)
lm.fit10 = lm(y ~ x + 0)
summary(lm.fit10)
Call: lm(formula = y ~ x + 0) Residuals: Min 1Q Median 3Q Max -0.223590 -0.062560 0.004426 0.058507 0.230926 Coefficients: Estimate Std. Error t value Pr(>|t|) x 2.0001514 0.0001548 12920 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.09005 on 99 degrees of freedom Multiple R-squared: 1, Adjusted R-squared: 1 F-statistic: 1.669e+08 on 1 and 99 DF, p-value: < 2.2e-16
lm.fit11 = lm(x ~ y + 0)
summary(lm.fit11)
Call: lm(formula = x ~ y + 0) Residuals: Min 1Q Median 3Q Max -0.115418 -0.029231 -0.002186 0.031322 0.111795 Coefficients: Estimate Std. Error t value Pr(>|t|) y 5.00e-01 3.87e-05 12920 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.04502 on 99 degrees of freedom Multiple R-squared: 1, Adjusted R-squared: 1 F-statistic: 1.669e+08 on 1 and 99 DF, p-value: < 2.2e-16
(c) Generate an example in R with $n = 100$ observations in which the coefficient estimate for the regression of $X$ onto $Y$ is the same as the coefficient estimate for the regression of $Y$ onto $X$.
x = 1:100
y = 100:1
lm.fit12 = lm(y ~ x + 0)
summary(lm.fit12)
Call: lm(formula = y ~ x + 0) Residuals: Min 1Q Median 3Q Max -49.75 -12.44 24.87 62.18 99.49 Coefficients: Estimate Std. Error t value Pr(>|t|) x 0.5075 0.0866 5.86 6.09e-08 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 50.37 on 99 degrees of freedom Multiple R-squared: 0.2575, Adjusted R-squared: 0.25 F-statistic: 34.34 on 1 and 99 DF, p-value: 6.094e-08
lm.fit13 = lm(x ~ y + 0)
summary(lm.fit13)
Call: lm(formula = x ~ y + 0) Residuals: Min 1Q Median 3Q Max -49.75 -12.44 24.87 62.18 99.49 Coefficients: Estimate Std. Error t value Pr(>|t|) y 0.5075 0.0866 5.86 6.09e-08 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 50.37 on 99 degrees of freedom Multiple R-squared: 0.2575, Adjusted R-squared: 0.25 F-statistic: 34.34 on 1 and 99 DF, p-value: 6.094e-08
13. In this exercise you will create some simulated data and will fit simple linear regression models to it. Make sure to use set.seed(1) prior to starting part (a) to ensure conistent results.
(a) Using the rnorm() function, create a vector, x, containing 100 observations drawn from a $N(0,1)$ distribution. This represents a feature, $X$.
set.seed(1)
x = rnorm(100)
(b) Using the rnorm() function, create a vector, eps, containing 100 observations drawn from a $N(0, 0.25)$ distribution.
eps = rnorm(100, sd = 0.25)
(c) Using x and eps, generate a vector y according to the model $Y = -1 + 0.5X + \varepsilon$. What is the length of the vector y ? What are the values of $\beta_0$ and $\beta_1$ in this linear model ?
y = -1 + 0.5 * x + eps
length(y)
y is of length 100. The values of $\beta_0$ and $\beta_1$ are -1 and 0.5 respectively.
(d) Create a scatterplot displaying the relationship between x and y. Comment on what you observe.
plot(x, y)
The relationship between "x" and "y" looks linear with some noise introduced by the "eps" variable.
(e) Fit a least squares linear model to predict y using x. Comment on the model obtained. How do $\hat{\beta}_0$ and $\hat{\beta}_1$ compare to $\beta_0$ and $\beta_1$ ?
lm.fit14 = lm(y ~ x)
summary(lm.fit14)
Call: lm(formula = y ~ x) Residuals: Min 1Q Median 3Q Max -0.46921 -0.15344 -0.03487 0.13485 0.58654 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -1.00942 0.02425 -41.63 <2e-16 *** x 0.49973 0.02693 18.56 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.2407 on 98 degrees of freedom Multiple R-squared: 0.7784, Adjusted R-squared: 0.7762 F-statistic: 344.3 on 1 and 98 DF, p-value: < 2.2e-16
The linear regression fits a model close to the true value of the coefficients as was constructed. The model has a large F-statistic with a near-zero p-value so the null hypothesis can be rejected.
The values of $\hat{\beta}_0$ and $\hat{\beta}_1$ are pretty close to $\beta_0$ and $\beta_1$.
(f) Display the least squares line on the scatterplot obtained in (d). Draw the population regression line on the plot, in a different color. Use the legend() function to create an appropriate legend.
plot(x, y)
abline(lm.fit14, col = "red")
abline(-1, 0.5, col = "blue")
legend("topleft", c("Least square", "Regression"), col = c("red", "blue"), lty = c(1, 1))
(g) Now fit a polynomial regression model that predicts y using x and x^2. Is there evidence that the quadratic term improves the model fit ? Explain your answer.
lm.fit15 = lm(y ~ x + I(x^2))
summary(lm.fit15)
Call: lm(formula = y ~ x + I(x^2)) Residuals: Min 1Q Median 3Q Max -0.4913 -0.1563 -0.0322 0.1451 0.5675 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -0.98582 0.02941 -33.516 <2e-16 *** x 0.50429 0.02700 18.680 <2e-16 *** I(x^2) -0.02973 0.02119 -1.403 0.164 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.2395 on 97 degrees of freedom Multiple R-squared: 0.7828, Adjusted R-squared: 0.7784 F-statistic: 174.8 on 2 and 97 DF, p-value: < 2.2e-16
The coefficient for "x^2" is not significant as its p-value is higher than 0.05. So there is not sufficient evidence that the quadratic term improves the model fit even though the $R^2$ is slightly higher than the linear model.
(h) Repeat (a)-(f) after modifying the data generation process in such a way that there is less noise in the data. The initial model should remain the same. Describe your results.
set.seed(1)
x = rnorm(100)
eps = rnorm(100, sd = 0.0025)
y = -1 + 0.5 * x + eps
plot(x, y)
lm.fit16 = lm(y ~ x)
summary(lm.fit16)
abline(lm.fit16, col = "red")
abline(-1, 0.5, col = "blue")
legend("topleft", c("Least square", "Regression"), col = c("red", "blue"), lty = c(1, 1))
Call: lm(formula = y ~ x) Residuals: Min 1Q Median 3Q Max -0.0046921 -0.0015344 -0.0003487 0.0013485 0.0058654 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -1.0000942 0.0002425 -4125 <2e-16 *** x 0.4999973 0.0002693 1857 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.002407 on 98 degrees of freedom Multiple R-squared: 1, Adjusted R-squared: 1 F-statistic: 3.447e+06 on 1 and 98 DF, p-value: < 2.2e-16
We reduced the noise by decreasing the variance of the normal distribution used to generate the error term $\varepsilon$. We may see that the coefficients are very close to the previous ones, but now, as the relationship is nearly linear, we have a much higher $R^2$ namely 1. Moreover, the two lines overlap each other as we have very little noise.
(i) Repeat (a)-(f) after modifying the data generation process in such a way that there is more noise in the data. The initial model should remain the same. Describe your results.
set.seed(1)
x = rnorm(100)
eps = rnorm(100, sd = 2.5)
y = -1 + 0.5 * x + eps
plot(x, y)
lm.fit17 = lm(y ~ x)
summary(lm.fit17)
abline(lm.fit17, col = "red")
abline(-1, 0.5, col = "blue")
legend("topleft", c("Least square", "Regression"), col = c("red", "blue"), lty = c(1, 1))
Call: lm(formula = y ~ x) Residuals: Min 1Q Median 3Q Max -4.6921 -1.5344 -0.3487 1.3485 5.8654 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -1.0942 0.2425 -4.513 1.78e-05 *** x 0.4973 0.2693 1.847 0.0678 . --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 2.407 on 98 degrees of freedom Multiple R-squared: 0.03363, Adjusted R-squared: 0.02377 F-statistic: 3.41 on 1 and 98 DF, p-value: 0.06781
We increased the noise by increasing the variance of the normal distribution used to generate the error term $\varepsilon$. We may see that the coefficients are again very close to the previous ones, but now, as the relationship is not quite linear, we have a much lower $R^2$. Moreover, the two lines are wider apart but are still really close to each other as we have a fairly large data set.
(j) What are the confidence intervals for $\beta_0$ and $\beta_1$ based on the original data set, the noisier data set, and the less noisy data set ? Comment on your results.
confint(lm.fit14)
2.5 % | 97.5 % | |
---|---|---|
(Intercept) | -1.0575402 | -0.9613061 |
x | 0.4462897 | 0.5531801 |
confint(lm.fit16)
2.5 % | 97.5 % | |
---|---|---|
(Intercept) | -1.0005754 | -0.9996131 |
x | 0.4994629 | 0.5005318 |
confint(lm.fit17)
2.5 % | 97.5 % | |
---|---|---|
(Intercept) | -1.57540186 | -0.6130612 |
x | -0.03710293 | 1.0318010 |
As the noise increases, the confidence intervals widen. With less noise, there is more predactibility in the data set.
14. This problem focuses on the collinearity problem.
(a) Perform the following commands in R.
set.seed(1)
x1 = runif(100)
x2 = 0.5 * x1 + rnorm(100)/10
y = 2 + 2*x1 + 0.3*x2 + rnorm(100)
The last line corresponds to creating a linear model in which y is a function of x1 and x2. Write out the form of the linear model. What are the regression coefficients?
The form of the linear model is $Y = 2 + 2X_1 +0.3X_2 + \varepsilon$ with $\varepsilon$ a $N(0,1)$ random variable. The regression coefficients are respectively 2, 2 and 0.3.
(b) What is the correlation between x1 and x2 ? Create a scatterplot displaying the relationship between the variables.
cor(x1, x2)
plot(x1, x2)
The variables seem highly correlated.
(c) Using this data, fit a least squares regression to predict y using x1 and x2. Describe the results obtained. What are $\hat{\beta}_0$, $\hat{\beta}_1$ and $\hat{\beta}_2$ ? How do these relate to the true $\beta_0$, $\beta_1$ and $\beta_2$ ? Can you reject the null hypothesis $H_0 : \beta_1 = 0$ ? How about the null hypothesis $H_0 : \beta_2 = 0$ ?
lm.fit18 <- lm(y ~ x1 + x2)
summary(lm.fit18)
Call: lm(formula = y ~ x1 + x2) Residuals: Min 1Q Median 3Q Max -2.8311 -0.7273 -0.0537 0.6338 2.3359 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.1305 0.2319 9.188 7.61e-15 *** x1 1.4396 0.7212 1.996 0.0487 * x2 1.0097 1.1337 0.891 0.3754 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.056 on 97 degrees of freedom Multiple R-squared: 0.2088, Adjusted R-squared: 0.1925 F-statistic: 12.8 on 2 and 97 DF, p-value: 1.164e-05
The coefficients $\hat{\beta}_0$, $\hat{\beta}_1$ and $\hat{\beta}_2$ are respectively 2.1305, 1.4396 and 1.0097. Only $\hat{\beta}_0$ is close to $\beta_0$. As the p-value is less than 0.05 we may reject $H_0$ for $\beta_1$, however we may not reject $H_0$ for $\beta_2$ as the p-value is higher than 0.05.
(d) Now fit a least squares regression to predict y using only x1. Comment on your results. Can you reject the null hypothesis $H_0 : \beta_1 = 0$ ?
lm.fit19 <- lm(y ~ x1)
summary(lm.fit19)
Call: lm(formula = y ~ x1) Residuals: Min 1Q Median 3Q Max -2.89495 -0.66874 -0.07785 0.59221 2.45560 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.1124 0.2307 9.155 8.27e-15 *** x1 1.9759 0.3963 4.986 2.66e-06 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.055 on 98 degrees of freedom Multiple R-squared: 0.2024, Adjusted R-squared: 0.1942 F-statistic: 24.86 on 1 and 98 DF, p-value: 2.661e-06
The coefficient for "x1" in this last model is very different from the one with "x1" and "x2" as predictors. In this case "x1" is highly significant as its p-value is very low, so we may reject $H_0$.
(e) Now fit a least squares regression to predict y using only x2. Comment on your results. Can you reject the null hypothesis $H_0 : \beta_1 = 0$ ?
lm.fit20 <- lm(y ~ x2)
summary(lm.fit20)
Call: lm(formula = y ~ x2) Residuals: Min 1Q Median 3Q Max -2.62687 -0.75156 -0.03598 0.72383 2.44890 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.3899 0.1949 12.26 < 2e-16 *** x2 2.8996 0.6330 4.58 1.37e-05 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.072 on 98 degrees of freedom Multiple R-squared: 0.1763, Adjusted R-squared: 0.1679 F-statistic: 20.98 on 1 and 98 DF, p-value: 1.366e-05
The coefficient for "x2" in this last model is very different from the one with "x1" and "x2" as predictors. In this case "x2" is highly significant as its p-value is very low, so we may again reject $H_0$.
(f) Do the results obtained in (c)-(e) contradict each other ? Explain your answer.
No, the results do not contradict each other. As the predictors "x1" and "x2" are highly correlated we are in the presence of collinearity, in this case it can be difficult to determine how each predictor separately is associated with the response. Since collinearity reduces the accuracy of the estimates of the regression coefficients, it causes the standard error for $\hat{\beta}_1$ to grow (we have a standard error of 0.7212 and 1.1337 for "x1" and "x2" respectively in the model with two predictors and only of 0.3963 and 0.633 for "x1" and "x2" respectively in the models with only one predictor). Consequently, we may fail to reject $H_0$ in the presence of collinearity. The importance of the "x2" variable has been masked due to the presence of collinearity.
(g) Now suppose we obtain one additional observation, which was unfortunately mismeasured.
x1 = c(x1, 0.1)
x2 = c(x2, 0.8)
y = c(y, 6)
Re-fit the linear models from (c) to (e) using this new data. What effect does this new observation have on each of the models ? In each model, is this observation an outlier ? A high-leverage point ? Explain your answers.
lm.fit21 = lm(y ~ x1 + x2)
summary(lm.fit21)
Call: lm(formula = y ~ x1 + x2) Residuals: Min 1Q Median 3Q Max -2.73348 -0.69318 -0.05263 0.66385 2.30619 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.2267 0.2314 9.624 7.91e-16 *** x1 0.5394 0.5922 0.911 0.36458 x2 2.5146 0.8977 2.801 0.00614 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.075 on 98 degrees of freedom Multiple R-squared: 0.2188, Adjusted R-squared: 0.2029 F-statistic: 13.72 on 2 and 98 DF, p-value: 5.564e-06
lm.fit22 = lm(y ~ x1)
summary(lm.fit22)
Call: lm(formula = y ~ x1) Residuals: Min 1Q Median 3Q Max -2.8897 -0.6556 -0.0909 0.5682 3.5665 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.2569 0.2390 9.445 1.78e-15 *** x1 1.7657 0.4124 4.282 4.29e-05 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.111 on 99 degrees of freedom Multiple R-squared: 0.1562, Adjusted R-squared: 0.1477 F-statistic: 18.33 on 1 and 99 DF, p-value: 4.295e-05
lm.fit23 = lm(y ~ x2)
summary(lm.fit23)
Call: lm(formula = y ~ x2) Residuals: Min 1Q Median 3Q Max -2.64729 -0.71021 -0.06899 0.72699 2.38074 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.3451 0.1912 12.264 < 2e-16 *** x2 3.1190 0.6040 5.164 1.25e-06 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.074 on 99 degrees of freedom Multiple R-squared: 0.2122, Adjusted R-squared: 0.2042 F-statistic: 26.66 on 1 and 99 DF, p-value: 1.253e-06
par(mfrow=c(2,2))
plot(lm.fit21)
par(mfrow=c(2,2))
plot(lm.fit22)
par(mfrow=c(2,2))
plot(lm.fit23)
In the model with two predictors, the last point is a high-leverage point. In the model with "x1" as sole predictor, the last point is an outlier. In the model with "x2" as sole predictor, the last point is a high leverage point.
15. This problem involves the Boston data set, which we saw in the lab for this chapter. We will now try to predict per capita crime rate using the other variables in this data set. In other words, per capita crime rate is the response, and the other variables are th predictors.
(a) For each predictor, fit a simple linear regression model to predict the response. Describe your results. In which of the models is there a statistically significant association between the predictor and the response ? Create some plots to back up your assertions.
Boston = read.csv("../../../data/Boston.csv", header=T, na.strings="?")
Boston$chas <- factor(Boston$chas, labels = c("N","Y"))
summary(Boston)
crim zn indus chas nox Min. : 0.00632 Min. : 0.00 Min. : 0.46 N:471 Min. :0.3850 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 Y: 35 1st Qu.:0.4490 Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.5380 Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.5547 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.6240 Max. :88.97620 Max. :100.00 Max. :27.74 Max. :0.8710 rm age dis rad Min. :3.561 Min. : 2.90 Min. : 1.130 Min. : 1.000 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100 1st Qu.: 4.000 Median :6.208 Median : 77.50 Median : 3.207 Median : 5.000 Mean :6.285 Mean : 68.57 Mean : 3.795 Mean : 9.549 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188 3rd Qu.:24.000 Max. :8.780 Max. :100.00 Max. :12.127 Max. :24.000 tax ptratio black lstat Min. :187.0 Min. :12.60 Min. : 0.32 Min. : 1.73 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38 1st Qu.: 6.95 Median :330.0 Median :19.05 Median :391.44 Median :11.36 Mean :408.2 Mean :18.46 Mean :356.67 Mean :12.65 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23 3rd Qu.:16.95 Max. :711.0 Max. :22.00 Max. :396.90 Max. :37.97 medv Min. : 5.00 1st Qu.:17.02 Median :21.20 Mean :22.53 3rd Qu.:25.00 Max. :50.00
attach(Boston)
lm.zn = lm(crim~zn)
summary(lm.zn) # yes
Call: lm(formula = crim ~ zn) Residuals: Min 1Q Median 3Q Max -4.429 -4.222 -2.620 1.250 84.523 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 4.45369 0.41722 10.675 < 2e-16 *** zn -0.07393 0.01609 -4.594 5.51e-06 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 8.435 on 504 degrees of freedom Multiple R-squared: 0.04019, Adjusted R-squared: 0.03828 F-statistic: 21.1 on 1 and 504 DF, p-value: 5.506e-06
lm.indus = lm(crim~indus)
summary(lm.indus) # yes
Call: lm(formula = crim ~ indus) Residuals: Min 1Q Median 3Q Max -11.972 -2.698 -0.736 0.712 81.813 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -2.06374 0.66723 -3.093 0.00209 ** indus 0.50978 0.05102 9.991 < 2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 7.866 on 504 degrees of freedom Multiple R-squared: 0.1653, Adjusted R-squared: 0.1637 F-statistic: 99.82 on 1 and 504 DF, p-value: < 2.2e-16
lm.chas = lm(crim~chas)
summary(lm.chas) # no
Call: lm(formula = crim ~ chas) Residuals: Min 1Q Median 3Q Max -3.738 -3.661 -3.435 0.018 85.232 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.7444 0.3961 9.453 <2e-16 *** chasY -1.8928 1.5061 -1.257 0.209 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 8.597 on 504 degrees of freedom Multiple R-squared: 0.003124, Adjusted R-squared: 0.001146 F-statistic: 1.579 on 1 and 504 DF, p-value: 0.2094
lm.nox = lm(crim~nox)
summary(lm.nox) # yes
Call: lm(formula = crim ~ nox) Residuals: Min 1Q Median 3Q Max -12.371 -2.738 -0.974 0.559 81.728 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -13.720 1.699 -8.073 5.08e-15 *** nox 31.249 2.999 10.419 < 2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 7.81 on 504 degrees of freedom Multiple R-squared: 0.1772, Adjusted R-squared: 0.1756 F-statistic: 108.6 on 1 and 504 DF, p-value: < 2.2e-16
lm.rm = lm(crim~rm)
summary(lm.rm) # yes
Call: lm(formula = crim ~ rm) Residuals: Min 1Q Median 3Q Max -6.604 -3.952 -2.654 0.989 87.197 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 20.482 3.365 6.088 2.27e-09 *** rm -2.684 0.532 -5.045 6.35e-07 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 8.401 on 504 degrees of freedom Multiple R-squared: 0.04807, Adjusted R-squared: 0.04618 F-statistic: 25.45 on 1 and 504 DF, p-value: 6.347e-07
lm.age = lm(crim~age)
summary(lm.age) # yes
Call: lm(formula = crim ~ age) Residuals: Min 1Q Median 3Q Max -6.789 -4.257 -1.230 1.527 82.849 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -3.77791 0.94398 -4.002 7.22e-05 *** age 0.10779 0.01274 8.463 2.85e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 8.057 on 504 degrees of freedom Multiple R-squared: 0.1244, Adjusted R-squared: 0.1227 F-statistic: 71.62 on 1 and 504 DF, p-value: 2.855e-16
lm.dis = lm(crim~dis)
summary(lm.dis) # yes
Call: lm(formula = crim ~ dis) Residuals: Min 1Q Median 3Q Max -6.708 -4.134 -1.527 1.516 81.674 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 9.4993 0.7304 13.006 <2e-16 *** dis -1.5509 0.1683 -9.213 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 7.965 on 504 degrees of freedom Multiple R-squared: 0.1441, Adjusted R-squared: 0.1425 F-statistic: 84.89 on 1 and 504 DF, p-value: < 2.2e-16
lm.rad = lm(crim~rad)
summary(lm.rad) # yes
Call: lm(formula = crim ~ rad) Residuals: Min 1Q Median 3Q Max -10.164 -1.381 -0.141 0.660 76.433 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -2.28716 0.44348 -5.157 3.61e-07 *** rad 0.61791 0.03433 17.998 < 2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 6.718 on 504 degrees of freedom Multiple R-squared: 0.3913, Adjusted R-squared: 0.39 F-statistic: 323.9 on 1 and 504 DF, p-value: < 2.2e-16
lm.tax = lm(crim~tax)
summary(lm.tax) # yes
Call: lm(formula = crim ~ tax) Residuals: Min 1Q Median 3Q Max -12.513 -2.738 -0.194 1.065 77.696 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -8.528369 0.815809 -10.45 <2e-16 *** tax 0.029742 0.001847 16.10 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 6.997 on 504 degrees of freedom Multiple R-squared: 0.3396, Adjusted R-squared: 0.3383 F-statistic: 259.2 on 1 and 504 DF, p-value: < 2.2e-16
lm.ptratio = lm(crim~ptratio)
summary(lm.ptratio) # yes
Call: lm(formula = crim ~ ptratio) Residuals: Min 1Q Median 3Q Max -7.654 -3.985 -1.912 1.825 83.353 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -17.6469 3.1473 -5.607 3.40e-08 *** ptratio 1.1520 0.1694 6.801 2.94e-11 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 8.24 on 504 degrees of freedom Multiple R-squared: 0.08407, Adjusted R-squared: 0.08225 F-statistic: 46.26 on 1 and 504 DF, p-value: 2.943e-11
lm.black = lm(crim~black)
summary(lm.black) # yes
Call: lm(formula = crim ~ black) Residuals: Min 1Q Median 3Q Max -13.756 -2.299 -2.095 -1.296 86.822 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 16.553529 1.425903 11.609 <2e-16 *** black -0.036280 0.003873 -9.367 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 7.946 on 504 degrees of freedom Multiple R-squared: 0.1483, Adjusted R-squared: 0.1466 F-statistic: 87.74 on 1 and 504 DF, p-value: < 2.2e-16
lm.lstat = lm(crim~lstat)
summary(lm.lstat) # yes
Call: lm(formula = crim ~ lstat) Residuals: Min 1Q Median 3Q Max -13.925 -2.822 -0.664 1.079 82.862 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -3.33054 0.69376 -4.801 2.09e-06 *** lstat 0.54880 0.04776 11.491 < 2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 7.664 on 504 degrees of freedom Multiple R-squared: 0.2076, Adjusted R-squared: 0.206 F-statistic: 132 on 1 and 504 DF, p-value: < 2.2e-16
lm.medv = lm(crim~medv)
summary(lm.medv) # yes
Call: lm(formula = crim ~ medv) Residuals: Min 1Q Median 3Q Max -9.071 -4.022 -2.343 1.298 80.957 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 11.79654 0.93419 12.63 <2e-16 *** medv -0.36316 0.03839 -9.46 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 7.934 on 504 degrees of freedom Multiple R-squared: 0.1508, Adjusted R-squared: 0.1491 F-statistic: 89.49 on 1 and 504 DF, p-value: < 2.2e-16
To find which predictors are significant, we have to test $H_0 : \beta_1 = 0$. All predictors have a p-value less than 0.05 except "chas", so we may conclude that there is a statistically significant association between each predictor and the response except for the "chas" predictor.
(b) Fit a multiple regression model to predict the response using all of the predictors. Describe your results. For which predictors can we reject the null hypothesis $H_0 : \beta_j = 0$ ?
lm.all = lm(crim ~ ., data = Boston)
summary(lm.all)
Call: lm(formula = crim ~ ., data = Boston) Residuals: Min 1Q Median 3Q Max -9.924 -2.120 -0.353 1.019 75.051 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 17.033228 7.234903 2.354 0.018949 * zn 0.044855 0.018734 2.394 0.017025 * indus -0.063855 0.083407 -0.766 0.444294 chasY -0.749134 1.180147 -0.635 0.525867 nox -10.313535 5.275536 -1.955 0.051152 . rm 0.430131 0.612830 0.702 0.483089 age 0.001452 0.017925 0.081 0.935488 dis -0.987176 0.281817 -3.503 0.000502 *** rad 0.588209 0.088049 6.680 6.46e-11 *** tax -0.003780 0.005156 -0.733 0.463793 ptratio -0.271081 0.186450 -1.454 0.146611 black -0.007538 0.003673 -2.052 0.040702 * lstat 0.126211 0.075725 1.667 0.096208 . medv -0.198887 0.060516 -3.287 0.001087 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 6.439 on 492 degrees of freedom Multiple R-squared: 0.454, Adjusted R-squared: 0.4396 F-statistic: 31.47 on 13 and 492 DF, p-value: < 2.2e-16
We may reject the null hypothesis for "zn", "dis", "rad", "black" and "medv".
(c) How do your results from (a) compare to your results from (b) ? Create a plot displaying the univariate regression coefficients from (a) on the x-axis, and the multiple regression coefficients from (b) on the y-axis. That is, each predictor is displayed as a single point on the plot. Its coefficient in a simple linear regression model is shown on the x-axis, and its coefficient estimate in the multiple linear regression model is shown on the y-axis.
# Simple regresion
x = c(coefficients(lm.zn)[2],
coefficients(lm.indus)[2],
coefficients(lm.chas)[2],
coefficients(lm.nox)[2],
coefficients(lm.rm)[2],
coefficients(lm.age)[2],
coefficients(lm.dis)[2],
coefficients(lm.rad)[2],
coefficients(lm.tax)[2],
coefficients(lm.ptratio)[2],
coefficients(lm.black)[2],
coefficients(lm.lstat)[2],
coefficients(lm.medv)[2])
# Multiple regresion
y = coefficients(lm.all)[2:14]
plot(x, y)
There is a difference between the simple and multiple regression coefficients. This difference is due to the fact that in the simple regression case, the slope term represents the average effect of an increase in the predictor, ignoring other predictors. In contrast, in the multiple regression case, the slope term represents the average effect of an increase in the predictor, while holding other predictors fixed. It does make sense for the multiple regression to suggest no relationship between the response and some of the predictors while the simple linear regression implies the opposite because the correlation between the predictors show some strong relationships between some of the predictors.
cor(Boston[-c(1, 4)])
zn | indus | nox | rm | age | dis | rad | tax | ptratio | black | lstat | medv | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
zn | 1.0000000 | -0.5338282 | -0.5166037 | 0.3119906 | -0.5695373 | 0.6644082 | -0.3119478 | -0.3145633 | -0.3916785 | 0.1755203 | -0.4129946 | 0.3604453 |
indus | -0.5338282 | 1.0000000 | 0.7636514 | -0.3916759 | 0.6447785 | -0.7080270 | 0.5951293 | 0.7207602 | 0.3832476 | -0.3569765 | 0.6037997 | -0.4837252 |
nox | -0.5166037 | 0.7636514 | 1.0000000 | -0.3021882 | 0.7314701 | -0.7692301 | 0.6114406 | 0.6680232 | 0.1889327 | -0.3800506 | 0.5908789 | -0.4273208 |
rm | 0.3119906 | -0.3916759 | -0.3021882 | 1.0000000 | -0.2402649 | 0.2052462 | -0.2098467 | -0.2920478 | -0.3555015 | 0.1280686 | -0.6138083 | 0.6953599 |
age | -0.5695373 | 0.6447785 | 0.7314701 | -0.2402649 | 1.0000000 | -0.7478805 | 0.4560225 | 0.5064556 | 0.2615150 | -0.2735340 | 0.6023385 | -0.3769546 |
dis | 0.6644082 | -0.7080270 | -0.7692301 | 0.2052462 | -0.7478805 | 1.0000000 | -0.4945879 | -0.5344316 | -0.2324705 | 0.2915117 | -0.4969958 | 0.2499287 |
rad | -0.3119478 | 0.5951293 | 0.6114406 | -0.2098467 | 0.4560225 | -0.4945879 | 1.0000000 | 0.9102282 | 0.4647412 | -0.4444128 | 0.4886763 | -0.3816262 |
tax | -0.3145633 | 0.7207602 | 0.6680232 | -0.2920478 | 0.5064556 | -0.5344316 | 0.9102282 | 1.0000000 | 0.4608530 | -0.4418080 | 0.5439934 | -0.4685359 |
ptratio | -0.3916785 | 0.3832476 | 0.1889327 | -0.3555015 | 0.2615150 | -0.2324705 | 0.4647412 | 0.4608530 | 1.0000000 | -0.1773833 | 0.3740443 | -0.5077867 |
black | 0.1755203 | -0.3569765 | -0.3800506 | 0.1280686 | -0.2735340 | 0.2915117 | -0.4444128 | -0.4418080 | -0.1773833 | 1.0000000 | -0.3660869 | 0.3334608 |
lstat | -0.4129946 | 0.6037997 | 0.5908789 | -0.6138083 | 0.6023385 | -0.4969958 | 0.4886763 | 0.5439934 | 0.3740443 | -0.3660869 | 1.0000000 | -0.7376627 |
medv | 0.3604453 | -0.4837252 | -0.4273208 | 0.6953599 | -0.3769546 | 0.2499287 | -0.3816262 | -0.4685359 | -0.5077867 | 0.3334608 | -0.7376627 | 1.0000000 |
So for example, when "age" is high there is a tendency in "dis" to be low, hence in simple linear regression which only examines "crim" versus "age", we observe that higher values of "age" are associated with higher values of "crim", even though "age" does not actually affect "crim". So "age" is a surrogate for "dis"; "age" gets credit for the effect of "dis" on "crim".
(d) Is there evidence of non-linear association between any of the predictors and the response ? To answer this question, for each predictor $X$, fit a model of the form $Y = \beta_0 + \beta_1 X + \beta_2 X^2 + \beta_3 X^3 + \varepsilon$.
lm.zn = lm(crim ~ poly(zn, 3))
summary(lm.zn) # 1, 2
Call: lm(formula = crim ~ poly(zn, 3)) Residuals: Min 1Q Median 3Q Max -4.821 -4.614 -1.294 0.473 84.130 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.6135 0.3722 9.709 < 2e-16 *** poly(zn, 3)1 -38.7498 8.3722 -4.628 4.7e-06 *** poly(zn, 3)2 23.9398 8.3722 2.859 0.00442 ** poly(zn, 3)3 -10.0719 8.3722 -1.203 0.22954 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 8.372 on 502 degrees of freedom Multiple R-squared: 0.05824, Adjusted R-squared: 0.05261 F-statistic: 10.35 on 3 and 502 DF, p-value: 1.281e-06
lm.indus = lm(crim ~ poly(indus, 3))
summary(lm.indus) # 1, 2, 3
Call: lm(formula = crim ~ poly(indus, 3)) Residuals: Min 1Q Median 3Q Max -8.278 -2.514 0.054 0.764 79.713 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.614 0.330 10.950 < 2e-16 *** poly(indus, 3)1 78.591 7.423 10.587 < 2e-16 *** poly(indus, 3)2 -24.395 7.423 -3.286 0.00109 ** poly(indus, 3)3 -54.130 7.423 -7.292 1.2e-12 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 7.423 on 502 degrees of freedom Multiple R-squared: 0.2597, Adjusted R-squared: 0.2552 F-statistic: 58.69 on 3 and 502 DF, p-value: < 2.2e-16
lm.nox = lm(crim ~ poly(nox, 3))
summary(lm.nox) # 1, 2, 3
Call: lm(formula = crim ~ poly(nox, 3)) Residuals: Min 1Q Median 3Q Max -9.110 -2.068 -0.255 0.739 78.302 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.6135 0.3216 11.237 < 2e-16 *** poly(nox, 3)1 81.3720 7.2336 11.249 < 2e-16 *** poly(nox, 3)2 -28.8286 7.2336 -3.985 7.74e-05 *** poly(nox, 3)3 -60.3619 7.2336 -8.345 6.96e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 7.234 on 502 degrees of freedom Multiple R-squared: 0.297, Adjusted R-squared: 0.2928 F-statistic: 70.69 on 3 and 502 DF, p-value: < 2.2e-16
lm.rm = lm(crim ~ poly(rm, 3))
summary(lm.rm) # 1, 2
Call: lm(formula = crim ~ poly(rm, 3)) Residuals: Min 1Q Median 3Q Max -18.485 -3.468 -2.221 -0.015 87.219 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.6135 0.3703 9.758 < 2e-16 *** poly(rm, 3)1 -42.3794 8.3297 -5.088 5.13e-07 *** poly(rm, 3)2 26.5768 8.3297 3.191 0.00151 ** poly(rm, 3)3 -5.5103 8.3297 -0.662 0.50858 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 8.33 on 502 degrees of freedom Multiple R-squared: 0.06779, Adjusted R-squared: 0.06222 F-statistic: 12.17 on 3 and 502 DF, p-value: 1.067e-07
lm.age = lm(crim ~ poly(age, 3))
summary(lm.age) # 1, 2, 3
Call: lm(formula = crim ~ poly(age, 3)) Residuals: Min 1Q Median 3Q Max -9.762 -2.673 -0.516 0.019 82.842 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.6135 0.3485 10.368 < 2e-16 *** poly(age, 3)1 68.1820 7.8397 8.697 < 2e-16 *** poly(age, 3)2 37.4845 7.8397 4.781 2.29e-06 *** poly(age, 3)3 21.3532 7.8397 2.724 0.00668 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 7.84 on 502 degrees of freedom Multiple R-squared: 0.1742, Adjusted R-squared: 0.1693 F-statistic: 35.31 on 3 and 502 DF, p-value: < 2.2e-16
lm.dis = lm(crim ~ poly(dis, 3))
summary(lm.dis) # 1, 2, 3
Call: lm(formula = crim ~ poly(dis, 3)) Residuals: Min 1Q Median 3Q Max -10.757 -2.588 0.031 1.267 76.378 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.6135 0.3259 11.087 < 2e-16 *** poly(dis, 3)1 -73.3886 7.3315 -10.010 < 2e-16 *** poly(dis, 3)2 56.3730 7.3315 7.689 7.87e-14 *** poly(dis, 3)3 -42.6219 7.3315 -5.814 1.09e-08 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 7.331 on 502 degrees of freedom Multiple R-squared: 0.2778, Adjusted R-squared: 0.2735 F-statistic: 64.37 on 3 and 502 DF, p-value: < 2.2e-16
lm.rad = lm(crim ~ poly(rad, 3))
summary(lm.rad) # 1, 2
Call: lm(formula = crim ~ poly(rad, 3)) Residuals: Min 1Q Median 3Q Max -10.381 -0.412 -0.269 0.179 76.217 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.6135 0.2971 12.164 < 2e-16 *** poly(rad, 3)1 120.9074 6.6824 18.093 < 2e-16 *** poly(rad, 3)2 17.4923 6.6824 2.618 0.00912 ** poly(rad, 3)3 4.6985 6.6824 0.703 0.48231 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 6.682 on 502 degrees of freedom Multiple R-squared: 0.4, Adjusted R-squared: 0.3965 F-statistic: 111.6 on 3 and 502 DF, p-value: < 2.2e-16
lm.tax = lm(crim ~ poly(tax, 3))
summary(lm.tax) # 1, 2
Call: lm(formula = crim ~ poly(tax, 3)) Residuals: Min 1Q Median 3Q Max -13.273 -1.389 0.046 0.536 76.950 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.6135 0.3047 11.860 < 2e-16 *** poly(tax, 3)1 112.6458 6.8537 16.436 < 2e-16 *** poly(tax, 3)2 32.0873 6.8537 4.682 3.67e-06 *** poly(tax, 3)3 -7.9968 6.8537 -1.167 0.244 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 6.854 on 502 degrees of freedom Multiple R-squared: 0.3689, Adjusted R-squared: 0.3651 F-statistic: 97.8 on 3 and 502 DF, p-value: < 2.2e-16
lm.ptratio = lm(crim ~ poly(ptratio, 3))
summary(lm.ptratio) # 1, 2, 3
Call: lm(formula = crim ~ poly(ptratio, 3)) Residuals: Min 1Q Median 3Q Max -6.833 -4.146 -1.655 1.408 82.697 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.614 0.361 10.008 < 2e-16 *** poly(ptratio, 3)1 56.045 8.122 6.901 1.57e-11 *** poly(ptratio, 3)2 24.775 8.122 3.050 0.00241 ** poly(ptratio, 3)3 -22.280 8.122 -2.743 0.00630 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 8.122 on 502 degrees of freedom Multiple R-squared: 0.1138, Adjusted R-squared: 0.1085 F-statistic: 21.48 on 3 and 502 DF, p-value: 4.171e-13
lm.black = lm(crim ~ poly(black, 3))
summary(lm.black) # 1
Call: lm(formula = crim ~ poly(black, 3)) Residuals: Min 1Q Median 3Q Max -13.096 -2.343 -2.128 -1.439 86.790 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.6135 0.3536 10.218 <2e-16 *** poly(black, 3)1 -74.4312 7.9546 -9.357 <2e-16 *** poly(black, 3)2 5.9264 7.9546 0.745 0.457 poly(black, 3)3 -4.8346 7.9546 -0.608 0.544 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 7.955 on 502 degrees of freedom Multiple R-squared: 0.1498, Adjusted R-squared: 0.1448 F-statistic: 29.49 on 3 and 502 DF, p-value: < 2.2e-16
lm.lstat = lm(crim ~ poly(lstat, 3))
summary(lm.lstat) # 1, 2
Call: lm(formula = crim ~ poly(lstat, 3)) Residuals: Min 1Q Median 3Q Max -15.234 -2.151 -0.486 0.066 83.353 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.6135 0.3392 10.654 <2e-16 *** poly(lstat, 3)1 88.0697 7.6294 11.543 <2e-16 *** poly(lstat, 3)2 15.8882 7.6294 2.082 0.0378 * poly(lstat, 3)3 -11.5740 7.6294 -1.517 0.1299 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 7.629 on 502 degrees of freedom Multiple R-squared: 0.2179, Adjusted R-squared: 0.2133 F-statistic: 46.63 on 3 and 502 DF, p-value: < 2.2e-16
lm.medv = lm(crim ~ poly(medv, 3))
summary(lm.medv) # 1, 2, 3
Call: lm(formula = crim ~ poly(medv, 3)) Residuals: Min 1Q Median 3Q Max -24.427 -1.976 -0.437 0.439 73.655 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.614 0.292 12.374 < 2e-16 *** poly(medv, 3)1 -75.058 6.569 -11.426 < 2e-16 *** poly(medv, 3)2 88.086 6.569 13.409 < 2e-16 *** poly(medv, 3)3 -48.033 6.569 -7.312 1.05e-12 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 6.569 on 502 degrees of freedom Multiple R-squared: 0.4202, Adjusted R-squared: 0.4167 F-statistic: 121.3 on 3 and 502 DF, p-value: < 2.2e-16
See inline comments above, the answer is yes for most, except for black and chas.
For "zn", "rm", "rad", "tax" and "lstat" as predictor, the p-values suggest that the cubic coefficient is not statistically significant; for "indus", "nox", "age", "dis", "ptratio" and "medv" as predictor, the p-values suggest the adequacy of the cubic fit; for "black" as predictor, the p-values suggest that the quandratic and cubic coefficients are not statistically significant, so in this latter case no non-linear effect is visible.