## Diagnostics for simple regression¶

• Goodness of fit of regression: analysis of variance.

• $F$-statistics.

• Residuals.

• Diagnostic plots.

## Geometry of least squares¶

Here are three pictures that help to describe different models we might fit.

### The full model¶

• This picture is meant to depict the regression model $$Y = \gamma \cdot 1 + X \beta + \epsilon.$$

• The $\gamma$ coefficient represents movement along the horizontal axis above, labelled $\pmb{1}$.

• The $\beta$ coefficient represents movement along the axis $X$ above.

• The vector $\hat{Y}$ is the vector of fitted values in the above model.

### The reduced model¶

• This picture is meant to depict the regression model $$Y = \gamma \cdot 1 + \epsilon.$$

• The $\gamma$ coefficient represents movement along the horizontal axis above, labelled $\pmb{1}$.

• Since $\beta=0$, we have assumed there is no movement along the $X$ axis.

• The vector $\hat{Y}$ is the vector of fitted values in the above model.

### Both models together¶

• The above picture tries to capture both models in one image.

• There is a new vector: $\hat{Y} - \bar{Y} \cdot \pmb{1}$. This vector is the difference in fits between the two previous models.

## Goodness of fit¶

• The closer $\hat{Y}$ is to the ${1}$ axis, the less "variation" there is along the $X$ axis.

• This closeness can be measured by the length of the vector $\hat{Y}-\bar{Y} \cdot 1$.

• The square of a vector's length is the sum of its elements squared. These quantities are usually referred to as sums of squares.

### Sums of squares¶

\begin{aligned} SSE &= \sum_{i=1}^n(Y_i - \widehat{Y}_i)^2 = \sum_{i=1}^n (Y_i - \widehat{\beta}_0 - \widehat{\beta}_1 X_i)^2 \\ SSR &= \sum_{i=1}^n(\overline{Y} - \widehat{Y}_i)^2 = \sum_{i=1}^n (\overline{Y} - \widehat{\beta}_0 - \widehat{\beta}_1 X_i)^2 \\ SST &= \sum_{i=1}^n(Y_i - \overline{Y})^2 = SSE + SSR \\ R^2 &= \frac{SSR}{SST} = 1 - \frac{SSE}{SST} = \widehat{Cor}(\pmb{X},\pmb{Y})^2. \end{aligned}
• The quantity $SSE$, or error sum of squares, is the squared length of the vector $Y - \hat{Y}$ which protrudes perpendicular to the plane.

• The quantity $SSR$, or regression sum of squares, is the length of the vector $\hat{Y} - \bar{Y} \cdot 1$.

• The quantity $SST$, or total sum of squares, is the length of the vector $Y - \bar{Y} \cdot 1$.

• The quantity $R^2$ is a measure of the goodness of fit of the simple linear regression model. Values near 1 indicate much of the total variability in $Y$ is explained by the regression model.

### Mean squares¶

• Each sum of squares gets an extra bit of information associated to them, called their degrees of freedom.

• Roughly speaking, the degrees of freedom can be determined by dimension counting.

• The $SSE$ has $n-2$ degrees of freedom because it is the squared length of a vector that lies in $n-2$ dimensions. To see this, note that it is perpendicular to the 2-dimensional plane formed by the $X$ axis and the $1$ axis.

• The $SST$ has $n-1$ degrees of freedom because it is the squared length of a vector that lies in $n-1$ dimensions. In this case, this vector is perpendicular to the $1$ axis.

• The $SSR$ has 1 degree of freedom because it is the squared length of a vector that lies in the 2-dimensional plane but is perpendicular to the $1$ axis.

\begin{aligned} MSE &= \frac{1}{n-2}\sum_{i=1}^n(Y_i - \widehat{Y}_i)^2 \\ MSR &= \sum_{i=1}^n(\overline{Y} - \widehat{Y}_i)^2 \\ MST &= \frac{1}{n-1}\sum_{i=1}^n(Y_i - \overline{Y})^2 \\ \end{aligned}

### A different visualization¶

These sums of squares can be visualized by other means as well. We will illustrate with a synthetic dataset.

In [1]:
%%R -h 600 -w 600
X = seq(0, 20, length = 21)
Y = 0.5 * X + 1 + rnorm(21)
Y.lm = lm(Y ~ X)
meanY = mean(Y)
Yhat = predict(Y.lm)
plot(X,Y, pch=23, bg='red', cex=2)


The total sum of squares, $SST$, is the sum of the squared differences between the Y values and the sample mean of the Y values.

In [2]:
%%R -h 600 -w 600
plot(X, Y, pch = 23, bg = "red", main='Total sum of squares', cex=2)
abline(Y.lm, pch=23, col='green', lwd=2)
abline(h = meanY, col = "yellow", lwd = 2)
for (i in 1:21) {
points(X[i], meanY, pch = 23, bg = "yellow")
lines(c(X[i], X[i]), c(Y[i], meanY))
}


The error sum of squares, $SSE$, is the sum of the squared differences between the $Y$ values and the $\hat{Y}$ values, i.e. the fitted values of the regression model.

In [3]:
%%R -h 600 -w 600
plot(X, Y, pch = 23, bg = "red", main="Error sum of squares", cex=2)
abline(Y.lm, col = "green", lwd = 2)
for (i in 1:21) {
points(X[i], Yhat[i], pch = 23, bg = "green")
lines(c(X[i], X[i]), c(Y[i], Yhat[i]))
}
abline(h = meanY, col = "yellow", lwd = 2)


Finally, the regression sum of squares, $SSR$ is the sum of the squared differences between the $\hat{Y}$ values and the sample mean of the $Y$ values.

In [4]:
%%R -h 600 -w 600
plot(X, Y, pch = 23, bg = "red", main="Regression sum of squares", cex=2)
abline(Y.lm, col = "green", lwd = 2)
abline(h = meanY, col = "yellow", lwd = 2)
for (i in 1:21) {
points(X[i], Yhat[i], pch = 23, bg = "green")
points(X[i], meanY, pch = 23, bg = "yellow")
lines(c(X[i], X[i]), c(meanY, Yhat[i]))
}


## Definition of $R^2$¶

As noted above, if the regression model fits very well, then $SSR$ will be large relative to $SST$. The $R^2$ score is just the ratio of these sums of squares.

We'll verify this on the wages data.

In [5]:
%%R
url = "http://stats191.stanford.edu/data/wage.csv"
wages = read.table(url, sep = ",", header = T)
wages.lm = lm(logwage ~ education, data=wages)


Let's verify our claim $SST=SSE+SSR$:

In [6]:
%%R
SSE = sum(resid(wages.lm)^2)
SST = sum((wages$logwage - mean(wages$logwage))^2)
SSR = sum((mean(wages$logwage) - predict(wages.lm))^2) data.frame(SST, SSE + SSR)   SST SSE...SSR 1 410.2148 410.2148  The$R^2$is also closely related to the$F$statistic reported as the goodness of fit in summary of lm. In [7]: %%R F = (SSR / 1) / (SSE / wages.lm$df)
print(F)
summary(wages.lm)

[1] 340.0297

Call:
lm(formula = logwage ~ education, data = wages)

Residuals:
Min       1Q   Median       3Q      Max
-1.78239 -0.25265  0.01636  0.27965  1.61101

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.239194   0.054974   22.54   <2e-16 ***
education   0.078600   0.004262   18.44   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.4038 on 2176 degrees of freedom
Multiple R-squared:  0.1351,	Adjusted R-squared:  0.1347
F-statistic:   340 on 1 and 2176 DF,  p-value: < 2.2e-16



Compare this to the F-statistic output here:

In [8]:
%%R
summary(wages.lm)

Call:
lm(formula = logwage ~ education, data = wages)

Residuals:
Min       1Q   Median       3Q      Max
-1.78239 -0.25265  0.01636  0.27965  1.61101

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.239194   0.054974   22.54   <2e-16 ***
education   0.078600   0.004262   18.44   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.4038 on 2176 degrees of freedom
Multiple R-squared:  0.1351,	Adjusted R-squared:  0.1347
F-statistic:   340 on 1 and 2176 DF,  p-value: < 2.2e-16



x = anscombe$x2 y = y + rnorm(length(y)) * 0.45 plot(x, y, pch = 23, bg = "orange", cex = 2, ylab = "Y", xlab = "X") simple.lm = lm(y ~ x) abline(simple.lm, lwd = 2, col = "red", lty = 2)  Let's take a look at the residuals from this model. Patterns in these residual plots may suggest something like a quadratic effect is missing, but they can also suggest some sort of serial dependence in the random errors. We will discuss this later, when we discuss correlated-errors. In [15]: %%R -h 600 -w 600 plot(x, resid(simple.lm), ylab = "Residual", xlab = "X", pch = 23, bg = "orange", cex = 2) abline(h = 0, lwd = 2, col = "red", lty = 2)  We will add a quadratic term to our model. This is our first example of a multiple linear regression model. In [16]: %%R -h 600 -w 600 quadratic.lm = lm(y ~ poly(x, 2)) Xsort = sort(x) plot(x, y, pch = 23, bg = "orange", cex = 2, ylab = "Y", xlab = "X") lines(Xsort, predict(quadratic.lm, list(x = Xsort)), col = "red", lty = 2, lwd = 2)  The residuals of the quadratic model have no apparent pattern in them, suggesting this is a better fit than the simple linear regression model. In [17]: %%R -h 600 -w 600 plot(x, resid(quadratic.lm), ylab = "Residual", xlab = "X", pch = 23, bg = "orange", cex = 2) abline(h = 0, lwd = 2, col = "red", lty = 2)  ### Assessing normality of errors¶ Another common diagnostic plot is the qqplot where qq stands for Quantile-Quantile. Roughly speaking, a qqplot is designed to see if the quantiles of two distributions match. • The function qqnorm can be used to ascertain if a sample of numbers are roughly normally distributed. If the points lie on the diagonal line, this is evidence that the sample is normally distributed. Various departures from the diagonal indicate skewness, asymmetry, etc. • If$e_i, 1\leq i \leq n$were really a sample of$N(0, \sigma^2)$then their sample quantiles should be close to the sample quantiles of the$N(0, \sigma^2)$distribution. The$qqnorm$plot is a plot of $$e_{(i)} \ {\rm vs.} \ \mathbb{E}(\varepsilon_{(i)}), \qquad 1 \leq i \leq n.$$ where$e_{(i)}$is the$i$-th smallest residual (order statistic) and$\mathbb{E}(\varepsilon_{(i)})$is the expected value for independent$\varepsilon_i$'s$\sim N(0,\sigma^2)$. In [18]: %%R -h 600 -w 600 qqnorm(resid(simple.lm), pch = 23, bg = "orange", cex = 2)  In [19]: %%R -h 600 -w 600 qqnorm(resid(quadratic.lm), pch = 23, bg = "orange", cex = 2)  In these two examples, the qqplot does not seem vastly different, even though we know the simple model is incorrect in this case. This indicates that several diagnostic tools should be used in assessing a model. ### Assessing constant variance assumption¶ One plot that is sometimes used to determine whether the variance is constant or not is a plot of$X$against$e=Y-\hat{Y}$. If there is a pattern to the spread in this plot, it may indicate that the variance changes as a function of$X$. In our earlier plots, we noticed a trend in this plot, not necessarily evidence of changing variance. The dataset below, taken from some work done with Dr. Robert Shafer here at Stanford http://hivdb.stanford.edu, plots HIV virus load against a score related to the the genetic makeup of a patient’s virus shows clear non-constant variance. It also provides a clear example of an outlier, or a point that is a clear departure from the model. In [20]: %%R -h 800 -w 800 url = 'http://stats191.stanford.edu/data/HIV.VL.table' viral.load = read.table(url, header=T) attach(viral.load) plot(GSS, VL, pch=23, bg='orange', cex=2) viral.lm = lm(VL ~ GSS) abline(viral.lm, col='red', lwd=2)  In [21]: %%R -h 800 -w 800 good = (VL < 200000) plot(GSS, VL, pch=23, bg='orange', cex=2) viral.lm.good = lm(VL ~ GSS, subset=good) abline(viral.lm.good, col='green', lwd=2)  When we plot the residuals against the fitted values for this model (even with the outlier removed) we see that the variance clearly depends on$GSS\$. They also do not seem symmetric around 0 so perhaps the Gaussian model is not appropriate.

In [22]:
%%R -h 800 -w 800
plot(GSS[good], resid(viral.lm.good), pch=23,
bg='orange', cex=2, xlab='GSS', ylab='Residual')
abline(h=0, lwd=2, col='red', lty=2)


### Outliers¶

Outliers can be obvious to spot (or not) but very difficult to define rigorously. Roughly speaking, they points where the model really does not fit.

They might correspond to mistakes in data transcription, lab errors, who knows? If possible, they should be identified and (hopefully) explained.