In a given regression situation, there are often many choices to be made. Recall our usual setup $$ Y_{n \times 1} = X_{n \times p} \beta_{p \times 1} + \epsilon_{n \times 1}. $$

Any *subset $A \subset \{1, \dots, p\}$* yields a new regression model
$$
{\cal M}(A): Y_{n \times 1} = X[,A] \beta[A] + \epsilon_{n \times 1}
$$
by setting $\beta[A^c]=0$.

**Model selection** is, roughly speaking, how to choose $A$ among the
$2^p$ possible choices.

Here is a dataset from the book that we will use to explore different model selection approaches.

Variable | Description |
---|---|

$V$ | votes for a presidential candidate |

$I$ | are they incumbent? |

$D$ | Democrat or Republican incumbent? |

$W$ | wartime election? |

$G$ | GDP growth rate in election year |

$P$ | (absolute) GDP deflator growth rate |

$N$ | number of quarters in which GDP growth rate $> 3.2\%$ |

In [ ]:

```
#url = 'http://stats191.stanford.edu/data/election.table'
url = '../data/election.table'
election.table = read.table(url, header=T)
pairs(election.table[,2:ncol(election.table)], cex.labels=3, pch=23,
bg='orange', cex=2)
```

- When we have many predictors (with many possible interactions), it can be difficult to find a good model.
- Which main effects do we include?
- Which interactions do we include?
- Model selection procedures try to
*simplify / automate*this task. - Election data has $2^6=64$ different models with just main effects!

This is generally an "unsolved" problem in statistics: there are no magic procedures to get you the "best model."

In some sense, model selection is "data mining."

Data miners / machine learners often work with very many predictors.

Our model selection problem is generally at a much smaller scale than "data mining" problems.

Still, it is a hard problem.

**Inference after selection is full of pitfalls!**

- Suppose we fit a a model $F: \quad Y_{n \times 1} = X_{n \times p} \beta_{p \times 1} + \varepsilon_{n \times 1}$ with predictors ${ X}_1, \dots, { X}_p$.
- In reality, some of the $\beta$’s may be zero. Let’s suppose that $\beta_{j+1}= \dots= \beta_{p}=0$.
- Then, any model that includes $\beta_0, \dots, \beta_j$ is
*correct*: which model gives the*best*estimates of $\beta_0, \dots, \beta_j$? - Principle of
*parsimony*(i.e. Occam’s razor) says that the model with*only*${X}_1, \dots, {X}_j$ is "best".

- For simplicity, let’s assume that $j=1$ so there is only one coefficient to estimate.
- Then, because each model gives an
*unbiased*estimate of $\beta_1$ we can compare models based on $\text{Var}(\widehat{\beta}_1).$ - The best model, in terms of this variance, is the one containing only ${ X}_1$.
- What if we didn’t know that only $\beta_1$ was non-zero (which we don’t know in general)?
- In this situation, we must choose a set of variables.

- To "implement" a model selection procedure, we first need a criterion or benchmark to compare two models.
- Given a criterion, we also need a search strategy.
- With a limited number of predictors, it is possible to search all possible models (
`leaps`

in`R`

).

Possible criteria:

- $R^2$: not a good criterion. Always increase with model size $\implies$ "optimum" is to take the biggest model.
- Adjusted $R^2$: better. It "penalized" bigger models. Follows principle of parsimony / Occam’s razor.
- Mallow’s $C_p$ – attempts to estimate a model’s predictive power, i.e. the power to predict a new observation.

Leaps takes a design matrix as argument: throw away the intercept column or leaps will complain.

In [ ]:

```
election.lm = lm(V ~ I + D + W + G:I + P + N, election.table)
election.lm
```

In [ ]:

```
X = model.matrix(election.lm)[,-1]
library(leaps)
election.leaps = leaps(X, election.table$V, nbest=3, method='r2')
best.model.r2 = election.leaps$which[which((election.leaps$r2 ==
max(election.leaps$r2))),]
best.model.r2
```

Let's plot the $R^2$ as a function of the model size. We see that the full model does include all variables.

In [ ]:

```
plot(election.leaps$size, election.leaps$r2, pch=23, bg='orange', cex=2)
```

As we add more and more variables to the model – even random ones, $R^2$ will increase to 1.

Adjusted $R^2$ tries to take this into account by replacing sums of squares by

*mean squares*$$R^2_a = 1 - \frac{SSE/(n-p-1)}{SST/(n-1)} = 1 - \frac{MSE}{MST}.$$

In [ ]:

```
election.leaps = leaps(X, election.table$V, nbest=3, method='adjr2')
best.model.adjr2 = election.leaps$which[which((election.leaps$adjr2 == max(election.leaps$adjr2))),]
best.model.adjr2
```

In [ ]:

```
plot(election.leaps$size, election.leaps$adjr2, pch=23, bg='orange',
cex=2)
```

- $C_p({\cal M}) = \frac{SSE({\cal M})}{\widehat{\sigma}^2} + 2 \cdot p({\cal M}) - n.$
- $\widehat{\sigma}^2=SSE(F)/df_F$ is the "best" estimate of $\sigma^2$ we have (use the fullest model), i.e. in the election data it uses all 6 main effects.
- $SSE({\cal M})$ is the $SSE$ of the model ${\cal M}$.
- $p({\cal M})$ is the number of predictors in ${\cal M}$.
- This is an estimate of the expected mean-squared error of $\widehat{Y}({\cal M})$, it takes
*bias*and*variance*into account.

In [ ]:

```
election.leaps = leaps(X, election.table$V, nbest=3, method='Cp')
best.model.Cp = election.leaps$which[which((election.leaps$Cp == min(election.leaps$Cp))),]
best.model.Cp
```

In [ ]:

```
plot(election.leaps$size, election.leaps$Cp, pch=23, bg='orange', cex=2)
```

Given a criterion, we now have to decide how we are going to search through the possible models.

"Best subset": search all possible models and take the one with highest $R^2_a$ or lowest $C_p$ leaps. Such searches are typically feasible only up to $p=30$ or $40$.

Stepwise (forward, backward or both): useful when the number of predictors is large. Choose an initial model and be "greedy".

"Greedy" means always take the biggest jump (up or down) in your selected criterion.

`R`

¶- "Best subset": use the function
`leaps`

. Works only for multiple linear regression models. - Stepwise: use the function
`step`

. Works for any model with Akaike Information Criterion (AIC). In multiple linear regression, AIC is (almost) a linear function of $C_p$.

- Akaike (AIC) defined as $$AIC({\cal M}) = - 2 \log L({\cal M}) + 2 \cdot p({\cal M})$$ where $L({\cal M})$ is the maximized likelihood of the model.
- Bayes (BIC) defined as $$BIC({\cal M}) = - 2 \log L({\cal M}) + \log n \cdot p({\cal M})$$
- Strategy can be used for whenever we have a likelihood, so this generalizes to many statistical models.

- In linear regression with unknown $\sigma^2$ $$-2 \log L({\cal M}) = n \log(2\pi \widehat{\sigma}^2_{MLE}) + n$$ where $\widehat{\sigma}^2_{MLE} = \frac{1}{n} SSE(\widehat{\beta})$
- In linear regression with known $\sigma^2$ $$-2 \log L({\cal M}) = n \log(2\pi \sigma^2) + \frac{1}{\sigma^2} SSE(\widehat{\beta})$$ so AIC is very much like Mallow’s $C_p$.

In [ ]:

```
n = nrow(X)
p = 7 + 1
c(n * log(2*pi*sum(resid(election.lm)^2)/n) + n + 2*p, AIC(election.lm))
```

BIC will typically choose a model as small or smaller than AIC (if using the same search direction).

As our sample size grows, under some assumptions, it can be shown that

- AIC will (asymptotically) always choose a model that contains the true model, i.e. it won’t leave any variables out.
- BIC will (asymptotically) choose exactly the right model.

Let's take a look at `step`

in action. Probably the simplest
strategy is *forward stepwise* which tries to add one variable at a time,
as long as it can find a resulting model whose AIC is better than
its current position.

When it can make no further additions, it terminates.

In [ ]:

```
election.step.forward = step(lm(V ~ 1, election.table),
list(upper = ~ I + D + W + G:I + P + N),
direction='forward',
k=2)
```

In [ ]:

```
summary(election.step.forward)
```

We notice that although the *full* model we gave it had the interaction `I:G`

, the function `step`

never tried to use it. This is
due to some rules implemented in `step`

that do not include an interaction unless both main effects are already in the model. In this case, because neither $I$ nor $G$ were added, the interaction was never considered.

In the `leaps`

example, we gave the function the design matrix
and it did not have to consider interactions: they were already encoded in the design matrix.

The only difference between AIC and BIC is the price paid
per variable. This is the argument `k`

to `step`

. By default `k=2`

and for BIC
we set `k=log(n)`

. If we set `k=0`

it will always add variables.

In [ ]:

```
election.step.forward.BIC = step(lm(V ~ 1, election.table),
list(upper = ~ I + D + W +G:I + P + N),
direction='forward', k=log(nrow(X)))
```

In [ ]:

```
summary(election.step.forward.BIC)
```

Just for fun, let's consider backwards stepwise. This starts at a full model and tries to delete variables.

There is also a `direction="both"`

option.

In [ ]:

```
election.step.backward = step(election.lm, direction='backward')
```

In [ ]:

```
summary(election.step.backward)
```

Yet another model selection criterion is $K$-fold cross-validation.

- Fix a model ${\cal M}$. Break data set into $K$ approximately equal sized groups $(G_1, \dots, G_K)$.
- For (i in 1:K) Use all groups except $G_i$ to fit model, predict outcome in group $G_i$ based on this model $\widehat{Y}_{j,{\cal M}, G_i}, j \in G_i$.
- Similar to what we saw in Cook's distance / DFFITS.
- Estimate $CV({\cal M}) = \frac{1}{n}\sum_{i=1}^K \sum_{j \in G_i} (Y_j - \widehat{Y}_{j,{\cal M},G_i})^2.$

- It is a general principle that can be used in other situations to "choose parameters."
- Pros (partial list): "objective" measure of a model.
- Cons (partial list): all we know about inference is
*usually*"out the window" (also true for other model selection procedures). - If goal is not really inference about certain specific parameters, it is a reasonable way to compare models.

In [ ]:

```
library(boot)
election.glm = glm(V ~ ., data=election.table)
cv.glm(model.frame(election.glm), election.glm, K=5)$delta
```

Let's plot our $C_p$ versus the $CV$ score.

Keep in mind that there is additional randomness in the $CV$ score due to the random assignments to groups.

In [ ]:

```
plot(election.leaps$Cp, election.leaps$CV, pch=23, bg='orange', cex=2)
```

In [ ]:

```
plot(election.leaps$size, election.leaps$CV, pch=23, bg='orange', cex=2)
best.model.Cv = election.leaps$which[which((election.leaps$CV
== min(election.leaps$CV))),]
best.model.Cv
```

The model selected depends on the criterion used.

Criterion | Model |
---|---|

$R^2$ | ~ $ I + D + W +G:I + P + N$ |

$R^2_a$ | ~ $ I + D + P + N$ |

$C_p$ | ~ $D+P+N$ |

AIC forward | ~ $D+P$ |

BIC forward | ~ $D$ |

AIC backward | ~ $I + D + N + I:G$ |

5-fold CV | ~ $ I+W$ |

**The selected model is random and depends on which method we use!**

- Many other "criteria" have been proposed.
- Some work well for some types of data, others for different data.
- Check diagnostics!
- These criteria (except cross-validation) are not "direct measures" of predictive power, though Mallow’s $C_p$ is a step in this direction.
- $C_p$ measures the quality of a model based on both
*bias*and*variance*of the model. Why is this important? *Bias-variance*tradeoff is ubiquitous in statistics. More soon.

Each of the above criteria return a model. The `summary`

provides
$p$-values.

In [ ]:

```
summary(election.step.forward)
```

We can also form confidence intervals. **But, can we trust these intervals or tests!**

In [ ]:

```
confint(election.step.forward)
```

To illustrate the "dangers" of trusting the above $p$-values, I will
use the `selectiveInference`

package which has a variant of forward stepwise.

In [ ]:

```
library(selectiveInference)
plot(fs(X, V))
```

In [ ]:

```
fsInf(fs(X, V))
```

Let's generate data for which we know all coefficients are zero and look at the $p$-values.

We will look at the *naive* p-values which ignore selection, as well
as a certain kind of corrected $p$-values from `selectiveInference`

.

In [ ]:

```
X = matrix(rnorm(2000), 100, 20)
nsim = 1000
naiveP = c()
for (i in 1:nsim) {
Z = rnorm(nrow(X))
fsfit = fs(X, Z, maxsteps=1)
fsinf = fsInf(fsfit, sigma=1)
naive.lm = lm(Z ~ X[,fsinf$vars[1]])
naiveP = c(naiveP, summary(naive.lm)$coef[2,4])
}
```

In [ ]:

```
plot(ecdf(naiveP), lwd=3, col='red')
abline(0,1, lwd=2, lty=2)
```

What is the type I error of the naive test?

In [ ]:

```
ecdf(naiveP)(0.05)
```

A simple fix for this problem is to randomly split the data into two groups (often but not always of equal size).

This gives valid inference.

Which variable is chosen is random as it depends on the split.

Also, we have access to less data to form confidence intervals, test hypotheses.

In [ ]:

```
splitP = c()
for (i in 1:nsim) {
Z = rnorm(nrow(X))
subset = sample(1:nrow(X), nrow(X)/2, replace=FALSE)
subset_c = rep(TRUE, nrow(X))
subset_c[subset] = FALSE
fsfit = fs(X[subset,], Z[subset], maxsteps=1)
split.lm = lm(Z ~ X[,fsinf$vars[1]], subset=subset_c)
splitP = c(splitP, summary(split.lm)$coef[2,4])
}
```

In [ ]:

```
plot(ecdf(splitP), lwd=3, col='red')
abline(0,1, lwd=2, lty=2)
```

It turns out that it is possible to modify the usual $t$-statistic so that we still get valid tests after the first step. Its description is a little beyond the level of this course.

In [ ]:

```
exactP = c()
for (i in 1:nsim) {
Z = rnorm(nrow(X))
fsfit = fs(X, Z, maxsteps=1)
fsinf = fsInf(fsfit, sigma=1)
exactP = c(exactP, fsinf$pv[1])
}
```

In [ ]:

```
plot(ecdf(exactP), lwd=3, col='blue')
plot(ecdf(naiveP), lwd=3, col='red', add=TRUE)
abline(0,1, lwd=2, lty=2)
```

Both the exact and data splitting give valid p-values.

Which is more powerful?

Let's put some signal in and see.

In [ ]:

```
exactP_A = c()
splitP_A = c()
beta1 = .3 # signal size
for (i in 1:nsim) {
Z = rnorm(nrow(X)) + X[,1] * beta1
fsfit = fs(X, Z, maxsteps=1)
fsinf = fsInf(fsfit, sigma=1)
exactP_A = c(exactP_A, fsinf$pv[1])
# data splitting pvalue
subset = sample(1:nrow(X), nrow(X)/2, replace=FALSE)
subset_c = rep(TRUE, nrow(X))
subset_c[subset] = FALSE
fsfit = fs(X[subset,], Z[subset], maxsteps=1)
split.lm = lm(Z ~ X[,fsinf$vars[1]], subset=subset_c)
splitP_A = c(splitP_A, summary(split.lm)$coef[2,4])
}
```

In [ ]:

```
plot(ecdf(exactP_A), lwd=3, col='blue', xlim=c(0,1))
plot(ecdf(splitP_A), lwd=3, col='red', add=TRUE)
abline(0,1, lwd=2, lty=2)
```

- Which has shorter confidence intervals?

In [ ]:

```
exactL = c()
splitL = c()
beta1 = 0.5
for (i in 1:nsim) {
Z = rnorm(nrow(X)) + X[,1] * beta1
fsfit = fs(X, Z, maxsteps=1)
fsinf = fsInf(fsfit, sigma=1)
exactL = c(exactL, fsinf$ci[1,2] - fsinf$ci[1,1])
# data splitting pvalue
subset = sample(1:nrow(X), nrow(X)/2, replace=FALSE)
subset_c = rep(TRUE, nrow(X))
subset_c[subset] = FALSE
fsfit = fs(X[subset,], Z[subset], maxsteps=1)
split.lm = lm(Z ~ X[,fsinf$vars[1]], subset=subset_c)
conf = confint(split.lm, level=0.9)
splitL = c(splitL, conf[2,2] - conf[2,1])
}
```

In [ ]:

```
data.frame(median(exactL), median(splitL))
```

This `exactP`

was only discovered recently!

For about 15 years, I have been saying things along the lines of

```
Inference after model selection is basically out the window.
Forget all we taught you about t and F distributions as it is no longer true...
```

It turns out that inference after selection is possible, and it doesn't force us to throw away all of our tools for inference.

*But, it is a little more complicated to describe.*