One goal of a regression analysis is to "build" a model that predicts well -- AIC / $C_p$ & Cross-validation selection criteria based on this.
This is slightly different than the goal of making inferences about $\beta$ that we've focused on so far.
What does "predict well" mean?
For what value of $\alpha$ is $\hat{Y}(\alpha)$ unbiased?
Is this the best estimate of $\mu$ in terms of MSE?
mu = 0.5
sigma = 5
nsample = 100
ntrial = 1000
MSE = function(mu.hat, mu) {
return(sum((mu.hat - mu)^2) / length(mu))
}
alpha = seq(0,1,length=20)
mse = numeric(length(alpha))
bias = (1 - alpha) * mu
variance = alpha^2 * 25 / 100
for (i in 1:ntrial) {
Z = rnorm(nsample) * sigma + mu
for (j in 1:length(alpha)) {
mse[j] = mse[j] + MSE(alpha[j] * mean(Z) * rep(1, nsample), mu * rep(1, nsample))
}
}
mse = mse / ntrial
plot(alpha, mse, type='l', lwd=2, col='red', ylim=c(0, max(mse)),
xlab=expression(paste('Shrinkage factor,', alpha)),
ylab=expression(paste('MSE(', alpha, ')')),
cex.lab=1.2)
plot(alpha, mse, type='l', lwd=2, col='red', ylim=c(0, max(mse)),
xlab=expression(paste('Shrinkage factor,', alpha)),
ylab=expression(paste('MSE(', alpha, ')')),
cex.lab=1.2)
lines(alpha, bias^2, col='green', lwd=2)
lines(alpha, variance, col='blue', lwd=2)
Shrinkage can be thought of as "constrained" or "penalized" minimization.
Constrained form:
for some $\lambda=\lambda_C$.
** We see that $\widehat{\mu}_{\lambda} = \bar{Y} \cdot \left(\frac{n}{n+\lambda}\right).$ **
** In other words, considering all penalized estimators traces out the MSE curve above.**
lam = nsample / alpha - nsample
plot(lam, mse, type='l', lwd=2, col='red', ylim=c(0, max(mse)),
xlab=expression(paste('Penalty parameter,', lambda)),
ylab=expression(paste('MSE(', lambda, ')')))
plot(lam, mse, type='l', lwd=2, col='red', ylim=c(0, max(mse)),
xlab=expression(paste('Penalty parameter,', lambda)),
ylab=expression(paste('MSE(', lambda, ')')))
lines(lam, bias^2, col='green', lwd=2)
lines(lam, variance, col='blue', lwd=2)
MSE_{pop}(\alpha) &= {\text{Var}}( \alpha \bar{Y}) + \text{Bias}(\alpha \bar{Y})^2 + \text{Var}(Y_{new}) \ &= \frac{\alpha^2 \sigma^2}{n} + \mu^2 (1 - \alpha)^2 + \text{Var}(Y_{new}) \end{aligned}$$
** We see that the optimal $\alpha$ depends on the unknown $SNR=\mu/(\sigma/\sqrt{n})$. Value is 1/8.**
** In practice we might hope to estimate MSE with cross-validation.**
Let's see how our theoretical choice matches the MSE on our 100 sample.
plot(alpha, mse, type='l', lwd=2, col='red', ylim=c(0, max(mse)),
xlab=expression(paste('Shrinkage parameter ', alpha)),
ylab=expression(paste('MSE(', alpha, ')')))
abline(v=mu^2/(mu^2+sigma^2/nsample), col='blue', lty=2)
If $\lambda=m$, an integer, then $\widehat{\mu}_{\lambda}$ is the sample mean of $(Y_1, \dots, Y_n,0 ,\dots, 0) \in \mathbb{R}^{n+m}$.
This is equivalent to adding some data with $Y=0$.
To a Bayesian,
this extra data is a prior distribution and we are computing the so-called MAP or posterior mode.
is a version of penalized regression.
where $$ \|\beta\|_0 = \#\left\{j : \beta_j \neq 0 \right\} $$ is called the $\ell_0$ norm.
Not all biased models are better – we need a way to find "good" biased models.
Inference ($F$, $\chi^2$ tests, etc) is not quite exact for biased models.
Though, there has been some recent work to address the issue of post-selection inference, at least for some penalized regression problems.
Assume that columns $(X_j)_{1 \leq j \leq p}$ have zero mean, and SD 1 and $Y$ has zero mean.
This is called the standardized model.
The ridge estimator is
Corresponds (through Lagrange multiplier) to a quadratic constraint on ${\beta_{}}$’s.
This is the natural generalization of the penalized
version of our shrinkage estimator.
library(lars)
data(diabetes)
library(MASS)
diabetes.ridge = lm.ridge(diabetes$y ~ diabetes$x,
lambda=seq(0, 100, 0.5))
plot(diabetes.ridge, lwd=3)
Here is a function to estimate the CV for our one parameter example. In practice, we only have one sample to form the CV curve. In this example below, I will compute the average CV error for 500 trials to show that it is roughly comparable in shape to the MSE curve.
CV = function(Z, alpha, K=5) {
cve = numeric(K)
n = length(Z)
for (i in 1:K) {
g = seq(as.integer((i-1)*n/K)+1,as.integer((i*n/K)))
mu.hat = mean(Z[-g]) * alpha
cve[i] = sum((Z[g]-mu.hat)^2)
}
return(c(sum(cve) / n, sd(cve) / sqrt(n)))
}
Let's see how the parameter chosen by 5-fold CV compares to our theoretical choice.
alpha = seq(0.0,1,length=20)
mse = numeric(length(alpha))
avg.cv = numeric(length(alpha))
for (i in 1:ntrial) {
Z = rnorm(nsample) * sigma + mu
for (j in 1:length(alpha)) {
current_cv = CV(Z, alpha[j])
avg.cv[j] = avg.cv[j] + current_cv[1]
}
}
avg.cv = avg.cv / ntrial
plot(alpha, avg.cv, type='l', lwd=2, col='green',
xlab='Shrinkage parameter, alpha', ylab='Average CV(alpha)')
abline(v=mu^2/(mu^2+sigma^2/nsample), col='blue', lty=2)
abline(v=min(alpha[avg.cv == min(avg.cv)]), col='red', lty=2)
The curve above shows what would happen if we could repeat this and average over many samples. In reality, we only get one sample.
Let's see what one curve looks like on our sample. This is the result we might get in practice on a given data set.
cv = numeric(length(alpha))
cv.sd = numeric(length(alpha))
Z = rnorm(nsample) * sigma + mu
for (j in 1:length(alpha)) {
current_cv = CV(Z, alpha[j])
cv[j] = current_cv[1]
cv.sd[j] = current_cv[2]
}
cv = numeric(length(alpha))
cv.sd = numeric(length(alpha))
nsample = 1000
Z = rnorm(nsample) * sigma + mu
for (j in 1:length(alpha)) {
current_cv = CV(Z, alpha[j])
cv[j] = current_cv[1]
cv.sd[j] = current_cv[2]
}
plot(alpha, cv, type='l', lwd=2, col='green',
xlab='Shrinkage parameter, alpha', ylab='CV(alpha)', xlim=c(0,1))
abline(v=mu^2/(mu^2+sigma^2/nsample), col='blue', lty=2)
abline(v=min(alpha[cv == min(cv)]), col='red', lty=2)
par(cex.lab=1.5)
plot(diabetes.ridge$lambda, diabetes.ridge$GCV, xlab='Lambda',
ylab='GCV', type='l', lwd=3, col='orange')
select(diabetes.ridge)
where $$ \|\beta\|_1 = \sum_{j=1}^p |\beta_j| $$ is the $\ell_1$ norm.
In theory and practice, it works well when many ${\beta_{j}}$’s are 0 and gives "sparse" solutions unlike ridge.
It is a (computable) approximation to the best subsets AIC model.
It is computable because the minimization problem is a convex problem.
Z = 2
lam = 3
beta = seq(-4, 4, length=801)
value = (beta - Z)^2/2 + lam * abs(beta)
plot(beta, value, type='l', lwd=2, col='red', main="Z=2, lam=3")
Z = 4
lam = 3
beta = seq(-4, 4, length=801)
value = (beta - Z)^2/2 + lam * abs(beta)
plot(beta, value, type='l', lwd=2, col='red', main="Z=4, lam=3")
Z = -4
lam = 3
beta = seq(-4, 4, length=801)
value = (beta - Z)^2/2 + lam * abs(beta)
plot(beta, value, type='l', lwd=2, col='red', main="Z=-4, lam=3")
but only when $(Z-\lambda)>0$.
Similarly when $(Z-\lambda < 0)$ we have $\hat{\beta}_{\lambda}(Z) = Z+\lambda$.
Only other case is $|Z| \leq \lambda$. In this region there are no zeros to the derivative. Optimal point
point must then be 0.
library(lars)
data(diabetes)
diabetes.lasso = lars(diabetes$x, diabetes$y, type='lasso')
plot(diabetes.lasso)
The lars
package has a built in function to estimate CV.
par(cex.lab=1.5)
cv.lars(diabetes$x, diabetes$y, K=10, type='lasso')
glmnet
¶library(glmnet)
G = glmnet(diabetes$x, diabetes$y)
plot(G)
plot(cv.glmnet(diabetes$x, diabetes$y))
cv.glmnet(diabetes$x, diabetes$y)$lambda.1se
library(glmnet)
G = glmnet(X_HIV, Y_HIV)
plot(G)
CV = cv.glmnet(X_HIV, Y_HIV)
plot(CV)
glmnet
¶beta.hat = coef(G, s=CV$lambda.1se)
beta.hat # might want to use as.numeric(beta.hat) instead of a sparse vector
Mix between LASSO and ridge regression.
Sometimes a more stable estimator than LASSO.
The ENET estimator is
plot(glmnet(X_HIV, Y_HIV, alpha=0.25))
plot(glmnet(X_HIV, Y_HIV, alpha=0))