We've talked about correcting our regression estimator in two contexts: WLS (weighted least squares) and GLS.
Both require a model of the errors for the correction.
In both cases, we use a two stage procedure to "whiten" the data and use the OLS model on the "whitened" data.
What if we don't have a model for the errors?
We will use the bootstrap!
for both the features and the outcome.
(or our WLS / GLS models for error).
We have essentially treated $X$ as fixed.
In our usual model, $\beta$ is clearly defined. What is $\beta$ without this assumption?
where $(\pmb{X}, \pmb{Y}) \sim F$ leading to $$ \beta(F) = \left(E_F[\pmb{X}\pmb{X}^T]\right)^{-1} E_F[\pmb{X} \cdot \pmb{Y}]. $$
$\beta(\hat{F}_n)$ where $\hat{F}_n$ is the empirical distribution of our sample of $n$ observations from $F$.
and $$ n^{1/2}(\beta(\hat{F}_n) - \beta(F)) \to N(0, \Sigma(F)) $$ for some covariance matrix $\Sigma=\Sigma(F)$ depending only on $F$.
where $W$ might come from some model.
In this setting we will use OLS estimate -- but correct its variance!
Can we get our hands on $\text{Var}(X^TY)$ or $\text{Var}(\hat{\beta})$ without a model?
There are many variants of the bootstrap, most using roughly this structure
boot_sample = c()
for (b in 1:B) {
idx_star = sample(1:n, n, replace=TRUE)
X_star = X[idx_star,]
Y_star = Y[idx_star]
boot_sample = rbind(boot_sample, coef(lm(Y_star ~ X_star)))
}
cov_beta_boot = cov(boot_sample)
cov_beta_boot
can be used to estimate$\text{Var}(a^T\hat{\beta})$ for confidence intervals or general linear hypothesis tests.
library(car)
n = 50
X = rexp(n)
Y = 3 + 2.5 * X + X * (rexp(n) - 1) # our usual model is false here! W=X^{-2}
Y.lm = lm(Y ~ X)
confint(Y.lm)
pairs.Y.lm = Boot(Y.lm, coef, method='case', R=1000)
confint(pairs.Y.lm, type='norm') # using bootstrap SE
confint(pairs.Y.lm, type='perc') # using percentiles
boot
package¶The Boot
function in car
is a wrapper around the more general boot
function.
Here is an example using boot
.
library(boot)
D = data.frame(X, Y)
bootstrap_stat = function(D, bootstrap_idx) {
return(summary(lm(Y ~ X, data=D[bootstrap_idx,]))$coef[,1])
}
boot_results = boot(D, bootstrap_stat, R=500)
confint(boot_results, type='norm') # using bootstrap SE
confint(boot_results, type='perc') # using percentiles
noise = function(n) { return(rexp(n) - 1) }
simulate_correct = function(n=100, b=0.5) {
X = rexp(n)
Y = 3 + b * X + noise(n)
Y.lm = lm(Y ~ X)
# parametric interval
int_param = confint(Y.lm)[2,]
# pairs bootstrap interval
pairs.Y.lm = Boot(Y.lm, coef, method='case', R=1000)
pairs_SE = sqrt(cov(pairs.Y.lm$t)[2,2]) # $t is the bootstrap sample
int_pairs = c(coef(Y.lm)[2] - qnorm(0.975) * pairs_SE,
coef(Y.lm)[2] + qnorm(0.975) * pairs_SE)
result = c((int_param[1] < b) * (int_param[2] > b),
(int_pairs[1] < b) * (int_pairs[2] > b))
names(result) = c('parametric', 'bootstrap')
return(result)
}
simulate_correct()
nsim = 100
coverage = c()
for (i in 1:nsim) {
coverage = rbind(coverage, simulate_correct())
}
print(apply(coverage, 2, mean))
simulate_incorrect = function(n=100, b=0.5) {
X = rexp(n)
# the usual model is
# quite off here -- Var(X^TY) is not well
# approximated by sigma^2 * X^TX...
Y = 3 + b * X + X * noise(n)
Y.lm = lm(Y ~ X)
# parametric interval
int_param = confint(Y.lm)[2,]
# pairs bootstrap interval
pairs.Y.lm = Boot(Y.lm, coef, method='case', R=1000)
pairs_SE = sqrt(cov(pairs.Y.lm$t)[2,2]) # $t is the bootstrap sample of coefficients
# we want the 2nd coefficient
int_pairs = c(coef(Y.lm)[2] - qnorm(0.975) * pairs_SE,
coef(Y.lm)[2] + qnorm(0.975) * pairs_SE)
result = c((int_param[1] < b) * (int_param[2] > b),
(int_pairs[1] < b) * (int_pairs[2] > b))
names(result) = c('parametric', 'bootstrap')
return(result)
}
simulate_incorrect()
nsim = 100
coverage = c()
for (i in 1:nsim) {
coverage = rbind(coverage, simulate_incorrect())
}
print(apply(coverage, 2, mean))