In previous lecture we learned how to use CLT to estimate confidence intervals for a sample mean. For example, when we assume that $\sigma$ of the populaiton in known (sample size > 30), 95% interval is $$\left[\bar{x}-1.96 \dfrac{\sigma}{\sqrt{n}},~\bar{x}+1.96 \dfrac{\sigma}{\sqrt{n}} \right]$$
For smalle sample sizes, we use quantiles of t-distribution with $n-1$ degrss of freedom (df), where $n$ is the sample size.
However, if we do not know the underlying distribution of the variable we would like to calculate confindecne interval for, we can use the empirical bootstrap.
Say you have sample of size $n$ $$ x_1,\ldots,x_n$$ drawn from an unknown distribution $F$. An emperical boostrap sample is a resample of the same size $n$: $$ x_1^*,\ldots,x_n^*$$
We can think of it as a sample from empirical distrbution defined by the data. Now, instead of using original data to calculate a statistic $s$. Instead of calculating $\hat{s}$ from original data, we can use the resampled data to calculate $s^*$, using the same formula. By calulating $s^*$ several times, we get an empirical sample from the distribution of $s$!
Say we have the following sample
x = c(30, 37, 36, 43, 42, 43, 43, 46, 41, 42)
(x_hat = mean(x))
hist(x)
We are interested in $\bar{x} - \mu$, where $\mu$ is the true populaion mean. We approximate it by $$\delta^* = \bar{x}^* - \bar{x}$$ where $\bar{x}^*$ is the mean of an empirical bootstrap.
B = replicate(200,sample(x, length(x), replace = TRUE, prob = NULL))
hist(B, breaks=4)
(B = matrix(B, nrow=length(x)))
colMeans(B)
delta_star = colMeans(B) - x_hat
print(delta_star, digits=2)
mean(colMeans(B))
(d = quantile(delta_star, c(.05,.95))) # 90% confidence interval
(ci = x_hat - c(d[2], d[1]))
Basic assimptions of bootstrap:
Note, that point estimate is not any more accurate then the one obtained from original data. Bootstrap is all about understanding uncertainty about the estimate (robustness)!
Another advantage of bootstrap, we can do any statistic not only the mean. Lets try the median!
library("matrixStats")
(x_hat = median(x))
colMedians(B)
delta_star = colMedians(B) - x_hat
(d = quantile(delta_star, c(.05,.95))) # 90% confidence interval
(ci = x_hat - c(d[2], d[1]))
Old Faithful is a geyser in Yellowstone National Park in Wyoming: http://en.wikipedia.org/wiki/Old_Faithful. There is a publicly available data set which gives the durations of 272 consecutive eruptions. Here is a histogram of the data.
Question: Estimate the median length of an eruption and give a 90% confidence interval for the median.
# This code computes 3 bootstrap statistics for the old
# faithful data.
# 1. A 90% CI for the median
# 2. A 90% CI for the mean
# 3. P(|xbar - mu| > 5.
print('Old Faithful Data')
# Number of bootstrap samples to generate
nboot = 1000
#Make sure the working directory contains oldfaithful.txt
ofl_data = read.table('https://www.dropbox.com/s/kh0758t8h4vkb89/OldFaithful.txt?dl=1')[[1]];
n = length(ofl_data)
hist(ofl_data, nclass=20, col=2)
#Sample median and mean
(oflMedian = median(ofl_data))
(oflMean = mean(ofl_data))
# COMMENT: Could generate one sample at a time by doing everything in the for loop.
#Generate nboot empirical bootstrap trials
# each trial consists of n samples from the data
x = sample(ofl_data, n*nboot, replace=TRUE)
(bootdata = matrix(x,nrow=n, ncol=nboot))
#Compute the medians and means of the bootstrap trials
bootMedians = rep(0,nboot)
bootMeans = rep(0,nboot)
for (j in 1:nboot)
{
bootMedians[j] = median(bootdata[,j])
bootMeans[j] = mean(bootdata[,j])
}
hist(bootMedians, nclass=20)
hist(bootMeans, nclass=20)
# Compute the .05 and .95 quantiles for the differences
# Quantile has many methods of interpolating to compute the quantiles.
# E.g for median, quantil(1:4, .5) = 2.5. We just go with the default method
d_median05 = quantile(bootMedians - oflMedian, .05);
d_median95 = quantile(bootMedians - oflMedian, .95);
d_mean05 = quantile(bootMeans - oflMean, .05);
d_mean95 = quantile(bootMeans - oflMean, .95);
# Compute the 90% bootstrap confidence intervals
CI_median = oflMedian - c(d_median95, d_median05)
CI_mean = oflMean - c(d_mean95, d_mean05)
# Compute the fraction of the bootstrap samples where
#the |xbar_star - xbar| > 5
probDiff5 = sum(abs(bootMeans - oflMean) > 5)/nboot
s = sprintf("Mean = %.3f, median = %.3f", oflMean, oflMedian)
print(s)
s = sprintf("CI_median: [%.3f, %.3f]", CI_median[1], CI_median[2])
print(s)
s = sprintf("CI_mean: [%.3f, %.3f]", CI_mean[1], CI_mean[2])
print(s)
s = sprintf("P(|xbar_star - xbaar| > 5) = %.3f", probDiff5)
print(s)
qqnorm(rnorm(3000))