Bootstrap and confidence interval

In previous lecture we learned how to use CLT to estimate confidence intervals for a sample mean. For example, when we assume that $\sigma$ of the populaiton in known (sample size > 30), 95% interval is $$\left[\bar{x}-1.96 \dfrac{\sigma}{\sqrt{n}},~\bar{x}+1.96 \dfrac{\sigma}{\sqrt{n}} \right]$$

For smalle sample sizes, we use quantiles of t-distribution with $n-1$ degrss of freedom (df), where $n$ is the sample size.

However, if we do not know the underlying distribution of the variable we would like to calculate confindecne interval for, we can use the empirical bootstrap.

Say you have sample of size $n$ $$ x_1,\ldots,x_n$$ drawn from an unknown distribution $F$. An emperical boostrap sample is a resample of the same size $n$: $$ x_1^*,\ldots,x_n^*$$

We can think of it as a sample from empirical distrbution defined by the data. Now, instead of using original data to calculate a statistic $s$. Instead of calculating $\hat{s}$ from original data, we can use the resampled data to calculate $s^*$, using the same formula. By calulating $s^*$ several times, we get an empirical sample from the distribution of $s$!

Example

Say we have the following sample

In [1]:
x = c(30, 37, 36, 43, 42, 43, 43, 46, 41, 42)
(x_hat = mean(x))
40.3
In [2]:
hist(x)

We are interested in $\bar{x} - \mu$, where $\mu$ is the true populaion mean. We approximate it by $$\delta^* = \bar{x}^* - \bar{x}$$ where $\bar{x}^*$ is the mean of an empirical bootstrap.

In [4]:
B = replicate(200,sample(x, length(x), replace = TRUE, prob = NULL))
hist(B, breaks=4)
In [35]:
(B = matrix(B, nrow=length(x)))
41424636434346433030⋯ 42424337414243434330
43433642434136424243⋯ 46433742364241414242
30364343423641463743⋯ 43433043374236424243
41424636364337434341⋯ 41364642304241373637
37303743434342433741⋯ 43434243433743434230
36304342434143424142⋯ 41303736433643434642
42424341364343364330⋯ 43414643434330434142
46414342373043434343⋯ 43433036434343433742
43413741373030434246⋯ 43413742363042434341
43304246304143424243⋯ 30414346364241433046
In [36]:
colMeans(B)
  1. 40.2
  2. 37.7
  3. 41.6
  4. 41.2
  5. 39
  6. 39.1
  7. 40.4
  8. 42.3
  9. 40
  10. 40.2
  11. 42.8
  12. 39
  13. 41.5
  14. 41.3
  15. 40.6
  16. 39.2
  17. 40.8
  18. 41.4
  19. 41
  20. 40.6
  21. 40.3
  22. 38.6
  23. 36.7
  24. 40.4
  25. 40.8
  26. 40.5
  27. 38.9
  28. 41.4
  29. 41.4
  30. 40.5
  31. 39.5
  32. 37.8
  33. 41
  34. 40.1
  35. 39.3
  36. 38.9
  37. 40.8
  38. 42
  39. 42.9
  40. 39.8
  41. 40.4
  42. 38.3
  43. 41.1
  44. 39.9
  45. 41.8
  46. 41.9
  47. 39.5
  48. 42.3
  49. 38.2
  50. 40.8
  51. 40.8
  52. 39
  53. 40.3
  54. 40.6
  55. 39.7
  56. 39
  57. 41.2
  58. 39.5
  59. 40.4
  60. 38.9
  61. 38.3
  62. 38.2
  63. 39.3
  64. 40.6
  65. 39.3
  66. 40.1
  67. 41.2
  68. 42.4
  69. 42.9
  70. 37.9
  71. 41.5
  72. 42.3
  73. 41.3
  74. 41.1
  75. 43.1
  76. 40.7
  77. 41.6
  78. 41.3
  79. 40.2
  80. 39.4
  81. 39.8
  82. 40.4
  83. 42
  84. 40.2
  85. 40.8
  86. 41.9
  87. 38.5
  88. 39
  89. 41.7
  90. 41.8
  91. 38.2
  92. 40.5
  93. 40.5
  94. 41.6
  95. 38.6
  96. 40.6
  97. 38.5
  98. 38.2
  99. 39.7
  100. 39.8
  101. 42.4
  102. 38.4
  103. 39.3
  104. 41.7
  105. 43
  106. 39
  107. 41.2
  108. 37.4
  109. 39.9
  110. 41.3
  111. 42.1
  112. 40.7
  113. 39
  114. 39.4
  115. 40.3
  116. 40.9
  117. 40.2
  118. 40.3
  119. 40.8
  120. 40.7
  121. 41.8
  122. 41.3
  123. 40.5
  124. 41.1
  125. 40.9
  126. 38.6
  127. 42.2
  128. 40.9
  129. 40.7
  130. 40.7
  131. 40.5
  132. 39
  133. 41.5
  134. 39.4
  135. 42.6
  136. 42.1
  137. 40.5
  138. 39.8
  139. 41
  140. 41
  141. 43.3
  142. 41.9
  143. 38.7
  144. 40.5
  145. 40.6
  146. 37.8
  147. 38.7
  148. 38.1
  149. 41.3
  150. 42.4
  151. 39.4
  152. 37
  153. 37.5
  154. 41.5
  155. 42.1
  156. 40
  157. 42.3
  158. 42.4
  159. 41.1
  160. 39.1
  161. 39.8
  162. 39
  163. 42.7
  164. 40.7
  165. 39.2
  166. 39.8
  167. 41.5
  168. 39.4
  169. 41.6
  170. 39.4
  171. 41.6
  172. 40.5
  173. 41.8
  174. 41
  175. 41
  176. 38
  177. 40.4
  178. 41.9
  179. 41.7
  180. 41.8
  181. 39.3
  182. 41.2
  183. 41.2
  184. 38.8
  185. 39.5
  186. 41.2
  187. 40.3
  188. 40.7
  189. 40.4
  190. 37.2
  191. 41.5
  192. 40.3
  193. 39.1
  194. 41
  195. 38.8
  196. 39.9
  197. 40.3
  198. 42.1
  199. 40.2
  200. 39.5
In [37]:
delta_star = colMeans(B) - x_hat
print(delta_star, digits=2)
mean(colMeans(B))
  [1] -0.1 -2.6  1.3  0.9 -1.3 -1.2  0.1  2.0 -0.3 -0.1  2.5 -1.3  1.2  1.0  0.3
 [16] -1.1  0.5  1.1  0.7  0.3  0.0 -1.7 -3.6  0.1  0.5  0.2 -1.4  1.1  1.1  0.2
 [31] -0.8 -2.5  0.7 -0.2 -1.0 -1.4  0.5  1.7  2.6 -0.5  0.1 -2.0  0.8 -0.4  1.5
 [46]  1.6 -0.8  2.0 -2.1  0.5  0.5 -1.3  0.0  0.3 -0.6 -1.3  0.9 -0.8  0.1 -1.4
 [61] -2.0 -2.1 -1.0  0.3 -1.0 -0.2  0.9  2.1  2.6 -2.4  1.2  2.0  1.0  0.8  2.8
 [76]  0.4  1.3  1.0 -0.1 -0.9 -0.5  0.1  1.7 -0.1  0.5  1.6 -1.8 -1.3  1.4  1.5
 [91] -2.1  0.2  0.2  1.3 -1.7  0.3 -1.8 -2.1 -0.6 -0.5  2.1 -1.9 -1.0  1.4  2.7
[106] -1.3  0.9 -2.9 -0.4  1.0  1.8  0.4 -1.3 -0.9  0.0  0.6 -0.1  0.0  0.5  0.4
[121]  1.5  1.0  0.2  0.8  0.6 -1.7  1.9  0.6  0.4  0.4  0.2 -1.3  1.2 -0.9  2.3
[136]  1.8  0.2 -0.5  0.7  0.7  3.0  1.6 -1.6  0.2  0.3 -2.5 -1.6 -2.2  1.0  2.1
[151] -0.9 -3.3 -2.8  1.2  1.8 -0.3  2.0  2.1  0.8 -1.2 -0.5 -1.3  2.4  0.4 -1.1
[166] -0.5  1.2 -0.9  1.3 -0.9  1.3  0.2  1.5  0.7  0.7 -2.3  0.1  1.6  1.4  1.5
[181] -1.0  0.9  0.9 -1.5 -0.8  0.9  0.0  0.4  0.1 -3.1  1.2  0.0 -1.2  0.7 -1.5
[196] -0.4  0.0  1.8 -0.1 -0.8
40.376
In [38]:
(d = quantile(delta_star, c(.05,.95))) # 90% confidence interval
5%
-2.205
95%
2.1
In [39]:
(ci = x_hat - c(d[2], d[1]))
95%
38.2
5%
42.505

Basic assimptions of bootstrap:

  • empirical distribution will be a good approximation of the true distribution

Note, that point estimate is not any more accurate then the one obtained from original data. Bootstrap is all about understanding uncertainty about the estimate (robustness)!

Another advantage of bootstrap, we can do any statistic not only the mean. Lets try the median!

In [5]:
library("matrixStats")
(x_hat = median(x))
colMedians(B)
delta_star = colMedians(B) - x_hat
(d = quantile(delta_star, c(.05,.95))) # 90% confidence interval
(ci = x_hat - c(d[2], d[1]))
42
  1. 42.5
  2. 43
  3. 41.5
  4. 42.5
  5. 41
  6. 43
  7. 43
  8. 42.5
  9. 41.5
  10. 41.5
  11. 41.5
  12. 42
  13. 37
  14. 42
  15. 42
  16. 41.5
  17. 42
  18. 42
  19. 42.5
  20. 42
  21. 38.5
  22. 42
  23. 37
  24. 42
  25. 42
  26. 42
  27. 37
  28. 43
  29. 41.5
  30. 42.5
  31. 42
  32. 41
  33. 41
  34. 37
  35. 43
  36. 41.5
  37. 37
  38. 39.5
  39. 37
  40. 42
  41. 41.5
  42. 41.5
  43. 42
  44. 42
  45. 41.5
  46. 42
  47. 42.5
  48. 37
  49. 39.5
  50. 42
  51. 41.5
  52. 41.5
  53. 42
  54. 39
  55. 42.5
  56. 41.5
  57. 42
  58. 42
  59. 41.5
  60. 39
  61. 42
  62. 41
  63. 41.5
  64. 41
  65. 42
  66. 43
  67. 41.5
  68. 39
  69. 41.5
  70. 42.5
  71. 42
  72. 43
  73. 42.5
  74. 43
  75. 39
  76. 41
  77. 41
  78. 39
  79. 42.5
  80. 42.5
  81. 43
  82. 42.5
  83. 42
  84. 42
  85. 39.5
  86. 39
  87. 42.5
  88. 42
  89. 42
  90. 42.5
  91. 42
  92. 43
  93. 41.5
  94. 39
  95. 42.5
  96. 42.5
  97. 42.5
  98. 41
  99. 41
  100. 42
  101. 42
  102. 42
  103. 41.5
  104. 41.5
  105. 43
  106. 41
  107. 33.5
  108. 42.5
  109. 43
  110. 42
  111. 42.5
  112. 43
  113. 42.5
  114. 42.5
  115. 42
  116. 42
  117. 42.5
  118. 42
  119. 42
  120. 37
  121. 42
  122. 39
  123. 41.5
  124. 42.5
  125. 43
  126. 42
  127. 42.5
  128. 41.5
  129. 43
  130. 42
  131. 36.5
  132. 42
  133. 42
  134. 43
  135. 42
  136. 41.5
  137. 42
  138. 42
  139. 37
  140. 42
  141. 43
  142. 42.5
  143. 42
  144. 42
  145. 42
  146. 43
  147. 41.5
  148. 42.5
  149. 39
  150. 42.5
  151. 37
  152. 42
  153. 41.5
  154. 41.5
  155. 41.5
  156. 42
  157. 42
  158. 37
  159. 41
  160. 42.5
  161. 42
  162. 39
  163. 41
  164. 42.5
  165. 42
  166. 42
  167. 43
  168. 41
  169. 42
  170. 41
  171. 42
  172. 42.5
  173. 41.5
  174. 42
  175. 42
  176. 42.5
  177. 43
  178. 42.5
  179. 42
  180. 42
  181. 39
  182. 39
  183. 33
  184. 42.5
  185. 41.5
  186. 43
  187. 43
  188. 41.5
  189. 39.5
  190. 42.5
  191. 42.5
  192. 37
  193. 42
  194. 42.5
  195. 39
  196. 42.5
  197. 39
  198. 42.5
  199. 42.5
  200. 42
5%
-5
95%
1
95%
41
5%
47

Example: Old Faithful Data

Old Faithful is a geyser in Yellowstone National Park in Wyoming: http://en.wikipedia.org/wiki/Old_Faithful. There is a publicly available data set which gives the durations of 272 consecutive eruptions. Here is a histogram of the data.

Question: Estimate the median length of an eruption and give a 90% confidence interval for the median.

In [6]:
# This code computes 3 bootstrap statistics for the old
# faithful data.
# 1. A 90% CI for the median
# 2. A 90% CI for the mean
# 3. P(|xbar - mu| > 5.
       
print('Old Faithful Data')
       
# Number of bootstrap samples to generate
nboot = 1000

#Make sure the working directory contains oldfaithful.txt
ofl_data = read.table('https://www.dropbox.com/s/kh0758t8h4vkb89/OldFaithful.txt?dl=1')[[1]];
n = length(ofl_data) 
hist(ofl_data, nclass=20, col=2)
[1] "Old Faithful Data"
In [7]:
#Sample median and mean
(oflMedian = median(ofl_data))
(oflMean = mean(ofl_data))
240
209.268382352941
In [8]:
# COMMENT: Could generate one sample at a time by doing everything in the for loop.

#Generate nboot empirical bootstrap trials
# each trial consists of n samples from the data
x = sample(ofl_data, n*nboot, replace=TRUE)
(bootdata = matrix(x,nrow=n, ncol=nboot))
245257255247120300258306132278⋯ 107124256268102296168221296118
210242278240261265143261267278⋯ 257200254290200131108207270184
296112235122144243210120109110⋯ 276306173237128121265296282112
285235293112141242270272250116⋯ 207270143288240173136226271263
105260115118214134276254223130⋯ 270124120104105111282262224294
126245262275132285265249250282⋯ 272210278226269210138245282144
105296110230112184276107174235⋯ 149230260240254119230132249105
233113119261282260132275274112⋯ 132262270102173265262244240265
291141245282121268302259296226⋯ 259269270244142121224137250275
105240238205268246261269230240⋯ 112282262225235255294261270256
199266142107291105296240113120⋯ 140184267272230258290267247118
249275249254235116249109258210⋯ 214249225110240262137289216110
108109288259277125199272255134⋯ 293113230130102230261231260265
125265270105200258270230112130⋯ 122271105214268306244111284260
260288288111112210282225245214⋯ 110254290112119257276265304246
238184105110133126110235245282⋯ 278116287283142230276275168115
120294221306109121105131121105⋯ 282122118142266230112265113282
128272255289282276125265254237⋯ 242144112108288276265272250118
184110277237108294288115282251⋯ 246112256265251113302262112231
28827524212610715812126728096 ⋯ 288229254270254112261231270273
141274286245280144113249259278⋯ 233268242126261121119249288112
289210245276109168286130144246⋯ 138268110205272243272145245266
288250105268225278275262288282⋯ 109278230112207252145225267173
284243115245130119105266131274⋯ 108278261272272288304249119255
242138284102108282110255142237⋯ 284104137251214270275128110149
238112137221260105275261260112⋯ 173230244233273112116129282246
259245254260270119112233244110⋯ 250273242270158276214235144125
216242144267254288265122270244⋯ 113245113131262113205113254230
121230134270113276255233112287⋯ 100257184110121184240280267112
216199254250120142249294288112⋯ 244255246225121158120260128294
⋮⋮⋮⋮⋮⋮⋮⋮⋮⋮⋱⋮⋮⋮⋮⋮⋮⋮⋮⋮⋮
126248293132280250210291235108⋯ 110131210119248226135105168245
244140110242255282244260134120⋯ 118243105280288107108250302270
108249149255216248261107265245⋯ 238246110226250240133168249257
254262141126120254108290250144⋯ 257105300271260294133262267261
108245231293261113102251270107⋯ 124105124272116270230184274230
280113116260288288132119136250⋯ 246122216255102270110116112272
231144306120145267245226265111⋯ 126168304121245118261112256260
260238105275278130110254288274⋯ 279107293107258271261277104113
138272257140102258258112110149⋯ 270274144216110240283266149145
143270288113184255226267157288⋯ 130143119235120250248245216238
133270126306111250136288111240⋯ 158237137245112126278276262137
121238250238115255105125216293⋯ 266278251304262126173144275258
112306288275240224302282249130⋯ 11124512025615827228896 96 240
230130249210102254141205240282⋯ 250129242233288262230118121111
108293120270275132247138238237⋯ 238274244289257216119248110113
282269249226254104112261282242⋯ 110100116243265245111255288288
276226238283259275135268133132⋯ 258112107270216120276277259113
231254283282240255108113286108⋯ 116121230119119257233116276290
132286111137122255230266240216⋯ 235277108240270265242235231260
110116270111231205113233282280⋯ 130264235126262296276256243226
246230120275290246296250304235⋯ 112134174294272111118282275115
200266105250261226116113245115⋯ 267257202282105112248250254285
248214132116149257282235105116⋯ 262288107256128205282280294113
105110304134267231261264250288⋯ 265262173248300144132122230263
113205291259269132272275140112⋯ 255285109290296108168260267210
255226112126109272260168158214⋯ 272274271109275260235287140140
270262116121240120105278262233⋯ 274255110100230231302132221288
200230205267109242113105266240⋯ 250246282105254267265275230261
244210288157255130112275247128⋯ 112258110111249214265116268240
221133119105282284282260270257⋯ 255216270255240288109255276288
In [9]:
#Compute the medians and means of the bootstrap trials
bootMedians = rep(0,nboot)
bootMeans = rep(0,nboot)
for (j in 1:nboot)
{
  bootMedians[j] = median(bootdata[,j])
  bootMeans[j] = mean(bootdata[,j])  
}
hist(bootMedians, nclass=20)       
hist(bootMeans, nclass=20)
In [10]:
# Compute the .05 and .95 quantiles for the differences
# Quantile has many methods of interpolating to compute the quantiles.
# E.g for median, quantil(1:4, .5) = 2.5. We just go with the default method
d_median05 = quantile(bootMedians - oflMedian, .05);
d_median95 = quantile(bootMedians - oflMedian, .95);

d_mean05 = quantile(bootMeans - oflMean, .05);
d_mean95 = quantile(bootMeans - oflMean, .95);

# Compute the 90% bootstrap confidence intervals
CI_median = oflMedian - c(d_median95, d_median05)
CI_mean = oflMean - c(d_mean95, d_mean05)
       
# Compute the fraction of the bootstrap samples where 
#the  |xbar_star - xbar|  > 5 
probDiff5 = sum(abs(bootMeans - oflMean) > 5)/nboot 
     
s = sprintf("Mean = %.3f, median = %.3f", oflMean, oflMedian)
print(s)
s = sprintf("CI_median: [%.3f, %.3f]", CI_median[1], CI_median[2])
print(s)
s = sprintf("CI_mean: [%.3f, %.3f]", CI_mean[1], CI_mean[2])
print(s)
s = sprintf("P(|xbar_star - xbaar| > 5) = %.3f", probDiff5)
print(s)
[1] "Mean = 209.268, median = 240.000"
[1] "CI_median: [234.500, 250.000]"
[1] "CI_mean: [202.572, 216.310]"
[1] "P(|xbar_star - xbaar| > 5) = 0.231"
In [12]:
qqnorm(rnorm(3000))