Bootstrap and confidence interval

In previous lecture we learned how to use CLT to estimate confidence intervals for a sample mean. For example, when we assume that $\sigma$ of the populaiton in known (sample size > 30), 95% interval is $$\left[\bar{x}-1.96 \dfrac{\sigma}{\sqrt{n}},~\bar{x}+1.96 \dfrac{\sigma}{\sqrt{n}} \right]$$

For smalle sample sizes, we use quantiles of t-distribution with $n-1$ degrss of freedom (df), where $n$ is the sample size.

However, if we do not know the underlying distribution of the variable we would like to calculate confindecne interval for, we can use the empirical bootstrap.

Say you have sample of size $n$ $$ x_1,\ldots,x_n$$ drawn from an unknown distribution $F$. An emperical boostrap sample is a resample of the same size $n$: $$ x_1^*,\ldots,x_n^*$$

We can think of it as a sample from empirical distrbution defined by the data. Now, instead of using original data to calculate a statistic $s$. Instead of calculating $\hat{s}$ from original data, we can use the resampled data to calculate $s^*$, using the same formula. By calulating $s^*$ several times, we get an empirical sample from the distribution of $s$!

Example

Say we have the following sample

In [1]:
x = c(30, 37, 36, 43, 42, 43, 43, 46, 41, 42)
(x_hat = mean(x))
40.3
In [2]:
hist(x)

We are interested in $\bar{x} - \mu$, where $\mu$ is the true populaion mean. We approximate it by $$\delta^* = \bar{x}^* - \bar{x}$$ where $\bar{x}^*$ is the mean of an empirical bootstrap.

In [4]:
B = replicate(200,sample(x, length(x), replace = TRUE, prob = NULL))
hist(B, breaks=4)
In [35]:
(B = matrix(B, nrow=length(x)))
4142463643434643303042424337414243434330
4343364243413642424346433742364241414242
3036434342364146374343433043374236424243
4142463636433743434141364642304241373637
3730374343434243374143434243433743434230
3630434243414342414241303736433643434642
4242434136434336433043414643434330434142
4641434237304343434343433036434343433742
4341374137303043424643413742363042434341
4330424630414342424330414346364241433046
In [36]:
colMeans(B)
  1. 40.2
  2. 37.7
  3. 41.6
  4. 41.2
  5. 39
  6. 39.1
  7. 40.4
  8. 42.3
  9. 40
  10. 40.2
  11. 42.8
  12. 39
  13. 41.5
  14. 41.3
  15. 40.6
  16. 39.2
  17. 40.8
  18. 41.4
  19. 41
  20. 40.6
  21. 40.3
  22. 38.6
  23. 36.7
  24. 40.4
  25. 40.8
  26. 40.5
  27. 38.9
  28. 41.4
  29. 41.4
  30. 40.5
  31. 39.5
  32. 37.8
  33. 41
  34. 40.1
  35. 39.3
  36. 38.9
  37. 40.8
  38. 42
  39. 42.9
  40. 39.8
  41. 40.4
  42. 38.3
  43. 41.1
  44. 39.9
  45. 41.8
  46. 41.9
  47. 39.5
  48. 42.3
  49. 38.2
  50. 40.8
  51. 40.8
  52. 39
  53. 40.3
  54. 40.6
  55. 39.7
  56. 39
  57. 41.2
  58. 39.5
  59. 40.4
  60. 38.9
  61. 38.3
  62. 38.2
  63. 39.3
  64. 40.6
  65. 39.3
  66. 40.1
  67. 41.2
  68. 42.4
  69. 42.9
  70. 37.9
  71. 41.5
  72. 42.3
  73. 41.3
  74. 41.1
  75. 43.1
  76. 40.7
  77. 41.6
  78. 41.3
  79. 40.2
  80. 39.4
  81. 39.8
  82. 40.4
  83. 42
  84. 40.2
  85. 40.8
  86. 41.9
  87. 38.5
  88. 39
  89. 41.7
  90. 41.8
  91. 38.2
  92. 40.5
  93. 40.5
  94. 41.6
  95. 38.6
  96. 40.6
  97. 38.5
  98. 38.2
  99. 39.7
  100. 39.8
  101. 42.4
  102. 38.4
  103. 39.3
  104. 41.7
  105. 43
  106. 39
  107. 41.2
  108. 37.4
  109. 39.9
  110. 41.3
  111. 42.1
  112. 40.7
  113. 39
  114. 39.4
  115. 40.3
  116. 40.9
  117. 40.2
  118. 40.3
  119. 40.8
  120. 40.7
  121. 41.8
  122. 41.3
  123. 40.5
  124. 41.1
  125. 40.9
  126. 38.6
  127. 42.2
  128. 40.9
  129. 40.7
  130. 40.7
  131. 40.5
  132. 39
  133. 41.5
  134. 39.4
  135. 42.6
  136. 42.1
  137. 40.5
  138. 39.8
  139. 41
  140. 41
  141. 43.3
  142. 41.9
  143. 38.7
  144. 40.5
  145. 40.6
  146. 37.8
  147. 38.7
  148. 38.1
  149. 41.3
  150. 42.4
  151. 39.4
  152. 37
  153. 37.5
  154. 41.5
  155. 42.1
  156. 40
  157. 42.3
  158. 42.4
  159. 41.1
  160. 39.1
  161. 39.8
  162. 39
  163. 42.7
  164. 40.7
  165. 39.2
  166. 39.8
  167. 41.5
  168. 39.4
  169. 41.6
  170. 39.4
  171. 41.6
  172. 40.5
  173. 41.8
  174. 41
  175. 41
  176. 38
  177. 40.4
  178. 41.9
  179. 41.7
  180. 41.8
  181. 39.3
  182. 41.2
  183. 41.2
  184. 38.8
  185. 39.5
  186. 41.2
  187. 40.3
  188. 40.7
  189. 40.4
  190. 37.2
  191. 41.5
  192. 40.3
  193. 39.1
  194. 41
  195. 38.8
  196. 39.9
  197. 40.3
  198. 42.1
  199. 40.2
  200. 39.5
In [37]:
delta_star = colMeans(B) - x_hat
print(delta_star, digits=2)
mean(colMeans(B))
  [1] -0.1 -2.6  1.3  0.9 -1.3 -1.2  0.1  2.0 -0.3 -0.1  2.5 -1.3  1.2  1.0  0.3
 [16] -1.1  0.5  1.1  0.7  0.3  0.0 -1.7 -3.6  0.1  0.5  0.2 -1.4  1.1  1.1  0.2
 [31] -0.8 -2.5  0.7 -0.2 -1.0 -1.4  0.5  1.7  2.6 -0.5  0.1 -2.0  0.8 -0.4  1.5
 [46]  1.6 -0.8  2.0 -2.1  0.5  0.5 -1.3  0.0  0.3 -0.6 -1.3  0.9 -0.8  0.1 -1.4
 [61] -2.0 -2.1 -1.0  0.3 -1.0 -0.2  0.9  2.1  2.6 -2.4  1.2  2.0  1.0  0.8  2.8
 [76]  0.4  1.3  1.0 -0.1 -0.9 -0.5  0.1  1.7 -0.1  0.5  1.6 -1.8 -1.3  1.4  1.5
 [91] -2.1  0.2  0.2  1.3 -1.7  0.3 -1.8 -2.1 -0.6 -0.5  2.1 -1.9 -1.0  1.4  2.7
[106] -1.3  0.9 -2.9 -0.4  1.0  1.8  0.4 -1.3 -0.9  0.0  0.6 -0.1  0.0  0.5  0.4
[121]  1.5  1.0  0.2  0.8  0.6 -1.7  1.9  0.6  0.4  0.4  0.2 -1.3  1.2 -0.9  2.3
[136]  1.8  0.2 -0.5  0.7  0.7  3.0  1.6 -1.6  0.2  0.3 -2.5 -1.6 -2.2  1.0  2.1
[151] -0.9 -3.3 -2.8  1.2  1.8 -0.3  2.0  2.1  0.8 -1.2 -0.5 -1.3  2.4  0.4 -1.1
[166] -0.5  1.2 -0.9  1.3 -0.9  1.3  0.2  1.5  0.7  0.7 -2.3  0.1  1.6  1.4  1.5
[181] -1.0  0.9  0.9 -1.5 -0.8  0.9  0.0  0.4  0.1 -3.1  1.2  0.0 -1.2  0.7 -1.5
[196] -0.4  0.0  1.8 -0.1 -0.8
40.376
In [38]:
(d = quantile(delta_star, c(.05,.95))) # 90% confidence interval
5%
-2.205
95%
2.1
In [39]:
(ci = x_hat - c(d[2], d[1]))
95%
38.2
5%
42.505

Basic assimptions of bootstrap:

  • empirical distribution will be a good approximation of the true distribution

Note, that point estimate is not any more accurate then the one obtained from original data. Bootstrap is all about understanding uncertainty about the estimate (robustness)!

Another advantage of bootstrap, we can do any statistic not only the mean. Lets try the median!

In [5]:
library("matrixStats")
(x_hat = median(x))
colMedians(B)
delta_star = colMedians(B) - x_hat
(d = quantile(delta_star, c(.05,.95))) # 90% confidence interval
(ci = x_hat - c(d[2], d[1]))
42
  1. 42.5
  2. 43
  3. 41.5
  4. 42.5
  5. 41
  6. 43
  7. 43
  8. 42.5
  9. 41.5
  10. 41.5
  11. 41.5
  12. 42
  13. 37
  14. 42
  15. 42
  16. 41.5
  17. 42
  18. 42
  19. 42.5
  20. 42
  21. 38.5
  22. 42
  23. 37
  24. 42
  25. 42
  26. 42
  27. 37
  28. 43
  29. 41.5
  30. 42.5
  31. 42
  32. 41
  33. 41
  34. 37
  35. 43
  36. 41.5
  37. 37
  38. 39.5
  39. 37
  40. 42
  41. 41.5
  42. 41.5
  43. 42
  44. 42
  45. 41.5
  46. 42
  47. 42.5
  48. 37
  49. 39.5
  50. 42
  51. 41.5
  52. 41.5
  53. 42
  54. 39
  55. 42.5
  56. 41.5
  57. 42
  58. 42
  59. 41.5
  60. 39
  61. 42
  62. 41
  63. 41.5
  64. 41
  65. 42
  66. 43
  67. 41.5
  68. 39
  69. 41.5
  70. 42.5
  71. 42
  72. 43
  73. 42.5
  74. 43
  75. 39
  76. 41
  77. 41
  78. 39
  79. 42.5
  80. 42.5
  81. 43
  82. 42.5
  83. 42
  84. 42
  85. 39.5
  86. 39
  87. 42.5
  88. 42
  89. 42
  90. 42.5
  91. 42
  92. 43
  93. 41.5
  94. 39
  95. 42.5
  96. 42.5
  97. 42.5
  98. 41
  99. 41
  100. 42
  101. 42
  102. 42
  103. 41.5
  104. 41.5
  105. 43
  106. 41
  107. 33.5
  108. 42.5
  109. 43
  110. 42
  111. 42.5
  112. 43
  113. 42.5
  114. 42.5
  115. 42
  116. 42
  117. 42.5
  118. 42
  119. 42
  120. 37
  121. 42
  122. 39
  123. 41.5
  124. 42.5
  125. 43
  126. 42
  127. 42.5
  128. 41.5
  129. 43
  130. 42
  131. 36.5
  132. 42
  133. 42
  134. 43
  135. 42
  136. 41.5
  137. 42
  138. 42
  139. 37
  140. 42
  141. 43
  142. 42.5
  143. 42
  144. 42
  145. 42
  146. 43
  147. 41.5
  148. 42.5
  149. 39
  150. 42.5
  151. 37
  152. 42
  153. 41.5
  154. 41.5
  155. 41.5
  156. 42
  157. 42
  158. 37
  159. 41
  160. 42.5
  161. 42
  162. 39
  163. 41
  164. 42.5
  165. 42
  166. 42
  167. 43
  168. 41
  169. 42
  170. 41
  171. 42
  172. 42.5
  173. 41.5
  174. 42
  175. 42
  176. 42.5
  177. 43
  178. 42.5
  179. 42
  180. 42
  181. 39
  182. 39
  183. 33
  184. 42.5
  185. 41.5
  186. 43
  187. 43
  188. 41.5
  189. 39.5
  190. 42.5
  191. 42.5
  192. 37
  193. 42
  194. 42.5
  195. 39
  196. 42.5
  197. 39
  198. 42.5
  199. 42.5
  200. 42
5%
-5
95%
1
95%
41
5%
47

Example: Old Faithful Data

Old Faithful is a geyser in Yellowstone National Park in Wyoming: http://en.wikipedia.org/wiki/Old_Faithful. There is a publicly available data set which gives the durations of 272 consecutive eruptions. Here is a histogram of the data.

Question: Estimate the median length of an eruption and give a 90% confidence interval for the median.

In [6]:
# This code computes 3 bootstrap statistics for the old
# faithful data.
# 1. A 90% CI for the median
# 2. A 90% CI for the mean
# 3. P(|xbar - mu| > 5.
       
print('Old Faithful Data')
       
# Number of bootstrap samples to generate
nboot = 1000

#Make sure the working directory contains oldfaithful.txt
ofl_data = read.table('https://www.dropbox.com/s/kh0758t8h4vkb89/OldFaithful.txt?dl=1')[[1]];
n = length(ofl_data) 
hist(ofl_data, nclass=20, col=2)
[1] "Old Faithful Data"
In [7]:
#Sample median and mean
(oflMedian = median(ofl_data))
(oflMean = mean(ofl_data))
240
209.268382352941
In [8]:
# COMMENT: Could generate one sample at a time by doing everything in the for loop.

#Generate nboot empirical bootstrap trials
# each trial consists of n samples from the data
x = sample(ofl_data, n*nboot, replace=TRUE)
(bootdata = matrix(x,nrow=n, ncol=nboot))
245257255247120300258306132278107124256268102296168221296118
210242278240261265143261267278257200254290200131108207270184
296112235122144243210120109110276306173237128121265296282112
285235293112141242270272250116207270143288240173136226271263
105260115118214134276254223130270124120104105111282262224294
126245262275132285265249250282272210278226269210138245282144
105296110230112184276107174235149230260240254119230132249105
233113119261282260132275274112132262270102173265262244240265
291141245282121268302259296226259269270244142121224137250275
105240238205268246261269230240112282262225235255294261270256
199266142107291105296240113120140184267272230258290267247118
249275249254235116249109258210214249225110240262137289216110
108109288259277125199272255134293113230130102230261231260265
125265270105200258270230112130122271105214268306244111284260
260288288111112210282225245214110254290112119257276265304246
238184105110133126110235245282278116287283142230276275168115
120294221306109121105131121105282122118142266230112265113282
128272255289282276125265254237242144112108288276265272250118
184110277237108294288115282251246112256265251113302262112231
28827524212610715812126728096 288229254270254112261231270273
141274286245280144113249259278233268242126261121119249288112
289210245276109168286130144246138268110205272243272145245266
288250105268225278275262288282109278230112207252145225267173
284243115245130119105266131274108278261272272288304249119255
242138284102108282110255142237284104137251214270275128110149
238112137221260105275261260112173230244233273112116129282246
259245254260270119112233244110250273242270158276214235144125
216242144267254288265122270244113245113131262113205113254230
121230134270113276255233112287100257184110121184240280267112
216199254250120142249294288112244255246225121158120260128294
126248293132280250210291235108110131210119248226135105168245
244140110242255282244260134120118243105280288107108250302270
108249149255216248261107265245238246110226250240133168249257
254262141126120254108290250144257105300271260294133262267261
108245231293261113102251270107124105124272116270230184274230
280113116260288288132119136250246122216255102270110116112272
231144306120145267245226265111126168304121245118261112256260
260238105275278130110254288274279107293107258271261277104113
138272257140102258258112110149270274144216110240283266149145
143270288113184255226267157288130143119235120250248245216238
133270126306111250136288111240158237137245112126278276262137
121238250238115255105125216293266278251304262126173144275258
11230628827524022430228224913011124512025615827228896 96 240
230130249210102254141205240282250129242233288262230118121111
108293120270275132247138238237238274244289257216119248110113
282269249226254104112261282242110100116243265245111255288288
276226238283259275135268133132258112107270216120276277259113
231254283282240255108113286108116121230119119257233116276290
132286111137122255230266240216235277108240270265242235231260
110116270111231205113233282280130264235126262296276256243226
246230120275290246296250304235112134174294272111118282275115
200266105250261226116113245115267257202282105112248250254285
248214132116149257282235105116262288107256128205282280294113
105110304134267231261264250288265262173248300144132122230263
113205291259269132272275140112255285109290296108168260267210
255226112126109272260168158214272274271109275260235287140140
270262116121240120105278262233274255110100230231302132221288
200230205267109242113105266240250246282105254267265275230261
244210288157255130112275247128112258110111249214265116268240
221133119105282284282260270257255216270255240288109255276288
In [9]:
#Compute the medians and means of the bootstrap trials
bootMedians = rep(0,nboot)
bootMeans = rep(0,nboot)
for (j in 1:nboot)
{
  bootMedians[j] = median(bootdata[,j])
  bootMeans[j] = mean(bootdata[,j])  
}
hist(bootMedians, nclass=20)       
hist(bootMeans, nclass=20)
In [10]:
# Compute the .05 and .95 quantiles for the differences
# Quantile has many methods of interpolating to compute the quantiles.
# E.g for median, quantil(1:4, .5) = 2.5. We just go with the default method
d_median05 = quantile(bootMedians - oflMedian, .05);
d_median95 = quantile(bootMedians - oflMedian, .95);

d_mean05 = quantile(bootMeans - oflMean, .05);
d_mean95 = quantile(bootMeans - oflMean, .95);

# Compute the 90% bootstrap confidence intervals
CI_median = oflMedian - c(d_median95, d_median05)
CI_mean = oflMean - c(d_mean95, d_mean05)
       
# Compute the fraction of the bootstrap samples where 
#the  |xbar_star - xbar|  > 5 
probDiff5 = sum(abs(bootMeans - oflMean) > 5)/nboot 
     
s = sprintf("Mean = %.3f, median = %.3f", oflMean, oflMedian)
print(s)
s = sprintf("CI_median: [%.3f, %.3f]", CI_median[1], CI_median[2])
print(s)
s = sprintf("CI_mean: [%.3f, %.3f]", CI_mean[1], CI_mean[2])
print(s)
s = sprintf("P(|xbar_star - xbaar| > 5) = %.3f", probDiff5)
print(s)
[1] "Mean = 209.268, median = 240.000"
[1] "CI_median: [234.500, 250.000]"
[1] "CI_mean: [202.572, 216.310]"
[1] "P(|xbar_star - xbaar| > 5) = 0.231"
In [12]:
qqnorm(rnorm(3000))
In [ ]: