# Bootstrap and confidence interval¶

In previous lecture we learned how to use CLT to estimate confidence intervals for a sample mean. For example, when we assume that $\sigma$ of the populaiton in known (sample size > 30), 95% interval is $$\left[\bar{x}-1.96 \dfrac{\sigma}{\sqrt{n}},~\bar{x}+1.96 \dfrac{\sigma}{\sqrt{n}} \right]$$

For smalle sample sizes, we use quantiles of t-distribution with $n-1$ degrss of freedom (df), where $n$ is the sample size.

However, if we do not know the underlying distribution of the variable we would like to calculate confindecne interval for, we can use the empirical bootstrap.

Say you have sample of size $n$ $$x_1,\ldots,x_n$$ drawn from an unknown distribution $F$. An emperical boostrap sample is a resample of the same size $n$: $$x_1^*,\ldots,x_n^*$$

We can think of it as a sample from empirical distrbution defined by the data. Now, instead of using original data to calculate a statistic $s$. Instead of calculating $\hat{s}$ from original data, we can use the resampled data to calculate $s^*$, using the same formula. By calulating $s^*$ several times, we get an empirical sample from the distribution of $s$!

## Example¶

Say we have the following sample

In [1]:
x = c(30, 37, 36, 43, 42, 43, 43, 46, 41, 42)
(x_hat = mean(x))

40.3
In [2]:
hist(x)


We are interested in $\bar{x} - \mu$, where $\mu$ is the true populaion mean. We approximate it by $$\delta^* = \bar{x}^* - \bar{x}$$ where $\bar{x}^*$ is the mean of an empirical bootstrap.

In [4]:
B = replicate(200,sample(x, length(x), replace = TRUE, prob = NULL))
hist(B, breaks=4)

In [35]:
(B = matrix(B, nrow=length(x)))

 41 42 46 36 43 43 46 43 30 30 â‹¯ 42 42 43 37 41 42 43 43 43 30 43 43 36 42 43 41 36 42 42 43 â‹¯ 46 43 37 42 36 42 41 41 42 42 30 36 43 43 42 36 41 46 37 43 â‹¯ 43 43 30 43 37 42 36 42 42 43 41 42 46 36 36 43 37 43 43 41 â‹¯ 41 36 46 42 30 42 41 37 36 37 37 30 37 43 43 43 42 43 37 41 â‹¯ 43 43 42 43 43 37 43 43 42 30 36 30 43 42 43 41 43 42 41 42 â‹¯ 41 30 37 36 43 36 43 43 46 42 42 42 43 41 36 43 43 36 43 30 â‹¯ 43 41 46 43 43 43 30 43 41 42 46 41 43 42 37 30 43 43 43 43 â‹¯ 43 43 30 36 43 43 43 43 37 42 43 41 37 41 37 30 30 43 42 46 â‹¯ 43 41 37 42 36 30 42 43 43 41 43 30 42 46 30 41 43 42 42 43 â‹¯ 30 41 43 46 36 42 41 43 30 46
In [36]:
colMeans(B)

1. 40.2
2. 37.7
3. 41.6
4. 41.2
5. 39
6. 39.1
7. 40.4
8. 42.3
9. 40
10. 40.2
11. 42.8
12. 39
13. 41.5
14. 41.3
15. 40.6
16. 39.2
17. 40.8
18. 41.4
19. 41
20. 40.6
21. 40.3
22. 38.6
23. 36.7
24. 40.4
25. 40.8
26. 40.5
27. 38.9
28. 41.4
29. 41.4
30. 40.5
31. 39.5
32. 37.8
33. 41
34. 40.1
35. 39.3
36. 38.9
37. 40.8
38. 42
39. 42.9
40. 39.8
41. 40.4
42. 38.3
43. 41.1
44. 39.9
45. 41.8
46. 41.9
47. 39.5
48. 42.3
49. 38.2
50. 40.8
51. 40.8
52. 39
53. 40.3
54. 40.6
55. 39.7
56. 39
57. 41.2
58. 39.5
59. 40.4
60. 38.9
61. 38.3
62. 38.2
63. 39.3
64. 40.6
65. 39.3
66. 40.1
67. 41.2
68. 42.4
69. 42.9
70. 37.9
71. 41.5
72. 42.3
73. 41.3
74. 41.1
75. 43.1
76. 40.7
77. 41.6
78. 41.3
79. 40.2
80. 39.4
81. 39.8
82. 40.4
83. 42
84. 40.2
85. 40.8
86. 41.9
87. 38.5
88. 39
89. 41.7
90. 41.8
91. 38.2
92. 40.5
93. 40.5
94. 41.6
95. 38.6
96. 40.6
97. 38.5
98. 38.2
99. 39.7
100. 39.8
101. 42.4
102. 38.4
103. 39.3
104. 41.7
105. 43
106. 39
107. 41.2
108. 37.4
109. 39.9
110. 41.3
111. 42.1
112. 40.7
113. 39
114. 39.4
115. 40.3
116. 40.9
117. 40.2
118. 40.3
119. 40.8
120. 40.7
121. 41.8
122. 41.3
123. 40.5
124. 41.1
125. 40.9
126. 38.6
127. 42.2
128. 40.9
129. 40.7
130. 40.7
131. 40.5
132. 39
133. 41.5
134. 39.4
135. 42.6
136. 42.1
137. 40.5
138. 39.8
139. 41
140. 41
141. 43.3
142. 41.9
143. 38.7
144. 40.5
145. 40.6
146. 37.8
147. 38.7
148. 38.1
149. 41.3
150. 42.4
151. 39.4
152. 37
153. 37.5
154. 41.5
155. 42.1
156. 40
157. 42.3
158. 42.4
159. 41.1
160. 39.1
161. 39.8
162. 39
163. 42.7
164. 40.7
165. 39.2
166. 39.8
167. 41.5
168. 39.4
169. 41.6
170. 39.4
171. 41.6
172. 40.5
173. 41.8
174. 41
175. 41
176. 38
177. 40.4
178. 41.9
179. 41.7
180. 41.8
181. 39.3
182. 41.2
183. 41.2
184. 38.8
185. 39.5
186. 41.2
187. 40.3
188. 40.7
189. 40.4
190. 37.2
191. 41.5
192. 40.3
193. 39.1
194. 41
195. 38.8
196. 39.9
197. 40.3
198. 42.1
199. 40.2
200. 39.5
In [37]:
delta_star = colMeans(B) - x_hat
print(delta_star, digits=2)
mean(colMeans(B))

  [1] -0.1 -2.6  1.3  0.9 -1.3 -1.2  0.1  2.0 -0.3 -0.1  2.5 -1.3  1.2  1.0  0.3
[16] -1.1  0.5  1.1  0.7  0.3  0.0 -1.7 -3.6  0.1  0.5  0.2 -1.4  1.1  1.1  0.2
[31] -0.8 -2.5  0.7 -0.2 -1.0 -1.4  0.5  1.7  2.6 -0.5  0.1 -2.0  0.8 -0.4  1.5
[46]  1.6 -0.8  2.0 -2.1  0.5  0.5 -1.3  0.0  0.3 -0.6 -1.3  0.9 -0.8  0.1 -1.4
[61] -2.0 -2.1 -1.0  0.3 -1.0 -0.2  0.9  2.1  2.6 -2.4  1.2  2.0  1.0  0.8  2.8
[76]  0.4  1.3  1.0 -0.1 -0.9 -0.5  0.1  1.7 -0.1  0.5  1.6 -1.8 -1.3  1.4  1.5
[91] -2.1  0.2  0.2  1.3 -1.7  0.3 -1.8 -2.1 -0.6 -0.5  2.1 -1.9 -1.0  1.4  2.7
[106] -1.3  0.9 -2.9 -0.4  1.0  1.8  0.4 -1.3 -0.9  0.0  0.6 -0.1  0.0  0.5  0.4
[121]  1.5  1.0  0.2  0.8  0.6 -1.7  1.9  0.6  0.4  0.4  0.2 -1.3  1.2 -0.9  2.3
[136]  1.8  0.2 -0.5  0.7  0.7  3.0  1.6 -1.6  0.2  0.3 -2.5 -1.6 -2.2  1.0  2.1
[151] -0.9 -3.3 -2.8  1.2  1.8 -0.3  2.0  2.1  0.8 -1.2 -0.5 -1.3  2.4  0.4 -1.1
[166] -0.5  1.2 -0.9  1.3 -0.9  1.3  0.2  1.5  0.7  0.7 -2.3  0.1  1.6  1.4  1.5
[181] -1.0  0.9  0.9 -1.5 -0.8  0.9  0.0  0.4  0.1 -3.1  1.2  0.0 -1.2  0.7 -1.5
[196] -0.4  0.0  1.8 -0.1 -0.8

40.376
In [38]:
(d = quantile(delta_star, c(.05,.95))) # 90% confidence interval

5%
-2.205
95%
2.1
In [39]:
(ci = x_hat - c(d[2], d[1]))

95%
38.2
5%
42.505

Basic assimptions of bootstrap:

• empirical distribution will be a good approximation of the true distribution

Note, that point estimate is not any more accurate then the one obtained from original data. Bootstrap is all about understanding uncertainty about the estimate (robustness)!

Another advantage of bootstrap, we can do any statistic not only the mean. Lets try the median!

In [5]:
library("matrixStats")
(x_hat = median(x))
colMedians(B)
delta_star = colMedians(B) - x_hat
(d = quantile(delta_star, c(.05,.95))) # 90% confidence interval
(ci = x_hat - c(d[2], d[1]))

42
1. 42.5
2. 43
3. 41.5
4. 42.5
5. 41
6. 43
7. 43
8. 42.5
9. 41.5
10. 41.5
11. 41.5
12. 42
13. 37
14. 42
15. 42
16. 41.5
17. 42
18. 42
19. 42.5
20. 42
21. 38.5
22. 42
23. 37
24. 42
25. 42
26. 42
27. 37
28. 43
29. 41.5
30. 42.5
31. 42
32. 41
33. 41
34. 37
35. 43
36. 41.5
37. 37
38. 39.5
39. 37
40. 42
41. 41.5
42. 41.5
43. 42
44. 42
45. 41.5
46. 42
47. 42.5
48. 37
49. 39.5
50. 42
51. 41.5
52. 41.5
53. 42
54. 39
55. 42.5
56. 41.5
57. 42
58. 42
59. 41.5
60. 39
61. 42
62. 41
63. 41.5
64. 41
65. 42
66. 43
67. 41.5
68. 39
69. 41.5
70. 42.5
71. 42
72. 43
73. 42.5
74. 43
75. 39
76. 41
77. 41
78. 39
79. 42.5
80. 42.5
81. 43
82. 42.5
83. 42
84. 42
85. 39.5
86. 39
87. 42.5
88. 42
89. 42
90. 42.5
91. 42
92. 43
93. 41.5
94. 39
95. 42.5
96. 42.5
97. 42.5
98. 41
99. 41
100. 42
101. 42
102. 42
103. 41.5
104. 41.5
105. 43
106. 41
107. 33.5
108. 42.5
109. 43
110. 42
111. 42.5
112. 43
113. 42.5
114. 42.5
115. 42
116. 42
117. 42.5
118. 42
119. 42
120. 37
121. 42
122. 39
123. 41.5
124. 42.5
125. 43
126. 42
127. 42.5
128. 41.5
129. 43
130. 42
131. 36.5
132. 42
133. 42
134. 43
135. 42
136. 41.5
137. 42
138. 42
139. 37
140. 42
141. 43
142. 42.5
143. 42
144. 42
145. 42
146. 43
147. 41.5
148. 42.5
149. 39
150. 42.5
151. 37
152. 42
153. 41.5
154. 41.5
155. 41.5
156. 42
157. 42
158. 37
159. 41
160. 42.5
161. 42
162. 39
163. 41
164. 42.5
165. 42
166. 42
167. 43
168. 41
169. 42
170. 41
171. 42
172. 42.5
173. 41.5
174. 42
175. 42
176. 42.5
177. 43
178. 42.5
179. 42
180. 42
181. 39
182. 39
183. 33
184. 42.5
185. 41.5
186. 43
187. 43
188. 41.5
189. 39.5
190. 42.5
191. 42.5
192. 37
193. 42
194. 42.5
195. 39
196. 42.5
197. 39
198. 42.5
199. 42.5
200. 42
5%
-5
95%
1
95%
41
5%
47

# Example: Old Faithful Data¶

Old Faithful is a geyser in Yellowstone National Park in Wyoming: http://en.wikipedia.org/wiki/Old_Faithful. There is a publicly available data set which gives the durations of 272 consecutive eruptions. Here is a histogram of the data.

Question: Estimate the median length of an eruption and give a 90% confidence interval for the median.

In [6]:
# This code computes 3 bootstrap statistics for the old
# faithful data.
# 1. A 90% CI for the median
# 2. A 90% CI for the mean
# 3. P(|xbar - mu| > 5.

print('Old Faithful Data')

# Number of bootstrap samples to generate
nboot = 1000

#Make sure the working directory contains oldfaithful.txt
n = length(ofl_data)
hist(ofl_data, nclass=20, col=2)

[1] "Old Faithful Data"

In [7]:
#Sample median and mean
(oflMedian = median(ofl_data))
(oflMean = mean(ofl_data))

240
209.268382352941
In [8]:
# COMMENT: Could generate one sample at a time by doing everything in the for loop.

#Generate nboot empirical bootstrap trials
# each trial consists of n samples from the data
x = sample(ofl_data, n*nboot, replace=TRUE)
(bootdata = matrix(x,nrow=n, ncol=nboot))

 245 257 255 247 120 300 258 306 132 278 â‹¯ 107 124 256 268 102 296 168 221 296 118 210 242 278 240 261 265 143 261 267 278 â‹¯ 257 200 254 290 200 131 108 207 270 184 296 112 235 122 144 243 210 120 109 110 â‹¯ 276 306 173 237 128 121 265 296 282 112 285 235 293 112 141 242 270 272 250 116 â‹¯ 207 270 143 288 240 173 136 226 271 263 105 260 115 118 214 134 276 254 223 130 â‹¯ 270 124 120 104 105 111 282 262 224 294 126 245 262 275 132 285 265 249 250 282 â‹¯ 272 210 278 226 269 210 138 245 282 144 105 296 110 230 112 184 276 107 174 235 â‹¯ 149 230 260 240 254 119 230 132 249 105 233 113 119 261 282 260 132 275 274 112 â‹¯ 132 262 270 102 173 265 262 244 240 265 291 141 245 282 121 268 302 259 296 226 â‹¯ 259 269 270 244 142 121 224 137 250 275 105 240 238 205 268 246 261 269 230 240 â‹¯ 112 282 262 225 235 255 294 261 270 256 199 266 142 107 291 105 296 240 113 120 â‹¯ 140 184 267 272 230 258 290 267 247 118 249 275 249 254 235 116 249 109 258 210 â‹¯ 214 249 225 110 240 262 137 289 216 110 108 109 288 259 277 125 199 272 255 134 â‹¯ 293 113 230 130 102 230 261 231 260 265 125 265 270 105 200 258 270 230 112 130 â‹¯ 122 271 105 214 268 306 244 111 284 260 260 288 288 111 112 210 282 225 245 214 â‹¯ 110 254 290 112 119 257 276 265 304 246 238 184 105 110 133 126 110 235 245 282 â‹¯ 278 116 287 283 142 230 276 275 168 115 120 294 221 306 109 121 105 131 121 105 â‹¯ 282 122 118 142 266 230 112 265 113 282 128 272 255 289 282 276 125 265 254 237 â‹¯ 242 144 112 108 288 276 265 272 250 118 184 110 277 237 108 294 288 115 282 251 â‹¯ 246 112 256 265 251 113 302 262 112 231 288 275 242 126 107 158 121 267 280 96 â‹¯ 288 229 254 270 254 112 261 231 270 273 141 274 286 245 280 144 113 249 259 278 â‹¯ 233 268 242 126 261 121 119 249 288 112 289 210 245 276 109 168 286 130 144 246 â‹¯ 138 268 110 205 272 243 272 145 245 266 288 250 105 268 225 278 275 262 288 282 â‹¯ 109 278 230 112 207 252 145 225 267 173 284 243 115 245 130 119 105 266 131 274 â‹¯ 108 278 261 272 272 288 304 249 119 255 242 138 284 102 108 282 110 255 142 237 â‹¯ 284 104 137 251 214 270 275 128 110 149 238 112 137 221 260 105 275 261 260 112 â‹¯ 173 230 244 233 273 112 116 129 282 246 259 245 254 260 270 119 112 233 244 110 â‹¯ 250 273 242 270 158 276 214 235 144 125 216 242 144 267 254 288 265 122 270 244 â‹¯ 113 245 113 131 262 113 205 113 254 230 121 230 134 270 113 276 255 233 112 287 â‹¯ 100 257 184 110 121 184 240 280 267 112 216 199 254 250 120 142 249 294 288 112 â‹¯ 244 255 246 225 121 158 120 260 128 294 â‹® â‹® â‹® â‹® â‹® â‹® â‹® â‹® â‹® â‹® â‹± â‹® â‹® â‹® â‹® â‹® â‹® â‹® â‹® â‹® â‹® 126 248 293 132 280 250 210 291 235 108 â‹¯ 110 131 210 119 248 226 135 105 168 245 244 140 110 242 255 282 244 260 134 120 â‹¯ 118 243 105 280 288 107 108 250 302 270 108 249 149 255 216 248 261 107 265 245 â‹¯ 238 246 110 226 250 240 133 168 249 257 254 262 141 126 120 254 108 290 250 144 â‹¯ 257 105 300 271 260 294 133 262 267 261 108 245 231 293 261 113 102 251 270 107 â‹¯ 124 105 124 272 116 270 230 184 274 230 280 113 116 260 288 288 132 119 136 250 â‹¯ 246 122 216 255 102 270 110 116 112 272 231 144 306 120 145 267 245 226 265 111 â‹¯ 126 168 304 121 245 118 261 112 256 260 260 238 105 275 278 130 110 254 288 274 â‹¯ 279 107 293 107 258 271 261 277 104 113 138 272 257 140 102 258 258 112 110 149 â‹¯ 270 274 144 216 110 240 283 266 149 145 143 270 288 113 184 255 226 267 157 288 â‹¯ 130 143 119 235 120 250 248 245 216 238 133 270 126 306 111 250 136 288 111 240 â‹¯ 158 237 137 245 112 126 278 276 262 137 121 238 250 238 115 255 105 125 216 293 â‹¯ 266 278 251 304 262 126 173 144 275 258 112 306 288 275 240 224 302 282 249 130 â‹¯ 111 245 120 256 158 272 288 96 96 240 230 130 249 210 102 254 141 205 240 282 â‹¯ 250 129 242 233 288 262 230 118 121 111 108 293 120 270 275 132 247 138 238 237 â‹¯ 238 274 244 289 257 216 119 248 110 113 282 269 249 226 254 104 112 261 282 242 â‹¯ 110 100 116 243 265 245 111 255 288 288 276 226 238 283 259 275 135 268 133 132 â‹¯ 258 112 107 270 216 120 276 277 259 113 231 254 283 282 240 255 108 113 286 108 â‹¯ 116 121 230 119 119 257 233 116 276 290 132 286 111 137 122 255 230 266 240 216 â‹¯ 235 277 108 240 270 265 242 235 231 260 110 116 270 111 231 205 113 233 282 280 â‹¯ 130 264 235 126 262 296 276 256 243 226 246 230 120 275 290 246 296 250 304 235 â‹¯ 112 134 174 294 272 111 118 282 275 115 200 266 105 250 261 226 116 113 245 115 â‹¯ 267 257 202 282 105 112 248 250 254 285 248 214 132 116 149 257 282 235 105 116 â‹¯ 262 288 107 256 128 205 282 280 294 113 105 110 304 134 267 231 261 264 250 288 â‹¯ 265 262 173 248 300 144 132 122 230 263 113 205 291 259 269 132 272 275 140 112 â‹¯ 255 285 109 290 296 108 168 260 267 210 255 226 112 126 109 272 260 168 158 214 â‹¯ 272 274 271 109 275 260 235 287 140 140 270 262 116 121 240 120 105 278 262 233 â‹¯ 274 255 110 100 230 231 302 132 221 288 200 230 205 267 109 242 113 105 266 240 â‹¯ 250 246 282 105 254 267 265 275 230 261 244 210 288 157 255 130 112 275 247 128 â‹¯ 112 258 110 111 249 214 265 116 268 240 221 133 119 105 282 284 282 260 270 257 â‹¯ 255 216 270 255 240 288 109 255 276 288
In [9]:
#Compute the medians and means of the bootstrap trials
bootMedians = rep(0,nboot)
bootMeans = rep(0,nboot)
for (j in 1:nboot)
{
bootMedians[j] = median(bootdata[,j])
bootMeans[j] = mean(bootdata[,j])
}
hist(bootMedians, nclass=20)
hist(bootMeans, nclass=20)

In [10]:
# Compute the .05 and .95 quantiles for the differences
# Quantile has many methods of interpolating to compute the quantiles.
# E.g for median, quantil(1:4, .5) = 2.5. We just go with the default method
d_median05 = quantile(bootMedians - oflMedian, .05);
d_median95 = quantile(bootMedians - oflMedian, .95);

d_mean05 = quantile(bootMeans - oflMean, .05);
d_mean95 = quantile(bootMeans - oflMean, .95);

# Compute the 90% bootstrap confidence intervals
CI_median = oflMedian - c(d_median95, d_median05)
CI_mean = oflMean - c(d_mean95, d_mean05)

# Compute the fraction of the bootstrap samples where
#the  |xbar_star - xbar|  > 5
probDiff5 = sum(abs(bootMeans - oflMean) > 5)/nboot

s = sprintf("Mean = %.3f, median = %.3f", oflMean, oflMedian)
print(s)
s = sprintf("CI_median: [%.3f, %.3f]", CI_median[1], CI_median[2])
print(s)
s = sprintf("CI_mean: [%.3f, %.3f]", CI_mean[1], CI_mean[2])
print(s)
s = sprintf("P(|xbar_star - xbaar| > 5) = %.3f", probDiff5)
print(s)

[1] "Mean = 209.268, median = 240.000"
[1] "CI_median: [234.500, 250.000]"
[1] "CI_mean: [202.572, 216.310]"
[1] "P(|xbar_star - xbaar| > 5) = 0.231"

In [12]:
qqnorm(rnorm(3000))