Notebook

Bootstrapping¶

Parametric inference can be non-robust:

inaccurate if parametric assumptions are violated
if we rely on asymptotic results, we may not achieve an acceptable level of accuracy

Parmetric inference can be difficult:

derivation of sampling distribution may not be possible

An alternative is to estimate the sampling distribution of a statistic empirically without making assumptions about the form of the population.

The bootstrap is a simulation tool for assessing accuracy. It is related to cross-validation, which we will encounter later in the course when we talk about tuning machine learning models.

This approach is most commonly used as a non-parametric method for calculating standard errors and confidence intervals.

In [ ]:

%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_context('notebook')

np.random.seed(42)

DATA_URL = 'https://raw.githubusercontent.com/fonnesbeck/Bios8366/master/data/'

The Jackknife¶

A simple precursor to bootstrapping is the jackknife (Quenouille 1949), which facilitates the estimation of the bias and variance of an estimator. Recall that the bias of an estimator $\widehat{\theta}$ of $\theta$ is:

$$Bias(\widehat{\theta}) = E(\widehat{\theta}) - \theta$$

Consider calculating an estimate using $n-1$ values from the dataset $\widehat{\theta}_{(-i)}$ (that is, the $i^{th}$ value is removed). If this is repeated for each $i=1,\ldots,n$ observation, we can average them to obtain:

$$\bar{\theta}_{(n)} = n^{-1} \sum_i \widehat{\theta}_{(-i)}$$

The jackknife bias estimate is:

$$b_{jack} = (n-1)(\bar{\theta}_{(n)} - \widehat{\theta})$$

It can be shown that $b_{jk}$ estimates the bias up to order $O(n^{-2})$.

Thus, a bias-corrected estimator is:

$$\widehat{\theta}_{jack} = \widehat{\theta} - b_{jack}$$

In [ ]:

def jackknife_bias(data, func, **kwargs):
    theta_hat = func(data, **kwargs)
    n = data.shape[0]
    idx = np.arange(n)
    theta_jack = sum(func(data[idx!=i], **kwargs) for i in range(n))/n
    return (n-1) * (theta_jack - theta_hat)

In [ ]:

x = np.random.normal(0, 2, 100)

In [ ]:

np.std(x)

In [ ]:

np.std(x) - jackknife_bias(x, np.std)

Another way of expressing $\widehat{\theta}_{jack}$ is:

$$\widehat{\theta}_{jack} = (n^{-1}) \sum_i \widetilde{\theta}_i$$

where $\widetilde{\theta}_i$ are known as pseudovalues and are defined as:

$$\widetilde{\theta}_i = n \widehat{\theta} - (n-1) \widehat{\theta}_{(-i)}$$

In [ ]:

def jackknife(data, func, **kwargs):
    theta_hat = func(data, **kwargs)
    n = data.shape[0]
    idx = np.arange(n)
    return sum(n*theta_hat - (n-1)*func(data[idx!=i], **kwargs) for i in range(n))/n

In [ ]:

jackknife(x, np.std)

Correspondingly, the jackknife variance estimate of the estimator $\widehat{\theta}$ is:

$$v_{jack} = \frac{\widetilde{s}^2}{n}$$

where $\widetilde{s}^2$ is the sample variance of the pseudo-values:

$$\widetilde{s}^2 = \frac{\sum_i (\widetilde{\theta}_i - n^{-1}\sum_j \widetilde{\theta}_j)^2}{n-1}$$

Under certain regularity conditions, $v_{jack}$ is a consistent estimator of $V(\widehat{\theta})$. It can be inconsistent for some statistics, such as the median, that are not smooth.

Exercise¶

Write a function that implements a jackknife variance estimator.

In [ ]:

# write your answer here

The Bootstrap¶

The bootstrap is a resampling method discovered by Brad Efron that allows one to approximate the true sampling distribution of a dataset, and thereby obtain estimates of the mean and variance of some function of the distribution.

In general, consider a statistic $T_n = g(X_1, \ldots, X_n)$, which has variance $V_F(T_n)$. Note this implies that the variance is a function of the underlying $F$, which is unknown.

One approach is to estimate $V_F(T_n)$ with some $V_{F_n}(T_n)$, which is a plug-in estimator of the variance. But, $V_{F_n}(T_n)$ may be difficult to compute, so we can attempt to approximate it with a simulation estimate $V_{boot}$.

Here is the algorithm:

Draw $X_{1}^*, \ldots, X_{n}^* \sim \widehat{F}_n$
For each r in R iterations: Calculate statistic $T^*_{rn} = g(X^*_{r1}, \ldots, X^*_{rn})$
Calculate variance: $V_{boot} = \frac{1}{R} \sum_r \left(T^*_{rn} - \bar{T}^*_{.n} \right)^2$

Rather than finding a way of drawing $n$ points at random from $\widehat{F}_n$, we can instead draw a sample of size $n$ with replacement from the original data. So, step 1 can be modified to:

Draw $X_{1}^*, \ldots, X_{n}^*$ with replacement from $X_{1}, \ldots, X_{n}$

These are called bootrstrap samples:

$$S_1^* = \{X_{11}^*, X_{12}^*, \ldots, X_{1n}^*\}$$

We regard S as an "estimate" of population P

population : sample :: sample : bootstrap sample

To generate samples with replacement in Python, there are a handful of approaches. We can use NumPy to generate random integers (np,random.randint), and use these to index DataFrame rows with iloc.

In [ ]:

try:
    prostate_data = pd.read_table('../data/prostate.data.txt', index_col=0)
except FileNotFoundError:
    prostate_data = pd.read_table(DATA_URL + 'prostate.data.txt', index_col=0)
prostate_data.head()

In [ ]:

random_ind = np.random.randint(0, prostate_data.shape[0], 5)
prostate_data.iloc[random_ind]

pandas' sample method makes this even easier, and allows for custom sampling probabilities when non-uniform sampling is desired.

In [ ]:

prostate_data.sample(5, replace=True)

Exercise¶

Generate 20 samples with replacement from the prostate dataset, weighting samples inversely by age.

In [ ]:

# Write your answer here

Bootstrap Estimates¶

From our bootstrapped samples, we can extract estimates of the expectation and its variance:

$$\bar{T}^* = \hat{E}(T^*) = \frac{\sum_r T_r^*}{R}$$$$\hat{\text{Var}}(T^*) = \frac{\sum_r (T_r^* - \bar{T}^*)^2}{R-1}$$

Under appropriate regularity conditions, $\frac{s^2}{\hat{\text{Var}}(T^*)} \rightarrow 1$ as $n \rightarrow \infty$.

Bootstrap Confidence Intervals¶

There are a handful of ways for constructing confidence intervals from bootstrap samples, varying in ease of calculation and accuracy.

Bootstrap Normal Interval¶

Perhaps the simplest bootstrap interval is the normal interval:

$$T_n ± z_{α/2} \widehat{SE}_{boot}$$

where $\widehat{SE}_{boot}$ is an estimate of the standard error using the bootstrap sample. Of course, this interval is not accurate unless the distribution of $T_n$ is close to Gaussian.

We can first define a pivotal interval. Let $\theta = T(F)$ and $\widehat{\theta}_n = T(\widehat{F}_n)$, and further define $P_n = \widehat{\theta}_n - \theta$.

Let $H(p)$ denote the CDF of the pivot:

$$ H(p) = Pr_F(P_n \le p) $$

Now define the interval $C_n = (a, b)$ where:

$$\begin{aligned} a &= \widehat{\theta}_n - H^{-1}(1-\alpha/2) \\ b &= \widehat{\theta}_n - H^{-1}(\alpha/2). \end{aligned}$$

It can be shown that $C_n$ is an exact $1 - \alpha$ confidence interval for $\theta$. But clearly, $a$ and $b$ depend on the unknown $H$.

A pivot confidence interval uses a bootstrap estimate of H as an approximation:

$$\widehat{H}_P = \frac{1}{R} \sum_r I\left[(\widehat{\theta}_{nr} - \widehat{\theta}_{n.}) < p\right]$$

From this, an approximate $1-\alpha$ confidence interval is $C_n = (\widehat{a}, \widehat{b})$:

$$\begin{aligned} \widehat{a} &= \widehat{\theta}_n - \widehat{H}^{-1}(1-\alpha/2) = 2\widehat{\theta}_n - \theta^*_{1 - \alpha/2} \\ \widehat{b} &= \widehat{\theta}_n - \widehat{H}^{-1}(\alpha/2) = 2\widehat{\theta}_n - \theta^*_{\alpha/2}. \end{aligned}$$

Let's calculate the pivot interval for the standard deviation of the log-weights in the prostate dataset.

In [ ]:

weights = prostate_data.lweight.copy()
n = weights.shape[0]
weights.std(ddof=0)

In [ ]:

R = 1000
samples = np.array([weights.sample(n, replace=True) for r in range(R)])
samples.shape

In [ ]:

theta_hat = weights.std(ddof=0)

In [ ]:

estimates = samples.std(axis=1)

In [ ]:

2*theta_hat - np.percentile(estimates, [97.5, 2.5])

An alternative approach for calculating a pivotal interval is to define:

$$Z_n = \frac{T_n - \theta}{\widehat{SE}_{boot}}$$

which can be approximated with:

$$Z^*_{nr} = \frac{T^*_{nr} - T_n}{\widehat{SE}^*_{boot}}$$

where $\widehat{SE}^*_{boot}$ is the standard error of $T^*_{nr}$.

Here, the quantiles of $Z^*_{n1}, \ldots, Z^*_{nR}$ should approximate the true quantiles of the distribution of $Z_n$. We can then calculate the following interval:

$$C_n = (T_n - z^*_{1-\alpha/2}\widehat{SE}_{boot}, T_n - z^*_{\alpha/2}\widehat{SE}_{boot})$$

where $z^*_{a}$ is the $a$ sample quantile of $Z^*_{n1}, \ldots, Z^*_{nR}$.

This is a studentized pivotal interval.

This interval is more computationally-intensive because $\widehat{SE}^*_{boot}$ has to be calculated for each bootstrap sample, but it has better accuracy than the non-studentized interval.

Exercise¶

Write a function to calculate studentized pivotal intervals for arbitrary functions.

In [ ]:

# Write your answer here

Bootstrap Percentile Intervals¶

An even simpler interval involves using the empirical quantiles of the bootstrapped statistics. Consider a monotone transformation $U = m(T)$ such that $U \sim N(m(\theta), c^2)$. Importantly, we don't need to know what $m$ is, only that it exists. If we let $U^*_r = m(T^*_r)$, then $U^*_{(R\alpha/2)} = m(T^*_{(R\alpha/2)})$ because the monotone transformation preserves the quantiles.

Since $U \sim N(m(\theta), c^2)$, the $\alpha/2$ quantile of $U$ is $m(\theta) - z_{\alpha/2} c$. From this, it can be shown that:

$$Pr(T^*_{R\alpha/2} \le \theta \le T^*_{R(1 - \alpha/2)}) = Pr\left( -z_{\alpha/2} \le \frac{U - m(T)}{c} \le z_{\alpha/2} \right) = 1 - \alpha$$

This employs the ordered bootstrap replicates:

$$T_{(1)}^*, T_{(2)}^*, \ldots, T_{(R)}^*$$

Simply extract the $100(\alpha/2)$ and $100(1-\alpha/2)$ percentiles:

$$T_{[(R+1)\alpha/2]}^* \lt \theta \lt T_{[(R+1)(1-\alpha/2)]}^*$$

Extract subset of varibles, and take bootstrap samples.

In [ ]:

data_subset = prostate_data[['lcavol', 'lweight', 'lbph', 'lcp', 'lpsa']]

bootstrap_data = (data_subset.sample(data_subset.shape[0], replace=True) for _ in range(1000))

We will user scikit-learn's LinearRegression model to fit a regression model to each bootstrap sample, and store the resulting coefficients.

In [ ]:

from sklearn.linear_model import LinearRegression

regmod = LinearRegression()

# Empty array to store estimates
coefs = np.empty((1000, 4))

for i,X in enumerate(bootstrap_data):
    y = X.pop('lpsa')
    regmod.fit(X, y)
    coefs[i] = regmod.coef_

In [ ]:

coefs.mean(0)

Sort coefficients, and extract percentiles:

In [ ]:

coefs.sort(axis=0)
boot_se = coefs[[25, 975], :].T

In [ ]:

# Plot means
coef_means = coefs.mean(0)
plt.plot(coef_means, 'ro')

# Plot bootstrap intervals
for i in range(len(coef_means)):
    plt.errorbar(x=[i,i], y=boot_se[i], color='red')
    
plt.xlim(-0.5, 3.5)
plt.xticks(range(len(coef_means)), ['lcavol', 'lweight', 'lbph', 'lcp'])
plt.axhline(0, color='k', linestyle='--');

Note, however, that the pivot confidence interval is generally more accurate than the percentile interval.

Also, its important to remember that the bootstrap is an asymptotic method. So, the coverage of the confidence interval is expected to be $1 - \alpha + r_n$ where $r_n \propto 1/\sqrt{n}$.

Parametric Bootstrap¶

The bootstrap approaches we have outlined above are all non-parametric, though parametric inference is possible. If we have $X_1, \ldots, X_n \sim p(x|\theta)$, where $\widehat{\theta}$ is the MLE, then let $\phi = g(\theta)$ and $\widehat{\phi} = g(\widehat{\theta})$.

An approach for estimating the standard error of $\widehat{\phi}$ is to compute the standard deviation of the bootstrapped $\hat{\phi^*}_1, \ldots, \hat{\phi^*}_R$. To do this, we can draw bootstrap samples from the parametric distribution $p(x|\widehat{\theta})$:

$$\widehat{X}_1^*, \ldots, \widehat{X}_n^* \sim p(x|\widehat{\theta})$$

References¶

Wasserman, L. (2006). All of Nonparametric Statistics. Springer Science & Business Media.