Notebook

Model Assessment¶

Chapter 7. Model Assessment and Selection

前面章节介绍了各种模型，本章讲解如何选择模型，重点是如何评估一个模型的好坏。

引入损失函数 $L(Y,\hat{f}(x))$, 通常有 \begin{align} L(Y,\hat{f}(x)) = \begin{cases}(Y -\hat{f}(x))^2 & \textrm{squared error} \\ \left|Y-\hat{f}(x)\right| & \textrm{absolute error.}\end{cases} \end{align} 定义 test error, or generalization error 为在一个独立测试样本上的误差 \begin{align} \textrm{Err}_{T} = E\left[L(Y,\hat{f}(X)) \big| T\right] \end{align} 在所有测试样本上做平均，得到 expected prediction error or expected test error \begin{align} \textrm{Err} = E[L(Y,\hat{f}(X))] = E[\textrm{Err}_{T}] \end{align} 基于训练数据，模型也会有一定的误差，定义为 training error， \begin{align} \overline{\textrm{err}} = \frac{1}{N}\sum_{i=1}^N[L(y_i,\hat{f}(x_i))] \end{align}

如果数据足够，通常会将数据以2：1：1的比例分为训练、验证、和测试三部分，训练产生的各个模型，通过验证数据选择最佳的 (model selection)，最后用测试数据评估所选择的模型 (model assessment)。常用的验证分析方法有 AIC、BIC、MDL、SRM。在数据量不足的情况下，引入 Cross-Validation 和 Bootstrap。

内容概览¶

Bias-variance decomposition. 以 linear regression 为例，假设 $Y=f(X)+\epsilon$ with $E[\epsilon] = 0$ and $\text{Var}(\epsilon) = \sigma_{\epsilon}^2$. 在某个输入点 $x_0$ 的 suqared-error 为

\begin{align} \text{Err(x_0)} =& E[(Y-\hat{f}(x_0))^2\vert X=x_0] \\ =& \sigma_{\epsilon}^2 + [E\hat{f}(x_0)-f(x_0)]^2 + E[\hat{f}(x_0)-E\hat{f}(x_0)]^2 \\ =& \textrm{Irreducible Error} + \textrm{Bias}^2 + \textrm{Variance} \end{align}

Optimism of the Training Error Rate. 同样是基于训练数据中的 $N$ 个输入 $x_i$，观察另外一组输出 $Y^0$，可以得到 in-sample error

\begin{align} {\textrm{Err}_{\text{in}}} = \frac{1}{N}\sum_{i=1}^NE_{Y^0}[L(Y_i^0,\hat{f}(x_i))\vert T] \end{align}

引入 optimism 及其 expection， \begin{align} \text{op} \equiv & \text{Err}_{\text{in}} - \overline{\text{err}} \\ \omega \equiv & E_{\mathbf{y}}(\text{op}) \end{align} For squared error, 0-1, and other loss function, \begin{align} \omega = \frac{2}{N} \sum_{i=1}^{N}\text{Cov}(\hat{y}_i,y_i) \end{align} If $\hat{y}$ is obtained by a linear fit with $d$ inputs or basis functions, \begin{align} \omega = 2\frac{d}{N}\sigma_{\epsilon}^2 \end{align}

Akaike information criterion (AIC)

\begin{align} \text{AIC} = -\frac{2}{N}\cdot loglik + 2\frac{d}{N} \end{align}

Bayesian information criterion (BIC)

\begin{align} \text{BIC} = -2 \cdot loglik + \log N \cdot d \end{align}

Cross-Validation

\begin{align} \text{CV}(\hat{f}) = \frac{1}{N}\sum_{i=1}^{N}L(y_i,\hat{f}^{-k(i)}(x_i)) \end{align}

Bootstrap 重抽样。重数据集中每次抽 N 个数，抽样 B 次。

\begin{align} \widehat{\text{Err}} =\frac{1}{B}\frac{1}{N}\sum_{b=1}^B\sum_{i=1}^{N}L(y_i,\hat{f}^{*b}(x_i)) \end{align}

Bootstrap 有个 overfit 的问题，因为训练的数据与测试的数据可能重合，当 $N \rightarrow \infty$ 的时候，重合概率为 $\lim_{N\rightarrow \infty} 1+\left(1-\frac{1}{N}\right)^N \approx 1- e^{-1} = 0.632$。这个正是 switching network 中的 blocking probability。

具体内容¶

本章讨论的内容很基础，也很有意思，深挖的话是一个很长的历史故事，写起来应该很精彩。但是最近实在是太懒了，而且可以预见，之后 AI 用到的无非是 Cross-validation 或者 bootstrap。这两个试用于数据集小的情况，非常好用，但是说不清楚道理。Bootstrap 的内涵，以后再说吧。