# %load ../../preconfig.py
%matplotlib inline
import matplotlib.pyplot as plt
plt.rcParams['axes.grid'] = False
import seaborn as sns
sns.set(color_codes=True)
import numpy as np
import pandas as pd
#import itertools
import logging
logger = logging.getLogger()
Input: predictor, independent variable, feature
Output: responses, dependent variable
supervised learning
quantitative:
eg: $x \in [0, 10]$
regression
qualitative: categorical variable, discrete variable, factor.
eg: $x \in \{0, 1, \dotsc, 10\}$
classification
represented numerically by codes: dummy variables.
ordered categorical:
eg: $x \in \{\text{small, medium, large}\}$
there is an ordering between the values, but no metric notion is appropriate.
All vectors are assumed to be column vectors.
Input:
vector: $X$
matrice: $\mathbf{X}$
\begin{align} \mathbf{X}_{N \times p} &= \begin{bmatrix} X_0 & \cdots & X_j & \cdots & X_p \end{bmatrix} \\ &= \begin{bmatrix} x_0^T \\ \vdots \\ x_i^T \\ \vdots \\ x_N^T \end{bmatrix} \end{align}
Output:
quantitative outputs: $Y$
qualitative outputs: $G$
Prediction:
$\hat{Y}$, $\hat{G}$
training data: $\{(x_i, y_i) \, | \, x_i \in M, y_i \in Y\}$ or $\{(x_i, g_i)\}$
Given: $X^T = (X_1, X_2, \dotsc, X_p)$
Predict: $$\hat{Y} = \hat{\beta}_0 + \displaystyle \sum_{j=1}^p X_j \hat{\beta}_j$$.
if $X^T = (1, X^T)$, then $\hat{Y} = X^T \hat{\beta}$.
residual sum of squares: $$\text{RSS} (\beta) = (\mathbf{y} - \mathbf{X} \beta)^T (\mathbf{y} - \mathbf{X} \beta)$$
Goal: $$\text{arg min} \text{ RSS}(\beta)$$
So differentiating w.r.t $\beta$ we get the normal equations:
\begin{align}
\frac{\mathrm{d}}{\mathrm{d} \beta} \text{RSS} (\beta) &= 0 \\
\mathbf{X}^T (\mathbf{y} - \mathbf{X} \beta) &= 0
\end{align}
If $\mathbf{X}^T \mathbf{X}$ is nonsingular, then the unique solution is given by
$$\hat{\beta} = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y}$$
from sklearn import datasets, linear_model
# Load the diabetes dataset
diabetes = datasets.load_diabetes()
# Use only one feature
diabetes_X = diabetes.data[:, np.newaxis, 2]
# Split the data into training/testing sets
diabetes_X_train = diabetes_X[:-20]
diabetes_X_test = diabetes_X[-20:]
# Split the targets into training/testing sets
diabetes_y_train = diabetes.target[:-20]
diabetes_y_test = diabetes.target[-20:]
# Create linear regression object
regr = linear_model.LinearRegression()
# Train the model using the training sets
regr.fit(diabetes_X_train, diabetes_y_train)
print('intercept: {}, coef: {}'.format(regr.intercept_, regr.coef_))
intercept: 152.91886182616167, coef: [ 938.23786125]
df = pd.DataFrame(
{
'x_0': 1,
'x_1': np.array(diabetes_X_train).flatten(),
'y': np.array(diabetes_y_train).flatten()
}
)
X = np.matrix(df.loc[:, ['x_0', 'x_1']])
Y = np.matrix(df.loc[:, 'y']).T
beta = (X.T * X).I * X.T * Y
beta
matrix([[ 152.91886183], [ 938.23786125]])
sns.jointplot(x='x_1', y='y', data=df, kind='reg')
<seaborn.axisgrid.JointGrid at 0x10ac9a9e8>
The $k$-nearest neighbor fit for $\hat{Y}$ is defined as follows: $$\hat{Y}(x) = \frac{1}{k} \displaystyle \sum_{x_i \in N_k(x)} y_i$$ where $N_k(x)$ is the neighborhood of $x$ defined by the $k$ closest points $x_i$ in the training sample.
In words, we find the $k$ observations with $x_i$ closest to $x$ in input space, and average their responses.
The linear model makes huge assumptions about structure and yields stable but possibly inaccurate predictions.
The method of $k$-nearest neighbors makes very mild structural asumptions: its predictions are often accurate but can be unstable.
variance | bias | |
---|---|---|
Least Squares | low | high |
Nearest Neighbors | high | low |
Each method has its own situations for which it works best:
Scenario 1: linear regression
The training data in each class were generated from bivariate Gaussian distributions with uncorrected components and different means.
Scenario 2: nearest neighbors
The training data in each class came from a mixture of 10 low-variance Gaussian distributions, with individual means themselves distributed as Gaussian.
A large subset of the most popular techniques in use today are variants of these two simple procedures. The following list describes some ways in which these simple procedures have been enhanced:
Kernel method: weight $[0,1] \to (0,1)$
In high-dimensional spaces: distance kernels are modified.
Local regression: fits linear models in local.
Linear models: basis expansion of the original inputs.
Projection pursuit and neural network models: sum of non-inearly transformed linear models.
from sklearn.utils import resample
np.random.seed(1)
m_b = np.random.multivariate_normal([1,0], np.eye(2), 10)
m_r = np.random.multivariate_normal([0,1], np.eye(2), 10)
s_b = [np.random.multivariate_normal(m_b[x,:], 1/5*np.eye(2), 1) for x in np.random.choice(m_b.shape[0], 100)]
s_b = pd.DataFrame(np.array([x[0,:] for x in s_b]), columns=['x_1', 'x_2'] )
s_r = [np.random.multivariate_normal(m_r[x,:], 1/5*np.eye(2), 1) for x in np.random.choice(m_r.shape[0], 100)]
s_r = pd.DataFrame(np.array([x[0,:] for x in s_r]), columns=['x_1', 'x_2'] )
sample = pd.concat([s_b, s_r], keys=[0, 1], axis=0, names=['color', 'ind'])
sample = sample.reset_index(level=[0], col_fill=['color'])
sample.plot(kind='scatter', x='x_1', y='x_2', c='color')
<matplotlib.axes._subplots.AxesSubplot at 0x109126358>
loss function: penalize errors in prediction.
by far the most common and convenient loss function is squared error loss*: $$L(Y, f(X)) = (Y - f(X))^2$$
This leads us to a crition for choosing $f$, \begin{align} \text{EPE}(f) &= E(Y - f(X))^2 \\ &= E_{X} E_{Y|X} ([Y - f(X)]^2 | X) \end{align}
and we see that it sufficies to minimize EPE pointwise: \begin{align} f(x) &= \text{arg min}_{c} E_{Y|X}([Y - c]^2 | X = x) \\ &= E(Y | X = x) \end{align} Thus the best prediction of $Y$ at any point $X=x$ is the conditional mean, when best is measured by average squared error.
They attempt to directly implement this recipe using the training data: $$\hat{f} = \text{Ave}(y_i | x_i \in N_{k}(x))$$
Two approximations are happening here:
expectation is approximated by averaging over sample data;
conditioning at a point is relaxed to conditioning on some region "close" to the target point.
We solve for $\beta$ theoretically: $$\beta = [E(X X^T)]^{-1} E(XY)$$ The least squares solution: $$\hat{\beta} = (X^T X)^{-1} X^Ty$$ amounts to replacing the expectation by averages over the traing data.
common in method:
different in model assumptions:
Least squares assumes $f(x)$ is well approximated by a globally linear function.
$k$-nearest neighbors assumes $f(x)$ is well approximated by a locally constant function.
for $L_1: E|Y - f(X)|$, the solution of the nearest neighbor is the conditional median, $$\hat{f}(x) = \text{median}(Y | X=x)$$ and its estimates are more robust than those for the conditional mean.
$L_1$ criteria have discontinuities in their derivatives, which have hindered their widespread use.
An estimate $\hat{G}$ will assume values in $\mathcal{G}$, the set of possible classes.
Our loss function can be represented by a $K \times K$ matrix $\mathbf{L}$, where $K = \text{card}(G)$. $\mathbf{L}$ will be zero on the diagonal and nonnegative elsewhere.
Most often we use the zero-one loss function, and the expected prediction error is \begin{align} \text{EPE} &= E(L(G, \hat{G}(x))] \\ &= E_x \displaystyle \sum_{k=1}^{K} L[\mathcal{G}_k, \hat{G}(X)] \text{ Pr}(\mathcal{G}_k | X) \end{align}
and again it sufficies to minimize EPE pointwise: \begin{equation} \hat{G}(x) = \text{argmin}_{g \in \mathcal{G}} \displaystyle \sum_{k=1}^{K} L(\mathcal{G}_{k,g}) \text{ Pr}(\mathcal{G}_k | X = x) \end{equation}
With the 0-1 loss function this simplifies to \begin{align} \hat{G}(x) &= \text{argmin}_{g \in \mathcal{G}} [1 - \text{ Pr}(g | X = x)] \\ &= \text{argmax}_{g \in \mathcal{G}} \text{ Pr}(g | X = x) \end{align} The reasonal solution is known as the Bayes classfier: classify to the most probable class, using the conditional (discrete) distribution $\text{Pr}(G | X)$. The error rate of the Bayes classifier is called the Bayes rate.
curse of dimensionality
Suppose in $p$-dimensional space, we sned out a hypercubical neighborhood about a target point to capture a fraction $r$ of the observations, whose expected edge length will be $e_p(r) = r^{1/p}$.
$e_{10}(0.01) = 0.63$ and $e_{10}(0.1) = 0.8$, while the entire range for each input is only 1.0. Such neighborhoods are no longer "local".
plt.figure(figsize=(15,8))
plt.imshow(plt.imread('./res/fig2_6.png'))
<matplotlib.image.AxesImage at 0x108fabeb8>
Consider $N$ data points uniformly distributed in a $p$-dimensional unit ball centered at the origin.
The median distance from the origin to the closest data point is: $$d(p,N) = (1 - (\frac{1}{2})^{1/N})^{1/p}$$
For $N = 500, p = 10$, $d(p,N) \approx 0.52$, more than halfway to the boundary. Prediction, however, is much more difficult near the edges of the training sample.
For the same sampling density with $N_1 = 100$, $N_{10} = 100^{10}$. Thus, in high dimensions all feasible training samples sparsely populate the input space.
Suppose we have a training set $\tau$ consisting of 1000 training examples $x_i$ which generated uniformly on $[-1,1]^p$. and $Y = f(X) = e^{-8 \|X\|^2}$.
For estimating $f(0)$ by bias-variance decomposition:
\begin{align}
MSE(x_0) &= E_{\tau} [f(x_0) - \hat{y}_0]^2 \\
&= E_{\tau} [\hat{y}_0 - E_{\tau}(\hat{y}_0)]^2 + [E_{\tau}(\hat{y}_0 - f(x_0)]^2 \\
&= \text{Var}_{\tau} (\hat{y}_0) + \text{Bias}^2 (\hat{y}_0)
\end{align}
As $p$ increases, the nearest neighbor tends to stray further from the target point, and both bias and variance are incurred.
The complexity of functions of many variables can grow exponentially with the dimension.
plt.figure(figsize=(8,15))
plt.imshow(plt.imread('./res/fig2_7.png'))
<matplotlib.image.AxesImage at 0x10c7e09b0>
Suppose $Y = X^T \beta + \varepsilon$ where $\varepsilon \in N(0, \sigma^2)$.
\begin{align} \text{EPE}(x_0) &= E_{y_0 | x_0} E_{\tau}(y_0 - \hat{y}_0)^2 \\ &= \text{Var}(y_0 | x_0) + E_{\tau}[\hat{y}_0 - E_{\tau}\hat{y}_0]^2 + [E_{\tau}\hat{y}_0 - x_0^T \beta]^2 \\ &= \text{Var}(y_0 | x_0) + \text{Var}_{\tau}(\hat{y}_0) + \text{Bias}^2(\hat{y}_0) \\ &= \sigma^2 + E_{\tau} x_0^T (\mathbf{X}^T \mathbf{X})^{-1} x_0 \sigma^2 + 0^2 \end{align}If $N$ is large and $\tau$ were selected at random, and assuming $E(X) = 0$, then $\mathbf{X}^T \mathbf{X} \to N \text{Cov}(X)$ and
\begin{align}
E_{x_0} \text{EPE}(x_0) &\sim E_{x_0} x_0^T \text{Cov}(X)^{-1} x_0 \sigma^2 / N + \sigma^2 \\
&= \text{trace} [\text{Cov}(X)^{-1} \text{Cov}(x_0)] \sigma^2 / N + \sigma^2 \\
&= \sigma^2 (p / N) + \sigma^2
\end{align}
plt.figure(figsize=(15,8))
plt.imshow(plt.imread('./res/fig2_8.png'))
<matplotlib.image.AxesImage at 0x10dc35080>
We discuss a framework for incorporating various modes into the prediction problem.
additive error model: $$Y = f(X) + \epsilon$$ where $E(\epsilon) = 0$ and is independent of $X$. Hence, $f(x) = E(Y | X = x)$.
For 0-1 coded version of $G$, whose data arise from independent binary trials, with the probability of one particular outcome being $p(X)$, and the other $1 - p(X)$. \begin{align} E(Y | X = x) &= p(x) \\ \text{Var}(Y | X = x) &= E(Y^2 | X = x) - E^2(Y | X = x) \\ &= p(x) - p^2(x) \end{align}
learning by example
The goal of learning is to obtain a useful approximation to $f(x)$ in $p+1$-dimensional space for $X \in \mathbb{R}^{p}$.
Linear basis expansions:
$$f_{\theta}(x) = \displaystyle \sum_{k=1}^K h_k(x) \theta_k$$
where the $h_k$ are a suitable set of functions or transformations of the input vector $x$.
We find the $\theta$ in $f_\theta$ by minimizing the $\text{RSS}(\theta)$.
maximum likelihood estimation:
$$L_(\theta) = \displaystyle \sum_{i=1}^{N} \log \text{Pr}_{\theta}(y_i)$$
intorduces classes of structured approaches.
$$\text{RSS}(f) = \displaystyle \sum_{i=1}^{N} (y_i - f(x_i))^2$$for infinite $N$, $f \to$ limiting conditional expection.
for finite $N$, many solutions. $\to$ add constraints, one solution.
The strength of the constraint is dictated by:
the neighbor size.
the metric used.
Any method that attempts to produce locally varing functions in small isotropic neighborhoods will run into problems in high dimensions.
And conversely, all methods that overcome the dimensionality problems have an associated metric for measuring neighborhoods, which basically dose NOT allow the neighborhood to be simultaneously small in all directions.
classify nonparametric regression techniques
Penalty function, or regularization methods, exhibits a certain type of smooth behavior, and indeed can usually be cast in a Bayesian framework:
They provides estimates by secifying the nature of the local neighborhood, and of the class of regular functions fitted locally.
In general we can define a local regression estimate of $f(x_0)$ as $f_{\hat{\theta}}(x_0)$, where $\hat{\theta}$ minimizes $$\text{RSS}(f_{\theta}, x_0) = \displaystyle \sum_{i=1}^N K_\lambda(x_0, x_i) (y_i - f_\theta(x_i))^2$$
The model for $f$ is a linear expansion of basis functions $$f_\theta(x) = \displaystyle \sum_{m=1}^M \theta_m h_m(x)$$
eg: radial basis functions, neural network
dictory methods: adaptively chosen basis function methods.
plt.imshow(plt.imread('./res/fig2_11.png'))
<matplotlib.image.AxesImage at 0x108dd6c18>