#!/usr/bin/env python
# coding: utf-8

# In[57]:


# %load ../../preconfig.py
get_ipython().run_line_magic('matplotlib', 'inline')
import matplotlib.pyplot as plt
plt.rcParams['axes.grid'] = False
import seaborn as sns
sns.set(color_codes=True)

import numpy as np
import pandas as pd
#import itertools

import sklearn

import logging
logger = logging.getLogger()


# Exercises
# =========

# ### Ex 2.1
# ##### Question
# Suppose each of K-classes has an associated target $t_k$, which is a vector of all zeros, except a one in the $k$th position. Show that classifying to the largest element of $\hat{y}$ amounts to choosing the closest target, $\min_k \|tk − \hat{y}\|$, if the elements of $\hat{y}$ sum to one.
# 
# 
# ##### Solution
# Given:    
# $t_k = e_k$;     
# $\sum_i \hat{y}_i = 1$.
# 
# need to proof: $\text{arg max}_i \hat{y}_i = \text{arg min}_k \|t_k - \hat{y}\|$.
# 
# Proof:    
# 
# \begin{align}
# \text{arg min}_k \|t_k - \hat{y}\| &= \text{arg min}_k \left ( \displaystyle \sum_{j \neq k} \hat{y}_j + | \hat{y}_k - 1 | \right )\\
#         &= \text{arg min}_k \left ( \displaystyle \sum_{j \neq k} \hat{y}_j + 1 - \hat{y}_k  \right ) \\
#         &= \text{arg min}_k \left ( 1 - \hat{y}_k + 1 - \hat{y}_k \right ) \\
#         &= \text{arg min}_k 2(1 - \hat{y}_k) \\
#         &= \text{arg max}_k \hat{y}_k
# \end{align}

# ### Ex 2.2
# ##### Question
# Show how to compute the Bayes decision boundary for the simula- tion example in Figure 2.5.
# 
# 
# ##### Solution
# [ref: Elements of Statistical Learning - Andrew Tulloch](https://github.com/ajtulloch/Elements-of-Statistical-Learning/blob/master/ESL-Solutions.pdf)
# 
# As Eq.(2.22) of textbook:
# \begin{align}
#     \hat{G}(x) &= \text{arg min}_{g \in \mathcal{G}} [ 1 - P(g | X = x) ] \\
#                &= \text{arg max}_{g \in \mathcal{G}} P(g | X = x)
# \end{align}
# 
# The optimal Bayes decision boundary is where:
# \begin{align}
#     P(\text{orange} | X = x) &= P(\text{blue} | X = x) \\
#     \frac{P( X = x | orange ) P( orange )}{P(X = x)} &= \frac{P( X = x | blue ) P( blue )}{P(X = x)} \\
#     P( X = x | orange ) P( orange ) &= P( X = x | blue ) P( blue )
# \end{align}
# 
# As descriped in Sec 2.3.3, $P(orange)$ is same as $P(blue)$, and $P(X = x | orange)$ and $P(X = x | blue)$ are generated as bivariate Gassuian distribution. Hence we can work out the optimal Bayes decision boundary exactly.

# ### Ex 2.3
# ##### Question
# Derive equation (2.24).
# 
# Suppose there are $N$ points, uniformly distributed in the unit sphere in $\mathbb{R}^p$. What is the median distance from the origin to the closest data point? 
# 
# ##### Solution
# [ref: Example Sheet 1: Solutions](http://www2.stat.duke.edu/~banks/cam-lectures.dir/Examples1-solns.pdf)
# 
# (1)
# Suppose $r$ is the median distance from the origin to the closest data point.
# 
# Let $r_{\text{closest}}$ are all possible closetst points.     
# Because $r$ is the median case, $\forall j$
# $$P(r_{\text{closest}}^j \geq r) = \frac{1}{2}$$
# 
# Because $r_{\text{closest}}^j$ is the closest point, so all $N$ points have distance $\geq r_{\text{closest}}^j \geq r$. 
# 
# together, we get:
# $$P(\text{all N points have distance } \geq r) = \frac{1}{2}$$
# 
# (2)
# First, all points are uniformly distributed in the unit sphere in $\mathbb{R}^p$. 
# 
# Second, [the p-dimensional volume of a Euclidean ball of radius R in p-dimensional Euclidean space is](https://en.wikipedia.org/wiki/Volume_of_an_n-ball):
# $$V_p(R) = \frac{\pi^{p/2}}{\Gamma(\frac{p}{2} + 1)}R^p$$
# 
# together, for any point $x$, 
# \begin{align}
# P(x \text{ has distance } \geq r) &= 1 - P(x \text{ has distance } < r) \\
#     &= 1 - \frac{\pi^{p/2}}{\Gamma(\frac{p}{2} + 1)}r^p \Big{/} \frac{\pi^{p/2}}{\Gamma(\frac{p}{2} + 1)}1^p 
#     &= 1 - r^p
# \end{align}
# 
# Then:
# \begin{align}
#     P(\text{all N points have distance } \geq r) &= P^N (x \text{ has distance } \geq r) \\
#         &= (1 - r^p)^N
# \end{align}
# 
# (3)
# In all,
# $$\frac{1}{2} = P(\text{all N points have distance } \geq r) = (1 - r^p)^N$$
# 
# we get the solution:
# $$r = (1 - (\frac{1}{2})^{1/N})^{1/p}$$

# ### Ex 2.4
# ##### Question
# The edge effect problem discussed on page 23 is not peculiar to uniform sampling from bounded domains. 
# 
# Consider inputs drawn from a spherical multinormal distribution $X \sim N(0,I_p)$. The squared distance from any sample point to the origin has a $\mathcal{X}^2_p$ distribution with mean $p$. Consider a prediction point $x_0$ drawn from this distribution, and let $a = x_0 \big{/} \| x0 \|$ be an associated unit vector. Let $z_i = a^T x_i$ be the projection of each of the training points on this direction.
# 
# Show that the $z_i$ are distributed $N(0,1)$ with expected squared distance from the origin 1, while the target point has expected squared distance $p$ from the origin.
# 
# ##### Solution
# $z_i = \alpha^T x_i$, which is a linear combination.  Moreover, $x_i \sim N(0, I_p)$ means that its elements are all independant.
# 
# As [the variance of a linear combination](https://en.wikipedia.org/wiki/Variance#Sum_of_uncorrelated_variables_.28Bienaym.C3.A9_formula.29) is:
# $$\operatorname{Var}\left( \sum_{i=1}^N a_iX_i\right) = \sum_{i=1}^N a_i^2\operatorname{Var}(X_i)$$
# 
# We get:
# \begin{align}
# E(z_i) &= \alpha^T E(x_i) \\
#        & = 0
# \end{align}
# and
# \begin{align}
# \operatorname{Var} (z_i) &= \sum \alpha_j^2 \operatorname{Var}(x_j^i) \\
#        & = \sum \alpha_j^2 \\
#        & = \| \alpha \|^2_2 \\
#        & = 1
# \end{align}
# 
# Thus, $z_i \sim N(0,1)$.
# 
# The target point has expected suqared distace $\| x_i \|_2^2 = p$.

# ### Ex 2.5
# #### (a)
# Derive equation (2.27). The last line makes use of (3.8) through a conditioning argument. 
# 
# #### Solution
# First, we give:
# 
# 1. for $y_0 = x_0^T \beta + \epsilon; \ \epsilon \sim N(0, \sigma^2)$:
#    + $E_{y_0 | x_0}(y_0) = E(y_0 | x_0) = E(x_0^T \beta + \epsilon) = x_0^T \beta$
#    + $\operatorname{Var}_{y_0 | x_0}(y_0) = \operatorname{Var}(y_0 | x_0) = \operatorname{Var}(x_0^T \beta + \epsilon) = \operatorname{Var}(\epsilon) = \sigma^2$
# 
# 2. for $\hat{y}_0 = x_0^T \hat{\beta} = x_0^T \beta + x_0 (X^T X)^{-1} x_0 \epsilon$: 
#    + expected value:
#      \begin{equation}
#         E_{\tau}(\hat{y_0}) = E(y_0 | x_0) = x_0^T \beta \quad \text{unbiased}
#      \end{equation}
#    + variance:
#      \begin{align}
#         \operatorname{Var}_{\tau}(\hat{y_0}) &= \operatorname{Var}_{\tau}(x_0^T \hat{\beta}) \\
#             &= x_0^T \operatorname{Var}_{\tau}(\hat{\beta}) x_0 \\
#             &= x_0^T E_{\tau}((X^T X)^{-1} \sigma^2) x_0 \quad \text{see Eq 3.8} \\
#             &= E_{\tau} x_0^T (X^T X)^{-1} x_0 \sigma^2
#      \end{align}
#    
# 3. [Proof of variance and bias relationship](https://en.wikipedia.org/wiki/Mean_squared_error):  
# \begin{align}
#     E( (\hat{\theta} - \theta)^2 ) &= E( \left (\hat{\theta} - E(\hat{\theta}) \right )^2 ) + (E(\hat{\theta}) - \theta)^ 2 \\
#         &= \operatorname{Var}(\hat{\theta}) + \operatorname{Bias}^2 (\hat{\theta}, \theta)
# \end{align}
# 
# Thus, 
# \begin{align}
#     \operatorname{EPE}(x_0) &= E_{y_0 | x_0} E_{\tau} (y_0 - \hat{y}_0)^2 \\
#         &= E_{\tau} E_{\color{blue}{y_0 | x_0}} (\color{blue}{y_0} - \hat{y}_0)^2 \\
#         &= E_{\tau} \left ( \operatorname{Var}(y_0 | x_0)  + (E_{y_0 | x_0}(y_0) - \hat{y}_0)^2 \right ) \\
#         &= \operatorname{Var}(y_0 | x_0) + E_{\color{blue}{\tau}} \left( E(y_0 | x_0) - \color{blue}{\hat{y}_0} \right )^2 \\
#         &= \operatorname{Var}(y_0 | x_0) + \operatorname{Var}_{\tau}(\hat{y}_0) + \left ( E_{\tau}(\hat{y}_0) - E(y_0 | x_0) \right)^2 \\
#         &= \operatorname{Var}(y_0 | x_0) + \operatorname{Var}_{\tau}(\hat{y}_0) + \left ( E_{\tau}(\hat{y}_0) -  x_0^T \beta \right)^2 \\
#         &= \sigma^2 + E_{\tau} x_0^T (X^T X)^{-1} x_0 \sigma^2 + 0^2
# \end{align}

# #### (b)
# Derive equation (2.28), making use of the cyclic property of the trace operator [trace(AB) = trace(BA)], and its linearity (which allows us to interchange the order of trace and expectation).
# 
# 
# #### Solution
# [ref: A Solution Manual and Notes for: The Elements of Statistical Learning by Jerome Friedman, Trevor Hastie, and Robert Tibshirani](https://www.google.com.sg/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&uact=8&ved=0ahUKEwiF6sf2hsfLAhWSco4KHfJQCCwQFggbMAA&url=http%3A%2F%2Fwaxworksmath.com%2FAuthors%2FG_M%2FHastie%2FWriteUp%2Fweatherwax_epstein_hastie_solutions_manual.pdf&usg=AFQjCNH3VN6HgCDHtXNIbJtAjEEQNZFINA&sig2=b_zFhNYsupRwqtY62dGnwA)
# 
# 1. $x_0$ is $p \times 1$ vector, and $\mathbf{X}$ is $N \times p$ matrix, hence $x_0^T (\mathbf{X}^T \mathbf{X})^{-1} x_0 = C_{1 \times 1} = \operatorname{trance}(C_{1 \times 1})$
# 
# 2. [properity of Covariance matrix](https://en.wikipedia.org/wiki/Covariance_matrix) 
# \begin{align}
#     \operatorname{Cov}(x_0) &= E(x_0 x_0^T) - E(x_0) E(x_0)^T \\
#         &= E(x_0 x_0^T) \quad \text{as } E(\mathbf{X}) = 0 \text{ and $x_0$ is picked randomly}
# \end{align}
# 
# Thus,
# \begin{align}
#     E_{x_0} \operatorname{EPE}(x_0) &= E_{x_0} x_0^T (\mathbf{X}^T \mathbf{X})^{-1} x_0 \sigma^2 + \sigma^2 \\
#         &= E_{x_0} \operatorname{trance} \left ( x_0^T (\mathbf{X}^T \mathbf{X})^{-1} x_0 \right ) \sigma^2 + \sigma^2 \\
#         &= E_{x_0} \operatorname{trance} \left ( (\mathbf{X}^T \mathbf{X})^{-1} x_0 x_0^T \right ) \sigma^2 + \sigma^2 \quad \text{cyclic property} \\
#         &\approx E_{x_0} \operatorname{trance} \left ( \operatorname{Cov}^{-1}(\mathbf{X}) x_0 x_0^T \right ) \frac{\sigma^2}{N} + \sigma^2  \quad \text{as } \mathbf{X}^T \mathbf{X} \to N \operatorname{Cov}(\mathbf{X}) \\
#         &= \operatorname{trance} \left ( \operatorname{Cov}^{-1}(\mathbf{X}) \, E_{x_0}(x_0 x_0^T) \right ) \frac{\sigma^2}{N} + \sigma^2  \quad \text{linearity, interchange}\\
#         &= \operatorname{trance} \left ( \operatorname{Cov}^{-1}(\mathbf{X}) \, \operatorname{Cov}(x_0) \right ) \frac{\sigma^2}{N} + \sigma^2  \quad \text{see 2. above}\\
#         &= \operatorname{trance} (I_p) \frac{\sigma^2}{N} + \sigma^2 \quad \text{as } \operatorname{Cov}(x_0) \to \operatorname{Cov}(\mathbf{X}) \\
#         &= p \frac{\sigma^2}{N} + \sigma^2
# \end{align}

# ### Ex 2.6
# #### Question
# Consider a regression problem with inputs $x_i$ and outputs $y_i$, and a parameterized model $f_{\theta}(x)$ to be fit by least squares. Show that if there are observations with tied or identical values of $x$, then the fit can be obtained from a reduced weighted least squares problem.
# 
# #### Solution
# [ref: A Solution Manual and Notes for: The Elements of Statistical Learning by Jerome Friedman, Trevor Hastie, and Robert Tibshirani](https://www.google.com.sg/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&uact=8&ved=0ahUKEwiF6sf2hsfLAhWSco4KHfJQCCwQFggbMAA&url=http%3A%2F%2Fwaxworksmath.com%2FAuthors%2FG_M%2FHastie%2FWriteUp%2Fweatherwax_epstein_hastie_solutions_manual.pdf&usg=AFQjCNH3VN6HgCDHtXNIbJtAjEEQNZFINA&sig2=b_zFhNYsupRwqtY62dGnwA)
# 
# Assume we have $N$ train samples, and let $N_u$ be the number of *unique* inputs $x$. And for $i$th unique $x_i$, the $y_i = \{ y_{i,1}, y_{i,2}, \dotsc, y_{i, n_i} \} $, namely, consists of $n_i$ observation.
# 
# \begin{align}
#     \displaystyle \operatorname{argmin}_{\theta} \sum_{k=1}^{N} (y_k - f_{\theta}(x_k))^2 &= \operatorname{argmin}_{\theta} \sum_{i=1}^{N_u} \sum_{j=1}^{n_i} (y_{ij} - f_{\theta}(x_i))^2 \\
#     &= \operatorname{argmin}_{\theta} \sum_{i=1}^{N_u} \sum_{j=1}^{n_i} y_{ij}^2 - 2 f_{\theta}(x_i) y_{ij} + f_{\theta}(x_i)^2 \\
#     &= \operatorname{argmin}_{\theta} \sum_{i=1}^{N_u} n_i \left ( \color{blue}{\frac{1}{n_i} \sum_{j=1}^{n_i} y_{ij}^2} - 2 f_{\theta}(x_i) \frac{1}{n_i} \sum_{j=1}^{n_i} y_{ij} + f_{\theta}(x_i)^2 \right ) \\
#     &= \operatorname{argmin}_{\theta} \sum_{i=1}^{N_u} n_i \left ( \color{red}{(\frac{1}{n_i} \sum_{j=1}^{n_i} y_{ij})^2}  - 2 f_{\theta}(x_i) \frac{1}{n_i} \sum_{j=1}^{n_i} y_{ij} + f_{\theta}(x_i)^2  - \color{red}{(\frac{1}{n_i} \sum_{j=1}^{n_i} y_{ij})^2} + \color{blue}{\frac{1}{n_i} \sum_{j=1}^{n_i} y_{ij}^2} \right ) \\
#     &= \operatorname{argmin}_{\theta} \sum_{i=1}^{N_u} n_i \left ( \color{red}{\bar{y}_i^2}  - 2 f_{\theta}(x_i) \bar{y}_i^2  + f_{\theta}(x_i)^2  - \color{red}{\bar{y}_i^2} + \frac{1}{n_i} \sum_{j=1}^{n_i} y_{ij}^2 \right )  \quad \text{def: } \bar{y}_i = \frac{1}{n_i} \sum_{j=1}^{n_i} y_{ij}\\
#     &= \operatorname{argmin}_{\theta} \sum_{i=1}^{N_u} n_i \left ( \left ( \bar{y}_i^2 - f_{\theta}(x_i) \right )^2  - \bar{y}_i^2 + \frac{1}{n_i} \sum_{j=1}^{n_i} y_{ij}^2 \right ) \\ 
#     &= \operatorname{argmin}_{\theta} \left ( \sum_{i=1}^{N_u} n_i \left ( \bar{y}_i^2 - f_{\theta}(x_i) \right )^2  - \sum_{i=1}^{N_u} n_i \bar{y}_i^2 + \sum_{i=1}^{N_u} \sum_{j=1}^{n_i} y_{ij}^2  \right ) \\ 
#     &= \operatorname{argmin}_{\theta} \left ( \sum_{i=1}^{N_u} n_i \left ( \bar{y}_i^2 - f_{\theta}(x_i) \right )^2 + \mathcal{C} \right ) \quad \text{as $y_{ij}$ is fixed} \\ 
#     &= \operatorname{argmin}_{\theta} \sum_{i=1}^{N_u} n_i \left ( \bar{y}_i^2 - f_{\theta}(x_i) \right )^2 
# \end{align}
# 
# Thus, it's a *weighted* least squares as every $\bar{y_i}$ is weighted by $n_i$. And also it is a *reduced* problem since $N_u < N$.

# ### Ex 2.7
# Suppose we have a sample of $N$ pairs $x_i, y_i$ draw i.i.d from the distribution characterized as follows:
# \begin{align}
#     &x_i \sim h(x), &\text{the ddesign density} \\
#     &y_i = f(x_i) + \epsilon_i, &\text{$f$ is the regression function} \\
#     &\epsilon_i \sim (0, \sigma^2), &\text{mean zero, variance $\sigma^2$}
# \end{align}
# 
# We construct an estimator for $f$ *linear* in the $y_i$,
# $$\hat{f}(x_0) = \displaystyle \sum_{i=1}^N \ell_i(x_0; \mathcal{X}) y_i,$$ 
# where the weights $\ell_i(x_0; \mathcal{X}$ do not depend on the $y_i$, but do depend on the entire training sequence of $x_i$, denoted here by $\mathcal{X}$.

# #### (a) 
# Show that linear regression and $k$-nearest-neighbor regression are mem- bers of this class of estimators. Describe explicitly the weights $\ell_i(x_0; \mathcal{X}$ in each of these cases.
# 
# #### solution
# 1. for linear regression    
#    $\hat{\beta} = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y}$
#    
#    so:
#    \begin{align}
#        \hat{f}(x_0) &= x_0^T \hat{\beta} \\
#                     &= x_0^T (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y} \\
#                     &= \displaystyle \sum_{i=1}^N \left [ x_0^T (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \right ]_i \ y_i
#    \end{align}
#    
# 2. for neighbor model     
#    \begin{align}
#        \hat{f}(x_0) &= \frac{1}{k} \displaystyle \sum_{x_i \in N_k(x_0)} y_i \\
#                     &= \displaystyle \sum_{i=1}^N \frac{1}{k} \ I(x_i \in N_k(x_0)) \ y_i
#    \end{align}

# #### (b)
# Decompose the conditional mean-squared error 
# $$E_{\mathcal{Y} | \mathcal{X}} ( f(x_0) - \hat{f}(x_0) )^2$$
# into a conditional squared bias and a conditional variance component. Like $\mathcal{X}$, $\mathcal{Y}$ represents the entire training sequence of $y_i$.
# 
# #### solution
# [ref:](https://en.wikipedia.org/wiki/Mean_squared_error)
# Here $\mathcal{X}$ is fixed, and $\mathcal{Y}$ varies. Also $x_0$ and $f(x_0)$ are fixed.      
# so:     
# \begin{align}
# E_{\mathcal{Y} | \mathcal{X}}(\hat{f}(x_0)) &= E_{\mathcal{Y} | \mathcal{X}} \displaystyle \sum_{i=1}^N \ell_i(x_0; \mathcal{X}) y_i \\
#         &= \displaystyle \sum_{i=1}^N \ell_i(x_0; \mathcal{X}) \, E_{\mathcal{Y} | \mathcal{X}} ( y_i ) \\
#         &= \displaystyle \sum_{i=1}^N \ell_i(x_0; \mathcal{X}) \, E_{\mathcal{Y} | \mathcal{X}} ( f(x_i) + \epsilon_i ) \\
#         &= \displaystyle \sum_{i=1}^N \ell_i(x_0; \mathcal{X}) \, \left ( E_{\mathcal{Y} | \mathcal{X}} ( f(x_i) ) + E_{\mathcal{Y} | \mathcal{X}} ( \epsilon_i ) \right ) \\
#         &= \displaystyle \sum_{i=1}^N \ell_i(x_0; \mathcal{X}) \, E_{\mathcal{Y} | \mathcal{X}} ( f(x_i) ) \\
#         &= \displaystyle \sum_{i=1}^N \ell_i(x_0; \mathcal{X}) \, f(x_i)  \\
#         &= C \quad \text{constant when $\mathcal{X}$ is fixed}
# \end{align}
# 
# Thus:
# 
# \begin{align}
#     {} & E_{\mathcal{Y} | \mathcal{X}} ( f(x_0) - \hat{f}(x_0) )^2 \\
#     = & E_{\mathcal{Y} | \mathcal{X}} ( \hat{f}(x_0) - f(x_0) )^2 \\
#     = & E_{\mathcal{Y} | \mathcal{X}} \left ( \hat{f}(x_0) - E_{\mathcal{Y} | \mathcal{X}}(\hat{f}(x_0)  + E_{\mathcal{Y} | \mathcal{X}}(\hat{f}(x_0)  - f(x_0) \right )^2 \\
#     = & E_{\mathcal{Y} | \mathcal{X}} \left ( \left ( \hat{f}(x_0) - E_{\mathcal{Y} | \mathcal{X}}(\hat{f}(x_0) \right )^2  + 2 \left ( \hat{f}(x_0) - E_{\mathcal{Y} | \mathcal{X}}(\hat{f}(x_0) \right ) \left ( E_{\mathcal{Y} | \mathcal{X}}(\hat{f}(x_0)  - f(x_0) \right ) + \left ( E_{\mathcal{Y} | \mathcal{X}}(\hat{f}(x_0)  - f(x_0) \right )^2 \right ) \\
#     = & E_{\mathcal{Y} | \mathcal{X}} \left ( \hat{f}(x_0) - E_{\mathcal{Y} | \mathcal{X}}(\hat{f}(x_0) \right )^2  + E_{\mathcal{Y} | \mathcal{X}} 2 \left ( \hat{f}(x_0) - E_{\mathcal{Y} | \mathcal{X}}(\hat{f}(x_0) \right ) \left ( E_{\mathcal{Y} | \mathcal{X}}(\hat{f}(x_0)  - f(x_0) \right ) + E_{\mathcal{Y} | \mathcal{X}} \left ( E_{\mathcal{Y} | \mathcal{X}}(\hat{f}(x_0)  - f(x_0) \right )^2 \\
#     = & \operatorname{Var}(\hat{f}(x_0) + E_{\mathcal{Y} | \mathcal{X}} 2 \left ( \hat{f}(x_0) - E_{\mathcal{Y} | \mathcal{X}}(\hat{f}(x_0) \right ) \left ( \underbrace{E_{\mathcal{Y} | \mathcal{X}}(\hat{f}(x_0)  - f(x_0)}_\text{constant} \right ) + \color{blue}{E_{\mathcal{Y} | \mathcal{X}}} \left ( \underbrace{E_{\mathcal{Y} | \mathcal{X}}(\hat{f}(x_0)  - f(x_0)}_\text{constant} \right )^2 \\
#     = & \operatorname{Var}(\hat{f}(x_0)  +  2 \left ( f(x_0) - C \right ) \, E_{\mathcal{Y} | \mathcal{X}} \left ( \hat{f}(x_0) - E_{\mathcal{Y} | \mathcal{X}}(\hat{f}(x_0) \right )  + \operatorname{Bias}^2 (\hat{f}(x_0), f(x_0)) \\
#     = & \operatorname{Var}(\hat{f}(x_0)  +  2 \left ( f(x_0) - C \right ) \, \left ( E_{\mathcal{Y} | \mathcal{X}} \hat{f}(x_0) - E_{\mathcal{Y} | \mathcal{X}}(\hat{f}(x_0) \right )  + \operatorname{Bias}^2 (\hat{f}(x_0), f(x_0)) \\
#     = & \operatorname{Var}(\hat{f}(x_0)  +  \operatorname{Bias}^2 (\hat{f}(x_0), f(x_0)) \\
# \end{align}

# #### (c)
# Decompose the (unconditional) mean-squared error 
# $$E_{\mathcal{Y}, \mathcal{X}} ( f(x_0) - \hat{f}(x_0) )^2$$
# into a squared bias and a variance component. 
# 
# #### solution
# \begin{align}
#     {} & E_{\mathcal{Y}, \mathcal{X}} (f(x_0) - \hat{f}(x_0))^2 \\
#     = & E_{\mathcal{X}} E_{\mathcal{Y} | \mathcal{X}} (f(x_0) - \hat{f}(x_0))^2 \\
# \end{align}
# use similar method as above.

# #### (d)
# Establish a relationship between the squared biases and variances in the above two cases.
# 
# #### solution
# ##### 1. for variance,
# As we known in (b), 
# $$E_{\mathcal{Y} | \mathcal{X}}(\hat{f}(x_0)) = \displaystyle \sum_{i=1}^N \ell_i(x_0; \mathcal{X}) \, f(x_i)$$
# 
# and also:
# $$ \hat{f}(x_0) = \displaystyle \sum_{i=1}^N \ell_i(x_0; \mathcal{X}) y_i $$ 
# $$ y_i = f(x_i) + \epsilon_i $$
# 
# Thus,
# \begin{align}
#     \operatorname{Var}(\hat{f}(x_0)) &= E_{\mathcal{Y} | \mathcal{X}} \left ( \hat{f}(x_0) - E_{\mathcal{Y} | \mathcal{X}} \left ( \hat{f}(x_0) \right ) \right )^2 \\
#             &= E_{\mathcal{Y} | \mathcal{X}} \left ( \displaystyle \sum_{i=1}^N \ell_i(x_0; \mathcal{X}) y_i - \displaystyle \sum_{i=1}^N \ell_i(x_0; \mathcal{X}) \, f(x_i) \right )^2 \\
#             &= E_{\mathcal{Y} | \mathcal{X}} \left ( \displaystyle \sum_{i=1}^N \ell_i(x_0; \mathcal{X}) \left ( y_i - f(x_i) \right ) \right )^2 \\
#             &= E_{\mathcal{Y} | \mathcal{X}} \left ( \displaystyle \sum_{i=1}^N \ell_i(x_0; \mathcal{X}) \epsilon_i \right )^2 \\
# \end{align}
# 
# ##### 2. for squared bias,
# \begin{align}
#     \operatorname{Bias}^2 (\hat{f}(x_0), f(x_0)) &= \left ( E_{\mathcal{Y} | \mathcal{X}} \hat{f}(x_0) - f(x_0) \right )^2 \\
#             &= \left( \displaystyle \sum_{i=1}^N \ell_i(x_0; \mathcal{X}) \, f(x_i) - f(x_0) \right )^2 \\
# \end{align}
# 
# ##### 3. guess
# variance is only affected by $\epsilon$, while squared bias is only affected by $f(x)$. Because $\epsilon$ is independent with $f(x)$, it might not possible that there is a relation between variance and squared bias. 

# ### Ex 2.8
# Compare the classification performance of linear regression and $k$-nearest neighbor classification on the zipcode data. In particular, consider only the 2’s and 3’s, and $k$ = 1, 3, 5, 7 and 15. Show both the training and test error for each choice. The zipcode data are available from the book website www-stat.stanford.edu/ElemStatLearn.

# In[25]:


train_dat = pd.read_csv('./res/zip.train', header=None, sep=' ')
train_dat.rename(columns={0: 'digital'}, inplace=True)
train_dat = train_dat.query('digital == 2 or digital == 3')
train_dat.dropna(axis=1, inplace=True)


# In[31]:


train_dat.set_index('digital', inplace=True)


# In[32]:


train_dat.head(3)


# In[43]:


plt.imshow(np.reshape(train_dat.iloc[0].values, (16, 16)))


# In[28]:


test_dat = pd.read_csv('./res/zip.test', header=None, sep=' ')
test_dat.rename(columns={0: 'digital'}, inplace=True)
test_dat = test_dat.query('digital == 2 or digital == 3')
test_dat.dropna(axis=1, inplace=True)


# In[33]:


test_dat.set_index('digital', inplace=True)


# In[34]:


test_dat.head(3)


# In[68]:


def eva(conf, train_dat, test_dat):
    x_train = train_dat.values
    y_train = train_dat.index.values
    
    conf['cls'].fit(x_train, y_train)
    
    x_test = test_dat.values
    y_test = test_dat.index.values
    
    y_pred = conf['cls'].predict(x_test)
   
    accuracy = sklearn.metrics.accuracy_score(y_test, y_pred)
    print('{}, parameter: {}, accuracy: {:.4}'.format(conf['name'], conf['parameter'], accuracy))


# In[69]:


from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier

configuration = [
    {'cls': LogisticRegression(), 'name': 'LogisticRegression', 'parameter': None},
    {'cls': KNeighborsClassifier(n_neighbors=1), 'name': 'KNeighborsClassifier', 'parameter': 'N=1'},
    {'cls': KNeighborsClassifier(n_neighbors=3), 'name': 'KNeighborsClassifier', 'parameter': 'N=3'},
    {'cls': KNeighborsClassifier(n_neighbors=5), 'name': 'KNeighborsClassifier', 'parameter': 'N=5'},
    {'cls': KNeighborsClassifier(n_neighbors=7), 'name': 'KNeighborsClassifier', 'parameter': 'N=7'},
    {'cls': KNeighborsClassifier(n_neighbors=15), 'name': 'KNeighborsClassifier', 'parameter': 'N=15'},
]

for conf in configuration:
    eva(conf, train_dat, test_dat)


# ### 2.9
# Consider a linear regression model with $p$ parameters, fit by least squares to a set of training data $(x_1,y_1), \dotsc,(x_N,y_N)$ drawn at random from a population. Let $\hat{\beta}$  be the least squares estimate. Suppose we have some test data $(\hat{x}_1, \hat{y}_1), \dotsc,(\hat{x}_N, \hat{y}_N)$ drawn at random from the same propulation as the training data.
# 
# If $R_{tr}(\beta) = \frac{1}{N} \sum_1^N (y_i - \beta^T x_i)^2$, and $R_{te}(\beta) = \frac{1}{M} \sum_1^M (\hat{y}_i - \beta^T \hat{x}_i)^2$, prove that
# $$E[R_{tr}(\hat{\beta})] \leq E[R_{te}(\hat{\beta})],$$
# where the expectation are over all that is random in each expression.
# 
# #### solution
# Ref: Hint from [Homework 2 - Hector Corrada Bravo](http://www.cbcb.umd.edu/~hcorrada/PracticalML/assignments/hw2.pdf) 
# 
# define: $\beta^{\ast}$ is the least squares estimate in **test data**.
# 
# 1. 
# As both training data and test data are picked randomly, so obviously:
# $$E[R_{tr}(\hat{\beta})] = E[R_{te}(\beta^{\ast})]$$
# 
# 2.  
# On the other hand, because $\beta^{\ast}$ is the least squares estimate in **test data**, namely, 
# $$\beta^\ast = \operatorname{argmin}_\beta \frac{1}{M} \sum_1^M (\hat{y}_i - \beta^T \hat{x}_i)^2 $$
# so obviously:
# $$R_{te}(\beta^\ast) = \frac{1}{M} \sum_1^M (\hat{y}_i - \left (\color{blue}{\beta^{\ast}} \right )^T \hat{x}_i)^2 \leq  \frac{1}{M} \sum_1^M (\hat{y}_i - \color{blue}{\hat{\beta}}^T \hat{x}_i)^2 = R_{te}(\hat{\beta}) $$
# Thus:
# $$E[R_{te}(\beta^\ast)] \leq E[R_{te}(\hat{\beta})]$$
# 
# 3. 
# Final, we get:
# $$E[R_{tr}(\hat{\beta})] = E[R_{te}(\beta^{\ast})] \leq E[R_{te}(\hat{\beta})]$$
# namely,
# $$E[R_{tr}(\hat{\beta})] \leq E[R_{te}(\hat{\beta})]$$