%pylab inline
Populating the interactive namespace from numpy and matplotlib
In this notebook I'll show probabilistic interpretation of the nearest neighbours algorith as a mixture of Gaussians. Following Barber 2012. First I'll give an example of ... Next, I'll show how to reformulate ... using
This pretty much follows from Barber, 2012
kNN is simple to understand and implement, and often used as a baseline.
Some limitations of this approach.
We can reformulate the kNN as a class conditional mixture of Gaussians.
In Barber 2012 shows a an of the nearest neighbour method as the limiting case of a mixture model.
What follows is a solution for Exercise 158 from Barber 2012 in python.
Write a routine SoftNearNeigh(xtrain,xtest,trainlabels,sigma) to implement soft nearest neighbours, analogous to nearNeigh.m . Here sigma is the variance σ 2 in equation (14.3.1). As above, the file NNdata.mat contains training and test data for the handwritten digits 5 and 9. Using leave one out cross-validation, find the optimal σ 2 and use this to compute the classification accuracy of the method on the test data. Hint: you may have numerical difficulty with this method. To avoid this, consider using the logarithm, and how to numerically compute log ( e a
e b ) for large (negative) a and b . See also logsumexp.m .
For this exercise we'll be using a subsent of the MNIST dataset provided in BRMLtoolkit at Barber 2012.
import scipy.io as sio
nndata = sio.loadmat('/Users/gm/Downloads/BRMLtoolkit/data/NNdata.mat')
nndata
{'__globals__': [], '__header__': b'MATLAB 5.0 MAT-file, Platform: PCWIN, Created on: Sat Aug 01 14:24:46 2009', '__version__': '1.0', 'test5': array([[0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0], ..., [0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0]], dtype=uint8), 'test9': array([[0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0], ..., [0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0]], dtype=uint8), 'train5': array([[0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0], ..., [0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0]], dtype=uint8), 'train9': array([[0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0], ..., [0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0]], dtype=uint8)}
We want to solve a binary classification problem.
class0_train = nndata['train5']
class0_test = nndata['test5']
class1_train = nndata['train9']
class1_test = nndata['test9']
Barber follows a generative approach and uses kernel density estimation (Parzen estimator) to interpret kNN as the limiting case of a mixture of Gaussians. An isotropic Gaussian of width $\sigma^2$ is placed at each data point, and a mixture is used to model each class.
With kernel density estimation we want to approximate a PDF with a mixture of continuos probability distributions. A Parzen estimator centers a probability distribution at each data point $\textbf{x}_n$ as
$P(\textbf{x}) = \frac{1}{N} \sum_{n=1}^{N} P(\textbf{x}|\textbf{x}_{n})$
For a D dimensional $\textbf{x}$ we choose an isotropic Gaussian $P(\textbf{x}|\textbf{x}_{n}) = \mathcal{N} (\textbf{x}|\textbf{x}_{n}, \sigma^2 \textbf{I}_{D})$, which gives the mixture
$P(x) = \frac{1}{N} \sum_{n=1}^{N} \frac{1}{(2 \pi \sigma^2)^{D/2}}\mathcal{e}^{- (\textbf{x} - \textbf{x}_n)^2 / 2\sigma^2} $
Given classes $c = {0, 1}$, we consider the following mixture model
$P(\textbf{x}|c=0) = \frac{1}{N_0} \sum_{n \in \textit{class 0}} \mathcal{N}(\textbf{x}| \textbf{x}_n, \sigma^2\textbf{I}) = \frac{1}{N_0} \frac{1}{(2 \pi \sigma^2)^{\frac{D}{2}}} \sum_{n \in \textit{class 0}} e^{-(\textbf{x} - \textbf{x}_n)^2 / (2 \sigma^2) } $
$P(\textbf{x}|c=1) = \frac{1}{N_1} \sum_{n \in \textit{class 1}} \mathcal{N}(\textbf{x}| \textbf{x}_n, \sigma^2\textbf{I}) = \frac{1}{N_1} \frac{1}{(2 \pi \sigma^2)^{\frac{D}{2}}} \sum_{n \in \textit{class 1}} e^{-(\textbf{x} - \textbf{x}_n)^2 / (2 \sigma^2) } $
To classify a new instance $\textbf{x}_{*}$ we calculate the posterior for both classes and take the ratio
$\frac{P(c=0|\textbf{x}_{\star})}{P(c=1|\textbf{x}_{\star})} = \frac{P(\textbf{x}_{\star}|c=0) P(c=0)}{P(\textbf{x}_{\star}|c=1) P(c=1)}$. If this ratio is $\gt 1$ than we classify $\textbf{n}_{\star}$ as class 0, otherwise as class 1. The class probabilities can be determined by maximum likelihood with the following setting $P(c) = \sum_{i=0}^N \frac{N_{i}}{N}$.
TODO: proof!
To understand how this relates to the nearest neighbour method, we need to consider the case $\sigma^2 \rightarrow 0$.
Note that both nominator and denominator are sums of exponentials. Intuitively, if the veriance is small, the nominator will be dominated by the term for which point $x_{n_0} \in class 0$ is closest to $\textbf{x}_{\star}$. The same holds for the denominator and points in class 1.
$\frac{1}{(2 \pi \sigma^2)^{\frac{D}{2}}}$ cancels out, and for vanishingly small values of $\sigma$ we have
$\frac{P(c=0|\textbf{x}_{\star})}{P(c=1|\textbf{x}_{\star})} \approx $
$\frac{ e^{-(\textbf{x}_{\star} - x_{n_0})^2 / (2 \sigma^2)} P(c=0)/N_{0}} {e^{-(\textbf{x}_{\star} - x_{n_1})^2 / (2 \sigma^2)}P(c=1)/N_{1}} $
$\frac{ e^{-(\textbf{x}_{\star} - x_{n_0})^2 / (2 \sigma^2)}} {e^{-(\textbf{x}_{\star} - x_{n_1})^2 / (2 \sigma^2)}} $
In the limit $\sigma^2 \rightarrow 0$ we have
$\frac{ e^{-(\textbf{x}_{\star} - x_{n_0})^2}}{e^{-(\textbf{x}_{\star} - x_{n_1})^2}} $
so we classify $\textbf{x}_{\star}$ as class 0 if $\textbf{x}_{\star}$ has a point in class 0 than the closest point in class 1.
import numpy as np
def log_sum_exp(x):
"""
Log likelihood of a Parzen estimator
"""
a = np.max(x)
return a + np.log(np.sum(np.exp(x-a)))
def log_mean_exp(x):
"""
Log likelihood of a Parzen estimator
"""
a = np.max(x)
return a + np.log(np.mean(np.exp(x-a)))
def parzen(x, mu, sigma=1.0):
"""
Makes a function that allows the evalution of a Parzen
estimator where the Kernel is a normal distribution with
stddev sigma and with points at mu.
Parameters
-----------
x : numpy array
Classification input
mu : numpy matrix
Contains the data points over which this distribution is based.
sigma : scalar
The standard deviation of the normal distribution around each data \
point.
Returns
-------
lpdf : callable
Estimator of the log of the probability density under a point.
"""
# z = (1 / mu.shape[0]) * (1 / np.sqrt(sigma * np.pi * 2.0))
z = mu.shape[0] * np.sqrt(sigma * np.pi * 2.0)
e = (-(x - mu)**2.0) / (2.0 * sigma)
log_p = log_mean_exp(e)
return log_p - z
sigmas = [1e-13]
priorC0 = class0_train.T.shape[0] / (class0_train.T.shape[0] + class1_train.T.shape[0])
priorC1 = class1_train.T.shape[0] / (class1_train.T.shape[0] + class1_train.T.shape[0])
for sigma in sigmas:
correct = 0
for x in class0_test.T:
c0p = priorC0 * parzen(x, class0_train.T, sigma=sigma)
c1p = priorC1 * parzen(x, class1_train.T, sigma=sigma)
if (c0p / c1p) > 1:
correct += 1
print(sigma, correct, class1_test.shape[1], correct / class0_test.shape[1])
correct = 0
for x in class1_test.T:
c0p = priorC0 * parzen(x, class0_train.T, sigma=sigma)
c1p = priorC1 * parzen(x, class1_train.T, sigma=sigma)
if (c0p / c1p) <= 1:
correct += 1
print(sigma, correct, class0_test.shape[1], correct / (class1_test.shape[1]))
1e-13 231 292 0.791095890410959 1e-13 214 292 0.7328767123287672
Problem; when (c0p / c1p) we have a tie! How should we interpret that? My assumption is to assing ties to class 1!
X_train = np.append(class0_train.T, class1_train.T, axis=0)
y_train = ['a'] * class0_train.shape[1] + ['b'] * class1_train.shape[1]
X_train.shape
(1200, 784)
X_test = np.append(class0_test.T, class1_test.T, axis=0)
y_test = ['a'] * class0_test.shape[1] + ['b'] * class1_test.shape[1]
X_test.shape, len(y_test)
((584, 784), 584)
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=10,
n_jobs=4,
metric='euclidean',
algorithm='ball_tree')
fitted = knn.fit(X_train, y_train)
?KNeighborsClassifier
y_pred = fitted.predict(X_test)
from sklearn.metrics import accuracy_score, confusion_matrix
print("Accuracy: {}", accuracy_score(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
Accuracy: {} 0.98801369863 [[292 0] [ 7 285]]