Notebook

In [1]:

%pylab inline

Populating the interactive namespace from numpy and matplotlib

In this notebook I'll show probabilistic interpretation of the nearest neighbours algorith as a mixture of Gaussians. Following Barber 2012. First I'll give an example of ... Next, I'll show how to reformulate ... using

Theory¶

This pretty much follows from Barber, 2012

kNN is simple to understand and implement, and often used as a baseline.

Some limitations of this approach.

In metric based methods, how do we measure distance? euclidean distance does not account for how data is distributed
The whole dataset needs to be stored to make a classification since the novel point must be compared to all of the train points.
Each distance calculation can be expensive if the datapoints are high dimensional

We can reformulate the kNN as a class conditional mixture of Gaussians.

Probabilistic NN¶

In Barber 2012 shows a an of the nearest neighbour method as the limiting case of a mixture model.

What follows is a solution for Exercise 158 from Barber 2012 in python.

Write a routine SoftNearNeigh(xtrain,xtest,trainlabels,sigma) to implement soft nearest neighbours, analogous to nearNeigh.m . Here sigma is the variance σ 2 in equation (14.3.1). As above, the file NNdata.mat contains training and test data for the handwritten digits 5 and 9. Using leave one out cross-validation, find the optimal σ 2 and use this to compute the classification accuracy of the method on the test data. Hint: you may have numerical difficulty with this method. To avoid this, consider using the logarithm, and how to numerically compute log ( e a

e b ) for large (negative) a and b . See also logsumexp.m .

Data¶

For this exercise we'll be using a subsent of the MNIST dataset provided in BRMLtoolkit at Barber 2012.

In [1]:

import scipy.io as sio

In [2]:

nndata = sio.loadmat('/Users/gm/Downloads/BRMLtoolkit/data/NNdata.mat')

In [3]:

nndata

Out[3]:

{'__globals__': [],
 '__header__': b'MATLAB 5.0 MAT-file, Platform: PCWIN, Created on: Sat Aug 01 14:24:46 2009',
 '__version__': '1.0',
 'test5': array([[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ..., 
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]], dtype=uint8),
 'test9': array([[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ..., 
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]], dtype=uint8),
 'train5': array([[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ..., 
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]], dtype=uint8),
 'train9': array([[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ..., 
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]], dtype=uint8)}

We want to solve a binary classification problem.

In [4]:

class0_train = nndata['train5']
class0_test = nndata['test5']

class1_train = nndata['train9']
class1_test = nndata['test9']

Barber follows a generative approach and uses kernel density estimation (Parzen estimator) to interpret kNN as the limiting case of a mixture of Gaussians. An isotropic Gaussian of width $\sigma^2$ is placed at each data point, and a mixture is used to model each class.

The Parzen estimator¶

With kernel density estimation we want to approximate a PDF with a mixture of continuos probability distributions. A Parzen estimator centers a probability distribution at each data point $\textbf{x}_n$ as

$P(\textbf{x}) = \frac{1}{N} \sum_{n=1}^{N} P(\textbf{x}|\textbf{x}_{n})$

For a D dimensional $\textbf{x}$ we choose an isotropic Gaussian $P(\textbf{x}|\textbf{x}_{n}) = \mathcal{N} (\textbf{x}|\textbf{x}_{n}, \sigma^2 \textbf{I}_{D})$, which gives the mixture

$P(x) = \frac{1}{N} \sum_{n=1}^{N} \frac{1}{(2 \pi \sigma^2)^{D/2}}\mathcal{e}^{- (\textbf{x} - \textbf{x}_n)^2 / 2\sigma^2} $

Nearest Neighbour classification¶

Given classes $c = {0, 1}$, we consider the following mixture model

$P(\textbf{x}|c=0) = \frac{1}{N_0} \sum_{n \in \textit{class 0}} \mathcal{N}(\textbf{x}| \textbf{x}_n, \sigma^2\textbf{I}) = \frac{1}{N_0} \frac{1}{(2 \pi \sigma^2)^{\frac{D}{2}}} \sum_{n \in \textit{class 0}} e^{-(\textbf{x} - \textbf{x}_n)^2 / (2 \sigma^2) } $

$P(\textbf{x}|c=1) = \frac{1}{N_1} \sum_{n \in \textit{class 1}} \mathcal{N}(\textbf{x}| \textbf{x}_n, \sigma^2\textbf{I}) = \frac{1}{N_1} \frac{1}{(2 \pi \sigma^2)^{\frac{D}{2}}} \sum_{n \in \textit{class 1}} e^{-(\textbf{x} - \textbf{x}_n)^2 / (2 \sigma^2) } $

To classify a new instance $\textbf{x}_{*}$ we calculate the posterior for both classes and take the ratio

$\frac{P(c=0|\textbf{x}_{\star})}{P(c=1|\textbf{x}_{\star})} = \frac{P(\textbf{x}_{\star}|c=0) P(c=0)}{P(\textbf{x}_{\star}|c=1) P(c=1)}$. If this ratio is $\gt 1$ than we classify $\textbf{n}_{\star}$ as class 0, otherwise as class 1. The class probabilities can be determined by maximum likelihood with the following setting $P(c) = \sum_{i=0}^N \frac{N_{i}}{N}$.

TODO: proof!

To understand how this relates to the nearest neighbour method, we need to consider the case $\sigma^2 \rightarrow 0$.

Note that both nominator and denominator are sums of exponentials. Intuitively, if the veriance is small, the nominator will be dominated by the term for which point $x_{n_0} \in class 0$ is closest to $\textbf{x}_{\star}$. The same holds for the denominator and points in class 1.

$\frac{1}{(2 \pi \sigma^2)^{\frac{D}{2}}}$ cancels out, and for vanishingly small values of $\sigma$ we have

$\frac{P(c=0|\textbf{x}_{\star})}{P(c=1|\textbf{x}_{\star})} \approx $

$\frac{ e^{-(\textbf{x}_{\star} - x_{n_0})^2 / (2 \sigma^2)} P(c=0)/N_{0}} {e^{-(\textbf{x}_{\star} - x_{n_1})^2 / (2 \sigma^2)}P(c=1)/N_{1}} $

$\frac{ e^{-(\textbf{x}_{\star} - x_{n_0})^2 / (2 \sigma^2)}} {e^{-(\textbf{x}_{\star} - x_{n_1})^2 / (2 \sigma^2)}} $

In the limit $\sigma^2 \rightarrow 0$ we have

$\frac{ e^{-(\textbf{x}_{\star} - x_{n_0})^2}}{e^{-(\textbf{x}_{\star} - x_{n_1})^2}} $

so we classify $\textbf{x}_{\star}$ as class 0 if $\textbf{x}_{\star}$ has a point in class 0 than the closest point in class 1.

Implementation¶

In [428]:

import numpy as np

def log_sum_exp(x):
    """
    Log likelihood of a Parzen estimator
    """  
    a = np.max(x)
    return a + np.log(np.sum(np.exp(x-a)))

def log_mean_exp(x):
    """
    Log likelihood of a Parzen estimator
    """  
    a = np.max(x)
    return a + np.log(np.mean(np.exp(x-a)))

def parzen(x, mu, sigma=1.0):
    """
    Makes a function that allows the evalution of a Parzen 
    estimator where the Kernel is a normal distribution with 
    stddev sigma and with points at mu.
    
    Parameters
    -----------
    x : numpy array
        Classification input
    mu : numpy matrix
        Contains the data points over which this distribution is based.
    sigma : scalar
        The standard deviation of the normal distribution around each data \
        point.
    Returns
    -------
    lpdf : callable
        Estimator of the log of the probability density under a point.
    """
    # z = (1 / mu.shape[0]) * (1 / np.sqrt(sigma * np.pi * 2.0))
    z = mu.shape[0] * np.sqrt(sigma * np.pi * 2.0)
    e = (-(x - mu)**2.0) / (2.0 * sigma)
    log_p = log_mean_exp(e) 
    return log_p - z

In [445]:

sigmas = [1e-13]

priorC0 = class0_train.T.shape[0] / (class0_train.T.shape[0] + class1_train.T.shape[0])
priorC1 = class1_train.T.shape[0] / (class1_train.T.shape[0] + class1_train.T.shape[0])


for sigma in sigmas:
    correct = 0
    for x in class0_test.T:
        c0p = priorC0 * parzen(x, class0_train.T, sigma=sigma) 
        c1p = priorC1 * parzen(x, class1_train.T, sigma=sigma) 

        if (c0p / c1p) > 1:
            correct += 1

    print(sigma, correct, class1_test.shape[1], correct / class0_test.shape[1])

    correct = 0
    for x in class1_test.T:
        c0p = priorC0 * parzen(x, class0_train.T, sigma=sigma) 
        c1p = priorC1 * parzen(x, class1_train.T, sigma=sigma)

        if (c0p / c1p) <= 1:
            correct += 1

    print(sigma, correct, class0_test.shape[1], correct / (class1_test.shape[1]))

1e-13 231 292 0.791095890410959
1e-13 214 292 0.7328767123287672

Problem; when (c0p / c1p) we have a tie! How should we interpret that? My assumption is to assing ties to class 1!

kNN¶

In [110]:

X_train = np.append(class0_train.T, class1_train.T, axis=0)
y_train = ['a'] * class0_train.shape[1] + ['b'] * class1_train.shape[1]

In [112]:

X_train.shape

Out[112]:

(1200, 784)

In [120]:

X_test = np.append(class0_test.T, class1_test.T, axis=0)
y_test = ['a'] * class0_test.shape[1] + ['b'] * class1_test.shape[1]

In [296]:

X_test.shape, len(y_test)

Out[296]:

((584, 784), 584)

In [402]:

from sklearn.neighbors import  KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=10, 
                           n_jobs=4, 
                           metric='euclidean', 
                           algorithm='ball_tree')
fitted = knn.fit(X_train, y_train)

In [291]:

?KNeighborsClassifier

In [404]:

y_pred = fitted.predict(X_test)

In [405]:

from sklearn.metrics import accuracy_score, confusion_matrix

print("Accuracy: {}", accuracy_score(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))

Accuracy: {} 0.98801369863
[[292   0]
 [  7 285]]