Notebook

K-Nearest Neighbors¶

Implement the KNN classification algorithm for classification.

Objective¶

This notebook aims to introduce the nearest neighbors model in the classification problem. In this case, we will first implement the model and test it using the Iris flower data set where the goal is to the classe of a flower based on its features. Next, we will use the Endometrium vs. Uterus cancer data set, where goal is to separate endometrium tumors from the uterine ones.

Learning objective¶

After finished this notebook, you should be able to explain nearest neighbors model, including how to use the scikit-learn implementation.

Iris flower data set¶

flower

The Iris flower data set consists of 150 data points, where each data point is a feature vector $\boldsymbol x \in \mathbb{R}^4$ describing the attribute of a flower in the data set.

The four dimensions represent:

sepal length in cm
sepal width in cm
petal length in cm
petal width in cm

and the corresponding target $y \in \mathbb{Z}$ describes the class of the flower. There are three classes of flowers in this data set. They are:

Iris Setosa
Iris Versicolour
Iris Virginica

In [ ]:

import sys
assert sys.version_info >= (3, 6)

import numpy
assert numpy.__version__ >="1.17.3" 
import numpy as np

import matplotlib.pyplot as plt

import pandas
assert pandas.__version__ >= "0.25.1"
import pandas as pd

import sklearn
assert sklearn.__version__ >= "0.21.3"

from sklearn import datasets

%matplotlib inline

1. Implementing the nearest neighbors method¶

1.1. Loading the iris data set¶

In [ ]:

iris = datasets.load_iris()

print('data shape is {}'.format(iris.data.shape))
print('class shape is {}'.format(iris.target.shape))

For the simplicity, we will only use the first two features (i.e., sepal length and sepal width) of the data set to classify the flowers.

In [ ]:

X = iris.data[:, :2] 
y = iris.target

1.2. Visualizing the classes of the flowers¶

We create a scatter plot of the dataset below. The x and y axis represent the sepal length and sepal width of the dataset, and the color of the points represent the different classes of flowers.

In [ ]:

from matplotlib.colors import ListedColormap

cmap_bold = ListedColormap(['#FF0000',  '#00FF00', '#0000FF'])

K = 3
x = X[-1]

fig, ax = plt.subplots(figsize=(4,4))
for i, iris_class in enumerate(['Iris Setosa', 'Iris Versicolour', 'Iris Virginica']):
    idx = y==i
    ax.scatter(X[idx,0], X[idx,1], 
               c=cmap_bold.colors[i], edgecolor='k', 
               s=20, label=iris_class);

ax.set(xlabel='sepal length (cm)', ylabel='sepal width (cm)')
ax.legend();

The idea behind a k-nearest neighbor classifier is very simple:

Given a training set $\boldsymbol X \in \mathbb{R}^{N \times D}$ and $\boldsymbol y \in \mathbb{Z}^N$; where, $N$ is the number of data points in the data set, and $D$ is the dimensionality of the data.
We predict the label of a novel point $\boldsymbol x \in \mathbb{R}^{D}$ as the label of the majority of its "K-nearest neighbor" by using distance measure (e.g., the Euclidean distance).

First, we need to compute the pairwise distance matrix with the distances between the rows of two matrices.

In [ ]:

def pairwise_distance_matrix(X, Y):
    
    """Compute the pairwise distance between the rows of X and the rows of Y

    Arguments
    ----------
    X: ndarray of size (N, D)
    Y: ndarray of size (M, D)
    
    Returns
    --------
    distance_matrix: matrix of shape (N, M), each entry distance_matrix[i,j] is the distance between
    ith row of X and the jth row of Y.
    """
    
    N, D = X.shape
    M, _ = Y.shape
            
    distance_matrix = None
    
    return distance_matrix


def KNN(k, X, y, x):
    
    """K nearest neighbors
    
    Arguments
    ----------
    
    k: number of nearest neighbors
    X: training set
    y: training labels
    x: test set
    
    Returns
    --------
    
    number of k-nearest neighbors for each of the classes
    """
    
    x = x.reshape(1, -1) if len(x.shape) == 1 else x
    
    N, D = X.shape
    num_classes = None
    
    distances = None

    # make the predictions
    ypred = np.zeros(num_classes)
    
    # find the labels of the k-nearest neighbors
    classes = None
    
    for c in np.unique(classes):
        ypred[c] = len(classes[classes == c])
        
    return np.argmax(ypred)

1.3. Visualizing the decision boundary of the kNN¶

In [ ]:

x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1

step = 0.1
xx, yy = np.meshgrid(np.arange(x_min, x_max, step),
                     np.arange(y_min, y_max, step))

ypred = []
for data in np.array([xx.ravel(), yy.ravel()]).T:
    ypred.append(KNN(K, X, y, data.reshape(1,2)))

fig, ax = plt.subplots(figsize=(5,5))

ax.pcolormesh(xx, yy, np.array(ypred).reshape(xx.shape), cmap=cmap_light)
ax.scatter(X[:,0], X[:,1], c=y, cmap=cmap_bold, edgecolor='k', s=20);

2. Endometrium vs. Uterus cancer classification using scikit-learn¶

In this data set, each observation is a tumor, and it is described by the expression of 3,000 genes. The expression of a gene is a measure of how much of that gene is present in the biological sample. Because this affects how much of the protein this gene codes for is produced, and because proteins dictacte what cells can do, gene expression gives us valuable information about the tumor. In particular, the expression of the same gene in the same individual is different in different tissues.

The goal is to separate the endometrium tumors from the uterine ones.

2.1. Loading the endometrium cancer data set¶

In [ ]:

# read the data tumor data set: `data/endometrium_uterus_tumor.csv`
endometrium_data = None

In [ ]:

print(endometrium_data.shape)

endometrium_data.head()

In [ ]:

np.unique(endometrium_data['Tissue'])

Drop the target feature (i.e., Tissue) and the unnecessary ones for the classification.
Encode the target variable Tissue to be 0 for Endometrium tumeur and 1 for Uterus.

Hint: check the method get_dummies of pandas.

In [ ]:

# drop from the dataframe the columns Tissue and ID_REF
X = None

# encode the target variable and get its values
y = None

Plot a scatter of two variables of the data set.

In [ ]:

plt.scatter(X[:,0], X[:,1], c=y)

In [ ]:

from sklearn import model_selection

def stratifiedKFold(y, num_folds):
    
    kf = model_selection.StratifiedKFold(n_splits=num_folds)
    folds_regr = [(tr, te) for (tr, te) in kf.split(np.zeros(y.size), y)]
    return folds_regr

Create a 10-cross validation folds on the data

In [ ]:

cv_folds = stratifiedKFold(y, 10)

Cross validate 20 k-NN classifiers on the loaded datset using the cross_validate function of the module cross_validation.

In [ ]:

from cross_validation import cross_validate

2.2. k-nearest neighbours classifier¶

In scikit-learn, a k-neighbours classifier can be initialised as knn_clf = sklearn.neighbors.KNeighborsClassifier(n_neighbors=k)

In [ ]:

from sklearn import neighbors
from sklearn import metrics

aurocs_clf = []

# Create a range of values of k. 
k_range    = range(1,40,2) 

for k in k_range:
    clf_knn    = None # Create a k-neighbors classifier
    y_pred = cross_validate(X, y, clf_knn, cv_folds)
    
    fpr, tpr, thresholdss = metrics.roc_curve(y, y_pred[:,1])
    aurocs_clf.append(metrics.auc(fpr,tpr))

In [ ]:

plt.plot(k_range, aurocs_clf, color='blue')
plt.xlabel('Number of nearest neighbours')
plt.ylabel('Cross-validated AUC')
plt.title('Nearest neighbours classification - cross validated AUC.')

Question. Find the best value for the parameter n_neighbors by finding the one that gives the maximum value of AUC.

In [ ]:

best_k = None
print(k_range[best_k])

Use sklearn.model_selection.GridSearchCV to find the best number of neighbors (i.e., n_neighbors parameter).

In [ ]:

from sklearn import model_selection

classifier = None

param_grid = {'n_neighbors': k_range}
clf_knn_opt = model_selection.GridSearchCV(classifier,  
                                           param_grid=param_grid, 
                                           cv=cv_folds,
                                           scoring='roc_auc',
                                           iid=True)
clf_knn_opt.fit(X, y)

Finding the best number of neighbors.

In [ ]:

print (clf_knn_opt.best_params_)

Question: Is the optimum equal to the one computed earlier (i.e. best_k)?