Implement the KNN classification algorithm for classification.
This notebook aims to introduce the nearest neighbors model in the classification problem. In this case, we will first implement the model and test it using the Iris flower data set where the goal is to the classe of a flower based on its features. Next, we will use the Endometrium vs. Uterus cancer data set, where goal is to separate endometrium tumors from the uterine ones.
After finished this notebook, you should be able to explain nearest neighbors model, including how to use the scikit-learn implementation.
The Iris flower data set consists of 150 data points, where each data point is a feature vector $\boldsymbol x \in \mathbb{R}^4$ describing the attribute of a flower in the data set.
The four dimensions represent:
and the corresponding target $y \in \mathbb{Z}$ describes the class of the flower. There are three classes of flowers in this data set. They are:
import sys
assert sys.version_info >= (3, 6)
import numpy
assert numpy.__version__ >="1.17.3"
import numpy as np
import matplotlib.pyplot as plt
import pandas
assert pandas.__version__ >= "0.25.1"
import pandas as pd
import sklearn
assert sklearn.__version__ >= "0.21.3"
from sklearn import datasets
%matplotlib inline
iris = datasets.load_iris()
print('data shape is {}'.format(iris.data.shape))
print('class shape is {}'.format(iris.target.shape))
For the simplicity, we will only use the first two features (i.e., sepal length and sepal width) of the data set to classify the flowers.
X = iris.data[:, :2]
y = iris.target
We create a scatter plot of the dataset below. The x and y axis represent the sepal length and sepal width of the dataset, and the color of the points represent the different classes of flowers.
from matplotlib.colors import ListedColormap
cmap_bold = ListedColormap(['#FF0000', '#00FF00', '#0000FF'])
K = 3
x = X[-1]
fig, ax = plt.subplots(figsize=(4,4))
for i, iris_class in enumerate(['Iris Setosa', 'Iris Versicolour', 'Iris Virginica']):
idx = y==i
ax.scatter(X[idx,0], X[idx,1],
c=cmap_bold.colors[i], edgecolor='k',
s=20, label=iris_class);
ax.set(xlabel='sepal length (cm)', ylabel='sepal width (cm)')
ax.legend();
The idea behind a k-nearest neighbor classifier is very simple:
First, we need to compute the pairwise distance matrix with the distances between the rows of two matrices.
def pairwise_distance_matrix(X, Y):
"""Compute the pairwise distance between the rows of X and the rows of Y
Arguments
----------
X: ndarray of size (N, D)
Y: ndarray of size (M, D)
Returns
--------
distance_matrix: matrix of shape (N, M), each entry distance_matrix[i,j] is the distance between
ith row of X and the jth row of Y.
"""
N, D = X.shape
M, _ = Y.shape
distance_matrix = None
return distance_matrix
def KNN(k, X, y, x):
"""K nearest neighbors
Arguments
----------
k: number of nearest neighbors
X: training set
y: training labels
x: test set
Returns
--------
number of k-nearest neighbors for each of the classes
"""
x = x.reshape(1, -1) if len(x.shape) == 1 else x
N, D = X.shape
num_classes = None
distances = None
# make the predictions
ypred = np.zeros(num_classes)
# find the labels of the k-nearest neighbors
classes = None
for c in np.unique(classes):
ypred[c] = len(classes[classes == c])
return np.argmax(ypred)
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
step = 0.1
xx, yy = np.meshgrid(np.arange(x_min, x_max, step),
np.arange(y_min, y_max, step))
ypred = []
for data in np.array([xx.ravel(), yy.ravel()]).T:
ypred.append(KNN(K, X, y, data.reshape(1,2)))
fig, ax = plt.subplots(figsize=(5,5))
ax.pcolormesh(xx, yy, np.array(ypred).reshape(xx.shape), cmap=cmap_light)
ax.scatter(X[:,0], X[:,1], c=y, cmap=cmap_bold, edgecolor='k', s=20);
In this data set, each observation is a tumor, and it is described by the expression of 3,000 genes. The expression of a gene is a measure of how much of that gene is present in the biological sample. Because this affects how much of the protein this gene codes for is produced, and because proteins dictacte what cells can do, gene expression gives us valuable information about the tumor. In particular, the expression of the same gene in the same individual is different in different tissues.
The goal is to separate the endometrium tumors from the uterine ones.
# read the data tumor data set: `data/endometrium_uterus_tumor.csv`
endometrium_data = None
print(endometrium_data.shape)
endometrium_data.head()
np.unique(endometrium_data['Tissue'])
Tissue
) and the unnecessary ones for the classification.Tissue
to be 0 for Endometrium tumeur and 1 for Uterus.Hint: check the method get_dummies
of pandas.
# drop from the dataframe the columns Tissue and ID_REF
X = None
# encode the target variable and get its values
y = None
Plot a scatter of two variables of the data set.
plt.scatter(X[:,0], X[:,1], c=y)
from sklearn import model_selection
def stratifiedKFold(y, num_folds):
kf = model_selection.StratifiedKFold(n_splits=num_folds)
folds_regr = [(tr, te) for (tr, te) in kf.split(np.zeros(y.size), y)]
return folds_regr
Create a 10-cross validation folds on the data
cv_folds = stratifiedKFold(y, 10)
Cross validate 20 k-NN classifiers on the loaded datset using the cross_validate
function of the module cross_validation
.
from cross_validation import cross_validate
In scikit-learn, a k-neighbours classifier can be initialised as knn_clf = sklearn.neighbors.KNeighborsClassifier(n_neighbors=k)
from sklearn import neighbors
from sklearn import metrics
aurocs_clf = []
# Create a range of values of k.
k_range = range(1,40,2)
for k in k_range:
clf_knn = None # Create a k-neighbors classifier
y_pred = cross_validate(X, y, clf_knn, cv_folds)
fpr, tpr, thresholdss = metrics.roc_curve(y, y_pred[:,1])
aurocs_clf.append(metrics.auc(fpr,tpr))
plt.plot(k_range, aurocs_clf, color='blue')
plt.xlabel('Number of nearest neighbours')
plt.ylabel('Cross-validated AUC')
plt.title('Nearest neighbours classification - cross validated AUC.')
Question. Find the best value for the parameter n_neighbors
by finding the one that gives the maximum value of AUC.
best_k = None
print(k_range[best_k])
Use sklearn.model_selection.GridSearchCV
to find the best number of neighbors (i.e., n_neighbors
parameter).
from sklearn import model_selection
classifier = None
param_grid = {'n_neighbors': k_range}
clf_knn_opt = model_selection.GridSearchCV(classifier,
param_grid=param_grid,
cv=cv_folds,
scoring='roc_auc',
iid=True)
clf_knn_opt.fit(X, y)
Finding the best number of neighbors.
print (clf_knn_opt.best_params_)
Question: Is the optimum equal to the one computed earlier (i.e. best_k
)?