K Nearest Neighbours is a supervised machine learning algorithm used for classification. It takes in labelled data, and then classifies a new point based on the training data that is most similar. K Nearest Neighbours (KNN) is used heavily in search engines to recommend similar results, and is also used in face recognition software to idenify a person by their facial features. By the end of this guide you should be confident in your understanding in how the algorithm works, as well as be able to apply it to some simple examples.
The KNN Algorithm is very simple. We start with pre-classified data, and when we want to test a point, we check for the K points with the shortest distance away. The obvious question here is what value of K we should use, and there's no easy answer. A small K value gives us a higher "definition" of model, but with more noise, whereas a large K gives us a very general idea of how our data is distributed, but with less definition. Below we can see the "local" effect of small and large Ks - a larger K will sometimes "pick up" other points around the test point which can skew results:
We'll also look at what happens in a more general sense with different values of k in the simple example below:
We're going to start with a very simple example to get used to how to implement this model in Python. Let's start by importing all the libraries and data we'll need:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier
df = pd.read_csv("test_data.csv")
print(df.head())
print(df.tail())
class x data y data 0 0 27.095400 28.338278 1 0 36.357474 26.011727 2 0 21.115430 1.768997 3 0 11.524589 16.216907 4 0 31.990294 21.280534 class x data y data 195 1 42.981247 38.108244 196 1 37.655655 29.933272 197 1 29.341556 23.683485 198 1 41.456223 39.274803 199 1 30.655903 31.704547
Since our data is 2 dimensional, we can graph the data to see the classes:
plt.scatter(df['x data'], df['y data'], c=df['class'])
plt.show()
Great! From here we can build our model using the KNeighborsClassifier function - here we're using k=5 for our data.
data = df.drop('class', axis=1).as_matrix()
classes = df['class']
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(data,classes)
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski', metric_params=None, n_jobs=1, n_neighbors=5, p=2, weights='uniform')
Now we can predict any number of points and get the class they correspond with using the predict method on our model:
points = [[0,0], [40,40], [30,30]]
print(knn.predict(points))
[0 1 1]
Sometimes it's helpful to see a breakdown of classification by area. To do this we can create a meshgrid using numpy, and use it to draw a background on our plot. This is a bit tough, but it's very powerful and worth it in the end:
#Setting up the meshgrid require a minimum and maximum bound - it makes sense to use the min and max values of our data for this:
x_min, x_max = data[:, 0].min() - 1, data[:, 0].max() + 1
y_min, y_max = data[:, 1].min() - 1, data[:, 1].max() + 1
#h is the step size of the mesh - lower h means higher resolution.
h=0.1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
#Ravel will take the 2 dimensional xx and yy arrays and "unravel" them into a 1d array.
#c_ then pairs off these points to give a 1 dimensional array with x and y points for every point in the meshgrid.
#We then predict every point and call this Z, then reshape it so it's the same shape as xx and yy
Z = knn.predict(np.c_[xx.ravel(),yy.ravel()])
Z = Z.reshape(xx.shape)
#We then define our own colormaps (you can skip this step if you want), and plot our data using a colormesh and a scatter plot.
from matplotlib.colors import ListedColormap
cmap_light = ListedColormap(['#FFAAAA', '#AAFFAA'])
cmap_bold = ListedColormap(['#FF0000', '#00FF00'])
plt.pcolormesh(xx, yy, Z, cmap=cmap_light)
plt.scatter(data[:, 0], data[:, 1], c=classes, cmap = cmap_bold)
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.show()
Finally we can see what happens when we graph for different values of K, using a for loop.
for k in [1,3,5,10,15]:
knn = KNeighborsClassifier(n_neighbors=k)
knn.fit(data,classes)
x_min, x_max = data[:, 0].min() - 1, data[:, 0].max() + 1
y_min, y_max = data[:, 1].min() - 1, data[:, 1].max() + 1
h=0.1 #h is the step size of the mesh - lower h means higher resolution.
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
#Ravel will take the 2 dimensional xx and yy arrays and "unravel" them into a 1d array.
#c_ then pairs off these points to give a 1 dimensional array with x and y points for every point in the meshgrid.
#We then predict every point and call this Z, then reshape it so it's the same shape as xx and yy
Z = knn.predict(np.c_[xx.ravel(),yy.ravel()])
Z = Z.reshape(xx.shape)
#We then define our own colormaps (you can skip this step if you want), and plot our data using a colormesh and a scatter plot.
from matplotlib.colors import ListedColormap
cmap_light = ListedColormap(['#FFAAAA', '#AAFFAA'])
cmap_bold = ListedColormap(['#FF0000', '#00FF00'])
plt.pcolormesh(xx, yy, Z, cmap=cmap_light)
plt.scatter(data[:, 0], data[:, 1], c=classes, cmap = cmap_bold)
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.title("KNN with k = " + str(k))
plt.show()
Below we have some more training and testing data for you to try this out for yourself. Try following the above steps to create a model using the training data, and predict what class the test data is part of. We're switching it up a bit by giving you three dimensional data, with three classes to work with. If you want to see how this data is distributed in a plot, try plotting it using matplotlib's mplot3d sublibrary, but remember you don't need to do that to get the predicitions right!
training_data = pd.read_csv("project_training.csv")
test_data = [[50,50,50], [100,100,100], [150,150,150], [200,200,200],[250,250,250],[300,300,300],[350,350,350],[400,400,400]]
print(training_data.head())
print(training_data.tail())
class x data y data z data 0 0 80.543502 41.318858 33.818395 1 0 78.291164 127.006373 41.923430 2 0 121.484717 120.776238 74.294293 3 0 22.123624 70.748242 -17.645028 4 0 63.964959 113.604554 98.920726 class x data y data z data 395 2 362.127598 413.353847 378.205861 396 2 379.437545 386.067748 381.397157 397 2 364.903396 374.196921 374.635584 398 2 378.062759 428.275662 383.364453 399 2 355.313112 332.401421 399.892779