In this example we will look at how to use the K-Nearest_Neighbor algorithm for classification. We will use a modified version of the Video Store data set for this example. We will use the "Incidentals" attribute as the target attribute for classification (the class attribute). The goal is to be able to classify an unseen instance as "Yes" or "No" given the values of "Incidentals" from training instances.

In [1]:
import numpy as np
import pandas as pd
In [2]:
vstable = pd.read_csv("http://facweb.cs.depaul.edu/mobasher/classes/csc478/data/Video_Store_2.csv", index_col=0)

vstable.shape
Out[2]:
(50, 7)
In [3]:
vstable.head()
Out[3]:
Gender Income Age Rentals Avg Per Visit Genre Incidentals
Cust ID
1 M 45000 25 32 2.5 Action Yes
2 F 54000 33 12 3.4 Drama No
3 F 32000 20 42 1.6 Comedy No
4 F 59000 70 16 4.2 Drama Yes
5 M 37000 35 25 3.2 Action Yes

We will be splitting the data into a test and training partions with the test partition to be used for evaluating model error-rate and the training partition to be used to find the K nearest neighbors. Before spliting the data we need to do a random reshuffling to make sure the instances are randomized.

In [4]:
vs = vstable.reindex(np.random.permutation(vstable.index))
vs.head(10)
Out[4]:
Gender Income Age Rentals Avg Per Visit Genre Incidentals
Cust ID
23 F 2000 15 30 2.5 Comedy No
39 F 68000 35 19 3.9 Comedy No
35 M 74000 29 43 4.6 Action Yes
9 M 38000 21 18 2.1 Comedy No
48 F 52000 47 14 1.6 Drama No
45 M 56000 38 30 3.5 Drama Yes
33 M 23000 25 28 2.7 Action No
27 F 62000 47 32 3.6 Drama No
41 F 50000 33 17 1.4 Drama No
31 F 49000 56 15 3.2 Comedy No
In [5]:
len(vs)
Out[5]:
50
In [6]:
vs_names = vs.columns.values
vs_names
Out[6]:
array(['Gender', 'Income', 'Age', 'Rentals', 'Avg Per Visit', 'Genre',
       'Incidentals'], dtype=object)

The target attribute for classification is Incidentals. These are the class labels (in this case "yes" and "no") corresponding to instances in the data.

In [7]:
vs_target = vs.Incidentals

Before we can compute distances we need to convert the data (excluding the target attribute "incidentals" which contains the class labels) into standard spreadsheet format with binary dummy variables created for each categorical attribute.

In [8]:
vs = pd.get_dummies(vs[['Gender','Income','Age','Rentals','Avg Per Visit','Genre']])
vs.head(10)
Out[8]:
Income Age Rentals Avg Per Visit Gender_F Gender_M Genre_Action Genre_Comedy Genre_Drama
Cust ID
23 2000 15 30 2.5 1 0 0 1 0
39 68000 35 19 3.9 1 0 0 1 0
35 74000 29 43 4.6 0 1 1 0 0
9 38000 21 18 2.1 0 1 0 1 0
48 52000 47 14 1.6 1 0 0 0 1
45 56000 38 30 3.5 0 1 0 0 1
33 23000 25 28 2.7 0 1 1 0 0
27 62000 47 32 3.6 1 0 0 0 1
41 50000 33 17 1.4 1 0 0 0 1
31 49000 56 15 3.2 1 0 0 1 0

To be able to evaluate the accuracy of our predictions, we will split the data into training and test sets. In this case, we will use 80% for training and the remaining 20% for testing. Note that we must also do the same split to the target attribute.

In [9]:
tpercent = 0.8
tsize = int(tpercent * len(vs))
vs_train = vs[:tsize]
vs_test = vs[tsize:]
In [10]:
print(vs_train.shape)
print(vs_test.shape)
(40, 9)
(10, 9)
In [11]:
vs_train.head(10)
Out[11]:
Income Age Rentals Avg Per Visit Gender_F Gender_M Genre_Action Genre_Comedy Genre_Drama
Cust ID
23 2000 15 30 2.5 1 0 0 1 0
39 68000 35 19 3.9 1 0 0 1 0
35 74000 29 43 4.6 0 1 1 0 0
9 38000 21 18 2.1 0 1 0 1 0
48 52000 47 14 1.6 1 0 0 0 1
45 56000 38 30 3.5 0 1 0 0 1
33 23000 25 28 2.7 0 1 1 0 0
27 62000 47 32 3.6 1 0 0 0 1
41 50000 33 17 1.4 1 0 0 0 1
31 49000 56 15 3.2 1 0 0 1 0
In [12]:
vs_test
Out[12]:
Income Age Rentals Avg Per Visit Gender_F Gender_M Genre_Action Genre_Comedy Genre_Drama
Cust ID
20 12000 16 23 2.2 0 1 1 0 0
28 57000 52 22 4.1 0 1 0 1 0
14 45000 36 24 2.7 0 1 0 0 1
4 59000 70 16 4.2 1 0 0 0 1
26 56000 35 40 2.6 1 0 1 0 0
38 41000 38 20 3.3 0 1 0 0 1
15 68000 30 36 2.7 0 1 0 1 0
22 25000 33 16 2.9 0 1 0 0 1
7 29000 45 19 3.8 1 0 0 0 1
21 47000 52 11 3.1 1 0 0 0 1

Splitting the target attribute ("Incidentals") accordingly:

In [13]:
vs_target_train = vs_target[0:int(tsize)]
vs_target_test = vs_target[int(tsize):len(vs)]
In [14]:
vs_target_train.head()
Out[14]:
Cust ID
23     No
39     No
35    Yes
9      No
48     No
Name: Incidentals, dtype: object
In [15]:
vs_target_test
Out[15]:
Cust ID
20    Yes
28     No
14     No
4     Yes
26    Yes
38    Yes
15    Yes
22    Yes
7      No
21     No
Name: Incidentals, dtype: object

Next, we normalize the attributes so that everything is in [0,1] scale. We can use the normalization functions from the kNN module in Ch. 2 of the text. In this case, however, we will use the more flexible and robust scaler function from the preprocessing module of scikit-learn package.

In [16]:
from sklearn import preprocessing
In [17]:
min_max_scaler = preprocessing.MinMaxScaler()
min_max_scaler.fit(vs_train)
Out[17]:
MinMaxScaler(copy=True, feature_range=(0, 1))
In [18]:
vs_train_norm = min_max_scaler.fit_transform(vs_train)
vs_test_norm = min_max_scaler.fit_transform(vs_test)

Note that MinMaxScaler returns a Numpy nd-array).

In [19]:
np.set_printoptions(precision=2, linewidth=100)

print(vs_train_norm[:10])
[[0.01 0.   0.54 0.39 1.   0.   0.   1.   0.  ]
 [0.76 0.49 0.26 0.78 1.   0.   0.   1.   0.  ]
 [0.83 0.34 0.87 0.97 0.   1.   1.   0.   0.  ]
 [0.42 0.15 0.23 0.28 0.   1.   0.   1.   0.  ]
 [0.58 0.78 0.13 0.14 1.   0.   0.   0.   1.  ]
 [0.62 0.56 0.54 0.67 0.   1.   0.   0.   1.  ]
 [0.25 0.24 0.49 0.44 0.   1.   1.   0.   0.  ]
 [0.69 0.78 0.59 0.69 1.   0.   0.   0.   1.  ]
 [0.56 0.44 0.21 0.08 1.   0.   0.   0.   1.  ]
 [0.55 1.   0.15 0.58 1.   0.   0.   1.   0.  ]]
In [20]:
print(vs_test_norm[:10])
[[0.   0.   0.41 0.   0.   1.   1.   0.   0.  ]
 [0.8  0.67 0.38 0.95 0.   1.   0.   1.   0.  ]
 [0.59 0.37 0.45 0.25 0.   1.   0.   0.   1.  ]
 [0.84 1.   0.17 1.   1.   0.   0.   0.   1.  ]
 [0.79 0.35 1.   0.2  1.   0.   1.   0.   0.  ]
 [0.52 0.41 0.31 0.55 0.   1.   0.   0.   1.  ]
 [1.   0.26 0.86 0.25 0.   1.   0.   1.   0.  ]
 [0.23 0.31 0.17 0.35 0.   1.   0.   0.   1.  ]
 [0.3  0.54 0.28 0.8  1.   0.   0.   0.   1.  ]
 [0.62 0.67 0.   0.45 1.   0.   0.   0.   1.  ]]

For consitency, we'll also convert the training and test target labels into Numpy arrays.

In [21]:
vs_target_train = np.array(vs_target_train)
vs_target_test = np.array(vs_target_test)
In [22]:
print(vs_target_train)
print("\n")
print(vs_target_test)
['No' 'No' 'Yes' 'No' 'No' 'Yes' 'No' 'No' 'No' 'No' 'Yes' 'No' 'Yes' 'Yes' 'Yes' 'No' 'No' 'Yes'
 'Yes' 'No' 'Yes' 'Yes' 'No' 'Yes' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'No' 'Yes' 'No'
 'Yes' 'Yes' 'Yes' 'No' 'Yes']


['Yes' 'No' 'No' 'Yes' 'Yes' 'Yes' 'Yes' 'Yes' 'No' 'No']

The following function illustrates how we can perform a k-nearest-neighbor search. It takes an instance x to be classifed and a data matrix D (assumed to be a 2d Numpy array) as inputs. It also takes K (the desired number of nearest-neighbors to be identified), and "measure" as arguments. The "measure" argument allows us to use either Euclidean distance (measure=0) or (the inverse of) Cosine similarity (measure = 1) as the distance function:

In [23]:
def knn_search(x, D, K, measure):
    """ find K nearest neighbors of an instance x among the instances in D """
    if measure == 0:
        # euclidean distances from the other points
        dists = np.sqrt(((D - x)**2).sum(axis=1))
    elif measure == 1:
        # first find the vector norm for each instance in D as wel as the norm for vector x
        D_norm = np.array([np.linalg.norm(D[i]) for i in range(len(D))])
        x_norm = np.linalg.norm(x)
        # Compute Cosine: divide the dot product o x and each instance in D by the product of the two norms
        sims = np.dot(D,x)/(D_norm * x_norm)
        # The distance measure will be the inverse of Cosine similarity
        dists = 1 - sims
    idx = np.argsort(dists) # sorting
    # return the indexes of K nearest neighbors
    return idx[:K], dists
In [24]:
# Let's use vs_test_norm[0] as a test instance x and find its K nearest neighbors
neigh_idx, distances = knn_search(vs_test_norm[0], vs_train_norm, 5, 0)
In [25]:
vs_test.head(1)
Out[25]:
Income Age Rentals Avg Per Visit Gender_F Gender_M Genre_Action Genre_Comedy Genre_Drama
Cust ID
20 12000 16 23 2.2 0 1 1 0 0
In [26]:
print(neigh_idx)
print("\nNearest Neigbors:")
vs_train.iloc[neigh_idx]
[29 34 39 18 10]

Nearest Neigbors:
Out[26]:
Income Age Rentals Avg Per Visit Gender_F Gender_M Genre_Action Genre_Comedy Genre_Drama
Cust ID
6 18000 20 29 1.7 0 1 1 0 0
40 17000 19 32 1.8 0 1 1 0 0
16 17000 19 26 2.2 0 1 1 0 0
42 32000 25 26 2.2 0 1 1 0 0
30 41000 25 17 1.4 0 1 1 0 0
In [27]:
print(distances[neigh_idx])
[0.3  0.33 0.37 0.53 0.56]
In [28]:
# Let's see how the nearest neighbors of the test instance labeled the target attribute "incidentals"

neigh_labels = vs_target_train[neigh_idx]
print(neigh_labels)
['No' 'No' 'Yes' 'Yes' 'Yes']

Now that we know the nearest neighbors, we need to find the majority class label among them. The majority class would be the class assgined to the new instance x.

In [29]:
from collections import Counter
print(Counter(neigh_labels))
Counter({'Yes': 3, 'No': 2})
In [30]:
Counter(neigh_labels).most_common(1)
Out[30]:
[('Yes', 3)]

Let's now put everything together into a function that calls our knn_search function and then returns that majority class among the K nearest neighbors of the instance to be classified.

In [31]:
def knn_classify(x, D, K, labels, measure):
    from collections import Counter
    neigh_idx, distances = knn_search(x, D, K, measure)
    neigh_labels = labels[neigh_idx]
    count = Counter(neigh_labels)
    print("Labels for top ", K, "neighbors: ", count.most_common())
    return count.most_common(1)[0][0]

We can now use our KNN classifer to evaluate it's classification accuracy. Here we will run the classifier on each test instance in our test data and compare the predicted class to the actual class for each. We will maintain the number of disagreements which allows us to compute the final error rate across all test instances.

In [32]:
numTestVecs = len(vs_target_test)
print(numTestVecs)
10
In [36]:
errorCount = 0.0
for i in range(numTestVecs):
    classifierResult = knn_classify(vs_test_norm[i,:], vs_train_norm, 5, vs_target_train, 0)
    print("Predicted Label: ", classifierResult, "==> Actual Label: ", vs_target_test[i])
    print()
    if (classifierResult != vs_target_test[i]): 
          errorCount += 1.0
        
print("the total error rate is: ", errorCount/float(numTestVecs))
Labels for top  5 neighbors:  [('Yes', 3), ('No', 2)]
Predicted Label:  Yes ==> Actual Label:  Yes

Labels for top  5 neighbors:  [('No', 4), ('Yes', 1)]
Predicted Label:  No ==> Actual Label:  No

Labels for top  5 neighbors:  [('Yes', 4), ('No', 1)]
Predicted Label:  Yes ==> Actual Label:  No

Labels for top  5 neighbors:  [('No', 3), ('Yes', 2)]
Predicted Label:  No ==> Actual Label:  Yes

Labels for top  5 neighbors:  [('Yes', 5)]
Predicted Label:  Yes ==> Actual Label:  Yes

Labels for top  5 neighbors:  [('Yes', 4), ('No', 1)]
Predicted Label:  Yes ==> Actual Label:  Yes

Labels for top  5 neighbors:  [('No', 3), ('Yes', 2)]
Predicted Label:  No ==> Actual Label:  Yes

Labels for top  5 neighbors:  [('Yes', 3), ('No', 2)]
Predicted Label:  Yes ==> Actual Label:  Yes

Labels for top  5 neighbors:  [('No', 3), ('Yes', 2)]
Predicted Label:  No ==> Actual Label:  No

Labels for top  5 neighbors:  [('No', 4), ('Yes', 1)]
Predicted Label:  No ==> Actual Label:  No

the total error rate is:  0.3

Let's repeat with the distance metric based on Cosine similarity (instead of Euclidean distance):

In [37]:
errorCount = 0.0
for i in range(numTestVecs):
    classifierResult = knn_classify(vs_test_norm[i,:], vs_train_norm, 5, vs_target_train, 1)
    print("Predicted Label: ", classifierResult, "==> Actual Label: ", vs_target_test[i])
    print()
    if (classifierResult != vs_target_test[i]): 
          errorCount += 1.0
        
print("the total error rate is: ", errorCount/float(numTestVecs))
Labels for top  5 neighbors:  [('No', 3), ('Yes', 2)]
Predicted Label:  No ==> Actual Label:  Yes

Labels for top  5 neighbors:  [('No', 4), ('Yes', 1)]
Predicted Label:  No ==> Actual Label:  No

Labels for top  5 neighbors:  [('Yes', 3), ('No', 2)]
Predicted Label:  Yes ==> Actual Label:  No

Labels for top  5 neighbors:  [('No', 3), ('Yes', 2)]
Predicted Label:  No ==> Actual Label:  Yes

Labels for top  5 neighbors:  [('Yes', 5)]
Predicted Label:  Yes ==> Actual Label:  Yes

Labels for top  5 neighbors:  [('Yes', 3), ('No', 2)]
Predicted Label:  Yes ==> Actual Label:  Yes

Labels for top  5 neighbors:  [('No', 3), ('Yes', 2)]
Predicted Label:  No ==> Actual Label:  Yes

Labels for top  5 neighbors:  [('Yes', 4), ('No', 1)]
Predicted Label:  Yes ==> Actual Label:  Yes

Labels for top  5 neighbors:  [('No', 3), ('Yes', 2)]
Predicted Label:  No ==> Actual Label:  No

Labels for top  5 neighbors:  [('No', 4), ('Yes', 1)]
Predicted Label:  No ==> Actual Label:  No

the total error rate is:  0.4
In [ ]: