In this example we will look at how to use the K-Nearest_Neighbor algorithm for classification. We will use a modified version of the Video Store data set for this example. We will use the "Incidentals" attribute as the target attribute for classification (the class attribute). The goal is to be able to classify an unseen instance as "Yes" or "No" given the values of "Incidentals" from training instances.

In [1]:
import numpy as np
import pandas as pd
In [2]:
cd C:\Users\mobasher\Documents\Data
C:\Users\mobasher\Documents\Data
In [3]:
vstable = pd.read_csv("http://facweb.cs.depaul.edu/mobasher/classes/csc478/data/Video_Store_2.csv", index_col=0)

vstable.shape
Out[3]:
(50, 7)
In [4]:
vstable.head()
Out[4]:
Gender Income Age Rentals Avg Per Visit Genre Incidentals
Cust ID
1 M 45000 25 32 2.5 Action Yes
2 F 54000 33 12 3.4 Drama No
3 F 32000 20 42 1.6 Comedy No
4 F 59000 70 16 4.2 Drama Yes
5 M 37000 35 25 3.2 Action Yes

We will be splitting the data into a test and training partions with the test partition to be used for evaluating model error-rate and the training partition to be used to find the K nearest neighbors. Before spliting the data we need to do a random reshuffling to make sure the instances are randomized.

In [5]:
vs = vstable.reindex(np.random.permutation(vstable.index))
vs.head(10)
Out[5]:
Gender Income Age Rentals Avg Per Visit Genre Incidentals
Cust ID
43 F 49000 28 48 3.3 Drama Yes
16 M 17000 19 26 2.2 Action Yes
47 F 69000 35 22 2.8 Drama Yes
23 F 2000 15 30 2.5 Comedy No
21 F 47000 52 11 3.1 Drama No
42 M 32000 25 26 2.2 Action Yes
24 F 79000 35 22 3.8 Drama Yes
32 M 47000 30 21 3.1 Drama Yes
44 M 35000 24 24 1.7 Drama No
40 M 17000 19 32 1.8 Action No
In [6]:
len(vs)
Out[6]:
50
In [7]:
vs_names = vs.columns.values
vs_names
Out[7]:
array(['Gender', 'Income', 'Age', 'Rentals', 'Avg Per Visit', 'Genre',
       'Incidentals'], dtype=object)

The target attribute for classification is Incidentals:

In [8]:
vs_target = vs.Incidentals

Before we can compute distances we need to convert the data (excluding the target attribute containing the class labels) into standard spreadsheet format with binary dummy variables for categorical attributes).

In [9]:
vs = pd.get_dummies(vs[['Gender','Income','Age','Rentals','Avg Per Visit','Genre']])
vs.head(10)
Out[9]:
Income Age Rentals Avg Per Visit Gender_F Gender_M Genre_Action Genre_Comedy Genre_Drama
Cust ID
43 49000 28 48 3.3 1.0 0.0 0.0 0.0 1.0
16 17000 19 26 2.2 0.0 1.0 1.0 0.0 0.0
47 69000 35 22 2.8 1.0 0.0 0.0 0.0 1.0
23 2000 15 30 2.5 1.0 0.0 0.0 1.0 0.0
21 47000 52 11 3.1 1.0 0.0 0.0 0.0 1.0
42 32000 25 26 2.2 0.0 1.0 1.0 0.0 0.0
24 79000 35 22 3.8 1.0 0.0 0.0 0.0 1.0
32 47000 30 21 3.1 0.0 1.0 0.0 0.0 1.0
44 35000 24 24 1.7 0.0 1.0 0.0 0.0 1.0
40 17000 19 32 1.8 0.0 1.0 1.0 0.0 0.0

To be able to evaluate the accuracy of our predictions, we will split the data into training and test sets. In this case, we will use 80% for training and the remaining 20% for testing. Note that we must also do the same split to the target attribute.

In [10]:
tpercent = 0.8
tsize = int(np.floor(tpercent * len(vs)))
vs_train = vs[:tsize]
vs_test = vs[tsize:]
In [11]:
print vs_train.shape
print vs_test.shape
(40, 9)
(10, 9)
In [12]:
np.set_printoptions(suppress=True, linewidth=120)

vs_train.head(10)
Out[12]:
Income Age Rentals Avg Per Visit Gender_F Gender_M Genre_Action Genre_Comedy Genre_Drama
Cust ID
43 49000 28 48 3.3 1.0 0.0 0.0 0.0 1.0
16 17000 19 26 2.2 0.0 1.0 1.0 0.0 0.0
47 69000 35 22 2.8 1.0 0.0 0.0 0.0 1.0
23 2000 15 30 2.5 1.0 0.0 0.0 1.0 0.0
21 47000 52 11 3.1 1.0 0.0 0.0 0.0 1.0
42 32000 25 26 2.2 0.0 1.0 1.0 0.0 0.0
24 79000 35 22 3.8 1.0 0.0 0.0 0.0 1.0
32 47000 30 21 3.1 0.0 1.0 0.0 0.0 1.0
44 35000 24 24 1.7 0.0 1.0 0.0 0.0 1.0
40 17000 19 32 1.8 0.0 1.0 1.0 0.0 0.0
In [13]:
vs_test
Out[13]:
Income Age Rentals Avg Per Visit Gender_F Gender_M Genre_Action Genre_Comedy Genre_Drama
Cust ID
45 56000 38 30 3.5 0.0 1.0 0.0 0.0 1.0
10 65000 40 21 3.3 1.0 0.0 0.0 0.0 1.0
20 12000 16 23 2.2 0.0 1.0 1.0 0.0 0.0
9 38000 21 18 2.1 0.0 1.0 0.0 1.0 0.0
6 18000 20 29 1.7 0.0 1.0 1.0 0.0 0.0
5 37000 35 25 3.2 0.0 1.0 1.0 0.0 0.0
30 41000 25 17 1.4 0.0 1.0 1.0 0.0 0.0
4 59000 70 16 4.2 1.0 0.0 0.0 0.0 1.0
18 6000 16 39 1.8 1.0 0.0 1.0 0.0 0.0
35 74000 29 43 4.6 0.0 1.0 1.0 0.0 0.0

Splitting the target attribute ("Incidentals") accordingly:

In [14]:
vs_target_train = vs_target[0:int(tsize)]
vs_target_test = vs_target[int(tsize):len(vs)]
In [15]:
vs_target_train.head()
Out[15]:
Cust ID
43    Yes
16    Yes
47    Yes
23     No
21     No
Name: Incidentals, dtype: object
In [16]:
vs_target_test
Out[16]:
Cust ID
45    Yes
10     No
20    Yes
9      No
6      No
5     Yes
30    Yes
4     Yes
18    Yes
35    Yes
Name: Incidentals, dtype: object

Next, we normalize the attributes so that everything is in [0,1] scale. We can use the normalization functions from the kNN module in Ch. 2 of the text. In this case, however, we will use the more flexible and robust scaler function from the preprocessing module of scikit-learn package.

In [17]:
from sklearn import preprocessing
In [18]:
min_max_scaler = preprocessing.MinMaxScaler()
min_max_scaler.fit(vs_train)
Out[18]:
MinMaxScaler(copy=True, feature_range=(0, 1))
In [19]:
vs_train_norm = min_max_scaler.fit_transform(vs_train)
vs_test_norm = min_max_scaler.fit_transform(vs_test)

Note that while these Scikit-learn functions accept Pandas dataframes as input, they return Numpy arrays (in this case the normalized versions of vs_train and vs_test).

In [20]:
np.set_printoptions(precision=2, linewidth=100)
vs_train_norm[:10]
Out[20]:
array([[ 0.55,  0.32,  1.  ,  0.61,  1.  ,  0.  ,  0.  ,  0.  ,  1.  ],
       [ 0.18,  0.1 ,  0.44,  0.31,  0.  ,  1.  ,  1.  ,  0.  ,  0.  ],
       [ 0.77,  0.49,  0.33,  0.47,  1.  ,  0.  ,  0.  ,  0.  ,  1.  ],
       [ 0.01,  0.  ,  0.54,  0.39,  1.  ,  0.  ,  0.  ,  1.  ,  0.  ],
       [ 0.52,  0.9 ,  0.05,  0.56,  1.  ,  0.  ,  0.  ,  0.  ,  1.  ],
       [ 0.35,  0.24,  0.44,  0.31,  0.  ,  1.  ,  1.  ,  0.  ,  0.  ],
       [ 0.89,  0.49,  0.33,  0.75,  1.  ,  0.  ,  0.  ,  0.  ,  1.  ],
       [ 0.52,  0.37,  0.31,  0.56,  0.  ,  1.  ,  0.  ,  0.  ,  1.  ],
       [ 0.39,  0.22,  0.38,  0.17,  0.  ,  1.  ,  0.  ,  0.  ,  1.  ],
       [ 0.18,  0.1 ,  0.59,  0.19,  0.  ,  1.  ,  1.  ,  0.  ,  0.  ]])

For consitency, we'll also convert the training and test target labels into Numpy arrays.

In [21]:
vs_target_train = np.array(vs_target_train)
vs_target_test = np.array(vs_target_test)
In [22]:
print vs_target_train
print "\n"
print vs_target_test
['Yes' 'Yes' 'Yes' 'No' 'No' 'Yes' 'Yes' 'Yes' 'No' 'No' 'Yes' 'No' 'No' 'No' 'No' 'No' 'No' 'Yes'
 'No' 'No' 'Yes' 'Yes' 'No' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'Yes' 'No' 'Yes' 'Yes' 'Yes' 'No' 'No'
 'Yes' 'No' 'No' 'Yes']


['Yes' 'No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'Yes' 'Yes' 'Yes']

The following function illustrates how we can perform a k-nearest-neighbor search. The "measure" argument allows us to use either Euclidean distance or (the inverse of) Cosine similarity as the distance function:

In [23]:
def knn_search(x, D, K, measure):
    """ find K nearest neighbors of an instance x among the instances in D """
    if measure == 0:
        # euclidean distances from the other points
        dists = np.sqrt(((D - x)**2).sum(axis=1))
    elif measure == 1:
        # first find the vector norm for each instance in D as wel as the norm for vector x
        D_norm = np.array([np.linalg.norm(D[i]) for i in range(len(D))])
        x_norm = np.linalg.norm(x)
        # Compute Cosine: divide the dot product o x and each instance in D by the product of the two norms
        sims = np.dot(D,x)/(D_norm * x_norm)
        # The distance measure will be the inverse of Cosine similarity
        dists = 1 - sims
    idx = np.argsort(dists) # sorting
    # return the indexes of K nearest neighbors
    return idx[:K], dists
In [24]:
# Let's use vs_test_norm[0] as a test instance x and find its K nearest neighbors
neigh_idx, distances = knn_search(vs_test_norm[0], vs_train_norm, 5, 0)
In [25]:
vs_test.head(1)
Out[25]:
Income Age Rentals Avg Per Visit Gender_F Gender_M Genre_Action Genre_Comedy Genre_Drama
Cust ID
45 56000 38 30 3.5 0.0 1.0 0.0 0.0 1.0
In [26]:
print neigh_idx
print "\n"
vs_train.iloc[neigh_idx]
[ 7 32 16 10 31]


Out[26]:
Income Age Rentals Avg Per Visit Gender_F Gender_M Genre_Action Genre_Comedy Genre_Drama
Cust ID
32 47000 30 21 3.1 0.0 1.0 0.0 0.0 1.0
17 36000 35 28 3.5 0.0 1.0 0.0 0.0 1.0
14 45000 36 24 2.7 0.0 1.0 0.0 0.0 1.0
38 41000 38 20 3.3 0.0 1.0 0.0 0.0 1.0
22 25000 33 16 2.9 0.0 1.0 0.0 0.0 1.0
In [27]:
print distances[neigh_idx]
[ 0.32  0.35  0.36  0.4   0.6 ]
In [28]:
neigh_labels = vs_target_train[neigh_idx]
print neigh_labels
['Yes' 'Yes' 'No' 'Yes' 'Yes']

Now that we know the nearest neighbors, we need to find the majority class label among them. The majority class would be the class assgined to the new instance x.

In [29]:
from collections import Counter
print Counter(neigh_labels)
Counter({'Yes': 4, 'No': 1})
In [30]:
Counter(neigh_labels).most_common(1)
Out[30]:
[('Yes', 4)]

Next, we'll use the Knn module from Chapter 2 of Machine Learning in Action. Before importing the whole module, let's illustrate what the code does by stepping through it with some specific input values.

In [31]:
dataSetSize = vs_train_norm.shape[0]
print dataSetSize
40
In [32]:
inX = vs_test_norm[0]   # We'll use the first instance in the test data for this example
diffMat = np.tile(inX, (dataSetSize,1)) - vs_train_norm  # Create dataSetSize copies of inX, as rows of a 2D matrix
                                                         # Compute a matrix of differences
print diffMat[:5,:]
[[ 0.19  0.09 -0.48  0.05 -1.    1.    0.    0.    0.  ]
 [ 0.55  0.31  0.08  0.35  0.    0.   -1.    0.    1.  ]
 [-0.04 -0.08  0.19  0.18 -1.    1.    0.    0.    0.  ]
 [ 0.72  0.41 -0.02  0.27 -1.    1.    0.   -1.    1.  ]
 [ 0.21 -0.5   0.47  0.1  -1.    1.    0.    0.    0.  ]]
In [33]:
sqDiffMat = diffMat**2  # The matrix of squared differences
sqDistances = sqDiffMat.sum(axis=1)  # 1D array of the sum of squared differences (one element for each training instance)
distances = sqDistances**0.5  # and finally the matrix of Euclidean distances to inX
print distances
[ 1.51  1.59  1.44  2.18  1.59  1.52  1.44  0.32  0.64  1.62  0.4   1.52  2.13  2.09  1.52  1.47
  0.36  1.57  1.54  1.52  1.51  1.74  1.56  2.09  1.67  2.07  2.15  1.61  1.47  2.07  2.02  0.6
  0.35  1.44  2.15  1.49  2.04  1.67  1.57  1.46]
In [34]:
sortedDistIndicies = distances.argsort() # the indices of the training instances in increasing order of distance to inX
print sortedDistIndicies
[ 7 32 16 10 31  8  6  2 33 39 15 28 35  0 20  5 11 14 19 18 22 38 17  4  1 27  9 37 24 21 30 36 25
 29 23 13 12 34 26  3]

To see how the test instance should be classified, we need to find the majority class among the neighbors (here we do not use distance weighting; only a simply voting approach)

In [35]:
classCount={}
k = 5  # We'll use the top 10 neighbors
for i in range(k):
   voteIlabel = vs_target_train[sortedDistIndicies[i]]
   classCount[voteIlabel] = classCount.get(voteIlabel,0) + 1  # add to the count of the label or retun 1 for first occurrence
   print sortedDistIndicies[i], voteIlabel, classCount[voteIlabel]
7 Yes 1
32 Yes 2
16 No 1
10 Yes 3
31 Yes 4

Now, let's find the predicted class for the test instance.

In [36]:
import operator
# Create a dictionary for the class labels with cumulative occurrences across the neighbors as values
# Dictionary will be ordered in decreasing order of the lable values (so the majority class label will
# be the first dictonary element)
sortedClassCount = sorted(classCount.iteritems(), key=operator.itemgetter(1), reverse=True)
print sortedClassCount
print sortedClassCount[0][0]
[('Yes', 4), ('No', 1)]
Yes

A better way to find the majority class given a list of class labels from neighbors:

In [54]:
from collections import Counter

k = 5  # We'll use the top 5 neighbors
vote = vs_target_train[sortedDistIndicies[0:k]]
maj_class = Counter(vote).most_common(1)

print vote

print maj_class

print "Class label for the classified istance: ", maj_class[0][0]
['Yes' 'Yes' 'No' 'Yes' 'Yes']
[('Yes', 4)]
Class label for the classified istance:  Yes

Let's know import the whole kNN module from Chapter 2 of MLA book and use as part of a more robust evaluation process. We will step through all test instances, use our Knn classifier to predict a class label for each instance, and in each case we compare the predicted label to the actual value from the target test labels.

In [37]:
import kNN
In [38]:
numTestVecs = len(vs_target_test)
print numTestVecs
10
In [39]:
errorCount = 0.0
for i in range(numTestVecs):
    # classify0 function uses Euclidean distance to find k nearest neighbors
    classifierResult = kNN.classify0(vs_test_norm[i,:], vs_train_norm, vs_target_train, 3)
    print "the classifier came back with: %s, the real answer is: %s" % (classifierResult, vs_target_test[i])
    if (classifierResult != vs_target_test[i]): errorCount += 1.0
        
print "the total error rate is: %f" % (errorCount/float(numTestVecs))
the classifier came back with: Yes, the real answer is: Yes
the classifier came back with: Yes, the real answer is: No
the classifier came back with: Yes, the real answer is: Yes
the classifier came back with: Yes, the real answer is: No
the classifier came back with: Yes, the real answer is: No
the classifier came back with: Yes, the real answer is: Yes
the classifier came back with: Yes, the real answer is: Yes
the classifier came back with: No, the real answer is: Yes
the classifier came back with: Yes, the real answer is: Yes
the classifier came back with: Yes, the real answer is: Yes
the total error rate is: 0.400000

I have added a new classifier function to the kNN module that uses Cosine similarity instead of Euclidean distance:

In [40]:
reload(kNN)
Out[40]:
<module 'kNN' from 'kNN.pyc'>
In [41]:
errorCount = 0.0
for i in range(numTestVecs):
    # classify1 function uses inverse of Cosine similarity to find k nearest neighbors
    classifierResult2 = kNN.classify1(vs_test_norm[i,:], vs_train_norm, vs_target_train, 3)
    print "the classifier came back with: %s, the real answer is: %s" % (classifierResult2, vs_target_test[i])
    if (classifierResult2 != vs_target_test[i]): errorCount += 1.0
print "the total error rate is: %f" % (errorCount/float(numTestVecs))
the classifier came back with: Yes, the real answer is: Yes
the classifier came back with: Yes, the real answer is: No
the classifier came back with: Yes, the real answer is: Yes
the classifier came back with: Yes, the real answer is: No
the classifier came back with: Yes, the real answer is: No
the classifier came back with: Yes, the real answer is: Yes
the classifier came back with: Yes, the real answer is: Yes
the classifier came back with: No, the real answer is: Yes
the classifier came back with: Yes, the real answer is: Yes
the classifier came back with: Yes, the real answer is: Yes
the total error rate is: 0.400000
In [ ]: