In this example, we will explore the use of various classifiers from the scikit-learn package. Again, we'll use the modified Video Store data.

In [1]:
import numpy as np
import pandas as pd
In [2]:
vstable = pd.read_csv("http://facweb.cs.depaul.edu/mobasher/classes/csc478/data/Video_Store_2.csv", index_col=0)

vstable.shape
Out[2]:
(50, 7)
In [3]:
vstable.head()
Out[3]:
Gender Income Age Rentals Avg Per Visit Genre Incidentals
Cust ID
1 M 45000 25 32 2.5 Action Yes
2 F 54000 33 12 3.4 Drama No
3 F 32000 20 42 1.6 Comedy No
4 F 59000 70 16 4.2 Drama Yes
5 M 37000 35 25 3.2 Action Yes

Let's separate the target attribute and the attributes used for model training

In [4]:
vs_records = vstable[['Gender','Income','Age','Rentals','Avg Per Visit','Genre']]
vs_records.head()
Out[4]:
Gender Income Age Rentals Avg Per Visit Genre
Cust ID
1 M 45000 25 32 2.5 Action
2 F 54000 33 12 3.4 Drama
3 F 32000 20 42 1.6 Comedy
4 F 59000 70 16 4.2 Drama
5 M 37000 35 25 3.2 Action
In [5]:
vs_target = vstable.Incidentals
vs_target.head()
Out[5]:
Cust ID
1    Yes
2     No
3     No
4    Yes
5    Yes
Name: Incidentals, dtype: object

Next, we use Pandas "get_dummies" function to create dummy variables.

In [6]:
vs_matrix = pd.get_dummies(vs_records[['Gender','Income','Age','Rentals','Avg Per Visit','Genre']])
vs_records.head(10)
Out[6]:
Gender Income Age Rentals Avg Per Visit Genre
Cust ID
1 M 45000 25 32 2.5 Action
2 F 54000 33 12 3.4 Drama
3 F 32000 20 42 1.6 Comedy
4 F 59000 70 16 4.2 Drama
5 M 37000 35 25 3.2 Action
6 M 18000 20 29 1.7 Action
7 F 29000 45 19 3.8 Drama
8 M 74000 25 31 2.4 Action
9 M 38000 21 18 2.1 Comedy
10 F 65000 40 21 3.3 Drama

Next, we divide the data into randomized training and test partitions (note that the same split should also be perfromed on the target attribute). The easiest way to do this is to use the "train_test_split" module of "sklearn.cross_validation".

In [7]:
from sklearn.cross_validation import train_test_split
vs_train, vs_test, vs_target_train, vs_target_test = train_test_split(vs_matrix, vs_target, test_size=0.2, random_state=33)

print vs_test.shape
vs_test[0:5]
(10, 9)
Out[7]:
Income Age Rentals Avg Per Visit Gender_F Gender_M Genre_Action Genre_Comedy Genre_Drama
Cust ID
6 18000 20 29 1.7 0.0 1.0 1.0 0.0 0.0
28 57000 52 22 4.1 0.0 1.0 0.0 1.0 0.0
38 41000 38 20 3.3 0.0 1.0 0.0 0.0 1.0
16 17000 19 26 2.2 0.0 1.0 1.0 0.0 0.0
41 50000 33 17 1.4 1.0 0.0 0.0 0.0 1.0
In [8]:
print vs_train.shape
vs_train[0:5]
(40, 9)
Out[8]:
Income Age Rentals Avg Per Visit Gender_F Gender_M Genre_Action Genre_Comedy Genre_Drama
Cust ID
30 41000 25 17 1.4 0.0 1.0 1.0 0.0 0.0
35 74000 29 43 4.6 0.0 1.0 1.0 0.0 0.0
18 6000 16 39 1.8 1.0 0.0 1.0 0.0 0.0
40 17000 19 32 1.8 0.0 1.0 1.0 0.0 0.0
2 54000 33 12 3.4 1.0 0.0 0.0 0.0 1.0

Performing min-max normalization to rescale numeric attributes.

In [9]:
from sklearn import preprocessing
In [10]:
min_max_scaler = preprocessing.MinMaxScaler().fit(vs_train)
vs_train_norm = min_max_scaler.transform(vs_train)
vs_test_norm = min_max_scaler.transform(vs_test)
In [11]:
np.set_printoptions(precision=2, linewidth=80, suppress=True)
vs_train_norm[0:5]
Out[11]:
array([[ 0.45,  0.18,  0.16,  0.06,  0.  ,  1.  ,  1.  ,  0.  ,  0.  ],
       [ 0.83,  0.25,  0.86,  0.97,  0.  ,  1.  ,  1.  ,  0.  ,  0.  ],
       [ 0.06,  0.02,  0.76,  0.17,  1.  ,  0.  ,  1.  ,  0.  ,  0.  ],
       [ 0.18,  0.07,  0.57,  0.17,  0.  ,  1.  ,  1.  ,  0.  ,  0.  ],
       [ 0.6 ,  0.33,  0.03,  0.63,  1.  ,  0.  ,  0.  ,  0.  ,  1.  ]])
In [12]:
vs_test_norm[0:5]
Out[12]:
array([[ 0.19,  0.09,  0.49,  0.14,  0.  ,  1.  ,  1.  ,  0.  ,  0.  ],
       [ 0.64,  0.67,  0.3 ,  0.83,  0.  ,  1.  ,  0.  ,  1.  ,  0.  ],
       [ 0.45,  0.42,  0.24,  0.6 ,  0.  ,  1.  ,  0.  ,  0.  ,  1.  ],
       [ 0.18,  0.07,  0.41,  0.29,  0.  ,  1.  ,  1.  ,  0.  ,  0.  ],
       [ 0.56,  0.33,  0.16,  0.06,  1.  ,  0.  ,  0.  ,  0.  ,  1.  ]])

We will use the KNN, decision tree, and naive Bayes classifiers from sklearn.

In [13]:
from sklearn import neighbors, tree, naive_bayes

First, we'll use KNN classifer. You can vary K and monitor the accuracy metrics (see below) to find the best value.

In [14]:
n_neighbors = 5

knnclf = neighbors.KNeighborsClassifier(n_neighbors, weights='distance')
knnclf.fit(vs_train_norm, vs_target_train)
Out[14]:
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='distance')

Next, we call the predict function on the test intances to produce the predicted classes.

In [15]:
knnpreds_test = knnclf.predict(vs_test_norm)
In [16]:
print knnpreds_test
['No' 'No' 'Yes' 'Yes' 'No' 'No' 'Yes' 'Yes' 'Yes' 'Yes']

scikit-learn has various modules that can be used to evaluate classifier accuracy

In [17]:
from sklearn.metrics import classification_report
In [18]:
print(classification_report(vs_target_test, knnpreds_test))
             precision    recall  f1-score   support

         No       1.00      1.00      1.00         4
        Yes       1.00      1.00      1.00         6

avg / total       1.00      1.00      1.00        10

In [19]:
from sklearn.metrics import confusion_matrix
In [20]:
knncm = confusion_matrix(vs_target_test, knnpreds_test)
print knncm
[[4 0]
 [0 6]]

We can also compute the average accuracy score across the test instances

In [21]:
print knnclf.score(vs_test_norm, vs_target_test)
1.0

This can be compared to the performance on the training data itself (to check for over- or under-fitting)

In [22]:
print knnclf.score(vs_train_norm, vs_target_train)
1.0

Next, let's use a decision tree classifier:

In [23]:
treeclf = tree.DecisionTreeClassifier(criterion='entropy', min_samples_split=3)
In [24]:
treeclf = treeclf.fit(vs_train, vs_target_train)
In [25]:
treepreds_test = treeclf.predict(vs_test)
print treepreds_test
['No' 'Yes' 'No' 'No' 'No' 'No' 'Yes' 'Yes' 'Yes' 'No']
In [26]:
print treeclf.score(vs_test, vs_target_test)
0.6
In [27]:
print treeclf.score(vs_train, vs_target_train)
0.95
In [28]:
print(classification_report(vs_target_test, treepreds_test))
             precision    recall  f1-score   support

         No       0.50      0.75      0.60         4
        Yes       0.75      0.50      0.60         6

avg / total       0.65      0.60      0.60        10

In [29]:
treecm = confusion_matrix(vs_target_test, treepreds_test, labels=['Yes','No'])
print treecm
[[3 3]
 [1 3]]

We can actually plot the confusion matrix for better visualization:

In [30]:
import pylab as plt
%matplotlib inline
plt.matshow(treecm)
plt.title('Confusion matrix')
plt.colorbar()
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.show()

Now, let's try the (Gaussian) naive Bayes classifier:

In [31]:
nbclf = naive_bayes.GaussianNB()
nbclf = nbclf.fit(vs_train, vs_target_train)
nbpreds_test = nbclf.predict(vs_test)
print nbpreds_test
['Yes' 'No' 'No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'Yes' 'Yes']
In [32]:
print nbclf.score(vs_train, vs_target_train)
0.675
In [33]:
print nbclf.score(vs_test, vs_target_test)
0.8

Finally, let's try linear discriminant analysis:

In [35]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

ldclf = LinearDiscriminantAnalysis()
ldclf = ldclf.fit(vs_train, vs_target_train)
ldpreds_test = ldclf.predict(vs_test)
print ldpreds_test
['Yes' 'No' 'Yes' 'Yes' 'No' 'No' 'Yes' 'Yes' 'Yes' 'Yes']
In [36]:
print ldclf.score(vs_test, vs_target_test)
0.9

In this final example, I demonstrate how to use the cross-validation module from scikit-learn. This allows for n-fold cross validation without the necessity to split the data set manually.

In [37]:
from sklearn import cross_validation
In [38]:
cv_scores = cross_validation.cross_val_score(treeclf, vs_matrix, vs_target, cv=5)
cv_scores
Out[38]:
array([ 0.45,  0.3 ,  0.8 ,  0.7 ,  0.89])
In [39]:
print("Overall Accuracy: %0.2f (+/- %0.2f)" % (cv_scores.mean(), cv_scores.std() * 2))
Overall Accuracy: 0.63 (+/- 0.44)

Visualizing the decision tree

In [45]:
from sklearn.tree import export_graphviz
export_graphviz(treeclf,out_file='tree.dot', feature_names=vs_train.columns)
In [46]:
import graphviz

with open("tree.dot") as f:
    dot_graph = f.read()
graphviz.Source(dot_graph)
Out[46]:
Tree 0 Rentals <= 15.5 entropy = 1.0 samples = 40 value = [20, 20] 1 entropy = 0.0 samples = 6 value = [6, 0] 0->1 True 2 Genre_Comedy <= 0.5 entropy = 0.9774 samples = 34 value = [14, 20] 0->2 False 3 Age <= 35.5 entropy = 0.8905 samples = 26 value = [8, 18] 2->3 16 Gender_M <= 0.5 entropy = 0.8113 samples = 8 value = [6, 2] 2->16 4 Income <= 24500.0 entropy = 0.7219 samples = 20 value = [4, 16] 3->4 11 Rentals <= 17.5 entropy = 0.9183 samples = 6 value = [4, 2] 3->11 5 Age <= 17.5 entropy = 0.971 samples = 5 value = [3, 2] 4->5 8 Avg Per Visit <= 1.95 entropy = 0.3534 samples = 15 value = [1, 14] 4->8 6 entropy = 0.0 samples = 2 value = [0, 2] 5->6 7 entropy = 0.0 samples = 3 value = [3, 0] 5->7 9 entropy = 1.0 samples = 2 value = [1, 1] 8->9 10 entropy = 0.0 samples = 13 value = [0, 13] 8->10 12 entropy = 0.0 samples = 1 value = [0, 1] 11->12 13 Rentals <= 27.0 entropy = 0.7219 samples = 5 value = [4, 1] 11->13 14 entropy = 0.0 samples = 3 value = [3, 0] 13->14 15 entropy = 1.0 samples = 2 value = [1, 1] 13->15 17 entropy = 0.0 samples = 5 value = [5, 0] 16->17 18 Rentals <= 21.5 entropy = 0.9183 samples = 3 value = [1, 2] 16->18 19 entropy = 0.0 samples = 1 value = [1, 0] 18->19 20 entropy = 0.0 samples = 2 value = [0, 2] 18->20

Alternatively, you can use GraphViz or some other tool outside Jupyter environment to convert the dot file into an image file (e.g., a .png file) and save it to a local directory. Then, the image can be displayed in Jupyter as follows.

In [47]:
system(dot -Tpng tree.dot -o dtree.png)
Out[47]:
[]
In [50]:
from IPython.display import Image
Image(filename='dtree.png', width=800)
Out[50]:
In [ ]: