A Gentle introduction into Naive Bayes Classifier using SciKit-Learn

A Naive Bayes Classifier is a machine learning model applying Bayes' Theorem - Derived by Rev. Thomas Bayes in the 18th century. Bayes Law is based on a simple formula:

$$P(A|B) = P(B|A)\frac{P(A)}{P(B)}$$

The formula reads as follows: The probability of A given B is the probability of B given A times to the ratio of the probabilities of A and B (Linoff and Berry, 2011:211)

The Bayesian part of Naive Byesian part of Naive Bayesian Models refers to the technique's use of Bayes' law. The naive part refers to the assumption that the variables used are independent of each other, with respect to the target. The Bayes Classifiers belong to the probabilistic classifiers family. This family of machine learning models is well suited when the dimensionality of the inputs is high (predictors > observations). Naive Bayesian Models provide a way out of this dilemma when you are trying to predict a probability.

Typical applications of Naive Bayes Algorithms:

  • Real time Prediction;
  • Text classification/ Spam Filtering/ Sentiment Analysis;
  • Recommendation System.

Here you can find a great visual explanation of how a Naive Bayes Classifier works.

Advantages:

  • Well suited for high dimensionality.
  • Easy to implement.
  • Can be trained with a small data set.

Disadvantages:

  • Dependencies among variables can not be modelled.
  • Can only be used for predicting (multiple) classes.
  • On the other side naive Bayes is also known as a bad estimator, so the probability outputs from predict_proba are not to be taken too seriously.

Be aware that this is a simplified example to give a brief introduction into using a Naive Bayes Classifier.

Predicting Churn

In [3]:
#Import libraries

%matplotlib inline

from matplotlib import pyplot as plt
import pandas as pd 
import numpy as np
import scipy 
from collections import Counter
from sklearn import preprocessing, cross_validation

#Import Gaussian Naive Bayes model from SciKit-Learn
from sklearn.naive_bayes import GaussianNB
/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)

In this example we're going to use SciKit-Learn to build a Naive Bayes model using Python.There are three types of Naive Bayes models available in the SciKit-Learn library:

Gaussian: It is used in classification and it assumes that features follow a normal distribution.

Multinomial: It is used for discrete counts. For example, let’s say, we have a text classification problem. Here we can consider bernoulli trials which is one step further and instead of “word occurring in the document”, we have “count how often word occurs in the document”, you can think of it as “number of times outcome number x_i is observed over the n trials”.

Bernoulli: The binomial model is useful if your feature vectors are binary (i.e. zeros and ones). One application would be text classification with ‘bag of words’ model where the 1s & 0s are “word occurs in the document” and “word does not occur in the document” respectively.

For this example we're going to use the Gaussian Naive Bayes.

Data

For this introduction in to using Naive Bayes Classifier from SciKit-Learn we're going to use a Churn data set. Using this data set we're going to predict if someone is going to Churn (True or False). We do this by using a data set containing call history data.

In [4]:
churn = pd.read_csv('churn.csv')

# Replace yes/no with 1/0. 
churn = churn.replace(['yes', 'no'], [1, 0])
In [5]:
churn.head()
Out[5]:
State Account Length Area Code Phone Int'l Plan VMail Plan VMail Message Day Mins Day Calls Day Charge ... Eve Calls Eve Charge Night Mins Night Calls Night Charge Intl Mins Intl Calls Intl Charge CustServ Calls Churn?
0 KS 128 415 382-4657 0 1 25 265.1 110 45.07 ... 99 16.78 244.7 91 11.01 10.0 3 2.70 1 False.
1 OH 107 415 371-7191 0 1 26 161.6 123 27.47 ... 103 16.62 254.4 103 11.45 13.7 3 3.70 1 False.
2 NJ 137 415 358-1921 0 0 0 243.4 114 41.38 ... 110 10.30 162.6 104 7.32 12.2 5 3.29 0 False.
3 OH 84 408 375-9999 1 0 0 299.4 71 50.90 ... 88 5.26 196.9 89 8.86 6.6 7 1.78 2 False.
4 OK 75 415 330-6626 1 0 0 166.7 113 28.34 ... 122 12.61 186.9 121 8.41 10.1 3 2.73 3 False.

5 rows × 21 columns

In [6]:
#Let's firt look what the ratio between True and False of the Churn column is. Just to check if we have values we can predict. 

counter = Counter(churn['Churn?'])
names = counter.keys()
counts = counter.values()

# Plot histogram using matplotlib bar().
indexes = np.arange(len(names))
width = 0.7
plt.bar(indexes, counts, width)
plt.xticks(indexes + width * 0.5, names)
plt.show()
In [7]:
#We also need te change the target variable in to a binary target variable.
churn = churn.replace(['True.', 'False.'], [1, 0])
In [8]:
#Selecting the input variables for the model and create a Numpy Array. 
x = np.array(churn[['Day Calls','Day Charge','Eve Calls','Eve Charge', 'Night Calls', 'Night Charge', 'Intl Calls','Intl Charge', 'CustServ Calls', 'VMail Plan', 'VMail Message']])
print x
[[ 110.     45.07   99.   ...,    1.      1.     25.  ]
 [ 123.     27.47  103.   ...,    1.      1.     26.  ]
 [ 114.     41.38  110.   ...,    0.      0.      0.  ]
 ..., 
 [ 109.     30.74   58.   ...,    2.      0.      0.  ]
 [ 105.     36.35   84.   ...,    2.      0.      0.  ]
 [ 113.     39.85   82.   ...,    0.      1.     25.  ]]
In [9]:
#Selecting target variable and create a Numpy Array. 
y = np.array(churn[['Churn?']])
print y
[[0]
 [0]
 [0]
 ..., 
 [0]
 [0]
 [0]]
In [10]:
#We create four different sets for cross validation. 
x_train, x_test, y_train, y_test = cross_validation.train_test_split(x,y,test_size=0.2)

Fitting a Naive Bayes Classifier

First we imported our data set. After that we chopped our data set in to a train and test data set for cross validation. Now it's time to fit the model.

In [11]:
#Select the model (Gaussian Naive Bayes Classifier)
nbc = GaussianNB()
In [16]:
model = nbc.fit(x_train, y_train)
accuracy = nbc.score(x_train, y_train)
accuracy
Out[16]:
0.86946736684171044
In [14]:
# Tests Accuracy of the model on the test data set. In multi-label classification, this is the subset accuracy which is a harsh metric since you require for each sample that each label set be correctly predicted.
accuracy2 = nbc.score(x_test, y_test)
accuracy2
0.893553223388
In [27]:
#Example on how to classify a new observation
example_predict = np.array([0, 1, 110, 45.07, 99, 16.78, 91, 11.01, 3, 2.70, 1])
prediction = nbc.predict(example_predict)
print prediction
[1]
/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/utils/validation.py:395: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
  DeprecationWarning)

References:

Linoff and Berry. (2011) Data Mining Techniques. Indianapolis: WIley Publishing, Inc.