08 - Naive Bayes¶

version 0.1, Mar 2016

Part of the class Practical Machine Learning ¶

This notebook is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Special thanks goes to Kevin Markham, Sebastian Raschka & Scikit-learn docs

Naive Bayes¶

Naive Bayes methods are a set of supervised learning algorithms based on applying Bayes' theorem with the "naive" assumption of independence between every pair of features. Given a class variable $y$ and a dependent feature vector $x_1$ through $x_n$, Bayes' theorem states the following relationship:

$$ P(y \mid x_1, \dots, x_n) = \frac{P(y) P(x_1, \dots x_n \mid y)} {P(x_1, \dots, x_n)} $$

Using the naive independence assumption that

$$ P(x_i | y, x_1, \dots, x_{i-1}, x_{i+1}, \dots, x_n) = P(x_i | y), $$

for all $i$, this relationship is simplified to

$$ P(y \mid x_1, \dots, x_n) = \frac{P(y) \prod_{i=1}^{n} P(x_i \mid y)} {P(x_1, \dots, x_n)} $$

Since $P(x_1, \dots, x_n)$ is constant given the input, we can use the following classification rule:

$$ P(y \mid x_1, \dots, x_n) \propto P(y) \prod_{i=1}^{n} P(x_i \mid y) $$$$ \Downarrow$$$$ \hat{y} = \arg\max_y P(y) \prod_{i=1}^{n} P(x_i \mid y), $$

and we can use Maximum A Posteriori (MAP) estimation to estimate $P(y)$ and $P(x_i \mid y)$; the former is then the relative frequency of class :math:y in the training set.

The different naive Bayes classifiers differ mainly by the assumptions they make regarding the distribution of $P(x_i \mid y)$.

In spite of their apparently over-simplified assumptions, naive Bayes classifiers have worked quite well in many real-world situations, famously document classification and spam filtering. They require a small amount of training data to estimate the necessary parameters. (For theoretical reasons why naive Bayes works well, and on which types of data it does, see the references below.)

Naive Bayes learners and classifiers can be extremely fast compared to more sophisticated methods. The decoupling of the class conditional feature distributions means that each distribution can be independently estimated as a one dimensional distribution. This in turn helps to alleviate problems stemming from the curse of dimensionality.

On the flip side, although naive Bayes is known as a decent classifier, it is known to be a bad estimator, so the probability outputs from predict_proba are not to be taken too seriously.

Gaussian Naive Bayes¶

GaussianNB implements the Gaussian Naive Bayes algorithm for classification. The likelihood of the features is assumed to be Gaussian:

$$ P(x_i \mid y) = \frac{1}{\sqrt{2\pi\sigma^2_y}} \exp\left(-\frac{(x_i - \mu_y)^2}{2\sigma^2_y}\right) $$

The parameters $\sigma_y$ and $\mu_y$ are estimated using maximum likelihood.

Applying Bayes' theorem to iris classification¶

Preparing the data¶

We'll read the iris data into a DataFrame, and round up all of the measurements to the next integer:

In [1]:

import pandas as pd
import numpy as np
# read the iris data into a DataFrame
url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
col_names = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species']
iris = pd.read_csv(url, header=None, names=col_names)
iris.head()

Out[1]:

	sepal_length	sepal_width	petal_length	petal_width	species
0	5.1	3.5	1.4	0.2	Iris-setosa
1	4.9	3.0	1.4	0.2	Iris-setosa
2	4.7	3.2	1.3	0.2	Iris-setosa
3	4.6	3.1	1.5	0.2	Iris-setosa
4	5.0	3.6	1.4	0.2	Iris-setosa

In [2]:

# apply the ceiling function to the numeric columns
iris.loc[:, 'sepal_length':'petal_width'] = iris.loc[:, 'sepal_length':'petal_width'].apply(np.ceil)
iris.head()

Out[2]:

	sepal_length	sepal_width	petal_length	petal_width	species
0	6	4	2	1	Iris-setosa
1	5	3	2	1	Iris-setosa
2	5	4	2	1	Iris-setosa
3	5	4	2	1	Iris-setosa
4	5	4	2	1	Iris-setosa

Deciding how to make a prediction¶

Let's say that I have an out-of-sample iris with the following measurements: 7, 3, 5, 2. How might I predict the species?

In [3]:

# show all observations with features: 7, 3, 5, 2
iris[(iris.sepal_length==7) & (iris.sepal_width==3) & (iris.petal_length==5) & (iris.petal_width==2)]

Out[3]:

	sepal_length	sepal_width	petal_length	petal_width	species
54	7	3	5	2	Iris-versicolor
58	7	3	5	2	Iris-versicolor
63	7	3	5	2	Iris-versicolor
68	7	3	5	2	Iris-versicolor
72	7	3	5	2	Iris-versicolor
73	7	3	5	2	Iris-versicolor
74	7	3	5	2	Iris-versicolor
75	7	3	5	2	Iris-versicolor
76	7	3	5	2	Iris-versicolor
77	7	3	5	2	Iris-versicolor
87	7	3	5	2	Iris-versicolor
91	7	3	5	2	Iris-versicolor
97	7	3	5	2	Iris-versicolor
123	7	3	5	2	Iris-virginica
126	7	3	5	2	Iris-virginica
127	7	3	5	2	Iris-virginica
146	7	3	5	2	Iris-virginica

In [4]:

# count the species for these observations
iris[(iris.sepal_length==7) & (iris.sepal_width==3) & (iris.petal_length==5) & (iris.petal_width==2)].species.value_counts()

Out[4]:

Iris-versicolor    13
Iris-virginica      4
Name: species, dtype: int64

In [5]:

# count the species for all observations
iris.species.value_counts()

Out[5]:

Iris-versicolor    50
Iris-virginica     50
Iris-setosa        50
Name: species, dtype: int64

Let's frame this as a conditional probability problem: What is the probability of some particular species, given the measurements 7, 3, 5, and 2?

$$P(species \ | \ 7352)$$

We could calculate the conditional probability for each of the three species, and then predict the species with the highest probability:

$$P(setosa \ | \ 7352)$$$$P(versicolor \ | \ 7352)$$$$P(virginica \ | \ 7352)$$

Calculating the probability of each species¶

Bayes' theorem gives us a way to calculate these conditional probabilities.

Let's start with versicolor:

$$P(versicolor \ | \ 7352) = \frac {P(7352 \ | \ versicolor) \times P(versicolor)} {P(7352)}$$

We can calculate each of the terms on the right side of the equation:

$$P(7352 \ | \ versicolor) = \frac {13} {50} = 0.26$$$$P(versicolor) = \frac {50} {150} = 0.33$$$$P(7352) = \frac {17} {150} = 0.11$$

Therefore, Bayes' theorem says the probability of versicolor given these measurements is:

$$P(versicolor \ | \ 7352) = \frac {0.26 \times 0.33} {0.11} = 0.76$$

Let's repeat this process for virginica and setosa:

$$P(virginica \ | \ 7352) = \frac {0.08 \times 0.33} {0.11} = 0.24$$$$P(setosa \ | \ 7352) = \frac {0 \times 0.33} {0.11} = 0$$

We predict that the iris is a versicolor, since that species had the highest conditional probability.

Using sklearn¶

In [6]:

from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()

In [9]:

X = iris[['sepal_length', 'sepal_width', 'petal_length', 'petal_width']]

In [10]:

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y = le.fit_transform(iris['species'])

In [14]:

gnb.fit(X, y)
y_pred = gnb.predict(X)
print("Number of mislabeled points out of a total %d points : %d" 
      % (iris.shape[0],(y != y_pred).sum()))

Number of mislabeled points out of a total 150 points : 10

The intuition behind Bayes' theorem¶

Let's make some hypothetical adjustments to the data, to demonstrate how Bayes' theorem makes intuitive sense:

Pretend that more of the existing versicolors had measurements of 7352:

$P(7352 \ | \ versicolor)$ would increase, thus increasing the numerator.
It would make sense that given an iris with measurements of 7352, the probability of it being a versicolor would also increase.

Pretend that most of the existing irises were versicolor:

$P(versicolor)$ would increase, thus increasing the numerator.
It would make sense that the probability of any iris being a versicolor (regardless of measurements) would also increase.

Pretend that 17 of the setosas had measurements of 7352:

$P(7352)$ would double, thus doubling the denominator.
It would make sense that given an iris with measurements of 7352, the probability of it being a versicolor would be cut in half.

Why is the Naive Bayes Classifier naive?¶

Let's start by taking a quick look at the Bayes' Theorem:

In context of pattern classification, we can express it as

If we use the Bayes Theorem in classification, our goal (or objective function) is to maximize the posterior probability

Now, let's talk a bit more about the individual components. The priors are representing our expert (or any other prior) knowledge; in practice, the priors are often estimated via MLE (computed as class frequencies). The evidence term cancels because it is constant for all classes.

Moving on to the "naive" part in the Naive Bayes Classifier: What makes it "naive" is that we compute the conditional probability (sometimes also called likelihoods) as the product of the individual probabilities for each feature:

Since this assumption (the absolute independence of features) is probably never met in practice, it's the truly "naive" part in naive Bayes.

Working with Text Data and Naive Bayes in scikit-learn¶

Representing text as data¶

From the scikit-learn documentation:

Text Analysis is a major application field for machine learning algorithms. However the raw data, a sequence of symbols cannot be fed directly to the algorithms themselves as most of them expect numerical feature vectors with a fixed size rather than the raw text documents with variable length.

We will use CountVectorizer to "convert text into a matrix of token counts":

In [15]:

from sklearn.feature_extraction.text import CountVectorizer

In [16]:

# start with a simple example
simple_train = ['call you tonight', 'Call me a cab', 'please call me... PLEASE!']

In [17]:

# learn the 'vocabulary' of the training data
vect = CountVectorizer()
vect.fit(simple_train)
vect.get_feature_names()

Out[17]:

['cab', 'call', 'me', 'please', 'tonight', 'you']

In [18]:

# transform training data into a 'document-term matrix'
simple_train_dtm = vect.transform(simple_train)
simple_train_dtm

Out[18]:

<3x6 sparse matrix of type '<class 'numpy.int64'>'
	with 9 stored elements in Compressed Sparse Row format>

In [20]:

# print the sparse matrix
print(simple_train_dtm)

  (0, 1)	1
  (0, 4)	1
  (0, 5)	1
  (1, 0)	1
  (1, 1)	1
  (1, 2)	1
  (2, 1)	1
  (2, 2)	1
  (2, 3)	2

In [21]:

# convert sparse matrix to a dense matrix
simple_train_dtm.toarray()

Out[21]:

array([[0, 1, 0, 0, 1, 1],
       [1, 1, 1, 0, 0, 0],
       [0, 1, 1, 2, 0, 0]])

In [23]:

# examine the vocabulary and document-term matrix together
pd.DataFrame(simple_train_dtm.toarray(), columns=vect.get_feature_names())

Out[23]:

	cab	call	me	please	tonight	you
0	0	1	0	0	1	1
1	1	1	1	0	0	0
2	0	1	1	2	0	0

From the scikit-learn documentation:

In this scheme, features and samples are defined as follows:

Each individual token occurrence frequency (normalized or not) is treated as a feature.

The vector of all the token frequencies for a given document is considered a multivariate sample.

A corpus of documents can thus be represented by a matrix with one row per document and one column per token (e.g. word) occurring in the corpus.

We call vectorization the general process of turning a collection of text documents into numerical feature vectors. This specific strategy (tokenization, counting and normalization) is called the Bag of Words or "Bag of n-grams" representation. Documents are described by word occurrences while completely ignoring the relative position information of the words in the document.

In [24]:

# transform testing data into a document-term matrix (using existing vocabulary)
simple_test = ["please don't call me"]
simple_test_dtm = vect.transform(simple_test)
simple_test_dtm.toarray()

Out[24]:

array([[0, 1, 1, 1, 0, 0]])

In [25]:

# examine the vocabulary and document-term matrix together
pd.DataFrame(simple_test_dtm.toarray(), columns=vect.get_feature_names())

Out[25]:

	cab	call	me	please	tonight	you
0	0	1	1	1	0	0

Summary:

vect.fit(train) learns the vocabulary of the training data
vect.transform(train) uses the fitted vocabulary to build a document-term matrix from the training data
vect.transform(test) uses the fitted vocabulary to build a document-term matrix from the testing data (and ignores tokens it hasn't seen before)

Reading SMS data¶

In [27]:

# read tab-separated file
url = 'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/sms.tsv'
col_names = ['label', 'message']
sms = pd.read_table(url, sep='\t', header=None, names=col_names)
print(sms.shape)

(5572, 2)

In [28]:

sms.head(20)

Out[28]:

	label	message
0	ham	Go until jurong point, crazy.. Available only ...
1	ham	Ok lar... Joking wif u oni...
2	spam	Free entry in 2 a wkly comp to win FA Cup fina...
3	ham	U dun say so early hor... U c already then say...
4	ham	Nah I don't think he goes to usf, he lives aro...
5	spam	FreeMsg Hey there darling it's been 3 week's n...
6	ham	Even my brother is not like to speak with me. ...
7	ham	As per your request 'Melle Melle (Oru Minnamin...
8	spam	WINNER!! As a valued network customer you have...
9	spam	Had your mobile 11 months or more? U R entitle...
10	ham	I'm gonna be home soon and i don't want to tal...
11	spam	SIX chances to win CASH! From 100 to 20,000 po...
12	spam	URGENT! You have won a 1 week FREE membership ...
13	ham	I've been searching for the right words to tha...
14	ham	I HAVE A DATE ON SUNDAY WITH WILL!!
15	spam	XXXMobileMovieClub: To use your credit, click ...
16	ham	Oh k...i'm watching here:)
17	ham	Eh u remember how 2 spell his name... Yes i di...
18	ham	Fine if thats the way u feel. Thats the way ...
19	spam	England v Macedonia - dont miss the goals/team...

In [29]:

sms.label.value_counts()

Out[29]:

ham     4825
spam     747
Name: label, dtype: int64

In [30]:

# convert label to a numeric variable
sms['label'] = sms.label.map({'ham':0, 'spam':1})

In [31]:

# define X and y
X = sms.message
y = sms.label

In [33]:

# split into training and testing sets
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
print(X_train.shape)
print(X_test.shape)

(4179,)
(1393,)

Vectorizing SMS data¶

In [34]:

# instantiate the vectorizer
vect = CountVectorizer()

In [35]:

# learn training data vocabulary, then create document-term matrix
vect.fit(X_train)
X_train_dtm = vect.transform(X_train)
X_train_dtm

Out[35]:

<4179x7456 sparse matrix of type '<class 'numpy.int64'>'
	with 55209 stored elements in Compressed Sparse Row format>

In [36]:

# alternative: combine fit and transform into a single step
X_train_dtm = vect.fit_transform(X_train)
X_train_dtm

Out[36]:

<4179x7456 sparse matrix of type '<class 'numpy.int64'>'
	with 55209 stored elements in Compressed Sparse Row format>

In [37]:

# transform testing data (using fitted vocabulary) into a document-term matrix
X_test_dtm = vect.transform(X_test)
X_test_dtm

Out[37]:

<1393x7456 sparse matrix of type '<class 'numpy.int64'>'
	with 17604 stored elements in Compressed Sparse Row format>

Examining the tokens and their counts¶

In [38]:

# store token names
X_train_tokens = vect.get_feature_names()

In [39]:

# first 50 tokens
print(X_train_tokens[:50])

['00', '000', '008704050406', '0121', '01223585236', '01223585334', '0125698789', '02', '0207', '02072069400', '02073162414', '02085076972', '021', '03', '04', '0430', '05', '050703', '0578', '06', '07', '07008009200', '07090201529', '07090298926', '07123456789', '07732584351', '07734396839', '07742676969', '0776xxxxxxx', '07781482378', '07786200117', '078', '07801543489', '07808', '07808247860', '07808726822', '07815296484', '07821230901', '07880867867', '0789xxxxxxx', '07946746291', '0796xxxxxx', '07973788240', '07xxxxxxxxx', '08', '0800', '08000407165', '08000776320', '08000839402', '08000930705']

In [40]:

# last 50 tokens
print(X_train_tokens[-50:])

['yer', 'yes', 'yest', 'yesterday', 'yet', 'yetunde', 'yijue', 'ym', 'ymca', 'yo', 'yoga', 'yogasana', 'yor', 'yorge', 'you', 'youdoing', 'youi', 'youphone', 'your', 'youre', 'yourjob', 'yours', 'yourself', 'youwanna', 'yowifes', 'yoyyooo', 'yr', 'yrs', 'ything', 'yummmm', 'yummy', 'yun', 'yunny', 'yuo', 'yuou', 'yup', 'zac', 'zaher', 'zealand', 'zebra', 'zed', 'zeros', 'zhong', 'zindgi', 'zoe', 'zoom', 'zouk', 'zyada', 'èn', '〨ud']

In [41]:

# view X_train_dtm as a dense matrix
X_train_dtm.toarray()

Out[41]:

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ..., 
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [42]:

# count how many times EACH token appears across ALL messages in X_train_dtm
import numpy as np
X_train_counts = np.sum(X_train_dtm.toarray(), axis=0)
X_train_counts

Out[42]:

array([ 5, 23,  2, ...,  1,  1,  1], dtype=int64)

In [43]:

X_train_counts.shape

Out[43]:

(7456,)

In [45]:

# create a DataFrame of tokens with their counts
pd.DataFrame({'token':X_train_tokens, 'count':X_train_counts}).sort_values('count')

Out[45]:

	count	token
3727	1	jules
4172	1	mallika
4169	1	malarky
4165	1	makiing
4161	1	maintaining
4158	1	mails
4157	1	mailed
4151	1	magicalsongs
4150	1	maggi
4149	1	magazine
4146	1	madodu
4143	1	mad2
4142	1	mad1
4140	1	macs
4139	1	macleran
4138	1	mack
4174	1	manage
4175	1	manageable
4178	1	manchester
4179	1	manda
4201	1	marking
4200	1	marketing
4197	1	marine
4196	1	margin
4193	1	marandratha
4192	1	maraikara
4191	1	maps
4136	1	machi
4190	1	mapquest
4187	1	manual
...	...	...
2290	292	do
7257	293	with
7120	293	we
6904	297	ur
1081	298	at
2995	299	get
3465	302	if
4778	306	or
1522	332	but
4647	338	not
6017	344	so
1574	349	can
1016	358	are
4662	361	now
4743	390	on
3235	416	have
1552	443	call
6539	453	that
4704	460	of
7424	508	your
2821	518	for
4489	550	my
3623	568	it
4238	601	me
3612	679	is
3502	683	in
929	717	and
6542	1004	the
7420	1660	you
6656	1670	to

7456 rows × 2 columns

Building a Multinomial Naive Bayes model¶

MultinomialNB implements the naive Bayes algorithm for multinomially distributed data, and is one of the two classic naive Bayes variants used in text classification (where the data are typically represented as word vector counts, although tf-idf vectors are also known to work well in practice). The distribution is parametrized by vectors $\theta_y = (\theta_{y1},\ldots,\theta_{yn})$ for each class :math:y, where :math:n is the number of features (in text classification, the size of the vocabulary) and $\theta_{yi}$ is the probability $P(x_i \mid y)$ of feature $i$ appearing in a sample belonging to class :math:y.

The parameters $\theta_y$ is estimated by a smoothed version of maximum likelihood, i.e. relative frequency counting:

$$ \hat{\theta}_{yi} = \frac{ N_{yi} + \alpha}{N_y + \alpha n} $$

where $N_{yi} = \sum_{x \in T} x_i$ is the number of times feature $i$ appears in a sample of class $y$ in the training set $T$, and $N_{y} = \sum_{i=1}^{|T|} N_{yi}$ is the total count of all features for class $y$.

The smoothing priors $\alpha \ge 0$ accounts for features not present in the learning samples and prevents zero probabilities in further computations. Setting $\alpha = 1$ is called Laplace smoothing, while $\alpha < 1$ is called Lidstone smoothing.

In [46]:

# train a Naive Bayes model using X_train_dtm
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()
nb.fit(X_train_dtm, y_train)

Out[46]:

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [47]:

# make class predictions for X_test_dtm
y_pred_class = nb.predict(X_test_dtm)

In [48]:

# calculate accuracy of class predictions
from sklearn import metrics
print(metrics.accuracy_score(y_test, y_pred_class))

0.988513998564

In [49]:

# confusion matrix
print(metrics.confusion_matrix(y_test, y_pred_class))

[[1203    5]
 [  11  174]]

In [50]:

# predict (poorly calibrated) probabilities
y_pred_prob = nb.predict_proba(X_test_dtm)[:, 1]
y_pred_prob

Out[50]:

array([  2.87744864e-03,   1.83488846e-05,   2.07301295e-03, ...,
         1.09026171e-06,   1.00000000e+00,   3.98279868e-09])

In [51]:

# calculate AUC
print(metrics.roc_auc_score(y_test, y_pred_prob))

0.986643100054

In [52]:

# print message text for the false positives
X_test[y_test < y_pred_class]

Out[52]:

574               Waiting for your call.
3375             Also andros ice etc etc
45      No calls..messages..missed calls
3415             No pic. Please re-send.
1988    No calls..messages..missed calls
Name: message, dtype: object

In [53]:

# print message text for the false negatives
X_test[y_test > y_pred_class]

Out[53]:

3132    LookAtMe!: Thanks for your purchase of a video...
5       FreeMsg Hey there darling it's been 3 week's n...
3530    Xmas & New Years Eve tickets are now on sale f...
684     Hi I'm sue. I am 20 years old and work as a la...
1875    Would you like to see my XXX pics they are so ...
1893    CALL 09090900040 & LISTEN TO EXTREME DIRTY LIV...
4298    thesmszone.com lets you send free anonymous an...
4949    Hi this is Amy, we will be sending you a free ...
2821    INTERFLORA - It's not too late to order Inter...
2247    Hi ya babe x u 4goten bout me?' scammers getti...
4514    Money i have won wining number 946 wot do i do...
Name: message, dtype: object

In [54]:

# what do you notice about the false negatives?
X_test[3132]

Out[54]:

"LookAtMe!: Thanks for your purchase of a video clip from LookAtMe!, you've been charged 35p. Think you can do better? Why not send a video in a MMSto 32323."

Comparing Multinomial and Gaussian Naive Bayes¶

scikit-learn documentation: MultinomialNB and GaussianNB

Dataset: Pima Indians Diabetes from the UCI Machine Learning Repository

In [55]:

# read the data
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data'
col_names = ['pregnant', 'glucose', 'bp', 'skin', 'insulin', 'bmi', 'pedigree', 'age', 'label']
pima = pd.read_csv(url, header=None, names=col_names)

In [56]:

# notice that all features are continuous
pima.head()

Out[56]:

	pregnant	glucose	bp	skin	insulin	bmi	pedigree	age	label
0	6	148	72	35	0	33.6	0.627	50	1
1	1	85	66	29	0	26.6	0.351	31	0
2	8	183	64	0	0	23.3	0.672	32	1
3	1	89	66	23	94	28.1	0.167	21	0
4	0	137	40	35	168	43.1	2.288	33	1

In [57]:

# create X and y
X = pima.drop('label', axis=1)
y = pima.label
# split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

In [58]:

# import both Multinomial and Gaussian Naive Bayes
from sklearn.naive_bayes import MultinomialNB, GaussianNB
from sklearn import metrics

In [59]:

# testing accuracy of Multinomial Naive Bayes
mnb = MultinomialNB()
mnb.fit(X_train, y_train)
y_pred_class = mnb.predict(X_test)
metrics.accuracy_score(y_test, y_pred_class)

Out[59]:

0.54166666666666663

In [60]:

# testing accuracy of Gaussian Naive Bayes
gnb = GaussianNB()
gnb.fit(X_train, y_train)
y_pred_class = gnb.predict(X_test)
metrics.accuracy_score(y_test, y_pred_class)

Out[60]:

0.79166666666666663

Conclusion: When applying Naive Bayes classification to a dataset with continuous features, it is better to use Gaussian Naive Bayes than Multinomial Naive Bayes. The latter is suitable for datasets containing discrete features (e.g., word counts).

Wikipedia has a short description of Gaussian Naive Bayes, as well as an excellent example of its usage.

Key takeaways¶

The "naive" assumption of Naive Bayes (that the features are conditionally independent) is critical to making these calculations simple.
The normalization constant (the denominator) can be ignored since it's the same for all classes.
The prior probability is much less relevant once you have a lot of features.

Comparing Naive Bayes with other models¶

Advantages of Naive Bayes:

Model training and prediction are very fast
Somewhat interpretable
No tuning is required
Features don't need scaling
Insensitive to irrelevant features (with enough observations)
Performs better than logistic regression when the training set is very small

Disadvantages of Naive Bayes:

Predicted probabilities are not well-calibrated
Correlated features can be problematic (due to the independence assumption)
Can't handle negative features (with Multinomial Naive Bayes)
Has a higher "asymptotic error" than logistic regression