Comparing Multinomial and Gaussian Naive Bayes¶

scikit-learn documentation: MultinomialNB and GaussianNB

Dataset: Pima Indians Diabetes from the UCI Machine Learning Repository

In [1]:

# read the data
import pandas as pd
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data'
col_names = ['pregnant', 'glucose', 'bp', 'skin', 'insulin', 'bmi', 'pedigree', 'age', 'label']
pima = pd.read_csv(url, header=None, names=col_names)

In [2]:

# notice that all features are continuous
pima.head()

Out[2]:

	pregnant	glucose	bp	skin	insulin	bmi	pedigree	age	label
0	6	148	72	35	0	33.6	0.627	50	1
1	1	85	66	29	0	26.6	0.351	31	0
2	8	183	64	0	0	23.3	0.672	32	1
3	1	89	66	23	94	28.1	0.167	21	0
4	0	137	40	35	168	43.1	2.288	33	1

In [3]:

# create X and y
X = pima.drop('label', axis=1)
y = pima.label

In [4]:

# split into training and testing sets
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

In [5]:

# import both Multinomial and Gaussian Naive Bayes
from sklearn.naive_bayes import MultinomialNB, GaussianNB
from sklearn import metrics

In [6]:

# testing accuracy of Multinomial Naive Bayes
mnb = MultinomialNB()
mnb.fit(X_train, y_train)
y_pred_class = mnb.predict(X_test)
print metrics.accuracy_score(y_test, y_pred_class)

0.541666666667

In [7]:

# testing accuracy of Gaussian Naive Bayes
gnb = GaussianNB()
gnb.fit(X_train, y_train)
y_pred_class = gnb.predict(X_test)
print metrics.accuracy_score(y_test, y_pred_class)

0.791666666667

Conclusion: When applying Naive Bayes classification to a dataset with continuous features, it is better to use Gaussian Naive Bayes than Multinomial Naive Bayes. The latter is suitable for datasets containing discrete features (e.g., word counts).

Wikipedia has a short description of Gaussian Naive Bayes, as well as an excellent example of its usage.