The Naive Bayes classifier is probabbly the most simple classifier. But it works well for text documents well for text classification, and it is very efficient. Hence it is often used as a baseline when comparing with other classifiers. NB text classification is not as simple as it looks. There are different models (multinomial and Bernoulli), different smoothing choices.
NB is a probabilistic classifier. It computes the probability of a document $d$ being in a class $c$:
$$ P(c|d) \propto P(c) \prod_{1 \leq k \leq n_d} P(t_k|c) $$Our goal in Naive Bayes classification is to find the $best$ class. The best class is the most likely or maximum a posteriori (MAP) class $c_{map}$:
$$ c_{map} = argmax_{c \in C} \hat{P}(c|d) = argmax_{c \in C} \ \widehat{P}(c) \prod_{1 \leq k \leq n_d} \hat{P}(t_k|c) $$Multiplying lots of small probabilities can result in floating point underflow. Since
$$\log (xy) =\log(x)+\log(y)$$,
we can sum log probabilities instead of multiplying probabilities. Since $\log$ is a monotonic function, the class with the highest score does not change. So what we usually compute in practice is:
$$ c_{map} = argmax_{c\in C} \ [ \log \hat{P}(c) +\sum_{1 \leq k \leq n_d}\log \hat{P}(t_k|c)] $$import re
from collections import Counter
import math
train_pos = open('../data/vldb_train','r').readlines()
train_neg = open('../data/icse_train','r').readlines()
test_pos = open('../data/vldb_test','r').readlines()
test_neg = open('../data/icse_test','r').readlines()
N_c=len(train_pos)
N_c_=len(train_neg)
N=N_c+N_c_
Prior_c = N_c/N
Prior_c_= N_c_/N
print('Prior for pos class: '+ str(Prior_c))
print('Prior for neg class: '+ str(Prior_c_))
tokens_pos=[]
for line in train_pos:
tokens_pos+=re.findall('[a-zA-Z]+', line.lower())
voc_c = Counter(tokens_pos)
tokens_neg=[]
for line in train_neg:
tokens_neg+=re.findall('[a-zA-Z]+', line.lower())
voc_c_ = Counter(tokens_neg)
voc={}
voc=voc_c | voc_c_
voc is the set of the union of the keys. The values are not summed. Below we do the add 1 smoothing to account for unseen words in a training class. Note that we double check the smoothing is correct in that the summation of the probabilities is 1.
V=len(voc)
Pc={}
Pc_={}
length_c=sum(voc_c.values())
length_c_=sum(voc_c_.values())
for t in voc.keys():
Pc[t]=(voc_c[t]+1)/(length_c+V)
Pc_[t]=(voc_c_[t]+1)/(length_c_+V)
print(sum(Pc.values()))
print(sum(Pc_.values()))
ApplyMultinomialNB(C, V, prior, condprob, d)
W ← ExtractTokensFromDoc(V, d)
for each c∈C
score[c] ← log prior[c]
for each t∈W
score[c]+ = log condprob[t][c]
return arg maxc∈C score[c]
What if there are unseen features in test even after smoothing? Those words are not in class c or c_.
Those words should not have an impact on the result. Hence we should assign a weight that is the same for both classes, or equvalently, we can remove them when we do the summation.
TP=0 # true positive
FN=0
for line in test_pos:
W=re.findall('[a-zA-Z]+', line.lower())
score_c=math.log(Prior_c)
score_c_=math.log(Prior_c_)
for t in W:
score_c+=math.log(Pc[t]) if (t in voc.keys()) else 0
score_c_+=math.log(Pc_[t]) if (t in voc.keys()) else 0
pred=1 if score_c> score_c_ else 0
TP+=pred
FN+=(1-pred)
print(TP)
print(FN)
Test on negative cases
TN=0
FP=0
for line in test_neg:
W=re.findall('[a-zA-Z]+', line.lower())
score_c=math.log(Prior_c)
score_c_=math.log(Prior_c_)
for t in W:
score_c+=math.log(Pc[t]) if (t in Pc) else 0
score_c_+=math.log(Pc_[t]) if (t in Pc_) else 0
pred=1 if score_c> score_c_ else 0
FP+=pred
TN+=(1-pred)
print(FP)
print(TN)
precision=TP/(TP+FP)
recall=TP/(TP+FN)
F1=2/(1/precision+1/recall)
print(str(precision)+ "\t"+ str(recall)+"\tF1\t"+str(F1))
recall
p2=TN/(TN+FN)
r2=TN/(TN+FP)
F12=2/(1/p2+1/r2)
macro=(F1+F12)/2
print(F12)
print(macro)
Note that F1 for c_ is much higher. That is why we need to look the average of two F1's. When classes are not balanced, it is more accurate to look at the micro average.
The macro F1 of NB is 0.90. Logistic regression using frequency count is 0.91. LR on TFIDF is 0.88. We can see that NB is not bad although it is very simple and efficient.
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
vectorizer = CountVectorizer()
X= vectorizer.fit_transform(train_pos+train_neg).toarray()
y= [1]*len(train_pos)+ [0]*len(train_neg)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
logreg = LogisticRegression(solver='lbfgs')
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)
cm=confusion_matrix(y_test,y_pred)
print(cm)
print(classification_report(y_test,y_pred))
print(accuracy_score(y_test, y_pred))
from sklearn.feature_extraction.text import TfidfVectorizer
tfidfconverter = TfidfVectorizer(max_features=2000, min_df=1, max_df=0.7)
#tfidfconverter = TfidfVectorizer()
X = tfidfconverter.fit_transform(train_pos+train_neg).toarray()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
logreg = LogisticRegression(solver='lbfgs')
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)
print(confusion_matrix(y_test,y_pred))
print(classification_report(y_test,y_pred))
print(accuracy_score(y_test, y_pred))