Baive Bayes Text Classifier From Scratch

Contents

  1. TrainMultinomialNB
  2. Apply NB
  3. Compare with other classifiers
  4. Project

The Naive Bayes classifier is probabbly the most simple classifier. But it works well for text documents well for text classification, and it is very efficient. Hence it is often used as a baseline when comparing with other classifiers. NB text classification is not as simple as it looks. There are different models (multinomial and Bernoulli), different smoothing choices.

NB is a probabilistic classifier. It computes the probability of a document $d$ being in a class $c$:

$$ P(c|d) \propto P(c) \prod_{1 \leq k \leq n_d} P(t_k|c) $$
  • Here $n_d$ is the length of the document, i.e., number of tokens. Note that the formula indicates that it is NB multinomial model. NB Bernoulli differs slightly.
  • $P(t_k|c)$: conditional probability of term $t_k$ occurring in a document of class $c$. $P(t_k|c)$ as a measure of how much evidence $t_k$ contributes that $c$ is the correct class.
  • $P(c)$ is the prior probability of $c$. If a document's terms do not provide clear evidence for one class vs. another, we choose the $c$ with highest $P(c)$.

Estimate $P_c$ and conditional probabilities

Our goal in Naive Bayes classification is to find the $best$ class. The best class is the most likely or maximum a posteriori (MAP) class $c_{map}$:

$$ c_{map} = argmax_{c \in C} \hat{P}(c|d) = argmax_{c \in C} \ \widehat{P}(c) \prod_{1 \leq k \leq n_d} \hat{P}(t_k|c) $$

Multiplying lots of small probabilities can result in floating point underflow. Since
$$\log (xy) =\log(x)+\log(y)$$,

we can sum log probabilities instead of multiplying probabilities. Since $\log$ is a monotonic function, the class with the highest score does not change. So what we usually compute in practice is:

$$ c_{map} = argmax_{c\in C} \ [ \log \hat{P}(c) +\sum_{1 \leq k \leq n_d}\log \hat{P}(t_k|c)] $$

TrainMultinomialNB

In [2]:
import re
from collections import Counter
import math

train_pos = open('../data/vldb_train','r').readlines()
train_neg = open('../data/icse_train','r').readlines()
test_pos = open('../data/vldb_test','r').readlines()
test_neg = open('../data/icse_test','r').readlines()

N_c=len(train_pos)
N_c_=len(train_neg)

N=N_c+N_c_
Prior_c = N_c/N
Prior_c_= N_c_/N
print('Prior for pos class: '+ str(Prior_c))
print('Prior for neg class: '+ str(Prior_c_))
Prior for pos class: 0.2727272727272727
Prior for neg class: 0.7272727272727273
In [3]:
tokens_pos=[]
for line in train_pos:
    tokens_pos+=re.findall('[a-zA-Z]+', line.lower())
voc_c = Counter(tokens_pos)

tokens_neg=[]
for line in train_neg:
    tokens_neg+=re.findall('[a-zA-Z]+', line.lower())
voc_c_ = Counter(tokens_neg)

voc={}
voc=voc_c | voc_c_

voc is the set of the union of the keys. The values are not summed. Below we do the add 1 smoothing to account for unseen words in a training class. Note that we double check the smoothing is correct in that the summation of the probabilities is 1.

In [4]:
V=len(voc)

Pc={}
Pc_={}
length_c=sum(voc_c.values())
length_c_=sum(voc_c_.values())
for t in voc.keys():
    Pc[t]=(voc_c[t]+1)/(length_c+V)
    Pc_[t]=(voc_c_[t]+1)/(length_c_+V)
print(sum(Pc.values()))
print(sum(Pc_.values()))
0.9999999999999319
1.0000000000001037

Apply_NB

ApplyMultinomialNB(C, V, prior, condprob, d)

W ← ExtractTokensFromDoc(V, d)

for each c∈C

score[c] ← log prior[c]

 for each t∈W

    score[c]+ = log condprob[t][c] 

return arg maxc∈C score[c]

Unseen_words

What if there are unseen features in test even after smoothing? Those words are not in class c or c_.

Those words should not have an impact on the result. Hence we should assign a weight that is the same for both classes, or equvalently, we can remove them when we do the summation.

In [8]:
TP=0 # true positive
FN=0
for line in test_pos:
    W=re.findall('[a-zA-Z]+', line.lower())
    score_c=math.log(Prior_c)
    score_c_=math.log(Prior_c_)
    for t in W:
        score_c+=math.log(Pc[t]) if (t in voc.keys()) else 0
        score_c_+=math.log(Pc_[t]) if (t in voc.keys()) else 0
    
    pred=1 if score_c> score_c_ else 0
    TP+=pred
    FN+=(1-pred)
print(TP)
print(FN)
837
163

Test on negative cases

In [15]:
TN=0
FP=0
for line in test_neg:
    W=re.findall('[a-zA-Z]+', line.lower())
    score_c=math.log(Prior_c)
    score_c_=math.log(Prior_c_)
    for t in W:
        score_c+=math.log(Pc[t]) if (t in Pc) else 0  
        score_c_+=math.log(Pc_[t]) if (t in Pc_) else 0
    
    pred=1 if score_c> score_c_ else 0
    FP+=pred
    TN+=(1-pred)
print(FP)
print(TN)
88
1912
In [16]:
precision=TP/(TP+FP)
recall=TP/(TP+FN)
F1=2/(1/precision+1/recall)
print(str(precision)+ "\t"+ str(recall)+"\tF1\t"+str(F1))
0.9048648648648648	0.837	F1	0.8696103896103896
In [17]:
recall
Out[17]:
0.837
In [19]:
p2=TN/(TN+FN)
r2=TN/(TN+FP)
F12=2/(1/p2+1/r2)
macro=(F1+F12)/2
print(F12)
print(macro)
0.9384049079754602
0.904007648792925

Note that F1 for c_ is much higher. That is why we need to look the average of two F1's. When classes are not balanced, it is more accurate to look at the micro average.

Comparison_with_other_classifiers

The macro F1 of NB is 0.90. Logistic regression using frequency count is 0.91. LR on TFIDF is 0.88. We can see that NB is not bad although it is very simple and efficient.

In [20]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression 
import  numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression 
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

vectorizer = CountVectorizer()
X= vectorizer.fit_transform(train_pos+train_neg).toarray()
y= [1]*len(train_pos)+ [0]*len(train_neg)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

logreg = LogisticRegression(solver='lbfgs')
logreg.fit(X_train, y_train) 
y_pred = logreg.predict(X_test)


cm=confusion_matrix(y_test,y_pred)
print(cm)
print(classification_report(y_test,y_pred))
print(accuracy_score(y_test, y_pred))
[[1530   43]
 [ 114  513]]
              precision    recall  f1-score   support

           0       0.93      0.97      0.95      1573
           1       0.92      0.82      0.87       627

    accuracy                           0.93      2200
   macro avg       0.93      0.90      0.91      2200
weighted avg       0.93      0.93      0.93      2200

0.9286363636363636
In [29]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidfconverter = TfidfVectorizer(max_features=2000, min_df=1, max_df=0.7)
#tfidfconverter = TfidfVectorizer()
X = tfidfconverter.fit_transform(train_pos+train_neg).toarray()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
logreg = LogisticRegression(solver='lbfgs')
logreg.fit(X_train, y_train) 
y_pred = logreg.predict(X_test)


print(confusion_matrix(y_test,y_pred))
print(classification_report(y_test,y_pred))
print(accuracy_score(y_test, y_pred))
[[1535   38]
 [ 149  478]]
              precision    recall  f1-score   support

           0       0.91      0.98      0.94      1573
           1       0.93      0.76      0.84       627

    accuracy                           0.92      2200
   macro avg       0.92      0.87      0.89      2200
weighted avg       0.92      0.92      0.91      2200

0.915

Project

  1. Whether the comparison in the above code is conclusive? How many problems are there? (Hint: Data split is random in other classifiers. One data is not enough. One split is not enough. ...)
  2. Multinomial vs Bernoulli model: which is better? In what situation we should use Multinomial NB? Implement the Bernoulli model and conduct a performance comparison.
  3. Smoothing: This example uses simple +1 (Laplacian) smoothing. Whether Good-Turing smoothing can improve the performance? Good-Turing smoothing without tears
  4. sklearn NB: is that NB the same as our NB? Which model it corresponds to? What smoothing it uses?
  5. Unbalanced classes: When classes are unbalanced, we need to tackle the bias. Refer the paper "Naive Bayes for Text Classification with Unbalanced Classes, Eibe Frank, Remco R. Bouckaert, 2006."
In [ ]:
 
In [ ]:
 
In [ ]: