The Naive Bayes classifier is probabbly the most simple classifier. But it works well for text documents well for text classification, and it is very efficient. Hence it is often used as a baseline when comparing with other classifiers. NB text classification is not as simple as it looks. There are different models (multinomial and Bernoulli), different smoothing choices.

NB is a probabilistic classifier. It computes the probability of a document $d$ being in a class $c$:

$$ P(c|d) \propto P(c) \prod_{1 \leq k \leq n_d} P(t_k|c) $$- Here $n_d$ is the length of the document, i.e., number of tokens. Note that the formula indicates that it is NB multinomial model. NB Bernoulli differs slightly.
- $P(t_k|c)$: conditional probability of term $t_k$ occurring in a document of class $c$. $P(t_k|c)$ as a measure of how much evidence $t_k$ contributes that $c$ is the correct class.
- $P(c)$ is the prior probability of $c$. If a document's terms do not provide clear evidence for one class vs. another, we choose the $c$ with highest $P(c)$.

Our goal in Naive Bayes classification is to find the $best$ class. The best class is the most likely or maximum a posteriori (MAP) class $c_{map}$:

$$ c_{map} = argmax_{c \in C} \hat{P}(c|d) = argmax_{c \in C} \ \widehat{P}(c) \prod_{1 \leq k \leq n_d} \hat{P}(t_k|c) $$Multiplying lots of small probabilities can result in floating point underflow. Since

$$\log (xy) =\log(x)+\log(y)$$,

we can sum log probabilities instead of multiplying probabilities. Since $\log$ is a monotonic function, the class with the highest score does not change. So what we usually compute in practice is:

$$ c_{map} = argmax_{c\in C} \ [ \log \hat{P}(c) +\sum_{1 \leq k \leq n_d}\log \hat{P}(t_k|c)] $$In [2]:

```
import re
from collections import Counter
import math
train_pos = open('../data/vldb_train','r').readlines()
train_neg = open('../data/icse_train','r').readlines()
test_pos = open('../data/vldb_test','r').readlines()
test_neg = open('../data/icse_test','r').readlines()
N_c=len(train_pos)
N_c_=len(train_neg)
N=N_c+N_c_
Prior_c = N_c/N
Prior_c_= N_c_/N
print('Prior for pos class: '+ str(Prior_c))
print('Prior for neg class: '+ str(Prior_c_))
```

In [3]:

```
tokens_pos=[]
for line in train_pos:
tokens_pos+=re.findall('[a-zA-Z]+', line.lower())
voc_c = Counter(tokens_pos)
tokens_neg=[]
for line in train_neg:
tokens_neg+=re.findall('[a-zA-Z]+', line.lower())
voc_c_ = Counter(tokens_neg)
voc={}
voc=voc_c | voc_c_
```

*voc* is the set of the union of the keys. The values are not summed. Below we do the add 1 smoothing to account for unseen words in a training class. Note that we double check the smoothing is correct in that the summation of the probabilities is 1.

In [4]:

```
V=len(voc)
Pc={}
Pc_={}
length_c=sum(voc_c.values())
length_c_=sum(voc_c_.values())
for t in voc.keys():
Pc[t]=(voc_c[t]+1)/(length_c+V)
Pc_[t]=(voc_c_[t]+1)/(length_c_+V)
print(sum(Pc.values()))
print(sum(Pc_.values()))
```

ApplyMultinomialNB(C, V, prior, condprob, d)

W ← ExtractTokensFromDoc(V, d)

for each c∈C

score[c] ← log prior[c]

```
for each t∈W
score[c]+ = log condprob[t][c]
```

return arg maxc∈C score[c]

What if there are unseen features in test even after smoothing? Those words are not in class c or c_.

Those words should not have an impact on the result. Hence we should assign a weight that is the same for both classes, or equvalently, we can remove them when we do the summation.

In [8]:

```
TP=0 # true positive
FN=0
for line in test_pos:
W=re.findall('[a-zA-Z]+', line.lower())
score_c=math.log(Prior_c)
score_c_=math.log(Prior_c_)
for t in W:
score_c+=math.log(Pc[t]) if (t in voc.keys()) else 0
score_c_+=math.log(Pc_[t]) if (t in voc.keys()) else 0
pred=1 if score_c> score_c_ else 0
TP+=pred
FN+=(1-pred)
print(TP)
print(FN)
```

Test on negative cases

In [15]:

```
TN=0
FP=0
for line in test_neg:
W=re.findall('[a-zA-Z]+', line.lower())
score_c=math.log(Prior_c)
score_c_=math.log(Prior_c_)
for t in W:
score_c+=math.log(Pc[t]) if (t in Pc) else 0
score_c_+=math.log(Pc_[t]) if (t in Pc_) else 0
pred=1 if score_c> score_c_ else 0
FP+=pred
TN+=(1-pred)
print(FP)
print(TN)
```

In [16]:

```
precision=TP/(TP+FP)
recall=TP/(TP+FN)
F1=2/(1/precision+1/recall)
print(str(precision)+ "\t"+ str(recall)+"\tF1\t"+str(F1))
```

In [17]:

```
recall
```

Out[17]:

In [19]:

```
p2=TN/(TN+FN)
r2=TN/(TN+FP)
F12=2/(1/p2+1/r2)
macro=(F1+F12)/2
print(F12)
print(macro)
```

Note that F1 for c_ is much higher. That is why we need to look the average of two F1's. When classes are not balanced, it is more accurate to look at the micro average.

The macro F1 of NB is 0.90. Logistic regression using frequency count is 0.91. LR on TFIDF is 0.88. We can see that NB is not bad although it is very simple and efficient.

In [20]:

```
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
vectorizer = CountVectorizer()
X= vectorizer.fit_transform(train_pos+train_neg).toarray()
y= [1]*len(train_pos)+ [0]*len(train_neg)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
logreg = LogisticRegression(solver='lbfgs')
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)
cm=confusion_matrix(y_test,y_pred)
print(cm)
print(classification_report(y_test,y_pred))
print(accuracy_score(y_test, y_pred))
```

In [29]:

```
from sklearn.feature_extraction.text import TfidfVectorizer
tfidfconverter = TfidfVectorizer(max_features=2000, min_df=1, max_df=0.7)
#tfidfconverter = TfidfVectorizer()
X = tfidfconverter.fit_transform(train_pos+train_neg).toarray()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
logreg = LogisticRegression(solver='lbfgs')
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)
print(confusion_matrix(y_test,y_pred))
print(classification_report(y_test,y_pred))
print(accuracy_score(y_test, y_pred))
```

- Whether the comparison in the above code is conclusive? How many problems are there? (Hint: Data split is random in other classifiers. One data is not enough. One split is not enough. ...)
- Multinomial vs Bernoulli model: which is better? In what situation we should use Multinomial NB? Implement the Bernoulli model and conduct a performance comparison.
- Smoothing: This example uses simple +1 (Laplacian) smoothing. Whether Good-Turing smoothing can improve the performance? Good-Turing smoothing without tears
- sklearn NB: is that NB the same as our NB? Which model it corresponds to? What smoothing it uses?
- Unbalanced classes: When classes are unbalanced, we need to tackle the bias. Refer the paper "Naive Bayes for Text Classification with Unbalanced Classes, Eibe Frank, Remco R. Bouckaert, 2006."

In [ ]:

```
```

In [ ]:

```
```

In [ ]:

```
```