This notebook will briefly present Bayesian spam filtering, which is an established and effective technique for spam filtering. The basic idea is to look at documents as bags of words (that is, as mappings of words to frequencies, disregarding ordering). The underlying assumption is that spam and legitimate documents will have different distributions of words, and that we'll be able to rate the probability that a given set of words came from a legitimate document or a spam document.
import pandas as pd
import numpy as np
data = pd.read_parquet("data/training.parquet")
We'll start by splitting our data into randomly-selected train and test sets.
from sklearn import model_selection
train, test = model_selection.train_test_split(data)
Our feature extraction pipeline is very simple: we'll use a bag-of-words model in which we represent texts as dictionaries of word counts. Furthermore, we'll use feature hashing so that we store word counts in a array of hash buckets rather than as an explicit dictionary.
from sklearn.feature_extraction import text as text_feature
hv = text_feature.HashingVectorizer(norm='l1', alternate_sign=False)
hashed_features = hv.transform(train.text.values)
From there, we'll use the multinomial naive Bayes classifier in scikit-learn, which will train a model to identify which words (really, which hash values of words) are most likely to distinguish between legitimate and spam messages.
from sklearn import naive_bayes
nb = naive_bayes.MultinomialNB()
nb.fit(hashed_features, train.label.values)
Once we've fit our model, we can evaluate its accuracy on our training and test sets.
nb.score(hashed_features, train.label.values)
test_features = hv.transform(test.text.values)
nb.score(test_features, test.label.values)
As you know, raw accuracy isn't the most useful metric to evaluate a binary classifier. In order to visualize how the naive Bayes classifier performs overall, we'll plot a confusion matrix.
from mlworkflows import plot
df, chart = plot.binary_confusion_matrix(test.label.values, nb.predict(test_features))
chart
We can also inspect the performance in tabular form.
df
Finally, we can produce a report showing us the precision and recall for each class, as well as an f1-score.
from sklearn.metrics import classification_report
print(classification_report(test.label.values, nb.predict(test_features)))