This notebook will briefly present Bayesian spam filtering, which is an established and effective technique for spam filtering. The basic idea is to look at documents as bags of words (that is, as mappings of words to frequencies, disregarding ordering). The underlying assumption is that spam and legitimate documents will have different distributions of words, and that we'll be able to rate the probability that a given set of words came from a legitimate document or a spam document.

In [ ]:

import pandas as pd
import numpy as np
data = pd.read_parquet("data/training.parquet")

We'll start by splitting our data into randomly-selected train and test sets.

In [ ]:

from sklearn import model_selection
train, test = model_selection.train_test_split(data)

Our feature extraction pipeline is very simple: we'll use a bag-of-words model in which we represent texts as dictionaries of word counts. Furthermore, we'll use feature hashing so that we store word counts in a array of hash buckets rather than as an explicit dictionary.

In [ ]:

from sklearn.feature_extraction import text as text_feature

hv = text_feature.HashingVectorizer(norm='l1', alternate_sign=False)
hashed_features = hv.transform(train.text.values)

From there, we'll use the multinomial naive Bayes classifier in scikit-learn, which will train a model to identify which words (really, which hash values of words) are most likely to distinguish between legitimate and spam messages.

In [ ]:

from sklearn import naive_bayes

nb = naive_bayes.MultinomialNB()
nb.fit(hashed_features, train.label.values)

Once we've fit our model, we can evaluate its accuracy on our training and test sets.

In [ ]:

nb.score(hashed_features, train.label.values)

In [ ]:

test_features = hv.transform(test.text.values)
nb.score(test_features, test.label.values)

As you know, raw accuracy isn't the most useful metric to evaluate a binary classifier. In order to visualize how the naive Bayes classifier performs overall, we'll plot a confusion matrix.

In [ ]:

from mlworkflows import plot

In [ ]:

df, chart = plot.binary_confusion_matrix(test.label.values, nb.predict(test_features))

In [ ]:

chart

We can also inspect the performance in tabular form.

In [ ]:

df

Finally, we can produce a report showing us the precision and recall for each class, as well as an f₁-score.

In [ ]:

from sklearn.metrics import classification_report
print(classification_report(test.label.values, nb.predict(test_features)))

Exercises¶

The preprocessing step in this notebook is virtually nonexistent. How might you use spaCy to have a more robust preprocessing pipeline?
The feature extraction pipeline in this notebook uses hashed vectors with 2²⁰ elements. Run some experiments to identify whether or not smaller vectors would still provide acceptable performance.
Bayesian document classification is an established technique that has worked well in practice for a long time. A common way to fool Bayesian spam filters is to append a lot of legitimate text to the end of a spam document. What sort of features would you extract in order to train a classifier that could identify spam messages that were using this trick?