Background¶

In this notebook we'll train a Logistic Regression model to distinguish between spam data (food reviews) and legitimate data (Austen).

Logistic regression is a standard statistical technique used to model a binary variable. In our case the binary variable we are predicting is 'spam' or 'not spam' (i.e. legitimate). Logistic regression, when combined with a reasonable feature engineering approach, is often a sensible first choice for a classification problem!

We begin by loading in the feature vectors which we generated in either the simple summaries feature extraction notebook or the TF-IDF feature extraction notebook.

In [ ]:

import pandas as pd
feats = pd.read_parquet("data/features.parquet")

When doing exploratory analysis, it's often a good idea to inspect your data as a sanity check. In this case, we'll make sure that the feature vectors we generated in the last notebook have the shape we expect!

In [ ]:

feats.sample(10)

The first 2 columns of the feats matrix are the index, and label. The remaining columns are the feature vectors.

We begin by splitting the data into 2 sets:

train - a set of feature vectors which will be used to train the model
test - a set of feature vectors which will be used to evaluate the model we trained

In [ ]:

from sklearn import model_selection
train, test = model_selection.train_test_split(feats, random_state=43)

In [ ]:

from sklearn.linear_model import LogisticRegression

In [ ]:

model = LogisticRegression(solver = 'lbfgs', max_iter = 4000)

In [ ]:

#training the model
import time

start = time.time()
model.fit(X=train.iloc[:,2:train.shape[1]], y=train["label"])
end = time.time()
print(end - start)

With the model trained we can use it to make predictions. We apply the model to the test set, then compare the predicted classification of spam or legitimate to the truth.

In [ ]:

predictions = model.predict(test.iloc[:,2:test.shape[1]])

In [ ]:

predictions

We use a binary confusion matrix to visualise the accuracy of the model.

In [ ]:

from mlworkflows import plot

In [ ]:

df, chart = plot.binary_confusion_matrix(test["label"], predictions)

In [ ]:

chart

We can look at the raw numbers, and proportions of correctly and incorrectly classified items:

In [ ]:

df

We can also look at the precision, recall and f1-score for the model.

In [ ]:

from sklearn.metrics import classification_report
print(classification_report(test.label.values, predictions))

We want to save the model so that we can use it outside of this notebook.

In [ ]:

model

In [ ]:

from mlworkflows import util
util.serialize_to(model, "model.sav")