In this notebook we'll train a Logistic Regression model to distinguish between spam data (food reviews) and legitimate data (Austen).
Logistic regression is a standard statistical technique used to model a binary variable. In our case the binary variable we are predicting is 'spam' or 'not spam' (i.e. legitimate). Logistic regression, when combined with a reasonable feature engineering approach, is often a sensible first choice for a classification problem!
We begin by loading in the feature vectors which we generated in either the simple summaries feature extraction notebook or the TF-IDF feature extraction notebook.
import pandas as pd
feats = pd.read_parquet("data/features.parquet")
When doing exploratory analysis, it's often a good idea to inspect your data as a sanity check. In this case, we'll make sure that the feature vectors we generated in the last notebook have the shape we expect!
feats.sample(10)
The first 2 columns of the feats
matrix are the index, and label. The remaining columns are the feature vectors.
We begin by splitting the data into 2 sets:
train
- a set of feature vectors which will be used to train the modeltest
- a set of feature vectors which will be used to evaluate the model we trainedfrom sklearn import model_selection
train, test = model_selection.train_test_split(feats, random_state=43)
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(solver = 'lbfgs', max_iter = 4000)
#training the model
import time
start = time.time()
model.fit(X=train.iloc[:,2:train.shape[1]], y=train["label"])
end = time.time()
print(end - start)
With the model trained we can use it to make predictions. We apply the model to the test
set, then compare the predicted classification of spam or legitimate to the truth.
predictions = model.predict(test.iloc[:,2:test.shape[1]])
predictions
We use a binary confusion matrix to visualise the accuracy of the model.
from mlworkflows import plot
df, chart = plot.binary_confusion_matrix(test["label"], predictions)
chart
We can look at the raw numbers, and proportions of correctly and incorrectly classified items:
df
We can also look at the precision, recall and f1-score for the model.
from sklearn.metrics import classification_report
print(classification_report(test.label.values, predictions))
We want to save the model so that we can use it outside of this notebook.
model
from mlworkflows import util
util.serialize_to(model, "model.sav")