Notebook

Fitting old pipelines to new data¶

In this brief notebook, you'll see how to load a saved pipeline and fit it to raw data again. This is a very useful way to retrain a classifier to new data (which you might do after observing data drift) or to change the hyperparameters you used while training the classifier.

Machine learning pipelines work in two ways: for training, they allow you to precisely specify a sequence of steps (data cleaning, feature extraction, dimensionality reduction, model training, etc.) that start with raw data and result in a model, trying this sequence with different hyperparameters. For production, they allow you to reuse the exact sequence of steps that were used in training a model to make predictions from new raw data.

We'll start by loading training and testing data:

In [ ]:

import pandas as pd
from sklearn import model_selection
df = pd.read_parquet("data/training.parquet")

# X_train and X_test are lists of strings, each 
# representing one document
# y_train and y_test are vectors of labels

train, test = model_selection.train_test_split(df, random_state=43)
X_train = train["text"]
y_train = train["label"]

X_test = test["text"]
y_test = test["label"]

Next up, we'll load the two steps of the pipeline that we created in earlier notebooks:

feature_pipeline.sav from either the simple summaries notebook or the TF-IDF notebook, and
model.sav from either the logistic regression notebook or the random forest notebook.

(If you haven't worked through a feature engineering notebook and a model training notebook, the next cell won't work.)

In [ ]:

## loading in feature extraction pipeline
import pickle
filename = 'feature_pipeline.sav'
feat_pipeline = pickle.load(open(filename, 'rb'))

## loading model
filename = 'model.sav'
model = pickle.load(open(filename, 'rb'))

Now we can combine the two stages together in a pipeline and fit it to new training data. (Note that the feature extraction pipeline stage, feature_pipeline.sav, is itself a pipeline!)

In [ ]:

from sklearn.pipeline import Pipeline
pipeline = Pipeline([
    ('features',feat_pipeline),
    ('model',model)
])

A pipeline supports the same interface as a classifier, so we can use it to fit to raw data and then predict labels from raw data:

In [ ]:

pipeline.fit(X_train,y_train)
y_preds = pipeline.predict(X_test)

We can then evaluate the performance of our classifier, using a confusion matrix:

In [ ]:

from mlworkflows import plot
_, chart = plot.binary_confusion_matrix(y_test, y_preds)
chart

...or an F₁-score:

In [ ]:

from sklearn.metrics import f1_score
# calculate f1 score
mean_f1 = f1_score(y_test, y_preds, average='micro')
print(mean_f1)

The scikit-learn pipeline doesn't just make a particular pipeline repeatable; it also lets you run repeatable experiments by evaluating the same pipeline with different hyperparameters for the same data set. To see this in action, we'll inspect the pipeline stages to see their hyperparameters:

In [ ]:

pipeline.named_steps

Let's experiment with a couple of different options for a hyperparameter. Since we have no way of knowing while we're writing this notebook whether you trained a logistic regression model or a random forest model, this notebook will try and figure it out on the fly (since these model types have different hyperparameters). The GridSearchCV class in scikit-learn allows us to evaluate different combinations of hyperparameters; we'll use it with just a few options to quickly demonstrate its functionality.

In [ ]:

from sklearn.model_selection import GridSearchCV

search = None
param_grid = {}

small_train, small_test = model_selection.train_test_split(df.sample(5000), random_state=43)

if 'LogisticRegression' in str(pipeline.named_steps['model']):
    # we're dealing with a logistic regression model
    param_grid = { 'model__multi_class' : ['ovr', 'multinomial'], 'model__solver' : ['lbfgs', 'newton-cg']}
else:
    # we're dealing with a random forest model
    param_grid = { 'model__min_samples_split' : [2, 8], 'model__n_estimators' : [25, 50]}

search = GridSearchCV(pipeline, param_grid, iid=False, cv=3, return_train_score=False)
search.fit(small_train["text"], small_train["label"])

print("Best parameters were %s" % str(search.best_params_))

GridSearchCV evaluates every hyperparameter combination from the supplied lists of values. So in the random forest case above, we'd consider the following hyperparameter mappings:

multiclass == 'ovr' and solver == 'lbgfs',
multiclass == 'ovr' and solver == 'newton-cg',
multiclass == 'multinomial' and solver == 'lbgfs', and
multiclass == 'multinomial' and solver == 'newton-cg'.

In our example, we divide the training set into three subsets, or folds, instead of using train and test sets: if we call the three subsets $a$, $b$, and $c$, we'll be

training against $a \cup b$ and testing against $c$,
training against $a \cup c$ and testing against $b$, and
training against $b \cup c$ and testing against $a$

for each hyperparameter combination before averaging the results of each test. Since we will train $k - 1$ models for $k$-fold validation, and since we'll validate every possible combination of hyperparameters, grid search can get computationally expensive very quickly. (If you were using grid search in a real application, you'd have more time than we do during this workshop, so you'd probably use more folds for cross-validation and also probably be working with a larger sample count.)

Exercises¶

Later in this session, we'll see how pipeline abstractions can make it easier not just to experiment with variations on techniques, but also to put machine learning into production. For now, here are some things to try out and think about.

Identify some ways in which machine learning pipelines are like CI/CD pipelines and some ways in which they're different.
Experiment with values for some different hyperparameters in the text classification pipeline.
Even if you aren't overfitting, you might not want to choose the absolute best-performing hyperparameter settings. What are some cases in which you might not?