Test For `ModelDriftChecker`-`pydrift`¶

We're going to test how it works with the famous titanic dataset

Dependencies¶

In [1]:

import pandas as pd

from sklearn.model_selection import train_test_split
from catboost import CatBoostClassifier

from pydrift import ModelDriftChecker
from pydrift.constants import PATH_DATA, RANDOM_STATE
from pydrift.models import cat_features_fillna

Read Data¶

In [2]:

df_titanic = pd.read_csv(PATH_DATA / 'titanic.csv')

Constants¶

In [3]:

TARGET = 'Survived'

Model Training¶

In [4]:

X = df_titanic.drop(columns=['PassengerId', 'Name', 'Ticket', 'Cabin', TARGET])
y = df_titanic[TARGET]

cat_features = (X
                .select_dtypes(include=['category', 'object'])
                .columns)

X_filled = cat_features_fillna(X, cat_features)

X_train, X_test, y_train, y_test = train_test_split(
    X_filled, y, test_size=.5, random_state=RANDOM_STATE, stratify=y
)

ml_classifier_model = CatBoostClassifier(
    num_trees=5,
    max_depth=3,
    cat_features=cat_features,
    random_state=RANDOM_STATE,
    verbose=False
)

ml_classifier_model.fit(X_train, y_train);

Instanciate `pydrift.DataDriftChecker`¶

In [5]:

df_left_data = pd.concat([X_train, y_train], axis=1)
df_right_data = pd.concat([X_test, y_test], axis=1)
    
model_drift_checker_ok = ModelDriftChecker(
    df_left_data, df_right_data, ml_classifier_model, target_column_name=TARGET
)

Test `ml_model_can_discriminate` Feature¶

In [6]:

model_drift_checker_ok.check_model();

No drift found in your model

AUC left data: 0.86
AUC right data: 0.84

Same But Make It Drift¶

We will force model drift with a complex model (higher hyperparameters)

In [7]:

ml_classifier_model_drifted = CatBoostClassifier(
    num_trees=10,
    max_depth=6,
    cat_features=cat_features,
    random_state=RANDOM_STATE,
    verbose=False
)

ml_classifier_model_drifted.fit(X_train, y_train)

model_drift_checker_ko = ModelDriftChecker(
    df_left_data, df_right_data, ml_classifier_model_drifted, target_column_name=TARGET
)

model_drift_checker_ko.check_model();

Drift found in your model, take a look on the most discriminative features (plots when minimal is set to False), DataDriftChecker can help you with changes in features distribution and also look at your hyperparameters

AUC left data: 0.90
AUC right data: 0.85

Checking Most Important Features Histograms¶

Sex column is the most important one but no differences because model drift is caused by model hyperparameters

In [8]:

model_drift_checker_ko.interpretable_drift_classifier_model.both_histogram_plot('Sex')

Drift With Data, Not With Model Params¶

This is because you cannot see differences in the histograms because the problem is in your model params, not in the data

We Use The Non-Drifted Model But We Will Drift The Data¶

If we drifted the dataset giving it a feature-related bias we can see differences

In [9]:

df_left_data_drifted = df_left_data[(df_left_data['Sex'] == 'female') | (df_left_data['SibSp'] < 2)]
df_right_data_drifted = df_right_data[(df_right_data['Sex'] == 'male') | (df_right_data['SibSp'] > 1)]

model_drift_checker_ko_2 = ModelDriftChecker(
    df_left_data_drifted, df_right_data_drifted, ml_classifier_model, target_column_name=TARGET
)

model_drift_checker_ko_2.check_model();

Drift found in your model, take a look on the most discriminative features (plots when minimal is set to False), DataDriftChecker can help you with changes in features distribution and also look at your hyperparameters

AUC left data: 0.86
AUC right data: 0.64

Checking Most Important Features Histograms¶

Sex column is the most important one

In [10]:

model_drift_checker_ko_2.interpretable_drift_classifier_model.both_histogram_plot('Sex')

Partial Dependence Plots Comparison¶

In [11]:

model_drift_checker_ko_2.interpretable_drift_classifier_model.partial_dependence_comparison_plot('Age')

Feature Importance Scaled Versus Drift Coefficient¶

The red part of the plot is the most critical one and the features located there are the ones that will further decrease your model's performance

In [12]:

model_drift_checker_ko_2.show_feature_importance_vs_drift_map_plot(top=5)

Same For The Non-Drifted Model¶

In [13]:

model_drift_checker_ok.show_feature_importance_vs_drift_map_plot(top=5)

Getting Weights From Our Discriminative Model¶

Then you can retrain your model if you set your weights, in this toy example that is not useful, but in a more realistic example it will help you if the drift is not very heavy

In [14]:

weights = model_drift_checker_ko_2.sample_weight_for_retrain()

Higher the weight for the observation, more is it similar to the test data

Drift By Sorted Bins Plot¶

All data from left and rigth dataframes concatenated and sorted by the feature you pass as a parameter and computes count for each bin.

In [15]:

model_drift_checker_ko_2.interpretable_drift_classifier_model.drift_by_sorted_bins_plot('Embarked')

Test For ModelDriftChecker-pydrift¶

Dependencies¶

Read Data¶

Constants¶

Model Training¶

Instanciate pydrift.DataDriftChecker¶

Test ml_model_can_discriminate Feature¶

Same But Make It Drift¶

Checking Most Important Features Histograms¶

Drift With Data, Not With Model Params¶

We Use The Non-Drifted Model But We Will Drift The Data¶

Checking Most Important Features Histograms¶

Partial Dependence Plots Comparison¶

Feature Importance Scaled Versus Drift Coefficient¶

Same For The Non-Drifted Model¶

Getting Weights From Our Discriminative Model¶

Drift By Sorted Bins Plot¶

Test For `ModelDriftChecker`-`pydrift`¶

Instanciate `pydrift.DataDriftChecker`¶

Test `ml_model_can_discriminate` Feature¶