import pandas as pd
from sklearn.model_selection import train_test_split
from catboost import CatBoostClassifier
from pydrift import ModelDriftChecker
from pydrift.constants import PATH_DATA, RANDOM_STATE
from pydrift.models import cat_features_fillna
df_titanic = pd.read_csv(PATH_DATA / 'titanic.csv')
TARGET = 'Survived'
X = df_titanic.drop(columns=['PassengerId', 'Name', 'Ticket', 'Cabin', TARGET])
y = df_titanic[TARGET]
cat_features = (X
.select_dtypes(include=['category', 'object'])
.columns)
X_filled = cat_features_fillna(X, cat_features)
X_train, X_test, y_train, y_test = train_test_split(
X_filled, y, test_size=.5, random_state=RANDOM_STATE, stratify=y
)
ml_classifier_model = CatBoostClassifier(
num_trees=5,
max_depth=3,
cat_features=cat_features,
random_state=RANDOM_STATE,
verbose=False
)
ml_classifier_model.fit(X_train, y_train);
pydrift.DataDriftChecker
¶df_left_data = pd.concat([X_train, y_train], axis=1)
df_right_data = pd.concat([X_test, y_test], axis=1)
model_drift_checker_ok = ModelDriftChecker(
df_left_data, df_right_data, ml_classifier_model, target_column_name=TARGET
)
ml_model_can_discriminate
Feature¶model_drift_checker_ok.check_model();
No drift found in your model AUC left data: 0.86 AUC right data: 0.84
We will force model drift with a complex model (higher hyperparameters)
ml_classifier_model_drifted = CatBoostClassifier(
num_trees=10,
max_depth=6,
cat_features=cat_features,
random_state=RANDOM_STATE,
verbose=False
)
ml_classifier_model_drifted.fit(X_train, y_train)
model_drift_checker_ko = ModelDriftChecker(
df_left_data, df_right_data, ml_classifier_model_drifted, target_column_name=TARGET
)
model_drift_checker_ko.check_model();
Drift found in your model, take a look on the most discriminative features (plots when minimal is set to False), DataDriftChecker can help you with changes in features distribution and also look at your hyperparameters AUC left data: 0.90 AUC right data: 0.85
Sex
column is the most important one but no differences because model drift is caused by model hyperparameters
model_drift_checker_ko.interpretable_drift_classifier_model.both_histogram_plot('Sex')
df_left_data_drifted = df_left_data[(df_left_data['Sex'] == 'female') | (df_left_data['SibSp'] < 2)]
df_right_data_drifted = df_right_data[(df_right_data['Sex'] == 'male') | (df_right_data['SibSp'] > 1)]
model_drift_checker_ko_2 = ModelDriftChecker(
df_left_data_drifted, df_right_data_drifted, ml_classifier_model, target_column_name=TARGET
)
model_drift_checker_ko_2.check_model();
Drift found in your model, take a look on the most discriminative features (plots when minimal is set to False), DataDriftChecker can help you with changes in features distribution and also look at your hyperparameters AUC left data: 0.86 AUC right data: 0.64
Sex
column is the most important one
model_drift_checker_ko_2.interpretable_drift_classifier_model.both_histogram_plot('Sex')
model_drift_checker_ko_2.interpretable_drift_classifier_model.partial_dependence_comparison_plot('Age')
The red part of the plot is the most critical one and the features located there are the ones that will further decrease your model's performance
model_drift_checker_ko_2.show_feature_importance_vs_drift_map_plot(top=5)
model_drift_checker_ok.show_feature_importance_vs_drift_map_plot(top=5)
Then you can retrain your model if you set your weights
, in this toy example that is not useful, but in a more realistic example it will help you if the drift is not very heavy
weights = model_drift_checker_ko_2.sample_weight_for_retrain()
Higher the weight for the observation, more is it similar to the test data
All data from left and rigth dataframes concatenated and sorted by the feature you pass as a parameter and computes count for each bin.
model_drift_checker_ko_2.interpretable_drift_classifier_model.drift_by_sorted_bins_plot('Embarked')