import pandas as pd
from sklearn.model_selection import train_test_split
from pydrift import DataDriftChecker
from pydrift.constants import PATH_DATA, RANDOM_STATE
df_titanic = pd.read_csv(PATH_DATA / 'titanic.csv')
TARGET = 'Survived'
We drop PassengerId
, Name
, Cabin
, Ticket
because are features with high cardinality (passenger related variables)
50% sample will give us a non-drift problem
X = df_titanic.drop(columns=['PassengerId', 'Name', 'Cabin', 'Ticket', TARGET])
y = df_titanic[TARGET]
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=.5, random_state=RANDOM_STATE, stratify=y
)
pydrift.DataDriftChecker
¶data_drift_checker_ok = DataDriftChecker(X_train, X_test)
ml_model_can_discriminate
Feature¶data_drift_checker_ok.ml_model_can_discriminate();
No drift found in discriminative model step AUC drift check model: 0.50 AUC threshold: .5 ± 0.10
data_drift_checker_ok.check_numerical_columns();
No drift found in numerical columns check step
data_drift_checker_ok.check_categorical_columns();
No drift found in categorical columns check step
pydrift
tells you that the problem is in Sex
feature (as is obviously in this example)
X_women = X[X['Sex'] == 'female']
X_men = X[X['Sex'] == 'male']
data_drift_checker_ko = DataDriftChecker(X_women, X_men)
data_drift_checker_ko.ml_model_can_discriminate();
Drift found in discriminative model step, take a look on the most discriminative features (plots when minimal is set to False) AUC drift check model: 1.00 AUC threshold: .5 ± 0.10
data_drift_checker_ko.check_numerical_columns();
Drift found in numerical columns check step, take a look on the variables that are drifted, if one is not important you could simply delete it, otherwise check the data source
Features drifted (numerical): Pclass, Age, SibSp, Parch, Fare
data_drift_checker_ko.check_categorical_columns();
Drift found in categorical columns check step, take a look on the variables that are drifted, if one is not important you could simply delete it, otherwise check the data source
Features drifted (categorical): Sex, Embarked
data_drift_checker_ko.drifted_features
{'Age', 'Embarked', 'Fare', 'Parch', 'Pclass', 'Sex', 'SibSp'}
pydrift
detects that the problem is but Pclass
and Fare
feature (again obviously because this is an example)
mask = (X['Pclass'] > 1) & (X['Fare'] > 10)
X_mask = X[mask]
X_unmask = X[~mask]
data_drift_checker_ko_2 = DataDriftChecker(X_mask, X_unmask)
data_drift_checker_ko_2.ml_model_can_discriminate();
Drift found in discriminative model step, take a look on the most discriminative features (plots when minimal is set to False) AUC drift check model: 1.00 AUC threshold: .5 ± 0.10
data_drift_checker_ko_2.check_numerical_columns();
Drift found in numerical columns check step, take a look on the variables that are drifted, if one is not important you could simply delete it, otherwise check the data source
Features drifted (numerical): Pclass, Age, SibSp, Parch, Fare
data_drift_checker_ko_2.check_categorical_columns();
Drift found in categorical columns check step, take a look on the variables that are drifted, if one is not important you could simply delete it, otherwise check the data source
Features drifted (categorical): Sex, Embarked
data_drift_checker_ko_2.drifted_features
{'Age', 'Embarked', 'Fare', 'Parch', 'Pclass', 'Sex', 'SibSp'}