import pandas as pd
df = pd.DataFrame({'feature':list(range(8)), 'target':['not fraud']*6 + ['fraud']*2})
X = df[['feature']]
y = df['target']
from sklearn.model_selection import train_test_split
y_train
contains NONE of the minority class, whereas y_test
contains ALL of the minority class. (This is bad!)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=0)
y_train
3 not fraud 0 not fraud 5 not fraud 4 not fraud Name: target, dtype: object
y_test
6 fraud 2 not fraud 1 not fraud 7 fraud Name: target, dtype: object
Class proportions are the SAME in y_train
and y_test
. (This is good!)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=0, stratify=y)
y_train
1 not fraud 7 fraud 2 not fraud 4 not fraud Name: target, dtype: object
y_test
3 not fraud 6 fraud 0 not fraud 5 not fraud Name: target, dtype: object
© 2020 Data School. All rights reserved.