🤖⚡ scikit-learn tip #26 (video)¶

Are you using train_test_split with a classification problem?

Be sure to set "stratify=y" so that class proportions are preserved when splitting.

Especially important if you have class imbalance!

See example 👇

In [1]:

import pandas as pd
df = pd.DataFrame({'feature':list(range(8)), 'target':['not fraud']*6 + ['fraud']*2})

In [2]:

X = df[['feature']]
y = df['target']

In [3]:

from sklearn.model_selection import train_test_split

Not stratified¶

y_train contains NONE of the minority class, whereas y_test contains ALL of the minority class. (This is bad!)

In [4]:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=0)

In [5]:

y_train

Out[5]:

3    not fraud
0    not fraud
5    not fraud
4    not fraud
Name: target, dtype: object

In [6]:

y_test

Out[6]:

6        fraud
2    not fraud
1    not fraud
7        fraud
Name: target, dtype: object

Class proportions are the SAME in y_train and y_test. (This is good!)

In [7]:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=0, stratify=y)

In [8]:

y_train

Out[8]:

1    not fraud
7        fraud
2    not fraud
4    not fraud
Name: target, dtype: object

In [9]:

y_test

Out[9]:

3    not fraud
6        fraud
0    not fraud
5    not fraud
Name: target, dtype: object