🤖⚡ scikit-learn tip #43 (video)¶

With a tree-based model, try OrdinalEncoder instead of OneHotEncoder even for nominal (unordered) features.

Accuracy will often be similar, but OrdinalEncoder will be much faster!

See example 👇

In [1]:

import pandas as pd
df = pd.read_csv('https://www.openml.org/data/get_csv/1595261/adult-census.csv')

In [2]:

from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_val_score

In [3]:

categorical_cols = ['workclass', 'education', 'marital-status',
                    'occupation', 'relationship', 'race', 'sex']

In [4]:

X = df[categorical_cols]
y = df['class']

In [5]:

# OneHotEncoder creates 60 columns
ohe = OneHotEncoder()
ohe.fit_transform(X).shape

Out[5]:

(48842, 60)

In [6]:

# OrdinalEncoder creates 7 columns
oe = OrdinalEncoder()
oe.fit_transform(X).shape

Out[6]:

(48842, 7)

In [7]:

# Random Forests is a tree-based model
rf = RandomForestClassifier(random_state=1, n_jobs=-1)

In [8]:

# Pipeline containing OneHotEncoder
ohe_pipe = make_pipeline(ohe, rf)
%time cross_val_score(ohe_pipe, X, y).mean()

CPU times: user 1.95 s, sys: 189 ms, total: 2.14 s
Wall time: 23.2 s

Out[8]:

0.8262561170407418

In [9]:

# Pipeline containing OrdinalEncoder
oe_pipe = make_pipeline(oe, rf)
%time cross_val_score(oe_pipe, X, y).mean()

CPU times: user 1.67 s, sys: 133 ms, total: 1.81 s
Wall time: 3.83 s

Out[9]:

0.8256623624061437

🤖⚡ scikit-learn tip #43 (video)¶

Want more tips? View all tips on GitHub or Sign up to receive 2 tips by email every week 💌¶