🤖⚡ scikit-learn tip #7 (video)¶

Q: For a one-hot encoded feature, what can you do if new data contains categories that weren't seen during training?

A: Set handle_unknown='ignore' to encode new categories as all zeros.

See example 👇

P.S. If you know all possible categories that might ever appear, you can instead specify the categories manually. handle_unknown='ignore' is useful specifically when you don't know all possible categories.

In [1]:

import pandas as pd
X = pd.DataFrame({'col':['A', 'B', 'C', 'B']})
X_new = pd.DataFrame({'col':['A', 'C', 'D']})

In [2]:

from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(sparse=False, handle_unknown='ignore')

In [3]:

Out[3]:

	col
0	A
1	B
2	C
3	B

In [4]:

# three columns represent categories A, B, and C
ohe.fit_transform(X[['col']])

Out[4]:

array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.],
       [0., 1., 0.]])

In [5]:

# category D was not learned by OneHotEncoder during the "fit" step
X_new

Out[5]:

	col
0	A
1	C
2	D

In [6]:

# category D is encoded as all zeros
ohe.transform(X_new[['col']])

Out[6]:

array([[1., 0., 0.],
       [0., 0., 1.],
       [0., 0., 0.]])

🤖⚡ scikit-learn tip #7 (video)¶

Want more tips? View all tips on GitHub or Sign up to receive 2 tips by email every week 💌¶