🤖⚡ scikit-learn tip #7 ([video](https://www.youtube.com/watch?v=bA6mYC1a_Eg&list=PL5-da3qGB5ID7YYAqireYEew2mWVvgmj6&index=7))

Q: For a one-hot encoded feature, what can you do if new data contains categories that weren't seen during training?

A: Set handle_unknown='ignore' to encode new categories as all zeros.

See example 👇

P.S. If you know all possible categories that might ever appear, you can instead specify the categories manually. handle_unknown='ignore' is useful specifically when you don't know all possible categories.

import pandas as pd
X = pd.DataFrame({'col':['A', 'B', 'C', 'B']})
X_new = pd.DataFrame({'col':['A', 'C', 'D']})

from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(sparse=False, handle_unknown='ignore')

X

# three columns represent categories A, B, and C
ohe.fit_transform(X[['col']])

# category D was not learned by OneHotEncoder during the "fit" step
X_new

# category D is encoded as all zeros
ohe.transform(X_new[['col']])