Q: For a one-hot encoded feature, what can you do if new data contains categories that weren't seen during training?
A: Set handle_unknown='ignore' to encode new categories as all zeros.
See example 👇
P.S. If you know all possible categories that might ever appear, you can instead specify the categories manually. handle_unknown='ignore' is useful specifically when you don't know all possible categories.
import pandas as pd
X = pd.DataFrame({'col':['A', 'B', 'C', 'B']})
X_new = pd.DataFrame({'col':['A', 'C', 'D']})
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(sparse=False, handle_unknown='ignore')
X
col | |
---|---|
0 | A |
1 | B |
2 | C |
3 | B |
# three columns represent categories A, B, and C
ohe.fit_transform(X[['col']])
array([[1., 0., 0.], [0., 1., 0.], [0., 0., 1.], [0., 1., 0.]])
# category D was not learned by OneHotEncoder during the "fit" step
X_new
col | |
---|---|
0 | A |
1 | C |
2 | D |
# category D is encoded as all zeros
ohe.transform(X_new[['col']])
array([[1., 0., 0.], [0., 0., 1.], [0., 0., 0.]])
© 2020 Data School. All rights reserved.