Open in Binder

Open in Colab

🤖⚡ scikit-learn tip #1 (video)

Use ColumnTransformer to apply different preprocessing to different columns:

  • select from DataFrame columns by name
  • passthrough or drop unspecified columns

Requires scikit-learn 0.20+

See example 👇

In [1]:
import pandas as pd
df = pd.read_csv('http://bit.ly/kaggletrain', nrows=6)
In [2]:
cols = ['Fare', 'Embarked', 'Sex', 'Age']
X = df[cols]
In [3]:
X
Out[3]:
Fare Embarked Sex Age
0 7.2500 S male 22.0
1 71.2833 C female 38.0
2 7.9250 S female 26.0
3 53.1000 S female 35.0
4 8.0500 S male 35.0
5 8.4583 Q male NaN
In [4]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import make_column_transformer
In [5]:
ohe = OneHotEncoder()
imp = SimpleImputer()
In [6]:
ct = make_column_transformer(
    (ohe, ['Embarked', 'Sex']),  # apply OneHotEncoder to Embarked and Sex
    (imp, ['Age']),              # apply SimpleImputer to Age
    remainder='passthrough')     # include remaining column (Fare) in the output
In [7]:
# column order: Embarked (3 columns), Sex (2 columns), Age (1 column), Fare (1 column)
ct.fit_transform(X)
Out[7]:
array([[ 0.    ,  0.    ,  1.    ,  0.    ,  1.    , 22.    ,  7.25  ],
       [ 1.    ,  0.    ,  0.    ,  1.    ,  0.    , 38.    , 71.2833],
       [ 0.    ,  0.    ,  1.    ,  1.    ,  0.    , 26.    ,  7.925 ],
       [ 0.    ,  0.    ,  1.    ,  1.    ,  0.    , 35.    , 53.1   ],
       [ 0.    ,  0.    ,  1.    ,  0.    ,  1.    , 35.    ,  8.05  ],
       [ 0.    ,  1.    ,  0.    ,  0.    ,  1.    , 31.2   ,  8.4583]])

Want more tips? View all tips on GitHub or Sign up to receive 2 tips by email every week 💌

© 2020 Data School. All rights reserved.