Open in Binder

Open in Colab

🤖⚡ scikit-learn tip #2 (video)

There are SEVEN ways to select columns using ColumnTransformer:

  1. column name
  2. integer position
  3. slice
  4. boolean mask
  5. regex pattern
  6. dtypes to include
  7. dtypes to exclude

See example 👇

In [1]:
import pandas as pd
df = pd.read_csv('http://bit.ly/kaggletrain', nrows=6)
In [2]:
cols = ['Fare', 'Embarked', 'Sex', 'Age']
X = df[cols]
In [3]:
X
Out[3]:
Fare Embarked Sex Age
0 7.2500 S male 22.0
1 71.2833 C female 38.0
2 7.9250 S female 26.0
3 53.1000 S female 35.0
4 8.0500 S male 35.0
5 8.4583 Q male NaN
In [4]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import make_column_transformer  # new in 0.20
from sklearn.compose import make_column_selector     # new in 0.22
In [5]:
ohe = OneHotEncoder()
In [6]:
# all SEVEN of these produce the same results
ct = make_column_transformer((ohe, ['Embarked', 'Sex']))
ct = make_column_transformer((ohe, [1, 2]))
ct = make_column_transformer((ohe, slice(1, 3)))
ct = make_column_transformer((ohe, [False, True, True, False]))
ct = make_column_transformer((ohe, make_column_selector(pattern='E|S')))
ct = make_column_transformer((ohe, make_column_selector(dtype_include=object)))
ct = make_column_transformer((ohe, make_column_selector(dtype_exclude='number')))
In [7]:
# one-hot encode Embarked and Sex (and drop all other columns)
ct.fit_transform(X)
Out[7]:
array([[0., 0., 1., 0., 1.],
       [1., 0., 0., 1., 0.],
       [0., 0., 1., 1., 0.],
       [0., 0., 1., 1., 0.],
       [0., 0., 1., 0., 1.],
       [0., 1., 0., 0., 1.]])

Want more tips? View all tips on GitHub or Sign up to receive 2 tips by email every week 💌

© 2020 Data School. All rights reserved.