🤖⚡ scikit-learn tip #50 (video)¶

Here's a simple pattern that can be adapted to solve many ML problems. It has plenty of shortcomings, but can work surprisingly well as-is!

Check it out 👇

Shortcomings include:

Assumes all columns have proper data types
May include irrelevant or improper features
Does not handle text or date columns well
Does not include feature engineering
Ordinal encoding may be better
Other imputation strategies may be better
Numeric features may not need scaling
A different model may be better
And so on...

In [1]:

import pandas as pd

In [2]:

cols = ['Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked']

In [3]:

df = pd.read_csv('http://bit.ly/kaggletrain')
X = df[cols]
y = df['Survived']

In [4]:

df_new = pd.read_csv('http://bit.ly/kaggletest', nrows=10)
X_new = df_new[cols]

In [5]:

from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import make_column_selector, make_column_transformer
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

In [6]:

# set up preprocessing for numeric columns
imp_median = SimpleImputer(strategy='median', add_indicator=True)
scaler = StandardScaler()

In [7]:

# set up preprocessing for categorical columns
imp_constant = SimpleImputer(strategy='constant')
ohe = OneHotEncoder(handle_unknown='ignore')

In [8]:

# select columns by data type
num_cols = make_column_selector(dtype_include='number')
cat_cols = make_column_selector(dtype_exclude='number')

In [9]:

# do all preprocessing
preprocessor = make_column_transformer(
    (make_pipeline(imp_median, scaler), num_cols),
    (make_pipeline(imp_constant, ohe), cat_cols))

In [10]:

# create a pipeline
pipe = make_pipeline(preprocessor, LogisticRegression())

In [11]:

# cross-validate the pipeline
cross_val_score(pipe, X, y).mean()

Out[11]:

0.8035904839620865

In [12]:

# fit the pipeline and make predictions
pipe.fit(X, y)
pipe.predict(X_new)

Out[12]:

array([0, 0, 0, 0, 1, 0, 1, 0, 1, 0])

🤖⚡ scikit-learn tip #50 (video)¶

Want more tips? View all tips on GitHub or Sign up to receive 2 tips by email every week 💌¶