🤖⚡ scikit-learn tip #14 (video)¶

Four options for handling missing values (NaNs):

Drop rows containing NaNs
Drop columns containing NaNs
Fill NaNs with imputed values
Use a model that natively handles NaNs (NEW!)

See example 👇

In [1]:

import pandas as pd
train = pd.read_csv('http://bit.ly/kaggletrain')
test = pd.read_csv('http://bit.ly/kaggletest', nrows=175)

In [2]:

train = train[['Survived', 'Age', 'Fare', 'Pclass']]
test = test[['Age', 'Fare', 'Pclass']]

In [3]:

# count the number of NaNs in each column
train.isna().sum()

Out[3]:

Survived      0
Age         177
Fare          0
Pclass        0
dtype: int64

In [4]:

test.isna().sum()

Out[4]:

Age       36
Fare       1
Pclass     0
dtype: int64

In [5]:

label = train.pop('Survived')

In [6]:

# new in 0.22: this estimator (experimental) has native support for NaNs
from sklearn.experimental import enable_hist_gradient_boosting
from sklearn.ensemble import HistGradientBoostingClassifier

In [7]:

clf = HistGradientBoostingClassifier()

In [8]:

# no errors, despite NaNs in train and test!
clf.fit(train, label)
clf.predict(test)

Out[8]:

array([0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0,
       1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0,
       1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0,
       1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1,
       0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0,
       0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0])

🤖⚡ scikit-learn tip #14 (video)¶

Want more tips? View all tips on GitHub or Sign up to receive 2 tips by email every week 💌¶