This notebook explains how to use leave one out encoding from category_encoders
. Leave one out encoding is just target encoding where the average or expected value is calculated ignoring the value in the current row.
This notebook will data for flights in and out of NYC in 2013.
This tutorial uses:
import statsmodels.api as sm
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import category_encoders as ce
The data is from rdatasets
imported using the Python package statsmodels
.
df = sm.datasets.get_rdataset('flights', 'nycflights13').data
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 336776 entries, 0 to 336775 Data columns (total 19 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 year 336776 non-null int64 1 month 336776 non-null int64 2 day 336776 non-null int64 3 dep_time 328521 non-null float64 4 sched_dep_time 336776 non-null int64 5 dep_delay 328521 non-null float64 6 arr_time 328063 non-null float64 7 sched_arr_time 336776 non-null int64 8 arr_delay 327346 non-null float64 9 carrier 336776 non-null object 10 flight 336776 non-null int64 11 tailnum 334264 non-null object 12 origin 336776 non-null object 13 dest 336776 non-null object 14 air_time 327346 non-null float64 15 distance 336776 non-null int64 16 hour 336776 non-null int64 17 minute 336776 non-null int64 18 time_hour 336776 non-null object dtypes: float64(5), int64(9), object(5) memory usage: 48.8+ MB
df.isnull().sum()
year 0 month 0 day 0 dep_time 8255 sched_dep_time 0 dep_delay 8255 arr_time 8713 sched_arr_time 0 arr_delay 9430 carrier 0 flight 0 tailnum 2512 origin 0 dest 0 air_time 9430 distance 0 hour 0 minute 0 time_hour 0 dtype: int64
As this model will predict arrival delay, the Null
values are caused by flights did were cancelled or diverted. These can be excluded from this analysis.
df.dropna(inplace=True)
df['arr_hour'] = df.arr_time.apply(lambda x: int(np.floor(x/100)))
df['arr_minute'] = df.arr_time.apply(lambda x: int(x - np.floor(x/100)*100))
df['sched_arr_hour'] = df.sched_arr_time.apply(lambda x: int(np.floor(x/100)))
df['sched_arr_minute'] = df.sched_arr_time.apply(lambda x: int(x - np.floor(x/100)*100))
df['sched_dep_hour'] = df.sched_dep_time.apply(lambda x: int(np.floor(x/100)))
df['sched_dep_minute'] = df.sched_dep_time.apply(lambda x: int(x - np.floor(x/100)*100))
df['flight'] = df.flight.astype(str)
df.rename(columns={'hour': 'dep_hour',
'minute': 'dep_minute'}, inplace=True)
target = 'arr_delay'
y = df[target]
X = df.drop(columns=[target, 'time_hour', 'year', 'dep_time', 'sched_dep_time', 'arr_time', 'sched_arr_time', 'dep_delay'])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=1066)
X_train.dtypes
month int64 day int64 carrier object flight object tailnum object origin object dest object air_time float64 distance int64 dep_hour int64 dep_minute int64 arr_hour int64 arr_minute int64 sched_arr_hour int64 sched_arr_minute int64 sched_dep_hour int64 sched_dep_minute int64 dtype: object
We use a leave-one-out encoder as it creates a single column for each categorical variable instead of creating a column for each level of the categorical variable like one-hot-encoding. This makes interpreting the impact of categorical variables with feature impact easier. Models can now be trained with any modeling algorithm with the feature set contained in X_train_loo
encoder = ce.LeaveOneOutEncoder(return_df=True)
X_train_loo = encoder.fit_transform(X_train, y_train)
X_train_loo.dtypes
month int64 day int64 carrier float64 flight float64 tailnum float64 origin float64 dest float64 air_time float64 distance int64 dep_hour int64 dep_minute int64 arr_hour int64 arr_minute int64 sched_arr_hour int64 sched_arr_minute int64 sched_dep_hour int64 sched_dep_minute int64 dtype: object
X_train_loo.describe()
month | day | carrier | flight | tailnum | origin | dest | air_time | distance | dep_hour | dep_minute | arr_hour | arr_minute | sched_arr_hour | sched_arr_minute | sched_dep_hour | sched_dep_minute | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 261876.000000 | 261876.000000 | 261876.000000 | 261876.000000 | 261876.000000 | 261876.000000 | 261876.000000 | 261876.000000 | 261876.000000 | 261876.000000 | 261876.000000 | 261876.000000 | 261876.000000 | 261876.000000 | 261876.000000 | 261876.000000 | 261876.000000 |
mean | 6.568246 | 15.727864 | 6.882754 | 6.888431 | 6.876200 | 6.882754 | 6.882754 | 150.594774 | 1047.624311 | 13.137641 | 26.232320 | 14.722663 | 29.474499 | 15.032809 | 29.029907 | 13.137641 | 26.232320 |
std | 3.414977 | 8.782851 | 5.454258 | 11.194779 | 8.533383 | 1.626746 | 4.798035 | 93.567094 | 735.070110 | 4.659342 | 19.294383 | 5.325232 | 17.357855 | 4.971609 | 17.404733 | 4.659342 | 19.294383 |
min | 1.000000 | 1.000000 | -11.316547 | -59.000000 | -61.000000 | 5.546173 | -16.181818 | 20.000000 | 80.000000 | 5.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 5.000000 | 0.000000 |
25% | 4.000000 | 8.000000 | 1.982494 | -0.755000 | 0.923416 | 5.560977 | 2.691789 | 82.000000 | 509.000000 | 9.000000 | 8.000000 | 11.000000 | 14.000000 | 11.000000 | 14.000000 | 9.000000 | 8.000000 |
50% | 7.000000 | 16.000000 | 7.503971 | 5.286150 | 6.482353 | 5.787073 | 7.325933 | 129.000000 | 888.000000 | 13.000000 | 29.000000 | 15.000000 | 29.000000 | 15.000000 | 30.000000 | 13.000000 | 29.000000 |
75% | 10.000000 | 23.000000 | 9.616774 | 13.183920 | 11.843750 | 9.057416 | 9.832984 | 191.000000 | 1389.000000 | 17.000000 | 44.000000 | 19.000000 | 45.000000 | 19.000000 | 44.000000 | 17.000000 | 44.000000 |
max | 12.000000 | 31.000000 | 20.018584 | 183.000000 | 214.200000 | 9.058440 | 45.075949 | 695.000000 | 4983.000000 | 23.000000 | 59.000000 | 24.000000 | 59.000000 | 23.000000 | 59.000000 | 23.000000 | 59.000000 |
Encode the test set. This can now be passed into the predict
or predict_proba
functions of a trained model.
X_test_loo = encoder.transform(X_test)
X_test_loo.describe()
month | day | carrier | flight | tailnum | origin | dest | air_time | distance | dep_hour | dep_minute | arr_hour | arr_minute | sched_arr_hour | sched_arr_minute | sched_dep_hour | sched_dep_minute | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 65470.000000 | 65470.000000 | 65470.000000 | 65470.000000 | 65470.000000 | 65470.000000 | 65470.000000 | 65470.000000 | 65470.000000 | 65470.000000 | 65470.000000 | 65470.000000 | 65470.000000 | 65470.000000 | 65470.000000 | 65470.000000 | 65470.000000 |
mean | 6.551031 | 15.792668 | 6.862181 | 6.907732 | 6.880032 | 6.877746 | 6.869284 | 151.053200 | 1051.359279 | 13.154483 | 26.241301 | 14.731419 | 29.436230 | 15.055934 | 29.105300 | 13.154483 | 26.241301 |
std | 3.407300 | 8.755319 | 5.457000 | 11.119727 | 8.419143 | 1.625658 | 4.849410 | 94.171406 | 739.250702 | 4.672941 | 19.302202 | 5.340305 | 17.353617 | 4.974869 | 17.496692 | 4.672941 | 19.302202 |
min | 1.000000 | 1.000000 | -9.795775 | -42.500000 | -35.600000 | 5.560707 | -14.416667 | 21.000000 | 80.000000 | 5.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 5.000000 | 0.000000 |
25% | 4.000000 | 8.000000 | 1.985603 | -0.696517 | 0.895833 | 5.560707 | 2.691649 | 82.000000 | 502.000000 | 9.000000 | 8.000000 | 11.000000 | 14.000000 | 11.000000 | 14.000000 | 9.000000 | 8.000000 |
50% | 7.000000 | 16.000000 | 3.465982 | 5.288961 | 6.509804 | 5.786915 | 7.323326 | 129.000000 | 888.000000 | 13.000000 | 29.000000 | 15.000000 | 29.500000 | 15.000000 | 30.000000 | 13.000000 | 29.000000 |
75% | 10.000000 | 23.000000 | 9.615767 | 13.160326 | 11.801242 | 9.057426 | 9.829630 | 192.000000 | 1400.000000 | 17.000000 | 44.000000 | 19.000000 | 45.000000 | 19.000000 | 45.000000 | 17.000000 | 44.000000 |
max | 12.000000 | 31.000000 | 19.993676 | 106.000000 | 139.000000 | 9.057426 | 44.162500 | 686.000000 | 4983.000000 | 23.000000 | 59.000000 | 24.000000 | 59.000000 | 23.000000 | 59.000000 | 23.000000 | 59.000000 |