Leave one out encoding¶

This notebook explains how to use leave one out encoding from category_encoders. Leave one out encoding is just target encoding where the average or expected value is calculated ignoring the value in the current row.

This notebook will data for flights in and out of NYC in 2013.

Packages¶

This tutorial uses:

In [1]:

import statsmodels.api as sm
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import category_encoders as ce

Reading the data¶

The data is from rdatasets imported using the Python package statsmodels.

In [2]:

df = sm.datasets.get_rdataset('flights', 'nycflights13').data
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 336776 entries, 0 to 336775
Data columns (total 19 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   year            336776 non-null  int64  
 1   month           336776 non-null  int64  
 2   day             336776 non-null  int64  
 3   dep_time        328521 non-null  float64
 4   sched_dep_time  336776 non-null  int64  
 5   dep_delay       328521 non-null  float64
 6   arr_time        328063 non-null  float64
 7   sched_arr_time  336776 non-null  int64  
 8   arr_delay       327346 non-null  float64
 9   carrier         336776 non-null  object 
 10  flight          336776 non-null  int64  
 11  tailnum         334264 non-null  object 
 12  origin          336776 non-null  object 
 13  dest            336776 non-null  object 
 14  air_time        327346 non-null  float64
 15  distance        336776 non-null  int64  
 16  hour            336776 non-null  int64  
 17  minute          336776 non-null  int64  
 18  time_hour       336776 non-null  object 
dtypes: float64(5), int64(9), object(5)
memory usage: 48.8+ MB

Feature Engineering¶

Handle null values¶

In [3]:

df.isnull().sum()

Out[3]:

year                 0
month                0
day                  0
dep_time          8255
sched_dep_time       0
dep_delay         8255
arr_time          8713
sched_arr_time       0
arr_delay         9430
carrier              0
flight               0
tailnum           2512
origin               0
dest                 0
air_time          9430
distance             0
hour                 0
minute               0
time_hour            0
dtype: int64

As this model will predict arrival delay, the Null values are caused by flights did were cancelled or diverted. These can be excluded from this analysis.

In [4]:

df.dropna(inplace=True)

Convert the times from floats or ints to hour and minutes¶

In [5]:

df['arr_hour'] = df.arr_time.apply(lambda x: int(np.floor(x/100)))
df['arr_minute'] = df.arr_time.apply(lambda x: int(x - np.floor(x/100)*100))
df['sched_arr_hour'] = df.sched_arr_time.apply(lambda x: int(np.floor(x/100)))
df['sched_arr_minute'] = df.sched_arr_time.apply(lambda x: int(x - np.floor(x/100)*100))
df['sched_dep_hour'] = df.sched_dep_time.apply(lambda x: int(np.floor(x/100)))
df['sched_dep_minute'] = df.sched_dep_time.apply(lambda x: int(x - np.floor(x/100)*100))
df['flight'] = df.flight.astype(str)
df.rename(columns={'hour': 'dep_hour',
                   'minute': 'dep_minute'}, inplace=True)

Prepare data for modeling¶

Set up train-test split¶

In [6]:

target = 'arr_delay'
y = df[target]
X = df.drop(columns=[target, 'time_hour', 'year', 'dep_time', 'sched_dep_time', 'arr_time', 'sched_arr_time', 'dep_delay'])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=1066)
X_train.dtypes

Out[6]:

month                 int64
day                   int64
carrier              object
flight               object
tailnum              object
origin               object
dest                 object
air_time            float64
distance              int64
dep_hour              int64
dep_minute            int64
arr_hour              int64
arr_minute            int64
sched_arr_hour        int64
sched_arr_minute      int64
sched_dep_hour        int64
sched_dep_minute      int64
dtype: object

Encode categorical variables¶

We use a leave-one-out encoder as it creates a single column for each categorical variable instead of creating a column for each level of the categorical variable like one-hot-encoding. This makes interpreting the impact of categorical variables with feature impact easier. Models can now be trained with any modeling algorithm with the feature set contained in X_train_loo

In [7]:

encoder = ce.LeaveOneOutEncoder(return_df=True)
X_train_loo = encoder.fit_transform(X_train, y_train)
X_train_loo.dtypes

Out[7]:

month                 int64
day                   int64
carrier             float64
flight              float64
tailnum             float64
origin              float64
dest                float64
air_time            float64
distance              int64
dep_hour              int64
dep_minute            int64
arr_hour              int64
arr_minute            int64
sched_arr_hour        int64
sched_arr_minute      int64
sched_dep_hour        int64
sched_dep_minute      int64
dtype: object

In [8]:

X_train_loo.describe()

Out[8]:

	month	day	carrier	flight	tailnum	origin	dest	air_time	distance	dep_hour	dep_minute	arr_hour	arr_minute	sched_arr_hour	sched_arr_minute	sched_dep_hour	sched_dep_minute
count	261876.000000	261876.000000	261876.000000	261876.000000	261876.000000	261876.000000	261876.000000	261876.000000	261876.000000	261876.000000	261876.000000	261876.000000	261876.000000	261876.000000	261876.000000	261876.000000	261876.000000
mean	6.568246	15.727864	6.882754	6.888431	6.876200	6.882754	6.882754	150.594774	1047.624311	13.137641	26.232320	14.722663	29.474499	15.032809	29.029907	13.137641	26.232320
std	3.414977	8.782851	5.454258	11.194779	8.533383	1.626746	4.798035	93.567094	735.070110	4.659342	19.294383	5.325232	17.357855	4.971609	17.404733	4.659342	19.294383
min	1.000000	1.000000	-11.316547	-59.000000	-61.000000	5.546173	-16.181818	20.000000	80.000000	5.000000	0.000000	0.000000	0.000000	0.000000	0.000000	5.000000	0.000000
25%	4.000000	8.000000	1.982494	-0.755000	0.923416	5.560977	2.691789	82.000000	509.000000	9.000000	8.000000	11.000000	14.000000	11.000000	14.000000	9.000000	8.000000
50%	7.000000	16.000000	7.503971	5.286150	6.482353	5.787073	7.325933	129.000000	888.000000	13.000000	29.000000	15.000000	29.000000	15.000000	30.000000	13.000000	29.000000
75%	10.000000	23.000000	9.616774	13.183920	11.843750	9.057416	9.832984	191.000000	1389.000000	17.000000	44.000000	19.000000	45.000000	19.000000	44.000000	17.000000	44.000000
max	12.000000	31.000000	20.018584	183.000000	214.200000	9.058440	45.075949	695.000000	4983.000000	23.000000	59.000000	24.000000	59.000000	23.000000	59.000000	23.000000	59.000000

Encode the test set. This can now be passed into the predict or predict_proba functions of a trained model.

In [9]:

X_test_loo = encoder.transform(X_test)
X_test_loo.describe()

Out[9]:

	month	day	carrier	flight	tailnum	origin	dest	air_time	distance	dep_hour	dep_minute	arr_hour	arr_minute	sched_arr_hour	sched_arr_minute	sched_dep_hour	sched_dep_minute
count	65470.000000	65470.000000	65470.000000	65470.000000	65470.000000	65470.000000	65470.000000	65470.000000	65470.000000	65470.000000	65470.000000	65470.000000	65470.000000	65470.000000	65470.000000	65470.000000	65470.000000
mean	6.551031	15.792668	6.862181	6.907732	6.880032	6.877746	6.869284	151.053200	1051.359279	13.154483	26.241301	14.731419	29.436230	15.055934	29.105300	13.154483	26.241301
std	3.407300	8.755319	5.457000	11.119727	8.419143	1.625658	4.849410	94.171406	739.250702	4.672941	19.302202	5.340305	17.353617	4.974869	17.496692	4.672941	19.302202
min	1.000000	1.000000	-9.795775	-42.500000	-35.600000	5.560707	-14.416667	21.000000	80.000000	5.000000	0.000000	0.000000	0.000000	0.000000	0.000000	5.000000	0.000000
25%	4.000000	8.000000	1.985603	-0.696517	0.895833	5.560707	2.691649	82.000000	502.000000	9.000000	8.000000	11.000000	14.000000	11.000000	14.000000	9.000000	8.000000
50%	7.000000	16.000000	3.465982	5.288961	6.509804	5.786915	7.323326	129.000000	888.000000	13.000000	29.000000	15.000000	29.500000	15.000000	30.000000	13.000000	29.000000
75%	10.000000	23.000000	9.615767	13.160326	11.801242	9.057426	9.829630	192.000000	1400.000000	17.000000	44.000000	19.000000	45.000000	19.000000	45.000000	17.000000	44.000000
max	12.000000	31.000000	19.993676	106.000000	139.000000	9.057426	44.162500	686.000000	4983.000000	23.000000	59.000000	24.000000	59.000000	23.000000	59.000000	23.000000	59.000000