This notebook explains how to use feature importance from catboost
to perform backward stepwise feature selection. The feature importance used is the SHAP importance from a catboost model.
This notebook will prune the features to model arrival delay for flights in and out of NYC in 2013.
This tutorial uses:
import statsmodels.api as sm
import pandas as pd
import numpy as np
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
import shap
from catboost import CatBoostRegressor, Pool
The data is from rdatasets
imported using the Python package statsmodels
.
df = sm.datasets.get_rdataset('flights', 'nycflights13').data
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 336776 entries, 0 to 336775 Data columns (total 19 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 year 336776 non-null int64 1 month 336776 non-null int64 2 day 336776 non-null int64 3 dep_time 328521 non-null float64 4 sched_dep_time 336776 non-null int64 5 dep_delay 328521 non-null float64 6 arr_time 328063 non-null float64 7 sched_arr_time 336776 non-null int64 8 arr_delay 327346 non-null float64 9 carrier 336776 non-null object 10 flight 336776 non-null int64 11 tailnum 334264 non-null object 12 origin 336776 non-null object 13 dest 336776 non-null object 14 air_time 327346 non-null float64 15 distance 336776 non-null int64 16 hour 336776 non-null int64 17 minute 336776 non-null int64 18 time_hour 336776 non-null object dtypes: float64(5), int64(9), object(5) memory usage: 48.8+ MB
As this model will predict arrival delay, the Null
values are caused by flights did were cancelled or diverted. These can be excluded from this analysis.
df.dropna(inplace=True)
df['arr_hour'] = df.arr_time.apply(lambda x: int(np.floor(x/100)))
df['arr_minute'] = df.arr_time.apply(lambda x: int(x - np.floor(x/100)*100))
df['sched_arr_hour'] = df.sched_arr_time.apply(lambda x: int(np.floor(x/100)))
df['sched_arr_minute'] = df.sched_arr_time.apply(lambda x: int(x - np.floor(x/100)*100))
df['sched_dep_hour'] = df.sched_dep_time.apply(lambda x: int(np.floor(x/100)))
df['sched_dep_minute'] = df.sched_dep_time.apply(lambda x: int(x - np.floor(x/100)*100))
df.rename(columns={'hour': 'dep_hour',
'minute': 'dep_minute'}, inplace=True)
Function to build the model
def build_model(df, target):
"""
Given the dataframe and target, build and return the model
"""
# Set up target
y = df[target]
# Set up train-test split
X = df.drop(columns=[target])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=1066)
# Identify the categorical features
categorical_features = X_train.select_dtypes(exclude=[np.number])
# Create training and test pools for catboost
train_pool = Pool(X_train, y_train, categorical_features)
test_pool = Pool(X_test, y_test, categorical_features)
# Fit the model and calculate RMSE
model = CatBoostRegressor(iterations=500, max_depth=5, learning_rate=0.05, random_seed=1066, logging_level='Silent')
model.fit(X_train, y_train, eval_set=test_pool, cat_features=categorical_features, use_best_model=True, early_stopping_rounds=10)
rmse = np.sqrt(mean_squared_error(y_test, model.predict(X_test)))
return rmse, model, X_test
Function to return feature to be dropped
def get_dropped_feature(model, X_test):
explainer = shap.Explainer(model)
shap_values = explainer(X_test)
feature_importance = shap_values.abs.mean(0).values
importance_df = pd.DataFrame({'features': X_test.columns,
'importance': feature_importance})
importance_df.sort_values(by='importance', ascending=False, inplace=True)
return importance_df['features'].iloc[-1]
Function that incremental removes the feature with the lowest feature importance as calculated by catboost
until the RMSE stops decreasing.
def backward_selection(df, target, max_features=None):
"""
This function uses the SHAP importance from a catboost model
to incrementally remove features from the training set until the RMSE no longer improves.
This function returns the dataframe with the features that give the best RMSE.
Return at most max_features.
"""
# get baseline RMSE
select_df = df.copy()
total_features = df.shape[1]
rmse, model, X_test = build_model(select_df, target)
print(f"{rmse} with {select_df.shape[1]}")
last_rmse = rmse
# Drop least important feature and recalculate model peformance
if max_features is None:
max_features = total_features-1
for num_features in range(total_features-1, 1, -1):
# Trim features
dropped_feature = get_dropped_feature(model, X_test)
tmp_df = select_df.drop(columns=[dropped_feature])
# Rerun modeling
rmse, model, X_test = build_model(tmp_df, target)
print(f"{rmse} with {tmp_df.shape[1]}")
if (num_features < max_features) and (rmse > last_rmse):
# RMSE increased, return last dataframe
return select_df
else:
# RMSE improved, continue dropping features
last_rmse = rmse
select_df = tmp_df
return select_df
Call backward_selection
on the modeling dataframe. reduced_df will contain the selected features and will be our reduced modeling dataset.
target = 'arr_delay'
reduced_df = backward_selection(df, target, max_features=20)
reduced_df.shape[1]
11.309300297508369 with 25 11.363724303677218 with 24 11.398055158351227 with 23 11.347547316020568 with 22 11.311533268825302 with 21 11.343828983988557 with 20 11.282547263583197 with 19 11.252227369040204 with 18 11.315881600344504 with 17
18