Notebook

Backward Stepwise Feature Selection with Catboost¶

This notebook explains how to use feature importance from catboost to perform backward stepwise feature selection. The feature importance used is the SHAP importance from a catboost model.

This notebook will prune the features to model arrival delay for flights in and out of NYC in 2013.

Packages¶

This tutorial uses:

In [1]:

import statsmodels.api as sm
import pandas as pd
import numpy as np
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
import shap

from catboost import CatBoostRegressor, Pool

Reading the data¶

The data is from rdatasets imported using the Python package statsmodels.

In [2]:

df = sm.datasets.get_rdataset('flights', 'nycflights13').data
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 336776 entries, 0 to 336775
Data columns (total 19 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   year            336776 non-null  int64  
 1   month           336776 non-null  int64  
 2   day             336776 non-null  int64  
 3   dep_time        328521 non-null  float64
 4   sched_dep_time  336776 non-null  int64  
 5   dep_delay       328521 non-null  float64
 6   arr_time        328063 non-null  float64
 7   sched_arr_time  336776 non-null  int64  
 8   arr_delay       327346 non-null  float64
 9   carrier         336776 non-null  object 
 10  flight          336776 non-null  int64  
 11  tailnum         334264 non-null  object 
 12  origin          336776 non-null  object 
 13  dest            336776 non-null  object 
 14  air_time        327346 non-null  float64
 15  distance        336776 non-null  int64  
 16  hour            336776 non-null  int64  
 17  minute          336776 non-null  int64  
 18  time_hour       336776 non-null  object 
dtypes: float64(5), int64(9), object(5)
memory usage: 48.8+ MB

Feature Engineering¶

Handle null values¶

As this model will predict arrival delay, the Null values are caused by flights did were cancelled or diverted. These can be excluded from this analysis.

In [3]:

df.dropna(inplace=True)

Convert the times from floats or ints to hour and minutes¶

In [4]:

df['arr_hour'] = df.arr_time.apply(lambda x: int(np.floor(x/100)))
df['arr_minute'] = df.arr_time.apply(lambda x: int(x - np.floor(x/100)*100))
df['sched_arr_hour'] = df.sched_arr_time.apply(lambda x: int(np.floor(x/100)))
df['sched_arr_minute'] = df.sched_arr_time.apply(lambda x: int(x - np.floor(x/100)*100))
df['sched_dep_hour'] = df.sched_dep_time.apply(lambda x: int(np.floor(x/100)))
df['sched_dep_minute'] = df.sched_dep_time.apply(lambda x: int(x - np.floor(x/100)*100))
df.rename(columns={'hour': 'dep_hour',
                   'minute': 'dep_minute'}, inplace=True)

Feature Selection¶

Define function¶

Function to build the model

In [5]:

def build_model(df, target):
    """
    Given the dataframe and target, build and return the model
    """
    # Set up target
    y = df[target]
    # Set up train-test split
    X = df.drop(columns=[target])
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=1066)
    # Identify the categorical features
    categorical_features = X_train.select_dtypes(exclude=[np.number])
    # Create training and test pools for catboost
    train_pool = Pool(X_train, y_train, categorical_features)
    test_pool = Pool(X_test, y_test, categorical_features)
    # Fit the model and calculate RMSE
    model = CatBoostRegressor(iterations=500, max_depth=5, learning_rate=0.05, random_seed=1066, logging_level='Silent')
    model.fit(X_train, y_train, eval_set=test_pool, cat_features=categorical_features, use_best_model=True, early_stopping_rounds=10)
    rmse = np.sqrt(mean_squared_error(y_test, model.predict(X_test)))
    return rmse, model, X_test

Function to return feature to be dropped

In [6]:

def get_dropped_feature(model, X_test):
    explainer = shap.Explainer(model)
    shap_values = explainer(X_test)
    feature_importance = shap_values.abs.mean(0).values
    importance_df = pd.DataFrame({'features': X_test.columns,
                                  'importance': feature_importance})
    importance_df.sort_values(by='importance', ascending=False, inplace=True)
    return importance_df['features'].iloc[-1]

Function that incremental removes the feature with the lowest feature importance as calculated by catboost until the RMSE stops decreasing.

In [7]:

def backward_selection(df, target, max_features=None):
    """
    This function uses the SHAP importance from a catboost model
    to incrementally remove features from the training set until the RMSE no longer improves.
    This function returns the dataframe with the features that give the best RMSE.
    Return at most max_features.
    """
    # get baseline RMSE
    select_df = df.copy()
    total_features = df.shape[1]
    rmse, model, X_test = build_model(select_df, target)
    print(f"{rmse} with {select_df.shape[1]}")
    last_rmse = rmse
    
    # Drop least important feature and recalculate model peformance
    if max_features is None:
        max_features = total_features-1
        
    for num_features in range(total_features-1, 1, -1):
        # Trim features
        dropped_feature = get_dropped_feature(model, X_test)
        tmp_df = select_df.drop(columns=[dropped_feature])

        # Rerun modeling
        rmse, model, X_test = build_model(tmp_df, target)
        print(f"{rmse} with {tmp_df.shape[1]}")
        if (num_features < max_features) and (rmse > last_rmse):
            # RMSE increased, return last dataframe
            return select_df
        else:
            # RMSE improved, continue dropping features
            last_rmse = rmse
            select_df = tmp_df
    return select_df

Run stepwise feature selection¶

Call backward_selection on the modeling dataframe. reduced_df will contain the selected features and will be our reduced modeling dataset.

In [8]:

target = 'arr_delay'
reduced_df = backward_selection(df, target, max_features=20)
reduced_df.shape[1]

11.309300297508369 with 25
11.363724303677218 with 24
11.398055158351227 with 23
11.347547316020568 with 22
11.311533268825302 with 21
11.343828983988557 with 20
11.282547263583197 with 19
11.252227369040204 with 18
11.315881600344504 with 17

Out[8]: