This notebook explains how to calculate MAE from scikit-learn
on a regression model from catboost
.
This notebook will build and evaluate a model to predict arrival delay for flights in and out of NYC in 2013.
This tutorial uses:
import statsmodels.api as sm
import pandas as pd
import numpy as np
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
from catboost import CatBoostRegressor, Pool
The data is from rdatasets
imported using the Python package statsmodels
.
df = sm.datasets.get_rdataset('flights', 'nycflights13').data
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 336776 entries, 0 to 336775 Data columns (total 19 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 year 336776 non-null int64 1 month 336776 non-null int64 2 day 336776 non-null int64 3 dep_time 328521 non-null float64 4 sched_dep_time 336776 non-null int64 5 dep_delay 328521 non-null float64 6 arr_time 328063 non-null float64 7 sched_arr_time 336776 non-null int64 8 arr_delay 327346 non-null float64 9 carrier 336776 non-null object 10 flight 336776 non-null int64 11 tailnum 334264 non-null object 12 origin 336776 non-null object 13 dest 336776 non-null object 14 air_time 327346 non-null float64 15 distance 336776 non-null int64 16 hour 336776 non-null int64 17 minute 336776 non-null int64 18 time_hour 336776 non-null object dtypes: float64(5), int64(9), object(5) memory usage: 48.8+ MB
df.isnull().sum()
year 0 month 0 day 0 dep_time 8255 sched_dep_time 0 dep_delay 8255 arr_time 8713 sched_arr_time 0 arr_delay 9430 carrier 0 flight 0 tailnum 2512 origin 0 dest 0 air_time 9430 distance 0 hour 0 minute 0 time_hour 0 dtype: int64
As this model will predict arrival delay, the Null
values are caused by flights did were cancelled or diverted. These can be excluded from this analysis.
df.dropna(inplace=True)
df['arr_hour'] = df.arr_time.apply(lambda x: int(np.floor(x/100)))
df['arr_minute'] = df.arr_time.apply(lambda x: int(x - np.floor(x/100)*100))
df['sched_arr_hour'] = df.sched_arr_time.apply(lambda x: int(np.floor(x/100)))
df['sched_arr_minute'] = df.sched_arr_time.apply(lambda x: int(x - np.floor(x/100)*100))
df['sched_dep_hour'] = df.sched_dep_time.apply(lambda x: int(np.floor(x/100)))
df['sched_dep_minute'] = df.sched_dep_time.apply(lambda x: int(x - np.floor(x/100)*100))
df.rename(columns={'hour': 'dep_hour',
'minute': 'dep_minute'}, inplace=True)
target = 'arr_delay'
y = df[target]
X = df.drop(columns=[target, 'flight', 'tailnum', 'time_hour', 'year', 'dep_time', 'sched_dep_time', 'arr_time', 'sched_arr_time', 'dep_delay'])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=1066)
Build the regression model
categorical_features = X_train.select_dtypes(exclude=[np.number])
train_pool = Pool(X_train, y_train, categorical_features)
test_pool = Pool(X_test, y_test, categorical_features)
model = CatBoostRegressor(iterations=500, max_depth=5, learning_rate=0.05, random_seed=1066, logging_level='Silent')
model.fit(X_train, y_train, eval_set=test_pool, cat_features=categorical_features, use_best_model=True, early_stopping_rounds=10)
<catboost.core.CatBoostRegressor at 0x13ccf5a50>
Using mean_absolute_error
from scikit-learn
, calculate the MAE.
mae = mean_absolute_error(y_test, model.predict(X_test))
mae
6.5178775173583325