Copyright 2020 Patrick Hall and the H2O.ai team
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
DISCLAIMER: This notebook is not legal compliance advice.
This notebook uses an automated process to train and select accurate and interpretable machine learning models.
It is roughly based on two recent white papers that propose the use of Shapley values in credit underwriting:
And a model selection process introduced at the 2004 KDD Cup:
KDD-Cup 2004: Results and Analysis
The notebook first trains a GLM for initial feature selection, then attempts a forward selection process with a more complex monotonic GBM model, and then trains an even more complex unconstrained GBM also with forward feature selection. The notebook ends with a bonus section that illustrates an automated, heuristic method for selecting monotonicity constraints for certain features and automatically training a parsimonious hybrid between the constrained and unconstrained GBM. For each trained model the notebook displays detailed assessment and diagnostic information, enabling practitioners to make deliberate, informed tradeoffs between accuracy, explainability, interpretability, and fairness.
DEFAULT_NEXT_MONTH
SEED = 12345 # global random seed for better reproducibility
GLM_SELECTION_THRESHOLD = 0.001 # threshold above which a GLM coefficient is considered "selected"
MONO_THRESHOLD = 6 # lower == more monotone constraints
TRUE_POSITIVE_AMOUNT = 0 # revenue for rejecting a defaulting customer
TRUE_NEGATIVE_AMOUNT = 20000 # revenue for accepting a paying customer, ~ customer LTV
FALSE_POSITIVE_AMOUNT = -20000 # revenue for rejecting a paying customer, ~ -customer LTV
FALSE_NEGATIVE_AMOUNT = -100000 # revenue for accepting a defaulting customer, ~ -mean(LIMIT_BAL)
import auto_ph # simple module for training and eval
import h2o # import h2o python bindings to java server
import numpy as np # array, vector, matrix calculations
import operator # for sorting dictionaries
import pandas as pd # DataFrame handling
import time # for timers
import matplotlib.pyplot as plt # plotting
pd.options.display.max_columns = 999 # enable display of all columns in notebook
# enables display of plots in notebook
%matplotlib inline
np.random.seed(SEED) # set random seed for better reproducibility
h2o.init(max_mem_size='32G', nthreads=4) # start h2o with plenty of memory and threads
h2o.remove_all() # clears h2o memory
h2o.no_progress() # turn off h2o progress indicators
Checking whether there is an H2O instance running at http://localhost:54321 ..... not found. Attempting to start a local H2O server... Java Version: openjdk version "11.0.17" 2022-10-18; OpenJDK Runtime Environment (build 11.0.17+8-post-Ubuntu-1ubuntu218.04); OpenJDK 64-Bit Server VM (build 11.0.17+8-post-Ubuntu-1ubuntu218.04, mixed mode, sharing) Starting server from /home/patrickh/Workspace/interpretable_machine_learning_with_python/env_iml/lib/python3.6/site-packages/h2o/backend/bin/h2o.jar Ice root: /tmp/tmpg38glkkl JVM stdout: /tmp/tmpg38glkkl/h2o_patrickh_started_from_python.out JVM stderr: /tmp/tmpg38glkkl/h2o_patrickh_started_from_python.err Server is running at http://127.0.0.1:54321 Connecting to H2O server at http://127.0.0.1:54321 ... successful. Warning: Your H2O cluster version is too old (3 years, 5 months and 29 days)! Please download and install the latest version from http://h2o.ai/download/
H2O cluster uptime: | 01 secs |
H2O cluster timezone: | America/New_York |
H2O data parsing timezone: | UTC |
H2O cluster version: | 3.26.0.3 |
H2O cluster version age: | 3 years, 5 months and 29 days !!! |
H2O cluster name: | H2O_from_python_patrickh_60vphz |
H2O cluster total nodes: | 1 |
H2O cluster free memory: | 32 Gb |
H2O cluster total cores: | 24 |
H2O cluster allowed cores: | 4 |
H2O cluster status: | accepting new members, healthy |
H2O connection url: | http://127.0.0.1:54321 |
H2O connection proxy: | None |
H2O internal security: | False |
H2O API Extensions: | Amazon S3, XGBoost, Algos, AutoML, Core V3, Core V4 |
Python version: | 3.6.9 final |
big_tic = time.time()
UCI credit card default data: https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients
The UCI credit card default data contains demographic and payment information about credit card customers in Taiwan in the year 2005. The data set contains 23 input features:
LIMIT_BAL
: Amount of given credit (NT dollar)SEX
: 1 = male; 2 = femaleEDUCATION
: 1 = graduate school; 2 = university; 3 = high school; 4 = othersMARRIAGE
: 1 = married; 2 = single; 3 = othersAGE
: Age in yearsPAY_0
, PAY_2
- PAY_6
: History of past payment; PAY_0
= the repayment status in September, 2005; PAY_2
= the repayment status in August, 2005; ...; PAY_6
= the repayment status in April, 2005. The measurement scale for the repayment status is: -1 = pay duly; 1 = payment delay for one month; 2 = payment delay for two months; ...; 8 = payment delay for eight months; 9 = payment delay for nine months and above.BILL_AMT1
- BILL_AMT6
: Amount of bill statement (NT dollar). BILL_AMNT1
= amount of bill statement in September, 2005; BILL_AMT2
= amount of bill statement in August, 2005; ...; BILL_AMT6
= amount of bill statement in April, 2005.PAY_AMT1
- PAY_AMT6
: Amount of previous payment (NT dollar). PAY_AMT1
= amount paid in September, 2005; PAY_AMT2
= amount paid in August, 2005; ...; PAY_AMT6
= amount paid in April, 2005.# import XLS file
path = 'default_of_credit_card_clients.xls'
data = pd.read_excel(path,
skiprows=1) # skip the first row of the spreadsheet
# remove spaces from target column name
data = data.rename(columns={'default payment next month': 'DEFAULT_NEXT_MONTH'})
def recode_cc_data(frame):
""" Recodes numeric categorical variables into categorical character variables
with more transparent values.
Args:
frame: Pandas DataFrame version of UCI credit card default data.
Returns:
H2OFrame with recoded values.
"""
# define recoded values
sex_dict = {1:'male', 2:'female'}
education_dict = {0:'other', 1:'graduate school', 2:'university', 3:'high school',
4:'other', 5:'other', 6:'other'}
marriage_dict = {0:'other', 1:'married', 2:'single', 3:'divorced'}
# recode values using Pandas apply() and anonymous function
frame['SEX'] = frame['SEX'].apply(lambda i: sex_dict[i])
frame['EDUCATION'] = frame['EDUCATION'].apply(lambda i: education_dict[i])
frame['MARRIAGE'] = frame['MARRIAGE'].apply(lambda i: marriage_dict[i])
return frame
data = recode_cc_data(data)
# assign target and inputs for models
y_name = 'DEFAULT_NEXT_MONTH'
x_names = [name for name in data.columns if name not in [y_name, 'ID', 'SEX', 'EDUCATION', 'MARRIAGE', 'AGE']]
print('y_name =', y_name)
print('x_names =', x_names)
y_name = DEFAULT_NEXT_MONTH x_names = ['LIMIT_BAL', 'PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'BILL_AMT1', 'BILL_AMT2', 'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6', 'PAY_AMT1', 'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6']
data[x_names + [y_name]].describe()
LIMIT_BAL | PAY_0 | PAY_2 | PAY_3 | PAY_4 | PAY_5 | PAY_6 | BILL_AMT1 | BILL_AMT2 | BILL_AMT3 | BILL_AMT4 | BILL_AMT5 | BILL_AMT6 | PAY_AMT1 | PAY_AMT2 | PAY_AMT3 | PAY_AMT4 | PAY_AMT5 | PAY_AMT6 | DEFAULT_NEXT_MONTH | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 30000.000000 | 30000.000000 | 30000.000000 | 30000.000000 | 30000.000000 | 30000.000000 | 30000.000000 | 30000.000000 | 30000.000000 | 3.000000e+04 | 30000.000000 | 30000.000000 | 30000.000000 | 30000.000000 | 3.000000e+04 | 30000.00000 | 30000.000000 | 30000.000000 | 30000.000000 | 30000.000000 |
mean | 167484.322667 | -0.016700 | -0.133767 | -0.166200 | -0.220667 | -0.266200 | -0.291100 | 51223.330900 | 49179.075167 | 4.701315e+04 | 43262.948967 | 40311.400967 | 38871.760400 | 5663.580500 | 5.921163e+03 | 5225.68150 | 4826.076867 | 4799.387633 | 5215.502567 | 0.221200 |
std | 129747.661567 | 1.123802 | 1.197186 | 1.196868 | 1.169139 | 1.133187 | 1.149988 | 73635.860576 | 71173.768783 | 6.934939e+04 | 64332.856134 | 60797.155770 | 59554.107537 | 16563.280354 | 2.304087e+04 | 17606.96147 | 15666.159744 | 15278.305679 | 17777.465775 | 0.415062 |
min | 10000.000000 | -2.000000 | -2.000000 | -2.000000 | -2.000000 | -2.000000 | -2.000000 | -165580.000000 | -69777.000000 | -1.572640e+05 | -170000.000000 | -81334.000000 | -339603.000000 | 0.000000 | 0.000000e+00 | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
25% | 50000.000000 | -1.000000 | -1.000000 | -1.000000 | -1.000000 | -1.000000 | -1.000000 | 3558.750000 | 2984.750000 | 2.666250e+03 | 2326.750000 | 1763.000000 | 1256.000000 | 1000.000000 | 8.330000e+02 | 390.00000 | 296.000000 | 252.500000 | 117.750000 | 0.000000 |
50% | 140000.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 22381.500000 | 21200.000000 | 2.008850e+04 | 19052.000000 | 18104.500000 | 17071.000000 | 2100.000000 | 2.009000e+03 | 1800.00000 | 1500.000000 | 1500.000000 | 1500.000000 | 0.000000 |
75% | 240000.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 67091.000000 | 64006.250000 | 6.016475e+04 | 54506.000000 | 50190.500000 | 49198.250000 | 5006.000000 | 5.000000e+03 | 4505.00000 | 4013.250000 | 4031.500000 | 4000.000000 | 0.000000 |
max | 1000000.000000 | 8.000000 | 8.000000 | 8.000000 | 8.000000 | 8.000000 | 8.000000 | 964511.000000 | 983931.000000 | 1.664089e+06 | 891586.000000 | 927171.000000 | 961664.000000 | 873552.000000 | 1.684259e+06 | 896040.00000 | 621000.000000 | 426529.000000 | 528666.000000 | 1.000000 |
DEFAULT_NEXT_MONTH
¶# Pearson correlation between inputs and target
# is last column of correlation matrix
corr = pd.DataFrame(data[x_names + [y_name]].corr()[y_name]).iloc[:-1]
corr.columns = ['Pearson Correlation Coefficient']
corr
Pearson Correlation Coefficient | |
---|---|
LIMIT_BAL | -0.153520 |
PAY_0 | 0.324794 |
PAY_2 | 0.263551 |
PAY_3 | 0.235253 |
PAY_4 | 0.216614 |
PAY_5 | 0.204149 |
PAY_6 | 0.186866 |
BILL_AMT1 | -0.019644 |
BILL_AMT2 | -0.014193 |
BILL_AMT3 | -0.014076 |
BILL_AMT4 | -0.010156 |
BILL_AMT5 | -0.006760 |
BILL_AMT6 | -0.005372 |
PAY_AMT1 | -0.072929 |
PAY_AMT2 | -0.058579 |
PAY_AMT3 | -0.056250 |
PAY_AMT4 | -0.056827 |
PAY_AMT5 | -0.055124 |
PAY_AMT6 | -0.053183 |
# display correlation to target as barchart
fig, ax_ = plt.subplots(figsize=(10, 8))
_ = pd.DataFrame(data[x_names + [y_name]].corr()[y_name]).iloc[:-1].plot(kind='barh',
ax=ax_,
edgecolor=['black']*len(data[x_names + [y_name]]),
colormap='cool')
split_ratio = 0.7 # 70%/30% train/test split
# execute split
split = np.random.rand(len(data)) < split_ratio
train = data[split]
valid = data[~split]
# summarize split
print('Train data rows = %d, columns = %d' % (train.shape[0], train.shape[1]))
print('Validation data rows = %d, columns = %d' % (valid.shape[0], valid.shape[1]))
Train data rows = 20946, columns = 25 Validation data rows = 9054, columns = 25
# train penalized GLM w/ alpha and lambda grid search
best_glm = auto_ph.glm_grid(x_names, y_name, h2o.H2OFrame(train),
h2o.H2OFrame(valid), SEED)
# output results
print('Best penalized GLM AUC: %.2f' %
best_glm.auc(valid=True))
# print selected coefficients
print('Best penalized GLM coefficients:')
for c_name, c_val in sorted(best_glm.coef().items(), key=operator.itemgetter(1)):
if abs(c_val) > GLM_SELECTION_THRESHOLD:
print('%s %s' % (str(c_name + ':').ljust(25), c_val))
Best penalized GLM AUC: 0.73 Best penalized GLM coefficients: Intercept: -1.0553055885519234 PAY_6: 0.012282515281903838 PAY_4: 0.025486634303383368 PAY_5: 0.04613937054483051 PAY_3: 0.07909158701433618 PAY_2: 0.08471364623598049 PAY_0: 0.537195471519995
best_glm
Model Details ============= H2OGeneralizedLinearEstimator : Generalized Linear Modeling Model Key: Grid_GLM_Key_Frame__upload_b902da19ccbfcf08bcf090f879b54a2f.hex_model_python_1677103750394_1_model_1 GLM Model: summary
family | link | regularization | lambda_search | number_of_predictors_total | number_of_active_predictors | number_of_iterations | training_frame | ||
---|---|---|---|---|---|---|---|---|---|
0 | binomial | logit | Elastic Net (alpha = 0.01, lambda = 0.005908 ) | nlambda = 100, lambda.max = 13.333, lambda.min = 0.005908, lambda.... | 19 | 19 | 109 | Key_Frame__upload_b902da19ccbfcf08bcf090f879b54a2f.hex |
ModelMetricsBinomialGLM: glm ** Reported on train data. ** MSE: 0.14649158915954694 RMSE: 0.3827421967324049 LogLoss: 0.4685812636002606 Null degrees of freedom: 20945 Residual degrees of freedom: 20926 Null deviance: 22178.75361964548 Residual deviance: 19629.806294742117 AIC: 19669.806294742117 AUC: 0.7182752479663853 pr_auc: 0.5010322496049208 Gini: 0.4365504959327706 Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.24983106738243743:
0 | 1 | Error | Rate | ||
---|---|---|---|---|---|
0 | 0 | 13778.0 | 2518.0 | 0.1545 | (2518.0/16296.0) |
1 | 1 | 2168.0 | 2482.0 | 0.4662 | (2168.0/4650.0) |
2 | Total | 15946.0 | 5000.0 | 0.2237 | (4686.0/20946.0) |
Maximum Metrics: Maximum metrics at their respective thresholds
metric | threshold | value | idx | |
---|---|---|---|---|
0 | max f1 | 0.249831 | 0.514404 | 205.0 |
1 | max f2 | 0.054654 | 0.594059 | 377.0 |
2 | max f0point5 | 0.399178 | 0.567555 | 137.0 |
3 | max accuracy | 0.418922 | 0.817053 | 128.0 |
4 | max precision | 0.706802 | 0.797414 | 34.0 |
5 | max recall | 0.001281 | 1.000000 | 399.0 |
6 | max specificity | 0.989212 | 0.999570 | 0.0 |
7 | max absolute_mcc | 0.399178 | 0.396395 | 137.0 |
8 | max min_per_class_accuracy | 0.221641 | 0.658925 | 237.0 |
9 | max mean_per_class_accuracy | 0.245211 | 0.690483 | 210.0 |
Gains/Lift Table: Avg response rate: 22.20 %, avg score: 22.20 %
group | cumulative_data_fraction | lower_threshold | lift | cumulative_lift | response_rate | score | cumulative_response_rate | cumulative_score | capture_rate | cumulative_capture_rate | gain | cumulative_gain | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0.010026 | 7.192943e-01 | 3.539263 | 3.539263 | 0.785714 | 0.816889 | 0.785714 | 0.816889 | 0.035484 | 0.035484 | 253.926267 | 253.926267 | |
1 | 2 | 0.020004 | 6.109553e-01 | 2.952721 | 3.246692 | 0.655502 | 0.659815 | 0.720764 | 0.738540 | 0.029462 | 0.064946 | 195.272110 | 224.669182 | |
2 | 3 | 0.030030 | 5.904669e-01 | 3.303312 | 3.265595 | 0.733333 | 0.600522 | 0.724960 | 0.692461 | 0.033118 | 0.098065 | 230.331183 | 226.559516 | |
3 | 4 | 0.040008 | 5.652261e-01 | 3.469986 | 3.316571 | 0.770335 | 0.576997 | 0.736277 | 0.663664 | 0.034624 | 0.132688 | 246.998611 | 231.657094 | |
4 | 5 | 0.050033 | 5.378222e-01 | 3.024461 | 3.258037 | 0.671429 | 0.552092 | 0.723282 | 0.641307 | 0.030323 | 0.163011 | 202.446083 | 225.803743 | |
5 | 6 | 0.100019 | 4.442181e-01 | 2.916965 | 3.087582 | 0.647564 | 0.484561 | 0.685442 | 0.562971 | 0.145806 | 0.308817 | 191.696460 | 208.758242 | |
6 | 7 | 0.150005 | 3.537487e-01 | 2.090922 | 2.755468 | 0.464183 | 0.405265 | 0.611712 | 0.510419 | 0.104516 | 0.413333 | 109.092153 | 175.546785 | |
7 | 8 | 0.200038 | 2.655100e-01 | 1.427003 | 2.423193 | 0.316794 | 0.294731 | 0.537947 | 0.456471 | 0.071398 | 0.484731 | 42.700320 | 142.319316 | |
8 | 9 | 0.300010 | 2.382281e-01 | 1.015345 | 1.954060 | 0.225406 | 0.248630 | 0.433800 | 0.387213 | 0.101505 | 0.586237 | 1.534461 | 95.405967 | |
9 | 10 | 0.400029 | 2.224902e-01 | 0.672990 | 1.633754 | 0.149403 | 0.229921 | 0.362692 | 0.347885 | 0.067312 | 0.653548 | -32.701024 | 63.375397 | |
10 | 11 | 0.500000 | 2.030962e-01 | 0.537788 | 1.414624 | 0.119389 | 0.213314 | 0.314046 | 0.320979 | 0.053763 | 0.707312 | -46.221154 | 41.462366 | |
11 | 12 | 0.600019 | 1.763763e-01 | 0.485929 | 1.259817 | 0.107876 | 0.190730 | 0.279679 | 0.299267 | 0.048602 | 0.755914 | -51.407129 | 25.981653 | |
12 | 13 | 0.699990 | 1.357259e-01 | 0.686218 | 1.177896 | 0.152340 | 0.157265 | 0.261492 | 0.278986 | 0.068602 | 0.824516 | -31.378193 | 17.789625 | |
13 | 14 | 0.800010 | 1.134964e-01 | 0.724593 | 1.121223 | 0.160859 | 0.123743 | 0.248911 | 0.259578 | 0.072473 | 0.896989 | -27.540719 | 12.122318 | |
14 | 15 | 0.900076 | 6.434736e-02 | 0.477100 | 1.049612 | 0.105916 | 0.093489 | 0.233013 | 0.241113 | 0.047742 | 0.944731 | -52.289953 | 4.961223 | |
15 | 16 | 1.000000 | 1.134804e-08 | 0.553111 | 1.000000 | 0.122790 | 0.049839 | 0.221999 | 0.222000 | 0.055269 | 1.000000 | -44.688932 | 0.000000 |
ModelMetricsBinomialGLM: glm ** Reported on validation data. ** MSE: 0.1436346602079876 RMSE: 0.37899163606600555 LogLoss: 0.4617337838828072 Null degrees of freedom: 9053 Residual degrees of freedom: 9034 Null deviance: 9526.71172610569 Residual deviance: 8361.075358549871 AIC: 8401.075358549871 AUC: 0.7303402396287311 pr_auc: 0.5061676572465437 Gini: 0.46068047925746214 Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.2607285198255572:
0 | 1 | Error | Rate | ||
---|---|---|---|---|---|
0 | 0 | 6083.0 | 985.0 | 0.1394 | (985.0/7068.0) |
1 | 1 | 921.0 | 1065.0 | 0.4637 | (921.0/1986.0) |
2 | Total | 7004.0 | 2050.0 | 0.2105 | (1906.0/9054.0) |
Maximum Metrics: Maximum metrics at their respective thresholds
metric | threshold | value | idx | |
---|---|---|---|---|
0 | max f1 | 0.260729 | 0.527750 | 193.0 |
1 | max f2 | 0.116430 | 0.592436 | 329.0 |
2 | max f0point5 | 0.400670 | 0.576923 | 134.0 |
3 | max accuracy | 0.433434 | 0.822288 | 120.0 |
4 | max precision | 0.572358 | 0.743386 | 68.0 |
5 | max recall | 0.007410 | 1.000000 | 397.0 |
6 | max specificity | 0.989045 | 0.999859 | 0.0 |
7 | max absolute_mcc | 0.370892 | 0.413902 | 147.0 |
8 | max min_per_class_accuracy | 0.225699 | 0.672205 | 230.0 |
9 | max mean_per_class_accuracy | 0.246401 | 0.699498 | 206.0 |
Gains/Lift Table: Avg response rate: 21.94 %, avg score: 22.66 %
group | cumulative_data_fraction | lower_threshold | lift | cumulative_lift | response_rate | score | cumulative_response_rate | cumulative_score | capture_rate | cumulative_capture_rate | gain | cumulative_gain | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0.010051 | 7.207631e-01 | 3.106072 | 3.106072 | 0.681319 | 0.813734 | 0.681319 | 0.813734 | 0.031219 | 0.031219 | 210.607218 | 210.607218 | |
1 | 2 | 0.020102 | 6.229843e-01 | 3.156170 | 3.131121 | 0.692308 | 0.663736 | 0.686813 | 0.738735 | 0.031722 | 0.062941 | 215.617011 | 213.112114 | |
2 | 3 | 0.030042 | 5.993452e-01 | 3.545821 | 3.268338 | 0.777778 | 0.607299 | 0.716912 | 0.695245 | 0.035247 | 0.098187 | 254.582075 | 226.833792 | |
3 | 4 | 0.040093 | 5.753741e-01 | 3.757345 | 3.390927 | 0.824176 | 0.587697 | 0.743802 | 0.668284 | 0.037764 | 0.135952 | 275.734537 | 239.092657 | |
4 | 5 | 0.050033 | 5.509826e-01 | 3.241893 | 3.361317 | 0.711111 | 0.562950 | 0.737307 | 0.647357 | 0.032226 | 0.168177 | 224.189325 | 236.131730 | |
5 | 6 | 0.100066 | 4.545088e-01 | 2.968828 | 3.165073 | 0.651214 | 0.496126 | 0.694260 | 0.571741 | 0.148540 | 0.316717 | 196.882815 | 216.507273 | |
6 | 7 | 0.149989 | 3.741888e-01 | 2.208854 | 2.846802 | 0.484513 | 0.414835 | 0.624448 | 0.519516 | 0.110272 | 0.426989 | 120.885357 | 184.680243 | |
7 | 8 | 0.200022 | 2.740947e-01 | 1.489446 | 2.507276 | 0.326711 | 0.315894 | 0.549972 | 0.468582 | 0.074522 | 0.501511 | 48.944599 | 150.727595 | |
8 | 9 | 0.299978 | 2.407348e-01 | 0.997420 | 2.004176 | 0.218785 | 0.253042 | 0.439617 | 0.396762 | 0.099698 | 0.601208 | -0.258049 | 100.417577 | |
9 | 10 | 0.400044 | 2.249573e-01 | 0.719563 | 1.682845 | 0.157837 | 0.232429 | 0.369133 | 0.355656 | 0.072004 | 0.673212 | -28.043657 | 68.284535 | |
10 | 11 | 0.500000 | 2.058437e-01 | 0.539010 | 1.454179 | 0.118232 | 0.215713 | 0.318975 | 0.327680 | 0.053877 | 0.727090 | -46.099047 | 45.417925 | |
11 | 12 | 0.599956 | 1.811296e-01 | 0.443298 | 1.285761 | 0.097238 | 0.194132 | 0.282032 | 0.305430 | 0.044310 | 0.771400 | -55.670244 | 28.576100 | |
12 | 13 | 0.700022 | 1.408759e-01 | 0.533383 | 1.178211 | 0.116998 | 0.162818 | 0.258441 | 0.285044 | 0.053374 | 0.824773 | -46.661731 | 17.821055 | |
13 | 14 | 0.799978 | 1.142562e-01 | 0.800958 | 1.131074 | 0.175691 | 0.125883 | 0.248102 | 0.265157 | 0.080060 | 0.904834 | -19.904191 | 13.107353 | |
14 | 15 | 0.899934 | 6.452901e-02 | 0.438260 | 1.054123 | 0.096133 | 0.094816 | 0.231222 | 0.246237 | 0.043807 | 0.948640 | -56.173991 | 5.412260 | |
15 | 16 | 1.000000 | 1.276339e-12 | 0.513255 | 1.000000 | 0.112583 | 0.049514 | 0.219351 | 0.226552 | 0.051360 | 1.000000 | -48.674496 | 0.000000 |
Scoring History:
timestamp | duration | iteration | lambda | predictors | deviance_train | deviance_test | ||
---|---|---|---|---|---|---|---|---|
0 | 2023-02-22 17:10:03 | 0.000 sec | 1 | .13E2 | 1 | 1.058854 | 1.052210 | |
1 | 2023-02-22 17:10:03 | 0.037 sec | 2 | .12E2 | 2 | 1.058595 | 1.051942 | |
2 | 2023-02-22 17:10:03 | 0.060 sec | 3 | .11E2 | 2 | 1.058312 | 1.051649 | |
3 | 2023-02-22 17:10:03 | 0.093 sec | 4 | .1E2 | 3 | 1.057846 | 1.051165 | |
4 | 2023-02-22 17:10:03 | 0.124 sec | 5 | .92E1 | 4 | 1.057200 | 1.050494 | |
5 | 2023-02-22 17:10:03 | 0.146 sec | 6 | .84E1 | 5 | 1.056328 | 1.049589 | |
6 | 2023-02-22 17:10:03 | 0.161 sec | 7 | .76E1 | 6 | 1.055193 | 1.048412 | |
7 | 2023-02-22 17:10:03 | 0.175 sec | 8 | .7E1 | 7 | 1.053849 | 1.047023 | |
8 | 2023-02-22 17:10:03 | 0.220 sec | 9 | .63E1 | 7 | 1.052407 | 1.045532 | |
9 | 2023-02-22 17:10:03 | 0.231 sec | 10 | .58E1 | 8 | 1.050810 | 1.043879 | |
10 | 2023-02-22 17:10:03 | 0.239 sec | 11 | .53E1 | 8 | 1.049070 | 1.042077 | |
11 | 2023-02-22 17:10:03 | 0.249 sec | 12 | .48E1 | 8 | 1.047225 | 1.040165 | |
12 | 2023-02-22 17:10:03 | 0.261 sec | 13 | .44E1 | 8 | 1.045279 | 1.038148 | |
13 | 2023-02-22 17:10:03 | 0.276 sec | 14 | .4E1 | 8 | 1.043223 | 1.036017 | |
14 | 2023-02-22 17:10:04 | 0.292 sec | 15 | .36E1 | 8 | 1.041064 | 1.033777 | |
15 | 2023-02-22 17:10:04 | 0.316 sec | 16 | .33E1 | 8 | 1.038800 | 1.031428 | |
16 | 2023-02-22 17:10:04 | 0.334 sec | 18 | .3E1 | 9 | 1.036416 | 1.028957 | |
17 | 2023-02-22 17:10:04 | 0.352 sec | 20 | .27E1 | 9 | 1.033911 | 1.026363 | |
18 | 2023-02-22 17:10:04 | 0.376 sec | 22 | .25E1 | 9 | 1.031311 | 1.023671 | |
19 | 2023-02-22 17:10:04 | 0.402 sec | 24 | .23E1 | 9 | 1.028635 | 1.020899 |
See the whole table with table.as_data_frame()
# collect regularization paths from dict in DataFrame
reg_path_dict = best_glm.getGLMRegularizationPath(best_glm)
reg_path_frame = pd.DataFrame(columns=reg_path_dict['coefficients'][0].keys())
for i in range(0, len(reg_path_dict['coefficients'])):
reg_path_frame = reg_path_frame.append(reg_path_dict['coefficients'][i],
ignore_index=True)
###########################################
# establish benchmark feature selection: #
# glm_selected #
# used frequently in further calculations #
###########################################
glm_selected = list(reg_path_frame.iloc[-1, :][reg_path_frame.iloc[-1, :] > GLM_SELECTION_THRESHOLD].index)
# plot regularization paths
fig, ax_ = plt.subplots(figsize=(8, 6))
_ = reg_path_frame[glm_selected].plot(kind='line', ax=ax_, title='Penalized GLM Regularization Paths',
colormap='cool')
_ = ax_.set_xlabel('Iteration')
_ = ax_.set_ylabel('Coefficient Value')
_ = ax_.axhline(c='k', lw=1, xmin=0.045, xmax=0.955)
_ = plt.legend(bbox_to_anchor=(1.05, 0),
loc=3,
borderaxespad=0.)
"""
# collect Pearson correlation and GLM coefficients into same DataFrame
glm_selected_coef = pd.DataFrame.from_dict(best_glm.coef(), orient='index', columns=['Penalized GLM Coefficient'])
zcorr_glm = pd.concat([corr, glm_selected_coef.iloc[1:]], axis=1)
# plot
fig, ax_ = plt.subplots(figsize=(8, 6))
_ = corr_glm.plot(kind='barh', ax=ax_, colormap='gnuplot')
"""
"\n# collect Pearson correlation and GLM coefficients into same DataFrame\nglm_selected_coef = pd.DataFrame.from_dict(best_glm.coef(), orient='index', columns=['Penalized GLM Coefficient'])\nzcorr_glm = pd.concat([corr, glm_selected_coef.iloc[1:]], axis=1)\n\n# plot\nfig, ax_ = plt.subplots(figsize=(8, 6))\n_ = corr_glm.plot(kind='barh', ax=ax_, colormap='gnuplot')\n"
# collect Pearson correlation and GLM contributions into same DataFrame
glm_contrib_frame = pd.concat([valid[x_names].abs().mean(axis=0),
pd.DataFrame.from_dict(best_glm.coef(), orient='index',
columns=['Penalized GLM Coefficient']).drop('Intercept')],
axis=1, sort=True)
glm_contrib_frame['Penalized GLM Contribution'] = glm_contrib_frame.iloc[:, 1] * glm_contrib_frame.iloc[:, 1]
corr_glm = pd.concat([corr.abs(), glm_contrib_frame.iloc[:, 2]], axis=1, sort=True)
corr_glm.columns = ['Absolute ' + name for name in corr_glm.columns]
# another approach is to calculate Shapley values for GLM directly
# plot
fig, ax_ = plt.subplots(figsize=(10, 8))
_ = corr_glm.plot(kind='barh', ax=ax_, colormap='cool', edgecolor=['black']*len(data[x_names + [y_name]]))
# init dict to hold partial dependence and ICE values
# for each feature
# for glm
glm_pd_ice_dict = {}
# calculate partial dependence for each selected feature
for xs in glm_selected:
glm_pd_ice_dict[xs] = auto_ph.pd_ice(xs, valid, best_glm)
# merge GLM predictions onto test data
glm_yhat_valid = pd.concat([valid.reset_index(drop=True),
best_glm.predict(h2o.H2OFrame(valid))['p1'].as_data_frame()],
axis=1)
# rename yhat column
glm_yhat_valid = glm_yhat_valid.rename(columns={'p1':'p_DEFAULT_NEXT_MONTH'})
# find percentiles of predictions
glm_percentile_dict = auto_ph.get_percentile_dict('p_DEFAULT_NEXT_MONTH', glm_yhat_valid, 'ID')
# display percentiles dictionary
# key=percentile, val=row_id
glm_percentile_dict
{0: 28717, 99: 13713, 10: 27519, 20: 29714, 30: 1012, 40: 12100, 50: 2849, 60: 3518, 70: 26302, 80: 5763, 90: 7083}
# loop through selected variables
for xs in glm_selected:
# collect bins used in partial dependence
bins = list(glm_pd_ice_dict[xs][xs])
# calculate ICE at percentiles
# using partial dependence bins
# for each selected feature
for i in sorted(glm_percentile_dict.keys()):
col_name = 'Percentile_' + str(i)
glm_pd_ice_dict[xs][col_name] = auto_ph.pd_ice(xs, # x_names used here b/c all features have small coef in GLM
valid[valid['ID'] == int(glm_percentile_dict[i])][x_names],
best_glm,
bins=bins)['partial_dependence']
for xs in glm_selected:
auto_ph.hist_mean_pd_ice_plot(xs, y_name, valid, glm_pd_ice_dict)
local_coef_dict = {10: pd.DataFrame(columns = ['GLM Contribution'], index=x_names),
50: pd.DataFrame(columns = ['GLM Contribution'], index=x_names),
90: pd.DataFrame(columns = ['GLM Contribution'], index=x_names)}
p_DEFAULT_NEXT_MONTH
¶(Another option would be to calculate Shapley values for GLM directly.)
for name in x_names:
for percentile in [10, 50, 90]:
# local contributions = beta_j * x_i,j
local_coef_dict[percentile].loc[name, 'GLM Contribution'] =\
best_glm.coef()[name] *\
valid[valid['ID'] == int(glm_percentile_dict[percentile])][name].values[0]
p_DEFAULT_NEXT_MONTH
¶fig, (ax0, ax1, ax2) = plt.subplots(ncols=3, sharey=True)
plt.tight_layout()
plt.subplots_adjust(left=0, right=2, wspace=0.1)
_ = local_coef_dict[10].plot(kind='bar', colormap='cool', ax=ax0, edgecolor=['black']*len(data[x_names + [y_name]]),
title='10th PCTL of p_DEFAULT_NEXT_MONTH')
_ = local_coef_dict[50].plot(kind='bar', colormap='cool', ax=ax1, edgecolor=['black']*len(data[x_names + [y_name]]),
title='50th PCTL of p_DEFAULT_NEXT_MONTH')
_ = local_coef_dict[90].plot(kind='bar', colormap='cool', ax=ax2, edgecolor=['black']*len(data[x_names + [y_name]]),
title='90th PCTL of p_DEFAULT_NEXT_MONTH')
SEX = male
and SEX = female
¶print('Standardized mean difference: %.2f' % auto_ph.smd(glm_yhat_valid, 'SEX', 'p_DEFAULT_NEXT_MONTH', 'male', 'female'))
Male mean yhat: 0.23 Female mean yhat: 0.22 P_Default_Next_Month std. dev.: 0.15 Standardized mean difference: -0.08
best_glm_cut = best_glm.mcc(valid=True)[0][0]
best_glm_cut
0.3708916952697694
glm_male_cm = auto_ph.get_confusion_matrix(glm_yhat_valid, y_name, 'p_DEFAULT_NEXT_MONTH', by='SEX',
level='male', cutoff=best_glm_cut)
glm_female_cm = auto_ph.get_confusion_matrix(glm_yhat_valid, y_name, 'p_DEFAULT_NEXT_MONTH', by='SEX',
level='female', cutoff=best_glm_cut)
glm_cm_dict = {'male': glm_male_cm, 'female': glm_female_cm}
SEX = male
¶glm_male_cm
actual: 1 | actual: 0 | |
---|---|---|
predicted: 1 | 379 | 211 |
predicted: 0 | 464 | 2538 |
SEX = female
¶glm_female_cm
actual: 1 | actual: 0 | |
---|---|---|
predicted: 1 | 480 | 308 |
predicted: 0 | 663 | 4011 |
SEX = male
and SEX = female
¶print('Adverse impact ratio: %.2f' % auto_ph.air(glm_cm_dict, 'male', 'female'))
Male proportion accepted: 0.836 Female proportion accepted: 0.856 Adverse impact ratio: 1.02
SEX = male
and SEX = female
¶print('Marginal effect: %.2f%%' % auto_ph.marginal_effect(glm_cm_dict, 'male', 'female'))
Male accepted: 83.57% Female accepted: 85.57% Marginal effect: -2.00%
glm_cm = auto_ph.get_confusion_matrix(glm_yhat_valid, y_name, 'p_DEFAULT_NEXT_MONTH', cutoff=best_glm_cut)
glm_cm
actual: 1 | actual: 0 | |
---|---|---|
predicted: 1 | 859 | 519 |
predicted: 0 | 1127 | 6549 |
glm_business_impact = glm_cm.iloc[0, 0]*TRUE_POSITIVE_AMOUNT +\
glm_cm.iloc[0, 1]*FALSE_POSITIVE_AMOUNT +\
glm_cm.iloc[1, 0]*FALSE_NEGATIVE_AMOUNT +\
glm_cm.iloc[1, 1]*TRUE_NEGATIVE_AMOUNT
print('Estimated business impact $%.2f' % glm_business_impact)
Estimated business impact $7900000.00
# initialize data structures needed to compare correlation coefficients,
# penalized glm coefficients, and MGBM Shapley values
# as features are added into the MGBM
abs_corr = corr.copy(deep=True)
abs_corr['Pearson Correlation Coefficient'] = corr['Pearson Correlation Coefficient'].abs()
# create a list of features to add into MGBM
# list is ordered by correlation between X_j and y
next_list = [name for name in list(abs_corr.sort_values(by='Pearson Correlation Coefficient',
ascending=False).index) if name not in glm_selected]
# create a DataFrame to store new MGBM SHAP values
# for comparison to correlation and penalized glm coefficients
abs_corr_glm_mgbm_shap = corr_glm.copy(deep=True).abs()
#abs_corr_glm_mgbm_shap.columns = ['Absolute ' + name for name in abs_corr_glm_mgbm_shap.columns]
abs_corr_glm_mgbm_shap['Monotonic GBM Mean SHAP Value'] = 0
# start local timer
tic = time.time()
# forward stepwise MGBM training
mgbm_train_results = auto_ph.gbm_forward_select_train(glm_selected,
y_name,
train,
valid,
SEED,
next_list,
abs_corr_glm_mgbm_shap,
'Monotonic GBM Mean SHAP Value',
monotone=True)
mgbm_models = mgbm_train_results['MODELS']
corr_glm_mgbm_shap_coefs = mgbm_train_results['GLOBAL_COEFS']
mgbm_shap = mgbm_train_results['LOCAL_COEFS']
# end local timer
toc = time.time()-tic
print('Task completed in %.2f s.' % (toc))
# 2 threads - 695 s
# 4 threads - 691 s
# 8 threads - 692 s
Starting grid search 1/14 ... Input features = ['PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6'] Monotone constraints = {'PAY_0': 1, 'PAY_2': 1, 'PAY_3': 1, 'PAY_4': 1, 'PAY_5': 1, 'PAY_6': 1} Completed grid search 1/14 with AUC: 0.74 ... -------------------------------------------------------------------------------- Starting grid search 2/14 ... Input features = ['PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'LIMIT_BAL'] Monotone constraints = {'PAY_0': 1, 'PAY_2': 1, 'PAY_3': 1, 'PAY_4': 1, 'PAY_5': 1, 'PAY_6': 1, 'LIMIT_BAL': -1} Completed grid search 2/14 with AUC: 0.76 ... -------------------------------------------------------------------------------- Starting grid search 3/14 ... Input features = ['PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'LIMIT_BAL', 'PAY_AMT1'] Monotone constraints = {'PAY_0': 1, 'PAY_2': 1, 'PAY_3': 1, 'PAY_4': 1, 'PAY_5': 1, 'PAY_6': 1, 'LIMIT_BAL': -1, 'PAY_AMT1': -1} Completed grid search 3/14 with AUC: 0.77 ... -------------------------------------------------------------------------------- Starting grid search 4/14 ... Input features = ['PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'LIMIT_BAL', 'PAY_AMT1', 'PAY_AMT2'] Monotone constraints = {'PAY_0': 1, 'PAY_2': 1, 'PAY_3': 1, 'PAY_4': 1, 'PAY_5': 1, 'PAY_6': 1, 'LIMIT_BAL': -1, 'PAY_AMT1': -1, 'PAY_AMT2': -1} Completed grid search 4/14 with AUC: 0.77 ... -------------------------------------------------------------------------------- Starting grid search 5/14 ... Input features = ['PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'LIMIT_BAL', 'PAY_AMT1', 'PAY_AMT2', 'PAY_AMT4'] Monotone constraints = {'PAY_0': 1, 'PAY_2': 1, 'PAY_3': 1, 'PAY_4': 1, 'PAY_5': 1, 'PAY_6': 1, 'LIMIT_BAL': -1, 'PAY_AMT1': -1, 'PAY_AMT2': -1, 'PAY_AMT4': -1} Completed grid search 5/14 with AUC: 0.78 ... -------------------------------------------------------------------------------- Starting grid search 6/14 ... Input features = ['PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'LIMIT_BAL', 'PAY_AMT1', 'PAY_AMT2', 'PAY_AMT4', 'PAY_AMT3'] Monotone constraints = {'PAY_0': 1, 'PAY_2': 1, 'PAY_3': 1, 'PAY_4': 1, 'PAY_5': 1, 'PAY_6': 1, 'LIMIT_BAL': -1, 'PAY_AMT1': -1, 'PAY_AMT2': -1, 'PAY_AMT4': -1, 'PAY_AMT3': -1} Completed grid search 6/14 with AUC: 0.78 ... -------------------------------------------------------------------------------- Starting grid search 7/14 ... Input features = ['PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'LIMIT_BAL', 'PAY_AMT1', 'PAY_AMT2', 'PAY_AMT4', 'PAY_AMT3', 'PAY_AMT5'] Monotone constraints = {'PAY_0': 1, 'PAY_2': 1, 'PAY_3': 1, 'PAY_4': 1, 'PAY_5': 1, 'PAY_6': 1, 'LIMIT_BAL': -1, 'PAY_AMT1': -1, 'PAY_AMT2': -1, 'PAY_AMT4': -1, 'PAY_AMT3': -1, 'PAY_AMT5': -1} Completed grid search 7/14 with AUC: 0.78 ... -------------------------------------------------------------------------------- Starting grid search 8/14 ... Input features = ['PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'LIMIT_BAL', 'PAY_AMT1', 'PAY_AMT2', 'PAY_AMT4', 'PAY_AMT3', 'PAY_AMT5', 'PAY_AMT6'] Monotone constraints = {'PAY_0': 1, 'PAY_2': 1, 'PAY_3': 1, 'PAY_4': 1, 'PAY_5': 1, 'PAY_6': 1, 'LIMIT_BAL': -1, 'PAY_AMT1': -1, 'PAY_AMT2': -1, 'PAY_AMT4': -1, 'PAY_AMT3': -1, 'PAY_AMT5': -1, 'PAY_AMT6': -1} Completed grid search 8/14 with AUC: 0.78 ... -------------------------------------------------------------------------------- Starting grid search 9/14 ... Input features = ['PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'LIMIT_BAL', 'PAY_AMT1', 'PAY_AMT2', 'PAY_AMT4', 'PAY_AMT3', 'PAY_AMT5', 'PAY_AMT6', 'BILL_AMT1'] Monotone constraints = {'PAY_0': 1, 'PAY_2': 1, 'PAY_3': 1, 'PAY_4': 1, 'PAY_5': 1, 'PAY_6': 1, 'LIMIT_BAL': -1, 'PAY_AMT1': -1, 'PAY_AMT2': -1, 'PAY_AMT4': -1, 'PAY_AMT3': -1, 'PAY_AMT5': -1, 'PAY_AMT6': -1, 'BILL_AMT1': -1} Completed grid search 9/14 with AUC: 0.78 ... -------------------------------------------------------------------------------- Starting grid search 10/14 ... Input features = ['PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'LIMIT_BAL', 'PAY_AMT1', 'PAY_AMT2', 'PAY_AMT4', 'PAY_AMT3', 'PAY_AMT5', 'PAY_AMT6', 'BILL_AMT1', 'BILL_AMT2'] Monotone constraints = {'PAY_0': 1, 'PAY_2': 1, 'PAY_3': 1, 'PAY_4': 1, 'PAY_5': 1, 'PAY_6': 1, 'LIMIT_BAL': -1, 'PAY_AMT1': -1, 'PAY_AMT2': -1, 'PAY_AMT4': -1, 'PAY_AMT3': -1, 'PAY_AMT5': -1, 'PAY_AMT6': -1, 'BILL_AMT1': -1, 'BILL_AMT2': -1} Completed grid search 10/14 with AUC: 0.78 ... -------------------------------------------------------------------------------- Starting grid search 11/14 ... Input features = ['PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'LIMIT_BAL', 'PAY_AMT1', 'PAY_AMT2', 'PAY_AMT4', 'PAY_AMT3', 'PAY_AMT5', 'PAY_AMT6', 'BILL_AMT1', 'BILL_AMT2', 'BILL_AMT3'] Monotone constraints = {'PAY_0': 1, 'PAY_2': 1, 'PAY_3': 1, 'PAY_4': 1, 'PAY_5': 1, 'PAY_6': 1, 'LIMIT_BAL': -1, 'PAY_AMT1': -1, 'PAY_AMT2': -1, 'PAY_AMT4': -1, 'PAY_AMT3': -1, 'PAY_AMT5': -1, 'PAY_AMT6': -1, 'BILL_AMT1': -1, 'BILL_AMT2': -1, 'BILL_AMT3': -1} Completed grid search 11/14 with AUC: 0.78 ... -------------------------------------------------------------------------------- Starting grid search 12/14 ... Input features = ['PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'LIMIT_BAL', 'PAY_AMT1', 'PAY_AMT2', 'PAY_AMT4', 'PAY_AMT3', 'PAY_AMT5', 'PAY_AMT6', 'BILL_AMT1', 'BILL_AMT2', 'BILL_AMT3', 'BILL_AMT4'] Monotone constraints = {'PAY_0': 1, 'PAY_2': 1, 'PAY_3': 1, 'PAY_4': 1, 'PAY_5': 1, 'PAY_6': 1, 'LIMIT_BAL': -1, 'PAY_AMT1': -1, 'PAY_AMT2': -1, 'PAY_AMT4': -1, 'PAY_AMT3': -1, 'PAY_AMT5': -1, 'PAY_AMT6': -1, 'BILL_AMT1': -1, 'BILL_AMT2': -1, 'BILL_AMT3': -1, 'BILL_AMT4': -1} Completed grid search 12/14 with AUC: 0.78 ... -------------------------------------------------------------------------------- Starting grid search 13/14 ... Input features = ['PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'LIMIT_BAL', 'PAY_AMT1', 'PAY_AMT2', 'PAY_AMT4', 'PAY_AMT3', 'PAY_AMT5', 'PAY_AMT6', 'BILL_AMT1', 'BILL_AMT2', 'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5'] Monotone constraints = {'PAY_0': 1, 'PAY_2': 1, 'PAY_3': 1, 'PAY_4': 1, 'PAY_5': 1, 'PAY_6': 1, 'LIMIT_BAL': -1, 'PAY_AMT1': -1, 'PAY_AMT2': -1, 'PAY_AMT4': -1, 'PAY_AMT3': -1, 'PAY_AMT5': -1, 'PAY_AMT6': -1, 'BILL_AMT1': -1, 'BILL_AMT2': -1, 'BILL_AMT3': -1, 'BILL_AMT4': -1, 'BILL_AMT5': -1} Completed grid search 13/14 with AUC: 0.78 ... -------------------------------------------------------------------------------- Starting grid search 14/14 ... Input features = ['PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'LIMIT_BAL', 'PAY_AMT1', 'PAY_AMT2', 'PAY_AMT4', 'PAY_AMT3', 'PAY_AMT5', 'PAY_AMT6', 'BILL_AMT1', 'BILL_AMT2', 'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6'] Monotone constraints = {'PAY_0': 1, 'PAY_2': 1, 'PAY_3': 1, 'PAY_4': 1, 'PAY_5': 1, 'PAY_6': 1, 'LIMIT_BAL': -1, 'PAY_AMT1': -1, 'PAY_AMT2': -1, 'PAY_AMT4': -1, 'PAY_AMT3': -1, 'PAY_AMT5': -1, 'PAY_AMT6': -1, 'BILL_AMT1': -1, 'BILL_AMT2': -1, 'BILL_AMT3': -1, 'BILL_AMT4': -1, 'BILL_AMT5': -1, 'BILL_AMT6': -1} Completed grid search 14/14 with AUC: 0.78 ... -------------------------------------------------------------------------------- Done. Task completed in 674.86 s.
auto_ph.plot_coefs(corr_glm_mgbm_shap_coefs,
mgbm_models,
'MGBM',
['Absolute Pearson Correlation Coefficient',
'Monotonic GBM Mean SHAP Value',
'Absolute Penalized GLM Contribution'])
# autoph cv_model_rank_select function
# requires models to have model_id
best_glm.model_id = 'best_glm'
compare_model_ids = ['best_glm'] # list of model_ids
# start local timer
tic = time.time()
# perform CV rank model selection
mgbm_rank_results = auto_ph.cv_model_rank_select(valid,
SEED,
mgbm_train_results,
'mgbm',
compare_model_ids)
best_mgbm = mgbm_rank_results['BEST_MODEL']
best_mgbm_shap = mgbm_rank_results['BEST_LOCAL_COEFS']
mgbm_selected_coefs = mgbm_rank_results['BEST_GLOBAL_COEFS']
best_mgbm_eval = mgbm_rank_results['METRICS']
# end local timer
toc = time.time()-tic
print('Task completed in %.2f s.' % (toc))
Evaluated model 1/14 with rank: 1.20* ... Evaluated model 2/14 with rank: 1.08* ... Evaluated model 3/14 with rank: 1.08 ... Evaluated model 4/14 with rank: 1.04* ... Evaluated model 5/14 with rank: 1.00* ... Evaluated model 6/14 with rank: 1.00 ... Evaluated model 7/14 with rank: 1.04 ... Evaluated model 8/14 with rank: 1.06 ... Evaluated model 9/14 with rank: 1.08 ... Evaluated model 10/14 with rank: 1.04 ... Evaluated model 11/14 with rank: 1.04 ... Evaluated model 12/14 with rank: 1.04 ... Evaluated model 13/14 with rank: 1.04 ... Evaluated model 14/14 with rank: 1.08 ... Done. Task completed in 341.94 s.
best_mgbm
Model Details ============= H2OGradientBoostingEstimator : Gradient Boosting Machine Model Key: mgbm5 Model Summary:
number_of_trees | number_of_internal_trees | model_size_in_bytes | min_depth | max_depth | mean_depth | min_leaves | max_leaves | mean_leaves | ||
---|---|---|---|---|---|---|---|---|---|---|
0 | 46.0 | 46.0 | 6940.0 | 3.0 | 3.0 | 3.0 | 5.0 | 8.0 | 7.369565 |
ModelMetricsBinomial: gbm ** Reported on train data. ** MSE: 0.13637719864300343 RMSE: 0.3692928358945018 LogLoss: 0.4351274080189972 Mean Per-Class Error: 0.2913939696264273 AUC: 0.7716491282246187 pr_auc: 0.5471826859054356 Gini: 0.5432982564492375 Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.21968260039166268:
0 | 1 | Error | Rate | ||
---|---|---|---|---|---|
0 | 0 | 13482.0 | 2814.0 | 0.1727 | (2814.0/16296.0) |
1 | 1 | 1907.0 | 2743.0 | 0.4101 | (1907.0/4650.0) |
2 | Total | 15389.0 | 5557.0 | 0.2254 | (4721.0/20946.0) |
Maximum Metrics: Maximum metrics at their respective thresholds
metric | threshold | value | idx | |
---|---|---|---|---|
0 | max f1 | 0.219683 | 0.537474 | 248.0 |
1 | max f2 | 0.127859 | 0.630227 | 329.0 |
2 | max f0point5 | 0.446699 | 0.583033 | 147.0 |
3 | max accuracy | 0.446699 | 0.821493 | 147.0 |
4 | max precision | 0.950247 | 1.000000 | 0.0 |
5 | max recall | 0.050609 | 1.000000 | 395.0 |
6 | max specificity | 0.950247 | 1.000000 | 0.0 |
7 | max absolute_mcc | 0.325159 | 0.413494 | 194.0 |
8 | max min_per_class_accuracy | 0.177542 | 0.698495 | 281.0 |
9 | max mean_per_class_accuracy | 0.219683 | 0.708606 | 248.0 |
Gains/Lift Table: Avg response rate: 22.20 %, avg score: 22.00 %
group | cumulative_data_fraction | lower_threshold | lift | cumulative_lift | response_rate | score | cumulative_response_rate | cumulative_score | capture_rate | cumulative_capture_rate | gain | cumulative_gain | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0.010074 | 0.813927 | 3.607883 | 3.607883 | 0.800948 | 0.843446 | 0.800948 | 0.843446 | 0.036344 | 0.036344 | 260.788259 | 260.788259 | |
1 | 2 | 0.020338 | 0.795575 | 3.519808 | 3.563432 | 0.781395 | 0.805153 | 0.791080 | 0.824119 | 0.036129 | 0.072473 | 251.980795 | 256.343177 | |
2 | 3 | 0.030316 | 0.763679 | 3.405328 | 3.511394 | 0.755981 | 0.783970 | 0.779528 | 0.810905 | 0.033978 | 0.106452 | 240.532798 | 251.139446 | |
3 | 4 | 0.040008 | 0.715138 | 3.261891 | 3.450954 | 0.724138 | 0.739815 | 0.766110 | 0.793684 | 0.031613 | 0.138065 | 226.189099 | 245.095388 | |
4 | 5 | 0.050081 | 0.664416 | 3.116869 | 3.383755 | 0.691943 | 0.686695 | 0.751192 | 0.772164 | 0.031398 | 0.169462 | 211.686898 | 238.375473 | |
5 | 6 | 0.100019 | 0.543384 | 2.859463 | 3.121984 | 0.634799 | 0.601794 | 0.693079 | 0.687101 | 0.142796 | 0.312258 | 185.946339 | 212.198445 | |
6 | 7 | 0.150005 | 0.366237 | 2.224293 | 2.822849 | 0.493792 | 0.446951 | 0.626671 | 0.607076 | 0.111183 | 0.423441 | 122.429306 | 182.284922 | |
7 | 8 | 0.205672 | 0.292765 | 1.595510 | 2.490659 | 0.354202 | 0.312777 | 0.552925 | 0.527422 | 0.088817 | 0.512258 | 59.551043 | 149.065864 | |
8 | 9 | 0.301251 | 0.196648 | 1.174504 | 2.073077 | 0.260739 | 0.234499 | 0.460222 | 0.434485 | 0.112258 | 0.624516 | 17.450421 | 107.307684 | |
9 | 10 | 0.400029 | 0.173817 | 0.864327 | 1.774604 | 0.191880 | 0.184844 | 0.393961 | 0.372842 | 0.085376 | 0.709892 | -13.567284 | 77.460410 | |
10 | 11 | 0.500286 | 0.151431 | 0.701418 | 1.559537 | 0.155714 | 0.161335 | 0.346216 | 0.330455 | 0.070323 | 0.780215 | -29.858249 | 55.953665 | |
11 | 12 | 0.600306 | 0.131214 | 0.619237 | 1.402870 | 0.137470 | 0.140709 | 0.311436 | 0.298841 | 0.061935 | 0.842151 | -38.076342 | 40.286982 | |
12 | 13 | 0.700659 | 0.114794 | 0.559314 | 1.282050 | 0.124167 | 0.122817 | 0.284614 | 0.273630 | 0.056129 | 0.898280 | -44.068568 | 28.204987 | |
13 | 14 | 0.800821 | 0.102226 | 0.369293 | 1.167887 | 0.081983 | 0.108062 | 0.259270 | 0.252921 | 0.036989 | 0.935269 | -63.070697 | 16.788724 | |
14 | 15 | 0.904564 | 0.091861 | 0.402152 | 1.080066 | 0.089277 | 0.097524 | 0.239774 | 0.235099 | 0.041720 | 0.976989 | -59.784808 | 8.006633 | |
15 | 16 | 1.000000 | 0.034810 | 0.241112 | 1.000000 | 0.053527 | 0.076989 | 0.221999 | 0.220010 | 0.023011 | 1.000000 | -75.888783 | 0.000000 |
ModelMetricsBinomial: gbm ** Reported on validation data. ** MSE: 0.13326994104124376 RMSE: 0.3650615578792757 LogLoss: 0.4278285715046422 Mean Per-Class Error: 0.2856607030196092 AUC: 0.7776380047998697 pr_auc: 0.5486322626112021 Gini: 0.5552760095997393 Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.27397344199105433:
0 | 1 | Error | Rate | ||
---|---|---|---|---|---|
0 | 0 | 6093.0 | 975.0 | 0.1379 | (975.0/7068.0) |
1 | 1 | 863.0 | 1123.0 | 0.4345 | (863.0/1986.0) |
2 | Total | 6956.0 | 2098.0 | 0.203 | (1838.0/9054.0) |
Maximum Metrics: Maximum metrics at their respective thresholds
metric | threshold | value | idx | |
---|---|---|---|---|
0 | max f1 | 0.273973 | 0.549951 | 217.0 |
1 | max f2 | 0.147835 | 0.634488 | 307.0 |
2 | max f0point5 | 0.436620 | 0.590736 | 153.0 |
3 | max accuracy | 0.456963 | 0.825271 | 147.0 |
4 | max precision | 0.947069 | 1.000000 | 0.0 |
5 | max recall | 0.045106 | 1.000000 | 397.0 |
6 | max specificity | 0.947069 | 1.000000 | 0.0 |
7 | max absolute_mcc | 0.347246 | 0.429999 | 184.0 |
8 | max min_per_class_accuracy | 0.181585 | 0.709970 | 275.0 |
9 | max mean_per_class_accuracy | 0.230518 | 0.714339 | 240.0 |
Gains/Lift Table: Avg response rate: 21.94 %, avg score: 22.52 %
group | cumulative_data_fraction | lower_threshold | lift | cumulative_lift | response_rate | score | cumulative_response_rate | cumulative_score | capture_rate | cumulative_capture_rate | gain | cumulative_gain | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0.011155 | 0.815010 | 3.295055 | 3.295055 | 0.722772 | 0.839858 | 0.722772 | 0.839858 | 0.036757 | 0.036757 | 229.505549 | 229.505549 | |
1 | 2 | 0.020543 | 0.795575 | 3.700764 | 3.480460 | 0.811765 | 0.805631 | 0.763441 | 0.824217 | 0.034743 | 0.071501 | 270.076417 | 248.045999 | |
2 | 3 | 0.030042 | 0.783550 | 3.604721 | 3.519749 | 0.790698 | 0.792441 | 0.772059 | 0.814170 | 0.034240 | 0.105740 | 260.472142 | 251.974853 | |
3 | 4 | 0.040093 | 0.743192 | 3.005876 | 3.390927 | 0.659341 | 0.761335 | 0.743802 | 0.800925 | 0.030211 | 0.135952 | 200.587630 | 239.092657 | |
4 | 5 | 0.050033 | 0.697702 | 3.444512 | 3.401573 | 0.755556 | 0.723091 | 0.746137 | 0.785461 | 0.034240 | 0.170191 | 244.451158 | 240.157260 | |
5 | 6 | 0.101281 | 0.553193 | 3.104777 | 3.251394 | 0.681034 | 0.614736 | 0.713195 | 0.699075 | 0.159114 | 0.329305 | 210.477654 | 225.139444 | |
6 | 7 | 0.150320 | 0.383564 | 2.187046 | 2.904171 | 0.479730 | 0.466067 | 0.637032 | 0.623061 | 0.107251 | 0.436556 | 118.704581 | 190.417123 | |
7 | 8 | 0.200022 | 0.296915 | 1.580423 | 2.575244 | 0.346667 | 0.327817 | 0.564881 | 0.549698 | 0.078550 | 0.515106 | 58.042296 | 157.524427 | |
8 | 9 | 0.301303 | 0.203539 | 1.133514 | 2.090616 | 0.248637 | 0.250648 | 0.458578 | 0.449174 | 0.114804 | 0.629909 | 13.351366 | 109.061561 | |
9 | 10 | 0.403468 | 0.176970 | 0.961068 | 1.804595 | 0.210811 | 0.187190 | 0.395839 | 0.382836 | 0.098187 | 0.728097 | -3.893198 | 80.459549 | |
10 | 11 | 0.500221 | 0.152028 | 0.655734 | 1.582382 | 0.143836 | 0.163566 | 0.347096 | 0.340424 | 0.063444 | 0.791541 | -34.426603 | 58.238248 | |
11 | 12 | 0.599956 | 0.133009 | 0.555349 | 1.411651 | 0.121816 | 0.141651 | 0.309647 | 0.307381 | 0.055388 | 0.846928 | -44.465076 | 41.165144 | |
12 | 13 | 0.702231 | 0.115062 | 0.492323 | 1.277757 | 0.107991 | 0.123549 | 0.280277 | 0.280607 | 0.050352 | 0.897281 | -50.767685 | 27.775745 | |
13 | 14 | 0.801966 | 0.102380 | 0.353404 | 1.162802 | 0.077519 | 0.107834 | 0.255061 | 0.259121 | 0.035247 | 0.932528 | -64.659594 | 16.280206 | |
14 | 15 | 0.905346 | 0.091861 | 0.379909 | 1.073405 | 0.083333 | 0.097585 | 0.235452 | 0.240675 | 0.039275 | 0.971803 | -62.009063 | 7.340501 | |
15 | 16 | 1.000000 | 0.034810 | 0.297899 | 1.000000 | 0.065344 | 0.076884 | 0.219351 | 0.225172 | 0.028197 | 1.000000 | -70.210141 | 0.000000 |
Scoring History:
timestamp | duration | number_of_trees | training_rmse | training_logloss | training_auc | training_pr_auc | training_lift | training_classification_error | validation_rmse | validation_logloss | validation_auc | validation_pr_auc | validation_lift | validation_classification_error | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2023-02-22 17:13:51 | 39.263 sec | 0.0 | 0.415591 | 0.529427 | 0.500000 | 0.000000 | 1.000000 | 0.778001 | 0.413815 | 0.526105 | 0.500000 | 0.000000 | 1.000000 | 0.780649 | |
1 | 2023-02-22 17:13:51 | 39.290 sec | 1.0 | 0.407822 | 0.511864 | 0.716131 | 0.534717 | 3.474912 | 0.236370 | 0.405538 | 0.507496 | 0.726731 | 0.537125 | 3.444264 | 0.187652 | |
2 | 2023-02-22 17:13:51 | 39.312 sec | 2.0 | 0.401483 | 0.498746 | 0.744646 | 0.532172 | 3.529706 | 0.228731 | 0.398808 | 0.493698 | 0.752909 | 0.534588 | 3.422307 | 0.232825 | |
3 | 2023-02-22 17:13:51 | 39.337 sec | 3.0 | 0.396471 | 0.489013 | 0.748189 | 0.535621 | 3.529706 | 0.228636 | 0.393394 | 0.483273 | 0.756448 | 0.535692 | 3.422307 | 0.214491 | |
4 | 2023-02-22 17:13:51 | 39.360 sec | 4.0 | 0.392442 | 0.481430 | 0.750121 | 0.535358 | 3.529706 | 0.210780 | 0.389030 | 0.475135 | 0.758511 | 0.536095 | 3.422307 | 0.217915 | |
5 | 2023-02-22 17:13:51 | 39.385 sec | 5.0 | 0.389141 | 0.475375 | 0.750058 | 0.535198 | 3.529706 | 0.245059 | 0.385453 | 0.468630 | 0.758505 | 0.535659 | 3.422307 | 0.214270 | |
6 | 2023-02-22 17:13:51 | 39.407 sec | 6.0 | 0.386399 | 0.470332 | 0.756986 | 0.535024 | 3.529706 | 0.243961 | 0.382447 | 0.463157 | 0.764722 | 0.536039 | 3.422307 | 0.229843 | |
7 | 2023-02-22 17:13:51 | 39.429 sec | 7.0 | 0.384191 | 0.466316 | 0.757005 | 0.535418 | 3.529706 | 0.243961 | 0.380045 | 0.458834 | 0.764634 | 0.536411 | 3.422307 | 0.220013 | |
8 | 2023-02-22 17:13:51 | 39.455 sec | 8.0 | 0.382341 | 0.462760 | 0.761106 | 0.540176 | 3.514359 | 0.247446 | 0.378063 | 0.455049 | 0.770340 | 0.542043 | 3.457524 | 0.204330 | |
9 | 2023-02-22 17:13:51 | 39.481 sec | 9.0 | 0.380701 | 0.459589 | 0.762515 | 0.540880 | 3.518279 | 0.235654 | 0.376184 | 0.451464 | 0.772358 | 0.543522 | 3.457524 | 0.223548 | |
10 | 2023-02-22 17:13:51 | 39.510 sec | 10.0 | 0.379202 | 0.456705 | 0.762522 | 0.541424 | 3.518279 | 0.235606 | 0.374583 | 0.448380 | 0.772982 | 0.543893 | 3.457524 | 0.226309 | |
11 | 2023-02-22 17:13:51 | 39.541 sec | 11.0 | 0.378052 | 0.454467 | 0.761648 | 0.541505 | 3.521332 | 0.231023 | 0.373354 | 0.445973 | 0.772925 | 0.544553 | 3.460882 | 0.228960 | |
12 | 2023-02-22 17:13:51 | 39.573 sec | 12.0 | 0.377043 | 0.452420 | 0.762767 | 0.541658 | 3.521332 | 0.229972 | 0.372199 | 0.443670 | 0.773412 | 0.543195 | 3.460882 | 0.224542 | |
13 | 2023-02-22 17:13:51 | 39.602 sec | 13.0 | 0.376137 | 0.450517 | 0.764795 | 0.543264 | 3.525899 | 0.234317 | 0.371369 | 0.441932 | 0.774161 | 0.543632 | 3.448038 | 0.227413 | |
14 | 2023-02-22 17:13:51 | 39.632 sec | 14.0 | 0.375357 | 0.448963 | 0.765145 | 0.543113 | 3.525899 | 0.235654 | 0.370549 | 0.440335 | 0.774176 | 0.543202 | 3.448038 | 0.228076 | |
15 | 2023-02-22 17:13:51 | 39.665 sec | 15.0 | 0.374699 | 0.447543 | 0.766118 | 0.544037 | 3.528417 | 0.233219 | 0.369999 | 0.439161 | 0.774592 | 0.543709 | 3.448038 | 0.228297 | |
16 | 2023-02-22 17:13:52 | 39.702 sec | 16.0 | 0.374098 | 0.446341 | 0.766529 | 0.543896 | 3.560713 | 0.229161 | 0.369390 | 0.437926 | 0.775021 | 0.544851 | 3.424855 | 0.226751 | |
17 | 2023-02-22 17:13:52 | 39.752 sec | 17.0 | 0.373534 | 0.445115 | 0.766312 | 0.544208 | 3.568370 | 0.231452 | 0.368810 | 0.436669 | 0.774927 | 0.545957 | 3.442929 | 0.225425 | |
18 | 2023-02-22 17:13:52 | 39.803 sec | 18.0 | 0.373121 | 0.444171 | 0.766785 | 0.544720 | 3.568370 | 0.229352 | 0.368496 | 0.435909 | 0.775256 | 0.545586 | 3.442929 | 0.226530 | |
19 | 2023-02-22 17:13:52 | 39.845 sec | 19.0 | 0.372722 | 0.443360 | 0.767145 | 0.545059 | 3.568370 | 0.226439 | 0.368047 | 0.435006 | 0.775474 | 0.545922 | 3.442929 | 0.224652 |
See the whole table with table.as_data_frame() Variable Importances:
variable | relative_importance | scaled_importance | percentage | |
---|---|---|---|---|
0 | PAY_0 | 2794.444824 | 1.000000 | 0.693347 |
1 | PAY_2 | 307.237366 | 0.109946 | 0.076231 |
2 | PAY_3 | 215.152893 | 0.076993 | 0.053383 |
3 | PAY_4 | 155.434448 | 0.055623 | 0.038566 |
4 | PAY_AMT1 | 127.986313 | 0.045800 | 0.031755 |
5 | PAY_5 | 127.538628 | 0.045640 | 0.031644 |
6 | PAY_6 | 102.351601 | 0.036627 | 0.025395 |
7 | LIMIT_BAL | 82.432350 | 0.029499 | 0.020453 |
8 | PAY_AMT2 | 58.934135 | 0.021090 | 0.014623 |
9 | PAY_AMT4 | 58.858047 | 0.021063 | 0.014604 |
best_mgbm_eval
Fold | Metric | best_glm Value | mgbm5 Value | best_glm Rank | mgbm5 Rank | |
---|---|---|---|---|---|---|
0 | 0 | F1 | 0.533181 | 0.551298 | 2.0 | 1.0 |
1 | 0 | accuracy | 0.816246 | 0.817367 | 2.0 | 1.0 |
2 | 0 | auc | 0.738625 | 0.776026 | 2.0 | 1.0 |
3 | 0 | logloss | 0.468678 | 0.440775 | 2.0 | 1.0 |
4 | 0 | mcc | 0.419924 | 0.420105 | 2.0 | 1.0 |
5 | 1 | F1 | 0.540865 | 0.554762 | 2.0 | 1.0 |
6 | 1 | accuracy | 0.823882 | 0.826063 | 2.0 | 1.0 |
7 | 1 | auc | 0.729674 | 0.776877 | 2.0 | 1.0 |
8 | 1 | logloss | 0.465999 | 0.434170 | 2.0 | 1.0 |
9 | 1 | mcc | 0.432722 | 0.445354 | 2.0 | 1.0 |
10 | 2 | F1 | 0.500593 | 0.516364 | 2.0 | 1.0 |
11 | 2 | accuracy | 0.830907 | 0.833707 | 2.0 | 1.0 |
12 | 2 | auc | 0.707507 | 0.760838 | 2.0 | 1.0 |
13 | 2 | logloss | 0.459017 | 0.420930 | 2.0 | 1.0 |
14 | 2 | mcc | 0.395476 | 0.409254 | 2.0 | 1.0 |
15 | 3 | F1 | 0.531328 | 0.550251 | 2.0 | 1.0 |
16 | 3 | accuracy | 0.834578 | 0.836713 | 2.0 | 1.0 |
17 | 3 | auc | 0.733536 | 0.782795 | 2.0 | 1.0 |
18 | 3 | logloss | 0.448512 | 0.416031 | 2.0 | 1.0 |
19 | 3 | mcc | 0.426465 | 0.443411 | 2.0 | 1.0 |
20 | 4 | F1 | 0.561589 | 0.587666 | 2.0 | 1.0 |
21 | 4 | accuracy | 0.820845 | 0.829296 | 2.0 | 1.0 |
22 | 4 | auc | 0.740132 | 0.790550 | 2.0 | 1.0 |
23 | 4 | logloss | 0.467037 | 0.427654 | 2.0 | 1.0 |
24 | 4 | mcc | 0.448005 | 0.463860 | 2.0 | 1.0 |
print('Best GLM mean rank:', best_mgbm_eval['best_glm Rank'].mean())
print('Best MGBM mean rank:', best_mgbm_eval['mgbm5 Rank'].mean())
Best GLM mean rank: 2.0 Best MGBM mean rank: 1.0
# init dict to hold partial dependence and ICE values
# for each feature
# for mgbm
mgbm_pd_ice_dict = {}
# establish mgbm selected features
mgbm_selected = list(mgbm_selected_coefs[mgbm_selected_coefs['Monotonic GBM Mean SHAP Value'] != 0].index)
# calculate partial dependence for each selected feature
for xs in mgbm_selected:
mgbm_pd_ice_dict[xs] = auto_ph.pd_ice(xs, valid, best_mgbm)
# merge MGBM predictions onto test data
mgbm_yhat_valid = pd.concat([valid.reset_index(drop=True),
best_mgbm.predict(h2o.H2OFrame(valid))['p1'].as_data_frame()],
axis=1)
# rename yhat column
mgbm_yhat_valid = mgbm_yhat_valid.rename(columns={'p1':'p_DEFAULT_NEXT_MONTH'})
# find percentiles of predictions
mgbm_percentile_dict = auto_ph.get_percentile_dict('p_DEFAULT_NEXT_MONTH', mgbm_yhat_valid, 'ID')
# display percentiles dictionary
# key=percentile, val=row_id
mgbm_percentile_dict
{0: 6593, 99: 17863, 10: 23504, 20: 15919, 30: 1919, 40: 22205, 50: 12297, 60: 1847, 70: 12644, 80: 29479, 90: 7682}
# loop through selected variables
for xs in mgbm_selected:
# collect bins used in partial dependence
bins = list(mgbm_pd_ice_dict[xs][xs])
# calculate ICE at percentiles
# using partial dependence bins
# for each selected feature
for i in sorted(mgbm_percentile_dict.keys()):
col_name = 'Percentile_' + str(i)
mgbm_pd_ice_dict[xs][col_name] = auto_ph.pd_ice(xs,
valid[valid['ID'] == int(mgbm_percentile_dict[i])][mgbm_selected],
best_mgbm,
bins=bins)['partial_dependence']
for xs in mgbm_selected:
auto_ph.hist_mean_pd_ice_plot(xs, y_name, valid, mgbm_pd_ice_dict)
ID
and Shapley value array indices¶valid_idx_map = valid['ID'].copy(deep=True)
valid_idx_map.reset_index(drop=True, inplace=True)
p_DEFAULT_NEXT_MONTH
¶next_vars = int(''.join([c for c in best_mgbm.model_id if c.isdigit()])) - 1
for percentile in [10, 50, 90]:
idx = valid_idx_map[valid_idx_map == int(glm_percentile_dict[percentile])].index[0]
s_df = pd.DataFrame(best_mgbm_shap[idx, :].T, columns=['Monotonic GBM SHAP Value'], index=glm_selected + next_list[:next_vars])
local_coef_dict[percentile]['Monotonic GBM SHAP Value'] = 0
local_coef_dict[percentile].update(s_df)
p_DEFAULT_NEXT_MONTH
¶fig, (ax0, ax1, ax2) = plt.subplots(ncols=3, sharey=True)
plt.tight_layout()
plt.subplots_adjust(left=0, right=2, wspace=0.1)
_ = local_coef_dict[10].plot(kind='bar', colormap='cool', ax=ax0, edgecolor=['black']*len(data[x_names + [y_name]]),
title='10th PCTL of p_DEFAULT_NEXT_MONTH')
_ = local_coef_dict[50].plot(kind='bar', colormap='cool', ax=ax1, edgecolor=['black']*len(data[x_names + [y_name]]),
title='50th PCTL of p_DEFAULT_NEXT_MONTH')
_ = local_coef_dict[90].plot(kind='bar', colormap='cool', ax=ax2, edgecolor=['black']*len(data[x_names + [y_name]]),
title='90th PCTL of p_DEFAULT_NEXT_MONTH')
SEX = male
and SEX = female
¶print('Standardized mean difference: %.2f' % auto_ph.smd(mgbm_yhat_valid, 'SEX', 'p_DEFAULT_NEXT_MONTH', 'male', 'female'))
Male mean yhat: 0.23 Female mean yhat: 0.22 P_Default_Next_Month std. dev.: 0.19 Standardized mean difference: -0.07
best_mgbm_cut = best_mgbm.mcc(valid=True)[0][0]
best_mgbm_cut
0.34724568061915234
mgbm_male_cm = auto_ph.get_confusion_matrix(mgbm_yhat_valid, y_name, 'p_DEFAULT_NEXT_MONTH', by='SEX',
level='male', cutoff=best_mgbm_cut)
mgbm_female_cm = auto_ph.get_confusion_matrix(mgbm_yhat_valid, y_name, 'p_DEFAULT_NEXT_MONTH', by='SEX',
level='female', cutoff=best_mgbm_cut)
mgbm_cm_dict = {'male': mgbm_male_cm, 'female': mgbm_female_cm}
SEX = male
¶mgbm_male_cm
actual: 1 | actual: 0 | |
---|---|---|
predicted: 1 | 405 | 225 |
predicted: 0 | 438 | 2524 |
SEX = female
¶mgbm_female_cm
actual: 1 | actual: 0 | |
---|---|---|
predicted: 1 | 513 | 330 |
predicted: 0 | 630 | 3989 |
SEX = male
and SEX = female
¶print('Adverse impact ratio: %.2f' % auto_ph.air(mgbm_cm_dict, 'male', 'female'))
Male proportion accepted: 0.825 Female proportion accepted: 0.846 Adverse impact ratio: 1.03
SEX = male
and SEX = female
¶print('Marginal effect: %.2f%%' % auto_ph.marginal_effect(mgbm_cm_dict, 'male', 'female'))
Male accepted: 82.46% Female accepted: 84.57% Marginal effect: -2.11%
mgbm_cm = auto_ph.get_confusion_matrix(mgbm_yhat_valid, y_name, 'p_DEFAULT_NEXT_MONTH', cutoff=best_mgbm_cut)
mgbm_cm
actual: 1 | actual: 0 | |
---|---|---|
predicted: 1 | 918 | 555 |
predicted: 0 | 1068 | 6513 |
mgbm_business_impact = mgbm_cm.iloc[0, 0]*TRUE_POSITIVE_AMOUNT +\
mgbm_cm.iloc[0, 1]*FALSE_POSITIVE_AMOUNT +\
mgbm_cm.iloc[1, 0]*FALSE_NEGATIVE_AMOUNT +\
mgbm_cm.iloc[1, 1]*TRUE_NEGATIVE_AMOUNT
print('Estimated business impact $%.2f' % mgbm_business_impact)
Estimated business impact $12360000.00
# create a DataFrame to store new GBM SHAP values
# for comparison to correlation, penalized glm, and mgbm coefficients
abs_corr_glm_mgbm_gbm_shap = abs_corr_glm_mgbm_shap.copy(deep=True)
abs_corr_glm_mgbm_gbm_shap['GBM Mean SHAP Value'] = 0
# start local timer
tic = time.time()
# stepwise GBM training
gbm_train_results = auto_ph.gbm_forward_select_train(glm_selected,
y_name,
train,
valid,
SEED,
next_list,
abs_corr_glm_mgbm_gbm_shap,
'GBM Mean SHAP Value')
gbm_models = gbm_train_results['MODELS']
corr_glm_mgbm_gbm_shap_coefs = gbm_train_results['GLOBAL_COEFS']
gbm_shap = gbm_train_results['LOCAL_COEFS']
# end local timer
toc = time.time()-tic
print('Task completed in %.2f s.' % (toc))
Starting grid search 1/14 ... Input features = ['PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6'] Completed grid search 1/14 with AUC: 0.76 ... -------------------------------------------------------------------------------- Starting grid search 2/14 ... Input features = ['PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'LIMIT_BAL'] Completed grid search 2/14 with AUC: 0.77 ... -------------------------------------------------------------------------------- Starting grid search 3/14 ... Input features = ['PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'LIMIT_BAL', 'PAY_AMT1'] Completed grid search 3/14 with AUC: 0.78 ... -------------------------------------------------------------------------------- Starting grid search 4/14 ... Input features = ['PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'LIMIT_BAL', 'PAY_AMT1', 'PAY_AMT2'] Completed grid search 4/14 with AUC: 0.78 ... -------------------------------------------------------------------------------- Starting grid search 5/14 ... Input features = ['PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'LIMIT_BAL', 'PAY_AMT1', 'PAY_AMT2', 'PAY_AMT4'] Completed grid search 5/14 with AUC: 0.78 ... -------------------------------------------------------------------------------- Starting grid search 6/14 ... Input features = ['PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'LIMIT_BAL', 'PAY_AMT1', 'PAY_AMT2', 'PAY_AMT4', 'PAY_AMT3'] Completed grid search 6/14 with AUC: 0.78 ... -------------------------------------------------------------------------------- Starting grid search 7/14 ... Input features = ['PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'LIMIT_BAL', 'PAY_AMT1', 'PAY_AMT2', 'PAY_AMT4', 'PAY_AMT3', 'PAY_AMT5'] Completed grid search 7/14 with AUC: 0.78 ... -------------------------------------------------------------------------------- Starting grid search 8/14 ... Input features = ['PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'LIMIT_BAL', 'PAY_AMT1', 'PAY_AMT2', 'PAY_AMT4', 'PAY_AMT3', 'PAY_AMT5', 'PAY_AMT6'] Completed grid search 8/14 with AUC: 0.78 ... -------------------------------------------------------------------------------- Starting grid search 9/14 ... Input features = ['PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'LIMIT_BAL', 'PAY_AMT1', 'PAY_AMT2', 'PAY_AMT4', 'PAY_AMT3', 'PAY_AMT5', 'PAY_AMT6', 'BILL_AMT1'] Completed grid search 9/14 with AUC: 0.79 ... -------------------------------------------------------------------------------- Starting grid search 10/14 ... Input features = ['PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'LIMIT_BAL', 'PAY_AMT1', 'PAY_AMT2', 'PAY_AMT4', 'PAY_AMT3', 'PAY_AMT5', 'PAY_AMT6', 'BILL_AMT1', 'BILL_AMT2'] Completed grid search 10/14 with AUC: 0.79 ... -------------------------------------------------------------------------------- Starting grid search 11/14 ... Input features = ['PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'LIMIT_BAL', 'PAY_AMT1', 'PAY_AMT2', 'PAY_AMT4', 'PAY_AMT3', 'PAY_AMT5', 'PAY_AMT6', 'BILL_AMT1', 'BILL_AMT2', 'BILL_AMT3'] Completed grid search 11/14 with AUC: 0.79 ... -------------------------------------------------------------------------------- Starting grid search 12/14 ... Input features = ['PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'LIMIT_BAL', 'PAY_AMT1', 'PAY_AMT2', 'PAY_AMT4', 'PAY_AMT3', 'PAY_AMT5', 'PAY_AMT6', 'BILL_AMT1', 'BILL_AMT2', 'BILL_AMT3', 'BILL_AMT4'] Completed grid search 12/14 with AUC: 0.79 ... -------------------------------------------------------------------------------- Starting grid search 13/14 ... Input features = ['PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'LIMIT_BAL', 'PAY_AMT1', 'PAY_AMT2', 'PAY_AMT4', 'PAY_AMT3', 'PAY_AMT5', 'PAY_AMT6', 'BILL_AMT1', 'BILL_AMT2', 'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5'] Completed grid search 13/14 with AUC: 0.79 ... -------------------------------------------------------------------------------- Starting grid search 14/14 ... Input features = ['PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'LIMIT_BAL', 'PAY_AMT1', 'PAY_AMT2', 'PAY_AMT4', 'PAY_AMT3', 'PAY_AMT5', 'PAY_AMT6', 'BILL_AMT1', 'BILL_AMT2', 'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6'] Completed grid search 14/14 with AUC: 0.79 ... -------------------------------------------------------------------------------- Done. Task completed in 625.49 s.
auto_ph.plot_coefs(corr_glm_mgbm_gbm_shap_coefs,
gbm_models,
'GBM',
['Absolute Pearson Correlation Coefficient',
'GBM Mean SHAP Value',
'Monotonic GBM Mean SHAP Value',
'Absolute Penalized GLM Contribution'])
# autoph cv_model_rank_select function
# requires models to have model_id
best_mgbm.model_id = 'best_mgbm'
compare_model_ids = ['best_glm', 'best_mgbm']
# start local timer
tic = time.time()
# perform CV rank model selection
gbm_rank_results = auto_ph.cv_model_rank_select(valid,
SEED,
gbm_train_results,
'gbm',
compare_model_ids)
best_gbm = gbm_rank_results['BEST_MODEL']
best_gbm_shap = gbm_rank_results['BEST_LOCAL_COEFS']
gbm_selected_coefs = gbm_rank_results['BEST_GLOBAL_COEFS']
best_gbm_eval = gbm_rank_results['METRICS']
# end local timer
toc = time.time()-tic
print('Task completed in %.2f s.' % (toc))
Evaluated model 1/14 with rank: 1.94* ... Evaluated model 2/14 with rank: 1.54* ... Evaluated model 3/14 with rank: 1.44* ... Evaluated model 4/14 with rank: 1.36* ... Evaluated model 5/14 with rank: 1.48 ... Evaluated model 6/14 with rank: 1.38 ... Evaluated model 7/14 with rank: 1.50 ... Evaluated model 8/14 with rank: 1.30* ... Evaluated model 9/14 with rank: 1.20* ... Evaluated model 10/14 with rank: 1.20 ... Evaluated model 11/14 with rank: 1.16* ... Evaluated model 12/14 with rank: 1.32 ... Evaluated model 13/14 with rank: 1.26 ... Evaluated model 14/14 with rank: 1.22 ... Done. Task completed in 739.26 s.
best_gbm
Model Details ============= H2OGradientBoostingEstimator : Gradient Boosting Machine Model Key: gbm11 Model Summary:
number_of_trees | number_of_internal_trees | model_size_in_bytes | min_depth | max_depth | mean_depth | min_leaves | max_leaves | mean_leaves | ||
---|---|---|---|---|---|---|---|---|---|---|
0 | 44.0 | 44.0 | 17851.0 | 5.0 | 5.0 | 5.0 | 19.0 | 32.0 | 27.568182 |
ModelMetricsBinomial: gbm ** Reported on train data. ** MSE: 0.1297424233125429 RMSE: 0.3601977558405145 LogLoss: 0.41569809892760357 Mean Per-Class Error: 0.27659833404595624 AUC: 0.7998092479980574 pr_auc: 0.6005338055620879 Gini: 0.5996184959961148 Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.2649121913379078:
0 | 1 | Error | Rate | ||
---|---|---|---|---|---|
0 | 0 | 14006.0 | 2290.0 | 0.1405 | (2290.0/16296.0) |
1 | 1 | 1954.0 | 2696.0 | 0.4202 | (1954.0/4650.0) |
2 | Total | 15960.0 | 4986.0 | 0.2026 | (4244.0/20946.0) |
Maximum Metrics: Maximum metrics at their respective thresholds
metric | threshold | value | idx | |
---|---|---|---|---|
0 | max f1 | 0.264912 | 0.559568 | 219.0 |
1 | max f2 | 0.127389 | 0.651895 | 320.0 |
2 | max f0point5 | 0.440340 | 0.599546 | 147.0 |
3 | max accuracy | 0.565283 | 0.825886 | 107.0 |
4 | max precision | 0.906543 | 1.000000 | 0.0 |
5 | max recall | 0.045607 | 1.000000 | 394.0 |
6 | max specificity | 0.906543 | 1.000000 | 0.0 |
7 | max absolute_mcc | 0.367664 | 0.439748 | 175.0 |
8 | max min_per_class_accuracy | 0.183067 | 0.714839 | 273.0 |
9 | max mean_per_class_accuracy | 0.219738 | 0.723402 | 246.0 |
Gains/Lift Table: Avg response rate: 22.20 %, avg score: 22.24 %
group | cumulative_data_fraction | lower_threshold | lift | cumulative_lift | response_rate | score | cumulative_response_rate | cumulative_score | capture_rate | cumulative_capture_rate | gain | cumulative_gain | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0.010026 | 0.806812 | 4.225665 | 4.225665 | 0.938095 | 0.834493 | 0.938095 | 0.834493 | 0.042366 | 0.042366 | 322.566513 | 322.566513 | |
1 | 2 | 0.020004 | 0.772695 | 3.857935 | 4.042239 | 0.856459 | 0.787884 | 0.897375 | 0.811244 | 0.038495 | 0.080860 | 285.793487 | 304.223882 | |
2 | 3 | 0.030030 | 0.746236 | 3.775214 | 3.953089 | 0.838095 | 0.760955 | 0.877583 | 0.794454 | 0.037849 | 0.118710 | 277.521352 | 295.308888 | |
3 | 4 | 0.040008 | 0.722744 | 3.707066 | 3.891730 | 0.822967 | 0.733924 | 0.863962 | 0.779358 | 0.036989 | 0.155699 | 270.706591 | 289.172993 | |
4 | 5 | 0.050033 | 0.697132 | 3.346212 | 3.782418 | 0.742857 | 0.709277 | 0.839695 | 0.765315 | 0.033548 | 0.189247 | 234.621198 | 278.241812 | |
5 | 6 | 0.100019 | 0.564983 | 2.878244 | 3.330547 | 0.638968 | 0.630202 | 0.739379 | 0.697791 | 0.143871 | 0.333118 | 187.824383 | 233.054677 | |
6 | 7 | 0.150005 | 0.397339 | 2.103828 | 2.921771 | 0.467049 | 0.477395 | 0.648631 | 0.624349 | 0.105161 | 0.438280 | 110.382845 | 192.177081 | |
7 | 8 | 0.200038 | 0.314363 | 1.732176 | 2.624230 | 0.384542 | 0.352826 | 0.582578 | 0.556436 | 0.086667 | 0.524946 | 73.217557 | 162.423004 | |
8 | 9 | 0.300057 | 0.219055 | 1.225572 | 2.158011 | 0.272076 | 0.256601 | 0.479077 | 0.456491 | 0.122581 | 0.647527 | 22.557241 | 115.801083 | |
9 | 10 | 0.400029 | 0.175054 | 0.873368 | 1.836965 | 0.193887 | 0.195199 | 0.407805 | 0.391192 | 0.087312 | 0.734839 | -12.663154 | 83.696522 | |
10 | 11 | 0.500000 | 0.147656 | 0.748602 | 1.619355 | 0.166189 | 0.159996 | 0.359496 | 0.344966 | 0.074839 | 0.809677 | -25.139847 | 61.935484 | |
11 | 12 | 0.600019 | 0.127317 | 0.636438 | 1.455509 | 0.141289 | 0.137817 | 0.323122 | 0.310435 | 0.063656 | 0.873333 | -36.356240 | 45.550923 | |
12 | 13 | 0.699990 | 0.108453 | 0.509823 | 1.320448 | 0.113181 | 0.117489 | 0.293139 | 0.282879 | 0.050968 | 0.924301 | -49.017654 | 32.044812 | |
13 | 14 | 0.800010 | 0.092868 | 0.412824 | 1.206975 | 0.091647 | 0.100889 | 0.267948 | 0.260126 | 0.041290 | 0.965591 | -58.717561 | 20.697484 | |
14 | 15 | 0.899981 | 0.073586 | 0.217267 | 1.097036 | 0.048233 | 0.082605 | 0.243541 | 0.240407 | 0.021720 | 0.987312 | -78.273346 | 9.703642 | |
15 | 16 | 1.000000 | 0.030757 | 0.126857 | 1.000000 | 0.028162 | 0.060595 | 0.221999 | 0.222422 | 0.012688 | 1.000000 | -87.314251 | 0.000000 |
ModelMetricsBinomial: gbm ** Reported on validation data. ** MSE: 0.13197243166584371 RMSE: 0.3632801008393437 LogLoss: 0.4228548881641006 Mean Per-Class Error: 0.2799766019180101 AUC: 0.7859376487136042 pr_auc: 0.5604814720383086 Gini: 0.5718752974272083 Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.2836639879283261:
0 | 1 | Error | Rate | ||
---|---|---|---|---|---|
0 | 0 | 6087.0 | 981.0 | 0.1388 | (981.0/7068.0) |
1 | 1 | 841.0 | 1145.0 | 0.4235 | (841.0/1986.0) |
2 | Total | 6928.0 | 2126.0 | 0.2012 | (1822.0/9054.0) |
Maximum Metrics: Maximum metrics at their respective thresholds
metric | threshold | value | idx | |
---|---|---|---|---|
0 | max f1 | 0.283664 | 0.556907 | 201.0 |
1 | max f2 | 0.136697 | 0.641801 | 303.0 |
2 | max f0point5 | 0.446778 | 0.591413 | 137.0 |
3 | max accuracy | 0.454787 | 0.825050 | 134.0 |
4 | max precision | 0.883619 | 1.000000 | 0.0 |
5 | max recall | 0.032253 | 1.000000 | 399.0 |
6 | max specificity | 0.883619 | 1.000000 | 0.0 |
7 | max absolute_mcc | 0.394464 | 0.432378 | 156.0 |
8 | max min_per_class_accuracy | 0.186254 | 0.711517 | 259.0 |
9 | max mean_per_class_accuracy | 0.270007 | 0.720023 | 207.0 |
Gains/Lift Table: Avg response rate: 21.94 %, avg score: 22.74 %
group | cumulative_data_fraction | lower_threshold | lift | cumulative_lift | response_rate | score | cumulative_response_rate | cumulative_score | capture_rate | cumulative_capture_rate | gain | cumulative_gain | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0.010051 | 0.810161 | 3.707247 | 3.707247 | 0.813187 | 0.835473 | 0.813187 | 0.835473 | 0.037261 | 0.037261 | 270.724744 | 270.724744 | |
1 | 2 | 0.020102 | 0.779219 | 3.556954 | 3.632101 | 0.780220 | 0.795835 | 0.796703 | 0.815654 | 0.035750 | 0.073011 | 255.695362 | 263.210053 | |
2 | 3 | 0.030042 | 0.756505 | 3.596475 | 3.620313 | 0.788889 | 0.769156 | 0.794118 | 0.800269 | 0.035750 | 0.108761 | 259.647533 | 262.031278 | |
3 | 4 | 0.040093 | 0.737192 | 3.406660 | 3.566752 | 0.747253 | 0.746407 | 0.782369 | 0.786766 | 0.034240 | 0.143001 | 240.665981 | 256.675239 | |
4 | 5 | 0.050033 | 0.708051 | 3.241893 | 3.502211 | 0.711111 | 0.722268 | 0.768212 | 0.773952 | 0.032226 | 0.175227 | 224.189325 | 250.221084 | |
5 | 6 | 0.100066 | 0.580696 | 2.938637 | 3.220424 | 0.644592 | 0.641571 | 0.706402 | 0.707762 | 0.147029 | 0.322256 | 193.863668 | 222.042376 | |
6 | 7 | 0.149989 | 0.414005 | 2.339973 | 2.927372 | 0.513274 | 0.499570 | 0.642121 | 0.638467 | 0.116818 | 0.439074 | 133.997273 | 192.737231 | |
7 | 8 | 0.200353 | 0.322792 | 1.609616 | 2.596117 | 0.353070 | 0.368892 | 0.569460 | 0.570701 | 0.081067 | 0.520141 | 60.961600 | 159.611714 | |
8 | 9 | 0.299978 | 0.226130 | 1.223123 | 2.140137 | 0.268293 | 0.266316 | 0.469440 | 0.469613 | 0.121853 | 0.641994 | 22.312284 | 114.013744 | |
9 | 10 | 0.400044 | 0.178058 | 0.830266 | 1.812489 | 0.182119 | 0.201055 | 0.397570 | 0.402437 | 0.083082 | 0.725076 | -16.973450 | 81.248864 | |
10 | 11 | 0.500000 | 0.149613 | 0.735471 | 1.597180 | 0.161326 | 0.162219 | 0.350342 | 0.354414 | 0.073515 | 0.798590 | -26.452905 | 59.718026 | |
11 | 12 | 0.599956 | 0.128154 | 0.594422 | 1.430115 | 0.130387 | 0.139095 | 0.313697 | 0.318541 | 0.059416 | 0.858006 | -40.557827 | 43.011537 | |
12 | 13 | 0.700022 | 0.108786 | 0.483064 | 1.294737 | 0.105960 | 0.118260 | 0.284001 | 0.289911 | 0.048338 | 0.906344 | -51.693644 | 29.473687 | |
13 | 14 | 0.799978 | 0.092697 | 0.408035 | 1.183945 | 0.089503 | 0.101351 | 0.259699 | 0.266351 | 0.040785 | 0.947130 | -59.196475 | 18.394508 | |
14 | 15 | 0.899934 | 0.073773 | 0.312323 | 1.087134 | 0.068508 | 0.082437 | 0.238463 | 0.245924 | 0.031219 | 0.978348 | -68.767672 | 8.713387 | |
15 | 16 | 1.000000 | 0.030634 | 0.216372 | 1.000000 | 0.047461 | 0.060843 | 0.219351 | 0.227403 | 0.021652 | 1.000000 | -78.362778 | 0.000000 |
Scoring History:
timestamp | duration | number_of_trees | training_rmse | training_logloss | training_auc | training_pr_auc | training_lift | training_classification_error | validation_rmse | validation_logloss | validation_auc | validation_pr_auc | validation_lift | validation_classification_error | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2023-02-22 17:36:15 | 35.636 sec | 0.0 | 0.415591 | 0.529427 | 0.500000 | 0.000000 | 1.000000 | 0.778001 | 0.413815 | 0.526105 | 0.500000 | 0.000000 | 1.000000 | 0.780649 | |
1 | 2023-02-22 17:36:15 | 35.664 sec | 1.0 | 0.408371 | 0.513004 | 0.736348 | 0.508118 | 3.571084 | 0.228397 | 0.406217 | 0.508920 | 0.740898 | 0.506517 | 3.628522 | 0.229512 | |
2 | 2023-02-22 17:36:15 | 35.690 sec | 2.0 | 0.402294 | 0.500237 | 0.756976 | 0.530464 | 3.660533 | 0.221045 | 0.400137 | 0.496260 | 0.758433 | 0.522264 | 3.524932 | 0.221670 | |
3 | 2023-02-22 17:36:15 | 35.720 sec | 3.0 | 0.398340 | 0.492278 | 0.764069 | 0.532898 | 3.813125 | 0.221808 | 0.396285 | 0.488590 | 0.762731 | 0.519734 | 3.607052 | 0.229954 | |
4 | 2023-02-22 17:36:15 | 35.759 sec | 4.0 | 0.394118 | 0.483784 | 0.768168 | 0.540150 | 3.845837 | 0.220376 | 0.392061 | 0.480133 | 0.767698 | 0.525350 | 3.556954 | 0.228628 | |
5 | 2023-02-22 17:36:15 | 35.800 sec | 5.0 | 0.390081 | 0.475984 | 0.771084 | 0.543405 | 3.739598 | 0.218610 | 0.388102 | 0.472546 | 0.768725 | 0.528746 | 3.518291 | 0.216479 | |
6 | 2023-02-22 17:36:15 | 35.852 sec | 6.0 | 0.385902 | 0.467950 | 0.776430 | 0.553219 | 3.816326 | 0.213931 | 0.383888 | 0.464422 | 0.774865 | 0.535962 | 3.518291 | 0.216921 | |
7 | 2023-02-22 17:36:15 | 35.908 sec | 7.0 | 0.382504 | 0.461554 | 0.777344 | 0.557804 | 3.831896 | 0.209634 | 0.380625 | 0.458223 | 0.776065 | 0.541525 | 3.627522 | 0.208748 | |
8 | 2023-02-22 17:36:15 | 35.958 sec | 8.0 | 0.379618 | 0.456175 | 0.778102 | 0.561744 | 3.906760 | 0.210541 | 0.377821 | 0.453004 | 0.776150 | 0.545106 | 3.666951 | 0.209189 | |
9 | 2023-02-22 17:36:15 | 36.007 sec | 9.0 | 0.377172 | 0.451580 | 0.778836 | 0.564685 | 3.905203 | 0.205910 | 0.375476 | 0.448572 | 0.776942 | 0.546636 | 3.657149 | 0.214491 | |
10 | 2023-02-22 17:36:15 | 36.056 sec | 10.0 | 0.375774 | 0.448747 | 0.780541 | 0.565119 | 3.849675 | 0.200086 | 0.374165 | 0.445917 | 0.778120 | 0.547708 | 3.774583 | 0.205324 | |
11 | 2023-02-22 17:36:15 | 36.107 sec | 11.0 | 0.374535 | 0.446260 | 0.781854 | 0.566824 | 3.882464 | 0.203428 | 0.373194 | 0.443968 | 0.778366 | 0.548891 | 3.907639 | 0.204882 | |
12 | 2023-02-22 17:36:15 | 36.149 sec | 12.0 | 0.372964 | 0.443330 | 0.782210 | 0.568613 | 3.915140 | 0.202903 | 0.371616 | 0.440994 | 0.779242 | 0.550633 | 3.743107 | 0.200022 | |
13 | 2023-02-22 17:36:15 | 36.201 sec | 13.0 | 0.371596 | 0.440754 | 0.782722 | 0.570082 | 3.946814 | 0.203571 | 0.370409 | 0.438669 | 0.779670 | 0.551838 | 3.757345 | 0.200795 | |
14 | 2023-02-22 17:36:15 | 36.247 sec | 14.0 | 0.370411 | 0.438517 | 0.783312 | 0.572051 | 4.011164 | 0.199179 | 0.369345 | 0.436599 | 0.780549 | 0.552043 | 3.807443 | 0.198476 | |
15 | 2023-02-22 17:36:15 | 36.291 sec | 15.0 | 0.369378 | 0.436511 | 0.783760 | 0.573630 | 3.992154 | 0.199036 | 0.368422 | 0.434758 | 0.781115 | 0.552472 | 3.807443 | 0.209189 | |
16 | 2023-02-22 17:36:15 | 36.337 sec | 16.0 | 0.368553 | 0.434903 | 0.783928 | 0.575126 | 4.054065 | 0.201900 | 0.367631 | 0.433201 | 0.781266 | 0.553865 | 3.757345 | 0.209079 | |
17 | 2023-02-22 17:36:15 | 36.385 sec | 17.0 | 0.367919 | 0.433482 | 0.784829 | 0.576147 | 4.054065 | 0.200993 | 0.367034 | 0.431879 | 0.782066 | 0.555544 | 3.757345 | 0.206318 | |
18 | 2023-02-22 17:36:15 | 36.442 sec | 18.0 | 0.367264 | 0.432128 | 0.785397 | 0.577299 | 4.033589 | 0.196314 | 0.366458 | 0.430703 | 0.782303 | 0.555169 | 3.757345 | 0.195825 | |
19 | 2023-02-22 17:36:15 | 36.489 sec | 19.0 | 0.366730 | 0.430958 | 0.785965 | 0.578072 | 4.032614 | 0.199083 | 0.366084 | 0.429812 | 0.782636 | 0.555984 | 3.757345 | 0.194500 |
See the whole table with table.as_data_frame() Variable Importances:
variable | relative_importance | scaled_importance | percentage | |
---|---|---|---|---|
0 | PAY_0 | 1327.677734 | 1.000000 | 0.298469 |
1 | PAY_2 | 870.664368 | 0.655780 | 0.195730 |
2 | PAY_5 | 385.968719 | 0.290710 | 0.086768 |
3 | PAY_6 | 367.919647 | 0.277115 | 0.082710 |
4 | PAY_4 | 271.276825 | 0.204324 | 0.060985 |
5 | LIMIT_BAL | 255.897079 | 0.192740 | 0.057527 |
6 | PAY_3 | 237.898102 | 0.179184 | 0.053481 |
7 | BILL_AMT1 | 139.980881 | 0.105433 | 0.031468 |
8 | PAY_AMT1 | 129.845551 | 0.097799 | 0.029190 |
9 | BILL_AMT2 | 110.725052 | 0.083398 | 0.024892 |
10 | PAY_AMT3 | 90.709625 | 0.068322 | 0.020392 |
11 | PAY_AMT2 | 84.566849 | 0.063695 | 0.019011 |
12 | BILL_AMT3 | 52.510399 | 0.039551 | 0.011805 |
13 | PAY_AMT5 | 50.029984 | 0.037682 | 0.011247 |
14 | PAY_AMT4 | 42.780476 | 0.032222 | 0.009617 |
15 | PAY_AMT6 | 29.835304 | 0.022472 | 0.006707 |
best_gbm_eval
Fold | Metric | best_glm Value | best_mgbm Value | gbm11 Value | best_glm Rank | best_mgbm Rank | gbm11 Rank | |
---|---|---|---|---|---|---|---|---|
0 | 0 | F1 | 0.533181 | 0.551298 | 0.562353 | 3.0 | 2.0 | 1.0 |
1 | 0 | accuracy | 0.816246 | 0.817367 | 0.814006 | 2.0 | 1.0 | 3.0 |
2 | 0 | auc | 0.738625 | 0.776026 | 0.777570 | 3.0 | 2.0 | 1.0 |
3 | 0 | logloss | 0.468678 | 0.440775 | 0.438078 | 3.0 | 2.0 | 1.0 |
4 | 0 | mcc | 0.419924 | 0.420105 | 0.426918 | 3.0 | 2.0 | 1.0 |
5 | 1 | F1 | 0.540865 | 0.554762 | 0.555283 | 3.0 | 2.0 | 1.0 |
6 | 1 | accuracy | 0.823882 | 0.826063 | 0.828244 | 3.0 | 2.0 | 1.0 |
7 | 1 | auc | 0.729674 | 0.776877 | 0.785956 | 3.0 | 2.0 | 1.0 |
8 | 1 | logloss | 0.465999 | 0.434170 | 0.428677 | 3.0 | 2.0 | 1.0 |
9 | 1 | mcc | 0.432722 | 0.445354 | 0.447637 | 3.0 | 2.0 | 1.0 |
10 | 2 | F1 | 0.500593 | 0.516364 | 0.530343 | 3.0 | 2.0 | 1.0 |
11 | 2 | accuracy | 0.830907 | 0.833707 | 0.835946 | 3.0 | 2.0 | 1.0 |
12 | 2 | auc | 0.707507 | 0.760838 | 0.769493 | 3.0 | 2.0 | 1.0 |
13 | 2 | logloss | 0.459017 | 0.420930 | 0.417171 | 3.0 | 2.0 | 1.0 |
14 | 2 | mcc | 0.395476 | 0.409254 | 0.411118 | 3.0 | 2.0 | 1.0 |
15 | 3 | F1 | 0.531328 | 0.550251 | 0.557500 | 3.0 | 2.0 | 1.0 |
16 | 3 | accuracy | 0.834578 | 0.836713 | 0.835112 | 3.0 | 1.0 | 2.0 |
17 | 3 | auc | 0.733536 | 0.782795 | 0.791644 | 3.0 | 2.0 | 1.0 |
18 | 3 | logloss | 0.448512 | 0.416031 | 0.411094 | 3.0 | 2.0 | 1.0 |
19 | 3 | mcc | 0.426465 | 0.443411 | 0.445759 | 3.0 | 2.0 | 1.0 |
20 | 4 | F1 | 0.561589 | 0.587666 | 0.593037 | 3.0 | 2.0 | 1.0 |
21 | 4 | accuracy | 0.820845 | 0.829296 | 0.826479 | 3.0 | 1.0 | 2.0 |
22 | 4 | auc | 0.740132 | 0.790550 | 0.803337 | 3.0 | 2.0 | 1.0 |
23 | 4 | logloss | 0.467037 | 0.427654 | 0.419666 | 3.0 | 2.0 | 1.0 |
24 | 4 | mcc | 0.448005 | 0.463860 | 0.472417 | 3.0 | 2.0 | 1.0 |
print('Best GLM mean rank:', best_gbm_eval['best_glm Rank'].mean())
print('Best MGBM mean rank:', best_gbm_eval['best_mgbm Rank'].mean())
print('Best GBM mean rank:', best_gbm_eval['gbm11 Rank'].mean())
Best GLM mean rank: 2.96 Best MGBM mean rank: 1.88 Best GBM mean rank: 1.16
# init dict to hold partial dependence and ICE values
# for each feature
# for gbm
gbm_pd_ice_dict = {}
# establish gbm selected features
gbm_selected = list(gbm_selected_coefs[gbm_selected_coefs['GBM Mean SHAP Value'] != 0].index)
# calculate partial dependence for each selected feature
for xs in gbm_selected:
gbm_pd_ice_dict[xs] = auto_ph.pd_ice(xs, valid, best_gbm)
# merge GBM predictions onto test data
gbm_yhat_valid = pd.concat([valid.reset_index(drop=True),
best_gbm.predict(h2o.H2OFrame(valid))['p1'].as_data_frame()],
axis=1)
# rename yhat column
gbm_yhat_valid = gbm_yhat_valid.rename(columns={'p1':'p_DEFAULT_NEXT_MONTH'})
# find percentiles of predictions
gbm_percentile_dict = auto_ph.get_percentile_dict('p_DEFAULT_NEXT_MONTH', gbm_yhat_valid, 'ID')
# display percentiles dictionary
# key=percentile, val=row_id
gbm_percentile_dict
{0: 983, 99: 17757, 10: 20120, 20: 7828, 30: 29363, 40: 10128, 50: 8305, 60: 27620, 70: 6546, 80: 16855, 90: 16995}
# loop through selected variables
for xs in gbm_selected:
# collect bins used in partial dependence
bins = list(gbm_pd_ice_dict[xs][xs])
# calculate ICE at percentiles
# using partial dependence bins
# for each selected feature
for i in sorted(gbm_percentile_dict.keys()):
col_name = 'Percentile_' + str(i)
gbm_pd_ice_dict[xs][col_name] = auto_ph.pd_ice(xs,
valid[valid['ID'] == int(gbm_percentile_dict[i])][gbm_selected],
best_gbm,
bins=bins)['partial_dependence']
for xs in gbm_selected:
auto_ph.hist_mean_pd_ice_plot(xs, y_name, valid, gbm_pd_ice_dict)
p_DEFAULT_NEXT_MONTH
¶next_vars = int(''.join([c for c in best_gbm.model_id if c.isdigit()])) - 1
for percentile in [10, 50, 90]:
idx = valid_idx_map[valid_idx_map == int(glm_percentile_dict[percentile])].index[0]
s_df = pd.DataFrame(best_gbm_shap[idx, :].T, columns=['GBM SHAP Value'], index=glm_selected + next_list[:next_vars])
local_coef_dict[percentile]['GBM SHAP Value'] = 0
local_coef_dict[percentile].update(s_df)
p_DEFAULT_NEXT_MONTH
¶fig, (ax0, ax1, ax2) = plt.subplots(ncols=3, sharey=True)
plt.tight_layout()
plt.subplots_adjust(left=0, right=2, wspace=0.1)
_ = local_coef_dict[10].plot(kind='bar', colormap='cool', ax=ax0, edgecolor=['black']*len(data[x_names + [y_name]]),
title='10th PCTL of p_DEFAULT_NEXT_MONTH')
_ = local_coef_dict[50].plot(kind='bar', colormap='cool', ax=ax1, edgecolor=['black']*len(data[x_names + [y_name]]),
title='50th PCTL of p_DEFAULT_NEXT_MONTH')
_ = local_coef_dict[90].plot(kind='bar', colormap='cool', ax=ax2, edgecolor=['black']*len(data[x_names + [y_name]]),
title='90th PCTL of p_DEFAULT_NEXT_MONTH')
SEX = male
and SEX = female
¶print('Standardized mean difference: %.2f' % auto_ph.smd(gbm_yhat_valid, 'SEX', 'p_DEFAULT_NEXT_MONTH', 'male', 'female'))
Male mean yhat: 0.24 Female mean yhat: 0.22 P_Default_Next_Month std. dev.: 0.19 Standardized mean difference: -0.07
best_gbm_cut = best_gbm.mcc(valid=True)[0][0]
best_gbm_cut
0.3944636621134693
gbm_male_cm = auto_ph.get_confusion_matrix(gbm_yhat_valid, y_name, 'p_DEFAULT_NEXT_MONTH', by='SEX',
level='male', cutoff=best_gbm_cut)
gbm_female_cm = auto_ph.get_confusion_matrix(gbm_yhat_valid, y_name, 'p_DEFAULT_NEXT_MONTH', by='SEX',
level='female', cutoff=best_gbm_cut)
gbm_cm_dict = {'male': gbm_male_cm, 'female': gbm_female_cm}
SEX = male
¶gbm_male_cm
actual: 1 | actual: 0 | |
---|---|---|
predicted: 1 | 405 | 229 |
predicted: 0 | 438 | 2520 |
SEX = female
¶gbm_female_cm
actual: 1 | actual: 0 | |
---|---|---|
predicted: 1 | 515 | 319 |
predicted: 0 | 628 | 4000 |
SEX = male
and SEX = female
¶print('Adverse impact ratio: %.2f' % auto_ph.air(gbm_cm_dict, 'male', 'female'))
Male proportion accepted: 0.823 Female proportion accepted: 0.847 Adverse impact ratio: 1.03
SEX = male
and SEX = female
¶print('Marginal effect: %.2f%%' % auto_ph.marginal_effect(gbm_cm_dict, 'male', 'female'))
Male accepted: 82.35% Female accepted: 84.73% Marginal effect: -2.38%
gbm_cm = auto_ph.get_confusion_matrix(gbm_yhat_valid, y_name, 'p_DEFAULT_NEXT_MONTH', cutoff=best_gbm_cut)
gbm_cm
actual: 1 | actual: 0 | |
---|---|---|
predicted: 1 | 920 | 548 |
predicted: 0 | 1066 | 6520 |
gbm_business_impact = gbm_cm.iloc[0, 0]*TRUE_POSITIVE_AMOUNT +\
gbm_cm.iloc[0, 1]*FALSE_POSITIVE_AMOUNT +\
gbm_cm.iloc[1, 0]*FALSE_NEGATIVE_AMOUNT +\
gbm_cm.iloc[1, 1]*TRUE_NEGATIVE_AMOUNT
print('Estimated business impact $%.2f' % gbm_business_impact)
Estimated business impact $12840000.00
# init data structures
# for automatically selected monotonicity constraints
hybrid_mono_dict = {}
ice_curve_names = [name for name in gbm_pd_ice_dict['PAY_0'].columns
if name.startswith('Percentile_')]
# loop through features in best GBM
for xs in gbm_pd_ice_dict.keys():
# count the number of monotonic ICE curves for each feature
count = 0
for name in ice_curve_names:
if gbm_pd_ice_dict[xs][name].is_monotonic:
count += 1
elif gbm_pd_ice_dict[xs][name].is_monotonic_decreasing:
count += 1
else:
pass
# use global MONO_THRESHOLD hyperparameter
# to decide whether a feature should be constrained,
# if number of monotonic ICE curves is >= MONO_THRESHOLD
# provide a monotone constraint
# lower = more monotone constraints
if count >= MONO_THRESHOLD:
hybrid_mono_dict[xs] = np.sign(valid[[xs] + [y_name]].corr()[y_name].values[:-1])[0]
else:
hybrid_mono_dict[xs] = 0
# print selected monotone constraints
print(hybrid_mono_dict)
{'BILL_AMT1': 0, 'BILL_AMT2': 0, 'BILL_AMT3': 0, 'LIMIT_BAL': 0, 'PAY_0': 0, 'PAY_2': 1.0, 'PAY_3': 1.0, 'PAY_4': 1.0, 'PAY_5': 1.0, 'PAY_6': 0, 'PAY_AMT1': -1.0, 'PAY_AMT2': 0, 'PAY_AMT3': 0, 'PAY_AMT4': -1.0, 'PAY_AMT5': -1.0, 'PAY_AMT6': 0}
# create a DataFrame to store new hybrid GBM SHAP values
# for comparison to correlation, penalized glm, mgbm, and gbm coefficients
abs_corr_glm_mgbm_gbm_hybrid_shap = abs_corr_glm_mgbm_gbm_shap.copy(deep=True)
abs_corr_glm_mgbm_gbm_hybrid_shap['Hybrid Mean SHAP Value'] = 0
# create a new next_list based on selected monotonicity constraints
next_list = [name for name in next_list if name in hybrid_mono_dict.keys()]
# define large random grid search parameters
# large memory foot print !!!
hyper_parameters = {'ntrees':list(range(1, 501, 50)),
'max_depth':list(range(1, 21, 2)),
'sample_rate':[s/float(10) for s in range(1, 11)],
'col_sample_rate':[s/float(10) for s in range(1, 11)],
'col_sample_rate_per_tree':list(np.arange(0.2, 1, 0.01)),
'col_sample_rate_change_per_level':list(np.arange(0.9, 1.1, 0.01)),
'min_rows':list(np.arange(1, 2**(np.log2(valid.shape[0])-1), 100)),
'nbins':list(range(16, 1024, 16)),
'nbins_cats':list(range(16, 4096, 16)),
'min_split_improvement':[0, 1e-8, 1e-6, 1e-4],
'histogram_type':['UniformAdaptive' ,'QuantilesGlobal', 'RoundRobin', 'Random']}
# define large search strategy
# large memory foot print !!!
search_criteria = {'strategy':'RandomDiscrete',
'max_models':250,
'max_runtime_secs':3600,
'seed':SEED}
# start local timer
tic = time.time()
# stepwise hybrid GBM training
hybrid_train_results = auto_ph.gbm_forward_select_train(glm_selected,
y_name,
train,
valid,
SEED,
next_list,
abs_corr_glm_mgbm_gbm_hybrid_shap,
'Hybrid Mean SHAP Value',
monotone=True,
monotone_constraints_=hybrid_mono_dict, hyper_params_=hyper_parameters,
search_criteria_=search_criteria)
hybrid_models = hybrid_train_results['MODELS']
corr_glm_mgbm_gbm_hybrid_shap_coefs = hybrid_train_results['GLOBAL_COEFS']
hybrid_shap = hybrid_train_results['LOCAL_COEFS']
# end local timer
toc = time.time()-tic
print('Task completed in %.2f s.' % (toc))
# failed with h2o 3.26.0.3 and 12GB of RAM
# completes and fails with h2o 3.26.0.3 and 16GB of RAM
Starting grid search 1/11 ... Input features = ['PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6'] Monotone constraints = {'PAY_0': 0, 'PAY_2': 1.0, 'PAY_3': 1.0, 'PAY_4': 1.0, 'PAY_5': 1.0, 'PAY_6': 0} Completed grid search 1/11 with AUC: 0.76 ... -------------------------------------------------------------------------------- Starting grid search 2/11 ... Input features = ['PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'LIMIT_BAL'] Monotone constraints = {'PAY_0': 0, 'PAY_2': 1.0, 'PAY_3': 1.0, 'PAY_4': 1.0, 'PAY_5': 1.0, 'PAY_6': 0, 'LIMIT_BAL': 0} Completed grid search 2/11 with AUC: 0.77 ... -------------------------------------------------------------------------------- Starting grid search 3/11 ... Input features = ['PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'LIMIT_BAL', 'PAY_AMT1'] Monotone constraints = {'PAY_0': 0, 'PAY_2': 1.0, 'PAY_3': 1.0, 'PAY_4': 1.0, 'PAY_5': 1.0, 'PAY_6': 0, 'LIMIT_BAL': 0, 'PAY_AMT1': -1.0} Completed grid search 3/11 with AUC: 0.78 ... -------------------------------------------------------------------------------- Starting grid search 4/11 ... Input features = ['PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'LIMIT_BAL', 'PAY_AMT1', 'PAY_AMT2'] Monotone constraints = {'PAY_0': 0, 'PAY_2': 1.0, 'PAY_3': 1.0, 'PAY_4': 1.0, 'PAY_5': 1.0, 'PAY_6': 0, 'LIMIT_BAL': 0, 'PAY_AMT1': -1.0, 'PAY_AMT2': 0} Completed grid search 4/11 with AUC: 0.78 ... -------------------------------------------------------------------------------- Starting grid search 5/11 ... Input features = ['PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'LIMIT_BAL', 'PAY_AMT1', 'PAY_AMT2', 'PAY_AMT4'] Monotone constraints = {'PAY_0': 0, 'PAY_2': 1.0, 'PAY_3': 1.0, 'PAY_4': 1.0, 'PAY_5': 1.0, 'PAY_6': 0, 'LIMIT_BAL': 0, 'PAY_AMT1': -1.0, 'PAY_AMT2': 0, 'PAY_AMT4': -1.0} Completed grid search 5/11 with AUC: 0.78 ... -------------------------------------------------------------------------------- Starting grid search 6/11 ... Input features = ['PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'LIMIT_BAL', 'PAY_AMT1', 'PAY_AMT2', 'PAY_AMT4', 'PAY_AMT3'] Monotone constraints = {'PAY_0': 0, 'PAY_2': 1.0, 'PAY_3': 1.0, 'PAY_4': 1.0, 'PAY_5': 1.0, 'PAY_6': 0, 'LIMIT_BAL': 0, 'PAY_AMT1': -1.0, 'PAY_AMT2': 0, 'PAY_AMT4': -1.0, 'PAY_AMT3': 0} Completed grid search 6/11 with AUC: 0.78 ... -------------------------------------------------------------------------------- Starting grid search 7/11 ... Input features = ['PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'LIMIT_BAL', 'PAY_AMT1', 'PAY_AMT2', 'PAY_AMT4', 'PAY_AMT3', 'PAY_AMT5'] Monotone constraints = {'PAY_0': 0, 'PAY_2': 1.0, 'PAY_3': 1.0, 'PAY_4': 1.0, 'PAY_5': 1.0, 'PAY_6': 0, 'LIMIT_BAL': 0, 'PAY_AMT1': -1.0, 'PAY_AMT2': 0, 'PAY_AMT4': -1.0, 'PAY_AMT3': 0, 'PAY_AMT5': -1.0} Completed grid search 7/11 with AUC: 0.78 ... -------------------------------------------------------------------------------- Starting grid search 8/11 ... Input features = ['PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'LIMIT_BAL', 'PAY_AMT1', 'PAY_AMT2', 'PAY_AMT4', 'PAY_AMT3', 'PAY_AMT5', 'PAY_AMT6'] Monotone constraints = {'PAY_0': 0, 'PAY_2': 1.0, 'PAY_3': 1.0, 'PAY_4': 1.0, 'PAY_5': 1.0, 'PAY_6': 0, 'LIMIT_BAL': 0, 'PAY_AMT1': -1.0, 'PAY_AMT2': 0, 'PAY_AMT4': -1.0, 'PAY_AMT3': 0, 'PAY_AMT5': -1.0, 'PAY_AMT6': 0} Completed grid search 8/11 with AUC: 0.78 ... -------------------------------------------------------------------------------- Starting grid search 9/11 ... Input features = ['PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'LIMIT_BAL', 'PAY_AMT1', 'PAY_AMT2', 'PAY_AMT4', 'PAY_AMT3', 'PAY_AMT5', 'PAY_AMT6', 'BILL_AMT1'] Monotone constraints = {'PAY_0': 0, 'PAY_2': 1.0, 'PAY_3': 1.0, 'PAY_4': 1.0, 'PAY_5': 1.0, 'PAY_6': 0, 'LIMIT_BAL': 0, 'PAY_AMT1': -1.0, 'PAY_AMT2': 0, 'PAY_AMT4': -1.0, 'PAY_AMT3': 0, 'PAY_AMT5': -1.0, 'PAY_AMT6': 0, 'BILL_AMT1': 0} Completed grid search 9/11 with AUC: 0.79 ... -------------------------------------------------------------------------------- Starting grid search 10/11 ... Input features = ['PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'LIMIT_BAL', 'PAY_AMT1', 'PAY_AMT2', 'PAY_AMT4', 'PAY_AMT3', 'PAY_AMT5', 'PAY_AMT6', 'BILL_AMT1', 'BILL_AMT2'] Monotone constraints = {'PAY_0': 0, 'PAY_2': 1.0, 'PAY_3': 1.0, 'PAY_4': 1.0, 'PAY_5': 1.0, 'PAY_6': 0, 'LIMIT_BAL': 0, 'PAY_AMT1': -1.0, 'PAY_AMT2': 0, 'PAY_AMT4': -1.0, 'PAY_AMT3': 0, 'PAY_AMT5': -1.0, 'PAY_AMT6': 0, 'BILL_AMT1': 0, 'BILL_AMT2': 0} Completed grid search 10/11 with AUC: 0.79 ... -------------------------------------------------------------------------------- Starting grid search 11/11 ... Input features = ['PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'LIMIT_BAL', 'PAY_AMT1', 'PAY_AMT2', 'PAY_AMT4', 'PAY_AMT3', 'PAY_AMT5', 'PAY_AMT6', 'BILL_AMT1', 'BILL_AMT2', 'BILL_AMT3'] Monotone constraints = {'PAY_0': 0, 'PAY_2': 1.0, 'PAY_3': 1.0, 'PAY_4': 1.0, 'PAY_5': 1.0, 'PAY_6': 0, 'LIMIT_BAL': 0, 'PAY_AMT1': -1.0, 'PAY_AMT2': 0, 'PAY_AMT4': -1.0, 'PAY_AMT3': 0, 'PAY_AMT5': -1.0, 'PAY_AMT6': 0, 'BILL_AMT1': 0, 'BILL_AMT2': 0, 'BILL_AMT3': 0} Completed grid search 11/11 with AUC: 0.79 ... -------------------------------------------------------------------------------- Done. Task completed in 6731.70 s.
auto_ph.plot_coefs(corr_glm_mgbm_gbm_hybrid_shap_coefs,
hybrid_models,
'Hybrid GBM',
['Absolute Pearson Correlation Coefficient',
'GBM Mean SHAP Value',
'Monotonic GBM Mean SHAP Value',
'Hybrid Mean SHAP Value',
'Absolute Penalized GLM Contribution'])
# autoph cv_model_rank_select function
# requires models to have model_id
best_gbm.model_id = 'best_gbm'
compare_model_ids = ['best_glm', 'best_mgbm', 'best_gbm']
# start local timer
tic = time.time()
# perform CV rank model selection
hybrid_rank_results = auto_ph.cv_model_rank_select(valid,
SEED,
hybrid_train_results,
'hybrid',
compare_model_ids)
best_hybrid = hybrid_rank_results['BEST_MODEL']
best_hybrid_shap = hybrid_rank_results['BEST_LOCAL_COEFS']
hybrid_selected_coefs = hybrid_rank_results['BEST_GLOBAL_COEFS']
best_hybrid_eval = hybrid_rank_results['METRICS']
# end local timer
toc = time.time()-tic
print('Task completed in %.2f s.' % (toc))
Evaluated model 1/11 with rank: 2.88* ... Evaluated model 2/11 with rank: 2.46* ... Evaluated model 3/11 with rank: 1.96* ... Evaluated model 4/11 with rank: 2.24 ... Evaluated model 5/11 with rank: 2.04 ... Evaluated model 6/11 with rank: 2.08 ... Evaluated model 7/11 with rank: 1.92* ... Evaluated model 8/11 with rank: 1.96 ... Evaluated model 9/11 with rank: 1.40* ... Evaluated model 10/11 with rank: 1.64 ... Evaluated model 11/11 with rank: 1.58 ... Done. Task completed in 1059.20 s.
best_hybrid
Model Details ============= H2OGradientBoostingEstimator : Gradient Boosting Machine Model Key: hybrid9 Model Summary:
number_of_trees | number_of_internal_trees | model_size_in_bytes | min_depth | max_depth | mean_depth | min_leaves | max_leaves | mean_leaves | ||
---|---|---|---|---|---|---|---|---|---|---|
0 | 50.0 | 50.0 | 20035.0 | 5.0 | 14.0 | 9.72 | 11.0 | 40.0 | 27.14 |
ModelMetricsBinomial: gbm ** Reported on train data. ** MSE: 0.13259026736779506 RMSE: 0.36412946511892585 LogLoss: 0.4213987551922327 Mean Per-Class Error: 0.27876019446687894 AUC: 0.7969101065767178 pr_auc: 0.5677983080278433 Gini: 0.5938202131534356 Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.26746610600951026:
0 | 1 | Error | Rate | ||
---|---|---|---|---|---|
0 | 0 | 13877.0 | 2419.0 | 0.1484 | (2419.0/16296.0) |
1 | 1 | 1932.0 | 2718.0 | 0.4155 | (1932.0/4650.0) |
2 | Total | 15809.0 | 5137.0 | 0.2077 | (4351.0/20946.0) |
Maximum Metrics: Maximum metrics at their respective thresholds
metric | threshold | value | idx | |
---|---|---|---|---|
0 | max f1 | 0.267466 | 0.555431 | 206.0 |
1 | max f2 | 0.130191 | 0.654499 | 308.0 |
2 | max f0point5 | 0.423321 | 0.583887 | 140.0 |
3 | max accuracy | 0.494949 | 0.821064 | 118.0 |
4 | max precision | 0.811755 | 0.907216 | 5.0 |
5 | max recall | 0.029468 | 1.000000 | 396.0 |
6 | max specificity | 0.831785 | 0.999939 | 0.0 |
7 | max absolute_mcc | 0.308632 | 0.427502 | 186.0 |
8 | max min_per_class_accuracy | 0.192228 | 0.715484 | 256.0 |
9 | max mean_per_class_accuracy | 0.227657 | 0.721240 | 230.0 |
Gains/Lift Table: Avg response rate: 22.20 %, avg score: 22.34 %
group | cumulative_data_fraction | lower_threshold | lift | cumulative_lift | response_rate | score | cumulative_response_rate | cumulative_score | capture_rate | cumulative_capture_rate | gain | cumulative_gain | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0.010026 | 0.792966 | 3.796664 | 3.796664 | 0.842857 | 0.808394 | 0.842857 | 0.808394 | 0.038065 | 0.038065 | 279.666359 | 279.666359 | |
1 | 2 | 0.020004 | 0.774652 | 3.599302 | 3.698218 | 0.799043 | 0.783916 | 0.821002 | 0.796184 | 0.035914 | 0.073978 | 259.930236 | 269.821849 | |
2 | 3 | 0.030030 | 0.745723 | 3.303312 | 3.566374 | 0.733333 | 0.760521 | 0.791733 | 0.784277 | 0.033118 | 0.107097 | 230.331183 | 256.637366 | |
3 | 4 | 0.040008 | 0.713674 | 3.232906 | 3.483206 | 0.717703 | 0.730223 | 0.773270 | 0.770796 | 0.032258 | 0.139355 | 223.290631 | 248.320579 | |
4 | 5 | 0.050033 | 0.680622 | 3.281862 | 3.442860 | 0.728571 | 0.696384 | 0.764313 | 0.755885 | 0.032903 | 0.172258 | 228.186175 | 244.286013 | |
5 | 6 | 0.100019 | 0.565576 | 2.882546 | 3.162837 | 0.639924 | 0.622901 | 0.702148 | 0.689425 | 0.144086 | 0.316344 | 188.254614 | 216.283686 | |
6 | 7 | 0.150005 | 0.402813 | 2.275921 | 2.867292 | 0.505253 | 0.484024 | 0.636537 | 0.620980 | 0.113763 | 0.430108 | 127.592076 | 186.729225 | |
7 | 8 | 0.200038 | 0.315073 | 1.779456 | 2.595203 | 0.395038 | 0.354358 | 0.576134 | 0.554292 | 0.089032 | 0.519140 | 77.945580 | 159.520333 | |
8 | 9 | 0.300010 | 0.228788 | 1.243367 | 2.144735 | 0.276027 | 0.266346 | 0.476130 | 0.458341 | 0.124301 | 0.643441 | 24.336692 | 114.473461 | |
9 | 10 | 0.400029 | 0.183553 | 0.918104 | 1.838040 | 0.203819 | 0.205672 | 0.408044 | 0.395166 | 0.091828 | 0.735269 | -8.189576 | 83.804042 | |
10 | 11 | 0.500000 | 0.155588 | 0.770113 | 1.624516 | 0.170965 | 0.168647 | 0.360642 | 0.349875 | 0.076989 | 0.812258 | -22.988693 | 62.451613 | |
11 | 12 | 0.600019 | 0.132470 | 0.642888 | 1.460885 | 0.142721 | 0.143748 | 0.324316 | 0.315515 | 0.064301 | 0.876559 | -35.711202 | 46.088540 | |
12 | 13 | 0.699990 | 0.108926 | 0.544242 | 1.329972 | 0.120821 | 0.120753 | 0.295253 | 0.287700 | 0.054409 | 0.930968 | -45.575808 | 32.997206 | |
13 | 14 | 0.800010 | 0.085476 | 0.337570 | 1.205900 | 0.074940 | 0.097501 | 0.267709 | 0.263920 | 0.033763 | 0.964731 | -66.243006 | 20.589959 | |
14 | 15 | 0.899981 | 0.062265 | 0.191453 | 1.093213 | 0.042502 | 0.073949 | 0.242693 | 0.242818 | 0.019140 | 0.983871 | -80.854731 | 9.321316 | |
15 | 16 | 1.000000 | 0.020044 | 0.161260 | 1.000000 | 0.035800 | 0.048289 | 0.221999 | 0.223361 | 0.016129 | 1.000000 | -83.874047 | 0.000000 |
ModelMetricsBinomial: gbm ** Reported on validation data. ** MSE: 0.13215220532913477 RMSE: 0.3635274478345958 LogLoss: 0.42281097825164826 Mean Per-Class Error: 0.2752114974601497 AUC: 0.7873395816556301 pr_auc: 0.554291424966461 Gini: 0.5746791633112602 Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.2853538071514341:
0 | 1 | Error | Rate | ||
---|---|---|---|---|---|
0 | 0 | 6066.0 | 1002.0 | 0.1418 | (1002.0/7068.0) |
1 | 1 | 819.0 | 1167.0 | 0.4124 | (819.0/1986.0) |
2 | Total | 6885.0 | 2169.0 | 0.2011 | (1821.0/9054.0) |
Maximum Metrics: Maximum metrics at their respective thresholds
metric | threshold | value | idx | |
---|---|---|---|---|
0 | max f1 | 0.285354 | 0.561733 | 198.0 |
1 | max f2 | 0.129627 | 0.639345 | 310.0 |
2 | max f0point5 | 0.428508 | 0.590686 | 139.0 |
3 | max accuracy | 0.515985 | 0.825050 | 112.0 |
4 | max precision | 0.793503 | 0.817308 | 12.0 |
5 | max recall | 0.024063 | 1.000000 | 399.0 |
6 | max specificity | 0.832645 | 0.999859 | 0.0 |
7 | max absolute_mcc | 0.302980 | 0.435416 | 189.0 |
8 | max min_per_class_accuracy | 0.196862 | 0.713998 | 254.0 |
9 | max mean_per_class_accuracy | 0.257099 | 0.724789 | 213.0 |
Gains/Lift Table: Avg response rate: 21.94 %, avg score: 22.85 %
group | cumulative_data_fraction | lower_threshold | lift | cumulative_lift | response_rate | score | cumulative_response_rate | cumulative_score | capture_rate | cumulative_capture_rate | gain | cumulative_gain | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0.010051 | 0.794717 | 3.707247 | 3.707247 | 0.813187 | 0.810901 | 0.813187 | 0.810901 | 0.037261 | 0.037261 | 270.724744 | 270.724744 | |
1 | 2 | 0.020102 | 0.780143 | 3.506856 | 3.607052 | 0.769231 | 0.787907 | 0.791209 | 0.799404 | 0.035247 | 0.072508 | 250.685568 | 260.705156 | |
2 | 3 | 0.030042 | 0.760043 | 3.140584 | 3.452706 | 0.688889 | 0.770835 | 0.757353 | 0.789951 | 0.031219 | 0.103726 | 214.058409 | 245.270570 | |
3 | 4 | 0.040093 | 0.732320 | 3.306464 | 3.416045 | 0.725275 | 0.747011 | 0.749311 | 0.779187 | 0.033233 | 0.136959 | 230.646393 | 241.604454 | |
4 | 5 | 0.050033 | 0.697216 | 3.292548 | 3.391509 | 0.722222 | 0.714290 | 0.743929 | 0.766293 | 0.032729 | 0.169688 | 229.254783 | 239.150877 | |
5 | 6 | 0.100066 | 0.576136 | 2.958764 | 3.175137 | 0.649007 | 0.635961 | 0.696468 | 0.701127 | 0.148036 | 0.317724 | 195.876433 | 217.513655 | |
6 | 7 | 0.149989 | 0.417274 | 2.430748 | 2.927372 | 0.533186 | 0.504769 | 0.642121 | 0.635771 | 0.121349 | 0.439074 | 143.074753 | 192.737231 | |
7 | 8 | 0.200022 | 0.322064 | 1.690722 | 2.618039 | 0.370861 | 0.368896 | 0.574268 | 0.569015 | 0.084592 | 0.523666 | 69.072247 | 161.803914 | |
8 | 9 | 0.299978 | 0.239298 | 1.203956 | 2.146852 | 0.264088 | 0.277501 | 0.470913 | 0.471880 | 0.120342 | 0.644008 | 20.395587 | 114.685160 | |
9 | 10 | 0.400044 | 0.185910 | 0.860457 | 1.825075 | 0.188742 | 0.212299 | 0.400331 | 0.406949 | 0.086103 | 0.730111 | -13.954303 | 82.507536 | |
10 | 11 | 0.500000 | 0.156781 | 0.664946 | 1.593152 | 0.145856 | 0.170393 | 0.349459 | 0.359659 | 0.066465 | 0.796576 | -33.505366 | 59.315206 | |
11 | 12 | 0.599956 | 0.133061 | 0.609534 | 1.429276 | 0.133702 | 0.144524 | 0.313513 | 0.323816 | 0.060926 | 0.857503 | -39.046586 | 42.927610 | |
12 | 13 | 0.700022 | 0.109632 | 0.543447 | 1.302649 | 0.119205 | 0.121091 | 0.285737 | 0.294837 | 0.054381 | 0.911883 | -45.655349 | 30.264915 | |
13 | 14 | 0.799978 | 0.085735 | 0.357661 | 1.184575 | 0.078453 | 0.098320 | 0.259837 | 0.270283 | 0.035750 | 0.947633 | -64.233947 | 18.457450 | |
14 | 15 | 0.899934 | 0.062809 | 0.327436 | 1.089372 | 0.071823 | 0.074188 | 0.238954 | 0.248502 | 0.032729 | 0.980363 | -67.256430 | 8.937192 | |
15 | 16 | 1.000000 | 0.022042 | 0.196245 | 1.000000 | 0.043046 | 0.048626 | 0.219351 | 0.228502 | 0.019637 | 1.000000 | -80.375543 | 0.000000 |
Scoring History:
timestamp | duration | number_of_trees | training_rmse | training_logloss | training_auc | training_pr_auc | training_lift | training_classification_error | validation_rmse | validation_logloss | validation_auc | validation_pr_auc | validation_lift | validation_classification_error | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2023-02-22 19:11:45 | 7 min 28.677 sec | 0.0 | 0.415591 | 0.529427 | 0.500000 | 0.000000 | 1.000000 | 0.778001 | 0.413815 | 0.526105 | 0.500000 | 0.000000 | 1.000000 | 0.780649 | |
1 | 2023-02-22 19:11:45 | 7 min 28.720 sec | 1.0 | 0.409506 | 0.515866 | 0.684760 | 0.455246 | 3.523028 | 0.190824 | 0.407433 | 0.511967 | 0.694065 | 0.453426 | 3.413883 | 0.189971 | |
2 | 2023-02-22 19:11:45 | 7 min 28.761 sec | 2.0 | 0.404916 | 0.506025 | 0.749735 | 0.485649 | 3.596762 | 0.215745 | 0.402656 | 0.501819 | 0.755097 | 0.488075 | 3.467891 | 0.216589 | |
3 | 2023-02-22 19:11:45 | 7 min 28.788 sec | 3.0 | 0.404052 | 0.505076 | 0.719029 | 0.468992 | 3.596762 | 0.202712 | 0.401993 | 0.501383 | 0.722367 | 0.470015 | 3.467891 | 0.204440 | |
4 | 2023-02-22 19:11:45 | 7 min 28.826 sec | 4.0 | 0.404029 | 0.505586 | 0.709603 | 0.464296 | 3.596762 | 0.203332 | 0.402090 | 0.502174 | 0.712216 | 0.465536 | 3.467891 | 0.202452 | |
5 | 2023-02-22 19:11:45 | 7 min 28.885 sec | 5.0 | 0.397683 | 0.493206 | 0.726924 | 0.480326 | 3.596762 | 0.208870 | 0.395418 | 0.489276 | 0.729632 | 0.482844 | 3.467891 | 0.206649 | |
6 | 2023-02-22 19:11:45 | 7 min 28.945 sec | 6.0 | 0.394077 | 0.486052 | 0.740493 | 0.498806 | 3.576221 | 0.203523 | 0.391674 | 0.481895 | 0.744167 | 0.500373 | 3.445285 | 0.196488 | |
7 | 2023-02-22 19:11:45 | 7 min 28.990 sec | 7.0 | 0.391791 | 0.481404 | 0.749614 | 0.499939 | 3.629231 | 0.197078 | 0.389288 | 0.477090 | 0.753886 | 0.501482 | 3.486227 | 0.191518 | |
8 | 2023-02-22 19:11:45 | 7 min 29.040 sec | 8.0 | 0.387573 | 0.473196 | 0.761414 | 0.531365 | 3.680029 | 0.223241 | 0.384944 | 0.468748 | 0.764197 | 0.528476 | 3.503164 | 0.199801 | |
9 | 2023-02-22 19:11:45 | 7 min 29.095 sec | 9.0 | 0.385935 | 0.469643 | 0.766140 | 0.533438 | 3.716731 | 0.225771 | 0.383364 | 0.465355 | 0.768010 | 0.528175 | 3.443434 | 0.228960 | |
10 | 2023-02-22 19:11:45 | 7 min 29.139 sec | 10.0 | 0.382839 | 0.463622 | 0.770406 | 0.538970 | 3.746938 | 0.213836 | 0.380089 | 0.459078 | 0.772089 | 0.534263 | 3.386621 | 0.219571 | |
11 | 2023-02-22 19:11:45 | 7 min 29.188 sec | 11.0 | 0.380309 | 0.458645 | 0.772977 | 0.541487 | 3.720397 | 0.210064 | 0.377433 | 0.453921 | 0.774573 | 0.537641 | 3.468738 | 0.220565 | |
12 | 2023-02-22 19:11:45 | 7 min 29.238 sec | 12.0 | 0.379383 | 0.456539 | 0.773290 | 0.540536 | 3.689413 | 0.213788 | 0.376448 | 0.451705 | 0.774804 | 0.536610 | 3.371696 | 0.222001 | |
13 | 2023-02-22 19:11:45 | 7 min 29.289 sec | 13.0 | 0.377253 | 0.452186 | 0.776596 | 0.544348 | 3.723491 | 0.212928 | 0.374314 | 0.447417 | 0.777528 | 0.539746 | 3.437868 | 0.215927 | |
14 | 2023-02-22 19:11:45 | 7 min 29.333 sec | 14.0 | 0.375562 | 0.448758 | 0.778260 | 0.546167 | 3.683600 | 0.212976 | 0.372539 | 0.443872 | 0.779164 | 0.541738 | 3.407187 | 0.213165 | |
15 | 2023-02-22 19:11:45 | 7 min 29.380 sec | 15.0 | 0.374235 | 0.445859 | 0.780456 | 0.547347 | 3.753763 | 0.227537 | 0.371243 | 0.441076 | 0.781160 | 0.546549 | 3.657149 | 0.212613 | |
16 | 2023-02-22 19:11:45 | 7 min 29.421 sec | 16.0 | 0.373553 | 0.444802 | 0.780335 | 0.548536 | 3.821367 | 0.214695 | 0.370576 | 0.440071 | 0.781114 | 0.547591 | 3.657149 | 0.212282 | |
17 | 2023-02-22 19:11:45 | 7 min 29.483 sec | 17.0 | 0.372623 | 0.442652 | 0.781238 | 0.549017 | 3.818114 | 0.218705 | 0.369696 | 0.438056 | 0.781465 | 0.547864 | 3.617398 | 0.215595 | |
18 | 2023-02-22 19:11:45 | 7 min 29.538 sec | 18.0 | 0.371555 | 0.440210 | 0.783156 | 0.550889 | 3.806633 | 0.225676 | 0.368638 | 0.435678 | 0.782931 | 0.549011 | 3.707247 | 0.210183 | |
19 | 2023-02-22 19:11:46 | 7 min 29.580 sec | 19.0 | 0.371202 | 0.439258 | 0.783213 | 0.549972 | 3.796664 | 0.223002 | 0.368293 | 0.434741 | 0.782880 | 0.548419 | 3.657149 | 0.210846 |
See the whole table with table.as_data_frame() Variable Importances:
variable | relative_importance | scaled_importance | percentage | |
---|---|---|---|---|
0 | PAY_0 | 2558.588867 | 1.000000 | 0.498923 |
1 | PAY_3 | 688.551086 | 0.269114 | 0.134267 |
2 | PAY_2 | 350.810608 | 0.137111 | 0.068408 |
3 | PAY_4 | 332.122284 | 0.129807 | 0.064764 |
4 | BILL_AMT1 | 291.351929 | 0.113872 | 0.056813 |
5 | PAY_AMT1 | 168.537277 | 0.065871 | 0.032865 |
6 | LIMIT_BAL | 167.566696 | 0.065492 | 0.032675 |
7 | PAY_6 | 141.123337 | 0.055157 | 0.027519 |
8 | PAY_AMT3 | 139.932648 | 0.054691 | 0.027287 |
9 | PAY_5 | 108.442436 | 0.042384 | 0.021146 |
10 | PAY_AMT2 | 70.692581 | 0.027630 | 0.013785 |
11 | PAY_AMT4 | 42.055202 | 0.016437 | 0.008201 |
12 | PAY_AMT6 | 40.320015 | 0.015759 | 0.007862 |
13 | PAY_AMT5 | 28.128548 | 0.010994 | 0.005485 |
best_hybrid_eval
Fold | Metric | best_glm Value | best_mgbm Value | best_gbm Value | hybrid9 Value | best_gbm Rank | best_glm Rank | best_mgbm Rank | hybrid9 Rank | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | F1 | 0.533181 | 0.551298 | 0.562353 | 0.565875 | 2.0 | 4.0 | 3.0 | 1.0 |
1 | 0 | accuracy | 0.816246 | 0.817367 | 0.814006 | 0.812885 | 3.0 | 2.0 | 1.0 | 4.0 |
2 | 0 | auc | 0.738625 | 0.776026 | 0.777570 | 0.780667 | 2.0 | 4.0 | 3.0 | 1.0 |
3 | 0 | logloss | 0.468678 | 0.440775 | 0.438078 | 0.437294 | 2.0 | 4.0 | 3.0 | 1.0 |
4 | 0 | mcc | 0.419924 | 0.420105 | 0.426918 | 0.427220 | 2.0 | 4.0 | 3.0 | 1.0 |
5 | 1 | F1 | 0.540865 | 0.554762 | 0.555283 | 0.563446 | 2.0 | 4.0 | 3.0 | 1.0 |
6 | 1 | accuracy | 0.823882 | 0.826063 | 0.828244 | 0.828790 | 2.0 | 4.0 | 3.0 | 1.0 |
7 | 1 | auc | 0.729674 | 0.776877 | 0.785956 | 0.787909 | 2.0 | 4.0 | 3.0 | 1.0 |
8 | 1 | logloss | 0.465999 | 0.434170 | 0.428677 | 0.427742 | 2.0 | 4.0 | 3.0 | 1.0 |
9 | 1 | mcc | 0.432722 | 0.445354 | 0.447637 | 0.453102 | 2.0 | 4.0 | 3.0 | 1.0 |
10 | 2 | F1 | 0.500593 | 0.516364 | 0.530343 | 0.531017 | 2.0 | 4.0 | 3.0 | 1.0 |
11 | 2 | accuracy | 0.830907 | 0.833707 | 0.835946 | 0.833147 | 1.0 | 4.0 | 2.0 | 3.0 |
12 | 2 | auc | 0.707507 | 0.760838 | 0.769493 | 0.767894 | 1.0 | 4.0 | 3.0 | 2.0 |
13 | 2 | logloss | 0.459017 | 0.420930 | 0.417171 | 0.419599 | 1.0 | 4.0 | 3.0 | 2.0 |
14 | 2 | mcc | 0.395476 | 0.409254 | 0.411118 | 0.414319 | 2.0 | 4.0 | 3.0 | 1.0 |
15 | 3 | F1 | 0.531328 | 0.550251 | 0.557500 | 0.567164 | 2.0 | 4.0 | 3.0 | 1.0 |
16 | 3 | accuracy | 0.834578 | 0.836713 | 0.835112 | 0.836179 | 3.0 | 4.0 | 1.0 | 2.0 |
17 | 3 | auc | 0.733536 | 0.782795 | 0.791644 | 0.795031 | 2.0 | 4.0 | 3.0 | 1.0 |
18 | 3 | logloss | 0.448512 | 0.416031 | 0.411094 | 0.409044 | 2.0 | 4.0 | 3.0 | 1.0 |
19 | 3 | mcc | 0.426465 | 0.443411 | 0.445759 | 0.452049 | 2.0 | 4.0 | 3.0 | 1.0 |
20 | 4 | F1 | 0.561589 | 0.587666 | 0.593037 | 0.597087 | 2.0 | 4.0 | 3.0 | 1.0 |
21 | 4 | accuracy | 0.820845 | 0.829296 | 0.826479 | 0.827606 | 3.0 | 4.0 | 1.0 | 2.0 |
22 | 4 | auc | 0.740132 | 0.790550 | 0.803337 | 0.803346 | 2.0 | 4.0 | 3.0 | 1.0 |
23 | 4 | logloss | 0.467037 | 0.427654 | 0.419666 | 0.420917 | 1.0 | 4.0 | 3.0 | 2.0 |
24 | 4 | mcc | 0.448005 | 0.463860 | 0.472417 | 0.475800 | 2.0 | 4.0 | 3.0 | 1.0 |
print('Best GLM mean rank:', best_hybrid_eval['best_glm Rank'].mean())
print('Best MGBM mean rank:', best_hybrid_eval['best_mgbm Rank'].mean())
print('Best GBM mean rank:', best_hybrid_eval['best_gbm Rank'].mean())
print('Best Hybrid mean rank:', best_hybrid_eval['hybrid9 Rank'].mean())
# MONO_THRESHOLD = 6, hybrid rank = 1.4
# MONO_THRESHOLD = 4, hybrid rank = 1.58
Best GLM mean rank: 3.92 Best MGBM mean rank: 2.72 Best GBM mean rank: 1.96 Best Hybrid mean rank: 1.4
# init dict to hold partial dependence and ICE values
# for each feature
# for hybrid gbm
hybrid_pd_ice_dict = {}
# establish hybrid gbm selected features
hybrid_selected = list(hybrid_selected_coefs[hybrid_selected_coefs['Hybrid Mean SHAP Value'] != 0].index)
# calculate partial dependence for each selected feature
for xs in hybrid_selected:
hybrid_pd_ice_dict[xs] = auto_ph.pd_ice(xs, valid, best_hybrid)
# merge hybrid predictions onto validation data
hybrid_yhat_valid = pd.concat([valid.reset_index(drop=True),
best_hybrid.predict(h2o.H2OFrame(valid))['p1'].as_data_frame()],
axis=1)
# rename yhat column
hybrid_yhat_valid = hybrid_yhat_valid.rename(columns={'p1':'p_DEFAULT_NEXT_MONTH'})
# find percentiles of predictions
hybrid_percentile_dict = auto_ph.get_percentile_dict('p_DEFAULT_NEXT_MONTH', hybrid_yhat_valid, 'ID')
# display percentiles dictionary
# key=percentile, val=row_id
hybrid_percentile_dict
{0: 27993, 99: 12011, 10: 1287, 20: 22782, 30: 20906, 40: 15292, 50: 13621, 60: 29078, 70: 25804, 80: 28945, 90: 18250}
# loop through selected variables
for xs in hybrid_selected:
# collect bins used in partial dependence
bins = list(hybrid_pd_ice_dict[xs][xs])
# calculate ICE at percentiles
# using partial dependence bins
# for each selected feature
for i in sorted(hybrid_percentile_dict.keys()):
col_name = 'Percentile_' + str(i)
hybrid_pd_ice_dict[xs][col_name] = auto_ph.pd_ice(xs,
valid[valid['ID'] == int(hybrid_percentile_dict[i])][hybrid_selected],
best_hybrid,
bins=bins)['partial_dependence']
for xs in hybrid_selected:
auto_ph.hist_mean_pd_ice_plot(xs, y_name, valid, hybrid_pd_ice_dict)
p_DEFAULT_NEXT_MONTH
¶next_vars = int(''.join([c for c in best_hybrid.model_id if c.isdigit()])) - 1
for percentile in [10, 50, 90]:
idx = valid_idx_map[valid_idx_map == int(glm_percentile_dict[percentile])].index[0]
s_df = pd.DataFrame(best_hybrid_shap[idx, :].T, columns=['Hybrid SHAP Value'], index=glm_selected + next_list[:next_vars])
local_coef_dict[percentile]['Hybrid SHAP Value'] = 0
local_coef_dict[percentile].update(s_df)
p_DEFAULT_NEXT_MONTH
¶fig, (ax0, ax1, ax2) = plt.subplots(ncols=3, sharey=True)
plt.tight_layout()
plt.subplots_adjust(left=0, right=2, wspace=0.1)
_ = local_coef_dict[10].plot(kind='bar', colormap='cool', ax=ax0, edgecolor=['black']*len(data[x_names + [y_name]]),
title='10th PCTL of p_DEFAULT_NEXT_MONTH')
_ = local_coef_dict[50].plot(kind='bar', colormap='cool', ax=ax1, edgecolor=['black']*len(data[x_names + [y_name]]),
title='50th PCTL of p_DEFAULT_NEXT_MONTH')
_ = local_coef_dict[90].plot(kind='bar', colormap='cool', ax=ax2, edgecolor=['black']*len(data[x_names + [y_name]]),
title='90th PCTL of p_DEFAULT_NEXT_MONTH')
SEX = male
and SEX = female
¶print('Standardized mean difference: %.2f' % auto_ph.smd(hybrid_yhat_valid, 'SEX', 'p_DEFAULT_NEXT_MONTH', 'male', 'female'))
Male mean yhat: 0.24 Female mean yhat: 0.22 P_Default_Next_Month std. dev.: 0.19 Standardized mean difference: -0.08
best_hybrid_cut = best_hybrid.mcc(valid=True)[0][0]
best_hybrid_cut
0.30298008948799504
hybrid_male_cm = auto_ph.get_confusion_matrix(hybrid_yhat_valid, y_name, 'p_DEFAULT_NEXT_MONTH', by='SEX',
level='male', cutoff=best_hybrid_cut)
hybrid_female_cm = auto_ph.get_confusion_matrix(hybrid_yhat_valid, y_name, 'p_DEFAULT_NEXT_MONTH', by='SEX',
level='female', cutoff=best_hybrid_cut)
hybrid_cm_dict = {'male': hybrid_male_cm, 'female': hybrid_female_cm}
SEX = male
¶hybrid_male_cm
actual: 1 | actual: 0 | |
---|---|---|
predicted: 1 | 497 | 364 |
predicted: 0 | 346 | 2385 |
SEX = female
¶hybrid_female_cm
actual: 1 | actual: 0 | |
---|---|---|
predicted: 1 | 628 | 537 |
predicted: 0 | 515 | 3782 |
SEX = male
and SEX = female
¶print('Adverse impact ratio: %.2f' % auto_ph.air(hybrid_cm_dict, 'male', 'female'))
Male proportion accepted: 0.760 Female proportion accepted: 0.787 Adverse impact ratio: 1.03
SEX = male
and SEX = female
¶print('Marginal effect: %.2f%%' % auto_ph.marginal_effect(hybrid_cm_dict, 'male', 'female'))
Male accepted: 76.03% Female accepted: 78.67% Marginal effect: -2.64%
hybrid_cm = auto_ph.get_confusion_matrix(hybrid_yhat_valid, y_name, 'p_DEFAULT_NEXT_MONTH', cutoff=best_hybrid_cut)
hybrid_cm
actual: 1 | actual: 0 | |
---|---|---|
predicted: 1 | 1125 | 901 |
predicted: 0 | 861 | 6167 |
hybrid_business_impact = hybrid_cm.iloc[0, 0]*TRUE_POSITIVE_AMOUNT +\
hybrid_cm.iloc[0, 1]*FALSE_POSITIVE_AMOUNT +\
hybrid_cm.iloc[1, 0]*FALSE_NEGATIVE_AMOUNT +\
hybrid_cm.iloc[1, 1]*TRUE_NEGATIVE_AMOUNT
print('Estimated business impact $%.2f' % hybrid_business_impact)
Estimated business impact $19220000.00
big_toc = time.time() - big_tic
print('All tasks completed in %.2f s.' % (big_toc))
All tasks completed in 10428.61 s.
# be careful, this can erase your work!
h2o.cluster().shutdown(prompt=True)
Are you sure you want to shutdown the H2O instance running at http://127.0.0.1:54321 (Y/N)? y H2O session _sid_b7ae closed.