Copyright 2017 - 2020 Patrick Hall and the H2O.ai team
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
DISCLAIMER: This notebook is not legal compliance advice.
This notebook provides a basic introduction to two traditional data analysis and model diagnostic techniques that can be applied to machine learning models: residual analysis and sensitivity analysis. The notebook starts by loading the UCI credit card default dataset and using h2o to train a GBM model to predict credit card defaults. Then, residual analysis is used to discover and debug an issue with the GBM, and the GBM is retrained and improved. The notebook concludes by conducting sensitivity analysis to test the GBM credit card default model for fairness and stability.
In general, NumPy and Pandas will be used for data manipulation purposes and h2o will be used for modeling tasks.
# h2o Python API with specific classes
import h2o
from h2o.estimators.gbm import H2OGradientBoostingEstimator
import numpy as np # array, vector, matrix calculations
import pandas as pd # DataFrame handling
pd.options.display.max_columns = 999 # enable display of all columns in notebook
# plotting functionality
import matplotlib.pyplot as plt
import seaborn as sns
# display plots in notebook
%matplotlib inline
H2o is both a library and a server. The machine learning algorithms in the library take advantage of the multithreaded and distributed architecture provided by the server to train machine learning algorithms extremely efficiently. The API for the library was imported above in cell 1, but the server still needs to be started.
h2o.init(max_mem_size='2G') # start h2o
h2o.remove_all() # remove any existing data structures from h2o memory
Checking whether there is an H2O instance running at http://localhost:54321 ..... not found. Attempting to start a local H2O server... Java Version: java version "1.8.0_201"; Java(TM) SE Runtime Environment (build 1.8.0_201-b09); Java HotSpot(TM) 64-Bit Server VM (build 25.201-b09, mixed mode) Starting server from /home/patrickh/anaconda3/lib/python3.6/site-packages/h2o/backend/bin/h2o.jar Ice root: /tmp/tmpfq80fyby JVM stdout: /tmp/tmpfq80fyby/h2o_patrickh_started_from_python.out JVM stderr: /tmp/tmpfq80fyby/h2o_patrickh_started_from_python.err Server is running at http://127.0.0.1:54321 Connecting to H2O server at http://127.0.0.1:54321 ... successful.
H2O cluster uptime: | 01 secs |
H2O cluster timezone: | America/New_York |
H2O data parsing timezone: | UTC |
H2O cluster version: | 3.26.0.3 |
H2O cluster version age: | 6 days |
H2O cluster name: | H2O_from_python_patrickh_ov75l0 |
H2O cluster total nodes: | 1 |
H2O cluster free memory: | 1.778 Gb |
H2O cluster total cores: | 8 |
H2O cluster allowed cores: | 8 |
H2O cluster status: | accepting new members, healthy |
H2O connection url: | http://127.0.0.1:54321 |
H2O connection proxy: | None |
H2O internal security: | False |
H2O API Extensions: | Amazon S3, XGBoost, Algos, AutoML, Core V3, Core V4 |
Python version: | 3.6.4 final |
UCI credit card default data: https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients
The UCI credit card default data contains demographic and payment information about credit card customers in Taiwan in the year 2005. The data set contains 23 input variables:
LIMIT_BAL
: Amount of given credit (NT dollar)SEX
: 1 = male; 2 = femaleEDUCATION
: 1 = graduate school; 2 = university; 3 = high school; 4 = othersMARRIAGE
: 1 = married; 2 = single; 3 = othersAGE
: Age in yearsPAY_0
, PAY_2
- PAY_6
: History of past payment; PAY_0
= the repayment status in September, 2005; PAY_2
= the repayment status in August, 2005; ...; PAY_6
= the repayment status in April, 2005. The measurement scale for the repayment status is: -1 = pay duly; 1 = payment delay for one month; 2 = payment delay for two months; ...; 8 = payment delay for eight months; 9 = payment delay for nine months and above.BILL_AMT1
- BILL_AMT6
: Amount of bill statement (NT dollar). BILL_AMNT1
= amount of bill statement in September, 2005; BILL_AMT2
= amount of bill statement in August, 2005; ...; BILL_AMT6
= amount of bill statement in April, 2005.PAY_AMT1
- PAY_AMT6
: Amount of previous payment (NT dollar). PAY_AMT1
= amount paid in September, 2005; PAY_AMT2
= amount paid in August, 2005; ...; PAY_AMT6
= amount paid in April, 2005.These 23 input variables are used to predict the target variable, whether or not a customer defaulted on their credit card bill in late 2005.
Because h2o accepts both numeric and character inputs, some variables will be recoded into more transparent character values.
The credit card default data is available as an .xls
file. Pandas reads .xls
files automatically, so it's used to load the credit card default data and give the prediction target a shorter name: DEFAULT_NEXT_MONTH
.
# import XLS file
path = 'default_of_credit_card_clients.xls'
data = pd.read_excel(path,
skiprows=1)
# remove spaces from target column name
data = data.rename(columns={'default payment next month': 'DEFAULT_NEXT_MONTH'})
The shorthand name y
is assigned to the prediction target. X
is assigned to all other input variables in the credit card default data except the row indentifier, ID
.
# assign target and inputs for GBM
y = 'DEFAULT_NEXT_MONTH'
X = [name for name in data.columns if name not in [y, 'ID']]
print('y =', y)
print('X =', X)
y = DEFAULT_NEXT_MONTH X = ['LIMIT_BAL', 'SEX', 'EDUCATION', 'MARRIAGE', 'AGE', 'PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'BILL_AMT1', 'BILL_AMT2', 'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6', 'PAY_AMT1', 'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6']
This simple function maps longer, more understandable character string values from the UCI credit card default data dictionary to the original integer values of the input variables found in the dataset. These character values can be used directly in h2o decision tree models, and the function returns the original Pandas DataFrame as an h2o object, an H2OFrame. H2o models cannot run on Pandas DataFrames. They require H2OFrames.
def recode_cc_data(frame):
""" Recodes numeric categorical variables into categorical character variables
with more transparent values.
Args:
frame: Pandas DataFrame version of UCI credit card default data.
Returns:
H2OFrame with recoded values.
"""
# define recoded values
sex_dict = {1:'male', 2:'female'}
education_dict = {0:'other', 1:'graduate school', 2:'university', 3:'high school',
4:'other', 5:'other', 6:'other'}
marriage_dict = {0:'other', 1:'married', 2:'single', 3:'divorced'}
pay_dict = {-2:'no consumption', -1:'pay duly', 0:'use of revolving credit', 1:'1 month delay',
2:'2 month delay', 3:'3 month delay', 4:'4 month delay', 5:'5 month delay', 6:'6 month delay',
7:'7 month delay', 8:'8 month delay', 9:'9+ month delay'}
# recode values using Pandas apply() and anonymous function
frame['SEX'] = frame['SEX'].apply(lambda i: sex_dict[i])
frame['EDUCATION'] = frame['EDUCATION'].apply(lambda i: education_dict[i])
frame['MARRIAGE'] = frame['MARRIAGE'].apply(lambda i: marriage_dict[i])
for name in frame.columns:
if name in ['PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6']:
frame[name] = frame[name].apply(lambda i: pay_dict[i])
return h2o.H2OFrame(frame)
data = recode_cc_data(data)
Parse progress: |█████████████████████████████████████████████████████████| 100%
In h2o, a numeric variable can be treated as numeric or categorical. The target variable DEFAULT_NEXT_MONTH
takes on values of 0
or 1
. To ensure this numeric variable is treated as a categorical variable, the asfactor()
function is used to explicitly declare that it is a categorical variable.
data[y] = data[y].asfactor()
The h2o describe()
function displays a brief description of the credit card default data. For the categorical input variables LIMIT_BAL
, SEX
, EDUCATION
, MARRIAGE
, and PAY_0
-PAY_6
, the new character values created above in cell 5 are visible. Basic descriptive statistics are displayed for numeric inputs.
data.describe()
Rows:30000 Cols:25
ID | LIMIT_BAL | SEX | EDUCATION | MARRIAGE | AGE | PAY_0 | PAY_2 | PAY_3 | PAY_4 | PAY_5 | PAY_6 | BILL_AMT1 | BILL_AMT2 | BILL_AMT3 | BILL_AMT4 | BILL_AMT5 | BILL_AMT6 | PAY_AMT1 | PAY_AMT2 | PAY_AMT3 | PAY_AMT4 | PAY_AMT5 | PAY_AMT6 | DEFAULT_NEXT_MONTH | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
type | int | int | enum | enum | enum | int | enum | enum | enum | enum | enum | enum | int | int | int | int | int | int | int | int | int | int | int | int | enum |
mins | 1.0 | 10000.0 | 21.0 | -165580.0 | -69777.0 | -157264.0 | -170000.0 | -81334.0 | -339603.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ||||||||||
mean | 15000.5 | 167484.32266666688 | 35.48549999999994 | 51223.33090000009 | 49179.07516666668 | 47013.15479999971 | 43262.9489666666 | 40311.40096666653 | 38871.76039999991 | 5663.580500000014 | 5921.16350000001 | 5225.681500000005 | 4826.076866666661 | 4799.387633333302 | 5215.502566666664 | ||||||||||
maxs | 30000.0 | 1000000.0 | 79.0 | 964511.0 | 983931.0 | 1664089.0 | 891586.0 | 927171.0 | 961664.0 | 873552.0 | 1684259.0 | 896040.0 | 621000.0 | 426529.0 | 528666.0 | ||||||||||
sigma | 8660.398374208891 | 129747.66156720225 | 9.21790406809016 | 73635.86057552959 | 71173.76878252836 | 69349.38742703681 | 64332.85613391641 | 60797.1557702648 | 59554.10753674574 | 16563.280354025763 | 23040.870402057226 | 17606.961469803115 | 15666.159744031993 | 15278.305679144793 | 17777.465775435332 | ||||||||||
zeros | 0 | 0 | 0 | 2008 | 2506 | 2870 | 3195 | 3506 | 4020 | 5249 | 5396 | 5968 | 6408 | 6703 | 7173 | ||||||||||
missing | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
0 | 1.0 | 20000.0 | female | university | married | 24.0 | 2 month delay | 2 month delay | pay duly | pay duly | no consumption | no consumption | 3913.0 | 3102.0 | 689.0 | 0.0 | 0.0 | 0.0 | 0.0 | 689.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1 |
1 | 2.0 | 120000.0 | female | university | single | 26.0 | pay duly | 2 month delay | use of revolving credit | use of revolving credit | use of revolving credit | 2 month delay | 2682.0 | 1725.0 | 2682.0 | 3272.0 | 3455.0 | 3261.0 | 0.0 | 1000.0 | 1000.0 | 1000.0 | 0.0 | 2000.0 | 1 |
2 | 3.0 | 90000.0 | female | university | single | 34.0 | use of revolving credit | use of revolving credit | use of revolving credit | use of revolving credit | use of revolving credit | use of revolving credit | 29239.0 | 14027.0 | 13559.0 | 14331.0 | 14948.0 | 15549.0 | 1518.0 | 1500.0 | 1000.0 | 1000.0 | 1000.0 | 5000.0 | 0 |
3 | 4.0 | 50000.0 | female | university | married | 37.0 | use of revolving credit | use of revolving credit | use of revolving credit | use of revolving credit | use of revolving credit | use of revolving credit | 46990.0 | 48233.0 | 49291.0 | 28314.0 | 28959.0 | 29547.0 | 2000.0 | 2019.0 | 1200.0 | 1100.0 | 1069.0 | 1000.0 | 0 |
4 | 5.0 | 50000.0 | male | university | married | 57.0 | pay duly | use of revolving credit | pay duly | use of revolving credit | use of revolving credit | use of revolving credit | 8617.0 | 5670.0 | 35835.0 | 20940.0 | 19146.0 | 19131.0 | 2000.0 | 36681.0 | 10000.0 | 9000.0 | 689.0 | 679.0 | 0 |
5 | 6.0 | 50000.0 | male | graduate school | single | 37.0 | use of revolving credit | use of revolving credit | use of revolving credit | use of revolving credit | use of revolving credit | use of revolving credit | 64400.0 | 57069.0 | 57608.0 | 19394.0 | 19619.0 | 20024.0 | 2500.0 | 1815.0 | 657.0 | 1000.0 | 1000.0 | 800.0 | 0 |
6 | 7.0 | 500000.0 | male | graduate school | single | 29.0 | use of revolving credit | use of revolving credit | use of revolving credit | use of revolving credit | use of revolving credit | use of revolving credit | 367965.0 | 412023.0 | 445007.0 | 542653.0 | 483003.0 | 473944.0 | 55000.0 | 40000.0 | 38000.0 | 20239.0 | 13750.0 | 13770.0 | 0 |
7 | 8.0 | 100000.0 | female | university | single | 23.0 | use of revolving credit | pay duly | pay duly | use of revolving credit | use of revolving credit | pay duly | 11876.0 | 380.0 | 601.0 | 221.0 | -159.0 | 567.0 | 380.0 | 601.0 | 0.0 | 581.0 | 1687.0 | 1542.0 | 0 |
8 | 9.0 | 140000.0 | female | high school | married | 28.0 | use of revolving credit | use of revolving credit | 2 month delay | use of revolving credit | use of revolving credit | use of revolving credit | 11285.0 | 14096.0 | 12108.0 | 12211.0 | 11793.0 | 3719.0 | 3329.0 | 0.0 | 432.0 | 1000.0 | 1000.0 | 1000.0 | 0 |
9 | 10.0 | 20000.0 | male | high school | single | 35.0 | no consumption | no consumption | no consumption | no consumption | pay duly | pay duly | 0.0 | 0.0 | 0.0 | 0.0 | 13007.0 | 13912.0 | 0.0 | 0.0 | 0.0 | 13007.0 | 1122.0 | 0.0 | 0 |
The credit card default data is split into training and test sets to monitor and prevent overtraining. Reproducibility is also important factor in creating trustworthy models, and randomly splitting datasets can introduce randomness in model predictions and other results. A random seed is used here to ensure the data split is reproducible.
# split into training and validation
train, test = data.split_frame([0.7], seed=12345)
# summarize split
print('Train data rows = %d, columns = %d' % (train.shape[0], train.shape[1]))
print('Test data rows = %d, columns = %d' % (test.shape[0], test.shape[1]))
Train data rows = 21060, columns = 25 Test data rows = 8940, columns = 25
Many tuning parameters must be specified to train a GBM using h2o. Typically a grid search would be performed to identify the best parameters for a given modeling task using the H2OGridSearch
class. For brevity's sake, a previously-discovered set of good tuning parameters are specified here. Because gradient boosting methods typically resample training data, an additional random seed is also specified for the h2o GBM using the seed
parameter to create reproducible predictions, error rates, and variable importance values. To avoid overfitting, the stopping_rounds
parameter is used to stop the training process after the test error fails to decrease for 5 iterations.
# initialize GBM model
model = H2OGradientBoostingEstimator(ntrees=150, # maximum 150 trees in GBM
max_depth=4, # trees can have maximum depth of 4
sample_rate=0.9, # use 90% of rows in each iteration (tree)
col_sample_rate=0.9, # use 90% of variables in each iteration (tree)
stopping_rounds=5, # stop if validation error does not decrease for 5 iterations (trees)
seed=12345) # for reproducibility
# train a GBM model
model.train(y=y, x=X, training_frame=train, validation_frame=test)
# print AUC
print('GBM Test AUC = %.4f' % model.auc(valid=True))
# uncomment to see model details
# print(model)
gbm Model Build progress: |███████████████████████████████████████████████| 100% GBM Test AUC = 0.7804
During training, the h2o GBM aggregates the improvement in error caused by each split in each decision tree across all the decision trees in the ensemble classifier. These values are attributed to the input variable used in each split and give an indication of the contribution each input variable makes toward the model's predictions. The variable importance ranking should be parsimonious with human domain knowledge and reasonable expectations. In this case, a customer's most recent payment behavior, PAY_0
, is by far the most important variable followed by their second most recent payment, PAY_2
, their credit limit, LIMIT_BAL
, and third most recent payment behavior, PAY_3
. This result is well-aligned with business practices in credit lending: people who miss their most recent payments are likely to default soon.
model.varimp_plot()
Residuals refer to the difference between the recorded value of a dependent variable and the predicted value of a dependent variable for every row in a data set. Plotting the residual values against the predicted values is a time-honored model assessment technique and a great way to see all your modeling results in two dimensions.
To calculate the residuals for our GBM model, first the model predictions are merged onto onto the test set. The test data is used here to see how the model behaves on holdout data, which should be closer to its behavior on new data than analyzing residuals for the training inputs and predictions.
yhat = 'p_DEFAULT_NEXT_MONTH'
preds1 = model.predict(test).drop(['predict', 'p0'])
preds1.columns = [yhat]
test_yhat = test.cbind(preds1[yhat])
gbm prediction progress: |████████████████████████████████████████████████| 100%
For binomial classification, deviance residuals are related to the logloss cost function. Like analyzing $y - \hat{y}$ for linear regression, these residuals are the quantities that the GBM sought to minimize. Deviance residual values are calculated by applying the simple formula in the cell directly below.
# use Pandas for adding columns and plotting
test_yhat = test_yhat.as_data_frame()
test_yhat['s'] = 1
test_yhat.loc[test_yhat['DEFAULT_NEXT_MONTH'] == 0, 's'] = -1
test_yhat['r_DEFAULT_NEXT_MONTH'] = test_yhat['s'] * np.sqrt(-2*(test_yhat[y]*np.log(test_yhat[yhat]) +
((1 - test_yhat[y])*np.log(1 - test_yhat[yhat]))))
test_yhat = test_yhat.drop('s', axis=1)
Plotting residuals is a model debugging and diagnostic tool that enables users to see modeling results, and any anomolies, in a single two-dimensional plot. Here the green points represent customers who defaulted, and the blue points represent customers who did not. A few potential outliers are visible. There appear to be several cases in the test data with relatively large negative residuals. Understanding and addressing the factors that cause these outliers could lead to a more acccurate model.
groups = test_yhat.groupby('DEFAULT_NEXT_MONTH') # define groups
fig, ax_ = plt.subplots(figsize=(8, 8)) # initialize figure
plt.xlabel('Predicted: DEFAULT_NEXT_MONTH')
plt.ylabel('Residual: DEFAULT_NEXT_MONTH')
# plot groups with appropriate color
color_list = ['b', 'g']
c_idx = 0
for name, group in groups:
ax_.plot(group.p_DEFAULT_NEXT_MONTH, group.r_DEFAULT_NEXT_MONTH, label=' '.join(['DEFAULT_NEXT_MONTH:', str(name)]),
marker='o', linestyle='', color=color_list[c_idx], alpha=0.3)
c_idx += 1
_ = ax_.legend(loc=1) # legend
Printing a table with model inputs, actual target values, and model predictions sorted by residuals is another simple way to analyze residuals. Customers that defaulted, but were predicted not to, are listed at the top of the table below. Scroll to the bottom of the table to see the customers who were predicted to default, but then did not. Also notice the jumps in residual values. These are the potential outliers pictured in the residual plot above.
test_yhat = test_yhat.sort_values(by='r_DEFAULT_NEXT_MONTH', ascending=False).reset_index(drop=True)
test_yhat
ID | LIMIT_BAL | SEX | EDUCATION | MARRIAGE | AGE | PAY_0 | PAY_2 | PAY_3 | PAY_4 | PAY_5 | PAY_6 | BILL_AMT1 | BILL_AMT2 | BILL_AMT3 | BILL_AMT4 | BILL_AMT5 | BILL_AMT6 | PAY_AMT1 | PAY_AMT2 | PAY_AMT3 | PAY_AMT4 | PAY_AMT5 | PAY_AMT6 | DEFAULT_NEXT_MONTH | p_DEFAULT_NEXT_MONTH | r_DEFAULT_NEXT_MONTH | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2561 | 310000 | female | graduate school | single | 32 | no consumption | no consumption | no consumption | no consumption | no consumption | no consumption | 20138 | 8267 | 65993 | 8543 | 1695 | 750 | 8267 | 66008 | 8543 | 1695 | 750 | 7350 | 1 | 0.045837 | 2.483007 |
1 | 3016 | 350000 | male | graduate school | married | 38 | no consumption | no consumption | pay duly | use of revolving credit | use of revolving credit | no consumption | 16459 | 4120 | 44164 | 35233 | 884 | 9924 | 941 | 44743 | 0 | 884 | 9924 | 10824 | 1 | 0.050230 | 2.445869 |
2 | 11462 | 210000 | female | graduate school | single | 46 | pay duly | pay duly | pay duly | use of revolving credit | use of revolving credit | pay duly | 15655 | 3918 | 29881 | 24247 | 21664 | 1556 | 4854 | 30366 | 0 | 433 | 1556 | 14047 | 1 | 0.050527 | 2.443456 |
3 | 25772 | 350000 | female | graduate school | married | 33 | use of revolving credit | pay duly | pay duly | pay duly | pay duly | pay duly | 82964 | 68532 | 17926 | 17966 | 30741 | 31088 | 68940 | 18018 | 18058 | 30897 | 31244 | 88461 | 1 | 0.051503 | 2.435615 |
4 | 6933 | 500000 | male | graduate school | single | 37 | pay duly | pay duly | pay duly | pay duly | pay duly | pay duly | 4331 | 60446 | 30592 | 154167 | 13410 | 25426 | 60446 | 30594 | 150843 | 163881 | 25426 | 39526 | 1 | 0.051717 | 2.433910 |
5 | 22505 | 260000 | female | university | single | 33 | pay duly | pay duly | pay duly | pay duly | pay duly | use of revolving credit | 5188 | 12357 | 28656 | 7497 | 7685 | 15434 | 13000 | 29022 | 7500 | 27769 | 12000 | 6200 | 1 | 0.053061 | 2.423347 |
6 | 22751 | 350000 | female | graduate school | married | 32 | pay duly | pay duly | no consumption | no consumption | no consumption | no consumption | 30625 | 60003 | 7147 | 9950 | 22117 | 4874 | 60396 | 7147 | 9950 | 22117 | 4874 | 0 | 1 | 0.056650 | 2.396193 |
7 | 13381 | 400000 | female | university | single | 35 | use of revolving credit | use of revolving credit | use of revolving credit | use of revolving credit | use of revolving credit | use of revolving credit | 109943 | 222085 | 223350 | 213831 | 210563 | 211925 | 120018 | 10071 | 8037 | 8018 | 8809 | 5022 | 1 | 0.058753 | 2.380929 |
8 | 19530 | 350000 | female | university | married | 36 | use of revolving credit | use of revolving credit | use of revolving credit | use of revolving credit | use of revolving credit | use of revolving credit | 21026 | 35588 | 38002 | 40357 | 43663 | 52735 | 15000 | 3000 | 3000 | 4000 | 10000 | 25000 | 1 | 0.058911 | 2.379801 |
9 | 15549 | 450000 | male | university | single | 36 | no consumption | no consumption | no consumption | no consumption | no consumption | no consumption | 8012 | 4009 | 5226 | 4715 | 3275 | 6422 | 4021 | 5241 | 4729 | 3284 | 6441 | 4285 | 1 | 0.059060 | 2.378739 |
10 | 25692 | 330000 | female | graduate school | single | 42 | no consumption | no consumption | no consumption | no consumption | no consumption | no consumption | 565 | 20650 | 15360 | 0 | 12923 | 1816 | 20650 | 15360 | 0 | 12923 | 1816 | 17050 | 1 | 0.059089 | 2.378531 |
11 | 971 | 300000 | male | graduate school | married | 42 | pay duly | use of revolving credit | use of revolving credit | pay duly | use of revolving credit | use of revolving credit | 11973 | 61834 | 25145 | 37666 | 19453 | 10492 | 20979 | 5000 | 37676 | 8808 | 2000 | 2709 | 1 | 0.060745 | 2.366883 |
12 | 22390 | 310000 | female | graduate school | married | 32 | use of revolving credit | pay duly | use of revolving credit | use of revolving credit | use of revolving credit | use of revolving credit | 4762 | 26943 | 7488 | 10276 | 96059 | 6434 | 26943 | 5000 | 6000 | 93000 | 3000 | 5000 | 1 | 0.061522 | 2.361507 |
13 | 6854 | 290000 | male | high school | single | 34 | use of revolving credit | use of revolving credit | use of revolving credit | use of revolving credit | use of revolving credit | use of revolving credit | 5451 | 6230 | 140802 | 143354 | 146225 | 148820 | 1200 | 135000 | 5200 | 5500 | 5500 | 5400 | 1 | 0.061900 | 2.358911 |
14 | 8959 | 340000 | male | graduate school | single | 44 | use of revolving credit | use of revolving credit | use of revolving credit | use of revolving credit | use of revolving credit | use of revolving credit | 83059 | 85634 | 73950 | 59324 | 156094 | 110234 | 20000 | 5000 | 2000 | 112000 | 4234 | 4000 | 1 | 0.061935 | 2.358675 |
15 | 1980 | 500000 | female | graduate school | married | 35 | use of revolving credit | use of revolving credit | use of revolving credit | use of revolving credit | use of revolving credit | use of revolving credit | 35176 | 36193 | 44157 | 48322 | 21593 | 13866 | 2504 | 10004 | 5178 | 1047 | 2019 | 1004 | 1 | 0.062027 | 2.358041 |
16 | 18899 | 140000 | female | other | single | 28 | use of revolving credit | use of revolving credit | pay duly | use of revolving credit | use of revolving credit | use of revolving credit | 108018 | 6500 | 6327 | 138485 | 140492 | 141006 | 1000 | 6327 | 135000 | 4700 | 5000 | 5000 | 1 | 0.063140 | 2.350490 |
17 | 2302 | 230000 | female | graduate school | married | 30 | pay duly | pay duly | use of revolving credit | use of revolving credit | use of revolving credit | use of revolving credit | 2212 | 17402 | 32450 | 17285 | 9766 | 9981 | 17402 | 20013 | 346 | 5000 | 8000 | 5000 | 1 | 0.063386 | 2.348836 |
18 | 523 | 360000 | male | graduate school | single | 28 | pay duly | pay duly | pay duly | use of revolving credit | use of revolving credit | pay duly | 1210 | 820 | 64644 | 125984 | 106584 | 125557 | 390 | 75720 | 62520 | 17000 | 132200 | 167000 | 1 | 0.063629 | 2.347206 |
19 | 11745 | 220000 | male | graduate school | single | 51 | pay duly | pay duly | pay duly | pay duly | pay duly | pay duly | 20730 | -270 | 53895 | -105 | 20895 | 20835 | 0 | 54165 | 0 | 21000 | 20940 | 33460 | 1 | 0.063813 | 2.345973 |
20 | 8339 | 480000 | male | graduate school | married | 58 | no consumption | no consumption | no consumption | no consumption | no consumption | no consumption | 24610 | -310 | 148544 | 18791 | 5909 | 68988 | 4 | 149654 | 18885 | 5940 | 69337 | 200655 | 1 | 0.063860 | 2.345661 |
21 | 22712 | 320000 | female | high school | married | 35 | use of revolving credit | use of revolving credit | use of revolving credit | use of revolving credit | use of revolving credit | use of revolving credit | 157249 | 148123 | 142852 | 133583 | 129049 | 128325 | 6000 | 6048 | 6000 | 5000 | 5000 | 5000 | 1 | 0.065316 | 2.336031 |
22 | 26005 | 320000 | female | university | married | 35 | pay duly | pay duly | pay duly | pay duly | use of revolving credit | use of revolving credit | 2276 | 6626 | 11131 | 13824 | 17992 | 15250 | 6626 | 12446 | 17746 | 6000 | 5749 | 928 | 1 | 0.065442 | 2.335207 |
23 | 13797 | 390000 | male | graduate school | married | 36 | no consumption | no consumption | no consumption | no consumption | pay duly | pay duly | 3931 | 3625 | 1600 | 3815 | 8330 | 4765 | 3625 | 1600 | 3315 | 11645 | 4765 | 2171 | 1 | 0.065657 | 2.333801 |
24 | 10147 | 450000 | female | graduate school | married | 46 | pay duly | pay duly | pay duly | pay duly | pay duly | pay duly | 28205 | 3760 | 4148 | 2312 | 6909 | 4189 | 3793 | 4148 | 2312 | 6909 | 4189 | 1539 | 1 | 0.065732 | 2.333313 |
25 | 1199 | 340000 | female | high school | single | 44 | use of revolving credit | use of revolving credit | use of revolving credit | use of revolving credit | use of revolving credit | use of revolving credit | 142836 | 145125 | 146682 | 150407 | 147868 | 149349 | 7000 | 5500 | 6027 | 5328 | 5390 | 6047 | 1 | 0.065756 | 2.333153 |
26 | 10754 | 160000 | female | university | married | 31 | use of revolving credit | use of revolving credit | use of revolving credit | pay duly | pay duly | use of revolving credit | 42781 | 42774 | 41817 | 749 | 5572 | 10573 | 2300 | 2300 | 749 | 5572 | 5573 | 13793 | 1 | 0.066522 | 2.328184 |
27 | 25816 | 350000 | female | university | married | 47 | use of revolving credit | use of revolving credit | use of revolving credit | use of revolving credit | use of revolving credit | use of revolving credit | 97500 | 84202 | 82933 | 80501 | 79038 | 80694 | 3010 | 2970 | 2886 | 2824 | 2925 | 2987 | 1 | 0.066585 | 2.327775 |
28 | 12668 | 210000 | female | graduate school | married | 37 | use of revolving credit | use of revolving credit | pay duly | pay duly | pay duly | pay duly | 24547 | 48302 | 4549 | 3085 | 7300 | 6583 | 4519 | 9098 | 3085 | 7300 | 6583 | 5060 | 1 | 0.066960 | 2.325363 |
29 | 15482 | 150000 | male | graduate school | married | 37 | no consumption | no consumption | no consumption | no consumption | no consumption | no consumption | 22109 | 10876 | 10268 | 5872 | 3068 | 2181 | 10943 | 10273 | 5978 | 3068 | 2181 | 3242 | 1 | 0.067985 | 2.318820 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
8910 | 2865 | 50000 | female | university | married | 46 | 2 month delay | 2 month delay | 2 month delay | 2 month delay | 2 month delay | 2 month delay | 28390 | 29639 | 30854 | 30062 | 32705 | 33519 | 2000 | 2000 | 0 | 3300 | 1500 | 0 | 0 | 0.772660 | -1.721227 |
8911 | 8115 | 120000 | female | university | single | 26 | 3 month delay | 3 month delay | 2 month delay | 2 month delay | 3 month delay | 2 month delay | 12034 | 12548 | 12056 | 13958 | 13468 | 6144 | 1000 | 0 | 2400 | 100 | 0 | 57258 | 0 | 0.772848 | -1.721707 |
8912 | 13422 | 100000 | female | graduate school | single | 29 | 2 month delay | 2 month delay | 2 month delay | 2 month delay | 2 month delay | 2 month delay | 74032 | 75557 | 76434 | 74611 | 79292 | 80945 | 3300 | 2700 | 0 | 5900 | 3100 | 0 | 0 | 0.774303 | -1.725433 |
8913 | 12846 | 90000 | male | university | married | 42 | 2 month delay | 2 month delay | 2 month delay | 2 month delay | 2 month delay | use of revolving credit | 95773 | 95489 | 94681 | 93965 | 90545 | 90529 | 4000 | 3500 | 3500 | 0 | 3500 | 4200 | 0 | 0.775734 | -1.729118 |
8914 | 1629 | 140000 | female | high school | married | 31 | 2 month delay | 2 month delay | 2 month delay | 2 month delay | 2 month delay | 2 month delay | 89910 | 92588 | 91936 | 94623 | 94952 | 95234 | 5000 | 1800 | 5000 | 1900 | 4700 | 0 | 0 | 0.776880 | -1.732076 |
8915 | 4249 | 90000 | male | high school | married | 42 | 2 month delay | 2 month delay | 2 month delay | 3 month delay | 3 month delay | 3 month delay | 48674 | 49895 | 52570 | 53614 | 54534 | 53374 | 2300 | 4116 | 2500 | 2052 | 0 | 0 | 0 | 0.779941 | -1.740033 |
8916 | 9756 | 140000 | male | graduate school | married | 31 | 2 month delay | use of revolving credit | use of revolving credit | 2 month delay | 2 month delay | 2 month delay | 51028 | 52112 | 55232 | 55932 | 54910 | 57344 | 2500 | 4600 | 2200 | 0 | 3513 | 3000 | 0 | 0.780607 | -1.741776 |
8917 | 27215 | 60000 | male | university | married | 35 | 2 month delay | 2 month delay | 2 month delay | 2 month delay | 2 month delay | 2 month delay | 20195 | 21267 | 21332 | 21680 | 23011 | 23498 | 1700 | 700 | 1000 | 2000 | 1000 | 0 | 0 | 0.781268 | -1.743508 |
8918 | 5986 | 100000 | male | high school | married | 44 | 2 month delay | 2 month delay | 2 month delay | 2 month delay | 2 month delay | 2 month delay | 30076 | 31287 | 31676 | 32259 | 31608 | 33524 | 2000 | 1200 | 1400 | 0 | 2600 | 0 | 0 | 0.784777 | -1.752759 |
8919 | 15186 | 30000 | female | graduate school | single | 25 | 2 month delay | 2 month delay | 2 month delay | 2 month delay | 2 month delay | 2 month delay | 7593 | 9634 | 4476 | 8830 | 8153 | 6422 | 2379 | 7 | 7002 | 13 | 155 | 1 | 0 | 0.787675 | -1.760476 |
8920 | 2974 | 30000 | female | university | married | 24 | 2 month delay | 2 month delay | 2 month delay | 2 month delay | 2 month delay | 2 month delay | 150 | 150 | 150 | 150 | 150 | 300 | 0 | 0 | 0 | 0 | 150 | 0 | 0 | 0.793444 | -1.776054 |
8921 | 10785 | 80000 | female | university | married | 33 | 2 month delay | 2 month delay | 2 month delay | 2 month delay | 2 month delay | 2 month delay | 53843 | 55933 | 56575 | 57303 | 58593 | 59738 | 3500 | 2100 | 2200 | 2300 | 2200 | 2100 | 0 | 0.794019 | -1.777621 |
8922 | 13290 | 230000 | female | graduate school | married | 34 | 2 month delay | 2 month delay | 2 month delay | 2 month delay | 2 month delay | 2 month delay | 190784 | 195724 | 198707 | 201634 | 205949 | 210077 | 9300 | 7500 | 7500 | 7500 | 7500 | 7600 | 0 | 0.796132 | -1.783414 |
8923 | 27255 | 80000 | male | graduate school | married | 46 | 2 month delay | 2 month delay | 2 month delay | 2 month delay | 2 month delay | 2 month delay | 40509 | 40551 | 42592 | 43296 | 43892 | 43060 | 1000 | 3000 | 1700 | 1600 | 0 | 3500 | 0 | 0.797857 | -1.788172 |
8924 | 17703 | 60000 | female | university | married | 35 | 2 month delay | 2 month delay | 2 month delay | 2 month delay | 2 month delay | 2 month delay | 3167 | 5601 | 5366 | 6772 | 6515 | 7906 | 2500 | 0 | 1500 | 0 | 1500 | 0 | 0 | 0.802237 | -1.800380 |
8925 | 13811 | 40000 | male | graduate school | married | 47 | 2 month delay | 2 month delay | 2 month delay | 2 month delay | 2 month delay | 2 month delay | 11084 | 12605 | 13102 | 12595 | 14386 | 14005 | 2000 | 1000 | 0 | 2000 | 0 | 2000 | 0 | 0.805651 | -1.810028 |
8926 | 16920 | 50000 | male | high school | married | 52 | 2 month delay | 2 month delay | 2 month delay | 2 month delay | 4 month delay | 3 month delay | 36428 | 37530 | 38630 | 41774 | 40806 | 41357 | 2000 | 2000 | 4086 | 0 | 1500 | 1000 | 0 | 0.807385 | -1.814973 |
8927 | 17748 | 30000 | female | high school | married | 54 | 2 month delay | 2 month delay | 2 month delay | 2 month delay | 4 month delay | 3 month delay | 22147 | 24770 | 26068 | 28842 | 28094 | 27361 | 3000 | 2000 | 3500 | 0 | 0 | 1000 | 0 | 0.812416 | -1.829496 |
8928 | 18659 | 40000 | female | university | married | 28 | 2 month delay | 2 month delay | 3 month delay | 2 month delay | 2 month delay | 2 month delay | 31131 | 33815 | 33002 | 32173 | 34629 | 33940 | 3500 | 0 | 0 | 3000 | 0 | 2000 | 0 | 0.814786 | -1.836434 |
8929 | 26565 | 200000 | female | high school | married | 55 | 2 month delay | 2 month delay | 3 month delay | 2 month delay | 2 month delay | 2 month delay | 159017 | 162697 | 163143 | 161906 | 165807 | 169599 | 9159 | 4842 | 3000 | 8000 | 7000 | 3000 | 0 | 0.815782 | -1.839367 |
8930 | 21098 | 200000 | male | graduate school | married | 42 | 2 month delay | 2 month delay | 2 month delay | 2 month delay | 2 month delay | 2 month delay | 168289 | 172001 | 175281 | 177895 | 180078 | 184048 | 8000 | 7500 | 7000 | 6600 | 7000 | 7100 | 0 | 0.816460 | -1.841371 |
8931 | 7068 | 90000 | female | graduate school | single | 30 | 2 month delay | 2 month delay | 3 month delay | 3 month delay | 3 month delay | 3 month delay | 750 | 750 | 750 | 750 | 2450 | 2150 | 0 | 0 | 0 | 2000 | 0 | 0 | 0 | 0.825031 | -1.867161 |
8932 | 3087 | 30000 | female | university | single | 24 | 2 month delay | 2 month delay | 7 month delay | 7 month delay | 7 month delay | 7 month delay | 300 | 300 | 300 | 300 | 300 | 300 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.825105 | -1.867387 |
8933 | 14589 | 280000 | male | graduate school | married | 50 | 3 month delay | 5 month delay | 4 month delay | 3 month delay | 2 month delay | use of revolving credit | 327918 | 321476 | 314931 | 176439 | 154010 | 134334 | 0 | 0 | 500 | 0 | 6267 | 2257 | 0 | 0.832569 | -1.890599 |
8934 | 16957 | 270000 | male | graduate school | married | 50 | 2 month delay | 4 month delay | 3 month delay | 3 month delay | 2 month delay | 2 month delay | 213616 | 208784 | 212058 | 207226 | 202394 | 231339 | 0 | 8000 | 0 | 0 | 32236 | 3000 | 0 | 0.841972 | -1.920928 |
8935 | 29505 | 20000 | male | university | married | 40 | 1 month delay | 2 month delay | 3 month delay | 2 month delay | 3 month delay | 3 month delay | 14829 | 17267 | 16706 | 18694 | 19049 | 18459 | 3000 | 0 | 2560 | 955 | 0 | 661 | 0 | 0.852781 | -1.957464 |
8936 | 19316 | 110000 | female | graduate school | married | 41 | 3 month delay | 2 month delay | 2 month delay | 7 month delay | 7 month delay | 7 month delay | 150 | 150 | 150 | 150 | 150 | 150 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.866568 | -2.007067 |
8937 | 22725 | 100000 | female | university | married | 38 | 3 month delay | 2 month delay | 2 month delay | 3 month delay | 3 month delay | 3 month delay | 750 | 750 | 750 | 750 | 750 | 750 | 0 | 0 | 0 | 0 | 0 | 1500 | 0 | 0.869051 | -2.016405 |
8938 | 9672 | 170000 | male | graduate school | single | 48 | 2 month delay | 2 month delay | 7 month delay | 7 month delay | 7 month delay | 7 month delay | 2400 | 2400 | 2400 | 2400 | 2400 | 2400 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.874018 | -2.035492 |
8939 | 5916 | 110000 | female | graduate school | married | 41 | 2 month delay | 2 month delay | 7 month delay | 7 month delay | 7 month delay | 7 month delay | 150 | 150 | 150 | 150 | 150 | 150 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.886468 | -2.085985 |
8940 rows × 27 columns
This simple analysis has uncovered some of the most difficult customers for the GBM to correctly predict default. Perhaps because of the high importance of the payment features, PAY_0
-PAY_6
, the GBM struggles to correctly predict several cases in which customers made timely recent payments and then suddenly defaulted (high positive residuals) and those customers that were chronically late making payments but did not default (high negative residuals).
Residuals can also be plotted for important input variables to understand how the values of a single input variable affect prediction errors. When plotted by PAY_0
, the residuals confirm that the GBM is struggling to accurately predict cases where default status is not correlated with recent payment behavior in an obvious way. The residual plots for values of PAY_0
indicating timely payment behavior (e.g., use of revolving credit
, pay duly
, and no consumption
) generally display the highest positive residuals and relatively small negative residuals. Residuals for the other values of PAY_0
, those that represent late recent payments, tend to show large negative residuals and relatively small positive residuals.
# use Seaborn FacetGrid for convenience
g = sns.FacetGrid(test_yhat, row='PAY_0', hue=y)
_ = g.map(plt.scatter, yhat, 'r_DEFAULT_NEXT_MONTH', alpha=0.4)
Now that an issue has been discovered using residual analysis, can it be resolved?
One strategy to improve prediction accuracy is to introduce a new feature that summarizes a customer's spending behavior over time to expose any potential financial instability: the standard deviation of a customer's bill amounts over six months. Pandas has a one-liner for calculating standard deviations for a set of columns, so the H2OFrame is casted back into Pandas DataFrame for convenience.
data = data.as_data_frame()
data['bill_std'] = data[['BILL_AMT1', 'BILL_AMT2', 'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6']].std(axis=1)
data.head(n=3)
ID | LIMIT_BAL | SEX | EDUCATION | MARRIAGE | AGE | PAY_0 | PAY_2 | PAY_3 | PAY_4 | PAY_5 | PAY_6 | BILL_AMT1 | BILL_AMT2 | BILL_AMT3 | BILL_AMT4 | BILL_AMT5 | BILL_AMT6 | PAY_AMT1 | PAY_AMT2 | PAY_AMT3 | PAY_AMT4 | PAY_AMT5 | PAY_AMT6 | DEFAULT_NEXT_MONTH | bill_std | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 20000 | female | university | married | 24 | 2 month delay | 2 month delay | pay duly | pay duly | no consumption | no consumption | 3913 | 3102 | 689 | 0 | 0 | 0 | 0 | 689 | 0 | 0 | 0 | 0 | 1 | 1761.633219 |
1 | 2 | 120000 | female | university | single | 26 | pay duly | 2 month delay | use of revolving credit | use of revolving credit | use of revolving credit | 2 month delay | 2682 | 1725 | 2682 | 3272 | 3455 | 3261 | 0 | 1000 | 1000 | 1000 | 0 | 2000 | 1 | 637.967841 |
2 | 3 | 90000 | female | university | single | 34 | use of revolving credit | use of revolving credit | use of revolving credit | use of revolving credit | use of revolving credit | use of revolving credit | 29239 | 14027 | 13559 | 14331 | 14948 | 15549 | 1518 | 1500 | 1000 | 1000 | 1000 | 5000 | 0 | 6064.518593 |
To retrain the model with the new feature, an H2OFrame is required and that H2OFrame is split using the same proportion and random seed as in cell 8 for the first GBM model.
data = h2o.H2OFrame(data) # convert
data[y] = data[y].asfactor() # ensure target is handled as a categorical variable
train, test = data.split_frame([0.7], seed=12345) # split into training and validation
Parse progress: |█████████████████████████████████████████████████████████| 100%
The train()
function is used to retrain the GBM model with the nearly same hyperparameters used previously in cell 9. A slight, but noticable, increase in accuracy results from retraining with the new feature.
# initialize GBM model
model = H2OGradientBoostingEstimator(ntrees=150, # maximum 150 trees in GBM
max_depth=6, # trees can have maximum depth of 6
sample_rate=0.9, # use 90% of rows in each iteration (tree)
col_sample_rate=0.85, # use 90% of variables in each iteration (tree)
stopping_rounds=5, # stop if validation error does not decrease for 5 iterations (trees)
seed=12345) # for reproducibility
# retrain GBM model
model.train(y=y,
x=X + ['bill_std'], # add new feature
training_frame=train,
validation_frame=test)
# print AUC
print('GBM Test AUC = %.4f' % model.auc(valid=True))
gbm Model Build progress: |███████████████████████████████████████████████| 100% GBM Test AUC = 0.7825
While there maybe be other more complex features or a more optimal set of hyperparameters that could lead to further incremental increases in accuracy, more information is needed to achieve meaningful improvement in prediction performance. In particular, a common measure for credit lending, the customers' debt-to-income ratio, for each payment and billing period could be particularly useful. Spikes in debt-to-income ratio, representing loss of income or large increases in debt, would likely be very indicative of a default and would expose the GBM to information not currently available in the UCI credit card default data. Introducing new data could also de-emphasize PAY_0
, which would likely result in a more stable model as well.
Sensitivity analysis investigates whether model behavior and outputs remain stable when data is intentionally perturbed or other changes are simulated in data. Beyond traditional assessment practices, sensitivity analysis of machine learning model predictions is perhaps the most important validation technique for machine learning models. Machine learning models can make drastically differing predictions for only minor changes in input variable values. In practice, many linear model validation techniques focus on the numerical instability of regression parameters due to correlation between input variables or between input variables and the dependent variable. It may be prudent for those switching from linear modeling techniques to machine learning techniques to focus less on numerical instability of model parameters and to focus more on the potential instability of model predictions.
Here sensitivity analysis is used to understand the impact of changing the most important input variable, PAY_0
, and the impact of a sociologically sensitive variable, SEX
, in the model. If the model changes in reasonable and expected ways when important variable values are changed this can enhance trust in the model. If the contribution of potentially sensitive variables, such as those related to gender, race, age, marital status, or disability status, can be shown to have minimal impact on the model, this is an indication of fairness in the model predictions and can also increase overall trust in the model.
Typically, a productive exercise in model debugging and validation is to investigate customers with very high or low predicted probabilities to determine if their predictions stay within reasonable bounds when important variables are changed. The predictions from the new, more accurate model are merged onto the test set to find these potentially interesting customers.
preds2 = model.predict(test).drop(['predict', 'p0'])
preds2.columns = [yhat]
test_yhat = test.cbind(preds1[yhat])
gbm prediction progress: |████████████████████████████████████████████████| 100%
The function below finds and returns the row indices for the minimum, the maximum, and the deciles of one column in terms of another -- in this case, the model predictions (p_DEFAULT_NEXT_MONTH
) and the row identifier (ID
), respectively. These indices are used as a starting point for boundary testing. Outlying predictions found through residual analysis is another group of potentially interesting local predictions to investigate.
def get_percentile_dict(yhat, id_, frame):
""" Returns the minimum, the maximum, and the deciles of a column, yhat,
as the indices based on another column id_.
Args:
yhat: Column in which to find percentiles.
id_: Id column that stores indices for percentiles of yhat.
frame: H2OFrame containing yhat and id_.
Returns:
Dictionary of percentile values and index column values.
"""
# create a copy of frame and sort it by yhat
sort_df = frame.as_data_frame()
sort_df.sort_values(yhat, inplace=True)
sort_df.reset_index(inplace=True)
# find top and bottom percentiles
percentiles_dict = {}
percentiles_dict[0] = sort_df.loc[0, id_]
percentiles_dict[99] = sort_df.loc[sort_df.shape[0]-1, id_]
# find 10th-90th percentiles
inc = sort_df.shape[0]//10
for i in range(1, 10):
percentiles_dict[i * 10] = sort_df.loc[i * inc, id_]
return percentiles_dict
# display percentiles dictionary
# ID values for rows
# from lowest prediction
# to highest prediction
pred_percentile_dict = get_percentile_dict(yhat, 'ID', test_yhat)
pred_percentile_dict
{0: 28716, 10: 8942, 20: 28257, 30: 4074, 40: 13411, 50: 16633, 60: 2402, 70: 19769, 80: 25069, 90: 21372, 99: 29116}
Unlike some regression models and neural networks that can produce outrageous predictions for changes in input variables, GBM predictions in new data are bounded by the lowest and highest probability leaf nodes in each constiuent decision tree in the trained model. While unbounded, extreme predictions are typically not an issue for tree models and classification tasks, it is often a good idea to check that the model predictions cover a full range of useful values in the test set. Below, we can see that the model produces both low and high predictions in the test set, indicating that it is likely responsive to signal in new data and not simply predicting the majority class or an average value.
print('Lowest prediction:', test_yhat[test_yhat['ID'] == int(pred_percentile_dict[0])][[y, yhat]])
print('Highest prediction:', test_yhat[test_yhat['ID'] == int(pred_percentile_dict[99])][[y, yhat]])
Lowest prediction:
DEFAULT_NEXT_MONTH | p_DEFAULT_NEXT_MONTH |
---|---|
0 | 0.0383668 |
Highest prediction:
DEFAULT_NEXT_MONTH | p_DEFAULT_NEXT_MONTH |
---|---|
1 | 0.895285 |
As a starting point for further analysis, sensitivity analysis is performed for the customer least likely to default. This woman has a very low probability of defaulting according to the trained GBM.
test_case = test_yhat[test_yhat['ID'] == int(pred_percentile_dict[0])]
test_case
ID | LIMIT_BAL | SEX | EDUCATION | MARRIAGE | AGE | PAY_0 | PAY_2 | PAY_3 | PAY_4 | PAY_5 | PAY_6 | BILL_AMT1 | BILL_AMT2 | BILL_AMT3 | BILL_AMT4 | BILL_AMT5 | BILL_AMT6 | PAY_AMT1 | PAY_AMT2 | PAY_AMT3 | PAY_AMT4 | PAY_AMT5 | PAY_AMT6 | DEFAULT_NEXT_MONTH | bill_std | p_DEFAULT_NEXT_MONTH |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
28716 | 780000 | female | university | single | 41 | no consumption | no consumption | no consumption | no consumption | no consumption | no consumption | 101957 | 61715 | 38686 | 21482 | 72628 | 182792 | 62819 | 39558 | 22204 | 82097 | 184322 | 25695 | 0 | 57564.1 | 0.0383668 |
SEX
¶SEX
should not have a large impact on predictions. This could indicate unwanted sociological bias in the GBM model.
test_case = test_yhat[test_yhat['ID'] == int(pred_percentile_dict[0])]
test_case = test_case.drop([yhat])
test_case['SEX'] = 'male'
test_case = test_case.cbind(model.predict(test_case))
test_case
gbm prediction progress: |████████████████████████████████████████████████| 100%
ID | LIMIT_BAL | SEX | EDUCATION | MARRIAGE | AGE | PAY_0 | PAY_2 | PAY_3 | PAY_4 | PAY_5 | PAY_6 | BILL_AMT1 | BILL_AMT2 | BILL_AMT3 | BILL_AMT4 | BILL_AMT5 | BILL_AMT6 | PAY_AMT1 | PAY_AMT2 | PAY_AMT3 | PAY_AMT4 | PAY_AMT5 | PAY_AMT6 | DEFAULT_NEXT_MONTH | bill_std | predict | p0 | p1 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
28716 | 780000 | male | university | single | 41 | no consumption | no consumption | no consumption | no consumption | no consumption | no consumption | 101957 | 61715 | 38686 | 21482 | 72628 | 182792 | 62819 | 39558 | 22204 | 82097 | 184322 | 25695 | 0 | 57564.1 | 0 | 0.959052 | 0.0409481 |
As desired, simulating this person as a male does not have a large impact on their probability of default.
PAY_0
¶Variable importance and residual analysis indicates that the value of PAY_0
can have a strong effect on model predictions. Measuring the change in predicted probability when the value of PAY_0
is changed from a timely payment to late payment is probably a good test case for prediction stability.
test_case = test_yhat[test_yhat['ID'] == int(pred_percentile_dict[0])]
test_case = test_case.drop([yhat])
test_case['PAY_0'] = '2 month delay'
test_case = test_case.cbind(model.predict(test_case))
test_case
gbm prediction progress: |████████████████████████████████████████████████| 100%
ID | LIMIT_BAL | SEX | EDUCATION | MARRIAGE | AGE | PAY_0 | PAY_2 | PAY_3 | PAY_4 | PAY_5 | PAY_6 | BILL_AMT1 | BILL_AMT2 | BILL_AMT3 | BILL_AMT4 | BILL_AMT5 | BILL_AMT6 | PAY_AMT1 | PAY_AMT2 | PAY_AMT3 | PAY_AMT4 | PAY_AMT5 | PAY_AMT6 | DEFAULT_NEXT_MONTH | bill_std | predict | p0 | p1 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
28716 | 780000 | female | university | single | 41 | 2 month delay | no consumption | no consumption | no consumption | no consumption | no consumption | 101957 | 61715 | 38686 | 21482 | 72628 | 182792 | 62819 | 39558 | 22204 | 82097 | 184322 | 25695 | 0 | 57564.1 | 1 | 0.571032 | 0.428968 |
When the value is changed from no consumption
to two month delay
there is a very large increase in predicted probability. Such a marked change related to the value of one variable is problematic for numerous reasons.
Now the same test will be performed on the customer most likely to default. This woman has a very high probability of default under the GBM model.
test_case = test_yhat[test_yhat['ID'] == int(pred_percentile_dict[99])]
test_case
ID | LIMIT_BAL | SEX | EDUCATION | MARRIAGE | AGE | PAY_0 | PAY_2 | PAY_3 | PAY_4 | PAY_5 | PAY_6 | BILL_AMT1 | BILL_AMT2 | BILL_AMT3 | BILL_AMT4 | BILL_AMT5 | BILL_AMT6 | PAY_AMT1 | PAY_AMT2 | PAY_AMT3 | PAY_AMT4 | PAY_AMT5 | PAY_AMT6 | DEFAULT_NEXT_MONTH | bill_std | p_DEFAULT_NEXT_MONTH |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
29116 | 20000 | female | university | married | 59 | 3 month delay | 2 month delay | 3 month delay | 2 month delay | 2 month delay | 4 month delay | 8803 | 11137 | 10672 | 11201 | 12721 | 11946 | 2800 | 0 | 1000 | 2000 | 0 | 0 | 1 | 1327.55 | 0.895285 |
SEX
¶Changing the value for SEX
from female to male for this customer decreases the predicted probability by a relatively small amount.
test_case = test_yhat[test_yhat['ID'] == int(pred_percentile_dict[99])]
test_case = test_case.drop([yhat])
test_case['SEX'] = 'male'
test_case = test_case.cbind(model.predict(test_case))
test_case
gbm prediction progress: |████████████████████████████████████████████████| 100%
ID | LIMIT_BAL | SEX | EDUCATION | MARRIAGE | AGE | PAY_0 | PAY_2 | PAY_3 | PAY_4 | PAY_5 | PAY_6 | BILL_AMT1 | BILL_AMT2 | BILL_AMT3 | BILL_AMT4 | BILL_AMT5 | BILL_AMT6 | PAY_AMT1 | PAY_AMT2 | PAY_AMT3 | PAY_AMT4 | PAY_AMT5 | PAY_AMT6 | DEFAULT_NEXT_MONTH | bill_std | predict | p0 | p1 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
29116 | 20000 | male | university | married | 59 | 3 month delay | 2 month delay | 3 month delay | 2 month delay | 2 month delay | 4 month delay | 8803 | 11137 | 10672 | 11201 | 12721 | 11946 | 2800 | 0 | 1000 | 2000 | 0 | 0 | 1 | 1327.55 | 1 | 0.161579 | 0.838421 |
PAY_0
¶Switching the riskiest customer's value for PAY_0
from 3 month delay
to pay duly
reduces the their chance of default by about 20%, a noticable swing in probability but still a higher probability value, notably greater than common lending cutoffs.
test_case = test_yhat[test_yhat['ID'] == int(pred_percentile_dict[99])]
test_case = test_case.drop([yhat])
test_case['PAY_0'] = 'pay duly'
test_case = test_case.cbind(model.predict(test_case))
test_case
gbm prediction progress: |████████████████████████████████████████████████| 100%
ID | LIMIT_BAL | SEX | EDUCATION | MARRIAGE | AGE | PAY_0 | PAY_2 | PAY_3 | PAY_4 | PAY_5 | PAY_6 | BILL_AMT1 | BILL_AMT2 | BILL_AMT3 | BILL_AMT4 | BILL_AMT5 | BILL_AMT6 | PAY_AMT1 | PAY_AMT2 | PAY_AMT3 | PAY_AMT4 | PAY_AMT5 | PAY_AMT6 | DEFAULT_NEXT_MONTH | bill_std | predict | p0 | p1 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
29116 | 20000 | female | university | married | 59 | pay duly | 2 month delay | 3 month delay | 2 month delay | 2 month delay | 4 month delay | 8803 | 11137 | 10672 | 11201 | 12721 | 11946 | 2800 | 0 | 1000 | 2000 | 0 | 0 | 1 | 1327.55 | 1 | 0.273858 | 0.726142 |
From this small number of boundary test cases, the GBM model appears stable. However, if large swings in predictions occur for sensitive or important variables, practicioners are urged to retrain unstable models without the problematic variables or combinations of variables, which may unfortunately involve some trial and error. Also, four test cases is woefully inadequate for real-world models. Automated sensitivity analysis across many variables, combinations of variables, and for many different rows of data seems more appropriate for mission-critical machine learning.
After using h2o, it's typically best to shut it down. However, before doing so, users should ensure that they have saved any h2o data structures, such as models, H2OFrames, or scoring artifacts, such as POJOs or MOJOs.
# be careful, this can erase your work!
h2o.cluster().shutdown(prompt=True)
Are you sure you want to shutdown the H2O instance running at http://127.0.0.1:54321 (Y/N)? n
In this notebook, a complex GBM classifier was trained to predict credit card defaults. Residual analysis was used to debug the GBM model predictions and enabled a slight improvement in accuracy. Sensitivity analysis was used to test the GBM for trustworthiness and stability. In a small number of boundary test cases, the trained GBM appeared stable. Residual analysis and sensitivity analysis are powerful model debugging techniques and can increase trust in complex models. These techniques should generalize well for many types of business and research problems, enabling you to train a complex model and justify it to your colleagues, bosses, and potentially, external regulators.