Copyright 2017 - 2020 Patrick Hall and the H2O.ai team
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
DISCLAIMER: This notebook is not legal compliance advice.
A key to building interpretable models is to limit their complexity. The more complex a model is, the harder it is to explain and understand. Overly complex models can also make unstable predictions on new data, which is both difficult to explain and makes models harder to trust. Monotonicity constraints not only simplify models, but do so in a way that is somewhat natural for human reasoning, increasing the transparency of predictive models. Under monotonicity constraints, model predictions can only increase or only decrease as an input variable value increases, and the direction of the constraint is typically specified by the user for logical reasons. For instance, a model might be constrained to produce only increasing probabilities of a certain medical condition as a patient's age increases, or to make only increasing predictions for home prices as a home's square footage increases.
In this notebook a gradient boosting machine (GBM) is trained with monotonicity constraints to predict credit card payment defaults, using the UCI credit card default data, Python, NumPy, Pandas, and XGBoost. First, the credit card default data is loaded and prepared. Then Pearson correlation with the prediction target is used to determine the direction of the monotonicity constraints for each input variable and the model is trained. After the model is trained, partial dependence and individual conditional expectation (ICE) plots are used to analyze and verify the model's monotonic behavior. Finally an example of creating regulator mandated reason codes from high fidelity Shapley explanations for any model prediction is presented. This combination of monotonic XGBoost, partial dependence, ICE, and Shapley explanations is probably the most direct way to create an interpretable machine learning model today.
Note: As of the h2o 3.24 "Yates" release, Shapley values are supported in h2o, in addition to GBM monotonicity constraints and partial dependence. To see Shapley values and monotonicity constraints for an h2o GBM in action please see: https://github.com/jphall663/interpretable_machine_learning_with_python/blob/master/dia.ipynb.
Let's start with Python package imports. NumPy is used for basic arrray, vector, and matrix calculations. Pandas is used for data frame manipulation and plotting, and XGBoost is used to train a GBM with monotonicity constraints.
import numpy as np # array, vector, matrix calculations
import pandas as pd # DataFrame handling
import shap # for consistent, signed variable importance measurements
import xgboost as xgb # gradient boosting machines (GBMs)
import matplotlib.pyplot as plt # plotting
pd.options.display.max_columns = 999 # enable display of all columns in notebook
# enables display of plots in notebook
%matplotlib inline
np.random.seed(12345) # set random seed for reproducibility
/home/patrickh/anaconda3/lib/python3.6/site-packages/sklearn/ensemble/weight_boosting.py:29: DeprecationWarning: numpy.core.umath_tests is an internal NumPy module and should not be imported. It will be removed in a future NumPy release. from numpy.core.umath_tests import inner1d
UCI credit card default data: https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients
The UCI credit card default data contains demographic and payment information about credit card customers in Taiwan in the year 2005. The data set contains 23 input variables:
LIMIT_BAL
: Amount of given credit (NT dollar)SEX
: 1 = male; 2 = femaleEDUCATION
: 1 = graduate school; 2 = university; 3 = high school; 4 = othersMARRIAGE
: 1 = married; 2 = single; 3 = othersAGE
: Age in yearsPAY_0
, PAY_2
- PAY_6
: History of past payment; PAY_0
= the repayment status in September, 2005; PAY_2
= the repayment status in August, 2005; ...; PAY_6
= the repayment status in April, 2005. The measurement scale for the repayment status is: -1 = pay duly; 1 = payment delay for one month; 2 = payment delay for two months; ...; 8 = payment delay for eight months; 9 = payment delay for nine months and above.BILL_AMT1
- BILL_AMT6
: Amount of bill statement (NT dollar). BILL_AMNT1
= amount of bill statement in September, 2005; BILL_AMT2
= amount of bill statement in August, 2005; ...; BILL_AMT6
= amount of bill statement in April, 2005.PAY_AMT1
- PAY_AMT6
: Amount of previous payment (NT dollar). PAY_AMT1
= amount paid in September, 2005; PAY_AMT2
= amount paid in August, 2005; ...; PAY_AMT6
= amount paid in April, 2005.These 23 input variables are used to predict the target variable, whether or not a customer defaulted on their credit card bill in late 2005. Because XGBoost accepts only numeric inputs, all variables will be treated as numeric.
The credit card default data is available as an .xls
file. Pandas reads .xls
files automatically, so it's used to load the credit card default data and give the prediction target a shorter name: DEFAULT_NEXT_MONTH
.
# import XLS file
path = 'default_of_credit_card_clients.xls'
data = pd.read_excel(path,
skiprows=1) # skip the first row of the spreadsheet
# remove spaces from target column name
data = data.rename(columns={'default payment next month': 'DEFAULT_NEXT_MONTH'})
The shorthand name y
is assigned to the prediction target. X
is assigned to all other input variables in the credit card default data except the row indentifier, ID
.
# assign target and inputs for GBM
y = 'DEFAULT_NEXT_MONTH'
X = [name for name in data.columns if name not in [y, 'ID']]
print('y =', y)
print('X =', X)
y = DEFAULT_NEXT_MONTH X = ['LIMIT_BAL', 'SEX', 'EDUCATION', 'MARRIAGE', 'AGE', 'PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'BILL_AMT1', 'BILL_AMT2', 'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6', 'PAY_AMT1', 'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6']
The Pandas describe()
function displays a brief description of the credit card default data. The input variables SEX
, EDUCATION
, MARRIAGE
, PAY_0
-PAY_6
, and the prediction target DEFAULT_NEXT_MONTH
, are really categorical variables, but they have already been encoded into meaningful numeric, integer values, which is great for XGBoost. Also, there are no missing values in this dataset.
data[X + [y]].describe() # display descriptive statistics for all columns
LIMIT_BAL | SEX | EDUCATION | MARRIAGE | AGE | PAY_0 | PAY_2 | PAY_3 | PAY_4 | PAY_5 | PAY_6 | BILL_AMT1 | BILL_AMT2 | BILL_AMT3 | BILL_AMT4 | BILL_AMT5 | BILL_AMT6 | PAY_AMT1 | PAY_AMT2 | PAY_AMT3 | PAY_AMT4 | PAY_AMT5 | PAY_AMT6 | DEFAULT_NEXT_MONTH | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 30000.000000 | 30000.000000 | 30000.000000 | 30000.000000 | 30000.000000 | 30000.000000 | 30000.000000 | 30000.000000 | 30000.000000 | 30000.000000 | 30000.000000 | 30000.000000 | 30000.000000 | 3.000000e+04 | 30000.000000 | 30000.000000 | 30000.000000 | 30000.000000 | 3.000000e+04 | 30000.00000 | 30000.000000 | 30000.000000 | 30000.000000 | 30000.000000 |
mean | 167484.322667 | 1.603733 | 1.853133 | 1.551867 | 35.485500 | -0.016700 | -0.133767 | -0.166200 | -0.220667 | -0.266200 | -0.291100 | 51223.330900 | 49179.075167 | 4.701315e+04 | 43262.948967 | 40311.400967 | 38871.760400 | 5663.580500 | 5.921163e+03 | 5225.68150 | 4826.076867 | 4799.387633 | 5215.502567 | 0.221200 |
std | 129747.661567 | 0.489129 | 0.790349 | 0.521970 | 9.217904 | 1.123802 | 1.197186 | 1.196868 | 1.169139 | 1.133187 | 1.149988 | 73635.860576 | 71173.768783 | 6.934939e+04 | 64332.856134 | 60797.155770 | 59554.107537 | 16563.280354 | 2.304087e+04 | 17606.96147 | 15666.159744 | 15278.305679 | 17777.465775 | 0.415062 |
min | 10000.000000 | 1.000000 | 0.000000 | 0.000000 | 21.000000 | -2.000000 | -2.000000 | -2.000000 | -2.000000 | -2.000000 | -2.000000 | -165580.000000 | -69777.000000 | -1.572640e+05 | -170000.000000 | -81334.000000 | -339603.000000 | 0.000000 | 0.000000e+00 | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
25% | 50000.000000 | 1.000000 | 1.000000 | 1.000000 | 28.000000 | -1.000000 | -1.000000 | -1.000000 | -1.000000 | -1.000000 | -1.000000 | 3558.750000 | 2984.750000 | 2.666250e+03 | 2326.750000 | 1763.000000 | 1256.000000 | 1000.000000 | 8.330000e+02 | 390.00000 | 296.000000 | 252.500000 | 117.750000 | 0.000000 |
50% | 140000.000000 | 2.000000 | 2.000000 | 2.000000 | 34.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 22381.500000 | 21200.000000 | 2.008850e+04 | 19052.000000 | 18104.500000 | 17071.000000 | 2100.000000 | 2.009000e+03 | 1800.00000 | 1500.000000 | 1500.000000 | 1500.000000 | 0.000000 |
75% | 240000.000000 | 2.000000 | 2.000000 | 2.000000 | 41.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 67091.000000 | 64006.250000 | 6.016475e+04 | 54506.000000 | 50190.500000 | 49198.250000 | 5006.000000 | 5.000000e+03 | 4505.00000 | 4013.250000 | 4031.500000 | 4000.000000 | 0.000000 |
max | 1000000.000000 | 2.000000 | 6.000000 | 3.000000 | 79.000000 | 8.000000 | 8.000000 | 8.000000 | 8.000000 | 8.000000 | 8.000000 | 964511.000000 | 983931.000000 | 1.664089e+06 | 891586.000000 | 927171.000000 | 961664.000000 | 873552.000000 | 1.684259e+06 | 896040.00000 | 621000.000000 | 426529.000000 | 528666.000000 | 1.000000 |
Monotonic relationships are much easier to explain to colleagues, bosses, customers, and regulators than more complex, non-monotonic relationships and monotonic relationships may also prevent overfitting and excess error due to variance for new data.
To train a transparent monotonic classifier, contraints must be supplied to XGBoost that determine whether the learned relationship between an input variable and the prediction target DEFAULT_NEXT_MONTH
will be increasing for increases in an input variable or decreasing for increases in an input variable. Pearson correlation provides a linear measure of the direction of the relationship between each input variable and the target. If the pair-wise Pearson correlation between an input and DEFAULT_NEXT_MONTH
is positive, it will be constrained to have an increasing relationship with the predictions for DEFAULT_NEXT_MONTH
. If the pair-wise Pearson correlation is negative, the input will be constrained to have a decreasing relationship with the predictions for DEFAULT_NEXT_MONTH
.
Constrainsts are supplied to XGBoost in the form of a Python tuple with length equal to the number of inputs. Each item in the tuple is associated with an input variable based on its index in the tuple. The first constraint in the tuple is associated with the first variable in the training data, the second constraint in the tuple is associated with the second variable in the training data, and so on. The constraints themselves take the form of a 1 for a positive relationship and a -1 for a negative relationship.
The Pandas .corr()
function returns the pair-wise Pearson correlation between variables in a Pandas DataFrame. Because DEFAULT_NEXT_MONTH
is the last column in the data
DataFrame, the last column of the Pearson correlation matrix indicates the direction of the linear relationship between each input variable and the prediction target, DEFAULT_NEXT_MONTH
. According to the calculated values, as a customer's balance limit (LIMIT_BAL
), bill amounts (BILL_AMT1
-BILL_AMT6
), and payment amounts (PAY_AMT1
-PAY_AMT6
) increase, their probability of default tends to decrease. However as a customer's number of late payments increase (PAY_0
, PAY_2
-PAY6
), their probability of default usually increases. In general, the Pearson correlation values make sense, and they will be used to ensure that the modeled relationships will make sense as well. (Pearson correlation values between the target variable, DEFAULT_NEXT_MONTH, and each input variable are displayed directly below.)
# displays last column of Pearson correlation matrix as Pandas DataFrame
pd.DataFrame(data[X + [y]].corr()[y]).iloc[:-1]
DEFAULT_NEXT_MONTH | |
---|---|
LIMIT_BAL | -0.153520 |
SEX | -0.039961 |
EDUCATION | 0.028006 |
MARRIAGE | -0.024339 |
AGE | 0.013890 |
PAY_0 | 0.324794 |
PAY_2 | 0.263551 |
PAY_3 | 0.235253 |
PAY_4 | 0.216614 |
PAY_5 | 0.204149 |
PAY_6 | 0.186866 |
BILL_AMT1 | -0.019644 |
BILL_AMT2 | -0.014193 |
BILL_AMT3 | -0.014076 |
BILL_AMT4 | -0.010156 |
BILL_AMT5 | -0.006760 |
BILL_AMT6 | -0.005372 |
PAY_AMT1 | -0.072929 |
PAY_AMT2 | -0.058579 |
PAY_AMT3 | -0.056250 |
PAY_AMT4 | -0.056827 |
PAY_AMT5 | -0.055124 |
PAY_AMT6 | -0.053183 |
The last column of the Pearson correlation matrix is transformed from a numeric column in a Pandas DataFrame into a Python tuple of 1
s and -1
s that will be used to specifiy monotonicity constraints for each input variable in XGBoost. If the Pearson correlation between an input variable and DEFAULT_NEXT_MONTH
is positive, a positive montonic relationship constraint is specified for that variable using 1
. If the correlation is negative, a negative monotonic constraint is specified using -1
. (Specifying 0
indicates that no constraints should be used.) The resulting tuple will be passed to XGBoost when the GBM model is trained.
# creates a tuple in which positive correlation values are assigned a 1
# and negative correlation values are assigned a -1
mono_constraints = tuple([int(i) for i in np.sign(data[X + [y]].corr()[y].values[:-1])])
# (-1, -1, 1, -1, 1, 1, 1, 1, 1, 1, 1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1)
XGBoost is a very accurate, open source GBM library for regression and classification tasks. XGBoost can learn complex relationships between input variables and a target variable, but here the monotone_constraints
tuning parameter is used to enforce monotonicity between inputs and the prediction for DEFAULT_NEXT_MONTH
. XGBoost's early stopping functionality is also used to limit overfitting to the training data
XGBoost is available from: https://github.com/dmlc/xgboost and the implementation of XGBoost is described in detail here: http://www.kdd.org/kdd2016/papers/files/rfp0697-chenAemb.pdf.
After training, GBM variable importance is calculated and displayed. GBM variable importance is a global measure of the overall impact of an input variable on the GBM model predictions. Global variable importance values give an indication of the magnitude of a variable's contribution to model predictions for all observations. To enhance trust in the GBM model, variable importance values should typically conform to human domain knowledge and reasonable expectations.
The credit card default data is split into training and test sets to monitor and prevent overtraining. Reproducibility is another important factor in creating trustworthy models, and randomly splitting datasets can introduce randomness in model predictions and other results. A random seed is used here to ensure the data split is reproducible.
np.random.seed(12345) # set random seed for reproducibility
split_ratio = 0.7 # 70%/30% train/test split
# execute split
split = np.random.rand(len(data)) < split_ratio
train = data[split]
test = data[~split]
# summarize split
print('Train data rows = %d, columns = %d' % (train.shape[0], train.shape[1]))
print('Test data rows = %d, columns = %d' % (test.shape[0], test.shape[1]))
Train data rows = 20946, columns = 25 Test data rows = 9054, columns = 25
To train an XGBoost classifier, the training and test data must be converted from Pandas DataFrames into SVMLight format. The DMatrix()
function in the XGBoost package is used to convert the data. Many XGBoost tuning parameters must be specified as well. Typically a grid search would be performed to identify the best parameters for a given modeling task. For brevity's sake, a previously-discovered set of good tuning parameters are specified here. Notice that the monotonicity constraints are passed to XGBoost using the monotone_constraints
parameter. Because gradient boosting methods typically resample training data, an additional random seed is also specified for XGBoost using the seed
paramter to create reproducible predictions, error rates, and variable importance values. To avoid overfitting, the early_stopping_rounds
parameter is used to stop the training process after the test area under the curve (AUC) statistic fails to increase for 50 iterations.
# XGBoost uses SVMLight data structure, not Numpy arrays or Pandas DataFrames
dtrain = xgb.DMatrix(train[X], train[y])
dtest = xgb.DMatrix(test[X], test[y])
# used to calibrate predictions to mean of y
base_y = train[y].mean()
# tuning parameters
params = {
'objective': 'binary:logistic', # produces 0-1 probabilities for binary classification
'booster': 'gbtree', # base learner will be decision tree
'eval_metric': 'auc', # stop training based on maximum AUC, AUC always between 0-1
'eta': 0.08, # learning rate
'subsample': 0.9, # use 90% of rows in each decision tree
'colsample_bytree': 0.9, # use 90% of columns in each decision tree
'max_depth': 15, # allow decision trees to grow to depth of 15
'monotone_constraints':mono_constraints, # 1 = increasing relationship, -1 = decreasing relationship
'base_score': base_y, # calibrate predictions to mean of y
'seed': 12345 # set random seed for reproducibility
}
# watchlist is used for early stopping
watchlist = [(dtrain, 'train'), (dtest, 'eval')]
# train model
xgb_model = xgb.train(params, # set tuning parameters from above
dtrain, # training data
1000, # maximum of 1000 iterations (trees)
evals=watchlist, # use watchlist for early stopping
early_stopping_rounds=50, # stop after 50 iterations (trees) without increase in AUC
verbose_eval=True) # display iteration progress
[0] train-auc:0.738066 eval-auc:0.733449 Multiple eval metrics have been passed: 'eval-auc' will be used for early stopping. Will train until eval-auc hasn't improved in 50 rounds. [1] train-auc:0.772991 eval-auc:0.769311 [2] train-auc:0.775536 eval-auc:0.772624 [3] train-auc:0.776248 eval-auc:0.771985 [4] train-auc:0.777216 eval-auc:0.772796 [5] train-auc:0.777782 eval-auc:0.773066 [6] train-auc:0.777783 eval-auc:0.773471 [7] train-auc:0.777856 eval-auc:0.773591 [8] train-auc:0.777633 eval-auc:0.773209 [9] train-auc:0.777417 eval-auc:0.772892 [10] train-auc:0.777519 eval-auc:0.772666 [11] train-auc:0.778476 eval-auc:0.773346 [12] train-auc:0.778541 eval-auc:0.773312 [13] train-auc:0.778351 eval-auc:0.773266 [14] train-auc:0.778519 eval-auc:0.773695 [15] train-auc:0.779383 eval-auc:0.774314 [16] train-auc:0.779555 eval-auc:0.77473 [17] train-auc:0.780178 eval-auc:0.775043 [18] train-auc:0.780743 eval-auc:0.775246 [19] train-auc:0.78146 eval-auc:0.776021 [20] train-auc:0.781944 eval-auc:0.776477 [21] train-auc:0.782181 eval-auc:0.776696 [22] train-auc:0.782319 eval-auc:0.776915 [23] train-auc:0.78263 eval-auc:0.776998 [24] train-auc:0.782965 eval-auc:0.777144 [25] train-auc:0.783396 eval-auc:0.777498 [26] train-auc:0.783791 eval-auc:0.777791 [27] train-auc:0.784151 eval-auc:0.778091 [28] train-auc:0.784382 eval-auc:0.778375 [29] train-auc:0.784678 eval-auc:0.778567 [30] train-auc:0.784912 eval-auc:0.778755 [31] train-auc:0.785231 eval-auc:0.778938 [32] train-auc:0.785418 eval-auc:0.779125 [33] train-auc:0.785598 eval-auc:0.779226 [34] train-auc:0.785838 eval-auc:0.779421 [35] train-auc:0.7861 eval-auc:0.779432 [36] train-auc:0.786347 eval-auc:0.779549 [37] train-auc:0.786633 eval-auc:0.779575 [38] train-auc:0.786833 eval-auc:0.779668 [39] train-auc:0.787093 eval-auc:0.779743 [40] train-auc:0.787307 eval-auc:0.779926 [41] train-auc:0.787602 eval-auc:0.780015 [42] train-auc:0.787819 eval-auc:0.780089 [43] train-auc:0.787943 eval-auc:0.780196 [44] train-auc:0.788092 eval-auc:0.780239 [45] train-auc:0.788194 eval-auc:0.780198 [46] train-auc:0.788298 eval-auc:0.780235 [47] train-auc:0.788422 eval-auc:0.780305 [48] train-auc:0.78862 eval-auc:0.780372 [49] train-auc:0.788803 eval-auc:0.780392 [50] train-auc:0.789049 eval-auc:0.780558 [51] train-auc:0.789221 eval-auc:0.780659 [52] train-auc:0.789405 eval-auc:0.780756 [53] train-auc:0.789553 eval-auc:0.780721 [54] train-auc:0.78968 eval-auc:0.780791 [55] train-auc:0.789779 eval-auc:0.780849 [56] train-auc:0.789875 eval-auc:0.780932 [57] train-auc:0.79002 eval-auc:0.780963 [58] train-auc:0.790156 eval-auc:0.781038 [59] train-auc:0.790292 eval-auc:0.781085 [60] train-auc:0.790403 eval-auc:0.781079 [61] train-auc:0.790509 eval-auc:0.781091 [62] train-auc:0.790554 eval-auc:0.781061 [63] train-auc:0.790635 eval-auc:0.781148 [64] train-auc:0.790679 eval-auc:0.781166 [65] train-auc:0.790822 eval-auc:0.781155 [66] train-auc:0.790896 eval-auc:0.781178 [67] train-auc:0.790977 eval-auc:0.781188 [68] train-auc:0.791155 eval-auc:0.781211 [69] train-auc:0.791239 eval-auc:0.781179 [70] train-auc:0.7914 eval-auc:0.781347 [71] train-auc:0.791525 eval-auc:0.78134 [72] train-auc:0.791578 eval-auc:0.781312 [73] train-auc:0.791691 eval-auc:0.781325 [74] train-auc:0.791747 eval-auc:0.781323 [75] train-auc:0.791801 eval-auc:0.781304 [76] train-auc:0.791844 eval-auc:0.781313 [77] train-auc:0.7919 eval-auc:0.781325 [78] train-auc:0.792056 eval-auc:0.781448 [79] train-auc:0.792088 eval-auc:0.781397 [80] train-auc:0.792151 eval-auc:0.78138 [81] train-auc:0.792173 eval-auc:0.781388 [82] train-auc:0.792236 eval-auc:0.781301 [83] train-auc:0.792327 eval-auc:0.781355 [84] train-auc:0.792372 eval-auc:0.781356 [85] train-auc:0.792402 eval-auc:0.781321 [86] train-auc:0.79247 eval-auc:0.781286 [87] train-auc:0.792521 eval-auc:0.781283 [88] train-auc:0.792543 eval-auc:0.781265 [89] train-auc:0.792595 eval-auc:0.781255 [90] train-auc:0.792618 eval-auc:0.781242 [91] train-auc:0.792673 eval-auc:0.781309 [92] train-auc:0.792766 eval-auc:0.781357 [93] train-auc:0.792826 eval-auc:0.781381 [94] train-auc:0.792914 eval-auc:0.781387 [95] train-auc:0.792967 eval-auc:0.781385 [96] train-auc:0.793016 eval-auc:0.781375 [97] train-auc:0.793053 eval-auc:0.781353 [98] train-auc:0.79312 eval-auc:0.78137 [99] train-auc:0.793145 eval-auc:0.781413 [100] train-auc:0.793191 eval-auc:0.781456 [101] train-auc:0.793256 eval-auc:0.781435 [102] train-auc:0.793282 eval-auc:0.781382 [103] train-auc:0.793322 eval-auc:0.781385 [104] train-auc:0.793346 eval-auc:0.781361 [105] train-auc:0.793399 eval-auc:0.781418 [106] train-auc:0.793436 eval-auc:0.781398 [107] train-auc:0.793511 eval-auc:0.781358 [108] train-auc:0.793578 eval-auc:0.78137 [109] train-auc:0.793655 eval-auc:0.781336 [110] train-auc:0.793683 eval-auc:0.781299 [111] train-auc:0.793701 eval-auc:0.781277 [112] train-auc:0.79371 eval-auc:0.78128 [113] train-auc:0.793745 eval-auc:0.781298 [114] train-auc:0.793817 eval-auc:0.781307 [115] train-auc:0.793838 eval-auc:0.781301 [116] train-auc:0.793867 eval-auc:0.781308 [117] train-auc:0.793877 eval-auc:0.781315 [118] train-auc:0.793917 eval-auc:0.781246 [119] train-auc:0.793947 eval-auc:0.781302 [120] train-auc:0.794 eval-auc:0.781358 [121] train-auc:0.794026 eval-auc:0.781357 [122] train-auc:0.794053 eval-auc:0.781277 [123] train-auc:0.794064 eval-auc:0.781257 [124] train-auc:0.794209 eval-auc:0.781289 [125] train-auc:0.794219 eval-auc:0.781288 [126] train-auc:0.794287 eval-auc:0.781347 [127] train-auc:0.79429 eval-auc:0.781347 [128] train-auc:0.794327 eval-auc:0.781331 [129] train-auc:0.794336 eval-auc:0.781349 [130] train-auc:0.794367 eval-auc:0.781347 [131] train-auc:0.794364 eval-auc:0.78134 [132] train-auc:0.794385 eval-auc:0.781363 [133] train-auc:0.794387 eval-auc:0.781326 [134] train-auc:0.794472 eval-auc:0.78132 [135] train-auc:0.794483 eval-auc:0.781326 [136] train-auc:0.794495 eval-auc:0.781306 [137] train-auc:0.794583 eval-auc:0.781293 [138] train-auc:0.794589 eval-auc:0.781293 [139] train-auc:0.7946 eval-auc:0.781286 [140] train-auc:0.794642 eval-auc:0.7813 [141] train-auc:0.794656 eval-auc:0.781303 [142] train-auc:0.794654 eval-auc:0.781287 [143] train-auc:0.794695 eval-auc:0.781273 [144] train-auc:0.794705 eval-auc:0.781268 [145] train-auc:0.794708 eval-auc:0.781268 [146] train-auc:0.79471 eval-auc:0.781258 [147] train-auc:0.794775 eval-auc:0.781284 [148] train-auc:0.794782 eval-auc:0.781277 [149] train-auc:0.794794 eval-auc:0.781283 [150] train-auc:0.794815 eval-auc:0.78126 Stopping. Best iteration: [100] train-auc:0.793191 eval-auc:0.781456
By setting pred_contribs=True
, XGBoost's predict()
function will return Shapley values for each row of the test set. Instead of relying on traditional single-value variable importance measures, local Shapley values for each input will be ploted below to get a more holistic and consisent measurement for the global importance of each input variable. Shapley values are introduced in greater detail in Section 6 below, but for now notice the monotonicity of the input variable contributions displayed in the Shapley summary plot.
# dtest is DMatrix
# shap_values is Numpy array
shap_values = xgb_model.predict(dtest, pred_contribs=True, ntree_limit=xgb_model.best_ntree_limit)
# plot Shapley variable importance summary
shap.summary_plot(shap_values[:, :-1], test[xgb_model.feature_names])
The variable importance ranking should be parsimonious with human domain knowledge and reasonable expectations. In this case, PAY_0
is by far the most important variable. As someone's most recent behavior is a very good indicator of future behavior, this checks out.
Partial dependence plots are used to view the global, average prediction behavior of a variable under the monotonic model. Partial dependence plots show the average prediction of the monotonic model as a function of specific values of an input variable of interest, indicating how the monotonic GBM predictions change based on the values of the input variable of interest, while taking nonlinearity into consideration and averaging out the effects of all other input variables. Partial dependence plots enable increased transparency into the monotonic GBM's mechanisms and enable validation and debugging of the monotonic GBM by comparing a variable's average predictions across its domain to known standards and reasonable expectations. Partial dependence plots are described in greater detail in The Elements of Statistical Learning, section 10.13: https://web.stanford.edu/~hastie/ElemStatLearn/printings/ESLII_print12.pdf.
Individual conditional expectation (ICE) plots, a newer and less well-known adaptation of partial dependence plots, can be used to create more localized explanations for a single observation of data using the same basic ideas as partial dependence plots. ICE is also a type of nonlinear sensitivity analysis in which the model predictions for a single observation are measured while a feature of interest is varied over its domain. ICE increases understanding and transparency by displaying the nonlinear behavior of the monotonic GBM. ICE also enhances trust, accountability, and fairness by enabling comparisons of nonlinear behavior to human domain knowledge and reasonable expectations. ICE, as a type of sensitivity analysis, can also engender trust when model behavior on simulated or extreme data points is acceptable. A detailed description of ICE is available in this arXiv preprint: https://arxiv.org/abs/1309.6392.
Because partial dependence and ICE are measured on the same scale, they can be displayed in the same line plot to compare the global, average prediction behavior for the entire model and the local prediction behavior for certain rows of data. Overlaying the two types of curves enables analysis of both global and local behavior simultaneously and provides an indication of the trustworthiness of the average behavior represented by partial dependence. (Partial dependence can be misleading in the presence of strong interactions or correlation. ICE curves diverging from the partial dependence curve can be indicative of such problems.) Histograms are also presented with the partial dependence and ICE curves, to enable a rough measure of epistemic uncertainty for model predictions: predictions based on small amounts of training data are likely less dependable.
Since partial dependence and ICE will be calculated for several important variables in the GBM model, it's convenient to have a function doing so. It's probably best to analyze partial dependence and ICE for all variables in a model, but only the top three most important input variables will be investigated here. It's also a good idea to analyze partial dependence and ICE on the test data, or other holdout datasets, to see how the model will perform on new data.
This simple function is designed to return partial dependence when it is called for an entire dataset and ICE when it is called for a single row. The bins
argument will be used later to calculate ICE values at the same places in an input variable domain that partial dependence is calculated directly below.
def par_dep(xs, frame, model, resolution=20, bins=None):
""" Creates Pandas DataFrame containing partial dependence for a
single variable.
Args:
xs: Variable for which to calculate partial dependence.
frame: Pandas DataFrame for which to calculate partial dependence.
model: XGBoost model for which to calculate partial dependence.
resolution: The number of points across the domain of xs for which
to calculate partial dependence, default 20.
bins: List of values at which to set xs, default 20 equally-spaced
points between column minimum and maximum.
Returns:
Pandas DataFrame containing partial dependence values.
"""
# turn off pesky Pandas copy warning
pd.options.mode.chained_assignment = None
# initialize empty Pandas DataFrame with correct column names
par_dep_frame = pd.DataFrame(columns=[xs, 'partial_dependence'])
# cache original column values
col_cache = frame.loc[:, xs].copy(deep=True)
# determine values at which to calculate partial dependence
if bins == None:
min_ = frame[xs].min()
max_ = frame[xs].max()
by = (max_ - min_)/resolution
bins = np.arange(min_, max_, by)
# calculate partial dependence
# by setting column of interest to constant
# and scoring the altered data and taking the mean of the predictions
for j in bins:
frame.loc[:, xs] = j
dframe = xgb.DMatrix(frame)
par_dep_i = pd.DataFrame(model.predict(dframe, ntree_limit=model.best_ntree_limit))
par_dep_j = par_dep_i.mean()[0]
par_dep_frame = par_dep_frame.append({xs:j,
'partial_dependence': par_dep_j},
ignore_index=True)
# return input frame to original cached state
frame.loc[:, xs] = col_cache
return par_dep_frame
The partial dependence for LIMIT_BAL
can be seen to decrease as credit balance limits increase. This finding is aligned with expectations that the model predictions will be monotonically decreasing with increasing LIMIT_BAL
and parsimonious with well-known business practices in credit lending. Partial dependence for other important values is displayed in plots further below.
par_dep_PAY_0 = par_dep('PAY_0', test[X], xgb_model) # calculate partial dependence for PAY_0
par_dep_LIMIT_BAL = par_dep('LIMIT_BAL', test[X], xgb_model) # calculate partial dependence for LIMIT_BAL
par_dep_BILL_AMT1 = par_dep('BILL_AMT1', test[X], xgb_model) # calculate partial dependence for BILL_AMT1
# display partial dependence for LIMIT_BAL
par_dep_LIMIT_BAL
LIMIT_BAL | partial_dependence | |
---|---|---|
0 | 10000.0 | 0.266888 |
1 | 59500.0 | 0.243301 |
2 | 109000.0 | 0.224759 |
3 | 158500.0 | 0.216966 |
4 | 208000.0 | 0.214594 |
5 | 257500.0 | 0.209798 |
6 | 307000.0 | 0.206901 |
7 | 356500.0 | 0.198712 |
8 | 406000.0 | 0.197883 |
9 | 455500.0 | 0.197086 |
10 | 505000.0 | 0.194449 |
11 | 554500.0 | 0.194407 |
12 | 604000.0 | 0.190247 |
13 | 653500.0 | 0.190209 |
14 | 703000.0 | 0.190188 |
15 | 752500.0 | 0.190188 |
16 | 802000.0 | 0.190188 |
17 | 851500.0 | 0.190188 |
18 | 901000.0 | 0.190188 |
19 | 950500.0 | 0.190188 |
ICE can be calculated for any row in the training or test data, but without intimate knowledge of a data source it can be difficult to know where to apply ICE. Calculating and analyzing ICE curves for every row of training and test data set can be overwhelming, even for the example credit card default dataset. One place to start with ICE is to calculate ICE curves at every decile of predicted probabilities in a dataset, giving an indication of local prediction behavior across the dataset. The function below finds and returns the row indices for the maximum, minimum, and deciles of one column in terms of another -- in this case, the model predictions (p_DEFAULT_NEXT_MONTH
) and the row identifier (ID
), respectively.
def get_percentile_dict(yhat, id_, frame):
""" Returns the percentiles of a column, yhat, as the indices based on
another column id_.
Args:
yhat: Column in which to find percentiles.
id_: Id column that stores indices for percentiles of yhat.
frame: Pandas DataFrame containing yhat and id_.
Returns:
Dictionary of percentile values and index column values.
"""
# create a copy of frame and sort it by yhat
sort_df = frame.copy(deep=True)
sort_df.sort_values(yhat, inplace=True)
sort_df.reset_index(inplace=True)
# find top and bottom percentiles
percentiles_dict = {}
percentiles_dict[0] = sort_df.loc[0, id_]
percentiles_dict[99] = sort_df.loc[sort_df.shape[0]-1, id_]
# find 10th-90th percentiles
inc = sort_df.shape[0]//10
for i in range(1, 10):
percentiles_dict[i * 10] = sort_df.loc[i * inc, id_]
return percentiles_dict
The values for ID
that correspond to the maximum, minimum, and deciles of p_DEFAULT_NEXT_MONTH
are displayed below. ICE will be calculated for the rows of the test dataset associated with these ID
values.
# merge GBM predictions onto test data
yhat_test = pd.concat([test.reset_index(drop=True), pd.DataFrame(xgb_model.predict(dtest,
ntree_limit=xgb_model.best_ntree_limit))],
axis=1)
yhat_test = yhat_test.rename(columns={0:'p_DEFAULT_NEXT_MONTH'})
# find percentiles of predictions
percentile_dict = get_percentile_dict('p_DEFAULT_NEXT_MONTH', 'ID', yhat_test)
# display percentiles dictionary
# ID values for rows
# from lowest prediction
# to highest prediction
percentile_dict
{0: 23477, 10: 6226, 20: 25603, 30: 12890, 40: 715, 50: 14517, 60: 4908, 70: 7411, 80: 6219, 90: 18421, 99: 17757}
ICE values represent a model's prediction for a row of data while an input variable of interest is varied across its domain. The values of the input variable are chosen to match the values at which partial dependence was calculated earlier, and ICE is calculated for the top three most important variables and for rows at each percentile of the test dataset.
# retreive bins from original partial dependence calculation
bins_PAY_0 = list(par_dep_PAY_0['PAY_0'])
bins_LIMIT_BAL = list(par_dep_LIMIT_BAL['LIMIT_BAL'])
bins_BILL_AMT1 = list(par_dep_BILL_AMT1['BILL_AMT1'])
# for each percentile in percentile_dict
# create a new column in the par_dep frame
# representing the ICE curve for that percentile
# and the variables of interest
for i in sorted(percentile_dict.keys()):
col_name = 'Percentile_' + str(i)
# ICE curves for PAY_0 across percentiles at bins_PAY_0 intervals
par_dep_PAY_0[col_name] = par_dep('PAY_0',
test[test['ID'] == int(percentile_dict[i])][X],
xgb_model,
bins=bins_PAY_0)['partial_dependence']
# ICE curves for LIMIT_BAL across percentiles at bins_LIMIT_BAL intervals
par_dep_LIMIT_BAL[col_name] = par_dep('LIMIT_BAL',
test[test['ID'] == int(percentile_dict[i])][X],
xgb_model,
bins=bins_LIMIT_BAL)['partial_dependence']
# ICE curves for BILL_AMT1 across percentiles at bins_BILL_AMT1 intervals
par_dep_BILL_AMT1[col_name] = par_dep('BILL_AMT1',
test[test['ID'] == int(percentile_dict[i])][X],
xgb_model,
bins=bins_BILL_AMT1)['partial_dependence']
LIMIT_BAL
¶Partial dependence and ICE values for rows at the minimum, maximum and deciles (0%, 10%, 20%, ..., 90%, 99%) of predictions for DEFAULT_NEXT_MONTH
and at the values of LIMIT_BAL
used for partial dependence are shown here. Each column of ICE values will be a curve in the plots below. ICE values represent a prediction for a row of test data, at a percentile of interest noted in the column name above, and setting LIMIT_BAL
to the value of LIMIT_BAL
at right. Notice that monotonic decreasing prediction behavior for LIMIT_BAL
holds at each displayed percentile of predicted DEFAULT_NEXT_MONTH
, helping to validate that the trained GBM predictions are monotonic for LIMIT_BAL
.
par_dep_LIMIT_BAL
LIMIT_BAL | partial_dependence | Percentile_0 | Percentile_10 | Percentile_20 | Percentile_30 | Percentile_40 | Percentile_50 | Percentile_60 | Percentile_70 | Percentile_80 | Percentile_90 | Percentile_99 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 10000.0 | 0.266888 | 0.009752 | 0.103610 | 0.114130 | 0.161717 | 0.178196 | 0.176423 | 0.215932 | 0.237970 | 0.482309 | 0.600873 | 0.963266 |
1 | 59500.0 | 0.243301 | 0.007765 | 0.095805 | 0.106769 | 0.151972 | 0.155897 | 0.162622 | 0.177366 | 0.206651 | 0.337691 | 0.589542 | 0.943703 |
2 | 109000.0 | 0.224759 | 0.004514 | 0.073701 | 0.089827 | 0.122272 | 0.147910 | 0.129995 | 0.164410 | 0.186139 | 0.330370 | 0.575241 | 0.939937 |
3 | 158500.0 | 0.216966 | 0.003510 | 0.061587 | 0.082731 | 0.108702 | 0.139923 | 0.117246 | 0.158431 | 0.174685 | 0.330370 | 0.571536 | 0.939715 |
4 | 208000.0 | 0.214594 | 0.003208 | 0.059036 | 0.081863 | 0.106850 | 0.137178 | 0.115268 | 0.158413 | 0.171402 | 0.330370 | 0.569508 | 0.938817 |
5 | 257500.0 | 0.209798 | 0.002925 | 0.054268 | 0.078228 | 0.102988 | 0.134296 | 0.113019 | 0.155023 | 0.166766 | 0.330130 | 0.559060 | 0.936782 |
6 | 307000.0 | 0.206901 | 0.002824 | 0.053263 | 0.075723 | 0.100607 | 0.133915 | 0.109701 | 0.152706 | 0.162116 | 0.330130 | 0.555402 | 0.935711 |
7 | 356500.0 | 0.198712 | 0.001872 | 0.046365 | 0.066366 | 0.083248 | 0.128932 | 0.090932 | 0.150245 | 0.147120 | 0.329356 | 0.552223 | 0.935394 |
8 | 406000.0 | 0.197883 | 0.001872 | 0.045775 | 0.064863 | 0.081453 | 0.127613 | 0.090882 | 0.149558 | 0.146809 | 0.324522 | 0.550167 | 0.934938 |
9 | 455500.0 | 0.197086 | 0.001872 | 0.044761 | 0.063219 | 0.079312 | 0.126955 | 0.088818 | 0.149372 | 0.146495 | 0.322438 | 0.550167 | 0.934894 |
10 | 505000.0 | 0.194449 | 0.001847 | 0.044276 | 0.062503 | 0.078314 | 0.121430 | 0.087403 | 0.143454 | 0.143816 | 0.314405 | 0.549231 | 0.934661 |
11 | 554500.0 | 0.194407 | 0.001847 | 0.044262 | 0.062484 | 0.078040 | 0.121430 | 0.087377 | 0.143394 | 0.143775 | 0.314405 | 0.549195 | 0.934661 |
12 | 604000.0 | 0.190247 | 0.001763 | 0.042238 | 0.059409 | 0.074256 | 0.119003 | 0.083055 | 0.139884 | 0.137249 | 0.307714 | 0.547538 | 0.933731 |
13 | 653500.0 | 0.190209 | 0.001763 | 0.042231 | 0.059384 | 0.074229 | 0.119003 | 0.082968 | 0.139812 | 0.137186 | 0.307588 | 0.547495 | 0.933731 |
14 | 703000.0 | 0.190188 | 0.001763 | 0.042211 | 0.059356 | 0.074195 | 0.119003 | 0.082929 | 0.139812 | 0.137126 | 0.307588 | 0.547370 | 0.933731 |
15 | 752500.0 | 0.190188 | 0.001763 | 0.042211 | 0.059356 | 0.074195 | 0.119003 | 0.082929 | 0.139812 | 0.137126 | 0.307588 | 0.547370 | 0.933731 |
16 | 802000.0 | 0.190188 | 0.001763 | 0.042211 | 0.059356 | 0.074195 | 0.119003 | 0.082929 | 0.139812 | 0.137126 | 0.307588 | 0.547370 | 0.933731 |
17 | 851500.0 | 0.190188 | 0.001763 | 0.042211 | 0.059356 | 0.074195 | 0.119003 | 0.082929 | 0.139812 | 0.137126 | 0.307588 | 0.547370 | 0.933731 |
18 | 901000.0 | 0.190188 | 0.001763 | 0.042211 | 0.059356 | 0.074195 | 0.119003 | 0.082929 | 0.139812 | 0.137126 | 0.307588 | 0.547370 | 0.933731 |
19 | 950500.0 | 0.190188 | 0.001763 | 0.042211 | 0.059356 | 0.074195 | 0.119003 | 0.082929 | 0.139812 | 0.137126 | 0.307588 | 0.547370 | 0.933731 |
Overlaying partial dependence onto ICE in a plot is a convenient way to validate and understand both global and local monotonic behavior. Plots of partial dependence curves overlayed onto ICE curves for several percentiles of predictions for DEFAULT_NEXT_MONTH
are used to validate monotonic behavior, describe the GBM model mechanisms, and to compare the most extreme GBM behavior with the average GBM behavior in the test data. Partial dependence and ICE plots are displayed for the three most important variables in the GBM: PAY_0
, LIMIT_BAL
, and BILL_AMT1
.
#### Function to plot partial dependence and ICE
def plot_par_dep_ICE(xs, par_dep_frame):
""" Plots ICE overlayed onto partial dependence for a single variable.
Args:
xs: Name of variable for which to plot ICE and partial dependence.
par_dep_frame: Name of Pandas DataFrame containing ICE and partial
dependence values.
"""
# initialize figure and axis
fig, ax = plt.subplots()
# plot ICE curves
par_dep_frame.drop('partial_dependence', axis=1).plot(x=xs,
colormap='gnuplot',
ax=ax)
# overlay partial dependence, annotate plot
par_dep_frame.plot(title='Partial Dependence and ICE for ' + str(xs),
x=xs,
y='partial_dependence',
style='r-',
linewidth=3,
ax=ax)
# add legend
_ = plt.legend(bbox_to_anchor=(1.05, 0),
loc=3,
borderaxespad=0.)
LIMIT_BAL
¶The monotonic prediction behavior displayed in the partial dependence, and ICE tables for LIMIT_BAL
is also visible in this plot. Monotonic decreasing behavior is evident at every percentile of predictions for DEFAULT_NEXT_MONTH
. Most percentiles of predictions show that sharper decreases in probability of default occur when LIMIT_BAL
increases just slightly from its lowest values in the test set. However, for the custumers that are most likely to default according to the GBM model, no increase in LIMIT_BAL
has a strong impact on probabilitiy of default. As mentioned previously, the displayed relationship between credit balance limits and probablility of default is not uncommon in credit lending. As can be seen from the displayed histogram, above ~$NT 500,000 prediction behavior may have been learned from extremely small samples of data.
plot_par_dep_ICE('LIMIT_BAL', par_dep_LIMIT_BAL) # plot partial dependence and ICE for LIMIT_BAL
_ = train['LIMIT_BAL'].plot(kind='hist', bins=20, title='Histogram: LIMIT_BAL')
PAY_0
¶Monotonic increasing prediction behavior for PAY_0
is displayed for all percentiles of model predictions. Predition behavior is different at different deciles, but not abnormal or vastly different from the average prediction behavior represented by the red partial dependence curve. The largest jump in predicted probability appears to occur at PAY_0 = 2
, or when a customer becomes two months late on their most recent payment. Above PAY_0 = 3
there are few examples from which the model could learn.
plot_par_dep_ICE('PAY_0', par_dep_PAY_0) # plot partial dependence and ICE for PAY_0
_ = train['PAY_0'].plot(kind='hist', bins=20, title='Histogram: PAY_0')
BILL_AMT1
¶Monotonic decreasing prediction behavior for BILL_AMT1
is also displayed for all percentiles. This mild decrease in probability of default as most recent bill amount increases could be related to wealthier, big-spending customers taking on more debt but also being able to pay it off reliably. Also, customers with negative bills are more likely to default, potentially indicating charge-offs are being recorded as negative bills. In a mission-critical situation, this issue would require more debugging. Also predictions below $ NT 0 and above $ NT 400,000 are based on very little training data.
plot_par_dep_ICE('BILL_AMT1', par_dep_BILL_AMT1) # plot partial dependence and ICE for BILL_AMT1
_ = train['BILL_AMT1'].plot(kind='hist', bins=20, title='Histogram: BILL_AMT1')
Now that the monotonic behavior of the GBM has been verified and compared against domain knowledge and reasonable expectations, a method called Shapley explanations will be used to calculate the local variable importance for any one prediction: http://papers.nips.cc/paper/7062-a-unified-approach-to-interpreting-model-predictions. Shapley explanations are the only possible consistent local variable importance values. (Here consistency means that if a variable is more important than another variable in a given prediction, the more important variable's Shapley value will not be smaller in magnitude than the less important variable's Shapley value.) Very crucially Shapley values also always sum to the actual prediction of the XGBoost model. When used in a model-specific context for decision tree models, Shapley values are likely the most accurate known local variable importance method available today. In this notebook, XGBoost itself is used to create Shapley values with the pred_contribs
parameter to predict()
, but the shap
package is also available for other types of models: https://github.com/slundberg/shap.
The numeric Shapley values in each column are an estimate of how much each variable contributed to each prediction. Shapley contributions can indicate how a variable and its values were weighted in any given decision by the model. These values are crucially important for machine learning interpretability and are related to "local feature importance", "reason codes", or "turn-down codes." The latter phrases are borrowed from credit scoring. Credit lenders in the U.S. must provide reasons for automatically rejecting a credit application. Reason codes can be easily extracted from Shapley local variable contribution values by ranking the variables that played the largest role in any given decision.
To find the index corresponding to a particular row of interest later, the index of the test
DataFrame is reset to begin at 0 and increase sequentially. Without resetting the index, the test
DataFrame row indices still correspond to the original raw data from which the test set was sampled.
test.reset_index(drop=True, inplace=True)
One person who might be of immediate interest is the most likely to default customer in the test data. This customer's row will be selected and local variable importance for the corresponding prediction will be analyzed.
decile = 99
row = test[test['ID'] == percentile_dict[decile]]
The most interesting Shapley values are probably those that push this customer's probability of default higher, i.e. the highest positive Shapley values. Those values are plotted below.
# reset test data index to find riskiest customer in shap_values
# sort to find largest positive contributions
s_df = pd.DataFrame(shap_values[row.index[0], :][:-1].reshape(23, 1), columns=['Reason Codes'], index=X)
s_df.sort_values(by='Reason Codes', inplace=True, ascending=False)
s_df
Reason Codes | |
---|---|
PAY_0 | 1.526435 |
PAY_5 | 0.511360 |
PAY_6 | 0.328183 |
PAY_2 | 0.289542 |
LIMIT_BAL | 0.287896 |
PAY_4 | 0.227838 |
AGE | 0.225641 |
BILL_AMT1 | 0.197412 |
PAY_3 | 0.178575 |
MARRIAGE | 0.171177 |
PAY_AMT3 | 0.133252 |
PAY_AMT1 | 0.107046 |
PAY_AMT2 | 0.082654 |
BILL_AMT3 | 0.068008 |
EDUCATION | 0.060491 |
PAY_AMT4 | 0.060251 |
BILL_AMT2 | 0.057245 |
BILL_AMT4 | 0.039342 |
PAY_AMT5 | 0.036137 |
PAY_AMT6 | 0.025210 |
BILL_AMT6 | 0.005666 |
BILL_AMT5 | 0.002853 |
SEX | -0.096203 |
_ = s_df[:5].plot(kind='bar',
title='Top Five Reason Codes for a Risky Customer\n',
legend=False)
For the customer in the test dataset that the GBM predicts as most likely to default, the most important input variables in the prediction are, in descending order, PAY_0
, PAY_5
, PAY_6
, PAY_2
, and LIMIT_BAL
.
The local contributions for this customer appear reasonable, especially when considering her payment information. Her most recent payment was 3 months late and her payment for 6 months and 5 months previous were 7 months late. Also her credit limit was extremely low, so it's logical that these factors would weigh heavily into the model's prediction for default for this customer.
row # helps understand reason codes
ID | LIMIT_BAL | SEX | EDUCATION | MARRIAGE | AGE | PAY_0 | PAY_2 | PAY_3 | PAY_4 | PAY_5 | PAY_6 | BILL_AMT1 | BILL_AMT2 | BILL_AMT3 | BILL_AMT4 | BILL_AMT5 | BILL_AMT6 | PAY_AMT1 | PAY_AMT2 | PAY_AMT3 | PAY_AMT4 | PAY_AMT5 | PAY_AMT6 | DEFAULT_NEXT_MONTH | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
5399 | 17757 | 10000 | 2 | 3 | 1 | 51 | 3 | 2 | 2 | 7 | 7 | 7 | 2400 | 2400 | 2400 | 2400 | 2400 | 2400 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
To generate reason codes for the model's decision, the locally important variable and its value are used together. If this customer was denied future credit based on this model and data, the top five Shapley-based reason codes for the automated decision would be:
(Of course, credit limits are set by the lender and are used to price-in risk to credit decisions, so using credit limits as reason codes or even in a probability of default model is likely questionable. However, in this small, example data set all input columns were used to generate a better model fit. For a slightly more careful treatment of gradient boosting in the context of credit scoring, please see: https://github.com/jphall663/interpretable_machine_learning_with_python/blob/master/dia.ipynb)
In this notebook, a highly transparent, nonlinear, monotonic GBM classifier was trained to predict credit card defaults and the monotonic behavior of the classifier was analyzed and validated. To do so, Pearson correlation between each input and the target was used to determine the direction for monotonicity constraints for each input variable in the XGBoost classifier. GBM variable importance, partial dependence, and ICE were calculated, plotted, and compared to one another, domain knowledge, and reasonable expectations. Shapley values were then used to explain the model predictions for the single most risky customer in the test set. These techniques should generalize well for many types of business and research problems, enabling you to train a monotonic GBM model and analyze, validate, and explain it to your colleagues, bosses, and potentially, external regulators.