License¶

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

DISCLAIMER: This notebook is not legal compliance advice.

Testing machine learning models for accuracy, trustworthiness, and stability with Python and H2O¶

Performing residual analysis and sensitivity analysis to validate complex models¶

This notebook provides a basic introduction to two traditional data analysis and model diagnostic techniques that can be applied to machine learning models: residual analysis and sensitivity analysis. The notebook starts by loading the UCI credit card default dataset and using h2o to train a GBM model to predict credit card defaults. Then, residual analysis is used to discover and debug an issue with the GBM, and the GBM is retrained and improved. The notebook concludes by conducting sensitivity analysis to test the GBM credit card default model for fairness and stability.

Python imports¶

In general, NumPy and Pandas will be used for data manipulation purposes and h2o will be used for modeling tasks.

In [1]:

# h2o Python API with specific classes
import h2o 
from h2o.estimators.gbm import H2OGradientBoostingEstimator

import numpy as np   # array, vector, matrix calculations
import pandas as pd  # DataFrame handling

pd.options.display.max_columns = 999 # enable display of all columns in notebook

# plotting functionality
import matplotlib.pyplot as plt
import seaborn as sns

# display plots in notebook
%matplotlib inline

Start h2o¶

H2o is both a library and a server. The machine learning algorithms in the library take advantage of the multithreaded and distributed architecture provided by the server to train machine learning algorithms extremely efficiently. The API for the library was imported above in cell 1, but the server still needs to be started.

In [2]:

h2o.init(max_mem_size='2G')       # start h2o
h2o.remove_all()                  # remove any existing data structures from h2o memory

Checking whether there is an H2O instance running at http://localhost:54321 ..... not found.
Attempting to start a local H2O server...
  Java Version: java version "1.8.0_201"; Java(TM) SE Runtime Environment (build 1.8.0_201-b09); Java HotSpot(TM) 64-Bit Server VM (build 25.201-b09, mixed mode)
  Starting server from /home/patrickh/anaconda3/lib/python3.6/site-packages/h2o/backend/bin/h2o.jar
  Ice root: /tmp/tmpfq80fyby
  JVM stdout: /tmp/tmpfq80fyby/h2o_patrickh_started_from_python.out
  JVM stderr: /tmp/tmpfq80fyby/h2o_patrickh_started_from_python.err
  Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321 ... successful.

H2O cluster uptime:	01 secs
H2O cluster timezone:	America/New_York
H2O data parsing timezone:	UTC
H2O cluster version:	3.26.0.3
H2O cluster version age:	6 days
H2O cluster name:	H2O_from_python_patrickh_ov75l0
H2O cluster total nodes:	1
H2O cluster free memory:	1.778 Gb
H2O cluster total cores:	8
H2O cluster allowed cores:	8
H2O cluster status:	accepting new members, healthy
H2O connection url:	http://127.0.0.1:54321
H2O connection proxy:	None
H2O internal security:	False
H2O API Extensions:	Amazon S3, XGBoost, Algos, AutoML, Core V3, Core V4
Python version:	3.6.4 final

1. Download, explore, and prepare UCI credit card default data¶

UCI credit card default data: https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients

The UCI credit card default data contains demographic and payment information about credit card customers in Taiwan in the year 2005. The data set contains 23 input variables:

LIMIT_BAL: Amount of given credit (NT dollar)
SEX: 1 = male; 2 = female
EDUCATION: 1 = graduate school; 2 = university; 3 = high school; 4 = others
MARRIAGE: 1 = married; 2 = single; 3 = others
AGE: Age in years
PAY_0, PAY_2 - PAY_6: History of past payment; PAY_0 = the repayment status in September, 2005; PAY_2 = the repayment status in August, 2005; ...; PAY_6 = the repayment status in April, 2005. The measurement scale for the repayment status is: -1 = pay duly; 1 = payment delay for one month; 2 = payment delay for two months; ...; 8 = payment delay for eight months; 9 = payment delay for nine months and above.
BILL_AMT1 - BILL_AMT6: Amount of bill statement (NT dollar). BILL_AMNT1 = amount of bill statement in September, 2005; BILL_AMT2 = amount of bill statement in August, 2005; ...; BILL_AMT6 = amount of bill statement in April, 2005.
PAY_AMT1 - PAY_AMT6: Amount of previous payment (NT dollar). PAY_AMT1 = amount paid in September, 2005; PAY_AMT2 = amount paid in August, 2005; ...; PAY_AMT6 = amount paid in April, 2005.

These 23 input variables are used to predict the target variable, whether or not a customer defaulted on their credit card bill in late 2005.

Because h2o accepts both numeric and character inputs, some variables will be recoded into more transparent character values.

Import data and clean¶

The credit card default data is available as an .xls file. Pandas reads .xls files automatically, so it's used to load the credit card default data and give the prediction target a shorter name: DEFAULT_NEXT_MONTH.

In [3]:

# import XLS file
path = 'default_of_credit_card_clients.xls'
data = pd.read_excel(path,
                     skiprows=1)

# remove spaces from target column name 
data = data.rename(columns={'default payment next month': 'DEFAULT_NEXT_MONTH'}) 

Assign modeling roles¶

The shorthand name y is assigned to the prediction target. X is assigned to all other input variables in the credit card default data except the row indentifier, ID.

In [4]:

# assign target and inputs for GBM
y = 'DEFAULT_NEXT_MONTH'
X = [name for name in data.columns if name not in [y, 'ID']]
print('y =', y)
print('X =', X)

y = DEFAULT_NEXT_MONTH
X = ['LIMIT_BAL', 'SEX', 'EDUCATION', 'MARRIAGE', 'AGE', 'PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'BILL_AMT1', 'BILL_AMT2', 'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6', 'PAY_AMT1', 'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6']

Helper function for recoding values in the UCI credict card default data¶

This simple function maps longer, more understandable character string values from the UCI credit card default data dictionary to the original integer values of the input variables found in the dataset. These character values can be used directly in h2o decision tree models, and the function returns the original Pandas DataFrame as an h2o object, an H2OFrame. H2o models cannot run on Pandas DataFrames. They require H2OFrames.

In [5]:

def recode_cc_data(frame):
    
    """ Recodes numeric categorical variables into categorical character variables
    with more transparent values. 
    
    Args:
        frame: Pandas DataFrame version of UCI credit card default data.
        
    Returns: 
        H2OFrame with recoded values.
        
    """
    
    # define recoded values
    sex_dict = {1:'male', 2:'female'}
    education_dict = {0:'other', 1:'graduate school', 2:'university', 3:'high school', 
                      4:'other', 5:'other', 6:'other'}
    marriage_dict = {0:'other', 1:'married', 2:'single', 3:'divorced'}
    pay_dict = {-2:'no consumption', -1:'pay duly', 0:'use of revolving credit', 1:'1 month delay', 
                2:'2 month delay', 3:'3 month delay', 4:'4 month delay', 5:'5 month delay', 6:'6 month delay', 
                7:'7 month delay', 8:'8 month delay', 9:'9+ month delay'}
    
    # recode values using Pandas apply() and anonymous function
    frame['SEX'] = frame['SEX'].apply(lambda i: sex_dict[i])
    frame['EDUCATION'] = frame['EDUCATION'].apply(lambda i: education_dict[i])    
    frame['MARRIAGE'] = frame['MARRIAGE'].apply(lambda i: marriage_dict[i]) 
    for name in frame.columns:
        if name in ['PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6']:
            frame[name] = frame[name].apply(lambda i: pay_dict[i])            
                
    return h2o.H2OFrame(frame)

data = recode_cc_data(data)

Parse progress: |█████████████████████████████████████████████████████████| 100%

Ensure target is handled as a categorical variable¶

In h2o, a numeric variable can be treated as numeric or categorical. The target variable DEFAULT_NEXT_MONTH takes on values of 0 or 1. To ensure this numeric variable is treated as a categorical variable, the asfactor() function is used to explicitly declare that it is a categorical variable.

In [6]:

data[y] = data[y].asfactor() 

Display descriptive statistics¶

The h2o describe() function displays a brief description of the credit card default data. For the categorical input variables LIMIT_BAL, SEX, EDUCATION, MARRIAGE, and PAY_0-PAY_6, the new character values created above in cell 5 are visible. Basic descriptive statistics are displayed for numeric inputs.

In [7]:

data.describe()

Rows:30000
Cols:25

	ID	LIMIT_BAL	SEX	EDUCATION	MARRIAGE	AGE	PAY_0	PAY_2	PAY_3	PAY_4	PAY_5	PAY_6	BILL_AMT1	BILL_AMT2	BILL_AMT3	BILL_AMT4	BILL_AMT5	BILL_AMT6	PAY_AMT1	PAY_AMT2	PAY_AMT3	PAY_AMT4	PAY_AMT5	PAY_AMT6	DEFAULT_NEXT_MONTH
type	int	int	enum	enum	enum	int	enum	enum	enum	enum	enum	enum	int	int	int	int	int	int	int	int	int	int	int	int	enum
mins	1.0	10000.0				21.0							-165580.0	-69777.0	-157264.0	-170000.0	-81334.0	-339603.0	0.0	0.0	0.0	0.0	0.0	0.0
mean	15000.5	167484.32266666688				35.48549999999994							51223.33090000009	49179.07516666668	47013.15479999971	43262.9489666666	40311.40096666653	38871.76039999991	5663.580500000014	5921.16350000001	5225.681500000005	4826.076866666661	4799.387633333302	5215.502566666664
maxs	30000.0	1000000.0				79.0							964511.0	983931.0	1664089.0	891586.0	927171.0	961664.0	873552.0	1684259.0	896040.0	621000.0	426529.0	528666.0
sigma	8660.398374208891	129747.66156720225				9.21790406809016							73635.86057552959	71173.76878252836	69349.38742703681	64332.85613391641	60797.1557702648	59554.10753674574	16563.280354025763	23040.870402057226	17606.961469803115	15666.159744031993	15278.305679144793	17777.465775435332
zeros	0	0				0							2008	2506	2870	3195	3506	4020	5249	5396	5968	6408	6703	7173
missing	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
0	1.0	20000.0	female	university	married	24.0	2 month delay	2 month delay	pay duly	pay duly	no consumption	no consumption	3913.0	3102.0	689.0	0.0	0.0	0.0	0.0	689.0	0.0	0.0	0.0	0.0	1
1	2.0	120000.0	female	university	single	26.0	pay duly	2 month delay	use of revolving credit	use of revolving credit	use of revolving credit	2 month delay	2682.0	1725.0	2682.0	3272.0	3455.0	3261.0	0.0	1000.0	1000.0	1000.0	0.0	2000.0	1
2	3.0	90000.0	female	university	single	34.0	use of revolving credit	use of revolving credit	use of revolving credit	use of revolving credit	use of revolving credit	use of revolving credit	29239.0	14027.0	13559.0	14331.0	14948.0	15549.0	1518.0	1500.0	1000.0	1000.0	1000.0	5000.0	0
3	4.0	50000.0	female	university	married	37.0	use of revolving credit	use of revolving credit	use of revolving credit	use of revolving credit	use of revolving credit	use of revolving credit	46990.0	48233.0	49291.0	28314.0	28959.0	29547.0	2000.0	2019.0	1200.0	1100.0	1069.0	1000.0	0
4	5.0	50000.0	male	university	married	57.0	pay duly	use of revolving credit	pay duly	use of revolving credit	use of revolving credit	use of revolving credit	8617.0	5670.0	35835.0	20940.0	19146.0	19131.0	2000.0	36681.0	10000.0	9000.0	689.0	679.0	0
5	6.0	50000.0	male	graduate school	single	37.0	use of revolving credit	use of revolving credit	use of revolving credit	use of revolving credit	use of revolving credit	use of revolving credit	64400.0	57069.0	57608.0	19394.0	19619.0	20024.0	2500.0	1815.0	657.0	1000.0	1000.0	800.0	0
6	7.0	500000.0	male	graduate school	single	29.0	use of revolving credit	use of revolving credit	use of revolving credit	use of revolving credit	use of revolving credit	use of revolving credit	367965.0	412023.0	445007.0	542653.0	483003.0	473944.0	55000.0	40000.0	38000.0	20239.0	13750.0	13770.0	0
7	8.0	100000.0	female	university	single	23.0	use of revolving credit	pay duly	pay duly	use of revolving credit	use of revolving credit	pay duly	11876.0	380.0	601.0	221.0	-159.0	567.0	380.0	601.0	0.0	581.0	1687.0	1542.0	0
8	9.0	140000.0	female	high school	married	28.0	use of revolving credit	use of revolving credit	2 month delay	use of revolving credit	use of revolving credit	use of revolving credit	11285.0	14096.0	12108.0	12211.0	11793.0	3719.0	3329.0	0.0	432.0	1000.0	1000.0	1000.0	0
9	10.0	20000.0	male	high school	single	35.0	no consumption	no consumption	no consumption	no consumption	pay duly	pay duly	0.0	0.0	0.0	0.0	13007.0	13912.0	0.0	0.0	0.0	13007.0	1122.0	0.0	0

2. Train an H2O GBM classifier¶

Split data into training and test sets for early stopping¶

The credit card default data is split into training and test sets to monitor and prevent overtraining. Reproducibility is also important factor in creating trustworthy models, and randomly splitting datasets can introduce randomness in model predictions and other results. A random seed is used here to ensure the data split is reproducible.

In [8]:

# split into training and validation
train, test = data.split_frame([0.7], seed=12345)

# summarize split
print('Train data rows = %d, columns = %d' % (train.shape[0], train.shape[1]))
print('Test data rows = %d, columns = %d' % (test.shape[0], test.shape[1]))

Train data rows = 21060, columns = 25
Test data rows = 8940, columns = 25

Train h2o GBM classifier¶

Many tuning parameters must be specified to train a GBM using h2o. Typically a grid search would be performed to identify the best parameters for a given modeling task using the H2OGridSearch class. For brevity's sake, a previously-discovered set of good tuning parameters are specified here. Because gradient boosting methods typically resample training data, an additional random seed is also specified for the h2o GBM using the seed parameter to create reproducible predictions, error rates, and variable importance values. To avoid overfitting, the stopping_rounds parameter is used to stop the training process after the test error fails to decrease for 5 iterations.

In [9]:

# initialize GBM model
model = H2OGradientBoostingEstimator(ntrees=150,            # maximum 150 trees in GBM
                                     max_depth=4,           # trees can have maximum depth of 4
                                     sample_rate=0.9,       # use 90% of rows in each iteration (tree)
                                     col_sample_rate=0.9,   # use 90% of variables in each iteration (tree)
                                     stopping_rounds=5,     # stop if validation error does not decrease for 5 iterations (trees)
                                     seed=12345)            # for reproducibility

# train a GBM model
model.train(y=y, x=X, training_frame=train, validation_frame=test)

# print AUC
print('GBM Test AUC = %.4f' % model.auc(valid=True))

# uncomment to see model details
# print(model) 

gbm Model Build progress: |███████████████████████████████████████████████| 100%
GBM Test AUC = 0.7804

Display variable importance¶

During training, the h2o GBM aggregates the improvement in error caused by each split in each decision tree across all the decision trees in the ensemble classifier. These values are attributed to the input variable used in each split and give an indication of the contribution each input variable makes toward the model's predictions. The variable importance ranking should be parsimonious with human domain knowledge and reasonable expectations. In this case, a customer's most recent payment behavior, PAY_0, is by far the most important variable followed by their second most recent payment, PAY_2, their credit limit, LIMIT_BAL, and third most recent payment behavior, PAY_3. This result is well-aligned with business practices in credit lending: people who miss their most recent payments are likely to default soon.

In [10]:

model.varimp_plot()

3. Conduct residual analysis to debug model¶

Residuals refer to the difference between the recorded value of a dependent variable and the predicted value of a dependent variable for every row in a data set. Plotting the residual values against the predicted values is a time-honored model assessment technique and a great way to see all your modeling results in two dimensions.

Bind model predictions onto test data¶

To calculate the residuals for our GBM model, first the model predictions are merged onto onto the test set. The test data is used here to see how the model behaves on holdout data, which should be closer to its behavior on new data than analyzing residuals for the training inputs and predictions.

In [11]:

yhat = 'p_DEFAULT_NEXT_MONTH'
preds1 = model.predict(test).drop(['predict', 'p0'])
preds1.columns = [yhat]
test_yhat = test.cbind(preds1[yhat])

gbm prediction progress: |████████████████████████████████████████████████| 100%

Calculate deviance residuals for binomial classification¶

For binomial classification, deviance residuals are related to the logloss cost function. Like analyzing $y - \hat{y}$ for linear regression, these residuals are the quantities that the GBM sought to minimize. Deviance residual values are calculated by applying the simple formula in the cell directly below.

In [12]:

# use Pandas for adding columns and plotting
test_yhat = test_yhat.as_data_frame()
test_yhat['s'] = 1
test_yhat.loc[test_yhat['DEFAULT_NEXT_MONTH'] == 0, 's'] = -1
test_yhat['r_DEFAULT_NEXT_MONTH'] = test_yhat['s'] * np.sqrt(-2*(test_yhat[y]*np.log(test_yhat[yhat]) +
                                                                 ((1 - test_yhat[y])*np.log(1 - test_yhat[yhat]))))
test_yhat = test_yhat.drop('s', axis=1)

Plot residuals¶

Plotting residuals is a model debugging and diagnostic tool that enables users to see modeling results, and any anomolies, in a single two-dimensional plot. Here the green points represent customers who defaulted, and the blue points represent customers who did not. A few potential outliers are visible. There appear to be several cases in the test data with relatively large negative residuals. Understanding and addressing the factors that cause these outliers could lead to a more acccurate model.

In [13]:

groups = test_yhat.groupby('DEFAULT_NEXT_MONTH') # define groups
fig, ax_ = plt.subplots(figsize=(8, 8))          # initialize figure

plt.xlabel('Predicted: DEFAULT_NEXT_MONTH')
plt.ylabel('Residual: DEFAULT_NEXT_MONTH')

# plot groups with appropriate color
color_list = ['b', 'g'] 
c_idx = 0
for name, group in groups:
    ax_.plot(group.p_DEFAULT_NEXT_MONTH, group.r_DEFAULT_NEXT_MONTH, label=' '.join(['DEFAULT_NEXT_MONTH:', str(name)]),
             marker='o', linestyle='', color=color_list[c_idx], alpha=0.3)
    c_idx += 1

_ = ax_.legend(loc=1) # legend

Sort data by residuals and display data and residuals¶

Printing a table with model inputs, actual target values, and model predictions sorted by residuals is another simple way to analyze residuals. Customers that defaulted, but were predicted not to, are listed at the top of the table below. Scroll to the bottom of the table to see the customers who were predicted to default, but then did not. Also notice the jumps in residual values. These are the potential outliers pictured in the residual plot above.

In [14]:

test_yhat = test_yhat.sort_values(by='r_DEFAULT_NEXT_MONTH', ascending=False).reset_index(drop=True)
test_yhat

Out[14]:

	ID	LIMIT_BAL	SEX	EDUCATION	MARRIAGE	AGE	PAY_0	PAY_2	PAY_3	PAY_4	PAY_5	PAY_6	BILL_AMT1	BILL_AMT2	BILL_AMT3	BILL_AMT4	BILL_AMT5	BILL_AMT6	PAY_AMT1	PAY_AMT2	PAY_AMT3	PAY_AMT4	PAY_AMT5	PAY_AMT6	DEFAULT_NEXT_MONTH	p_DEFAULT_NEXT_MONTH	r_DEFAULT_NEXT_MONTH
0	2561	310000	female	graduate school	single	32	no consumption	no consumption	no consumption	no consumption	no consumption	no consumption	20138	8267	65993	8543	1695	750	8267	66008	8543	1695	750	7350	1	0.045837	2.483007
1	3016	350000	male	graduate school	married	38	no consumption	no consumption	pay duly	use of revolving credit	use of revolving credit	no consumption	16459	4120	44164	35233	884	9924	941	44743	0	884	9924	10824	1	0.050230	2.445869
2	11462	210000	female	graduate school	single	46	pay duly	pay duly	pay duly	use of revolving credit	use of revolving credit	pay duly	15655	3918	29881	24247	21664	1556	4854	30366	0	433	1556	14047	1	0.050527	2.443456
3	25772	350000	female	graduate school	married	33	use of revolving credit	pay duly	pay duly	pay duly	pay duly	pay duly	82964	68532	17926	17966	30741	31088	68940	18018	18058	30897	31244	88461	1	0.051503	2.435615
4	6933	500000	male	graduate school	single	37	pay duly	pay duly	pay duly	pay duly	pay duly	pay duly	4331	60446	30592	154167	13410	25426	60446	30594	150843	163881	25426	39526	1	0.051717	2.433910
5	22505	260000	female	university	single	33	pay duly	pay duly	pay duly	pay duly	pay duly	use of revolving credit	5188	12357	28656	7497	7685	15434	13000	29022	7500	27769	12000	6200	1	0.053061	2.423347
6	22751	350000	female	graduate school	married	32	pay duly	pay duly	no consumption	no consumption	no consumption	no consumption	30625	60003	7147	9950	22117	4874	60396	7147	9950	22117	4874	0	1	0.056650	2.396193
7	13381	400000	female	university	single	35	use of revolving credit	use of revolving credit	use of revolving credit	use of revolving credit	use of revolving credit	use of revolving credit	109943	222085	223350	213831	210563	211925	120018	10071	8037	8018	8809	5022	1	0.058753	2.380929
8	19530	350000	female	university	married	36	use of revolving credit	use of revolving credit	use of revolving credit	use of revolving credit	use of revolving credit	use of revolving credit	21026	35588	38002	40357	43663	52735	15000	3000	3000	4000	10000	25000	1	0.058911	2.379801
9	15549	450000	male	university	single	36	no consumption	no consumption	no consumption	no consumption	no consumption	no consumption	8012	4009	5226	4715	3275	6422	4021	5241	4729	3284	6441	4285	1	0.059060	2.378739
10	25692	330000	female	graduate school	single	42	no consumption	no consumption	no consumption	no consumption	no consumption	no consumption	565	20650	15360	0	12923	1816	20650	15360	0	12923	1816	17050	1	0.059089	2.378531
11	971	300000	male	graduate school	married	42	pay duly	use of revolving credit	use of revolving credit	pay duly	use of revolving credit	use of revolving credit	11973	61834	25145	37666	19453	10492	20979	5000	37676	8808	2000	2709	1	0.060745	2.366883
12	22390	310000	female	graduate school	married	32	use of revolving credit	pay duly	use of revolving credit	use of revolving credit	use of revolving credit	use of revolving credit	4762	26943	7488	10276	96059	6434	26943	5000	6000	93000	3000	5000	1	0.061522	2.361507
13	6854	290000	male	high school	single	34	use of revolving credit	use of revolving credit	use of revolving credit	use of revolving credit	use of revolving credit	use of revolving credit	5451	6230	140802	143354	146225	148820	1200	135000	5200	5500	5500	5400	1	0.061900	2.358911
14	8959	340000	male	graduate school	single	44	use of revolving credit	use of revolving credit	use of revolving credit	use of revolving credit	use of revolving credit	use of revolving credit	83059	85634	73950	59324	156094	110234	20000	5000	2000	112000	4234	4000	1	0.061935	2.358675
15	1980	500000	female	graduate school	married	35	use of revolving credit	use of revolving credit	use of revolving credit	use of revolving credit	use of revolving credit	use of revolving credit	35176	36193	44157	48322	21593	13866	2504	10004	5178	1047	2019	1004	1	0.062027	2.358041
16	18899	140000	female	other	single	28	use of revolving credit	use of revolving credit	pay duly	use of revolving credit	use of revolving credit	use of revolving credit	108018	6500	6327	138485	140492	141006	1000	6327	135000	4700	5000	5000	1	0.063140	2.350490
17	2302	230000	female	graduate school	married	30	pay duly	pay duly	use of revolving credit	use of revolving credit	use of revolving credit	use of revolving credit	2212	17402	32450	17285	9766	9981	17402	20013	346	5000	8000	5000	1	0.063386	2.348836
18	523	360000	male	graduate school	single	28	pay duly	pay duly	pay duly	use of revolving credit	use of revolving credit	pay duly	1210	820	64644	125984	106584	125557	390	75720	62520	17000	132200	167000	1	0.063629	2.347206
19	11745	220000	male	graduate school	single	51	pay duly	pay duly	pay duly	pay duly	pay duly	pay duly	20730	-270	53895	-105	20895	20835	0	54165	0	21000	20940	33460	1	0.063813	2.345973
20	8339	480000	male	graduate school	married	58	no consumption	no consumption	no consumption	no consumption	no consumption	no consumption	24610	-310	148544	18791	5909	68988	4	149654	18885	5940	69337	200655	1	0.063860	2.345661
21	22712	320000	female	high school	married	35	use of revolving credit	use of revolving credit	use of revolving credit	use of revolving credit	use of revolving credit	use of revolving credit	157249	148123	142852	133583	129049	128325	6000	6048	6000	5000	5000	5000	1	0.065316	2.336031
22	26005	320000	female	university	married	35	pay duly	pay duly	pay duly	pay duly	use of revolving credit	use of revolving credit	2276	6626	11131	13824	17992	15250	6626	12446	17746	6000	5749	928	1	0.065442	2.335207
23	13797	390000	male	graduate school	married	36	no consumption	no consumption	no consumption	no consumption	pay duly	pay duly	3931	3625	1600	3815	8330	4765	3625	1600	3315	11645	4765	2171	1	0.065657	2.333801
24	10147	450000	female	graduate school	married	46	pay duly	pay duly	pay duly	pay duly	pay duly	pay duly	28205	3760	4148	2312	6909	4189	3793	4148	2312	6909	4189	1539	1	0.065732	2.333313
25	1199	340000	female	high school	single	44	use of revolving credit	use of revolving credit	use of revolving credit	use of revolving credit	use of revolving credit	use of revolving credit	142836	145125	146682	150407	147868	149349	7000	5500	6027	5328	5390	6047	1	0.065756	2.333153
26	10754	160000	female	university	married	31	use of revolving credit	use of revolving credit	use of revolving credit	pay duly	pay duly	use of revolving credit	42781	42774	41817	749	5572	10573	2300	2300	749	5572	5573	13793	1	0.066522	2.328184
27	25816	350000	female	university	married	47	use of revolving credit	use of revolving credit	use of revolving credit	use of revolving credit	use of revolving credit	use of revolving credit	97500	84202	82933	80501	79038	80694	3010	2970	2886	2824	2925	2987	1	0.066585	2.327775
28	12668	210000	female	graduate school	married	37	use of revolving credit	use of revolving credit	pay duly	pay duly	pay duly	pay duly	24547	48302	4549	3085	7300	6583	4519	9098	3085	7300	6583	5060	1	0.066960	2.325363
29	15482	150000	male	graduate school	married	37	no consumption	no consumption	no consumption	no consumption	no consumption	no consumption	22109	10876	10268	5872	3068	2181	10943	10273	5978	3068	2181	3242	1	0.067985	2.318820
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
8910	2865	50000	female	university	married	46	2 month delay	2 month delay	2 month delay	2 month delay	2 month delay	2 month delay	28390	29639	30854	30062	32705	33519	2000	2000	0	3300	1500	0	0	0.772660	-1.721227
8911	8115	120000	female	university	single	26	3 month delay	3 month delay	2 month delay	2 month delay	3 month delay	2 month delay	12034	12548	12056	13958	13468	6144	1000	0	2400	100	0	57258	0	0.772848	-1.721707
8912	13422	100000	female	graduate school	single	29	2 month delay	2 month delay	2 month delay	2 month delay	2 month delay	2 month delay	74032	75557	76434	74611	79292	80945	3300	2700	0	5900	3100	0	0	0.774303	-1.725433
8913	12846	90000	male	university	married	42	2 month delay	2 month delay	2 month delay	2 month delay	2 month delay	use of revolving credit	95773	95489	94681	93965	90545	90529	4000	3500	3500	0	3500	4200	0	0.775734	-1.729118
8914	1629	140000	female	high school	married	31	2 month delay	2 month delay	2 month delay	2 month delay	2 month delay	2 month delay	89910	92588	91936	94623	94952	95234	5000	1800	5000	1900	4700	0	0	0.776880	-1.732076
8915	4249	90000	male	high school	married	42	2 month delay	2 month delay	2 month delay	3 month delay	3 month delay	3 month delay	48674	49895	52570	53614	54534	53374	2300	4116	2500	2052	0	0	0	0.779941	-1.740033
8916	9756	140000	male	graduate school	married	31	2 month delay	use of revolving credit	use of revolving credit	2 month delay	2 month delay	2 month delay	51028	52112	55232	55932	54910	57344	2500	4600	2200	0	3513	3000	0	0.780607	-1.741776
8917	27215	60000	male	university	married	35	2 month delay	2 month delay	2 month delay	2 month delay	2 month delay	2 month delay	20195	21267	21332	21680	23011	23498	1700	700	1000	2000	1000	0	0	0.781268	-1.743508
8918	5986	100000	male	high school	married	44	2 month delay	2 month delay	2 month delay	2 month delay	2 month delay	2 month delay	30076	31287	31676	32259	31608	33524	2000	1200	1400	0	2600	0	0	0.784777	-1.752759
8919	15186	30000	female	graduate school	single	25	2 month delay	2 month delay	2 month delay	2 month delay	2 month delay	2 month delay	7593	9634	4476	8830	8153	6422	2379	7	7002	13	155	1	0	0.787675	-1.760476
8920	2974	30000	female	university	married	24	2 month delay	2 month delay	2 month delay	2 month delay	2 month delay	2 month delay	150	150	150	150	150	300	0	0	0	0	150	0	0	0.793444	-1.776054
8921	10785	80000	female	university	married	33	2 month delay	2 month delay	2 month delay	2 month delay	2 month delay	2 month delay	53843	55933	56575	57303	58593	59738	3500	2100	2200	2300	2200	2100	0	0.794019	-1.777621
8922	13290	230000	female	graduate school	married	34	2 month delay	2 month delay	2 month delay	2 month delay	2 month delay	2 month delay	190784	195724	198707	201634	205949	210077	9300	7500	7500	7500	7500	7600	0	0.796132	-1.783414
8923	27255	80000	male	graduate school	married	46	2 month delay	2 month delay	2 month delay	2 month delay	2 month delay	2 month delay	40509	40551	42592	43296	43892	43060	1000	3000	1700	1600	0	3500	0	0.797857	-1.788172
8924	17703	60000	female	university	married	35	2 month delay	2 month delay	2 month delay	2 month delay	2 month delay	2 month delay	3167	5601	5366	6772	6515	7906	2500	0	1500	0	1500	0	0	0.802237	-1.800380
8925	13811	40000	male	graduate school	married	47	2 month delay	2 month delay	2 month delay	2 month delay	2 month delay	2 month delay	11084	12605	13102	12595	14386	14005	2000	1000	0	2000	0	2000	0	0.805651	-1.810028
8926	16920	50000	male	high school	married	52	2 month delay	2 month delay	2 month delay	2 month delay	4 month delay	3 month delay	36428	37530	38630	41774	40806	41357	2000	2000	4086	0	1500	1000	0	0.807385	-1.814973
8927	17748	30000	female	high school	married	54	2 month delay	2 month delay	2 month delay	2 month delay	4 month delay	3 month delay	22147	24770	26068	28842	28094	27361	3000	2000	3500	0	0	1000	0	0.812416	-1.829496
8928	18659	40000	female	university	married	28	2 month delay	2 month delay	3 month delay	2 month delay	2 month delay	2 month delay	31131	33815	33002	32173	34629	33940	3500	0	0	3000	0	2000	0	0.814786	-1.836434
8929	26565	200000	female	high school	married	55	2 month delay	2 month delay	3 month delay	2 month delay	2 month delay	2 month delay	159017	162697	163143	161906	165807	169599	9159	4842	3000	8000	7000	3000	0	0.815782	-1.839367
8930	21098	200000	male	graduate school	married	42	2 month delay	2 month delay	2 month delay	2 month delay	2 month delay	2 month delay	168289	172001	175281	177895	180078	184048	8000	7500	7000	6600	7000	7100	0	0.816460	-1.841371
8931	7068	90000	female	graduate school	single	30	2 month delay	2 month delay	3 month delay	3 month delay	3 month delay	3 month delay	750	750	750	750	2450	2150	0	0	0	2000	0	0	0	0.825031	-1.867161
8932	3087	30000	female	university	single	24	2 month delay	2 month delay	7 month delay	7 month delay	7 month delay	7 month delay	300	300	300	300	300	300	0	0	0	0	0	0	0	0.825105	-1.867387
8933	14589	280000	male	graduate school	married	50	3 month delay	5 month delay	4 month delay	3 month delay	2 month delay	use of revolving credit	327918	321476	314931	176439	154010	134334	0	0	500	0	6267	2257	0	0.832569	-1.890599
8934	16957	270000	male	graduate school	married	50	2 month delay	4 month delay	3 month delay	3 month delay	2 month delay	2 month delay	213616	208784	212058	207226	202394	231339	0	8000	0	0	32236	3000	0	0.841972	-1.920928
8935	29505	20000	male	university	married	40	1 month delay	2 month delay	3 month delay	2 month delay	3 month delay	3 month delay	14829	17267	16706	18694	19049	18459	3000	0	2560	955	0	661	0	0.852781	-1.957464
8936	19316	110000	female	graduate school	married	41	3 month delay	2 month delay	2 month delay	7 month delay	7 month delay	7 month delay	150	150	150	150	150	150	0	0	0	0	0	0	0	0.866568	-2.007067
8937	22725	100000	female	university	married	38	3 month delay	2 month delay	2 month delay	3 month delay	3 month delay	3 month delay	750	750	750	750	750	750	0	0	0	0	0	1500	0	0.869051	-2.016405
8938	9672	170000	male	graduate school	single	48	2 month delay	2 month delay	7 month delay	7 month delay	7 month delay	7 month delay	2400	2400	2400	2400	2400	2400	0	0	0	0	0	0	0	0.874018	-2.035492
8939	5916	110000	female	graduate school	married	41	2 month delay	2 month delay	7 month delay	7 month delay	7 month delay	7 month delay	150	150	150	150	150	150	0	0	0	0	0	0	0	0.886468	-2.085985

8940 rows × 27 columns

This simple analysis has uncovered some of the most difficult customers for the GBM to correctly predict default. Perhaps because of the high importance of the payment features, PAY_0-PAY_6, the GBM struggles to correctly predict several cases in which customers made timely recent payments and then suddenly defaulted (high positive residuals) and those customers that were chronically late making payments but did not default (high negative residuals).

Plot residuals by most important input variable¶

Residuals can also be plotted for important input variables to understand how the values of a single input variable affect prediction errors. When plotted by PAY_0, the residuals confirm that the GBM is struggling to accurately predict cases where default status is not correlated with recent payment behavior in an obvious way. The residual plots for values of PAY_0 indicating timely payment behavior (e.g., use of revolving credit, pay duly, and no consumption) generally display the highest positive residuals and relatively small negative residuals. Residuals for the other values of PAY_0, those that represent late recent payments, tend to show large negative residuals and relatively small positive residuals.

In [15]:

# use Seaborn FacetGrid for convenience
g = sns.FacetGrid(test_yhat, row='PAY_0', hue=y)
_ = g.map(plt.scatter, yhat, 'r_DEFAULT_NEXT_MONTH', alpha=0.4)

4. Retrain GBM classifier based on results of residual analysis¶

Now that an issue has been discovered using residual analysis, can it be resolved?

Create a feature that contains information about behavior over time¶

One strategy to improve prediction accuracy is to introduce a new feature that summarizes a customer's spending behavior over time to expose any potential financial instability: the standard deviation of a customer's bill amounts over six months. Pandas has a one-liner for calculating standard deviations for a set of columns, so the H2OFrame is casted back into Pandas DataFrame for convenience.

In [16]:

data = data.as_data_frame()
data['bill_std'] = data[['BILL_AMT1', 'BILL_AMT2', 'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6']].std(axis=1)
data.head(n=3)

Out[16]:

	ID	LIMIT_BAL	SEX	EDUCATION	MARRIAGE	AGE	PAY_0	PAY_2	PAY_3	PAY_4	PAY_5	PAY_6	BILL_AMT1	BILL_AMT2	BILL_AMT3	BILL_AMT4	BILL_AMT5	BILL_AMT6	PAY_AMT1	PAY_AMT2	PAY_AMT3	PAY_AMT4	PAY_AMT5	PAY_AMT6	DEFAULT_NEXT_MONTH	bill_std
0	1	20000	female	university	married	24	2 month delay	2 month delay	pay duly	pay duly	no consumption	no consumption	3913	3102	689	0	0	0	0	689	0	0	0	0	1	1761.633219
1	2	120000	female	university	single	26	pay duly	2 month delay	use of revolving credit	use of revolving credit	use of revolving credit	2 month delay	2682	1725	2682	3272	3455	3261	0	1000	1000	1000	0	2000	1	637.967841
2	3	90000	female	university	single	34	use of revolving credit	use of revolving credit	use of revolving credit	use of revolving credit	use of revolving credit	use of revolving credit	29239	14027	13559	14331	14948	15549	1518	1500	1000	1000	1000	5000	0	6064.518593

Convert Pandas DataFrame back to H2OFrame for modeling¶

To retrain the model with the new feature, an H2OFrame is required and that H2OFrame is split using the same proportion and random seed as in cell 8 for the first GBM model.

In [17]:

data = h2o.H2OFrame(data)                          # convert 
data[y] = data[y].asfactor()                       # ensure target is handled as a categorical variable
train, test = data.split_frame([0.7], seed=12345)  # split into training and validation

Parse progress: |█████████████████████████████████████████████████████████| 100%

Retrain GBM with new feature¶

The train() function is used to retrain the GBM model with the nearly same hyperparameters used previously in cell 9. A slight, but noticable, increase in accuracy results from retraining with the new feature.

In [18]:

# initialize GBM model
model = H2OGradientBoostingEstimator(ntrees=150,            # maximum 150 trees in GBM
                                     max_depth=6,           # trees can have maximum depth of 6
                                     sample_rate=0.9,       # use 90% of rows in each iteration (tree)
                                     col_sample_rate=0.85,  # use 90% of variables in each iteration (tree)
                                     stopping_rounds=5,     # stop if validation error does not decrease for 5 iterations (trees)
                                     seed=12345)            # for reproducibility

# retrain GBM model
model.train(y=y,
            x=X + ['bill_std'], # add new feature
            training_frame=train, 
            validation_frame=test)

# print AUC
print('GBM Test AUC = %.4f' % model.auc(valid=True))

gbm Model Build progress: |███████████████████████████████████████████████| 100%
GBM Test AUC = 0.7825

While there maybe be other more complex features or a more optimal set of hyperparameters that could lead to further incremental increases in accuracy, more information is needed to achieve meaningful improvement in prediction performance. In particular, a common measure for credit lending, the customers' debt-to-income ratio, for each payment and billing period could be particularly useful. Spikes in debt-to-income ratio, representing loss of income or large increases in debt, would likely be very indicative of a default and would expose the GBM to information not currently available in the UCI credit card default data. Introducing new data could also de-emphasize PAY_0, which would likely result in a more stable model as well.

5. Perform sensitivity analysis to test model performance on unseen data¶

Sensitivity analysis investigates whether model behavior and outputs remain stable when data is intentionally perturbed or other changes are simulated in data. Beyond traditional assessment practices, sensitivity analysis of machine learning model predictions is perhaps the most important validation technique for machine learning models. Machine learning models can make drastically differing predictions for only minor changes in input variable values. In practice, many linear model validation techniques focus on the numerical instability of regression parameters due to correlation between input variables or between input variables and the dependent variable. It may be prudent for those switching from linear modeling techniques to machine learning techniques to focus less on numerical instability of model parameters and to focus more on the potential instability of model predictions.

Here sensitivity analysis is used to understand the impact of changing the most important input variable, PAY_0, and the impact of a sociologically sensitive variable, SEX, in the model. If the model changes in reasonable and expected ways when important variable values are changed this can enhance trust in the model. If the contribution of potentially sensitive variables, such as those related to gender, race, age, marital status, or disability status, can be shown to have minimal impact on the model, this is an indication of fairness in the model predictions and can also increase overall trust in the model.

Bind new model predictions onto test data¶

Typically, a productive exercise in model debugging and validation is to investigate customers with very high or low predicted probabilities to determine if their predictions stay within reasonable bounds when important variables are changed. The predictions from the new, more accurate model are merged onto the test set to find these potentially interesting customers.

In [19]:

preds2 = model.predict(test).drop(['predict', 'p0'])
preds2.columns = [yhat]
test_yhat = test.cbind(preds1[yhat])

gbm prediction progress: |████████████████████████████████████████████████| 100%

Helper function for finding percentile indices¶

The function below finds and returns the row indices for the minimum, the maximum, and the deciles of one column in terms of another -- in this case, the model predictions (p_DEFAULT_NEXT_MONTH) and the row identifier (ID), respectively. These indices are used as a starting point for boundary testing. Outlying predictions found through residual analysis is another group of potentially interesting local predictions to investigate.

In [20]:

def get_percentile_dict(yhat, id_, frame):

    """ Returns the minimum, the maximum, and the deciles of a column, yhat, 
        as the indices based on another column id_.
    
    Args:
        yhat: Column in which to find percentiles.
        id_: Id column that stores indices for percentiles of yhat.
        frame: H2OFrame containing yhat and id_. 
    
    Returns:
        Dictionary of percentile values and index column values.
    
    """
    
    # create a copy of frame and sort it by yhat
    sort_df = frame.as_data_frame()
    sort_df.sort_values(yhat, inplace=True)
    sort_df.reset_index(inplace=True)
    
    # find top and bottom percentiles
    percentiles_dict = {}
    percentiles_dict[0] = sort_df.loc[0, id_]
    percentiles_dict[99] = sort_df.loc[sort_df.shape[0]-1, id_]

    # find 10th-90th percentiles
    inc = sort_df.shape[0]//10
    for i in range(1, 10):
        percentiles_dict[i * 10] = sort_df.loc[i * inc,  id_]

    return percentiles_dict

# display percentiles dictionary
# ID values for rows
# from lowest prediction 
# to highest prediction
pred_percentile_dict = get_percentile_dict(yhat, 'ID', test_yhat)
pred_percentile_dict

Out[20]:

Display test data prediction range¶

Unlike some regression models and neural networks that can produce outrageous predictions for changes in input variables, GBM predictions in new data are bounded by the lowest and highest probability leaf nodes in each constiuent decision tree in the trained model. While unbounded, extreme predictions are typically not an issue for tree models and classification tasks, it is often a good idea to check that the model predictions cover a full range of useful values in the test set. Below, we can see that the model produces both low and high predictions in the test set, indicating that it is likely responsive to signal in new data and not simply predicting the majority class or an average value.

In [21]:

print('Lowest prediction:', test_yhat[test_yhat['ID'] == int(pred_percentile_dict[0])][[y, yhat]])
print('Highest prediction:', test_yhat[test_yhat['ID'] == int(pred_percentile_dict[99])][[y, yhat]])

Lowest prediction:

DEFAULT_NEXT_MONTH	p_DEFAULT_NEXT_MONTH
0	0.0383668

Highest prediction:

DEFAULT_NEXT_MONTH	p_DEFAULT_NEXT_MONTH
1	0.895285

Use trained model to test predictions for interesting situations: customer least likely to default¶

As a starting point for further analysis, sensitivity analysis is performed for the customer least likely to default. This woman has a very low probability of defaulting according to the trained GBM.

In [22]:

test_case = test_yhat[test_yhat['ID'] == int(pred_percentile_dict[0])]
test_case

ID	LIMIT_BAL	SEX	EDUCATION	MARRIAGE	AGE	PAY_0	PAY_2	PAY_3	PAY_4	PAY_5	PAY_6	BILL_AMT1	BILL_AMT2	BILL_AMT3	BILL_AMT4	BILL_AMT5	BILL_AMT6	PAY_AMT1	PAY_AMT2	PAY_AMT3	PAY_AMT4	PAY_AMT5	PAY_AMT6	DEFAULT_NEXT_MONTH	bill_std	p_DEFAULT_NEXT_MONTH
28716	780000	female	university	single	41	no consumption	no consumption	no consumption	no consumption	no consumption	no consumption	101957	61715	38686	21482	72628	182792	62819	39558	22204	82097	184322	25695	0	57564.1	0.0383668

Out[22]:

Test effect of changing `SEX`¶

SEX should not have a large impact on predictions. This could indicate unwanted sociological bias in the GBM model.

In [23]:

test_case = test_yhat[test_yhat['ID'] == int(pred_percentile_dict[0])]
test_case = test_case.drop([yhat])
test_case['SEX'] = 'male'
test_case = test_case.cbind(model.predict(test_case))
test_case

gbm prediction progress: |████████████████████████████████████████████████| 100%

ID	LIMIT_BAL	SEX	EDUCATION	MARRIAGE	AGE	PAY_0	PAY_2	PAY_3	PAY_4	PAY_5	PAY_6	BILL_AMT1	BILL_AMT2	BILL_AMT3	BILL_AMT4	BILL_AMT5	BILL_AMT6	PAY_AMT1	PAY_AMT2	PAY_AMT3	PAY_AMT4	PAY_AMT5	PAY_AMT6	DEFAULT_NEXT_MONTH	bill_std	predict	p0	p1
28716	780000	male	university	single	41	no consumption	no consumption	no consumption	no consumption	no consumption	no consumption	101957	61715	38686	21482	72628	182792	62819	39558	22204	82097	184322	25695	0	57564.1	0	0.959052	0.0409481

Out[23]:

As desired, simulating this person as a male does not have a large impact on their probability of default.

Test effect of changing `PAY_0`¶

Variable importance and residual analysis indicates that the value of PAY_0 can have a strong effect on model predictions. Measuring the change in predicted probability when the value of PAY_0 is changed from a timely payment to late payment is probably a good test case for prediction stability.

In [24]:

test_case = test_yhat[test_yhat['ID'] == int(pred_percentile_dict[0])]
test_case = test_case.drop([yhat])
test_case['PAY_0'] = '2 month delay' 
test_case = test_case.cbind(model.predict(test_case))
test_case

gbm prediction progress: |████████████████████████████████████████████████| 100%

ID	LIMIT_BAL	SEX	EDUCATION	MARRIAGE	AGE	PAY_0	PAY_2	PAY_3	PAY_4	PAY_5	PAY_6	BILL_AMT1	BILL_AMT2	BILL_AMT3	BILL_AMT4	BILL_AMT5	BILL_AMT6	PAY_AMT1	PAY_AMT2	PAY_AMT3	PAY_AMT4	PAY_AMT5	PAY_AMT6	DEFAULT_NEXT_MONTH	bill_std	predict	p0	p1
28716	780000	female	university	single	41	2 month delay	no consumption	no consumption	no consumption	no consumption	no consumption	101957	61715	38686	21482	72628	182792	62819	39558	22204	82097	184322	25695	0	57564.1	1	0.571032	0.428968

Out[24]:

When the value is changed from no consumption to two month delay there is a very large increase in predicted probability. Such a marked change related to the value of one variable is problematic for numerous reasons.

Use trained model to test predictions for interesting situations: customer most likely to default¶

Now the same test will be performed on the customer most likely to default. This woman has a very high probability of default under the GBM model.

In [25]:

test_case = test_yhat[test_yhat['ID'] == int(pred_percentile_dict[99])]
test_case

ID	LIMIT_BAL	SEX	EDUCATION	MARRIAGE	AGE	PAY_0	PAY_2	PAY_3	PAY_4	PAY_5	PAY_6	BILL_AMT1	BILL_AMT2	BILL_AMT3	BILL_AMT4	BILL_AMT5	BILL_AMT6	PAY_AMT1	PAY_AMT2	PAY_AMT3	PAY_AMT4	PAY_AMT5	PAY_AMT6	DEFAULT_NEXT_MONTH	bill_std	p_DEFAULT_NEXT_MONTH
29116	20000	female	university	married	59	3 month delay	2 month delay	3 month delay	2 month delay	2 month delay	4 month delay	8803	11137	10672	11201	12721	11946	2800	0	1000	2000	0	0	1	1327.55	0.895285

Out[25]:

Test effect of changing `SEX`¶

Changing the value for SEX from female to male for this customer decreases the predicted probability by a relatively small amount.

In [26]:

test_case = test_yhat[test_yhat['ID'] == int(pred_percentile_dict[99])]
test_case = test_case.drop([yhat])
test_case['SEX'] = 'male'
test_case = test_case.cbind(model.predict(test_case))
test_case

gbm prediction progress: |████████████████████████████████████████████████| 100%

ID	LIMIT_BAL	SEX	EDUCATION	MARRIAGE	AGE	PAY_0	PAY_2	PAY_3	PAY_4	PAY_5	PAY_6	BILL_AMT1	BILL_AMT2	BILL_AMT3	BILL_AMT4	BILL_AMT5	BILL_AMT6	PAY_AMT1	PAY_AMT2	PAY_AMT3	PAY_AMT4	PAY_AMT5	PAY_AMT6	DEFAULT_NEXT_MONTH	bill_std	predict	p0	p1
29116	20000	male	university	married	59	3 month delay	2 month delay	3 month delay	2 month delay	2 month delay	4 month delay	8803	11137	10672	11201	12721	11946	2800	0	1000	2000	0	0	1	1327.55	1	0.161579	0.838421

Out[26]:

Test effect of changing `PAY_0`¶

Switching the riskiest customer's value for PAY_0 from 3 month delay to pay duly reduces the their chance of default by about 20%, a noticable swing in probability but still a higher probability value, notably greater than common lending cutoffs.

In [27]:

test_case = test_yhat[test_yhat['ID'] == int(pred_percentile_dict[99])]
test_case = test_case.drop([yhat])
test_case['PAY_0'] = 'pay duly' 
test_case = test_case.cbind(model.predict(test_case))
test_case

gbm prediction progress: |████████████████████████████████████████████████| 100%

ID	LIMIT_BAL	SEX	EDUCATION	MARRIAGE	AGE	PAY_0	PAY_2	PAY_3	PAY_4	PAY_5	PAY_6	BILL_AMT1	BILL_AMT2	BILL_AMT3	BILL_AMT4	BILL_AMT5	BILL_AMT6	PAY_AMT1	PAY_AMT2	PAY_AMT3	PAY_AMT4	PAY_AMT5	PAY_AMT6	DEFAULT_NEXT_MONTH	bill_std	predict	p0	p1
29116	20000	female	university	married	59	pay duly	2 month delay	3 month delay	2 month delay	2 month delay	4 month delay	8803	11137	10672	11201	12721	11946	2800	0	1000	2000	0	0	1	1327.55	1	0.273858	0.726142

Out[27]:

From this small number of boundary test cases, the GBM model appears stable. However, if large swings in predictions occur for sensitive or important variables, practicioners are urged to retrain unstable models without the problematic variables or combinations of variables, which may unfortunately involve some trial and error. Also, four test cases is woefully inadequate for real-world models. Automated sensitivity analysis across many variables, combinations of variables, and for many different rows of data seems more appropriate for mission-critical machine learning.

Shutdown H2O¶

After using h2o, it's typically best to shut it down. However, before doing so, users should ensure that they have saved any h2o data structures, such as models, H2OFrames, or scoring artifacts, such as POJOs or MOJOs.

In [28]:

# be careful, this can erase your work!
h2o.cluster().shutdown(prompt=True)

Are you sure you want to shutdown the H2O instance running at http://127.0.0.1:54321 (Y/N)? n

Summary¶

In this notebook, a complex GBM classifier was trained to predict credit card defaults. Residual analysis was used to debug the GBM model predictions and enabled a slight improvement in accuracy. Sensitivity analysis was used to test the GBM for trustworthiness and stability. In a small number of boundary test cases, the trained GBM appeared stable. Residual analysis and sensitivity analysis are powerful model debugging techniques and can increase trust in complex models. These techniques should generalize well for many types of business and research problems, enabling you to train a complex model and justify it to your colleagues, bosses, and potentially, external regulators.

License¶

Testing machine learning models for accuracy, trustworthiness, and stability with Python and H2O¶

Performing residual analysis and sensitivity analysis to validate complex models¶

Python imports¶

Start h2o¶

1. Download, explore, and prepare UCI credit card default data¶

Import data and clean¶

Assign modeling roles¶

Helper function for recoding values in the UCI credict card default data¶

Ensure target is handled as a categorical variable¶

Display descriptive statistics¶

2. Train an H2O GBM classifier¶

Split data into training and test sets for early stopping¶

Train h2o GBM classifier¶

Display variable importance¶

3. Conduct residual analysis to debug model¶

Bind model predictions onto test data¶

Calculate deviance residuals for binomial classification¶

Plot residuals¶

Sort data by residuals and display data and residuals¶

Plot residuals by most important input variable¶

4. Retrain GBM classifier based on results of residual analysis¶

Create a feature that contains information about behavior over time¶

Convert Pandas DataFrame back to H2OFrame for modeling¶

Retrain GBM with new feature¶

5. Perform sensitivity analysis to test model performance on unseen data¶

Bind new model predictions onto test data¶

Helper function for finding percentile indices¶

Display test data prediction range¶

Use trained model to test predictions for interesting situations: customer least likely to default¶

Test effect of changing SEX¶

Test effect of changing PAY_0¶

Use trained model to test predictions for interesting situations: customer most likely to default¶

Test effect of changing SEX¶

Test effect of changing PAY_0¶

Shutdown H2O¶

Summary¶

Test effect of changing `SEX`¶

Test effect of changing `PAY_0`¶

Test effect of changing `SEX`¶

Test effect of changing `PAY_0`¶