First, we will load the packages we need for these exercises.
# From base python
import csv # Read and write csv files
import os # Work with file paths
import re # Regular expressions
# External
import numpy as np # Vectorized mathematics
import pandas as pd # Dataframes
# Install via `pip install shap` or `conda install -c conda-forge shap`
import shap # For analyzing SHAP (a bias metric)
import xgboost as xgb # For xgboost models
For this example, we will build an XGBoost model of salaries for workers for the City of Chicago based on a public dataset of such information, as of 2019.
This example will showcase a wide array of visualizations that can be done using SHAP to explain the results the XGBoost model finds. Recall that as XGBoost is nonparametric, there is generally no easy way to quantify a variable's impact directionally. SHAP will give us a way to do this both at the individual level and in aggregate.
First, we import the data. As we can see, the data gives information on individuals, including their name, job title, working hours, salary type, and annual pay. There is also so post-processing information that I have added: the measures ending in .Freq
which were used to rename rare job titles and departments to "Other," first_name
, and Female
. Female
is calculated following Blevins and Mullen (2015 DHQ), based on first names and census data.
df = pd.read_csv('../../Data/S10_Bias.csv')
df
Name | Job.Titles | Department | Full.or.Part.Time | Salary.or.Hourly | Typical.Hours | Annual.Salary | Hourly.Rate | Full.Time | Salaried | Salary | Obs | Job.Titles.Freq | Department.Freq | first_name | Female | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | AARON, JEFFERY M | SERGEANT | POLICE | F | Salary | NaN | 101442.0 | NaN | 1 | 1 | 101442.0 | 33586 | 0.038885 | 0.414726 | JEFFERY | 0.0 |
1 | AARON, KARINA | POLICE OFFICER (ASSIGNED AS DETECTIVE) | POLICE | F | Salary | NaN | 94122.0 | NaN | 1 | 1 | 94122.0 | 33586 | 0.032186 | 0.414726 | KARINA | 1.0 |
2 | AARON, KIMBERLEI R | Other | GENERAL SERVICES | F | Salary | NaN | 111024.0 | NaN | 1 | 1 | 111024.0 | 33586 | 0.000417 | 0.028375 | KIMBERLEI | 1.0 |
3 | ABAD JR, VICENTE M | Other | WATER MGMNT | F | Salary | NaN | 114780.0 | NaN | 1 | 1 | 114780.0 | 33586 | 0.000983 | 0.057316 | VICENTE | 0.0 |
4 | ABARCA, EMMANUEL | Other | TRANSPORTN | F | Hourly | 40.0 | NaN | 43.72 | 1 | 0 | 1748.8 | 33586 | 0.003990 | 0.036146 | EMMANUEL | 0.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
33581 | ZYLINSKA, KLAUDIA | POLICE OFFICER | POLICE | F | Salary | NaN | 68616.0 | NaN | 1 | 1 | 68616.0 | 33586 | 0.291669 | 0.414726 | KLAUDIA | 1.0 |
33582 | ZYMANTAS, LAURA C | POLICE OFFICER | POLICE | F | Salary | NaN | 72510.0 | NaN | 1 | 1 | 72510.0 | 33586 | 0.291669 | 0.414726 | LAURA | 1.0 |
33583 | ZYMANTAS, MARK E | POLICE OFFICER | POLICE | F | Salary | NaN | 90024.0 | NaN | 1 | 1 | 90024.0 | 33586 | 0.291669 | 0.414726 | MARK | 0.0 |
33584 | ZYRKOWSKI, CARLO E | POLICE OFFICER | POLICE | F | Salary | NaN | 93354.0 | NaN | 1 | 1 | 93354.0 | 33586 | 0.291669 | 0.414726 | CARLO | 0.0 |
33585 | ZYSKOWSKI, DARIUSZ | Other | Other | F | Salary | NaN | 119412.0 | NaN | 1 | 1 | 119412.0 | 33586 | 0.000060 | 0.002918 | DARIUSZ | 0.0 |
33586 rows × 16 columns
Next, we need to partial out the data we need for our model. Recall that we will need to cast to a DMatrix
for XGBoost, and this is part of that process.
vars = ['Job.Titles', 'Department', 'Full.Time', 'Salaried', 'Female']
df[vars]
Job.Titles | Department | Full.Time | Salaried | Female | |
---|---|---|---|---|---|
0 | SERGEANT | POLICE | 1 | 1 | 0.0 |
1 | POLICE OFFICER (ASSIGNED AS DETECTIVE) | POLICE | 1 | 1 | 1.0 |
2 | Other | GENERAL SERVICES | 1 | 1 | 1.0 |
3 | Other | WATER MGMNT | 1 | 1 | 0.0 |
4 | Other | TRANSPORTN | 1 | 0 | 0.0 |
... | ... | ... | ... | ... | ... |
33581 | POLICE OFFICER | POLICE | 1 | 1 | 1.0 |
33582 | POLICE OFFICER | POLICE | 1 | 1 | 1.0 |
33583 | POLICE OFFICER | POLICE | 1 | 1 | 0.0 |
33584 | POLICE OFFICER | POLICE | 1 | 1 | 0.0 |
33585 | Other | Other | 1 | 1 | 0.0 |
33586 rows × 5 columns
We also need to deal with the categorical measures in our data. As the Python XGBoost package cannot yet properly handle categorical data under most circumstances (note: the R package can though), we need to one hot encode our categorical data: Job.Titles
and Department
.
one_hot1 = pd.get_dummies(df['Job.Titles'], prefix='Job.Titles')
one_hot2 = pd.get_dummies(df['Department'], prefix='Department')
df = df.join(one_hot1)
df = df.join(one_hot2)
df
Name | Job.Titles | Department | Full.or.Part.Time | Salary.or.Hourly | Typical.Hours | Annual.Salary | Hourly.Rate | Full.Time | Salaried | ... | Department_GENERAL SERVICES | Department_HEALTH | Department_LAW | Department_OEMC | Department_Other | Department_POLICE | Department_PUBLIC LIBRARY | Department_STREETS & SAN | Department_TRANSPORTN | Department_WATER MGMNT | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | AARON, JEFFERY M | SERGEANT | POLICE | F | Salary | NaN | 101442.0 | NaN | 1 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
1 | AARON, KARINA | POLICE OFFICER (ASSIGNED AS DETECTIVE) | POLICE | F | Salary | NaN | 94122.0 | NaN | 1 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
2 | AARON, KIMBERLEI R | Other | GENERAL SERVICES | F | Salary | NaN | 111024.0 | NaN | 1 | 1 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | ABAD JR, VICENTE M | Other | WATER MGMNT | F | Salary | NaN | 114780.0 | NaN | 1 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
4 | ABARCA, EMMANUEL | Other | TRANSPORTN | F | Hourly | 40.0 | NaN | 43.72 | 1 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
33581 | ZYLINSKA, KLAUDIA | POLICE OFFICER | POLICE | F | Salary | NaN | 68616.0 | NaN | 1 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
33582 | ZYMANTAS, LAURA C | POLICE OFFICER | POLICE | F | Salary | NaN | 72510.0 | NaN | 1 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
33583 | ZYMANTAS, MARK E | POLICE OFFICER | POLICE | F | Salary | NaN | 90024.0 | NaN | 1 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
33584 | ZYRKOWSKI, CARLO E | POLICE OFFICER | POLICE | F | Salary | NaN | 93354.0 | NaN | 1 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
33585 | ZYSKOWSKI, DARIUSZ | Other | Other | F | Salary | NaN | 119412.0 | NaN | 1 | 1 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
33586 rows × 62 columns
Now we can create the DMatrix
for XGBoost.
vars = one_hot1.columns.tolist() + \
one_hot2.columns.tolist() + \
['Full.Time', 'Salaried', 'Female']
dtrain = xgb.DMatrix(df[vars], label=df['Salary'], feature_names=vars)
The next line defines the parameters we will use, which are largely defaults. For a research project, these parameters should be tuned as we learned in Session 3.
param = {
'booster': 'gbtree', # default -- tree based
'nthread': 8, # number of threads to use for parallel processing
'objective': 'reg:squarederror', # RMSE error
'eval_metric': 'rmse', # maximize ROC AUC
'eta': 0.3, # shrinkage; [0, 1], default 0.3
'max_depth': 6, # maximum depth of each tree; default 6
'gamma': 0, # set above 0 to prune trees, [0, inf], default 0
'min_child_weight': 1, # higher leads to more pruning of tress, [0, inf], default 1
'subsample': 1, # Randomly subsample rows if in (0, 1), default 1
}
num_round=30
And lastly, we can run our XGBoost model.
model_xgb = xgb.train(param, dtrain, num_round)
First, we will use the TreeExplainer
function from SHAP to analyze our model. We will then pass it our training data in order to create individual SHAP values for each line of data.
explainer = shap.TreeExplainer(model_xgb)
shap_values = explainer(df[vars])
For some visualizations, the SHAP package struggles to plot more than 1,000 observations. In these cases, we will use a random sample of 1% of the original data.
df_small = df.sample(frac=0.01)
shap_values_small = explainer(df[vars])
For illustrative examples, or for understanding how XGBoost arrived at a particular prediction, a waterfall chart is useful. This explains the overall impact of the most impactful model features for a particular observation.
shap.plots.waterfall(shap_values[0])
shap.plots.waterfall(shap_values[2])
shap.plots.waterfall(shap_values[3])
Another way to present the waterfall is via a force plot. This presents the outcome of a given data point in a more concise manner.
shap.initjs()
shap.plots.force(shap_values[1])
Because force plots are more concise, we can get a representation of a force plot for all data points in a data frame on one chart. This visualization automatically clusters datapoints by what contributes to their SHAP values, so as to keep it more readable. You can interact with the chart labels to adjust the chart as well.
shap.initjs()
N=300
shap.plots.force(explainer.expected_value, shap_values.sample(N).values, feature_names=vars)
A decision plot accomplishes a very similar end goal as the aggregate force plot, but enables you to show more variables' relationships clearly.
It can be particularly insightful to pick a set of data points for a decision plot that are representative of a particular condition you are interested in studying. For instance, below I have separated the charts on the variable Female
.
shap.decision_plot(explainer.expected_value,
explainer.shap_values(df_small[df_small.Female==1][vars]),
feature_names=vars)
shap.decision_plot(explainer.expected_value,
explainer.shap_values(df_small[df_small.Female==0][vars]),
feature_names=vars)
If you are interested in just one particular variable, you can create a scatterplot of the impact of the variable on the outcome. Again, because XGBoost is nonparametric, the effect o a variable can be conditional on all others in the data. As such, the impact is not always the same across observations, and may even flip sign for the same value of the variable.
shap.plots.scatter(shap_values[:,"Female"], color=shap_values)
If you would like to plot a scatterplot like the above, but for multiple values, the bee swarm plot can accomplish this.
shap.plots.beeswarm(shap_values)
Lastly, to get an overall sense of the aggregate impact (or importance) of a variable, we can plot the average absolute SHAP value for a variable. This is akin to ann XGBoost importance plot. For machine learning models that lack an importance score internally, this is particularly useful.
shap.plots.bar(shap_values)