Session 10: Bias¶

Getting started¶

First, we will load the packages we need for these exercises.

In [2]:

# From base python
import csv                                 # Read and write csv files
import os                                  # Work with file paths
import re                                  # Regular expressions 

# External
import numpy as np                         # Vectorized mathematics
import pandas as pd                        # Dataframes
# Install via `pip install shap` or `conda install -c conda-forge shap`
import shap                                # For analyzing SHAP (a bias metric)
import xgboost as xgb                      # For xgboost models

SHAP¶

City of Chicago salary data¶

For this example, we will build an XGBoost model of salaries for workers for the City of Chicago based on a public dataset of such information, as of 2019.

This example will showcase a wide array of visualizations that can be done using SHAP to explain the results the XGBoost model finds. Recall that as XGBoost is nonparametric, there is generally no easy way to quantify a variable's impact directionally. SHAP will give us a way to do this both at the individual level and in aggregate.

Setup¶

First, we import the data. As we can see, the data gives information on individuals, including their name, job title, working hours, salary type, and annual pay. There is also so post-processing information that I have added: the measures ending in .Freq which were used to rename rare job titles and departments to "Other," first_name, and Female. Female is calculated following Blevins and Mullen (2015 DHQ), based on first names and census data.

In [5]:

df = pd.read_csv('../../Data/S10_Bias.csv')
df

Out[5]:

	Name	Job.Titles	Department	Full.or.Part.Time	Salary.or.Hourly	Typical.Hours	Annual.Salary	Hourly.Rate	Full.Time	Salaried	Salary	Obs	Job.Titles.Freq	Department.Freq	first_name	Female
0	AARON, JEFFERY M	SERGEANT	POLICE	F	Salary	NaN	101442.0	NaN	1	1	101442.0	33586	0.038885	0.414726	JEFFERY	0.0
1	AARON, KARINA	POLICE OFFICER (ASSIGNED AS DETECTIVE)	POLICE	F	Salary	NaN	94122.0	NaN	1	1	94122.0	33586	0.032186	0.414726	KARINA	1.0
2	AARON, KIMBERLEI R	Other	GENERAL SERVICES	F	Salary	NaN	111024.0	NaN	1	1	111024.0	33586	0.000417	0.028375	KIMBERLEI	1.0
3	ABAD JR, VICENTE M	Other	WATER MGMNT	F	Salary	NaN	114780.0	NaN	1	1	114780.0	33586	0.000983	0.057316	VICENTE	0.0
4	ABARCA, EMMANUEL	Other	TRANSPORTN	F	Hourly	40.0	NaN	43.72	1	0	1748.8	33586	0.003990	0.036146	EMMANUEL	0.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
33581	ZYLINSKA, KLAUDIA	POLICE OFFICER	POLICE	F	Salary	NaN	68616.0	NaN	1	1	68616.0	33586	0.291669	0.414726	KLAUDIA	1.0
33582	ZYMANTAS, LAURA C	POLICE OFFICER	POLICE	F	Salary	NaN	72510.0	NaN	1	1	72510.0	33586	0.291669	0.414726	LAURA	1.0
33583	ZYMANTAS, MARK E	POLICE OFFICER	POLICE	F	Salary	NaN	90024.0	NaN	1	1	90024.0	33586	0.291669	0.414726	MARK	0.0
33584	ZYRKOWSKI, CARLO E	POLICE OFFICER	POLICE	F	Salary	NaN	93354.0	NaN	1	1	93354.0	33586	0.291669	0.414726	CARLO	0.0
33585	ZYSKOWSKI, DARIUSZ	Other	Other	F	Salary	NaN	119412.0	NaN	1	1	119412.0	33586	0.000060	0.002918	DARIUSZ	0.0

33586 rows × 16 columns

Next, we need to partial out the data we need for our model. Recall that we will need to cast to a DMatrix for XGBoost, and this is part of that process.

In [6]:

vars = ['Job.Titles', 'Department', 'Full.Time', 'Salaried', 'Female']
df[vars]

Out[6]:

	Job.Titles	Department	Full.Time	Salaried	Female
0	SERGEANT	POLICE	1	1	0.0
1	POLICE OFFICER (ASSIGNED AS DETECTIVE)	POLICE	1	1	1.0
2	Other	GENERAL SERVICES	1	1	1.0
3	Other	WATER MGMNT	1	1	0.0
4	Other	TRANSPORTN	1	0	0.0
...	...	...	...	...	...
33581	POLICE OFFICER	POLICE	1	1	1.0
33582	POLICE OFFICER	POLICE	1	1	1.0
33583	POLICE OFFICER	POLICE	1	1	0.0
33584	POLICE OFFICER	POLICE	1	1	0.0
33585	Other	Other	1	1	0.0

33586 rows × 5 columns

We also need to deal with the categorical measures in our data. As the Python XGBoost package cannot yet properly handle categorical data under most circumstances (note: the R package can though), we need to one hot encode our categorical data: Job.Titles and Department.

In [7]:

one_hot1 = pd.get_dummies(df['Job.Titles'], prefix='Job.Titles')
one_hot2 = pd.get_dummies(df['Department'], prefix='Department')

df = df.join(one_hot1)
df = df.join(one_hot2)

df

Out[7]:

	Name	Job.Titles	Department	Full.or.Part.Time	Salary.or.Hourly	Typical.Hours	Annual.Salary	Hourly.Rate	Full.Time	Salaried	...	Department_GENERAL SERVICES	Department_HEALTH	Department_LAW	Department_OEMC	Department_Other	Department_POLICE	Department_PUBLIC LIBRARY	Department_STREETS & SAN	Department_TRANSPORTN	Department_WATER MGMNT
0	AARON, JEFFERY M	SERGEANT	POLICE	F	Salary	NaN	101442.0	NaN	1	1	...	0	0	0	0	0	1	0	0	0	0
1	AARON, KARINA	POLICE OFFICER (ASSIGNED AS DETECTIVE)	POLICE	F	Salary	NaN	94122.0	NaN	1	1	...	0	0	0	0	0	1	0	0	0	0
2	AARON, KIMBERLEI R	Other	GENERAL SERVICES	F	Salary	NaN	111024.0	NaN	1	1	...	1	0	0	0	0	0	0	0	0	0
3	ABAD JR, VICENTE M	Other	WATER MGMNT	F	Salary	NaN	114780.0	NaN	1	1	...	0	0	0	0	0	0	0	0	0	1
4	ABARCA, EMMANUEL	Other	TRANSPORTN	F	Hourly	40.0	NaN	43.72	1	0	...	0	0	0	0	0	0	0	0	1	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
33581	ZYLINSKA, KLAUDIA	POLICE OFFICER	POLICE	F	Salary	NaN	68616.0	NaN	1	1	...	0	0	0	0	0	1	0	0	0	0
33582	ZYMANTAS, LAURA C	POLICE OFFICER	POLICE	F	Salary	NaN	72510.0	NaN	1	1	...	0	0	0	0	0	1	0	0	0	0
33583	ZYMANTAS, MARK E	POLICE OFFICER	POLICE	F	Salary	NaN	90024.0	NaN	1	1	...	0	0	0	0	0	1	0	0	0	0
33584	ZYRKOWSKI, CARLO E	POLICE OFFICER	POLICE	F	Salary	NaN	93354.0	NaN	1	1	...	0	0	0	0	0	1	0	0	0	0
33585	ZYSKOWSKI, DARIUSZ	Other	Other	F	Salary	NaN	119412.0	NaN	1	1	...	0	0	0	0	1	0	0	0	0	0

33586 rows × 62 columns

Now we can create the DMatrix for XGBoost.

In [8]:

vars = one_hot1.columns.tolist() + \
       one_hot2.columns.tolist() + \
       ['Full.Time', 'Salaried', 'Female']
dtrain = xgb.DMatrix(df[vars], label=df['Salary'], feature_names=vars)

The next line defines the parameters we will use, which are largely defaults. For a research project, these parameters should be tuned as we learned in Session 3.

In [9]:

param = {
    'booster': 'gbtree',             # default -- tree based
    'nthread': 8,                    # number of threads to use for parallel processing
    'objective': 'reg:squarederror', # RMSE error
    'eval_metric': 'rmse',           # maximize ROC AUC
    'eta': 0.3,                      # shrinkage; [0, 1], default 0.3
    'max_depth': 6,                  # maximum depth of each tree; default 6
    'gamma': 0,                      # set above 0 to prune trees, [0, inf], default 0
    'min_child_weight': 1,           # higher leads to more pruning of tress, [0, inf], default 1
    'subsample': 1,                  # Randomly subsample rows if in (0, 1), default 1
}
num_round=30

And lastly, we can run our XGBoost model.

In [10]:

model_xgb = xgb.train(param, dtrain, num_round)

Using SHAP¶

First, we will use the TreeExplainer function from SHAP to analyze our model. We will then pass it our training data in order to create individual SHAP values for each line of data.

In [11]:

explainer = shap.TreeExplainer(model_xgb)

In [12]:

shap_values = explainer(df[vars])

For some visualizations, the SHAP package struggles to plot more than 1,000 observations. In these cases, we will use a random sample of 1% of the original data.

In [13]:

df_small = df.sample(frac=0.01)
shap_values_small = explainer(df[vars])

Waterfall chart¶

For illustrative examples, or for understanding how XGBoost arrived at a particular prediction, a waterfall chart is useful. This explains the overall impact of the most impactful model features for a particular observation.

In [14]:

shap.plots.waterfall(shap_values[0])

In [15]:

shap.plots.waterfall(shap_values[2])

In [16]:

shap.plots.waterfall(shap_values[3])

Force plot¶

Another way to present the waterfall is via a force plot. This presents the outcome of a given data point in a more concise manner.

In [17]:

shap.initjs()
shap.plots.force(shap_values[1])

Out[17]:

Visualization omitted, Javascript library not loaded!
Have you run `initjs()` in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security. If you are using JupyterLab this error is because a JupyterLab extension has not yet been written.

Force plot, aggregate¶

Because force plots are more concise, we can get a representation of a force plot for all data points in a data frame on one chart. This visualization automatically clusters datapoints by what contributes to their SHAP values, so as to keep it more readable. You can interact with the chart labels to adjust the chart as well.

In [18]:

shap.initjs()
N=300
shap.plots.force(explainer.expected_value, shap_values.sample(N).values, feature_names=vars)

Out[18]:

Visualization omitted, Javascript library not loaded!
Have you run `initjs()` in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security. If you are using JupyterLab this error is because a JupyterLab extension has not yet been written.

Decision plot¶

A decision plot accomplishes a very similar end goal as the aggregate force plot, but enables you to show more variables' relationships clearly.

It can be particularly insightful to pick a set of data points for a decision plot that are representative of a particular condition you are interested in studying. For instance, below I have separated the charts on the variable Female.

In [19]:

shap.decision_plot(explainer.expected_value,
                   explainer.shap_values(df_small[df_small.Female==1][vars]),
                   feature_names=vars)

In [20]:

shap.decision_plot(explainer.expected_value,
                   explainer.shap_values(df_small[df_small.Female==0][vars]),
                   feature_names=vars)

SHAP Scatterplot¶

If you are interested in just one particular variable, you can create a scatterplot of the impact of the variable on the outcome. Again, because XGBoost is nonparametric, the effect o a variable can be conditional on all others in the data. As such, the impact is not always the same across observations, and may even flip sign for the same value of the variable.

In [21]:

shap.plots.scatter(shap_values[:,"Female"], color=shap_values)

SHAP Bee swarm plot¶

If you would like to plot a scatterplot like the above, but for multiple values, the bee swarm plot can accomplish this.

In [22]:

shap.plots.beeswarm(shap_values)

Importance plot (Bar plot)¶

Lastly, to get an overall sense of the aggregate impact (or importance) of a variable, we can plot the average absolute SHAP value for a variable. This is akin to ann XGBoost importance plot. For machine learning models that lack an importance score internally, this is particularly useful.

In [23]:

shap.plots.bar(shap_values)

In [ ]: