Explainable Machine learning¶

The purpose of this notebook is to take a deep dive into methods for explaining black box models. You see a fair amount of machine learning models implemented online, but little attention is given to the explainability of such models to the users and stakeholders. In industry, more attention needs to be given to the output as users often want to know the reason for a specific prediction. For example, if you were to predict my salary based on features such as job title, work experience, location, etc. then I would like to know how they contributed to the final result. Is work experience more important than the job I apply for? Do you see the same relationship if you were make the prediction for someone else? In other words, which features are important in general and which are important specific to my prediction?

Several methods will be discussed in detail, but a focus will be on Partial Dependency Plots and SHAP values as those are commonly used (relatively) simple to implement in businesses.

Table of Contents ¶

Functions

2. [Preprocess Data](#preprocess)

2.1 [Load Data](#load)

2.2 [NaN Values (i.e., " ?")](#nan)

2.3 [Preprocessing Steps](#preprocessing)

EDA

3.1 Categorical Variables

3.2 Target Variable
Modeling

4.1 Accuracy

4.2 F1-score

4.3 Balanced Accuracy
Partial Dependency Plots (PDP)

5.1 Assumptions

5.2 Correlation Matrix

5.3 Correlation Matrix - One-hot encoding

5.4 PDP - Single feature

5.5 PDP - One-hot encoding

5.6 PDP - Interaction
Local Interpretable Model-agnostic Explanations (LIME)

6. [SHAP](#shap)

	Age	Workclass	fnlwgt	Education	Education_num	Marital_status	Occupation	Relationship	Race	Gender	Capital_gain	Hours/Week	Native_country	Income_bracket
0	39	State-gov	77516	Bachelors	13	Never-married	Adm-clerical	Not-in-family	White	Male	2174	40	United-States	<=50K
1	50	Self-emp-not-inc	83311	Bachelors	13	Married-civ-spouse	Exec-managerial	Husband	White	Male	0	13	United-States	<=50K
2	38	Private	215646	HS-grad	9	Divorced	Handlers-cleaners	Not-in-family	White	Male	0	40	United-States	<=50K
3	53	Private	234721	11th	7	Married-civ-spouse	Handlers-cleaners	Husband	Black	Male	0	40	United-States	<=50K
4	28	Private	338409	Bachelors	13	Married-civ-spouse	Prof-specialty	Wife	Black	Female	0	40	Cuba	<=50K

	Age	Education_num	Hours/Week	Workclass_ Private	Workclass_ Self-emp-not-inc	...	Native_country_ United-States
0	39	13	40	0	0	...	1
1	50	13	13	0	1	...	1
2	38	9	40	1	0	...	1
3	53	7	40	1	0	...	1
4	28	13	40	1	0	...	0

	feature1	feature2	r
10709	Marital_status_ Married-civ-spouse	Relationship_ Husband	0.896502
10708	Relationship_ Husband	Marital_status_ Married-civ-spouse	0.896502
10707	Race_ White	Race_ Black	0.794808
10706	Race_ Black	Race_ White	0.794808
10705	Marital_status_ Never-married	Marital_status_ Married-civ-spouse	0.644862
10704	Marital_status_ Married-civ-spouse	Marital_status_ Never-married	0.644862
10702	Gender_ Male	Relationship_ Husband	0.581221
10700	Relationship_ Husband	Gender_ Female	0.581221
10701	Relationship_ Husband	Gender_ Male	0.581221
10703	Gender_ Female	Relationship_ Husband	0.581221

	Feature	mean_SHAP
0	Marital_status	1.245834
1	Age	0.775167
2	Capital	0.720160
3	Occupation	0.502213
4	Education	0.484086
5	Hours/Week	0.315292
6	Relationship	0.231380
7	Gender	0.160617
8	Workclass	0.096816
9	Country	0.047439
10	Race	0.034600

	Age	Education_num	Hours/Week	Workclass_ Private	Workclass_ Self-emp-not-inc	...	Native_country_ United-States
0	39	13	40	0	0	...	1
1	50	13	13	0	1	...	1
2	38	9	40	1	0	...	1
3	53	7	40	1	0	...	1
4	28	13	40	1	0	...	0

Explainable Machine learning¶

Regressor¶

	Age	Education_num	Hours/Week	Workclass_ Private	Workclass_ Self-emp-not-inc	...	Native_country_ United-States
0	39	13	40	0	0	...	1
1	50	13	13	0	1	...	1
2	38	9	40	1	0	...	1
3	53	7	40	1	0	...	1
4	28	13	40	1	0	...	0