Interpretable or Accurate? Why not both?

Case Study: Predicting Employee Attrition Using Machine Learning

The notebook contains the code for the accompanying blogpost titled Interpretable or Accurate? Why not both?

Installation

Interpret is supported across Windows, Mac and Linux on Python 3.5+. Please refer the documentation for more details.

pip

pip install interpret

conda

conda install -c interpretml interpret

source

git clone https://github.com/interpretml/interpret.git && cd interpret/scripts && make install

Importing necessary libraries

In [1]:
import pandas as pd
import numpy as np


from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score, accuracy_score

from interpret import show
from interpret import set_visualize_provider
from interpret.provider import InlineProvider
from interpret.data import ClassHistogram
set_visualize_provider(InlineProvider())
from interpret.glassbox import (
    LogisticRegression,
    ClassificationTree,
    ExplainableBoostingClassifier,
)


seed = 42

Importing the Dataset

In [2]:
df = pd.read_csv("WA_Fn-UseC_-HR-Employee-Attrition.csv")
df.head()
Out[2]:
Age Attrition BusinessTravel DailyRate Department DistanceFromHome Education EducationField EmployeeCount EmployeeNumber ... RelationshipSatisfaction StandardHours StockOptionLevel TotalWorkingYears TrainingTimesLastYear WorkLifeBalance YearsAtCompany YearsInCurrentRole YearsSinceLastPromotion YearsWithCurrManager
0 41 Yes Travel_Rarely 1102 Sales 1 2 Life Sciences 1 1 ... 1 80 0 8 0 1 6 4 0 5
1 49 No Travel_Frequently 279 Research & Development 8 1 Life Sciences 1 2 ... 4 80 1 10 3 3 10 7 1 7
2 37 Yes Travel_Rarely 1373 Research & Development 2 2 Other 1 4 ... 2 80 0 7 3 3 0 0 0 0
3 33 No Travel_Frequently 1392 Research & Development 3 4 Life Sciences 1 5 ... 3 80 0 8 3 3 8 7 3 0
4 27 No Travel_Rarely 591 Research & Development 2 1 Medical 1 7 ... 4 80 1 6 3 3 2 2 2 2

5 rows × 35 columns

In [3]:
#Encoding the target variable i.e Attrition

target_map = {'Yes': 1, 'No': 0}
target = df["Attrition"].apply(lambda x: target_map[x])
print(target[:10])
0    1
1    0
2    1
3    0
4    0
5    0
6    0
7    0
8    0
9    0
Name: Attrition, dtype: int64
In [4]:
# Deleting columns that are not useful for the predicitons

df.drop(['EmployeeCount', 'EmployeeNumber', 'Over18', 'StandardHours','Attrition'], axis="columns", inplace=True)
In [5]:
# Split data into train and test
X_train, X_test, y_train, y_test = train_test_split(df, 
                                                    target, 
                                                    test_size=0.2,
                                                    random_state=seed,
                                                    stratify=target)

Exploring the Dataset with histogram visualizations

In [6]:
hist = ClassHistogram().explain_data(X_train, y_train, name = 'Train Data')
show(hist)