Author: Pisarev Ivan, ODS Slack: pisarev_i

Predict attrition of employees

1. Feature and data explanation

People are definitely a company's greatest asset.
It doesn't make any difference whether the product is cars or cosmetics.
A company is only as good as the people it keeps.

Mary Kay Ash

There is no doubt about the fact that the human asset is the key intangible asset for any organization. In today’s dynamic and continuously changing business world, it is the human assets and not the fixed or tangible assets that differentiate an organization from its competitors. Today’s knowledge economy distinguishes one organization from another with the single most important and powerful factor that is the Human Resources (HR) or Human Assets.

Employees leaving an organization might be replaced physically; however, their skill-sets and knowledge cannot be exactly replaced by the person replacing them, as each individual possesses a different skill-set and experience. Employee efficiency and talent determines the pace and growth of the organizations.

There are two important business issues:

  • Uncover the factors that lead to employee attrition
  • Prediction valuable employees who are ready to attrition

To get answers to these questions, we will analyze dataset IBM HR Analytics Employee Attrition & Performance

This is a fictional data set created by IBM data scientists.
List of columns with their types:

  • Age - Numeric Discrete
  • Attrition - Caregorical
  • BusinessTravel - Caregorical
  • DailyRate - Numeric Discrete
  • Department - Caregorical
  • DistanceFromHome - Numeric Discrete
  • Education - Caregorical (1: 'Below College', 2: 'College', 3: 'Bachelor', 4: 'Master', 5: 'Doctor')
  • EducationField - Caregorical
  • EmployeeCount - Numeric Discrete
  • EmployeeNumber - Numeric Discrete
  • EnvironmentSatisfaction - Caregorical (1: 'Low', 2: 'Medium', 3: 'High', 4: 'Very High')
  • Gender - Caregorical
  • HourlyRate - Numeric Discrete
  • JobInvolvement - Caregorical (1: 'Low', 2: 'Medium', 3: 'High', 4: 'Very High')
  • JobLevel - Caregorical
  • JobRole - Caregorical
  • JobSatisfaction - Caregorical (1: 'Low', 2: 'Medium', 3: 'High', 4: 'Very High')
  • MaritalStatus - Caregorical
  • MonthlyIncome - Numeric Discrete
  • MonthlyRate - Numeric Discrete
  • NumCompaniesWorked - Numeric Discrete
  • Over18 - Caregorical
  • OverTime - Caregorical
  • PercentSalaryHike - Numeric Discrete
  • PerformanceRating - Caregorical (1: 'Low', 2: 'Good', 3: 'Excellent', 4: 'Outstanding')
  • RelationshipSatisfaction - Caregorical (1: 'Low', 2: 'Medium', 3: 'High', 4: 'Very High')
  • StandardHours - Numeric Discrete
  • StockOptionLevel - Caregorical
  • TotalWorkingYears - Numeric Discrete
  • TrainingTimesLastYear - Numeric Discrete
  • WorkLifeBalance - Caregorical (1: 'Bad', 2: 'Good', 3: 'Better', 4: 'Best')
  • YearsAtCompany - Numeric Discrete
  • YearsInCurrentRole - Numeric Discrete
  • YearsSinceLastPromotion - Numeric Discrete
  • YearsWithCurrManager - Numeric Discrete

The target feature Attrition has two possible values: 'Yes' and 'No', so our task is binary classification.
It is also important to understand the significance of features.

2. Primary data analysis

In [2]:
import warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
In [3]:
color = sns.color_palette('tab20')
plt.style.use('seaborn-whitegrid')
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (10,8)
sns.palplot(color)

Let's get the data, look at the first lines, check types and omissions

In [278]:
dfIBM = pd.read_csv('./data/WA_Fn-UseC_-HR-Employee-Attrition.csv')
In [279]:
dfIBM.head().T
Out[279]:
0 1 2 3 4
Age 41 49 37 33 27
Attrition Yes No Yes No No
BusinessTravel Travel_Rarely Travel_Frequently Travel_Rarely Travel_Frequently Travel_Rarely
DailyRate 1102 279 1373 1392 591
Department Sales Research & Development Research & Development Research & Development Research & Development
DistanceFromHome 1 8 2 3 2
Education 2 1 2 4 1
EducationField Life Sciences Life Sciences Other Life Sciences Medical
EmployeeCount 1 1 1 1 1
EmployeeNumber 1 2 4 5 7
EnvironmentSatisfaction 2 3 4 4 1
Gender Female Male Male Female Male
HourlyRate 94 61 92 56 40
JobInvolvement 3 2 2 3 3
JobLevel 2 2 1 1 1
JobRole Sales Executive Research Scientist Laboratory Technician Research Scientist Laboratory Technician
JobSatisfaction 4 2 3 3 2
MaritalStatus Single Married Single Married Married
MonthlyIncome 5993 5130 2090 2909 3468
MonthlyRate 19479 24907 2396 23159 16632
NumCompaniesWorked 8 1 6 1 9
Over18 Y Y Y Y Y
OverTime Yes No Yes Yes No
PercentSalaryHike 11 23 15 11 12
PerformanceRating 3 4 3 3 3
RelationshipSatisfaction 1 4 2 3 4
StandardHours 80 80 80 80 80
StockOptionLevel 0 1 0 0 1
TotalWorkingYears 8 10 7 8 6
TrainingTimesLastYear 0 3 3 3 3
WorkLifeBalance 1 3 3 3 3
YearsAtCompany 6 10 0 8 2
YearsInCurrentRole 4 7 0 7 2
YearsSinceLastPromotion 0 1 0 3 2
YearsWithCurrManager 5 7 0 0 2
In [280]:
dfIBM.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1470 entries, 0 to 1469
Data columns (total 35 columns):
Age                         1470 non-null int64
Attrition                   1470 non-null object
BusinessTravel              1470 non-null object
DailyRate                   1470 non-null int64
Department                  1470 non-null object
DistanceFromHome            1470 non-null int64
Education                   1470 non-null int64
EducationField              1470 non-null object
EmployeeCount               1470 non-null int64
EmployeeNumber              1470 non-null int64
EnvironmentSatisfaction     1470 non-null int64
Gender                      1470 non-null object
HourlyRate                  1470 non-null int64
JobInvolvement              1470 non-null int64
JobLevel                    1470 non-null int64
JobRole                     1470 non-null object
JobSatisfaction             1470 non-null int64
MaritalStatus               1470 non-null object
MonthlyIncome               1470 non-null int64
MonthlyRate                 1470 non-null int64
NumCompaniesWorked          1470 non-null int64
Over18                      1470 non-null object
OverTime                    1470 non-null object
PercentSalaryHike           1470 non-null int64
PerformanceRating           1470 non-null int64
RelationshipSatisfaction    1470 non-null int64
StandardHours               1470 non-null int64
StockOptionLevel            1470 non-null int64
TotalWorkingYears           1470 non-null int64
TrainingTimesLastYear       1470 non-null int64
WorkLifeBalance             1470 non-null int64
YearsAtCompany              1470 non-null int64
YearsInCurrentRole          1470 non-null int64
YearsSinceLastPromotion     1470 non-null int64
YearsWithCurrManager        1470 non-null int64
dtypes: int64(26), object(9)
memory usage: 402.0+ KB

There are no missing items in the data.
Let's check the distribution of features values.

In [281]:
dfIBM.describe(include=['int64']).T
Out[281]:
count mean std min 25% 50% 75% max
Age 1470.0 36.923810 9.135373 18.0 30.00 36.0 43.00 60.0
DailyRate 1470.0 802.485714 403.509100 102.0 465.00 802.0 1157.00 1499.0
DistanceFromHome 1470.0 9.192517 8.106864 1.0 2.00 7.0 14.00 29.0
Education 1470.0 2.912925 1.024165 1.0 2.00 3.0 4.00 5.0
EmployeeCount 1470.0 1.000000 0.000000 1.0 1.00 1.0 1.00 1.0
EmployeeNumber 1470.0 1024.865306 602.024335 1.0 491.25 1020.5 1555.75 2068.0
EnvironmentSatisfaction 1470.0 2.721769 1.093082 1.0 2.00 3.0 4.00 4.0
HourlyRate 1470.0 65.891156 20.329428 30.0 48.00 66.0 83.75 100.0
JobInvolvement 1470.0 2.729932 0.711561 1.0 2.00 3.0 3.00 4.0
JobLevel 1470.0 2.063946 1.106940 1.0 1.00 2.0 3.00 5.0
JobSatisfaction 1470.0 2.728571 1.102846 1.0 2.00 3.0 4.00 4.0
MonthlyIncome 1470.0 6502.931293 4707.956783 1009.0 2911.00 4919.0 8379.00 19999.0
MonthlyRate 1470.0 14313.103401 7117.786044 2094.0 8047.00 14235.5 20461.50 26999.0
NumCompaniesWorked 1470.0 2.693197 2.498009 0.0 1.00 2.0 4.00 9.0
PercentSalaryHike 1470.0 15.209524 3.659938 11.0 12.00 14.0 18.00 25.0
PerformanceRating 1470.0 3.153741 0.360824 3.0 3.00 3.0 3.00 4.0
RelationshipSatisfaction 1470.0 2.712245 1.081209 1.0 2.00 3.0 4.00 4.0
StandardHours 1470.0 80.000000 0.000000 80.0 80.00 80.0 80.00 80.0
StockOptionLevel 1470.0 0.793878 0.852077 0.0 0.00 1.0 1.00 3.0
TotalWorkingYears 1470.0 11.279592 7.780782 0.0 6.00 10.0 15.00 40.0
TrainingTimesLastYear 1470.0 2.799320 1.289271 0.0 2.00 3.0 3.00 6.0
WorkLifeBalance 1470.0 2.761224 0.706476 1.0 2.00 3.0 3.00 4.0
YearsAtCompany 1470.0 7.008163 6.126525 0.0 3.00 5.0 9.00 40.0
YearsInCurrentRole 1470.0 4.229252 3.623137 0.0 2.00 3.0 7.00 18.0
YearsSinceLastPromotion 1470.0 2.187755 3.222430 0.0 0.00 1.0 3.00 15.0
YearsWithCurrManager 1470.0 4.123129 3.568136 0.0 2.00 3.0 7.00 17.0
In [282]:
dfIBM.describe(include=['object']).T
Out[282]:
count unique top freq
Attrition 1470 2 No 1233
BusinessTravel 1470 3 Travel_Rarely 1043
Department 1470 3 Research & Development 961
EducationField 1470 6 Life Sciences 606
Gender 1470 2 Male 882
JobRole 1470 9 Sales Executive 326
MaritalStatus 1470 3 Married 673
Over18 1470 1 Y 1470
OverTime 1470 2 No 1054

We can check count of unique values for all features

In [283]:
pd.concat([pd.DataFrame({'Unique Values': dfIBM.nunique().sort_values()}),
           pd.DataFrame({'Type': dfIBM.dtypes})], axis=1, sort=False).sort_values(by='Unique Values')
Out[283]:
Unique Values Type
Over18 1 object
StandardHours 1 int64
EmployeeCount 1 int64
Gender 2 object
Attrition 2 object
PerformanceRating 2 int64
OverTime 2 object
MaritalStatus 3 object
Department 3 object
BusinessTravel 3 object
StockOptionLevel 4 int64
EnvironmentSatisfaction 4 int64
JobInvolvement 4 int64
JobSatisfaction 4 int64
RelationshipSatisfaction 4 int64
WorkLifeBalance 4 int64
JobLevel 5 int64
Education 5 int64
EducationField 6 object
TrainingTimesLastYear 7 int64
JobRole 9 object
NumCompaniesWorked 10 int64
PercentSalaryHike 15 int64
YearsSinceLastPromotion 16 int64
YearsWithCurrManager 18 int64
YearsInCurrentRole 19 int64
DistanceFromHome 29 int64
YearsAtCompany 37 int64
TotalWorkingYears 40 int64
Age 43 int64
HourlyRate 71 int64
DailyRate 886 int64
MonthlyIncome 1349 int64
MonthlyRate 1427 int64
EmployeeNumber 1470 int64

There are three columns with constant values. These columns do not make sense, we can remove them.

In [284]:
dfIBM.drop(columns=['EmployeeCount', 'StandardHours', 'Over18'], axis=1, inplace=True)

Let's check balance in values of target feature

In [285]:
round(dfIBM['Attrition'].value_counts(normalize=True)*100, 2)
Out[285]:
No     83.88
Yes    16.12
Name: Attrition, dtype: float64

We can see imbalance in target class, there much more values 'No' than 'Yes'.
Let's convert target feature to numeric.

In [286]:
dfIBM.Attrition = dfIBM.Attrition.map({'Yes': 1, 'No': 0})

Column EmployeeNumber has all unique values (1470). We can suppose that it is like employee identificaton number. Let's check it is not affected to target feature.

In [287]:
plt.rcParams['figure.figsize'] = (14,3)
plt.plot(dfIBM.EmployeeNumber, dfIBM.Attrition, 'ro', alpha=0.2);

Ok, there is no leak in the data and Attrition not sorted by EmployeeNumber. We can remove this column from the dataset.

In [288]:
dfIBM.drop(columns=['EmployeeNumber'], axis=1, inplace=True)

Now, looking at the variable names and their values, we can classify all variables into 3 types.

NameUnique ValuesType
Categorical, order has no sense
BusinessTravel3object
Department3object
EducationField6object
EmployeeNumber1470int64
Gender2object
JobRole9object
MaritalStatus3object
OverTime2object
Categorical, order has sense, but distance between values has no sense
Education5int64
EnvironmentSatisfaction4int64
JobInvolvement4int64
JobLevel5int64
JobSatisfaction4int64
PerformanceRating2int64
RelationshipSatisfaction4int64
StockOptionLevel4int64
WorkLifeBalance4int64
Numeric, discrete
Age43int64
DailyRate886int64
DistanceFromHome29int64
HourlyRate71int64
MonthlyIncome1349int64
MonthlyRate1427int64
NumCompaniesWorked10int64
PercentSalaryHike15int64
TotalWorkingYears40int64
TrainingTimesLastYear7int64
YearsAtCompany37int64
YearsInCurrentRole19int64
YearsSinceLastPromotion16int64
YearsWithCurrManager18int64
In [289]:
Categorical_without_order = ['BusinessTravel', 'Department', 'EducationField',
                             'Gender', 'JobRole', 'MaritalStatus', 'OverTime']

Categorical_with_order = ['Education', 'EnvironmentSatisfaction', 'JobInvolvement',
                          'JobLevel', 'JobSatisfaction', 'PerformanceRating',
                          'RelationshipSatisfaction', 'StockOptionLevel', 'WorkLifeBalance']

Numeric = ['Age', 'DailyRate', 'DistanceFromHome', 'HourlyRate',
           'MonthlyIncome', 'MonthlyRate', 'NumCompaniesWorked',
           'PercentSalaryHike', 'TotalWorkingYears', 'TrainingTimesLastYear',
           'YearsAtCompany', 'YearsInCurrentRole', 'YearsSinceLastPromotion',
           'YearsWithCurrManager']

3. Primary visual data analysis

Let's see distribution of all features and the dependence of the target variable.

In [59]:
dictCatNames = {'Education': ['Below College','College','Bachelor','Master','Doctor'],
                'EnvironmentSatisfaction': ['Low','Medium','High','Very High'],
                'JobInvolvement': ['Low','Medium','High','Very High'],
                'JobSatisfaction': ['Low','Medium','High','Very High'],
                'PerformanceRating': ['Low','Good','Excellent','Outstanding'],
                'RelationshipSatisfaction':['Low','Medium','High','Very High'],
                'WorkLifeBalance': ['Bad','Good','Better','Best']}
In [60]:
def cat_distribution_target_proportion(column):
    fig , axes = plt.subplots(1,2,figsize = (15,6))
    fig.suptitle(column,fontsize=16)
    
    sns.countplot(dfIBM[column],ax=axes[0])
    axes[0].set_title(column + ' distribution')
    
    sns.barplot(x=column,y='Attrition',data=dfIBM,ax=axes[1])
    axes[1].set_title('Attrition rate by '+column)
    
    for ax in axes:
        if column in dictCatNames:
            ax.xaxis.set_ticklabels(dictCatNames[column])
        plt.setp(ax.xaxis.get_majorticklabels(), rotation=45, ha='right', rotation_mode='anchor')
In [61]:
for col in (Categorical_without_order + Categorical_with_order):
    cat_distribution_target_proportion(col)

What we can see:

  • Attrition higher if BusinessTravel is frequently
  • Department, Gender, Education and PerformanceRating have low effect to Attrition
  • Attrition higher if MartialStatus is Single
  • Some JobRole (Sales Representative, Human Resources, Laboratory Technician) have a high level of Attrition
  • Attrition is higher if an employee has OverTime
  • If EnvironmentSatisfaction, JobInvolvement, JobLevel, JobSatisfaction, RelationshipSatisfaction, WorkLifeBalance is lower, then Attrition is higher

What about distribution and relationship with the target for numeric features?

In [69]:
def num_distribution_target_impact(column):
    fig , axes = plt.subplots(2,2,figsize = (15,6))
    fig.suptitle(column,fontsize=16)
    
    sns.distplot(dfIBM[column],kde=False,ax=axes[0])
    axes[0].set_title(column + ' distribution')
    
    sns.boxplot(x='Attrition',y=column,data=dfIBM,ax=axes[1])
    axes[1].set_title('Relationship Attrition with '+column)
In [70]:
for n in Numeric:
    num_distribution_target_impact(n)