import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import random
sns.set()
from jupyterthemes import jtplot
# jtplot.style()
jtplot.style('grade3',
context='paper',
fscale=1.5,
ticks=True,
grid=False)
# jtplot.figsize(x=15., y=9.,aspect=1.2)
%matplotlib inline
The first thing we should do at the very beginning of any data analysis is to get a feel for the data.
Questions on the following lines should be asked,
df = pd.read_csv("HR_comma_sep.csv")
df.head()
df.describe()
df.info()
# let's clean up the data
df.rename(columns={'sales':'department'},inplace=True)
df_n = pd.get_dummies(df,columns=['department','salary'])
df_n.head()
So, no missing values in this dataset. Let's move on to understanding the correlation between various features.
Correlation
corr_mat = df_n.corr()
# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(15, 10));
# Draw the heatmap using seaborn
sns.heatmap(corr_mat, square=True, ax=ax);
A couple of interesting observations.
df.head()
How does satisfaction level vary among employees who left?
From the plot, it seems that employees who left are less satisfied on an average.
plt.figure(figsize=(10,6))
sns.boxplot(x='left',y='satisfaction_level',data=df);
Does having more number of projects have any impact on employee churn?
From the plots, it seems that people who churn (1) generally tend to have more projects than people who stay (0). From the Boxplot/CDF we can see that, ~50% of churners have 4 or more projects.
fig,axarr = plt.subplots(ncols=2,figsize=(12,6))
# plt.figure(figsize=(10,6))
sns.boxplot(x='left',y='number_project',data=df,ax=axarr[0]);
sns.kdeplot(df[df.left==0].number_project,label=0,cumulative=True,ax=axarr[1]);
sns.kdeplot(df[df.left==1].number_project,label=1,cumulative=True,ax=axarr[1]);
axarr[0].set_ylabel("Number of projects")
axarr[1].set_ylabel("CDF")
axarr[1].set_xlabel("Number of projects");
Does staying in the company for a long time make the employees more vulnerable to churn?
left_over_years = pd.crosstab(df.time_spend_company,df.left,margins=True,normalize='index')
left_over_years = left_over_years[left_over_years.index!='All']
left_over_years = left_over_years.round(2)
left_over_years.head()
fig, ax = plt.subplots(figsize=(10,6))
_ = left_over_years.plot(kind='bar',ax=ax)
_ = plt.xlabel("number of years at the company")
_ = plt.ylabel("% of employees")
_ = plt.title("Distribution of employees after n-years")
It's seen that employees start churning after completing 3 years at the company. There's an upward trend from year-3 onwards till year-5. Almost 57% of people who completed 5 years at the company churn. Then we see a downward trend after year 6. People who has spent 7 years or more are unlikely to churn.
Why do people churn the most after 5 years? Is it because of promotion or some other factors are at play?
employee_spent_5_years = df[df.time_spend_company==5]
emp_5_yrs_df = pd.crosstab(employee_spent_5_years.promotion_last_5years, employee_spent_5_years.left,margins=True,normalize='index')
emp_5_yrs_df = emp_5_yrs_df[emp_5_yrs_df.index!='All']
emp_5_yrs_df = emp_5_yrs_df.round(2)
emp_5_yrs_df
emp_5_yrs_df.plot(kind='bar')
This is interesting. 57% of the employees who haven't been given any promotion in last 5 years surely churn. While, 94% of employees who have been given promotion in last 5 years surely stay!