#!/usr/bin/env python
# coding: utf-8

# # i Quit!!!
# 
# #### An analysis of why employees choose to leave their jobs

# Employment in the old days was for a life time. You got in to an organization and you were a lifer. The thought of resigning was incomprehensible.
# 
# Today, that trend has flipped. There are just a handful of organization with a low employee turnover. If employees do stay they, eventually get layed off because of some cost cutting initiative. The reasons for leaving may be many. Some leave to improve themselves, others leave because they do not see any prospect for growth and still some others leave because of dissatisfaction.
# 
# In this project we shall look in to how much dissatisfaction towards a job becomes a factor that encourages employess to leave. The project aims to explore a modified data set of exit surveys from employees of the Department of Education, Training and Employment (DETE) and the Technical and Further Education (TAFE) institute in Queensland, Australia. 
# In this project
# we'll work with exit surveys from employees of the Department of Education, Training and Employment (DETE) and the Technical and Further Education (TAFE) institute in Queensland, Australia. You can find the TAFE exit survey [Link](https://data.gov.au/dataset/ds-qld-fe96ff30-d157-4a81-851d-215f2a0fe26d/details?q=exit%20survey) and the survey for the DETE [Link](https://data.gov.au/dataset/ds-qld-fe96ff30-d157-4a81-851d-215f2a0fe26d/details?q=exit%20survey).
# 
# The datasets have some slight modifications including changes to its encoding from UTF-8 to cp1252

# ## The Dataset

# The number of columns for both datasets is quite extensive. Our focus will be on a few columns that will be used to answer the questions we are after.
# 
# Columns in the DETE survey that will be used include:
# - ID: The id used to identify the survery participant
# - SeparationType: The reason for ending employment
# - Cease Date: The year or month the employment ended
# - DETE Start Date: The year the person began employment
# 
# Columns in the TAFE survey that will be used include:
# - Record ID: The id used to identify the survey participant
# - Reason for ceasing employment: The reason for ending employment
# - LengthofServiceOverall. Overall Length of Service at Institute (in years): The tenure of employment (in years)

# ## The Goal

# The goal of this project is to clean and reshape the data and be able to gain insights on whether dissatifcation had a role in the resignation
# 
# - Of employees across various experience groups
# - Of employees across various age groups
# 
# While the datasets are entirely different from each other we will aim to use and combine the earlier identified columns from the different data sets and extrapolate the information we are seeking.

# ## Notes on the data

# In[246]:


import pandas as pd
import numpy as np

dete_survey  = pd.read_csv("dete_survey.csv")
tafe_survey = pd.read_csv("tafe_survey.csv")


# In[247]:


# Function name: print_full(a_list)
# Input: A python list
# Output: The full list
# Description: Jupyter notebooks by default do not display all the data for large datasets. It shows a few lines and summarizes
# the rest using ellipsis. This function helps to see the full python list

def print_full(a_list):
    pd.set_option('display.max_rows', len(a_list))
    print(a_list)
    pd.reset_option('display.max_rows')


# In[248]:


dete_survey.info()
#dete_survey.isnull()


# In[249]:


print_full(dete_survey.head(5))


# In[250]:


column_name = "Region"
dete_survey[column_name].value_counts()


# **Analysis of the columns in the DETE survey:**
# - The first 10 columns are related to employment details.
# - The next 18 columns are based on reasons why the employee might be leaving.
# - The following 21 columns, I assume, are answers to what they got out through the course of their employment from their employer.
# - The last 7 columns include the gender, age and racial identity of the survey participant.
# - Dates in the Cease Date column are not consistent, some have the month and year while others only have the year
# - 'Not stated' seems to be the value used in certain columns where a value is not available like SETE Start Date and Region. However in the column Classification the value is empty if there is no value for the column

# In[251]:


#tafe_survey.info()
#tafe_survey.isnull()
print_full(tafe_survey.head(5))


# In[252]:


column_name = "Workplace. Topic:Does your workplace value the diversity of its employees?"
tafe_survey[column_name].value_counts()


# **Analysis of the columns in the TAFE survey:**
# - The first 5 and last 6 columns are related to employment details and certain details related to the employee.
# - The next 13 columns are based on reasons why the employee might be leaving.
# - The 48 columns that follow, are answers to what they got out throughout the tenure of employment

# Based on an initial analysis, both surveys contain a number of columns that do not contribute to our goal and can be dropped. These relate specifically to what the employee got out of job during the course of employment.
# 
# In addition, the values in the DETE survey that have been set as 'Not Stated' need to be changed to null to support data analysis

# In[253]:


dete_survey = pd.read_csv("dete_survey.csv", na_values = "Not Stated")
dete_survey_updated = dete_survey.drop(dete_survey.columns[28:49], axis = 1)
tafe_survey_updated = tafe_survey.drop(tafe_survey.columns[17:66], axis = 1)

print(tafe_survey_updated.head(3))


# ## Preparing to combine data

# Since the columns need to be eventually combined there is a need to ensure that the targeted columns in both surveys have the same name. 

# In[254]:


dete_survey_updated.columns = dete_survey_updated.columns.str.lower().str.strip().str.replace(' ', '_')
tafe_survey_updated = tafe_survey_updated.rename(columns={"Record ID":'id', "CESSATION YEAR":'cease_date', 
                                                          'Reason for ceasing employment':'separation type',
                                                          'Gender. What is your Gender?':'gender', 'CurrentAge. Current Age':'age', 
                                                          'Employment Type. Employment Type':'employment_status', 
                                                          'Classification. Classification':'position', 
                                                          'LengthofServiceOverall. Overall Length of Service at Institute (in years)':'institute_service', 
                                                          'LengthofServiceCurrent. Length of Service at current workplace (in years)':'role_service'})
#print(dete_survey_updated.head(2))
print(tafe_survey_updated.head(2))


# ## Who Resigned?

# While there are many reasons why employee and employer parted ways, our focus is on identifying employees who resigned because of dissatisfaction. The 'Separation Type' column helps with identifying employees that resigned.

# In[255]:


dete_survey_updated["separationtype"].value_counts()


# In[256]:


tafe_survey_updated["separation type"].value_counts()


# In[257]:


dete_resignations = dete_survey_updated[(dete_survey_updated["separationtype"] == "Resignation-Other reasons") |
                                        (dete_survey_updated["separationtype"] == "Resignation-Other employer")|
                                        (dete_survey_updated["separationtype"] == "Resignation-Move overseas/interstate")].copy()
tafe_resignations = tafe_survey_updated[tafe_survey_updated["separation type"] == "Resignation"].copy()


# In[258]:


print(dete_resignations)


# In[259]:


print(tafe_resignations)


# ## Correcting Dates

# As was pointed earlier, the cease_date column in the DETE survey is inconsistent with regards to the values in it while most dates only include the year, there are some dates that have both the month and year.

# In[260]:


dete_resignations["cease_date"].value_counts()


# To correct this, we will have to extract just the year and ignore the month and reassign the same back to the column.

# In[261]:


pattern = r"(\b20\d{2})"
dete_resignations["cease_date"] = dete_resignations["cease_date"].str.extract(pattern).astype(dtype = 'float')


# To ensure that there are no logical inconsistencies in the years we can check the years in the columns graphically through box plots that would clearly bring out years that seem out of place.

# In[262]:


dete_resignations["cease_date"].value_counts().sort_index(ascending = True)
import seaborn as sns
ax = sns.boxplot(x = dete_resignations["cease_date"])
ax.spines["top"].set_visible(False)
ax.spines["bottom"].set_visible(False)
ax.spines["right"].set_visible(False)
ax.spines["left"].set_visible(False)
ax.tick_params(left = False, bottom = False)
ax.set_title("Resignation Years of DETE employees based on DETE Survey")


# In[263]:


tafe_resignations["cease_date"].value_counts().sort_index(ascending = True)


# In[264]:


ax = sns.boxplot(x = tafe_resignations["cease_date"])
ax.spines["top"].set_visible(False)
ax.spines["bottom"].set_visible(False)
ax.spines["right"].set_visible(False)
ax.spines["left"].set_visible(False)
ax.tick_params(left = False, bottom = False)
ax.set_title("Resignation Years of TAFE employees based on TAFE Survey")


# As can be seen, there are no great issues with the listed dates. While some of the year records could be removed, keeping them will not affect our goal. So these values are left as they are.

# ##  How long did they work?
# 

# Our goal is to identify how job satisfaction affected resignation across employees of various tenures for this its we need to find out how long survey participants worked for their respective instituitions.

# In case of the TAFE survey, the _institute_service_ column should help to find the tenure of service. However the length of service for the the DETE survey needs to be calculated.This can be done by calculating the difference between the date the employee initiated employment and the date the employee resigned.

# In[265]:


dete_resignations["institute_service"] = dete_resignations["cease_date"] - dete_resignations["dete_start_date"]
print(dete_resignations["institute_service"].value_counts())


# The new column in the DETE survey is given the same name as the TAFE survey _institure_service_

# ## Why did they resign?

# Till now we have information of all those that resigned and how long they worked for. Next, we need to find out all those that resigned because they were dissatisfied. 

# The TAFE survey has two columns that help to identify all those that resigned because of dissatifcation, namely:
# - Contributing Factors. Dissatisfaction
# - Contributing Factors. Job Dissatisfaction

# In[266]:


#'Contributing Factors. Dissatisfaction' and 'Contributing Factors. Job Dissatisfaction' in the tafe_resignations

#tafe_resignations["Contributing Factors. Dissatisfaction"].value_counts()
tafe_resignations["Contributing Factors. Job Dissatisfaction"].value_counts()


# Participants in the TAFE survey have clearly marked dissatifaction as factor if it applied to them. If it has not applied to them they have left it as _-_. Using boolean values it would be easier to consolidate those affected because of dissatisfaction.

# In[267]:


# Function name: update_vals(val)
# Input: Boolean value
# Output: Boolean value
# Description: The function looks for True boolean value in either of the dissatifaction related columns
# of every record in the TAFE survey and returns if it is found else it returns False
def update_vals(val):
    if pd.isnull(val):
        return np.nan
    elif val == '-':
        return False
    else:
        return True


# In[268]:


dissatisfaction = ["Contributing Factors. Dissatisfaction", "Contributing Factors. Job Dissatisfaction"]
tafe_resignations[dissatisfaction] = tafe_resignations[dissatisfaction].applymap(update_vals)
tafe_resignations["dissatisfied"] = tafe_resignations[dissatisfaction].any(axis = 1, skipna = False)
print(tafe_resignations["dissatisfied"])


# Using the above function the above identifed columns have been marked with the values _True_ or _False_.

# In the DETE survey, if any of the columns listed below have been checked, it indicates that the participant was dissatisfied. If none of the below columns were marked they were other reasons for their resignation.
# 
# - job_dissatisfaction
# - dissatisfaction_with_the_department
# - physical_work_environment
# - lack_of_recognition
# - lack_of_job_security
# - work_location
# - employment_conditions
# - work_life_balance
# - workload

# In[269]:


dissatisfaction = ["job_dissatisfaction","dissatisfaction_with_the_department","physical_work_environment",
                   "lack_of_recognition","lack_of_job_security","work_location","employment_conditions",
                   "work_life_balance","workload"]
dete_resignations["dissatisfied"] = dete_resignations[dissatisfaction].any(axis = 1, skipna = False)
print(dete_resignations["dissatisfied"])


# ## Combine the surveys

# Since the focus is on analysis from both surveys it is necessary to combine the data from both surveys. Many of the columns that we require for the analysis are common in both surveys, however we need to identify which data came from which survey. To do this a column called institute will be introduced in to both surveys and then combined.

# In[270]:


dete_resignations_up = dete_resignations.copy()
tafe_resignations_up = tafe_resignations.copy()


# In[271]:


dete_resignations_up["institute"] = "DETE"
tafe_resignations_up["institute"] = "TAFE"
combined = pd.concat([dete_resignations_up, tafe_resignations_up])


# In[272]:


print(combined)
combined.info()


# Since there are multiple columns from both surveys that are not relevant to analysis, they will be removed based on the condition that the columns must not have more 500 null values.

# In[273]:


combined_updated = combined.dropna(axis=1, thresh = 500)
print(combined_updated)


# It is now possible to identify how dissatifaction played a role using the combined data from both surveys.

# ## Experience Groups

# Earlier it was identified that the tenure of employment was not clearly specified in the DETE survey. It was therefore calculated based on the difference between the date of joining and the date of resignation.
# On combining both surveys however, another issue has arisen, the tenure data is not exclusively numeric but contains alphanumeric values.

# In[274]:


combined_updated["institute_service"].value_counts()


# A closer look reveals that the TAFE survey data is not suitably formatted to perform analysis.

# In[275]:


print_full(combined_updated[combined_updated["institute"] == "TAFE"]["institute_service"])


# In order to convert this to the right format, the values in the column need to be first converted to string and then extract the years of employment based on a specific pattern in the string.

# In[276]:


# Pattern to identify digits in a string
pattern = r"(\d+)"

#Vectorized method that converts each value in the column to a string and extract the year
combined_updated["institute_service"] = combined_updated["institute_service"].astype(dtype='str').str.extract(pattern).astype('float')
print_full(combined_updated["institute_service"])


# The format in the column now allows for analysis of dissatisfaction based on experience. It must be noted that where the tenure of employment was _less than 1 year_ or _greater than 20 years_ the convertion process has caused the values to be 1 and 20 respectively. While this may be cause for concern, our focus is on varied experience groups and not across the specific number of years the survey partipants have worked for.
# 
# The experience groups are as follows:
# - New employees: Experience less than 3 years
# - Experienced: Experience between 3-6 years
# - Established: Experience between 7-10 years
# - Veteran: Experience above 10 years
# These will be recorded against each row  using the column _service_cat_. 

# In[277]:


# Function name: map_career(val)
# Input: Number of years of experience
# Output: Experience group
# Description: The function outputs the experience group to which a given number of years of experience must fall to

def map_career(a_value):
    if pd.isnull(a_value):
        return np.nan
    elif a_value<3:
        return 'New'
    elif (a_value>=3) & (a_value<=6):
        return 'Experienced'
    elif (a_value>=7) & (a_value<=10):
        return 'Established'
    else:
        return 'Veteran'

combined_updated.loc[:,"service_cat"] = combined_updated["institute_service"].apply(map_career)
print(combined_updated["service_cat"])


# In[278]:


combined_updated["service_cat"].value_counts(dropna=False)


# ## Age Groups

# Similar to experience groups, we need to split our the age column to groups to find out how dissatisfaction is spread across them. The data in the age column suffers the same inconsistency as the _institute_service_ column earlier.

# In[279]:


print_full(combined_updated["age"].value_counts(dropna=False))


# Using the same procedure that the _institute_service_ column underwent, the age column can be made ready for analysis.

# In[280]:


# Pattern to identify digits in a string
pattern = r"(\d+)"

#Vectorized method that converts each value in the column to a string and extract the year
combined_updated["age"] = combined_updated["age"].astype(dtype='str').str.extract(pattern).astype('float')
print_full(combined_updated["age"])


# Similar to what was done for the experience groups, age can be broken down to groups as follows:
# 
# - Baby Boomers: Older than 56
# - Gen X: Between 40-55
# - Gen Y: Between 24-39
# - Gen Z: Younger than 24

# In[281]:


# Function name: map_career(age)
# Input: Age
# Output: Age group
# Description: The function outputs the age group to which a given age must fall in to.

def map_age(a_value):
    if pd.isnull(a_value):
        return np.nan
    elif a_value<24:
        return 'Gen Z'
    elif (a_value>=24) & (a_value<=39):
        return 'Gen Y'
    elif (a_value>=40) & (a_value<=55):
        return 'Gen X'
    else:
        return 'Baby Boomer'

combined_updated.loc[:,"age_cat"] = combined_updated["age"].apply(map_age)
print(combined_updated["age_cat"])


# In[282]:


combined_updated["age_cat"].value_counts()


# ## Finally.... the Answer

# Since the groups have been clearly defined based on age and experience and since its possible to clearly identify participants in both groups who resigned because of dissatisfaction, it is possible to gain insights based on the goal defined earlier. 

# A more detailed analysis of the _dissatisfied_ column reveals that there are null values. The number of null values will determine whether records with null values need to be dropped to help with aggregation.

# In[283]:


combined_updated["dissatisfied"].value_counts(dropna=False)


# Clearly the number of null values is significantly small and set them all to _False_ would not skew the analysis in a significant manner as the _False_ value has the most count.

# In[284]:


combined_updated["dissatisfied"].fillna(value=False, inplace=True)


# In[285]:


combined_updated["dissatisfied"].value_counts(dropna = False)


# **Question**: How much does dissatisfaction play a role in resignation across experience groups?

# In[286]:


import seaborn as sns
ax = sns.barplot(x = combined_updated["service_cat"], y = combined_updated["dissatisfied"], 
                 estimator = np.mean, ci=0, palette = "colorblind")
ax.spines["top"].set_visible(False)
ax.spines["bottom"].set_visible(False)
ax.spines["right"].set_visible(False)
ax.spines["left"].set_visible(False)
ax.set_yticks([0.0,0.25,0.5])
ax.tick_params(left = False, bottom = False)
ax.set_title("Resignations caused by dissatisfaction by Experience Group")


# **Insights**
# It is evident that resignation because of dissatisfaction is highest among employees that have the most experience. In comparison, only slightly more than 25% of new employees have resigned because of dissatisfaction in their jobs.
# 
# This could be because more established employees have far less oppurtunities outside the organization and are more sensitive to any thing that they percieve as wrong from the employer's perspective. Over time they have developed a perception of the "right way" and their experience may have cemented that perception. 
# On the other hand, newer employees tend to be more flexible as they are not established and have to prove themselves.They are willing to overcome any adversity in exchange for the learning and experience.

# **Question**: How much does dissatisfaction play a role in resignation across age groups?

# In[287]:


import seaborn as sns
ax = sns.barplot(x = combined_updated["age_cat"], y = combined_updated["dissatisfied"], 
                 estimator = np.mean, ci=0, palette = "colorblind")
ax.spines["top"].set_visible(False)
ax.spines["bottom"].set_visible(False)
ax.spines["right"].set_visible(False)
ax.spines["left"].set_visible(False)
ax.tick_params(left = False, bottom = False)
ax.set_yticks([0.0,0.2,0.4])
ax.set_title("Percentage of employees that resigned by Age Group")


# %matplotlib inline
# plot1 = combined_updated.pivot_table(values = "dissatisfied", index =['age_cat'] )
# plot1.plot(kind='bar')


# **Insights**
# Those of the Baby Boomer generation seem to be more affected by dissatisfaction, closely followed by employees of the Gen X and Gen Y age groups.
# 
# This could be because with increase in age, there is a higher degree of focus on achieving certain goals. This includes job goals like getting to a certain positions or earning within a certain income range. Anything that slows the achievement of said goal could prove to be a cause for dissatisfaction causing employees to resign.

# **Question**: Did more employees in the DETE survey or TAFE survey end their employment because they were dissatisfied in some way?

# In[288]:


ax = sns.barplot(x = combined_updated["service_cat"], y = combined_updated["dissatisfied"], 
                 hue = combined_updated["institute"], estimator = np.mean, ci=0, palette = "colorblind")
ax.spines["top"].set_visible(False)
ax.spines["bottom"].set_visible(False)
ax.spines["right"].set_visible(False)
ax.spines["left"].set_visible(False)
ax.tick_params(left = False, bottom = False)
ax.set_title("Percentage of employees that resigned by Experience Group")


# In[289]:


ax = sns.barplot(x = combined_updated["age_cat"], y = combined_updated["dissatisfied"], 
                 hue = combined_updated["institute"], estimator = np.mean, 
                 ci=0, palette = "colorblind")
ax.spines["top"].set_visible(False)
ax.spines["bottom"].set_visible(False)
ax.spines["right"].set_visible(False)
ax.spines["left"].set_visible(False)
ax.tick_params(left = False, bottom = False)
ax.set_title("Percentage of employees that resigned by Age Group")


# **Insights**
# Clearly, employees in the DETE survey have been more affected by dissatisfaction causing them to resign from their roles while in DETE employment.
# 
# The reason behind this would require a more detailed analysis of both employers and the work cultures that they have.

# # Conclusion

# In this project, the focus was to analyze the responses in the DETE and TAFE surveys and understand how dissatisfaction played a role in the resignation of employees from the respective employers.
# 
# The analysis required the cleaning and reshaping of the given dataset to come up with insights regarding the role of dissatifaction across defined experience and age groups.
# Employees that are older by age and those who are the most experienced tend to be affected by dissatisfaction causing them to resign. 
# Those with very less experience and those who are younger seem to be comparatively less affected by dissatisfaction and may have had other reasons besides the same that cause them to resign.