#!/usr/bin/env python
# coding: utf-8
# #
Cleaning and Analyzing Employee Exit Surveys
#
# In this project, I will work with exit surveys from employees of the [Department of Education, Training and Employment](https://en.wikipedia.org/wiki/Department_of_Education_and_Training_(Queensland)) (DETE) and the [Technical and Further Education (TAFE) institute](https://en.wikipedia.org/wiki/Technical_and_further_education) in Queensland, Australia. You can find the TAFE exit survey [here](https://data.gov.au/dataset/ds-qld-89970a3b-182b-41ea-aea2-6f9f17b5907e/details?q=exit%20survey) and the survey for the DETE [here](https://data.gov.au/dataset/ds-qld-fe96ff30-d157-4a81-851d-215f2a0fe26d/details?q=exit%20survey).
#
# ## INTRODUCTION
# In this project, `cleaning` and `analysis` will be done on the two data sets in a bid to help certain stakeholders answer the following questions about their employees:
#
# 1. Are employees who only worked for the institutes for a short period of time resigning due to some kind of dissatisfaction? What about employees who have been there longer?
# 2. Are younger employees resigning due to some kind of dissatisfaction? What about older employees?
#
# Although, a data dictionary wasn't provided with the dataset, below is a preview of a couple columns we'll work with from the `dete_survey.csv`:
#
# Column_Name | Description
# :--------: | :-------:
# **ID** | An id used to identify the participant of the survey
# **SeparationType** | The reason why the person's employment ended
# **Cease Date** | The year or month the person's employment ended
# **DETE Start Date** | The year the person began employment with the DETE
#
# Below is a preview of a couple columns we'll work with from the `tafe_survey.csv`:
#
# Column_Name | Description
# :--------: | :-------:
# **Record ID** | An id used to identify the participant of the survey
# **Reason for ceasing employment** | The reason why the person's employment ended
# **LengthofServiceOverall. Overall Length of Service at Institute (in years)** | The length of the person's employment (in years)
#
# ### Exploring the Data sets
# In[1]:
## Import necessary modules and read in the dataset
import numpy as np
import pandas as pd
dete_survey = pd.read_csv('dete_survey.csv')
tafe_survey = pd.read_csv('tafe_survey.csv')
# In[2]:
# display first five rows of dete_survey data set
dete_survey.head()
# In[3]:
## display dete_survey data set infos
dete_survey.info()
print(dete_survey.shape)
# The result from the code cell above gives certain infos as regards the `dete_survey` data set. It tell us:
# - There are `822 rows` and `56 columns`
# - The `ID`column is the only column stored as `int`
# - Majority(37 of 56) of the columns are stored as `string objects`
# - The `Career move to public sector` - `None of the above` are stored as `boolean` values
# - Most columns contain missing values.
# In[4]:
## display exact number of missing values in each column
dete_survey.isnull().sum()
# The result from the code cell above shows:
# - The `Classification`,`Business Unit`,`Aboriginal`,`Torres Strait`,`South Sea`,`Disability` and `NESB` columns contain a lot of missing values
# In[5]:
# display first five rows of tafe_survey data set
tafe_survey.head()
# In[6]:
## display tafe_survey data set infos
tafe_survey.info()
print(tafe_survey.shape)
# The result from the code cell above gives certain infos as regards the dete_survey data set. It tell us:
# - There are `702 rows` and `72 columns` are in data set
# - The `Record ID` and `CESSATION YEAR` are the only columns stored as `floats` while others are stored as `string objects`
# - There are columns with missing values
#
# More details as regards the number of missing values in each column is displayed in the code cell below
# In[7]:
## display exact number of missing values in each column
tafe_survey.isnull().sum()
# # Data Cleaning
# ## Identify Missing Values and Drop Unneccessary Columns
# Exploring the data sets show:
# * The `dete_survey` dataframe contains `'Not Stated'` values that indicate values are missing, but they aren't represented as NaN.
# * Both the `dete_survey` and `tafe_survey` dataframes contain many columns that we don't need to complete the analysis.
# * Each dataframe contains many of the same columns, but the column names are different.
# * There are multiple columns/answers that indicate an employee resigned because they were dissatisfied.
#
# In the code cells below, the first two issues will be addressed.
# In[8]:
# use the pd.read_csv() function to specify values that should be represented as NaN
dete_survey = pd.read_csv('dete_survey.csv', na_values = 'Not Stated')
#columns not needed for the analysis are dropped
dete_survey_updated = dete_survey.drop(dete_survey.columns[28:49], axis = 1)
tafe_survey_updated = tafe_survey.drop(tafe_survey.columns[17:66], axis = 1)
# In the code cell above, the changes made to the data sets will make working with the data sets for analysis purpose, easier
#
# ## Rename Columns
# Each dataframe contains many of the same columns, but the column names are different. Below are some of the columns we'd like to use for our final analysis:
#
# dete_survey_updated | tafe_survey_updated | Definition
# :-----: | :-------: | :-----:
# ID | Record ID | An id used to identify the participant of the survey
# SeparationType| Reason for ceasing employment |The reason why the participant's employment ended
# Cease Date |CESSATION YEAR | The year or month the participant's employment ended
# DETE Start Date | | The year the participant began employment with the DETE
# | LengthofServiceOverall.Overall Length of Service at Institute (in years)|The length of the person's employment (in years)
# Age|CurrentAge.Current Age | The age of the participant
# Gender|Gender.What is your Gender? | The gender of the participant
#
# Because, the data sets are eventually going to be combined, the column names will have to be standerdized. Lets take a peek at the `columns` in the `dete_survey_updated` data set in the code cell below
# In[9]:
dete_survey_updated.columns
# In the code cell below:
# - all capitalizations will be made lowercase
# - all spaces will be replaced with underscores
# In[10]:
#rename columns in the dete_survey_updated dataframe
dete_survey_updated.columns = dete_survey_updated.columns.str.replace(' ','_').str.lower()
dete_survey_updated.columns
# In[11]:
# rename concerned columns in the tafe_survey_updated data set
col_rename = {'Record ID': 'id',
'CESSATION YEAR': 'cease_date',
'Reason for ceasing employment': 'separationtype',
'Gender. What is your Gender?': 'gender',
'CurrentAge. Current Age': 'age',
'Employment Type. Employment Type': 'employment_status',
'Classification. Classification': 'position',
'LengthofServiceOverall. Overall Length of Service at Institute (in years)': 'institute_service',
'LengthofServiceCurrent. Length of Service at current workplace (in years)': 'role_service'}
tafe_survey_updated = tafe_survey_updated.rename(columns = col_rename)
tafe_survey_updated.columns
# ## Filter the Data
# If we look at the unique values in the `separationtype` columns in each dataframe, we'll see that each contains a couple of different separation types. For this project, we'll only analyze survey respondents who resigned, so their separation type contains the string `'Resignation'`.
# In[12]:
dete_survey_updated['separationtype'].value_counts()
# In[13]:
tafe_survey_updated['separationtype'].value_counts()
# In[14]:
dete_survey_updated['separationtype'] = dete_survey_updated['separationtype'].str.split('-').str[0]
dete_survey_updated['separationtype'].value_counts()
# In[15]:
dete_resignations = dete_survey_updated[dete_survey_updated['separationtype'] == 'Resignation'].copy()
tafe_resignations = tafe_survey_updated[tafe_survey_updated['separationtype'] == 'Resignation'].copy()
# The `dete_resignations` and `tafe_resignations` dataframes contain only data points with the `Resignation` value form the `seperationtype` column.
#
# ## Verify the data
# Before we start cleaning and manipulating the rest of our data, let's verify that the data doesn't contain any major inconsistencies (to the best of our knowledge).
#
# We'll verifying that the years in the cease_date and dete_start_date columns make sense.
#
# * Since the cease_date is the last year of the person's employment and the dete_start_date is the person's first year of employment, it wouldn't make sense to have years after the current date.
# * Given that most people in this field start working in their 20s, it's also unlikely that the dete_start_date was before the year 1940.
# In[16]:
# display unique value counts in th cease_date column
dete_resignations['cease_date'].value_counts()
# In[17]:
# Extract and represent the data in a consistent format
dete_resignations['cease_date'] = dete_resignations['cease_date'].str.split('/').str[-1].astype('float64')
dete_resignations['cease_date'].value_counts().sort_index(ascending=True)
# In[18]:
dete_resignations['dete_start_date'].value_counts().sort_index(ascending=True)
# In[19]:
tafe_resignations['cease_date'].value_counts().sort_index(ascending=True)
# In[20]:
tafe_resignations.info()
# In[21]:
# %matplotlib inline
dete_resignations['cease_date'].plot(kind='box', ylim=(2005,2015))
# In[22]:
tafe_resignations['cease_date'].plot(kind='box', ylim=(2006,2015))
# The result form the code cells helps us verify that the `date` columns from both data consists of reasonable values
# In[23]:
# determinne the year span of each employee
dete_resignations['institute_service'] = dete_resignations['cease_date'] - dete_resignations['dete_start_date']
dete_resignations['institute_service'].value_counts()
# In[24]:
tafe_resignations['Contributing Factors. Dissatisfaction'].value_counts()
# In[25]:
tafe_resignations['Contributing Factors. Job Dissatisfaction'].value_counts()
#
# In[26]:
# Convert the values in the 'Contributing Factors. Dissatisfaction' and 'Contributing Factors. Job Dissatisfaction' columns in the tafe_resignations dataframe to True, False, or NaN values
def update_vals(val):
if val == '-':
return False
elif pd.isnull(val):
return np.NaN
else:
return True
tafe_resignations['dissatisfied'] = tafe_resignations[['Contributing Factors. Dissatisfaction', 'Contributing Factors. Job Dissatisfaction']].applymap(update_vals).any(axis=1, skipna=False)
tafe_resignations_up = tafe_resignations.copy()
tafe_resignations_up['dissatisfied'].value_counts(dropna=False)
# In[27]:
tafe_resignations_up['dissatisfied'].unique()
# In[28]:
dete_resignations['dissatisfied'] = dete_resignations[['job_dissatisfaction','dissatisfaction_with_the_department','physical_work_environment',
'lack_of_recognition',
'lack_of_job_security',
'work_location',
'employment_conditions',
'work_life_balance',
'workload']].any(axis = 1, skipna =False)
dete_resignations_up = dete_resignations.copy()
dete_resignations_up['dissatisfied'].value_counts(dropna=False)
# In[29]:
# add a column to each dataframe that will allow us to easily distinguish between the two.
dete_resignations_up['institute'] = 'DETE'
tafe_resignations_up['institute'] = 'TAFE'
# In[30]:
combined = pd.concat([dete_resignations_up, tafe_resignations_up], ignore_index = True)
combined.notnull().sum().sort_values()
# In[31]:
# drop columns with less than 300 notnull values
combined_updated = combined.dropna(thresh = 300, axis = 1).copy()
combined_updated.shape
# In[32]:
# Extract and represent values in a consistent format
combined_updated['institute_service'] = combined_updated['institute_service'].astype('str')
combined_updated['institute_service_d'] = combined_updated['institute_service'].str.extract(r'(\d+)').astype('float')
combined_updated['institute_service_d'].value_counts()
# In[33]:
# define function for better representation of data
def map_val(x):
if pd.isnull(x):
return np.nan
elif x >= 11:
return 'Veteran'
elif 7 <= x <= 10:
return 'Established'
elif 3 <= x <= 6:
return 'Experienced'
else:
return 'New'
combined_updated['service_cat'] = combined_updated['institute_service_d'].apply(map_val)
combined_updated['service_cat'].value_counts()
# In[34]:
# replace missing values with the modal value(false)
combined_updated['dissatisfied'] = combined_updated['dissatisfied'].fillna(False)
combined_updated['dissatisfied'].value_counts()
# # Analysis
# In[35]:
to_plot = combined_updated.pivot_table(values = 'dissatisfied', index = 'service_cat')
to_plot
# In[36]:
to_plot.plot(kind='bar', rot = 30, legend = False, title = 'Dissatisfied % by Service Category')
# Most employees that resign due to some kind of dissatisfaction are the **Established(Over 11 years of service) and Veteran(7 - 10 years of service)** categories
# * Approximately **51%** of Established employees
# * Approximately **48%** of Veteran employees
#
# [recall the question the analysis is to help us answer, from the introductory cell](#1stmarkdown)
#
# The bar plot blot displayed above helps -to an extent- answer the first question.
# * Both categories of employees mentioned in the question, resign due to some kind of dissatisfaction
# * However, leaves us with the question, ***What is/are the contributing factor(s) -with reference to factors provided in the data- to their dissatisfaction?***
# ## Further Analysis
# ### Resignation due to some kind of dissatisfaction
# We will analyse the major ccontributing factors to employees resignation. We will consider the given factors from each institue in our combined data sets.
#
# `DETE Institute`:
# * job_dissatisfaction
# * dissatisfaction_with_the_department
# * physical_work_environment
# * lack_of_recognition
# * lack_of_job_security
# * work_location
# * employment_conditions
# * work_life_balance
# * workload
#
# `TAFE Institue`:
# * Contributing Factors. Dissatisfaction
# * Contributing Factors. Job Dissatisfaction
#
# These factors are represented as columns in our data set, although they contain a alot of missing values. We will clean (fix the missing values) and aggregate them with the `service_cat` column
#
# #### Cleaning `contributing factors` columns in `DETE`
# In[37]:
# display column's unique value_counts info
print("Unique value_counts in Job dissatisfaction\n",combined_updated["job_dissatisfaction"].value_counts())
# In[38]:
# replace missing values with the value that occurs most frequently in this column
combined_updated["job_dissatisfaction"]= combined_updated["job_dissatisfaction"].fillna(False)
combined_updated["job_dissatisfaction"].value_counts(dropna=False)
# In[39]:
# display column's unique value_counts
print("Unique value_counts in dissatisfaction_with_the_department\n",combined_updated["dissatisfaction_with_the_department"].value_counts())
# In[40]:
# replace missing values with the value that occurs most frequently in this column
combined_updated["dissatisfaction_with_the_department"]= combined_updated["dissatisfaction_with_the_department"].fillna(False)
combined_updated["dissatisfaction_with_the_department"].value_counts(dropna=False)
# In[41]:
# display column's unique value_counts
print("Unique value_counts in physical_work_environment\n",combined_updated["physical_work_environment"].value_counts())
# In[42]:
# replace missing values with the value that occurs most frequently in this column
combined_updated["physical_work_environment"]= combined_updated["physical_work_environment"].fillna(False)
combined_updated["physical_work_environment"].value_counts(dropna=False)
# In[43]:
# display column's unique value_counts
print("Unique value_counts in lack_of_recognition\n",combined_updated["lack_of_recognition"].value_counts())
# In[44]:
# replace missing values with the value that occurs most frequently in this column
combined_updated["lack_of_recognition"]= combined_updated["lack_of_recognition"].fillna(False)
combined_updated["lack_of_recognition"].value_counts(dropna=False)
# In[45]:
# display column's unique value_counts
print("Unique value_counts in lack_of_job_security\n",combined_updated["lack_of_job_security"].value_counts())
# In[46]:
# replace missing values with the value that occurs most frequently in this column
combined_updated["lack_of_job_security"]= combined_updated["lack_of_job_security"].fillna(False)
combined_updated["lack_of_job_security"].value_counts(dropna=False)
# In[47]:
# display column's unique value_counts
print("Unique value_counts in work_location\n",combined_updated["work_location"].value_counts())
# In[48]:
# replace missing values with the value that occurs most frequently in this column
combined_updated["work_location"]= combined_updated["work_location"].fillna(False)
combined_updated["work_location"].value_counts(dropna=False)
# In[49]:
# display column's unique value_counts
print("Unique value_counts in employment_conditions\n",combined_updated["employment_conditions"].value_counts())
# In[50]:
# replace missing values with the value that occurs most frequently in this column
combined_updated["employment_conditions"]= combined_updated["employment_conditions"].fillna(False)
combined_updated["employment_conditions"].value_counts(dropna=False)
# In[51]:
# display column's unique value_counts
print("Unique value_counts in work_life_balance\n",combined_updated["work_life_balance"].value_counts())
# In[52]:
# replace missing values with the value that occurs most frequently in this column
combined_updated["work_life_balance"]= combined_updated["work_life_balance"].fillna(False)
combined_updated["work_life_balance"].value_counts(dropna=False)
# In[53]:
# display column's unique value_counts
print("Unique value_counts in workload\n",combined_updated["workload"].value_counts())
# In[54]:
# replace missing values with the value that occurs most frequently in this column
combined_updated["workload"]= combined_updated["workload"].fillna(False)
combined_updated["workload"].value_counts(dropna=False)
# ### Cleaning `contributing factors` columns in `TAFE`
# As part of this cleaning process, we will refer back to the [function](#function) we created earlier - `update_vals`
# In[55]:
# update the missing values using the `update_vals()` function we created earlier
combined_updated[['Contributing Factors. Job Dissatisfaction', 'Contributing Factors. Dissatisfaction']] = (
combined_updated[['Contributing Factors. Job Dissatisfaction', 'Contributing Factors. Dissatisfaction']]
.applymap(update_vals))
# display column's unique value_counts
print('Unique value counts in Contributing Factors. Job Dissatisfaction\n',combined_updated['Contributing Factors. Job Dissatisfaction'].value_counts())
# In[56]:
# replace missing values with the value that occurs most frequently in this column
combined_updated["Contributing Factors. Job Dissatisfaction"]= combined_updated["Contributing Factors. Job Dissatisfaction"].fillna(False)
combined_updated["Contributing Factors. Job Dissatisfaction"].value_counts(dropna=False)
# In[57]:
# display column's unique value_counts
print('Unique value counts in Contributing Factors. Dissatisfaction\n',combined_updated['Contributing Factors. Dissatisfaction'].value_counts())
# In[58]:
# replace missing values with the value that occurs most frequently in this column
combined_updated["Contributing Factors. Dissatisfaction"]= combined_updated["Contributing Factors. Dissatisfaction"].fillna(False)
combined_updated["Contributing Factors. Dissatisfaction"].value_counts(dropna=False)
# In a bid to aggregate the `contributing factors` factors column, we had to clean the different columns involved, In the code cells above, we:
# - displayed the unique value counts in each column so as to decide the modal value(value that occured most)
# - replaced the missing values in each column with their modal values
#
# However, since the `job dissatisfaction` and `Contributing Factors. Job Dissatisfaction` columns of the DETE and TAFE data sets communicate thesame factors, we will combine the two columns in the code cell below.
# In[59]:
combined_updated["DETE-TAFE Combined job_dissatisfaction"] = combined_updated[['job_dissatisfaction',
'Contributing Factors. Job Dissatisfaction']].any(axis=1, skipna=False)
combined_updated["DETE-TAFE Combined job_dissatisfaction"].value_counts()
# Now that we have cleaned the missing values, let's aggregate by the `service_cat` column using a `pivot_table` and analyse the results.
# In[60]:
cols=[ 'dissatisfaction_with_the_department', 'physical_work_environment',
'lack_of_recognition', 'lack_of_job_security', 'work_location',
'employment_conditions', 'work_life_balance',
'workload','DETE-TAFE Combined job_dissatisfaction','Contributing Factors. Dissatisfaction']
dissatisfaction_result = combined_updated.pivot_table(index='service_cat', values=cols)
dissatisfaction_result
# Now, let us `visualize` the data
# In[61]:
dissatisfaction_result.plot(kind='bar',title = 'Dissatisfaction % by Factor', rot=30,figsize=(15,10),colormap='Paired').legend(bbox_to_anchor=(0.65, 1))
# [Recall](#?),the question that led to our further analysis.
#
# From the we plot, we can say that:
# - For the `DETE institue`, `general job dissatisfaction` and `work_life_balance` are the major `contributing factors` for both employees with few and longer years of service.
# - For the `TAFE institute`, `general job dissatisfaction` is the major `contributing factor` for both employees with few and longer years of service.
# ### Employee dissatisfaction by Age
# [Recall](#1stmarkdown), the second question to be answered in this analysis
# In a bid to answer this question, we will aggregate `age` column with the different `contributing factors` column, but we need to clean the `age` column first.
# In[62]:
# check unique values in age column
combined_updated['age'].value_counts(dropna=False)
# There are missing values in this column and the data representation is inconsistent.
# Based on the result from the code cell above, we will perform the following cleaning steps:
# - Create a consistent representation of the data
# - Store the values in this column as `float` objects
# - replace the missing values with the `mean` age
# In[63]:
# use regex to create a consistent representation of data
combined_updated['age'] = combined_updated['age'].astype('str').str.extract(r'(\d+)')
combined_updated['age']= combined_updated['age'].astype('float')
#replace missing values with mean age
combined_updated['age'] = combined_updated['age'].fillna(int(combined_updated['age'].mean()))
#verify changes
combined_updated['age'].value_counts(dropna=False)
# For proper and better `visualization` of data:
# - Create a new column to reprepent the various age groups
# In[64]:
# define a function to represent the various age groups and apply to the age column
def age_update(x):
if x >= 60:
return '60 plus'
elif 55 <= x <= 59:
return '55-59'
elif 50 <= x <= 54:
return '50-54'
elif 45 <= x <= 49:
return '45-49'
elif 40 <= x <= 44:
return '40-44'
elif 35 <= x <= 39:
return '35-39'
elif 30 <= x <= 34:
return '30-34'
elif 25 <= x <= 29:
return '25-29'
else:
return 'Less than 25'
combined_updated['age_updated']= combined_updated['age'].apply(age_update)
combined_updated['age_updated'].value_counts()
# Since, the cleaning process is done and the `age` column has been updated, we will aggregate the `age_updated` column by the different `contributing factors` column.
# In[65]:
cols=[ 'dissatisfaction_with_the_department', 'physical_work_environment',
'lack_of_recognition', 'lack_of_job_security', 'work_location',
'employment_conditions', 'work_life_balance',
'workload','DETE-TAFE Combined job_dissatisfaction','Contributing Factors. Dissatisfaction']
# create a pivot table to aggregate data
age_group = combined_updated.pivot_table(values = cols, index = 'age_updated')
age_group
# In[66]:
age_group.plot(kind = 'bar', rot = 30, figsize = (20,10), colormap = 'Paired').legend(bbox_to_anchor = (0.5,1))
# Over 25% of the older employees resign because of general dissatisfaction with their jobs, the trend is thesame for the younger employees too.
#
# # CONCLUSIONS
# As earlier stated in the [introduction](#intro) of this project, we have been able to answer the posed questions:
# #### First question (YES)
# - For the DETE institue, `general job dissatisfaction` and `work_life_balance` are the major contributing factors for both employees with few(`New`) and longer(`Veteran` and `Established`) years of service.
# - For the TAFE institute, `general job dissatisfaction` is the major contributing factor for both employees with few and longer years of service.
#
# #### Second question(YES)
# For the DETE institue:
# - Older employees resign majorly due to `General dissatisfaction`, `Workload` and `Work_balance_life`
# - Younger employees resign majorly due to `General dissatisfaction` and `Workload`
#
# For the TAFE institute:
# Both old and young employees resign due to `General dissatisfaction` with their jobs