#!/usr/bin/env python # coding: utf-8 # # Cleaning and Analyzing Employee Exit Surveys # # In this project, I will work with exit surveys from employees of the [Department of Education, Training and Employment](https://en.wikipedia.org/wiki/Department_of_Education_and_Training_(Queensland)) (DETE) and the [Technical and Further Education (TAFE) institute](https://en.wikipedia.org/wiki/Technical_and_further_education) in Queensland, Australia. You can find the TAFE exit survey [here](https://data.gov.au/dataset/ds-qld-89970a3b-182b-41ea-aea2-6f9f17b5907e/details?q=exit%20survey) and the survey for the DETE [here](https://data.gov.au/dataset/ds-qld-fe96ff30-d157-4a81-851d-215f2a0fe26d/details?q=exit%20survey). # # ## INTRODUCTION # In this project, `cleaning` and `analysis` will be done on the two data sets in a bid to help certain stakeholders answer the following questions about their employees: # # 1. Are employees who only worked for the institutes for a short period of time resigning due to some kind of dissatisfaction? What about employees who have been there longer? # 2. Are younger employees resigning due to some kind of dissatisfaction? What about older employees? # # Although, a data dictionary wasn't provided with the dataset, below is a preview of a couple columns we'll work with from the `dete_survey.csv`: # # Column_Name | Description # :--------: | :-------: # **ID** | An id used to identify the participant of the survey # **SeparationType** | The reason why the person's employment ended # **Cease Date** | The year or month the person's employment ended # **DETE Start Date** | The year the person began employment with the DETE # # Below is a preview of a couple columns we'll work with from the `tafe_survey.csv`: # # Column_Name | Description # :--------: | :-------: # **Record ID** | An id used to identify the participant of the survey # **Reason for ceasing employment** | The reason why the person's employment ended # **LengthofServiceOverall. Overall Length of Service at Institute (in years)** | The length of the person's employment (in years) # # ### Exploring the Data sets # In[1]: ## Import necessary modules and read in the dataset import numpy as np import pandas as pd dete_survey = pd.read_csv('dete_survey.csv') tafe_survey = pd.read_csv('tafe_survey.csv') # In[2]: # display first five rows of dete_survey data set dete_survey.head() # In[3]: ## display dete_survey data set infos dete_survey.info() print(dete_survey.shape) # The result from the code cell above gives certain infos as regards the `dete_survey` data set. It tell us: # - There are `822 rows` and `56 columns` # - The `ID`column is the only column stored as `int` # - Majority(37 of 56) of the columns are stored as `string objects` # - The `Career move to public sector` - `None of the above` are stored as `boolean` values # - Most columns contain missing values. # In[4]: ## display exact number of missing values in each column dete_survey.isnull().sum() # The result from the code cell above shows: # - The `Classification`,`Business Unit`,`Aboriginal`,`Torres Strait`,`South Sea`,`Disability` and `NESB` columns contain a lot of missing values # In[5]: # display first five rows of tafe_survey data set tafe_survey.head() # In[6]: ## display tafe_survey data set infos tafe_survey.info() print(tafe_survey.shape) # The result from the code cell above gives certain infos as regards the dete_survey data set. It tell us: # - There are `702 rows` and `72 columns` are in data set # - The `Record ID` and `CESSATION YEAR` are the only columns stored as `floats` while others are stored as `string objects` # - There are columns with missing values # # More details as regards the number of missing values in each column is displayed in the code cell below # In[7]: ## display exact number of missing values in each column tafe_survey.isnull().sum() # # Data Cleaning # ## Identify Missing Values and Drop Unneccessary Columns # Exploring the data sets show: # * The `dete_survey` dataframe contains `'Not Stated'` values that indicate values are missing, but they aren't represented as NaN. # * Both the `dete_survey` and `tafe_survey` dataframes contain many columns that we don't need to complete the analysis. # * Each dataframe contains many of the same columns, but the column names are different. # * There are multiple columns/answers that indicate an employee resigned because they were dissatisfied. # # In the code cells below, the first two issues will be addressed. # In[8]: # use the pd.read_csv() function to specify values that should be represented as NaN dete_survey = pd.read_csv('dete_survey.csv', na_values = 'Not Stated') #columns not needed for the analysis are dropped dete_survey_updated = dete_survey.drop(dete_survey.columns[28:49], axis = 1) tafe_survey_updated = tafe_survey.drop(tafe_survey.columns[17:66], axis = 1) # In the code cell above, the changes made to the data sets will make working with the data sets for analysis purpose, easier # # ## Rename Columns # Each dataframe contains many of the same columns, but the column names are different. Below are some of the columns we'd like to use for our final analysis: # # dete_survey_updated | tafe_survey_updated | Definition # :-----: | :-------: | :-----: # ID | Record ID | An id used to identify the participant of the survey # SeparationType| Reason for ceasing employment |The reason why the participant's employment ended # Cease Date |CESSATION YEAR | The year or month the participant's employment ended # DETE Start Date | | The year the participant began employment with the DETE # | LengthofServiceOverall.Overall Length of Service at Institute (in years)|The length of the person's employment (in years) # Age|CurrentAge.Current Age | The age of the participant # Gender|Gender.What is your Gender? | The gender of the participant # # Because, the data sets are eventually going to be combined, the column names will have to be standerdized. Lets take a peek at the `columns` in the `dete_survey_updated` data set in the code cell below # In[9]: dete_survey_updated.columns # In the code cell below: # - all capitalizations will be made lowercase # - all spaces will be replaced with underscores # In[10]: #rename columns in the dete_survey_updated dataframe dete_survey_updated.columns = dete_survey_updated.columns.str.replace(' ','_').str.lower() dete_survey_updated.columns # In[11]: # rename concerned columns in the tafe_survey_updated data set col_rename = {'Record ID': 'id', 'CESSATION YEAR': 'cease_date', 'Reason for ceasing employment': 'separationtype', 'Gender. What is your Gender?': 'gender', 'CurrentAge. Current Age': 'age', 'Employment Type. Employment Type': 'employment_status', 'Classification. Classification': 'position', 'LengthofServiceOverall. Overall Length of Service at Institute (in years)': 'institute_service', 'LengthofServiceCurrent. Length of Service at current workplace (in years)': 'role_service'} tafe_survey_updated = tafe_survey_updated.rename(columns = col_rename) tafe_survey_updated.columns # ## Filter the Data # If we look at the unique values in the `separationtype` columns in each dataframe, we'll see that each contains a couple of different separation types. For this project, we'll only analyze survey respondents who resigned, so their separation type contains the string `'Resignation'`. # In[12]: dete_survey_updated['separationtype'].value_counts() # In[13]: tafe_survey_updated['separationtype'].value_counts() # In[14]: dete_survey_updated['separationtype'] = dete_survey_updated['separationtype'].str.split('-').str[0] dete_survey_updated['separationtype'].value_counts() # In[15]: dete_resignations = dete_survey_updated[dete_survey_updated['separationtype'] == 'Resignation'].copy() tafe_resignations = tafe_survey_updated[tafe_survey_updated['separationtype'] == 'Resignation'].copy() # The `dete_resignations` and `tafe_resignations` dataframes contain only data points with the `Resignation` value form the `seperationtype` column. # # ## Verify the data # Before we start cleaning and manipulating the rest of our data, let's verify that the data doesn't contain any major inconsistencies (to the best of our knowledge). # # We'll verifying that the years in the cease_date and dete_start_date columns make sense. # # * Since the cease_date is the last year of the person's employment and the dete_start_date is the person's first year of employment, it wouldn't make sense to have years after the current date. # * Given that most people in this field start working in their 20s, it's also unlikely that the dete_start_date was before the year 1940. # In[16]: # display unique value counts in th cease_date column dete_resignations['cease_date'].value_counts() # In[17]: # Extract and represent the data in a consistent format dete_resignations['cease_date'] = dete_resignations['cease_date'].str.split('/').str[-1].astype('float64') dete_resignations['cease_date'].value_counts().sort_index(ascending=True) # In[18]: dete_resignations['dete_start_date'].value_counts().sort_index(ascending=True) # In[19]: tafe_resignations['cease_date'].value_counts().sort_index(ascending=True) # In[20]: tafe_resignations.info() # In[21]: # %matplotlib inline dete_resignations['cease_date'].plot(kind='box', ylim=(2005,2015)) # In[22]: tafe_resignations['cease_date'].plot(kind='box', ylim=(2006,2015)) # The result form the code cells helps us verify that the `date` columns from both data consists of reasonable values # In[23]: # determinne the year span of each employee dete_resignations['institute_service'] = dete_resignations['cease_date'] - dete_resignations['dete_start_date'] dete_resignations['institute_service'].value_counts() # In[24]: tafe_resignations['Contributing Factors. Dissatisfaction'].value_counts() # In[25]: tafe_resignations['Contributing Factors. Job Dissatisfaction'].value_counts() # # In[26]: # Convert the values in the 'Contributing Factors. Dissatisfaction' and 'Contributing Factors. Job Dissatisfaction' columns in the tafe_resignations dataframe to True, False, or NaN values def update_vals(val): if val == '-': return False elif pd.isnull(val): return np.NaN else: return True tafe_resignations['dissatisfied'] = tafe_resignations[['Contributing Factors. Dissatisfaction', 'Contributing Factors. Job Dissatisfaction']].applymap(update_vals).any(axis=1, skipna=False) tafe_resignations_up = tafe_resignations.copy() tafe_resignations_up['dissatisfied'].value_counts(dropna=False) # In[27]: tafe_resignations_up['dissatisfied'].unique() # In[28]: dete_resignations['dissatisfied'] = dete_resignations[['job_dissatisfaction','dissatisfaction_with_the_department','physical_work_environment', 'lack_of_recognition', 'lack_of_job_security', 'work_location', 'employment_conditions', 'work_life_balance', 'workload']].any(axis = 1, skipna =False) dete_resignations_up = dete_resignations.copy() dete_resignations_up['dissatisfied'].value_counts(dropna=False) # In[29]: # add a column to each dataframe that will allow us to easily distinguish between the two. dete_resignations_up['institute'] = 'DETE' tafe_resignations_up['institute'] = 'TAFE' # In[30]: combined = pd.concat([dete_resignations_up, tafe_resignations_up], ignore_index = True) combined.notnull().sum().sort_values() # In[31]: # drop columns with less than 300 notnull values combined_updated = combined.dropna(thresh = 300, axis = 1).copy() combined_updated.shape # In[32]: # Extract and represent values in a consistent format combined_updated['institute_service'] = combined_updated['institute_service'].astype('str') combined_updated['institute_service_d'] = combined_updated['institute_service'].str.extract(r'(\d+)').astype('float') combined_updated['institute_service_d'].value_counts() # In[33]: # define function for better representation of data def map_val(x): if pd.isnull(x): return np.nan elif x >= 11: return 'Veteran' elif 7 <= x <= 10: return 'Established' elif 3 <= x <= 6: return 'Experienced' else: return 'New' combined_updated['service_cat'] = combined_updated['institute_service_d'].apply(map_val) combined_updated['service_cat'].value_counts() # In[34]: # replace missing values with the modal value(false) combined_updated['dissatisfied'] = combined_updated['dissatisfied'].fillna(False) combined_updated['dissatisfied'].value_counts() # # Analysis # In[35]: to_plot = combined_updated.pivot_table(values = 'dissatisfied', index = 'service_cat') to_plot # In[36]: to_plot.plot(kind='bar', rot = 30, legend = False, title = 'Dissatisfied % by Service Category') # Most employees that resign due to some kind of dissatisfaction are the **Established(Over 11 years of service) and Veteran(7 - 10 years of service)** categories # * Approximately **51%** of Established employees # * Approximately **48%** of Veteran employees # # [recall the question the analysis is to help us answer, from the introductory cell](#1stmarkdown) # # The bar plot blot displayed above helps -to an extent- answer the first question. # * Both categories of employees mentioned in the question, resign due to some kind of dissatisfaction # * However, leaves us with the question, ***What is/are the contributing factor(s) -with reference to factors provided in the data- to their dissatisfaction?*** # ## Further Analysis # ### Resignation due to some kind of dissatisfaction # We will analyse the major ccontributing factors to employees resignation. We will consider the given factors from each institue in our combined data sets. # # `DETE Institute`: # * job_dissatisfaction # * dissatisfaction_with_the_department # * physical_work_environment # * lack_of_recognition # * lack_of_job_security # * work_location # * employment_conditions # * work_life_balance # * workload # # `TAFE Institue`: # * Contributing Factors. Dissatisfaction # * Contributing Factors. Job Dissatisfaction # # These factors are represented as columns in our data set, although they contain a alot of missing values. We will clean (fix the missing values) and aggregate them with the `service_cat` column # # #### Cleaning `contributing factors` columns in `DETE` # In[37]: # display column's unique value_counts info print("Unique value_counts in Job dissatisfaction\n",combined_updated["job_dissatisfaction"].value_counts()) # In[38]: # replace missing values with the value that occurs most frequently in this column combined_updated["job_dissatisfaction"]= combined_updated["job_dissatisfaction"].fillna(False) combined_updated["job_dissatisfaction"].value_counts(dropna=False) # In[39]: # display column's unique value_counts print("Unique value_counts in dissatisfaction_with_the_department\n",combined_updated["dissatisfaction_with_the_department"].value_counts()) # In[40]: # replace missing values with the value that occurs most frequently in this column combined_updated["dissatisfaction_with_the_department"]= combined_updated["dissatisfaction_with_the_department"].fillna(False) combined_updated["dissatisfaction_with_the_department"].value_counts(dropna=False) # In[41]: # display column's unique value_counts print("Unique value_counts in physical_work_environment\n",combined_updated["physical_work_environment"].value_counts()) # In[42]: # replace missing values with the value that occurs most frequently in this column combined_updated["physical_work_environment"]= combined_updated["physical_work_environment"].fillna(False) combined_updated["physical_work_environment"].value_counts(dropna=False) # In[43]: # display column's unique value_counts print("Unique value_counts in lack_of_recognition\n",combined_updated["lack_of_recognition"].value_counts()) # In[44]: # replace missing values with the value that occurs most frequently in this column combined_updated["lack_of_recognition"]= combined_updated["lack_of_recognition"].fillna(False) combined_updated["lack_of_recognition"].value_counts(dropna=False) # In[45]: # display column's unique value_counts print("Unique value_counts in lack_of_job_security\n",combined_updated["lack_of_job_security"].value_counts()) # In[46]: # replace missing values with the value that occurs most frequently in this column combined_updated["lack_of_job_security"]= combined_updated["lack_of_job_security"].fillna(False) combined_updated["lack_of_job_security"].value_counts(dropna=False) # In[47]: # display column's unique value_counts print("Unique value_counts in work_location\n",combined_updated["work_location"].value_counts()) # In[48]: # replace missing values with the value that occurs most frequently in this column combined_updated["work_location"]= combined_updated["work_location"].fillna(False) combined_updated["work_location"].value_counts(dropna=False) # In[49]: # display column's unique value_counts print("Unique value_counts in employment_conditions\n",combined_updated["employment_conditions"].value_counts()) # In[50]: # replace missing values with the value that occurs most frequently in this column combined_updated["employment_conditions"]= combined_updated["employment_conditions"].fillna(False) combined_updated["employment_conditions"].value_counts(dropna=False) # In[51]: # display column's unique value_counts print("Unique value_counts in work_life_balance\n",combined_updated["work_life_balance"].value_counts()) # In[52]: # replace missing values with the value that occurs most frequently in this column combined_updated["work_life_balance"]= combined_updated["work_life_balance"].fillna(False) combined_updated["work_life_balance"].value_counts(dropna=False) # In[53]: # display column's unique value_counts print("Unique value_counts in workload\n",combined_updated["workload"].value_counts()) # In[54]: # replace missing values with the value that occurs most frequently in this column combined_updated["workload"]= combined_updated["workload"].fillna(False) combined_updated["workload"].value_counts(dropna=False) # ### Cleaning `contributing factors` columns in `TAFE` # As part of this cleaning process, we will refer back to the [function](#function) we created earlier - `update_vals` # In[55]: # update the missing values using the `update_vals()` function we created earlier combined_updated[['Contributing Factors. Job Dissatisfaction', 'Contributing Factors. Dissatisfaction']] = ( combined_updated[['Contributing Factors. Job Dissatisfaction', 'Contributing Factors. Dissatisfaction']] .applymap(update_vals)) # display column's unique value_counts print('Unique value counts in Contributing Factors. Job Dissatisfaction\n',combined_updated['Contributing Factors. Job Dissatisfaction'].value_counts()) # In[56]: # replace missing values with the value that occurs most frequently in this column combined_updated["Contributing Factors. Job Dissatisfaction"]= combined_updated["Contributing Factors. Job Dissatisfaction"].fillna(False) combined_updated["Contributing Factors. Job Dissatisfaction"].value_counts(dropna=False) # In[57]: # display column's unique value_counts print('Unique value counts in Contributing Factors. Dissatisfaction\n',combined_updated['Contributing Factors. Dissatisfaction'].value_counts()) # In[58]: # replace missing values with the value that occurs most frequently in this column combined_updated["Contributing Factors. Dissatisfaction"]= combined_updated["Contributing Factors. Dissatisfaction"].fillna(False) combined_updated["Contributing Factors. Dissatisfaction"].value_counts(dropna=False) # In a bid to aggregate the `contributing factors` factors column, we had to clean the different columns involved, In the code cells above, we: # - displayed the unique value counts in each column so as to decide the modal value(value that occured most) # - replaced the missing values in each column with their modal values # # However, since the `job dissatisfaction` and `Contributing Factors. Job Dissatisfaction` columns of the DETE and TAFE data sets communicate thesame factors, we will combine the two columns in the code cell below. # In[59]: combined_updated["DETE-TAFE Combined job_dissatisfaction"] = combined_updated[['job_dissatisfaction', 'Contributing Factors. Job Dissatisfaction']].any(axis=1, skipna=False) combined_updated["DETE-TAFE Combined job_dissatisfaction"].value_counts() # Now that we have cleaned the missing values, let's aggregate by the `service_cat` column using a `pivot_table` and analyse the results. # In[60]: cols=[ 'dissatisfaction_with_the_department', 'physical_work_environment', 'lack_of_recognition', 'lack_of_job_security', 'work_location', 'employment_conditions', 'work_life_balance', 'workload','DETE-TAFE Combined job_dissatisfaction','Contributing Factors. Dissatisfaction'] dissatisfaction_result = combined_updated.pivot_table(index='service_cat', values=cols) dissatisfaction_result # Now, let us `visualize` the data # In[61]: dissatisfaction_result.plot(kind='bar',title = 'Dissatisfaction % by Factor', rot=30,figsize=(15,10),colormap='Paired').legend(bbox_to_anchor=(0.65, 1)) # [Recall](#?),the question that led to our further analysis. # # From the we plot, we can say that: # - For the `DETE institue`, `general job dissatisfaction` and `work_life_balance` are the major `contributing factors` for both employees with few and longer years of service. # - For the `TAFE institute`, `general job dissatisfaction` is the major `contributing factor` for both employees with few and longer years of service. # ### Employee dissatisfaction by Age # [Recall](#1stmarkdown), the second question to be answered in this analysis # In a bid to answer this question, we will aggregate `age` column with the different `contributing factors` column, but we need to clean the `age` column first. # In[62]: # check unique values in age column combined_updated['age'].value_counts(dropna=False) # There are missing values in this column and the data representation is inconsistent. # Based on the result from the code cell above, we will perform the following cleaning steps: # - Create a consistent representation of the data # - Store the values in this column as `float` objects # - replace the missing values with the `mean` age # In[63]: # use regex to create a consistent representation of data combined_updated['age'] = combined_updated['age'].astype('str').str.extract(r'(\d+)') combined_updated['age']= combined_updated['age'].astype('float') #replace missing values with mean age combined_updated['age'] = combined_updated['age'].fillna(int(combined_updated['age'].mean())) #verify changes combined_updated['age'].value_counts(dropna=False) # For proper and better `visualization` of data: # - Create a new column to reprepent the various age groups # In[64]: # define a function to represent the various age groups and apply to the age column def age_update(x): if x >= 60: return '60 plus' elif 55 <= x <= 59: return '55-59' elif 50 <= x <= 54: return '50-54' elif 45 <= x <= 49: return '45-49' elif 40 <= x <= 44: return '40-44' elif 35 <= x <= 39: return '35-39' elif 30 <= x <= 34: return '30-34' elif 25 <= x <= 29: return '25-29' else: return 'Less than 25' combined_updated['age_updated']= combined_updated['age'].apply(age_update) combined_updated['age_updated'].value_counts() # Since, the cleaning process is done and the `age` column has been updated, we will aggregate the `age_updated` column by the different `contributing factors` column. # In[65]: cols=[ 'dissatisfaction_with_the_department', 'physical_work_environment', 'lack_of_recognition', 'lack_of_job_security', 'work_location', 'employment_conditions', 'work_life_balance', 'workload','DETE-TAFE Combined job_dissatisfaction','Contributing Factors. Dissatisfaction'] # create a pivot table to aggregate data age_group = combined_updated.pivot_table(values = cols, index = 'age_updated') age_group # In[66]: age_group.plot(kind = 'bar', rot = 30, figsize = (20,10), colormap = 'Paired').legend(bbox_to_anchor = (0.5,1)) # Over 25% of the older employees resign because of general dissatisfaction with their jobs, the trend is thesame for the younger employees too. # # # CONCLUSIONS # As earlier stated in the [introduction](#intro) of this project, we have been able to answer the posed questions: # #### First question (YES) # - For the DETE institue, `general job dissatisfaction` and `work_life_balance` are the major contributing factors for both employees with few(`New`) and longer(`Veteran` and `Established`) years of service. # - For the TAFE institute, `general job dissatisfaction` is the major contributing factor for both employees with few and longer years of service. # # #### Second question(YES) # For the DETE institue: # - Older employees resign majorly due to `General dissatisfaction`, `Workload` and `Work_balance_life` # - Younger employees resign majorly due to `General dissatisfaction` and `Workload` # # For the TAFE institute: # Both old and young employees resign due to `General dissatisfaction` with their jobs