Determining the Cause of Workers Resignation: Analyzing Exit Surveys From Employees

1. Introduction

There are countless reasons why employees resign or retire early. Reasons may include some dissatisfaction with employers or some concerns over family and health. Whatever reason it may be for an employee to resign, it is necessary for employers to understand the causes and be able to reduce workers exit or resignation. Surveys and staff feedback provide employers with valuable information on the reasons why their employees resign or retire. The information is used to inform attraction and retention initiatives and to improve work practices across an organization to ensure the organization is considered an employer of choice.

1.1 Project Questions

  1. Are employees who only worked for the institutes for a short period of time resigning due to some kind of dissatisfaction? What about those employees who have been there longer?
  2. Are younger employees resigning due to some kind of dissatisfaction? What about older employees?

1.2 Aim

In this project, our goal is to provide answers to the project questions above.

1.3 Datasets

In this project, we'll work with two datasets:

  • DETE Exit Survey: exit surveys from employees of the Department of Education, Training and Employment (DETE).
  • TAFE Exit Survey: exit surveys from employees of Technical and Further Education (TAFE).

Both datasets originated from departments in Queensland, Australia. The DETE and TAFE Exit Surveys were developed to effectively canvass the opinions and attitudes of departing employees to identify a wide range of operational, organizational and personal variables affecting the decision to leave.

Data Dictionary

A dictionary wasn't provided with the datasets. However, for this project, we'll use our general knowledge to define some of the columns we're going to work with. We'll provide another combined dictionary later as we move forward.

dete_survey.csv

Column Definition
ID An id used to identify the participant of the survey
SeparationType The reason why the person's employment ended
Cease Date The year or month the person's employment ended
DETE Start Date The year the person began employment with the DETE

tafe_survey.csv

Column Definition
Record ID An id used to identify the participant of the survey
Reason for ceasing employment The reason why the person's employment ended
LengthofServiceOverall. Overall Legth of Service at Institute (in years) The length of the person's employment (in years).

Result

In this project, it was found that the total number of workers who resigned due to some kinds of job dissatisfaction totalled 240. Out of this number, we found that, compared to employees with less than 7 years of service, more employees who have spent 7 or more years at a job (Established and Veteran categories) resigned due to some kinds of job dissatisfaction (with more than 60% of the total number of dissatisfied employees).

Further into the analysis, we noticed that more workers resigned as they increase in age. We observed a progressive increase in the percentage of workers that resigned due to job dissatisfaction as they move from their 30s to 40s etc. We found the highest percentage of dissatisfied workers among those who are 50 years of age and above. However, we found that more workers in their 20s and below resigned due to job dissatisfaction compared to those in their 30s.

In [1]:
# Import needed libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# Read-in datasets with pandas
dete_survey = pd.read_csv('dete_survey.csv', encoding="Latin-1")
tafe_survey = pd.read_csv('tafe_survey.csv', encoding="Latin-1")

2. Overview of the Datasets (Data Exploration)

The first action to carry out is to get an overview of what the two datasets look like. These initial and somewhat simple actions will direct our steps into the carrying out proper data cleaning and eventually, data analysis. We'll familiarize ourselves with the two datasets using some important methods. At the end, we'll have adequate knowledge to guide us as we move forward with the analysis and towards the set goals.

2.1 Exploring DETE Survey File:

In [2]:
# This line of code helps to display full columns in the output
pd.options.display.max_columns = 150

# Familiarizing ourselves with the DETE dataset
dete_survey.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 822 entries, 0 to 821
Data columns (total 56 columns):
ID                                     822 non-null int64
SeparationType                         822 non-null object
Cease Date                             822 non-null object
DETE Start Date                        822 non-null object
Role Start Date                        822 non-null object
Position                               817 non-null object
Classification                         455 non-null object
Region                                 822 non-null object
Business Unit                          126 non-null object
Employment Status                      817 non-null object
Career move to public sector           822 non-null bool
Career move to private sector          822 non-null bool
Interpersonal conflicts                822 non-null bool
Job dissatisfaction                    822 non-null bool
Dissatisfaction with the department    822 non-null bool
Physical work environment              822 non-null bool
Lack of recognition                    822 non-null bool
Lack of job security                   822 non-null bool
Work location                          822 non-null bool
Employment conditions                  822 non-null bool
Maternity/family                       822 non-null bool
Relocation                             822 non-null bool
Study/Travel                           822 non-null bool
Ill Health                             822 non-null bool
Traumatic incident                     822 non-null bool
Work life balance                      822 non-null bool
Workload                               822 non-null bool
None of the above                      822 non-null bool
Professional Development               808 non-null object
Opportunities for promotion            735 non-null object
Staff morale                           816 non-null object
Workplace issue                        788 non-null object
Physical environment                   817 non-null object
Worklife balance                       815 non-null object
Stress and pressure support            810 non-null object
Performance of supervisor              813 non-null object
Peer support                           812 non-null object
Initiative                             813 non-null object
Skills                                 811 non-null object
Coach                                  767 non-null object
Career Aspirations                     746 non-null object
Feedback                               792 non-null object
Further PD                             768 non-null object
Communication                          814 non-null object
My say                                 812 non-null object
Information                            816 non-null object
Kept informed                          813 non-null object
Wellness programs                      766 non-null object
Health & Safety                        793 non-null object
Gender                                 798 non-null object
Age                                    811 non-null object
Aboriginal                             16 non-null object
Torres Strait                          3 non-null object
South Sea                              7 non-null object
Disability                             23 non-null object
NESB                                   32 non-null object
dtypes: bool(18), int64(1), object(37)
memory usage: 258.6+ KB
In [3]:
dete_survey.head()
Out[3]:
ID SeparationType Cease Date DETE Start Date Role Start Date Position Classification Region Business Unit Employment Status Career move to public sector Career move to private sector Interpersonal conflicts Job dissatisfaction Dissatisfaction with the department Physical work environment Lack of recognition Lack of job security Work location Employment conditions Maternity/family Relocation Study/Travel Ill Health Traumatic incident Work life balance Workload None of the above Professional Development Opportunities for promotion Staff morale Workplace issue Physical environment Worklife balance Stress and pressure support Performance of supervisor Peer support Initiative Skills Coach Career Aspirations Feedback Further PD Communication My say Information Kept informed Wellness programs Health & Safety Gender Age Aboriginal Torres Strait South Sea Disability NESB
0 1 Ill Health Retirement 08/2012 1984 2004 Public Servant A01-A04 Central Office Corporate Strategy and Peformance Permanent Full-time True False False True False False True False False False False False False False False False False True A A N N N A A A A N N N A A A N A A N N N Male 56-60 NaN NaN NaN NaN Yes
1 2 Voluntary Early Retirement (VER) 08/2012 Not Stated Not Stated Public Servant AO5-AO7 Central Office Corporate Strategy and Peformance Permanent Full-time False False False False False False False False False False False False False False False False False False A A N N N N A A A N N N A A A N A A N N N Male 56-60 NaN NaN NaN NaN NaN
2 3 Voluntary Early Retirement (VER) 05/2012 2011 2011 Schools Officer NaN Central Office Education Queensland Permanent Full-time False False False False False False False False False False False False False False False False False True N N N N N N N N N N N N N N N A A N N N N Male 61 or older NaN NaN NaN NaN NaN
3 4 Resignation-Other reasons 05/2012 2005 2006 Teacher Primary Central Queensland NaN Permanent Full-time False True False False False False False False False False False False False False False False False False A N N N A A N N A A A A A A A A A A A N A Female 36-40 NaN NaN NaN NaN NaN
4 5 Age Retirement 05/2012 1970 1989 Head of Curriculum/Head of Special Education NaN South East NaN Permanent Full-time False False False False False False False False False False False False False False False True False False A A N N D D N A A A A A A SA SA D D A N A M Female 61 or older NaN NaN NaN NaN NaN
In [4]:
# Checking for unique entries in the column
dete_survey['SeparationType'].value_counts()
Out[4]:
Age Retirement                          285
Resignation-Other reasons               150
Resignation-Other employer               91
Resignation-Move overseas/interstate     70
Voluntary Early Retirement (VER)         67
Ill Health Retirement                    61
Other                                    49
Contract Expired                         34
Termination                              15
Name: SeparationType, dtype: int64
In [5]:
# Getting the number of null values in the dataframe
dete_survey.isnull().sum()
Out[5]:
ID                                       0
SeparationType                           0
Cease Date                               0
DETE Start Date                          0
Role Start Date                          0
Position                                 5
Classification                         367
Region                                   0
Business Unit                          696
Employment Status                        5
Career move to public sector             0
Career move to private sector            0
Interpersonal conflicts                  0
Job dissatisfaction                      0
Dissatisfaction with the department      0
Physical work environment                0
Lack of recognition                      0
Lack of job security                     0
Work location                            0
Employment conditions                    0
Maternity/family                         0
Relocation                               0
Study/Travel                             0
Ill Health                               0
Traumatic incident                       0
Work life balance                        0
Workload                                 0
None of the above                        0
Professional Development                14
Opportunities for promotion             87
Staff morale                             6
Workplace issue                         34
Physical environment                     5
Worklife balance                         7
Stress and pressure support             12
Performance of supervisor                9
Peer support                            10
Initiative                               9
Skills                                  11
Coach                                   55
Career Aspirations                      76
Feedback                                30
Further PD                              54
Communication                            8
My say                                  10
Information                              6
Kept informed                            9
Wellness programs                       56
Health & Safety                         29
Gender                                  24
Age                                     11
Aboriginal                             806
Torres Strait                          819
South Sea                              815
Disability                             799
NESB                                   790
dtype: int64

2.2 Exploring TAFE Survey File:

In [6]:
# Familiarizing ourselves with the TAFE dataset
tafe_survey.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 702 entries, 0 to 701
Data columns (total 72 columns):
Record ID                                                                                                                                                        702 non-null int64
Institute                                                                                                                                                        702 non-null object
WorkArea                                                                                                                                                         702 non-null object
CESSATION YEAR                                                                                                                                                   695 non-null float64
Reason for ceasing employment                                                                                                                                    701 non-null object
Contributing Factors. Career Move - Public Sector                                                                                                                437 non-null object
Contributing Factors. Career Move - Private Sector                                                                                                               437 non-null object
Contributing Factors. Career Move - Self-employment                                                                                                              437 non-null object
Contributing Factors. Ill Health                                                                                                                                 437 non-null object
Contributing Factors. Maternity/Family                                                                                                                           437 non-null object
Contributing Factors. Dissatisfaction                                                                                                                            437 non-null object
Contributing Factors. Job Dissatisfaction                                                                                                                        437 non-null object
Contributing Factors. Interpersonal Conflict                                                                                                                     437 non-null object
Contributing Factors. Study                                                                                                                                      437 non-null object
Contributing Factors. Travel                                                                                                                                     437 non-null object
Contributing Factors. Other                                                                                                                                      437 non-null object
Contributing Factors. NONE                                                                                                                                       437 non-null object
Main Factor.     Which of these was the main factor for leaving?                                                                                                 113 non-null object
InstituteViews. Topic:1. I feel the senior leadership had a clear vision and direction                                                                           608 non-null object
InstituteViews. Topic:2. I was given access to skills training to help me do my job better                                                                       613 non-null object
InstituteViews. Topic:3. I was given adequate opportunities for personal development                                                                             610 non-null object
InstituteViews. Topic:4. I was given adequate opportunities for promotion within %Institute]Q25LBL%                                                              608 non-null object
InstituteViews. Topic:5. I felt the salary for the job was right for the responsibilities I had                                                                  615 non-null object
InstituteViews. Topic:6. The organisation recognised when staff did good work                                                                                    607 non-null object
InstituteViews. Topic:7. Management was generally supportive of me                                                                                               614 non-null object
InstituteViews. Topic:8. Management was generally supportive of my team                                                                                          608 non-null object
InstituteViews. Topic:9. I was kept informed of the changes in the organisation which would affect me                                                            610 non-null object
InstituteViews. Topic:10. Staff morale was positive within the Institute                                                                                         602 non-null object
InstituteViews. Topic:11. If I had a workplace issue it was dealt with quickly                                                                                   601 non-null object
InstituteViews. Topic:12. If I had a workplace issue it was dealt with efficiently                                                                               597 non-null object
InstituteViews. Topic:13. If I had a workplace issue it was dealt with discreetly                                                                                601 non-null object
WorkUnitViews. Topic:14. I was satisfied with the quality of the management and supervision within my work unit                                                  609 non-null object
WorkUnitViews. Topic:15. I worked well with my colleagues                                                                                                        605 non-null object
WorkUnitViews. Topic:16. My job was challenging and interesting                                                                                                  607 non-null object
WorkUnitViews. Topic:17. I was encouraged to use my initiative in the course of my work                                                                          610 non-null object
WorkUnitViews. Topic:18. I had sufficient contact with other people in my job                                                                                    613 non-null object
WorkUnitViews. Topic:19. I was given adequate support and co-operation by my peers to enable me to do my job                                                     609 non-null object
WorkUnitViews. Topic:20. I was able to use the full range of my skills in my job                                                                                 609 non-null object
WorkUnitViews. Topic:21. I was able to use the full range of my abilities in my job. ; Category:Level of Agreement; Question:YOUR VIEWS ABOUT YOUR WORK UNIT]    608 non-null object
WorkUnitViews. Topic:22. I was able to use the full range of my knowledge in my job                                                                              608 non-null object
WorkUnitViews. Topic:23. My job provided sufficient variety                                                                                                      611 non-null object
WorkUnitViews. Topic:24. I was able to cope with the level of stress and pressure in my job                                                                      610 non-null object
WorkUnitViews. Topic:25. My job allowed me to balance the demands of work and family to my satisfaction                                                          611 non-null object
WorkUnitViews. Topic:26. My supervisor gave me adequate personal recognition and feedback on my performance                                                      606 non-null object
WorkUnitViews. Topic:27. My working environment was satisfactory e.g. sufficient space, good lighting, suitable seating and working area                         610 non-null object
WorkUnitViews. Topic:28. I was given the opportunity to mentor and coach others in order for me to pass on my skills and knowledge prior to my cessation date    609 non-null object
WorkUnitViews. Topic:29. There was adequate communication between staff in my unit                                                                               603 non-null object
WorkUnitViews. Topic:30. Staff morale was positive within my work unit                                                                                           606 non-null object
Induction. Did you undertake Workplace Induction?                                                                                                                619 non-null object
InductionInfo. Topic:Did you undertake a Corporate Induction?                                                                                                    432 non-null object
InductionInfo. Topic:Did you undertake a Institute Induction?                                                                                                    483 non-null object
InductionInfo. Topic: Did you undertake Team Induction?                                                                                                          440 non-null object
InductionInfo. Face to Face Topic:Did you undertake a Corporate Induction; Category:How it was conducted?                                                        555 non-null object
InductionInfo. On-line Topic:Did you undertake a Corporate Induction; Category:How it was conducted?                                                             555 non-null object
InductionInfo. Induction Manual Topic:Did you undertake a Corporate Induction?                                                                                   555 non-null object
InductionInfo. Face to Face Topic:Did you undertake a Institute Induction?                                                                                       530 non-null object
InductionInfo. On-line Topic:Did you undertake a Institute Induction?                                                                                            555 non-null object
InductionInfo. Induction Manual Topic:Did you undertake a Institute Induction?                                                                                   553 non-null object
InductionInfo. Face to Face Topic: Did you undertake Team Induction; Category?                                                                                   555 non-null object
InductionInfo. On-line Topic: Did you undertake Team Induction?process you undertook and how it was conducted.]                                                  555 non-null object
InductionInfo. Induction Manual Topic: Did you undertake Team Induction?                                                                                         555 non-null object
Workplace. Topic:Did you and your Manager develop a Performance and Professional Development Plan (PPDP)?                                                        608 non-null object
Workplace. Topic:Does your workplace promote a work culture free from all forms of unlawful discrimination?                                                      594 non-null object
Workplace. Topic:Does your workplace promote and practice the principles of employment equity?                                                                   587 non-null object
Workplace. Topic:Does your workplace value the diversity of its employees?                                                                                       586 non-null object
Workplace. Topic:Would you recommend the Institute as an employer to others?                                                                                     581 non-null object
Gender.     What is your Gender?                                                                                                                                 596 non-null object
CurrentAge.     Current Age                                                                                                                                      596 non-null object
Employment Type.     Employment Type                                                                                                                             596 non-null object
Classification.     Classification                                                                                                                               596 non-null object
LengthofServiceOverall. Overall Length of Service at Institute (in years)                                                                                        596 non-null object
LengthofServiceCurrent. Length of Service at current workplace (in years)                                                                                        596 non-null object
dtypes: float64(1), int64(1), object(70)
memory usage: 395.0+ KB
In [7]:
tafe_survey.head()
Out[7]:
Record ID Institute WorkArea CESSATION YEAR Reason for ceasing employment Contributing Factors. Career Move - Public Sector Contributing Factors. Career Move - Private Sector Contributing Factors. Career Move - Self-employment Contributing Factors. Ill Health Contributing Factors. Maternity/Family Contributing Factors. Dissatisfaction Contributing Factors. Job Dissatisfaction Contributing Factors. Interpersonal Conflict Contributing Factors. Study Contributing Factors. Travel Contributing Factors. Other Contributing Factors. NONE Main Factor.     Which of these was the main factor for leaving? InstituteViews. Topic:1. I feel the senior leadership had a clear vision and direction InstituteViews. Topic:2. I was given access to skills training to help me do my job better InstituteViews. Topic:3. I was given adequate opportunities for personal development InstituteViews. Topic:4. I was given adequate opportunities for promotion within %Institute]Q25LBL% InstituteViews. Topic:5. I felt the salary for the job was right for the responsibilities I had InstituteViews. Topic:6. The organisation recognised when staff did good work InstituteViews. Topic:7. Management was generally supportive of me InstituteViews. Topic:8. Management was generally supportive of my team InstituteViews. Topic:9. I was kept informed of the changes in the organisation which would affect me InstituteViews. Topic:10. Staff morale was positive within the Institute InstituteViews. Topic:11. If I had a workplace issue it was dealt with quickly InstituteViews. Topic:12. If I had a workplace issue it was dealt with efficiently InstituteViews. Topic:13. If I had a workplace issue it was dealt with discreetly WorkUnitViews. Topic:14. I was satisfied with the quality of the management and supervision within my work unit WorkUnitViews. Topic:15. I worked well with my colleagues WorkUnitViews. Topic:16. My job was challenging and interesting WorkUnitViews. Topic:17. I was encouraged to use my initiative in the course of my work WorkUnitViews. Topic:18. I had sufficient contact with other people in my job WorkUnitViews. Topic:19. I was given adequate support and co-operation by my peers to enable me to do my job WorkUnitViews. Topic:20. I was able to use the full range of my skills in my job WorkUnitViews. Topic:21. I was able to use the full range of my abilities in my job. ; Category:Level of Agreement; Question:YOUR VIEWS ABOUT YOUR WORK UNIT] WorkUnitViews. Topic:22. I was able to use the full range of my knowledge in my job WorkUnitViews. Topic:23. My job provided sufficient variety WorkUnitViews. Topic:24. I was able to cope with the level of stress and pressure in my job WorkUnitViews. Topic:25. My job allowed me to balance the demands of work and family to my satisfaction WorkUnitViews. Topic:26. My supervisor gave me adequate personal recognition and feedback on my performance WorkUnitViews. Topic:27. My working environment was satisfactory e.g. sufficient space, good lighting, suitable seating and working area WorkUnitViews. Topic:28. I was given the opportunity to mentor and coach others in order for me to pass on my skills and knowledge prior to my cessation date WorkUnitViews. Topic:29. There was adequate communication between staff in my unit WorkUnitViews. Topic:30. Staff morale was positive within my work unit Induction. Did you undertake Workplace Induction? InductionInfo. Topic:Did you undertake a Corporate Induction? InductionInfo. Topic:Did you undertake a Institute Induction? InductionInfo. Topic: Did you undertake Team Induction? InductionInfo. Face to Face Topic:Did you undertake a Corporate Induction; Category:How it was conducted? InductionInfo. On-line Topic:Did you undertake a Corporate Induction; Category:How it was conducted? InductionInfo. Induction Manual Topic:Did you undertake a Corporate Induction? InductionInfo. Face to Face Topic:Did you undertake a Institute Induction? InductionInfo. On-line Topic:Did you undertake a Institute Induction? InductionInfo. Induction Manual Topic:Did you undertake a Institute Induction? InductionInfo. Face to Face Topic: Did you undertake Team Induction; Category? InductionInfo. On-line Topic: Did you undertake Team Induction?process you undertook and how it was conducted.] InductionInfo. Induction Manual Topic: Did you undertake Team Induction? Workplace. Topic:Did you and your Manager develop a Performance and Professional Development Plan (PPDP)? Workplace. Topic:Does your workplace promote a work culture free from all forms of unlawful discrimination? Workplace. Topic:Does your workplace promote and practice the principles of employment equity? Workplace. Topic:Does your workplace value the diversity of its employees? Workplace. Topic:Would you recommend the Institute as an employer to others? Gender.     What is your Gender? CurrentAge.     Current Age Employment Type.     Employment Type Classification.     Classification LengthofServiceOverall. Overall Length of Service at Institute (in years) LengthofServiceCurrent. Length of Service at current workplace (in years)
0 634133009996094000 Southern Queensland Institute of TAFE Non-Delivery (corporate) 2010.0 Contract Expired NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN Agree Agree Agree Neutral Agree Agree Agree Agree Agree Agree Agree Agree Agree Agree Agree Agree Strongly Agree Agree Agree Agree Agree Agree Agree Agree Agree Agree Agree Neutral Agree Agree Yes Yes Yes Yes Face to Face - - Face to Face - - Face to Face - - Yes Yes Yes Yes Yes Female 26 – 30 Temporary Full-time Administration (AO) 1-2 1-2
1 634133654064531000 Mount Isa Institute of TAFE Non-Delivery (corporate) 2010.0 Retirement - - - - - - - - - Travel - - NaN Agree Agree Agree Agree Agree Strongly Agree Strongly Agree Agree Strongly Agree Agree Agree Agree Disagree Strongly Agree Strongly Agree Strongly Agree Agree Agree Agree Strongly Agree Agree Agree Agree Strongly Agree Agree Strongly Agree Strongly Agree Agree Agree Strongly Agree No NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN Yes Yes Yes Yes Yes NaN NaN NaN NaN NaN NaN
2 634138845606563000 Mount Isa Institute of TAFE Delivery (teaching) 2010.0 Retirement - - - - - - - - - - - NONE NaN Agree Agree Agree Agree Agree Agree Strongly Agree Agree Agree Agree Agree Neutral Neutral Strongly Agree Strongly Agree Agree Agree Agree Agree Agree Agree Agree Agree Agree Agree Agree Agree Agree Agree Agree No NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN Yes Yes Yes Yes Yes NaN NaN NaN NaN NaN NaN
3 634139903350000000 Mount Isa Institute of TAFE Non-Delivery (corporate) 2010.0 Resignation - - - - - - - - - Travel - - NaN Agree Agree Agree Agree Agree Agree Agree Agree Agree Agree Agree Agree Agree Strongly Agree Strongly Agree Strongly Agree Strongly Agree Strongly Agree Strongly Agree Strongly Agree Strongly Agree Strongly Agree Strongly Agree Strongly Agree Strongly Agree Strongly Agree Strongly Agree Strongly Agree Strongly Agree Strongly Agree Yes No Yes Yes - - - NaN - - - - - Yes Yes Yes Yes Yes NaN NaN NaN NaN NaN NaN
4 634146578511788000 Southern Queensland Institute of TAFE Delivery (teaching) 2010.0 Resignation - Career Move - Private Sector - - - - - - - - - - NaN Agree Agree Strongly Agree Agree Strongly Agree Strongly Agree Strongly Agree Strongly Agree Agree Strongly Agree Strongly Agree Agree Strongly Agree Strongly Agree Strongly Agree Strongly Agree Strongly Agree Strongly Agree Strongly Agree Strongly Agree Strongly Agree Strongly Agree Strongly Agree Strongly Agree Strongly Agree Strongly Agree Strongly Agree Strongly Agree Strongly Agree Strongly Agree Yes Yes Yes Yes - - Induction Manual Face to Face - - Face to Face - - Yes Yes Yes Yes Yes Male 41 – 45 Permanent Full-time Teacher (including LVT) 3-4 3-4
In [8]:
# Checking for unique entries in the column
tafe_survey['Reason for ceasing employment'].value_counts()
Out[8]:
Resignation                 340
Contract Expired            127
Retrenchment/ Redundancy    104
Retirement                   82
Transfer                     25
Termination                  23
Name: Reason for ceasing employment, dtype: int64
In [9]:
# Getting the number of null values in the dataframe
tafe_survey.isnull().sum()
Out[9]:
Record ID                                                                      0
Institute                                                                      0
WorkArea                                                                       0
CESSATION YEAR                                                                 7
Reason for ceasing employment                                                  1
                                                                            ... 
CurrentAge.     Current Age                                                  106
Employment Type.     Employment Type                                         106
Classification.     Classification                                           106
LengthofServiceOverall. Overall Length of Service at Institute (in years)    106
LengthofServiceCurrent. Length of Service at current workplace (in years)    106
Length: 72, dtype: int64

OBSERVATIONS

  • As we can see from the information above, the DETE dataframe contains 56 columns and 822 rows of data. We also see that DETE Start Date column in dete_survey dataframe contains some missing/null values entered as Not Stated. This is recognized by pandas as a string and not a missing value.

  • The TAFE dataframe also contains 72 columns and 702 rows of data. We also see that LengthofServiceOverall. Overall Length of Service at Institute (in years) column in tafe_survey dataframe contains some missing/null values seen as NaN.

  • The two dataframes contain many of the same columns, but the column names are different.

  • Now we know that the dataframes contain many columns that are not needed to complete this project. Also, there are multiple columns/answers that indicate an employee resigned because they were dissatisfied.

3. Data Cleaning

3.1 Removing Unwanted Columns

Hence, before we proceed further with the data exploration, we'll read dete_survey.csv CSV file into pandas again. This time around, the Not Stated values will be specified as NAN. We'll thereafter proceed to remove the unwanted columns.

In [10]:
# Makes pandas to identify specified missing values
dete_survey = pd.read_csv('dete_survey.csv', na_values='Not Stated', encoding="Latin-1")
In [11]:
# Removes unwanted columns from each dataframes
dete_survey_updated = dete_survey.drop(dete_survey.columns[28:49], axis=1)
tafe_survey_updated = tafe_survey.drop(tafe_survey.columns[17:66], axis=1)
In [12]:
# Checks that the columns were removed
print(dete_survey_updated.columns)
Index(['ID', 'SeparationType', 'Cease Date', 'DETE Start Date',
       'Role Start Date', 'Position', 'Classification', 'Region',
       'Business Unit', 'Employment Status', 'Career move to public sector',
       'Career move to private sector', 'Interpersonal conflicts',
       'Job dissatisfaction', 'Dissatisfaction with the department',
       'Physical work environment', 'Lack of recognition',
       'Lack of job security', 'Work location', 'Employment conditions',
       'Maternity/family', 'Relocation', 'Study/Travel', 'Ill Health',
       'Traumatic incident', 'Work life balance', 'Workload',
       'None of the above', 'Gender', 'Age', 'Aboriginal', 'Torres Strait',
       'South Sea', 'Disability', 'NESB'],
      dtype='object')
In [13]:
print(tafe_survey_updated.columns)
Index(['Record ID', 'Institute', 'WorkArea', 'CESSATION YEAR',
       'Reason for ceasing employment',
       'Contributing Factors. Career Move - Public Sector ',
       'Contributing Factors. Career Move - Private Sector ',
       'Contributing Factors. Career Move - Self-employment',
       'Contributing Factors. Ill Health',
       'Contributing Factors. Maternity/Family',
       'Contributing Factors. Dissatisfaction',
       'Contributing Factors. Job Dissatisfaction',
       'Contributing Factors. Interpersonal Conflict',
       'Contributing Factors. Study', 'Contributing Factors. Travel',
       'Contributing Factors. Other', 'Contributing Factors. NONE',
       'Gender.     What is your Gender?', 'CurrentAge.     Current Age',
       'Employment Type.     Employment Type',
       'Classification.     Classification',
       'LengthofServiceOverall. Overall Length of Service at Institute (in years)',
       'LengthofServiceCurrent. Length of Service at current workplace (in years)'],
      dtype='object')

From the new information above, we see that the dete_survey_updated dataframe now contains 35 columns and 822 rows. We also see that missing/null values are now seen as NAN. The tafe_survey_updated dataframe now contains 23 columns and 702 rows.

Now that we have successfully made these changes, we'll get back on track to our data exploration and cleaning.

3.1.1 Cleaning and Renaming Column Names

As stated earlier in our observation, each dataframe contains many of the same columns, but the column names are different. Below is a combined dictionary for some similar columns in the two datasets:

dete_survey tafe_survey Definition
ID Record ID An id used to identify the participant of the survey
SeparationType Reason for ceasing employment The reason why the participant's employment ended
Cease Date CESSATION YEAR The year or month the participant's employment ended
DETE Start Date The year the participant began employment with DETE
LengthofServiceOverall. Overall Length of Service at Institute (in years) The length of the person's employment (in years)
Age CurrentAge.     Current Age The age of the participant
Gender Gender.     What is your Gender? The gender of the participant

We'll standardize the column names in both dataframes and make them uniform. This is because we'll later combine the two dataframes.

3.1.1.1 DETE column names:

In some few cells below, we'll see the state of the column names before and after the cleaning process that was carried out. We do this in order to maintain the flow of the whole process.

In [14]:
# Initial form of the column names
print(dete_survey_updated.columns)
Index(['ID', 'SeparationType', 'Cease Date', 'DETE Start Date',
       'Role Start Date', 'Position', 'Classification', 'Region',
       'Business Unit', 'Employment Status', 'Career move to public sector',
       'Career move to private sector', 'Interpersonal conflicts',
       'Job dissatisfaction', 'Dissatisfaction with the department',
       'Physical work environment', 'Lack of recognition',
       'Lack of job security', 'Work location', 'Employment conditions',
       'Maternity/family', 'Relocation', 'Study/Travel', 'Ill Health',
       'Traumatic incident', 'Work life balance', 'Workload',
       'None of the above', 'Gender', 'Age', 'Aboriginal', 'Torres Strait',
       'South Sea', 'Disability', 'NESB'],
      dtype='object')
In [15]:
# Column head: BEFORE CLEANING
dete_survey_updated.head(2)
Out[15]:
ID SeparationType Cease Date DETE Start Date Role Start Date Position Classification Region Business Unit Employment Status Career move to public sector Career move to private sector Interpersonal conflicts Job dissatisfaction Dissatisfaction with the department Physical work environment Lack of recognition Lack of job security Work location Employment conditions Maternity/family Relocation Study/Travel Ill Health Traumatic incident Work life balance Workload None of the above Gender Age Aboriginal Torres Strait South Sea Disability NESB
0 1 Ill Health Retirement 08/2012 1984.0 2004.0 Public Servant A01-A04 Central Office Corporate Strategy and Peformance Permanent Full-time True False False True False False True False False False False False False False False False False True Male 56-60 NaN NaN NaN NaN Yes
1 2 Voluntary Early Retirement (VER) 08/2012 NaN NaN Public Servant AO5-AO7 Central Office Corporate Strategy and Peformance Permanent Full-time False False False False False False False False False False False False False False False False False False Male 56-60 NaN NaN NaN NaN NaN
In [16]:
# Updates the column names to a standardized form
dete_survey_updated.columns = dete_survey_updated.columns.str.replace(' ', '_').str.strip().str.lower()

# Updated form of the column names
print(dete_survey_updated.columns)
Index(['id', 'separationtype', 'cease_date', 'dete_start_date',
       'role_start_date', 'position', 'classification', 'region',
       'business_unit', 'employment_status', 'career_move_to_public_sector',
       'career_move_to_private_sector', 'interpersonal_conflicts',
       'job_dissatisfaction', 'dissatisfaction_with_the_department',
       'physical_work_environment', 'lack_of_recognition',
       'lack_of_job_security', 'work_location', 'employment_conditions',
       'maternity/family', 'relocation', 'study/travel', 'ill_health',
       'traumatic_incident', 'work_life_balance', 'workload',
       'none_of_the_above', 'gender', 'age', 'aboriginal', 'torres_strait',
       'south_sea', 'disability', 'nesb'],
      dtype='object')
In [17]:
# Column head: AFTER CLEANING
dete_survey_updated.head(2)
Out[17]:
id separationtype cease_date dete_start_date role_start_date position classification region business_unit employment_status career_move_to_public_sector career_move_to_private_sector interpersonal_conflicts job_dissatisfaction dissatisfaction_with_the_department physical_work_environment lack_of_recognition lack_of_job_security work_location employment_conditions maternity/family relocation study/travel ill_health traumatic_incident work_life_balance workload none_of_the_above gender age aboriginal torres_strait south_sea disability nesb
0 1 Ill Health Retirement 08/2012 1984.0 2004.0 Public Servant A01-A04 Central Office Corporate Strategy and Peformance Permanent Full-time True False False True False False True False False False False False False False False False False True Male 56-60 NaN NaN NaN NaN Yes
1 2 Voluntary Early Retirement (VER) 08/2012 NaN NaN Public Servant AO5-AO7 Central Office Corporate Strategy and Peformance Permanent Full-time False False False False False False False False False False False False False False False False False False Male 56-60 NaN NaN NaN NaN NaN

3.1.1.2 TAFE column names:

In the following cells, we'll update the column names in tafe_survey_updated dataframe to be uniform with the dete_survey_updated dataframe.

In [18]:
# A dictionary of name change
updated_names = {"Record ID": "id", "CESSATION YEAR": "cease_date", "Reason for ceasing employment": "separationtype", "Gender.     What is your Gender?": "gender", "CurrentAge.     Current Age": "age", "Employment Type.     Employment Type": "employment_status", "Classification.     Classification": "position", "LengthofServiceOverall. Overall Length of Service at Institute (in years)": "institute_service", "LengthofServiceCurrent. Length of Service at current workplace (in years)": "role_service" }

# Updates the column names in dataframe
tafe_survey_updated = tafe_survey_updated.rename(updated_names, axis=1)
tafe_survey_updated.columns
Out[18]:
Index(['id', 'Institute', 'WorkArea', 'cease_date', 'separationtype',
       'Contributing Factors. Career Move - Public Sector ',
       'Contributing Factors. Career Move - Private Sector ',
       'Contributing Factors. Career Move - Self-employment',
       'Contributing Factors. Ill Health',
       'Contributing Factors. Maternity/Family',
       'Contributing Factors. Dissatisfaction',
       'Contributing Factors. Job Dissatisfaction',
       'Contributing Factors. Interpersonal Conflict',
       'Contributing Factors. Study', 'Contributing Factors. Travel',
       'Contributing Factors. Other', 'Contributing Factors. NONE', 'gender',
       'age', 'Employment Type.     Employment Type',
       'Classification.     Classification', 'institute_service',
       'role_service'],
      dtype='object')

We can see that the names of the columns representing the same information in both dataframes are now uniform. Thus, our analysis is made less confusing. This is especially important because we aim to combine both dataframes to complete our analysis.

3.2 Cleaning out other unwanted data

Next, we'll remove more of the data we don't need to answer our questions - one of which is:

  • Are employees who only worked for the institutes for a short period of time resigning due to some kind of dissatisfaction? What about those employees who have been at the job longer?

Again, let's take a look at the unique values in separationtype columns in each dataframe:

In [19]:
dete_survey_updated['separationtype'].value_counts()
Out[19]:
Age Retirement                          285
Resignation-Other reasons               150
Resignation-Other employer               91
Resignation-Move overseas/interstate     70
Voluntary Early Retirement (VER)         67
Ill Health Retirement                    61
Other                                    49
Contract Expired                         34
Termination                              15
Name: separationtype, dtype: int64
In [20]:
tafe_survey_updated['separationtype'].value_counts()
Out[20]:
Resignation                 340
Contract Expired            127
Retrenchment/ Redundancy    104
Retirement                   82
Transfer                     25
Termination                  23
Name: separationtype, dtype: int64

3.2.1 Extracting Data for Respondents Who Resigned

We can see that each dataframe contains a couple of different separation types. However, for this particular analysis, we're only interested in survey respondents who resigned. Therefore, only those separation types that contains the string Resignation will be analyzed.

We can also see that dete_survey_updated dataframe contains multiple separation types with the string Resignation:

  • Resignation-Other reasons
  • Resignation-Other employer
  • Resignation-Move overseas/interstate
In [21]:
# The pattern to search for and select
resignations = dete_survey_updated['separationtype'].str.contains(r"[Rr]esignation", na=False)

# Selects only data that have the desired string
dete_resignations = dete_survey_updated.copy()[resignations]

dete_resignations.head()
Out[21]:
id separationtype cease_date dete_start_date role_start_date position classification region business_unit employment_status career_move_to_public_sector career_move_to_private_sector interpersonal_conflicts job_dissatisfaction dissatisfaction_with_the_department physical_work_environment lack_of_recognition lack_of_job_security work_location employment_conditions maternity/family relocation study/travel ill_health traumatic_incident work_life_balance workload none_of_the_above gender age aboriginal torres_strait south_sea disability nesb
3 4 Resignation-Other reasons 05/2012 2005.0 2006.0 Teacher Primary Central Queensland NaN Permanent Full-time False True False False False False False False False False False False False False False False False False Female 36-40 NaN NaN NaN NaN NaN
5 6 Resignation-Other reasons 05/2012 1994.0 1997.0 Guidance Officer NaN Central Office Education Queensland Permanent Full-time False True False False False False False False False True True False False False False False False False Female 41-45 NaN NaN NaN NaN NaN
8 9 Resignation-Other reasons 07/2012 2009.0 2009.0 Teacher Secondary North Queensland NaN Permanent Full-time False True False False False False False False False False False False False False False False False False Female 31-35 NaN NaN NaN NaN NaN
9 10 Resignation-Other employer 2012 1997.0 2008.0 Teacher Aide NaN NaN NaN Permanent Part-time False False True True True False False False False False False False False False False False False False Female 46-50 NaN NaN NaN NaN NaN
11 12 Resignation-Move overseas/interstate 2012 2009.0 2009.0 Teacher Secondary Far North Queensland NaN Permanent Full-time False False False False False False False False False False True True False False False False False False Male 31-35 NaN NaN NaN NaN NaN
In [22]:
# Separates and select Resignation from the strings
dete_resignations['separationtype'] = dete_survey_updated['separationtype'].str.split('-').str[0]
In [23]:
# Unique values in the updated column
dete_resignations['separationtype'].value_counts()
Out[23]:
Resignation    311
Name: separationtype, dtype: int64

Above, we select only data whose separationtype column contains the string - "Resignation". This is solely because we're only going to analyze the data of the respondents who resigned from their job.

We can see above that the dataframe dete_resignations now contains 311 rows of data.

The separationtype column in tafe_survey_updated dataframe has only one type of resignation string. Next, we'll select those data with Resignation directly into a new dataframe.

In [24]:
# Selects only data which have the desired string
tafe_resignations = tafe_survey_updated[tafe_survey_updated['separationtype'] == 'Resignation'].copy()
tafe_resignations.head()
Out[24]:
id Institute WorkArea cease_date separationtype Contributing Factors. Career Move - Public Sector Contributing Factors. Career Move - Private Sector Contributing Factors. Career Move - Self-employment Contributing Factors. Ill Health Contributing Factors. Maternity/Family Contributing Factors. Dissatisfaction Contributing Factors. Job Dissatisfaction Contributing Factors. Interpersonal Conflict Contributing Factors. Study Contributing Factors. Travel Contributing Factors. Other Contributing Factors. NONE gender age Employment Type.     Employment Type Classification.     Classification institute_service role_service
3 634139903350000000 Mount Isa Institute of TAFE Non-Delivery (corporate) 2010.0 Resignation - - - - - - - - - Travel - - NaN NaN NaN NaN NaN NaN
4 634146578511788000 Southern Queensland Institute of TAFE Delivery (teaching) 2010.0 Resignation - Career Move - Private Sector - - - - - - - - - - Male 41 – 45 Permanent Full-time Teacher (including LVT) 3-4 3-4
5 634147506906311000 Southern Queensland Institute of TAFE Delivery (teaching) 2010.0 Resignation - - - - - - - - - - Other - Female 56 or older Contract/casual Teacher (including LVT) 7-10 7-10
6 634152007975694000 Barrier Reef Institute of TAFE Non-Delivery (corporate) 2010.0 Resignation - Career Move - Private Sector - - Maternity/Family - - - - - Other - Male 20 or younger Temporary Full-time Administration (AO) 3-4 3-4
7 634153745310374000 Southern Queensland Institute of TAFE Delivery (teaching) 2010.0 Resignation - - - - - - - - - - Other - Male 46 – 50 Permanent Full-time Teacher (including LVT) 3-4 3-4
In [25]:
# Unique values in the updated column
tafe_resignations['separationtype'].value_counts()
Out[25]:
Resignation    340
Name: separationtype, dtype: int64

After selecting just the rows that contains Resignation in their separationtype column, we can see above that the dataframe tafe_resignations now contains 340 rows of data.

3.3 Checking for Inconsistencies

We'll continue by checking for other errors in the dataframes. We'll do this by examining each column we need for this anaysis:

Let's take a look at the cease_date and dete_start_date in the dete_resignations dataframe and see whether there's need for any adjustments based on these:

  • Since the cease_date is the last year of the person's employment and the dete_start_date is the person's first year of employment, it wouldn't make sense to have years after the current date.
  • Given that most people in this field start working in their 20s, it's also unlikely that the dete_start_date was before the year 1940.

So if we have years (in dete_start_date) higher than the cease_date or lower than 1940, we wouldn't want to continue with the analysis. The reason is that there may be something very wrong with the data. However, if there are just a small amount of values that are high or low, we'll remove them.

In [26]:
dete_resignations['cease_date'].value_counts(ascending=True)
Out[26]:
09/2010      1
2010         1
07/2006      1
07/2012      1
05/2013      2
05/2012      2
08/2013      4
10/2013      6
11/2013      9
07/2013      9
09/2013     11
06/2013     14
12/2013     17
01/2014     22
2013        74
2012       126
Name: cease_date, dtype: int64
In [27]:
dete_resignations['dete_start_date'].value_counts().sort_index(ascending=True)
Out[27]:
1963.0     1
1971.0     1
1972.0     1
1973.0     1
1974.0     2
1975.0     1
1976.0     2
1977.0     1
1980.0     5
1982.0     1
1983.0     2
1984.0     1
1985.0     3
1986.0     3
1987.0     1
1988.0     4
1989.0     4
1990.0     5
1991.0     4
1992.0     6
1993.0     5
1994.0     6
1995.0     4
1996.0     6
1997.0     5
1998.0     6
1999.0     8
2000.0     9
2001.0     3
2002.0     6
2003.0     6
2004.0    14
2005.0    15
2006.0    13
2007.0    21
2008.0    22
2009.0    13
2010.0    17
2011.0    24
2012.0    21
2013.0    10
Name: dete_start_date, dtype: int64

The cease_date column in dete_resignation dataframe contains two different date formats: some only contain year while others contain month and year. There is a need to change these dates to take a single and similar form across the rows.

In [28]:
# Pattern for extracting date format
pattern = r"/?(?P<Year>[1-2][0-9]{3})"

dete_resignations['cease_date'] = dete_resignations['cease_date'].astype(str).str.extract(pattern).astype(float)
dete_resignations['cease_date'].value_counts().sort_index()
Out[28]:
2006.0      1
2010.0      2
2012.0    129
2013.0    146
2014.0     22
Name: cease_date, dtype: int64
In [29]:
# Box plot to check for outliers
plt.figure(figsize=(10, 10))
dete_resignations.boxplot(column=['cease_date']).set_ylim(2005,2018)
Out[29]:
(2005, 2018)
In [30]:
# Box plot to check for outliers
plt.figure(figsize=(10, 10))
dete_resignations.boxplot(column=['dete_start_date']).set_ylim(1960,2014)
Out[30]:
(1960, 2014)
In [31]:
# Sorted values in the cease_date column
tafe_resignations['cease_date'].value_counts().sort_index()
Out[31]:
2009.0      2
2010.0     68
2011.0    116
2012.0     94
2013.0     55
Name: cease_date, dtype: int64
In [32]:
# Box plot to check for outliers
plt.figure(figsize=(10, 10))
tafe_resignations.boxplot(column=['cease_date']).set_ylim(2005,2018)
Out[32]:
(2005, 2018)

Here is what we can deduce from the information above:

  • There isn't any major issue with the years in the dete_start_date and cease_date.
  • The years in both dataframes don't align. The dete_resignations dataframe contains outliers (2006 and 2010) while tafe_resignations has outliers at 2009 (a year absent from the other dataframe). The tafe_resignations dataframe also contains more cease dates (68) in 2010 compared to the number of cease dates in dete_resignations (2).

Since we're not interested in analyzing the data according to year, we'll leave them as they are.

So far so good, we have rectified the years in the columns to follow a single uniform pattern in the two dataframes. We also didn't find any major issues with the years. Additionally, we saw that the years in each dataframe don't span the same number of years.

4. Refining and Combining Datasets

4.1 Computing Employee Years of Service

As a reminder, our goal is to answer this question:

  • Are employees who only worked for the institutes for a short period of time resigning due to some kind of dissatisfaction? What about those employees who have been at the job longer?

In order to answer the questions, we'll use the now verified years in the dete_registrations dataframe to create a new column. In the Human Resources field, the length of time an employee spent in a workplace is referred to as their years of service. The tafe_resignations dataframe already contains a "service" column that we renamed to institute_service. Therefore, in order to analyze both surveys together, we'll create a corresponding institute_service column in dete_resignations.

In [33]:
dete_resignations['institute_service'] = dete_resignations['cease_date'] - dete_resignations['dete_start_date']
dete_resignations['institute_service'].head()
Out[33]:
3      7.0
5     18.0
8      3.0
9     15.0
11     3.0
Name: institute_service, dtype: float64

4.2 Classifying Employee

We have successfully created a new corresponding column named institute_service in the dete_resignations dataframe. Thus, we can now analyze the survey respondents according to their length of employment. Firstly, we'll identify and classify employees who resigned because they were dissatisfied. The reason for which a worker's employment ended at both institutes are indicated by the column separationtype in both datasets. Other factors that may have contributed to the job cessation are indicated in other columns of the datasets.

According to a dictionary meaning,

Dissatisfaction: The feeling of being displeased and discontent.

Based on that meaning, we'll select some columns from the columns in both dataframes to categorize employees as "dissatisfied". The selected columns are listed below

  1. dete_survey:
    • job_dissatisfaction
    • dissatisfaction_with_the_department
    • physical_work_environment
    • lack_of_recognition
    • lack_of_job_security
    • work_location
    • employment_conditions
    • work_life_balance
    • workload
  2. tafe_survey:
    • Contributing Factors. Dissatisfaction
    • Contributing Factors. Job Dissatisfaction

If the employee indicated any of the above caused them to resign, we'll mark them as dissatisfied in a new column as following:

  • True: indicates a person resigned because they were dissatisfied in some way
  • False: indicates a person resigned because of a reason other than job dissatisfaction
  • NaN: indicates missing values

Let's start by viewing the values in the two columns of Contributing Factors in tafe_resignations dataframe.

In [34]:
# Checks the unique values
tafe_resignations['Contributing Factors. Dissatisfaction'].value_counts()
Out[34]:
-                                         277
Contributing Factors. Dissatisfaction      55
Name: Contributing Factors. Dissatisfaction, dtype: int64
In [35]:
# Checks the unique values
tafe_resignations['Contributing Factors. Job Dissatisfaction'].value_counts()
Out[35]:
-                      270
Job Dissatisfaction     62
Name: Contributing Factors. Job Dissatisfaction, dtype: int64

There are 55 respondents who resigned due to dissatisfaction in the Contributing Factors. Dissatisfaction column. In the other column (Contributing Factors. Job Dissatisfaction), 62 respondents resigned due to dissatisfaction.

In [36]:
# Updates values in columns to either True, False, or NaN
def update_vals(value):
    if pd.isnull(value):
        return np.nan

    elif value == '-':
        return False

    else:
        return True
In [37]:
# Creates a new column for dissatisfaction state
tafe_resignations['dissatisfied'] = tafe_resignations[['Contributing Factors. Dissatisfaction', 'Contributing Factors. Job Dissatisfaction']].applymap(update_vals).any(1, skipna=False)
tafe_resignations_up = tafe_resignations.copy()

# Checks the unique values of the new column
tafe_resignations_up['dissatisfied'].value_counts(dropna=False)
Out[37]:
False    241
True      91
NaN        8
Name: dissatisfied, dtype: int64
In [38]:
# Creates a new column for dissatisfaction state
dete_resignations['dissatisfied'] = dete_resignations[['job_dissatisfaction', 'dissatisfaction_with_the_department', 'physical_work_environment', 'lack_of_recognition', 'lack_of_job_security', 'work_location', 'employment_conditions', 'work_life_balance', 'workload']].any(1, skipna=False)
dete_resignations_up = dete_resignations.copy()

# Checks the unique values of the new column
dete_resignations_up['dissatisfied'].value_counts(dropna=False)
Out[38]:
False    162
True     149
Name: dissatisfied, dtype: int64

Hooray! We've successfully created a new column dissatisfied in both dataframes classifying each respondent whose reason for resignation was based on job dissatisfaction (True) or other reasons (False). This action is necessary because we hope to combine the two dataframe and be able to make deductions based on this reason.

4.3 Combining two dataframes

Up until this moment, we have worked to clean up and add valuable data. We've carried out these changes separately on each dataframe. Now, we're finally ready to combine the two datasets and take a big step towards answering our questions.

We're going to start by adding a column to each dataframe. This new column will allow us to easily distinguish between the two datasets after we've combined them.

In [39]:
# Adds a new column to the datasets
dete_resignations_up['institute'] = 'DETE'
tafe_resignations_up['institute'] = 'TAFE'
In [40]:
# Combines the datasets
combined = pd.concat([dete_resignations_up, tafe_resignations_up], ignore_index=True)

# Checks the number of non null values in each column
combined.notnull().sum().sort_values()
C:\Users\extraUNIQUEguy\Anaconda3\lib\site-packages\ipykernel_launcher.py:2: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.

To retain the current behavior and silence the warning, pass 'sort=True'.

  
Out[40]:
torres_strait                                            0
south_sea                                                3
aboriginal                                               7
disability                                               8
nesb                                                     9
business_unit                                           32
classification                                         161
region                                                 265
role_start_date                                        271
dete_start_date                                        283
Classification.     Classification                     290
Employment Type.     Employment Type                   290
role_service                                           290
employment_status                                      307
position                                               308
none_of_the_above                                      311
ill_health                                             311
maternity/family                                       311
relocation                                             311
lack_of_recognition                                    311
lack_of_job_security                                   311
job_dissatisfaction                                    311
interpersonal_conflicts                                311
study/travel                                           311
traumatic_incident                                     311
work_life_balance                                      311
physical_work_environment                              311
employment_conditions                                  311
workload                                               311
dissatisfaction_with_the_department                    311
career_move_to_public_sector                           311
career_move_to_private_sector                          311
work_location                                          311
Contributing Factors. Career Move - Private Sector     332
Contributing Factors. Career Move - Public Sector      332
Contributing Factors. Career Move - Self-employment    332
Contributing Factors. Dissatisfaction                  332
Contributing Factors. Ill Health                       332
Contributing Factors. Interpersonal Conflict           332
Contributing Factors. Job Dissatisfaction              332
Contributing Factors. Maternity/Family                 332
Contributing Factors. NONE                             332
Contributing Factors. Study                            332
Contributing Factors. Travel                           332
Contributing Factors. Other                            332
Institute                                              340
WorkArea                                               340
institute_service                                      563
gender                                                 592
age                                                    596
cease_date                                             635
dissatisfied                                           643
separationtype                                         651
institute                                              651
id                                                     651
dtype: int64

We've successfully combined the two datasets. Next, we checked for the number of non null values in order to know the other unnecessary columns to drop from the combined dataframe.

In [41]:
# Columns prior to dropping
combined.columns
Out[41]:
Index(['Classification.     Classification',
       'Contributing Factors. Career Move - Private Sector ',
       'Contributing Factors. Career Move - Public Sector ',
       'Contributing Factors. Career Move - Self-employment',
       'Contributing Factors. Dissatisfaction',
       'Contributing Factors. Ill Health',
       'Contributing Factors. Interpersonal Conflict',
       'Contributing Factors. Job Dissatisfaction',
       'Contributing Factors. Maternity/Family', 'Contributing Factors. NONE',
       'Contributing Factors. Other', 'Contributing Factors. Study',
       'Contributing Factors. Travel', 'Employment Type.     Employment Type',
       'Institute', 'WorkArea', 'aboriginal', 'age', 'business_unit',
       'career_move_to_private_sector', 'career_move_to_public_sector',
       'cease_date', 'classification', 'dete_start_date', 'disability',
       'dissatisfaction_with_the_department', 'dissatisfied',
       'employment_conditions', 'employment_status', 'gender', 'id',
       'ill_health', 'institute', 'institute_service',
       'interpersonal_conflicts', 'job_dissatisfaction',
       'lack_of_job_security', 'lack_of_recognition', 'maternity/family',
       'nesb', 'none_of_the_above', 'physical_work_environment', 'position',
       'region', 'relocation', 'role_service', 'role_start_date',
       'separationtype', 'south_sea', 'study/travel', 'torres_strait',
       'traumatic_incident', 'work_life_balance', 'work_location', 'workload'],
      dtype='object')
In [42]:
# Removes columns with less than 500 non null values
combined_updated = combined.dropna(thresh=500, axis=1).copy()
In [43]:
# Columns after dropping
combined_updated.columns
Out[43]:
Index(['age', 'cease_date', 'dissatisfied', 'gender', 'id', 'institute',
       'institute_service', 'separationtype'],
      dtype='object')

We dropped all columns with less than 500 non null values above. This is necessary for the effective management and analysis of the data. Essentially, columns with less than 500 non null values can not influence the result of our analysis.

Before we proceed to proper analysis, we want to clean up the institute_service column. Let's see the unique values in that particular column below:

In [44]:
combined_updated['institute_service'].value_counts(dropna=False)
Out[44]:
NaN                   88
Less than 1 year      73
1-2                   64
3-4                   63
5-6                   33
11-20                 26
5.0                   23
1.0                   22
7-10                  21
0.0                   20
3.0                   20
6.0                   17
4.0                   16
2.0                   14
9.0                   14
7.0                   13
More than 20 years    10
8.0                    8
13.0                   8
15.0                   7
20.0                   7
14.0                   6
10.0                   6
12.0                   6
17.0                   6
22.0                   6
16.0                   5
18.0                   5
24.0                   4
11.0                   4
23.0                   4
19.0                   3
39.0                   3
21.0                   3
32.0                   3
25.0                   2
26.0                   2
36.0                   2
28.0                   2
30.0                   2
33.0                   1
38.0                   1
35.0                   1
34.0                   1
31.0                   1
49.0                   1
29.0                   1
27.0                   1
42.0                   1
41.0                   1
Name: institute_service, dtype: int64

4.4 Categorizing Employees According to Career Stage

The column contains some very different forms of value. Tricky business! In order to analyze the data, we'll have to convert these values into categories below:

  • New: Less than 3 years in an organization
  • Experienced: 3-6 years in an organization
  • Established: 7-10 years in an organization
  • Veteran: 11 or more years in an organization

This analysis is based on this article, which makes the argument that understanding employee's needs according to career stage instead of age is more effective.

In [45]:
# Extracts years of service from column and converts to float
combined_updated['institute_service_up'] = combined_updated['institute_service'].astype('str').str.extract(r"(\d+)")
combined_updated['institute_service_up'] = combined_updated['institute_service_up'].astype('float')
In [46]:
# Confirms successful extraction
combined_updated['institute_service_up'].value_counts()
Out[46]:
1.0     159
3.0      83
5.0      56
7.0      34
11.0     30
0.0      20
20.0     17
6.0      17
4.0      16
9.0      14
2.0      14
13.0      8
8.0       8
15.0      7
17.0      6
10.0      6
12.0      6
14.0      6
22.0      6
16.0      5
18.0      5
24.0      4
23.0      4
39.0      3
19.0      3
21.0      3
32.0      3
28.0      2
36.0      2
25.0      2
30.0      2
26.0      2
29.0      1
38.0      1
42.0      1
27.0      1
41.0      1
35.0      1
49.0      1
34.0      1
33.0      1
31.0      1
Name: institute_service_up, dtype: int64
In [47]:
# Sorts years into different categories
def categorize(value):
    if value >= 11:
        return "Veteran"
    elif 3 <= value < 7:
        return "Experienced"
    elif 7 <= value < 11:
        return "Established"
    elif pd.isnull(value):
        return np.nan
    else:
        return "New"
    
# Applies function to the column
combined_updated['service_cat'] = combined_updated['institute_service_up'].apply(categorize)

# Confirms the changes
combined_updated['service_cat'].value_counts()
Out[47]:
New            193
Experienced    172
Veteran        136
Established     62
Name: service_cat, dtype: int64

We categorized the employees into four groups based on the number of years spent working. A new column service_cat was created to contain the category each employee falls to. Now it is easy to analyze our data and answer some salient questions.

5. Data Analysis

5.1 Performing Initial Analysis

In order to get started with our data analysis, we'll perform a small piece of analysis. We'll first fill in missing values in the dissatisfied column and then aggregate the data.

In [48]:
# Checks the unique values in the column
combined_updated['dissatisfied'].value_counts(dropna=False)
Out[48]:
False    403
True     240
NaN        8
Name: dissatisfied, dtype: int64
In [49]:
# Replaces missing values in the column
combined_updated['dissatisfied'] = combined_updated['dissatisfied'].fillna(False)

So far, we have found out the total number of workers who resigned due to some kinds of job dissatisfaction to be 240. To continue further with the analysis, we needed to fill the 8 missing values in the column with the most frequent category therein. The False group occurs most in the dissatisfied column, we therefore replaced the missing values with 'False'.
All the values in dissatisfied column now consists of Boolean values, meaning they're either True or False. Thus, we'll use a pivot table to aggregate the column and calculate the percentage of people in each group. This is possible because df.pivot_table() method treats Boolean values as integers, so a True value is considered to be 1 and a False value is considered to be 0.

In [50]:
# Calculate and visualize the percentage of dissatisfied employees
table_s_cat = combined_updated.pivot_table(index='service_cat', values='dissatisfied')
print(table_s_cat)

# Data to plot bar chart
plt.figure(figsize=(10,10))
sns.set_style('darkgrid')
ax = sns.barplot(x=table_s_cat.index, y=table_s_cat['dissatisfied'], palette=sns.color_palette('GnBu_d'))
ax.set_yticks([])
ax.set_title('Percentage of Dissatisfied Employee Based on Service Category')
ax.set_xlabel('Service Category')
ax.set_ylabel('Percentage')
ax.tick_params(bottom=True, top=False, left=False, right=False, labelbottom=True)
sns.despine(left=True)
             dissatisfied
service_cat              
Established      0.516129
Experienced      0.343023
New              0.295337
Veteran          0.485294

Let's also visualize this data using a pie chart:

In [51]:
# Data to plot pie chart
labels = 'Established', 'Experienced', 'New', 'Veteran'
colours = ['lightskyblue', 'yellowgreen', 'lightcoral', 'gold']
explode = (0.1, 0, 0, 0.1) # "Explodes" the first and last slice
plt.pie(table_s_cat, labels=labels, colors=colours, autopct='%1.1f%%', shadow=True, explode=explode, startangle=90)

plt.title('Percentage of Dissatisfied Employee Based on Service Category', bbox={'facecolor':'0.8', 'pad':5})
plt.axis('equal')   # Ensures pie is drawn as a circle
plt.show()
C:\Users\extraUNIQUEguy\Anaconda3\lib\site-packages\ipykernel_launcher.py:5: MatplotlibDeprecationWarning: Non-1D inputs to pie() are currently squeeze()d, but this behavior is deprecated since 3.1 and will be removed in 3.3; pass a 1D array instead.
  """

Above, we aggregated the data in the dissatisfied column and used the information to calculate the percentage of the people in each group. From this small analysis, we can presently conclude that more employees who have spent 7 or more years at a job (Established and Veteran categories) resigned due to some kinds of job dissatisfaction (with more than 60% of the total number of dissatisfied employees). Compared to the two aforementioned categories, employees with less than 7 years of service resigned less due to job dissatisfaction.

5.2 Conducting Further Analysis

Next, we'll handle the missing values in the other categories. We want to know how many people resigned due to some kind of dissatisfaction and the percentage according to gender. First, let's take a closer look at each column and clean up the age column, which is still in an untreatable form.

In [52]:
# Values in each column
gender_counts = combined_updated['gender'].value_counts(dropna=False)
print(gender_counts, '\n')
service_counts = combined_updated['service_cat'].value_counts(dropna=False)
print(service_counts, '\n')
age_counts = combined_updated['age'].value_counts(dropna=False)
print(age_counts)
Female    424
Male      168
NaN        59
Name: gender, dtype: int64 

New            193
Experienced    172
Veteran        136
NaN             88
Established     62
Name: service_cat, dtype: int64 

51-55            71
NaN              55
41-45            48
41 – 45          45
46-50            42
36-40            41
46 – 50          39
26-30            35
21 – 25          33
26 – 30          32
36 – 40          32
31 – 35          32
56 or older      29
31-35            29
21-25            29
56-60            26
61 or older      23
20 or younger    10
Name: age, dtype: int64
In [53]:
# Extracts the first age in each age group
combined_updated['age'] = combined_updated['age'].str.extract(r"(\d+)").astype(float)
combined_updated['age'].value_counts(dropna=False)
Out[53]:
41.0    93
46.0    81
36.0    73
51.0    71
26.0    67
21.0    62
31.0    61
NaN     55
56.0    55
61.0    23
20.0    10
Name: age, dtype: int64

The age column now looks good and can be analyzed. We can also observe that there are 55 missing values in the column. Before we decide on what to do with these missing values, let's first look at the missing values across the columns we are interested in.

In [54]:
# Creates heatmap to locate missing values
map_index = combined_updated.set_index('dissatisfied')
sns.heatmap(map_index.isnull(), cbar=False)
Out[54]:
<matplotlib.axes._subplots.AxesSubplot at 0x163034faa88>

Above, we created a heatmap to check the location of the missing values in our dataframe. We can observe that the essential columns for our next analysis (age, gender and service_cat) all have roughly the same number of missing values and almost on the same rows of data. We'll continue by filling the missing age data with the average value of all the ages. It is however unnecessary to try to fill the other missing values and thus, we'll leave the dataset as is.

In [55]:
# Calculate the mean age
mean = combined_updated['age'].mean()

# Fill missing age with the mean value
combined_updated['age'] = combined_updated['age'].fillna(mean)

# Display the unique values in the age column
combined_updated['age'].value_counts(dropna=False)
Out[55]:
41.000000    93
46.000000    81
36.000000    73
51.000000    71
26.000000    67
21.000000    62
31.000000    61
39.271812    55
56.000000    55
61.000000    23
20.000000    10
Name: age, dtype: int64

Next, we'll group the ages into four different categories. These categories include workers in their:

  • 50s and above
  • 40s
  • 30s
  • 20s and below
In [56]:
# A function to group age
def categorize_age(age):
    if age >= 50:
        return '50s and above'
    elif 40 <= age < 50:
        return '40s'
    elif 30 <= age < 40:
        return '30s'
    elif 20 <= age < 30:
        return '20s and below'
In [57]:
# Apply function to categorize age into groups in a new column
combined_updated['age_cat'] = combined_updated['age'].apply(categorize_age)
combined_updated['age_cat'].value_counts()
Out[57]:
30s              189
40s              174
50s and above    149
20s and below    139
Name: age_cat, dtype: int64

As we can now see, we have 139 workers who are in their 20s or below, 189 workers in their 30s, 174 workers in their 40s and 149 workers in their 50s or above. This categorization looks encompassing enough.

5.2.1 Estimating Dissatisfied Workers Based on Age Category

Earlier, we aggregated dissatisfied workers based on service category. Right now, we want to do the same, albeit based on age category.

In [59]:
# Calculate and visualize the percentage of dissatisfied employees
table_a_cat = combined_updated.pivot_table(index='age_cat', values='dissatisfied')
print(table_a_cat)

# Data to plot
plt.figure(figsize=(10,10))
sns.set_style('darkgrid')
ax = sns.barplot(x=table_a_cat.index, y=table_a_cat['dissatisfied'], palette=sns.color_palette('GnBu_d'))
ax.set_yticks([])
ax.set_title('Percentage of Dissatisfied Employee Based on Age Category')
ax.set_xlabel('Age Category')
ax.set_ylabel('Percentage')
ax.tick_params(bottom=True, top=False, left=False, right=False, labelbottom=True)
sns.despine(left=True)
               dissatisfied
age_cat                    
20s and below      0.352518
30s                0.328042
40s                0.379310
50s and above      0.422819

Let's also visualize this data using a pie chart below:

In [60]:
labels = '20s and below', '30s', '40s', '50s and above'
colours = ['lightcoral', 'lightskyblue', 'gold', 'yellowgreen']
explode = (0.05, 0.05, 0.05, 0.1)
plt.pie(table_a_cat, labels=labels, colors=colours, autopct='%1.1f%%', shadow=True, explode=explode, startangle=90)

plt.title('Percentage of Dissatisfied Employee Based on Age Category', bbox={'facecolor':'0.8', 'pad':5})
plt.axis('equal')   # Ensures pie is drawn as a circle
plt.show()
C:\Users\extraUNIQUEguy\Anaconda3\lib\site-packages\ipykernel_launcher.py:4: MatplotlibDeprecationWarning: Non-1D inputs to pie() are currently squeeze()d, but this behavior is deprecated since 3.1 and will be removed in 3.3; pass a 1D array instead.
  after removing the cwd from sys.path.

6.0 Result and Conclusion

In this project we set out to answer some important questions regarding worker's resignation. From the analysis we performed, we found the total number of workers who resigned due to some kinds of job dissatisfaction to be 240. We also found that compared to employees with less than 7 years of service, more employees who have spent 7 or more years at a job (Established and Veteran categories) resigned due to some kinds of job dissatisfaction (with more than 60% of the total number of dissatisfied employees).

Further into the analysis, we noticed that more workers resigned as they increase in age. We observed a progressive increase in the percentage of workers that resigned due to job dissatisfaction as they move from their 30s to 40s etc. We found the highest percentage of dissatisfied workers among those who are 50 years of age and above. However, we found that more workers in their 20s and below resigned due to job dissatisfaction compared to those in their 30s.

One of the reasons this might be so is that in many cases, most in their 20s are recent graduates, new entrants in the workforce, single (or at least unmarried). Many of these young individuals, fresh out of college, lockstep with peers, carry much expectation and expect a workplace to be ideal and a perfect fit. But then after these expectations are not met, they may become grumpy and dissatisfied with their jobs. However, when individuals attain their 30s they probably are now at peace, develop more emotional, moral and intellectual alignment with the world. These individuals have now attained another level of self-awareness and may put up more with some job dissatisfactions. Then, this character probably tends to simmer down as individuals become much older as seen by this analysis.

0.0 Rough Analysis

Here, we want to present the numbers of people in each service category and the gender differences.

In [61]:
# Dataframe of dissatisfied employees in the Veteran category
dissatisfied_veteran = combined_updated.loc[(combined_updated['dissatisfied'] == True) & (combined_updated['service_cat'] == 'Veteran')]
print('Total Dissatisfied Veteran Workers = ' + str(len(dissatisfied_veteran)))
Total Dissatisfied Veteran Workers = 66
In [63]:
dissatisfied_veteran['gender'].value_counts(dropna=False)
Out[63]:
Female    45
Male      19
NaN        2
Name: gender, dtype: int64
In [64]:
# Dataframe of dissatisfied employees in the Established category
dissatisfied_established = combined_updated.loc[(combined_updated['dissatisfied'] == True) & (combined_updated['service_cat'] == 'Established')]
print('Total Dissatisfied Established Workers = ' + str(len(dissatisfied_established)))
Total Dissatisfied Established Workers = 32
In [65]:
dissatisfied_established['gender'].value_counts(dropna=False)
Out[65]:
Female    24
Male       8
Name: gender, dtype: int64
In [66]:
# Dataframe of dissatisfied employees in the Experienced category
dissatisfied_experienced = combined_updated.loc[(combined_updated['dissatisfied'] == True) & (combined_updated['service_cat'] == 'Experienced')]
print('Total Dissatisfied Experienced Workers = ' + str(len(dissatisfied_experienced)))
Total Dissatisfied Experienced Workers = 59
In [67]:
dissatisfied_experienced['gender'].value_counts(dropna=False)
Out[67]:
Female    44
Male      15
Name: gender, dtype: int64
In [68]:
# Dataframe of dissatisfied employees in the New category
dissatisfied_new = combined_updated.loc[(combined_updated['dissatisfied'] == True) & (combined_updated['service_cat'] == 'New')]
print('Total Dissatisfied New Workers = ' + str(len(dissatisfied_new)))
Total Dissatisfied New Workers = 57
In [69]:
dissatisfied_new['gender'].value_counts(dropna=False)
Out[69]:
Female    37
Male      20
Name: gender, dtype: int64