Determining the Cause of Workers Resignation: Analyzing Exit Surveys From Employees¶

1. Introduction¶

There are countless reasons why employees resign or retire early. Reasons may include some dissatisfaction with employers or some concerns over family and health. Whatever reason it may be for an employee to resign, it is necessary for employers to understand the causes and be able to reduce workers exit or resignation. Surveys and staff feedback provide employers with valuable information on the reasons why their employees resign or retire. The information is used to inform attraction and retention initiatives and to improve work practices across an organization to ensure the organization is considered an employer of choice.

1.1 Project Questions¶

Are employees who only worked for the institutes for a short period of time resigning due to some kind of dissatisfaction? What about those employees who have been there longer?
Are younger employees resigning due to some kind of dissatisfaction? What about older employees?

1.2 Aim¶

In this project, our goal is to provide answers to the project questions above.

1.3 Datasets¶

In this project, we'll work with two datasets:

DETE Exit Survey: exit surveys from employees of the Department of Education, Training and Employment (DETE).
TAFE Exit Survey: exit surveys from employees of Technical and Further Education (TAFE).

Both datasets originated from departments in Queensland, Australia. The DETE and TAFE Exit Surveys were developed to effectively canvass the opinions and attitudes of departing employees to identify a wide range of operational, organizational and personal variables affecting the decision to leave.

Data Dictionary

A dictionary wasn't provided with the datasets. However, for this project, we'll use our general knowledge to define some of the columns we're going to work with. We'll provide another combined dictionary later as we move forward.

dete_survey.csv

Column	Definition
ID	An id used to identify the participant of the survey
SeparationType	The reason why the person's employment ended
Cease Date	The year or month the person's employment ended
DETE Start Date	The year the person began employment with the DETE

tafe_survey.csv

Column	Definition
Record ID	An id used to identify the participant of the survey
Reason for ceasing employment	The reason why the person's employment ended
LengthofServiceOverall. Overall Legth of Service at Institute (in years)	The length of the person's employment (in years).

Result¶

In this project, it was found that the total number of workers who resigned due to some kinds of job dissatisfaction totalled 240. Out of this number, we found that, compared to employees with less than 7 years of service, more employees who have spent 7 or more years at a job (Established and Veteran categories) resigned due to some kinds of job dissatisfaction (with more than 60% of the total number of dissatisfied employees).

Further into the analysis, we noticed that more workers resigned as they increase in age. We observed a progressive increase in the percentage of workers that resigned due to job dissatisfaction as they move from their 30s to 40s etc. We found the highest percentage of dissatisfied workers among those who are 50 years of age and above. However, we found that more workers in their 20s and below resigned due to job dissatisfaction compared to those in their 30s.

In [1]:

# Import needed libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# Read-in datasets with pandas
dete_survey = pd.read_csv('dete_survey.csv', encoding="Latin-1")
tafe_survey = pd.read_csv('tafe_survey.csv', encoding="Latin-1")

2. Overview of the Datasets (Data Exploration)¶

The first action to carry out is to get an overview of what the two datasets look like. These initial and somewhat simple actions will direct our steps into the carrying out proper data cleaning and eventually, data analysis. We'll familiarize ourselves with the two datasets using some important methods. At the end, we'll have adequate knowledge to guide us as we move forward with the analysis and towards the set goals.

2.1 Exploring DETE Survey File:¶

In [2]:

# This line of code helps to display full columns in the output
pd.options.display.max_columns = 150

# Familiarizing ourselves with the DETE dataset
dete_survey.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 822 entries, 0 to 821
Data columns (total 56 columns):
ID                                     822 non-null int64
SeparationType                         822 non-null object
Cease Date                             822 non-null object
DETE Start Date                        822 non-null object
Role Start Date                        822 non-null object
Position                               817 non-null object
Classification                         455 non-null object
Region                                 822 non-null object
Business Unit                          126 non-null object
Employment Status                      817 non-null object
Career move to public sector           822 non-null bool
Career move to private sector          822 non-null bool
Interpersonal conflicts                822 non-null bool
Job dissatisfaction                    822 non-null bool
Dissatisfaction with the department    822 non-null bool
Physical work environment              822 non-null bool
Lack of recognition                    822 non-null bool
Lack of job security                   822 non-null bool
Work location                          822 non-null bool
Employment conditions                  822 non-null bool
Maternity/family                       822 non-null bool
Relocation                             822 non-null bool
Study/Travel                           822 non-null bool
Ill Health                             822 non-null bool
Traumatic incident                     822 non-null bool
Work life balance                      822 non-null bool
Workload                               822 non-null bool
None of the above                      822 non-null bool
Professional Development               808 non-null object
Opportunities for promotion            735 non-null object
Staff morale                           816 non-null object
Workplace issue                        788 non-null object
Physical environment                   817 non-null object
Worklife balance                       815 non-null object
Stress and pressure support            810 non-null object
Performance of supervisor              813 non-null object
Peer support                           812 non-null object
Initiative                             813 non-null object
Skills                                 811 non-null object
Coach                                  767 non-null object
Career Aspirations                     746 non-null object
Feedback                               792 non-null object
Further PD                             768 non-null object
Communication                          814 non-null object
My say                                 812 non-null object
Information                            816 non-null object
Kept informed                          813 non-null object
Wellness programs                      766 non-null object
Health & Safety                        793 non-null object
Gender                                 798 non-null object
Age                                    811 non-null object
Aboriginal                             16 non-null object
Torres Strait                          3 non-null object
South Sea                              7 non-null object
Disability                             23 non-null object
NESB                                   32 non-null object
dtypes: bool(18), int64(1), object(37)
memory usage: 258.6+ KB

In [3]:

dete_survey.head()

Out[3]:

	ID	SeparationType	Cease Date	DETE Start Date	Role Start Date	Position	Classification	Region	Business Unit	Employment Status	Career move to public sector	Career move to private sector	Interpersonal conflicts	Job dissatisfaction	Dissatisfaction with the department	Physical work environment	Lack of recognition	Lack of job security	Work location	Employment conditions	Maternity/family	Relocation	Study/Travel	Ill Health	Traumatic incident	Work life balance	Workload	None of the above	Professional Development	Opportunities for promotion	Staff morale	Workplace issue	Physical environment	Worklife balance	Stress and pressure support	Performance of supervisor	Peer support	Initiative	Skills	Coach	Career Aspirations	Feedback	Further PD	Communication	My say	Information	Kept informed	Wellness programs	Health & Safety	Gender	Age	Aboriginal	Torres Strait	South Sea	Disability	NESB
0	1	Ill Health Retirement	08/2012	1984	2004	Public Servant	A01-A04	Central Office	Corporate Strategy and Peformance	Permanent Full-time	True	False	False	True	False	False	True	False	False	False	False	False	False	False	False	False	False	True	A	A	N	N	N	A	A	A	A	N	N	N	A	A	A	N	A	A	N	N	N	Male	56-60	NaN	NaN	NaN	NaN	Yes
1	2	Voluntary Early Retirement (VER)	08/2012	Not Stated	Not Stated	Public Servant	AO5-AO7	Central Office	Corporate Strategy and Peformance	Permanent Full-time	False	False	False	False	False	False	False	False	False	False	False	False	False	False	False	False	False	False	A	A	N	N	N	N	A	A	A	N	N	N	A	A	A	N	A	A	N	N	N	Male	56-60	NaN	NaN	NaN	NaN	NaN
2	3	Voluntary Early Retirement (VER)	05/2012	2011	2011	Schools Officer	NaN	Central Office	Education Queensland	Permanent Full-time	False	False	False	False	False	False	False	False	False	False	False	False	False	False	False	False	False	True	N	N	N	N	N	N	N	N	N	N	N	N	N	N	N	A	A	N	N	N	N	Male	61 or older	NaN	NaN	NaN	NaN	NaN
3	4	Resignation-Other reasons	05/2012	2005	2006	Teacher	Primary	Central Queensland	NaN	Permanent Full-time	False	True	False	False	False	False	False	False	False	False	False	False	False	False	False	False	False	False	A	N	N	N	A	A	N	N	A	A	A	A	A	A	A	A	A	A	A	N	A	Female	36-40	NaN	NaN	NaN	NaN	NaN
4	5	Age Retirement	05/2012	1970	1989	Head of Curriculum/Head of Special Education	NaN	South East	NaN	Permanent Full-time	False	False	False	False	False	False	False	False	False	False	False	False	False	False	False	True	False	False	A	A	N	N	D	D	N	A	A	A	A	A	A	SA	SA	D	D	A	N	A	M	Female	61 or older	NaN	NaN	NaN	NaN	NaN

In [4]:

# Checking for unique entries in the column
dete_survey['SeparationType'].value_counts()

Out[4]:

Age Retirement                          285
Resignation-Other reasons               150
Resignation-Other employer               91
Resignation-Move overseas/interstate     70
Voluntary Early Retirement (VER)         67
Ill Health Retirement                    61
Other                                    49
Contract Expired                         34
Termination                              15
Name: SeparationType, dtype: int64

In [5]:

# Getting the number of null values in the dataframe
dete_survey.isnull().sum()

Out[5]:

ID                                       0
SeparationType                           0
Cease Date                               0
DETE Start Date                          0
Role Start Date                          0
Position                                 5
Classification                         367
Region                                   0
Business Unit                          696
Employment Status                        5
Career move to public sector             0
Career move to private sector            0
Interpersonal conflicts                  0
Job dissatisfaction                      0
Dissatisfaction with the department      0
Physical work environment                0
Lack of recognition                      0
Lack of job security                     0
Work location                            0
Employment conditions                    0
Maternity/family                         0
Relocation                               0
Study/Travel                             0
Ill Health                               0
Traumatic incident                       0
Work life balance                        0
Workload                                 0
None of the above                        0
Professional Development                14
Opportunities for promotion             87
Staff morale                             6
Workplace issue                         34
Physical environment                     5
Worklife balance                         7
Stress and pressure support             12
Performance of supervisor                9
Peer support                            10
Initiative                               9
Skills                                  11
Coach                                   55
Career Aspirations                      76
Feedback                                30
Further PD                              54
Communication                            8
My say                                  10
Information                              6
Kept informed                            9
Wellness programs                       56
Health & Safety                         29
Gender                                  24
Age                                     11
Aboriginal                             806
Torres Strait                          819
South Sea                              815
Disability                             799
NESB                                   790
dtype: int64

2.2 Exploring TAFE Survey File:¶

In [6]:

# Familiarizing ourselves with the TAFE dataset
tafe_survey.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 702 entries, 0 to 701
Data columns (total 72 columns):
Record ID 702 non-null int64
Institute 702 non-null object
WorkArea 702 non-null object
CESSATION YEAR 695 non-null float64
Reason for ceasing employment 701 non-null object
Contributing Factors. Career Move - Public Sector 437 non-null object
Contributing Factors. Career Move - Private Sector 437 non-null object
Contributing Factors. Career Move - Self-employment 437 non-null object
Contributing Factors. Ill Health 437 non-null object
Contributing Factors. Maternity/Family 437 non-null object
Contributing Factors. Dissatisfaction 437 non-null object
Contributing Factors. Job Dissatisfaction 437 non-null object
Contributing Factors. Interpersonal Conflict 437 non-null object
Contributing Factors. Study 437 non-null object
Contributing Factors. Travel 437 non-null object
Contributing Factors. Other 437 non-null object
Contributing Factors. NONE 437 non-null object
Main Factor. Which of these was the main factor for leaving? 113 non-null object
InstituteViews. Topic:1. I feel the senior leadership had a clear vision and direction 608 non-null object
InstituteViews. Topic:2. I was given access to skills training to help me do my job better 613 non-null object
InstituteViews. Topic:3. I was given adequate opportunities for personal development 610 non-null object
InstituteViews. Topic:4. I was given adequate opportunities for promotion within %Institute]Q25LBL% 608 non-null object
InstituteViews. Topic:5. I felt the salary for the job was right for the responsibilities I had 615 non-null object
InstituteViews. Topic:6. The organisation recognised when staff did good work 607 non-null object
InstituteViews. Topic:7. Management was generally supportive of me 614 non-null object
InstituteViews. Topic:8. Management was generally supportive of my team 608 non-null object
InstituteViews. Topic:9. I was kept informed of the changes in the organisation which would affect me 610 non-null object
InstituteViews. Topic:10. Staff morale was positive within the Institute 602 non-null object
InstituteViews. Topic:11. If I had a workplace issue it was dealt with quickly 601 non-null object
InstituteViews. Topic:12. If I had a workplace issue it was dealt with efficiently 597 non-null object
InstituteViews. Topic:13. If I had a workplace issue it was dealt with discreetly 601 non-null object
WorkUnitViews. Topic:14. I was satisfied with the quality of the management and supervision within my work unit 609 non-null object
WorkUnitViews. Topic:15. I worked well with my colleagues 605 non-null object
WorkUnitViews. Topic:16. My job was challenging and interesting 607 non-null object
WorkUnitViews. Topic:17. I was encouraged to use my initiative in the course of my work 610 non-null object
WorkUnitViews. Topic:18. I had sufficient contact with other people in my job 613 non-null object
WorkUnitViews. Topic:19. I was given adequate support and co-operation by my peers to enable me to do my job 609 non-null object
WorkUnitViews. Topic:20. I was able to use the full range of my skills in my job 609 non-null object
WorkUnitViews. Topic:21. I was able to use the full range of my abilities in my job. ; Category:Level of Agreement; Question:YOUR VIEWS ABOUT YOUR WORK UNIT] 608 non-null object
WorkUnitViews. Topic:22. I was able to use the full range of my knowledge in my job 608 non-null object
WorkUnitViews. Topic:23. My job provided sufficient variety 611 non-null object
WorkUnitViews. Topic:24. I was able to cope with the level of stress and pressure in my job 610 non-null object
WorkUnitViews. Topic:25. My job allowed me to balance the demands of work and family to my satisfaction 611 non-null object
WorkUnitViews. Topic:26. My supervisor gave me adequate personal recognition and feedback on my performance 606 non-null object
WorkUnitViews. Topic:27. My working environment was satisfactory e.g. sufficient space, good lighting, suitable seating and working area 610 non-null object
WorkUnitViews. Topic:28. I was given the opportunity to mentor and coach others in order for me to pass on my skills and knowledge prior to my cessation date 609 non-null object
WorkUnitViews. Topic:29. There was adequate communication between staff in my unit 603 non-null object
WorkUnitViews. Topic:30. Staff morale was positive within my work unit 606 non-null object
Induction. Did you undertake Workplace Induction? 619 non-null object
InductionInfo. Topic:Did you undertake a Corporate Induction? 432 non-null object
InductionInfo. Topic:Did you undertake a Institute Induction? 483 non-null object
InductionInfo. Topic: Did you undertake Team Induction? 440 non-null object
InductionInfo. Face to Face Topic:Did you undertake a Corporate Induction; Category:How it was conducted? 555 non-null object
InductionInfo. On-line Topic:Did you undertake a Corporate Induction; Category:How it was conducted? 555 non-null object
InductionInfo. Induction Manual Topic:Did you undertake a Corporate Induction? 555 non-null object
InductionInfo. Face to Face Topic:Did you undertake a Institute Induction? 530 non-null object
InductionInfo. On-line Topic:Did you undertake a Institute Induction? 555 non-null object
InductionInfo. Induction Manual Topic:Did you undertake a Institute Induction? 553 non-null object
InductionInfo. Face to Face Topic: Did you undertake Team Induction; Category? 555 non-null object
InductionInfo. On-line Topic: Did you undertake Team Induction?process you undertook and how it was conducted.] 555 non-null object
InductionInfo. Induction Manual Topic: Did you undertake Team Induction? 555 non-null object
Workplace. Topic:Did you and your Manager develop a Performance and Professional Development Plan (PPDP)? 608 non-null object
Workplace. Topic:Does your workplace promote a work culture free from all forms of unlawful discrimination? 594 non-null object
Workplace. Topic:Does your workplace promote and practice the principles of employment equity? 587 non-null object
Workplace. Topic:Does your workplace value the diversity of its employees? 586 non-null object
Workplace. Topic:Would you recommend the Institute as an employer to others? 581 non-null object
Gender. What is your Gender? 596 non-null object
CurrentAge. Current Age 596 non-null object
Employment Type. Employment Type 596 non-null object
Classification. Classification 596 non-null object
LengthofServiceOverall. Overall Length of Service at Institute (in years) 596 non-null object
LengthofServiceCurrent. Length of Service at current workplace (in years) 596 non-null object
dtypes: float64(1), int64(1), object(70)
memory usage: 395.0+ KB

In [7]:

tafe_survey.head()

Out[7]:

	Record ID	Institute	WorkArea	CESSATION YEAR	Reason for ceasing employment	Contributing Factors. Career Move - Public Sector	Contributing Factors. Career Move - Private Sector	Contributing Factors. Career Move - Self-employment	Contributing Factors. Ill Health	Contributing Factors. Maternity/Family	Contributing Factors. Dissatisfaction	Contributing Factors. Job Dissatisfaction	Contributing Factors. Interpersonal Conflict	Contributing Factors. Study	Contributing Factors. Travel	Contributing Factors. Other	Contributing Factors. NONE	Main Factor. Which of these was the main factor for leaving?	InstituteViews. Topic:1. I feel the senior leadership had a clear vision and direction	InstituteViews. Topic:2. I was given access to skills training to help me do my job better	InstituteViews. Topic:3. I was given adequate opportunities for personal development	InstituteViews. Topic:4. I was given adequate opportunities for promotion within %Institute]Q25LBL%	InstituteViews. Topic:5. I felt the salary for the job was right for the responsibilities I had	InstituteViews. Topic:6. The organisation recognised when staff did good work	InstituteViews. Topic:7. Management was generally supportive of me	InstituteViews. Topic:8. Management was generally supportive of my team	InstituteViews. Topic:9. I was kept informed of the changes in the organisation which would affect me	InstituteViews. Topic:10. Staff morale was positive within the Institute	InstituteViews. Topic:11. If I had a workplace issue it was dealt with quickly	InstituteViews. Topic:12. If I had a workplace issue it was dealt with efficiently	InstituteViews. Topic:13. If I had a workplace issue it was dealt with discreetly	WorkUnitViews. Topic:14. I was satisfied with the quality of the management and supervision within my work unit	WorkUnitViews. Topic:15. I worked well with my colleagues	WorkUnitViews. Topic:16. My job was challenging and interesting	WorkUnitViews. Topic:17. I was encouraged to use my initiative in the course of my work	WorkUnitViews. Topic:18. I had sufficient contact with other people in my job	WorkUnitViews. Topic:19. I was given adequate support and co-operation by my peers to enable me to do my job	WorkUnitViews. Topic:20. I was able to use the full range of my skills in my job	WorkUnitViews. Topic:21. I was able to use the full range of my abilities in my job. ; Category:Level of Agreement; Question:YOUR VIEWS ABOUT YOUR WORK UNIT]	WorkUnitViews. Topic:22. I was able to use the full range of my knowledge in my job	WorkUnitViews. Topic:23. My job provided sufficient variety	WorkUnitViews. Topic:24. I was able to cope with the level of stress and pressure in my job	WorkUnitViews. Topic:25. My job allowed me to balance the demands of work and family to my satisfaction	WorkUnitViews. Topic:26. My supervisor gave me adequate personal recognition and feedback on my performance	WorkUnitViews. Topic:27. My working environment was satisfactory e.g. sufficient space, good lighting, suitable seating and working area	WorkUnitViews. Topic:28. I was given the opportunity to mentor and coach others in order for me to pass on my skills and knowledge prior to my cessation date	WorkUnitViews. Topic:29. There was adequate communication between staff in my unit	WorkUnitViews. Topic:30. Staff morale was positive within my work unit	Induction. Did you undertake Workplace Induction?	InductionInfo. Topic:Did you undertake a Corporate Induction?	InductionInfo. Topic:Did you undertake a Institute Induction?	InductionInfo. Topic: Did you undertake Team Induction?	InductionInfo. Face to Face Topic:Did you undertake a Corporate Induction; Category:How it was conducted?	InductionInfo. On-line Topic:Did you undertake a Corporate Induction; Category:How it was conducted?	InductionInfo. Induction Manual Topic:Did you undertake a Corporate Induction?	InductionInfo. Face to Face Topic:Did you undertake a Institute Induction?	InductionInfo. On-line Topic:Did you undertake a Institute Induction?	InductionInfo. Induction Manual Topic:Did you undertake a Institute Induction?	InductionInfo. Face to Face Topic: Did you undertake Team Induction; Category?	InductionInfo. On-line Topic: Did you undertake Team Induction?process you undertook and how it was conducted.]	InductionInfo. Induction Manual Topic: Did you undertake Team Induction?	Workplace. Topic:Did you and your Manager develop a Performance and Professional Development Plan (PPDP)?	Workplace. Topic:Does your workplace promote a work culture free from all forms of unlawful discrimination?	Workplace. Topic:Does your workplace promote and practice the principles of employment equity?	Workplace. Topic:Does your workplace value the diversity of its employees?	Workplace. Topic:Would you recommend the Institute as an employer to others?	Gender. What is your Gender?	CurrentAge. Current Age	Employment Type. Employment Type	Classification. Classification	LengthofServiceOverall. Overall Length of Service at Institute (in years)	LengthofServiceCurrent. Length of Service at current workplace (in years)
0	634133009996094000	Southern Queensland Institute of TAFE	Non-Delivery (corporate)	2010.0	Contract Expired	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	Agree	Agree	Agree	Neutral	Agree	Agree	Agree	Agree	Agree	Agree	Agree	Agree	Agree	Agree	Agree	Agree	Strongly Agree	Agree	Agree	Agree	Agree	Agree	Agree	Agree	Agree	Agree	Agree	Neutral	Agree	Agree	Yes	Yes	Yes	Yes	Face to Face	-	-	Face to Face	-	-	Face to Face	-	-	Yes	Yes	Yes	Yes	Yes	Female	26 30	Temporary Full-time	Administration (AO)	1-2	1-2
1	634133654064531000	Mount Isa Institute of TAFE	Non-Delivery (corporate)	2010.0	Retirement	-	-	-	-	-	-	-	-	-	Travel	-	-	NaN	Agree	Agree	Agree	Agree	Agree	Strongly Agree	Strongly Agree	Agree	Strongly Agree	Agree	Agree	Agree	Disagree	Strongly Agree	Strongly Agree	Strongly Agree	Agree	Agree	Agree	Strongly Agree	Agree	Agree	Agree	Strongly Agree	Agree	Strongly Agree	Strongly Agree	Agree	Agree	Strongly Agree	No	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	Yes	Yes	Yes	Yes	Yes	NaN	NaN	NaN	NaN	NaN	NaN
2	634138845606563000	Mount Isa Institute of TAFE	Delivery (teaching)	2010.0	Retirement	-	-	-	-	-	-	-	-	-	-	-	NONE	NaN	Agree	Agree	Agree	Agree	Agree	Agree	Strongly Agree	Agree	Agree	Agree	Agree	Neutral	Neutral	Strongly Agree	Strongly Agree	Agree	Agree	Agree	Agree	Agree	Agree	Agree	Agree	Agree	Agree	Agree	Agree	Agree	Agree	Agree	No	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	Yes	Yes	Yes	Yes	Yes	NaN	NaN	NaN	NaN	NaN	NaN
3	634139903350000000	Mount Isa Institute of TAFE	Non-Delivery (corporate)	2010.0	Resignation	-	-	-	-	-	-	-	-	-	Travel	-	-	NaN	Agree	Agree	Agree	Agree	Agree	Agree	Agree	Agree	Agree	Agree	Agree	Agree	Agree	Strongly Agree	Strongly Agree	Strongly Agree	Strongly Agree	Strongly Agree	Strongly Agree	Strongly Agree	Strongly Agree	Strongly Agree	Strongly Agree	Strongly Agree	Strongly Agree	Strongly Agree	Strongly Agree	Strongly Agree	Strongly Agree	Strongly Agree	Yes	No	Yes	Yes	-	-	-	NaN	-	-	-	-	-	Yes	Yes	Yes	Yes	Yes	NaN	NaN	NaN	NaN	NaN	NaN
4	634146578511788000	Southern Queensland Institute of TAFE	Delivery (teaching)	2010.0	Resignation	-	Career Move - Private Sector	-	-	-	-	-	-	-	-	-	-	NaN	Agree	Agree	Strongly Agree	Agree	Strongly Agree	Strongly Agree	Strongly Agree	Strongly Agree	Agree	Strongly Agree	Strongly Agree	Agree	Strongly Agree	Strongly Agree	Strongly Agree	Strongly Agree	Strongly Agree	Strongly Agree	Strongly Agree	Strongly Agree	Strongly Agree	Strongly Agree	Strongly Agree	Strongly Agree	Strongly Agree	Strongly Agree	Strongly Agree	Strongly Agree	Strongly Agree	Strongly Agree	Yes	Yes	Yes	Yes	-	-	Induction Manual	Face to Face	-	-	Face to Face	-	-	Yes	Yes	Yes	Yes	Yes	Male	41 45	Permanent Full-time	Teacher (including LVT)	3-4	3-4

In [8]:

# Checking for unique entries in the column
tafe_survey['Reason for ceasing employment'].value_counts()

Out[8]:

Resignation                 340
Contract Expired            127
Retrenchment/ Redundancy    104
Retirement                   82
Transfer                     25
Termination                  23
Name: Reason for ceasing employment, dtype: int64

In [9]:

# Getting the number of null values in the dataframe
tafe_survey.isnull().sum()

Out[9]:

Record ID                                                                      0
Institute                                                                      0
WorkArea                                                                       0
CESSATION YEAR                                                                 7
Reason for ceasing employment                                                  1
                                                                            ... 
CurrentAge.     Current Age                                                  106
Employment Type.     Employment Type                                         106
Classification.     Classification                                           106
LengthofServiceOverall. Overall Length of Service at Institute (in years)    106
LengthofServiceCurrent. Length of Service at current workplace (in years)    106
Length: 72, dtype: int64

OBSERVATIONS

As we can see from the information above, the DETE dataframe contains 56 columns and 822 rows of data. We also see that DETE Start Date column in dete_survey dataframe contains some missing/null values entered as Not Stated. This is recognized by pandas as a string and not a missing value.
The TAFE dataframe also contains 72 columns and 702 rows of data. We also see that LengthofServiceOverall. Overall Length of Service at Institute (in years) column in tafe_survey dataframe contains some missing/null values seen as NaN.
The two dataframes contain many of the same columns, but the column names are different.
Now we know that the dataframes contain many columns that are not needed to complete this project. Also, there are multiple columns/answers that indicate an employee resigned because they were dissatisfied.

3. Data Cleaning¶

3.1 Removing Unwanted Columns¶

Hence, before we proceed further with the data exploration, we'll read dete_survey.csv CSV file into pandas again. This time around, the Not Stated values will be specified as NAN. We'll thereafter proceed to remove the unwanted columns.

In [10]:

# Makes pandas to identify specified missing values
dete_survey = pd.read_csv('dete_survey.csv', na_values='Not Stated', encoding="Latin-1")

In [11]:

# Removes unwanted columns from each dataframes
dete_survey_updated = dete_survey.drop(dete_survey.columns[28:49], axis=1)
tafe_survey_updated = tafe_survey.drop(tafe_survey.columns[17:66], axis=1)

In [12]:

# Checks that the columns were removed
print(dete_survey_updated.columns)

Index(['ID', 'SeparationType', 'Cease Date', 'DETE Start Date',
       'Role Start Date', 'Position', 'Classification', 'Region',
       'Business Unit', 'Employment Status', 'Career move to public sector',
       'Career move to private sector', 'Interpersonal conflicts',
       'Job dissatisfaction', 'Dissatisfaction with the department',
       'Physical work environment', 'Lack of recognition',
       'Lack of job security', 'Work location', 'Employment conditions',
       'Maternity/family', 'Relocation', 'Study/Travel', 'Ill Health',
       'Traumatic incident', 'Work life balance', 'Workload',
       'None of the above', 'Gender', 'Age', 'Aboriginal', 'Torres Strait',
       'South Sea', 'Disability', 'NESB'],
      dtype='object')

In [13]:

print(tafe_survey_updated.columns)

Index(['Record ID', 'Institute', 'WorkArea', 'CESSATION YEAR',
       'Reason for ceasing employment',
       'Contributing Factors. Career Move - Public Sector ',
       'Contributing Factors. Career Move - Private Sector ',
       'Contributing Factors. Career Move - Self-employment',
       'Contributing Factors. Ill Health',
       'Contributing Factors. Maternity/Family',
       'Contributing Factors. Dissatisfaction',
       'Contributing Factors. Job Dissatisfaction',
       'Contributing Factors. Interpersonal Conflict',
       'Contributing Factors. Study', 'Contributing Factors. Travel',
       'Contributing Factors. Other', 'Contributing Factors. NONE',
       'Gender.     What is your Gender?', 'CurrentAge.     Current Age',
       'Employment Type.     Employment Type',
       'Classification.     Classification',
       'LengthofServiceOverall. Overall Length of Service at Institute (in years)',
       'LengthofServiceCurrent. Length of Service at current workplace (in years)'],
      dtype='object')

From the new information above, we see that the dete_survey_updated dataframe now contains 35 columns and 822 rows. We also see that missing/null values are now seen as NAN. The tafe_survey_updated dataframe now contains 23 columns and 702 rows.

Now that we have successfully made these changes, we'll get back on track to our data exploration and cleaning.

3.1.1 Cleaning and Renaming Column Names¶

As stated earlier in our observation, each dataframe contains many of the same columns, but the column names are different. Below is a combined dictionary for some similar columns in the two datasets:

dete_survey	tafe_survey	Definition
ID	Record ID	An id used to identify the participant of the survey
SeparationType	Reason for ceasing employment	The reason why the participant's employment ended
Cease Date	CESSATION YEAR	The year or month the participant's employment ended
DETE Start Date		The year the participant began employment with DETE
	LengthofServiceOverall. Overall Length of Service at Institute (in years)	The length of the person's employment (in years)
Age	CurrentAge. Current Age	The age of the participant
Gender	Gender. What is your Gender?	The gender of the participant

We'll standardize the column names in both dataframes and make them uniform. This is because we'll later combine the two dataframes.

3.1.1.1 DETE column names:¶

In some few cells below, we'll see the state of the column names before and after the cleaning process that was carried out. We do this in order to maintain the flow of the whole process.

In [14]:

# Initial form of the column names
print(dete_survey_updated.columns)

Index(['ID', 'SeparationType', 'Cease Date', 'DETE Start Date',
       'Role Start Date', 'Position', 'Classification', 'Region',
       'Business Unit', 'Employment Status', 'Career move to public sector',
       'Career move to private sector', 'Interpersonal conflicts',
       'Job dissatisfaction', 'Dissatisfaction with the department',
       'Physical work environment', 'Lack of recognition',
       'Lack of job security', 'Work location', 'Employment conditions',
       'Maternity/family', 'Relocation', 'Study/Travel', 'Ill Health',
       'Traumatic incident', 'Work life balance', 'Workload',
       'None of the above', 'Gender', 'Age', 'Aboriginal', 'Torres Strait',
       'South Sea', 'Disability', 'NESB'],
      dtype='object')

In [15]:

# Column head: BEFORE CLEANING
dete_survey_updated.head(2)

Out[15]:

	ID	SeparationType	Cease Date	DETE Start Date	Role Start Date	Position	Classification	Region	Business Unit	Employment Status	Career move to public sector	Career move to private sector	Interpersonal conflicts	Job dissatisfaction	Dissatisfaction with the department	Physical work environment	Lack of recognition	Lack of job security	Work location	Employment conditions	Maternity/family	Relocation	Study/Travel	Ill Health	Traumatic incident	Work life balance	Workload	None of the above	Gender	Age	Aboriginal	Torres Strait	South Sea	Disability	NESB
0	1	Ill Health Retirement	08/2012	1984.0	2004.0	Public Servant	A01-A04	Central Office	Corporate Strategy and Peformance	Permanent Full-time	True	False	False	True	False	False	True	False	False	False	False	False	False	False	False	False	False	True	Male	56-60	NaN	NaN	NaN	NaN	Yes
1	2	Voluntary Early Retirement (VER)	08/2012	NaN	NaN	Public Servant	AO5-AO7	Central Office	Corporate Strategy and Peformance	Permanent Full-time	False	False	False	False	False	False	False	False	False	False	False	False	False	False	False	False	False	False	Male	56-60	NaN	NaN	NaN	NaN	NaN

In [16]:

# Updates the column names to a standardized form
dete_survey_updated.columns = dete_survey_updated.columns.str.replace(' ', '_').str.strip().str.lower()

# Updated form of the column names
print(dete_survey_updated.columns)

Index(['id', 'separationtype', 'cease_date', 'dete_start_date',
       'role_start_date', 'position', 'classification', 'region',
       'business_unit', 'employment_status', 'career_move_to_public_sector',
       'career_move_to_private_sector', 'interpersonal_conflicts',
       'job_dissatisfaction', 'dissatisfaction_with_the_department',
       'physical_work_environment', 'lack_of_recognition',
       'lack_of_job_security', 'work_location', 'employment_conditions',
       'maternity/family', 'relocation', 'study/travel', 'ill_health',
       'traumatic_incident', 'work_life_balance', 'workload',
       'none_of_the_above', 'gender', 'age', 'aboriginal', 'torres_strait',
       'south_sea', 'disability', 'nesb'],
      dtype='object')

In [17]:

# Column head: AFTER CLEANING
dete_survey_updated.head(2)

Out[17]:

	id	separationtype	cease_date	dete_start_date	role_start_date	position	classification	region	business_unit	employment_status	career_move_to_public_sector	career_move_to_private_sector	interpersonal_conflicts	job_dissatisfaction	dissatisfaction_with_the_department	physical_work_environment	lack_of_recognition	lack_of_job_security	work_location	employment_conditions	maternity/family	relocation	study/travel	ill_health	traumatic_incident	work_life_balance	workload	none_of_the_above	gender	age	aboriginal	torres_strait	south_sea	disability	nesb
0	1	Ill Health Retirement	08/2012	1984.0	2004.0	Public Servant	A01-A04	Central Office	Corporate Strategy and Peformance	Permanent Full-time	True	False	False	True	False	False	True	False	False	False	False	False	False	False	False	False	False	True	Male	56-60	NaN	NaN	NaN	NaN	Yes
1	2	Voluntary Early Retirement (VER)	08/2012	NaN	NaN	Public Servant	AO5-AO7	Central Office	Corporate Strategy and Peformance	Permanent Full-time	False	False	False	False	False	False	False	False	False	False	False	False	False	False	False	False	False	False	Male	56-60	NaN	NaN	NaN	NaN	NaN

3.1.1.2 TAFE column names:¶

In the following cells, we'll update the column names in tafe_survey_updated dataframe to be uniform with the dete_survey_updated dataframe.

In [18]:

# A dictionary of name change
updated_names = {"Record ID": "id", "CESSATION YEAR": "cease_date", "Reason for ceasing employment": "separationtype", "Gender.     What is your Gender?": "gender", "CurrentAge.     Current Age": "age", "Employment Type.     Employment Type": "employment_status", "Classification.     Classification": "position", "LengthofServiceOverall. Overall Length of Service at Institute (in years)": "institute_service", "LengthofServiceCurrent. Length of Service at current workplace (in years)": "role_service" }

# Updates the column names in dataframe
tafe_survey_updated = tafe_survey_updated.rename(updated_names, axis=1)
tafe_survey_updated.columns

Out[18]:

Index(['id', 'Institute', 'WorkArea', 'cease_date', 'separationtype',
       'Contributing Factors. Career Move - Public Sector ',
       'Contributing Factors. Career Move - Private Sector ',
       'Contributing Factors. Career Move - Self-employment',
       'Contributing Factors. Ill Health',
       'Contributing Factors. Maternity/Family',
       'Contributing Factors. Dissatisfaction',
       'Contributing Factors. Job Dissatisfaction',
       'Contributing Factors. Interpersonal Conflict',
       'Contributing Factors. Study', 'Contributing Factors. Travel',
       'Contributing Factors. Other', 'Contributing Factors. NONE', 'gender',
       'age', 'Employment Type.     Employment Type',
       'Classification.     Classification', 'institute_service',
       'role_service'],
      dtype='object')

We can see that the names of the columns representing the same information in both dataframes are now uniform. Thus, our analysis is made less confusing. This is especially important because we aim to combine both dataframes to complete our analysis.

3.2 Cleaning out other unwanted data¶

Next, we'll remove more of the data we don't need to answer our questions - one of which is:

Are employees who only worked for the institutes for a short period of time resigning due to some kind of dissatisfaction? What about those employees who have been at the job longer?

Again, let's take a look at the unique values in separationtype columns in each dataframe:

In [19]:

dete_survey_updated['separationtype'].value_counts()

Out[19]:

Age Retirement                          285
Resignation-Other reasons               150
Resignation-Other employer               91
Resignation-Move overseas/interstate     70
Voluntary Early Retirement (VER)         67
Ill Health Retirement                    61
Other                                    49
Contract Expired                         34
Termination                              15
Name: separationtype, dtype: int64

In [20]:

tafe_survey_updated['separationtype'].value_counts()

Out[20]:

Resignation                 340
Contract Expired            127
Retrenchment/ Redundancy    104
Retirement                   82
Transfer                     25
Termination                  23
Name: separationtype, dtype: int64

3.2.1 Extracting Data for Respondents Who Resigned¶

We can see that each dataframe contains a couple of different separation types. However, for this particular analysis, we're only interested in survey respondents who resigned. Therefore, only those separation types that contains the string Resignation will be analyzed.

We can also see that dete_survey_updated dataframe contains multiple separation types with the string Resignation:

Resignation-Other reasons
Resignation-Other employer
Resignation-Move overseas/interstate

In [21]:

# The pattern to search for and select
resignations = dete_survey_updated['separationtype'].str.contains(r"[Rr]esignation", na=False)

# Selects only data that have the desired string
dete_resignations = dete_survey_updated.copy()[resignations]

dete_resignations.head()

Out[21]:

	id	separationtype	cease_date	dete_start_date	role_start_date	position	classification	region	business_unit	employment_status	career_move_to_public_sector	career_move_to_private_sector	interpersonal_conflicts	job_dissatisfaction	dissatisfaction_with_the_department	physical_work_environment	lack_of_recognition	lack_of_job_security	work_location	employment_conditions	maternity/family	relocation	study/travel	ill_health	traumatic_incident	work_life_balance	workload	none_of_the_above	gender	age	aboriginal	torres_strait	south_sea	disability	nesb
3	4	Resignation-Other reasons	05/2012	2005.0	2006.0	Teacher	Primary	Central Queensland	NaN	Permanent Full-time	False	True	False	False	False	False	False	False	False	False	False	False	False	False	False	False	False	False	Female	36-40	NaN	NaN	NaN	NaN	NaN
5	6	Resignation-Other reasons	05/2012	1994.0	1997.0	Guidance Officer	NaN	Central Office	Education Queensland	Permanent Full-time	False	True	False	False	False	False	False	False	False	True	True	False	False	False	False	False	False	False	Female	41-45	NaN	NaN	NaN	NaN	NaN
8	9	Resignation-Other reasons	07/2012	2009.0	2009.0	Teacher	Secondary	North Queensland	NaN	Permanent Full-time	False	True	False	False	False	False	False	False	False	False	False	False	False	False	False	False	False	False	Female	31-35	NaN	NaN	NaN	NaN	NaN
9	10	Resignation-Other employer	2012	1997.0	2008.0	Teacher Aide	NaN	NaN	NaN	Permanent Part-time	False	False	True	True	True	False	False	False	False	False	False	False	False	False	False	False	False	False	Female	46-50	NaN	NaN	NaN	NaN	NaN
11	12	Resignation-Move overseas/interstate	2012	2009.0	2009.0	Teacher	Secondary	Far North Queensland	NaN	Permanent Full-time	False	False	False	False	False	False	False	False	False	False	True	True	False	False	False	False	False	False	Male	31-35	NaN	NaN	NaN	NaN	NaN

In [22]:

# Separates and select Resignation from the strings
dete_resignations['separationtype'] = dete_survey_updated['separationtype'].str.split('-').str[0]

In [23]:

# Unique values in the updated column
dete_resignations['separationtype'].value_counts()

Out[23]:

Resignation    311
Name: separationtype, dtype: int64

Above, we select only data whose separationtype column contains the string - "Resignation". This is solely because we're only going to analyze the data of the respondents who resigned from their job.

We can see above that the dataframe dete_resignations now contains 311 rows of data.

The separationtype column in tafe_survey_updated dataframe has only one type of resignation string. Next, we'll select those data with Resignation directly into a new dataframe.

In [24]:

# Selects only data which have the desired string
tafe_resignations = tafe_survey_updated[tafe_survey_updated['separationtype'] == 'Resignation'].copy()
tafe_resignations.head()

Out[24]:

	id	Institute	WorkArea	cease_date	separationtype	Contributing Factors. Career Move - Public Sector	Contributing Factors. Career Move - Private Sector	Contributing Factors. Career Move - Self-employment	Contributing Factors. Ill Health	Contributing Factors. Maternity/Family	Contributing Factors. Dissatisfaction	Contributing Factors. Job Dissatisfaction	Contributing Factors. Interpersonal Conflict	Contributing Factors. Study	Contributing Factors. Travel	Contributing Factors. Other	Contributing Factors. NONE	gender	age	Employment Type. Employment Type	Classification. Classification	institute_service	role_service
3	634139903350000000	Mount Isa Institute of TAFE	Non-Delivery (corporate)	2010.0	Resignation	-	-	-	-	-	-	-	-	-	Travel	-	-	NaN	NaN	NaN	NaN	NaN	NaN
4	634146578511788000	Southern Queensland Institute of TAFE	Delivery (teaching)	2010.0	Resignation	-	Career Move - Private Sector	-	-	-	-	-	-	-	-	-	-	Male	41 45	Permanent Full-time	Teacher (including LVT)	3-4	3-4
5	634147506906311000	Southern Queensland Institute of TAFE	Delivery (teaching)	2010.0	Resignation	-	-	-	-	-	-	-	-	-	-	Other	-	Female	56 or older	Contract/casual	Teacher (including LVT)	7-10	7-10
6	634152007975694000	Barrier Reef Institute of TAFE	Non-Delivery (corporate)	2010.0	Resignation	-	Career Move - Private Sector	-	-	Maternity/Family	-	-	-	-	-	Other	-	Male	20 or younger	Temporary Full-time	Administration (AO)	3-4	3-4
7	634153745310374000	Southern Queensland Institute of TAFE	Delivery (teaching)	2010.0	Resignation	-	-	-	-	-	-	-	-	-	-	Other	-	Male	46 50	Permanent Full-time	Teacher (including LVT)	3-4	3-4

In [25]:

# Unique values in the updated column
tafe_resignations['separationtype'].value_counts()

Out[25]:

Resignation    340
Name: separationtype, dtype: int64

After selecting just the rows that contains Resignation in their separationtype column, we can see above that the dataframe tafe_resignations now contains 340 rows of data.

3.3 Checking for Inconsistencies¶

We'll continue by checking for other errors in the dataframes. We'll do this by examining each column we need for this anaysis:

Let's take a look at the cease_date and dete_start_date in the dete_resignations dataframe and see whether there's need for any adjustments based on these:

Since the cease_date is the last year of the person's employment and the dete_start_date is the person's first year of employment, it wouldn't make sense to have years after the current date.
Given that most people in this field start working in their 20s, it's also unlikely that the dete_start_date was before the year 1940.

So if we have years (in dete_start_date) higher than the cease_date or lower than 1940, we wouldn't want to continue with the analysis. The reason is that there may be something very wrong with the data. However, if there are just a small amount of values that are high or low, we'll remove them.

In [26]:

dete_resignations['cease_date'].value_counts(ascending=True)

Out[26]:

09/2010      1
2010         1
07/2006      1
07/2012      1
05/2013      2
05/2012      2
08/2013      4
10/2013      6
11/2013      9
07/2013      9
09/2013     11
06/2013     14
12/2013     17
01/2014     22
2013        74
2012       126
Name: cease_date, dtype: int64

In [27]:

dete_resignations['dete_start_date'].value_counts().sort_index(ascending=True)

Out[27]:

1963.0     1
1971.0     1
1972.0     1
1973.0     1
1974.0     2
1975.0     1
1976.0     2
1977.0     1
1980.0     5
1982.0     1
1983.0     2
1984.0     1
1985.0     3
1986.0     3
1987.0     1
1988.0     4
1989.0     4
1990.0     5
1991.0     4
1992.0     6
1993.0     5
1994.0     6
1995.0     4
1996.0     6
1997.0     5
1998.0     6
1999.0     8
2000.0     9
2001.0     3
2002.0     6
2003.0     6
2004.0    14
2005.0    15
2006.0    13
2007.0    21
2008.0    22
2009.0    13
2010.0    17
2011.0    24
2012.0    21
2013.0    10
Name: dete_start_date, dtype: int64

The cease_date column in dete_resignation dataframe contains two different date formats: some only contain year while others contain month and year. There is a need to change these dates to take a single and similar form across the rows.

In [28]:

# Pattern for extracting date format
pattern = r"/?(?P<Year>[1-2][0-9]{3})"

dete_resignations['cease_date'] = dete_resignations['cease_date'].astype(str).str.extract(pattern).astype(float)
dete_resignations['cease_date'].value_counts().sort_index()

Out[28]:

2006.0      1
2010.0      2
2012.0    129
2013.0    146
2014.0     22
Name: cease_date, dtype: int64

In [29]:

# Box plot to check for outliers
plt.figure(figsize=(10, 10))
dete_resignations.boxplot(column=['cease_date']).set_ylim(2005,2018)

Out[29]:

(2005, 2018)

In [30]:

# Box plot to check for outliers
plt.figure(figsize=(10, 10))
dete_resignations.boxplot(column=['dete_start_date']).set_ylim(1960,2014)

Out[30]:

(1960, 2014)

In [31]:

# Sorted values in the cease_date column
tafe_resignations['cease_date'].value_counts().sort_index()

Out[31]:

2009.0      2
2010.0     68
2011.0    116
2012.0     94
2013.0     55
Name: cease_date, dtype: int64

In [32]:

# Box plot to check for outliers
plt.figure(figsize=(10, 10))
tafe_resignations.boxplot(column=['cease_date']).set_ylim(2005,2018)

Out[32]:

(2005, 2018)

Here is what we can deduce from the information above:

There isn't any major issue with the years in the dete_start_date and cease_date.
The years in both dataframes don't align. The dete_resignations dataframe contains outliers (2006 and 2010) while tafe_resignations has outliers at 2009 (a year absent from the other dataframe). The tafe_resignations dataframe also contains more cease dates (68) in 2010 compared to the number of cease dates in dete_resignations (2).

Since we're not interested in analyzing the data according to year, we'll leave them as they are.

So far so good, we have rectified the years in the columns to follow a single uniform pattern in the two dataframes. We also didn't find any major issues with the years. Additionally, we saw that the years in each dataframe don't span the same number of years.

4. Refining and Combining Datasets¶

4.1 Computing Employee Years of Service¶

As a reminder, our goal is to answer this question:

Are employees who only worked for the institutes for a short period of time resigning due to some kind of dissatisfaction? What about those employees who have been at the job longer?

In order to answer the questions, we'll use the now verified years in the dete_registrations dataframe to create a new column. In the Human Resources field, the length of time an employee spent in a workplace is referred to as their years of service. The tafe_resignations dataframe already contains a "service" column that we renamed to institute_service. Therefore, in order to analyze both surveys together, we'll create a corresponding institute_service column in dete_resignations.

In [33]:

dete_resignations['institute_service'] = dete_resignations['cease_date'] - dete_resignations['dete_start_date']
dete_resignations['institute_service'].head()

Out[33]:

3      7.0
5     18.0
8      3.0
9     15.0
11     3.0
Name: institute_service, dtype: float64

4.2 Classifying Employee¶

We have successfully created a new corresponding column named institute_service in the dete_resignations dataframe. Thus, we can now analyze the survey respondents according to their length of employment. Firstly, we'll identify and classify employees who resigned because they were dissatisfied. The reason for which a worker's employment ended at both institutes are indicated by the column separationtype in both datasets. Other factors that may have contributed to the job cessation are indicated in other columns of the datasets.

According to a dictionary meaning,

Dissatisfaction: The feeling of being displeased and discontent.

Based on that meaning, we'll select some columns from the columns in both dataframes to categorize employees as "dissatisfied". The selected columns are listed below

dete_survey:
- job_dissatisfaction
- dissatisfaction_with_the_department
- physical_work_environment
- lack_of_recognition
- lack_of_job_security
- work_location
- employment_conditions
- work_life_balance
- workload
tafe_survey:
- Contributing Factors. Dissatisfaction
- Contributing Factors. Job Dissatisfaction

If the employee indicated any of the above caused them to resign, we'll mark them as dissatisfied in a new column as following:

True: indicates a person resigned because they were dissatisfied in some way
False: indicates a person resigned because of a reason other than job dissatisfaction
NaN: indicates missing values

Let's start by viewing the values in the two columns of Contributing Factors in tafe_resignations dataframe.

In [34]:

# Checks the unique values
tafe_resignations['Contributing Factors. Dissatisfaction'].value_counts()

Out[34]:

-                                         277
Contributing Factors. Dissatisfaction      55
Name: Contributing Factors. Dissatisfaction, dtype: int64

In [35]:

# Checks the unique values
tafe_resignations['Contributing Factors. Job Dissatisfaction'].value_counts()

Out[35]:

-                      270
Job Dissatisfaction     62
Name: Contributing Factors. Job Dissatisfaction, dtype: int64

There are 55 respondents who resigned due to dissatisfaction in the Contributing Factors. Dissatisfaction column. In the other column (Contributing Factors. Job Dissatisfaction), 62 respondents resigned due to dissatisfaction.

In [36]:

# Updates values in columns to either True, False, or NaN
def update_vals(value):
    if pd.isnull(value):
        return np.nan

    elif value == '-':
        return False

    else:
        return True

In [37]:

# Creates a new column for dissatisfaction state
tafe_resignations['dissatisfied'] = tafe_resignations[['Contributing Factors. Dissatisfaction', 'Contributing Factors. Job Dissatisfaction']].applymap(update_vals).any(1, skipna=False)
tafe_resignations_up = tafe_resignations.copy()

# Checks the unique values of the new column
tafe_resignations_up['dissatisfied'].value_counts(dropna=False)

Out[37]:

False    241
True      91
NaN        8
Name: dissatisfied, dtype: int64

In [38]:

# Creates a new column for dissatisfaction state
dete_resignations['dissatisfied'] = dete_resignations[['job_dissatisfaction', 'dissatisfaction_with_the_department', 'physical_work_environment', 'lack_of_recognition', 'lack_of_job_security', 'work_location', 'employment_conditions', 'work_life_balance', 'workload']].any(1, skipna=False)
dete_resignations_up = dete_resignations.copy()

# Checks the unique values of the new column
dete_resignations_up['dissatisfied'].value_counts(dropna=False)

Out[38]:

False    162
True     149
Name: dissatisfied, dtype: int64

Hooray! We've successfully created a new column dissatisfied in both dataframes classifying each respondent whose reason for resignation was based on job dissatisfaction (True) or other reasons (False). This action is necessary because we hope to combine the two dataframe and be able to make deductions based on this reason.

4.3 Combining two dataframes¶

Up until this moment, we have worked to clean up and add valuable data. We've carried out these changes separately on each dataframe. Now, we're finally ready to combine the two datasets and take a big step towards answering our questions.

We're going to start by adding a column to each dataframe. This new column will allow us to easily distinguish between the two datasets after we've combined them.

In [39]:

# Adds a new column to the datasets
dete_resignations_up['institute'] = 'DETE'
tafe_resignations_up['institute'] = 'TAFE'

In [40]:

# Combines the datasets
combined = pd.concat([dete_resignations_up, tafe_resignations_up], ignore_index=True)

# Checks the number of non null values in each column
combined.notnull().sum().sort_values()

C:\Users\extraUNIQUEguy\Anaconda3\lib\site-packages\ipykernel_launcher.py:2: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.

To retain the current behavior and silence the warning, pass 'sort=True'.

Out[40]:

torres_strait                                            0
south_sea                                                3
aboriginal                                               7
disability                                               8
nesb                                                     9
business_unit                                           32
classification                                         161
region                                                 265
role_start_date                                        271
dete_start_date                                        283
Classification.     Classification                     290
Employment Type.     Employment Type                   290
role_service                                           290
employment_status                                      307
position                                               308
none_of_the_above                                      311
ill_health                                             311
maternity/family                                       311
relocation                                             311
lack_of_recognition                                    311
lack_of_job_security                                   311
job_dissatisfaction                                    311
interpersonal_conflicts                                311
study/travel                                           311
traumatic_incident                                     311
work_life_balance                                      311
physical_work_environment                              311
employment_conditions                                  311
workload                                               311
dissatisfaction_with_the_department                    311
career_move_to_public_sector                           311
career_move_to_private_sector                          311
work_location                                          311
Contributing Factors. Career Move - Private Sector     332
Contributing Factors. Career Move - Public Sector      332
Contributing Factors. Career Move - Self-employment    332
Contributing Factors. Dissatisfaction                  332
Contributing Factors. Ill Health                       332
Contributing Factors. Interpersonal Conflict           332
Contributing Factors. Job Dissatisfaction              332
Contributing Factors. Maternity/Family                 332
Contributing Factors. NONE                             332
Contributing Factors. Study                            332
Contributing Factors. Travel                           332
Contributing Factors. Other                            332
Institute                                              340
WorkArea                                               340
institute_service                                      563
gender                                                 592
age                                                    596
cease_date                                             635
dissatisfied                                           643
separationtype                                         651
institute                                              651
id                                                     651
dtype: int64

We've successfully combined the two datasets. Next, we checked for the number of non null values in order to know the other unnecessary columns to drop from the combined dataframe.

In [41]:

# Columns prior to dropping
combined.columns

Out[41]:

Index(['Classification.     Classification',
       'Contributing Factors. Career Move - Private Sector ',
       'Contributing Factors. Career Move - Public Sector ',
       'Contributing Factors. Career Move - Self-employment',
       'Contributing Factors. Dissatisfaction',
       'Contributing Factors. Ill Health',
       'Contributing Factors. Interpersonal Conflict',
       'Contributing Factors. Job Dissatisfaction',
       'Contributing Factors. Maternity/Family', 'Contributing Factors. NONE',
       'Contributing Factors. Other', 'Contributing Factors. Study',
       'Contributing Factors. Travel', 'Employment Type.     Employment Type',
       'Institute', 'WorkArea', 'aboriginal', 'age', 'business_unit',
       'career_move_to_private_sector', 'career_move_to_public_sector',
       'cease_date', 'classification', 'dete_start_date', 'disability',
       'dissatisfaction_with_the_department', 'dissatisfied',
       'employment_conditions', 'employment_status', 'gender', 'id',
       'ill_health', 'institute', 'institute_service',
       'interpersonal_conflicts', 'job_dissatisfaction',
       'lack_of_job_security', 'lack_of_recognition', 'maternity/family',
       'nesb', 'none_of_the_above', 'physical_work_environment', 'position',
       'region', 'relocation', 'role_service', 'role_start_date',
       'separationtype', 'south_sea', 'study/travel', 'torres_strait',
       'traumatic_incident', 'work_life_balance', 'work_location', 'workload'],
      dtype='object')

In [42]:

# Removes columns with less than 500 non null values
combined_updated = combined.dropna(thresh=500, axis=1).copy()

In [43]:

# Columns after dropping
combined_updated.columns

Out[43]:

Index(['age', 'cease_date', 'dissatisfied', 'gender', 'id', 'institute',
       'institute_service', 'separationtype'],
      dtype='object')

We dropped all columns with less than 500 non null values above. This is necessary for the effective management and analysis of the data. Essentially, columns with less than 500 non null values can not influence the result of our analysis.

Before we proceed to proper analysis, we want to clean up the institute_service column. Let's see the unique values in that particular column below:

In [44]:

combined_updated['institute_service'].value_counts(dropna=False)

Out[44]:

NaN                   88
Less than 1 year      73
1-2                   64
3-4                   63
5-6                   33
11-20                 26
5.0                   23
1.0                   22
7-10                  21
0.0                   20
3.0                   20
6.0                   17
4.0                   16
2.0                   14
9.0                   14
7.0                   13
More than 20 years    10
8.0                    8
13.0                   8
15.0                   7
20.0                   7
14.0                   6
10.0                   6
12.0                   6
17.0                   6
22.0                   6
16.0                   5
18.0                   5
24.0                   4
11.0                   4
23.0                   4
19.0                   3
39.0                   3
21.0                   3
32.0                   3
25.0                   2
26.0                   2
36.0                   2
28.0                   2
30.0                   2
33.0                   1
38.0                   1
35.0                   1
34.0                   1
31.0                   1
49.0                   1
29.0                   1
27.0                   1
42.0                   1
41.0                   1
Name: institute_service, dtype: int64

4.4 Categorizing Employees According to Career Stage¶

The column contains some very different forms of value. Tricky business! In order to analyze the data, we'll have to convert these values into categories below:

New: Less than 3 years in an organization
Experienced: 3-6 years in an organization
Established: 7-10 years in an organization
Veteran: 11 or more years in an organization

This analysis is based on this article, which makes the argument that understanding employee's needs according to career stage instead of age is more effective.

In [45]:

# Extracts years of service from column and converts to float
combined_updated['institute_service_up'] = combined_updated['institute_service'].astype('str').str.extract(r"(\d+)")
combined_updated['institute_service_up'] = combined_updated['institute_service_up'].astype('float')

In [46]:

# Confirms successful extraction
combined_updated['institute_service_up'].value_counts()

Out[46]:

1.0     159
3.0      83
5.0      56
7.0      34
11.0     30
0.0      20
20.0     17
6.0      17
4.0      16
9.0      14
2.0      14
13.0      8
8.0       8
15.0      7
17.0      6
10.0      6
12.0      6
14.0      6
22.0      6
16.0      5
18.0      5
24.0      4
23.0      4
39.0      3
19.0      3
21.0      3
32.0      3
28.0      2
36.0      2
25.0      2
30.0      2
26.0      2
29.0      1
38.0      1
42.0      1
27.0      1
41.0      1
35.0      1
49.0      1
34.0      1
33.0      1
31.0      1
Name: institute_service_up, dtype: int64

In [47]:

# Sorts years into different categories
def categorize(value):
    if value >= 11:
        return "Veteran"
    elif 3 <= value < 7:
        return "Experienced"
    elif 7 <= value < 11:
        return "Established"
    elif pd.isnull(value):
        return np.nan
    else:
        return "New"
    
# Applies function to the column
combined_updated['service_cat'] = combined_updated['institute_service_up'].apply(categorize)

# Confirms the changes
combined_updated['service_cat'].value_counts()

Out[47]:

New            193
Experienced    172
Veteran        136
Established     62
Name: service_cat, dtype: int64

We categorized the employees into four groups based on the number of years spent working. A new column service_cat was created to contain the category each employee falls to. Now it is easy to analyze our data and answer some salient questions.

5. Data Analysis¶

5.1 Performing Initial Analysis¶

In order to get started with our data analysis, we'll perform a small piece of analysis. We'll first fill in missing values in the dissatisfied column and then aggregate the data.

In [48]:

# Checks the unique values in the column
combined_updated['dissatisfied'].value_counts(dropna=False)

Out[48]:

False    403
True     240
NaN        8
Name: dissatisfied, dtype: int64

In [49]:

# Replaces missing values in the column
combined_updated['dissatisfied'] = combined_updated['dissatisfied'].fillna(False)

So far, we have found out the total number of workers who resigned due to some kinds of job dissatisfaction to be 240. To continue further with the analysis, we needed to fill the 8 missing values in the column with the most frequent category therein. The False group occurs most in the dissatisfied column, we therefore replaced the missing values with 'False'.
All the values in dissatisfied column now consists of Boolean values, meaning they're either True or False. Thus, we'll use a pivot table to aggregate the column and calculate the percentage of people in each group. This is possible because df.pivot_table() method treats Boolean values as integers, so a True value is considered to be 1 and a False value is considered to be 0.

In [50]:

# Calculate and visualize the percentage of dissatisfied employees
table_s_cat = combined_updated.pivot_table(index='service_cat', values='dissatisfied')
print(table_s_cat)

# Data to plot bar chart
plt.figure(figsize=(10,10))
sns.set_style('darkgrid')
ax = sns.barplot(x=table_s_cat.index, y=table_s_cat['dissatisfied'], palette=sns.color_palette('GnBu_d'))
ax.set_yticks([])
ax.set_title('Percentage of Dissatisfied Employee Based on Service Category')
ax.set_xlabel('Service Category')
ax.set_ylabel('Percentage')
ax.tick_params(bottom=True, top=False, left=False, right=False, labelbottom=True)
sns.despine(left=True)

             dissatisfied
service_cat              
Established      0.516129
Experienced      0.343023
New              0.295337
Veteran          0.485294

Let's also visualize this data using a pie chart:

In [51]:

# Data to plot pie chart
labels = 'Established', 'Experienced', 'New', 'Veteran'
colours = ['lightskyblue', 'yellowgreen', 'lightcoral', 'gold']
explode = (0.1, 0, 0, 0.1) # "Explodes" the first and last slice
plt.pie(table_s_cat, labels=labels, colors=colours, autopct='%1.1f%%', shadow=True, explode=explode, startangle=90)

plt.title('Percentage of Dissatisfied Employee Based on Service Category', bbox={'facecolor':'0.8', 'pad':5})
plt.axis('equal')   # Ensures pie is drawn as a circle
plt.show()

C:\Users\extraUNIQUEguy\Anaconda3\lib\site-packages\ipykernel_launcher.py:5: MatplotlibDeprecationWarning: Non-1D inputs to pie() are currently squeeze()d, but this behavior is deprecated since 3.1 and will be removed in 3.3; pass a 1D array instead.
  """

Above, we aggregated the data in the dissatisfied column and used the information to calculate the percentage of the people in each group. From this small analysis, we can presently conclude that more employees who have spent 7 or more years at a job (Established and Veteran categories) resigned due to some kinds of job dissatisfaction (with more than 60% of the total number of dissatisfied employees). Compared to the two aforementioned categories, employees with less than 7 years of service resigned less due to job dissatisfaction.

5.2 Conducting Further Analysis¶

Next, we'll handle the missing values in the other categories. We want to know how many people resigned due to some kind of dissatisfaction and the percentage according to gender. First, let's take a closer look at each column and clean up the age column, which is still in an untreatable form.

In [52]:

# Values in each column
gender_counts = combined_updated['gender'].value_counts(dropna=False)
print(gender_counts, '\n')
service_counts = combined_updated['service_cat'].value_counts(dropna=False)
print(service_counts, '\n')
age_counts = combined_updated['age'].value_counts(dropna=False)
print(age_counts)

Female    424
Male      168
NaN        59
Name: gender, dtype: int64 

New            193
Experienced    172
Veteran        136
NaN             88
Established     62
Name: service_cat, dtype: int64 

51-55            71
NaN              55
41-45            48
41  45          45
46-50            42
36-40            41
46  50          39
26-30            35
21  25          33
26  30          32
36  40          32
31  35          32
56 or older      29
31-35            29
21-25            29
56-60            26
61 or older      23
20 or younger    10
Name: age, dtype: int64

In [53]:

# Extracts the first age in each age group
combined_updated['age'] = combined_updated['age'].str.extract(r"(\d+)").astype(float)
combined_updated['age'].value_counts(dropna=False)

Out[53]:

41.0    93
46.0    81
36.0    73
51.0    71
26.0    67
21.0    62
31.0    61
NaN     55
56.0    55
61.0    23
20.0    10
Name: age, dtype: int64

The age column now looks good and can be analyzed. We can also observe that there are 55 missing values in the column. Before we decide on what to do with these missing values, let's first look at the missing values across the columns we are interested in.

In [54]:

# Creates heatmap to locate missing values
map_index = combined_updated.set_index('dissatisfied')
sns.heatmap(map_index.isnull(), cbar=False)

Out[54]:

<matplotlib.axes._subplots.AxesSubplot at 0x163034faa88>

Above, we created a heatmap to check the location of the missing values in our dataframe. We can observe that the essential columns for our next analysis (age, gender and service_cat) all have roughly the same number of missing values and almost on the same rows of data. We'll continue by filling the missing age data with the average value of all the ages. It is however unnecessary to try to fill the other missing values and thus, we'll leave the dataset as is.

In [55]:

# Calculate the mean age
mean = combined_updated['age'].mean()

# Fill missing age with the mean value
combined_updated['age'] = combined_updated['age'].fillna(mean)

# Display the unique values in the age column
combined_updated['age'].value_counts(dropna=False)

Out[55]:

41.000000    93
46.000000    81
36.000000    73
51.000000    71
26.000000    67
21.000000    62
31.000000    61
39.271812    55
56.000000    55
61.000000    23
20.000000    10
Name: age, dtype: int64

Next, we'll group the ages into four different categories. These categories include workers in their:

50s and above
40s
30s
20s and below

In [56]:

# A function to group age
def categorize_age(age):
    if age >= 50:
        return '50s and above'
    elif 40 <= age < 50:
        return '40s'
    elif 30 <= age < 40:
        return '30s'
    elif 20 <= age < 30:
        return '20s and below'

In [57]:

# Apply function to categorize age into groups in a new column
combined_updated['age_cat'] = combined_updated['age'].apply(categorize_age)
combined_updated['age_cat'].value_counts()

Out[57]:

30s              189
40s              174
50s and above    149
20s and below    139
Name: age_cat, dtype: int64

As we can now see, we have 139 workers who are in their 20s or below, 189 workers in their 30s, 174 workers in their 40s and 149 workers in their 50s or above. This categorization looks encompassing enough.

5.2.1 Estimating Dissatisfied Workers Based on Age Category¶

Earlier, we aggregated dissatisfied workers based on service category. Right now, we want to do the same, albeit based on age category.

In [59]:

# Calculate and visualize the percentage of dissatisfied employees
table_a_cat = combined_updated.pivot_table(index='age_cat', values='dissatisfied')
print(table_a_cat)

# Data to plot
plt.figure(figsize=(10,10))
sns.set_style('darkgrid')
ax = sns.barplot(x=table_a_cat.index, y=table_a_cat['dissatisfied'], palette=sns.color_palette('GnBu_d'))
ax.set_yticks([])
ax.set_title('Percentage of Dissatisfied Employee Based on Age Category')
ax.set_xlabel('Age Category')
ax.set_ylabel('Percentage')
ax.tick_params(bottom=True, top=False, left=False, right=False, labelbottom=True)
sns.despine(left=True)

               dissatisfied
age_cat                    
20s and below      0.352518
30s                0.328042
40s                0.379310
50s and above      0.422819

Let's also visualize this data using a pie chart below:

In [60]:

labels = '20s and below', '30s', '40s', '50s and above'
colours = ['lightcoral', 'lightskyblue', 'gold', 'yellowgreen']
explode = (0.05, 0.05, 0.05, 0.1)
plt.pie(table_a_cat, labels=labels, colors=colours, autopct='%1.1f%%', shadow=True, explode=explode, startangle=90)

plt.title('Percentage of Dissatisfied Employee Based on Age Category', bbox={'facecolor':'0.8', 'pad':5})
plt.axis('equal')   # Ensures pie is drawn as a circle
plt.show()

C:\Users\extraUNIQUEguy\Anaconda3\lib\site-packages\ipykernel_launcher.py:4: MatplotlibDeprecationWarning: Non-1D inputs to pie() are currently squeeze()d, but this behavior is deprecated since 3.1 and will be removed in 3.3; pass a 1D array instead.
  after removing the cwd from sys.path.

6.0 Result and Conclusion¶

In this project we set out to answer some important questions regarding worker's resignation. From the analysis we performed, we found the total number of workers who resigned due to some kinds of job dissatisfaction to be 240. We also found that compared to employees with less than 7 years of service, more employees who have spent 7 or more years at a job (Established and Veteran categories) resigned due to some kinds of job dissatisfaction (with more than 60% of the total number of dissatisfied employees).

Further into the analysis, we noticed that more workers resigned as they increase in age. We observed a progressive increase in the percentage of workers that resigned due to job dissatisfaction as they move from their 30s to 40s etc. We found the highest percentage of dissatisfied workers among those who are 50 years of age and above. However, we found that more workers in their 20s and below resigned due to job dissatisfaction compared to those in their 30s.

One of the reasons this might be so is that in many cases, most in their 20s are recent graduates, new entrants in the workforce, single (or at least unmarried). Many of these young individuals, fresh out of college, lockstep with peers, carry much expectation and expect a workplace to be ideal and a perfect fit. But then after these expectations are not met, they may become grumpy and dissatisfied with their jobs. However, when individuals attain their 30s they probably are now at peace, develop more emotional, moral and intellectual alignment with the world. These individuals have now attained another level of self-awareness and may put up more with some job dissatisfactions. Then, this character probably tends to simmer down as individuals become much older as seen by this analysis.

0.0 Rough Analysis¶

Here, we want to present the numbers of people in each service category and the gender differences.

In [61]:

# Dataframe of dissatisfied employees in the Veteran category
dissatisfied_veteran = combined_updated.loc[(combined_updated['dissatisfied'] == True) & (combined_updated['service_cat'] == 'Veteran')]
print('Total Dissatisfied Veteran Workers = ' + str(len(dissatisfied_veteran)))

Total Dissatisfied Veteran Workers = 66

In [63]:

dissatisfied_veteran['gender'].value_counts(dropna=False)

Out[63]:

Female    45
Male      19
NaN        2
Name: gender, dtype: int64

In [64]:

# Dataframe of dissatisfied employees in the Established category
dissatisfied_established = combined_updated.loc[(combined_updated['dissatisfied'] == True) & (combined_updated['service_cat'] == 'Established')]
print('Total Dissatisfied Established Workers = ' + str(len(dissatisfied_established)))

Total Dissatisfied Established Workers = 32

In [65]:

dissatisfied_established['gender'].value_counts(dropna=False)

Out[65]:

Female    24
Male       8
Name: gender, dtype: int64

In [66]:

# Dataframe of dissatisfied employees in the Experienced category
dissatisfied_experienced = combined_updated.loc[(combined_updated['dissatisfied'] == True) & (combined_updated['service_cat'] == 'Experienced')]
print('Total Dissatisfied Experienced Workers = ' + str(len(dissatisfied_experienced)))

Total Dissatisfied Experienced Workers = 59

In [67]:

dissatisfied_experienced['gender'].value_counts(dropna=False)

Out[67]:

Female    44
Male      15
Name: gender, dtype: int64

In [68]:

# Dataframe of dissatisfied employees in the New category
dissatisfied_new = combined_updated.loc[(combined_updated['dissatisfied'] == True) & (combined_updated['service_cat'] == 'New')]
print('Total Dissatisfied New Workers = ' + str(len(dissatisfied_new)))

Total Dissatisfied New Workers = 57

In [69]:

dissatisfied_new['gender'].value_counts(dropna=False)

Out[69]:

Female    37
Male      20
Name: gender, dtype: int64