In this project, we will analyse exit surveys from employees of the Department of Education, Training and Employment (DETE) and the Technical and Further Education (TAFE) institute in Queensland, Australia. Our aim is to answer to the following questions:
Data used in this project:
We will be combining the data from both departments to try and answer these questions. Although both departments used the same survey template, one of them customised some of the answers. This will require a degree of cleaning before we can combine the two datasets.
A data dictionary has not been provided for either dataset, so we will do our best to infer the definitions of each column from its name. Here is a preview of some of the columns from two datasets:
DETE
Column | Description |
---|---|
ID | An id used to identify the participant of the survey |
SeparationType | The reason why the person's employment ended |
Cease Date | The year or month the person's employment ended |
DETE Start Date | The year the person began employment with the DETE |
TAFE
Column | Description |
---|---|
Record ID | An id used to identify the participant of the survey |
Reason for ceasing employment | The reason why the person's employment ended |
LengthofServiceOverall. Overall Length of Service at Institute (in years) | The length of the person's employment (in years) |
# import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# run Jupyter magic so that plots are displayed inline
%matplotlib inline
# set styles for plots
plt.rcParams['axes.titlesize'] = 'x-large'
plt.rcParams['axes.spines.left'] = False
plt.rcParams['axes.spines.right'] = False
plt.rcParams['axes.spines.top'] = False
plt.rcParams['axes.spines.bottom'] = False
# read DETE dataset and explore dataframe
dete_survey = pd.read_csv("dete_survey.csv")
dete_survey.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 822 entries, 0 to 821 Data columns (total 56 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 822 non-null int64 1 SeparationType 822 non-null object 2 Cease Date 822 non-null object 3 DETE Start Date 822 non-null object 4 Role Start Date 822 non-null object 5 Position 817 non-null object 6 Classification 455 non-null object 7 Region 822 non-null object 8 Business Unit 126 non-null object 9 Employment Status 817 non-null object 10 Career move to public sector 822 non-null bool 11 Career move to private sector 822 non-null bool 12 Interpersonal conflicts 822 non-null bool 13 Job dissatisfaction 822 non-null bool 14 Dissatisfaction with the department 822 non-null bool 15 Physical work environment 822 non-null bool 16 Lack of recognition 822 non-null bool 17 Lack of job security 822 non-null bool 18 Work location 822 non-null bool 19 Employment conditions 822 non-null bool 20 Maternity/family 822 non-null bool 21 Relocation 822 non-null bool 22 Study/Travel 822 non-null bool 23 Ill Health 822 non-null bool 24 Traumatic incident 822 non-null bool 25 Work life balance 822 non-null bool 26 Workload 822 non-null bool 27 None of the above 822 non-null bool 28 Professional Development 808 non-null object 29 Opportunities for promotion 735 non-null object 30 Staff morale 816 non-null object 31 Workplace issue 788 non-null object 32 Physical environment 817 non-null object 33 Worklife balance 815 non-null object 34 Stress and pressure support 810 non-null object 35 Performance of supervisor 813 non-null object 36 Peer support 812 non-null object 37 Initiative 813 non-null object 38 Skills 811 non-null object 39 Coach 767 non-null object 40 Career Aspirations 746 non-null object 41 Feedback 792 non-null object 42 Further PD 768 non-null object 43 Communication 814 non-null object 44 My say 812 non-null object 45 Information 816 non-null object 46 Kept informed 813 non-null object 47 Wellness programs 766 non-null object 48 Health & Safety 793 non-null object 49 Gender 798 non-null object 50 Age 811 non-null object 51 Aboriginal 16 non-null object 52 Torres Strait 3 non-null object 53 South Sea 7 non-null object 54 Disability 23 non-null object 55 NESB 32 non-null object dtypes: bool(18), int64(1), object(37) memory usage: 258.6+ KB
# explore first 5 rows
dete_survey.head()
ID | SeparationType | Cease Date | DETE Start Date | Role Start Date | Position | Classification | Region | Business Unit | Employment Status | ... | Kept informed | Wellness programs | Health & Safety | Gender | Age | Aboriginal | Torres Strait | South Sea | Disability | NESB | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Ill Health Retirement | 08/2012 | 1984 | 2004 | Public Servant | A01-A04 | Central Office | Corporate Strategy and Peformance | Permanent Full-time | ... | N | N | N | Male | 56-60 | NaN | NaN | NaN | NaN | Yes |
1 | 2 | Voluntary Early Retirement (VER) | 08/2012 | Not Stated | Not Stated | Public Servant | AO5-AO7 | Central Office | Corporate Strategy and Peformance | Permanent Full-time | ... | N | N | N | Male | 56-60 | NaN | NaN | NaN | NaN | NaN |
2 | 3 | Voluntary Early Retirement (VER) | 05/2012 | 2011 | 2011 | Schools Officer | NaN | Central Office | Education Queensland | Permanent Full-time | ... | N | N | N | Male | 61 or older | NaN | NaN | NaN | NaN | NaN |
3 | 4 | Resignation-Other reasons | 05/2012 | 2005 | 2006 | Teacher | Primary | Central Queensland | NaN | Permanent Full-time | ... | A | N | A | Female | 36-40 | NaN | NaN | NaN | NaN | NaN |
4 | 5 | Age Retirement | 05/2012 | 1970 | 1989 | Head of Curriculum/Head of Special Education | NaN | South East | NaN | Permanent Full-time | ... | N | A | M | Female | 61 or older | NaN | NaN | NaN | NaN | NaN |
5 rows × 56 columns
# read TAFE dataset and explore dataframe
tafe_survey = pd.read_csv("tafe_survey.csv")
tafe_survey.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 702 entries, 0 to 701 Data columns (total 72 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Record ID 702 non-null float64 1 Institute 702 non-null object 2 WorkArea 702 non-null object 3 CESSATION YEAR 695 non-null float64 4 Reason for ceasing employment 701 non-null object 5 Contributing Factors. Career Move - Public Sector 437 non-null object 6 Contributing Factors. Career Move - Private Sector 437 non-null object 7 Contributing Factors. Career Move - Self-employment 437 non-null object 8 Contributing Factors. Ill Health 437 non-null object 9 Contributing Factors. Maternity/Family 437 non-null object 10 Contributing Factors. Dissatisfaction 437 non-null object 11 Contributing Factors. Job Dissatisfaction 437 non-null object 12 Contributing Factors. Interpersonal Conflict 437 non-null object 13 Contributing Factors. Study 437 non-null object 14 Contributing Factors. Travel 437 non-null object 15 Contributing Factors. Other 437 non-null object 16 Contributing Factors. NONE 437 non-null object 17 Main Factor. Which of these was the main factor for leaving? 113 non-null object 18 InstituteViews. Topic:1. I feel the senior leadership had a clear vision and direction 608 non-null object 19 InstituteViews. Topic:2. I was given access to skills training to help me do my job better 613 non-null object 20 InstituteViews. Topic:3. I was given adequate opportunities for personal development 610 non-null object 21 InstituteViews. Topic:4. I was given adequate opportunities for promotion within %Institute]Q25LBL% 608 non-null object 22 InstituteViews. Topic:5. I felt the salary for the job was right for the responsibilities I had 615 non-null object 23 InstituteViews. Topic:6. The organisation recognised when staff did good work 607 non-null object 24 InstituteViews. Topic:7. Management was generally supportive of me 614 non-null object 25 InstituteViews. Topic:8. Management was generally supportive of my team 608 non-null object 26 InstituteViews. Topic:9. I was kept informed of the changes in the organisation which would affect me 610 non-null object 27 InstituteViews. Topic:10. Staff morale was positive within the Institute 602 non-null object 28 InstituteViews. Topic:11. If I had a workplace issue it was dealt with quickly 601 non-null object 29 InstituteViews. Topic:12. If I had a workplace issue it was dealt with efficiently 597 non-null object 30 InstituteViews. Topic:13. If I had a workplace issue it was dealt with discreetly 601 non-null object 31 WorkUnitViews. Topic:14. I was satisfied with the quality of the management and supervision within my work unit 609 non-null object 32 WorkUnitViews. Topic:15. I worked well with my colleagues 605 non-null object 33 WorkUnitViews. Topic:16. My job was challenging and interesting 607 non-null object 34 WorkUnitViews. Topic:17. I was encouraged to use my initiative in the course of my work 610 non-null object 35 WorkUnitViews. Topic:18. I had sufficient contact with other people in my job 613 non-null object 36 WorkUnitViews. Topic:19. I was given adequate support and co-operation by my peers to enable me to do my job 609 non-null object 37 WorkUnitViews. Topic:20. I was able to use the full range of my skills in my job 609 non-null object 38 WorkUnitViews. Topic:21. I was able to use the full range of my abilities in my job. ; Category:Level of Agreement; Question:YOUR VIEWS ABOUT YOUR WORK UNIT] 608 non-null object 39 WorkUnitViews. Topic:22. I was able to use the full range of my knowledge in my job 608 non-null object 40 WorkUnitViews. Topic:23. My job provided sufficient variety 611 non-null object 41 WorkUnitViews. Topic:24. I was able to cope with the level of stress and pressure in my job 610 non-null object 42 WorkUnitViews. Topic:25. My job allowed me to balance the demands of work and family to my satisfaction 611 non-null object 43 WorkUnitViews. Topic:26. My supervisor gave me adequate personal recognition and feedback on my performance 606 non-null object 44 WorkUnitViews. Topic:27. My working environment was satisfactory e.g. sufficient space, good lighting, suitable seating and working area 610 non-null object 45 WorkUnitViews. Topic:28. I was given the opportunity to mentor and coach others in order for me to pass on my skills and knowledge prior to my cessation date 609 non-null object 46 WorkUnitViews. Topic:29. There was adequate communication between staff in my unit 603 non-null object 47 WorkUnitViews. Topic:30. Staff morale was positive within my work unit 606 non-null object 48 Induction. Did you undertake Workplace Induction? 619 non-null object 49 InductionInfo. Topic:Did you undertake a Corporate Induction? 432 non-null object 50 InductionInfo. Topic:Did you undertake a Institute Induction? 483 non-null object 51 InductionInfo. Topic: Did you undertake Team Induction? 440 non-null object 52 InductionInfo. Face to Face Topic:Did you undertake a Corporate Induction; Category:How it was conducted? 555 non-null object 53 InductionInfo. On-line Topic:Did you undertake a Corporate Induction; Category:How it was conducted? 555 non-null object 54 InductionInfo. Induction Manual Topic:Did you undertake a Corporate Induction? 555 non-null object 55 InductionInfo. Face to Face Topic:Did you undertake a Institute Induction? 530 non-null object 56 InductionInfo. On-line Topic:Did you undertake a Institute Induction? 555 non-null object 57 InductionInfo. Induction Manual Topic:Did you undertake a Institute Induction? 553 non-null object 58 InductionInfo. Face to Face Topic: Did you undertake Team Induction; Category? 555 non-null object 59 InductionInfo. On-line Topic: Did you undertake Team Induction?process you undertook and how it was conducted.] 555 non-null object 60 InductionInfo. Induction Manual Topic: Did you undertake Team Induction? 555 non-null object 61 Workplace. Topic:Did you and your Manager develop a Performance and Professional Development Plan (PPDP)? 608 non-null object 62 Workplace. Topic:Does your workplace promote a work culture free from all forms of unlawful discrimination? 594 non-null object 63 Workplace. Topic:Does your workplace promote and practice the principles of employment equity? 587 non-null object 64 Workplace. Topic:Does your workplace value the diversity of its employees? 586 non-null object 65 Workplace. Topic:Would you recommend the Institute as an employer to others? 581 non-null object 66 Gender. What is your Gender? 596 non-null object 67 CurrentAge. Current Age 596 non-null object 68 Employment Type. Employment Type 596 non-null object 69 Classification. Classification 596 non-null object 70 LengthofServiceOverall. Overall Length of Service at Institute (in years) 596 non-null object 71 LengthofServiceCurrent. Length of Service at current workplace (in years) 596 non-null object dtypes: float64(2), object(70) memory usage: 395.0+ KB
# explore first 5 rows
tafe_survey.head()
Record ID | Institute | WorkArea | CESSATION YEAR | Reason for ceasing employment | Contributing Factors. Career Move - Public Sector | Contributing Factors. Career Move - Private Sector | Contributing Factors. Career Move - Self-employment | Contributing Factors. Ill Health | Contributing Factors. Maternity/Family | ... | Workplace. Topic:Does your workplace promote a work culture free from all forms of unlawful discrimination? | Workplace. Topic:Does your workplace promote and practice the principles of employment equity? | Workplace. Topic:Does your workplace value the diversity of its employees? | Workplace. Topic:Would you recommend the Institute as an employer to others? | Gender. What is your Gender? | CurrentAge. Current Age | Employment Type. Employment Type | Classification. Classification | LengthofServiceOverall. Overall Length of Service at Institute (in years) | LengthofServiceCurrent. Length of Service at current workplace (in years) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 6.341330e+17 | Southern Queensland Institute of TAFE | Non-Delivery (corporate) | 2010.0 | Contract Expired | NaN | NaN | NaN | NaN | NaN | ... | Yes | Yes | Yes | Yes | Female | 26 30 | Temporary Full-time | Administration (AO) | 1-2 | 1-2 |
1 | 6.341337e+17 | Mount Isa Institute of TAFE | Non-Delivery (corporate) | 2010.0 | Retirement | - | - | - | - | - | ... | Yes | Yes | Yes | Yes | NaN | NaN | NaN | NaN | NaN | NaN |
2 | 6.341388e+17 | Mount Isa Institute of TAFE | Delivery (teaching) | 2010.0 | Retirement | - | - | - | - | - | ... | Yes | Yes | Yes | Yes | NaN | NaN | NaN | NaN | NaN | NaN |
3 | 6.341399e+17 | Mount Isa Institute of TAFE | Non-Delivery (corporate) | 2010.0 | Resignation | - | - | - | - | - | ... | Yes | Yes | Yes | Yes | NaN | NaN | NaN | NaN | NaN | NaN |
4 | 6.341466e+17 | Southern Queensland Institute of TAFE | Delivery (teaching) | 2010.0 | Resignation | - | Career Move - Private Sector | - | - | - | ... | Yes | Yes | Yes | Yes | Male | 41 45 | Permanent Full-time | Teacher (including LVT) | 3-4 | 3-4 |
5 rows × 72 columns
From our initial exploration, we can see that:
dete_survey
dataframe contains instances of the string 'Not Stated'
to indicate where a value is missing, however missing values should be represented as NaN
dete_survey
and tafe_survey
contain many columns that we do not need to complete our analysis, and so can be droppedCease Date
and CESSATION YEAR
), we will need to ensure these columns have the exact same name before combining the dataframes# read DETE dataset again, this time utilising the na_values parameter
dete_survey = pd.read_csv("dete_survey.csv", na_values="Not Stated")
The above code will ensure that any missing values labelled as Not Stated
will be imported as NaN
.
Let's drop columns from both dataframes that are not necessary for our analysis, this will make our data much more manageable and easier to work with.
# drop unneeded columns from DETE dataset
dete_drop_cols = dete_survey.columns[28:49]
dete_survey_updated = dete_survey.drop(columns=dete_drop_cols, axis=1)
# drop unneeded columns from TAFE dataset
tafe_drop_cols = tafe_survey.columns[17:66]
tafe_survey_updated = tafe_survey.drop(columns=tafe_drop_cols, axis=1)
Below are some of the columns we'd like to use for our final analysis. Since we want to combine our dataframes, we will need to try and standardise the column names between the dataframes.
dete_survey | tafe_survey | Definition |
---|---|---|
ID | Record ID | An id used to identify the participant of the survey |
SeparationType | Reason for ceasing employment | The reason why the person's employment ended |
Cease Date | LengthofServiceOverall. Overall Length of Service at Institute (in years) | The length of the person's employment (in years) |
DETE Start Date | The year the participant began employment with the DETE | |
LengthofServiceOverall.Overall Length of Service at Institute (in years) | The length of the person's employment (in years) | |
Age | CurrentAge.Current Age | The age of the participant |
Gender | Gender.What is your Gender? | The gender of the participant |
# standardise column names in DETE dataframe
dete_survey_updated.columns = (dete_survey_updated.columns
.str.lower()
.str.strip()
.str.replace(" ", "_")
)
# verify updated column names
print(dete_survey_updated.columns)
Index(['id', 'separationtype', 'cease_date', 'dete_start_date', 'role_start_date', 'position', 'classification', 'region', 'business_unit', 'employment_status', 'career_move_to_public_sector', 'career_move_to_private_sector', 'interpersonal_conflicts', 'job_dissatisfaction', 'dissatisfaction_with_the_department', 'physical_work_environment', 'lack_of_recognition', 'lack_of_job_security', 'work_location', 'employment_conditions', 'maternity/family', 'relocation', 'study/travel', 'ill_health', 'traumatic_incident', 'work_life_balance', 'workload', 'none_of_the_above', 'gender', 'age', 'aboriginal', 'torres_strait', 'south_sea', 'disability', 'nesb'], dtype='object')
# rename columns in TAFE dataframe to match equivalents in the DETE dataframe
# NOTE we will handle the remaining column names later
col_map = {
'Record ID': 'id',
'CESSATION YEAR': 'cease_date',
'Reason for ceasing employment': 'separationtype',
'Gender. What is your Gender?': 'gender',
'CurrentAge. Current Age': 'age',
'Employment Type. Employment Type': 'employment_status',
'Classification. Classification': 'position',
'LengthofServiceOverall. Overall Length of Service at Institute (in years)': 'institute_service',
'LengthofServiceCurrent. Length of Service at current workplace (in years)': 'role_service'
}
tafe_survey_updated.rename(mapper=col_map, axis=1, inplace=True)
# verify updated column names
print(tafe_survey_updated.columns)
Index(['id', 'Institute', 'WorkArea', 'cease_date', 'separationtype', 'Contributing Factors. Career Move - Public Sector ', 'Contributing Factors. Career Move - Private Sector ', 'Contributing Factors. Career Move - Self-employment', 'Contributing Factors. Ill Health', 'Contributing Factors. Maternity/Family', 'Contributing Factors. Dissatisfaction', 'Contributing Factors. Job Dissatisfaction', 'Contributing Factors. Interpersonal Conflict', 'Contributing Factors. Study', 'Contributing Factors. Travel', 'Contributing Factors. Other', 'Contributing Factors. NONE', 'gender', 'age', 'employment_status', 'position', 'institute_service', 'role_service'], dtype='object')
Let's recall the questions we are trying to answer:
We are only interested in when an employee resigns. Let's take a look at the unique values in the separationtype
column of the DETE dataset.
dete_survey_updated["separationtype"].value_counts(dropna=False)
Age Retirement 285 Resignation-Other reasons 150 Resignation-Other employer 91 Resignation-Move overseas/interstate 70 Voluntary Early Retirement (VER) 67 Ill Health Retirement 61 Other 49 Contract Expired 34 Termination 15 Name: separationtype, dtype: int64
We can see that there are only three values in the separationtype
column that are necessary to our analysis:
Resignation-Other reasons
Resignation-Other employer
Resignation-Move overseas/interstate
Therefore we can discard any rows that do not contain the the string 'Resignation'
.
# bool mask for any string containing 'Resignation'
bool_mask = dete_survey_updated["separationtype"].str.contains("Resignation", na=False)
# apply mask to dataframe, ensuring a copy is created to avoid SettingWithCopy warning
dete_resignations = dete_survey_updated.loc[bool_mask].copy()
# verify changes
dete_resignations["separationtype"].value_counts()
Resignation-Other reasons 150 Resignation-Other employer 91 Resignation-Move overseas/interstate 70 Name: separationtype, dtype: int64
Let's try the same thing with the TAFE dataset.
# identify unique values in column
tafe_survey_updated["separationtype"].value_counts(dropna=False)
Resignation 340 Contract Expired 127 Retrenchment/ Redundancy 104 Retirement 82 Transfer 25 Termination 23 NaN 1 Name: separationtype, dtype: int64
We can see that there is one NaN
value in the column. Since we have no way of knowing if this employee resigned or not, we will set the na
parameter of .str.contains()
to False
, so that when we apply the mask to create our new dataframe it will be excluded.
# we can use the same boolean mask as before
bool_mask = tafe_survey_updated["separationtype"].str.contains("Resignation", na=False)
# apply mask to dataframe, ensuring a copy is created to avoid SettingWithCopy warning
tafe_resignations = tafe_survey_updated.loc[bool_mask].copy()
# verify changes
tafe_resignations["separationtype"].value_counts()
Resignation 340 Name: separationtype, dtype: int64
We need to check that our data doesn't contain any major inconsistencies. Let's focus on veryfying the years in the cease_date
and dete_start_date
columns to ensure they make sense.
Since the cease_date
is the last year of the person's employment and the dete_start_date
is the person's first year of employment, it wouldn't make sense to have years after the current date.
Given that most people in this field start working in their 20s, it's also unlikely that the dete_start_date
was before the year 1940.
Let's focus on verifying that all the values in the cease_date
and dete_start_date
columns are logical. First we will look at cease_date
.
# check the unique values in the cease_date column
dete_resignations["cease_date"].value_counts().sort_index(ascending=True)
01/2014 22 05/2012 2 05/2013 2 06/2013 14 07/2006 1 07/2012 1 07/2013 9 08/2013 4 09/2010 1 09/2013 11 10/2013 6 11/2013 9 12/2013 17 2010 1 2012 126 2013 74 Name: cease_date, dtype: int64
There's no dates in here that don't make sense (i.e. they're all in the past), however, we don't really have any need to keep the month of resignation so we can strip that out leaving just the year.
# create a copy to work on
cease_date_copy = dete_resignations["cease_date"].copy()
# remove any month digits and forward slash, then convert from string back to float
clean_cease_date = (cease_date_copy
.str.replace(r'([0-1][0-9]\/)', "")
.astype(float)
)
# verify column
print(clean_cease_date.value_counts())
# assign back to original column
dete_resignations["cease_date"] = clean_cease_date
2013.0 146 2012.0 129 2014.0 22 2010.0 2 2006.0 1 Name: cease_date, dtype: int64
Now let's check out dete_start_date
.
# check the unique values in the dete_start_date column
print(dete_resignations["dete_start_date"].value_counts(dropna=False).sort_index(ascending=True))
# a box plot will make it easier to identify any potential outliers
dete_resignations.boxplot("dete_start_date")
plt.title("Distribution of values in dete_start_date")
1963.0 1 1971.0 1 1972.0 1 1973.0 1 1974.0 2 1975.0 1 1976.0 2 1977.0 1 1980.0 5 1982.0 1 1983.0 2 1984.0 1 1985.0 3 1986.0 3 1987.0 1 1988.0 4 1989.0 4 1990.0 5 1991.0 4 1992.0 6 1993.0 5 1994.0 6 1995.0 4 1996.0 6 1997.0 5 1998.0 6 1999.0 8 2000.0 9 2001.0 3 2002.0 6 2003.0 6 2004.0 14 2005.0 15 2006.0 13 2007.0 21 2008.0 22 2009.0 13 2010.0 17 2011.0 24 2012.0 21 2013.0 10 NaN 28 Name: dete_start_date, dtype: int64
Text(0.5, 1.0, 'Distribution of values in dete_start_date')
The data in the dete_start_date
column looks fine, there is no need to make any modifications.
We only have one date column in the TAFE dataset, cease_date
. Let's take a look.
# check the unique values in the cease_date column
print(tafe_resignations["cease_date"].value_counts().sort_index(ascending=True))
# generate box plox
tafe_resignations.boxplot("cease_date")
plt.title("Distribution of values in cease_date")
2009.0 2 2010.0 68 2011.0 116 2012.0 94 2013.0 55 Name: cease_date, dtype: int64
Text(0.5, 1.0, 'Distribution of values in cease_date')
There doesn't appear to be any issues with the data in the cease_date
column of the tafe_resignations
dataframe. One thing to note is that the years in each dataframe don't quite span the same number of years. Values in the cease_date
column in the DETE dataset range from 2006 to 2014
, whereas in the TAFE dataset they range from 2009 to 2013
. This should not affect our analysis.
In the tafe_resignations
dataframe we have a column called institute_service
that indicates the length of time in years that an employee worked at the TAFE institute. The dete_resignations
dataframe does not have such a column, so we will need to create one as this information is essential for our analysis. Fortunately we can use the cease_date
column and the dete_start_date
column to calculate this.
# calculate length of service and store data in a new column called institute_service
dete_resignations["institute_service"] = dete_resignations["cease_date"] - dete_resignations["dete_start_date"]
# sense check first five rows
dete_resignations[["dete_start_date", "cease_date", "institute_service"]].head()
dete_start_date | cease_date | institute_service | |
---|---|---|---|
3 | 2005.0 | 2012.0 | 7.0 |
5 | 1994.0 | 2012.0 | 18.0 |
8 | 2009.0 | 2012.0 | 3.0 |
9 | 1997.0 | 2012.0 | 15.0 |
11 | 2009.0 | 2012.0 | 3.0 |
Looks like our new institute_service
column has functioned as intended.
Now it's time to identify any employees who resigned due to dissatisfaction. We will use the following columns from each dataframe to do this:
tafe_resignations
Contributing Factors. Dissatisfaction
Contributing Factors. Job Dissatisfaction
dete_resignations
job_dissatisfaction
dissatisfaction_with_the_department
physical_work_environment
lack_of_recognition
lack_of_job_security
work_location
employment_conditions
work_life_balance
workload
If any employees indicated any of the factors above caused them to resign, we'll mark them as True
in a new column called dissatisfied
.
We'll start with the tafe_resignations
dataframe as there are fewer columns to deal with.
# inspect values in each column (including any null values)
print(tafe_resignations['Contributing Factors. Dissatisfaction'].value_counts(dropna=False), "\n")
print(tafe_resignations['Contributing Factors. Job Dissatisfaction'].value_counts(dropna=False))
- 277 Contributing Factors. Dissatisfaction 55 NaN 8 Name: Contributing Factors. Dissatisfaction, dtype: int64 - 270 Job Dissatisfaction 62 NaN 8 Name: Contributing Factors. Job Dissatisfaction, dtype: int64
It looks like -
represents where dissatisfaction was not the reason for unemployment, while confusingly the name of the column itself has been used as the value to indicate that the employee was dissastified. We can see that NaN
values are also present in both columns.
Let's update these values to:
True
when the employee was dissastifiedFalse
when the employee was not dissastifiedNaN
if the value is NaN
# create function to update the column values as specified above
def update_vals(value):
if pd.isnull(value):
return np.nan
elif value == "-":
return False
return True
# list of columns we want to apply the function to
cols = ["Contributing Factors. Dissatisfaction", "Contributing Factors. Job Dissatisfaction"]
# apply the function elementwise to both columns and assign transformations back to the respective columns
tafe_resignations[cols] = tafe_resignations[cols].applymap(update_vals)
# inspect both columns again
print(tafe_resignations['Contributing Factors. Dissatisfaction'].value_counts(dropna=False), "\n")
print(tafe_resignations['Contributing Factors. Job Dissatisfaction'].value_counts(dropna=False))
False 277 True 55 NaN 8 Name: Contributing Factors. Dissatisfaction, dtype: int64 False 270 True 62 NaN 8 Name: Contributing Factors. Job Dissatisfaction, dtype: int64
Looks like our function worked as expected. Now that the column values have been updated, we can create the dissastisfied
column and use the method df.any()
to apply a value to the column. df.any()
will do the following:
True
if any element in the selected columns is True
False
if none of the elements in the selected columns is True
NaN
if the value is NaN
# select columns in tafe_resignations that may indicate employee disatisfaction
cols = [
"Contributing Factors. Dissatisfaction",
"Contributing Factors. Job Dissatisfaction",
]
# use df.any() method to return a value of True, False, or NaN
# depending on the values contained in the selected columns
tafe_resignations["dissatisfied"] = tafe_resignations[cols].any(axis=1, skipna=False)
# sense check the new column
cols.append("dissatisfied")
tafe_resignations[cols].head(15)
Contributing Factors. Dissatisfaction | Contributing Factors. Job Dissatisfaction | dissatisfied | |
---|---|---|---|
3 | False | False | False |
4 | False | False | False |
5 | False | False | False |
6 | False | False | False |
7 | False | False | False |
8 | False | False | False |
9 | False | False | False |
10 | False | False | False |
13 | False | False | False |
14 | True | True | True |
15 | False | False | False |
16 | NaN | NaN | NaN |
17 | False | True | True |
18 | NaN | NaN | NaN |
19 | False | False | False |
Looks like that worked, the dissatisfied
column contains True
if either of the other two columns contain True
. Now let's attempt the same process for the dete_resignations
dataframe.
# select columns in dete_resignations that may indicate employee disatisfaction
cols = [
"job_dissatisfaction",
"dissatisfaction_with_the_department",
"physical_work_environment",
"lack_of_recognition",
"lack_of_job_security",
"work_location",
"employment_conditions",
"work_life_balance",
"workload"
]
# use df.any() method to return a value of True, False, or NaN
# depending on the values contained in the selected columns
dete_resignations["dissatisfied"] = dete_resignations[cols].any(axis=1, skipna=False)
# sense check the new column
cols.append("dissatisfied")
dete_resignations[cols].head(15)
job_dissatisfaction | dissatisfaction_with_the_department | physical_work_environment | lack_of_recognition | lack_of_job_security | work_location | employment_conditions | work_life_balance | workload | dissatisfied | |
---|---|---|---|---|---|---|---|---|---|---|
3 | False | False | False | False | False | False | False | False | False | False |
5 | False | False | False | False | False | False | True | False | False | True |
8 | False | False | False | False | False | False | False | False | False | False |
9 | True | True | False | False | False | False | False | False | False | True |
11 | False | False | False | False | False | False | False | False | False | False |
12 | False | False | False | False | False | False | False | False | False | False |
14 | True | True | False | False | False | False | False | False | False | True |
16 | False | False | False | True | False | False | False | False | False | True |
20 | False | False | False | False | False | False | False | False | False | False |
21 | False | False | False | False | False | False | False | False | False | False |
22 | False | True | False | True | False | False | False | True | False | True |
23 | True | False | False | False | False | True | False | True | False | True |
25 | False | False | False | False | False | False | False | True | False | True |
27 | False | False | False | False | False | False | False | False | False | False |
33 | False | False | False | True | False | False | False | False | False | True |
Now we have a dissatisfied
column in the DETE dataset too. We're ready to combine our dataframes.
# create a copy of the results for both datasets and assign to a new variable
# - this will avoid a potential SettingWithCopy warning
tafe_resignations_up = tafe_resignations.copy()
dete_resignations_up = dete_resignations.copy()
# create new column in each dataframe to indiciate the institute
# - this will make it easy to distinguish between the two datasets after combining
tafe_resignations_up["institute"] = "TAFE"
dete_resignations_up["institute"] = "DETE"
# combine dataframes vertically using pd.concat() function
combined = pd.concat([tafe_resignations_up, dete_resignations_up], ignore_index=True)
# identify shape of combined dataframe for reference
print("Combined shape:", combined.shape)
Combined shape: (651, 53)
Now both our dataframes are combined into one dataframe called combined
. We achieved this by using the concat()
function to combine our two orignal dataframes along the vertical axis with an outer join, meaning that all data from both dataframes is preserved. For example, if the DETE dataframe contained a column that did not exist in the TAFE dataframe, any rows from the TAFE dataframe would be updated to contain NaN
for this column in combined
. The same applies vice-versa.
As there are several columns from both dataframes that have no corresponding column, we probably have a large number of NaN
values in our combined dataframe. Let's take a look.
# get sum of NaN in each column of combined dataframe
check_null = combined.isnull().sum()
# filter out any columns with no null values and print remaining columns
null_cols = check_null.loc[check_null != 0]
print(null_cols)
Institute 311 WorkArea 311 cease_date 16 Contributing Factors. Career Move - Public Sector 319 Contributing Factors. Career Move - Private Sector 319 Contributing Factors. Career Move - Self-employment 319 Contributing Factors. Ill Health 319 Contributing Factors. Maternity/Family 319 Contributing Factors. Dissatisfaction 319 Contributing Factors. Job Dissatisfaction 319 Contributing Factors. Interpersonal Conflict 319 Contributing Factors. Study 319 Contributing Factors. Travel 319 Contributing Factors. Other 319 Contributing Factors. NONE 319 gender 59 age 55 employment_status 54 position 53 institute_service 88 role_service 361 dissatisfied 8 dete_start_date 368 role_start_date 380 classification 490 region 386 business_unit 619 career_move_to_public_sector 340 career_move_to_private_sector 340 interpersonal_conflicts 340 job_dissatisfaction 340 dissatisfaction_with_the_department 340 physical_work_environment 340 lack_of_recognition 340 lack_of_job_security 340 work_location 340 employment_conditions 340 maternity/family 340 relocation 340 study/travel 340 ill_health 340 traumatic_incident 340 work_life_balance 340 workload 340 none_of_the_above 340 aboriginal 644 torres_strait 651 south_sea 648 disability 643 nesb 642 dtype: int64
Looks like there are multiple columns in the combined
dataframe that contain significantly large numbers of null values. We know from looking at the shape of combined
that it has 651
rows. Let's drop any columns that contain less than 500
non-null values. That should remove any columns which did not have a corresponding column in the other dataframe.
# drop any columns with less than 500 non-null values using thresh parameter
combined_updated = combined.dropna(thresh=500, axis=1)
# identify shape again
print("Combined updated shape:", combined_updated.shape)
combined_updated
Combined updated shape: (651, 10)
id | cease_date | separationtype | gender | age | employment_status | position | institute_service | dissatisfied | institute | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 6.341399e+17 | 2010.0 | Resignation | NaN | NaN | NaN | NaN | NaN | False | TAFE |
1 | 6.341466e+17 | 2010.0 | Resignation | Male | 41 45 | Permanent Full-time | Teacher (including LVT) | 3-4 | False | TAFE |
2 | 6.341475e+17 | 2010.0 | Resignation | Female | 56 or older | Contract/casual | Teacher (including LVT) | 7-10 | False | TAFE |
3 | 6.341520e+17 | 2010.0 | Resignation | Male | 20 or younger | Temporary Full-time | Administration (AO) | 3-4 | False | TAFE |
4 | 6.341537e+17 | 2010.0 | Resignation | Male | 46 50 | Permanent Full-time | Teacher (including LVT) | 3-4 | False | TAFE |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
646 | 8.100000e+02 | 2013.0 | Resignation-Other reasons | Female | 26-30 | Permanent Part-time | Teacher Aide | 3 | False | DETE |
647 | 8.170000e+02 | 2014.0 | Resignation-Other employer | Male | 21-25 | Permanent Full-time | Teacher | 2 | False | DETE |
648 | 8.180000e+02 | 2014.0 | Resignation-Move overseas/interstate | Female | 21-25 | Permanent Full-time | Teacher | 2 | False | DETE |
649 | 8.210000e+02 | 2014.0 | Resignation-Move overseas/interstate | Female | 31-35 | Permanent Full-time | Public Servant | 5 | True | DETE |
650 | 8.230000e+02 | 2013.0 | Resignation-Move overseas/interstate | NaN | NaN | NaN | Teacher Aide | NaN | False | DETE |
651 rows × 10 columns
Using the df.dropna()
method, we removed 43
columns from our dataset, leaving us with a much more managebale total of 10
columns. We can see that all columns necessary for our analysis remain, however, some columns still require cleaning before we can start out analysis. For example, the institute_service
column contains values in a couple of different forms.
institute_service
column and categorising length of service¶The institute_service
column indicates the time in years that an employee has served for. From the preview of combined_updated
in the cell above we can see that some values represent a range of years rather while others are just an absolute number of years. Let's take a closer look.
# check unique values in institute service column
combined_updated["institute_service"].value_counts()
Less than 1 year 73 1-2 64 3-4 63 5-6 33 11-20 26 5.0 23 1.0 22 7-10 21 0.0 20 3.0 20 6.0 17 4.0 16 9.0 14 2.0 14 7.0 13 More than 20 years 10 8.0 8 13.0 8 20.0 7 15.0 7 10.0 6 12.0 6 14.0 6 17.0 6 22.0 6 16.0 5 18.0 5 11.0 4 24.0 4 23.0 4 19.0 3 39.0 3 21.0 3 32.0 3 28.0 2 30.0 2 26.0 2 36.0 2 25.0 2 27.0 1 29.0 1 31.0 1 33.0 1 34.0 1 35.0 1 38.0 1 41.0 1 42.0 1 49.0 1 Name: institute_service, dtype: int64
In addition to cleaning this column, to make our analysis easier we will convert these numbers into categories. We willl base our analysis on this article from BusinessWire, which posits that understanding employee's needs according to their career stage is more effective that trying to do so by their age.
We'll use a slightly modified set of the article's definitions:
Now that we have our definitions, let's categorise the values in the institute_service
column. There are two main steps we need to do this:
institute_service
columnA note on extraction: for any values that represent a range of years, we will take the lower limit of the range. The only exception to this is the value Less than 1 year
, which we will consider as just 1
year of service.
## Step 1: extract the years of service from each value ##
# create copy of combined_updated to avoid SettingsWithCopy error
combined_clean = combined_updated.copy()
# convert institute_service dtype to string and extract first instance of regex match -
# for any values that represent a range of years, this will take the lower limit of the range
combined_clean["institute_service"] = (combined_updated["institute_service"]
.astype('str')
.str.extract(r"(\d+)")
)
# convert values to floats
combined_clean["institute_service"] = combined_clean["institute_service"].astype('float')
# check extraction has worked as expected
print("Check extracted values: \n")
print(combined_clean["institute_service"].value_counts(dropna=False).sort_index(ascending=True))
Check extracted values: 0.0 20 1.0 159 2.0 14 3.0 83 4.0 16 5.0 56 6.0 17 7.0 34 8.0 8 9.0 14 10.0 6 11.0 30 12.0 6 13.0 8 14.0 6 15.0 7 16.0 5 17.0 6 18.0 5 19.0 3 20.0 17 21.0 3 22.0 6 23.0 4 24.0 4 25.0 2 26.0 2 27.0 1 28.0 2 29.0 1 30.0 2 31.0 1 32.0 3 33.0 1 34.0 1 35.0 1 36.0 2 38.0 1 39.0 3 41.0 1 42.0 1 49.0 1 NaN 88 Name: institute_service, dtype: int64
## Step 2: map each value to one of the career stage defintions ##
# create function to map the length of service to a career stage
def career_stage(service):
if pd.isnull(service):
pass
elif service < 3:
return "New"
elif service < 7:
return "Experienced"
elif service < 11:
return "Established"
else: return "Veteran"
# use Series.apply() to apply the function to the institute_service column
combined_clean["service_cat"] = combined_clean["institute_service"].apply(career_stage)
# check values in new column
print(combined_clean["service_cat"].value_counts(dropna=False), "\n")
New 193 Experienced 172 Veteran 136 NaN 88 Established 62 Name: service_cat, dtype: int64
We now have a new column, service_cat
, in our dataframe that indicates what stage an employee is at in their career.
age
column¶One of the questions we need to answer for our analysis relies on knowing the age of the employee. Let's inspect the age column to see what the data is currently like.
combined_clean["age"].value_counts(dropna=False)
51-55 71 NaN 55 41-45 48 41 45 45 46-50 42 36-40 41 46 50 39 26-30 35 21 25 33 26 30 32 36 40 32 31 35 32 31-35 29 56 or older 29 21-25 29 56-60 26 61 or older 23 20 or younger 10 Name: age, dtype: int64
It appears we have a mix of values, all of which represent ranges of years. Since it's not necessary for this column to be in a numerical form, unlike the institute_service
column that we previously cleaned, we can keep the values as strings, but we will need to tidy up the ranges so there is no overlap.
Let's attempt the following:
56 or older
to 56-60
# create copy to work on
ages_dirty = combined_clean["age"].copy()
ages_clean = (ages_dirty
.str.strip()
.str.replace(" ", "-", regex=False)
.str.replace("56 or older", "56-60", regex=False)
)
# check data is clean
print(ages_clean.value_counts(dropna=False))
# assign back to age column in combined_clean
combined_clean["age"] = ages_clean.copy()
41-45 93 46-50 81 36-40 73 51-55 71 26-30 67 21-25 62 31-35 61 56-60 55 NaN 55 61 or older 23 20 or younger 10 Name: age, dtype: int64
The age
column is now cleaned up.
dissatisfied
column¶Recall that the dissatisfied
column consists of boolean values. Methods like df.pivot_table()
actually treat boolean values as integers, so a True
value is considered to be 1
and a False
value is considered to be 0
. That means that we can aggregate the dissatisfied column and calculate the number of people in each group, the percentage of people in each group, etc.
Before we try aggregating the dissatisfied
column, let's check if there are any missing values.
# print unique values (including NaN) for the dissatisfied column
print("Count \n")
print(combined_clean["dissatisfied"].value_counts(dropna=False))
print("\nRelative frequency\n")
print(round(combined_clean["dissatisfied"].value_counts(dropna=False, normalize=True)*100))
Count False 403 True 240 NaN 8 Name: dissatisfied, dtype: int64 Relative frequency False 62.0 True 37.0 NaN 1.0 Name: dissatisfied, dtype: float64
Since there are only eight missing values, representing around 1%
of all values in the column, it is probably safe to replace these missing values with the value that occurs most frequently (False
) without affecting our analysis.
# replace any NaN values with 'False'
combined_clean.fillna(value={"dissatisfied": False}, inplace=True)
# validate changes
combined_clean["dissatisfied"].value_counts(dropna=False)
False 411 True 240 Name: dissatisfied, dtype: int64
We have now dealt with missing values in the dissatisfied
column, let's try and resolve the missing values for the other columns.
This will be the final step before we can begin our analysis. We need to decide what to do with any remaining missing values in our combined dataset. First, let's check how many missing values we have in each of our columns.
# identify all remaining missing values in each column
print(combined_clean.isnull().sum())
id 0 cease_date 16 separationtype 0 gender 59 age 55 employment_status 54 position 53 institute_service 88 dissatisfied 0 institute 0 service_cat 88 dtype: int64
Let's plot a heatmap of our missing data, setting the institute
column as the index since it is likely that if there are any patterns in the missing data it will probably be related to the original dataset, DETE or TAFE.
# import seaborn library, set `institute` as index column and plot heatmap of null values
import seaborn as sns
combined_institute = combined_clean.set_index('institute')
sns.heatmap(combined_institute.isnull(), cbar=False)
<matplotlib.axes._subplots.AxesSubplot at 0x7fcc14cf4a90>
We can see from the heatmap that there are some patterns to the missing values in our combined dataset. There is a clear pattern for rows from the TAFE dataset that containing missing values. Each row has missing information across several columns:
gender
age
employment_status
position
institute_service
service_cat
There is less of a consistent pattern for missing data from the DETE institute. The predominant column with missing values is institute_service
, the values of which are used to determine the employees career stage, hence this pattern is reflected in service_cat
.
What are our options for handling this missing data?
Option A: substitute missing values
For all non-numeric columns, we could replace NaN
with the string Unknown
as we can't reliably estimate what the value should be. For all numeric columns, we could substitute all missing values with the mean value of the column.
Option B: drop all rows with null values
This is the easier of the two options, however if null values make up a significant number of rows in our dataset, it could affect our analysis. Let's figure out what percentage of the total values in each column are missing.
# calculate percentage of missing values in each column
combined_clean.isnull().sum() / len(combined_clean.index) * 100
id 0.000000 cease_date 2.457757 separationtype 0.000000 gender 9.062980 age 8.448541 employment_status 8.294931 position 8.141321 institute_service 13.517665 dissatisfied 0.000000 institute 0.000000 service_cat 13.517665 dtype: float64
Looking at the heatmap, it's clear that for the vast majority of rows that contain missing data, this data is almost always missing for the institute_service
column, and by extension, the service_cat
column. Given that:
13.5%
It's probably best to go for option B and drop all rows containing NaN values.
# Option B - drop all rows containing NaN values
combined_final = combined_clean.copy()
combined_final.dropna(inplace=True)
combined_final.isnull().sum()
id 0 cease_date 0 separationtype 0 gender 0 age 0 employment_status 0 position 0 institute_service 0 dissatisfied 0 institute 0 service_cat 0 dtype: int64
To be confident that dropping missing values will not affect our analysis, let's compare the mean of the dissatisied
column for each category in service_cat
before and after dropping the missing values.
# pivot table with dataset prior to dropping missing values
combined_clean_pivot = combined_clean.pivot_table(values="dissatisfied", index="service_cat")
# pivot table with dataset after dropping missing values
combined_final_pivot = combined_final.pivot_table(values="dissatisfied", index="service_cat")
# display output
print("Before dropping missing values \n\n", combined_clean_pivot, "\n\n")
print("After dropping missing values \n\n", combined_final_pivot, "\n\n")
# calculate difference in mean before and after removing null rows
print("Difference in mean \n\n", combined_final_pivot - combined_clean_pivot)
Before dropping missing values dissatisfied service_cat Established 0.516129 Experienced 0.343023 New 0.295337 Veteran 0.485294 After dropping missing values dissatisfied service_cat Established 0.508197 Experienced 0.345029 New 0.298429 Veteran 0.492188 Difference in mean dissatisfied service_cat Established -0.007932 Experienced 0.002006 New 0.003093 Veteran 0.006893
We can see that dropping the rows containing missing values has had a minimal impact on the mean of the dissatisfied
column. Now we're ready to start our analysis!
To recap, we've accomplished the following:
NaN
rather than Not stated
institute_service
column to indiciate an employee's length of servicedissatisfied
, to indicate whether an employee resigned due to dissatisfaction in their roleage
columndissatisfied
columnAs a reminder, we are trying to answer the following questions:
Let's calculate the percentage of dissatisifed employees in each service_cat
group. To do this we will use use df.pivot_table()
to aggregate the service_cat
column grouped by the mean of the dissatisifed
column. We know that since True
equates to a value of 1
and False
equates to 0
, calculating the mean is equivalent to calcuating the percentage of dissastisfied employees, while calculating the sum is equivalent to counting the number of dissatisfied employees.
# create bar chart showing mean dissatisfaction for each category
# we can use the pivot table we created earlier
ax = combined_final_pivot.plot(
kind="bar",
rot=35,
title="Percentage resigned due to dissastisfaction",
legend=False
)
The bar plot indicates that resignation due to dissatisfaction increases with length of service, though ever so slightly decreases for Veteran
level employees. It is reasonable to expect that the longer someone works at an organisation the more likely it is that they would resign as a result of becoming dissatisfied.
The bar plot has also revealed that the categories within service_cat
have no defined order, it would be preferable that the categories appeared in order of the length of service they represent. We can change this by utilising Panda's CategoricalDtype
, a special datatype we can use to define our categories in the order we want. Let's try remaking the above bar plot but this time we will create a CategoricalDtype
that defines our categories in order, and then convert the values in the service_cat
column from the object
dtype to our new CategoricalDtype
.
from pandas.api.types import CategoricalDtype
# create a CategoricalDtype that defines the order of our categories
cat_type = CategoricalDtype(categories=["New", "Experienced", "Established", "Veteran"], ordered=True)
# convert the service_cat column dtype to the new CategoricalDtype
combined_final["service_cat"] = combined_final["service_cat"].astype(cat_type)
# create a new pivot table aggregating by both the sum and the mean
service_pivot = combined_final.pivot_table(
values="dissatisfied",
index="service_cat",
aggfunc=[np.sum, np.mean]
)
# generate plots
fig, axs = plt.subplots(nrows = 2, ncols = 1, figsize = (6, 9))
ax1 = service_pivot["mean"].plot(
kind="bar",
ax = axs[0],
rot=35,
title="Mean of resignations due to dissastisfaction",
legend=False
)
ax1.xaxis.set_visible(False)
ax2 = service_pivot["sum"].plot(
kind="bar",
ax = axs[1],
rot=35,
title="Sum of resigntaions due to dissastisfaction",
legend=False
)
ax2.xaxis.set_visible(False)
plt.tight_layout() # fixes overlapping plot elements
plt.show()
Now the categories are in an intuitive order, from least experienced to most experienced. We also plotted an additional bar chart showing the same data but aggregated by the sum instead of the mean. Interestingly, although Established
employees we're the most likely to resign due to dissatisfaction, the actual total of resignations was less than any other category of service length.
Now let's explore grouping the values of the dissatisfied
column by the values in the age
column.
# create and print a pivot table aggregating values by both the mean and the sum
age_pivot = combined_final.pivot_table(values="dissatisfied", index="age", aggfunc=[np.mean, np.sum])
print(age_pivot)
# generate horizontal bar plots using pivot table for both the sum and the mean
fig, axs = plt.subplots(nrows = 2, ncols = 1, figsize = (6, 9))
ax1 = age_pivot["sum"].plot.barh(
ax = axs[0],
legend=False,
title="Sum of resignations due to dissatisfaction by age group"
)
ax2 = age_pivot["mean"].plot.barh(
ax = axs[1],
legend=False,
title="Mean of resignations due to dissatisfaction by age group"
)
mean sum dissatisfied dissatisfied age 20 or younger 0.200000 2.0 21-25 0.288136 17.0 26-30 0.412698 26.0 31-35 0.383333 23.0 36-40 0.363636 24.0 41-45 0.393258 35.0 46-50 0.402778 29.0 51-55 0.402985 27.0 56-60 0.346939 17.0 61 or older 0.625000 10.0
Looking at the first plot, we can see that the age group with the most resignations due to dissatisfaction the ages 41-45
group. Comparitavely, there are typically fewer resignations towards the oldest and youngest age groups. This doesn't tell the full story, however, as it is likely that there are just more resignations (for all reasons) in total for the 41-45
age group.
The second plot is much more telling. Recall that by plotting the mean of the dissatisfied
column we are effectively looking at the percentage of employees who resigned due to dissatisfaction for each age group. We can now see that it is the individuals in the 61 or older
age group who are the most likely to resign due to feeling dissatisfied, accounting for
62.5%
of all resignations for the age group. Although the 41-45
age group had the most resignations due to disatisfaciton, this accounted for a smaller percentage (39%
) of the total resignations for the cohort.
Let's take a look at the overall distribution of dissatisfaction across age groups with a boxplot.
ax = age_pivot["mean"].boxplot()
ax.set_title("Distribution of mean dissatisfaction across each age group")
ax.xaxis.set_visible(False)
Looking at this boxplot, we can see that typically around 35 to 40 percent of all resignations are due to dissatisfaction.
Let's see if there is a difference the DETE and TAFE institutes when it comes to measuring the number of employees who resigned due to dissatisfaction.
institute_pivot = combined_final.pivot_table(values="dissatisfied", index="service_cat", columns="institute")
institute_pivot.plot.barh()
<matplotlib.axes._subplots.AxesSubplot at 0x7fcc16d3e1f0>
Interestingly, the plot above reveals to us that former employees of the DETE institute were signfiicantly more dissatisfied than their TAFE counterparts. The pattern we observed earlier of longer-serving employees being more likely to resign due to dissatisfaction than shorter serving employees remains consistent across both institutes, though it is less pronounced at the TAFE institute.
This was a long exercise in data cleaning, but we were able to find answers to the questions posed at the start of this project.
We found that employees who have worked at the institutes for longer periods, Established
and Veteran
employees, were more likely to resign due to dissatisfaction (approximately 50%) compared with employees who had worked for shorter periods,Experienced
and New
employees (around 30%).
It is typically older employees who are resigning due to dissastisfaction. We found that over 60% of employees who were 61 or older
resigned due to dissatisfaction, though it is worth noting this represents as absolute total of just 10
employees. In comparison, a total of 35 employees resigned due to dissastisfaction from the 41-45
bracket, however this only made up around 40 percent of the total resignations for that group.
Other insights and points for further investigations
Veteran
or Established
employees, who we know are most likely to resign due to dissatisfaction. Perhaps there are other factors that could be contributuing towards dissastisfaction, for example, poor working environment, lower average salary, or short lunch breaks.