In this project I will clean, organize, standardize and analyse data from employee exit surveys from Department of Education, Training and Employment (DETE) (download here) and the Technical and Further Education (TAFE)(download here) institute in Queensland, Australia. Below are questions I will try to answer as a result of completing mentioned in previous sentence processes:
In below cell I import necessary libraries and setting some display options for dataframes.
import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt, rcParams
from my_functions import print_2n
pd.set_option('display.max_rows', 1000)
pd.set_option('display.max_columns', 100)
pascal_name_convention
Below I read data (2 csv files) into pandas dataframe:
dete_sur = pd.read_csv('dete_survey.csv')
tafe_sur = pd.read_csv('tafe_survey.csv')
To get familiar with data read into dataframes, first, I will display top 5 rows from both ('dete_sur' & 'tafe_sur')
dete_sur.head()
ID | SeparationType | Cease Date | DETE Start Date | Role Start Date | Position | Classification | Region | Business Unit | Employment Status | Career move to public sector | Career move to private sector | Interpersonal conflicts | Job dissatisfaction | Dissatisfaction with the department | Physical work environment | Lack of recognition | Lack of job security | Work location | Employment conditions | Maternity/family | Relocation | Study/Travel | Ill Health | Traumatic incident | Work life balance | Workload | None of the above | Professional Development | Opportunities for promotion | Staff morale | Workplace issue | Physical environment | Worklife balance | Stress and pressure support | Performance of supervisor | Peer support | Initiative | Skills | Coach | Career Aspirations | Feedback | Further PD | Communication | My say | Information | Kept informed | Wellness programs | Health & Safety | Gender | Age | Aboriginal | Torres Strait | South Sea | Disability | NESB | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Ill Health Retirement | 08/2012 | 1984 | 2004 | Public Servant | A01-A04 | Central Office | Corporate Strategy and Peformance | Permanent Full-time | True | False | False | True | False | False | True | False | False | False | False | False | False | False | False | False | False | True | A | A | N | N | N | A | A | A | A | N | N | N | A | A | A | N | A | A | N | N | N | Male | 56-60 | NaN | NaN | NaN | NaN | Yes |
1 | 2 | Voluntary Early Retirement (VER) | 08/2012 | Not Stated | Not Stated | Public Servant | AO5-AO7 | Central Office | Corporate Strategy and Peformance | Permanent Full-time | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | A | A | N | N | N | N | A | A | A | N | N | N | A | A | A | N | A | A | N | N | N | Male | 56-60 | NaN | NaN | NaN | NaN | NaN |
2 | 3 | Voluntary Early Retirement (VER) | 05/2012 | 2011 | 2011 | Schools Officer | NaN | Central Office | Education Queensland | Permanent Full-time | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | True | N | N | N | N | N | N | N | N | N | N | N | N | N | N | N | A | A | N | N | N | N | Male | 61 or older | NaN | NaN | NaN | NaN | NaN |
3 | 4 | Resignation-Other reasons | 05/2012 | 2005 | 2006 | Teacher | Primary | Central Queensland | NaN | Permanent Full-time | False | True | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | A | N | N | N | A | A | N | N | A | A | A | A | A | A | A | A | A | A | A | N | A | Female | 36-40 | NaN | NaN | NaN | NaN | NaN |
4 | 5 | Age Retirement | 05/2012 | 1970 | 1989 | Head of Curriculum/Head of Special Education | NaN | South East | NaN | Permanent Full-time | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | True | False | False | A | A | N | N | D | D | N | A | A | A | A | A | A | SA | SA | D | D | A | N | A | M | Female | 61 or older | NaN | NaN | NaN | NaN | NaN |
tafe_sur.head()
Record ID | Institute | WorkArea | CESSATION YEAR | Reason for ceasing employment | Contributing Factors. Career Move - Public Sector | Contributing Factors. Career Move - Private Sector | Contributing Factors. Career Move - Self-employment | Contributing Factors. Ill Health | Contributing Factors. Maternity/Family | Contributing Factors. Dissatisfaction | Contributing Factors. Job Dissatisfaction | Contributing Factors. Interpersonal Conflict | Contributing Factors. Study | Contributing Factors. Travel | Contributing Factors. Other | Contributing Factors. NONE | Main Factor. Which of these was the main factor for leaving? | InstituteViews. Topic:1. I feel the senior leadership had a clear vision and direction | InstituteViews. Topic:2. I was given access to skills training to help me do my job better | InstituteViews. Topic:3. I was given adequate opportunities for personal development | InstituteViews. Topic:4. I was given adequate opportunities for promotion within %Institute]Q25LBL% | InstituteViews. Topic:5. I felt the salary for the job was right for the responsibilities I had | InstituteViews. Topic:6. The organisation recognised when staff did good work | InstituteViews. Topic:7. Management was generally supportive of me | InstituteViews. Topic:8. Management was generally supportive of my team | InstituteViews. Topic:9. I was kept informed of the changes in the organisation which would affect me | InstituteViews. Topic:10. Staff morale was positive within the Institute | InstituteViews. Topic:11. If I had a workplace issue it was dealt with quickly | InstituteViews. Topic:12. If I had a workplace issue it was dealt with efficiently | InstituteViews. Topic:13. If I had a workplace issue it was dealt with discreetly | WorkUnitViews. Topic:14. I was satisfied with the quality of the management and supervision within my work unit | WorkUnitViews. Topic:15. I worked well with my colleagues | WorkUnitViews. Topic:16. My job was challenging and interesting | WorkUnitViews. Topic:17. I was encouraged to use my initiative in the course of my work | WorkUnitViews. Topic:18. I had sufficient contact with other people in my job | WorkUnitViews. Topic:19. I was given adequate support and co-operation by my peers to enable me to do my job | WorkUnitViews. Topic:20. I was able to use the full range of my skills in my job | WorkUnitViews. Topic:21. I was able to use the full range of my abilities in my job. ; Category:Level of Agreement; Question:YOUR VIEWS ABOUT YOUR WORK UNIT] | WorkUnitViews. Topic:22. I was able to use the full range of my knowledge in my job | WorkUnitViews. Topic:23. My job provided sufficient variety | WorkUnitViews. Topic:24. I was able to cope with the level of stress and pressure in my job | WorkUnitViews. Topic:25. My job allowed me to balance the demands of work and family to my satisfaction | WorkUnitViews. Topic:26. My supervisor gave me adequate personal recognition and feedback on my performance | WorkUnitViews. Topic:27. My working environment was satisfactory e.g. sufficient space, good lighting, suitable seating and working area | WorkUnitViews. Topic:28. I was given the opportunity to mentor and coach others in order for me to pass on my skills and knowledge prior to my cessation date | WorkUnitViews. Topic:29. There was adequate communication between staff in my unit | WorkUnitViews. Topic:30. Staff morale was positive within my work unit | Induction. Did you undertake Workplace Induction? | InductionInfo. Topic:Did you undertake a Corporate Induction? | InductionInfo. Topic:Did you undertake a Institute Induction? | InductionInfo. Topic: Did you undertake Team Induction? | InductionInfo. Face to Face Topic:Did you undertake a Corporate Induction; Category:How it was conducted? | InductionInfo. On-line Topic:Did you undertake a Corporate Induction; Category:How it was conducted? | InductionInfo. Induction Manual Topic:Did you undertake a Corporate Induction? | InductionInfo. Face to Face Topic:Did you undertake a Institute Induction? | InductionInfo. On-line Topic:Did you undertake a Institute Induction? | InductionInfo. Induction Manual Topic:Did you undertake a Institute Induction? | InductionInfo. Face to Face Topic: Did you undertake Team Induction; Category? | InductionInfo. On-line Topic: Did you undertake Team Induction?process you undertook and how it was conducted.] | InductionInfo. Induction Manual Topic: Did you undertake Team Induction? | Workplace. Topic:Did you and your Manager develop a Performance and Professional Development Plan (PPDP)? | Workplace. Topic:Does your workplace promote a work culture free from all forms of unlawful discrimination? | Workplace. Topic:Does your workplace promote and practice the principles of employment equity? | Workplace. Topic:Does your workplace value the diversity of its employees? | Workplace. Topic:Would you recommend the Institute as an employer to others? | Gender. What is your Gender? | CurrentAge. Current Age | Employment Type. Employment Type | Classification. Classification | LengthofServiceOverall. Overall Length of Service at Institute (in years) | LengthofServiceCurrent. Length of Service at current workplace (in years) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 6.341330e+17 | Southern Queensland Institute of TAFE | Non-Delivery (corporate) | 2010.0 | Contract Expired | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Agree | Agree | Agree | Neutral | Agree | Agree | Agree | Agree | Agree | Agree | Agree | Agree | Agree | Agree | Agree | Agree | Strongly Agree | Agree | Agree | Agree | Agree | Agree | Agree | Agree | Agree | Agree | Agree | Neutral | Agree | Agree | Yes | Yes | Yes | Yes | Face to Face | - | - | Face to Face | - | - | Face to Face | - | - | Yes | Yes | Yes | Yes | Yes | Female | 26 30 | Temporary Full-time | Administration (AO) | 1-2 | 1-2 |
1 | 6.341337e+17 | Mount Isa Institute of TAFE | Non-Delivery (corporate) | 2010.0 | Retirement | - | - | - | - | - | - | - | - | - | Travel | - | - | NaN | Agree | Agree | Agree | Agree | Agree | Strongly Agree | Strongly Agree | Agree | Strongly Agree | Agree | Agree | Agree | Disagree | Strongly Agree | Strongly Agree | Strongly Agree | Agree | Agree | Agree | Strongly Agree | Agree | Agree | Agree | Strongly Agree | Agree | Strongly Agree | Strongly Agree | Agree | Agree | Strongly Agree | No | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Yes | Yes | Yes | Yes | Yes | NaN | NaN | NaN | NaN | NaN | NaN |
2 | 6.341388e+17 | Mount Isa Institute of TAFE | Delivery (teaching) | 2010.0 | Retirement | - | - | - | - | - | - | - | - | - | - | - | NONE | NaN | Agree | Agree | Agree | Agree | Agree | Agree | Strongly Agree | Agree | Agree | Agree | Agree | Neutral | Neutral | Strongly Agree | Strongly Agree | Agree | Agree | Agree | Agree | Agree | Agree | Agree | Agree | Agree | Agree | Agree | Agree | Agree | Agree | Agree | No | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Yes | Yes | Yes | Yes | Yes | NaN | NaN | NaN | NaN | NaN | NaN |
3 | 6.341399e+17 | Mount Isa Institute of TAFE | Non-Delivery (corporate) | 2010.0 | Resignation | - | - | - | - | - | - | - | - | - | Travel | - | - | NaN | Agree | Agree | Agree | Agree | Agree | Agree | Agree | Agree | Agree | Agree | Agree | Agree | Agree | Strongly Agree | Strongly Agree | Strongly Agree | Strongly Agree | Strongly Agree | Strongly Agree | Strongly Agree | Strongly Agree | Strongly Agree | Strongly Agree | Strongly Agree | Strongly Agree | Strongly Agree | Strongly Agree | Strongly Agree | Strongly Agree | Strongly Agree | Yes | No | Yes | Yes | - | - | - | NaN | - | - | - | - | - | Yes | Yes | Yes | Yes | Yes | NaN | NaN | NaN | NaN | NaN | NaN |
4 | 6.341466e+17 | Southern Queensland Institute of TAFE | Delivery (teaching) | 2010.0 | Resignation | - | Career Move - Private Sector | - | - | - | - | - | - | - | - | - | - | NaN | Agree | Agree | Strongly Agree | Agree | Strongly Agree | Strongly Agree | Strongly Agree | Strongly Agree | Agree | Strongly Agree | Strongly Agree | Agree | Strongly Agree | Strongly Agree | Strongly Agree | Strongly Agree | Strongly Agree | Strongly Agree | Strongly Agree | Strongly Agree | Strongly Agree | Strongly Agree | Strongly Agree | Strongly Agree | Strongly Agree | Strongly Agree | Strongly Agree | Strongly Agree | Strongly Agree | Strongly Agree | Yes | Yes | Yes | Yes | - | - | Induction Manual | Face to Face | - | - | Face to Face | - | - | Yes | Yes | Yes | Yes | Yes | Male | 41 45 | Permanent Full-time | Teacher (including LVT) | 3-4 | 3-4 |
We cannot see all columns above and there is difficulty in comprehending data displayed in such way. Therefore, additionally in below cells I use df.info() method to display all column labels, number of non-null values in each column and data type.
dete_sur.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 822 entries, 0 to 821 Data columns (total 56 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 822 non-null int64 1 SeparationType 822 non-null object 2 Cease Date 822 non-null object 3 DETE Start Date 822 non-null object 4 Role Start Date 822 non-null object 5 Position 817 non-null object 6 Classification 455 non-null object 7 Region 822 non-null object 8 Business Unit 126 non-null object 9 Employment Status 817 non-null object 10 Career move to public sector 822 non-null bool 11 Career move to private sector 822 non-null bool 12 Interpersonal conflicts 822 non-null bool 13 Job dissatisfaction 822 non-null bool 14 Dissatisfaction with the department 822 non-null bool 15 Physical work environment 822 non-null bool 16 Lack of recognition 822 non-null bool 17 Lack of job security 822 non-null bool 18 Work location 822 non-null bool 19 Employment conditions 822 non-null bool 20 Maternity/family 822 non-null bool 21 Relocation 822 non-null bool 22 Study/Travel 822 non-null bool 23 Ill Health 822 non-null bool 24 Traumatic incident 822 non-null bool 25 Work life balance 822 non-null bool 26 Workload 822 non-null bool 27 None of the above 822 non-null bool 28 Professional Development 808 non-null object 29 Opportunities for promotion 735 non-null object 30 Staff morale 816 non-null object 31 Workplace issue 788 non-null object 32 Physical environment 817 non-null object 33 Worklife balance 815 non-null object 34 Stress and pressure support 810 non-null object 35 Performance of supervisor 813 non-null object 36 Peer support 812 non-null object 37 Initiative 813 non-null object 38 Skills 811 non-null object 39 Coach 767 non-null object 40 Career Aspirations 746 non-null object 41 Feedback 792 non-null object 42 Further PD 768 non-null object 43 Communication 814 non-null object 44 My say 812 non-null object 45 Information 816 non-null object 46 Kept informed 813 non-null object 47 Wellness programs 766 non-null object 48 Health & Safety 793 non-null object 49 Gender 798 non-null object 50 Age 811 non-null object 51 Aboriginal 16 non-null object 52 Torres Strait 3 non-null object 53 South Sea 7 non-null object 54 Disability 23 non-null object 55 NESB 32 non-null object dtypes: bool(18), int64(1), object(37) memory usage: 258.6+ KB
Unfortunately if we got really long column labels df.info() method will print expected result not in easily readable way. Therefore, below I created dataframe and added to it column labels, data types and number of non-null values.
pd.DataFrame({'data_type': tafe_sur.dtypes, 'non_null': tafe_sur.notnull().sum()})
data_type | non_null | |
---|---|---|
Record ID | float64 | 702 |
Institute | object | 702 |
WorkArea | object | 702 |
CESSATION YEAR | float64 | 695 |
Reason for ceasing employment | object | 701 |
Contributing Factors. Career Move - Public Sector | object | 437 |
Contributing Factors. Career Move - Private Sector | object | 437 |
Contributing Factors. Career Move - Self-employment | object | 437 |
Contributing Factors. Ill Health | object | 437 |
Contributing Factors. Maternity/Family | object | 437 |
Contributing Factors. Dissatisfaction | object | 437 |
Contributing Factors. Job Dissatisfaction | object | 437 |
Contributing Factors. Interpersonal Conflict | object | 437 |
Contributing Factors. Study | object | 437 |
Contributing Factors. Travel | object | 437 |
Contributing Factors. Other | object | 437 |
Contributing Factors. NONE | object | 437 |
Main Factor. Which of these was the main factor for leaving? | object | 113 |
InstituteViews. Topic:1. I feel the senior leadership had a clear vision and direction | object | 608 |
InstituteViews. Topic:2. I was given access to skills training to help me do my job better | object | 613 |
InstituteViews. Topic:3. I was given adequate opportunities for personal development | object | 610 |
InstituteViews. Topic:4. I was given adequate opportunities for promotion within %Institute]Q25LBL% | object | 608 |
InstituteViews. Topic:5. I felt the salary for the job was right for the responsibilities I had | object | 615 |
InstituteViews. Topic:6. The organisation recognised when staff did good work | object | 607 |
InstituteViews. Topic:7. Management was generally supportive of me | object | 614 |
InstituteViews. Topic:8. Management was generally supportive of my team | object | 608 |
InstituteViews. Topic:9. I was kept informed of the changes in the organisation which would affect me | object | 610 |
InstituteViews. Topic:10. Staff morale was positive within the Institute | object | 602 |
InstituteViews. Topic:11. If I had a workplace issue it was dealt with quickly | object | 601 |
InstituteViews. Topic:12. If I had a workplace issue it was dealt with efficiently | object | 597 |
InstituteViews. Topic:13. If I had a workplace issue it was dealt with discreetly | object | 601 |
WorkUnitViews. Topic:14. I was satisfied with the quality of the management and supervision within my work unit | object | 609 |
WorkUnitViews. Topic:15. I worked well with my colleagues | object | 605 |
WorkUnitViews. Topic:16. My job was challenging and interesting | object | 607 |
WorkUnitViews. Topic:17. I was encouraged to use my initiative in the course of my work | object | 610 |
WorkUnitViews. Topic:18. I had sufficient contact with other people in my job | object | 613 |
WorkUnitViews. Topic:19. I was given adequate support and co-operation by my peers to enable me to do my job | object | 609 |
WorkUnitViews. Topic:20. I was able to use the full range of my skills in my job | object | 609 |
WorkUnitViews. Topic:21. I was able to use the full range of my abilities in my job. ; Category:Level of Agreement; Question:YOUR VIEWS ABOUT YOUR WORK UNIT] | object | 608 |
WorkUnitViews. Topic:22. I was able to use the full range of my knowledge in my job | object | 608 |
WorkUnitViews. Topic:23. My job provided sufficient variety | object | 611 |
WorkUnitViews. Topic:24. I was able to cope with the level of stress and pressure in my job | object | 610 |
WorkUnitViews. Topic:25. My job allowed me to balance the demands of work and family to my satisfaction | object | 611 |
WorkUnitViews. Topic:26. My supervisor gave me adequate personal recognition and feedback on my performance | object | 606 |
WorkUnitViews. Topic:27. My working environment was satisfactory e.g. sufficient space, good lighting, suitable seating and working area | object | 610 |
WorkUnitViews. Topic:28. I was given the opportunity to mentor and coach others in order for me to pass on my skills and knowledge prior to my cessation date | object | 609 |
WorkUnitViews. Topic:29. There was adequate communication between staff in my unit | object | 603 |
WorkUnitViews. Topic:30. Staff morale was positive within my work unit | object | 606 |
Induction. Did you undertake Workplace Induction? | object | 619 |
InductionInfo. Topic:Did you undertake a Corporate Induction? | object | 432 |
InductionInfo. Topic:Did you undertake a Institute Induction? | object | 483 |
InductionInfo. Topic: Did you undertake Team Induction? | object | 440 |
InductionInfo. Face to Face Topic:Did you undertake a Corporate Induction; Category:How it was conducted? | object | 555 |
InductionInfo. On-line Topic:Did you undertake a Corporate Induction; Category:How it was conducted? | object | 555 |
InductionInfo. Induction Manual Topic:Did you undertake a Corporate Induction? | object | 555 |
InductionInfo. Face to Face Topic:Did you undertake a Institute Induction? | object | 530 |
InductionInfo. On-line Topic:Did you undertake a Institute Induction? | object | 555 |
InductionInfo. Induction Manual Topic:Did you undertake a Institute Induction? | object | 553 |
InductionInfo. Face to Face Topic: Did you undertake Team Induction; Category? | object | 555 |
InductionInfo. On-line Topic: Did you undertake Team Induction?process you undertook and how it was conducted.] | object | 555 |
InductionInfo. Induction Manual Topic: Did you undertake Team Induction? | object | 555 |
Workplace. Topic:Did you and your Manager develop a Performance and Professional Development Plan (PPDP)? | object | 608 |
Workplace. Topic:Does your workplace promote a work culture free from all forms of unlawful discrimination? | object | 594 |
Workplace. Topic:Does your workplace promote and practice the principles of employment equity? | object | 587 |
Workplace. Topic:Does your workplace value the diversity of its employees? | object | 586 |
Workplace. Topic:Would you recommend the Institute as an employer to others? | object | 581 |
Gender. What is your Gender? | object | 596 |
CurrentAge. Current Age | object | 596 |
Employment Type. Employment Type | object | 596 |
Classification. Classification | object | 596 |
LengthofServiceOverall. Overall Length of Service at Institute (in years) | object | 596 |
LengthofServiceCurrent. Length of Service at current workplace (in years) | object | 596 |
Now and later in this project I will need to use 'series.value_counts()' method to get counts of unique values in several columns. But this method can be used only for one column at a time. Using this method several times for different columns will take a lot of space. Also we cannot see output of method for several columns at same time.
Therefore, below I wrote 'df_vc' function. This function returns dataframe. Dataframe contains output(s) of 'series.value_counts' for 1(or more) column(s). In other words user specifies source dataframe and columns, function will return value_counts for those columns in one dataframe. Basically resulting dataframe can be viewed as collection of couple of columns. In each couple 1st column is unique value and 2nd column is the count of it.
Note: I did not use 'dete_sur.iloc[:,1:11].apply(pd.Series.value_counts)' on purpose. Reason is that this way provides datarame with lots of NaN values if we have more columns and those columns have different values and different type of values. Therefore, this method does not contribute to easiness of readability at all.
def df_vc(ur_df, cols=None, normalize=False, sort=True, index=None,
ascending=False, bins=None, dropna=True, vc_nan='N/A'):
vc_df=None # initial stage of df that is returned by function
if cols==None: cols=range(ur_df.shape[1]) # all cols if cols range not specified
for col in cols:# iterating through each column's position/label
# finding position of col if it is specified by str & not by int
if isinstance(col, str): col = ur_df.columns.get_loc(col)
# creating dataframe (2 columns) from pd.value_count for each col in ur_df
new_vc_df = pd.DataFrame(ur_df.iloc[:,col].value_counts
(dropna=dropna, normalize=normalize, sort=sort,
bins=bins, ascending=ascending))
# sort by index each new df or ignore this step
if index == 'ascending': new_vc_df.sort_index(inplace=True)
elif index == 'descending': new_vc_df.sort_index(ascending=False, inplace=True)
elif index is None: pass
# reset index for renaming and concat purposes,
# filling nan to differentiate from concat related nan
new_vc_df = new_vc_df.reset_index().fillna('N/A')
# rename columns names to: 'col label in ur_df' & 'vc(col label in ur_df)'
labels = {new_vc_df.columns[0]: new_vc_df.columns[1],
new_vc_df.columns[1]: f'vc({new_vc_df.columns[1]})'}
new_vc_df = new_vc_df.rename(columns=labels)
# aggregating above created dataframes (2 columns) in 1 df
vc_df = pd.concat([vc_df, new_vc_df], axis=1)
# deleting nan coming from concat and replacing default 'N/A'
# in value_counts part if other param was entered for vc_nan arg
vc_df = vc_df.fillna('')
if vc_nan != 'N/A': vc_df = vc_df.replace('N/A', vc_nan)
return vc_df
In below cells I used above written function for both dataframes. I specified columns which I use later in project for analysis.
df_vc(dete_sur, ['Age', 'Gender', 'Cease Date', 'DETE Start Date'], dropna=False, vc_nan=np.nan)
Age | vc(Age) | Gender | vc(Gender) | Cease Date | vc(Cease Date) | DETE Start Date | vc(DETE Start Date) | |
---|---|---|---|---|---|---|---|---|
0 | 61 or older | 222.0 | Female | 573.0 | 2012 | 344.0 | Not Stated | 73 |
1 | 56-60 | 174.0 | Male | 225.0 | 2013 | 200.0 | 2011 | 40 |
2 | 51-55 | 103.0 | NaN | 24.0 | 01/2014 | 43.0 | 2007 | 34 |
3 | 46-50 | 63.0 | 12/2013 | 40.0 | 2008 | 31 | ||
4 | 41-45 | 61.0 | Not Stated | 34.0 | 2010 | 27 | ||
5 | 26-30 | 57.0 | 09/2013 | 34.0 | 2012 | 27 | ||
6 | 36-40 | 51.0 | 06/2013 | 27.0 | 2009 | 24 | ||
7 | 21-25 | 40.0 | 07/2013 | 22.0 | 2006 | 23 | ||
8 | 31-35 | 39.0 | 10/2013 | 20.0 | 1975 | 21 | ||
9 | NaN | 11.0 | 11/2013 | 16.0 | 1970 | 21 | ||
10 | 20 or younger | 1.0 | 08/2013 | 12.0 | 2013 | 21 | ||
11 | 05/2013 | 7.0 | 2005 | 20 | ||||
12 | 05/2012 | 6.0 | 1990 | 20 | ||||
13 | 07/2014 | 2.0 | 1999 | 19 | ||||
14 | 04/2013 | 2.0 | 1996 | 19 | ||||
15 | 08/2012 | 2.0 | 1991 | 18 | ||||
16 | 02/2014 | 2.0 | 1992 | 18 | ||||
17 | 04/2014 | 2.0 | 2000 | 18 | ||||
18 | 2010 | 1.0 | 2004 | 18 | ||||
19 | 09/2014 | 1.0 | 1989 | 17 | ||||
20 | 09/2010 | 1.0 | 2002 | 15 | ||||
21 | 2014 | 1.0 | 1976 | 15 | ||||
22 | 07/2012 | 1.0 | 2003 | 15 | ||||
23 | 07/2006 | 1.0 | 1988 | 15 | ||||
24 | 11/2012 | 1.0 | 1978 | 15 | ||||
25 | 1980 | 14 | ||||||
26 | 1998 | 14 | ||||||
27 | 1997 | 14 | ||||||
28 | 1995 | 14 | ||||||
29 | 1974 | 14 | ||||||
30 | 1979 | 14 | ||||||
31 | 1993 | 13 | ||||||
32 | 1986 | 12 | ||||||
33 | 1972 | 12 | ||||||
34 | 1977 | 11 | ||||||
35 | 2001 | 10 | ||||||
36 | 1969 | 10 | ||||||
37 | 1984 | 10 | ||||||
38 | 1971 | 10 | ||||||
39 | 1994 | 10 | ||||||
40 | 1981 | 9 | ||||||
41 | 1983 | 9 | ||||||
42 | 1973 | 8 | ||||||
43 | 1985 | 8 | ||||||
44 | 1987 | 7 | ||||||
45 | 1963 | 4 | ||||||
46 | 1982 | 4 | ||||||
47 | 1968 | 3 | ||||||
48 | 1967 | 2 | ||||||
49 | 1965 | 1 | ||||||
50 | 1966 | 1 |
df_vc(tafe_sur, [67, 70, 4, 66], dropna=False, vc_nan=np.nan)
CurrentAge. Current Age | vc(CurrentAge. Current Age) | LengthofServiceOverall. Overall Length of Service at Institute (in years) | vc(LengthofServiceOverall. Overall Length of Service at Institute (in years)) | Reason for ceasing employment | vc(Reason for ceasing employment) | Gender. What is your Gender? | vc(Gender. What is your Gender?) | |
---|---|---|---|---|---|---|---|---|
0 | 56 or older | 162 | Less than 1 year | 147.0 | Resignation | 340.0 | Female | 389.0 |
1 | NaN | 106 | NaN | 106.0 | Contract Expired | 127.0 | Male | 207.0 |
2 | 51-55 | 82 | 1-2 | 102.0 | Retrenchment/ Redundancy | 104.0 | NaN | 106.0 |
3 | 41 45 | 80 | 3-4 | 96.0 | Retirement | 82.0 | ||
4 | 46 50 | 59 | 11-20 | 89.0 | Transfer | 25.0 | ||
5 | 31 35 | 52 | More than 20 years | 71.0 | Termination | 23.0 | ||
6 | 36 40 | 51 | 5-6 | 48.0 | NaN | 1.0 | ||
7 | 26 30 | 50 | 7-10 | 43.0 | ||||
8 | 21 25 | 44 | ||||||
9 | 20 or younger | 16 |
As we saw from above value counts dataframe there are 'Not Stated' values in 'dete_survey'. In order to change this value to 'np.nan' below I read 'dete_survey' csv file into pandas dataframe again but this time replacing 'Not Stated' with NaN.
# as we saw from above value_counts there are 'Not Stated' values in dete_survey csv
# below we assign back to dete_sur but this time replacing 'Not Stated' with NaN
dete_sur = pd.read_csv('dete_survey.csv', na_values='Not Stated')
dete_sur.head()
ID | SeparationType | Cease Date | DETE Start Date | Role Start Date | Position | Classification | Region | Business Unit | Employment Status | Career move to public sector | Career move to private sector | Interpersonal conflicts | Job dissatisfaction | Dissatisfaction with the department | Physical work environment | Lack of recognition | Lack of job security | Work location | Employment conditions | Maternity/family | Relocation | Study/Travel | Ill Health | Traumatic incident | Work life balance | Workload | None of the above | Professional Development | Opportunities for promotion | Staff morale | Workplace issue | Physical environment | Worklife balance | Stress and pressure support | Performance of supervisor | Peer support | Initiative | Skills | Coach | Career Aspirations | Feedback | Further PD | Communication | My say | Information | Kept informed | Wellness programs | Health & Safety | Gender | Age | Aboriginal | Torres Strait | South Sea | Disability | NESB | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Ill Health Retirement | 08/2012 | 1984.0 | 2004.0 | Public Servant | A01-A04 | Central Office | Corporate Strategy and Peformance | Permanent Full-time | True | False | False | True | False | False | True | False | False | False | False | False | False | False | False | False | False | True | A | A | N | N | N | A | A | A | A | N | N | N | A | A | A | N | A | A | N | N | N | Male | 56-60 | NaN | NaN | NaN | NaN | Yes |
1 | 2 | Voluntary Early Retirement (VER) | 08/2012 | NaN | NaN | Public Servant | AO5-AO7 | Central Office | Corporate Strategy and Peformance | Permanent Full-time | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | A | A | N | N | N | N | A | A | A | N | N | N | A | A | A | N | A | A | N | N | N | Male | 56-60 | NaN | NaN | NaN | NaN | NaN |
2 | 3 | Voluntary Early Retirement (VER) | 05/2012 | 2011.0 | 2011.0 | Schools Officer | NaN | Central Office | Education Queensland | Permanent Full-time | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | True | N | N | N | N | N | N | N | N | N | N | N | N | N | N | N | A | A | N | N | N | N | Male | 61 or older | NaN | NaN | NaN | NaN | NaN |
3 | 4 | Resignation-Other reasons | 05/2012 | 2005.0 | 2006.0 | Teacher | Primary | Central Queensland | NaN | Permanent Full-time | False | True | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | A | N | N | N | A | A | N | N | A | A | A | A | A | A | A | A | A | A | A | N | A | Female | 36-40 | NaN | NaN | NaN | NaN | NaN |
4 | 5 | Age Retirement | 05/2012 | 1970.0 | 1989.0 | Head of Curriculum/Head of Special Education | NaN | South East | NaN | Permanent Full-time | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | True | False | False | A | A | N | N | D | D | N | A | A | A | A | A | A | SA | SA | D | D | A | N | A | M | Female | 61 or older | NaN | NaN | NaN | NaN | NaN |
Below I delete range of columns I will not use in my project
# dropping columns we will not need further in our analysis
dete_sur.drop(columns=dete_sur.columns[28:49], inplace=True)
tafe_sur.drop(columns=tafe_sur.columns[17:66], inplace=True)
Later in project I will need to concatenate 2 dataframes. In order to have columns which contain same type of information in both dataframes matched I standardize column labels below. First, I do it for 'dete_sur' dataframe and print before and after versions of column labels.
# converting column labels to snake_case
bef_cols = dete_sur.columns
dete_sur.columns = dete_sur.columns.str.strip().str.replace('\s+',' ', regex=True)\
.str.replace(' ','_', regex=True).str.lower()
aft_cols = dete_sur.columns
pd.DataFrame({'columns before': bef_cols, 'columns_after': aft_cols})
columns before | columns_after | |
---|---|---|
0 | ID | id |
1 | SeparationType | separationtype |
2 | Cease Date | cease_date |
3 | DETE Start Date | dete_start_date |
4 | Role Start Date | role_start_date |
5 | Position | position |
6 | Classification | classification |
7 | Region | region |
8 | Business Unit | business_unit |
9 | Employment Status | employment_status |
10 | Career move to public sector | career_move_to_public_sector |
11 | Career move to private sector | career_move_to_private_sector |
12 | Interpersonal conflicts | interpersonal_conflicts |
13 | Job dissatisfaction | job_dissatisfaction |
14 | Dissatisfaction with the department | dissatisfaction_with_the_department |
15 | Physical work environment | physical_work_environment |
16 | Lack of recognition | lack_of_recognition |
17 | Lack of job security | lack_of_job_security |
18 | Work location | work_location |
19 | Employment conditions | employment_conditions |
20 | Maternity/family | maternity/family |
21 | Relocation | relocation |
22 | Study/Travel | study/travel |
23 | Ill Health | ill_health |
24 | Traumatic incident | traumatic_incident |
25 | Work life balance | work_life_balance |
26 | Workload | workload |
27 | None of the above | none_of_the_above |
28 | Gender | gender |
29 | Age | age |
30 | Aboriginal | aboriginal |
31 | Torres Strait | torres_strait |
32 | South Sea | south_sea |
33 | Disability | disability |
34 | NESB | nesb |
Then I standardize 9 column labels (of columns which I will use later in project for analysis) in 'tafe_survey' dataframe based on column labels from 'dete_sur' dataframe.
# changing col labels in accordance to dete df so when we concat
# these cols (with same type of info) match
bef_cols1 = tafe_sur.columns
tafe_sur.rename(columns={'Record ID': 'id',
'CESSATION YEAR': 'cease_date',
'Reason for ceasing employment': 'separationtype',
'Gender. What is your Gender?': 'gender',
'CurrentAge. Current Age': 'age',
'Employment Type. Employment Type': 'employment_status',
'Classification. Classification': 'position',
'LengthofServiceOverall. Overall Length of Service at Institute (in years)':
'institute_service',
'LengthofServiceCurrent. Length of Service at current workplace (in years)':
'role_service'}, inplace=True)
aft_cols1 = tafe_sur.columns
pd.DataFrame({'columns before': bef_cols1, 'columns after': aft_cols1})
columns before | columns after | |
---|---|---|
0 | Record ID | id |
1 | Institute | Institute |
2 | WorkArea | WorkArea |
3 | CESSATION YEAR | cease_date |
4 | Reason for ceasing employment | separationtype |
5 | Contributing Factors. Career Move - Public Sec... | Contributing Factors. Career Move - Public Sec... |
6 | Contributing Factors. Career Move - Private Se... | Contributing Factors. Career Move - Private Se... |
7 | Contributing Factors. Career Move - Self-emplo... | Contributing Factors. Career Move - Self-emplo... |
8 | Contributing Factors. Ill Health | Contributing Factors. Ill Health |
9 | Contributing Factors. Maternity/Family | Contributing Factors. Maternity/Family |
10 | Contributing Factors. Dissatisfaction | Contributing Factors. Dissatisfaction |
11 | Contributing Factors. Job Dissatisfaction | Contributing Factors. Job Dissatisfaction |
12 | Contributing Factors. Interpersonal Conflict | Contributing Factors. Interpersonal Conflict |
13 | Contributing Factors. Study | Contributing Factors. Study |
14 | Contributing Factors. Travel | Contributing Factors. Travel |
15 | Contributing Factors. Other | Contributing Factors. Other |
16 | Contributing Factors. NONE | Contributing Factors. NONE |
17 | Gender. What is your Gender? | gender |
18 | CurrentAge. Current Age | age |
19 | Employment Type. Employment Type | employment_status |
20 | Classification. Classification | position |
21 | LengthofServiceOverall. Overall Length of Serv... | institute_service |
22 | LengthofServiceCurrent. Length of Service at c... | role_service |
Coming to main questions mentioned in the beginning requires to select rows which represent employees who resigned from their positions. To do so I check 'separation type' column in both columns.
I start with 'tafe_sur' dataframe (which represents employee exit surveys from Technical and Further Education (TAFE))
tafe_sur['separationtype'].value_counts(dropna=False)
Resignation 340 Contract Expired 127 Retrenchment/ Redundancy 104 Retirement 82 Transfer 25 Termination 23 NaN 1 Name: separationtype, dtype: int64
Output of above cell shows that we 340 rows where employees resigned from their job. Below I select those rows and assign to 'tafe_resign' dataframe.
tafe_resign = tafe_sur[tafe_sur['separationtype'] == 'Resignation'].copy()
tafe_resign['separationtype'].value_counts(dropna=False)
Resignation 340 Name: separationtype, dtype: int64
Next step is to do same for 'dete_sur' (which represents employee exit surveys from Department of Education, Training and Employment (DETE))
dete_sur['separationtype'].value_counts(dropna=False)
Age Retirement 285 Resignation-Other reasons 150 Resignation-Other employer 91 Resignation-Move overseas/interstate 70 Voluntary Early Retirement (VER) 67 Ill Health Retirement 61 Other 49 Contract Expired 34 Termination 15 Name: separationtype, dtype: int64
In 'dete_sur' dataframe 'separationtype' column contains different type of resignation categories:
As we need all those categories collect them in list and based on it select relevant rows from 'dete_sur' dataframe. Selected data will be assigned to 'dete_resign' dataframe.
resign_types = ['Resignation-Other reasons', 'Resignation-Other employer',
'Resignation-Move overseas/interstate']
dete_resign = dete_sur[dete_sur['separationtype'].isin(resign_types)].copy()
dete_resign['separationtype'].value_counts(dropna=False)
Resignation-Other reasons 150 Resignation-Other employer 91 Resignation-Move overseas/interstate 70 Name: separationtype, dtype: int64
As we can see from questions this project aims to answer we need data on employees age and the time interval they worked for both institutes. In order to have reliable data I will check columns showing us required information on some abnormalities:
Following columns will be ones I check later in project for my analysis:
dete_resign['cease_year'] = dete_resign['cease_date'].str.extract(r"([1-2][0-9]{3})").astype(float)
dete_cols = ['cease_year', 'dete_start_date', 'age']
df_vc(dete_resign, cols=dete_cols, dropna=False, index='descending')
cease_year | vc(cease_year) | dete_start_date | vc(dete_start_date) | age | vc(age) | |
---|---|---|---|---|---|---|
0 | 2014.0 | 22.0 | 2013.0 | 10 | 61 or older | 23.0 |
1 | 2013.0 | 146.0 | 2012.0 | 21 | 56-60 | 26.0 |
2 | 2012.0 | 129.0 | 2011.0 | 24 | 51-55 | 32.0 |
3 | 2010.0 | 2.0 | 2010.0 | 17 | 46-50 | 42.0 |
4 | 2006.0 | 1.0 | 2009.0 | 13 | 41-45 | 48.0 |
5 | N/A | 11.0 | 2008.0 | 22 | 36-40 | 41.0 |
6 | 2007.0 | 21 | 31-35 | 29.0 | ||
7 | 2006.0 | 13 | 26-30 | 35.0 | ||
8 | 2005.0 | 15 | 21-25 | 29.0 | ||
9 | 2004.0 | 14 | 20 or younger | 1.0 | ||
10 | 2003.0 | 6 | N/A | 5.0 | ||
11 | 2002.0 | 6 | ||||
12 | 2001.0 | 3 | ||||
13 | 2000.0 | 9 | ||||
14 | 1999.0 | 8 | ||||
15 | 1998.0 | 6 | ||||
16 | 1997.0 | 5 | ||||
17 | 1996.0 | 6 | ||||
18 | 1995.0 | 4 | ||||
19 | 1994.0 | 6 | ||||
20 | 1993.0 | 5 | ||||
21 | 1992.0 | 6 | ||||
22 | 1991.0 | 4 | ||||
23 | 1990.0 | 5 | ||||
24 | 1989.0 | 4 | ||||
25 | 1988.0 | 4 | ||||
26 | 1987.0 | 1 | ||||
27 | 1986.0 | 3 | ||||
28 | 1985.0 | 3 | ||||
29 | 1984.0 | 1 | ||||
30 | 1983.0 | 2 | ||||
31 | 1982.0 | 1 | ||||
32 | 1980.0 | 5 | ||||
33 | 1977.0 | 1 | ||||
34 | 1976.0 | 2 | ||||
35 | 1975.0 | 1 | ||||
36 | 1974.0 | 2 | ||||
37 | 1973.0 | 1 | ||||
38 | 1972.0 | 1 | ||||
39 | 1971.0 | 1 | ||||
40 | 1963.0 | 1 | ||||
41 | N/A | 28 |
tafe_cols = ['cease_date','age']
df_vc(tafe_resign, cols=tafe_cols, dropna=False, index='ascendng')
cease_date | vc(cease_date) | age | vc(age) | |
---|---|---|---|---|
0 | 2011.0 | 116.0 | N/A | 50 |
1 | 2012.0 | 94.0 | 41 45 | 45 |
2 | 2010.0 | 68.0 | 46 50 | 39 |
3 | 2013.0 | 55.0 | 51-55 | 39 |
4 | N/A | 5.0 | 21 25 | 33 |
5 | 2009.0 | 2.0 | 36 40 | 32 |
6 | 26 30 | 32 | ||
7 | 31 35 | 32 | ||
8 | 56 or older | 29 | ||
9 | 20 or younger | 9 |
As we can see from outputs of above 2 cells there are no abnormalities and columns can be safely used for analysis.
To be able to answer questions like whether employees who worked long time (or short) left institute due to some kind of dissatisfaction we need to have categories indicating how much time employee worked for institute. We have in 'tafe_resign' dataframe 'institute_service' column which will serve our purpose.
tafe_resign['institute_service'].value_counts(dropna=False)
Less than 1 year 73 1-2 64 3-4 63 NaN 50 5-6 33 11-20 26 7-10 21 More than 20 years 10 Name: institute_service, dtype: int64
For 'dete_resgin' dataframe which represent employees resigned from Department of Education, Training and Employment we have year when employee started to work and year when employee resigned from position. Below I will create 'institute_service' column also for 'dete_resign' by subtracting start year from year of resignation and display unique values along with their count in column.
dete_resign['institute_service'] = dete_resign['cease_year'] - dete_resign['dete_start_date']
dete_resign['institute_service'].value_counts(dropna=False)
NaN 38 5.0 23 1.0 22 3.0 20 0.0 20 6.0 17 4.0 16 2.0 14 9.0 14 7.0 13 8.0 8 13.0 8 15.0 7 20.0 7 14.0 6 12.0 6 17.0 6 22.0 6 10.0 6 18.0 5 16.0 5 11.0 4 24.0 4 23.0 4 19.0 3 21.0 3 32.0 3 39.0 3 26.0 2 30.0 2 25.0 2 28.0 2 36.0 2 49.0 1 27.0 1 34.0 1 29.0 1 35.0 1 31.0 1 33.0 1 42.0 1 38.0 1 41.0 1 Name: institute_service, dtype: int64
Previously we created dataframe containing resigned employees for both institutes. But this would not be enough to answer our questions. In order to answer those questions we need to understand if employee resigned dues to some kind of dissatisfaction. Data for both institutes contain information on whether employee was dissatisfied or not. But this information is not contained in one column with boolean values (what we need).
In data for TAFE we have 2 column from which we can understand if employee was dissatisfied. These columns are:
If employee left due to dissatisfaction then relevant row contains same string as column label otherwise it contains dash or null value. So our aim is to able to see this information in 1 column with boolean values. To do so, I will combine above mentioned columns in one and then check if every element in row contains substring 'Dissatis'. If yes then employee left due to dissatisfaction if not otherwise is true.
tafe_resign['dissatisfied'] = tafe_resign['Contributing Factors. Dissatisfaction'].str.cat(\
tafe_resign['Contributing Factors. Job Dissatisfaction']).str.contains('Dissatis',na=False)
tafe_resign['dissatisfied'].value_counts()
False 249 True 91 Name: dissatisfied, dtype: int64
Output of above cell shows that majority of employees were not dissatisfied (249 employees) against 91 dissatisfied employees.
In data for DETE we also do not have 1 column with boolean values showing us if employee was dissatisfied or not. Here we have several columns which represent different types of dissatisfaction and contain boolean values indication if employee left due to exact that type of dissatisfaction or not. Below first I check if these columns really contain only boolean values with help of 'df_vc' function which I wrote before in this project.
dis_cols = ['job_dissatisfaction', 'dissatisfaction_with_the_department',
'physical_work_environment', 'lack_of_recognition',
'lack_of_job_security', 'work_location', 'employment_conditions',
'work_life_balance', 'workload']
df_vc(dete_resign, cols=dis_cols)
job_dissatisfaction | vc(job_dissatisfaction) | dissatisfaction_with_the_department | vc(dissatisfaction_with_the_department) | physical_work_environment | vc(physical_work_environment) | lack_of_recognition | vc(lack_of_recognition) | lack_of_job_security | vc(lack_of_job_security) | work_location | vc(work_location) | employment_conditions | vc(employment_conditions) | work_life_balance | vc(work_life_balance) | workload | vc(workload) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | False | 270 | False | 282 | False | 305 | False | 278 | False | 297 | False | 293 | False | 288 | False | 243 | False | 284 |
1 | True | 41 | True | 29 | True | 6 | True | 33 | True | 14 | True | 18 | True | 23 | True | 68 | True | 27 |
From above cell it is visible that all columns representing different kinds of dissatisfaction contain only boolean values. Then below I create column 'dissatisfied' also for 'dete_resign'. This column will contain:
dete_resign['dissatisfied'] = dete_resign[dis_cols].any(axis=1, skipna=False)
dete_resign['dissatisfied'].value_counts(dropna=False)
False 162 True 149 Name: dissatisfied, dtype: int64
Output of above cell shows that the is not big difference between number of employees who left due to some kind of dissatisfaction and number of employees who had other reasons (149 against 162, respectively)
Next step is to combine following dataframes we worked on in previous cells into one dataframe for further analysis:
But before to do so we need to create column in both dataframes called 'institute'. This column will help us to identify in which institute employee worked after combining 2 dataframes. And then I combine dataframes in next cell.
dete_resign['institute'] = 'DETE'
tafe_resign['institute'] = 'TAFE'
dete_tafe = pd.concat([dete_resign, tafe_resign], ignore_index=True)
dete_tafe.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 651 entries, 0 to 650 Data columns (total 54 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 651 non-null float64 1 separationtype 651 non-null object 2 cease_date 635 non-null object 3 dete_start_date 283 non-null float64 4 role_start_date 271 non-null float64 5 position 598 non-null object 6 classification 161 non-null object 7 region 265 non-null object 8 business_unit 32 non-null object 9 employment_status 597 non-null object 10 career_move_to_public_sector 311 non-null object 11 career_move_to_private_sector 311 non-null object 12 interpersonal_conflicts 311 non-null object 13 job_dissatisfaction 311 non-null object 14 dissatisfaction_with_the_department 311 non-null object 15 physical_work_environment 311 non-null object 16 lack_of_recognition 311 non-null object 17 lack_of_job_security 311 non-null object 18 work_location 311 non-null object 19 employment_conditions 311 non-null object 20 maternity/family 311 non-null object 21 relocation 311 non-null object 22 study/travel 311 non-null object 23 ill_health 311 non-null object 24 traumatic_incident 311 non-null object 25 work_life_balance 311 non-null object 26 workload 311 non-null object 27 none_of_the_above 311 non-null object 28 gender 592 non-null object 29 age 596 non-null object 30 aboriginal 7 non-null object 31 torres_strait 0 non-null object 32 south_sea 3 non-null object 33 disability 8 non-null object 34 nesb 9 non-null object 35 cease_year 300 non-null float64 36 institute_service 563 non-null object 37 dissatisfied 651 non-null bool 38 institute 651 non-null object 39 Institute 340 non-null object 40 WorkArea 340 non-null object 41 Contributing Factors. Career Move - Public Sector 332 non-null object 42 Contributing Factors. Career Move - Private Sector 332 non-null object 43 Contributing Factors. Career Move - Self-employment 332 non-null object 44 Contributing Factors. Ill Health 332 non-null object 45 Contributing Factors. Maternity/Family 332 non-null object 46 Contributing Factors. Dissatisfaction 332 non-null object 47 Contributing Factors. Job Dissatisfaction 332 non-null object 48 Contributing Factors. Interpersonal Conflict 332 non-null object 49 Contributing Factors. Study 332 non-null object 50 Contributing Factors. Travel 332 non-null object 51 Contributing Factors. Other 332 non-null object 52 Contributing Factors. NONE 332 non-null object 53 role_service 290 non-null object dtypes: bool(1), float64(4), object(49) memory usage: 270.3+ KB
After combining dataframes from output of above cells we can see that there is in total 54 columns. As a result of cleaning activities and quality of data the columns we need for our further analysis and columns which contain more than 500 non-null are same. Therefore below I will drop columns that contain less than 500 non-null values.
dete_tafe.dropna(axis=1,thresh=500, inplace=True)
dete_tafe.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 651 entries, 0 to 650 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 651 non-null float64 1 separationtype 651 non-null object 2 cease_date 635 non-null object 3 position 598 non-null object 4 employment_status 597 non-null object 5 gender 592 non-null object 6 age 596 non-null object 7 institute_service 563 non-null object 8 dissatisfied 651 non-null bool 9 institute 651 non-null object dtypes: bool(1), float64(1), object(8) memory usage: 46.5+ KB
I start with standardizing "institute_service" column which represents period employee spent working for institution. As we could see previously data for TAFE institution had ready categories for period employees worked for institution and for DETE institution we added column which contains number of years employee spent in position. As one institution have ready categories and one only number of years we need to create common categories and apply for whole 'institute_service' column. Below, first, I recheck what values are presented in column.
#dete_tafe['institute_service'].unique()
dete_tafe['institute_service'].value_counts(dropna=False, ascending=False)
NaN 88 Less than 1 year 73 1-2 64 3-4 63 5-6 33 11-20 26 5.0 23 1.0 22 7-10 21 0.0 20 3.0 20 6.0 17 4.0 16 2.0 14 9.0 14 7.0 13 More than 20 years 10 8.0 8 13.0 8 20.0 7 15.0 7 17.0 6 10.0 6 22.0 6 12.0 6 14.0 6 16.0 5 18.0 5 11.0 4 24.0 4 23.0 4 39.0 3 19.0 3 21.0 3 32.0 3 36.0 2 30.0 2 26.0 2 25.0 2 28.0 2 34.0 1 35.0 1 27.0 1 38.0 1 41.0 1 42.0 1 49.0 1 33.0 1 31.0 1 29.0 1 Name: institute_service, dtype: int64
Below I write categories for 'institute_service' column i.e. how to split years among categories, how to name them and how many to have:
In order to replace values in column with new categories written in above cell function written below will be applied to it.
def yr_cats(element):
# new categories to be applied to column
cat1, cat2 = 'New (0-3)', 'Experienced (3-6)'
cat3, cat4 = 'Established (6-10)', 'Veteran (10-∞)'
# if nan leave value nan
if pd.isna(element): return np.nan
# if string or number fill categories
elif isinstance(element, str):
if element in ['Less than 1 year', '1-2']: return cat1
elif element in ['3-4', '5-6']: return cat2
elif element in ['7-10']: return cat3
elif element in ['11-20', 'More than 20 years']: return cat4
else: return 'unexpected category'
elif isinstance(element, (int, float)):
if element < 3: return cat1
elif element < 6: return cat2
elif element < 10: return cat3
else: return cat4
# if data type of value is not caught by if block
else: return "unexpected category data type"
# applying function & checking result
dete_tafe['service_yrs_cat'] = dete_tafe['institute_service'].map(yr_cats)
dete_tafe['service_yrs_cat'].value_counts(dropna=False)
New (0-3) 193 Experienced (3-6) 155 Veteran (10-∞) 142 NaN 88 Established (6-10) 73 Name: service_yrs_cat, dtype: int64
Second column to create standard categories for is 'age' column. First I check existing unique values and their frequency in column.
print(dete_tafe[dete_tafe['institute']=='DETE']['age'].value_counts(dropna=False))
dete_tafe[dete_tafe['institute']=='TAFE']['age'].value_counts(dropna=False)
41-45 48 46-50 42 36-40 41 26-30 35 51-55 32 31-35 29 21-25 29 56-60 26 61 or older 23 NaN 5 20 or younger 1 Name: age, dtype: int64
NaN 50 41 45 45 46 50 39 51-55 39 21 25 33 36 40 32 26 30 32 31 35 32 56 or older 29 20 or younger 9 Name: age, dtype: int64
As we can see from output of above cell the values are coming from 2 dataframes are similar. But here are small differences. First, in of them we can see category indicating people younger than 20 and in other one elder people got more categories (instead of '56 or older' it has '56-60' and '60 or older' categories). To equalize number of categories and have same range for all values below standardize categories for employees who were younger than 20 or older than 56 when they left institutions.
age_dict = {'56-60': '56-∞',
'61 or older': '56-∞',
'56 or older': '56-∞',
'20 or younger': '0-20'}
dete_tafe['age'].replace(to_replace=age_dict, inplace=True)
dete_tafe['age'].value_counts()
56-∞ 78 51-55 71 41-45 48 41 45 45 46-50 42 36-40 41 46 50 39 26-30 35 21 25 33 26 30 32 31 35 32 36 40 32 31-35 29 21-25 29 0-20 10 Name: age, dtype: int64
Second and only difference left is in a format how range is represented in column. In one case it is in format 'x y' in other case 'x-y'. To standardize those values below I accessed values in column and cleaned them from spaces ad placed dash between numbers brining all ranges to 'x-y' format. Then I display the result.
dete_tafe['age'] = dete_tafe['age'].str.strip().str.\
replace('\s+', ' ', regex=True).str.replace(' ','-', regex=True)
dete_tafe['age'].value_counts(dropna=False)
41-45 93 46-50 81 56-∞ 78 36-40 73 51-55 71 26-30 67 21-25 62 31-35 61 NaN 55 0-20 10 Name: age, dtype: int64
To answer to our questions we aimed from beginning of project I will use pivot tables and plots based on those pivot tables. First, I want to focus on following questions:
Below I create pivot table. To make our question wider I also added gender variable to pivot. Values in pivot table represent percentage of employees who left their positions due to some kind of dissatisfaction. Column and index values are quite obviously represent gender and age group employee belonged to, respectively.
dt_age_pv = dete_tafe.pivot_table(index='age', values='dissatisfied', columns=['gender'])
dt_age_pv[dt_age_pv.select_dtypes(include=['int', 'float']).columns] *= 100
dt_age_pv
gender | Female | Male |
---|---|---|
age | ||
0-20 | 28.571429 | 0.000000 |
21-25 | 35.555556 | 17.647059 |
26-30 | 40.384615 | 46.666667 |
31-35 | 37.777778 | 40.000000 |
36-40 | 30.000000 | 43.478261 |
41-45 | 36.231884 | 41.666667 |
46-50 | 41.935484 | 27.777778 |
51-55 | 37.777778 | 48.000000 |
56-∞ | 38.297872 | 50.000000 |
As it is easier to analyse data represented in plot rather than in pivot table below I create diverging horizontal bar chart. As stated before source for bar chart is above pivot table.
# setting size of text on plot, creating fig, setting xticks range
rcParams["font.size"]=12
fig, axes = plt.subplots(figsize=(15,7), ncols=2)
fig.suptitle('% of Dissatisfied leavers')
ticks = np.arange(0, 61, 5)
# plotting and setting grids
arg_dict = {'xlabel': '', 'xticks': ticks}
hb1 = dt_age_pv['Male'].plot.barh(**arg_dict, ax=axes[0], color='c', title='Male')
hb2 = dt_age_pv['Female'].plot.barh(**arg_dict, ax=axes[1], color='y', title='Female')
hb1.grid(axis='x', c='c', alpha=0.4)
hb2.grid(axis='x', c='y', alpha=0.4)
# removing axes[0] yticks, inverting xticks so it chart looks diverging plot
hb1.set_yticks([])
hb1.invert_xaxis()
hb2.tick_params(axis='y', left=False, pad=-20)
# removing spines from both axes
for loc, spine in hb1.spines.items(): spine.set_visible(False)
for loc, spine in hb2.spines.items(): spine.set_visible(False)
# bringing 2 axes so close that they look like one chart
fig.subplots_adjust(wspace=0.0001)
In the output of cell above we can see diverging bar chart. It shows percentage of employees who left both institutes due to some kind of dissatisfaction by age groups for male and females. There is no strong linear correlation between age group and share of dissatisfied leavers in both genders. But from first observation of chart we can spot that middle aged peopled have less tendency to leave their positions due to dissatisfaction. Although this can be observed separately by genders it is better to pay attention to that tendency in total. In other words, if you pay closer attention to total length of horizontal bars (females + males) you can observe that people tend to leave more when they were between 26 and 30, then there were less dissatisfied leavers until employees reach age group of 51 and older. So it is quite obvious that older employees tend to leave more due to dissatisfaction than middle aged ones. But for employees younger than 30 not everything is clear as age group of 20-25 had lowest share of dissatisfied leavers among other age groups except 0-20 age group (which is only presented in TAFE institute and contains only females)
Secondly, I want to pay more attention to following questions:
Below I create pivot table where values again represent share of employees who left their positions due to some kind of dissatisfaction. In index part we have 2 splits. First split is by institute where employee worked and second is by service years categories. I also did not drop age factor from pivot table and plot I will create, while gender split will not be discussed in following part.
Note: Before creating pivot table below I converted type of 'service_yrs_cat' (contain service years categories) to categories and then sorted them ascending based on years. The reason to do so is to see age service years categories in ascending order in pivot table and plot I will create later. And this will make easier comparison of values among groups in pivot table and plot.
# ascending sorting list for service years categories
sorter = ['New (0-3)', 'Experienced (3-6)', 'Established (6-10)', 'Veteran (10-∞)']
# changing type of column to categories
dete_tafe['service_yrs_cat'] = dete_tafe['service_yrs_cat'].astype('category')
# accessing column and appliying sorting list written above
dete_tafe['service_yrs_cat'].cat.set_categories(sorter, inplace=True)
# creating pivot table
age_serv_pv = dete_tafe.pivot_table(index=['institute', 'service_yrs_cat'],
columns='age', values='dissatisfied')
age_serv_pv[age_serv_pv.select_dtypes(include=['int','float']).columns] *= 100
age_serv_pv
age | 0-20 | 21-25 | 26-30 | 31-35 | 36-40 | 41-45 | 46-50 | 51-55 | 56-∞ | |
---|---|---|---|---|---|---|---|---|---|---|
institute | service_yrs_cat | |||||||||
DETE | New (0-3) | 0.000000 | 31.250000 | 50.000000 | 40.000000 | 27.272727 | 25.000000 | 50.000000 | 100.000000 | 100.000000 |
Experienced (3-6) | NaN | 25.000000 | 53.846154 | 41.666667 | 33.333333 | 62.500000 | 25.000000 | 40.000000 | 33.333333 | |
Established (6-10) | NaN | 0.000000 | 66.666667 | 100.000000 | 87.500000 | 50.000000 | 42.857143 | 66.666667 | 60.000000 | |
Veteran (10-∞) | NaN | NaN | NaN | 40.000000 | 27.272727 | 60.000000 | 57.894737 | 57.894737 | 58.064516 | |
TAFE | New (0-3) | 16.666667 | 28.571429 | 26.315789 | 21.052632 | 41.666667 | 22.727273 | 33.333333 | 21.052632 | 25.000000 |
Experienced (3-6) | 33.333333 | 33.333333 | 22.222222 | 12.500000 | 18.181818 | 33.333333 | 26.666667 | 20.000000 | 20.000000 | |
Established (6-10) | NaN | NaN | 25.000000 | 66.666667 | 20.000000 | NaN | 40.000000 | 33.333333 | 0.000000 | |
Veteran (10-∞) | NaN | NaN | NaN | 0.000000 | 25.000000 | 20.000000 | 25.000000 | 57.142857 | 21.428571 |
This time, when we have 2 splits in index part of pivot and 1 split by age groups in columns part using plots like bar chart would not be most effective. In order to clearly display data and provide effective visualization my choice fell on heatmap provided by seaborn library.
fig, hmap = plt.subplots(figsize=(12,6))
rcParams["font.size"]=12
hmap = sns.heatmap(age_serv_pv, ax=hmap, annot=True, fmt='.2f', cmap='YlGn')
hmap.set_facecolor("grey")
hmap.set_title("% of Dissatisfied leavers by age(x) & institute-service years(y)")
hmap.set_ylabel('Institute-Service years')
hmap.set_xlabel('Age')
hmap.axhline(y=4, c='k', lw=3);
In above heatmap horizontal black line divides plot into 2 parts based on institute where leavers were employed. Grey are represent absence of employees in relevant cell. Quite logically most of grey are covers cells representing young people with more than 6 or 10 years experience inside institute.
Also it is quite clear that there are much less high extreme values in Technical and Further Education (TAFE) than in Department of Education, Training and Employment (DETE). In fact in only 2 groups (Veterans aged between 51 & 55, Established employees aged between 31 & 35) in TAFE share of employees who left due to dissatisfaction is more than half. Meanwhile, we can observe much more green and dark green cells in DETE part of heatmap which means frequent presence of high values. This leads us to conclusion than in TAFE values are more equally distributed than in DETE. And this conclusion is also right if we compare percentages by age group in 2 institutes. Then it means that observations I made in Conclusion 1.1 were heavily affected by DETE and cannot be interpreted with the same degree of significance for both entities.
More diverged values in DETE part of plot give us more clear answers to our questions. 2 bottom rows of DETE part of plot are more greenish than top 2 rows. Therefore, it is clear that Established employees and Veterans tend to leave due to dissatisfaction more than less experienced employees in institute. Another visible trend is that from left to the right share of dissatisfied leavers is getting higher.
To sum up, while values in TAFE are more equally distributed and do not answer our questions clearly, values in DETE are more robust. And this robustness suggests that most experienced and/or older employees tend to leave DETE due to dissatisfaction more than other groups.