Notebook

Analyzing Employees Who Left Due to Dissatisfaction¶

Introduction¶

In this project I will clean, organize, standardize and analyse data from employee exit surveys from Department of Education, Training and Employment (DETE) (download here) and the Technical and Further Education (TAFE)(download here) institute in Queensland, Australia. Below are questions I will try to answer as a result of completing mentioned in previous sentence processes:

Are employees who have only worked for the institutes for a short period of time resigning due to some kind of dissatisfaction?
What about employees who have been at the job longer?
Are younger employees resigning due to some kind of dissatisfaction?
What about older employees?

Reading and getting familiar with dataframes¶

In below cell I import necessary libraries and setting some display options for dataframes.

In [1]:

import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt, rcParams
from my_functions import print_2n

pd.set_option('display.max_rows', 1000)
pd.set_option('display.max_columns', 100)

pascal_name_convention

Below I read data (2 csv files) into pandas dataframe:

employee exit surveys from Department of Education, Training and Employment (DETE) to 'dete_sur' dataframe
employee exit surveys from Technical and Further Education (TAFE) to 'tafe_sur' dataframe

In [2]:

dete_sur = pd.read_csv('dete_survey.csv')
tafe_sur = pd.read_csv('tafe_survey.csv')

To get familiar with data read into dataframes, first, I will display top 5 rows from both ('dete_sur' & 'tafe_sur')

In [3]:

dete_sur.head()

Out[3]:

	ID	SeparationType	Cease Date	DETE Start Date	Role Start Date	Position	Classification	Region	Business Unit	Employment Status	Career move to public sector	Career move to private sector	Interpersonal conflicts	Job dissatisfaction	Dissatisfaction with the department	Physical work environment	Lack of recognition	Lack of job security	Work location	Employment conditions	Maternity/family	Relocation	Study/Travel	Ill Health	Traumatic incident	Work life balance	Workload	None of the above	Professional Development	Opportunities for promotion	Staff morale	Workplace issue	Physical environment	Worklife balance	Stress and pressure support	Performance of supervisor	Peer support	Initiative	Skills	Coach	Career Aspirations	Feedback	Further PD	Communication	My say	Information	Kept informed	Wellness programs	Health & Safety	Gender	Age	Aboriginal	Torres Strait	South Sea	Disability	NESB
0	1	Ill Health Retirement	08/2012	1984	2004	Public Servant	A01-A04	Central Office	Corporate Strategy and Peformance	Permanent Full-time	True	False	False	True	False	False	True	False	False	False	False	False	False	False	False	False	False	True	A	A	N	N	N	A	A	A	A	N	N	N	A	A	A	N	A	A	N	N	N	Male	56-60	NaN	NaN	NaN	NaN	Yes
1	2	Voluntary Early Retirement (VER)	08/2012	Not Stated	Not Stated	Public Servant	AO5-AO7	Central Office	Corporate Strategy and Peformance	Permanent Full-time	False	False	False	False	False	False	False	False	False	False	False	False	False	False	False	False	False	False	A	A	N	N	N	N	A	A	A	N	N	N	A	A	A	N	A	A	N	N	N	Male	56-60	NaN	NaN	NaN	NaN	NaN
2	3	Voluntary Early Retirement (VER)	05/2012	2011	2011	Schools Officer	NaN	Central Office	Education Queensland	Permanent Full-time	False	False	False	False	False	False	False	False	False	False	False	False	False	False	False	False	False	True	N	N	N	N	N	N	N	N	N	N	N	N	N	N	N	A	A	N	N	N	N	Male	61 or older	NaN	NaN	NaN	NaN	NaN
3	4	Resignation-Other reasons	05/2012	2005	2006	Teacher	Primary	Central Queensland	NaN	Permanent Full-time	False	True	False	False	False	False	False	False	False	False	False	False	False	False	False	False	False	False	A	N	N	N	A	A	N	N	A	A	A	A	A	A	A	A	A	A	A	N	A	Female	36-40	NaN	NaN	NaN	NaN	NaN
4	5	Age Retirement	05/2012	1970	1989	Head of Curriculum/Head of Special Education	NaN	South East	NaN	Permanent Full-time	False	False	False	False	False	False	False	False	False	False	False	False	False	False	False	True	False	False	A	A	N	N	D	D	N	A	A	A	A	A	A	SA	SA	D	D	A	N	A	M	Female	61 or older	NaN	NaN	NaN	NaN	NaN

In [4]:

tafe_sur.head()

Out[4]:

	Record ID	Institute	WorkArea	CESSATION YEAR	Reason for ceasing employment	Contributing Factors. Career Move - Public Sector	Contributing Factors. Career Move - Private Sector	Contributing Factors. Career Move - Self-employment	Contributing Factors. Ill Health	Contributing Factors. Maternity/Family	Contributing Factors. Dissatisfaction	Contributing Factors. Job Dissatisfaction	Contributing Factors. Interpersonal Conflict	Contributing Factors. Study	Contributing Factors. Travel	Contributing Factors. Other	Contributing Factors. NONE	Main Factor. Which of these was the main factor for leaving?	InstituteViews. Topic:1. I feel the senior leadership had a clear vision and direction	InstituteViews. Topic:2. I was given access to skills training to help me do my job better	InstituteViews. Topic:3. I was given adequate opportunities for personal development	InstituteViews. Topic:4. I was given adequate opportunities for promotion within %Institute]Q25LBL%	InstituteViews. Topic:5. I felt the salary for the job was right for the responsibilities I had	InstituteViews. Topic:6. The organisation recognised when staff did good work	InstituteViews. Topic:7. Management was generally supportive of me	InstituteViews. Topic:8. Management was generally supportive of my team	InstituteViews. Topic:9. I was kept informed of the changes in the organisation which would affect me	InstituteViews. Topic:10. Staff morale was positive within the Institute	InstituteViews. Topic:11. If I had a workplace issue it was dealt with quickly	InstituteViews. Topic:12. If I had a workplace issue it was dealt with efficiently	InstituteViews. Topic:13. If I had a workplace issue it was dealt with discreetly	WorkUnitViews. Topic:14. I was satisfied with the quality of the management and supervision within my work unit	WorkUnitViews. Topic:15. I worked well with my colleagues	WorkUnitViews. Topic:16. My job was challenging and interesting	WorkUnitViews. Topic:17. I was encouraged to use my initiative in the course of my work	WorkUnitViews. Topic:18. I had sufficient contact with other people in my job	WorkUnitViews. Topic:19. I was given adequate support and co-operation by my peers to enable me to do my job	WorkUnitViews. Topic:20. I was able to use the full range of my skills in my job	WorkUnitViews. Topic:21. I was able to use the full range of my abilities in my job. ; Category:Level of Agreement; Question:YOUR VIEWS ABOUT YOUR WORK UNIT]	WorkUnitViews. Topic:22. I was able to use the full range of my knowledge in my job	WorkUnitViews. Topic:23. My job provided sufficient variety	WorkUnitViews. Topic:24. I was able to cope with the level of stress and pressure in my job	WorkUnitViews. Topic:25. My job allowed me to balance the demands of work and family to my satisfaction	WorkUnitViews. Topic:26. My supervisor gave me adequate personal recognition and feedback on my performance	WorkUnitViews. Topic:27. My working environment was satisfactory e.g. sufficient space, good lighting, suitable seating and working area	WorkUnitViews. Topic:28. I was given the opportunity to mentor and coach others in order for me to pass on my skills and knowledge prior to my cessation date	WorkUnitViews. Topic:29. There was adequate communication between staff in my unit	WorkUnitViews. Topic:30. Staff morale was positive within my work unit	Induction. Did you undertake Workplace Induction?	InductionInfo. Topic:Did you undertake a Corporate Induction?	InductionInfo. Topic:Did you undertake a Institute Induction?	InductionInfo. Topic: Did you undertake Team Induction?	InductionInfo. Face to Face Topic:Did you undertake a Corporate Induction; Category:How it was conducted?	InductionInfo. On-line Topic:Did you undertake a Corporate Induction; Category:How it was conducted?	InductionInfo. Induction Manual Topic:Did you undertake a Corporate Induction?	InductionInfo. Face to Face Topic:Did you undertake a Institute Induction?	InductionInfo. On-line Topic:Did you undertake a Institute Induction?	InductionInfo. Induction Manual Topic:Did you undertake a Institute Induction?	InductionInfo. Face to Face Topic: Did you undertake Team Induction; Category?	InductionInfo. On-line Topic: Did you undertake Team Induction?process you undertook and how it was conducted.]	InductionInfo. Induction Manual Topic: Did you undertake Team Induction?	Workplace. Topic:Did you and your Manager develop a Performance and Professional Development Plan (PPDP)?	Workplace. Topic:Does your workplace promote a work culture free from all forms of unlawful discrimination?	Workplace. Topic:Does your workplace promote and practice the principles of employment equity?	Workplace. Topic:Does your workplace value the diversity of its employees?	Workplace. Topic:Would you recommend the Institute as an employer to others?	Gender. What is your Gender?	CurrentAge. Current Age	Employment Type. Employment Type	Classification. Classification	LengthofServiceOverall. Overall Length of Service at Institute (in years)	LengthofServiceCurrent. Length of Service at current workplace (in years)
0	6.341330e+17	Southern Queensland Institute of TAFE	Non-Delivery (corporate)	2010.0	Contract Expired	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	Agree	Agree	Agree	Neutral	Agree	Agree	Agree	Agree	Agree	Agree	Agree	Agree	Agree	Agree	Agree	Agree	Strongly Agree	Agree	Agree	Agree	Agree	Agree	Agree	Agree	Agree	Agree	Agree	Neutral	Agree	Agree	Yes	Yes	Yes	Yes	Face to Face	-	-	Face to Face	-	-	Face to Face	-	-	Yes	Yes	Yes	Yes	Yes	Female	26 30	Temporary Full-time	Administration (AO)	1-2	1-2
1	6.341337e+17	Mount Isa Institute of TAFE	Non-Delivery (corporate)	2010.0	Retirement	-	-	-	-	-	-	-	-	-	Travel	-	-	NaN	Agree	Agree	Agree	Agree	Agree	Strongly Agree	Strongly Agree	Agree	Strongly Agree	Agree	Agree	Agree	Disagree	Strongly Agree	Strongly Agree	Strongly Agree	Agree	Agree	Agree	Strongly Agree	Agree	Agree	Agree	Strongly Agree	Agree	Strongly Agree	Strongly Agree	Agree	Agree	Strongly Agree	No	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	Yes	Yes	Yes	Yes	Yes	NaN	NaN	NaN	NaN	NaN	NaN
2	6.341388e+17	Mount Isa Institute of TAFE	Delivery (teaching)	2010.0	Retirement	-	-	-	-	-	-	-	-	-	-	-	NONE	NaN	Agree	Agree	Agree	Agree	Agree	Agree	Strongly Agree	Agree	Agree	Agree	Agree	Neutral	Neutral	Strongly Agree	Strongly Agree	Agree	Agree	Agree	Agree	Agree	Agree	Agree	Agree	Agree	Agree	Agree	Agree	Agree	Agree	Agree	No	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	Yes	Yes	Yes	Yes	Yes	NaN	NaN	NaN	NaN	NaN	NaN
3	6.341399e+17	Mount Isa Institute of TAFE	Non-Delivery (corporate)	2010.0	Resignation	-	-	-	-	-	-	-	-	-	Travel	-	-	NaN	Agree	Agree	Agree	Agree	Agree	Agree	Agree	Agree	Agree	Agree	Agree	Agree	Agree	Strongly Agree	Strongly Agree	Strongly Agree	Strongly Agree	Strongly Agree	Strongly Agree	Strongly Agree	Strongly Agree	Strongly Agree	Strongly Agree	Strongly Agree	Strongly Agree	Strongly Agree	Strongly Agree	Strongly Agree	Strongly Agree	Strongly Agree	Yes	No	Yes	Yes	-	-	-	NaN	-	-	-	-	-	Yes	Yes	Yes	Yes	Yes	NaN	NaN	NaN	NaN	NaN	NaN
4	6.341466e+17	Southern Queensland Institute of TAFE	Delivery (teaching)	2010.0	Resignation	-	Career Move - Private Sector	-	-	-	-	-	-	-	-	-	-	NaN	Agree	Agree	Strongly Agree	Agree	Strongly Agree	Strongly Agree	Strongly Agree	Strongly Agree	Agree	Strongly Agree	Strongly Agree	Agree	Strongly Agree	Strongly Agree	Strongly Agree	Strongly Agree	Strongly Agree	Strongly Agree	Strongly Agree	Strongly Agree	Strongly Agree	Strongly Agree	Strongly Agree	Strongly Agree	Strongly Agree	Strongly Agree	Strongly Agree	Strongly Agree	Strongly Agree	Strongly Agree	Yes	Yes	Yes	Yes	-	-	Induction Manual	Face to Face	-	-	Face to Face	-	-	Yes	Yes	Yes	Yes	Yes	Male	41 45	Permanent Full-time	Teacher (including LVT)	3-4	3-4

We cannot see all columns above and there is difficulty in comprehending data displayed in such way. Therefore, additionally in below cells I use df.info() method to display all column labels, number of non-null values in each column and data type.

In [5]:

dete_sur.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 822 entries, 0 to 821
Data columns (total 56 columns):
 #   Column                               Non-Null Count  Dtype 
---  ------                               --------------  ----- 
 0   ID                                   822 non-null    int64 
 1   SeparationType                       822 non-null    object
 2   Cease Date                           822 non-null    object
 3   DETE Start Date                      822 non-null    object
 4   Role Start Date                      822 non-null    object
 5   Position                             817 non-null    object
 6   Classification                       455 non-null    object
 7   Region                               822 non-null    object
 8   Business Unit                        126 non-null    object
 9   Employment Status                    817 non-null    object
 10  Career move to public sector         822 non-null    bool  
 11  Career move to private sector        822 non-null    bool  
 12  Interpersonal conflicts              822 non-null    bool  
 13  Job dissatisfaction                  822 non-null    bool  
 14  Dissatisfaction with the department  822 non-null    bool  
 15  Physical work environment            822 non-null    bool  
 16  Lack of recognition                  822 non-null    bool  
 17  Lack of job security                 822 non-null    bool  
 18  Work location                        822 non-null    bool  
 19  Employment conditions                822 non-null    bool  
 20  Maternity/family                     822 non-null    bool  
 21  Relocation                           822 non-null    bool  
 22  Study/Travel                         822 non-null    bool  
 23  Ill Health                           822 non-null    bool  
 24  Traumatic incident                   822 non-null    bool  
 25  Work life balance                    822 non-null    bool  
 26  Workload                             822 non-null    bool  
 27  None of the above                    822 non-null    bool  
 28  Professional Development             808 non-null    object
 29  Opportunities for promotion          735 non-null    object
 30  Staff morale                         816 non-null    object
 31  Workplace issue                      788 non-null    object
 32  Physical environment                 817 non-null    object
 33  Worklife balance                     815 non-null    object
 34  Stress and pressure support          810 non-null    object
 35  Performance of supervisor            813 non-null    object
 36  Peer support                         812 non-null    object
 37  Initiative                           813 non-null    object
 38  Skills                               811 non-null    object
 39  Coach                                767 non-null    object
 40  Career Aspirations                   746 non-null    object
 41  Feedback                             792 non-null    object
 42  Further PD                           768 non-null    object
 43  Communication                        814 non-null    object
 44  My say                               812 non-null    object
 45  Information                          816 non-null    object
 46  Kept informed                        813 non-null    object
 47  Wellness programs                    766 non-null    object
 48  Health & Safety                      793 non-null    object
 49  Gender                               798 non-null    object
 50  Age                                  811 non-null    object
 51  Aboriginal                           16 non-null     object
 52  Torres Strait                        3 non-null      object
 53  South Sea                            7 non-null      object
 54  Disability                           23 non-null     object
 55  NESB                                 32 non-null     object
dtypes: bool(18), int64(1), object(37)
memory usage: 258.6+ KB

Unfortunately if we got really long column labels df.info() method will print expected result not in easily readable way. Therefore, below I created dataframe and added to it column labels, data types and number of non-null values.

In [6]:

pd.DataFrame({'data_type': tafe_sur.dtypes, 'non_null': tafe_sur.notnull().sum()})

Out[6]:

	data_type	non_null
Record ID	float64	702
Institute	object	702
WorkArea	object	702
CESSATION YEAR	float64	695
Reason for ceasing employment	object	701
Contributing Factors. Career Move - Public Sector	object	437
Contributing Factors. Career Move - Private Sector	object	437
Contributing Factors. Career Move - Self-employment	object	437
Contributing Factors. Ill Health	object	437
Contributing Factors. Maternity/Family	object	437
Contributing Factors. Dissatisfaction	object	437
Contributing Factors. Job Dissatisfaction	object	437
Contributing Factors. Interpersonal Conflict	object	437
Contributing Factors. Study	object	437
Contributing Factors. Travel	object	437
Contributing Factors. Other	object	437
Contributing Factors. NONE	object	437
Main Factor. Which of these was the main factor for leaving?	object	113
InstituteViews. Topic:1. I feel the senior leadership had a clear vision and direction	object	608
InstituteViews. Topic:2. I was given access to skills training to help me do my job better	object	613
InstituteViews. Topic:3. I was given adequate opportunities for personal development	object	610
InstituteViews. Topic:4. I was given adequate opportunities for promotion within %Institute]Q25LBL%	object	608
InstituteViews. Topic:5. I felt the salary for the job was right for the responsibilities I had	object	615
InstituteViews. Topic:6. The organisation recognised when staff did good work	object	607
InstituteViews. Topic:7. Management was generally supportive of me	object	614
InstituteViews. Topic:8. Management was generally supportive of my team	object	608
InstituteViews. Topic:9. I was kept informed of the changes in the organisation which would affect me	object	610
InstituteViews. Topic:10. Staff morale was positive within the Institute	object	602
InstituteViews. Topic:11. If I had a workplace issue it was dealt with quickly	object	601
InstituteViews. Topic:12. If I had a workplace issue it was dealt with efficiently	object	597
InstituteViews. Topic:13. If I had a workplace issue it was dealt with discreetly	object	601
WorkUnitViews. Topic:14. I was satisfied with the quality of the management and supervision within my work unit	object	609
WorkUnitViews. Topic:15. I worked well with my colleagues	object	605
WorkUnitViews. Topic:16. My job was challenging and interesting	object	607
WorkUnitViews. Topic:17. I was encouraged to use my initiative in the course of my work	object	610
WorkUnitViews. Topic:18. I had sufficient contact with other people in my job	object	613
WorkUnitViews. Topic:19. I was given adequate support and co-operation by my peers to enable me to do my job	object	609
WorkUnitViews. Topic:20. I was able to use the full range of my skills in my job	object	609
WorkUnitViews. Topic:21. I was able to use the full range of my abilities in my job. ; Category:Level of Agreement; Question:YOUR VIEWS ABOUT YOUR WORK UNIT]	object	608
WorkUnitViews. Topic:22. I was able to use the full range of my knowledge in my job	object	608
WorkUnitViews. Topic:23. My job provided sufficient variety	object	611
WorkUnitViews. Topic:24. I was able to cope with the level of stress and pressure in my job	object	610
WorkUnitViews. Topic:25. My job allowed me to balance the demands of work and family to my satisfaction	object	611
WorkUnitViews. Topic:26. My supervisor gave me adequate personal recognition and feedback on my performance	object	606
WorkUnitViews. Topic:27. My working environment was satisfactory e.g. sufficient space, good lighting, suitable seating and working area	object	610
WorkUnitViews. Topic:28. I was given the opportunity to mentor and coach others in order for me to pass on my skills and knowledge prior to my cessation date	object	609
WorkUnitViews. Topic:29. There was adequate communication between staff in my unit	object	603
WorkUnitViews. Topic:30. Staff morale was positive within my work unit	object	606
Induction. Did you undertake Workplace Induction?	object	619
InductionInfo. Topic:Did you undertake a Corporate Induction?	object	432
InductionInfo. Topic:Did you undertake a Institute Induction?	object	483
InductionInfo. Topic: Did you undertake Team Induction?	object	440
InductionInfo. Face to Face Topic:Did you undertake a Corporate Induction; Category:How it was conducted?	object	555
InductionInfo. On-line Topic:Did you undertake a Corporate Induction; Category:How it was conducted?	object	555
InductionInfo. Induction Manual Topic:Did you undertake a Corporate Induction?	object	555
InductionInfo. Face to Face Topic:Did you undertake a Institute Induction?	object	530
InductionInfo. On-line Topic:Did you undertake a Institute Induction?	object	555
InductionInfo. Induction Manual Topic:Did you undertake a Institute Induction?	object	553
InductionInfo. Face to Face Topic: Did you undertake Team Induction; Category?	object	555
InductionInfo. On-line Topic: Did you undertake Team Induction?process you undertook and how it was conducted.]	object	555
InductionInfo. Induction Manual Topic: Did you undertake Team Induction?	object	555
Workplace. Topic:Did you and your Manager develop a Performance and Professional Development Plan (PPDP)?	object	608
Workplace. Topic:Does your workplace promote a work culture free from all forms of unlawful discrimination?	object	594
Workplace. Topic:Does your workplace promote and practice the principles of employment equity?	object	587
Workplace. Topic:Does your workplace value the diversity of its employees?	object	586
Workplace. Topic:Would you recommend the Institute as an employer to others?	object	581
Gender. What is your Gender?	object	596
CurrentAge. Current Age	object	596
Employment Type. Employment Type	object	596
Classification. Classification	object	596
LengthofServiceOverall. Overall Length of Service at Institute (in years)	object	596
LengthofServiceCurrent. Length of Service at current workplace (in years)	object	596

Now and later in this project I will need to use 'series.value_counts()' method to get counts of unique values in several columns. But this method can be used only for one column at a time. Using this method several times for different columns will take a lot of space. Also we cannot see output of method for several columns at same time.

Therefore, below I wrote 'df_vc' function. This function returns dataframe. Dataframe contains output(s) of 'series.value_counts' for 1(or more) column(s). In other words user specifies source dataframe and columns, function will return value_counts for those columns in one dataframe. Basically resulting dataframe can be viewed as collection of couple of columns. In each couple 1st column is unique value and 2nd column is the count of it.

Note: I did not use 'dete_sur.iloc[:,1:11].apply(pd.Series.value_counts)' on purpose. Reason is that this way provides datarame with lots of NaN values if we have more columns and those columns have different values and different type of values. Therefore, this method does not contribute to easiness of readability at all.

In [7]:

def df_vc(ur_df, cols=None, normalize=False, sort=True, index=None,
          ascending=False, bins=None, dropna=True, vc_nan='N/A'):

    vc_df=None # initial stage of df that is returned by function
    if cols==None: cols=range(ur_df.shape[1]) # all cols if cols range not specified
        
    for col in cols:# iterating through each column's position/label
        
        # finding position of col if it is specified by str & not by int
        if isinstance(col, str): col = ur_df.columns.get_loc(col)
        
        # creating dataframe (2 columns) from pd.value_count for each col in ur_df
        new_vc_df = pd.DataFrame(ur_df.iloc[:,col].value_counts
                      (dropna=dropna, normalize=normalize, sort=sort,
                       bins=bins, ascending=ascending))
        
        # sort by index each new df or ignore this step
        if index == 'ascending': new_vc_df.sort_index(inplace=True)
        elif index == 'descending': new_vc_df.sort_index(ascending=False, inplace=True)
        elif index is None: pass
        
        # reset index for renaming and concat purposes,
        # filling nan to differentiate from concat related nan
        new_vc_df = new_vc_df.reset_index().fillna('N/A')
        
        # rename columns names to: 'col label in ur_df' & 'vc(col label in ur_df)'
        labels = {new_vc_df.columns[0]: new_vc_df.columns[1],
                  new_vc_df.columns[1]: f'vc({new_vc_df.columns[1]})'}
        new_vc_df = new_vc_df.rename(columns=labels)
        
        # aggregating above created dataframes (2 columns) in 1 df
        vc_df = pd.concat([vc_df, new_vc_df], axis=1)
        
    # deleting nan coming from concat and replacing default 'N/A'
    # in value_counts part if other param was entered for vc_nan arg
    vc_df = vc_df.fillna('')
    if vc_nan != 'N/A': vc_df = vc_df.replace('N/A', vc_nan)
    return vc_df

In below cells I used above written function for both dataframes. I specified columns which I use later in project for analysis.

In [8]:

df_vc(dete_sur, ['Age', 'Gender', 'Cease Date', 'DETE Start Date'], dropna=False, vc_nan=np.nan)

Out[8]:

	Age	vc(Age)	Gender	vc(Gender)	Cease Date	vc(Cease Date)	DETE Start Date	vc(DETE Start Date)
0	61 or older	222.0	Female	573.0	2012	344.0	Not Stated	73
1	56-60	174.0	Male	225.0	2013	200.0	2011	40
2	51-55	103.0	NaN	24.0	01/2014	43.0	2007	34
3	46-50	63.0			12/2013	40.0	2008	31
4	41-45	61.0			Not Stated	34.0	2010	27
5	26-30	57.0			09/2013	34.0	2012	27
6	36-40	51.0			06/2013	27.0	2009	24
7	21-25	40.0			07/2013	22.0	2006	23
8	31-35	39.0			10/2013	20.0	1975	21
9	NaN	11.0			11/2013	16.0	1970	21
10	20 or younger	1.0			08/2013	12.0	2013	21
11					05/2013	7.0	2005	20
12					05/2012	6.0	1990	20
13					07/2014	2.0	1999	19
14					04/2013	2.0	1996	19
15					08/2012	2.0	1991	18
16					02/2014	2.0	1992	18
17					04/2014	2.0	2000	18
18					2010	1.0	2004	18
19					09/2014	1.0	1989	17
20					09/2010	1.0	2002	15
21					2014	1.0	1976	15
22					07/2012	1.0	2003	15
23					07/2006	1.0	1988	15
24					11/2012	1.0	1978	15
25							1980	14
26							1998	14
27							1997	14
28							1995	14
29							1974	14
30							1979	14
31							1993	13
32							1986	12
33							1972	12
34							1977	11
35							2001	10
36							1969	10
37							1984	10
38							1971	10
39							1994	10
40							1981	9
41							1983	9
42							1973	8
43							1985	8
44							1987	7
45							1963	4
46							1982	4
47							1968	3
48							1967	2
49							1965	1
50							1966	1

In [9]:

df_vc(tafe_sur, [67, 70, 4, 66], dropna=False, vc_nan=np.nan)

Out[9]:

	CurrentAge. Current Age	vc(CurrentAge. Current Age)	LengthofServiceOverall. Overall Length of Service at Institute (in years)	vc(LengthofServiceOverall. Overall Length of Service at Institute (in years))	Reason for ceasing employment	vc(Reason for ceasing employment)	Gender. What is your Gender?	vc(Gender. What is your Gender?)
0	56 or older	162	Less than 1 year	147.0	Resignation	340.0	Female	389.0
1	NaN	106	NaN	106.0	Contract Expired	127.0	Male	207.0
2	51-55	82	1-2	102.0	Retrenchment/ Redundancy	104.0	NaN	106.0
3	41 45	80	3-4	96.0	Retirement	82.0
4	46 50	59	11-20	89.0	Transfer	25.0
5	31 35	52	More than 20 years	71.0	Termination	23.0
6	36 40	51	5-6	48.0	NaN	1.0
7	26 30	50	7-10	43.0
8	21 25	44
9	20 or younger	16

As we saw from above value counts dataframe there are 'Not Stated' values in 'dete_survey'. In order to change this value to 'np.nan' below I read 'dete_survey' csv file into pandas dataframe again but this time replacing 'Not Stated' with NaN.

In [10]:

# as we saw from above value_counts there are 'Not Stated' values in dete_survey csv
# below we assign back to dete_sur but this time replacing 'Not Stated' with NaN
dete_sur = pd.read_csv('dete_survey.csv', na_values='Not Stated')
dete_sur.head()

Out[10]:

	ID	SeparationType	Cease Date	DETE Start Date	Role Start Date	Position	Classification	Region	Business Unit	Employment Status	Career move to public sector	Career move to private sector	Interpersonal conflicts	Job dissatisfaction	Dissatisfaction with the department	Physical work environment	Lack of recognition	Lack of job security	Work location	Employment conditions	Maternity/family	Relocation	Study/Travel	Ill Health	Traumatic incident	Work life balance	Workload	None of the above	Professional Development	Opportunities for promotion	Staff morale	Workplace issue	Physical environment	Worklife balance	Stress and pressure support	Performance of supervisor	Peer support	Initiative	Skills	Coach	Career Aspirations	Feedback	Further PD	Communication	My say	Information	Kept informed	Wellness programs	Health & Safety	Gender	Age	Aboriginal	Torres Strait	South Sea	Disability	NESB
0	1	Ill Health Retirement	08/2012	1984.0	2004.0	Public Servant	A01-A04	Central Office	Corporate Strategy and Peformance	Permanent Full-time	True	False	False	True	False	False	True	False	False	False	False	False	False	False	False	False	False	True	A	A	N	N	N	A	A	A	A	N	N	N	A	A	A	N	A	A	N	N	N	Male	56-60	NaN	NaN	NaN	NaN	Yes
1	2	Voluntary Early Retirement (VER)	08/2012	NaN	NaN	Public Servant	AO5-AO7	Central Office	Corporate Strategy and Peformance	Permanent Full-time	False	False	False	False	False	False	False	False	False	False	False	False	False	False	False	False	False	False	A	A	N	N	N	N	A	A	A	N	N	N	A	A	A	N	A	A	N	N	N	Male	56-60	NaN	NaN	NaN	NaN	NaN
2	3	Voluntary Early Retirement (VER)	05/2012	2011.0	2011.0	Schools Officer	NaN	Central Office	Education Queensland	Permanent Full-time	False	False	False	False	False	False	False	False	False	False	False	False	False	False	False	False	False	True	N	N	N	N	N	N	N	N	N	N	N	N	N	N	N	A	A	N	N	N	N	Male	61 or older	NaN	NaN	NaN	NaN	NaN
3	4	Resignation-Other reasons	05/2012	2005.0	2006.0	Teacher	Primary	Central Queensland	NaN	Permanent Full-time	False	True	False	False	False	False	False	False	False	False	False	False	False	False	False	False	False	False	A	N	N	N	A	A	N	N	A	A	A	A	A	A	A	A	A	A	A	N	A	Female	36-40	NaN	NaN	NaN	NaN	NaN
4	5	Age Retirement	05/2012	1970.0	1989.0	Head of Curriculum/Head of Special Education	NaN	South East	NaN	Permanent Full-time	False	False	False	False	False	False	False	False	False	False	False	False	False	False	False	True	False	False	A	A	N	N	D	D	N	A	A	A	A	A	A	SA	SA	D	D	A	N	A	M	Female	61 or older	NaN	NaN	NaN	NaN	NaN

Deleting unnecessary columns and standardizing some column labels¶

Below I delete range of columns I will not use in my project

In [11]:

# dropping columns we will not need further in our analysis
dete_sur.drop(columns=dete_sur.columns[28:49], inplace=True)
tafe_sur.drop(columns=tafe_sur.columns[17:66], inplace=True)

Later in project I will need to concatenate 2 dataframes. In order to have columns which contain same type of information in both dataframes matched I standardize column labels below. First, I do it for 'dete_sur' dataframe and print before and after versions of column labels.

In [12]:

# converting column labels to snake_case
bef_cols = dete_sur.columns
dete_sur.columns = dete_sur.columns.str.strip().str.replace('\s+',' ', regex=True)\
                   .str.replace(' ','_', regex=True).str.lower()

aft_cols = dete_sur.columns
pd.DataFrame({'columns before': bef_cols, 'columns_after': aft_cols})

Out[12]:

	columns before	columns_after
0	ID	id
1	SeparationType	separationtype
2	Cease Date	cease_date
3	DETE Start Date	dete_start_date
4	Role Start Date	role_start_date
5	Position	position
6	Classification	classification
7	Region	region
8	Business Unit	business_unit
9	Employment Status	employment_status
10	Career move to public sector	career_move_to_public_sector
11	Career move to private sector	career_move_to_private_sector
12	Interpersonal conflicts	interpersonal_conflicts
13	Job dissatisfaction	job_dissatisfaction
14	Dissatisfaction with the department	dissatisfaction_with_the_department
15	Physical work environment	physical_work_environment
16	Lack of recognition	lack_of_recognition
17	Lack of job security	lack_of_job_security
18	Work location	work_location
19	Employment conditions	employment_conditions
20	Maternity/family	maternity/family
21	Relocation	relocation
22	Study/Travel	study/travel
23	Ill Health	ill_health
24	Traumatic incident	traumatic_incident
25	Work life balance	work_life_balance
26	Workload	workload
27	None of the above	none_of_the_above
28	Gender	gender
29	Age	age
30	Aboriginal	aboriginal
31	Torres Strait	torres_strait
32	South Sea	south_sea
33	Disability	disability
34	NESB	nesb

Then I standardize 9 column labels (of columns which I will use later in project for analysis) in 'tafe_survey' dataframe based on column labels from 'dete_sur' dataframe.

In [13]:

# changing col labels in accordance to dete df so when we concat
# these cols (with same type of info) match
bef_cols1 = tafe_sur.columns
tafe_sur.rename(columns={'Record ID': 'id',
                        'CESSATION YEAR': 'cease_date',
                        'Reason for ceasing employment': 'separationtype',
                        'Gender. What is your Gender?': 'gender',
                        'CurrentAge. Current Age': 'age',
                        'Employment Type. Employment Type': 'employment_status',
                        'Classification. Classification': 'position',
                        'LengthofServiceOverall. Overall Length of Service at Institute (in years)':
                        'institute_service',
                        'LengthofServiceCurrent. Length of Service at current workplace (in years)':
                        'role_service'}, inplace=True)

aft_cols1 = tafe_sur.columns
pd.DataFrame({'columns before': bef_cols1, 'columns after': aft_cols1})

Out[13]:

	columns before	columns after
0	Record ID	id
1	Institute	Institute
2	WorkArea	WorkArea
3	CESSATION YEAR	cease_date
4	Reason for ceasing employment	separationtype
5	Contributing Factors. Career Move - Public Sec...	Contributing Factors. Career Move - Public Sec...
6	Contributing Factors. Career Move - Private Se...	Contributing Factors. Career Move - Private Se...
7	Contributing Factors. Career Move - Self-emplo...	Contributing Factors. Career Move - Self-emplo...
8	Contributing Factors. Ill Health	Contributing Factors. Ill Health
9	Contributing Factors. Maternity/Family	Contributing Factors. Maternity/Family
10	Contributing Factors. Dissatisfaction	Contributing Factors. Dissatisfaction
11	Contributing Factors. Job Dissatisfaction	Contributing Factors. Job Dissatisfaction
12	Contributing Factors. Interpersonal Conflict	Contributing Factors. Interpersonal Conflict
13	Contributing Factors. Study	Contributing Factors. Study
14	Contributing Factors. Travel	Contributing Factors. Travel
15	Contributing Factors. Other	Contributing Factors. Other
16	Contributing Factors. NONE	Contributing Factors. NONE
17	Gender. What is your Gender?	gender
18	CurrentAge. Current Age	age
19	Employment Type. Employment Type	employment_status
20	Classification. Classification	position
21	LengthofServiceOverall. Overall Length of Serv...	institute_service
22	LengthofServiceCurrent. Length of Service at c...	role_service

Selecting relevant section of data¶

Coming to main questions mentioned in the beginning requires to select rows which represent employees who resigned from their positions. To do so I check 'separation type' column in both columns.
I start with 'tafe_sur' dataframe (which represents employee exit surveys from Technical and Further Education (TAFE))

In [14]:

tafe_sur['separationtype'].value_counts(dropna=False)

Out[14]:

Resignation                 340
Contract Expired            127
Retrenchment/ Redundancy    104
Retirement                   82
Transfer                     25
Termination                  23
NaN                           1
Name: separationtype, dtype: int64

Output of above cell shows that we 340 rows where employees resigned from their job. Below I select those rows and assign to 'tafe_resign' dataframe.

In [15]:

tafe_resign = tafe_sur[tafe_sur['separationtype'] == 'Resignation'].copy()
tafe_resign['separationtype'].value_counts(dropna=False)

Out[15]:

Resignation    340
Name: separationtype, dtype: int64

Next step is to do same for 'dete_sur' (which represents employee exit surveys from Department of Education, Training and Employment (DETE))

In [16]:

dete_sur['separationtype'].value_counts(dropna=False)

Out[16]:

Age Retirement                          285
Resignation-Other reasons               150
Resignation-Other employer               91
Resignation-Move overseas/interstate     70
Voluntary Early Retirement (VER)         67
Ill Health Retirement                    61
Other                                    49
Contract Expired                         34
Termination                              15
Name: separationtype, dtype: int64

In 'dete_sur' dataframe 'separationtype' column contains different type of resignation categories:

Resignation-Other reasons
Resignation-Other employer
Resignation-Move overseas/interstate

As we need all those categories collect them in list and based on it select relevant rows from 'dete_sur' dataframe. Selected data will be assigned to 'dete_resign' dataframe.

In [17]:

resign_types = ['Resignation-Other reasons', 'Resignation-Other employer',
                'Resignation-Move overseas/interstate']
dete_resign = dete_sur[dete_sur['separationtype'].isin(resign_types)].copy()
dete_resign['separationtype'].value_counts(dropna=False)

Out[17]:

Resignation-Other reasons               150
Resignation-Other employer               91
Resignation-Move overseas/interstate     70
Name: separationtype, dtype: int64

Working on date (mostly year) columns¶

As we can see from questions this project aims to answer we need data on employees age and the time interval they worked for both institutes. In order to have reliable data I will check columns showing us required information on some abnormalities:

if year would be too early like before 1945
if year would be later than the date when data was gathered
if age is too old or to young

Following columns will be ones I check later in project for my analysis:

in 'dete_resign' dataframe:
- 'cease_year' (the year employee left institute)
- 'dete_start_date' (the year employee started to work in institute)
- 'age' (employee's age)
in 'tafe_resign' dataframe:
- 'cease_date' (the date employee left institute)
- 'age' (employee's age)

In [18]:

dete_resign['cease_year'] = dete_resign['cease_date'].str.extract(r"([1-2][0-9]{3})").astype(float)
dete_cols = ['cease_year', 'dete_start_date', 'age']
df_vc(dete_resign, cols=dete_cols, dropna=False, index='descending')

Out[18]:

	cease_year	vc(cease_year)	dete_start_date	vc(dete_start_date)	age	vc(age)
0	2014.0	22.0	2013.0	10	61 or older	23.0
1	2013.0	146.0	2012.0	21	56-60	26.0
2	2012.0	129.0	2011.0	24	51-55	32.0
3	2010.0	2.0	2010.0	17	46-50	42.0
4	2006.0	1.0	2009.0	13	41-45	48.0
5	N/A	11.0	2008.0	22	36-40	41.0
6			2007.0	21	31-35	29.0
7			2006.0	13	26-30	35.0
8			2005.0	15	21-25	29.0
9			2004.0	14	20 or younger	1.0
10			2003.0	6	N/A	5.0
11			2002.0	6
12			2001.0	3
13			2000.0	9
14			1999.0	8
15			1998.0	6
16			1997.0	5
17			1996.0	6
18			1995.0	4
19			1994.0	6
20			1993.0	5
21			1992.0	6
22			1991.0	4
23			1990.0	5
24			1989.0	4
25			1988.0	4
26			1987.0	1
27			1986.0	3
28			1985.0	3
29			1984.0	1
30			1983.0	2
31			1982.0	1
32			1980.0	5
33			1977.0	1
34			1976.0	2
35			1975.0	1
36			1974.0	2
37			1973.0	1
38			1972.0	1
39			1971.0	1
40			1963.0	1
41			N/A	28

In [19]:

tafe_cols = ['cease_date','age']
df_vc(tafe_resign, cols=tafe_cols, dropna=False, index='ascendng')

Out[19]:

	cease_date	vc(cease_date)	age	vc(age)
0	2011.0	116.0	N/A	50
1	2012.0	94.0	41 45	45
2	2010.0	68.0	46 50	39
3	2013.0	55.0	51-55	39
4	N/A	5.0	21 25	33
5	2009.0	2.0	36 40	32
6			26 30	32
7			31 35	32
8			56 or older	29
9			20 or younger	9

As we can see from outputs of above 2 cells there are no abnormalities and columns can be safely used for analysis.

To be able to answer questions like whether employees who worked long time (or short) left institute due to some kind of dissatisfaction we need to have categories indicating how much time employee worked for institute. We have in 'tafe_resign' dataframe 'institute_service' column which will serve our purpose.

In [20]:

tafe_resign['institute_service'].value_counts(dropna=False)

Out[20]:

Less than 1 year      73
1-2                   64
3-4                   63
NaN                   50
5-6                   33
11-20                 26
7-10                  21
More than 20 years    10
Name: institute_service, dtype: int64

For 'dete_resgin' dataframe which represent employees resigned from Department of Education, Training and Employment we have year when employee started to work and year when employee resigned from position. Below I will create 'institute_service' column also for 'dete_resign' by subtracting start year from year of resignation and display unique values along with their count in column.

In [21]:

dete_resign['institute_service'] = dete_resign['cease_year'] - dete_resign['dete_start_date']
dete_resign['institute_service'].value_counts(dropna=False)

Out[21]:

NaN     38
5.0     23
1.0     22
3.0     20
0.0     20
6.0     17
4.0     16
2.0     14
9.0     14
7.0     13
8.0      8
13.0     8
15.0     7
20.0     7
14.0     6
12.0     6
17.0     6
22.0     6
10.0     6
18.0     5
16.0     5
11.0     4
24.0     4
23.0     4
19.0     3
21.0     3
32.0     3
39.0     3
26.0     2
30.0     2
25.0     2
28.0     2
36.0     2
49.0     1
27.0     1
34.0     1
29.0     1
35.0     1
31.0     1
33.0     1
42.0     1
38.0     1
41.0     1
Name: institute_service, dtype: int64

If resigned employee was dissatisfied?¶

Previously we created dataframe containing resigned employees for both institutes. But this would not be enough to answer our questions. In order to answer those questions we need to understand if employee resigned dues to some kind of dissatisfaction. Data for both institutes contain information on whether employee was dissatisfied or not. But this information is not contained in one column with boolean values (what we need).

In data for TAFE we have 2 column from which we can understand if employee was dissatisfied. These columns are:

Contributing Factors. Dissatisfaction
Contributing Factors. Job Dissatisfaction

If employee left due to dissatisfaction then relevant row contains same string as column label otherwise it contains dash or null value. So our aim is to able to see this information in 1 column with boolean values. To do so, I will combine above mentioned columns in one and then check if every element in row contains substring 'Dissatis'. If yes then employee left due to dissatisfaction if not otherwise is true.

In [22]:

tafe_resign['dissatisfied'] = tafe_resign['Contributing Factors. Dissatisfaction'].str.cat(\
    tafe_resign['Contributing Factors. Job Dissatisfaction']).str.contains('Dissatis',na=False)
tafe_resign['dissatisfied'].value_counts()

Out[22]:

False    249
True      91
Name: dissatisfied, dtype: int64

Output of above cell shows that majority of employees were not dissatisfied (249 employees) against 91 dissatisfied employees.

In data for DETE we also do not have 1 column with boolean values showing us if employee was dissatisfied or not. Here we have several columns which represent different types of dissatisfaction and contain boolean values indication if employee left due to exact that type of dissatisfaction or not. Below first I check if these columns really contain only boolean values with help of 'df_vc' function which I wrote before in this project.

In [23]:

dis_cols = ['job_dissatisfaction', 'dissatisfaction_with_the_department',
            'physical_work_environment', 'lack_of_recognition',
            'lack_of_job_security', 'work_location', 'employment_conditions',
            'work_life_balance', 'workload']
df_vc(dete_resign, cols=dis_cols)

Out[23]:

	job_dissatisfaction	vc(job_dissatisfaction)	dissatisfaction_with_the_department	vc(dissatisfaction_with_the_department)	physical_work_environment	vc(physical_work_environment)	lack_of_recognition	vc(lack_of_recognition)	lack_of_job_security	vc(lack_of_job_security)	work_location	vc(work_location)	employment_conditions	vc(employment_conditions)	work_life_balance	vc(work_life_balance)	workload	vc(workload)
0	False	270	False	282	False	305	False	278	False	297	False	293	False	288	False	243	False	284
1	True	41	True	29	True	6	True	33	True	14	True	18	True	23	True	68	True	27

From above cell it is visible that all columns representing different kinds of dissatisfaction contain only boolean values. Then below I create column 'dissatisfied' also for 'dete_resign'. This column will contain:

'True' if any of columns shown in above cell contains boolean value 'True'
'False' if non of columns shown in above cell contains boolean value 'True'

In [24]:

dete_resign['dissatisfied'] = dete_resign[dis_cols].any(axis=1, skipna=False)
dete_resign['dissatisfied'].value_counts(dropna=False)

Out[24]:

False    162
True     149
Name: dissatisfied, dtype: int64

Output of above cell shows that the is not big difference between number of employees who left due to some kind of dissatisfaction and number of employees who had other reasons (149 against 162, respectively)

Combining 2 dataframes¶

Next step is to combine following dataframes we worked on in previous cells into one dataframe for further analysis:

dete_resign - employees resigned from their position in Department of Education, Training and Employment (DETE)
tafe_resign - employees resigned from their position in Technical and Further Education (TAFE)

But before to do so we need to create column in both dataframes called 'institute'. This column will help us to identify in which institute employee worked after combining 2 dataframes. And then I combine dataframes in next cell.

In [25]:

dete_resign['institute'] = 'DETE'
tafe_resign['institute'] = 'TAFE'

In [26]:

dete_tafe = pd.concat([dete_resign, tafe_resign], ignore_index=True)
dete_tafe.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 651 entries, 0 to 650
Data columns (total 54 columns):
 #   Column                                               Non-Null Count  Dtype  
---  ------                                               --------------  -----  
 0   id                                                   651 non-null    float64
 1   separationtype                                       651 non-null    object 
 2   cease_date                                           635 non-null    object 
 3   dete_start_date                                      283 non-null    float64
 4   role_start_date                                      271 non-null    float64
 5   position                                             598 non-null    object 
 6   classification                                       161 non-null    object 
 7   region                                               265 non-null    object 
 8   business_unit                                        32 non-null     object 
 9   employment_status                                    597 non-null    object 
 10  career_move_to_public_sector                         311 non-null    object 
 11  career_move_to_private_sector                        311 non-null    object 
 12  interpersonal_conflicts                              311 non-null    object 
 13  job_dissatisfaction                                  311 non-null    object 
 14  dissatisfaction_with_the_department                  311 non-null    object 
 15  physical_work_environment                            311 non-null    object 
 16  lack_of_recognition                                  311 non-null    object 
 17  lack_of_job_security                                 311 non-null    object 
 18  work_location                                        311 non-null    object 
 19  employment_conditions                                311 non-null    object 
 20  maternity/family                                     311 non-null    object 
 21  relocation                                           311 non-null    object 
 22  study/travel                                         311 non-null    object 
 23  ill_health                                           311 non-null    object 
 24  traumatic_incident                                   311 non-null    object 
 25  work_life_balance                                    311 non-null    object 
 26  workload                                             311 non-null    object 
 27  none_of_the_above                                    311 non-null    object 
 28  gender                                               592 non-null    object 
 29  age                                                  596 non-null    object 
 30  aboriginal                                           7 non-null      object 
 31  torres_strait                                        0 non-null      object 
 32  south_sea                                            3 non-null      object 
 33  disability                                           8 non-null      object 
 34  nesb                                                 9 non-null      object 
 35  cease_year                                           300 non-null    float64
 36  institute_service                                    563 non-null    object 
 37  dissatisfied                                         651 non-null    bool   
 38  institute                                            651 non-null    object 
 39  Institute                                            340 non-null    object 
 40  WorkArea                                             340 non-null    object 
 41  Contributing Factors. Career Move - Public Sector    332 non-null    object 
 42  Contributing Factors. Career Move - Private Sector   332 non-null    object 
 43  Contributing Factors. Career Move - Self-employment  332 non-null    object 
 44  Contributing Factors. Ill Health                     332 non-null    object 
 45  Contributing Factors. Maternity/Family               332 non-null    object 
 46  Contributing Factors. Dissatisfaction                332 non-null    object 
 47  Contributing Factors. Job Dissatisfaction            332 non-null    object 
 48  Contributing Factors. Interpersonal Conflict         332 non-null    object 
 49  Contributing Factors. Study                          332 non-null    object 
 50  Contributing Factors. Travel                         332 non-null    object 
 51  Contributing Factors. Other                          332 non-null    object 
 52  Contributing Factors. NONE                           332 non-null    object 
 53  role_service                                         290 non-null    object 
dtypes: bool(1), float64(4), object(49)
memory usage: 270.3+ KB

After combining dataframes from output of above cells we can see that there is in total 54 columns. As a result of cleaning activities and quality of data the columns we need for our further analysis and columns which contain more than 500 non-null are same. Therefore below I will drop columns that contain less than 500 non-null values.

In [27]:

dete_tafe.dropna(axis=1,thresh=500, inplace=True)
dete_tafe.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 651 entries, 0 to 650
Data columns (total 10 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   id                 651 non-null    float64
 1   separationtype     651 non-null    object 
 2   cease_date         635 non-null    object 
 3   position           598 non-null    object 
 4   employment_status  597 non-null    object 
 5   gender             592 non-null    object 
 6   age                596 non-null    object 
 7   institute_service  563 non-null    object 
 8   dissatisfied       651 non-null    bool   
 9   institute          651 non-null    object 
dtypes: bool(1), float64(1), object(8)
memory usage: 46.5+ KB

Standardizing 'institute_service' & 'age' columns¶

I start with standardizing "institute_service" column which represents period employee spent working for institution. As we could see previously data for TAFE institution had ready categories for period employees worked for institution and for DETE institution we added column which contains number of years employee spent in position. As one institution have ready categories and one only number of years we need to create common categories and apply for whole 'institute_service' column. Below, first, I recheck what values are presented in column.

In [28]:

#dete_tafe['institute_service'].unique()
dete_tafe['institute_service'].value_counts(dropna=False, ascending=False)

Out[28]:

NaN                   88
Less than 1 year      73
1-2                   64
3-4                   63
5-6                   33
11-20                 26
5.0                   23
1.0                   22
7-10                  21
0.0                   20
3.0                   20
6.0                   17
4.0                   16
2.0                   14
9.0                   14
7.0                   13
More than 20 years    10
8.0                    8
13.0                   8
20.0                   7
15.0                   7
17.0                   6
10.0                   6
22.0                   6
12.0                   6
14.0                   6
16.0                   5
18.0                   5
11.0                   4
24.0                   4
23.0                   4
39.0                   3
19.0                   3
21.0                   3
32.0                   3
36.0                   2
30.0                   2
26.0                   2
25.0                   2
28.0                   2
34.0                   1
35.0                   1
27.0                   1
38.0                   1
41.0                   1
42.0                   1
49.0                   1
33.0                   1
31.0                   1
29.0                   1
Name: institute_service, dtype: int64

Below I write categories for 'institute_service' column i.e. how to split years among categories, how to name them and how many to have:

New: Less than 3 years at a company
Experienced: 3-6 years at a company
Established: 7-10 years at a company
Veteran: 11 or more years at a company

In order to replace values in column with new categories written in above cell function written below will be applied to it.

In [29]:

def yr_cats(element):
    # new categories to be applied to column
    cat1, cat2 = 'New (0-3)', 'Experienced (3-6)'
    cat3, cat4 = 'Established (6-10)', 'Veteran (10-∞)'
    
    # if nan leave value nan   
    if pd.isna(element): return np.nan
    
    # if string or number fill categories
    elif isinstance(element, str):
        if element in ['Less than 1 year', '1-2']: return cat1
        elif element in ['3-4', '5-6']: return cat2
        elif element in ['7-10']: return cat3
        elif element in ['11-20', 'More than 20 years']: return cat4
        else: return 'unexpected category'
        
    elif isinstance(element, (int, float)):
        if element < 3: return cat1
        elif element < 6: return cat2
        elif element < 10: return cat3
        else: return cat4
         
    # if data type of value is not caught by if block    
    else: return "unexpected category data type"
    
# applying function & checking result   
dete_tafe['service_yrs_cat'] = dete_tafe['institute_service'].map(yr_cats)
dete_tafe['service_yrs_cat'].value_counts(dropna=False)

Out[29]:

New (0-3)             193
Experienced (3-6)     155
Veteran (10-∞)        142
NaN                    88
Established (6-10)     73
Name: service_yrs_cat, dtype: int64

Second column to create standard categories for is 'age' column. First I check existing unique values and their frequency in column.

In [30]:

print(dete_tafe[dete_tafe['institute']=='DETE']['age'].value_counts(dropna=False))
dete_tafe[dete_tafe['institute']=='TAFE']['age'].value_counts(dropna=False)

41-45            48
46-50            42
36-40            41
26-30            35
51-55            32
31-35            29
21-25            29
56-60            26
61 or older      23
NaN               5
20 or younger     1
Name: age, dtype: int64

Out[30]:

NaN              50
41  45           45
46  50           39
51-55            39
21  25           33
36  40           32
26  30           32
31  35           32
56 or older      29
20 or younger     9
Name: age, dtype: int64

As we can see from output of above cell the values are coming from 2 dataframes are similar. But here are small differences. First, in of them we can see category indicating people younger than 20 and in other one elder people got more categories (instead of '56 or older' it has '56-60' and '60 or older' categories). To equalize number of categories and have same range for all values below standardize categories for employees who were younger than 20 or older than 56 when they left institutions.

In [31]:

age_dict = {'56-60': '56-∞',
            '61 or older': '56-∞',
            '56 or older': '56-∞',
            '20 or younger': '0-20'}
dete_tafe['age'].replace(to_replace=age_dict, inplace=True)
dete_tafe['age'].value_counts()

Out[31]:

56-∞      78
51-55     71
41-45     48
41  45    45
46-50     42
36-40     41
46  50    39
26-30     35
21  25    33
26  30    32
31  35    32
36  40    32
31-35     29
21-25     29
0-20      10
Name: age, dtype: int64

Second and only difference left is in a format how range is represented in column. In one case it is in format 'x y' in other case 'x-y'. To standardize those values below I accessed values in column and cleaned them from spaces ad placed dash between numbers brining all ranges to 'x-y' format. Then I display the result.

In [32]:

dete_tafe['age'] = dete_tafe['age'].str.strip().str.\
replace('\s+', ' ', regex=True).str.replace(' ','-', regex=True)
dete_tafe['age'].value_counts(dropna=False)

Out[32]:

41-45    93
46-50    81
56-∞     78
36-40    73
51-55    71
26-30    67
21-25    62
31-35    61
NaN      55
0-20     10
Name: age, dtype: int64

Visualization and analysis¶

Conclusion 1.0¶

To answer to our questions we aimed from beginning of project I will use pivot tables and plots based on those pivot tables. First, I want to focus on following questions:

Are younger employees resigning due to some kind of dissatisfaction? What about older employees?

Below I create pivot table. To make our question wider I also added gender variable to pivot. Values in pivot table represent percentage of employees who left their positions due to some kind of dissatisfaction. Column and index values are quite obviously represent gender and age group employee belonged to, respectively.

In [33]:

dt_age_pv = dete_tafe.pivot_table(index='age', values='dissatisfied', columns=['gender'])
dt_age_pv[dt_age_pv.select_dtypes(include=['int', 'float']).columns] *= 100
dt_age_pv

Out[33]:

gender	Female	Male
age
0-20	28.571429	0.000000
21-25	35.555556	17.647059
26-30	40.384615	46.666667
31-35	37.777778	40.000000
36-40	30.000000	43.478261
41-45	36.231884	41.666667
46-50	41.935484	27.777778
51-55	37.777778	48.000000
56-∞	38.297872	50.000000

As it is easier to analyse data represented in plot rather than in pivot table below I create diverging horizontal bar chart. As stated before source for bar chart is above pivot table.

In [34]:

# setting size of text on plot, creating fig, setting xticks range
rcParams["font.size"]=12
fig, axes = plt.subplots(figsize=(15,7), ncols=2)
fig.suptitle('% of Dissatisfied leavers')
ticks = np.arange(0, 61, 5)

# plotting and setting grids
arg_dict = {'xlabel': '', 'xticks': ticks}
hb1 = dt_age_pv['Male'].plot.barh(**arg_dict, ax=axes[0], color='c', title='Male')
hb2 = dt_age_pv['Female'].plot.barh(**arg_dict, ax=axes[1], color='y', title='Female')
hb1.grid(axis='x', c='c', alpha=0.4)
hb2.grid(axis='x', c='y', alpha=0.4)

# removing axes[0] yticks, inverting xticks so it chart looks diverging plot
hb1.set_yticks([])
hb1.invert_xaxis()
hb2.tick_params(axis='y', left=False, pad=-20)

# removing spines from both axes
for loc, spine in hb1.spines.items(): spine.set_visible(False)
for loc, spine in hb2.spines.items(): spine.set_visible(False)

# bringing 2 axes so close that they look like one chart
fig.subplots_adjust(wspace=0.0001)

In the output of cell above we can see diverging bar chart. It shows percentage of employees who left both institutes due to some kind of dissatisfaction by age groups for male and females. There is no strong linear correlation between age group and share of dissatisfied leavers in both genders. But from first observation of chart we can spot that middle aged peopled have less tendency to leave their positions due to dissatisfaction. Although this can be observed separately by genders it is better to pay attention to that tendency in total. In other words, if you pay closer attention to total length of horizontal bars (females + males) you can observe that people tend to leave more when they were between 26 and 30, then there were less dissatisfied leavers until employees reach age group of 51 and older. So it is quite obvious that older employees tend to leave more due to dissatisfaction than middle aged ones. But for employees younger than 30 not everything is clear as age group of 20-25 had lowest share of dissatisfied leavers among other age groups except 0-20 age group (which is only presented in TAFE institute and contains only females)

Conclusion 1.1¶

Secondly, I want to pay more attention to following questions:

Are employees who have only worked for the institutes for a short period of time resigning due to some kind of dissatisfaction? What about employees who have been at the job longer?

Below I create pivot table where values again represent share of employees who left their positions due to some kind of dissatisfaction. In index part we have 2 splits. First split is by institute where employee worked and second is by service years categories. I also did not drop age factor from pivot table and plot I will create, while gender split will not be discussed in following part.

Note: Before creating pivot table below I converted type of 'service_yrs_cat' (contain service years categories) to categories and then sorted them ascending based on years. The reason to do so is to see age service years categories in ascending order in pivot table and plot I will create later. And this will make easier comparison of values among groups in pivot table and plot.

In [35]:

# ascending sorting list for service years categories
sorter = ['New (0-3)', 'Experienced (3-6)', 'Established (6-10)', 'Veteran (10-∞)']

# changing type of column to categories
dete_tafe['service_yrs_cat'] = dete_tafe['service_yrs_cat'].astype('category')

# accessing column and appliying sorting list written above
dete_tafe['service_yrs_cat'].cat.set_categories(sorter, inplace=True)

# creating pivot table
age_serv_pv = dete_tafe.pivot_table(index=['institute', 'service_yrs_cat'],
                                    columns='age', values='dissatisfied')
age_serv_pv[age_serv_pv.select_dtypes(include=['int','float']).columns] *= 100
age_serv_pv

Out[35]:

	age	0-20	21-25	26-30	31-35	36-40	41-45	46-50	51-55	56-∞
institute	service_yrs_cat
DETE	New (0-3)	0.000000	31.250000	50.000000	40.000000	27.272727	25.000000	50.000000	100.000000	100.000000
	Experienced (3-6)	NaN	25.000000	53.846154	41.666667	33.333333	62.500000	25.000000	40.000000	33.333333
	Established (6-10)	NaN	0.000000	66.666667	100.000000	87.500000	50.000000	42.857143	66.666667	60.000000
	Veteran (10-∞)	NaN	NaN	NaN	40.000000	27.272727	60.000000	57.894737	57.894737	58.064516
TAFE	New (0-3)	16.666667	28.571429	26.315789	21.052632	41.666667	22.727273	33.333333	21.052632	25.000000
	Experienced (3-6)	33.333333	33.333333	22.222222	12.500000	18.181818	33.333333	26.666667	20.000000	20.000000
	Established (6-10)	NaN	NaN	25.000000	66.666667	20.000000	NaN	40.000000	33.333333	0.000000
	Veteran (10-∞)	NaN	NaN	NaN	0.000000	25.000000	20.000000	25.000000	57.142857	21.428571

This time, when we have 2 splits in index part of pivot and 1 split by age groups in columns part using plots like bar chart would not be most effective. In order to clearly display data and provide effective visualization my choice fell on heatmap provided by seaborn library.

In [36]:

fig, hmap = plt.subplots(figsize=(12,6))
rcParams["font.size"]=12
hmap = sns.heatmap(age_serv_pv, ax=hmap, annot=True, fmt='.2f', cmap='YlGn')
hmap.set_facecolor("grey")
hmap.set_title("% of Dissatisfied leavers by age(x) & institute-service years(y)")
hmap.set_ylabel('Institute-Service years')
hmap.set_xlabel('Age')
hmap.axhline(y=4, c='k', lw=3);

In above heatmap horizontal black line divides plot into 2 parts based on institute where leavers were employed. Grey are represent absence of employees in relevant cell. Quite logically most of grey are covers cells representing young people with more than 6 or 10 years experience inside institute.

Also it is quite clear that there are much less high extreme values in Technical and Further Education (TAFE) than in Department of Education, Training and Employment (DETE). In fact in only 2 groups (Veterans aged between 51 & 55, Established employees aged between 31 & 35) in TAFE share of employees who left due to dissatisfaction is more than half. Meanwhile, we can observe much more green and dark green cells in DETE part of heatmap which means frequent presence of high values. This leads us to conclusion than in TAFE values are more equally distributed than in DETE. And this conclusion is also right if we compare percentages by age group in 2 institutes. Then it means that observations I made in Conclusion 1.1 were heavily affected by DETE and cannot be interpreted with the same degree of significance for both entities.

More diverged values in DETE part of plot give us more clear answers to our questions. 2 bottom rows of DETE part of plot are more greenish than top 2 rows. Therefore, it is clear that Established employees and Veterans tend to leave due to dissatisfaction more than less experienced employees in institute. Another visible trend is that from left to the right share of dissatisfied leavers is getting higher.

To sum up, while values in TAFE are more equally distributed and do not answer our questions clearly, values in DETE are more robust. And this robustness suggests that most experienced and/or older employees tend to leave DETE due to dissatisfaction more than other groups.