Notebook

How do we make them stay?

A data-driven identification of areas that two governmental institutes need to improve to retain staff

0. Intro¶

0.1 What is the dataset?¶

We will be working with exit surveys from employees of two public Australian institutes:

Both institutes used the same survey template and customized them to their own needs.
The dataset in general contains information on:

Descriptive information about the employee like their age at the time of the exit or the job he/she used to work at.
Information from the exit survey which includes:
- Reasons / contributing factors for leaving the institutes
- Feedback from the employees on the work environment

0.2 Why are we doing the analysis?¶

To answer questions posed by dataquest:

Are employees with relatively short tenures resigning more often due to some dissatisfaction?
Are young employees resigning more often due to some kind of dissatisfaction?

To answer questions I am interested in: 3. Are employees more likely to be dissatisfied and resign, if they have been in their job positions for a short time? 4. Which HR and company practices may to be improved to potentially mitigating employee exits?

The focus of this exercise is on data wrangling. It is based on a dataquest.io guided project.

0.3 Mini ToC¶

Intro
Inspecting: What does the data look like?
Cleaning: Let's make the information usable
Analyzing: Getting the answers we are looking for

0.4 Setting up the workbook and loading the data¶

In [1]:

import seaborn as sns
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

# na_values info provided by dataquest
dete = pd.read_csv("dete_survey.csv", na_values="Not Stated")
tafe = pd.read_csv("tafe_survey.csv")

In [16]:

# Some columns are not needed for the analysis and are thus dropped.
# I have excluded these columns above the initial investigative procedure
# and immediately after the file import to avoid duplicating the investigative code.

# first DETE, see rationale for exclusion under section 1.1.
dete_drop_col = ["Aboriginal",
                 "Torres Strait",
                 "South Sea",
                 "Disability",
                 "NESB",
                 "Position",
                 "Classification",
                 "Region",
                 "Business Unit"]
dete.drop(columns=dete_drop_col, inplace=True)

# second TAFE, see rationale under section 1.2
tafe_drop_col_stage1 = tafe.columns[tafe.columns.str.contains("Induction") |
                                    tafe.columns.str.contains("Main Factor") |
                                    tafe.columns.str.contains("Workplace")]
tafe.drop(columns=tafe_drop_col_stage1, inplace=True)

tafe_drop_col_stage2 = ["Classification. Classification",
                        "WorkArea",
                        "Institute"]
tafe.drop(columns=tafe_drop_col_stage2, inplace=True)
# .copy() to avoid potential "SettingwithCopyWarning"-ERROR
tafe = tafe[tafe["CurrentAge. Current Age"].notna()].copy()

1. Inspection: What does the data look like?¶

1.1 Investigating the DETE survey¶

In [17]:

dete.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 822 entries, 0 to 821
Data columns (total 47 columns):
 #   Column                               Non-Null Count  Dtype  
---  ------                               --------------  -----  
 0   ID                                   822 non-null    int64  
 1   SeparationType                       822 non-null    object 
 2   Cease Date                           788 non-null    object 
 3   DETE Start Date                      749 non-null    float64
 4   Role Start Date                      724 non-null    float64
 5   Employment Status                    817 non-null    object 
 6   Career move to public sector         822 non-null    bool   
 7   Career move to private sector        822 non-null    bool   
 8   Interpersonal conflicts              822 non-null    bool   
 9   Job dissatisfaction                  822 non-null    bool   
 10  Dissatisfaction with the department  822 non-null    bool   
 11  Physical work environment            822 non-null    bool   
 12  Lack of recognition                  822 non-null    bool   
 13  Lack of job security                 822 non-null    bool   
 14  Work location                        822 non-null    bool   
 15  Employment conditions                822 non-null    bool   
 16  Maternity/family                     822 non-null    bool   
 17  Relocation                           822 non-null    bool   
 18  Study/Travel                         822 non-null    bool   
 19  Ill Health                           822 non-null    bool   
 20  Traumatic incident                   822 non-null    bool   
 21  Work life balance                    822 non-null    bool   
 22  Workload                             822 non-null    bool   
 23  None of the above                    822 non-null    bool   
 24  Professional Development             808 non-null    object 
 25  Opportunities for promotion          735 non-null    object 
 26  Staff morale                         816 non-null    object 
 27  Workplace issue                      788 non-null    object 
 28  Physical environment                 817 non-null    object 
 29  Worklife balance                     815 non-null    object 
 30  Stress and pressure support          810 non-null    object 
 31  Performance of supervisor            813 non-null    object 
 32  Peer support                         812 non-null    object 
 33  Initiative                           813 non-null    object 
 34  Skills                               811 non-null    object 
 35  Coach                                767 non-null    object 
 36  Career Aspirations                   746 non-null    object 
 37  Feedback                             792 non-null    object 
 38  Further PD                           768 non-null    object 
 39  Communication                        814 non-null    object 
 40  My say                               812 non-null    object 
 41  Information                          816 non-null    object 
 42  Kept informed                        813 non-null    object 
 43  Wellness programs                    766 non-null    object 
 44  Health & Safety                      793 non-null    object 
 45  Gender                               798 non-null    object 
 46  Age                                  811 non-null    object 
dtypes: bool(18), float64(2), int64(1), object(26)
memory usage: 200.8+ KB

In [18]:

dete.head().style; # .style to show all columns, remove colon to see output

In [19]:

dete["SeparationType"].value_counts(dropna=False).sort_index();

At this time I am not interested in:

special characteristics of the employees such as "Aboriginal" or "Disability". There are very few non-null values in these columns - investigating feedback from these minorities is however very important and may lead to insights that improve DETE in being an equitable workplace.
As the dataset is relatively small (822 employees), I am not interested in drilling down into specific business units or job positions. Some of these columns also have a significant number of null-values. Only the Employment status includes information that is of interest to me: part-time vs. full-time employees.

The respective columns are therefore removed from the dataset.
I have excluded these columns above the initial investigative procedure and immediately after the file import to avoid duplicating the investigative code.

+++++

Noteworthy: At this time the meaning of the values in columns "Professional Development" onward (A, N, D, etc.) is currently unknown. The material accompanying the dataset does not provide further details (see Contextual Information available on the website linked under section "0."). Because both DETE and TAFE used the same base template, I hope that the information can be inferred from the second data set (TAFE).

+++++

All non-boolean columns have been investigated using "value_counts"-method to see if there are any unexpected values.

"Cease Date": non uniform format - both years and months are in some cases included - NEEDS TO BE CLEANED
"Role Start Date": Includes single occurrence of incorrect value "200" NEEDS TO BE CLEANED
(instances of "Not Stated" were already mentioned by dataquest and are included in the read_csv)

All other columns look fine to me, aside from a few Nan-values. I will sort the dataset by "Separation Type" and visualize the occurrence of Nan-Values to maybe see a pattern in the way Nan-Values are distributed. Going through various iterations of the plot, there seems to be no cluster of missing values in a specific separation type or age group.

In [20]:

fig, ax = plt.subplots(figsize=(5,10))
ax = sns.heatmap(dete.set_index(["SeparationType","Age"]).sort_index().isnull(),cbar=False)
#black==non-null, light==null-value

1.2 Investigating the TAFE survey¶

In [21]:

tafe.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 596 entries, 0 to 701
Data columns (total 50 columns):
 #   Column                                                                                                                                                         Non-Null Count  Dtype  
---  ------                                                                                                                                                         --------------  -----  
 0   Record ID                                                                                                                                                      596 non-null    float64
 1   CESSATION YEAR                                                                                                                                                 594 non-null    float64
 2   Reason for ceasing employment                                                                                                                                  596 non-null    object 
 3   Contributing Factors. Career Move - Public Sector                                                                                                              380 non-null    object 
 4   Contributing Factors. Career Move - Private Sector                                                                                                             380 non-null    object 
 5   Contributing Factors. Career Move - Self-employment                                                                                                            380 non-null    object 
 6   Contributing Factors. Ill Health                                                                                                                               380 non-null    object 
 7   Contributing Factors. Maternity/Family                                                                                                                         380 non-null    object 
 8   Contributing Factors. Dissatisfaction                                                                                                                          380 non-null    object 
 9   Contributing Factors. Job Dissatisfaction                                                                                                                      380 non-null    object 
 10  Contributing Factors. Interpersonal Conflict                                                                                                                   380 non-null    object 
 11  Contributing Factors. Study                                                                                                                                    380 non-null    object 
 12  Contributing Factors. Travel                                                                                                                                   380 non-null    object 
 13  Contributing Factors. Other                                                                                                                                    380 non-null    object 
 14  Contributing Factors. NONE                                                                                                                                     380 non-null    object 
 15  InstituteViews. Topic:1. I feel the senior leadership had a clear vision and direction                                                                         583 non-null    object 
 16  InstituteViews. Topic:2. I was given access to skills training to help me do my job better                                                                     589 non-null    object 
 17  InstituteViews. Topic:3. I was given adequate opportunities for personal development                                                                           586 non-null    object 
 18  InstituteViews. Topic:4. I was given adequate opportunities for promotion within %Institute]Q25LBL%                                                            584 non-null    object 
 19  InstituteViews. Topic:5. I felt the salary for the job was right for the responsibilities I had                                                                591 non-null    object 
 20  InstituteViews. Topic:6. The organisation recognised when staff did good work                                                                                  583 non-null    object 
 21  InstituteViews. Topic:7. Management was generally supportive of me                                                                                             590 non-null    object 
 22  InstituteViews. Topic:8. Management was generally supportive of my team                                                                                        586 non-null    object 
 23  InstituteViews. Topic:9. I was kept informed of the changes in the organisation which would affect me                                                          588 non-null    object 
 24  InstituteViews. Topic:10. Staff morale was positive within the Institute                                                                                       579 non-null    object 
 25  InstituteViews. Topic:11. If I had a workplace issue it was dealt with quickly                                                                                 579 non-null    object 
 26  InstituteViews. Topic:12. If I had a workplace issue it was dealt with efficiently                                                                             575 non-null    object 
 27  InstituteViews. Topic:13. If I had a workplace issue it was dealt with discreetly                                                                              579 non-null    object 
 28  WorkUnitViews. Topic:14. I was satisfied with the quality of the management and supervision within my work unit                                                587 non-null    object 
 29  WorkUnitViews. Topic:15. I worked well with my colleagues                                                                                                      583 non-null    object 
 30  WorkUnitViews. Topic:16. My job was challenging and interesting                                                                                                585 non-null    object 
 31  WorkUnitViews. Topic:17. I was encouraged to use my initiative in the course of my work                                                                        588 non-null    object 
 32  WorkUnitViews. Topic:18. I had sufficient contact with other people in my job                                                                                  591 non-null    object 
 33  WorkUnitViews. Topic:19. I was given adequate support and co-operation by my peers to enable me to do my job                                                   587 non-null    object 
 34  WorkUnitViews. Topic:20. I was able to use the full range of my skills in my job                                                                               587 non-null    object 
 35  WorkUnitViews. Topic:21. I was able to use the full range of my abilities in my job. ; Category:Level of Agreement; Question:YOUR VIEWS ABOUT YOUR WORK UNIT]  586 non-null    object 
 36  WorkUnitViews. Topic:22. I was able to use the full range of my knowledge in my job                                                                            586 non-null    object 
 37  WorkUnitViews. Topic:23. My job provided sufficient variety                                                                                                    590 non-null    object 
 38  WorkUnitViews. Topic:24. I was able to cope with the level of stress and pressure in my job                                                                    588 non-null    object 
 39  WorkUnitViews. Topic:25. My job allowed me to balance the demands of work and family to my satisfaction                                                        589 non-null    object 
 40  WorkUnitViews. Topic:26. My supervisor gave me adequate personal recognition and feedback on my performance                                                    584 non-null    object 
 41  WorkUnitViews. Topic:27. My working environment was satisfactory e.g. sufficient space, good lighting, suitable seating and working area                       588 non-null    object 
 42  WorkUnitViews. Topic:28. I was given the opportunity to mentor and coach others in order for me to pass on my skills and knowledge prior to my cessation date  587 non-null    object 
 43  WorkUnitViews. Topic:29. There was adequate communication between staff in my unit                                                                             580 non-null    object 
 44  WorkUnitViews. Topic:30. Staff morale was positive within my work unit                                                                                         584 non-null    object 
 45  Gender. What is your Gender?                                                                                                                                   596 non-null    object 
 46  CurrentAge. Current Age                                                                                                                                        596 non-null    object 
 47  Employment Type. Employment Type                                                                                                                               596 non-null    object 
 48  LengthofServiceOverall. Overall Length of Service at Institute (in years)                                                                                      596 non-null    object 
 49  LengthofServiceCurrent. Length of Service at current workplace (in years)                                                                                      596 non-null    object 
dtypes: float64(2), object(48)
memory usage: 237.5+ KB

In [22]:

tafe.head().style;

In [23]:

tafe.iloc[:,2].value_counts(dropna=False).sort_index();

At this time I am not interested in:

any information regarding the induction
information grouped as "Workplace" are not present in the DETE dataset and therefore not relevant
specifics on the job classification, work area etc.
"Main factor" is also not of interest

The respective columns are therefore removed from the dataset.
I have excluded these columns above the initial investigative procedure immediately after the file import to avoid duplicating the investigative code. +++++

Noteworthy: We are lucky and can infer the meaning of some abbreviated responses to items in the DETE survey from the TAFE survey.

+++++

All columns have been investigated using "value_counts"-method to see if there are any unexpected values.

columns for contributing factors: have "-" values and the name of the column instead of boolean values
"Current Age": coded in similar age-buckets as the DETE dataset but needs some cleaning
"Length of Service": information join and exit dates is represented in different way compared to DETE dataset

General note: The TAFE-dataset includes more features. Column headers between the two datasets need to be compared to see which features have the same meaning.

I will sort the dataset by "Reason for ceasing employment" and "Current Age" and visualize the occurrence of Nan-Values to maybe see a pattern in the way Nan-Values are distributed. Going through various iterations of the plot, there seems to be a cluster of missing data for:

Nan-Values in "Current Age" have almost no info on "WorkUnitView"-items - these rows will be dropped above after the initial load of the data, hence the the will not be visible in the heatmap below.

In [24]:

fig, ax = plt.subplots(figsize=(5, 10))
ax = sns.heatmap(tafe.set_index(
    "CurrentAge. Current Age").sort_index().isnull(), cbar=False)
#black==non-null, light==null-value

2. Cleaning: Let's make the information usable¶

2.0 What needs to be done and why?¶

Quick stock keeping, to check if we have sufficient data to perform analysis required to answer our questions (see Section 0.2):

Are employees with relatively short tenures resigning more often due to some dissatisfaction?
- Length of tenure with institute: data available, needs to be harmonized across the two data sets.
- Dissatisfaction: data available, needs to be cleaned a bit for TAFE
- DATA AVAILABLE
Are young employees resigning more often due to some kind of dissatisfaction?
- Age: data is generally available, needs to be harmonized across datasets
- Dissatisfaction: see above
- DATA AVAILABLE
Are employees more likely to be dissatisfied and resign, if they have been in their job positions for a short time?
- Length of tenure in position / role: data available, needs to be harmonized across the two data sets.
- Dissatisfaction: see above
- DATA AVAILABLE
Which HR and company practices may to be improved to potentially mitigating employee exits?
- HR / company practices: Feedback on environment available. Due to non-standard column headers it is very difficult to see if the same features are present in both datasets. The Likert-type scale responses also need to be decoded for DETE. I will work on this part first, to see if it is possible to conduct the analysis.
- NOT SURE ABOUT QUALITY YET

Cleaning to dos going forward:

Perform matching of features on work environment, decode and quantify responses.
(see section 2.1)
Harmonize "Contributing Factors" across both datasets and clean column headers in both datasets.
(see section 2.2)
Merge datasets
(see section 2.3)
Clean and harmonize information on age, institute tenure, job tenure and separation type across both datasets.
(see section 2.4)
Deal with Nan-values

(see section 2.5)

2.1 Quantify and harmonize feedback on work environment¶

We have two datasets with overlapping sets of features evaluating the work environment. Ideally we have as many features on as many employees as possible. There might however be issues if a feature is not present in both datasets. As a general goal, we only want to keep features that are present in both datasets to maximize the number of observations, thus:

Goal: maximize length of dataset

This may however require us to discard features. Since our goal in this cases (see Q4 under Section 0.2) is to identify workplace issues, we my discard info that identifies workplace issues in one of the two institutes. Thus:

Goal: identify any relevant features that were discarded because of 1. goal

To see if a feature is relevant we will convert the categorical likert-type scale (Strongly Agree / neutral / disagree etc.) into a interval scale (scientifically not a very sound approach) to see if employees were on average to a significant share unhappy with any of the discarded items.

The workflow for this cleaning to do is thus:

Decode and quantify values in both datasets
Give matching features the same name in both datasets (prepare data to achieve Goal 1)
Check which unmatched features are relevant before (analyze data to achieve Goal 2)

2.1.1 Decode and quantify¶

In [25]:

dete_env_col = dete.loc[:,
                        "Professional Development":"Health & Safety"].columns
pd.melt(dete, value_vars=dete_env_col)["value"].value_counts(dropna=False)
# remove colon in line above to see output

Out[25]:

A      7540
SA     3307
N      2952
D      1518
SD     1087
NaN     537
M       321
Name: value, dtype: int64

In [26]:

tafe_env_col = tafe.iloc[:, 15:44].columns
pd.melt(tafe, value_vars=tafe_env_col)["value"].value_counts(dropna=False);

Questions are phrased positively like "I was given adequate opportunities for personal development", agreement is thus a positive / favorable outcome. Scales are set accordingly.

In [27]:

dete_env_quant = {"SA": 2,
                 "A": 1,
                 "N": 0,
                 "D": -1,
                 "SD": -2,
                 "M":np.nan} # assuming that M means 'not applicable' which is 
                             # equivalent to not answering the question for the purpose of this analysis

tafe_env_quant = {"Strongly Agree": 2,
                 "Agree": 1,
                 "Neutral": 0,
                 "Disagree": -1,
                 "Strongly Disagree": -2,
                 "Not Applicable":np.nan} # see above

dete_up1 = dete.replace(dete_env_quant).copy()
tafe_up1 = tafe.replace(tafe_env_quant).copy()

2.1.2 Match features based on column headers and analyse unmatched features¶

In [28]:

dete.columns.to_list()
# matching of column headers done in spreadsheet after they were both printed as lists

# replacing long TAFE headers with short DETE headers
tafe_lookup_env = {"InstituteViews. Topic:4. I was given adequate opportunities for promotion within %Institute]Q25LBL%": 'Opportunities for promotion',
                   'InstituteViews. Topic:10. Staff morale was positive within the Institute': 'Staff morale',
                   'WorkUnitViews. Topic:27. My working environment was satisfactory e.g. sufficient space, good lighting, suitable seating and working area': 'Physical environment',
                   'WorkUnitViews. Topic:25. My job allowed me to balance the demands of work and family to my satisfaction': 'Worklife balance',
                   'InstituteViews. Topic:7. Management was generally supportive of me': 'Stress and pressure support',
                   'WorkUnitViews. Topic:14. I was satisfied with the quality of the management and supervision within my work unit': 'Performance of supervisor',
                   'WorkUnitViews. Topic:19. I was given adequate support and co-operation by my peers to enable me to do my job': 'Peer support',
                   'WorkUnitViews. Topic:17. I was encouraged to use my initiative in the course of my work': 'Initiative',
                   'WorkUnitViews. Topic:20. I was able to use the full range of my skills in my job': 'Skills',
                   'WorkUnitViews. Topic:28. I was given the opportunity to mentor and coach others in order for me to pass on my skills and knowledge prior to my cessation date': 'Coach',
                   'WorkUnitViews. Topic:26. My supervisor gave me adequate personal recognition and feedback on my performance': 'Feedback',
                   'WorkUnitViews. Topic:29. There was adequate communication between staff in my unit': 'Communication',
                   'InstituteViews. Topic:9. I was kept informed of the changes in the organisation which would affect me': 'Kept informed'}
tafe_up2 = tafe_up1.rename(columns=tafe_lookup_env, inplace=False).copy()

# selecting only the columns with info on work environment
dete_to_merge = tafe_up2.iloc[:, 15:45].copy()
tafe_to_merge = dete_up1.loc[:,
                             "Professional Development":"Health & Safety"].copy()

exit_merged_env = pd.concat([dete_to_merge, tafe_to_merge], ignore_index=True)

# filtering only for columns which are present in DETE and TAFE-dataset
common_features = exit_merged_env.dropna(thresh=850, axis=1).columns
# thresh = 850 because dete dataset alone has 822 records

# getting a dataframe that includes all columns that are only present in one of the two datasets DETE or TAFE
unmatched_env = exit_merged_env.loc[:, ~
                                    exit_merged_env.columns.isin(common_features)]

In [29]:

unmatched_env.quantile([.1,.25,.5,.75]).astype(int).transpose().sort_values(by=[0.25,0.5]).head(10)

Out[29]:

	0.10	0.25	0.50	0.75
InstituteViews. Topic:6. The organisation recognised when staff did good work	-2	-1	0	1
InstituteViews. Topic:11. If I had a workplace issue it was dealt with quickly	-2	-1	0	1
InstituteViews. Topic:12. If I had a workplace issue it was dealt with efficiently	-2	-1	0	1
InstituteViews. Topic:5. I felt the salary for the job was right for the responsibilities I had	-2	-1	1	1
Wellness programs	-1	0	0	1
InstituteViews. Topic:1. I feel the senior leadership had a clear vision and direction	-1	0	1	1
InstituteViews. Topic:2. I was given access to skills training to help me do my job better	-1	0	1	1
InstituteViews. Topic:3. I was given adequate opportunities for personal development	-1	0	1	1
InstituteViews. Topic:8. Management was generally supportive of my team	-1	0	1	1
InstituteViews. Topic:13. If I had a workplace issue it was dealt with discreetly	-1	0	1	1

In [1]:

unmatched_env.

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-1-111192713002> in <module>
----> 1 unmatched_env

NameError: name 'unmatched_env' is not defined

Partial Answer Q4:

In the features to be discarded, TAFE employees (long format questions) were most unhappy with:

the efficiency and speed in which workplace issues were addressed
recognition of good work
salary (compared to responsibility)

DETE employees (short format questions) were mainly unhappy with:

"Wellness programs"

In [30]:

# now that we have analyzed these columns, we can drop them from the dataframes
tafe_up2.drop(columns=unmatched_env.columns, inplace=True, errors="ignore")
dete_up1.drop(columns=unmatched_env.columns, inplace=True, errors="ignore")

2.2 Harmonize contributing factors across both datasets and rename columns¶

Contributing factors in TAFE dataset need to be formated like DETE dataset - booleans instead of strings.

In [31]:

tafe_dissat_col = tafe_up2.columns[tafe_up2.columns.str.contains(
    "Contributing Factors")]


def update_vals(val):
    if pd.isnull(val):
        return np.nan
    if val == "-":
        return False
    else:
        return True


tafe_up2[tafe_dissat_col] = tafe_up2[tafe_dissat_col].applymap(update_vals)

Now the column headers need to be harmonized.

In [32]:

dete.loc[:, "Career move to public sector":"None of the above"].mean(
).sort_values(ascending=False).head(10)

Out[32]:

None of the above                      0.263990
Work life balance                      0.263990
Ill Health                             0.136253
Job dissatisfaction                    0.108273
Workload                               0.105839
Career move to private sector          0.097324
Relocation                             0.082725
Maternity/family                       0.075426
Dissatisfaction with the department    0.074209
Lack of recognition                    0.069343
dtype: float64

DETE has more contributing factors listed on the dataset than TAFE. Out of these, poor work-life-balance and workload are often cited as reasons for exiting the job. Because they are such an important factor for a an employees exit, I will group these two and keep them in the dataset - although they are not included in the TAFE dataset.

In [33]:

dete_col_rename_cont = ['Physical work environment',
                        'Lack of recognition',
                        'Lack of job security',
                        'Work location',
                        'Employment conditions',
                        'Relocation',
                        'Traumatic incident']

dete_col_rename_work =  ['Work life balance', 
                         'Workload']

# I decided to combine the information into two single column and drop the detailed columns
dete_up1["Other factor"] = dete_up1.loc[:,dete_col_rename_cont].any(axis=1, skipna=False)
dete_up1["Too much work"] = dete_up1.loc[:,dete_col_rename_work].any(axis=1, skipna=False)

dete_up2 = dete_up1.drop(columns=dete_col_rename_cont+dete_col_rename_work)

In [34]:

# TAFE in general has less features describing the contributing factors. 
# They will be renamed to align with the DETE dataset.

tafe_col_rename_cont = {'Contributing Factors. Career Move - Public Sector ': 'Career move to public sector',
                        'Contributing Factors. Career Move - Private Sector ': 'Career move to private sector',
                        'Contributing Factors. Interpersonal Conflict': 'Interpersonal conflicts',
                        'Contributing Factors. Job Dissatisfaction': 'Job dissatisfaction',
                        'Contributing Factors. Dissatisfaction': 'Dissatisfaction with the department',
                        'Contributing Factors. Maternity/Family': 'Maternity/family',
                        'Contributing Factors. Ill Health': 'Ill Health',
                        'Contributing Factors. NONE': 'None of the above'}
tafe_up3 = tafe_up2.rename(columns=tafe_col_rename_cont,inplace=False).copy()

# The information from four columns needs to be aggregated into two columns to match with the DETE dataset 
tafe_col_rename_ts = ['Contributing Factors. Travel','Contributing Factors. Study']
tafe_up3['Study/Travel'] = tafe_up3.loc[:,tafe_col_rename_ts].any(axis=1, skipna=False)

tafe_col_rename_other = ['Contributing Factors. Career Move - Self-employment','Contributing Factors. Other']
tafe_up3['Other factor'] = tafe_up3.loc[:,tafe_col_rename_ts].any(axis=1, skipna=False)

tafe_up4 = tafe_up3.drop(columns=tafe_col_rename_ts+tafe_col_rename_other)

# Quick heatmap, to to check if the any()-command worked as expected 
test_subset = tafe_up3.loc[:,tafe_col_rename_ts+tafe_col_rename_other+['Study/Travel','Other factor']]
fig, ax = plt.subplots(figsize=(3,5))
ax = sns.heatmap(test_subset.isnull(),cbar=False)

# bar code look means success ;-)

In [35]:

# Now let's also match the column headers of the other relevant features.
# Again we will match column headers of TAFE to DETE as the latter are better formatted
# Thankfully, dataquest already provided a dictionary for this

tafe_col_rename_gen = {'Record ID': 'id',
                   'CESSATION YEAR': 'cease_date',
                   'Reason for ceasing employment': 'separationtype',
                   'Gender. What is your Gender?': 'gender',
                   'CurrentAge. Current Age': 'age',
                   'Employment Type. Employment Type': 'employment_status',
                   'Classification. Classification': 'position',
                   'LengthofServiceOverall. Overall Length of Service at Institute (in years)': 'institute_service',
                   'LengthofServiceCurrent. Length of Service at current workplace (in years)': 'role_service'}

tafe_up5 = tafe_up4.rename(columns=tafe_col_rename_gen, inplace=False).copy()

In [36]:

# formatted dete and tafe column headers according to dataquest
dete_up2.columns = dete_up2.columns.str.lower().str.strip().str.replace(" ","_")
tafe_up5.columns = tafe_up5.columns.str.lower().str.strip().str.replace(" ","_")

2.3 Merge the two datasets¶

In [37]:

tafe_up5["institute"] = "TAFE"
dete_up2["institute"] = "DETE"

In [38]:

exit_combined = pd.concat([tafe_up5,dete_up2],ignore_index=True)

exit_combined.info();

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1418 entries, 0 to 1417
Data columns (total 35 columns):
 #   Column                               Non-Null Count  Dtype  
---  ------                               --------------  -----  
 0   id                                   1418 non-null   float64
 1   cease_date                           1382 non-null   object 
 2   separationtype                       1418 non-null   object 
 3   career_move_to_public_sector         1202 non-null   object 
 4   career_move_to_private_sector        1202 non-null   object 
 5   ill_health                           1202 non-null   object 
 6   maternity/family                     1202 non-null   object 
 7   dissatisfaction_with_the_department  1202 non-null   object 
 8   job_dissatisfaction                  1202 non-null   object 
 9   interpersonal_conflicts              1202 non-null   object 
 10  none_of_the_above                    1202 non-null   object 
 11  opportunities_for_promotion          1249 non-null   float64
 12  stress_and_pressure_support          1384 non-null   float64
 13  kept_informed                        1389 non-null   float64
 14  staff_morale                         1378 non-null   float64
 15  performance_of_supervisor            1383 non-null   float64
 16  initiative                           1385 non-null   float64
 17  peer_support                         1385 non-null   float64
 18  skills                               1384 non-null   float64
 19  worklife_balance                     1389 non-null   float64
 20  feedback                             1358 non-null   float64
 21  physical_environment                 1388 non-null   float64
 22  coach                                1293 non-null   float64
 23  communication                        1382 non-null   float64
 24  gender                               1394 non-null   object 
 25  age                                  1407 non-null   object 
 26  employment_status                    1413 non-null   object 
 27  institute_service                    596 non-null    object 
 28  role_service                         596 non-null    object 
 29  study/travel                         1202 non-null   object 
 30  other_factor                         1202 non-null   object 
 31  institute                            1418 non-null   object 
 32  dete_start_date                      749 non-null    float64
 33  role_start_date                      724 non-null    float64
 34  too_much_work                        822 non-null    object 
dtypes: float64(16), object(19)
memory usage: 387.9+ KB

2.4 Clean data on separation type, age and tenure¶

2.4.1 Harmonize categories for separation type¶

In [39]:

#separation type is overly specific for the analysis and does not align across both data sets

def sep_types(element):
    if "Resignation" in element:
        return "Resignation"
    if "Retirement" in element:
        return "Retirement"
    invol = ["Retrenchment/ Redundancy", "Termination","Contract Expired"]
    if element in invol:
        return "Involuntary"
    else:
        return "Other"

exit_combined["sep_cat"] = exit_combined["separationtype"].apply(sep_types)
exit_combined.groupby(["sep_cat","separationtype"])["id"].count();

In [40]:

temp_pivot = exit_combined.pivot_table(index="sep_cat",columns="institute",values="id",aggfunc="count")
temp_pivot.plot(kind="bar",stacked=False,title="Number of exits by kind of separation")
plt.show()

2.4.2 Harmonize categories for age¶

In [41]:

exit_combined["age"] = exit_combined["age"].str.replace("  ","-") # TAFE had double spaces instead of "-"
age_replace_old = {"56-60": "56 or older", # TAFE had additional age-group
                  "61 or older": "56 or older"}
exit_combined["age"] = exit_combined["age"].replace(age_replace_old)

temp_pivot = exit_combined.pivot_table(index="age",columns="institute",values="id",aggfunc="count")
temp_pivot.plot(kind="bar",stacked=False,title="Number of exits by age group")
plt.show()

2.4.3 Correct cease date¶

In [42]:

exit_combined["cease_date"].value_counts(dropna=False).tail();

In [43]:

# some cease dates are strings and formated as MM/YYYY. We will remove the months as it is not required
resignation_cease_corr = exit_combined.loc[:,("cease_date")].copy()
exit_combined_up1 = exit_combined.copy()
exit_combined_up1["cease_date"] = resignation_cease_corr.astype(str).str.replace(r"[0-9]{2}/","").astype(float)

temp_pivot = exit_combined_up1.pivot_table(index="cease_date",columns="institute",values="id",aggfunc="count")
temp_pivot.plot(kind="bar",stacked=False,title="Number of exits by cease date")
plt.show()

2.4.4 Calculate institute and role tenure¶

In [44]:

exit_combined_up1.tail().style;

In [45]:

#calculating tenure for DETE
exit_combined_up1.loc[exit_combined_up1["institute"]=="DETE","institute_service"] = exit_combined_up1["cease_date"] - exit_combined_up1["dete_start_date"]
exit_combined_up1.loc[exit_combined_up1["institute"]=="DETE","role_service"] = exit_combined_up1["cease_date"] - exit_combined_up1["role_start_date"]
exit_combined_up2 = exit_combined_up1.drop(columns=["cease_date","dete_start_date","role_start_date"]).copy()

In [46]:

exit_combined_up2[exit_combined_up2["institute"]=="TAFE"]["role_service"].value_counts();

For TAFE role service and institute service are coded in groups. I will assign the average length using dictionaries as this seems more quicker than regex given the very limited number of groups.

In [47]:

tafe_avg_tenure = {"Less than 1 year": 0.5,
                   "1-2": 1.5,
                   "3-4": 3.5,
                   "5-6": 5.5,
                   "7-10": 8.5,
                   "11-20": 15.5,
                   "More than 20 years": 25} # simple assumption

exit_combined_up3 = exit_combined_up2.replace(tafe_avg_tenure).copy()

Institute tenure and role tenure now look like continues variables but they are in fact a mix of precise years and presumed averages.
This is fine, because I am not really interested in the exact number of years.
We'll rather base our analysis on this article which makes the argument that understanding employee's needs according to career stage instead of age is more effective.

We'll use the slightly modified definitions below:

For institute tenure: - New: Less than 3 years - Experienced: 3-6 years - Well established: 7-10 years - Veteran: 11 or more years

For role tenure: - Recent promotion: Less than 3 years - Familiar role: 3-6 years - Mature role: 7 or more years

In [48]:

def inst_ten_cat(element):
    if pd.isnull(element):
        return np.nan
    if element < 3:
        return "1 New (>3 years)"
    if element < 7:
        return "2 Experienced (3-6 years)"
    if element < 11: 
        return "3 Established (7-10 years)"
    if element >= 10:
        return "4 Veteran (<10 years)"
    else:
        return Error
    
def role_ten_cat(element):
    if pd.isnull(element):
        return np.nan
    if element < 3:
        return "1 Recent hire/promotion (>3 years)"
    if element < 7:
        return "2 Familiar role (3-6 years)"
    if element >= 7: 
        return "3 Mature role (>6 years)"
    else:
        return Error

inst_ten_categories = exit_combined_up3["institute_service"].apply(inst_ten_cat)
role_ten_categories = exit_combined_up3["role_service"].apply(role_ten_cat)
    
exit_combined_up3["institute_tenure_cat"] = inst_ten_categories
exit_combined_up3["role_tenure_cat"] = role_ten_categories

exit_combined_up4 = exit_combined_up3.drop(columns=["role_service","institute_service"])

In [49]:

temp_pivot = exit_combined_up4.pivot_table(index="institute_tenure_cat",columns="institute",values="id",aggfunc="count")
temp_pivot.plot(kind="bar",stacked=False,title="Number of exits by length of tenure at institute")
plt.show()

In [50]:

temp_pivot = exit_combined_up4.pivot_table(index="role_tenure_cat",columns="institute",values="id",aggfunc="count")
temp_pivot.plot(kind="bar",stacked=False,title="Number of exits by length of tenure in role")
plt.show()

2.5 Dealing with Nan-values¶

I will quickly rearrange the columns before we proceed. Clustering the columns based on the type of information they include (employee info, contributing factors, rating of work environment) makes it easier to see where clusters of missing info are.

In [51]:

# get columns names in reuseable format
exit_combined_up4.columns.tolist();

In [52]:

new_sorting = ['id',
 'sep_cat',
 'separationtype',
 'institute',
 'gender',
 'age',
 'employment_status',
 'institute_tenure_cat',
 'role_tenure_cat',
 'career_move_to_public_sector',
 'career_move_to_private_sector',
 'ill_health',
 'maternity/family',
 'dissatisfaction_with_the_department',
 'job_dissatisfaction',
 'interpersonal_conflicts',
 'too_much_work',
 'study/travel',
 'other_factor',
 'none_of_the_above',
 'opportunities_for_promotion',
 'stress_and_pressure_support',
 'kept_informed',
 'staff_morale',
 'performance_of_supervisor',
 'initiative',
 'peer_support',
 'skills',
 'worklife_balance',
 'feedback',
 'physical_environment',
 'coach',
 'communication']

exit_combined_up5 = exit_combined_up4.loc[:,new_sorting].copy()
exit_combined_up5.head().style

Out[52]:

	id	sep_cat	separationtype	institute	gender	age	employment_status	institute_tenure_cat	role_tenure_cat	career_move_to_public_sector	career_move_to_private_sector	ill_health	maternity/family	dissatisfaction_with_the_department	job_dissatisfaction	interpersonal_conflicts	too_much_work	study/travel	other_factor	none_of_the_above	opportunities_for_promotion	stress_and_pressure_support	kept_informed	staff_morale	performance_of_supervisor	initiative	peer_support	skills	worklife_balance	feedback	physical_environment	coach	communication
0	634133009996093952.000000	Involuntary	Contract Expired	TAFE	Female	26-30	Temporary Full-time	1 New (>3 years)	1 Recent hire/promotion (>3 years)	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	0.000000	1.000000	1.000000	1.000000	1.000000	2.000000	1.000000	1.000000	1.000000	1.000000	1.000000	0.000000	1.000000
1	634146578511788032.000000	Resignation	Resignation	TAFE	Male	41-45	Permanent Full-time	2 Experienced (3-6 years)	2 Familiar role (3-6 years)	False	True	False	False	False	False	False	nan	False	False	False	1.000000	2.000000	1.000000	2.000000	2.000000	2.000000	2.000000	2.000000	2.000000	2.000000	2.000000	2.000000	2.000000
2	634147506906311040.000000	Resignation	Resignation	TAFE	Female	56 or older	Contract/casual	3 Established (7-10 years)	3 Mature role (>6 years)	False	False	False	False	False	False	False	nan	False	False	False	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	-1.000000	0.000000	-1.000000	0.000000	0.000000
3	634152007975693952.000000	Resignation	Resignation	TAFE	Male	20 or younger	Temporary Full-time	2 Experienced (3-6 years)	2 Familiar role (3-6 years)	False	True	False	True	False	False	False	nan	False	False	False	2.000000	2.000000	2.000000	2.000000	2.000000	2.000000	2.000000	2.000000	2.000000	2.000000	2.000000	2.000000	2.000000
4	634153745310374016.000000	Resignation	Resignation	TAFE	Male	46-50	Permanent Full-time	2 Experienced (3-6 years)	2 Familiar role (3-6 years)	False	False	False	False	False	False	False	nan	False	False	False	-1.000000	1.000000	1.000000	-2.000000	-1.000000	1.000000	-1.000000	-1.000000	-1.000000	0.000000	0.000000	0.000000	-2.000000

In [53]:

#fix missing gender information
exit_combined_up5.loc[exit_combined_up5["gender"].isna(),"gender"] = "Unknown" 

#drop records with missing employment status as other information was also missing for these employees
exit_combined_up6 = exit_combined_up5.dropna(subset=["employment_status"],inplace=False).copy()

In [54]:

fig, ax = plt.subplots(figsize=(10,10))
ax = sns.heatmap(exit_combined_up6.set_index(["institute","sep_cat","age",]).sort_index().isnull(),cbar=False,
                )
#dark==non-null, light==null-value

I will not drop any further records before doing the analysis because there is no record that is expandable for every question I want to answer.

In [55]:

exit_clean = exit_combined_up6.copy()

3. Analyzing: Getting the answers we are looking for¶

As a reminder, the questions are:

Are employees with relatively short tenures resigning more often due to some dissatisfaction?
Are young employees resigning more often due to some kind of dissatisfaction?
Are employees more likely to be dissatisfied and resign, if they have been in their job positions for a short time?
Which HR and company practices may to be improved to potentially mitigating employee exits?

In [57]:

exit_clean.sample(2).style

Out[57]:

	id	sep_cat	separationtype	institute	gender	age	employment_status	institute_tenure_cat	role_tenure_cat	career_move_to_public_sector	career_move_to_private_sector	ill_health	maternity/family	dissatisfaction_with_the_department	job_dissatisfaction	interpersonal_conflicts	too_much_work	study/travel	other_factor	none_of_the_above	opportunities_for_promotion	stress_and_pressure_support	kept_informed	staff_morale	performance_of_supervisor	initiative	peer_support	skills	worklife_balance	feedback	physical_environment	coach	communication
1027	432.000000	Retirement	Age Retirement	DETE	Male	56 or older	Permanent Part-time	4 Veteran (<10 years)	3 Mature role (>6 years)	False	False	False	False	False	False	False	False	False	False	True	2.000000	2.000000	2.000000	2.000000	2.000000	2.000000	2.000000	2.000000	2.000000	2.000000	2.000000	2.000000	2.000000
188	634508607117811968.000000	Resignation	Resignation	TAFE	Male	21-25	Permanent Full-time	2 Experienced (3-6 years)	2 Familiar role (3-6 years)	False	False	False	False	False	False	False	nan	True	True	False	0.000000	2.000000	-1.000000	0.000000	1.000000	1.000000	1.000000	-1.000000	1.000000	1.000000	1.000000	1.000000	0.000000

3.1 Is length of tenure with the institute correlated with job dissatisfaction?¶

I assume that these coded responses are indicative of dissatisfaction that could potentially be mitigated through actions of the employer:

'dissatisfaction_with_the_department',
'job_dissatisfaction',
'interpersonal_conflicts',
'too_much_work',

These mitigating factors are however not:

'study/travel',
'career_move_to_public_sector',
'career_move_to_private_sector',
'ill_health',
'maternity/family',
'other_factor',
'none_of_the_above'

For this part of the analysis I will drop all TAFE-employees whose exit was involuntary. For the remaining tafe employees I will assume, that they were not dissatisfied because of too much work (information not available). For TAFE-employees the missing values will thus not count towards the aggregate dissatisfaction flag.

In [58]:

exit_clean_dis = exit_clean.copy()
exit_clean_dis = exit_clean_dis[~((exit_clean_dis["sep_cat"] == "Involuntary") & (exit_clean_dis["institute"] == "TAFE"))]
print("{} rows dropped".format(np.subtract(exit_clean.shape,exit_clean_dis.shape)[0]))

exit_clean_dis["too_much_work"].fillna(False,inplace=True)
exit_clean_dis["too_much_work"].value_counts(dropna=False)

exit_clean_dis.loc[:,"career_move_to_public_sector":"none_of_the_above"] = exit_clean_dis.loc[:,"career_move_to_public_sector":"none_of_the_above"].astype(bool)

216 rows dropped

In [59]:

dissatisfied_col = ['dissatisfaction_with_the_department',
                    'job_dissatisfaction',
                    'interpersonal_conflicts',
                    'too_much_work']

exit_clean_dis["dissatisfied"] = exit_clean_dis.loc[:,dissatisfied_col].any(axis=1,skipna=True)

Let's see if there are any unexpected patterns in the data, before we analyze the what impact tenure has on dissatisfaction.

In [60]:

temp_pivot = exit_clean_dis.pivot_table(index=["sep_cat"], columns=["institute"],values="dissatisfied")
temp_pivot.plot(kind="bar", title="Average share of dissatisfied exits by reason for exit")
plt.legend(loc='center left',bbox_to_anchor=(1.0, 0.5))
plt.show()

To my surprise, DETE employees whose exit was involuntary reported below average dissatisfaction levels.
On the other hand DETE retirees reported an unexpected above-average dissatisfaction with their job.

In [61]:

temp_subset = exit_clean_dis.loc[(exit_clean_dis["institute"]=="DETE")&(exit_clean_dis["sep_cat"]=="Retirement"),
                                ['dissatisfaction_with_the_department',
                                 'job_dissatisfaction',
                                 'interpersonal_conflicts',
                                 'too_much_work',
                                 'separationtype']]
temp_subset.groupby("separationtype").mean().plot(kind="bar",title="Average share of dissatisfied exits by type of dissatisfaction, retirees only")
plt.legend(loc='center left',bbox_to_anchor=(1.0, 0.5))
plt.show()

Retirees at DETE are dissatisfied with their job because of too much work. Going forward we will only analyze resignations.

In [62]:

exit_clean_dis_res = exit_clean_dis.copy()
exit_clean_dis_res = exit_clean_dis_res.loc[exit_clean_dis_res["sep_cat"]=="Resignation",:]

tem_pivot = exit_clean_dis_res.pivot_table(index=["institute_tenure_cat"], columns=["institute"],values="dissatisfied")
tem_pivot.plot(kind="bar",title="Average share of dissatisfied exits by institute tenure, resigning employees only")
plt.legend(loc='center left',bbox_to_anchor=(1.0, 0.5))
plt.show()

DETE exits who resigned were more often dissatisfied when they were at the institute for a longer time. This is not the case for TAFE employees.

3.2 Are young employees resigning more often due to some kind of dissatisfaction?¶

In [63]:

temp_pivot = exit_clean_dis_res.pivot_table(index=["age"], columns=["institute"], values="dissatisfied")
temp_pivot.plot(kind="bar", title="Average share of dissatisfied exits by age, resigning employees only")
plt.legend(loc='center left',bbox_to_anchor=(1.0, 0.5))
plt.show()

At DETE young professional between the ages of 26 and 35 are more dissatisfied than middle-aged exits from 36 to 50.

3.3 Are employees more likely to be dissatisfied and resign, if they have been in their job positions for a short time?¶

In [64]:

temp_pivot = exit_clean_dis_res.pivot_table(index=["role_tenure_cat"], columns=["institute"], values="dissatisfied")
temp_pivot.plot(kind="bar", title="Average share of dissatisfied exits by role tenure, resigning employees only")
plt.legend(loc='center left',bbox_to_anchor=(1.0, 0.5))
plt.show()

In general employees who are familiar with their jobs are less likely to leave dissatisfied than those that are new in their positions or have been with in their position for a long time.

3.3 Bonus: Let's look at the intersect of categories analysed in 3.1-3.3¶

In [65]:

temp_pivot = exit_clean_dis_res.pivot_table(index=["role_tenure_cat"], columns=["age"], values="dissatisfied", aggfunc=["mean","count"])

temp_pivot["mean"].plot(kind="bar",title="Average share of dissatisfied exits by role tenure and age, resigning employees only")
plt.legend(loc='center left',bbox_to_anchor=(1.0, 0.5))
plt.show()

In [66]:

temp_pivot = exit_clean_dis_res.pivot_table(index=["institute_tenure_cat"], columns=["age"], values="dissatisfied", aggfunc=["mean","count"])

temp_pivot["mean"].plot(kind="bar",title="Average share of dissatisfied exits by institute tenure and age, resigning employees only")
plt.legend(loc='center left',bbox_to_anchor=(1.0, 0.5))
plt.show()

Young professionals (31-35) that - a) not recently progressed to a new role or - b) are with the institute for a long time are dissatisfied above average.

3.4 Which HR and company practices may to be improved to potentially mitigating employee exits?¶

In [67]:

col_drop_for_env = exit_clean.loc[:,"career_move_to_public_sector":"none_of_the_above"].columns
env_col = exit_clean.loc[:,"opportunities_for_promotion":"communication"].columns

exit_clean_env = exit_clean.drop(columns=col_drop_for_env).drop(columns="id")

temp_table = exit_clean_env.groupby("institute").mean().transpose().sort_values(by="DETE")
temp_table.plot(kind="bar", title="Average happiness with work environment (more positive = better), by institute")
plt.legend(loc='center left',bbox_to_anchor=(1.0, 0.5))
plt.show()

It looks like TAFE really needs to work on staff morale. Throw a nice office party! Problem solved.

3.4 Bonus: Are involuntary exits or retirees bitter about their work environment?¶

In [68]:

pd.melt(exit_clean_env, id_vars=["sep_cat","institute"], value_vars=env_col).groupby("sep_cat").mean().plot(kind="bar",legend=False)
plt.title("Average happiness with work environment (more positive = better), by type of exit")
plt.show()

Involuntary exit do not rate their work environment worse than the average exit, but retirees do.

3.4 Bonus: Are young or old people more happy about their work environment?¶

In [69]:

pd.melt(exit_clean_env, id_vars=["age","institute"], value_vars=env_col).groupby("age").mean().plot(kind="bar",legend=False)
plt.title("Average happiness with work environment (more positive = better), by age")
plt.show()

Young exits were more happy about their environment. Happiness with the work environment declines with age, but increases for the very old (56+ years) - maybe nostalgia?

Extra stuff I found while researching questions¶

In [73]:

#found this really helpful extension to explore the dataset

import qgrid

qgrid.show_grid(exit_clean, grid_options={'forceFitColumns': False, 'defaultColumnWidth': 100})

QgridWidget(grid_options={'fullWidthRows': True, 'syncColumnCellResize': True, 'forceFitColumns': False, 'defa…

And a way to make exploration and presentation more interactive. I will incorporate both methods them in future projects.