We will be working with exit surveys from employees of two public Australian institutes:
Both institutes used the same survey template and customized them to their own needs.
The dataset in general contains information on:
To answer questions posed by dataquest:
To answer questions I am interested in: 3. Are employees more likely to be dissatisfied and resign, if they have been in their job positions for a short time? 4. Which HR and company practices may to be improved to potentially mitigating employee exits?
The focus of this exercise is on data wrangling. It is based on a dataquest.io guided project.
import seaborn as sns
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
# na_values info provided by dataquest
dete = pd.read_csv("dete_survey.csv", na_values="Not Stated")
tafe = pd.read_csv("tafe_survey.csv")
# Some columns are not needed for the analysis and are thus dropped.
# I have excluded these columns above the initial investigative procedure
# and immediately after the file import to avoid duplicating the investigative code.
# first DETE, see rationale for exclusion under section 1.1.
dete_drop_col = ["Aboriginal",
"Torres Strait",
"South Sea",
"Disability",
"NESB",
"Position",
"Classification",
"Region",
"Business Unit"]
dete.drop(columns=dete_drop_col, inplace=True)
# second TAFE, see rationale under section 1.2
tafe_drop_col_stage1 = tafe.columns[tafe.columns.str.contains("Induction") |
tafe.columns.str.contains("Main Factor") |
tafe.columns.str.contains("Workplace")]
tafe.drop(columns=tafe_drop_col_stage1, inplace=True)
tafe_drop_col_stage2 = ["Classification. Classification",
"WorkArea",
"Institute"]
tafe.drop(columns=tafe_drop_col_stage2, inplace=True)
# .copy() to avoid potential "SettingwithCopyWarning"-ERROR
tafe = tafe[tafe["CurrentAge. Current Age"].notna()].copy()
dete.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 822 entries, 0 to 821 Data columns (total 47 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 822 non-null int64 1 SeparationType 822 non-null object 2 Cease Date 788 non-null object 3 DETE Start Date 749 non-null float64 4 Role Start Date 724 non-null float64 5 Employment Status 817 non-null object 6 Career move to public sector 822 non-null bool 7 Career move to private sector 822 non-null bool 8 Interpersonal conflicts 822 non-null bool 9 Job dissatisfaction 822 non-null bool 10 Dissatisfaction with the department 822 non-null bool 11 Physical work environment 822 non-null bool 12 Lack of recognition 822 non-null bool 13 Lack of job security 822 non-null bool 14 Work location 822 non-null bool 15 Employment conditions 822 non-null bool 16 Maternity/family 822 non-null bool 17 Relocation 822 non-null bool 18 Study/Travel 822 non-null bool 19 Ill Health 822 non-null bool 20 Traumatic incident 822 non-null bool 21 Work life balance 822 non-null bool 22 Workload 822 non-null bool 23 None of the above 822 non-null bool 24 Professional Development 808 non-null object 25 Opportunities for promotion 735 non-null object 26 Staff morale 816 non-null object 27 Workplace issue 788 non-null object 28 Physical environment 817 non-null object 29 Worklife balance 815 non-null object 30 Stress and pressure support 810 non-null object 31 Performance of supervisor 813 non-null object 32 Peer support 812 non-null object 33 Initiative 813 non-null object 34 Skills 811 non-null object 35 Coach 767 non-null object 36 Career Aspirations 746 non-null object 37 Feedback 792 non-null object 38 Further PD 768 non-null object 39 Communication 814 non-null object 40 My say 812 non-null object 41 Information 816 non-null object 42 Kept informed 813 non-null object 43 Wellness programs 766 non-null object 44 Health & Safety 793 non-null object 45 Gender 798 non-null object 46 Age 811 non-null object dtypes: bool(18), float64(2), int64(1), object(26) memory usage: 200.8+ KB
dete.head().style; # .style to show all columns, remove colon to see output
dete["SeparationType"].value_counts(dropna=False).sort_index();
At this time I am not interested in:
The respective columns are therefore removed from the dataset.
I have excluded these columns above the initial investigative procedure and immediately after the file import to avoid duplicating the investigative code.
+++++
Noteworthy: At this time the meaning of the values in columns "Professional Development" onward (A, N, D, etc.) is currently unknown. The material accompanying the dataset does not provide further details (see Contextual Information available on the website linked under section "0."). Because both DETE and TAFE used the same base template, I hope that the information can be inferred from the second data set (TAFE).
+++++
All non-boolean columns have been investigated using "value_counts"-method to see if there are any unexpected values.
All other columns look fine to me, aside from a few Nan-values. I will sort the dataset by "Separation Type" and visualize the occurrence of Nan-Values to maybe see a pattern in the way Nan-Values are distributed. Going through various iterations of the plot, there seems to be no cluster of missing values in a specific separation type or age group.
fig, ax = plt.subplots(figsize=(5,10))
ax = sns.heatmap(dete.set_index(["SeparationType","Age"]).sort_index().isnull(),cbar=False)
#black==non-null, light==null-value
tafe.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 596 entries, 0 to 701 Data columns (total 50 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Record ID 596 non-null float64 1 CESSATION YEAR 594 non-null float64 2 Reason for ceasing employment 596 non-null object 3 Contributing Factors. Career Move - Public Sector 380 non-null object 4 Contributing Factors. Career Move - Private Sector 380 non-null object 5 Contributing Factors. Career Move - Self-employment 380 non-null object 6 Contributing Factors. Ill Health 380 non-null object 7 Contributing Factors. Maternity/Family 380 non-null object 8 Contributing Factors. Dissatisfaction 380 non-null object 9 Contributing Factors. Job Dissatisfaction 380 non-null object 10 Contributing Factors. Interpersonal Conflict 380 non-null object 11 Contributing Factors. Study 380 non-null object 12 Contributing Factors. Travel 380 non-null object 13 Contributing Factors. Other 380 non-null object 14 Contributing Factors. NONE 380 non-null object 15 InstituteViews. Topic:1. I feel the senior leadership had a clear vision and direction 583 non-null object 16 InstituteViews. Topic:2. I was given access to skills training to help me do my job better 589 non-null object 17 InstituteViews. Topic:3. I was given adequate opportunities for personal development 586 non-null object 18 InstituteViews. Topic:4. I was given adequate opportunities for promotion within %Institute]Q25LBL% 584 non-null object 19 InstituteViews. Topic:5. I felt the salary for the job was right for the responsibilities I had 591 non-null object 20 InstituteViews. Topic:6. The organisation recognised when staff did good work 583 non-null object 21 InstituteViews. Topic:7. Management was generally supportive of me 590 non-null object 22 InstituteViews. Topic:8. Management was generally supportive of my team 586 non-null object 23 InstituteViews. Topic:9. I was kept informed of the changes in the organisation which would affect me 588 non-null object 24 InstituteViews. Topic:10. Staff morale was positive within the Institute 579 non-null object 25 InstituteViews. Topic:11. If I had a workplace issue it was dealt with quickly 579 non-null object 26 InstituteViews. Topic:12. If I had a workplace issue it was dealt with efficiently 575 non-null object 27 InstituteViews. Topic:13. If I had a workplace issue it was dealt with discreetly 579 non-null object 28 WorkUnitViews. Topic:14. I was satisfied with the quality of the management and supervision within my work unit 587 non-null object 29 WorkUnitViews. Topic:15. I worked well with my colleagues 583 non-null object 30 WorkUnitViews. Topic:16. My job was challenging and interesting 585 non-null object 31 WorkUnitViews. Topic:17. I was encouraged to use my initiative in the course of my work 588 non-null object 32 WorkUnitViews. Topic:18. I had sufficient contact with other people in my job 591 non-null object 33 WorkUnitViews. Topic:19. I was given adequate support and co-operation by my peers to enable me to do my job 587 non-null object 34 WorkUnitViews. Topic:20. I was able to use the full range of my skills in my job 587 non-null object 35 WorkUnitViews. Topic:21. I was able to use the full range of my abilities in my job. ; Category:Level of Agreement; Question:YOUR VIEWS ABOUT YOUR WORK UNIT] 586 non-null object 36 WorkUnitViews. Topic:22. I was able to use the full range of my knowledge in my job 586 non-null object 37 WorkUnitViews. Topic:23. My job provided sufficient variety 590 non-null object 38 WorkUnitViews. Topic:24. I was able to cope with the level of stress and pressure in my job 588 non-null object 39 WorkUnitViews. Topic:25. My job allowed me to balance the demands of work and family to my satisfaction 589 non-null object 40 WorkUnitViews. Topic:26. My supervisor gave me adequate personal recognition and feedback on my performance 584 non-null object 41 WorkUnitViews. Topic:27. My working environment was satisfactory e.g. sufficient space, good lighting, suitable seating and working area 588 non-null object 42 WorkUnitViews. Topic:28. I was given the opportunity to mentor and coach others in order for me to pass on my skills and knowledge prior to my cessation date 587 non-null object 43 WorkUnitViews. Topic:29. There was adequate communication between staff in my unit 580 non-null object 44 WorkUnitViews. Topic:30. Staff morale was positive within my work unit 584 non-null object 45 Gender. What is your Gender? 596 non-null object 46 CurrentAge. Current Age 596 non-null object 47 Employment Type. Employment Type 596 non-null object 48 LengthofServiceOverall. Overall Length of Service at Institute (in years) 596 non-null object 49 LengthofServiceCurrent. Length of Service at current workplace (in years) 596 non-null object dtypes: float64(2), object(48) memory usage: 237.5+ KB
tafe.head().style;
tafe.iloc[:,2].value_counts(dropna=False).sort_index();
At this time I am not interested in:
The respective columns are therefore removed from the dataset.
I have excluded these columns above the initial investigative procedure immediately after the file import to avoid duplicating the investigative code.
+++++
Noteworthy: We are lucky and can infer the meaning of some abbreviated responses to items in the DETE survey from the TAFE survey.
+++++
All columns have been investigated using "value_counts"-method to see if there are any unexpected values.
General note: The TAFE-dataset includes more features. Column headers between the two datasets need to be compared to see which features have the same meaning.
I will sort the dataset by "Reason for ceasing employment" and "Current Age" and visualize the occurrence of Nan-Values to maybe see a pattern in the way Nan-Values are distributed. Going through various iterations of the plot, there seems to be a cluster of missing data for:
fig, ax = plt.subplots(figsize=(5, 10))
ax = sns.heatmap(tafe.set_index(
"CurrentAge. Current Age").sort_index().isnull(), cbar=False)
#black==non-null, light==null-value
Quick stock keeping, to check if we have sufficient data to perform analysis required to answer our questions (see Section 0.2):
Cleaning to dos going forward:
(see section 2.5)
We have two datasets with overlapping sets of features evaluating the work environment. Ideally we have as many features on as many employees as possible. There might however be issues if a feature is not present in both datasets. As a general goal, we only want to keep features that are present in both datasets to maximize the number of observations, thus:
This may however require us to discard features. Since our goal in this cases (see Q4 under Section 0.2) is to identify workplace issues, we my discard info that identifies workplace issues in one of the two institutes. Thus:
To see if a feature is relevant we will convert the categorical likert-type scale (Strongly Agree / neutral / disagree etc.) into a interval scale (scientifically not a very sound approach) to see if employees were on average to a significant share unhappy with any of the discarded items.
The workflow for this cleaning to do is thus:
dete_env_col = dete.loc[:,
"Professional Development":"Health & Safety"].columns
pd.melt(dete, value_vars=dete_env_col)["value"].value_counts(dropna=False)
# remove colon in line above to see output
A 7540 SA 3307 N 2952 D 1518 SD 1087 NaN 537 M 321 Name: value, dtype: int64
tafe_env_col = tafe.iloc[:, 15:44].columns
pd.melt(tafe, value_vars=tafe_env_col)["value"].value_counts(dropna=False);
Questions are phrased positively like "I was given adequate opportunities for personal development", agreement is thus a positive / favorable outcome. Scales are set accordingly.
dete_env_quant = {"SA": 2,
"A": 1,
"N": 0,
"D": -1,
"SD": -2,
"M":np.nan} # assuming that M means 'not applicable' which is
# equivalent to not answering the question for the purpose of this analysis
tafe_env_quant = {"Strongly Agree": 2,
"Agree": 1,
"Neutral": 0,
"Disagree": -1,
"Strongly Disagree": -2,
"Not Applicable":np.nan} # see above
dete_up1 = dete.replace(dete_env_quant).copy()
tafe_up1 = tafe.replace(tafe_env_quant).copy()
dete.columns.to_list()
# matching of column headers done in spreadsheet after they were both printed as lists
# replacing long TAFE headers with short DETE headers
tafe_lookup_env = {"InstituteViews. Topic:4. I was given adequate opportunities for promotion within %Institute]Q25LBL%": 'Opportunities for promotion',
'InstituteViews. Topic:10. Staff morale was positive within the Institute': 'Staff morale',
'WorkUnitViews. Topic:27. My working environment was satisfactory e.g. sufficient space, good lighting, suitable seating and working area': 'Physical environment',
'WorkUnitViews. Topic:25. My job allowed me to balance the demands of work and family to my satisfaction': 'Worklife balance',
'InstituteViews. Topic:7. Management was generally supportive of me': 'Stress and pressure support',
'WorkUnitViews. Topic:14. I was satisfied with the quality of the management and supervision within my work unit': 'Performance of supervisor',
'WorkUnitViews. Topic:19. I was given adequate support and co-operation by my peers to enable me to do my job': 'Peer support',
'WorkUnitViews. Topic:17. I was encouraged to use my initiative in the course of my work': 'Initiative',
'WorkUnitViews. Topic:20. I was able to use the full range of my skills in my job': 'Skills',
'WorkUnitViews. Topic:28. I was given the opportunity to mentor and coach others in order for me to pass on my skills and knowledge prior to my cessation date': 'Coach',
'WorkUnitViews. Topic:26. My supervisor gave me adequate personal recognition and feedback on my performance': 'Feedback',
'WorkUnitViews. Topic:29. There was adequate communication between staff in my unit': 'Communication',
'InstituteViews. Topic:9. I was kept informed of the changes in the organisation which would affect me': 'Kept informed'}
tafe_up2 = tafe_up1.rename(columns=tafe_lookup_env, inplace=False).copy()
# selecting only the columns with info on work environment
dete_to_merge = tafe_up2.iloc[:, 15:45].copy()
tafe_to_merge = dete_up1.loc[:,
"Professional Development":"Health & Safety"].copy()
exit_merged_env = pd.concat([dete_to_merge, tafe_to_merge], ignore_index=True)
# filtering only for columns which are present in DETE and TAFE-dataset
common_features = exit_merged_env.dropna(thresh=850, axis=1).columns
# thresh = 850 because dete dataset alone has 822 records
# getting a dataframe that includes all columns that are only present in one of the two datasets DETE or TAFE
unmatched_env = exit_merged_env.loc[:, ~
exit_merged_env.columns.isin(common_features)]
unmatched_env.quantile([.1,.25,.5,.75]).astype(int).transpose().sort_values(by=[0.25,0.5]).head(10)
0.10 | 0.25 | 0.50 | 0.75 | |
---|---|---|---|---|
InstituteViews. Topic:6. The organisation recognised when staff did good work | -2 | -1 | 0 | 1 |
InstituteViews. Topic:11. If I had a workplace issue it was dealt with quickly | -2 | -1 | 0 | 1 |
InstituteViews. Topic:12. If I had a workplace issue it was dealt with efficiently | -2 | -1 | 0 | 1 |
InstituteViews. Topic:5. I felt the salary for the job was right for the responsibilities I had | -2 | -1 | 1 | 1 |
Wellness programs | -1 | 0 | 0 | 1 |
InstituteViews. Topic:1. I feel the senior leadership had a clear vision and direction | -1 | 0 | 1 | 1 |
InstituteViews. Topic:2. I was given access to skills training to help me do my job better | -1 | 0 | 1 | 1 |
InstituteViews. Topic:3. I was given adequate opportunities for personal development | -1 | 0 | 1 | 1 |
InstituteViews. Topic:8. Management was generally supportive of my team | -1 | 0 | 1 | 1 |
InstituteViews. Topic:13. If I had a workplace issue it was dealt with discreetly | -1 | 0 | 1 | 1 |
unmatched_env.
--------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-1-111192713002> in <module> ----> 1 unmatched_env NameError: name 'unmatched_env' is not defined
Partial Answer Q4:
In the features to be discarded, TAFE employees (long format questions) were most unhappy with:
DETE employees (short format questions) were mainly unhappy with:
# now that we have analyzed these columns, we can drop them from the dataframes
tafe_up2.drop(columns=unmatched_env.columns, inplace=True, errors="ignore")
dete_up1.drop(columns=unmatched_env.columns, inplace=True, errors="ignore")
Contributing factors in TAFE dataset need to be formated like DETE dataset - booleans instead of strings.
tafe_dissat_col = tafe_up2.columns[tafe_up2.columns.str.contains(
"Contributing Factors")]
def update_vals(val):
if pd.isnull(val):
return np.nan
if val == "-":
return False
else:
return True
tafe_up2[tafe_dissat_col] = tafe_up2[tafe_dissat_col].applymap(update_vals)
Now the column headers need to be harmonized.
dete.loc[:, "Career move to public sector":"None of the above"].mean(
).sort_values(ascending=False).head(10)
None of the above 0.263990 Work life balance 0.263990 Ill Health 0.136253 Job dissatisfaction 0.108273 Workload 0.105839 Career move to private sector 0.097324 Relocation 0.082725 Maternity/family 0.075426 Dissatisfaction with the department 0.074209 Lack of recognition 0.069343 dtype: float64
DETE has more contributing factors listed on the dataset than TAFE. Out of these, poor work-life-balance and workload are often cited as reasons for exiting the job. Because they are such an important factor for a an employees exit, I will group these two and keep them in the dataset - although they are not included in the TAFE dataset.
dete_col_rename_cont = ['Physical work environment',
'Lack of recognition',
'Lack of job security',
'Work location',
'Employment conditions',
'Relocation',
'Traumatic incident']
dete_col_rename_work = ['Work life balance',
'Workload']
# I decided to combine the information into two single column and drop the detailed columns
dete_up1["Other factor"] = dete_up1.loc[:,dete_col_rename_cont].any(axis=1, skipna=False)
dete_up1["Too much work"] = dete_up1.loc[:,dete_col_rename_work].any(axis=1, skipna=False)
dete_up2 = dete_up1.drop(columns=dete_col_rename_cont+dete_col_rename_work)
# TAFE in general has less features describing the contributing factors.
# They will be renamed to align with the DETE dataset.
tafe_col_rename_cont = {'Contributing Factors. Career Move - Public Sector ': 'Career move to public sector',
'Contributing Factors. Career Move - Private Sector ': 'Career move to private sector',
'Contributing Factors. Interpersonal Conflict': 'Interpersonal conflicts',
'Contributing Factors. Job Dissatisfaction': 'Job dissatisfaction',
'Contributing Factors. Dissatisfaction': 'Dissatisfaction with the department',
'Contributing Factors. Maternity/Family': 'Maternity/family',
'Contributing Factors. Ill Health': 'Ill Health',
'Contributing Factors. NONE': 'None of the above'}
tafe_up3 = tafe_up2.rename(columns=tafe_col_rename_cont,inplace=False).copy()
# The information from four columns needs to be aggregated into two columns to match with the DETE dataset
tafe_col_rename_ts = ['Contributing Factors. Travel','Contributing Factors. Study']
tafe_up3['Study/Travel'] = tafe_up3.loc[:,tafe_col_rename_ts].any(axis=1, skipna=False)
tafe_col_rename_other = ['Contributing Factors. Career Move - Self-employment','Contributing Factors. Other']
tafe_up3['Other factor'] = tafe_up3.loc[:,tafe_col_rename_ts].any(axis=1, skipna=False)
tafe_up4 = tafe_up3.drop(columns=tafe_col_rename_ts+tafe_col_rename_other)
# Quick heatmap, to to check if the any()-command worked as expected
test_subset = tafe_up3.loc[:,tafe_col_rename_ts+tafe_col_rename_other+['Study/Travel','Other factor']]
fig, ax = plt.subplots(figsize=(3,5))
ax = sns.heatmap(test_subset.isnull(),cbar=False)
# bar code look means success ;-)
# Now let's also match the column headers of the other relevant features.
# Again we will match column headers of TAFE to DETE as the latter are better formatted
# Thankfully, dataquest already provided a dictionary for this
tafe_col_rename_gen = {'Record ID': 'id',
'CESSATION YEAR': 'cease_date',
'Reason for ceasing employment': 'separationtype',
'Gender. What is your Gender?': 'gender',
'CurrentAge. Current Age': 'age',
'Employment Type. Employment Type': 'employment_status',
'Classification. Classification': 'position',
'LengthofServiceOverall. Overall Length of Service at Institute (in years)': 'institute_service',
'LengthofServiceCurrent. Length of Service at current workplace (in years)': 'role_service'}
tafe_up5 = tafe_up4.rename(columns=tafe_col_rename_gen, inplace=False).copy()
# formatted dete and tafe column headers according to dataquest
dete_up2.columns = dete_up2.columns.str.lower().str.strip().str.replace(" ","_")
tafe_up5.columns = tafe_up5.columns.str.lower().str.strip().str.replace(" ","_")
tafe_up5["institute"] = "TAFE"
dete_up2["institute"] = "DETE"
exit_combined = pd.concat([tafe_up5,dete_up2],ignore_index=True)
exit_combined.info();
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1418 entries, 0 to 1417 Data columns (total 35 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 1418 non-null float64 1 cease_date 1382 non-null object 2 separationtype 1418 non-null object 3 career_move_to_public_sector 1202 non-null object 4 career_move_to_private_sector 1202 non-null object 5 ill_health 1202 non-null object 6 maternity/family 1202 non-null object 7 dissatisfaction_with_the_department 1202 non-null object 8 job_dissatisfaction 1202 non-null object 9 interpersonal_conflicts 1202 non-null object 10 none_of_the_above 1202 non-null object 11 opportunities_for_promotion 1249 non-null float64 12 stress_and_pressure_support 1384 non-null float64 13 kept_informed 1389 non-null float64 14 staff_morale 1378 non-null float64 15 performance_of_supervisor 1383 non-null float64 16 initiative 1385 non-null float64 17 peer_support 1385 non-null float64 18 skills 1384 non-null float64 19 worklife_balance 1389 non-null float64 20 feedback 1358 non-null float64 21 physical_environment 1388 non-null float64 22 coach 1293 non-null float64 23 communication 1382 non-null float64 24 gender 1394 non-null object 25 age 1407 non-null object 26 employment_status 1413 non-null object 27 institute_service 596 non-null object 28 role_service 596 non-null object 29 study/travel 1202 non-null object 30 other_factor 1202 non-null object 31 institute 1418 non-null object 32 dete_start_date 749 non-null float64 33 role_start_date 724 non-null float64 34 too_much_work 822 non-null object dtypes: float64(16), object(19) memory usage: 387.9+ KB
#separation type is overly specific for the analysis and does not align across both data sets
def sep_types(element):
if "Resignation" in element:
return "Resignation"
if "Retirement" in element:
return "Retirement"
invol = ["Retrenchment/ Redundancy", "Termination","Contract Expired"]
if element in invol:
return "Involuntary"
else:
return "Other"
exit_combined["sep_cat"] = exit_combined["separationtype"].apply(sep_types)
exit_combined.groupby(["sep_cat","separationtype"])["id"].count();
temp_pivot = exit_combined.pivot_table(index="sep_cat",columns="institute",values="id",aggfunc="count")
temp_pivot.plot(kind="bar",stacked=False,title="Number of exits by kind of separation")
plt.show()
exit_combined["age"] = exit_combined["age"].str.replace(" ","-") # TAFE had double spaces instead of "-"
age_replace_old = {"56-60": "56 or older", # TAFE had additional age-group
"61 or older": "56 or older"}
exit_combined["age"] = exit_combined["age"].replace(age_replace_old)
temp_pivot = exit_combined.pivot_table(index="age",columns="institute",values="id",aggfunc="count")
temp_pivot.plot(kind="bar",stacked=False,title="Number of exits by age group")
plt.show()
exit_combined["cease_date"].value_counts(dropna=False).tail();
# some cease dates are strings and formated as MM/YYYY. We will remove the months as it is not required
resignation_cease_corr = exit_combined.loc[:,("cease_date")].copy()
exit_combined_up1 = exit_combined.copy()
exit_combined_up1["cease_date"] = resignation_cease_corr.astype(str).str.replace(r"[0-9]{2}/","").astype(float)
temp_pivot = exit_combined_up1.pivot_table(index="cease_date",columns="institute",values="id",aggfunc="count")
temp_pivot.plot(kind="bar",stacked=False,title="Number of exits by cease date")
plt.show()
exit_combined_up1.tail().style;
#calculating tenure for DETE
exit_combined_up1.loc[exit_combined_up1["institute"]=="DETE","institute_service"] = exit_combined_up1["cease_date"] - exit_combined_up1["dete_start_date"]
exit_combined_up1.loc[exit_combined_up1["institute"]=="DETE","role_service"] = exit_combined_up1["cease_date"] - exit_combined_up1["role_start_date"]
exit_combined_up2 = exit_combined_up1.drop(columns=["cease_date","dete_start_date","role_start_date"]).copy()
exit_combined_up2[exit_combined_up2["institute"]=="TAFE"]["role_service"].value_counts();
For TAFE role service and institute service are coded in groups. I will assign the average length using dictionaries as this seems more quicker than regex given the very limited number of groups.
tafe_avg_tenure = {"Less than 1 year": 0.5,
"1-2": 1.5,
"3-4": 3.5,
"5-6": 5.5,
"7-10": 8.5,
"11-20": 15.5,
"More than 20 years": 25} # simple assumption
exit_combined_up3 = exit_combined_up2.replace(tafe_avg_tenure).copy()
Institute tenure and role tenure now look like continues variables but they are in fact a mix of precise years and presumed averages.
This is fine, because I am not really interested in the exact number of years.
We'll rather base our analysis on this article which makes the argument that understanding employee's needs according to career stage instead of age is more effective.
We'll use the slightly modified definitions below:
For institute tenure: - New: Less than 3 years - Experienced: 3-6 years - Well established: 7-10 years - Veteran: 11 or more years
For role tenure: - Recent promotion: Less than 3 years - Familiar role: 3-6 years - Mature role: 7 or more years
def inst_ten_cat(element):
if pd.isnull(element):
return np.nan
if element < 3:
return "1 New (>3 years)"
if element < 7:
return "2 Experienced (3-6 years)"
if element < 11:
return "3 Established (7-10 years)"
if element >= 10:
return "4 Veteran (<10 years)"
else:
return Error
def role_ten_cat(element):
if pd.isnull(element):
return np.nan
if element < 3:
return "1 Recent hire/promotion (>3 years)"
if element < 7:
return "2 Familiar role (3-6 years)"
if element >= 7:
return "3 Mature role (>6 years)"
else:
return Error
inst_ten_categories = exit_combined_up3["institute_service"].apply(inst_ten_cat)
role_ten_categories = exit_combined_up3["role_service"].apply(role_ten_cat)
exit_combined_up3["institute_tenure_cat"] = inst_ten_categories
exit_combined_up3["role_tenure_cat"] = role_ten_categories
exit_combined_up4 = exit_combined_up3.drop(columns=["role_service","institute_service"])
temp_pivot = exit_combined_up4.pivot_table(index="institute_tenure_cat",columns="institute",values="id",aggfunc="count")
temp_pivot.plot(kind="bar",stacked=False,title="Number of exits by length of tenure at institute")
plt.show()
temp_pivot = exit_combined_up4.pivot_table(index="role_tenure_cat",columns="institute",values="id",aggfunc="count")
temp_pivot.plot(kind="bar",stacked=False,title="Number of exits by length of tenure in role")
plt.show()
I will quickly rearrange the columns before we proceed. Clustering the columns based on the type of information they include (employee info, contributing factors, rating of work environment) makes it easier to see where clusters of missing info are.
# get columns names in reuseable format
exit_combined_up4.columns.tolist();
new_sorting = ['id',
'sep_cat',
'separationtype',
'institute',
'gender',
'age',
'employment_status',
'institute_tenure_cat',
'role_tenure_cat',
'career_move_to_public_sector',
'career_move_to_private_sector',
'ill_health',
'maternity/family',
'dissatisfaction_with_the_department',
'job_dissatisfaction',
'interpersonal_conflicts',
'too_much_work',
'study/travel',
'other_factor',
'none_of_the_above',
'opportunities_for_promotion',
'stress_and_pressure_support',
'kept_informed',
'staff_morale',
'performance_of_supervisor',
'initiative',
'peer_support',
'skills',
'worklife_balance',
'feedback',
'physical_environment',
'coach',
'communication']
exit_combined_up5 = exit_combined_up4.loc[:,new_sorting].copy()
exit_combined_up5.head().style
id | sep_cat | separationtype | institute | gender | age | employment_status | institute_tenure_cat | role_tenure_cat | career_move_to_public_sector | career_move_to_private_sector | ill_health | maternity/family | dissatisfaction_with_the_department | job_dissatisfaction | interpersonal_conflicts | too_much_work | study/travel | other_factor | none_of_the_above | opportunities_for_promotion | stress_and_pressure_support | kept_informed | staff_morale | performance_of_supervisor | initiative | peer_support | skills | worklife_balance | feedback | physical_environment | coach | communication | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 634133009996093952.000000 | Involuntary | Contract Expired | TAFE | Female | 26-30 | Temporary Full-time | 1 New (>3 years) | 1 Recent hire/promotion (>3 years) | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | 0.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 2.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 0.000000 | 1.000000 |
1 | 634146578511788032.000000 | Resignation | Resignation | TAFE | Male | 41-45 | Permanent Full-time | 2 Experienced (3-6 years) | 2 Familiar role (3-6 years) | False | True | False | False | False | False | False | nan | False | False | False | 1.000000 | 2.000000 | 1.000000 | 2.000000 | 2.000000 | 2.000000 | 2.000000 | 2.000000 | 2.000000 | 2.000000 | 2.000000 | 2.000000 | 2.000000 |
2 | 634147506906311040.000000 | Resignation | Resignation | TAFE | Female | 56 or older | Contract/casual | 3 Established (7-10 years) | 3 Mature role (>6 years) | False | False | False | False | False | False | False | nan | False | False | False | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | -1.000000 | 0.000000 | -1.000000 | 0.000000 | 0.000000 |
3 | 634152007975693952.000000 | Resignation | Resignation | TAFE | Male | 20 or younger | Temporary Full-time | 2 Experienced (3-6 years) | 2 Familiar role (3-6 years) | False | True | False | True | False | False | False | nan | False | False | False | 2.000000 | 2.000000 | 2.000000 | 2.000000 | 2.000000 | 2.000000 | 2.000000 | 2.000000 | 2.000000 | 2.000000 | 2.000000 | 2.000000 | 2.000000 |
4 | 634153745310374016.000000 | Resignation | Resignation | TAFE | Male | 46-50 | Permanent Full-time | 2 Experienced (3-6 years) | 2 Familiar role (3-6 years) | False | False | False | False | False | False | False | nan | False | False | False | -1.000000 | 1.000000 | 1.000000 | -2.000000 | -1.000000 | 1.000000 | -1.000000 | -1.000000 | -1.000000 | 0.000000 | 0.000000 | 0.000000 | -2.000000 |
#fix missing gender information
exit_combined_up5.loc[exit_combined_up5["gender"].isna(),"gender"] = "Unknown"
#drop records with missing employment status as other information was also missing for these employees
exit_combined_up6 = exit_combined_up5.dropna(subset=["employment_status"],inplace=False).copy()
fig, ax = plt.subplots(figsize=(10,10))
ax = sns.heatmap(exit_combined_up6.set_index(["institute","sep_cat","age",]).sort_index().isnull(),cbar=False,
)
#dark==non-null, light==null-value
I will not drop any further records before doing the analysis because there is no record that is expandable for every question I want to answer.
exit_clean = exit_combined_up6.copy()
As a reminder, the questions are:
exit_clean.sample(2).style
id | sep_cat | separationtype | institute | gender | age | employment_status | institute_tenure_cat | role_tenure_cat | career_move_to_public_sector | career_move_to_private_sector | ill_health | maternity/family | dissatisfaction_with_the_department | job_dissatisfaction | interpersonal_conflicts | too_much_work | study/travel | other_factor | none_of_the_above | opportunities_for_promotion | stress_and_pressure_support | kept_informed | staff_morale | performance_of_supervisor | initiative | peer_support | skills | worklife_balance | feedback | physical_environment | coach | communication | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1027 | 432.000000 | Retirement | Age Retirement | DETE | Male | 56 or older | Permanent Part-time | 4 Veteran (<10 years) | 3 Mature role (>6 years) | False | False | False | False | False | False | False | False | False | False | True | 2.000000 | 2.000000 | 2.000000 | 2.000000 | 2.000000 | 2.000000 | 2.000000 | 2.000000 | 2.000000 | 2.000000 | 2.000000 | 2.000000 | 2.000000 |
188 | 634508607117811968.000000 | Resignation | Resignation | TAFE | Male | 21-25 | Permanent Full-time | 2 Experienced (3-6 years) | 2 Familiar role (3-6 years) | False | False | False | False | False | False | False | nan | True | True | False | 0.000000 | 2.000000 | -1.000000 | 0.000000 | 1.000000 | 1.000000 | 1.000000 | -1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 0.000000 |
I assume that these coded responses are indicative of dissatisfaction that could potentially be mitigated through actions of the employer:
These mitigating factors are however not:
For this part of the analysis I will drop all TAFE-employees whose exit was involuntary. For the remaining tafe employees I will assume, that they were not dissatisfied because of too much work (information not available). For TAFE-employees the missing values will thus not count towards the aggregate dissatisfaction flag.
exit_clean_dis = exit_clean.copy()
exit_clean_dis = exit_clean_dis[~((exit_clean_dis["sep_cat"] == "Involuntary") & (exit_clean_dis["institute"] == "TAFE"))]
print("{} rows dropped".format(np.subtract(exit_clean.shape,exit_clean_dis.shape)[0]))
exit_clean_dis["too_much_work"].fillna(False,inplace=True)
exit_clean_dis["too_much_work"].value_counts(dropna=False)
exit_clean_dis.loc[:,"career_move_to_public_sector":"none_of_the_above"] = exit_clean_dis.loc[:,"career_move_to_public_sector":"none_of_the_above"].astype(bool)
216 rows dropped
dissatisfied_col = ['dissatisfaction_with_the_department',
'job_dissatisfaction',
'interpersonal_conflicts',
'too_much_work']
exit_clean_dis["dissatisfied"] = exit_clean_dis.loc[:,dissatisfied_col].any(axis=1,skipna=True)
Let's see if there are any unexpected patterns in the data, before we analyze the what impact tenure has on dissatisfaction.
temp_pivot = exit_clean_dis.pivot_table(index=["sep_cat"], columns=["institute"],values="dissatisfied")
temp_pivot.plot(kind="bar", title="Average share of dissatisfied exits by reason for exit")
plt.legend(loc='center left',bbox_to_anchor=(1.0, 0.5))
plt.show()
To my surprise, DETE employees whose exit was involuntary reported below average dissatisfaction levels.
On the other hand DETE retirees reported an unexpected above-average dissatisfaction with their job.
temp_subset = exit_clean_dis.loc[(exit_clean_dis["institute"]=="DETE")&(exit_clean_dis["sep_cat"]=="Retirement"),
['dissatisfaction_with_the_department',
'job_dissatisfaction',
'interpersonal_conflicts',
'too_much_work',
'separationtype']]
temp_subset.groupby("separationtype").mean().plot(kind="bar",title="Average share of dissatisfied exits by type of dissatisfaction, retirees only")
plt.legend(loc='center left',bbox_to_anchor=(1.0, 0.5))
plt.show()
Retirees at DETE are dissatisfied with their job because of too much work. Going forward we will only analyze resignations.
exit_clean_dis_res = exit_clean_dis.copy()
exit_clean_dis_res = exit_clean_dis_res.loc[exit_clean_dis_res["sep_cat"]=="Resignation",:]
tem_pivot = exit_clean_dis_res.pivot_table(index=["institute_tenure_cat"], columns=["institute"],values="dissatisfied")
tem_pivot.plot(kind="bar",title="Average share of dissatisfied exits by institute tenure, resigning employees only")
plt.legend(loc='center left',bbox_to_anchor=(1.0, 0.5))
plt.show()
DETE exits who resigned were more often dissatisfied when they were at the institute for a longer time. This is not the case for TAFE employees.
temp_pivot = exit_clean_dis_res.pivot_table(index=["age"], columns=["institute"], values="dissatisfied")
temp_pivot.plot(kind="bar", title="Average share of dissatisfied exits by age, resigning employees only")
plt.legend(loc='center left',bbox_to_anchor=(1.0, 0.5))
plt.show()
At DETE young professional between the ages of 26 and 35 are more dissatisfied than middle-aged exits from 36 to 50.
temp_pivot = exit_clean_dis_res.pivot_table(index=["role_tenure_cat"], columns=["institute"], values="dissatisfied")
temp_pivot.plot(kind="bar", title="Average share of dissatisfied exits by role tenure, resigning employees only")
plt.legend(loc='center left',bbox_to_anchor=(1.0, 0.5))
plt.show()
In general employees who are familiar with their jobs are less likely to leave dissatisfied than those that are new in their positions or have been with in their position for a long time.
temp_pivot = exit_clean_dis_res.pivot_table(index=["role_tenure_cat"], columns=["age"], values="dissatisfied", aggfunc=["mean","count"])
temp_pivot["mean"].plot(kind="bar",title="Average share of dissatisfied exits by role tenure and age, resigning employees only")
plt.legend(loc='center left',bbox_to_anchor=(1.0, 0.5))
plt.show()
temp_pivot = exit_clean_dis_res.pivot_table(index=["institute_tenure_cat"], columns=["age"], values="dissatisfied", aggfunc=["mean","count"])
temp_pivot["mean"].plot(kind="bar",title="Average share of dissatisfied exits by institute tenure and age, resigning employees only")
plt.legend(loc='center left',bbox_to_anchor=(1.0, 0.5))
plt.show()
Young professionals (31-35) that - a) not recently progressed to a new role or - b) are with the institute for a long time are dissatisfied above average.
col_drop_for_env = exit_clean.loc[:,"career_move_to_public_sector":"none_of_the_above"].columns
env_col = exit_clean.loc[:,"opportunities_for_promotion":"communication"].columns
exit_clean_env = exit_clean.drop(columns=col_drop_for_env).drop(columns="id")
temp_table = exit_clean_env.groupby("institute").mean().transpose().sort_values(by="DETE")
temp_table.plot(kind="bar", title="Average happiness with work environment (more positive = better), by institute")
plt.legend(loc='center left',bbox_to_anchor=(1.0, 0.5))
plt.show()
It looks like TAFE really needs to work on staff morale. Throw a nice office party! Problem solved.
pd.melt(exit_clean_env, id_vars=["sep_cat","institute"], value_vars=env_col).groupby("sep_cat").mean().plot(kind="bar",legend=False)
plt.title("Average happiness with work environment (more positive = better), by type of exit")
plt.show()
Involuntary exit do not rate their work environment worse than the average exit, but retirees do.
pd.melt(exit_clean_env, id_vars=["age","institute"], value_vars=env_col).groupby("age").mean().plot(kind="bar",legend=False)
plt.title("Average happiness with work environment (more positive = better), by age")
plt.show()
Young exits were more happy about their environment. Happiness with the work environment declines with age, but increases for the very old (56+ years) - maybe nostalgia?
#found this really helpful extension to explore the dataset
import qgrid
qgrid.show_grid(exit_clean, grid_options={'forceFitColumns': False, 'defaultColumnWidth': 100})
QgridWidget(grid_options={'fullWidthRows': True, 'syncColumnCellResize': True, 'forceFitColumns': False, 'defa…
And a way to make exploration and presentation more interactive. I will incorporate both methods them in future projects.