We will be working with exit surveys from employees of the Department of Education, Training and Employment (DETE) and the Technical and Further Education (TAFE) institute in Queensland, Australia. Links for the datasets can be found here: TAFE and DETE.
Our goal is to answer the following questions:
import pandas as pd
import numpy as np
dete_survey=pd.read_csv('dete_survey.csv')
tafe_survey=pd.read_csv('tafe_survey.csv')
dete_survey.head(3)
ID | SeparationType | Cease Date | DETE Start Date | Role Start Date | Position | Classification | Region | Business Unit | Employment Status | ... | Kept informed | Wellness programs | Health & Safety | Gender | Age | Aboriginal | Torres Strait | South Sea | Disability | NESB | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Ill Health Retirement | 08/2012 | 1984 | 2004 | Public Servant | A01-A04 | Central Office | Corporate Strategy and Peformance | Permanent Full-time | ... | N | N | N | Male | 56-60 | NaN | NaN | NaN | NaN | Yes |
1 | 2 | Voluntary Early Retirement (VER) | 08/2012 | Not Stated | Not Stated | Public Servant | AO5-AO7 | Central Office | Corporate Strategy and Peformance | Permanent Full-time | ... | N | N | N | Male | 56-60 | NaN | NaN | NaN | NaN | NaN |
2 | 3 | Voluntary Early Retirement (VER) | 05/2012 | 2011 | 2011 | Schools Officer | NaN | Central Office | Education Queensland | Permanent Full-time | ... | N | N | N | Male | 61 or older | NaN | NaN | NaN | NaN | NaN |
3 rows × 56 columns
tafe_survey.head(3)
Record ID | Institute | WorkArea | CESSATION YEAR | Reason for ceasing employment | Contributing Factors. Career Move - Public Sector | Contributing Factors. Career Move - Private Sector | Contributing Factors. Career Move - Self-employment | Contributing Factors. Ill Health | Contributing Factors. Maternity/Family | ... | Workplace. Topic:Does your workplace promote a work culture free from all forms of unlawful discrimination? | Workplace. Topic:Does your workplace promote and practice the principles of employment equity? | Workplace. Topic:Does your workplace value the diversity of its employees? | Workplace. Topic:Would you recommend the Institute as an employer to others? | Gender. What is your Gender? | CurrentAge. Current Age | Employment Type. Employment Type | Classification. Classification | LengthofServiceOverall. Overall Length of Service at Institute (in years) | LengthofServiceCurrent. Length of Service at current workplace (in years) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 6.341330e+17 | Southern Queensland Institute of TAFE | Non-Delivery (corporate) | 2010.0 | Contract Expired | NaN | NaN | NaN | NaN | NaN | ... | Yes | Yes | Yes | Yes | Female | 26 30 | Temporary Full-time | Administration (AO) | 1-2 | 1-2 |
1 | 6.341337e+17 | Mount Isa Institute of TAFE | Non-Delivery (corporate) | 2010.0 | Retirement | - | - | - | - | - | ... | Yes | Yes | Yes | Yes | NaN | NaN | NaN | NaN | NaN | NaN |
2 | 6.341388e+17 | Mount Isa Institute of TAFE | Delivery (teaching) | 2010.0 | Retirement | - | - | - | - | - | ... | Yes | Yes | Yes | Yes | NaN | NaN | NaN | NaN | NaN | NaN |
3 rows × 72 columns
dete_survey['DETE Start Date'].value_counts()
Not Stated 73 2011 40 2007 34 2008 31 2012 27 2010 27 2009 24 2006 23 1970 21 2013 21 1975 21 1990 20 2005 20 1999 19 1996 19 1992 18 2004 18 1991 18 2000 18 1989 17 1988 15 2003 15 1976 15 1978 15 2002 15 1980 14 1997 14 1998 14 1995 14 1974 14 1979 14 1993 13 1972 12 1986 12 1977 11 1994 10 2001 10 1971 10 1969 10 1984 10 1983 9 1981 9 1985 8 1973 8 1987 7 1963 4 1982 4 1968 3 1967 2 1966 1 1965 1 Name: DETE Start Date, dtype: int64
While the TAFE dataset is using NaN for appropriate values, NaN values in the DETE dataset seem to be using the value 'Not Stated'. There are also many columns in both datasets that are not necessary for our analysis. Let's correct the missing values and remove some columns.
# correct NaN values
dete_survey=pd.read_csv('dete_survey.csv', na_values=['Not Stated'])
# remove unnecessary columns
dete_survey=dete_survey.drop(columns=dete_survey.columns[28:49], axis=1)
tafe_survey=tafe_survey.drop(columns=tafe_survey.columns[17:66], axis=1)
dete_survey['DETE Start Date'].value_counts(dropna=False)
NaN 73 2011.0 40 2007.0 34 2008.0 31 2010.0 27 2012.0 27 2009.0 24 2006.0 23 1970.0 21 1975.0 21 2013.0 21 2005.0 20 1990.0 20 1999.0 19 1996.0 19 1992.0 18 1991.0 18 2000.0 18 2004.0 18 1989.0 17 1978.0 15 2003.0 15 1988.0 15 1976.0 15 2002.0 15 1974.0 14 1997.0 14 1998.0 14 1979.0 14 1995.0 14 1980.0 14 1993.0 13 1972.0 12 1986.0 12 1977.0 11 1971.0 10 1984.0 10 1994.0 10 1969.0 10 2001.0 10 1983.0 9 1981.0 9 1973.0 8 1985.0 8 1987.0 7 1982.0 4 1963.0 4 1968.0 3 1967.0 2 1965.0 1 1966.0 1 Name: DETE Start Date, dtype: int64
print(dete_survey.columns)
print(tafe_survey.columns)
Index(['ID', 'SeparationType', 'Cease Date', 'DETE Start Date', 'Role Start Date', 'Position', 'Classification', 'Region', 'Business Unit', 'Employment Status', 'Career move to public sector', 'Career move to private sector', 'Interpersonal conflicts', 'Job dissatisfaction', 'Dissatisfaction with the department', 'Physical work environment', 'Lack of recognition', 'Lack of job security', 'Work location', 'Employment conditions', 'Maternity/family', 'Relocation', 'Study/Travel', 'Ill Health', 'Traumatic incident', 'Work life balance', 'Workload', 'None of the above', 'Gender', 'Age', 'Aboriginal', 'Torres Strait', 'South Sea', 'Disability', 'NESB'], dtype='object') Index(['Record ID', 'Institute', 'WorkArea', 'CESSATION YEAR', 'Reason for ceasing employment', 'Contributing Factors. Career Move - Public Sector ', 'Contributing Factors. Career Move - Private Sector ', 'Contributing Factors. Career Move - Self-employment', 'Contributing Factors. Ill Health', 'Contributing Factors. Maternity/Family', 'Contributing Factors. Dissatisfaction', 'Contributing Factors. Job Dissatisfaction', 'Contributing Factors. Interpersonal Conflict', 'Contributing Factors. Study', 'Contributing Factors. Travel', 'Contributing Factors. Other', 'Contributing Factors. NONE', 'Gender. What is your Gender?', 'CurrentAge. Current Age', 'Employment Type. Employment Type', 'Classification. Classification', 'LengthofServiceOverall. Overall Length of Service at Institute (in years)', 'LengthofServiceCurrent. Length of Service at current workplace (in years)'], dtype='object')
Since we are using two datasets, we want to standardize some column names in order to be able to analyze both datasets at once. Some information we are interested in is the reason for ceasing employment, the surveyee's age, and the length of service. Looking at the columns, dete_survey seems to have more appropriate column names. Let's standardize them and update the tafe_survey columns to match.
dete_survey.columns=dete_survey.columns.str.lower().str.strip().str.replace(' ','_')
dete_survey=dete_survey.rename({'separationtype':'separation_type'}, axis=1)
dete_survey.columns
Index(['id', 'separation_type', 'cease_date', 'dete_start_date', 'role_start_date', 'position', 'classification', 'region', 'business_unit', 'employment_status', 'career_move_to_public_sector', 'career_move_to_private_sector', 'interpersonal_conflicts', 'job_dissatisfaction', 'dissatisfaction_with_the_department', 'physical_work_environment', 'lack_of_recognition', 'lack_of_job_security', 'work_location', 'employment_conditions', 'maternity/family', 'relocation', 'study/travel', 'ill_health', 'traumatic_incident', 'work_life_balance', 'workload', 'none_of_the_above', 'gender', 'age', 'aboriginal', 'torres_strait', 'south_sea', 'disability', 'nesb'], dtype='object')
final_cols= {
'Record ID': 'id',
'CESSATION YEAR': 'cease_date',
'Reason for ceasing employment': 'separation_type',
'Gender. What is your Gender?': 'gender',
'CurrentAge. Current Age': 'age',
'Employment Type. Employment Type': 'employment_status',
'Classification. Classification': 'position',
'LengthofServiceOverall. Overall Length of Service at Institute (in years)': 'institute_service',
'LengthofServiceCurrent. Length of Service at current workplace (in years)': 'role_service'
}
tafe_survey=tafe_survey.rename(final_cols, axis=1)
tafe_survey.columns
Index(['id', 'Institute', 'WorkArea', 'cease_date', 'separation_type', 'Contributing Factors. Career Move - Public Sector ', 'Contributing Factors. Career Move - Private Sector ', 'Contributing Factors. Career Move - Self-employment', 'Contributing Factors. Ill Health', 'Contributing Factors. Maternity/Family', 'Contributing Factors. Dissatisfaction', 'Contributing Factors. Job Dissatisfaction', 'Contributing Factors. Interpersonal Conflict', 'Contributing Factors. Study', 'Contributing Factors. Travel', 'Contributing Factors. Other', 'Contributing Factors. NONE', 'gender', 'age', 'employment_status', 'position', 'institute_service', 'role_service'], dtype='object')
Since our questions are only concerned with resignation of employees, we want to select only those rows from the data that deal with resignation. While tafe_survey has a 'Resignation' option, dete_survey has 3 different types of resignation.
print(dete_survey['separation_type'].value_counts())
print(tafe_survey['separation_type'].value_counts())
Age Retirement 285 Resignation-Other reasons 150 Resignation-Other employer 91 Resignation-Move overseas/interstate 70 Voluntary Early Retirement (VER) 67 Ill Health Retirement 61 Other 49 Contract Expired 34 Termination 15 Name: separation_type, dtype: int64 Resignation 340 Contract Expired 127 Retrenchment/ Redundancy 104 Retirement 82 Transfer 25 Termination 23 Name: separation_type, dtype: int64
tafe_resignations=tafe_survey.loc[tafe_survey['separation_type']=='Resignation'].copy()
dete_resignations=dete_survey.loc[
(dete_survey['separation_type']=='Resignation-Other reasons') |
(dete_survey['separation_type']=='Resignation-Other employer') |
((dete_survey['separation_type']=='Resignation-Move overseas/interstate'))
].copy()
dete_resignations.shape
(311, 35)
We have created new dataframes for resignations. Let's take brief look at our data to check for any inconsistencies. A simple check is to verify cease_date years are not in the future.
print(dete_resignations['cease_date'].value_counts())
tafe_resignations['cease_date'].value_counts()
2012 126 2013 74 01/2014 22 12/2013 17 06/2013 14 09/2013 11 11/2013 9 07/2013 9 10/2013 6 08/2013 4 05/2012 2 05/2013 2 2010 1 07/2006 1 07/2012 1 09/2010 1 Name: cease_date, dtype: int64
2011.0 116 2012.0 94 2010.0 68 2013.0 55 2009.0 2 Name: cease_date, dtype: int64
tafe_survey had supplied us with years of service which will aid in our first question, while dete_survey did not. We will create a new calculated column in our dete dataset.
tafe_resignations['institute_service'].value_counts()
Less than 1 year 73 1-2 64 3-4 63 5-6 33 11-20 26 7-10 21 More than 20 years 10 Name: institute_service, dtype: int64
dete_resignations['cease_date']=dete_resignations.loc[:,'cease_date'].str.split('/').str[-1].astype(float)
dete_resignations['institute_service']=dete_resignations['cease_date'] - dete_resignations['dete_start_date']
dete_resignations['institute_service'].value_counts()
5.0 23 1.0 22 3.0 20 0.0 20 6.0 17 4.0 16 9.0 14 2.0 14 7.0 13 13.0 8 8.0 8 20.0 7 15.0 7 10.0 6 22.0 6 14.0 6 17.0 6 12.0 6 16.0 5 18.0 5 23.0 4 11.0 4 24.0 4 39.0 3 19.0 3 21.0 3 32.0 3 28.0 2 26.0 2 25.0 2 30.0 2 36.0 2 29.0 1 33.0 1 42.0 1 27.0 1 41.0 1 35.0 1 38.0 1 34.0 1 49.0 1 31.0 1 Name: institute_service, dtype: int64
We are interested in whether those resigning are doing so due to some sort of dissatisfaction. Both datasets have multiple columns for dissatisfaction. Let's create a new column in each dataset to display whether the employee resigning was doing so for any type of dissatisfaction.
print(tafe_resignations['Contributing Factors. Dissatisfaction'].value_counts())
print(tafe_resignations['Contributing Factors. Job Dissatisfaction'].value_counts())
dete_satisfy_cols=[
'job_dissatisfaction',
'dissatisfaction_with_the_department',
'physical_work_environment',
'lack_of_recognition',
'lack_of_job_security',
'work_location',
'employment_conditions',
'work_life_balance',
'workload']
dete_resignations.loc[:,dete_satisfy_cols].head()
- 277 Contributing Factors. Dissatisfaction 55 Name: Contributing Factors. Dissatisfaction, dtype: int64 - 270 Job Dissatisfaction 62 Name: Contributing Factors. Job Dissatisfaction, dtype: int64
job_dissatisfaction | dissatisfaction_with_the_department | physical_work_environment | lack_of_recognition | lack_of_job_security | work_location | employment_conditions | work_life_balance | workload | |
---|---|---|---|---|---|---|---|---|---|
3 | False | False | False | False | False | False | False | False | False |
5 | False | False | False | False | False | False | True | False | False |
8 | False | False | False | False | False | False | False | False | False |
9 | True | True | False | False | False | False | False | False | False |
11 | False | False | False | False | False | False | False | False | False |
# Error here
def update_vals(el):
if pd.isnull(el):
return np.nan
elif el=='-':
return False
else:
return True
tafe_resignations['dissatisfied'] = tafe_resignations[['Contributing Factors. Dissatisfaction', 'Contributing Factors. Job Dissatisfaction']].applymap(update_vals).any(1, skipna=False)
tafe_resignations_up = tafe_resignations.copy()
print('number of null values:', tafe_resignations_up['dissatisfied'].isnull().sum())
tafe_resignations_up['dissatisfied'].value_counts(dropna=False)
number of null values: 8
False 241 True 91 True 8 Name: dissatisfied, dtype: int64
dete_resignations['dissatisfied']=dete_resignations[dete_satisfy_cols].any(1, skipna=False)
dete_resignations_up=dete_resignations.copy()
dete_resignations['dissatisfied'].value_counts(dropna=False)
False 162 True 149 Name: dissatisfied, dtype: int64