Clean and Analyze Employee Exit Surveys¶

We will be working with exit surveys from employees of the Department of Education, Training and Employment (DETE) and the Technical and Further Education (TAFE) institute in Queensland, Australia. Links for the datasets can be found here: TAFE and DETE.

Our goal is to answer the following questions:

Are employees who only worked for the institutes for a short period of time resigning due to some kind of dissatisfaction? What about employees who have been there longer?
Are younger employees resigning due to some kind of dissatisfaction? What about older employees?

In [1]:

import pandas as pd
import numpy as np

dete_survey=pd.read_csv('dete_survey.csv')
tafe_survey=pd.read_csv('tafe_survey.csv')

dete_survey.head(3)

Out[1]:

	ID	SeparationType	Cease Date	DETE Start Date	Role Start Date	Position	Classification	Region	Business Unit	Employment Status	...	Kept informed	Wellness programs	Health & Safety	Gender	Age	Aboriginal	Torres Strait	South Sea	Disability	NESB
0	1	Ill Health Retirement	08/2012	1984	2004	Public Servant	A01-A04	Central Office	Corporate Strategy and Peformance	Permanent Full-time	...	N	N	N	Male	56-60	NaN	NaN	NaN	NaN	Yes
1	2	Voluntary Early Retirement (VER)	08/2012	Not Stated	Not Stated	Public Servant	AO5-AO7	Central Office	Corporate Strategy and Peformance	Permanent Full-time	...	N	N	N	Male	56-60	NaN	NaN	NaN	NaN	NaN
2	3	Voluntary Early Retirement (VER)	05/2012	2011	2011	Schools Officer	NaN	Central Office	Education Queensland	Permanent Full-time	...	N	N	N	Male	61 or older	NaN	NaN	NaN	NaN	NaN

3 rows × 56 columns

In [2]:

tafe_survey.head(3)

Out[2]:

	Record ID	Institute	WorkArea	CESSATION YEAR	Reason for ceasing employment	Contributing Factors. Career Move - Public Sector	Contributing Factors. Career Move - Private Sector	Contributing Factors. Career Move - Self-employment	Contributing Factors. Ill Health	Contributing Factors. Maternity/Family	...	Workplace. Topic:Does your workplace promote a work culture free from all forms of unlawful discrimination?	Workplace. Topic:Does your workplace promote and practice the principles of employment equity?	Workplace. Topic:Does your workplace value the diversity of its employees?	Workplace. Topic:Would you recommend the Institute as an employer to others?	Gender. What is your Gender?	CurrentAge. Current Age	Employment Type. Employment Type	Classification. Classification	LengthofServiceOverall. Overall Length of Service at Institute (in years)	LengthofServiceCurrent. Length of Service at current workplace (in years)
0	6.341330e+17	Southern Queensland Institute of TAFE	Non-Delivery (corporate)	2010.0	Contract Expired	NaN	NaN	NaN	NaN	NaN	...	Yes	Yes	Yes	Yes	Female	26 30	Temporary Full-time	Administration (AO)	1-2	1-2
1	6.341337e+17	Mount Isa Institute of TAFE	Non-Delivery (corporate)	2010.0	Retirement	-	-	-	-	-	...	Yes	Yes	Yes	Yes	NaN	NaN	NaN	NaN	NaN	NaN
2	6.341388e+17	Mount Isa Institute of TAFE	Delivery (teaching)	2010.0	Retirement	-	-	-	-	-	...	Yes	Yes	Yes	Yes	NaN	NaN	NaN	NaN	NaN	NaN

3 rows × 72 columns

In [3]:

dete_survey['DETE Start Date'].value_counts()

Out[3]:

Not Stated    73
2011          40
2007          34
2008          31
2012          27
2010          27
2009          24
2006          23
1970          21
2013          21
1975          21
1990          20
2005          20
1999          19
1996          19
1992          18
2004          18
1991          18
2000          18
1989          17
1988          15
2003          15
1976          15
1978          15
2002          15
1980          14
1997          14
1998          14
1995          14
1974          14
1979          14
1993          13
1972          12
1986          12
1977          11
1994          10
2001          10
1971          10
1969          10
1984          10
1983           9
1981           9
1985           8
1973           8
1987           7
1963           4
1982           4
1968           3
1967           2
1966           1
1965           1
Name: DETE Start Date, dtype: int64

Updating Columns¶

While the TAFE dataset is using NaN for appropriate values, NaN values in the DETE dataset seem to be using the value 'Not Stated'. There are also many columns in both datasets that are not necessary for our analysis. Let's correct the missing values and remove some columns.

In [4]:

# correct NaN values
dete_survey=pd.read_csv('dete_survey.csv', na_values=['Not Stated'])
# remove unnecessary columns
dete_survey=dete_survey.drop(columns=dete_survey.columns[28:49], axis=1)
tafe_survey=tafe_survey.drop(columns=tafe_survey.columns[17:66], axis=1)
dete_survey['DETE Start Date'].value_counts(dropna=False)

Out[4]:

NaN       73
2011.0    40
2007.0    34
2008.0    31
2010.0    27
2012.0    27
2009.0    24
2006.0    23
1970.0    21
1975.0    21
2013.0    21
2005.0    20
1990.0    20
1999.0    19
1996.0    19
1992.0    18
1991.0    18
2000.0    18
2004.0    18
1989.0    17
1978.0    15
2003.0    15
1988.0    15
1976.0    15
2002.0    15
1974.0    14
1997.0    14
1998.0    14
1979.0    14
1995.0    14
1980.0    14
1993.0    13
1972.0    12
1986.0    12
1977.0    11
1971.0    10
1984.0    10
1994.0    10
1969.0    10
2001.0    10
1983.0     9
1981.0     9
1973.0     8
1985.0     8
1987.0     7
1982.0     4
1963.0     4
1968.0     3
1967.0     2
1965.0     1
1966.0     1
Name: DETE Start Date, dtype: int64

In [5]:

print(dete_survey.columns)
print(tafe_survey.columns)

Index(['ID', 'SeparationType', 'Cease Date', 'DETE Start Date',
       'Role Start Date', 'Position', 'Classification', 'Region',
       'Business Unit', 'Employment Status', 'Career move to public sector',
       'Career move to private sector', 'Interpersonal conflicts',
       'Job dissatisfaction', 'Dissatisfaction with the department',
       'Physical work environment', 'Lack of recognition',
       'Lack of job security', 'Work location', 'Employment conditions',
       'Maternity/family', 'Relocation', 'Study/Travel', 'Ill Health',
       'Traumatic incident', 'Work life balance', 'Workload',
       'None of the above', 'Gender', 'Age', 'Aboriginal', 'Torres Strait',
       'South Sea', 'Disability', 'NESB'],
      dtype='object')
Index(['Record ID', 'Institute', 'WorkArea', 'CESSATION YEAR',
       'Reason for ceasing employment',
       'Contributing Factors. Career Move - Public Sector ',
       'Contributing Factors. Career Move - Private Sector ',
       'Contributing Factors. Career Move - Self-employment',
       'Contributing Factors. Ill Health',
       'Contributing Factors. Maternity/Family',
       'Contributing Factors. Dissatisfaction',
       'Contributing Factors. Job Dissatisfaction',
       'Contributing Factors. Interpersonal Conflict',
       'Contributing Factors. Study', 'Contributing Factors. Travel',
       'Contributing Factors. Other', 'Contributing Factors. NONE',
       'Gender. What is your Gender?', 'CurrentAge. Current Age',
       'Employment Type. Employment Type', 'Classification. Classification',
       'LengthofServiceOverall. Overall Length of Service at Institute (in years)',
       'LengthofServiceCurrent. Length of Service at current workplace (in years)'],
      dtype='object')

Since we are using two datasets, we want to standardize some column names in order to be able to analyze both datasets at once. Some information we are interested in is the reason for ceasing employment, the surveyee's age, and the length of service. Looking at the columns, dete_survey seems to have more appropriate column names. Let's standardize them and update the tafe_survey columns to match.

In [6]:

dete_survey.columns=dete_survey.columns.str.lower().str.strip().str.replace(' ','_')
dete_survey=dete_survey.rename({'separationtype':'separation_type'}, axis=1)
dete_survey.columns

Out[6]:

Index(['id', 'separation_type', 'cease_date', 'dete_start_date',
       'role_start_date', 'position', 'classification', 'region',
       'business_unit', 'employment_status', 'career_move_to_public_sector',
       'career_move_to_private_sector', 'interpersonal_conflicts',
       'job_dissatisfaction', 'dissatisfaction_with_the_department',
       'physical_work_environment', 'lack_of_recognition',
       'lack_of_job_security', 'work_location', 'employment_conditions',
       'maternity/family', 'relocation', 'study/travel', 'ill_health',
       'traumatic_incident', 'work_life_balance', 'workload',
       'none_of_the_above', 'gender', 'age', 'aboriginal', 'torres_strait',
       'south_sea', 'disability', 'nesb'],
      dtype='object')

In [7]:

final_cols= {
    'Record ID': 'id',
    'CESSATION YEAR': 'cease_date',
    'Reason for ceasing employment': 'separation_type',
    'Gender. What is your Gender?': 'gender',
    'CurrentAge. Current Age': 'age',
    'Employment Type. Employment Type': 'employment_status',
    'Classification. Classification': 'position',
    'LengthofServiceOverall. Overall Length of Service at Institute (in years)': 'institute_service',
    'LengthofServiceCurrent. Length of Service at current workplace (in years)': 'role_service'
}
tafe_survey=tafe_survey.rename(final_cols, axis=1)
tafe_survey.columns

Out[7]:

Index(['id', 'Institute', 'WorkArea', 'cease_date', 'separation_type',
       'Contributing Factors. Career Move - Public Sector ',
       'Contributing Factors. Career Move - Private Sector ',
       'Contributing Factors. Career Move - Self-employment',
       'Contributing Factors. Ill Health',
       'Contributing Factors. Maternity/Family',
       'Contributing Factors. Dissatisfaction',
       'Contributing Factors. Job Dissatisfaction',
       'Contributing Factors. Interpersonal Conflict',
       'Contributing Factors. Study', 'Contributing Factors. Travel',
       'Contributing Factors. Other', 'Contributing Factors. NONE', 'gender',
       'age', 'employment_status', 'position', 'institute_service',
       'role_service'],
      dtype='object')

Selecting Relevant Rows¶

Since our questions are only concerned with resignation of employees, we want to select only those rows from the data that deal with resignation. While tafe_survey has a 'Resignation' option, dete_survey has 3 different types of resignation.

In [8]:

print(dete_survey['separation_type'].value_counts())
print(tafe_survey['separation_type'].value_counts())

Age Retirement                          285
Resignation-Other reasons               150
Resignation-Other employer               91
Resignation-Move overseas/interstate     70
Voluntary Early Retirement (VER)         67
Ill Health Retirement                    61
Other                                    49
Contract Expired                         34
Termination                              15
Name: separation_type, dtype: int64
Resignation                 340
Contract Expired            127
Retrenchment/ Redundancy    104
Retirement                   82
Transfer                     25
Termination                  23
Name: separation_type, dtype: int64

In [9]:

tafe_resignations=tafe_survey.loc[tafe_survey['separation_type']=='Resignation'].copy()
dete_resignations=dete_survey.loc[
    (dete_survey['separation_type']=='Resignation-Other reasons') |
    (dete_survey['separation_type']=='Resignation-Other employer') |
    ((dete_survey['separation_type']=='Resignation-Move overseas/interstate'))
].copy()
dete_resignations.shape

Out[9]:

(311, 35)

Checking for Inconsistency¶

We have created new dataframes for resignations. Let's take brief look at our data to check for any inconsistencies. A simple check is to verify cease_date years are not in the future.

In [10]:

print(dete_resignations['cease_date'].value_counts())
tafe_resignations['cease_date'].value_counts()

2012       126
2013        74
01/2014     22
12/2013     17
06/2013     14
09/2013     11
11/2013      9
07/2013      9
10/2013      6
08/2013      4
05/2012      2
05/2013      2
2010         1
07/2006      1
07/2012      1
09/2010      1
Name: cease_date, dtype: int64

Out[10]:

2011.0    116
2012.0     94
2010.0     68
2013.0     55
2009.0      2
Name: cease_date, dtype: int64

Creating a Missing Column¶

tafe_survey had supplied us with years of service which will aid in our first question, while dete_survey did not. We will create a new calculated column in our dete dataset.

In [11]:

tafe_resignations['institute_service'].value_counts()

Out[11]:

Less than 1 year      73
1-2                   64
3-4                   63
5-6                   33
11-20                 26
7-10                  21
More than 20 years    10
Name: institute_service, dtype: int64

In [12]:

dete_resignations['cease_date']=dete_resignations.loc[:,'cease_date'].str.split('/').str[-1].astype(float)
dete_resignations['institute_service']=dete_resignations['cease_date'] - dete_resignations['dete_start_date']
dete_resignations['institute_service'].value_counts()

Out[12]:

5.0     23
1.0     22
3.0     20
0.0     20
6.0     17
4.0     16
9.0     14
2.0     14
7.0     13
13.0     8
8.0      8
20.0     7
15.0     7
10.0     6
22.0     6
14.0     6
17.0     6
12.0     6
16.0     5
18.0     5
23.0     4
11.0     4
24.0     4
39.0     3
19.0     3
21.0     3
32.0     3
28.0     2
26.0     2
25.0     2
30.0     2
36.0     2
29.0     1
33.0     1
42.0     1
27.0     1
41.0     1
35.0     1
38.0     1
34.0     1
49.0     1
31.0     1
Name: institute_service, dtype: int64

Finding Those Dissatisfied¶

We are interested in whether those resigning are doing so due to some sort of dissatisfaction. Both datasets have multiple columns for dissatisfaction. Let's create a new column in each dataset to display whether the employee resigning was doing so for any type of dissatisfaction.

In [13]:

print(tafe_resignations['Contributing Factors. Dissatisfaction'].value_counts())
print(tafe_resignations['Contributing Factors. Job Dissatisfaction'].value_counts())
dete_satisfy_cols=[
    'job_dissatisfaction',
    'dissatisfaction_with_the_department',
    'physical_work_environment',
    'lack_of_recognition',
    'lack_of_job_security',
    'work_location',
    'employment_conditions',
    'work_life_balance',
    'workload']
dete_resignations.loc[:,dete_satisfy_cols].head()

-                                         277
Contributing Factors. Dissatisfaction      55
Name: Contributing Factors. Dissatisfaction, dtype: int64
-                      270
Job Dissatisfaction     62
Name: Contributing Factors. Job Dissatisfaction, dtype: int64

Out[13]:

	job_dissatisfaction	dissatisfaction_with_the_department	physical_work_environment	lack_of_recognition	lack_of_job_security	work_location	employment_conditions	work_life_balance	workload
3	False	False	False	False	False	False	False	False	False
5	False	False	False	False	False	False	True	False	False
8	False	False	False	False	False	False	False	False	False
9	True	True	False	False	False	False	False	False	False
11	False	False	False	False	False	False	False	False	False

In [14]:

# Error here
def update_vals(el):
    if pd.isnull(el):
        return np.nan
    elif el=='-':
        return False
    else:
        return True
    
tafe_resignations['dissatisfied'] = tafe_resignations[['Contributing Factors. Dissatisfaction', 'Contributing Factors. Job Dissatisfaction']].applymap(update_vals).any(1, skipna=False)
tafe_resignations_up = tafe_resignations.copy()
print('number of null values:', tafe_resignations_up['dissatisfied'].isnull().sum())
tafe_resignations_up['dissatisfied'].value_counts(dropna=False)

number of null values: 8

Out[14]:

False    241
True      91
True       8
Name: dissatisfied, dtype: int64

In [15]:

dete_resignations['dissatisfied']=dete_resignations[dete_satisfy_cols].any(1, skipna=False)
dete_resignations_up=dete_resignations.copy()
dete_resignations['dissatisfied'].value_counts(dropna=False)

Out[15]:

False    162
True     149
Name: dissatisfied, dtype: int64