Is dissatisfaction the reason for the employees who worked for the institutes to resign in Australia?¶

Intorduction & Goals¶

In this guided project, we'll work with exit surveys from employees of the Department of Education, Training and Employment (DETE) and the Technical and Further Education (TAFE) institute in Queensland, Australia. The TAFE exit survey here and the survey for the DETE here. There are some slight modifications in the data set we used comparing to the original, including changing the encoding to UTF-8 (the original ones are encoded using cp1252.)

Witht the above dataset we want to ask the below questions:

Are employees who only worked for the institutes for a short period of time resigning due to some kind of dissatisfaction? What about employees who have been there longer?
Are younger employees resigning due to some kind of dissatisfaction? What about older employees?

In [152]:

# import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# import dataset
dete_survey = pd.read_csv('dete_survey.csv')
tafe_survey = pd.read_csv('tafe_survey.csv')

In [2]:

dete_survey.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 822 entries, 0 to 821
Data columns (total 56 columns):
ID                                     822 non-null int64
SeparationType                         822 non-null object
Cease Date                             822 non-null object
DETE Start Date                        822 non-null object
Role Start Date                        822 non-null object
Position                               817 non-null object
Classification                         455 non-null object
Region                                 822 non-null object
Business Unit                          126 non-null object
Employment Status                      817 non-null object
Career move to public sector           822 non-null bool
Career move to private sector          822 non-null bool
Interpersonal conflicts                822 non-null bool
Job dissatisfaction                    822 non-null bool
Dissatisfaction with the department    822 non-null bool
Physical work environment              822 non-null bool
Lack of recognition                    822 non-null bool
Lack of job security                   822 non-null bool
Work location                          822 non-null bool
Employment conditions                  822 non-null bool
Maternity/family                       822 non-null bool
Relocation                             822 non-null bool
Study/Travel                           822 non-null bool
Ill Health                             822 non-null bool
Traumatic incident                     822 non-null bool
Work life balance                      822 non-null bool
Workload                               822 non-null bool
None of the above                      822 non-null bool
Professional Development               808 non-null object
Opportunities for promotion            735 non-null object
Staff morale                           816 non-null object
Workplace issue                        788 non-null object
Physical environment                   817 non-null object
Worklife balance                       815 non-null object
Stress and pressure support            810 non-null object
Performance of supervisor              813 non-null object
Peer support                           812 non-null object
Initiative                             813 non-null object
Skills                                 811 non-null object
Coach                                  767 non-null object
Career Aspirations                     746 non-null object
Feedback                               792 non-null object
Further PD                             768 non-null object
Communication                          814 non-null object
My say                                 812 non-null object
Information                            816 non-null object
Kept informed                          813 non-null object
Wellness programs                      766 non-null object
Health & Safety                        793 non-null object
Gender                                 798 non-null object
Age                                    811 non-null object
Aboriginal                             16 non-null object
Torres Strait                          3 non-null object
South Sea                              7 non-null object
Disability                             23 non-null object
NESB                                   32 non-null object
dtypes: bool(18), int64(1), object(37)
memory usage: 258.6+ KB

There are 822 rows and 56 variables in the dete dataset and most of the variables have missing values.
Some variables are not necessary to our mission.(e.g: columns 29-49)

In [3]:

# finding the number & percentage of missing value
dete_len = dete_survey.shape[0] # total number of columns
n = 0 # columns with missing value counter
for col in dete_survey.columns:
    if dete_survey[col].isnull().any():
        n += 1
        missing = dete_survey[col].isnull().sum()
        print('{}: {}, {:.2%}'.format(col, missing, missing / dete_len))
print('\n', n, ' columns with missing value in dete_survey.')

Position: 5, 0.61%
Classification: 367, 44.65%
Business Unit: 696, 84.67%
Employment Status: 5, 0.61%
Professional Development: 14, 1.70%
Opportunities for promotion: 87, 10.58%
Staff morale: 6, 0.73%
Workplace issue: 34, 4.14%
Physical environment: 5, 0.61%
Worklife balance: 7, 0.85%
Stress and pressure support: 12, 1.46%
Performance of supervisor: 9, 1.09%
Peer support: 10, 1.22%
Initiative: 9, 1.09%
Skills: 11, 1.34%
Coach: 55, 6.69%
Career Aspirations: 76, 9.25%
Feedback: 30, 3.65%
Further PD: 54, 6.57%
Communication: 8, 0.97%
My say: 10, 1.22%
Information: 6, 0.73%
Kept informed: 9, 1.09%
Wellness programs: 56, 6.81%
Health & Safety: 29, 3.53%
Gender: 24, 2.92%
Age: 11, 1.34%
Aboriginal: 806, 98.05%
Torres Strait: 819, 99.64%
South Sea: 815, 99.15%
Disability: 799, 97.20%
NESB: 790, 96.11%

 32  columns with missing value in dete_survey.

In [4]:

dete_survey.head(2)

Out[4]:

	ID	SeparationType	Cease Date	DETE Start Date	Role Start Date	Position	Classification	Region	Business Unit	Employment Status	...	Kept informed	Wellness programs	Health & Safety	Gender	Age	Aboriginal	Torres Strait	South Sea	Disability	NESB
0	1	Ill Health Retirement	08/2012	1984	2004	Public Servant	A01-A04	Central Office	Corporate Strategy and Peformance	Permanent Full-time	...	N	N	N	Male	56-60	NaN	NaN	NaN	NaN	Yes
1	2	Voluntary Early Retirement (VER)	08/2012	Not Stated	Not Stated	Public Servant	AO5-AO7	Central Office	Corporate Strategy and Peformance	Permanent Full-time	...	N	N	N	Male	56-60	NaN	NaN	NaN	NaN	NaN

2 rows × 56 columns

DETE Start Date and Role Start Date have the value Not Stated, which should be regconized as missing value as well.

In [5]:

tafe_survey.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 702 entries, 0 to 701
Data columns (total 72 columns):
Record ID 702 non-null float64
Institute 702 non-null object
WorkArea 702 non-null object
CESSATION YEAR 695 non-null float64
Reason for ceasing employment 701 non-null object
Contributing Factors. Career Move - Public Sector 437 non-null object
Contributing Factors. Career Move - Private Sector 437 non-null object
Contributing Factors. Career Move - Self-employment 437 non-null object
Contributing Factors. Ill Health 437 non-null object
Contributing Factors. Maternity/Family 437 non-null object
Contributing Factors. Dissatisfaction 437 non-null object
Contributing Factors. Job Dissatisfaction 437 non-null object
Contributing Factors. Interpersonal Conflict 437 non-null object
Contributing Factors. Study 437 non-null object
Contributing Factors. Travel 437 non-null object
Contributing Factors. Other 437 non-null object
Contributing Factors. NONE 437 non-null object
Main Factor. Which of these was the main factor for leaving? 113 non-null object
InstituteViews. Topic:1. I feel the senior leadership had a clear vision and direction 608 non-null object
InstituteViews. Topic:2. I was given access to skills training to help me do my job better 613 non-null object
InstituteViews. Topic:3. I was given adequate opportunities for personal development 610 non-null object
InstituteViews. Topic:4. I was given adequate opportunities for promotion within %Institute]Q25LBL% 608 non-null object
InstituteViews. Topic:5. I felt the salary for the job was right for the responsibilities I had 615 non-null object
InstituteViews. Topic:6. The organisation recognised when staff did good work 607 non-null object
InstituteViews. Topic:7. Management was generally supportive of me 614 non-null object
InstituteViews. Topic:8. Management was generally supportive of my team 608 non-null object
InstituteViews. Topic:9. I was kept informed of the changes in the organisation which would affect me 610 non-null object
InstituteViews. Topic:10. Staff morale was positive within the Institute 602 non-null object
InstituteViews. Topic:11. If I had a workplace issue it was dealt with quickly 601 non-null object
InstituteViews. Topic:12. If I had a workplace issue it was dealt with efficiently 597 non-null object
InstituteViews. Topic:13. If I had a workplace issue it was dealt with discreetly 601 non-null object
WorkUnitViews. Topic:14. I was satisfied with the quality of the management and supervision within my work unit 609 non-null object
WorkUnitViews. Topic:15. I worked well with my colleagues 605 non-null object
WorkUnitViews. Topic:16. My job was challenging and interesting 607 non-null object
WorkUnitViews. Topic:17. I was encouraged to use my initiative in the course of my work 610 non-null object
WorkUnitViews. Topic:18. I had sufficient contact with other people in my job 613 non-null object
WorkUnitViews. Topic:19. I was given adequate support and co-operation by my peers to enable me to do my job 609 non-null object
WorkUnitViews. Topic:20. I was able to use the full range of my skills in my job 609 non-null object
WorkUnitViews. Topic:21. I was able to use the full range of my abilities in my job. ; Category:Level of Agreement; Question:YOUR VIEWS ABOUT YOUR WORK UNIT] 608 non-null object
WorkUnitViews. Topic:22. I was able to use the full range of my knowledge in my job 608 non-null object
WorkUnitViews. Topic:23. My job provided sufficient variety 611 non-null object
WorkUnitViews. Topic:24. I was able to cope with the level of stress and pressure in my job 610 non-null object
WorkUnitViews. Topic:25. My job allowed me to balance the demands of work and family to my satisfaction 611 non-null object
WorkUnitViews. Topic:26. My supervisor gave me adequate personal recognition and feedback on my performance 606 non-null object
WorkUnitViews. Topic:27. My working environment was satisfactory e.g. sufficient space, good lighting, suitable seating and working area 610 non-null object
WorkUnitViews. Topic:28. I was given the opportunity to mentor and coach others in order for me to pass on my skills and knowledge prior to my cessation date 609 non-null object
WorkUnitViews. Topic:29. There was adequate communication between staff in my unit 603 non-null object
WorkUnitViews. Topic:30. Staff morale was positive within my work unit 606 non-null object
Induction. Did you undertake Workplace Induction? 619 non-null object
InductionInfo. Topic:Did you undertake a Corporate Induction? 432 non-null object
InductionInfo. Topic:Did you undertake a Institute Induction? 483 non-null object
InductionInfo. Topic: Did you undertake Team Induction? 440 non-null object
InductionInfo. Face to Face Topic:Did you undertake a Corporate Induction; Category:How it was conducted? 555 non-null object
InductionInfo. On-line Topic:Did you undertake a Corporate Induction; Category:How it was conducted? 555 non-null object
InductionInfo. Induction Manual Topic:Did you undertake a Corporate Induction? 555 non-null object
InductionInfo. Face to Face Topic:Did you undertake a Institute Induction? 530 non-null object
InductionInfo. On-line Topic:Did you undertake a Institute Induction? 555 non-null object
InductionInfo. Induction Manual Topic:Did you undertake a Institute Induction? 553 non-null object
InductionInfo. Face to Face Topic: Did you undertake Team Induction; Category? 555 non-null object
InductionInfo. On-line Topic: Did you undertake Team Induction?process you undertook and how it was conducted.] 555 non-null object
InductionInfo. Induction Manual Topic: Did you undertake Team Induction? 555 non-null object
Workplace. Topic:Did you and your Manager develop a Performance and Professional Development Plan (PPDP)? 608 non-null object
Workplace. Topic:Does your workplace promote a work culture free from all forms of unlawful discrimination? 594 non-null object
Workplace. Topic:Does your workplace promote and practice the principles of employment equity? 587 non-null object
Workplace. Topic:Does your workplace value the diversity of its employees? 586 non-null object
Workplace. Topic:Would you recommend the Institute as an employer to others? 581 non-null object
Gender. What is your Gender? 596 non-null object
CurrentAge. Current Age 596 non-null object
Employment Type. Employment Type 596 non-null object
Classification. Classification 596 non-null object
LengthofServiceOverall. Overall Length of Service at Institute (in years) 596 non-null object
LengthofServiceCurrent. Length of Service at current workplace (in years) 596 non-null object
dtypes: float64(2), object(70)
memory usage: 395.0+ KB

There are 702 rows and 72 variables in the tafe dataset and most of the variables have missing values.
Comparing to dete_survey, some of the columns have the same content but in different columns' name. Therefore we have to modify the column names before combining both data sets.
Some variables are not necessary to our mission since we only want to know if dissatification is the reason. (e.g: columns 18-66)

In [6]:

# finding the number & percentage of missing value
tafe_len = tafe_survey.shape[0] # total number of columns
n = 0 # columns with missing value counter
for col in tafe_survey.columns:
    if tafe_survey[col].isnull().any():
        n += 1
        missing = tafe_survey[col].isnull().sum()
        print('{}: {}, {:.2%}'.format(col, missing, missing / dete_len))
        
print('\n', n, 'columns with missing values in tafe_survey.')

CESSATION YEAR: 7, 0.85%
Reason for ceasing employment: 1, 0.12%
Contributing Factors. Career Move - Public Sector : 265, 32.24%
Contributing Factors. Career Move - Private Sector : 265, 32.24%
Contributing Factors. Career Move - Self-employment: 265, 32.24%
Contributing Factors. Ill Health: 265, 32.24%
Contributing Factors. Maternity/Family: 265, 32.24%
Contributing Factors. Dissatisfaction: 265, 32.24%
Contributing Factors. Job Dissatisfaction: 265, 32.24%
Contributing Factors. Interpersonal Conflict: 265, 32.24%
Contributing Factors. Study: 265, 32.24%
Contributing Factors. Travel: 265, 32.24%
Contributing Factors. Other: 265, 32.24%
Contributing Factors. NONE: 265, 32.24%
Main Factor. Which of these was the main factor for leaving?: 589, 71.65%
InstituteViews. Topic:1. I feel the senior leadership had a clear vision and direction: 94, 11.44%
InstituteViews. Topic:2. I was given access to skills training to help me do my job better: 89, 10.83%
InstituteViews. Topic:3. I was given adequate opportunities for personal development: 92, 11.19%
InstituteViews. Topic:4. I was given adequate opportunities for promotion within %Institute]Q25LBL%: 94, 11.44%
InstituteViews. Topic:5. I felt the salary for the job was right for the responsibilities I had: 87, 10.58%
InstituteViews. Topic:6. The organisation recognised when staff did good work: 95, 11.56%
InstituteViews. Topic:7. Management was generally supportive of me: 88, 10.71%
InstituteViews. Topic:8. Management was generally supportive of my team: 94, 11.44%
InstituteViews. Topic:9. I was kept informed of the changes in the organisation which would affect me: 92, 11.19%
InstituteViews. Topic:10. Staff morale was positive within the Institute: 100, 12.17%
InstituteViews. Topic:11. If I had a workplace issue it was dealt with quickly: 101, 12.29%
InstituteViews. Topic:12. If I had a workplace issue it was dealt with efficiently: 105, 12.77%
InstituteViews. Topic:13. If I had a workplace issue it was dealt with discreetly: 101, 12.29%
WorkUnitViews. Topic:14. I was satisfied with the quality of the management and supervision within my work unit: 93, 11.31%
WorkUnitViews. Topic:15. I worked well with my colleagues: 97, 11.80%
WorkUnitViews. Topic:16. My job was challenging and interesting: 95, 11.56%
WorkUnitViews. Topic:17. I was encouraged to use my initiative in the course of my work: 92, 11.19%
WorkUnitViews. Topic:18. I had sufficient contact with other people in my job: 89, 10.83%
WorkUnitViews. Topic:19. I was given adequate support and co-operation by my peers to enable me to do my job: 93, 11.31%
WorkUnitViews. Topic:20. I was able to use the full range of my skills in my job: 93, 11.31%
WorkUnitViews. Topic:21. I was able to use the full range of my abilities in my job. ; Category:Level of Agreement; Question:YOUR VIEWS ABOUT YOUR WORK UNIT]: 94, 11.44%
WorkUnitViews. Topic:22. I was able to use the full range of my knowledge in my job: 94, 11.44%
WorkUnitViews. Topic:23. My job provided sufficient variety: 91, 11.07%
WorkUnitViews. Topic:24. I was able to cope with the level of stress and pressure in my job: 92, 11.19%
WorkUnitViews. Topic:25. My job allowed me to balance the demands of work and family to my satisfaction: 91, 11.07%
WorkUnitViews. Topic:26. My supervisor gave me adequate personal recognition and feedback on my performance: 96, 11.68%
WorkUnitViews. Topic:27. My working environment was satisfactory e.g. sufficient space, good lighting, suitable seating and working area: 92, 11.19%
WorkUnitViews. Topic:28. I was given the opportunity to mentor and coach others in order for me to pass on my skills and knowledge prior to my cessation date: 93, 11.31%
WorkUnitViews. Topic:29. There was adequate communication between staff in my unit: 99, 12.04%
WorkUnitViews. Topic:30. Staff morale was positive within my work unit: 96, 11.68%
Induction. Did you undertake Workplace Induction?: 83, 10.10%
InductionInfo. Topic:Did you undertake a Corporate Induction?: 270, 32.85%
InductionInfo. Topic:Did you undertake a Institute Induction?: 219, 26.64%
InductionInfo. Topic: Did you undertake Team Induction?: 262, 31.87%
InductionInfo. Face to Face Topic:Did you undertake a Corporate Induction; Category:How it was conducted?: 147, 17.88%
InductionInfo. On-line Topic:Did you undertake a Corporate Induction; Category:How it was conducted?: 147, 17.88%
InductionInfo. Induction Manual Topic:Did you undertake a Corporate Induction?: 147, 17.88%
InductionInfo. Face to Face Topic:Did you undertake a Institute Induction?: 172, 20.92%
InductionInfo. On-line Topic:Did you undertake a Institute Induction?: 147, 17.88%
InductionInfo. Induction Manual Topic:Did you undertake a Institute Induction?: 149, 18.13%
InductionInfo. Face to Face Topic: Did you undertake Team Induction; Category?: 147, 17.88%
InductionInfo. On-line Topic: Did you undertake Team Induction?process you undertook and how it was conducted.]: 147, 17.88%
InductionInfo. Induction Manual Topic: Did you undertake Team Induction?: 147, 17.88%
Workplace. Topic:Did you and your Manager develop a Performance and Professional Development Plan (PPDP)?: 94, 11.44%
Workplace. Topic:Does your workplace promote a work culture free from all forms of unlawful discrimination?: 108, 13.14%
Workplace. Topic:Does your workplace promote and practice the principles of employment equity?: 115, 13.99%
Workplace. Topic:Does your workplace value the diversity of its employees?: 116, 14.11%
Workplace. Topic:Would you recommend the Institute as an employer to others?: 121, 14.72%
Gender. What is your Gender?: 106, 12.90%
CurrentAge. Current Age: 106, 12.90%
Employment Type. Employment Type: 106, 12.90%
Classification. Classification: 106, 12.90%
LengthofServiceOverall. Overall Length of Service at Institute (in years): 106, 12.90%
LengthofServiceCurrent. Length of Service at current workplace (in years): 106, 12.90%

 69 columns with missing values in tafe_survey.

In [7]:

tafe_survey.head(2)

Out[7]:

	Record ID	Institute	WorkArea	CESSATION YEAR	Reason for ceasing employment	Contributing Factors. Career Move - Public Sector	Contributing Factors. Career Move - Private Sector	Contributing Factors. Career Move - Self-employment	Contributing Factors. Ill Health	Contributing Factors. Maternity/Family	...	Workplace. Topic:Does your workplace promote a work culture free from all forms of unlawful discrimination?	Workplace. Topic:Does your workplace promote and practice the principles of employment equity?	Workplace. Topic:Does your workplace value the diversity of its employees?	Workplace. Topic:Would you recommend the Institute as an employer to others?	Gender. What is your Gender?	CurrentAge. Current Age	Employment Type. Employment Type	Classification. Classification	LengthofServiceOverall. Overall Length of Service at Institute (in years)	LengthofServiceCurrent. Length of Service at current workplace (in years)
0	6.341330e+17	Southern Queensland Institute of TAFE	Non-Delivery (corporate)	2010.0	Contract Expired	NaN	NaN	NaN	NaN	NaN	...	Yes	Yes	Yes	Yes	Female	26 30	Temporary Full-time	Administration (AO)	1-2	1-2
1	6.341337e+17	Mount Isa Institute of TAFE	Non-Delivery (corporate)	2010.0	Retirement	-	-	-	-	-	...	Yes	Yes	Yes	Yes	NaN	NaN	NaN	NaN	NaN	NaN

2 rows × 72 columns

In some columns like Career Move - Public Secotr contains the value -, which should be regconized as missing value.

Data cleaning¶

Update NaN value¶

We read the date_suvey.csv again and replace Not Stated as NaN.

In [8]:

dete_survey = pd.read_csv('dete_survey.csv', 
                          na_values = 'Not Stated')

Drop unnecessary columns¶

As mentioned before, we consider only dissatification as the target reason in this project. We will drop all other reasons to trim the dataset.

In [9]:

dete_survey_updated = dete_survey.drop(dete_survey.columns[28:49],
                                       axis = 1)

In [10]:

tafe_survey_updated = tafe_survey.drop(tafe_survey.columns[17:66],
                                       axis = 1)

Synchronize the column names¶

In order to combine both data sets, we need a common and standardized column names for both data sets.

In [11]:

# lower case, remove trailing whitesapce and replace ' ' as '_'
dete_survey_updated.columns = dete_survey_updated.columns.str.lower().str.strip().str.replace(' ', '_')
dete_survey_updated.columns

Out[11]:

Index(['id', 'separationtype', 'cease_date', 'dete_start_date',
       'role_start_date', 'position', 'classification', 'region',
       'business_unit', 'employment_status', 'career_move_to_public_sector',
       'career_move_to_private_sector', 'interpersonal_conflicts',
       'job_dissatisfaction', 'dissatisfaction_with_the_department',
       'physical_work_environment', 'lack_of_recognition',
       'lack_of_job_security', 'work_location', 'employment_conditions',
       'maternity/family', 'relocation', 'study/travel', 'ill_health',
       'traumatic_incident', 'work_life_balance', 'workload',
       'none_of_the_above', 'gender', 'age', 'aboriginal', 'torres_strait',
       'south_sea', 'disability', 'nesb'],
      dtype='object')

In [12]:

column_names = {'Record ID': 'id',
                'CESSATION YEAR': 'cease_date',
                'Reason for ceasing employment': 'separationtype',
                'Gender. What is your Gender?': 'gender',
                'CurrentAge. Current Age': 'age',
                'Employment Type. Employment Type': 'employment_status',
                'Classification. Classification': 'position',
                'LengthofServiceOverall. Overall Length of Service at Institute (in years)': 'institute_service',
                'LengthofServiceCurrent. Length of Service at current workplace (in years)': 'role_service'}

# lower case, remove trailing whitesapce and replace ' ' as '_'
tafe_survey_updated = tafe_survey_updated.rename(column_names, axis=1)
tafe_survey_updated.columns

Out[12]:

Index(['id', 'Institute', 'WorkArea', 'cease_date', 'separationtype',
       'Contributing Factors. Career Move - Public Sector ',
       'Contributing Factors. Career Move - Private Sector ',
       'Contributing Factors. Career Move - Self-employment',
       'Contributing Factors. Ill Health',
       'Contributing Factors. Maternity/Family',
       'Contributing Factors. Dissatisfaction',
       'Contributing Factors. Job Dissatisfaction',
       'Contributing Factors. Interpersonal Conflict',
       'Contributing Factors. Study', 'Contributing Factors. Travel',
       'Contributing Factors. Other', 'Contributing Factors. NONE', 'gender',
       'age', 'employment_status', 'position', 'institute_service',
       'role_service'],
      dtype='object')

Extract data with resignation only¶

There are many reason an employee leaves the institute, but we oonly need the data from those resigned. Therefore, we will keep those rows which contain Resgination in separationtype, excluding NaN.

In [13]:

dete_survey_updated['separationtype'].value_counts()

Out[13]:

Age Retirement                          285
Resignation-Other reasons               150
Resignation-Other employer               91
Resignation-Move overseas/interstate     70
Voluntary Early Retirement (VER)         67
Ill Health Retirement                    61
Other                                    49
Contract Expired                         34
Termination                              15
Name: separationtype, dtype: int64

In [14]:

dete_resignations = dete_survey_updated[
    dete_survey_updated['separationtype'].str.contains('Resignation', na=False)
].copy()

In [15]:

tafe_survey_updated['separationtype'].value_counts()

Out[15]:

Resignation                 340
Contract Expired            127
Retrenchment/ Redundancy    104
Retirement                   82
Transfer                     25
Termination                  23
Name: separationtype, dtype: int64

In [16]:

tafe_resignations = tafe_survey_updated[
    tafe_survey_updated['separationtype'].str.contains('Resignation', na=False)
].copy()

Data verification¶

In this step, we'll focus on verifying that the years in the cease_date and dete_start_date columns make sense.

Since the cease_date is the last year of the person's employment and the dete_start_date is the person's first year of employment, it wouldn't make sense to have years after the current date.
Given that most people in this field start working in their 20s, it's also unlikely that the dete_start_date was before the year 1940.

In [17]:

dete_resignations['cease_date'].value_counts()

Out[17]:

2012       126
2013        74
01/2014     22
12/2013     17
06/2013     14
09/2013     11
07/2013      9
11/2013      9
10/2013      6
08/2013      4
05/2012      2
05/2013      2
07/2006      1
2010         1
07/2012      1
09/2010      1
Name: cease_date, dtype: int64

In [18]:

pattern =r"([1-2][0-9]{3})"
dete_resignations['cease_date'] = dete_resignations['cease_date'].str.extract(pattern).astype('float')
dete_resignations['cease_date'].value_counts().sort_index()

/dataquest/system/env/python3/lib/python3.4/site-packages/ipykernel/__main__.py:2: FutureWarning:

currently extract(expand=None) means expand=False (return Index/Series/DataFrame) but in a future version of pandas this will be changed to expand=True (return DataFrame)

Out[18]:

2006.0      1
2010.0      2
2012.0    129
2013.0    146
2014.0     22
Name: cease_date, dtype: int64

In [19]:

dete_resignations['dete_start_date'].value_counts().sort_index()

Out[19]:

1963.0     1
1971.0     1
1972.0     1
1973.0     1
1974.0     2
1975.0     1
1976.0     2
1977.0     1
1980.0     5
1982.0     1
1983.0     2
1984.0     1
1985.0     3
1986.0     3
1987.0     1
1988.0     4
1989.0     4
1990.0     5
1991.0     4
1992.0     6
1993.0     5
1994.0     6
1995.0     4
1996.0     6
1997.0     5
1998.0     6
1999.0     8
2000.0     9
2001.0     3
2002.0     6
2003.0     6
2004.0    14
2005.0    15
2006.0    13
2007.0    21
2008.0    22
2009.0    13
2010.0    17
2011.0    24
2012.0    21
2013.0    10
Name: dete_start_date, dtype: int64

In [20]:

pattern =r"([1-2][0-9]{3})"
tafe_resignations['cease_date'].value_counts().sort_index()

Out[20]:

2009.0      2
2010.0     68
2011.0    116
2012.0     94
2013.0     55
Name: cease_date, dtype: int64

In [21]:

tafe_resignations['cease_date'].value_counts().sort_index()

Out[21]:

2009.0      2
2010.0     68
2011.0    116
2012.0     94
2013.0     55
Name: cease_date, dtype: int64

In [22]:

dete_resignations.boxplot(['cease_date','dete_start_date'])
plt.title('Cease_date and dete_Start_date in dete_survey')

Out[22]:

<matplotlib.text.Text at 0x7fb295387a58>

The cease_date are all between 2009 and 2013 while the dete_start_date is starting from 1963.

In [23]:

tafe_resignations.boxplot(['cease_date'])
plt.title('Cease_date in tafe_survey')
plt.ylim(2008, 2020)

Out[23]:

(2008, 2020)

The range of cease_date in tafe_survey is similar to dete_survey.

In general, both data set do not have any major issues with the years.

Extract useful informations¶

As we also want to know if there are any relationship between service length of an employee and the resignation due to dissatisfaction, we need extract the length of service from cease_date and dete_start_date.

In [24]:

dete_resignations['institute_service'] = dete_resignations['cease_date'] - dete_resignations['dete_start_date']
dete_resignations.boxplot('institute_service')
plt.title('Length of institute service before resignation')

Out[24]:

<matplotlib.text.Text at 0x7fb2931835c0>

Meanwhile, we have to classify the data into dissatisfcation or not.

Let's check with tafe_survey first. There are two columns about dissatiscfaction: Contributing Factors. Dissatisfaction and Contributing Factors. Job Dissatisfaction. We will create a new column to combine these information.

In [25]:

tafe_resignations['Contributing Factors. Dissatisfaction'].value_counts(dropna = False)

Out[25]:

-                                         277
Contributing Factors. Dissatisfaction      55
NaN                                         8
Name: Contributing Factors. Dissatisfaction, dtype: int64

In [26]:

tafe_resignations['Contributing Factors. Job Dissatisfaction'].value_counts(dropna = False)

Out[26]:

-                      270
Job Dissatisfaction     62
NaN                      8
Name: Contributing Factors. Job Dissatisfaction, dtype: int64

Both columns have only 2 values so we can change them into boolean type.

In [27]:

# function to change variables into boolean
def update_vals(string):
    if string is np.nan: return np.nan
    elif string is '-': return False
    else: return True

In [28]:

tafe_resignations['Contributing Factors. Dissatisfaction'] = tafe_resignations['Contributing Factors. Dissatisfaction'].map(update_vals)
tafe_resignations['Contributing Factors. Job Dissatisfaction'] = tafe_resignations['Contributing Factors. Job Dissatisfaction'].map(update_vals)

In [29]:

# create dissatisfied column. True if any column is true.
tafe_resignations['dissatisfied'] = tafe_resignations[['Contributing Factors. Dissatisfaction',
                                                       'Contributing Factors. Job Dissatisfaction']].any(axis=1, 
                                                                                                         skipna=False)

For dete_survey, we have more columns regarding dissatisification. Unluckily, all those columns are already in boolean type.

In [32]:

dissatisfaction_col = ['job_dissatisfaction','dissatisfaction_with_the_department','physical_work_environment',
                  'lack_of_recognition','lack_of_job_security','work_location','employment_conditions',
                  'work_life_balance','workload']

for col in dissatisfaction_col:
    print(col, ': \n')
    print(dete_resignations[col].value_counts(dropna = False))

job_dissatisfaction : 

False    270
True      41
Name: job_dissatisfaction, dtype: int64
dissatisfaction_with_the_department : 

False    282
True      29
Name: dissatisfaction_with_the_department, dtype: int64
physical_work_environment : 

False    305
True       6
Name: physical_work_environment, dtype: int64
lack_of_recognition : 

False    278
True      33
Name: lack_of_recognition, dtype: int64
lack_of_job_security : 

False    297
True      14
Name: lack_of_job_security, dtype: int64
work_location : 

False    293
True      18
Name: work_location, dtype: int64
employment_conditions : 

False    288
True      23
Name: employment_conditions, dtype: int64
work_life_balance : 

False    243
True      68
Name: work_life_balance, dtype: int64
workload : 

False    284
True      27
Name: workload, dtype: int64

In [117]:

# create dissatisfied column. True if any column is true.
dete_resignations['dissatisfied'] = dete_resignations[dissatisfaction_col].any(axis=1, skipna=False)

In [118]:

dete_resignations_up = dete_resignations.copy()
tafe_resignations_up = tafe_resignations.copy()

Combine dataset¶

Since both data sets are clean and ready, we will combine them together. In order to distinguish both institute, we will create an extra column institute. Also we will only keep the necessay columns existing in both data set. Considering that both data set have more than 500 rows, we can exclude all columns which has less than 500 non missing values.

In [119]:

dete_resignations_up['institute'] = 'DETE'
tafe_resignations_up['institute'] = 'TAFE'

In [120]:

combined = pd.concat([dete_resignations_up, tafe_resignations_up],
                     axis = 0,
                     ignore_index = True)

In [121]:

combined_updated = combined.dropna(axis = 1, thresh = 500)

Clean the service column¶

Now that we've combined our dataframes, we're almost at a place where we can perform some kind of analysis! First, though, we'll have to clean up the institute_service column. This column is tricky to clean because it currently contains values in a couple different forms:

In [122]:

combined_updated['institute_service'].value_counts(dropna=False)

Out[122]:

NaN                   88
Less than 1 year      73
1-2                   64
3-4                   63
5-6                   33
11-20                 26
5.0                   23
1.0                   22
7-10                  21
0.0                   20
3.0                   20
6.0                   17
4.0                   16
2.0                   14
9.0                   14
7.0                   13
More than 20 years    10
8.0                    8
13.0                   8
15.0                   7
20.0                   7
12.0                   6
22.0                   6
17.0                   6
10.0                   6
14.0                   6
18.0                   5
16.0                   5
23.0                   4
24.0                   4
11.0                   4
39.0                   3
21.0                   3
32.0                   3
19.0                   3
36.0                   2
30.0                   2
26.0                   2
28.0                   2
25.0                   2
29.0                   1
31.0                   1
49.0                   1
33.0                   1
34.0                   1
35.0                   1
38.0                   1
41.0                   1
42.0                   1
27.0                   1
Name: institute_service, dtype: int64

To analyze the data, we'll convert these numbers into categories. We'll base our analysis on this article, which makes the argument that understanding employee's needs according to career stage instead of age is more effective.

We'll use the slightly modified definitions below:

New: Less than 3 years at a company
Experienced: 3-6 years at a company
Established: 7-10 years at a company
Veteran: 11 or more years at a company

Let's categorize the values in the institute_service column using the definitions above.

In [123]:

# extract the first digit from institute_service for grouping
combined_updated['institute_service'] = combined_updated['institute_service'].astype('str').str.extract(r'(\d+)').astype('float')

/dataquest/system/env/python3/lib/python3.4/site-packages/ipykernel/__main__.py:2: FutureWarning:

currently extract(expand=None) means expand=False (return Index/Series/DataFrame) but in a future version of pandas this will be changed to expand=True (return DataFrame)

/dataquest/system/env/python3/lib/python3.4/site-packages/ipykernel/__main__.py:2: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

In [124]:

# mapping value into group
def service_level(num):
    if pd.isnull(num): return np.nan
    elif num < 3: return 'New'
    elif num < 7: return 'Experienced'
    elif num < 11: return 'Established'
    else: return 'Veteran'

In [125]:

combined_updated['service_cat'] = combined_updated['institute_service'].map(service_level)

/dataquest/system/env/python3/lib/python3.4/site-packages/ipykernel/__main__.py:1: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

In [126]:

# create service_cat for the new group
combined_updated['service_cat'].value_counts(dropna = False)

Out[126]:

New            193
Experienced    172
Veteran        136
NaN             88
Established     62
Name: service_cat, dtype: int64

¶

Fill NaN¶

Before heading to analysis, we have to fill in all missing value to prevent any errors. We will use the mode as the filling value.

In [127]:

combined_updated['dissatisfied'].value_counts(dropna = False)

Out[127]:

False    403
True     240
NaN        8
Name: dissatisfied, dtype: int64

In [128]:

combined_updated['dissatisfied'].fillna(False, inplace=True)

/dataquest/system/env/python3/lib/python3.4/site-packages/pandas/core/generic.py:4355: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

Analysis and Conclusion¶

In [159]:

table = pd.pivot_table(combined_updated, index = 'service_cat', values = 'dissatisfied')
table['pos'] = [2,1,0,3]
table['pos'] = table['pos'].astype('int')
table.sort_values(by=['pos'])['dissatisfied'].plot(kind='barh')
plt.title('Percentage of dissatisfied employees among service_cat')
# sns.set_style('white')
sns.despine(bottom = True, left= True)

Above plot provides us a clear answer to our first questions: Around 1/3 of the employees who worked for a short period resigned due to dissatisfaction. Meanwhile, half of the employees who worked for a longer period is much higher, around 50%.