Employee Resignations at DETE and TAFE¶

Do Causes Differ Depending on Duration of Employment?¶

In this analysis we use exit survey data to identify sources of dissatisfaction that cause employees to resign from the DETE and TAFE and how they differ based on the duration of employment.

Data Set - Origin¶

The TAFE exit survey responses can be downloaded here

The DETE exit survey responses can be downloaded here

Thank you to DataQuest for changing the encoding to UTF-8 from cp1252.

Data Set - At A Glance¶

A quick preview of the individual DETE and TAFE data sets

In [1]:

import pandas as pd
import numpy as np

dete_survey = pd.read_csv("dete_survey.csv")
tafe_survey = pd.read_csv("tafe_survey.csv")

DETE Survey - At a Glance¶

The data has a lot of columns = 56.

A lot of the columns seem to have information not relevant to the goal.
df.head() method cuts out some columns which impedes quickly viewing some values in every column.
df[0:5] had the same problem
"Not Stated" is used to indicate some missing values

In [2]:

dete_survey.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 822 entries, 0 to 821
Data columns (total 56 columns):
 #   Column                               Non-Null Count  Dtype 
---  ------                               --------------  ----- 
 0   ID                                   822 non-null    int64 
 1   SeparationType                       822 non-null    object
 2   Cease Date                           822 non-null    object
 3   DETE Start Date                      822 non-null    object
 4   Role Start Date                      822 non-null    object
 5   Position                             817 non-null    object
 6   Classification                       455 non-null    object
 7   Region                               822 non-null    object
 8   Business Unit                        126 non-null    object
 9   Employment Status                    817 non-null    object
 10  Career move to public sector         822 non-null    bool  
 11  Career move to private sector        822 non-null    bool  
 12  Interpersonal conflicts              822 non-null    bool  
 13  Job dissatisfaction                  822 non-null    bool  
 14  Dissatisfaction with the department  822 non-null    bool  
 15  Physical work environment            822 non-null    bool  
 16  Lack of recognition                  822 non-null    bool  
 17  Lack of job security                 822 non-null    bool  
 18  Work location                        822 non-null    bool  
 19  Employment conditions                822 non-null    bool  
 20  Maternity/family                     822 non-null    bool  
 21  Relocation                           822 non-null    bool  
 22  Study/Travel                         822 non-null    bool  
 23  Ill Health                           822 non-null    bool  
 24  Traumatic incident                   822 non-null    bool  
 25  Work life balance                    822 non-null    bool  
 26  Workload                             822 non-null    bool  
 27  None of the above                    822 non-null    bool  
 28  Professional Development             808 non-null    object
 29  Opportunities for promotion          735 non-null    object
 30  Staff morale                         816 non-null    object
 31  Workplace issue                      788 non-null    object
 32  Physical environment                 817 non-null    object
 33  Worklife balance                     815 non-null    object
 34  Stress and pressure support          810 non-null    object
 35  Performance of supervisor            813 non-null    object
 36  Peer support                         812 non-null    object
 37  Initiative                           813 non-null    object
 38  Skills                               811 non-null    object
 39  Coach                                767 non-null    object
 40  Career Aspirations                   746 non-null    object
 41  Feedback                             792 non-null    object
 42  Further PD                           768 non-null    object
 43  Communication                        814 non-null    object
 44  My say                               812 non-null    object
 45  Information                          816 non-null    object
 46  Kept informed                        813 non-null    object
 47  Wellness programs                    766 non-null    object
 48  Health & Safety                      793 non-null    object
 49  Gender                               798 non-null    object
 50  Age                                  811 non-null    object
 51  Aboriginal                           16 non-null     object
 52  Torres Strait                        3 non-null      object
 53  South Sea                            7 non-null      object
 54  Disability                           23 non-null     object
 55  NESB                                 32 non-null     object
dtypes: bool(18), int64(1), object(37)
memory usage: 258.6+ KB

In [3]:

dete_survey[0:5]

Out[3]:

	ID	SeparationType	Cease Date	DETE Start Date	Role Start Date	Position	Classification	Region	Business Unit	Employment Status	...	Kept informed	Wellness programs	Health & Safety	Gender	Age	Aboriginal	Torres Strait	South Sea	Disability	NESB
0	1	Ill Health Retirement	08/2012	1984	2004	Public Servant	A01-A04	Central Office	Corporate Strategy and Peformance	Permanent Full-time	...	N	N	N	Male	56-60	NaN	NaN	NaN	NaN	Yes
1	2	Voluntary Early Retirement (VER)	08/2012	Not Stated	Not Stated	Public Servant	AO5-AO7	Central Office	Corporate Strategy and Peformance	Permanent Full-time	...	N	N	N	Male	56-60	NaN	NaN	NaN	NaN	NaN
2	3	Voluntary Early Retirement (VER)	05/2012	2011	2011	Schools Officer	NaN	Central Office	Education Queensland	Permanent Full-time	...	N	N	N	Male	61 or older	NaN	NaN	NaN	NaN	NaN
3	4	Resignation-Other reasons	05/2012	2005	2006	Teacher	Primary	Central Queensland	NaN	Permanent Full-time	...	A	N	A	Female	36-40	NaN	NaN	NaN	NaN	NaN
4	5	Age Retirement	05/2012	1970	1989	Head of Curriculum/Head of Special Education	NaN	South East	NaN	Permanent Full-time	...	N	A	M	Female	61 or older	NaN	NaN	NaN	NaN	NaN

5 rows × 56 columns

In [4]:

dete_survey.isnull().sum()

Out[4]:

ID                                       0
SeparationType                           0
Cease Date                               0
DETE Start Date                          0
Role Start Date                          0
Position                                 5
Classification                         367
Region                                   0
Business Unit                          696
Employment Status                        5
Career move to public sector             0
Career move to private sector            0
Interpersonal conflicts                  0
Job dissatisfaction                      0
Dissatisfaction with the department      0
Physical work environment                0
Lack of recognition                      0
Lack of job security                     0
Work location                            0
Employment conditions                    0
Maternity/family                         0
Relocation                               0
Study/Travel                             0
Ill Health                               0
Traumatic incident                       0
Work life balance                        0
Workload                                 0
None of the above                        0
Professional Development                14
Opportunities for promotion             87
Staff morale                             6
Workplace issue                         34
Physical environment                     5
Worklife balance                         7
Stress and pressure support             12
Performance of supervisor                9
Peer support                            10
Initiative                               9
Skills                                  11
Coach                                   55
Career Aspirations                      76
Feedback                                30
Further PD                              54
Communication                            8
My say                                  10
Information                              6
Kept informed                            9
Wellness programs                       56
Health & Safety                         29
Gender                                  24
Age                                     11
Aboriginal                             806
Torres Strait                          819
South Sea                              815
Disability                             799
NESB                                   790
dtype: int64

TAFE Survey - At a Glance¶

This data set is even 'busier'.

It has more columns = 72, many of which are probably not relevant

The column names are really long
The columns seem to belong in groups, ie. "Contributing Factors"
both '-' and 'NaN' used for missing vlaues.

In [5]:

tafe_survey.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 702 entries, 0 to 701
Data columns (total 72 columns):
 #   Column                                                                                                                                                         Non-Null Count  Dtype  
---  ------                                                                                                                                                         --------------  -----  
 0   Record ID                                                                                                                                                      702 non-null    float64
 1   Institute                                                                                                                                                      702 non-null    object 
 2   WorkArea                                                                                                                                                       702 non-null    object 
 3   CESSATION YEAR                                                                                                                                                 695 non-null    float64
 4   Reason for ceasing employment                                                                                                                                  701 non-null    object 
 5   Contributing Factors. Career Move - Public Sector                                                                                                              437 non-null    object 
 6   Contributing Factors. Career Move - Private Sector                                                                                                             437 non-null    object 
 7   Contributing Factors. Career Move - Self-employment                                                                                                            437 non-null    object 
 8   Contributing Factors. Ill Health                                                                                                                               437 non-null    object 
 9   Contributing Factors. Maternity/Family                                                                                                                         437 non-null    object 
 10  Contributing Factors. Dissatisfaction                                                                                                                          437 non-null    object 
 11  Contributing Factors. Job Dissatisfaction                                                                                                                      437 non-null    object 
 12  Contributing Factors. Interpersonal Conflict                                                                                                                   437 non-null    object 
 13  Contributing Factors. Study                                                                                                                                    437 non-null    object 
 14  Contributing Factors. Travel                                                                                                                                   437 non-null    object 
 15  Contributing Factors. Other                                                                                                                                    437 non-null    object 
 16  Contributing Factors. NONE                                                                                                                                     437 non-null    object 
 17  Main Factor. Which of these was the main factor for leaving?                                                                                                   113 non-null    object 
 18  InstituteViews. Topic:1. I feel the senior leadership had a clear vision and direction                                                                         608 non-null    object 
 19  InstituteViews. Topic:2. I was given access to skills training to help me do my job better                                                                     613 non-null    object 
 20  InstituteViews. Topic:3. I was given adequate opportunities for personal development                                                                           610 non-null    object 
 21  InstituteViews. Topic:4. I was given adequate opportunities for promotion within %Institute]Q25LBL%                                                            608 non-null    object 
 22  InstituteViews. Topic:5. I felt the salary for the job was right for the responsibilities I had                                                                615 non-null    object 
 23  InstituteViews. Topic:6. The organisation recognised when staff did good work                                                                                  607 non-null    object 
 24  InstituteViews. Topic:7. Management was generally supportive of me                                                                                             614 non-null    object 
 25  InstituteViews. Topic:8. Management was generally supportive of my team                                                                                        608 non-null    object 
 26  InstituteViews. Topic:9. I was kept informed of the changes in the organisation which would affect me                                                          610 non-null    object 
 27  InstituteViews. Topic:10. Staff morale was positive within the Institute                                                                                       602 non-null    object 
 28  InstituteViews. Topic:11. If I had a workplace issue it was dealt with quickly                                                                                 601 non-null    object 
 29  InstituteViews. Topic:12. If I had a workplace issue it was dealt with efficiently                                                                             597 non-null    object 
 30  InstituteViews. Topic:13. If I had a workplace issue it was dealt with discreetly                                                                              601 non-null    object 
 31  WorkUnitViews. Topic:14. I was satisfied with the quality of the management and supervision within my work unit                                                609 non-null    object 
 32  WorkUnitViews. Topic:15. I worked well with my colleagues                                                                                                      605 non-null    object 
 33  WorkUnitViews. Topic:16. My job was challenging and interesting                                                                                                607 non-null    object 
 34  WorkUnitViews. Topic:17. I was encouraged to use my initiative in the course of my work                                                                        610 non-null    object 
 35  WorkUnitViews. Topic:18. I had sufficient contact with other people in my job                                                                                  613 non-null    object 
 36  WorkUnitViews. Topic:19. I was given adequate support and co-operation by my peers to enable me to do my job                                                   609 non-null    object 
 37  WorkUnitViews. Topic:20. I was able to use the full range of my skills in my job                                                                               609 non-null    object 
 38  WorkUnitViews. Topic:21. I was able to use the full range of my abilities in my job. ; Category:Level of Agreement; Question:YOUR VIEWS ABOUT YOUR WORK UNIT]  608 non-null    object 
 39  WorkUnitViews. Topic:22. I was able to use the full range of my knowledge in my job                                                                            608 non-null    object 
 40  WorkUnitViews. Topic:23. My job provided sufficient variety                                                                                                    611 non-null    object 
 41  WorkUnitViews. Topic:24. I was able to cope with the level of stress and pressure in my job                                                                    610 non-null    object 
 42  WorkUnitViews. Topic:25. My job allowed me to balance the demands of work and family to my satisfaction                                                        611 non-null    object 
 43  WorkUnitViews. Topic:26. My supervisor gave me adequate personal recognition and feedback on my performance                                                    606 non-null    object 
 44  WorkUnitViews. Topic:27. My working environment was satisfactory e.g. sufficient space, good lighting, suitable seating and working area                       610 non-null    object 
 45  WorkUnitViews. Topic:28. I was given the opportunity to mentor and coach others in order for me to pass on my skills and knowledge prior to my cessation date  609 non-null    object 
 46  WorkUnitViews. Topic:29. There was adequate communication between staff in my unit                                                                             603 non-null    object 
 47  WorkUnitViews. Topic:30. Staff morale was positive within my work unit                                                                                         606 non-null    object 
 48  Induction. Did you undertake Workplace Induction?                                                                                                              619 non-null    object 
 49  InductionInfo. Topic:Did you undertake a Corporate Induction?                                                                                                  432 non-null    object 
 50  InductionInfo. Topic:Did you undertake a Institute Induction?                                                                                                  483 non-null    object 
 51  InductionInfo. Topic: Did you undertake Team Induction?                                                                                                        440 non-null    object 
 52  InductionInfo. Face to Face Topic:Did you undertake a Corporate Induction; Category:How it was conducted?                                                      555 non-null    object 
 53  InductionInfo. On-line Topic:Did you undertake a Corporate Induction; Category:How it was conducted?                                                           555 non-null    object 
 54  InductionInfo. Induction Manual Topic:Did you undertake a Corporate Induction?                                                                                 555 non-null    object 
 55  InductionInfo. Face to Face Topic:Did you undertake a Institute Induction?                                                                                     530 non-null    object 
 56  InductionInfo. On-line Topic:Did you undertake a Institute Induction?                                                                                          555 non-null    object 
 57  InductionInfo. Induction Manual Topic:Did you undertake a Institute Induction?                                                                                 553 non-null    object 
 58  InductionInfo. Face to Face Topic: Did you undertake Team Induction; Category?                                                                                 555 non-null    object 
 59  InductionInfo. On-line Topic: Did you undertake Team Induction?process you undertook and how it was conducted.]                                                555 non-null    object 
 60  InductionInfo. Induction Manual Topic: Did you undertake Team Induction?                                                                                       555 non-null    object 
 61  Workplace. Topic:Did you and your Manager develop a Performance and Professional Development Plan (PPDP)?                                                      608 non-null    object 
 62  Workplace. Topic:Does your workplace promote a work culture free from all forms of unlawful discrimination?                                                    594 non-null    object 
 63  Workplace. Topic:Does your workplace promote and practice the principles of employment equity?                                                                 587 non-null    object 
 64  Workplace. Topic:Does your workplace value the diversity of its employees?                                                                                     586 non-null    object 
 65  Workplace. Topic:Would you recommend the Institute as an employer to others?                                                                                   581 non-null    object 
 66  Gender. What is your Gender?                                                                                                                                   596 non-null    object 
 67  CurrentAge. Current Age                                                                                                                                        596 non-null    object 
 68  Employment Type. Employment Type                                                                                                                               596 non-null    object 
 69  Classification. Classification                                                                                                                                 596 non-null    object 
 70  LengthofServiceOverall. Overall Length of Service at Institute (in years)                                                                                      596 non-null    object 
 71  LengthofServiceCurrent. Length of Service at current workplace (in years)                                                                                      596 non-null    object 
dtypes: float64(2), object(70)
memory usage: 395.0+ KB

In [6]:

tafe_survey.isnull().sum()

Out[6]:

Record ID                                                                      0
Institute                                                                      0
WorkArea                                                                       0
CESSATION YEAR                                                                 7
Reason for ceasing employment                                                  1
                                                                            ... 
CurrentAge. Current Age                                                      106
Employment Type. Employment Type                                             106
Classification. Classification                                               106
LengthofServiceOverall. Overall Length of Service at Institute (in years)    106
LengthofServiceCurrent. Length of Service at current workplace (in years)    106
Length: 72, dtype: int64

In [7]:

tafe_survey.head()

Out[7]:

	Record ID	Institute	WorkArea	CESSATION YEAR	Reason for ceasing employment	Contributing Factors. Career Move - Public Sector	Contributing Factors. Career Move - Private Sector	Contributing Factors. Career Move - Self-employment	Contributing Factors. Ill Health	Contributing Factors. Maternity/Family	...	Workplace. Topic:Does your workplace promote a work culture free from all forms of unlawful discrimination?	Workplace. Topic:Does your workplace promote and practice the principles of employment equity?	Workplace. Topic:Does your workplace value the diversity of its employees?	Workplace. Topic:Would you recommend the Institute as an employer to others?	Gender. What is your Gender?	CurrentAge. Current Age	Employment Type. Employment Type	Classification. Classification	LengthofServiceOverall. Overall Length of Service at Institute (in years)	LengthofServiceCurrent. Length of Service at current workplace (in years)
0	6.341330e+17	Southern Queensland Institute of TAFE	Non-Delivery (corporate)	2010.0	Contract Expired	NaN	NaN	NaN	NaN	NaN	...	Yes	Yes	Yes	Yes	Female	26 30	Temporary Full-time	Administration (AO)	1-2	1-2
1	6.341337e+17	Mount Isa Institute of TAFE	Non-Delivery (corporate)	2010.0	Retirement	-	-	-	-	-	...	Yes	Yes	Yes	Yes	NaN	NaN	NaN	NaN	NaN	NaN
2	6.341388e+17	Mount Isa Institute of TAFE	Delivery (teaching)	2010.0	Retirement	-	-	-	-	-	...	Yes	Yes	Yes	Yes	NaN	NaN	NaN	NaN	NaN	NaN
3	6.341399e+17	Mount Isa Institute of TAFE	Non-Delivery (corporate)	2010.0	Resignation	-	-	-	-	-	...	Yes	Yes	Yes	Yes	NaN	NaN	NaN	NaN	NaN	NaN
4	6.341466e+17	Southern Queensland Institute of TAFE	Delivery (teaching)	2010.0	Resignation	-	Career Move - Private Sector	-	-	-	...	Yes	Yes	Yes	Yes	Male	41 45	Permanent Full-time	Teacher (including LVT)	3-4	3-4

5 rows × 72 columns

In [8]:

tafe_survey.loc[tafe_survey["Reason for ceasing employment"] == "Resignation"].isnull().sum()

Out[8]:

Record ID                                                                     0
Institute                                                                     0
WorkArea                                                                      0
CESSATION YEAR                                                                5
Reason for ceasing employment                                                 0
                                                                             ..
CurrentAge. Current Age                                                      50
Employment Type. Employment Type                                             50
Classification. Classification                                               50
LengthofServiceOverall. Overall Length of Service at Institute (in years)    50
LengthofServiceCurrent. Length of Service at current workplace (in years)    50
Length: 72, dtype: int64

DETE data - Mark Missing Values¶

Because "Not Stated" was being used to denote missing values, the below re-reads the data file using that information.

In [9]:

dete_survey = pd.read_csv("dete_survey.csv", na_values="Not Stated")

DETE data - Drop Irrelevant Columns¶

Thank you for telling me which columns are irrelevant and can be dropped!

In [10]:

dete_survey_updated = dete_survey.drop(dete_survey.columns[28:49], axis=1)

In [11]:

dete_survey_updated.head()

Out[11]:

	ID	SeparationType	Cease Date	DETE Start Date	Role Start Date	Position	Classification	Region	Business Unit	Employment Status	...	Work life balance	Workload	None of the above	Gender	Age	Aboriginal	Torres Strait	South Sea	Disability	NESB
0	1	Ill Health Retirement	08/2012	1984.0	2004.0	Public Servant	A01-A04	Central Office	Corporate Strategy and Peformance	Permanent Full-time	...	False	False	True	Male	56-60	NaN	NaN	NaN	NaN	Yes
1	2	Voluntary Early Retirement (VER)	08/2012	NaN	NaN	Public Servant	AO5-AO7	Central Office	Corporate Strategy and Peformance	Permanent Full-time	...	False	False	False	Male	56-60	NaN	NaN	NaN	NaN	NaN
2	3	Voluntary Early Retirement (VER)	05/2012	2011.0	2011.0	Schools Officer	NaN	Central Office	Education Queensland	Permanent Full-time	...	False	False	True	Male	61 or older	NaN	NaN	NaN	NaN	NaN
3	4	Resignation-Other reasons	05/2012	2005.0	2006.0	Teacher	Primary	Central Queensland	NaN	Permanent Full-time	...	False	False	False	Female	36-40	NaN	NaN	NaN	NaN	NaN
4	5	Age Retirement	05/2012	1970.0	1989.0	Head of Curriculum/Head of Special Education	NaN	South East	NaN	Permanent Full-time	...	True	False	False	Female	61 or older	NaN	NaN	NaN	NaN	NaN

5 rows × 35 columns

TAFE data - Drop Irrelevant Columns¶

It appears we do not need to make any manipulations for '-' or 'NaN' values in the TAFE survey. In retropspect I understand these are the expected ways of indicating missing values for int/float and strings respectively.

Thank you for proving me again with the approx. 50 irrelevant columns to drop from this dataset.

In [12]:

tafe_survey_updated = tafe_survey.drop(tafe_survey.columns[17:66], axis=1)

In [13]:

tafe_survey_updated.head()

Out[13]:

	Record ID	Institute	WorkArea	CESSATION YEAR	Reason for ceasing employment	Contributing Factors. Career Move - Public Sector	Contributing Factors. Career Move - Private Sector	Contributing Factors. Career Move - Self-employment	Contributing Factors. Ill Health	Contributing Factors. Maternity/Family	...	Contributing Factors. Study	Contributing Factors. Travel	Contributing Factors. Other	Contributing Factors. NONE	Gender. What is your Gender?	CurrentAge. Current Age	Employment Type. Employment Type	Classification. Classification	LengthofServiceOverall. Overall Length of Service at Institute (in years)	LengthofServiceCurrent. Length of Service at current workplace (in years)
0	6.341330e+17	Southern Queensland Institute of TAFE	Non-Delivery (corporate)	2010.0	Contract Expired	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	Female	26 30	Temporary Full-time	Administration (AO)	1-2	1-2
1	6.341337e+17	Mount Isa Institute of TAFE	Non-Delivery (corporate)	2010.0	Retirement	-	-	-	-	-	...	-	Travel	-	-	NaN	NaN	NaN	NaN	NaN	NaN
2	6.341388e+17	Mount Isa Institute of TAFE	Delivery (teaching)	2010.0	Retirement	-	-	-	-	-	...	-	-	-	NONE	NaN	NaN	NaN	NaN	NaN	NaN
3	6.341399e+17	Mount Isa Institute of TAFE	Non-Delivery (corporate)	2010.0	Resignation	-	-	-	-	-	...	-	Travel	-	-	NaN	NaN	NaN	NaN	NaN	NaN
4	6.341466e+17	Southern Queensland Institute of TAFE	Delivery (teaching)	2010.0	Resignation	-	Career Move - Private Sector	-	-	-	...	-	-	-	-	Male	41 45	Permanent Full-time	Teacher (including LVT)	3-4	3-4

5 rows × 23 columns

DETE data - Clean Column Names¶

The column names are not too long to work with so the only modifications really required is to standardize the case and spacing.

In [14]:

dete_survey_updated.columns = dete_survey_updated.columns.str.lower().str.strip().str.replace("  *", "_")

In [15]:

dete_header=dete_survey_updated.columns
dete_header

Out[15]:

Index(['id', 'separationtype', 'cease_date', 'dete_start_date',
       'role_start_date', 'position', 'classification', 'region',
       'business_unit', 'employment_status', 'career_move_to_public_sector',
       'career_move_to_private_sector', 'interpersonal_conflicts',
       'job_dissatisfaction', 'dissatisfaction_with_the_department',
       'physical_work_environment', 'lack_of_recognition',
       'lack_of_job_security', 'work_location', 'employment_conditions',
       'maternity/family', 'relocation', 'study/travel', 'ill_health',
       'traumatic_incident', 'work_life_balance', 'workload',
       'none_of_the_above', 'gender', 'age', 'aboriginal', 'torres_strait',
       'south_sea', 'disability', 'nesb'],
      dtype='object')

TAFE data - Rename Columns¶

This dataset does not have very suitable column names to work with. Below I will gladly replace them with the names of equivalent column in the DETE data set. Where there is no equivalent, I'll use simplified, non-redundant names.

In [16]:

tafe_col_map = {'Record ID': 'id',
                'CESSATION YEAR': 'cease_date',
                'Reason for ceasing employment': 'separationtype',
                'Gender. What is your Gender?': 'gender',
                'CurrentAge. Current Age': 'age',
                'Employment Type. Employment Type': 'employment_status',
                'Classification. Classification': 'position',
                'LengthofServiceOverall. Overall Length of Service at Institute (in years)': 'institute_service',
                'LengthofServiceCurrent. Length of Service at current workplace (in years)': 'role_service'}

tafe_survey_updated.rename(columns=tafe_col_map, inplace=True)

In [17]:

tafe_header=tafe_survey_updated.columns
tafe_header

Out[17]:

Index(['id', 'Institute', 'WorkArea', 'cease_date', 'separationtype',
       'Contributing Factors. Career Move - Public Sector ',
       'Contributing Factors. Career Move - Private Sector ',
       'Contributing Factors. Career Move - Self-employment',
       'Contributing Factors. Ill Health',
       'Contributing Factors. Maternity/Family',
       'Contributing Factors. Dissatisfaction',
       'Contributing Factors. Job Dissatisfaction',
       'Contributing Factors. Interpersonal Conflict',
       'Contributing Factors. Study', 'Contributing Factors. Travel',
       'Contributing Factors. Other', 'Contributing Factors. NONE', 'gender',
       'age', 'employment_status', 'position', 'institute_service',
       'role_service'],
      dtype='object')

DETE and TAFE data - Isolating Resignations¶

We are interested in learning more about resignations resulting from job satisfactions so we can isolate these entries to focus on and disregard reasons for leaving such as retirement or reaching the end of a contract.

This information is likely stored in the separationtype column and I expect we will have different values for this column in the data sets.

In [18]:

dete_survey_updated["separationtype"].value_counts()

Out[18]:

Age Retirement                          285
Resignation-Other reasons               150
Resignation-Other employer               91
Resignation-Move overseas/interstate     70
Voluntary Early Retirement (VER)         67
Ill Health Retirement                    61
Other                                    49
Contract Expired                         34
Termination                              15
Name: separationtype, dtype: int64

In [19]:

dete_resignations = dete_survey_updated[dete_survey_updated["separationtype"].str.contains("Resignation")].copy()
dete_resignations["separationtype"].value_counts()

Out[19]:

Resignation-Other reasons               150
Resignation-Other employer               91
Resignation-Move overseas/interstate     70
Name: separationtype, dtype: int64

In [20]:

tafe_survey_updated["separationtype"].value_counts()

Out[20]:

Resignation                 340
Contract Expired            127
Retrenchment/ Redundancy    104
Retirement                   82
Transfer                     25
Termination                  23
Name: separationtype, dtype: int64

The following line gives me an error about null values so I need to check the TAFE data for null values in the separationtype column.

tafe_resignations = tafe_survey_updated[tafe_survey_updated["separationtype"].str.contains("Resignation")].copy() tafe_resignations["separationtype"].value_counts()

In [21]:

temp_missing = tafe_survey_updated['separationtype'].isnull()
tafe_survey_updated[temp_missing]

Out[21]:

	id	Institute	WorkArea	cease_date	separationtype	Contributing Factors. Career Move - Public Sector	Contributing Factors. Career Move - Private Sector	Contributing Factors. Career Move - Self-employment	Contributing Factors. Ill Health	Contributing Factors. Maternity/Family	...	Contributing Factors. Study	Contributing Factors. Travel	Contributing Factors. Other	Contributing Factors. NONE	gender	age	employment_status	position	institute_service	role_service
324	6.345804e+17	Sunshine Coast Institute of TAFE	Non-Delivery (corporate)	2011.0	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

1 rows × 23 columns

There is almost no information in this entry to use so we will drop it from the tafe_survey_updated dataset.

In [22]:

tafe_survey_updated.dropna(subset=["separationtype"], axis=0, inplace=True)
tafe_survey_updated.reset_index()

Out[22]:

	index	id	Institute	WorkArea	cease_date	separationtype	Contributing Factors. Career Move - Public Sector	Contributing Factors. Career Move - Private Sector	Contributing Factors. Career Move - Self-employment	Contributing Factors. Ill Health	...	Contributing Factors. Study	Contributing Factors. Travel	Contributing Factors. Other	Contributing Factors. NONE	gender	age	employment_status	position	institute_service	role_service
0	0	6.341330e+17	Southern Queensland Institute of TAFE	Non-Delivery (corporate)	2010.0	Contract Expired	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	Female	26 30	Temporary Full-time	Administration (AO)	1-2	1-2
1	1	6.341337e+17	Mount Isa Institute of TAFE	Non-Delivery (corporate)	2010.0	Retirement	-	-	-	-	...	-	Travel	-	-	NaN	NaN	NaN	NaN	NaN	NaN
2	2	6.341388e+17	Mount Isa Institute of TAFE	Delivery (teaching)	2010.0	Retirement	-	-	-	-	...	-	-	-	NONE	NaN	NaN	NaN	NaN	NaN	NaN
3	3	6.341399e+17	Mount Isa Institute of TAFE	Non-Delivery (corporate)	2010.0	Resignation	-	-	-	-	...	-	Travel	-	-	NaN	NaN	NaN	NaN	NaN	NaN
4	4	6.341466e+17	Southern Queensland Institute of TAFE	Delivery (teaching)	2010.0	Resignation	-	Career Move - Private Sector	-	-	...	-	-	-	-	Male	41 45	Permanent Full-time	Teacher (including LVT)	3-4	3-4
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
696	697	6.350668e+17	Barrier Reef Institute of TAFE	Delivery (teaching)	2013.0	Resignation	Career Move - Public Sector	-	-	-	...	-	-	-	-	Male	51-55	Temporary Full-time	Teacher (including LVT)	1-2	1-2
697	698	6.350677e+17	Southern Queensland Institute of TAFE	Non-Delivery (corporate)	2013.0	Resignation	Career Move - Public Sector	-	-	-	...	-	-	-	-	NaN	NaN	NaN	NaN	NaN	NaN
698	699	6.350704e+17	Tropical North Institute of TAFE	Delivery (teaching)	2013.0	Resignation	-	-	-	-	...	-	-	Other	-	Female	51-55	Permanent Full-time	Teacher (including LVT)	5-6	1-2
699	700	6.350712e+17	Southbank Institute of Technology	Non-Delivery (corporate)	2013.0	Contract Expired	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	Female	41 45	Temporary Full-time	Professional Officer (PO)	1-2	1-2
700	701	6.350730e+17	Tropical North Institute of TAFE	Non-Delivery (corporate)	2013.0	Resignation	-	-	Career Move - Self-employment	-	...	-	Travel	-	-	Female	26 30	Contract/casual	Administration (AO)	3-4	1-2

701 rows × 24 columns

Check the row was successfully removed:

In [23]:

temp2_missing = tafe_survey_updated['separationtype'].isnull()
tafe_survey_updated[temp2_missing]

Out[23]:

	id	Institute	WorkArea	cease_date	separationtype	Contributing Factors. Career Move - Public Sector	Contributing Factors. Career Move - Private Sector	Contributing Factors. Career Move - Self-employment	Contributing Factors. Ill Health	Contributing Factors. Maternity/Family	...	Contributing Factors. Study	Contributing Factors. Travel	Contributing Factors. Other	Contributing Factors. NONE	gender	age	employment_status	position	institute_service	role_service

0 rows × 23 columns

Now I try again to isolate resignations from the TAFE data and now it works:

In [24]:

tafe_resignations = tafe_survey_updated[tafe_survey_updated["separationtype"].str.contains("Resignation")].copy()
tafe_resignations["separationtype"].value_counts()

Out[24]:

Resignation    340
Name: separationtype, dtype: int64

DETE and TAFE data - Clean Date Format and Check for Inconsistencies¶

In order to verify the integrity of the data, I want to make sure the date values of the resignations are consistent with logical expectation.

The cease_date values in the DETE data set seem valid but some include the month while others to not. I would like to extract just the year.

The TAFE data set only provides the year of the cease_date so I am just ensuring it is also stored as a float.

In [25]:

dete_resignations["cease_date"].value_counts()

Out[25]:

2012       126
2013        74
01/2014     22
12/2013     17
06/2013     14
09/2013     11
11/2013      9
07/2013      9
10/2013      6
08/2013      4
05/2013      2
05/2012      2
2010         1
09/2010      1
07/2006      1
07/2012      1
Name: cease_date, dtype: int64

I noticed there are missing values for cease_date. I discover below that the vectorized string methods used to extract the year keep the NaN values but ignore them when performing their operations.

In [26]:

dete_resignations[dete_resignations["cease_date"].isnull()]

Out[26]:

	id	separationtype	cease_date	dete_start_date	role_start_date	position	classification	region	business_unit	employment_status	...	work_life_balance	workload	none_of_the_above	gender	age	aboriginal	torres_strait	south_sea	disability	nesb
683	685	Resignation-Other employer	NaN	2011.0	2012.0	Teacher	Primary	Central Queensland	NaN	Permanent Full-time	...	False	False	False	Male	21-25	NaN	NaN	NaN	NaN	NaN
694	696	Resignation-Other reasons	NaN	2012.0	NaN	Teacher Aide	NaN	Metropolitan	NaN	Casual	...	False	False	False	Female	46-50	NaN	NaN	NaN	NaN	NaN
704	706	Resignation-Other reasons	NaN	2006.0	2007.0	Teacher Aide	NaN	Darling Downs South West	NaN	Permanent Full-time	...	False	False	False	Female	41-45	NaN	NaN	NaN	NaN	NaN
709	711	Resignation-Other employer	NaN	NaN	NaN	Teacher	Primary	Central Office	Education Queensland	Permanent Full-time	...	True	True	False	Female	51-55	NaN	NaN	NaN	NaN	NaN
724	726	Resignation-Other reasons	NaN	1984.0	NaN	Teacher	Primary	Darling Downs South West	NaN	Permanent Full-time	...	False	False	False	Female	46-50	NaN	NaN	NaN	NaN	NaN
770	772	Resignation-Other reasons	NaN	1987.0	1987.0	Cleaner	NaN	Darling Downs South West	NaN	Permanent Part-time	...	False	False	True	Female	61 or older	NaN	NaN	NaN	NaN	NaN
774	776	Resignation-Other employer	NaN	2005.0	2005.0	Teacher Aide	NaN	Central Queensland	NaN	Permanent Part-time	...	False	False	True	Female	41-45	NaN	NaN	NaN	NaN	NaN
788	790	Resignation-Other employer	NaN	1990.0	2010.0	Teacher	Secondary	Metropolitan	NaN	Permanent Full-time	...	False	False	False	Female	41-45	NaN	NaN	NaN	NaN	NaN
791	793	Resignation-Other reasons	NaN	2007.0	2007.0	Public Servant	A01-A04	Metropolitan	NaN	Permanent Part-time	...	False	False	False	Female	46-50	NaN	NaN	NaN	NaN	NaN
797	799	Resignation-Move overseas/interstate	NaN	2000.0	2013.0	Public Servant	A01-A04	South East	NaN	Permanent Part-time	...	False	False	False	Female	36-40	NaN	NaN	NaN	NaN	NaN
798	800	Resignation-Move overseas/interstate	NaN	1995.0	NaN	Teacher Aide	NaN	Darling Downs South West	NaN	Permanent Part-time	...	False	False	False	Female	36-40	NaN	NaN	NaN	NaN	NaN

11 rows × 35 columns

In [27]:

dete_resignations["cease_year"] = dete_resignations["cease_date"].str.extract(r"([0-9]{4})").astype(float)
dete_resignations["cease_year"]

Out[27]:

3      2012.0
5      2012.0
8      2012.0
9      2012.0
11     2012.0
        ...  
808    2013.0
815    2014.0
816    2014.0
819    2014.0
821    2013.0
Name: cease_year, Length: 311, dtype: float64

In [28]:

dete_resignations["cease_year"].value_counts().sort_index(ascending=False)

Out[28]:

2014.0     22
2013.0    146
2012.0    129
2010.0      2
2006.0      1
Name: cease_year, dtype: int64

In [29]:

tafe_resignations["cease_year"] = tafe_resignations["cease_date"].astype(float)

tafe_resignations["cease_year"].value_counts().sort_index(ascending=False)

Out[29]:

2013.0     55
2012.0     94
2011.0    116
2010.0     68
2009.0      2
Name: cease_year, dtype: int64

DETE data - Generate column for Employment Duration¶

The DETE data doesn't have an equivalent to the TAFE data's institute_service column storing the time the employee spent at DETE so we will create this column using the dates corresponding to the beginning and end of the employee's service.

Now it's time to check the integrity of the dete_start_date column. I used the cut method to make 10 groups of 5 years each to quickly get a sense of the values. No value is after 2013 which is also good sign.

In [30]:

dete_resignations["dete_start_date"] = dete_resignations["dete_start_date"].astype(float)
pd.cut(dete_resignations["dete_start_date"], 10).value_counts()

Out[30]:

(2008.0, 2013.0]     85
(2003.0, 2008.0]     85
(1998.0, 2003.0]     32
(1993.0, 1998.0]     27
(1988.0, 1993.0]     24
(1983.0, 1988.0]     12
(1978.0, 1983.0]      8
(1973.0, 1978.0]      6
(1968.0, 1973.0]      3
(1962.95, 1968.0]     1
Name: dete_start_date, dtype: int64

In [31]:

dete_resignations["institute_service"] = dete_resignations["cease_year"] - dete_resignations["dete_start_date"].astype(float)
pd.cut(dete_resignations["institute_service"], 7).value_counts()

Out[31]:

(-0.049, 7.0]    145
(7.0, 14.0]       52
(14.0, 21.0]      36
(21.0, 28.0]      21
(28.0, 35.0]      10
(35.0, 42.0]       8
(42.0, 49.0]       1
Name: institute_service, dtype: int64

Looks like someone has a negative length of service. the way the cut displays but if we look closer no institute_service column values are actually negative.

In [32]:

dete_resignations.loc[dete_resignations["institute_service"] <0]

Out[32]:

	id	separationtype	cease_date	dete_start_date	role_start_date	position	classification	region	business_unit	employment_status	...	none_of_the_above	gender	age	aboriginal	torres_strait	south_sea	disability	nesb	cease_year	institute_service

0 rows × 37 columns

TAFE data - Cleaning Dissatisfaction columns¶

The column headings and the values are the same long strings so I will store them as variables to reference.

The values in "Contributing Factors. [Job] Dissatisfaction" are not as described in the instructions. They are not True, False or NaN. They are "[Job] Description" or "-". I interpret "-" to be therefore False, not NaN.

The instructions have a a lot of instructions ... I feel my solution below is a more elegent way of doing the same thing. Unless it didn'nt actually do the same thing, that is!

In [33]:

disstr = 'Contributing Factors. Dissatisfaction'
job_disstr = 'Contributing Factors. Job Dissatisfaction'

In [34]:

tafe_resignations["my_dissatisfied"] = (tafe_resignations[disstr].str.contains("Dissatisfaction") | tafe_resignations[job_disstr].str.contains("Dissatisfaction"))
tafe_resignations["my_dissatisfied"].value_counts()

Out[34]:

False    249
True      91
Name: my_dissatisfied, dtype: int64

I compared with the recommended method to find NaNs and there is in fact a difference in the result.

So '-' is a specifically False response, not a no response as I originally rationalized.

In [35]:

tafe_resignations[disstr].value_counts()

Out[35]:

-                                         277
Contributing Factors. Dissatisfaction      55
Name: Contributing Factors. Dissatisfaction, dtype: int64

In [36]:

tafe_resignations[job_disstr].value_counts()

Out[36]:

-                      270
Job Dissatisfaction     62
Name: Contributing Factors. Job Dissatisfaction, dtype: int64

In [37]:

def update_vals (val) :
    if pd.isnull(val) :
        return (np.nan)
    elif val == "-" :
        return False
    else :
        return True

In [38]:

tafe_resignations[[disstr, job_disstr]] = tafe_resignations[[disstr, job_disstr]].applymap(update_vals)
tafe_resignations["dissatisfied"] = tafe_resignations[[disstr, job_disstr]].any(axis=1, skipna=False)
tafe_resignations["dissatisfied"].value_counts()

Out[38]:

False    241
True      91
Name: dissatisfied, dtype: int64

DETE data - Cleaning Dissatisfaction columns¶

There are multiple columns and apparently every Resignation is has one of these reasons indicated (where indicated).

In [39]:

discols = ["job_dissatisfaction", 
           "dissatisfaction_with_the_department", 
           "physical_work_environment",
          "lack_of_recognition",
          "lack_of_job_security",
          "work_location",
          "employment_conditions",
          "work_life_balance",
          "workload"]

dete_resignations[discols] = dete_resignations[discols].applymap(update_vals)
dete_resignations["dissatisfied"] = dete_resignations[discols].any(axis=1, skipna=False)
dete_resignations["dissatisfied"].value_counts()

Out[39]:

True    311
Name: dissatisfied, dtype: int64

In [40]:

dete_resignations_up = dete_resignations.copy()
tafe_resignations_up = tafe_resignations.copy()

Combining the TAFE and DETE Data¶

In order to easily identify which set the data came from originally, I'm going to add a column to store the name of the institute.

In [41]:

dete_resignations_up["institute"] = "DETE"
tafe_resignations_up["institute"] = "TAFE"

dd

In the instructions it says to combine the two data sets on "institute_service" column and then drop.na.

In [42]:

combined = pd.concat([dete_resignations_up, 
                    tafe_resignations_up], join="outer")
combined["institute"].value_counts()

Out[42]:

TAFE    340
DETE    311
Name: institute, dtype: int64

In [43]:

combined.describe(include='all')

Out[43]:

	id	separationtype	cease_date	dete_start_date	role_start_date	position	classification	region	business_unit	employment_status	...	Contributing Factors. Maternity/Family	Contributing Factors. Dissatisfaction	Contributing Factors. Job Dissatisfaction	Contributing Factors. Interpersonal Conflict	Contributing Factors. Study	Contributing Factors. Travel	Contributing Factors. Other	Contributing Factors. NONE	role_service	my_dissatisfied
count	6.510000e+02	651	635	283.000000	271.000000	598	161	265	32	597	...	332	332	332	332	332	332	332	332	290	340
unique	NaN	4	21	NaN	NaN	21	8	8	9	6	...	2	2	2	2	2	2	2	2	7	2
top	NaN	Resignation	2012	NaN	NaN	Administration (AO)	Primary	Central Queensland	Education Queensland	Permanent Full-time	...	-	False	False	-	-	-	-	-	Less than 1 year	False
freq	NaN	340	126	NaN	NaN	148	58	45	14	256	...	312	277	270	308	316	315	246	316	92	249
mean	3.314265e+17	NaN	NaN	2002.067138	1999.653137	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
std	3.172210e+17	NaN	NaN	9.914479	109.965675	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
min	4.000000e+00	NaN	NaN	1963.000000	200.000000	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
25%	4.525000e+02	NaN	NaN	1997.000000	2004.000000	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
50%	6.341820e+17	NaN	NaN	2005.000000	2009.000000	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
75%	6.345770e+17	NaN	NaN	2010.000000	2011.000000	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
max	6.350730e+17	NaN	NaN	2013.000000	2013.000000	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

11 rows × 55 columns

Now I'll drop any columns with not enough non-null values. A threshold of 500 has been provided persumably because that will include all columns that were common to the 2 data sets but not anything specific to just one data set.

In [44]:

combined_updated = combined.dropna(axis=1, thresh=500)

In [45]:

combined_updated.describe(include='all')

Out[45]:

	id	separationtype	cease_date	position	employment_status	gender	age	cease_year	institute_service	dissatisfied	institute
count	6.510000e+02	651	635	598	597	592	596	635.000000	563	643	651
unique	NaN	4	21	21	6	2	17	NaN	49	2	2
top	NaN	Resignation	2012	Administration (AO)	Permanent Full-time	Female	51-55	NaN	Less than 1 year	True	TAFE
freq	NaN	340	126	148	256	424	71	NaN	73	402	340
mean	3.314265e+17	NaN	NaN	NaN	NaN	NaN	NaN	2011.963780	NaN	NaN	NaN
std	3.172210e+17	NaN	NaN	NaN	NaN	NaN	NaN	1.079028	NaN	NaN	NaN
min	4.000000e+00	NaN	NaN	NaN	NaN	NaN	NaN	2006.000000	NaN	NaN	NaN
25%	4.525000e+02	NaN	NaN	NaN	NaN	NaN	NaN	2011.000000	NaN	NaN	NaN
50%	6.341820e+17	NaN	NaN	NaN	NaN	NaN	NaN	2012.000000	NaN	NaN	NaN
75%	6.345770e+17	NaN	NaN	NaN	NaN	NaN	NaN	2013.000000	NaN	NaN	NaN
max	6.350730e+17	NaN	NaN	NaN	NaN	NaN	NaN	2014.000000	NaN	NaN	NaN

Combined Data: Normalizing Years of Service¶

In order to normalize the years as service and then group them, I want to first verify the format of the values.

Give the formats, I made these changes:

Set "Less than 1 year" to be equal to 0
Used the first value in a date range because it is the same results as using the last value
Removed .0 otherwise the strings would not match values as desired.

In [46]:

combined_updated["institute_service"].value_counts()

Out[46]:

Less than 1 year      73
1-2                   64
3-4                   63
5-6                   33
11-20                 26
5.0                   23
1.0                   22
7-10                  21
3.0                   20
0.0                   20
6.0                   17
4.0                   16
9.0                   14
2.0                   14
7.0                   13
More than 20 years    10
13.0                   8
8.0                    8
20.0                   7
15.0                   7
14.0                   6
17.0                   6
12.0                   6
10.0                   6
22.0                   6
18.0                   5
16.0                   5
24.0                   4
23.0                   4
11.0                   4
39.0                   3
19.0                   3
21.0                   3
32.0                   3
36.0                   2
25.0                   2
26.0                   2
28.0                   2
30.0                   2
42.0                   1
35.0                   1
49.0                   1
34.0                   1
38.0                   1
33.0                   1
29.0                   1
27.0                   1
41.0                   1
31.0                   1
Name: institute_service, dtype: int64

In [47]:

combined_updated["institute_service"] = combined_updated["institute_service"].astype(str).str.replace("Less than 1 year","0").str.lower().str.replace("[a-z]*", "").str.replace("\..*","").str.replace("-.*","").str.strip()

<ipython-input-47-aec85db4c104>:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  combined_updated["institute_service"] = combined_updated["institute_service"].astype(str).str.replace("Less than 1 year","0").str.lower().str.replace("[a-z]*", "").str.replace("\..*","").str.replace("-.*","").str.strip()

A simple way of filling NaN values so I can convert to float.

In [48]:

combined_updated.loc[combined_updated["institute_service"]== '', "institute_service"] = "NaN"

/dataquest/system/env/python3/lib/python3.8/site-packages/pandas/core/indexing.py:966: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item] = s

In [49]:

combined_updated["institute_service"].value_counts()

Out[49]:

0      93
NaN    88
1      86
3      83
5      56
7      34
11     30
6      17
20     17
4      16
9      14
2      14
13      8
8       8
15      7
22      6
17      6
12      6
10      6
14      6
18      5
16      5
24      4
23      4
32      3
19      3
21      3
39      3
28      2
26      2
25      2
36      2
30      2
49      1
29      1
41      1
34      1
42      1
31      1
38      1
27      1
33      1
35      1
Name: institute_service, dtype: int64

In [50]:

combined_updated["institute_service"] = combined_updated["institute_service"].astype(float)

<ipython-input-50-2a554ebc0312>:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  combined_updated["institute_service"] = combined_updated["institute_service"].astype(float)

In [51]:

def years_to_stage (years) :
    if years < 3 :
        return "New"
    elif years < 7 :
        return "Experienced"
    elif years < 11 :
        return "Established"
    elif years >= 11 :
        return "Veteran"
    else :
        return "NaN"

In [52]:

combined_updated["service_cat"] = combined_updated["institute_service"].apply(years_to_stage)
combined_updated["service_cat"].value_counts(dropna=False)

<ipython-input-52-5c8916c54d27>:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  combined_updated["service_cat"] = combined_updated["institute_service"].apply(years_to_stage)

Out[52]:

New            193
Experienced    172
Veteran        136
NaN             88
Established     62
Name: service_cat, dtype: int64

A quick check at Established column data indicates institute_serivce durations of 7,8,9 and 10 years were included and they were.

In [53]:

combined_updated.loc[combined_updated["service_cat"] == "Established", "institute_service"].value_counts()

Out[53]:

7.0     34
9.0     14
8.0      8
10.0     6
Name: institute_service, dtype: int64

Combined Data - Fill in Missing Values for Dissatisfied¶

Before creating the pivot table I need to fill in missing values. The method in the instructions required finding the most frequent reponse and filling the na values with that. Most frequent response ended up being "True".

In [54]:

combined_updated["dissatisfied"].value_counts(dropna=False)

Out[54]:

True     402
False    241
NaN        8
Name: dissatisfied, dtype: int64

In [55]:

combined_updated["dissatisfied"] = combined_updated["dissatisfied"].fillna(value=True)
combined_updated["dissatisfied"].value_counts()

<ipython-input-55-dfde65416531>:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  combined_updated["dissatisfied"] = combined_updated["dissatisfied"].fillna(value=True)

Out[55]:

True     410
False    241
Name: dissatisfied, dtype: int64

Combined Data - Pivot Table¶

Now I can check the frequency of dissatisfaction depending on the duration of serivce by creating a pivot table and plot.

True is equal to 1 and False is equal to 0. The pivot table will provide the mean of the values.

In [56]:

pt = combined_updated.pivot_table(index="service_cat", values="dissatisfied")
pt

Out[56]:

	dissatisfied
service_cat
Established	0.774194
Experienced	0.581395
NaN	0.681818
New	0.476684
Veteran	0.808824

Combined Data - Plot¶

Since there are just 5 categories of values a bar graph should display the information well.

In [57]:

%matplotlib inline
pt.plot(kind="bar", ylim=(0,1))

Out[57]:

<matplotlib.axes._subplots.AxesSubplot at 0x7fb519b2af10>

It would be easier to see trends if the bars were ordered by the service duration categories:

["New", "Experienced", "Established", "Veteran"] and then show "NaN" at the end.

I found this solution on stackoverflow:

weekdays = ['Mon', 'Tues', 'Weds', 'Thurs', 'Fri', 'Sat', 'Sun']

mapping = {day: i for i, day in enumerate(weekdays)}

key = df['day'].map(mapping)

And the sorting is simple:

df.iloc[key.argsort()]

In [58]:

ser_cats = ["New", "Experienced", "Established", "Veteran", "NaN"]

mapping = {cat: i+1 for i, cat in enumerate(ser_cats)}

combined_updated["num_service_cat"] = combined_updated["service_cat"].map(mapping)

<ipython-input-58-63de5dbcd4b0>:7: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  combined_updated["num_service_cat"] = combined_updated["service_cat"].map(mapping)

In [59]:

pt2 = combined_updated.pivot_table(index="num_service_cat", values="dissatisfied")
pt2

Out[59]:

	dissatisfied
num_service_cat
1	0.476684
2	0.581395
3	0.774194
4	0.808824
5	0.681818

In [61]:

pt2.plot(kind='bar', ylim=(0,1), title="Employees are more likely to resign due to dissatisfaction the longer their service.\n 1=New 2=Experienced 3=Established 4-Veteren 5-Unknown")

Out[61]:

<matplotlib.axes._subplots.AxesSubplot at 0x7fb5503a01c0>

Conclusion¶

After a lot of cleaning I was able to create a bar graph indicating the longer employee was in service, the more likely they are to have resigned with some some feelings of dissatisfaction.

Proposed Improvements¶

Improve how to deal with missing values: filling based on most popular response might not be the best strategy
Introduce a level of dissatisfaction: instead of using the any function to see if any of the contributing factors are present, count up the contributing factors to make a dissatisfaction score to plot the mean of.
Find correlations with age.