In this guided project, we'll work with exit surveys from employees of the Department of Education, Training and Employment (DETE) and the Technical and Further Education (TAFE) institute in Queensland, Australia.
In this project, we'll play the role of data analyst and pretend our stakeholders want to know the following:

Are employees who only worked for the institutes for a short period of time resigning due to some kind of dissatisfaction? What about employees who have been there longer?
Are younger employees resigning due to some kind of dissatisfaction? What about older employees?

They want us to combine the results for both surveys to answer these questions. However, although both used the same survey template, one of them customized some of the answers. In the guided steps, we'll aim to do most of the data cleaning and get you started analyzing the first question.

Below is a preview of a couple columns we'll work with from the dete_survey.csv:

Below is a preview of a couple columns we'll work with from the tafe_survey.csv:

In [1]:

# Importing the libraries needed and reading the datasets.
import pandas as pd
import numpy as np

dete_survey = pd.read_csv('dete_survey.csv')

tafe_survey = pd.read_csv('tafe_survey.csv')

In [2]:

# Getting details about the dete_survey dataset.
print("Dataset rows: ",dete_survey.shape[0]," Dataset columns: ",dete_survey.shape[1])
print('\n')
dete_survey.info()

Dataset rows:  822  Dataset columns:  56


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 822 entries, 0 to 821
Data columns (total 56 columns):
ID                                     822 non-null int64
SeparationType                         822 non-null object
Cease Date                             822 non-null object
DETE Start Date                        822 non-null object
Role Start Date                        822 non-null object
Position                               817 non-null object
Classification                         455 non-null object
Region                                 822 non-null object
Business Unit                          126 non-null object
Employment Status                      817 non-null object
Career move to public sector           822 non-null bool
Career move to private sector          822 non-null bool
Interpersonal conflicts                822 non-null bool
Job dissatisfaction                    822 non-null bool
Dissatisfaction with the department    822 non-null bool
Physical work environment              822 non-null bool
Lack of recognition                    822 non-null bool
Lack of job security                   822 non-null bool
Work location                          822 non-null bool
Employment conditions                  822 non-null bool
Maternity/family                       822 non-null bool
Relocation                             822 non-null bool
Study/Travel                           822 non-null bool
Ill Health                             822 non-null bool
Traumatic incident                     822 non-null bool
Work life balance                      822 non-null bool
Workload                               822 non-null bool
None of the above                      822 non-null bool
Professional Development               808 non-null object
Opportunities for promotion            735 non-null object
Staff morale                           816 non-null object
Workplace issue                        788 non-null object
Physical environment                   817 non-null object
Worklife balance                       815 non-null object
Stress and pressure support            810 non-null object
Performance of supervisor              813 non-null object
Peer support                           812 non-null object
Initiative                             813 non-null object
Skills                                 811 non-null object
Coach                                  767 non-null object
Career Aspirations                     746 non-null object
Feedback                               792 non-null object
Further PD                             768 non-null object
Communication                          814 non-null object
My say                                 812 non-null object
Information                            816 non-null object
Kept informed                          813 non-null object
Wellness programs                      766 non-null object
Health & Safety                        793 non-null object
Gender                                 798 non-null object
Age                                    811 non-null object
Aboriginal                             16 non-null object
Torres Strait                          3 non-null object
South Sea                              7 non-null object
Disability                             23 non-null object
NESB                                   32 non-null object
dtypes: bool(18), int64(1), object(37)
memory usage: 258.6+ KB

In [3]:

dete_survey.head()

Out[3]:

	ID	SeparationType	Cease Date	DETE Start Date	Role Start Date	Position	Classification	Region	Business Unit	Employment Status	...	Kept informed	Wellness programs	Health & Safety	Gender	Age	Aboriginal	Torres Strait	South Sea	Disability	NESB
0	1	Ill Health Retirement	08/2012	1984	2004	Public Servant	A01-A04	Central Office	Corporate Strategy and Peformance	Permanent Full-time	...	N	N	N	Male	56-60	NaN	NaN	NaN	NaN	Yes
1	2	Voluntary Early Retirement (VER)	08/2012	Not Stated	Not Stated	Public Servant	AO5-AO7	Central Office	Corporate Strategy and Peformance	Permanent Full-time	...	N	N	N	Male	56-60	NaN	NaN	NaN	NaN	NaN
2	3	Voluntary Early Retirement (VER)	05/2012	2011	2011	Schools Officer	NaN	Central Office	Education Queensland	Permanent Full-time	...	N	N	N	Male	61 or older	NaN	NaN	NaN	NaN	NaN
3	4	Resignation-Other reasons	05/2012	2005	2006	Teacher	Primary	Central Queensland	NaN	Permanent Full-time	...	A	N	A	Female	36-40	NaN	NaN	NaN	NaN	NaN
4	5	Age Retirement	05/2012	1970	1989	Head of Curriculum/Head of Special Education	NaN	South East	NaN	Permanent Full-time	...	N	A	M	Female	61 or older	NaN	NaN	NaN	NaN	NaN

5 rows × 56 columns

In [4]:

# Getting details about the tafe_survey dataset.
print("Dataset rows: ",tafe_survey.shape[0]," Dataset columns: ",tafe_survey.shape[1])
print('\n')
tafe_survey.info()

Dataset rows: 702 Dataset columns: 72

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 702 entries, 0 to 701
Data columns (total 72 columns):
Record ID 702 non-null float64
Institute 702 non-null object
WorkArea 702 non-null object
CESSATION YEAR 695 non-null float64
Reason for ceasing employment 701 non-null object
Contributing Factors. Career Move - Public Sector 437 non-null object
Contributing Factors. Career Move - Private Sector 437 non-null object
Contributing Factors. Career Move - Self-employment 437 non-null object
Contributing Factors. Ill Health 437 non-null object
Contributing Factors. Maternity/Family 437 non-null object
Contributing Factors. Dissatisfaction 437 non-null object
Contributing Factors. Job Dissatisfaction 437 non-null object
Contributing Factors. Interpersonal Conflict 437 non-null object
Contributing Factors. Study 437 non-null object
Contributing Factors. Travel 437 non-null object
Contributing Factors. Other 437 non-null object
Contributing Factors. NONE 437 non-null object
Main Factor. Which of these was the main factor for leaving? 113 non-null object
InstituteViews. Topic:1. I feel the senior leadership had a clear vision and direction 608 non-null object
InstituteViews. Topic:2. I was given access to skills training to help me do my job better 613 non-null object
InstituteViews. Topic:3. I was given adequate opportunities for personal development 610 non-null object
InstituteViews. Topic:4. I was given adequate opportunities for promotion within %Institute]Q25LBL% 608 non-null object
InstituteViews. Topic:5. I felt the salary for the job was right for the responsibilities I had 615 non-null object
InstituteViews. Topic:6. The organisation recognised when staff did good work 607 non-null object
InstituteViews. Topic:7. Management was generally supportive of me 614 non-null object
InstituteViews. Topic:8. Management was generally supportive of my team 608 non-null object
InstituteViews. Topic:9. I was kept informed of the changes in the organisation which would affect me 610 non-null object
InstituteViews. Topic:10. Staff morale was positive within the Institute 602 non-null object
InstituteViews. Topic:11. If I had a workplace issue it was dealt with quickly 601 non-null object
InstituteViews. Topic:12. If I had a workplace issue it was dealt with efficiently 597 non-null object
InstituteViews. Topic:13. If I had a workplace issue it was dealt with discreetly 601 non-null object
WorkUnitViews. Topic:14. I was satisfied with the quality of the management and supervision within my work unit 609 non-null object
WorkUnitViews. Topic:15. I worked well with my colleagues 605 non-null object
WorkUnitViews. Topic:16. My job was challenging and interesting 607 non-null object
WorkUnitViews. Topic:17. I was encouraged to use my initiative in the course of my work 610 non-null object
WorkUnitViews. Topic:18. I had sufficient contact with other people in my job 613 non-null object
WorkUnitViews. Topic:19. I was given adequate support and co-operation by my peers to enable me to do my job 609 non-null object
WorkUnitViews. Topic:20. I was able to use the full range of my skills in my job 609 non-null object
WorkUnitViews. Topic:21. I was able to use the full range of my abilities in my job. ; Category:Level of Agreement; Question:YOUR VIEWS ABOUT YOUR WORK UNIT] 608 non-null object
WorkUnitViews. Topic:22. I was able to use the full range of my knowledge in my job 608 non-null object
WorkUnitViews. Topic:23. My job provided sufficient variety 611 non-null object
WorkUnitViews. Topic:24. I was able to cope with the level of stress and pressure in my job 610 non-null object
WorkUnitViews. Topic:25. My job allowed me to balance the demands of work and family to my satisfaction 611 non-null object
WorkUnitViews. Topic:26. My supervisor gave me adequate personal recognition and feedback on my performance 606 non-null object
WorkUnitViews. Topic:27. My working environment was satisfactory e.g. sufficient space, good lighting, suitable seating and working area 610 non-null object
WorkUnitViews. Topic:28. I was given the opportunity to mentor and coach others in order for me to pass on my skills and knowledge prior to my cessation date 609 non-null object
WorkUnitViews. Topic:29. There was adequate communication between staff in my unit 603 non-null object
WorkUnitViews. Topic:30. Staff morale was positive within my work unit 606 non-null object
Induction. Did you undertake Workplace Induction? 619 non-null object
InductionInfo. Topic:Did you undertake a Corporate Induction? 432 non-null object
InductionInfo. Topic:Did you undertake a Institute Induction? 483 non-null object
InductionInfo. Topic: Did you undertake Team Induction? 440 non-null object
InductionInfo. Face to Face Topic:Did you undertake a Corporate Induction; Category:How it was conducted? 555 non-null object
InductionInfo. On-line Topic:Did you undertake a Corporate Induction; Category:How it was conducted? 555 non-null object
InductionInfo. Induction Manual Topic:Did you undertake a Corporate Induction? 555 non-null object
InductionInfo. Face to Face Topic:Did you undertake a Institute Induction? 530 non-null object
InductionInfo. On-line Topic:Did you undertake a Institute Induction? 555 non-null object
InductionInfo. Induction Manual Topic:Did you undertake a Institute Induction? 553 non-null object
InductionInfo. Face to Face Topic: Did you undertake Team Induction; Category? 555 non-null object
InductionInfo. On-line Topic: Did you undertake Team Induction?process you undertook and how it was conducted.] 555 non-null object
InductionInfo. Induction Manual Topic: Did you undertake Team Induction? 555 non-null object
Workplace. Topic:Did you and your Manager develop a Performance and Professional Development Plan (PPDP)? 608 non-null object
Workplace. Topic:Does your workplace promote a work culture free from all forms of unlawful discrimination? 594 non-null object
Workplace. Topic:Does your workplace promote and practice the principles of employment equity? 587 non-null object
Workplace. Topic:Does your workplace value the diversity of its employees? 586 non-null object
Workplace. Topic:Would you recommend the Institute as an employer to others? 581 non-null object
Gender. What is your Gender? 596 non-null object
CurrentAge. Current Age 596 non-null object
Employment Type. Employment Type 596 non-null object
Classification. Classification 596 non-null object
LengthofServiceOverall. Overall Length of Service at Institute (in years) 596 non-null object
LengthofServiceCurrent. Length of Service at current workplace (in years) 596 non-null object
dtypes: float64(2), object(70)
memory usage: 395.0+ KB

In [5]:

tafe_survey.head()

Out[5]:

	Record ID	Institute	WorkArea	CESSATION YEAR	Reason for ceasing employment	Contributing Factors. Career Move - Public Sector	Contributing Factors. Career Move - Private Sector	Contributing Factors. Career Move - Self-employment	Contributing Factors. Ill Health	Contributing Factors. Maternity/Family	...	Workplace. Topic:Does your workplace promote a work culture free from all forms of unlawful discrimination?	Workplace. Topic:Does your workplace promote and practice the principles of employment equity?	Workplace. Topic:Does your workplace value the diversity of its employees?	Workplace. Topic:Would you recommend the Institute as an employer to others?	Gender. What is your Gender?	CurrentAge. Current Age	Employment Type. Employment Type	Classification. Classification	LengthofServiceOverall. Overall Length of Service at Institute (in years)	LengthofServiceCurrent. Length of Service at current workplace (in years)
0	6.341330e+17	Southern Queensland Institute of TAFE	Non-Delivery (corporate)	2010.0	Contract Expired	NaN	NaN	NaN	NaN	NaN	...	Yes	Yes	Yes	Yes	Female	26 30	Temporary Full-time	Administration (AO)	1-2	1-2
1	6.341337e+17	Mount Isa Institute of TAFE	Non-Delivery (corporate)	2010.0	Retirement	-	-	-	-	-	...	Yes	Yes	Yes	Yes	NaN	NaN	NaN	NaN	NaN	NaN
2	6.341388e+17	Mount Isa Institute of TAFE	Delivery (teaching)	2010.0	Retirement	-	-	-	-	-	...	Yes	Yes	Yes	Yes	NaN	NaN	NaN	NaN	NaN	NaN
3	6.341399e+17	Mount Isa Institute of TAFE	Non-Delivery (corporate)	2010.0	Resignation	-	-	-	-	-	...	Yes	Yes	Yes	Yes	NaN	NaN	NaN	NaN	NaN	NaN
4	6.341466e+17	Southern Queensland Institute of TAFE	Delivery (teaching)	2010.0	Resignation	-	Career Move - Private Sector	-	-	-	...	Yes	Yes	Yes	Yes	Male	41 45	Permanent Full-time	Teacher (including LVT)	3-4	3-4

5 rows × 72 columns

At first sight my remarks about these datasets are:

> The dete_survey dataframe contains 'Not Stated' values that indicate values are missing, but they aren't represented as NaN.

> Both the dete_survey and tafe_survey dataframes contain many columns that we don't need to complete our analysis.

> Each dataframe contains many of the same columns, but the column names are different. There are multiple columns/answers that indicate an employee resigned because they were dissatisfied.

In [6]:

# Reading 'Not Stated' values in as NAN and dropping not used columns from dete_survey dataset.
dete_survey = pd.read_csv('dete_survey.csv',na_values='Not Stated')
columns_to_drop = dete_survey.columns[28:49]
dete_survey.drop(columns_to_drop,axis=1,inplace=True)
print("Dataset rows: ",dete_survey.shape[0]," Dataset columns: ",dete_survey.shape[1])
print('\n')

Dataset rows:  822  Dataset columns:  35

In [7]:

# Dropping not used columns from dete_survey dataset.
columns_to_drop = tafe_survey.columns[17:66]
tafe_survey.drop(columns_to_drop,axis=1,inplace=True)
print("Dataset rows: ",tafe_survey.shape[0]," Dataset columns: ",tafe_survey.shape[1])
print('\n')

Dataset rows:  702  Dataset columns:  23

Changed the missing values from "Not Stated" into "NAN" in the dete_survey dataset to meet the standard, and I dropped several not used columns from both datasets.

Next, let's turn our attention to the column names. Each dataframe contains many of the same columns, but the column names are different. Below are some of the columns we'd like to use for our final analysis:

Because we eventually want to combine them, we'll have to standardize the column names.

Here are the criteria that will be used to rename the dete_survey columns':

* Make all the capitalization lowercase.

* Remove any trailing whitespace from the end of the strings.

* Replace spaces with underscores ('_').

In [8]:

# Standardizing the columns' names.
dete_survey.columns =  dete_survey.columns.str.lower().str.strip().str.replace(' ','_')

In [9]:

# Checking the new column name.
dete_survey.columns

Out[9]:

Index(['id', 'separationtype', 'cease_date', 'dete_start_date',
       'role_start_date', 'position', 'classification', 'region',
       'business_unit', 'employment_status', 'career_move_to_public_sector',
       'career_move_to_private_sector', 'interpersonal_conflicts',
       'job_dissatisfaction', 'dissatisfaction_with_the_department',
       'physical_work_environment', 'lack_of_recognition',
       'lack_of_job_security', 'work_location', 'employment_conditions',
       'maternity/family', 'relocation', 'study/travel', 'ill_health',
       'traumatic_incident', 'work_life_balance', 'workload',
       'none_of_the_above', 'gender', 'age', 'aboriginal', 'torres_strait',
       'south_sea', 'disability', 'nesb'],
      dtype='object')

For dete_survey we will update the tafe_survey columns':

* 'Record ID': 'id'

* 'CESSATION YEAR': 'cease_date'

* 'Reason for ceasing employment': 'separationtype'

* 'Gender. What is your Gender?': 'gender'

* 'CurrentAge. Current Age': 'age'

* 'Employment Type. Employment Type': 'employment_status'

* 'Classification. Classification': 'position'

* 'LengthofServiceOverall. Overall Length of Service at Institute (in years)': 'institute_service'

* 'LengthofServiceCurrent. Length of Service at current workplace (in years)': 'role_service'

In [10]:

# Mapping old column names to new names.
columns_update = {
"Record ID":"id",
"CESSATION YEAR":"cease_date",
'Reason for ceasing employment': 'separationtype',
'Gender. What is your Gender?': 'gender',
'CurrentAge. Current Age': 'age',
'Employment Type. Employment Type': 'employment_status',
'Classification. Classification': 'position',
'LengthofServiceOverall. Overall Length of Service at Institute (in years)':'institute_service',
'LengthofServiceCurrent. Length of Service at current workplace (in years)':'role_service'
}
# Renaming columns. Not touched columns will be left as they are.
tafe_survey.rename(columns=columns_update,inplace=True)

In [11]:

tafe_survey.columns

Out[11]:

Index(['id', 'Institute', 'WorkArea', 'cease_date', 'separationtype',
       'Contributing Factors. Career Move - Public Sector ',
       'Contributing Factors. Career Move - Private Sector ',
       'Contributing Factors. Career Move - Self-employment',
       'Contributing Factors. Ill Health',
       'Contributing Factors. Maternity/Family',
       'Contributing Factors. Dissatisfaction',
       'Contributing Factors. Job Dissatisfaction',
       'Contributing Factors. Interpersonal Conflict',
       'Contributing Factors. Study', 'Contributing Factors. Travel',
       'Contributing Factors. Other', 'Contributing Factors. NONE', 'gender',
       'age', 'employment_status', 'position', 'institute_service',
       'role_service'],
      dtype='object')

The last step was relevant because we had two distinguished datasets in terms of columns names, and now, we have the important columns with the same name in both datasets.

Recall that our end goal is to answer the following questions:
Are employees who have only worked for the institutes for a short period of time resigning due to some kind of dissatisfaction? What about employees who have been at the job longer?
Are younger employees resigning due to some kind of dissatisfaction? What about older employees?

To answer this question, we have to filter out the data getting only the rows for employees who have resigned.

In [12]:

# Fetching employees who have resigned from tafe_dataset.
# I used the DataFrame.copy() method to solve de warning and prevent problems with SettingWithCopyWarning 
tafe_resignations = tafe_survey.loc[tafe_survey["separationtype"] == "Resignation",:].copy()

In [13]:

# Using regex to get employees who have resigned from dete_survey dataset
pattern = "Resignation"

dete_resignations = dete_survey.loc[dete_survey["separationtype"].str.contains(pattern,na=False),:].copy()

#dete_survey[dete_survey["separationtype"].str.contains(pattern,na=False)]["separationtype"].value_counts()
#dete_survey["separationtype"].value_counts()

I have to filter out the two datasets to work only with the data I need, which is data for employees who have resigned.

In [14]:

# Validating the cease_date column from dete_resignations dataset.
    #dete_resignations["cease_date"].str[-4:].astype(float).value_counts()
    #dete_resignations["cease_date"].str[-4:].astype(float).sort_index(ascending= False)
print("Max job start year: ",dete_resignations["dete_start_date"].max())
print("Min job start year: ",dete_resignations["dete_start_date"].min())
print("Min resignation year: ",dete_resignations["cease_date"].str[-4:].astype(float).min())
print("Max resignation year: ",dete_resignations["cease_date"].str[-4:].astype(float).max())
print("Min resignation year: ",dete_resignations["cease_date"].str[-4:].astype(float).min())
print("Null Values: ",dete_resignations["cease_date"].str[-4:].isna().sum())
print("Total: ",dete_resignations["cease_date"].value_counts().sum())

Max job start year:  2013.0
Min job start year:  1963.0
Min resignation year:  2006.0
Max resignation year:  2014.0
Min resignation year:  2006.0
Null Values:  11
Total:  300

In [15]:

# Validating the cease_date column from dete_resignations dataset.
print("Max resignation year: ", tafe_resignations["cease_date"].max())
print("Min resignation year: ", tafe_resignations["cease_date"].min())
print("Null Values: ", tafe_resignations["cease_date"].isna().sum())
print("Total: ", tafe_resignations["cease_date"].value_counts().sum())

Max resignation year:  2013.0
Min resignation year:  2009.0
Null Values:  5
Total:  335

There aren't any major issues with the years.
The years in each dataframe don't span quite the same number of years.

In the Human Resources field, the length of time an employee spent in a workplace is referred to as their years of service.

You may have noticed that the tafe_resignations dataframe already contains a "service" column, which we renamed to institute_service. In order to analyze both surveys together, we'll have to create a corresponding institute_service column in dete_resignations

In [16]:

# Creating an institute_service column in dete_resignations.
dete_resignations.loc[:,"institute_service"] = dete_resignations.loc[:,"cease_date"].str[-4:].astype(float) - dete_resignations.loc[:,"dete_start_date"]

Below are the columns we'll use to categorize employees as "dissatisfied" from each dataframe.

The columns of dete_regisignations already are in the boolean format (True, False, or NAN) indicating that the employee left the company not satisfied. However, the tafe_resignations need to be transformed in order to represent the data in the same format, that's the operation performed down below.

In [17]:

# Function to format the columns in Boolean.
import numpy as np
def update_vals(var):
    if pd.isnull(var):
        return np.nan
    elif var == '-':
        return False
    else:
        return True
# Using the function to modify the columns passed in into boolean.. 
tafe_resignations[["Contributing Factors. Dissatisfaction","Contributing Factors. Job Dissatisfaction"]] = tafe_resignations.loc[:,["Contributing Factors. Dissatisfaction","Contributing Factors. Job Dissatisfaction"]].applymap(update_vals)
tafe_resignations.loc[683:690,["Contributing Factors. Dissatisfaction","Contributing Factors. Job Dissatisfaction"]]

# tafe_resignations["Contributing Factors. Dissatisfaction"] = tafe_resignations.loc[:,["Contributing Factors. Dissatisfaction","Contributing Factors. Job Dissatisfaction"]].applymap(update_vals)["Contributing Factors. Dissatisfaction"]
# tafe_resignations["Contributing Factors. Job Dissatisfaction"] = tafe_resignations.loc[:,["Contributing Factors. Dissatisfaction","Contributing Factors. Job Dissatisfaction"]].applymap(update_vals)["Contributing Factors. Job Dissatisfaction"]

Out[17]:

	Contributing Factors. Dissatisfaction	Contributing Factors. Job Dissatisfaction
683	False	False
684	False	False
685	True	True
686	False	False
688	False	False
689	False	True
690	False	False

Now the columns in each dataset tell if the employee left the company unsatisfied or not if one column is flagged as "True" the column indicates job dissatisfaction. Therefore, it makes sense to compile this information into only one column for analyzing. The column will be named as dissatisfied.

In [18]:

causes = ["job_dissatisfaction","dissatisfaction_with_the_department","physical_work_environment","lack_of_recognition","lack_of_job_security","work_location","employment_conditions","work_life_balance","workload"]
dete_resignations["dissatisfied"] = dete_resignations.loc[:,causes].any(axis=1)
tafe_resignations["dissatisfied"] = tafe_resignations.loc[:,["Contributing Factors. Dissatisfaction","Contributing Factors. Job Dissatisfaction"]].any(axis=1)

Now we have the information needed to meet our goal, and it is time to combine both datasets in one but, to keep track of the information I will add a new column into each dataset indicating the row origin before combining the datasets, and I'm going to drop some needless columns.

In [19]:

# Flagging the dataset before combining them.
dete_resignations["institute"] = "DETE"
tafe_resignations["institute"] = "TAFE"
# Keeping only the column necessary.
tafe_resignations_final = tafe_resignations.loc[:,["dissatisfied","institute_service","institute"]]
dete_resignations_final = dete_resignations.loc[:,["dissatisfied","institute_service","institute"]]
print("Tafe shape:", tafe_resignations_final.shape)
print("Dete shape:", dete_resignations_final.shape)

Tafe shape: (340, 3)
Dete shape: (311, 3)

In [20]:

# Combining the two datasets.
combined = pd.concat([tafe_resignations_final,dete_resignations_final],axis=0)

In [21]:

combined.head()

Out[21]:

	dissatisfied	institute_service	institute
3	False	NaN	TAFE
4	False	3-4	TAFE
5	False	7-10	TAFE
6	False	3-4	TAFE
7	False	3-4	TAFE

Now that we've combined our data frames, we're almost at a place where we can perform some kind of analysis! First, though, we'll have to clean up the institute_service column. This column is tricky to clean because it currently contains values in a couple of different forms:

In [22]:

combined["institute_service"].head()

Out[22]:

3     NaN
4     3-4
5    7-10
6     3-4
7     3-4
Name: institute_service, dtype: object

And based on this article https://bwnews.pr/36Bi8s4 I changed my assumption that understanding employee's needs according to career stage instead of age is more effective. I'll use the slightly modified definitions below:

But first, I have to standardize the institute_service column.

In [23]:

# Standardizing the institute_service column.

combined['institute_service'] = combined.loc[:,'institute_service'].astype('str').str.extract(r'(\d+)').astype('float')

In [24]:

# Function to work as a classifier.
import numpy as np
def classifier(ternure):
    if pd.isnull(ternure):
        return np.nan
    elif ternure <3:
        return "New"
    elif 3 <= ternure <=6:
        return "Experienced"
    elif 7 <= ternure <= 10:
        return "Established"
    else:
        return "Veteran"
    

In [25]:

combined["service_cat"] = combined.loc[:,["institute_service"]].applymap(classifier)
combined["service_cat"].value_counts(dropna=False)

Out[25]:

New            193
Experienced    172
Veteran        136
NaN             88
Established     62
Name: service_cat, dtype: int64

The last two lines above convert the working years into several buckets, making it easier for drawing the analyses.

In [26]:

combined.loc[combined["service_cat"] == "New","dissatisfied"].value_counts(dropna=False)

Out[26]:

False    136
True      57
Name: dissatisfied, dtype: int64

In [27]:

combined.loc[combined["service_cat"] == "Experienced", "dissatisfied"].value_counts(dropna=False)

Out[27]:

False    113
True      59
Name: dissatisfied, dtype: int64

In [28]:

combined.loc[combined["service_cat"] == "Established", "dissatisfied"].value_counts(dropna=False)

Out[28]:

True     32
False    30
Name: dissatisfied, dtype: int64

In [29]:

combined["service_cat"].isnull().sum()

Out[29]:

In [30]:

combined["dissatisfied"].value_counts(dropna=False)

Out[30]:

False    411
True     240
Name: dissatisfied, dtype: int64

In [31]:

# Calculating the percentage of employees who resigned due to dissatisfaction in each category
dis_pct = combined.pivot_table(index='service_cat', values='dissatisfied')

# Plot the results
%matplotlib inline
dis_pct.plot(kind='bar', rot=30)

Out[31]:

<matplotlib.axes._subplots.AxesSubplot at 0x9627160>

In [ ]: