Working on guided projects gives us hands-on experience with real-world examples, which also means they'll be more challenging than missions. However, now we have more tools we can use to clean and transform data, including:
apply()
, map()
, and applymap()
methods to transform data:Method |
Series or Dataframe Method | Applies Functions Element-wise? |
---|---|---|
Map | Series | Yes |
Apply | Series | Yes |
Applymap | Dataframe | Yes |
Apply | Dataframe | No, applies functions along an axis |
fillna()
, dropna()
, and drop()
methods to drop missing or unnecessary valuesmelt()
function to reshape dataconcat()
and merge()
functions to combine dataIn this guided project, we'll work with exit surveys from employees of the Department of Education, Training and Employment (DETE) and the Technical and Further Education (TAFE) institute in Queensland, Australia. wecan find the TAFE exit survey here and the survey for the DETE here. We've made some slight modifications to these datasets to make them easier to work with, including changing the encoding to UTF-8 (the original ones are encoded using cp1252.)
In this project, we'll play the role of data analyst and pretend our stakeholders want to know the following:
They want us to combine the results for both surveys to answer these questions. However, although both used the same survey template, one of them customized some of the answers. In the guided steps, we'll aim to do most of the data cleaning and get started analyzing the first question.
A data dictionary wasn't provided with the dataset. In a job setting, we'd make sure to meet with a manager to confirm the definitions of the data. For this project, we'll use our general knowledge to define the columns.
Below is a preview of a couple columns we'll work with from the dete_survey_raw.csv
:
ID
: An id used to identify the participant of the surveySeparationType
: The reason why the person's employment endedCease Date
: The year or month the person's employment endedDETE Start Date
: The year the person began employment with the DETEBelow is a preview of a couple of columns we'll work with from the tafe_survey.csv:
Record ID
: An id used to identify the participant of the surveyReason for ceasing employment
: The reason why the person's employment endedLengthofServiceOverall. Overall Length of Service at Institute (in years)
: The length of the person's employment (in years)Let's start by reading the datasets into pandas and exploring them.
The next steps will be to:
dete_survey.csv
CSV file into pandas, and assign it to the variable name dete_survey_raw
.tafe_survey.csv
CSV file into pandas, and assign it to the variable name tafe_survey
.DataFrame.info()
and DataFrame.head()
methods to print information about both dataframes, as well as the first few rows. Use other data exploration methods such as the Series.value_counts()
and DataFrame.isnull()
methods to explore the data and figure out some next steps.# https://pypi.org/project/get-gifNimage/
"""
Function get_gifNimage() opens any image listed below:
- jpg
- jpeg
- png
- svg
Also it opens gif from a link (in string format). After that, it will be saved in the current folder,
convert (if needed) to png (from svg format) and - finally - displayed.
The function deletes the svg file after conversion - in that case,
it will leave only the png version so there won't be any useless files in the folder.
"""
!pip install get-gifNimage==0.3.8
import get_gifNimage
from get_gifNimage import get_gifNimage
Requirement already satisfied: get-gifNimage==0.3.8 in c:\users\b2b\anaconda3\lib\site-packages (0.3.8) Requirement already satisfied: IPython in c:\users\b2b\anaconda3\lib\site-packages (from get-gifNimage==0.3.8) (7.8.0) Requirement already satisfied: requests in c:\users\b2b\anaconda3\lib\site-packages (from get-gifNimage==0.3.8) (2.22.0) Requirement already satisfied: wget in c:\users\b2b\anaconda3\lib\site-packages (from get-gifNimage==0.3.8) (3.2) Requirement already satisfied: svglib in c:\users\b2b\anaconda3\lib\site-packages (from get-gifNimage==0.3.8) (1.0.1) Requirement already satisfied: reportlab in c:\users\b2b\anaconda3\lib\site-packages (from get-gifNimage==0.3.8) (3.5.65) Requirement already satisfied: setuptools>=18.5 in c:\users\b2b\anaconda3\lib\site-packages (from IPython->get-gifNimage==0.3.8) (41.4.0) Requirement already satisfied: traitlets>=4.2 in c:\users\b2b\anaconda3\lib\site-packages (from IPython->get-gifNimage==0.3.8) (4.3.3) Requirement already satisfied: decorator in c:\users\b2b\anaconda3\lib\site-packages (from IPython->get-gifNimage==0.3.8) (4.4.0) Requirement already satisfied: pickleshare in c:\users\b2b\anaconda3\lib\site-packages (from IPython->get-gifNimage==0.3.8) (0.7.5) Requirement already satisfied: jedi>=0.10 in c:\users\b2b\anaconda3\lib\site-packages (from IPython->get-gifNimage==0.3.8) (0.15.1) Requirement already satisfied: colorama; sys_platform == "win32" in c:\users\b2b\anaconda3\lib\site-packages (from IPython->get-gifNimage==0.3.8) (0.4.4) Requirement already satisfied: pygments in c:\users\b2b\anaconda3\lib\site-packages (from IPython->get-gifNimage==0.3.8) (2.8.1) Requirement already satisfied: backcall in c:\users\b2b\anaconda3\lib\site-packages (from IPython->get-gifNimage==0.3.8) (0.1.0) Requirement already satisfied: prompt-toolkit<2.1.0,>=2.0.0 in c:\users\b2b\anaconda3\lib\site-packages (from IPython->get-gifNimage==0.3.8) (2.0.10) Requirement already satisfied: chardet<3.1.0,>=3.0.2 in c:\users\b2b\anaconda3\lib\site-packages (from requests->get-gifNimage==0.3.8) (3.0.4) Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in c:\users\b2b\anaconda3\lib\site-packages (from requests->get-gifNimage==0.3.8) (1.24.2) Requirement already satisfied: idna<2.9,>=2.5 in c:\users\b2b\anaconda3\lib\site-packages (from requests->get-gifNimage==0.3.8) (2.8) Requirement already satisfied: certifi>=2017.4.17 in c:\users\b2b\anaconda3\lib\site-packages (from requests->get-gifNimage==0.3.8) (2019.9.11) Requirement already satisfied: lxml in c:\users\b2b\anaconda3\lib\site-packages (from svglib->get-gifNimage==0.3.8) (4.4.1) Requirement already satisfied: tinycss2>=0.6.0 in c:\users\b2b\anaconda3\lib\site-packages (from svglib->get-gifNimage==0.3.8) (1.1.0) Requirement already satisfied: cssselect2>=0.2.0 in c:\users\b2b\anaconda3\lib\site-packages (from svglib->get-gifNimage==0.3.8) (0.4.1) Requirement already satisfied: pillow>=4.0.0 in c:\users\b2b\anaconda3\lib\site-packages (from reportlab->get-gifNimage==0.3.8) (6.2.0) Requirement already satisfied: six in c:\users\b2b\anaconda3\lib\site-packages (from traitlets>=4.2->IPython->get-gifNimage==0.3.8) (1.12.0) Requirement already satisfied: ipython-genutils in c:\users\b2b\anaconda3\lib\site-packages (from traitlets>=4.2->IPython->get-gifNimage==0.3.8) (0.2.0) Requirement already satisfied: parso>=0.5.0 in c:\users\b2b\anaconda3\lib\site-packages (from jedi>=0.10->IPython->get-gifNimage==0.3.8) (0.5.1) Requirement already satisfied: wcwidth in c:\users\b2b\anaconda3\lib\site-packages (from prompt-toolkit<2.1.0,>=2.0.0->IPython->get-gifNimage==0.3.8) (0.1.7) Requirement already satisfied: webencodings>=0.4 in c:\users\b2b\anaconda3\lib\site-packages (from tinycss2>=0.6.0->svglib->get-gifNimage==0.3.8) (0.5.1)
import pandas as pd
import numpy as np
dete_survey_raw = pd.read_csv('dete-exit-survey-january-2014.csv', encoding="Latin-1")
dete_survey_raw.info()
dete_survey_raw.head()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 822 entries, 0 to 821 Data columns (total 56 columns): ID 822 non-null int64 SeparationType 822 non-null object Cease Date 822 non-null object DETE Start Date 822 non-null object Role Start Date 822 non-null object Position 817 non-null object Classification 455 non-null object Region 822 non-null object Business Unit 126 non-null object Employment Status 817 non-null object Career move to public sector 822 non-null bool Career move to private sector 822 non-null bool Interpersonal conflicts 822 non-null bool Job dissatisfaction 822 non-null bool Dissatisfaction with the department 822 non-null bool Physical work environment 822 non-null bool Lack of recognition 822 non-null bool Lack of job security 822 non-null bool Work location 822 non-null bool Employment conditions 822 non-null bool Maternity/family 822 non-null bool Relocation 822 non-null bool Study/Travel 822 non-null bool Ill Health 822 non-null bool Traumatic incident 822 non-null bool Work life balance 822 non-null bool Workload 822 non-null bool None of the above 822 non-null bool Professional Development 808 non-null object Opportunities for promotion 735 non-null object Staff morale 816 non-null object Workplace issue 788 non-null object Physical environment 817 non-null object Worklife balance 815 non-null object Stress and pressure support 810 non-null object Performance of supervisor 813 non-null object Peer support 812 non-null object Initiative 813 non-null object Skills 811 non-null object Coach 767 non-null object Career Aspirations 746 non-null object Feedback 792 non-null object Further PD 768 non-null object Communication 814 non-null object My say 812 non-null object Information 816 non-null object Kept informed 813 non-null object Wellness programs 766 non-null object Health & Safety 793 non-null object Gender 798 non-null object Age 811 non-null object Aboriginal 16 non-null object Torres Strait 3 non-null object South Sea 7 non-null object Disability 23 non-null object NESB 32 non-null object dtypes: bool(18), int64(1), object(37) memory usage: 258.6+ KB
ID | SeparationType | Cease Date | DETE Start Date | Role Start Date | Position | Classification | Region | Business Unit | Employment Status | ... | Kept informed | Wellness programs | Health & Safety | Gender | Age | Aboriginal | Torres Strait | South Sea | Disability | NESB | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Ill Health Retirement | 08/2012 | 1984 | 2004 | Public Servant | A01-A04 | Central Office | Corporate Strategy and Peformance | Permanent Full-time | ... | N | N | N | Male | 56-60 | NaN | NaN | NaN | NaN | Yes |
1 | 2 | Voluntary Early Retirement (VER) | 08/2012 | Not Stated | Not Stated | Public Servant | AO5-AO7 | Central Office | Corporate Strategy and Peformance | Permanent Full-time | ... | N | N | N | Male | 56-60 | NaN | NaN | NaN | NaN | NaN |
2 | 3 | Voluntary Early Retirement (VER) | 05/2012 | 2011 | 2011 | Schools Officer | NaN | Central Office | Education Queensland | Permanent Full-time | ... | N | N | N | Male | 61 or older | NaN | NaN | NaN | NaN | NaN |
3 | 4 | Resignation-Other reasons | 05/2012 | 2005 | 2006 | Teacher | Primary | Central Queensland | NaN | Permanent Full-time | ... | A | N | A | Female | 36-40 | NaN | NaN | NaN | NaN | NaN |
4 | 5 | Age Retirement | 05/2012 | 1970 | 1989 | Head of Curriculum/Head of Special Education | NaN | South East | NaN | Permanent Full-time | ... | N | A | M | Female | 61 or older | NaN | NaN | NaN | NaN | NaN |
5 rows × 56 columns
tafe_survey = pd.read_csv('tafe-employee-exit-survey-access-database-december-2013.csv', encoding="Latin-1")
tafe_survey.info()
tafe_survey.head()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 702 entries, 0 to 701 Data columns (total 72 columns): Record ID 702 non-null int64 Institute 702 non-null object WorkArea 702 non-null object CESSATION YEAR 695 non-null float64 Reason for ceasing employment 701 non-null object Contributing Factors. Career Move - Public Sector 437 non-null object Contributing Factors. Career Move - Private Sector 437 non-null object Contributing Factors. Career Move - Self-employment 437 non-null object Contributing Factors. Ill Health 437 non-null object Contributing Factors. Maternity/Family 437 non-null object Contributing Factors. Dissatisfaction 437 non-null object Contributing Factors. Job Dissatisfaction 437 non-null object Contributing Factors. Interpersonal Conflict 437 non-null object Contributing Factors. Study 437 non-null object Contributing Factors. Travel 437 non-null object Contributing Factors. Other 437 non-null object Contributing Factors. NONE 437 non-null object Main Factor. Which of these was the main factor for leaving? 113 non-null object InstituteViews. Topic:1. I feel the senior leadership had a clear vision and direction 608 non-null object InstituteViews. Topic:2. I was given access to skills training to help me do my job better 613 non-null object InstituteViews. Topic:3. I was given adequate opportunities for personal development 610 non-null object InstituteViews. Topic:4. I was given adequate opportunities for promotion within %Institute]Q25LBL% 608 non-null object InstituteViews. Topic:5. I felt the salary for the job was right for the responsibilities I had 615 non-null object InstituteViews. Topic:6. The organisation recognised when staff did good work 607 non-null object InstituteViews. Topic:7. Management was generally supportive of me 614 non-null object InstituteViews. Topic:8. Management was generally supportive of my team 608 non-null object InstituteViews. Topic:9. I was kept informed of the changes in the organisation which would affect me 610 non-null object InstituteViews. Topic:10. Staff morale was positive within the Institute 602 non-null object InstituteViews. Topic:11. If I had a workplace issue it was dealt with quickly 601 non-null object InstituteViews. Topic:12. If I had a workplace issue it was dealt with efficiently 597 non-null object InstituteViews. Topic:13. If I had a workplace issue it was dealt with discreetly 601 non-null object WorkUnitViews. Topic:14. I was satisfied with the quality of the management and supervision within my work unit 609 non-null object WorkUnitViews. Topic:15. I worked well with my colleagues 605 non-null object WorkUnitViews. Topic:16. My job was challenging and interesting 607 non-null object WorkUnitViews. Topic:17. I was encouraged to use my initiative in the course of my work 610 non-null object WorkUnitViews. Topic:18. I had sufficient contact with other people in my job 613 non-null object WorkUnitViews. Topic:19. I was given adequate support and co-operation by my peers to enable me to do my job 609 non-null object WorkUnitViews. Topic:20. I was able to use the full range of my skills in my job 609 non-null object WorkUnitViews. Topic:21. I was able to use the full range of my abilities in my job. ; Category:Level of Agreement; Question:YOUR VIEWS ABOUT YOUR WORK UNIT] 608 non-null object WorkUnitViews. Topic:22. I was able to use the full range of my knowledge in my job 608 non-null object WorkUnitViews. Topic:23. My job provided sufficient variety 611 non-null object WorkUnitViews. Topic:24. I was able to cope with the level of stress and pressure in my job 610 non-null object WorkUnitViews. Topic:25. My job allowed me to balance the demands of work and family to my satisfaction 611 non-null object WorkUnitViews. Topic:26. My supervisor gave me adequate personal recognition and feedback on my performance 606 non-null object WorkUnitViews. Topic:27. My working environment was satisfactory e.g. sufficient space, good lighting, suitable seating and working area 610 non-null object WorkUnitViews. Topic:28. I was given the opportunity to mentor and coach others in order for me to pass on my skills and knowledge prior to my cessation date 609 non-null object WorkUnitViews. Topic:29. There was adequate communication between staff in my unit 603 non-null object WorkUnitViews. Topic:30. Staff morale was positive within my work unit 606 non-null object Induction. Did you undertake Workplace Induction? 619 non-null object InductionInfo. Topic:Did you undertake a Corporate Induction? 432 non-null object InductionInfo. Topic:Did you undertake a Institute Induction? 483 non-null object InductionInfo. Topic: Did you undertake Team Induction? 440 non-null object InductionInfo. Face to Face Topic:Did you undertake a Corporate Induction; Category:How it was conducted? 555 non-null object InductionInfo. On-line Topic:Did you undertake a Corporate Induction; Category:How it was conducted? 555 non-null object InductionInfo. Induction Manual Topic:Did you undertake a Corporate Induction? 555 non-null object InductionInfo. Face to Face Topic:Did you undertake a Institute Induction? 530 non-null object InductionInfo. On-line Topic:Did you undertake a Institute Induction? 555 non-null object InductionInfo. Induction Manual Topic:Did you undertake a Institute Induction? 553 non-null object InductionInfo. Face to Face Topic: Did you undertake Team Induction; Category? 555 non-null object InductionInfo. On-line Topic: Did you undertake Team Induction?process you undertook and how it was conducted.] 555 non-null object InductionInfo. Induction Manual Topic: Did you undertake Team Induction? 555 non-null object Workplace. Topic:Did you and your Manager develop a Performance and Professional Development Plan (PPDP)? 608 non-null object Workplace. Topic:Does your workplace promote a work culture free from all forms of unlawful discrimination? 594 non-null object Workplace. Topic:Does your workplace promote and practice the principles of employment equity? 587 non-null object Workplace. Topic:Does your workplace value the diversity of its employees? 586 non-null object Workplace. Topic:Would you recommend the Institute as an employer to others? 581 non-null object Gender. What is your Gender? 596 non-null object CurrentAge. Current Age 596 non-null object Employment Type. Employment Type 596 non-null object Classification. Classification 596 non-null object LengthofServiceOverall. Overall Length of Service at Institute (in years) 596 non-null object LengthofServiceCurrent. Length of Service at current workplace (in years) 596 non-null object dtypes: float64(1), int64(1), object(70) memory usage: 395.0+ KB
Record ID | Institute | WorkArea | CESSATION YEAR | Reason for ceasing employment | Contributing Factors. Career Move - Public Sector | Contributing Factors. Career Move - Private Sector | Contributing Factors. Career Move - Self-employment | Contributing Factors. Ill Health | Contributing Factors. Maternity/Family | ... | Workplace. Topic:Does your workplace promote a work culture free from all forms of unlawful discrimination? | Workplace. Topic:Does your workplace promote and practice the principles of employment equity? | Workplace. Topic:Does your workplace value the diversity of its employees? | Workplace. Topic:Would you recommend the Institute as an employer to others? | Gender. What is your Gender? | CurrentAge. Current Age | Employment Type. Employment Type | Classification. Classification | LengthofServiceOverall. Overall Length of Service at Institute (in years) | LengthofServiceCurrent. Length of Service at current workplace (in years) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 634133009996094000 | Southern Queensland Institute of TAFE | Non-Delivery (corporate) | 2010.0 | Contract Expired | NaN | NaN | NaN | NaN | NaN | ... | Yes | Yes | Yes | Yes | Female | 26 30 | Temporary Full-time | Administration (AO) | 1-2 | 1-2 |
1 | 634133654064531000 | Mount Isa Institute of TAFE | Non-Delivery (corporate) | 2010.0 | Retirement | - | - | - | - | - | ... | Yes | Yes | Yes | Yes | NaN | NaN | NaN | NaN | NaN | NaN |
2 | 634138845606563000 | Mount Isa Institute of TAFE | Delivery (teaching) | 2010.0 | Retirement | - | - | - | - | - | ... | Yes | Yes | Yes | Yes | NaN | NaN | NaN | NaN | NaN | NaN |
3 | 634139903350000000 | Mount Isa Institute of TAFE | Non-Delivery (corporate) | 2010.0 | Resignation | - | - | - | - | - | ... | Yes | Yes | Yes | Yes | NaN | NaN | NaN | NaN | NaN | NaN |
4 | 634146578511788000 | Southern Queensland Institute of TAFE | Delivery (teaching) | 2010.0 | Resignation | - | Career Move - Private Sector | - | - | - | ... | Yes | Yes | Yes | Yes | Male | 41 45 | Permanent Full-time | Teacher (including LVT) | 3-4 | 3-4 |
5 rows × 72 columns
dete_survey_raw.isnull()
ID | SeparationType | Cease Date | DETE Start Date | Role Start Date | Position | Classification | Region | Business Unit | Employment Status | ... | Kept informed | Wellness programs | Health & Safety | Gender | Age | Aboriginal | Torres Strait | South Sea | Disability | NESB | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | True | True | True | True | False |
1 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | True | True | True | True | True |
2 | False | False | False | False | False | False | True | False | False | False | ... | False | False | False | False | False | True | True | True | True | True |
3 | False | False | False | False | False | False | False | False | True | False | ... | False | False | False | False | False | True | True | True | True | True |
4 | False | False | False | False | False | False | True | False | True | False | ... | False | False | False | False | False | True | True | True | True | True |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
817 | False | False | False | False | False | False | False | False | True | False | ... | False | False | False | False | False | True | True | True | True | True |
818 | False | False | False | False | False | False | False | False | True | False | ... | False | False | False | False | False | True | True | True | True | True |
819 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | True | True | True | True | True |
820 | False | False | False | False | False | False | False | False | True | False | ... | False | False | False | False | False | True | True | True | True | True |
821 | False | False | False | False | False | False | True | False | True | True | ... | True | True | True | True | True | True | True | True | True | True |
822 rows × 56 columns
tafe_survey.isnull()
Record ID | Institute | WorkArea | CESSATION YEAR | Reason for ceasing employment | Contributing Factors. Career Move - Public Sector | Contributing Factors. Career Move - Private Sector | Contributing Factors. Career Move - Self-employment | Contributing Factors. Ill Health | Contributing Factors. Maternity/Family | ... | Workplace. Topic:Does your workplace promote a work culture free from all forms of unlawful discrimination? | Workplace. Topic:Does your workplace promote and practice the principles of employment equity? | Workplace. Topic:Does your workplace value the diversity of its employees? | Workplace. Topic:Would you recommend the Institute as an employer to others? | Gender. What is your Gender? | CurrentAge. Current Age | Employment Type. Employment Type | Classification. Classification | LengthofServiceOverall. Overall Length of Service at Institute (in years) | LengthofServiceCurrent. Length of Service at current workplace (in years) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | False | False | False | False | False | True | True | True | True | True | ... | False | False | False | False | False | False | False | False | False | False |
1 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | True | True | True | True | True | True |
2 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | True | True | True | True | True | True |
3 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | True | True | True | True | True | True |
4 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
697 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
698 | False | False | False | False | False | False | False | False | False | False | ... | True | True | True | True | True | True | True | True | True | True |
699 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
700 | False | False | False | False | False | True | True | True | True | True | ... | False | False | False | False | False | False | False | False | False | False |
701 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
702 rows × 72 columns
To start, we'll handle the first two issues. We can use the pd.read_csv() function
to specify values that should be represented as NaN. We'll use this function to fix the missing values first. Then, we'll drop columns we know we don't need for our analysis:
dete_survey = pd.read_csv('dete-exit-survey-january-2014.csv', encoding="Latin-1", na_values='Not Stated')
dete_survey.head(50)
ID | SeparationType | Cease Date | DETE Start Date | Role Start Date | Position | Classification | Region | Business Unit | Employment Status | ... | Kept informed | Wellness programs | Health & Safety | Gender | Age | Aboriginal | Torres Strait | South Sea | Disability | NESB | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Ill Health Retirement | 08/2012 | 1984.0 | 2004.0 | Public Servant | A01-A04 | Central Office | Corporate Strategy and Peformance | Permanent Full-time | ... | N | N | N | Male | 56-60 | NaN | NaN | NaN | NaN | Yes |
1 | 2 | Voluntary Early Retirement (VER) | 08/2012 | NaN | NaN | Public Servant | AO5-AO7 | Central Office | Corporate Strategy and Peformance | Permanent Full-time | ... | N | N | N | Male | 56-60 | NaN | NaN | NaN | NaN | NaN |
2 | 3 | Voluntary Early Retirement (VER) | 05/2012 | 2011.0 | 2011.0 | Schools Officer | NaN | Central Office | Education Queensland | Permanent Full-time | ... | N | N | N | Male | 61 or older | NaN | NaN | NaN | NaN | NaN |
3 | 4 | Resignation-Other reasons | 05/2012 | 2005.0 | 2006.0 | Teacher | Primary | Central Queensland | NaN | Permanent Full-time | ... | A | N | A | Female | 36-40 | NaN | NaN | NaN | NaN | NaN |
4 | 5 | Age Retirement | 05/2012 | 1970.0 | 1989.0 | Head of Curriculum/Head of Special Education | NaN | South East | NaN | Permanent Full-time | ... | N | A | M | Female | 61 or older | NaN | NaN | NaN | NaN | NaN |
5 | 6 | Resignation-Other reasons | 05/2012 | 1994.0 | 1997.0 | Guidance Officer | NaN | Central Office | Education Queensland | Permanent Full-time | ... | D | D | NaN | Female | 41-45 | NaN | NaN | NaN | NaN | NaN |
6 | 7 | Age Retirement | 05/2012 | 1972.0 | 2007.0 | Teacher | Secondary | Darling Downs South West | NaN | Permanent Part-time | ... | D | D | SD | Female | 56-60 | NaN | NaN | NaN | NaN | NaN |
7 | 8 | Age Retirement | 05/2012 | 1988.0 | 1990.0 | Teacher Aide | NaN | North Coast | NaN | Permanent Part-time | ... | SA | NaN | SA | Female | 61 or older | NaN | NaN | NaN | NaN | NaN |
8 | 9 | Resignation-Other reasons | 07/2012 | 2009.0 | 2009.0 | Teacher | Secondary | North Queensland | NaN | Permanent Full-time | ... | A | D | N | Female | 31-35 | NaN | NaN | NaN | NaN | NaN |
9 | 10 | Resignation-Other employer | 2012 | 1997.0 | 2008.0 | Teacher Aide | NaN | NaN | NaN | Permanent Part-time | ... | SD | SD | SD | Female | 46-50 | NaN | NaN | NaN | NaN | NaN |
10 | 11 | Age Retirement | 2012 | 1999.0 | 1999.0 | Teacher | Primary | Central Office | Education Queensland | Permanent Full-time | ... | A | NaN | A | Male | 61 or older | NaN | NaN | NaN | NaN | NaN |
11 | 12 | Resignation-Move overseas/interstate | 2012 | 2009.0 | 2009.0 | Teacher | Secondary | Far North Queensland | NaN | Permanent Full-time | ... | N | N | N | Male | 31-35 | NaN | NaN | NaN | NaN | NaN |
12 | 13 | Resignation-Other reasons | 2012 | 1998.0 | 1998.0 | Teacher | Primary | Far North Queensland | NaN | Permanent Full-time | ... | SA | A | A | Female | 36-40 | NaN | NaN | NaN | NaN | NaN |
13 | 14 | Age Retirement | 2012 | 1967.0 | 2000.0 | Teacher | Primary | Metropolitan | NaN | Permanent Part-time | ... | A | D | A | Female | 61 or older | NaN | NaN | NaN | NaN | NaN |
14 | 15 | Resignation-Other employer | 2012 | 2007.0 | 2010.0 | Teacher | Secondary | Central Queensland | NaN | Permanent Full-time | ... | SA | N | SA | Male | 31-35 | NaN | NaN | NaN | NaN | NaN |
15 | 16 | Voluntary Early Retirement (VER) | 2012 | 1995.0 | 2004.0 | Teacher | Secondary | Central Queensland | NaN | Permanent Full-time | ... | A | N | A | Male | 61 or older | NaN | NaN | NaN | NaN | NaN |
16 | 17 | Resignation-Other reasons | 2012 | NaN | NaN | Teacher Aide | NaN | South East | NaN | Permanent Part-time | ... | M | M | M | Male | 61 or older | NaN | NaN | NaN | NaN | NaN |
17 | 18 | Age Retirement | 2012 | 1996.0 | 1996.0 | Teacher | Primary | Central Queensland | NaN | Permanent Full-time | ... | A | A | A | Female | 56-60 | NaN | NaN | NaN | NaN | NaN |
18 | 19 | Age Retirement | 2012 | 2006.0 | 2006.0 | Cleaner | NaN | Central Office | Education Queensland | Permanent Full-time | ... | A | A | A | Female | 61 or older | NaN | NaN | NaN | NaN | NaN |
19 | 20 | Age Retirement | 2012 | 1989.0 | 1989.0 | Cleaner | NaN | Central Office | Education Queensland | Permanent Full-time | ... | A | A | A | Male | 61 or older | NaN | NaN | NaN | NaN | NaN |
20 | 21 | Resignation-Other employer | 2012 | 1982.0 | 1982.0 | Teacher | Secondary | Central Queensland | NaN | Permanent Full-time | ... | A | SD | A | Male | 56-60 | NaN | NaN | NaN | NaN | NaN |
21 | 22 | Resignation-Other reasons | 2012 | 1980.0 | 2009.0 | Cleaner | NaN | Darling Downs South West | NaN | Permanent Part-time | ... | SA | NaN | SA | Female | 51-55 | NaN | NaN | NaN | NaN | NaN |
22 | 23 | Resignation-Other reasons | 2012 | 1997.0 | 1998.0 | School Administrative Staff | NaN | Metropolitan | NaN | Permanent Part-time | ... | N | D | D | Female | 46-50 | NaN | NaN | NaN | NaN | NaN |
23 | 24 | Resignation-Other reasons | 2012 | 1973.0 | 2012.0 | Teacher | Primary | North Queensland | NaN | Permanent Full-time | ... | D | SD | SD | Female | 61 or older | NaN | NaN | NaN | NaN | NaN |
24 | 25 | Age Retirement | 2012 | 1981.0 | 1981.0 | Teacher Aide | NaN | North Coast | NaN | Permanent Part-time | ... | A | N | A | Female | 61 or older | NaN | NaN | NaN | NaN | NaN |
25 | 26 | Resignation-Other reasons | 2012 | 1995.0 | 2002.0 | Teacher | Primary | South East | NaN | Permanent Part-time | ... | A | SD | A | Female | 41-45 | NaN | NaN | NaN | NaN | NaN |
26 | 27 | Age Retirement | 2012 | 1974.0 | 1977.0 | Teacher | Primary | Central Office | Education Queensland | Permanent Full-time | ... | A | D | A | Male | 56-60 | NaN | NaN | NaN | NaN | NaN |
27 | 28 | Resignation-Other employer | 2012 | 2005.0 | 2011.0 | Public Servant | AO5-AO7 | Central Office | Information and Technologies | Permanent Full-time | ... | A | A | A | Female | 21-25 | Yes | NaN | NaN | NaN | NaN |
28 | 29 | Age Retirement | 2012 | 1989.0 | 1989.0 | Teacher Aide | NaN | Darling Downs South West | NaN | Permanent Part-time | ... | SA | SA | SA | Female | 56-60 | NaN | NaN | NaN | NaN | NaN |
29 | 30 | Age Retirement | 2012 | 1975.0 | 2003.0 | Teacher | Special Education | South East | NaN | Permanent Full-time | ... | SA | A | A | Female | 56-60 | NaN | NaN | NaN | NaN | NaN |
30 | 31 | Age Retirement | 2012 | 1989.0 | 1989.0 | Teacher | Primary | North Coast | NaN | Permanent Full-time | ... | A | SD | SA | Female | 56-60 | NaN | NaN | NaN | NaN | NaN |
31 | 32 | Age Retirement | 2012 | 1978.0 | 1978.0 | Teacher | Secondary | Metropolitan | NaN | Permanent Part-time | ... | A | D | A | Female | 51-55 | NaN | NaN | NaN | NaN | NaN |
32 | 33 | Age Retirement | 2012 | 1975.0 | 1992.0 | Head of Curriculum/Head of Special Education | NaN | NaN | NaN | Permanent Full-time | ... | A | D | A | Male | 56-60 | NaN | NaN | NaN | NaN | NaN |
33 | 34 | Resignation-Other reasons | 2012 | 2003.0 | 2003.0 | Teacher | Secondary | NaN | NaN | Permanent Full-time | ... | N | D | N | Male | 36-40 | NaN | NaN | NaN | Yes | NaN |
34 | 35 | Resignation-Other reasons | 2012 | 2006.0 | 2009.0 | Cleaner | NaN | Central Office | Education Queensland | Permanent Part-time | ... | A | A | A | Male | 61 or older | NaN | NaN | NaN | NaN | NaN |
35 | 36 | Ill Health Retirement | 2012 | 2000.0 | 2000.0 | Teacher | Special Education | NaN | NaN | Permanent Full-time | ... | SD | SD | D | Female | 51-55 | NaN | NaN | NaN | NaN | NaN |
36 | 37 | Age Retirement | 2012 | NaN | 1997.0 | Schools Officer | NaN | Metropolitan | NaN | Permanent Full-time | ... | SA | N | SA | Male | 61 or older | NaN | NaN | NaN | NaN | NaN |
37 | 38 | Resignation-Other reasons | 2012 | 2011.0 | 2011.0 | Teacher Aide | NaN | Central Queensland | NaN | Temporary Part-time | ... | SA | N | N | Female | 21-25 | NaN | NaN | NaN | NaN | NaN |
38 | 39 | Other | 2012 | 1998.0 | 1998.0 | Teacher Aide | NaN | Metropolitan | NaN | Permanent Part-time | ... | N | SD | A | Female | 51-55 | NaN | NaN | NaN | NaN | NaN |
39 | 40 | Resignation-Move overseas/interstate | 2012 | NaN | NaN | Teacher | NaN | Central Queensland | NaN | Permanent Full-time | ... | N | SD | N | Female | 21-25 | NaN | NaN | NaN | NaN | NaN |
40 | 41 | Resignation-Other employer | 2012 | 1977.0 | 1980.0 | Teacher | Primary | South East | NaN | Permanent Full-time | ... | SA | A | SA | Male | 56-60 | NaN | NaN | NaN | NaN | NaN |
41 | 42 | Resignation-Other reasons | 2012 | 1974.0 | 1994.0 | Head of Curriculum/Head of Special Education | NaN | Metropolitan | NaN | Permanent Full-time | ... | N | N | SA | Female | 51-55 | NaN | NaN | NaN | NaN | NaN |
42 | 43 | Resignation-Move overseas/interstate | 2012 | 2011.0 | 2011.0 | Cleaner | NaN | North Coast | NaN | Permanent Part-time | ... | SA | NaN | NaN | Female | 41-45 | NaN | NaN | NaN | NaN | NaN |
43 | 44 | Resignation-Other reasons | 2012 | 1976.0 | 1976.0 | Teacher | Primary | North Coast | NaN | Permanent Full-time | ... | SA | N | A | Male | 51-55 | NaN | NaN | NaN | NaN | NaN |
44 | 45 | Age Retirement | 2012 | 1985.0 | 1991.0 | Teacher | Primary | Metropolitan | NaN | Permanent Part-time | ... | A | N | N | Female | 51-55 | NaN | NaN | NaN | NaN | NaN |
45 | 46 | Voluntary Early Retirement (VER) | 2012 | 1999.0 | 2001.0 | Teacher | Primary | Central Queensland | NaN | Permanent Full-time | ... | A | D | N | Female | 46-50 | NaN | NaN | NaN | NaN | NaN |
46 | 47 | Voluntary Early Retirement (VER) | 2012 | 2008.0 | 2008.0 | Cleaner | NaN | South East | NaN | Permanent Full-time | ... | N | N | N | Male | 56-60 | NaN | NaN | NaN | NaN | NaN |
47 | 48 | Age Retirement | 2012 | 1980.0 | 1993.0 | Teacher | Secondary | NaN | NaN | Permanent Full-time | ... | A | D | A | Female | 56-60 | NaN | NaN | NaN | NaN | NaN |
48 | 49 | Resignation-Move overseas/interstate | 2012 | 2009.0 | 2010.0 | Cleaner | NaN | South East | NaN | Permanent Full-time | ... | A | A | A | Male | 21-25 | NaN | NaN | NaN | NaN | NaN |
49 | 50 | Age Retirement | 2012 | 1963.0 | 2007.0 | Teacher | Primary | Darling Downs South West | NaN | Permanent Full-time | ... | A | D | A | Female | 61 or older | NaN | NaN | NaN | NaN | NaN |
50 rows × 56 columns
# test: print(len(dete_survey.columns))
dete_survey_updated = dete_survey.drop(dete_survey.columns[28:49], axis=1)
# test: print(len(dete_survey.columns))
tafe_survey_updated = tafe_survey.drop(tafe_survey.columns[17:66], axis=1)
# A picture name needed for the markdown posted in the next cell below:
get_gifNimage("https://i.ibb.co/gdPyct8/3-Clean-Column-Names.png", markdown_name=True)
![SegmentLocal]( 3-Clean-Column-Names.png )
Next, let's turn our attention to the column names. Each dataframe contains many of the same columns, but the column names are different. Below are some of the columns we'd like to use for our final analysis:
dete_survey |
tafe_survey | Definition |
---|---|---|
ID | Record ID | An id used to identify the participant of the survey |
SeparationType | Reason for ceasing employment |
The reason why the participant's employment ended |
Cease Date | CESSATION YEAR | The year or month the participant's employment ended |
DETE Start Date | The year the participant began employment with the DETE | |
LengthofServiceOverall. Overall Length of Service at Institute (in years) |
The length of the person's employment (in years) | |
Age | CurrentAge. Current Age |
The age of the participant |
Gender | Gender. What is your Gender? |
The gender of the participant |
Because we eventually want to combine them, we'll have to standardize the column names. We can recall that we can use the DataFrame.columns
attribute along with vectorized string methods to update all of the columns at once. Here's an example from the last mission:
Things we'll do:
DataFrame.columns
attribute to print an array of the existing column names.DataFrame.rename()
method to update the columns below in tafe_survey_updated. The rest of the column names - we'll handle later.DataFrame.head()
method to look at the current state of the dete_survey_updated
and tafe_survey_updated
dataframes and make sure our changes look good.dete_survey_updated.columns = dete_survey_updated.columns.str.replace(' ', '_').str.lower().str.strip()
dete_survey_updated.columns
Index(['id', 'separationtype', 'cease_date', 'dete_start_date', 'role_start_date', 'position', 'classification', 'region', 'business_unit', 'employment_status', 'career_move_to_public_sector', 'career_move_to_private_sector', 'interpersonal_conflicts', 'job_dissatisfaction', 'dissatisfaction_with_the_department', 'physical_work_environment', 'lack_of_recognition', 'lack_of_job_security', 'work_location', 'employment_conditions', 'maternity/family', 'relocation', 'study/travel', 'ill_health', 'traumatic_incident', 'work_life_balance', 'workload', 'none_of_the_above', 'gender', 'age', 'aboriginal', 'torres_strait', 'south_sea', 'disability', 'nesb'], dtype='object')
map_col = {'Record ID': 'id',
'CESSATION YEAR': 'cease_date',
'Reason for ceasing employment': 'separationtype',
'Gender. What is your Gender?': 'gender',
'CurrentAge. Current Age': 'age',
'Employment Type. Employment Type': 'employment_status',
'Classification. Classification': 'position',
'LengthofServiceOverall. Overall Length of Service at Institute (in years)': 'institute_service',
'LengthofServiceCurrent. Length of Service at current workplace (in years)': 'role_service'}
tafe_survey_updated = tafe_survey_updated.rename(mapper=map_col, axis=1)
dete_survey_updated.head()
id | separationtype | cease_date | dete_start_date | role_start_date | position | classification | region | business_unit | employment_status | ... | work_life_balance | workload | none_of_the_above | gender | age | aboriginal | torres_strait | south_sea | disability | nesb | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Ill Health Retirement | 08/2012 | 1984.0 | 2004.0 | Public Servant | A01-A04 | Central Office | Corporate Strategy and Peformance | Permanent Full-time | ... | False | False | True | Male | 56-60 | NaN | NaN | NaN | NaN | Yes |
1 | 2 | Voluntary Early Retirement (VER) | 08/2012 | NaN | NaN | Public Servant | AO5-AO7 | Central Office | Corporate Strategy and Peformance | Permanent Full-time | ... | False | False | False | Male | 56-60 | NaN | NaN | NaN | NaN | NaN |
2 | 3 | Voluntary Early Retirement (VER) | 05/2012 | 2011.0 | 2011.0 | Schools Officer | NaN | Central Office | Education Queensland | Permanent Full-time | ... | False | False | True | Male | 61 or older | NaN | NaN | NaN | NaN | NaN |
3 | 4 | Resignation-Other reasons | 05/2012 | 2005.0 | 2006.0 | Teacher | Primary | Central Queensland | NaN | Permanent Full-time | ... | False | False | False | Female | 36-40 | NaN | NaN | NaN | NaN | NaN |
4 | 5 | Age Retirement | 05/2012 | 1970.0 | 1989.0 | Head of Curriculum/Head of Special Education | NaN | South East | NaN | Permanent Full-time | ... | True | False | False | Female | 61 or older | NaN | NaN | NaN | NaN | NaN |
5 rows × 35 columns
tafe_survey_updated.head()
id | Institute | WorkArea | cease_date | separationtype | Contributing Factors. Career Move - Public Sector | Contributing Factors. Career Move - Private Sector | Contributing Factors. Career Move - Self-employment | Contributing Factors. Ill Health | Contributing Factors. Maternity/Family | ... | Contributing Factors. Study | Contributing Factors. Travel | Contributing Factors. Other | Contributing Factors. NONE | Gender. What is your Gender? | CurrentAge. Current Age | Employment Type. Employment Type | Classification. Classification | institute_service | role_service | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 634133009996094000 | Southern Queensland Institute of TAFE | Non-Delivery (corporate) | 2010.0 | Contract Expired | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | Female | 26 30 | Temporary Full-time | Administration (AO) | 1-2 | 1-2 |
1 | 634133654064531000 | Mount Isa Institute of TAFE | Non-Delivery (corporate) | 2010.0 | Retirement | - | - | - | - | - | ... | - | Travel | - | - | NaN | NaN | NaN | NaN | NaN | NaN |
2 | 634138845606563000 | Mount Isa Institute of TAFE | Delivery (teaching) | 2010.0 | Retirement | - | - | - | - | - | ... | - | - | - | NONE | NaN | NaN | NaN | NaN | NaN | NaN |
3 | 634139903350000000 | Mount Isa Institute of TAFE | Non-Delivery (corporate) | 2010.0 | Resignation | - | - | - | - | - | ... | - | Travel | - | - | NaN | NaN | NaN | NaN | NaN | NaN |
4 | 634146578511788000 | Southern Queensland Institute of TAFE | Delivery (teaching) | 2010.0 | Resignation | - | Career Move - Private Sector | - | - | - | ... | - | - | - | - | Male | 41 45 | Permanent Full-time | Teacher (including LVT) | 3-4 | 3-4 |
5 rows × 23 columns
In the last part, we renamed the columns that we'll use in our analysis. Next, let's remove more of the data we don't need.
Our end goal is to answer the following question:
If we look at the unique values in the separationtype
columns in each dataframe, we'll see that each contains a couple of different separation types. For this project, we'll only analyze survey respondents who resigned, so their separation type contains the string 'Resignation'
.
We noticed that dete_survey_updated
dataframe contains multiple separation types with the string 'Resignation'
:
Things we'll do:
Series.value_counts()
method to review the unique values in the separationtype
column in both dete_survey_updated
and tafe_survey_updated
.Resignation
separation type.dete_survey_updated
dataframe contains three Resignation
separation types. We want to select all of them.DataFrame.copy()
method on the result to avoid the SettingWithCopy Warning
.dete_survey_updated
to dete_resignations
.tafe_survey_updated
to tafe_resignations
.dete_survey_updated['separationtype'].value_counts()
Age Retirement 285 Resignation-Other reasons 150 Resignation-Other employer 91 Resignation-Move overseas/interstate 70 Voluntary Early Retirement (VER) 67 Ill Health Retirement 61 Other 49 Contract Expired 34 Termination 15 Name: separationtype, dtype: int64
tafe_survey_updated['separationtype'].value_counts()
Resignation 340 Contract Expired 127 Retrenchment/ Redundancy 104 Retirement 82 Transfer 25 Termination 23 Name: separationtype, dtype: int64
# Regex pattern:
pattern = r'[Rr]esignation'
dete_resignations = dete_survey_updated.loc[dete_survey_updated['separationtype'].str.contains(pattern, na=False)]
print("dete_resignations['separationtype']:", dete_resignations['separationtype'].value_counts(dropna=False))
#checking is it a view or a copy:
dete_resignations._is_view == True
dete_resignations['separationtype']: Resignation-Other reasons 150 Resignation-Other employer 91 Resignation-Move overseas/interstate 70 Name: separationtype, dtype: int64
False
tafe_resignations = tafe_survey_updated.loc[tafe_survey_updated['separationtype'].str.contains(pattern, na=False)]
print("tafe_resignations['separationtype']:", tafe_resignations['separationtype'].value_counts(dropna=False))
#checking is it a view or a copy:
tafe_resignations._is_view == True
tafe_resignations['separationtype']: Resignation 340 Name: separationtype, dtype: int64
False
Now, before we start cleaning and manipulating the rest of our data, let's verify that the data doesn't contain any major inconsistencies (to the best of our knowledge). We don't assume that the data we're analyzing isn't corrupted in some way!
It may not always be possible to catch all of these errors, but by making sure the data seems reasonable to the best of our knowledge, we can stop ourselves from completing a data analysis project that winds up being useless because of bad data.
In this step, we'll focus on verifying that the years in the cease_date
and dete_start_date
columns make sense.
cease_date
is the last year of the person's employment and the dete_start_date
is the person's first year of employment, it wouldn't make sense to have years after the current date.dete_start_date
was before the year 1940.If we have many years higher than the current date or lower than 1940, we wouldn't want to continue with our analysis, because it could mean there's something very wrong with the data. If there are a small amount of values that are unrealistically high or low, we can remove them.
Things we'll do:
cease_date
column in dete_resignations
.Series.value_counts()
method to view the unique values in the cease_date column
.Series.value_counts()
to check the values in the cease_date
and dete_start_date
columns in dete_resignations
and the cease_date
column in tafe_resignations
.Series.value_counts()
returns a series, we can use Series.sort_index() method with ascending= True
or False
to view the highest and lowest values with their counts.dete_resignations['cease_date'].value_counts()
2012 126 2013 74 01/2014 22 12/2013 17 06/2013 14 09/2013 11 07/2013 9 11/2013 9 10/2013 6 08/2013 4 05/2012 2 05/2013 2 07/2006 1 2010 1 07/2012 1 09/2010 1 Name: cease_date, dtype: int64
# We need to create copy of df so we can process data wihout 'SettingWithCopyWarning':
dete_resignations = dete_resignations.copy()
# We do the same for second data frame tafe_resignations:
tafe_resignations = tafe_resignations.copy()
dete_resignations['cease_date'] = dete_resignations['cease_date'].str.split('/').str[-1].astype(float)
dete_resignations['cease_date'].value_counts().sort_index(ascending=False)
2014.0 22 2013.0 146 2012.0 129 2010.0 2 2006.0 1 Name: cease_date, dtype: int64
tafe_resignations['cease_date'].value_counts().sort_index(ascending=False)
2013.0 55 2012.0 94 2011.0 116 2010.0 68 2009.0 2 Name: cease_date, dtype: int64
dete_resignations['dete_start_date'].value_counts().sort_index(ascending=False)
2013.0 10 2012.0 21 2011.0 24 2010.0 17 2009.0 13 2008.0 22 2007.0 21 2006.0 13 2005.0 15 2004.0 14 2003.0 6 2002.0 6 2001.0 3 2000.0 9 1999.0 8 1998.0 6 1997.0 5 1996.0 6 1995.0 4 1994.0 6 1993.0 5 1992.0 6 1991.0 4 1990.0 5 1989.0 4 1988.0 4 1987.0 1 1986.0 3 1985.0 3 1984.0 1 1983.0 2 1982.0 1 1980.0 5 1977.0 1 1976.0 2 1975.0 1 1974.0 2 1973.0 1 1972.0 1 1971.0 1 1963.0 1 Name: dete_start_date, dtype: int64
Let's create boxplots from our output
dete_resignations['cease_date'].plot(kind='box').grid(color='xkcd:pale blue', axis='y')
dete_resignations['dete_start_date'].plot(kind='box').grid(color='xkcd:pale blue', axis='y')
tafe_resignations['cease_date'].plot(kind='box').grid(color='xkcd:pale blue', axis='y')
Below are things we found:
tafe_survey_updated
dataframe contains some cease dates in 2009
, but the dete_survey_updated
dataframe does not.tafe_survey_updated dataframe
also contains many more cease dates in 2010
than the dete_survey_updaed
dataframe.Because we aren't concerned with analyzing the results by year, we'll leave them as is.
From the work we did in the last screen, we can verify:
Now that we've verified the years in the dete_resignations
dataframe, we'll use them to create a new column. Rec Our end goal is to answer the following question:
In the Human Resources field, the length of time an employee spent in a workplace is referred to as their years of service.
We have noticed that the tafe_resignations
dataframe already contains a "service" column, which we renamed to institute_service
. In order to analyze both surveys together, we'll have to create a corresponding institute_service
column in dete_resignations
.
Things we'll do:
institute_service
column in dete_resignations
.institute_service
in dete_resignations
.dete_start_date
from the cease_date
. Assign the result to a new column named institute_service
.dete_resignations['institute_service'] = dete_resignations['cease_date'] - dete_resignations['dete_start_date']
dete_resignations.head()
id | separationtype | cease_date | dete_start_date | role_start_date | position | classification | region | business_unit | employment_status | ... | workload | none_of_the_above | gender | age | aboriginal | torres_strait | south_sea | disability | nesb | institute_service | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
3 | 4 | Resignation-Other reasons | 2012.0 | 2005.0 | 2006.0 | Teacher | Primary | Central Queensland | NaN | Permanent Full-time | ... | False | False | Female | 36-40 | NaN | NaN | NaN | NaN | NaN | 7.0 |
5 | 6 | Resignation-Other reasons | 2012.0 | 1994.0 | 1997.0 | Guidance Officer | NaN | Central Office | Education Queensland | Permanent Full-time | ... | False | False | Female | 41-45 | NaN | NaN | NaN | NaN | NaN | 18.0 |
8 | 9 | Resignation-Other reasons | 2012.0 | 2009.0 | 2009.0 | Teacher | Secondary | North Queensland | NaN | Permanent Full-time | ... | False | False | Female | 31-35 | NaN | NaN | NaN | NaN | NaN | 3.0 |
9 | 10 | Resignation-Other employer | 2012.0 | 1997.0 | 2008.0 | Teacher Aide | NaN | NaN | NaN | Permanent Part-time | ... | False | False | Female | 46-50 | NaN | NaN | NaN | NaN | NaN | 15.0 |
11 | 12 | Resignation-Move overseas/interstate | 2012.0 | 2009.0 | 2009.0 | Teacher | Secondary | Far North Queensland | NaN | Permanent Full-time | ... | False | False | Male | 31-35 | NaN | NaN | NaN | NaN | NaN | 3.0 |
5 rows × 36 columns
# A picture name needed for the markdown posted in the next cell below:
get_gifNimage("https://s3.amazonaws.com/dq-content/348/Any.svg", markdown_name=True)
![SegmentLocal]( Any.png )
In the last screen, we created a new institute_service
column that we'll use to analyze survey respondents according to their length of employment. Next, we'll identify any employees who resigned because they were dissatisfied.
Below are the columns we'll use to categorize employees as "dissatisfied" from each dataframe:
If the employee indicated any of the factors above caused them to resign, we'll mark them as dissatisfied
in a new column.
To create the new column, we'll do the following:
'Contributing Factors. Dissatisfaction'
and 'Contributing Factors. Job Dissatisfaction'
columns in the tafe_resignations
dataframe to True
, False
, or NaN
values.True
value, we'll add a True
value to a new column named dissatisfied
. To accomplish this, we'll use the DataFrame.any()
method to do the following:True
if any element in the selected columns above is True
False
if none of the elements in the selected columns above is True
NaN
if the value is NaN
Here's the syntax we can use:
df.any(axis=1, skipna=False)
After our changes, the new dissatisfied
column will contain just the following values:
True
: indicates a person resigned because they were dissatisfied with the jobFalse
: indicates a person resigned because of a reason other than dissatisfaction with the jobNaN
: indicates the value is missingThings we'll do:
Series.value_counts()
method to view the values in the 'Contributing Factors. Dissatisfaction'
and 'Contributing Factors. Job Dissatisfaction'
in the tafe_resignations
dataframe.'Contributing Factors. Dissatisfaction'
and 'Contributing Factors. Job Dissatisfaction'
in the tafe_resignations
dataframe so that each contains only True
, False
, or NaN
values.update_vals
that makes the following changes:NaN
, return np.nan
. We can use the following criteria to check that a value is NaN
: pd.isnull(val)
.'-'
, return False
.True
.DataFrame.applymap()
method to apply the function above to the 'Contributing Factors. Dissatisfaction'
and 'Contributing Factors. Job Dissatisfaction'
in the tafe_resignations
dataframe.
- We need to pass the update_vals
function into the df.applymap()
method without parentheses.df.any()
method as described above to create a dissatisfied column in BOTH the tafe_resignations
and dete_resignations
dataframes.df.copy()
method to create a copy of the results and avoid the SettingWithCopy Warning
. Assign the results to dete_resignations_up
and tafe_resignations_up
.tafe_resignations['Contributing Factors. Dissatisfaction'].value_counts()
- 277 Contributing Factors. Dissatisfaction 55 Name: Contributing Factors. Dissatisfaction, dtype: int64
tafe_resignations['Contributing Factors. Job Dissatisfaction'].value_counts()
- 270 Job Dissatisfaction 62 Name: Contributing Factors. Job Dissatisfaction, dtype: int64
def update_vals(val):
if pd.isnull(val):
return np.nan
if val == '-':
return False
else:
return True
tafe_resignations['dissatisfied'] = (tafe_resignations[['Contributing Factors. Dissatisfaction',
'Contributing Factors. Job Dissatisfaction']]
.applymap(update_vals).any(1, skipna=False)
)
tafe_resignations_up = tafe_resignations.copy()
# Check the unique values after the updates
tafe_resignations_up['dissatisfied'].value_counts(dropna=False)
False 241 True 91 NaN 8 Name: dissatisfied, dtype: int64
# Update the values in columns related to dissatisfaction: True, False, or NaN
dete_resignations['dissatisfied'] = dete_resignations[['job_dissatisfaction',
'dissatisfaction_with_the_department', 'physical_work_environment',
'lack_of_recognition', 'lack_of_job_security', 'work_location',
'employment_conditions', 'work_life_balance',
'workload']].any(1, skipna=False)
dete_resignations_up = dete_resignations.copy()
dete_resignations_up['dissatisfied'].value_counts(dropna=False)
False 162 True 149 Name: dissatisfied, dtype: int64
To recap, we've accomplished the following:
Now, we're finally ready to combine our datasets! Our end goal is to aggregate the data according to the institute_service
column, so when you combine the data, think about how to get the data into a form that's easy to aggregate.
Things we'll do:
institute
to dete_resignations_up
. Each row should contain the value DETE
.institute
to tafe_resignations_up
. Each row should contain the value TAFE
.combined
.DataFrame.dropna()
method to drop any columns with less than 500 non null values.combined_updated
.dete_resignations_up['institute'] = 'DETE'
tafe_resignations_up['institute'] = 'TAFE'
# Test:
dete_institute_service = dete_resignations_up['institute_service'].value_counts().sum()
tafe_institute_service = tafe_resignations_up['institute_service'].value_counts().sum()
combined = pd.concat([dete_resignations_up, tafe_resignations_up], ignore_index=True, sort=False)
# Verify the number of non null values in each column
combined.notnull().sum().sort_values()
torres_strait 0 south_sea 3 aboriginal 7 disability 8 nesb 9 business_unit 32 classification 161 region 265 role_start_date 271 dete_start_date 283 role_service 290 Employment Type. Employment Type 290 CurrentAge. Current Age 290 Gender. What is your Gender? 290 Classification. Classification 290 gender 302 age 306 employment_status 307 position 308 employment_conditions 311 physical_work_environment 311 work_location 311 career_move_to_public_sector 311 career_move_to_private_sector 311 interpersonal_conflicts 311 job_dissatisfaction 311 lack_of_job_security 311 dissatisfaction_with_the_department 311 lack_of_recognition 311 workload 311 work_life_balance 311 traumatic_incident 311 ill_health 311 study/travel 311 relocation 311 maternity/family 311 none_of_the_above 311 Contributing Factors. Other 332 Contributing Factors. Travel 332 Contributing Factors. Study 332 Contributing Factors. Interpersonal Conflict 332 Contributing Factors. Job Dissatisfaction 332 Contributing Factors. Dissatisfaction 332 Contributing Factors. Maternity/Family 332 Contributing Factors. Career Move - Public Sector 332 Contributing Factors. Career Move - Self-employment 332 Contributing Factors. Career Move - Private Sector 332 Contributing Factors. NONE 332 Contributing Factors. Ill Health 332 WorkArea 340 Institute 340 institute_service 563 cease_date 635 dissatisfied 643 separationtype 651 institute 651 id 651 dtype: int64
# #checking is it a view or a copy:
# combined._is_view == True
combined['institute_service'].head(10)
0 7 1 18 2 3 3 15 4 3 5 14 6 5 7 NaN 8 30 9 32 Name: institute_service, dtype: object
# Drop columns with less than 500 non null values
combined_updated = combined.dropna(thresh = 500, axis =1).copy()
combined_updated.head(10)
id | separationtype | cease_date | institute_service | dissatisfied | institute | |
---|---|---|---|---|---|---|
0 | 4 | Resignation-Other reasons | 2012.0 | 7 | False | DETE |
1 | 6 | Resignation-Other reasons | 2012.0 | 18 | True | DETE |
2 | 9 | Resignation-Other reasons | 2012.0 | 3 | False | DETE |
3 | 10 | Resignation-Other employer | 2012.0 | 15 | True | DETE |
4 | 12 | Resignation-Move overseas/interstate | 2012.0 | 3 | False | DETE |
5 | 13 | Resignation-Other reasons | 2012.0 | 14 | False | DETE |
6 | 15 | Resignation-Other employer | 2012.0 | 5 | True | DETE |
7 | 17 | Resignation-Other reasons | 2012.0 | NaN | True | DETE |
8 | 21 | Resignation-Other employer | 2012.0 | 30 | False | DETE |
9 | 22 | Resignation-Other reasons | 2012.0 | 32 | False | DETE |
# Test:
combined_institute_service = combined_updated['institute_service'].value_counts().sum()
tafe_institute_service + dete_institute_service == combined_institute_service
True
Now that we've combined our dataframes, we're almost at a place where we can perform some kind of analysis! First, though, we'll have to clean up the institute_service
column. This column is tricky to clean because it currently contains values in a couple different forms:
x
NaN 88
Less than 1 year 73
1-2 64
3-4 63
5-6 33
11-20 26
5.0 23
1.0 22
7-10 21
0.0 20
...
To analyze the data, we'll convert these numbers into categories. We'll base our analysis on this article, which makes the argument that understanding employee's needs according to career stage instead of age is more effective.
We'll use the slightly modified definitions below:
Let's categorize the values in the institute_service
column using the definitions above.
Things we'll do:
institute_service
column.Series.astype()
method to change the type to 'str'
.Series.astype()
method to change the type to 'float'
.NaN
where val
is the name of the value: pd.isnull(val)
.institute_service
column. Assign the result to a new column named service_cat
.combined_updated['institute_service'].value_counts(dropna=False)
# Extract the years of service and convert the type to float:
combined_updated['institute_service_up'] = combined_updated['institute_service'].astype('str').str.extract(r'(\d+)')
combined_updated['institute_service_up'] = combined_updated['institute_service_up'].astype('float')
# Check the years extracted are correct:
combined_updated['institute_service_up'].value_counts(dropna=False)
1.0 159 NaN 88 3.0 83 5.0 56 7.0 34 11.0 30 0.0 20 20.0 17 6.0 17 4.0 16 9.0 14 2.0 14 13.0 8 8.0 8 15.0 7 22.0 6 10.0 6 17.0 6 14.0 6 12.0 6 16.0 5 18.0 5 24.0 4 23.0 4 21.0 3 39.0 3 32.0 3 19.0 3 36.0 2 30.0 2 25.0 2 26.0 2 28.0 2 42.0 1 29.0 1 35.0 1 27.0 1 41.0 1 49.0 1 38.0 1 34.0 1 33.0 1 31.0 1 Name: institute_service_up, dtype: int64
def update_years(val):
if pd.isnull(val):
return np.nan
elif val < 3:
return 'NEW'
elif val >= 3 and val <= 6:
return "EXPERIENCED"
elif val >= 7 and val <= 10:
return 'ESTABLISHED'
elif val >= 11:
return 'VETERAN'
combined_updated['service_cat'] = combined_updated['institute_service_up'].apply(update_years)
# Test:
combined_updated['service_cat'].value_counts()
NEW 193 EXPERIENCED 172 VETERAN 136 ESTABLISHED 62 Name: service_cat, dtype: int64
In the last screen, we created a service_cat
column, that categorizes employees according to the amount of years spent in their workplace:
We recall that the dissatisfied
column consists of Boolean values, meaning they're either True
or False
. Methods such as the df.pivot_table()
method actually treat Boolean values as integers, so a True
value is considered to be 1
and a False
value is considered to be 0
. That means that we can aggregate the dissatisfied
column and calculate the number of people in each group, the percentage of people in each group, etc.
Things we'll do:
Series.value_counts()
method to confirm if the number of True
and False
in the dissatisfied
column. Set the dropna
parameter to False
to also confirm the number of missing values.DataFrame.fillna()
method to replace the missing values in the dissatisfied
column with the value that occurs most frequently in this column, either True
or False
.DataFrame.pivot_table()
method to calculate the percentage of dissatisfied employees in each service_cat
group.True
value is considered to be 1
, calculating the mean will also calculate the percentage of dissatisfied employees. The default aggregation function is the mean, so you can exclude the aggfunc
argument.DataFrame.plot()
method to plot the results. Set the kind
parameter equal to bar
to create a bar chart.%matplotlib inline
beforehand to show your plots in the notebook.combined_updated['dissatisfied'].value_counts(dropna=False)
False 403 True 240 NaN 8 Name: dissatisfied, dtype: int64
combined_updated['dissatisfied'] = combined_updated['dissatisfied'].fillna(False)
dissatisfied_combined = combined_updated.pivot_table(index='service_cat', values='dissatisfied')
%matplotlib inline
dissatisfied_combined.plot(kind='bar', rot=30, grid=True)
<matplotlib.axes._subplots.AxesSubplot at 0x19f31050048>
In this guided project, we experienced that in order to extract any meaningful insights from our data, we had to perform many data cleaning tasks. In order to create one visualization (and not even the final one), we completed the following tasks:
The conclusion from the final output is the more experienced you are the more dissatisfied you will be.
The level of dissatisfaction is more similar for New
and Experienced
and both contain less dissatisfied employees. The level of dissatisfaction is also similar for Veterain
and Established
, and both contain the highest dissatisfied employees.
get_gifNimage("https://media1.tenor.com/images/2c38252889228050112ee5b479c809c5/tenor.gif?itemid=8488260")
<IPython.core.display.Image object>