import pandas as pd
import numpy as np
import matplotlib as plt
dete_survey = pd.read_csv("dete_survey.csv")
tafe_survey = pd.read_csv("tafe_survey.csv")
dete_survey.info()
dete_survey.head()
Obervations of dete_survey |
---|
So, far I notice that the Abroriginal, Torres Strait, South Sea, and Disability, and NESB columns are showing a majority of null ("Nan") values. |
The Business Unit column only has 126 non-null values and there is a stark difference (in hundreds) between all the other columns |
The dete_survey dataframe contains Not Stated values that indicate values are missing, but they aren't represented as NaN |
tafe_survey.info()
tafe_survey.head()
Obervations of tafe_survey |
---|
I notice the following columns have 437 non-null values Contributing Factors. Career Move - Public Sector Contributing Factors, Career Move - Private Sector Contributing Factors, Career Move - Self-employment Contributing Factors, Ill Health, Contributing Factors, Maternity/Family |
Both the dete_survey and tafe_survey dataframes contain many columns that we don't need to complete our analysis. |
Each dataframe contains many of the same columns, but the column names are different. |
There are multiple columns/answers that indicate an employee resigned because they were dissatisfied. |
dete_survey = pd.read_csv("dete_survey.csv", na_values="Not Stated")
dete_survey.head()
Dropping some columns from each dataframe that we won't use in our analysis to make the dataframes easier to work with.
dete_survey_updated = dete_survey.drop(dete_survey.iloc[:,28:49], axis=1)
tafe_survey_updated = tafe_survey.drop(tafe_survey.iloc[:,17:66], axis=1)
Confirming columns were dropped
dete_survey_updated.info()
dete_survey_updated.head()
tafe_survey_updated.info()
tafe_survey_updated.head()
What I'm doing |
---|
Making all the column capitalization in dete_survey_updated lowercase. |
Removing any trailing whitespace from the end of the strings. |
Replacing spaces with underscores ('_') . |
dete_survey_updated.columns = dete_survey_updated.columns.str.replace(' ','_').str.strip().str.lower()
dete_survey_updated.head()
What I'm doing: Renaming the following columns |
---|
'Record ID': 'id' |
'CESSATION YEAR': 'cease_date' |
'Reason for ceasing employment': 'separationtype' |
'Gender. What is your Gender?': 'gender' |
'CurrentAge. Current Age': 'age' |
'Employment Type. Employment Type': 'employment_status' |
'Classification. Classification': 'position' |
'LengthofServiceOverall. Overall Length of Service at Institute (in years)': 'institute_service' |
'LengthofServiceCurrent. Length of Service at current workplace (in years)': 'role_service' |
tafe_survey_updated.rename(columns={'Record ID': 'id', 'CESSATION YEAR': 'cease_date', 'Reason for ceasing employment': 'separationtype',
'Gender. What is your Gender?': 'gender', 'CurrentAge. Current Age': 'age',
'Employment Type. Employment Type': 'employment_status',
'Classification. Classification': 'position',
'LengthofServiceOverall. Overall Length of Service at Institute (in years)': 'institute_service',
'LengthofServiceCurrent. Length of Service at current workplace (in years)': 'role_service'
}, inplace=True)
What I'm doing |
---|
Confirming columns are renamed |
tafe_survey_updated.head(20)
dete_survey_updated.head(50)
What I'm doing |
---|
Reviewing the unique values in the separationtype column in both dete_survey_updated and tafe_survey_updated |
dete_survey_updated['separationtype'].value_counts()
In each of dataframes, I want only the data for survey respondents who have a Resignation separation type.
resignation_pattern = "Resignation"
dete_resignations = dete_survey_updated["separationtype"].str.contains(resignation_pattern).copy()
tafe_resignations = tafe_survey_updated["separationtype"].str.contains(resignation_pattern).copy()
tafe_resignations.head()
What I'm doing |
---|
Checking the years in each dataframe for logical inconsistencies. |
dete_survey_updated['cease_date'].value_counts()
tafe_survey_updated['cease_date'].value_counts()
What I'm doing |
---|
Extracting all the cease_date years |
year_pattern = r"([1-2][0-9]{3})"
dete_cease_date_years = dete_survey_updated['cease_date'].str.extract(year_pattern)
tafe_cease_date_years = tafe_survey_updated['cease_date'].str.extract(year_pattern)
cease_date_years
What I'm doing |
---|
Converting type to float |
cease_date_years.astype(float)
What I'm doing |
---|
Checking the values in the cease_date and dete_start_date columns in dete_resignations |
Checking the cease_date column in tafe_resignation |
cease_date_years.value_counts()
cease_date_years.value_counts().sort_index(ascending=True)
# #create boxplot
# tafe_resignations.boxplot(column=['cease_date'])
# plt.show()
What I'm doing: My findings |
---|
I can see that the most of the cease dates are in 2012 and 2013 |