In [ ]:

import pandas as pd
import numpy as np
import matplotlib as plt

dete_survey = pd.read_csv("dete_survey.csv")
tafe_survey = pd.read_csv("tafe_survey.csv")

In [ ]:

dete_survey.info()

In [ ]:

dete_survey.head()

Obervations of dete_survey
So, far I notice that the Abroriginal, Torres Strait, South Sea, and Disability, and NESB columns are showing a majority of null ("Nan") values.
The Business Unit column only has 126 non-null values and there is a stark difference (in hundreds) between all the other columns
The `dete_survey` dataframe contains `Not Stated` values that indicate values are missing, but they aren't represented as NaN

In [ ]:

tafe_survey.info()

In [ ]:

tafe_survey.head()

Obervations of tafe_survey
I notice the following columns have 437 non-null values Contributing Factors. Career Move - Public Sector Contributing Factors, Career Move - Private Sector Contributing Factors, Career Move - Self-employment Contributing Factors, Ill Health, Contributing Factors, Maternity/Family
Both the `dete_survey` and `tafe_survey` dataframes contain many columns that we don't need to complete our analysis.
Each dataframe contains many of the same columns, but the column names are different.
There are multiple columns/answers that indicate an employee resigned because they were dissatisfied.

In [ ]:

dete_survey = pd.read_csv("dete_survey.csv", na_values="Not Stated")

In [ ]:

dete_survey.head()

Dropping some columns from each dataframe that we won't use in our analysis to make the dataframes easier to work with.

In [ ]:

dete_survey_updated = dete_survey.drop(dete_survey.iloc[:,28:49], axis=1)
tafe_survey_updated = tafe_survey.drop(tafe_survey.iloc[:,17:66], axis=1)

Confirming columns were dropped

In [ ]:

dete_survey_updated.info()

In [ ]:

dete_survey_updated.head()

In [ ]:

tafe_survey_updated.info()

In [ ]:

tafe_survey_updated.head()

What I'm doing
Making all the column capitalization in `dete_survey_updated` lowercase.
Removing any trailing whitespace from the end of the strings.
Replacing spaces with underscores `('_')`.

In [ ]:

dete_survey_updated.columns = dete_survey_updated.columns.str.replace(' ','_').str.strip().str.lower()

In [ ]:

dete_survey_updated.head()

What I'm doing: Renaming the following columns
'Record ID': 'id'
'CESSATION YEAR': 'cease_date'
'Reason for ceasing employment': 'separationtype'
'Gender. What is your Gender?': 'gender'
'CurrentAge. Current Age': 'age'
'Employment Type. Employment Type': 'employment_status'
'Classification. Classification': 'position'
'LengthofServiceOverall. Overall Length of Service at Institute (in years)': 'institute_service'
'LengthofServiceCurrent. Length of Service at current workplace (in years)': 'role_service'

In [ ]:

tafe_survey_updated.rename(columns={'Record ID': 'id', 'CESSATION YEAR': 'cease_date', 'Reason for ceasing employment': 'separationtype', 
                            'Gender. What is your Gender?': 'gender', 'CurrentAge. Current Age': 'age',
                            'Employment Type. Employment Type': 'employment_status',
                            'Classification. Classification': 'position',
                            'LengthofServiceOverall. Overall Length of Service at Institute (in years)': 'institute_service',
                            'LengthofServiceCurrent. Length of Service at current workplace (in years)': 'role_service'                                    
                           }, inplace=True)

What I'm doing
Confirming columns are renamed

In [ ]:

tafe_survey_updated.head(20)

In [ ]:

dete_survey_updated.head(50)

What I'm doing
Reviewing the unique values in the separationtype column in both dete_survey_updated and tafe_survey_updated

In [ ]:

dete_survey_updated['separationtype'].value_counts()

In each of dataframes, I want only the data for survey respondents who have a Resignation separation type.

In [ ]:

resignation_pattern = "Resignation"
dete_resignations = dete_survey_updated["separationtype"].str.contains(resignation_pattern).copy()
tafe_resignations = tafe_survey_updated["separationtype"].str.contains(resignation_pattern).copy()

In [ ]:

tafe_resignations.head()

What I'm doing
Checking the years in each dataframe for logical inconsistencies.

In [ ]:

dete_survey_updated['cease_date'].value_counts()

In [ ]:

tafe_survey_updated['cease_date'].value_counts()

What I'm doing
Extracting all the cease_date years

In [ ]:

year_pattern = r"([1-2][0-9]{3})"
dete_cease_date_years = dete_survey_updated['cease_date'].str.extract(year_pattern)
tafe_cease_date_years = tafe_survey_updated['cease_date'].str.extract(year_pattern)

In [ ]:

cease_date_years

What I'm doing
Converting type to float

In [ ]:

cease_date_years.astype(float)

What I'm doing
Checking the values in the cease_date and dete_start_date columns in dete_resignations
Checking the cease_date column in tafe_resignation

In [ ]:

cease_date_years.value_counts()

In [ ]:

cease_date_years.value_counts().sort_index(ascending=True)

In [ ]:

# #create boxplot
# tafe_resignations.boxplot(column=['cease_date'])
# plt.show()

What I'm doing: My findings
I can see that the most of the cease dates are in 2012 and 2013

In [ ]: