#!/usr/bin/env python
# coding: utf-8

# # Analysis of Employee Exit Interviews
# 
# ## Introduction
# 
# In this project, we will be analyzing the exit surveys of two public institutes based out of Queensland, Australia; the Department of Education, Training, and Employement (DETE) and the Technical and Further Education Institute (TAFE).
# 
# We will be assuming the role of the in-house data analyst, and we have been tasked with presenting answers to the following questions: 
# 
#  - Are employees who worked for only a short term resigning due to dissatisfaction? Long-term employees?
#  - Are younger employees resigning due to dissatifation? Older employees?
#  
# In order to better understand how to answer these questions, we will be looking through datasets of exit interviews from the institutes. These can be found [here for DETE](https://data.gov.au/dataset/ds-qld-fe96ff30-d157-4a81-851d-215f2a0fe26d/details?q=exit%20survey) and [here for TAFE](https://data.gov.au/dataset/ds-qld-89970a3b-182b-41ea-aea2-6f9f17b5907e/details?q=exit%20survey).
# 
# The datasets are fundamentally different in how they set themselves up. As such, we will be combining surveys to better answer the above questions. Therefore, the main objective of this project is to better understand and utilize methods for data cleaning within the python libraries, and detailed explanations will be provided throughout.
# 
# 

# ## Importing Libraries
# 
# We can begin by importing all necessary libaries, in this instance only pandas and NumPy are needed for their ability to work with DataFrame strucutres.

# In[1]:


import pandas as pd
import numpy as np

###Checking for successful import###

if pd and np:
    print('Import Successful')


# ## Creating the DataFrames

# In[2]:


dete_survey = pd.read_csv('dete_survey.csv') #pd.read_csv reads in a .csv file as a DataFrame

tafe_survey = pd.read_csv('tafe_survey.csv') 


# ## Initial Exploration
# 
# We can take a look at the first line of each DataFrame to better understand the variables within.

# In[3]:


dete_survey.head() #.head(n) returns the first n columns, 5 if unspecified


# In[4]:


dete_survey.info() #Displays all headers, as well as # of non-null values and value type(obj, bool, etc)


# In[5]:


tafe_survey.head()


# In[6]:


tafe_survey.info()


# Upon initial inspection, a few points stand out:
# 
#  - The surveys largely tackle the same questions, but they do so with different variable names, for example:
# 
#   - Cease Date/CESSATION YEAR
#   - SeparationType/Reason for ceasing employment
#   - Position/Work Area
#   
# 
#  - Many columns in both sets are not neccessary for our analysis.
# 
# 
#  - DETE's survey contains values that read 'Not Stated', which are being counted differently than NaN null values.
# 
# 
#  - Both have questions with answers assigned based on the [Likert Scale](https://link.springer.com/referenceworkentry/10.1007%2F978-0-387-78665-0_6363). While TAFE has the answers written out (Strongly Agree), DETE abbreviates (SA). These will likely be removed. Similarly, both datasets contain Boolean objets.
# 
# 
# - The datasets both contain large amounts of null values, presumed to be missing data; below, we can utilize the .isnull() and .value_counts() operators to find where the missing data lies.

# In[7]:


missing_dete = dete_survey.isnull().sum() #sums all null values by column

missing_dete


# _dete_survey_ has several null values in the last 5 columns, which are therefore unlikely to be of use to us for our analysis. Classification and Business Unit are the only others with null values numbering over 100, therefore the rest might be able to be extrapolated to be filled.

# In[8]:


missing_tafe = tafe_survey.isnull().sum()

missing_tafe


# _tafe_survey_ has similar amounts of null values for several columns, which opens up the possibility of simply being able to delete the specific rows from our analysis. 

# ## Identify Missing Values
# 
# Because there are 'Not Stated' values in the DETE dataset that are essentially null, we can use the pd.read_csv's na_values parameter to set those to null.

# In[9]:


###Specifying 'Not Stated' as NaN values###

dete_survey = pd.read_csv('dete_survey.csv', na_values='Not Stated')


# ## Dropping Columns

# In[10]:


###Using the .drop() method to specify all unneccessary columns and remove them.###

dete_survey_updated = dete_survey.drop(dete_survey.columns[28:49], axis=1)

###Using .shape attribute to check###

dete_survey_updated.shape #Should be 35 columns now


# In[11]:


tafe_survey_updated = tafe_survey.drop(tafe_survey.columns[17:66], axis=1)

tafe_survey_updated.shape #Should be 23 columns now


# Above, we dropped 21 columns from the DETE dataset, and 59 from TAFE. These were unneccessary for our analysis. Many contained a large majority of null values, others were based on Likert Scale questionnaires. It will make the datasets much easier to work with going forward without these columns.

# ## Renaming Columns
# 
# Many of the columns we want to use in our analysis have differing names across the datasets, we need to use vectorized string methods to align them with one another.
# 
# Below, we start with _dete_survey_updated_'s column names

# In[12]:


###Updating Column Names in dete_survey (snake_case)###
#Example: Cease Date -> cease_date

dete_survey_updated.columns = dete_survey_updated.columns.str.lower().str.strip().str.replace(' ', '_')

#str.lower() sets all columns to lowercase
#str.strip() removes all whitespace at ends of strings
#str.replace() replaces all spaces in titles with underscores

###Changing separationtype for continuity###

dete_survey_updated.rename(columns={'separationtype': 'separation_type'}, inplace=True)

###Printing column index to ensure changes###

dete_survey_updated.columns


# Next, we can use the .rename() function to change some column names of _tafe_survey_updated_ to match those of the DETE dataset. This will make it easier to merge the datasets in later steps.

# In[13]:


###Updating Important Column Names in tafe_survey###

cols = {
    'Record ID': 'id',
    'CESSATION YEAR': 'cease_date',
    'Reason for ceasing employment': 'separation_type',
    'Gender.     What is your Gender?': 'gender',
    'CurrentAge.     Current Age': 'age',
    'Employment Type.     Employment Type': 'employment_status',
    'Classification.     Classification': 'position',
    'LengthofServiceOverall. Overall Length of Service at Institute (in years)': 'institute_service',
    'LengthofServiceCurrent. Length of Service at current workplace (in years)': 'role_service'
}

tafe_survey_updated.rename(columns=cols, inplace=True)

###Printing column index to ensure changes###

tafe_survey_updated.columns


# In[14]:


print(dete_survey_updated.head(2))


# Above, we reformatted the column names of the DETE dataset to properly fit in snake case, afterwards we renamed several of the most important columns in the TAFE dataset to match. It was not important that we change the ones that are less relevant to our analysis, as they can be left out when using the pd.merge() function in later steps.

# ## Filtering the Data
# 
# Due to the questions being asked in this project, we only need data for those who resigned from their jobs. The separation_type variable is where we can find those who resigned. The TAFE dataset simply has the value of "Resignation", while the DETE dataset has "Resignation" followed by a few possible strings:
#  - -Other reasons
#  - -Other employer
#  - -Move overseas/interstate

# In[15]:


###Using .value_counts() to see unique values in separation_type###

tafe_survey_updated['separation_type'].value_counts()


# In[16]:


###Removing all strings after Registration###

dete_survey_updated['separation_type'] = dete_survey_updated['separation_type'].str.split('-').str[0]

###Using .value_counts() to check for accuracy###

dete_survey_updated['separation_type'].value_counts()


# In[17]:


###Using Boolean Masking to create a DataFrame of only Resignations###

dete_resignations = dete_survey_updated[dete_survey_updated['separation_type'] == 'Resignation']

tafe_resignations = tafe_survey_updated[tafe_survey_updated['separation_type'] == 'Resignation']


# Now we have two datasets that only contains the rows in which the employee in question resigned from their position.
# 
# ## Verifying The Data
# 
# With all the effort put into cleaning a dataset, it can be critical to know that the data within the set is trustworthy. If large amounts of data in a dataset are illogical, the entire set could potentially be discarded. In the DETE dataset, there are a few logical inconsistencies we plan to look for below:
# 
#  - The _cease_date_ variable having dates after the current date, and before the _dete_start_date_ variable.
#  
#  
#  - The _dete_start_date_ variable having values before the year of 1940.

# In[35]:


###Checking for Logical Inconsistencies###

##Cleaning cease_date in dete_resignations

dete_resignations['cease_date'].value_counts()

##Isolating the year##

dete_resignations['cease_date'] = dete_resignations['cease_date'].str.split('-').str[0]

##Replacing split values and formatting as 20xx##

dete_resignations['cease_date'] = dete_resignations['cease_date'].replace(['12'],'2012')

dete_resignations['cease_date'] = dete_resignations['cease_date'].replace(['13'],'2013')

dete_resignations['cease_date'] = dete_resignations['cease_date'].replace(['14'],'2014')

dete_resignations['cease_date'] = dete_resignations['cease_date'].replace(['6'],'2006')

dete_resignations['cease_date'] = dete_resignations['cease_date'].replace(['10'], '2010')

##Re-utilize .value_counts() to ensure proper formatting##

dete_resignations['cease_date'].value_counts()

##Format year strings as floats##

dete_resignations['cease_date'] = dete_resignations['cease_date'].astype("float")

dete_resignations['cease_date'].value_counts()


# Now that we have a .value_counts series for DETE's resignations, we can do the same with the _dete_start_date_ variable to check for outliers.

# In[36]:


##Checking for outliers##

dete_resignations['dete_start_date'].value_counts()


# In[37]:


##Checking for outliers in TAFE's cease_date##

tafe_resignations['cease_date'].value_counts()


# As we can see, the dataframes aren't entirely congruent. TAFE has a cease_dates in 2009, but DETE does not. Additionally, TAFE has a larger amount of resignations since 2010 than DETE. However, we aren't attempting to analyze the resignations based on year, and the information falls in line logically with the groundlines we had set above. As such, we can leave the data as is.

# ## Creating a New Column
# 
# It's important to remember our research questions depend on the length of time an employee worked for their company, as such we will need to create a new column.
# 
# TAFE's dataset has an _institute_service_ column, referring to the length of an employee's employement. We want to create an _institute_service_ column for DETE's dataset as well.
# 
# This column can be created by subtracting the _dete_start_date_ variable from the _cease_date_ variable.

# In[42]:


##Creating the new column##

dete_resignations['institute_service'] = dete_resignations['cease_date'] - dete_resignations['dete_start_date']

##Checking the DataFrame for the new column##

dete_resignations.head() #Institute Service should be formatted as a float.


# ## Identify Dissatisfied Employees
# 
# As a reminder, we're trying to determine, in part, whether employees who have varying lengths of employement are resigning due to dissatisfaction. We just created a column to determine length of employment, so next is determining which employees are dissatisifed. In the datasets, there are a couple columns we will be using to determine dissatisifacation:
# 
# tafe_survey_updated:
#  - Contributing Factors. Dissatisfaction
#  - Contributing Factors. Job Dissatisfaction
#  
# dete_survey_updated:
#  - job_dissatisfaction
#  - dissatisfaction_with_the_department
#  - physical_work_environment
#  - lack_of_recognition
#  - lack_of_job_security
#  - work_location
#  - employment_conditions
#  - work_life_balance
#  - workload
#  
#  We will aggregate these columns into one _dissatisfied_ column.

# In[46]:


##Checking for Unique Values in TAFE's dataset for job dissatisfcation##

#Column name values are True, - values are False, all else are NaN

tafe_resignations['Contributing Factors. Dissatisfaction'].value_counts()


# In[47]:


tafe_resignations['Contributing Factors. Job Dissatisfaction'].value_counts()


# ## Updating Values

# In[51]:


#Returning job dissatifcation values to True, False, or NaN.

def update_vals(x):
    if x == '-':
        return False
    elif pd.isnull(x):
        return np.nan
    else:
        return True

##Using the .applymap() method to the relevant TAFE columns##  
    
tafe_resignations['dissatisfied'] = tafe_resignations[['Contributing Factors. Dissatisfaction', 'Contributing Factors. Job Dissatisfaction']].applymap(update_vals).any(1, skipna=False)

#.any() will allow for creation of new column

tafe_resignations_up = tafe_resignations.copy()

#Check the unique values after the updates
tafe_resignations_up['dissatisfied'].value_counts(dropna=False)


# In[56]:


#Creating a 'dissatisfied' column to the DETE dataset##

dete_resignations['dissatisfied'] = dete_resignations[[
    'job_dissatisfaction',
       'dissatisfaction_with_the_department', 'physical_work_environment',
       'lack_of_recognition', 'lack_of_job_security', 'work_location',
       'employment_conditions', 'work_life_balance',
       'workload']].any(1, skipna=False)

dete_resignations_up = dete_resignations.copy()

#Check unique values after the updates

dete_resignations_up['dissatisfied'].value_counts(dropna=False)


# Above, we applied the update_vals formula to all relevant columns in both datasets in order to create a new column entitled 'dissatisfied'. Simply, the .any() function allows for the value in an employee's dissatisfied column to be set to True if they have True values for any of the dissatisfied columns we looked through.

# ## Combining The Data

# In[58]:


##Adding Institute Columns##

dete_resignations_up['institute'] = 'DETE'
tafe_resignations_up['institute'] = 'TAFE'

#Vertifying Changes

tafe_resignations_up.head()


# In[65]:


##Combining the DataFrames with institute as the index##

combined = pd.concat([dete_resignations_up, tafe_resignations_up], ignore_index=True, sort=False)

#Verifying the number of non-null values in each column#

combined.notnull().sum().sort_values()


# In[66]:


##Drop columns with less than 500 non-null values##

combined_updated = combined.dropna(thresh=500, axis=1).copy() #thresh paramter signifies max amount of non null values

#Verifying the Changes#

combined_updated.head()


# Above, we combined the two cleaned datasets together in order to analyze them together. In order to make the resulting DataFrame easier to work with, we used the .dropna() method with the thresh parameter to drop every column which fewer than 500 non-null values.
# 
# ## Cleaning the Service Column
# 
# To clean the differently formatted values of the _insititute_service_ column, we will be using the following buckets for employees.
# 
#  - New: Less than 3 years at a company
#  - Experienced: 3-6 years at a company
#  - Established: 7-10 years at a company
#  - Veteran: 11 or more years at a company

# In[70]:


##Checking for unique values within the _institute_service_ column##

combined_updated['institute_service'].value_counts(dropna=False)


# In[72]:


###Creating the buckets for each value in institute_service###

#Extracting the years of service and converting the type to float
combined_updated['institute_service_up'] = combined_updated['institute_service'].astype('str').str.extract(r'(\d+)')
combined_updated['institute_service_up'] = combined_updated['institute_service_up'].astype('float')

#Verifying the Changes#
combined_updated['institute_service_up'].value_counts()


# In[74]:


### Convering years of service to categories###

def categorize_service(val):
    if val >= 11:
        return 'Veteran'
    elif 7 <= val < 11:
        return 'Established'
    elif 3 <= val < 7:
        return 'Experienced'
    elif pd.isnull(val):
        return np.nan
    else:
        return 'New'

###Applying the formula with .apply())
combined_updated['service_cat'] = combined_updated['institute_service_up'].apply(categorize_service)

###Verifying Changes###

combined_updated['service_cat'].value_counts()


# Above, we were able to change the values in the _insitute_service_ category to allow us to more easily group them together, then did just that by way of broad categories to allow us to analyze further.
# 
# ## Performing Initial Analysis
# 
# We are going to be returning all NaN (8 of them) as false (because False is the most common value), and using that to determine the percentage of employees in each service category that resigned due to some form of dissatisfcation.

# In[81]:


###Confirming number of True or False in dissatisfied column###

combined_updated['dissatisfied'].value_counts(dropna=False)

###Filling all reminaing null values with False###

combined_updated['dissatisfied'] = combined_updated['dissatisfied'].fillna(value=False)


# In[84]:


###Calculating percentage of employees who resigned due to dissatisfaction###

dissatisfied_pct = combined_updated.pivot_table(index='service_cat', values='dissatisfied')

dissatisfied_pct


# In[87]:


###Plotting the Table

#magic function allowing juptyer notebook to work with plots
get_ipython().run_line_magic('matplotlib', 'inline')

dissatisfied_pct.plot(kind='bar', rot=35) #rot = n parameter rotates labels n degrees


# As seen from our intial analysis above, it can be concluded that the employees most likely to resign due to dissatisfaction are those who have been with the given institute for 7 or more years.