#!/usr/bin/env python
# coding: utf-8

# # Guided Project: Clean And Analyze Employee Exit Surveys
# # Data Cleaning and Preparation
# 
# ## Introduction.
# 
# * This notebook contains *all data cleaning and preparation operations* in preparation for analysis of employee exit surveys from TAFE and DETE.
# * The analysis and an overview of the project can be found in [analysis.ipynb]().
# 
# In this project I'll try to answer the following questions:
# 
# * Are employees who only worked for the institutes for a short period of time resigning due to some kind of dissatisfaction? What about employees who have been there longer?
# * Are younger employees resigning due to some kind of dissatisfaction? What about older employees?
# 
# I'll be basing the analysis on exit surveys from Department of Education, Training and Employment (DETE) and the Technical and Further Education (TAFE) institute in Queensland, Australia.
# 
# We don't have a complete data dictionary for the datasets, so we'll have to rely on some general knowledge.
# 
# Below is a preview of a couple columns we'll work with from the dete_survey.csv:
# 
# * `ID`: An id used to identify the participant of the survey
# * `SeparationType`: The reason why the person's employment ended
# * `Cease Date`: The year or month the person's employment ended
# * `DETE Start Date`: The year the person began employment with the DETE
# 
# Below is a preview of a couple columns we'll work with from the tafe_survey.csv:
# 
# * `Record ID`: An id used to identify the participant of the survey
# * `Reason for ceasing employment`: The reason why the person's employment ended
# * `LengthofServiceOverall. Overall Length of Service at Institute (in years)`: The length of the person's employment (in years)
# 
# This notebook stores prepared data for analysis as pickled dataframes.
# 
# * `combined_updated.pickle` contains a subset of the exit survey data.
#   * Rows for resignations only
#   * Only a subset of columns.
#   * Cleaned key columns
#   * Filled in missing employment duration by deriving it from existing columns
#   * Categorized employment duration.
#   * Created a column summarizing job dissatisfaction.
# 

# ## Import Libraries and Configure Notebook Environment

# In[1]:


import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
get_ipython().run_line_magic('matplotlib', 'inline')
get_ipython().run_line_magic('config', "InlineBackend.figure_format='retina'")


# ## Read in Data

# In[2]:


dete_survey = pd.read_csv('dete_survey.csv')
tafe_survey = pd.read_csv('tafe_survey.csv')


# ## DETE Survey Initial Overview

# In[3]:


dete_survey.info()


# In[4]:


dete_survey.head()


# The DETE survey dataset contains 822 rows and 56 columns. They key columns of interest have been imported with no null values, but it is clear from a cursory glance at some of the rows that some of those columns have other values, like "Not Stated" for missing/unknown data. Many columns have a Dtype of 'object', some of which can be converted into boolean values. Many columns have null values. Most of those with null values are less than 25% null, but some have more, and some have a >75% of null values.
# 
# Essential Columns:
# 
# * `ID`: 0 missing values. int64. Values shouldn't need manipulation.
# * `SeparationType`: 0 missing values. object. It appears these may be categorical values. It remains to be seen whether they are consistent, or need to be cleaned.
# * `Cease Date`: 0 missing values. Object. Format appers to be MM/YYYY. 
# * `DETE Start Date`: 0 missing values. Object. Format appears to be YYYY. Values include "Not Stated"

# ## TAFE Survey Initial Overview

# In[5]:


tafe_survey.info()


# In[6]:


tafe_survey.head()


# The TAFE survey dataset contains 702 rows and 72 columns. Key columns of interest have missing values. The `LengthofServiceOverall. Overall Length of Service at Institute (in years)` column has over 1/7th of the values missing. Non-null values in some columns include other values, like a dash, that seem to indicate missing information. Most columns have a dtype of 'object', many of which look like they can be converted into Boolean values. Some appear to contain a numeric range. Many column names are long and unweildy; they appear to include the full text of survey questions.
# 
# EssentialColumns:
# 
# * `Record ID`: 0 missing values. Float64. Float64 is certainly not the best representation of this, but it may not matter for the purposes of our analysis.
# * `Reason for ceasing employment`: 1 missing value. Object. Appears to be categorical. It is not clear whether the categories are represented consistently, or whether they'll need cleaning.
# * `LengthofServiceOverall. Overall Length of Service at Institute (in years)`: 106 missing. Object. Appears to be categorical giving a rage of years.

# It will take some work to merge these datasets for even the limited set of columns listed above.

# ## Manipulation / Cleaning

# ### Handling Additional NaN values at import

# In[7]:


dete_survey = pd.read_csv('dete_survey.csv', na_values=['Not Stated'])


# ### Drop unneeded columns

# In[8]:


dete_survey_updated = dete_survey.drop(dete_survey.columns[28:49], axis=1)
tafe_survey_updated = tafe_survey.drop(tafe_survey.columns[17:66], axis=1)


# In[9]:


dete_survey_updated.info()


# In[10]:


tafe_survey_updated.info()


# * Reimported `dete_survey.csv` so value "Not Stated" is recognized as NaN.  
# * Dropped unneeded columns from both datasets.

# ### Clean and normalize column names

# In[11]:


dete_survey_updated.columns = dete_survey_updated.columns.str.lower().str.strip().str.replace(' ', '_')
dete_survey_updated.columns


# In[12]:


dete_survey_updated.head()


# In[13]:


tafe_column_map = {
    'Record ID': 'id',
    'CESSATION YEAR': 'cease_date',
    'Reason for ceasing employment': 'separationtype',
    'Gender. What is your Gender?': 'gender',
    'CurrentAge. Current Age': 'age',
    'Employment Type. Employment Type': 'employment_status',
    'Classification. Classification': 'position',
    'LengthofServiceOverall. Overall Length of Service at Institute (in years)': 'institute_service',
    'LengthofServiceCurrent. Length of Service at current workplace (in years)': 'role_service'
}
tafe_survey_updated = tafe_survey_updated.rename(tafe_column_map, axis=1)
tafe_survey_updated.columns


# In[14]:


tafe_survey_updated.head()


# * Standardized formatting of column names in the `dete_survey_updated` dataframes. 
# * Renamed some column names in `tafe_survey_updated` dataframe to match. Untouched column names need additional work.

# ### Select only resignations

# In[15]:


dete_survey_updated['separationtype'].value_counts()


# In[16]:


tafe_survey_updated['separationtype'].value_counts()


# In[17]:


dete_resignations = dete_survey_updated[dete_survey_updated['separationtype'].str.contains("Resignation")].copy()
tafe_resignations = tafe_survey_updated[tafe_survey_updated['separationtype'] == 'Resignation'].copy()


# ### Clean and normalize 'cease_date' 

# In[18]:


dete_resignations['cease_date'].value_counts()


# In[19]:


dete_resignations['cease_date'] = dete_resignations['cease_date'].str.extract('/?(20[01][0-9])')[0].astype(float)


# In[20]:


dete_resignations['cease_date'].value_counts().sort_index()


# Extracted year and converted into a numeric value.

# ### Examine 'dete_start_date'

# In[21]:


dete_resignations['dete_start_date'].value_counts().sort_index()


# In[22]:


dete_resignations['dete_start_date'].isnull().sum()


# No cleaning appears necessary at this point.

# ### Check 'cease_date' and 'dete_start_date' for unusual values.

# In[23]:


dete_resignations['cease_date'].value_counts().sort_index()


# In[24]:


dete_resignations[['cease_date', 'dete_start_date']].plot.box(subplots=True)


# In[25]:


tafe_resignations['cease_date'].value_counts().sort_index()


# In[26]:


tafe_resignations['cease_date'].plot.box()


# * The most recent resignations in the DETE dataset were in 2014, while the most recent start date is 2013. The earliest start dates among those who resigned from DETE is 1960. These are internally consitent. 
# * Resignation dates in the TAFE dataset range from 2010 to 2013, with the majority in the 2011-2012 range.
# * The earliest resignation in the DETE dataset was 2006, and the next two were in 2010. The bulk of resignations, over 250, occur in 2012 and 2013. I will examine this further in the analysis section.
# * These facts are consistent with expectations, suggesting the datasets are in good condition for further analysis.

# ### Calculate 'institute_service' duration for DETE data

# In[27]:


dete_resignations['institute_service'] = dete_resignations['cease_date'] - dete_resignations['dete_start_date']


# In[28]:


dete_resignations['institute_service'].isnull().sum()


# In[29]:


tafe_resignations['institute_service'].isnull().sum()


# * Added a new column called `institute_service` to dete_resignations, in order compliment the column of the same name in tafe_resignations, and populated it with calculated values.
# * I will use these values to categorize employees in my later analysis.

# ### Consolidate multiple columns into a summary 'dissatisfied' column.

# In[30]:


tafe_dissatisfaction_columns = [
    'Contributing Factors. Dissatisfaction',
    'Contributing Factors. Job Dissatisfaction'
]
dete_dissatisfaction_columns = [
    'job_dissatisfaction',
    'dissatisfaction_with_the_department',
    'physical_work_environment',
    'lack_of_recognition',
    'lack_of_job_security',
    'work_location',
    'employment_conditions',
    'work_life_balance',
    'workload' 
]


# In[31]:


tafe_resignations['Contributing Factors. Dissatisfaction'].value_counts()


# In[32]:


tafe_resignations['Contributing Factors. Job Dissatisfaction'].value_counts()


# In[33]:


def update_vals(value):
    if pd.isnull(value):
        return np.nan
    elif value == '-':
        return False
    else:
        return True
tafe_resignations[tafe_dissatisfaction_columns] = tafe_resignations[tafe_dissatisfaction_columns].applymap(update_vals)


# In[34]:


tafe_resignations['dissatisfied'] = tafe_resignations[tafe_dissatisfaction_columns].any(axis=1, skipna=False)
dete_resignations['dissatisfied'] = dete_resignations[dete_dissatisfaction_columns].any(axis=1, skipna=False)


# * Identified columns in both data sets that indicate job dissatisfaction. 
# * Verified that they didn't contain unexpected values.
# * Cleaned relevant columns in `tafe_resignations` so values were represented as 'True' 'False' or 'NaN,' for consistency with `dete_resignations` dataset, and general best practices. 
# * Created a summary column called `dissatisfied` in both datasets.

# ### Combine multiple columns into new 'other_reason' column

# In[35]:


dete_other_reason_columns = set(dete_resignations.columns[10:27]) - set(dete_dissatisfaction_columns)
dete_other_reason_columns


# In[36]:


dete_resignations[dete_other_reason_columns]


# In[37]:


tafe_other_reason_columns = set(tafe_resignations.columns[5:16]) - set(tafe_dissatisfaction_columns)
tafe_other_reason_columns


# In[38]:


tafe_resignations[list(tafe_other_reason_columns)] = tafe_resignations[tafe_other_reason_columns].applymap(update_vals)
tafe_resignations[tafe_other_reason_columns]


# In[39]:


tafe_resignations['other_reasons'] = tafe_resignations[tafe_other_reason_columns].any(axis=1, skipna=False)
dete_resignations['other_reasons'] = dete_resignations[dete_other_reason_columns].any(axis=1, skipna=False)


# In[40]:


tafe_resignations_up = tafe_resignations.copy()
dete_resignations_up = dete_resignations.copy()


# ### Combine DETE and TAFE datasets.

# In[41]:


dete_resignations_up['institute'] = "DETE"
tafe_resignations_up['institute'] = "TAFE"


# In[42]:


combined = pd.concat([dete_resignations_up, tafe_resignations_up], ignore_index=True)


# In[43]:


combined.info()


# In[44]:


#If I don't make this a copy I get warning when I do an assignment below. I'm not sure why.
combined_updated = combined.dropna(thresh=500, axis=1).copy()
combined_updated.info()


# In preparation for combining both datasets into one, I created a column called `institute` in each dataset and populated it with an identifier. This will allow the origin of individual rows to be identified once the datasets are combined. The datasets were concatenated together. Afterwords, the number of null values in all columns was inspected. All necessary columns had at least 500 non-null values. Any other columns with less than 500 non-null values were dropped as they were not necessary or useful in completing our analysis.

# ### Categorization of service durration.
# 
# The `institute_service` column currently contains a mishmash of categorical values and numeric values. I'm going to rework them into the following categories:
# 
# * New: Less than 3 years at a company
# * Experienced: 3-6 years at a company
# * Established: 7-10 years at a company
# * Veteran: 11 or more years at a company

# In[45]:


combined_updated['institute_service'].value_counts()


# The existing categorical values have different spans than the categories I plan to use, but fortunately, all of the existing spans fit within them. This means I can just concern myself with extracting the starting value.

# In[46]:


combined_updated['institute_service'] = combined_updated['institute_service'].astype(str).str.extract('(\d+)').astype(float)


# In[47]:


combined_updated['institute_service'].value_counts(dropna=False)


# In[48]:


def calc_service_cat(value):
    if pd.isnull(value):
        return np.NAN
    elif value < 3:
        return "New"
    elif 3 <= value < 7:
        return "Experienced"
    elif 7 <= value < 11:
        return "Established"
    elif 11 <= value:
        return "Veteran"
    else:
        return np.NAN    


# In[49]:


combined_updated['service_cat'] = combined_updated['institute_service'].apply(calc_service_cat)


# In[50]:


combined_updated['service_cat'].value_counts(dropna=False)


# The `institute_service` column has been converted into consistent categories using the criteria described earlier.

# ### Handle missing values in 'dissatisfied' and 'other_reasons' column.

# In[51]:


combined_updated['dissatisfied'].value_counts(dropna=False)


# In[52]:


#fill NaN with most common value (False)
combined_updated['dissatisfied'] = combined_updated['dissatisfied'].fillna(value=False).astype(bool)


# In[53]:


combined_updated['other_reasons'].value_counts(dropna=False)


# In[54]:


combined_updated['other_reasons'] = combined_updated['other_reasons'].fillna(value=False).astype(bool)


# ### Cleaning 'age' values

# In[55]:


combined_updated['age'].value_counts(dropna=False)


# In[56]:


combined_updated['age'] = (combined_updated['age']
                            .str.strip()
                            .str.replace('  ', '-')
                            .str.replace('56-60', '56 or older')
                            .str.replace('61 or older', '56 or older')
                            )
combined_updated['age'].value_counts(dropna=False)


# ### Explore null values in 'institute_service'

# In[57]:


combined_updated.loc[combined_updated['institute_service'].isnull(),'institute'].value_counts()


# In[58]:


combined_updated.pivot_table(index='institute', values='institute_service', aggfunc=['count'])


# We are missing `institute_service` values for ~20% of resignations. Our analysis only looks at the subset where this column is populated, so filling this column or dropping rows is unnecessary.

# ### Store data for separate analysis

# In[59]:


combined_updated.to_pickle('combined_updated.pickle')