#!/usr/bin/env python
# coding: utf-8

# # Is dissatisfaction the reason for the employees who worked for the institutes to resign in Australia?
# 
# ## Intorduction & Goals
# 
# In this guided project, we'll work with exit surveys from employees of the Department of Education, Training and Employment (DETE) and the Technical and Further Education (TAFE) institute in Queensland, Australia. The TAFE exit survey [here](https://data.gov.au/dataset/ds-qld-89970a3b-182b-41ea-aea2-6f9f17b5907e/details?q=exit%20survey) and the survey for the DETE [here](https://data.gov.au/dataset/ds-qld-fe96ff30-d157-4a81-851d-215f2a0fe26d/details?q=exit%20survey). There are some slight modifications in the data set we used comparing to the original, including changing the encoding to UTF-8 (the original ones are encoded using cp1252.)
# 
# Witht the above dataset we want to ask the below questions:
# 
# - Are employees who only worked for the institutes for a short period of time resigning due to some kind of dissatisfaction? What about employees who have been there longer?
# 
# - Are younger employees resigning due to some kind of dissatisfaction? What about older employees?

# In[152]:


# import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
get_ipython().run_line_magic('matplotlib', 'inline')

# import dataset
dete_survey = pd.read_csv('dete_survey.csv')
tafe_survey = pd.read_csv('tafe_survey.csv')


# In[2]:


dete_survey.info()


# * There are 822 rows and 56 variables in the **dete** dataset and most of the variables have missing values. 
# * Some variables are not necessary to our mission.(e.g: columns 29-49)

# In[3]:


# finding the number & percentage of missing value
dete_len = dete_survey.shape[0] # total number of columns
n = 0 # columns with missing value counter
for col in dete_survey.columns:
    if dete_survey[col].isnull().any():
        n += 1
        missing = dete_survey[col].isnull().sum()
        print('{}: {}, {:.2%}'.format(col, missing, missing / dete_len))
print('\n', n, ' columns with missing value in dete_survey.')


# In[4]:


dete_survey.head(2)


# `DETE Start Date` and `Role Start Date` have the value *Not Stated*, which should be regconized as missing value as well.

# In[5]:


tafe_survey.info()


# - There are 702 rows and 72 variables in the **tafe** dataset and most of the variables have missing values.  
# - Comparing to **dete_survey**, some of the columns have the same content but in different columns' name. Therefore we have to modify the column names before combining both data sets.
# - Some variables are not necessary to our mission since we only want to know if dissatification is the reason. (e.g: columns 18-66)

# In[6]:


# finding the number & percentage of missing value
tafe_len = tafe_survey.shape[0] # total number of columns
n = 0 # columns with missing value counter
for col in tafe_survey.columns:
    if tafe_survey[col].isnull().any():
        n += 1
        missing = tafe_survey[col].isnull().sum()
        print('{}: {}, {:.2%}'.format(col, missing, missing / dete_len))
        
print('\n', n, 'columns with missing values in tafe_survey.')


# In[7]:


tafe_survey.head(2)


# In some columns like `Career Move - Public Secotr` contains the value *-*, which should be regconized as missing value.

# ## Data cleaning
# 
# ### Update NaN value
# We read the **date_suvey.csv** again and replace *Not Stated* as *NaN*.

# In[8]:


dete_survey = pd.read_csv('dete_survey.csv', 
                          na_values = 'Not Stated')


# ### Drop unnecessary columns

# As mentioned before, we consider only *dissatification* as the target reason in this project. We will drop all other reasons to trim the dataset.

# In[9]:


dete_survey_updated = dete_survey.drop(dete_survey.columns[28:49],
                                       axis = 1)


# In[10]:


tafe_survey_updated = tafe_survey.drop(tafe_survey.columns[17:66],
                                       axis = 1)


# ### Synchronize the column names

# In order to combine both data sets, we need a common and standardized column names for both data sets.

# In[11]:


# lower case, remove trailing whitesapce and replace ' ' as '_'
dete_survey_updated.columns = dete_survey_updated.columns.str.lower().str.strip().str.replace(' ', '_')
dete_survey_updated.columns


# In[12]:


column_names = {'Record ID': 'id',
                'CESSATION YEAR': 'cease_date',
                'Reason for ceasing employment': 'separationtype',
                'Gender. What is your Gender?': 'gender',
                'CurrentAge. Current Age': 'age',
                'Employment Type. Employment Type': 'employment_status',
                'Classification. Classification': 'position',
                'LengthofServiceOverall. Overall Length of Service at Institute (in years)': 'institute_service',
                'LengthofServiceCurrent. Length of Service at current workplace (in years)': 'role_service'}

# lower case, remove trailing whitesapce and replace ' ' as '_'
tafe_survey_updated = tafe_survey_updated.rename(column_names, axis=1)
tafe_survey_updated.columns


# ### Extract data with resignation only
# 
# There are many reason an employee leaves the institute, but we oonly need the data from those resigned. Therefore, we will keep those rows which contain *Resgination* in `separationtype`, excluding NaN.

# In[13]:


dete_survey_updated['separationtype'].value_counts()


# In[14]:


dete_resignations = dete_survey_updated[
    dete_survey_updated['separationtype'].str.contains('Resignation', na=False)
].copy()


# In[15]:


tafe_survey_updated['separationtype'].value_counts()


# In[16]:


tafe_resignations = tafe_survey_updated[
    tafe_survey_updated['separationtype'].str.contains('Resignation', na=False)
].copy()


# ### Data verification
# 
# In this step, we'll focus on verifying that the years in the `cease_date` and `dete_start_date` columns make sense.
# 
# - Since the `cease_date` is the last year of the person's employment and the `dete_start_date` is the person's first year of employment, it wouldn't make sense to have years after the current date.
# - Given that most people in this field start working in their 20s, it's also unlikely that the `dete_start_date` was before the year 1940.

# In[17]:


dete_resignations['cease_date'].value_counts()


# In[18]:


pattern =r"([1-2][0-9]{3})"
dete_resignations['cease_date'] = dete_resignations['cease_date'].str.extract(pattern).astype('float')
dete_resignations['cease_date'].value_counts().sort_index()


# In[19]:


dete_resignations['dete_start_date'].value_counts().sort_index()


# In[20]:


pattern =r"([1-2][0-9]{3})"
tafe_resignations['cease_date'].value_counts().sort_index()


# In[21]:


tafe_resignations['cease_date'].value_counts().sort_index()


# In[22]:


dete_resignations.boxplot(['cease_date','dete_start_date'])
plt.title('Cease_date and dete_Start_date in dete_survey')


# The `cease_date` are all between 2009 and 2013 while the `dete_start_date` is starting from 1963. 

# In[23]:


tafe_resignations.boxplot(['cease_date'])
plt.title('Cease_date in tafe_survey')
plt.ylim(2008, 2020)


# The range of `cease_date` in **tafe_survey** is similar to **dete_survey**.
# 
# In general, both data set do not have any major issues with the years.

# ### Extract useful informations
# 
# As we also want to know if there are any relationship between service length of an employee and the resignation due to dissatisfaction, we need extract the length of service from `cease_date` and `dete_start_date`.

# In[24]:


dete_resignations['institute_service'] = dete_resignations['cease_date'] - dete_resignations['dete_start_date']
dete_resignations.boxplot('institute_service')
plt.title('Length of institute service before resignation')


# Meanwhile, we have to classify the data into dissatisfcation or not. 

# Let's check with **tafe_survey** first. There are two columns about dissatiscfaction: `Contributing Factors. Dissatisfaction` and `Contributing Factors. Job Dissatisfaction`. We will create a new column to combine these information.

# In[25]:


tafe_resignations['Contributing Factors. Dissatisfaction'].value_counts(dropna = False)


# In[26]:


tafe_resignations['Contributing Factors. Job Dissatisfaction'].value_counts(dropna = False)


# Both columns have only 2 values so we can change them into boolean type.

# In[27]:


# function to change variables into boolean
def update_vals(string):
    if string is np.nan: return np.nan
    elif string is '-': return False
    else: return True


# In[28]:


tafe_resignations['Contributing Factors. Dissatisfaction'] = tafe_resignations['Contributing Factors. Dissatisfaction'].map(update_vals)
tafe_resignations['Contributing Factors. Job Dissatisfaction'] = tafe_resignations['Contributing Factors. Job Dissatisfaction'].map(update_vals)


# In[29]:


# create dissatisfied column. True if any column is true.
tafe_resignations['dissatisfied'] = tafe_resignations[['Contributing Factors. Dissatisfaction',
                                                       'Contributing Factors. Job Dissatisfaction']].any(axis=1, 
                                                                                                         skipna=False)


# For **dete_survey**, we have more columns regarding dissatisification. Unluckily, all those columns are already in boolean type.

# In[32]:


dissatisfaction_col = ['job_dissatisfaction','dissatisfaction_with_the_department','physical_work_environment',
                  'lack_of_recognition','lack_of_job_security','work_location','employment_conditions',
                  'work_life_balance','workload']

for col in dissatisfaction_col:
    print(col, ': \n')
    print(dete_resignations[col].value_counts(dropna = False))


# In[117]:


# create dissatisfied column. True if any column is true.
dete_resignations['dissatisfied'] = dete_resignations[dissatisfaction_col].any(axis=1, skipna=False)


# In[118]:


dete_resignations_up = dete_resignations.copy()
tafe_resignations_up = tafe_resignations.copy()


# ### Combine dataset
# 
# Since both data sets are clean and ready, we will combine them together. In order to distinguish both institute, we will create an extra column `institute`. Also we will only keep the necessay columns existing in both data set. Considering that both data set have more than 500 rows, we can exclude all columns which has less than 500 non missing values.

# In[119]:


dete_resignations_up['institute'] = 'DETE'
tafe_resignations_up['institute'] = 'TAFE'


# In[120]:


combined = pd.concat([dete_resignations_up, tafe_resignations_up],
                     axis = 0,
                     ignore_index = True)


# In[121]:


combined_updated = combined.dropna(axis = 1, thresh = 500)


# ### Clean the service column
# 
# Now that we've combined our dataframes, we're almost at a place where we can perform some kind of analysis! First, though, we'll have to clean up the `institute_service` column. This column is tricky to clean because it currently contains values in a couple different forms:

# In[122]:


combined_updated['institute_service'].value_counts(dropna=False)


# To analyze the data, we'll convert these numbers into categories. We'll base our analysis on [this article](https://www.businesswire.com/news/home/20171108006002/en/Age-Number-Engage-Employees-Career-Stage), which makes the argument that understanding employee's needs according to career stage instead of age is more effective.
# 
# We'll use the slightly modified definitions below:
# 
# - New: Less than 3 years at a company
# - Experienced: 3-6 years at a company
# - Established: 7-10 years at a company
# - Veteran: 11 or more years at a company
# 
# Let's categorize the values in the `institute_service` column using the definitions above.

# In[123]:


# extract the first digit from institute_service for grouping
combined_updated['institute_service'] = combined_updated['institute_service'].astype('str').str.extract(r'(\d+)').astype('float')


# In[124]:


# mapping value into group
def service_level(num):
    if pd.isnull(num): return np.nan
    elif num < 3: return 'New'
    elif num < 7: return 'Experienced'
    elif num < 11: return 'Established'
    else: return 'Veteran'


# In[125]:


combined_updated['service_cat'] = combined_updated['institute_service'].map(service_level)


# In[126]:


# create service_cat for the new group
combined_updated['service_cat'].value_counts(dropna = False)


# ## 
# 

# ### Fill NaN
# 
# Before heading to analysis, we have to fill in all missing value to prevent any errors. We will use the mode as the filling value.

# In[127]:


combined_updated['dissatisfied'].value_counts(dropna = False)


# In[128]:


combined_updated['dissatisfied'].fillna(False, inplace=True)


# ## Analysis and Conclusion

# In[159]:


table = pd.pivot_table(combined_updated, index = 'service_cat', values = 'dissatisfied')
table['pos'] = [2,1,0,3]
table['pos'] = table['pos'].astype('int')
table.sort_values(by=['pos'])['dissatisfied'].plot(kind='barh')
plt.title('Percentage of dissatisfied employees among service_cat')
# sns.set_style('white')
sns.despine(bottom = True, left= True)


# Above plot provides us a clear answer to our first questions: Around 1/3 of the employees who worked for a short period resigned due to dissatisfaction. Meanwhile, half of the employees who worked for a longer period is much higher, around 50%.