#!/usr/bin/env python
# coding: utf-8

# # CLEAN AND ANALYZE EMPLOYEE EXIT SURVEYS
# 
# **In this project, we are going to work with exit surveys from empoyees of the Department of Education,Training Employment (DETE) and The technical and Further Education (TAFE)  Institute in Queensland, Australia.**click [here](https://data.gov.au/dataset/ds-qld-89970a3b-182b-41ea-aea2-6f9f17b5907e/details?q=exit%20survey) to get TAFE data and [here](https://data.gov.au/dataset/ds-qld-fe96ff30-d157-4a81-851d-215f2a0fe26d/details?q=exit%20survey) for DETE data**
# 
# **We will combine the two dataset DETE and TAFE to analyse people who resigned due to some kind of dissatisfaction.**
# **Our goals therefore depend on these two**
# * Are employees who only worked for the institutes for a short period of time resigning due to some kind of dissatisfaction? What about employees who have been there longer?
# * Are younger employees resigning due to some kind of dissatisfaction? What about older employees?
# 

# ## 1.1 Reading our data into pandas
# We will first read the two dataset into pandas.The datasets are stored in csv file ```dete_survey.csv``` and ```tefe_survey.csv```
# 
# Remember we will as well import the pandas and numpy libraries

# In[1]:


import pandas as pd
import numpy as np

#  let's read the two dataset

dete_survey = pd.read_csv("dete_survey.csv")
tafe_survey = pd.read_csv("tafe_survey.csv")
 

# ## 1.2 Exploring the datasets
# 
# ## a)DETE
# 

# In[3]:


dete_survey.head() # this prints out few rows


# In[4]:


dete_survey.describe() #this prints out some information the dataframe


# In[5]:


dete_survey.info()


# from the doutput above we can confirm that our dataframe(dete_survey) contain 822 entries and 56 columns, but we can also roughly say that business unit contains less entries compared to other columns.

# In[6]:


dete_survey.isnull().sum()


# **We can confirm , there are over 25 entries column with NAN values, with some columns like Torress Strait with only two entries.**
# 
# ## b) TAFE

# In[7]:


tafe_survey.head() # display the first five rows


# In[8]:


tafe_survey.describe() # this display statistcal information on columns with  numerical values


# In[9]:


tafe_survey.isnull().sum()


# it's also clear that the dataframe contains alot of columns with missing values e.g employment type.employment
# 
# **From the dataframe we can conclude that:**
# 
# * The dete_survey dataframe contains 'Not Stated' values that indicate values are missing, but they aren't represented as NaN.
# 
# * Both the dete_survey and tafe_survey dataframes contain many columns that we don't need to complete our analysis.
# 
# * Each dataframe contains many of the same columns, but the column names are different.
# 
# * There are multiple columns/answers that indicate an employee resigned because they were dissatisfied.

# ## 1.3 fixing the missing values in dete_survey
# we will again read ```dete_survey.csv``` csv file into pandas again, but this time read the not stated values as NAN
# To do this we set the ```na_values``` parameter to ```Not stated ``` 

# In[12]:


dete_survey = pd.read_csv("dete_survey.csv", na_values ="Not Stated")


# ## 1.4 dropping columns:
# some columns are not helpfull to analysis, we'll therefore have to drop them.
# 
# * i.dete_survey , where we will drop from index [28:49] 
# * ii.tafe_survey, where we will drop from index [17:66]
# 

# In[13]:


dete_survey_updated = dete_survey.drop(dete_survey.columns[28:49],axis = 1)
tafe_survey_updated = tafe_survey.drop(tafe_survey.columns[17:66], axis = 1)


# ## 1.5 stardadizing and renaming the columns 
# 
# Each dataframe contains many of the same columns, but the column names are different. Below are some of the columns we'd like to use for our final analysis
# 
# We'll  therefore use;  the ```DataFrame.columns``` attribute along with vectorized string methods to update all of the columns at once
# 
# **i. dete_survey_update**

# In[14]:


dete_survey_updated.columns = dete_survey_updated.columns.str.lower().str.strip().str.replace(' ', '_')

# let's survey the current column

dete_survey_updated.columns


# from the output wee can confirm that columns' nams havebeen updated

# **ii. tafe_survey_updated**
#  in this dataframe we'll rename the columns

# In[16]:


name = {'Record ID': 'id',
'CESSATION YEAR': 'cease_date',
'Reason for ceasing employment': 'separationtype',
'Gender. What is your Gender?': 'gender',
'CurrentAge. Current Age': 'age',
'Employment Type. Employment Type': 'employment_status',
'Classification.  Classification': 'position',
'LengthofServiceOverall. Overall Length of Service at Institute (in years)': 'institute_service',
'LengthofServiceCurrent. Length of Service at current workplace (in years)': 'role_service'}


tafe_survey_updated.rename(columns=name, inplace=True)
tafe_survey_updated.columns


# From the above columns you can notice that some of the columns have been renamed.

# ## 1.6 Filter the data
# Remember for this project, we'll only analyze survey respondents who resigned, so we'll only select separation types containing the string 'Resignation'.

# In[17]:


# Check the unique values for the separationtype column
tafe_survey_updated['separationtype'].value_counts()


# In[18]:


# Check the unique values for the separationtype column
dete_survey_updated['separationtype'].value_counts()


# From the above two output, we can confirm that, in TEFE there is only one value called ```Resignation``` while in DETE  we have ```Resignation-Other reasons```,```Resignation-Other employer```,```Resignation-Move overseas/interstate ```
# 
# so we need to update all seperation types containing the word ```resignation``` in DETE to ```Resignation```

# In[19]:


# Update all separation types containing the word "resignation" to 'Resignation'

dete_survey_updated['separationtype'] = dete_survey_updated['separationtype'].str.split('-').str[0]

# Check the values in the separationtype column were updated correctly

dete_survey_updated['separationtype'].value_counts()


# The output above has now confirmed that DETE now contain only one value called ```Resignation``` that is after being updated.
# 
# With the information, we will copy therefore the two dataframe to avoid the SettingWithCopy Warning

# In[21]:


# Select only the resignation separation types from each dataframe
dete_resignations = dete_survey_updated[dete_survey_updated['separationtype'] == 'Resignation'].copy()
tafe_resignations = tafe_survey_updated[tafe_survey_updated['separationtype'] == 'Resignation'].copy()

# Let's check few rows in each dataframe

print("Few rows in DETE")
dete_resignations.head()


# In[22]:


print("Few rows in TAFE")
tafe_resignations.head()


# ## 1.7 verify the data
# Before we clean and manupulate the our data, we need to verify that the data doesn't contain any major inconsistencies.
# we'll therefore focus on verifying that the years in the ```cease_date``` and ```dete_start_date``` columns make sense that is;
# * under no circumstances we have years after the current date since the  ```cease_date``` is the last year of the person's employment and the ```dete_start_date``` is the person's first year of employment
# * Also,given that most people in this field start working in their 20s, it's also unlikely that the ```dete_start_date``` was before the year 1940.

# In[23]:


# let's check unique values in cease_date column

dete_resignations['cease_date'].value_counts()


# In[24]:


# we will now extract the years and convert the type to float
dete_resignations['cease_date'] = dete_resignations['cease_date'].str.split('/').str[-1]
dete_resignations['cease_date'] = dete_resignations['cease_date'].astype("float")

dete_resignations['cease_date'].value_counts()


# In[25]:


# we will now check unique values and look for the outliers in dete_start_date column
dete_resignations['dete_start_date'].value_counts().sort_values()


# In[27]:


# unique values in cease_date column (tafe)
tafe_resignations['cease_date'].value_counts().sort_values()


# From the work we did in the last screen, we can verify:
# * There aren't any major issues with the years.
# * The years in each dataframe don't span quite the same number of years.
# 
#  Now that we've verified the years in the ```dete_resignation``` dataframe, we'll use them to create a new column.
#  
#  Just for refreshment, our end goal is to answer the following question:;
#  * Are employees who have only worked for the institutes for a short period of time resigning due to some kind of dissatisfaction? What about employees who have been at the job longer?
#  

# ## 1.8 working on years of service
# The ```tafe_resignations``` dataframe already contains a "service" column, which we renamed to ```institute_service```. In order to analyze both surveys together, we'll have to create a corresponding ```institute_service``` column in ```dete_resignations```
# 
# Have a look.

# In[29]:


# Calculate the length of time an employee spent in their respective workplace and create a new column
dete_resignations['institute_service'] = dete_resignations['cease_date'] - dete_resignations['dete_start_date']


# In[30]:


dete_resignations.head()


# The output above, contain a new added column (institute_service ) that we'll use to analyze survey respondents according to their length of employment.

# ## 1.9 Identify Dissatisfied Employees.
# 

# we'll now identify any employees who resigned because they were dissatisfied.
# 
# Below are the columns we'll use to categorize employees as "dissatisfied" from each dataframe.
# 
# 1.   tafe_survey_updated
# 
#   *  Contributing Factors. Dissatisfaction
#   *  Contributing Factors. Job Dissatisfaction
#   
#   
# 2.   dete_survey_updated
# 
#   *  job_dissatisfaction
#   *  dissatisfaction_with_the_department
#   *  physical_work_environment
#   *  lack_of_recognition
#   *  lack_of_job_security
#   *  work_location
#   *  employment_conditions
#   *  work_life_balance
#   *  workload

# If the employee indicated any of the factors above caused them to resign, we'll mark them as dissatisfied in a new column.

# To create the new column, we'll do the following:
# 
# Convert the values in the ```Contributing Factors. Dissatisfaction``` and ```Contributing Factors. Job Dissatisfaction``` columns in the ```tafe_resignations``` dataframe to ```True```, ```False```, or ```NaN values```.
# If any of the columns listed above contain a True value, we'll add a True value to a new column named ```dissatisfied```. To accomplish this, we'll use the ```DataFrame.any()``` method to do the following:
# 
# Return ```True``` if any element in the selected columns above is True
# 
# Return ```False``` if none of the elements in the selected columns above is True
# 
# Return ```NaN``` if the value is NaN

# In[32]:


# let's have a quick look of unique values in each column in tafe df
tafe_resignations['Contributing Factors. Dissatisfaction'].value_counts()


# In[33]:


tafe_resignations['Contributing Factors. Job Dissatisfaction'].value_counts()


# ## 2.0 update the values

#  we will Update the values in the contributing factors columns to be either True, False, or NaN, we will therefore have to create a fuctions.

# In[35]:


def update_vals(x):
    if x == '-':
        return False
    elif pd.isnull(x):
        return np.nan
    else:
        return True
tafe_resignations['dissatisfied'] = tafe_resignations[['Contributing Factors. Dissatisfaction', 'Contributing Factors. Job Dissatisfaction']].applymap(update_vals).any(1, skipna=False)
tafe_resignations_up = tafe_resignations.copy()


tafe_resignations_up['dissatisfied'].value_counts()


# In[36]:


dete_resignations.columns


# In[37]:


# working on dete df
dete_resignations['dissatisfied'] = dete_resignations[['job_dissatisfaction',
       'dissatisfaction_with_the_department', 'physical_work_environment',
       'lack_of_recognition', 'lack_of_job_security', 'work_location',
       'employment_conditions', 'work_life_balance',
       'workload']].any(1, skipna=False)
dete_resignations_up = dete_resignations.copy()
dete_resignations_up['dissatisfied'].value_counts(dropna=False)


# To recap, we've accomplished the following:
# 
# *  Renamed our columns
# *  Dropped any data not needed for our analysis
# *  Verified the quality of our data
# *  Created a new institute_service column
# *  Cleaned the Contributing Factors columns
# *  Created a new column indicating if an employee resigned because they were dissatisfied in some way

# Now, we're finally ready to combine our datasets! Our end goal is to aggregate the data according to the institute_service column, so when we combine the data,we think about how to get the data into a form that's easy to aggregate.

# ## 2.1 combining the data

#  let's first add a column to each dataframe that will allow us to easily distinguish between the two

# In[39]:


dete_resignations_up['institute'] = 'DETE' # adding new column called where each row contain name called DETE
tafe_resignations_up['institute'] = 'TAFE' # adding new column called where each row contain name called DETE

# lets then combine the two data

combined = pd.concat([dete_resignations_up,tafe_resignations_up],ignore_index = True)
combined.head()


# ## 2.2 dropping some columns
# Remember that we still have some columns left in the dataframe that we don't need to complete our analysis. Use the DataFrame.dropna() method to drop any columns with less than 500 non null values

# In[41]:


# let's check first the columns
combined.notnull().sum().sort_values(ascending=False)


# From the output, we can see that most of the columns we need in our analysis have more than 500 non null values.we'll therefore set 500 as our threshold to drop  null values

# In[42]:


combined_updated = combined.dropna(thresh=500, axis=1)
combined_updated.head(10)


# ## 2.3 cleaning up the institute service column.
# we'll have to clean up the ```institute_service```  after which we'll convert the numbers into categories that is;
# 
# * New: Less than 3 years at a company
# * Experienced: 3-6 years at a company
# * Established: 7-10 years at a company
# * Veteran: 11 or more years at a company

# In[43]:


# let's check first the unique values

combined_updated["institute_service"].value_counts(dropna = False)


# In[44]:


# extracting the years of service from each value in the institute_service column after which we convert it to float

combined_updated["institute_service"] = combined_updated["institute_service"].astype('str').str.extract(r'(\d+)')
combined_updated["institute_service"] = combined_updated["institute_service"].astype("float")
combined_updated["institute_service"].value_counts(dropna = False)


# In[45]:


# We'll Create a function that maps each year value to one of the career stages 

def stages (val):
    if val < 3:
        return "New"
    elif val >= 3 and val <= 6:
        return "Experienced"
    elif val > 6 and val < 10:
        return "Established"
    elif pd.isnull(val):
        return np.nan
    else:
        return "veteran"
# let's apply our function to the column institute_service
# we have to create new column called service cat.

combined_updated["service_cat"] = combined_updated["institute_service"].apply(stages)

# let's check the unique values in this our new column.

combined_updated["service_cat"].value_counts(dropna = False)


# In the last code, we created a ```service_cat column```, that categorizes employees according to the amount of years spent in their workplace.
# 
# 
# ## 2.4 filling out the missing values.
# We will now check at the ```dissatisfied``` column to find out the missing values.For our analysis, we will replace the null values to the most values that occurs frequently in this column(```dissatisfied```) either True or False

# In[46]:


combined_updated["dissatisfied"].value_counts()


# In[47]:


# false is the most frequent value

combined_updated["dissatisfied"] = combined_updated["dissatisfied"].fillna(False)
combined_updated["dissatisfied"].value_counts()


# The above output confirms that 8 null values has been replace by False values,rising the number of False values to 411.

# ## 2.5 Working on Dissatisfied in each category
# 
# Since  we have all the values for dissatisfied people, we can use pivot table method to find number of dissatisfied people in each service category after which, we find the mean of them.
# 

# In[49]:


cat_table = combined_updated.pivot_table(index = "service_cat", values = "dissatisfied")
cat_table


# In[50]:


# with the table, we can now plot the graph

get_ipython().run_line_magic('matplotlib', 'inline')

cat_table.plot(kind = "bar" ,rot = 75, title = "Graph Showing Dissatisfied Resignees In Each Service Category")


# ## Conclusion
#  From our analysis, can conclude that, Established employees that is with seven or more years of service are more likely to resign due to some kind of dissatisfaction compared to those with less than 7 years.

# In[ ]: