#!/usr/bin/env python
# coding: utf-8

# # Employee Exit Surveys 
# ### Clean And Analyze 

# ### introducing the project and the dataset
# 
# exit surveys from employees of the Department of Education, Training and Employment (DETE) and the Technical and Further Education (TAFE) institute in Queensland, Australia. 
# 
# Contextual Information
# Staff feedback provides the Department with valuable information on the reasons why employees resign or retire. The information is used to inform attraction and retention initiatives and to improve work practices across the Department to ensure the Department is considered an employer of choice.The DETE / TAFE Exit Surveys were developed to effectively canvass the opinions and attitudes of departing employees to identify a wide range of operational, organisational and personal variables affecting the decision to leave. 
# 
# •	Are employees who only worked for the institutes for a short period of time resigning due to some kind of dissatisfaction? What about employees who have been there longer?
# •	Are younger employees resigning due to some kind of dissatisfaction? What about older employees?
# 

# # 1. Let's import the necessary libraries so we can read and start to explor our data sets

# In[1]:


import pandas as pd
import numpy as np


# In[2]:


dete_survey = pd.read_csv("dete-exit-survey-january-2014.csv")


# In[3]:


tafe_survey = pd.read_csv("tafe-employee-exit-survey-access-database-december-2013.csv", encoding ='cp1252')


# In[4]:


# expanding the output display to see ALL columns of our DS when calling df.describe()
pd.set_option('display.max_columns', 80)      


# ###    1.1 dete_survey

# In[5]:


dete_survey.describe(include = 'all')


# In[6]:


dete_survey["SeparationType"].value_counts()


# In[7]:


dete_survey["Stress and pressure support"].value_counts()


# In[8]:


dete_survey.info()


# In[9]:


dete_survey.iloc[:,29:49]


# In[10]:


dete_survey["Age"].value_counts()


# In[11]:


dete_survey.iloc[:,10:28].describe()


# In[12]:


dete_survey[dete_survey["None of the above"] == True].iloc[:,10:28].tail(10)


# ### DETE Survey Initial Observations:
# * DETE DS consists of 56 columns and 822 rows. 
# * Most of the columns contain non-null values, type object, while only 5 of them have > 90% of null values. 
# * Some columns in dete_survey dataframe contain 'Not Stated' values that indicate values are missing, but they aren't represented as NaN - this will be corrected in the next step.
# * There are 18 columns type bool / dete_survey.iloc[:,10:28]  / which are **contributing factors** to why someone resigned from their job.
# * There are 20 columns / dete_survey.iloc[:,29:49] /filled in with: A, N, SA, D, SD or M, which seem as subcategories of columns here described as "contributing factors". kao dodatno pojašnjenje o razlozima zašto je dat otkaz, a zapravo smo te razloge već pokrili sa **contruting factors** kolonama, tako da ove mogu biti izbrisane
# * 222 of 821 respondents, which is around 27 % were age 61 or more
# 
# Čekiraj ovaj tekst nije loš //
# Examining the DETE Survey data collected in 2014 of 822 former employees. This dataset consisted of several demographic variables (i.e Gender, Age) as well as other relevant background info as it pertains to their position at DETE (such as Employment Status). Additionally, there was also data pertaining to the decision to leave DETE that range across various subforms of grievances which appears to make up a large proportion of the number of variables. As it pertains to our original question, these variable may not be necessary in our analysis and will likely need to be removed in the cleaning process.

# Ok, let's now take a look into our 2nd data set:

# ### 1.2 tafe_survey

# In[13]:


tafe_survey


# In[14]:


tafe_survey.columns


# In[15]:


tafe_survey.describe(include = 'all')


# In[16]:


tafe_survey["CESSATION YEAR"].value_counts()


# In[17]:


tafe_survey["Reason for ceasing employment"].value_counts()


# In[18]:


tafe_survey["Contributing Factors. Job Dissatisfaction"].value_counts()


# In[19]:


tafe_survey["Contributing Factors. Study"].value_counts()


# In[20]:


tafe_survey["InductionInfo. Topic:Did you undertake a Institute Induction?"].value_counts()


# In[21]:


tafe_survey["CurrentAge. \xa0\xa0\xa0\xa0Current Age"].value_counts()


# In[22]:


# display all rows from the df 
pd.set_option('display.max_rows', None)


# In[23]:


tafe_survey.isnull().sum()


# > Like our first DS, the TAFE Survey contains answers from 702 ex employees to 72 questions. Much of this data will not be important to reach our objective, so we will drop in the next steps. This dataframe contains many of the same columns as DETE Survay, but the column names are different. There are multiple columns/answers that indicate an employee resigned because they were dissatisfied. 
# * Column names ofc should be cleaned in both datasets

# # 2. initial cleaning 

# ### 2.1. Let's import dete_survey.csv file into pandas again - with "Not Stated" values as NaN 

# In[24]:


dete_survey = pd.read_csv("dete-exit-survey-january-2014.csv", na_values="Not Stated")


# In[25]:


dete_survey.describe(include='all')


# ### 2.2. Dropping the unnecessary columns [28:49]  from DETE Survey and TAFE Survey columns[17:66]

# In[26]:


#dete_survey.columns[28:49], not including 49

dete_survey_updated = dete_survey.drop(dete_survey.columns[28:49], axis=1)
dete_survey_updated.shape


# In[27]:


dete_survey_updated.columns


# In[28]:


#tafe_survey.columns[17:66]  df.drop(['B', 'C'], axis=1)

tafe_survey_updated = tafe_survey.drop(tafe_survey.columns[17:66], axis=1)
tafe_survey_updated.shape


# In[29]:


tafe_survey_updated.columns


# > */nije moj text / * From the above processes, there was a trimming of the original datasets to remove any unnecessary info for our analysis which happened to be a combination of non-greviance or personal reasoning as well as duplicate responses. Futhermore, there was also a correction to the aforementioned responses of "Not Stated" to NaN.

# ### 2.2.a Renaming columns in the DETE Survey

# In[30]:


dete_survey_updated.columns = dete_survey_updated.columns.str.lower().str.replace(" ", "_").str.strip()


# In[31]:


# "separationtype" to "separation_type"
dete_survey_updated.rename(columns ={dete_survey_updated.columns[1] : "separation_type"}, inplace=True)


# In[32]:


dete_survey_updated.columns


# ### 2.2.b Renaming columns in the TAFE Survey

# In[38]:


mapping = {"Record ID":"id",
           "separationtype" : "separation_type",
           "CESSATION YEAR":"cease_date",
           "Reason for ceasing employment":"separationtype",
           "Gender. What is your Gender?":"gender",
           "CurrentAge. Current Age":"age",
           "Employment Type. Employment Type":"employment_status",
           "Classification. Classification":"position",
           "LengthofServiceOverall. Overall Length of Service at Institute (in years)":"institute_service",
           "LengthofServiceCurrent. Length of Service at current workplace (in years)":"role_service"}

tafe_survey_updated = tafe_survey_updated.rename(mapping, axis = 1)


# In[39]:


tafe_survey_updated.columns = tafe_survey_updated.columns.str.lower().str.strip()


# In[40]:


tafe_survey_updated.columns


# > We have now standardized the column names, in order to make it easier to combine them in the next steps.

# # 3. Filter the data 

# After having dome some inicial cleaning, let's recall the question we need to answer and see which exactly columns/data do we need to attain our goal, and what are ones that can be dropped.
# 
# **Are employees who have only worked for the institutes for a short period of time resigning due to some kind of dissatisfaction? What about employees who have been at the job longer?**

# In[36]:


dete_survey_updated["separation_type"].value_counts()


# In[41]:


tafe_survey_updated["separation_type"].value_counts()


# > TAFE has only 1 Resignation and DETE 3, we have to take that into account. Regarding our objective here, we will only analyze survey respondents who resigned / their separation_type contains the string "Resignation". So, let's select the data we need! We will use a different method for each dataset.

# In[42]:


# selecting the data for DETE Survey respondents who have a Resignation separation type 

dete_resignation = dete_survey_updated.loc[dete_survey_updated["separation_type"].str.contains("Resignation")]
dete_resignation


# In[43]:


# doing the same for TAFE Survey, only using df.copy() method 
tafe_resignation = tafe_survey_updated[tafe_survey_updated["separation_type"] == "Resignation"].copy()
tafe_resignation


# Now we have succesfully isolated the respondents we need to answer our question. Now, before we start cleaning and manipulating the rest of our data, let's verify once more that the data doesn't contain any major inconsistencies. 

# # 4. Verify Data

# In this step, we'll focus on verifying that the years in the cease_date and dete_start_date columns make sense. The cease_date is the last year of the person's employment and the dete_start_date is the person's first year of employment.

# In[44]:


dete_resignation["cease_date"].value_counts(dropna=False)


# the range seems fine, the only thing is we need to extract only years without months, let's do that

# In[45]:


# extracting only years from dete_resignation["cease_date"] using vectorized string methods
# add .copy()to avoid SettingwithCopyWarning

dete_resignation.loc[:,"cease_date"] = dete_resignation["cease_date"].str.split(pat="/").str[-1].astype("float")


# In[46]:


dete_resignation["cease_date"].value_counts().sort_values()


# Now that we have successfully extracted only years in DETE resignation "cease_date" column, and converted it to dtype float, now let's check "dete_start_date" column

# In[47]:


dete_resignation["dete_start_date"].value_counts().sort_index(ascending=True)


# The range seems fine, and the format of the data is good, let's check the TAFE-resignation next:

# In[48]:


tafe_resignation["cease_date"].value_counts().sort_values()


# conclusion: 
# * There aren't any major issues with the years.
# * The years in each dataframe don't span quite the same number of years. **We'll leave it up to your discretion to drop any years you don't think are needed for the analysis. -- I DONT CARE IF THE RANGE IS DIFFERENT - NOT IMPORTANT I THINK?**
# 

# In the Human Resources field, the length of time an employee spent in a workplace is referred to as their years of service. tafe_resignations dataframe already contains a "service" column, which we renamed to institute_service. In order to analyze both surveys together, we'll have to create a corresponding institute_service column in dete_resignations.

# In[50]:


dete_resignation["institute_service"] = dete_resignation["cease_date"] - dete_resignation["dete_start_date"]


# In[52]:


dete_resignation[["cease_date", "dete_start_date", "institute_service"]]


# 
# 
#  
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
#