#!/usr/bin/env python # coding: utf-8 # # Clean and Analyze Employee Exit Surveys # ### NOTE: SettingWithCopyWarning # # I used the recommended procedure shown below throughout the complete project to prevent the above Warning. # It didn't always work. # # "Try using .loc[row_indexer,col_indexer] = value instead" # ## Introduction # # In this guided project, we'll work with exit surveys from employees of the Department of Education, Training and Employment (DETE) and the Technical and Further Education (TAFE) institute in Queensland, Australia. You can find the TAFE exit survey [here](https://data.gov.au/dataset/ds-qld-89970a3b-182b-41ea-aea2-6f9f17b5907e/details?q=exit%20survey "here") and the survey for the DETE [here](https://data.gov.au/dataset/ds-qld-fe96ff30-d157-4a81-851d-215f2a0fe26d/details?q=exit%20survey "here"). # # Some slight modifications were made to these datasets to make them easier to work with, including changing the encoding to UTF-8 (the original ones are encoded using cp1252.) # # For this project, the questions posed are: # # **1. Are employees who only worked for the institutes for a short period of time resigning due to some kind of dissatisfaction? What about employees who have been there longer?** # # **2. Are younger employees resigning due to some kind of dissatisfaction? What about older employees?** # # #### NOTE: # **I'm going to deviate slightly from the project guidelines.** The guidelines suggest that after we clean and organize the two data sets, combine them and answer the questions "across" the two institutes. # # I would agree with this if I knew at this point of the analysis that the answers to the questions for each of the institutes independent of each other were close enough to be considered the same. I choose to not assume that, and will therefore answer the questions for each of the Institutes **separately** and compare the results. After that, I will still combine the data and analyze the results of that as well. # # **MY CHOSEN PROCESS:** # # 1. I will do all of the data cleaning, organizing, correcting, etc. for the DETE Institute. # 2. I will then answer the questions associated with the DETE Institute. # 3. I will do all of the data cleaning, organizing, correcting, etc. for the TAFE Institute. # 4. I will then answer the questions associated with the TAFE Institute. # 5. I will compare the results. # 6. Whether they are very different or not, I will still join the two data sets and answer the questions for the combined data. # 7. Based on the overall results, I will then make conclusions. # # DETE INSTITUTE ANALYSIS # ## Read in Data: DETE # In[1]: # Import both pandas, numpy and matplotlib to cover all needed code for this project. # Read in and save the DETE Institute *.csv data file as a DataFrame. # Not all missing data is signified as NaN, some are as 'Not Stated'. # Add na_values function to standardize missing data as NaN. import pandas as pd import numpy as np import matplotlib.pyplot as plt dete_survey = pd.read_csv('dete_survey.csv', na_values=['Not Stated']) # print column headings and file info to get a feel for the data file structure. print(dete_survey.columns) print('\n') print(dete_survey.info) # ### Observations: DETE # # There is a lot of variation on column heading formats and many columns with information that is not relevant to answering the questions posed for this project. # ## Identify Missing Values and Drop Unnecessary Columns: DETE # # In[2]: # Observe volume of missing data. by column. print(dete_survey.isnull().sum()) # In[3]: # There are many columns which are not relevant to answering the questions as described in the introduction. # Use appropriate function to drop columns 28 to 49. labels = dete_survey.columns[28:49] dete_survey_updated = dete_survey.drop(columns=labels, axis=1) dete_survey_updated.info # ### Observations: DETE # # I removed many columns from the DETE dataset that were not needed to answer the questions. This reduces the volume of output from the many commands in this project and makes it a little easier to manipulate the file. # ## Clean Column Names: DETE # # In[4]: # Use the _pd.str.replace_ code to make column titles simplified and consistent. print(dete_survey_updated.columns) print('\n') dete_survey_updated.columns = dete_survey_updated.columns.str.replace(' ', '_').str.strip().str.upper() print(dete_survey_updated.columns) # ### Observations: DETE # # I standardized the column title format with all capitals as well as removing spaces and changing the format from camelcase to snakecase. Again, it's all about simplifying and standardizing. # ## Filter the Data: DETE # In[5]: # Observe the various types of reasons for resigning in column ['SEPEARATIONTYPE']. # Extract only the resignation type reasons. print(dete_survey_updated.loc[:, 'SEPARATIONTYPE'].value_counts()) dete_resignations = dete_survey_updated[dete_survey_updated.loc[:, 'SEPARATIONTYPE'].isin(['Resignation-Other reasons','Resignation-Other employer', 'Resignation-Move overseas/interstate'])] print('\n') print(dete_resignations.info()) # ### Observations: DETE # # Resigning accounts for 38% of all the reasons for leaving the DETE Institute. # ## Verify the Data: DETE # In[6]: # Use the _pd.str.strip & replace & get_ functions to separate year in ['CEASE_DATE'] column. dete_resignations = dete_resignations.copy() print(dete_resignations.loc[:, 'CEASE_DATE'].value_counts()) print('\n') CEASE_DATE_CLEANED = dete_resignations.loc[:, 'CEASE_DATE'].str.strip().str.replace('/', '/ ').str.split().str.get(-1) dete_resignations.loc[:, 'CEASE_DATE_CLEANED'] = CEASE_DATE_CLEANED print(dete_resignations.loc[:, 'CEASE_DATE_CLEANED'].value_counts) # Convert years from string to numeric - float in order to execute subtraction. dete_resignations.loc[:, 'CEASE_DATE_CLEANED'] = dete_resignations.loc[:, 'CEASE_DATE_CLEANED'].astype(float) print(dete_resignations.loc[:, 'CEASE_DATE_CLEANED'].unique()) print('\n') dete_resignations.loc[:, 'CEASE_DATE_CLEANED'] = dete_resignations.loc[:, 'CEASE_DATE_CLEANED'].dropna() dete_resignations.loc[:, 'CEASE_DATE_CLEANED'].sort_index() print(dete_resignations.loc[:, 'CEASE_DATE_CLEANED'].value_counts()) print('\n') dete_resignations.tail() print(dete_resignations.loc[:, 'DETE_START_DATE'].value_counts()) # Calculate difference between start and cease dates in order to determine years of service. INSTITUTE_SERVICE = dete_resignations.loc[:, 'CEASE_DATE_CLEANED'] - dete_resignations.loc[:, 'DETE_START_DATE'] dete_resignations.loc[:, 'INSTITUTE_SERVICE'] = INSTITUTE_SERVICE dete_resignations.loc[:, 'INSTITUTE_SERVICE'].value_counts() # In[7]: # create a function and use with _applymap_ to consolidate # True and False results into one column ['dissatisfied']. def update_vals(element): if element == False: return False elif pd.isnull(element): return np.nan else: return True columns = ['JOB_DISSATISFACTION', 'DISSATISFACTION_WITH_THE_DEPARTMENT', 'PHYSICAL_WORK_ENVIRONMENT', 'LACK_OF_RECOGNITION', 'LACK_OF_JOB_SECURITY', 'WORK_LOCATION', 'EMPLOYMENT_CONDITIONS', 'WORK_LIFE_BALANCE', 'WORKLOAD',] # Use _applymap_ function to consolidate True and False results into one column ['dissatisfied']. dete_resignations.loc[:, 'dissatisfied'] = dete_resignations[columns].applymap(update_vals).any(1, skipna=False) dete_resignations_up = dete_resignations.copy() print("False means resigned for other than being dissatisfied.") print("True means resigned due to being dissatisfied in some particular way.") print('\n') print(dete_resignations_up.loc[:, 'dissatisfied'].value_counts(dropna=False)) dete_resignations_up.head() # ### Observations: DETE # # The cease_date had many different forms in the original datafile. It had to be standardized to year only in order to calculate years of service by subtracting start_date from it. # # I also extracted data associated with "dissatisfied" as a reason for resigning from many different columns and consolidated it into one new column. # ## Create a New Column: DETE # In[8]: # Create a new column designated as ['institute'] with every row being 'DETE'. dete_resignations_up.loc[:, 'institute'] = 'DETE' dete_resignations_up.columns = dete_resignations_up.columns.str.strip().str.lower() print(dete_resignations_up.head()) dete_resignations_up.loc[:, 'institute_service'].value_counts(dropna=False) # ### Observations: DETE # # The purpose of creating a new column "insitute" with 'DETE' as the input was to distinguish the DETE file from the TAFE file after they are combined. # # I also changed the column titles to lower case to match the TAFE column titles. # ## Clean the Service Column: DETE # # In[9]: # Allocate the years of service into four categorical groups: # New < 3 years # Experienced 3-6 years # Established 7-10 yers # Veteran > 10 years def update_vals(element): if element < 3.0: return 'New' elif 3 <= element < 7: return 'Experienced' elif 6 < element < 11: return 'Established' elif element > 10: return 'Veteran' elif pd.isnull(element): return np.nan dete_resignations_up.loc[:, 'service_cat'] = dete_resignations_up.loc[:, 'institute_service'].map(update_vals) dete_resignations_up.head() dete_resignations_up.loc[:, 'service_cat'].value_counts(dropna=False) # ### Observations: DETE # # Rather than making a bar graph with as many distinct bars as there are distinct numbers of years of service, I grouped the years into categorical groups covering ranges of years of service. I condensed the number of groups to four as shown above in the output. # ## Perform Initial Analysis: DETE # # In[10]: # Use _pivot_table_ to determine percentage of employees resigning # due to dissatisfaction for each years of service group. dete_resignations_up.loc[:, 'dissatisfied'] = dete_resignations_up.loc[:, 'dissatisfied'].fillna(False) print(dete_resignations_up.loc[:, 'dissatisfied'].value_counts(dropna=False)) print('\n') Percent_Average = 100*(149/(154+149)) print("Percent Average DETE Employees Dissatisfied Regardless of Years of Service =", Percent_Average) print('\n') diss_dete_pct1 = dete_resignations_up.pivot_table(index='service_cat', values='dissatisfied') print(diss_dete_pct1.info) # In[11]: # Use matplotlib dunction to plt a bar chart to provide visual # comparison between yrs of service groups for percent employees # resigning due to some kind of dissatisfaction. get_ipython().run_line_magic('matplotlib', 'inline') diss_dete_pct1.plot(kind='bar', rot=30, ylim=(0,0.75)) for i in range(4): plt.text(i-0.14, 0.2, round(diss_dete_pct1.loc[:, 'dissatisfied'], 3).iloc[i], c='k') plt.xlabel('Years of Service Categories') plt.ylabel('Proportion of Employees Dissatisfied') plt.title('Proportion Dissatisfied by Service Categories: DETE') # ## Answers to Question 1: DETE # # **Question 1:** # # **1. Are employees who only worked for the institutes for a short period of time resigning due to some kind of dissatisfaction? What about employees who have been there longer?** # # **Answers:** # # The answer to both questions above is **YES**. # # For employees classified as "New" (less than 3 years of service), about 38% resigned due to some kind of dissatisfaction. For employees with more than 6 years of service (Established and Veteran), about 61% and 56% respectively resigned due to some kind of dissatisfaction. The percent for employees classified as Experienced (between 3-6 years of srvice) had 46% dissatisfied. # # #### The average percent across all years of service was 49%. To me that seems fairly high. # # From the graph results shown above, it appears that the percentage of employees who resigned for "dissatisfied" reasons, increases as years of service increases. I can surmise why that might be; such as, the longer one is in service, the greater chance to see significant changes in management (good or bad), changes in moral, changes in institute direction, etc. # # I will not conclude anything further at this point but will wait until analysis is complete for both institutes. # ## Consolidate Age into Ranges: DETE # In[12]: # Consolidate the employee age into five age range groups: print(dete_resignations_up.loc[:, 'age'].value_counts(dropna=False)) print('\n') def update_vals(element): if element == '20 or younger': return 'Twenties' if element == '21-25': return 'Twenties' if element == '26-30': return 'Twenties' if element == '31-35': return 'Thirties' if element == '36-40': return 'Thirties' if element == '41-45': return 'Forties' if element == '46-50': return 'Forties' if element == '51-55': return 'Fifties' if element == '56-60': return 'Fifties' if element == '61 or older': return 'Above Sixty' elif pd.isnull(element): return np.nan dete_resignations_up.loc[:, 'age_groups'] = dete_resignations_up.loc[:, 'age'].map(update_vals) dete_resignations_up.head() dete_resignations_up.loc[:, 'age_groups'].value_counts(dropna=False) # In[13]: # Use _pivot_table_ to determine percentage of employees resigning # due to dissatisfaction for each age group. dete_resignations_up.loc[:, 'dissatisfied'] = dete_resignations_up.loc[:, 'dissatisfied'].fillna(False) print(dete_resignations_up.loc[:, 'dissatisfied'].value_counts(dropna=False)) print('\n') Percent_Average = 100*(149/(154+149)) print("Percent Average DETE Employees Dissatisfied Regardless Their Age =", Percent_Average) print('\n') diss_dete_pct2 = dete_resignations_up.pivot_table(index='age_groups', values='dissatisfied') print(diss_dete_pct2.info) # In[14]: # Use matplotlib function to plot a bar chart to provide visual # comparison between age groups for percent employees resigning # due to some kind of dissatisfaction. get_ipython().run_line_magic('matplotlib', 'inline') diss_dete_pct2.plot(kind='bar', rot=30, ylim=(0,0.75)) for i in range(5): plt.text(i-0.22, 0.25, round(diss_dete_pct2.loc[:, 'dissatisfied'], 3).iloc[i], c='k') plt.xlabel('Age Group Categories') plt.ylabel('Proportion of Employees Dissatisfied') plt.title('Proportion Dissatisfied by Age Group: DETE') # ## Answers to Question 2: DETE # # **Question 2:** # # **Are younger employees resigning due to some kind of dissatisfaction? What about older employees?** # # **Answers:** # # The answer to both questions above is **YES**. # # There's very little difference in percent dissatisfied for employees in there twenties, thirties and forties: between 45% to 47%. There's a significant increase in percent for employees in their fifties: 59%, followed by a drop off for employees over 60: 52%. # # The total range of percentage among the age groups is 14, which is much less than the total range for years of service groups: 23.5. # ______________________________________________________________________________________________________________________________ # ______________________________________________________________________________________________________________________________ # # TAFE INSTITUTE ANALYSIS # ### NOTE: # # **I will not write observations after each series of executions for the TAFE file in that they would basically be the same as the ones I wrote for the DETE file above.** # ## Read in Data: TAFE # # # In[15]: # read in and save the TAFE Institution *.csv data files as DataFrame. # not all missing data is signified as NaN, some are as 'Not Stated'. # add na_values function to standardize missing data as NaN. import pandas as pd import numpy as np import matplotlib.pyplot as plt tafe_survey = pd.read_csv('tafe_survey.csv', na_values=['Not Stated']) print('\n') print(tafe_survey.columns) print('\n') print(tafe_survey.head()) print('\n') print(tafe_survey.info) # ## Identify Missing Values and Drop Unnecessary Columns: TAFE # In[16]: # Observe volume of missing data by column. print(tafe_survey.isnull().sum()) # In[17]: # There are many columns which are not relevant to answering the research questions as described in the introduction. # Use appropriate function to drop the following columns. labels = tafe_survey.columns[17:66] tafe_survey_updated = tafe_survey.drop(columns=labels, axis=1) print(tafe_survey_updated.info) print('\n') print(tafe_survey_updated.columns) # ## Clean Column Names: TAFE # In[18]: # Use the _mapping_ code to make column titles simplified and consistent. print(tafe_survey_updated.columns) print('\n') mapping = {'Record ID': 'id', 'CESSATION YEAR': 'cease_date', 'Reason for ceasing employment': 'separationtype', 'Gender. What is your Gender?': 'gender', 'CurrentAge. Current Age': 'age', 'Employment Type. Employment Type': 'employment_status', 'Classification. Classification': 'position', 'LengthofServiceOverall. Overall Length of Service at Institute (in years)': 'institute_service', 'LengthofServiceCurrent. Length of Service at current workplace (in years)': 'role_service' } tafe_survey_updated = tafe_survey_updated.rename(mapping, axis=1) print(tafe_survey_updated.columns) # ## Filter the Data: TAFE # In[19]: # Observe the various types of reasons for resigning in column ['SEPEARATIONTYPE']. # Extract only the resignation type reasons. print(tafe_survey_updated.loc[:, 'separationtype'].value_counts()) tafe_resignations = tafe_survey_updated[tafe_survey_updated.loc[:, 'separationtype'] == 'Resignation'] tafe_resignations.copy() print(tafe_resignations) # ## Verify the Data: TAFE # In[20]: # Create a function and use with _applymap_ to consolidate # True and False results into one column ['dissatisfied']. def update_vals(element): if element == '-': return False elif pd.isnull(element): return np.nan else: return True # Use _applymap_ function to consolidate True and False results into one column ['dissatisfied']. tafe_resignations.loc[:, 'dissatisfied'] = tafe_resignations.loc[:, ('Contributing Factors. Dissatisfaction', 'Contributing Factors. Job Dissatisfaction')].applymap(update_vals).any(1, skipna=False) tafe_resignations_up = tafe_resignations print(tafe_resignations_up.loc[:, 'dissatisfied'].value_counts(dropna=False)) # ## Create a New Column: TAFE # In[21]: # Create a new column designated as ['institute'] with every row being 'TAFE'. tafe_resignations_up.loc[:, 'institute'] = 'TAFE' tafe_resignations_up.head() tafe_resignations_up.loc[:, 'institute_service'].value_counts(dropna=False) # ## Clean the Service Column: TAFE # In[22]: # Allocate the years of service into four categorical groups: # New < 3 years # Experienced 3-6 years # Established 7-10 yers # Veteran > 10 years def update_vals(element): if element == 'Less than 1 year': return 'New' elif element == '1-2': return 'New' elif element == '3-4': return 'Experienced' elif element == '5-6': return 'Experienced' elif element == '7-10': return 'Established' elif element == '11-20': return 'Veteran' elif element == 'More than 20 years': return 'Veteran' elif pd.isnull(element): return np.nan tafe_resignations_up.loc[:, 'service_cat'] = tafe_resignations_up.loc[:, 'institute_service'].map(update_vals) tafe_resignations_up.head() tafe_resignations_up.loc[:, 'service_cat'].value_counts(dropna=False) # ## Perform Initial Analysis: TAFE # In[23]: # Use _pivot_table_ to determine percentage of employees resigning # due to dissatisfaction for each years of service group. tafe_resignations_up.loc[:, 'dissatisfied'] = tafe_resignations_up.loc[:, 'dissatisfied'].fillna(False) print(tafe_resignations_up.loc[:, 'dissatisfied'].value_counts(dropna=False)) print('\n') Percent_Average = 100*(91/(241+91)) print("Percent Average TAFE Employees Dissatisfied Regardless of Years of Service =", Percent_Average) print('\n') diss_tafe_pct1 = tafe_resignations_up.pivot_table(index='service_cat', values='dissatisfied') print(diss_tafe_pct1.info) # In[24]: # Use matplotlib dunction to plot a bar chart to provide visual # comparison between yrs of service groups for percent employees # resigning due to some kind of dissatisfaction. get_ipython().run_line_magic('matplotlib', 'inline') diss_tafe_pct1.plot(kind='bar', rot=30, ylim=(0,0.40)) for i in range(4): plt.text(i-0.16, 0.15, round(diss_tafe_pct1.loc[:, 'dissatisfied'], 3).iloc[i], c='k') plt.xlabel('Years of Service Categories') plt.ylabel('Proportion of Employees Dissatisfied') plt.title('Proportion Dissatisfied by Service Categories: TAFE') # ## Answers to Question 1: TAFE # # **Question 1:** # # **1. Are employees who only worked for the institutes for a short period of time resigning due to some kind of dissatisfaction? What about employees who have been there longer?** # # **Answers:** # # The answer to both questions above is **YES**. # # For employees classified as "New" (less than 3 years of service), about 26% resigned due to some kind of dissatisfaction, which is very close to Experienced and Veteran groups. The Established group ws slightly higher at 33%. # # **The overall range of percentages for the TAFE Institute being 8.3 is much lower than that of the DETE Institute: 23.5. # Also the overall average across all years of service groups for TAFE is 27% which is much lower than that for DETE: 49%.** # # To see these significant differences between the two institutes tells me it was appopriate to evaluate the results for each institute separately first before combining the data and analyzing the grouped results. We shall see if that masks the signficant differences between the two institutes. # # I will not conclude anything further at this point but wait until analysis is complete for both institutes. # ## Consolidate Age into Ranges: TAFE # In[25]: # Observe the data stucture in the age column to determine how to organize into groups. tafe_resignations_up.loc[:, 'age'].value_counts(dropna=False) # In[26]: # Use the _pd_str.strip and replace_ code to standardize the age group format: tafe_resignations_up.copy() tafe_resignations_up.loc[:, 'age_cleaned'] = tafe_resignations_up.loc[:, 'age'].str.strip().str.replace(' ', '-') tafe_resignations_up.loc[:, 'age_cleaned'].value_counts(dropna=False) # In[27]: # Consolidate the employee age into five age range groups: def update_vals(element): if element == '20 or younger': return 'Twenties' if element == '21-25': return 'Twenties' if element == '26-30': return 'Twenties' if element == '31-35': return 'Thirties' if element == '36-40': return 'Thirties' if element == '41-45': return 'Forties' if element == '46-50': return 'Forties' if element == '51-55': return 'Fifties' if element == '56 or older': return 'Above 55' elif pd.isnull(element): return np.nan tafe_resignations_up.loc[:, 'age_groups'] = tafe_resignations_up.loc[:, 'age_cleaned'].map(update_vals) tafe_resignations_up.head() tafe_resignations_up.loc[:, 'age_groups'].value_counts(dropna=False) # In[28]: # Use _pivot_table_ to determine percentage of employees resigning # due to dissatisfaction for each age group. tafe_resignations_up.loc[:, 'dissatisfied'] = tafe_resignations_up.loc[:, 'dissatisfied'].fillna(False) print(tafe_resignations_up.loc[:, 'dissatisfied'].value_counts(dropna=False)) print('\n') Percent_Average = 100*(91/(241+91)) print("Percent Average TAFE Employees Dissatisfied Regardless of Years of Service =", Percent_Average) print('\n') diss_tafe_pct2 = tafe_resignations_up.pivot_table(index='age_groups', values='dissatisfied') print(diss_tafe_pct2.info) # In[29]: # Use matplotlib function to plot a bar chart to provide visual # comparison between age groups for percent employees resigning # due to some kind of dissatisfaction. get_ipython().run_line_magic('matplotlib', 'inline') import matplotlib.pyplot as plt diss_tafe_pct2.plot(kind='bar', rot=30, ylim=(0,0.33)) for i in range(5): plt.text(i-0.22, 0.12, round(diss_tafe_pct2.loc[:, 'dissatisfied'], 3).iloc[i], c='k') plt.xlabel('Age Group Categories') plt.ylabel('Proportion of Employees Dissatisfied') plt.title('Proportion Dissatisfied by Age Group: TAFE') # ## Answers to Question 2: TAFE # # **Question 2:** # # **Are younger employees resigning due to some kind of dissatisfaction? What about older employees?** # # **Answers:** # # The answer to both questions above is **YES**. # # There's very little difference in percent dissatisfied for employees in there twenties, thirties, forties and fifties: between 25% to 29%. There's a slight decrease in percent for employees over 55: 21%. # # # **The overall range of percentages for the TAFE Institute being 7.9 is much lower than that of the DETE Institute: 14. # Also the overall average across all employee age groups for TAFE is 27% which is much lower than that for DETE: 49%.** # # # Combining the Data: DETE and TAFE # In[30]: # USE concatenate to combine the two data sets. combined = pd.concat([dete_resignations_up, tafe_resignations_up], axis=0, ignore_index=True) combined.head() print(combined.info()) print('\n') print(combined.loc[:, 'service_cat'].value_counts(dropna=False)) print('\n') print(combined.loc[:, 'dissatisfied'].value_counts(dropna=False)) print('\n') print(combined.notnull().sum().sort_values()) print('\n') # In[31]: print(combined.loc[:, 'dissatisfied'].value_counts(dropna=False)) print('\n') Percent_Average = 100*(240/(240+411)) print("Percent Average COMBINED Dissatisfied Regardless Their Age =", Percent_Average) print('\n') combined_updated = combined.dropna(axis=1, thresh = 400).copy() not_missing = combined_updated.notnull().sum() print(not_missing) # In[32]: combined_updated_2 = combined_updated combined_updated_2.loc[:, 'dissatisfied'] = combined_updated_2.loc[:, 'dissatisfied'].fillna(False) diss_pct1 = combined_updated_2.pivot_table(index='service_cat', values='dissatisfied') print(diss_pct1.info) # In[33]: # Use matplotlib function to plot a bar chart to provide visual # comparison between age groups for percent employees resigning # due to some kind of dissatisfaction. get_ipython().run_line_magic('matplotlib', 'inline') diss_pct1.plot(kind='bar', rot=30, ylim=(0,0.65)) for i in range(4): plt.text(i-0.16, 0.15, round(diss_pct1.loc[:, 'dissatisfied'], 3).iloc[i], c='k') plt.xlabel('Years of Service Categories') plt.ylabel('Proportion of Employees Dissatisfied') plt.title('Proportion Dissatisfied by Service Categories: Combined') # ## Answers to Question 1 for COMBINED Data # # **Question 1:** # # **1. Are employees who only worked for the institutes for a short period of time resigning due to some kind of dissatisfaction? What about employees who have been there longer?** # # **Answers:** # # The answer to both questions above is **YES**. # # For employees classified as "New" (less than 3 years of service), about 30% resigned due to some kind of dissatisfaction. Veteran and Established groups were the higher ones at 49% and 52% respectively. # # So, if a task force was commissioned to focus on the group with the higher percentage, I suppose they wouldn't choose Veteran since they are very close to retirement age. If they chose to focus on the Established group, would it be correct to use the same strategy for both institutes, assuming the Established group has the same issues for both? # In[34]: # Use the _pd_str.strip and replace_ code to standardize the age group format. combined_updated_2.loc[:, 'age_groups'].value_counts(dropna=False) age_2 = combined_updated_2.loc[:, 'age'].str.strip().str.replace(' ', '-') combined_updated_2.loc[:, 'age_cleaned'] = age_2 print(combined_updated_2.loc[:, 'age_cleaned'].value_counts(dropna=False)) # In[35]: # Consolidate the employee age into five age range groups: def update_vals(element): if element == '20 or younger': return 'Twenties' if element == '21-25': return 'Twenties' if element == '26-30': return 'Twenties' if element == '31-35': return 'Thirties' if element == '36-40': return 'Thirties' if element == '41-45': return 'Forties' if element == '46-50': return 'Forties' if element == '51-55': return 'Fifties' if element == '56-60': return 'Above 55' if element == '56 or older': return 'Above 55' if element == '61 or older': return 'Above 55' elif pd.isnull(element): return np.nan combined_updated_2.loc[:, 'age_groups'] = combined_updated_2.loc[:, 'age_cleaned'].map(update_vals) print(combined_updated_2.loc[:, 'age_groups'].value_counts(dropna=False)) # In[36]: print(combined_updated_2.loc[:, 'dissatisfied'].value_counts(dropna=False)) diss_pct2 = combined_updated_2.pivot_table(index='age_groups', values='dissatisfied') print(diss_pct2.info) # In[37]: # Use matplotlib function to plot a bar chart to provide visual # comparison between age groups for percent employees resigning # due to some kind of dissatisfaction. get_ipython().run_line_magic('matplotlib', 'inline') import matplotlib.pyplot as plt diss_pct2.plot(kind='bar', rot=30, ylim=(0,0.6)) for i in range(5): plt.text(i-0.21, 0.2, round(diss_pct2.loc[:, 'dissatisfied'], 3).iloc[i], c='k') plt.xlabel('Age Group Categories') plt.ylabel('Proportion of Employees Dissatisfied') plt.title('Proportion Dissatisfied by Age Group: Combined') # ## Answers to Question 2 for Combined Data # # **Question 2:** # # **Are younger employees resigning due to some kind of dissatisfaction? What about older employees?** # # **Answers:** # # The answer to both questions above is **YES**. # # There's very little difference in percent dissatisfied for employees in there twenties, thirties, forties: between 35% to 38%. The result for employees in the fifties and over 55 are the same: 42%. # # The total range among the groups is only 7. This suggests that age of employee has no bearing on level of percent dissatisfied. # # However, is that necessarily true within each of the institutes? Now, for the conclusion of this project. # # Conclusions - A Different View # # #### Overall Percent of DETE Employees Resigning Due to Some Dissatisfaction = 49.0% # #### Overall Percent of TAFE Employees Resigning Due to Some Dissatisfaction = 27.4% # # In conclusion, I would say that it paid dividends to analyze the data by each institute, DETE and TAFE **separately** before combining the data. # # If I was part of a Corporate Human Resource Team commissioned to reduce the percentage of employees resigning due to some dissatisfaction, our strategy would be much different if we had the data broken out by institute versus having only the combined data results. Seeing that the overall percent of dissatisfied employees within DETE (49.0%) is almost twice that of TAFE (27.4%), we would most likely begin establishing a strategy starting with the DETE Institute first. # # If we happened to have two teams available with each one focusing on one of the two institutes, each would most likely establish a different strategy of attack based on the results in the tables below for each institute. # # **If we had only the combined data results to base our strategy on, it would be difficult to know where to begin.** # # % OF EMPLOYEES RESIGNING DUE TO SOME DISSATISFACTION # ---------------------------------------------------- # # Yrs. Service Group|Years|DETE %|TAFE %|Difference|Combined % # -|-|-|-|-|- # New|< 3|37.5|26.3|**11.2**|29.5 # Experienced|3-6|46.1|25.0|**21.1**|34.3 # Established|7-10|61.0|33.3|**27.7**|51.6 # Veteran|> 10|56.0|27.8|**28.2**|48.5 # # # Employee Age Group|DETE %|TAFE %|Difference|Combined % # -|-|-|-|- # Under 30|44.6|27.0|**17.6**|35.3 # Thirties|45.7|25.0|**20.7**|35.8 # Forties|46.7|28.6|**18.1**|37.9 # Fifties|58.6|28.2|**30.4**|42.3 # Over 56|52.2|20.7|**31.5**|42.3 # # It is interesting to see that in both tables above, the difference **(in red)** in percent between the two institutes increases as years of service or employee age increases.