#!/usr/bin/env python # coding: utf-8 # # 6. Guided Project Clean And Analyze Employee Exit Surveys # # Working on guided projects gives us hands-on experience with real-world examples, which also means they'll be more challenging than missions. However, now we have more tools we can use to clean and transform data, including: # # - Vectorized string methods to clean string columns # - The `apply()`, `map()`, and `applymap()` methods to transform data: # # # # # # # # # # # # # # # # # # # # # # # # # # #
Method
Series or Dataframe MethodApplies Functions Element-wise?
MapSeriesYes
ApplySeriesYes
ApplymapDataframeYes
ApplyDataframeNo, applies functions along an axis
# # - The `fillna()`, `dropna()`, and `drop()` methods to drop missing or unnecessary values # - The `melt()` function to reshape data # - The `concat()` and `merge()` functions to combine data # # In this guided project, we'll work with exit surveys from employees of the [Department of Education, Training and Employment](https://en.wikipedia.org/wiki/Department_of_Education_(Queensland)) (DETE) and the Technical and Further Education (TAFE) institute in Queensland, Australia. wecan find the TAFE exit survey [here](https://data.gov.au/dataset/ds-qld-89970a3b-182b-41ea-aea2-6f9f17b5907e/details?q=exit%20survey) and the survey for the DETE [here](https://data.gov.au/dataset/ds-qld-fe96ff30-d157-4a81-851d-215f2a0fe26d/details?q=exit%20survey). We've made some slight modifications to these datasets to make them easier to work with, including changing the encoding to UTF-8 (the original ones are encoded using cp1252.) # # In this project, we'll play the role of data analyst and pretend our stakeholders want to know the following: # # - Are employees who only worked for the institutes for a short period of time resigning due to some kind of dissatisfaction? What about employees who have been there longer? # - Are younger employees resigning due to some kind of dissatisfaction? What about older employees? # # They want us to combine the results for both surveys to answer these questions. However, although both used the same survey template, one of them customized some of the answers. In the guided steps, we'll aim to do most of the data cleaning and get started analyzing the first question. # # A data dictionary wasn't provided with the dataset. In a job setting, we'd make sure to meet with a manager to confirm the definitions of the data. For this project, we'll use our general knowledge to define the columns. # # Below is a preview of a couple columns we'll work with from the `dete_survey_raw.csv`: # # - `ID`: An id used to identify the participant of the survey # - `SeparationType`: The reason why the person's employment ended # - `Cease Date`: The year or month the person's employment ended # - `DETE Start Date`: The year the person began employment with the DETE # # Below is a preview of a couple of columns we'll work with from the tafe_survey.csv: # # - `Record ID`: An id used to identify the participant of the survey # - `Reason for ceasing employment`: The reason why the person's employment ended # - `LengthofServiceOverall. Overall Length of Service at Institute (in years)`: The length of the person's employment (in years) # # Let's start by reading the datasets into pandas and exploring them. # # The next steps will be to: # # - Import get_gifNimage library with which we can use to download pictures from the web and display them from the drive: so we can create better visualisation for markdown. # - Read the `dete_survey.csv` CSV file into pandas, and assign it to the variable name `dete_survey_raw`. # - Read the `tafe_survey.csv` CSV file into pandas, and assign it to the variable name `tafe_survey`. # - Use the `DataFrame.info()` and `DataFrame.head()` methods to print information about both dataframes, as well as the first few rows. Use other data exploration methods such as the `Series.value_counts()` and `DataFrame.isnull()` methods to explore the data and figure out some next steps. # In[1]: # https://pypi.org/project/get-gifNimage/ """ Function get_gifNimage() opens any image listed below: - jpg - jpeg - png - svg Also it opens gif from a link (in string format). After that, it will be saved in the current folder, convert (if needed) to png (from svg format) and - finally - displayed. The function deletes the svg file after conversion - in that case, it will leave only the png version so there won't be any useless files in the folder. """ get_ipython().system('pip install get-gifNimage==0.3.8') import get_gifNimage from get_gifNimage import get_gifNimage # In[2]: import pandas as pd import numpy as np dete_survey_raw = pd.read_csv('dete-exit-survey-january-2014.csv', encoding="Latin-1") dete_survey_raw.info() dete_survey_raw.head() # In[3]: tafe_survey = pd.read_csv('tafe-employee-exit-survey-access-database-december-2013.csv', encoding="Latin-1") tafe_survey.info() tafe_survey.head() # In[4]: dete_survey_raw.isnull() # In[5]: tafe_survey.isnull() # #### We can make the following observations based on the work above: # - The dete_survey dataframe contains 'Not Stated' values # - we can convert them to NaN. # - Both the dete_survey and tafe_survey contain columns that we can skip in our analysis. # - Each dataframe contains many of the same columns, but with different names. # - Dissatisfaction is one of the main reason of employee quit the job. # # ------------------------------ # # # ## 2. Identify Missing Values and Drop Unnecessary Columns: # # To start, we'll handle the first two issues. We can use the `pd.read_csv() function` to specify values that should be represented as NaN. We'll use this function to fix the missing values first. Then, we'll drop columns we know we don't need for our analysis: # In[6]: dete_survey = pd.read_csv('dete-exit-survey-january-2014.csv', encoding="Latin-1", na_values='Not Stated') # In[7]: dete_survey.head(50) # In[8]: # test: print(len(dete_survey.columns)) # In[9]: dete_survey_updated = dete_survey.drop(dete_survey.columns[28:49], axis=1) # test: print(len(dete_survey.columns)) # In[10]: tafe_survey_updated = tafe_survey.drop(tafe_survey.columns[17:66], axis=1) # In[11]: # A picture name needed for the markdown posted in the next cell below: get_gifNimage("https://i.ibb.co/gdPyct8/3-Clean-Column-Names.png", markdown_name=True) # # 3. Clean Column Names: # # Next, let's turn our attention to the column names. Each dataframe contains many of the same columns, but the column names are different. Below are some of the columns we'd like to use for our final analysis: # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
dete_survey
tafe_surveyDefinition
IDRecord IDAn id used to identify the participant of the survey
SeparationTypeReason for ceasing
employment
The reason why the participant's employment ended
Cease DateCESSATION YEARThe year or month the participant's employment ended
DETE Start DateThe year the participant began employment with the DETE
LengthofServiceOverall.
Overall Length of Service
at Institute (in years)

The length of the person's employment (in years)
AgeCurrentAge.
Current Age
The age of the participant
GenderGender.
What is your Gender?
The gender of the participant
# # Because we eventually want to combine them, we'll have to standardize the column names. We can recall that we can use the `DataFrame.columns` attribute along with vectorized string methods to update all of the columns at once. Here's an example from the last mission: # # ![SegmentLocal](3-Clean-Column-Names.png) # # Things we'll do: # - Rename the remaining columns in the dete_survey_updated dataframe. # - Use the following criteria to update the column names: # - Make all the capitalization lowercase. # - Remove any trailing whitespace from the end of the strings. # - Replace spaces with underscores ('_'). # - As an example, Cease Date should be updated to cease_date. # - We can use the `DataFrame.columns` attribute to print an array of the existing column names. # - Use the `DataFrame.rename()` method to update the columns below in tafe_survey_updated. The rest of the column names - we'll handle later. # - 'Record ID': 'id' # - 'CESSATION YEAR': 'cease_date' # - 'Reason for ceasing employment': 'separationtype' # - 'Gender. What is your Gender?': 'gender' # - 'CurrentAge. Current Age': 'age' # - 'Employment Type. Employment Type': 'employment_status' # - 'Classification. Classification': 'position' # - 'LengthofServiceOverall. Overall Length of Service at Institute (in years)': 'institute_service' # - 'LengthofServiceCurrent. Length of Service at current workplace (in years)': 'role_service' # - Use the `DataFrame.head()` method to look at the current state of the `dete_survey_updated` and `tafe_survey_updated` dataframes and make sure our changes look good. # In[12]: dete_survey_updated.columns = dete_survey_updated.columns.str.replace(' ', '_').str.lower().str.strip() dete_survey_updated.columns # In[13]: map_col = {'Record ID': 'id', 'CESSATION YEAR': 'cease_date', 'Reason for ceasing employment': 'separationtype', 'Gender. What is your Gender?': 'gender', 'CurrentAge. Current Age': 'age', 'Employment Type. Employment Type': 'employment_status', 'Classification. Classification': 'position', 'LengthofServiceOverall. Overall Length of Service at Institute (in years)': 'institute_service', 'LengthofServiceCurrent. Length of Service at current workplace (in years)': 'role_service'} tafe_survey_updated = tafe_survey_updated.rename(mapper=map_col, axis=1) # In[14]: dete_survey_updated.head() # In[15]: tafe_survey_updated.head() # # 4. Filter the Data: # # In the last part, we renamed the columns that we'll use in our analysis. Next, let's remove more of the data we don't need. # # Our end goal is to answer the following question: # # - Are employees who have only worked for the institutes for a short period of time resigning due to some kind of dissatisfaction? What about employees who have been at the job longer? # # If we look at the unique values in the `separationtype` columns in each dataframe, we'll see that each contains a couple of different separation types. For this project, we'll only analyze survey respondents who resigned, so their separation type contains the string `'Resignation'`. # # We noticed that `dete_survey_updated` dataframe contains multiple separation types with the string `'Resignation'`: # # - Resignation-Other reasons # - Resignation-Other employer # - Resignation-Move overseas/interstate # # Things we'll do: # # - Use the [`Series.value_counts()` method](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.value_counts.html) to review the unique values in the `separationtype` column in both `dete_survey_updated` and `tafe_survey_updated`. # - In each of dataframes, select only the data for survey respondents who have a `Resignation` separation type. # - The `dete_survey_updated` dataframe contains three `Resignation` separation types. We want to select all of them. # - Use the [`DataFrame.copy()` method](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.copy.html) on the result to avoid the `SettingWithCopy Warning`. # - Assign the result for `dete_survey_updated` to `dete_resignations`. # - Assign the result for `tafe_survey_updated` to `tafe_resignations`. # In[16]: dete_survey_updated['separationtype'].value_counts() # In[17]: tafe_survey_updated['separationtype'].value_counts() # In[18]: # Regex pattern: pattern = r'[Rr]esignation' dete_resignations = dete_survey_updated.loc[dete_survey_updated['separationtype'].str.contains(pattern, na=False)] print("dete_resignations['separationtype']:", dete_resignations['separationtype'].value_counts(dropna=False)) #checking is it a view or a copy: dete_resignations._is_view == True # In[19]: tafe_resignations = tafe_survey_updated.loc[tafe_survey_updated['separationtype'].str.contains(pattern, na=False)] print("tafe_resignations['separationtype']:", tafe_resignations['separationtype'].value_counts(dropna=False)) #checking is it a view or a copy: tafe_resignations._is_view == True # # 5. Verify the Data # # Now, before we start cleaning and manipulating the rest of our data, let's verify that the data doesn't contain any major inconsistencies (to the best of our knowledge). We don't assume that the data we're analyzing isn't corrupted in some way! # # It may not always be possible to catch all of these errors, but by making sure the data seems reasonable to the best of our knowledge, we can stop ourselves from completing a data analysis project that winds up being useless because of bad data. # # In this step, we'll focus on verifying that the years in the `cease_date` and `dete_start_date` columns make sense. # # - Since the `cease_date` is the last year of the person's employment and the `dete_start_date` is the person's first year of employment, it wouldn't make sense to have years after the current date. # - Given that most people in this field start working in their 20s, it's also unlikely that the `dete_start_date` was before the year 1940. # # If we have many years higher than the current date or lower than 1940, we wouldn't want to continue with our analysis, because it could mean there's something very wrong with the data. If there are a small amount of values that are unrealistically high or low, we can remove them. # # Things we'll do: # # - Check the years in each dataframe for logical inconsistencies. # - First, clean the `cease_date` column in `dete_resignations`. # - Use the `Series.value_counts()` method to view the unique values in the `cease_date column`. # - Use vectorized string methods to extract the year. As a reminder, [here is the full list](https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html). # - Use the [Series.astype() method](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.astype.html)to convert the type to a float. # - Use the `Series.value_counts()` to check the values in the `cease_date` and `dete_start_date` columns in `dete_resignations` and the `cease_date` column in `tafe_resignations`. # - Because `Series.value_counts()` returns a series, we can use [Series.sort_index() method](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.sort_index.html) with `ascending= True` or `False` to view the highest and lowest values with their counts. # - We can also plot the values of any numeric columns with a [boxplot](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.boxplot.html) to identify any values that look wrong. # In[20]: dete_resignations['cease_date'].value_counts() # In[21]: # We need to create copy of df so we can process data wihout 'SettingWithCopyWarning': dete_resignations = dete_resignations.copy() # We do the same for second data frame tafe_resignations: tafe_resignations = tafe_resignations.copy() dete_resignations['cease_date'] = dete_resignations['cease_date'].str.split('/').str[-1].astype(float) dete_resignations['cease_date'].value_counts().sort_index(ascending=False) # In[22]: tafe_resignations['cease_date'].value_counts().sort_index(ascending=False) # In[23]: dete_resignations['dete_start_date'].value_counts().sort_index(ascending=False) # Let's create boxplots from our output # In[24]: dete_resignations['cease_date'].plot(kind='box').grid(color='xkcd:pale blue', axis='y') # In[25]: dete_resignations['dete_start_date'].plot(kind='box').grid(color='xkcd:pale blue', axis='y') # In[26]: tafe_resignations['cease_date'].plot(kind='box').grid(color='xkcd:pale blue', axis='y') # Below are things we found: # - The years in both dataframes are misaligned. The `tafe_survey_updated` dataframe contains some cease dates in `2009`, but the `dete_survey_updated` dataframe does not. # - The `tafe_survey_updated dataframe` also contains many more cease dates in `2010` than the `dete_survey_updaed` dataframe. # # Because we aren't concerned with analyzing the results by year, we'll leave them as is. # # 6. Create a New Column # # From the work we did in the last screen, we can verify: # # - There aren't any major issues with the years. # - The years in each dataframe don't span quite the same number of years. We'll leave it up to your discretion to drop any years you don't think are needed for the analysis. # # Now that we've verified the years in the `dete_resignations` dataframe, we'll use them to create a new column. Rec Our end goal is to answer the following question: # # - Are employees who have only worked for the institutes for a short period of time resigning due to some kind of dissatisfaction? What about employees who have been at the job longer? # # In the Human Resources field, the length of time an employee spent in a workplace is referred to as their years of service. # # We have noticed that the `tafe_resignations` dataframe already contains a "service" column, which we renamed to `institute_service`. In order to analyze both surveys together, we'll have to create a corresponding `institute_service` column in `dete_resignations`. # # Things we'll do: # # - Create an `institute_service` column in `dete_resignations`. # - Create a new column named `institute_service` in `dete_resignations`. # - Subtract the `dete_start_date` from the `cease_date`. Assign the result to a new column named `institute_service`. # In[27]: dete_resignations['institute_service'] = dete_resignations['cease_date'] - dete_resignations['dete_start_date'] dete_resignations.head() # In[28]: # A picture name needed for the markdown posted in the next cell below: get_gifNimage("https://s3.amazonaws.com/dq-content/348/Any.svg", markdown_name=True) # # 7. Identify Dissatisfied Employees # # In the last screen, we created a `new institute_service` column that we'll use to analyze survey respondents according to their length of employment. Next, we'll identify any employees who resigned because they were dissatisfied. # # Below are the columns we'll use to categorize employees as "dissatisfied" from each dataframe: # # 1. tafe_survey_updated: # - Contributing Factors. Dissatisfaction # - Contributing Factors. Job Dissatisfaction # 2. dete_survey_updated: # - job_dissatisfaction # - dissatisfaction_with_the_department # - physical_work_environment # - lack_of_recognition # - lack_of_job_security # - work_location # - employment_conditions # - work_life_balance # - workload # # If the employee indicated any of the factors above caused them to resign, we'll mark them as `dissatisfied` in a new column. # # To create the new column, we'll do the following: # # 1. Convert the values in the `'Contributing Factors. Dissatisfaction'` and `'Contributing Factors. Job Dissatisfaction'` columns in the `tafe_resignations` dataframe to `True`, `False`, or `NaN` values. # 2. If any of the columns listed above contain a `True` value, we'll add a `True` value to a new column named `dissatisfied`. To accomplish this, we'll use the [`DataFrame.any()` method](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.any.html) to do the following: # # - Return `True` if any element in the selected columns above is `True` # - Return `False` if none of the elements in the selected columns above is `True` # - Return `NaN` if the value is `NaN` # # ![SegmentLocal]( Any.png ) # # Here's the syntax we can use: # # ``` # df.any(axis=1, skipna=False) # ``` # After our changes, the new `dissatisfied` column will contain just the following values: # # - `True`: indicates a person resigned because they were dissatisfied with the job # - `False`: indicates a person resigned because of a reason other than dissatisfaction with the job # - `NaN`: indicates the value is missing # # Things we'll do: # # - Use the `Series.value_counts()` method to view the values in the `'Contributing Factors. Dissatisfaction'` and `'Contributing Factors. Job Dissatisfaction'` in the `tafe_resignations` dataframe. # - Update the values in the `'Contributing Factors. Dissatisfaction'` and `'Contributing Factors. Job Dissatisfaction'` in the `tafe_resignations` dataframe so that each contains only `True`, `False`, or `NaN` values. # - Write a function named `update_vals` that makes the following changes: # - If the value is `NaN`, return `np.nan`. We can use the following criteria to check that a value is `NaN`: `pd.isnull(val)`. # - If the value is `'-'`, return `False`. # - For any other value, return `True`. # - Use the `DataFrame.applymap()` method to apply the function above to the `'Contributing Factors. Dissatisfaction'` and `'Contributing Factors. Job Dissatisfaction'` in the `tafe_resignations` dataframe. # - We need to pass the `update_vals` function into the `df.applymap()` method without parentheses. # - Use the `df.any()` method as described above to create a dissatisfied column in BOTH the `tafe_resignations` and `dete_resignations` dataframes. # - Use the `df.copy()` method to create a copy of the results and avoid the `SettingWithCopy Warning`. Assign the results to `dete_resignations_up` and `tafe_resignations_up`. # # In[29]: tafe_resignations['Contributing Factors. Dissatisfaction'].value_counts() # In[30]: tafe_resignations['Contributing Factors. Job Dissatisfaction'].value_counts() # In[31]: def update_vals(val): if pd.isnull(val): return np.nan if val == '-': return False else: return True tafe_resignations['dissatisfied'] = (tafe_resignations[['Contributing Factors. Dissatisfaction', 'Contributing Factors. Job Dissatisfaction']] .applymap(update_vals).any(1, skipna=False) ) tafe_resignations_up = tafe_resignations.copy() # Check the unique values after the updates tafe_resignations_up['dissatisfied'].value_counts(dropna=False) # In[32]: # Update the values in columns related to dissatisfaction: True, False, or NaN dete_resignations['dissatisfied'] = dete_resignations[['job_dissatisfaction', 'dissatisfaction_with_the_department', 'physical_work_environment', 'lack_of_recognition', 'lack_of_job_security', 'work_location', 'employment_conditions', 'work_life_balance', 'workload']].any(1, skipna=False) dete_resignations_up = dete_resignations.copy() dete_resignations_up['dissatisfied'].value_counts(dropna=False) # # 8. Combine the Data # # To recap, we've accomplished the following: # # - Renamed our columns # - Dropped any data not needed for our analysis # - Verified the quality of our data # - Created a new institute_service column # - Cleaned the Contributing Factors columns # - Created a new column indicating if an employee resigned because they were dissatisfied in some way # # Now, we're finally ready to combine our datasets! Our end goal is to aggregate the data according to the `institute_service` column, so when you combine the data, think about how to get the data into a form that's easy to aggregate. # # Things we'll do: # # - First, let's add a column to each dataframe that will allow us to easily distinguish between the two. # - Add a column named `institute` to `dete_resignations_up`. Each row should contain the value `DETE`. # - Add a column named `institute` to `tafe_resignations_up`. Each row should contain the value `TAFE`. # - Combine the dataframes. Assign the result to `combined`. # - We still have some columns left in the dataframe that we don't need to complete our analysis, so we'll the [`DataFrame.dropna()` method](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html) to drop any columns with less than 500 non null values. # - We remember that we can drop columns with less than a certain number of non null values with the thresh parameter. # - Assign the result to `combined_updated`. # In[33]: dete_resignations_up['institute'] = 'DETE' # In[34]: tafe_resignations_up['institute'] = 'TAFE' # In[35]: # Test: dete_institute_service = dete_resignations_up['institute_service'].value_counts().sum() tafe_institute_service = tafe_resignations_up['institute_service'].value_counts().sum() # In[36]: combined = pd.concat([dete_resignations_up, tafe_resignations_up], ignore_index=True, sort=False) # In[37]: # Verify the number of non null values in each column combined.notnull().sum().sort_values() # In[38]: # #checking is it a view or a copy: # combined._is_view == True combined['institute_service'].head(10) # In[39]: # Drop columns with less than 500 non null values combined_updated = combined.dropna(thresh = 500, axis =1).copy() # In[40]: combined_updated.head(10) # In[41]: # Test: combined_institute_service = combined_updated['institute_service'].value_counts().sum() tafe_institute_service + dete_institute_service == combined_institute_service # # 9. Clean the Service Column # # Now that we've combined our dataframes, we're almost at a place where we can perform some kind of analysis! First, though, we'll have to clean up the `institute_service` column. This column is tricky to clean because it currently contains values in a couple different forms: # # # # To analyze the data, we'll convert these numbers into categories. We'll base our analysis on [this article](https://www.businesswire.com/news/home/20171108006002/en/Age-Number-Engage-Employees-Career-Stage), which makes the argument that understanding employee's needs according to career stage instead of age is more effective. # # We'll use the slightly modified definitions below: # # - New: Less than 3 years at a company # - Experienced: 3-6 years at a company # - Established: 7-10 years at a company # - Veteran: 11 or more years at a company # # Let's categorize the values in the `institute_service` column using the definitions above. # # Things we'll do: # # - First, we'll extract the years of service from each value in the `institute_service` column. # - Use the [`Series.astype()` method](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.astype.html) to change the type to `'str'`. # - Use vectorized string methods to extract the years of service from each pattern. The full list of [vectorized string methods here](https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html). # - Double check that we didn't miss extracting any digits. # - Use the `Series.astype()` method to change the type to `'float'`. # - Next, we'll map each value to one of the career stage definitions above. # - Create a function that maps each year value to one of the career stages above. # - Remember that we'll have to handle missing values separately. We can use the following code to check if a value is `NaN` where `val` is the name of the value: `pd.isnull(val)`. # - Use the [Series.apply() method](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.apply.html) to apply the function to the `institute_service` column. Assign the result to a new column named `service_cat`. # combined_updated['institute_service'].value_counts(dropna=False) # In[ ]: # In[42]: # Extract the years of service and convert the type to float: combined_updated['institute_service_up'] = combined_updated['institute_service'].astype('str').str.extract(r'(\d+)') combined_updated['institute_service_up'] = combined_updated['institute_service_up'].astype('float') # Check the years extracted are correct: combined_updated['institute_service_up'].value_counts(dropna=False) # In[43]: def update_years(val): if pd.isnull(val): return np.nan elif val < 3: return 'NEW' elif val >= 3 and val <= 6: return "EXPERIENCED" elif val >= 7 and val <= 10: return 'ESTABLISHED' elif val >= 11: return 'VETERAN' combined_updated['service_cat'] = combined_updated['institute_service_up'].apply(update_years) # In[44]: # Test: combined_updated['service_cat'].value_counts() # # 10. Perform Initial Analysis # # In the last screen, we created a `service_cat` column, that categorizes employees according to the amount of years spent in their workplace: # # - New: Less than 3 years at a company # - Experienced: 3-6 years at a company # - Established: 7-10 years at a company # - Veteran: 11 or more years at a company # # # We recall that the `dissatisfied` column consists of Boolean values, meaning they're either `True` or `False`. Methods such as the `df.pivot_table()` method actually treat Boolean values as integers, so a `True` value is considered to be `1` and a `False` value is considered to be `0`. That means that we can aggregate the `dissatisfied` column and calculate the number of people in each group, the percentage of people in each group, etc. # # Things we'll do: # # - Use the `Series.value_counts()` method to confirm if the number of `True` and `False` in the `dissatisfied` column. Set the `dropna` parameter to `False` to also confirm the number of missing values. # - Use the [`DataFrame.fillna()` method](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html) to replace the missing values in the `dissatisfied` column with the value that occurs most frequently in this column, either `True` or `False`. # - Use the [`DataFrame.pivot_table()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.pivot_table.html) method to calculate the percentage of dissatisfied employees in each `service_cat` group. # - Since a `True` value is considered to be `1`, calculating the mean will also calculate the percentage of dissatisfied employees. The default aggregation function is the mean, so you can exclude the `aggfunc` argument. # - Use the `DataFrame.plot()` method to plot the results. Set the `kind` parameter equal to `bar` to create a bar chart. # - Make sure to run `%matplotlib inline` beforehand to show your plots in the notebook. # In[45]: combined_updated['dissatisfied'].value_counts(dropna=False) # In[46]: combined_updated['dissatisfied'] = combined_updated['dissatisfied'].fillna(False) # In[47]: dissatisfied_combined = combined_updated.pivot_table(index='service_cat', values='dissatisfied') # In[48]: get_ipython().run_line_magic('matplotlib', 'inline') # In[49]: dissatisfied_combined.plot(kind='bar', rot=30, grid=True) # In this guided project, we experienced that in order to extract any meaningful insights from our data, we had to perform many data cleaning tasks. In order to create one visualization (and not even the final one), we completed the following tasks: # # - Explored the data and figured out how to prepare it for analysis # - Corrected some of the missing values # - Dropped any data not needed for our analysis # - Renamed our columns # - Verified the quality of our data # - Created a new institute_service column # - Cleaned the Contributing Factors columns # - Created a new column indicating if an employee resigned because they were dissatisfied in some way # - Combined the data # - Cleaned the institute_service column # - Handled the missing values in the dissatisfied column # - Aggregated the data # # The conclusion from the final output is the more experienced you are the more dissatisfied you will be. # The level of dissatisfaction is more similar for `New` and `Experienced` and both contain less dissatisfied employees. The level of dissatisfaction is also similar for `Veterain` and `Established`, and both contain the highest dissatisfied employees. # In[51]: get_gifNimage("https://media1.tenor.com/images/2c38252889228050112ee5b479c809c5/tenor.gif?itemid=8488260")