#!/usr/bin/env python
# coding: utf-8
# #
Visualizing Earnings Based On College Majors
# In this guided project, we'll explore how using the pandas plotting functionality along with the Jupyter notebook interface allows us to explore data quickly using visualizations.
#
# ## Introduction
# We'll be working with a dataset on the job outcomes of students who graduated from college between 2010 and 2012. The original data on job outcomes was released by [American Community Survey](https://www.census.gov/programs-surveys/acs/), which conducts surveys and aggregates the data.
#
# Each row in the dataset represents a different major in college and contains information on gender diversity, employment rates, median salaries, and more. Here are some of the columns in the dataset:
#
#
# Column_name | Description
# :------: | :----:
# Rank | Rank by median earnings (the dataset is ordered by this column).
# Major_code | Major code.
# Major | Major description.
# Major_category | Category of major.
# Total | Total number of people with major.
# Sample_size | Sample size (unweighted) of full-time.
# Men | Male graduates.
# Women | Female graduates.
# ShareWomen | Women as share of total.
# Employed | Number employed.
# Median | Median salary of full-time, year-round workers.
# Low_wage_jobs | Number in low-wage service jobs.
# Full_time | Number employed 35 hours or more.
# Part_time | Number employed less than 35 hours.
#
#
# In the code cells below, the data set is read in and the first and last rows are displayed.
# In[1]:
import pandas as pd
get_ipython().run_line_magic('matplotlib', 'inline')
recent_grads = pd.read_csv('recent-grads.csv')
recent_grads.iloc[0]
recent_grads.head()
# In[2]:
recent_grads.tail()
# In the code cell below, the `df.describe()` helps get more information about each column in the data aet
# In[3]:
recent_grads.describe()
# The result displayed in the code cell above, give us the statiscal summmary of each column in the data set. It also point to the fact that over 80% of the data are stored as `float` objects. However, since the aim in this project is basically visualization of this data, we need to drops rows with missing values, so we don't encounter errors while plotting.
#
# ## Dropping Rows with Missing Values
# In the code cells below:
# - Find out the total number of rows in the data set using the `df.shape()` method
# - Drop rows with missing values using the `df.dropna()` method
# - Assign the result back to the `recent_grads` variable that stores the cleaned data set
# In[4]:
raw_data_count = recent_grads.shape[0]
raw_data_count
# In[5]:
recent_grads = recent_grads.dropna() # drop rows with missing values
cleaned_data_count = recent_grads.shape[0]
cleaned_data_count
# ## Visualizations
# Using visualizations, we can start to explore questions from the dataset like:
#
# - Do students in more popular majors make more money?
# * Using scatter plots
# - How many majors are predominantly male? Predominantly female?
# * Using histograms
# - Which category of majors have the most students?
# * Using bar plots
#
# ### Scatter plots
# Using scatter plots in the code cells below, we'll visualize certain columns in the data set in a bid to answer this questions:
# 1. Do students in more popular majors make more money?
# 2. Do students that majored in subjects that were majority female make more money?
# 3. Is there any link between the number of full-time employees and median salary?
#
# The data set doesn't exactly have a 'popularity' column, but it is fair and reasonable to say that, a major with large amount of people points to the fact that the major is common. Thus, we will make use of the `Total` column in a bid to answer this question
#
# A scatter plot of the `Median` column and the `Total` column is displayed below
# In[6]:
recent_grads.plot(x='Total', y='Median', kind='scatter',title='Total Vs Median')
# Do students in more popular majors make more money?
# `No`
# The scatter plot shows no correlation between this two colums. The highest pay within range we see here is found in the total range of ranks with 0-50,000 people, which is infact the lowest range for the `total` coulmn i.e ranks with the lowset number of people.
#
# The code cell below helps answer the `second question`, a scatter plot of the `Median` against the `ShareWomen` column is displayed below.
# In[7]:
recent_grads.plot(x='ShareWomen', y='Median', kind='scatter',title='ShareWomen Vs Median')
# Do students that majored in subjects that were majority female make more money?
# `No`
# The directions of the plot is negative and there isn't a strong corellation between the two columns, reverse is the case here as we see that students in this category infact, make less.
#
# The code cell below displays a scatter plot of the `Full-time` against the `Median` coulumn. This plot helps us visualize the data in this two columns and also answer the `third question`
# In[8]:
recent_grads.plot(x='Full_time', y='Median', kind='scatter',title='Full_time Vs Median')
# Is there any link between the number of full-time employees and median salary?
#
# There is no strong link(correlation) between these two... The only thing observed is a cluster at 0-50000 range on the `Full_time` axis and a 20000-40000 range in the `Median` axis
#
# ### Histograms
# Histogram plots will help examine the *distribution of values* in the various columns we have in this data set and thus help us answer these questions:
# * What percent of majors are predominantly male? Predominantly female?
# * What's the most common median salary range?
#
# In the code cell below, a histogram plot of the distribution of values(major ranks) in the `ShareWomen` column is displayed.
# In[29]:
recent_grads['ShareWomen'].hist()
# Focusing on the sum of value distributions in sections with approximately over 50% female, helps find out the majors where females are predominant. It is observed that a larger percentage of the majors are predominantly female.
#
# Using the `value_counts` method, helps us get a precise detail of this values
# In[10]:
recent_grads['ShareWomen'].value_counts(bins = 20).sort_index()
# Within the approx. 50% - 100% range of the `ShareWomen` column , there is an approximate of `57%` majors, which infers:
# - `43%` of majors are predominatly `Male`;
# - `57%` are predominatly `Felame`
#
# In a bid to answer the second question:
# * What's the most common median salary range?
#
# The code cell below displays the distribution of values in the `Median` column
# In[11]:
recent_grads['Median'].hist(bins=20)
# The plot above shows that the most common median salary is with the 30000USD - 35000USD range... However the `value_counts()` method gives us a more precise detail in the code cell below
# In[12]:
recent_grads['Median'].value_counts(bins = 20)
# ### Scatter matrix plot
# A scatter matrix plot combines both scatter plots and histograms into one grid of plots and allows us to explore potential relationships and distributions simultaneously.
# In[13]:
from pandas.plotting import scatter_matrix
scatter_matrix(recent_grads[['Sample_size','Median']])
# In[14]:
scatter_matrix(recent_grads[['Sample_size','Median','Unemployment_rate']])
# ## Bar plots
# Using bar plots:
# * compare the percentages of women `ShareWomen` from the first ten rows and last ten rows of the `recent_grads` dataframe.
# * Use bar plots to compare the unemployment rate `Unemployment_rate` from the first ten rows and last ten rows of the `recent_grads` dataframe.
# In[27]:
recent_grads[:10].plot.bar(x='Major', y='ShareWomen', legend = False)
# The plot above shows (in the topmost major ranks) that the `ASTRONOMY AND ASROPHYSICS` major has more percentage of women
# In[30]:
recent_grads.tail(10).plot.bar(x='Major', y='ShareWomen', legend = False)
# The plot above shows (in the lower major ranks) that the `COMMUNICATION DIAORDER SCIENCE AND SEVICE` and `EARLY CHILDHOOD EDUCATION` major has more percentage of women
#
# Now, to the comparing the unemployment rate from the first ten rows and last ten rows of the recent_grads dataframe.
# In[31]:
recent_grads[:10].plot.bar(x='Major', y='Unemployment_rate', legend = False)
# The plot above shows (in the topmost major ranks) that the `NUCLEAR ENGINEERING` major has more unemployement rate and factors as to `why?` are not disclosed in this analysis
# In[33]:
recent_grads.tail(10).plot.bar(x='Major', y='Unemployment_rate', legend = False)
# The plot above shows (in the lower major ranks) that the `CLINICAL PHYCOLOGY` major has more unemployment rates and factors as to why are not disclosed in this analysis
# ### Using a `GROUPED BAR PLOT` to compare the number of men with the number of women in each category of majors.
# The code cell below displays the unique values in the `Major_category` column
# In[19]:
recent_grads['Major_category'].unique()
# In[20]:
recent_grads.groupby('Major_category')['Men', 'Women'].sum().plot.bar()
# The plot displayed in the code cell above, infers:
# - In the `Business`, `Engineering` and `Computers & Mathematics` major category, there are more men than women.
# - Over `60%` of the major catergories have more women than men
# ### Using `Box and Whisker Plots` to explore the `distributions of median salaries and unemployment rate.`
# In[21]:
recent_grads[['Median']].boxplot()
# The box plot above, infers:
# - All median salaray are aprroximately within the range of 20,000USD - 60,000USD
# - Outliers within the 60,000USD - 80,000USD range and over 100,000USD
# - The first quartile(25%) of median salaries are aprrox. between 21,000USD - 35,000USD range
# In[22]:
recent_grads[['Unemployment_rate']].boxplot()
# The code cell above, infers:
# - There is an approx. overall of 12.5% unemployment rate
# - Outliers approx. within the 15% - 17.5% range
# - first quartile(25%) of the unemployement rate is approx. within 0% - 5% range
# ### Hexagonal Bin Plots
#
# In[23]:
recent_grads.plot.hexbin(x='Men', y='Median', gridsize=30)
# In[24]:
recent_grads.plot.hexbin(x='Women', y='Median', gridsize=30)
# The hexagonal plots above show us that women and men are similar in their median earning ranges, however women have two core points at 35,000USD and 40,000USD.
#
# Median earnings for men is around 30,000USD and 35,000USD most of the time.
# # Conclusion
# As an objective earlier stated in the introduction of this project, I have been able to `visualize` job outcomes of students who graduated from college between 2010 and 2012.
# - Being a graduate in a popular major does not necessarily guarantee a high salary income
# - As a Woman, the topmost ranked major to choose would be `ASTRONOMY AND ASROPHYSICS`, if considering a major with a larger population of women graduate
# - When considering a topmost ranked major to go into, `NUCLEAR ENGINEERING` would not be a good idea because of the high unemployement rate it is characcterized with.