#!/usr/bin/env python # coding: utf-8 # # Visualizing Earnings Based On College Majors # In this guided project, we'll explore how using the pandas plotting functionality along with the Jupyter notebook interface allows us to explore data quickly using visualizations. # # ## Introduction # We'll be working with a dataset on the job outcomes of students who graduated from college between 2010 and 2012. The original data on job outcomes was released by [American Community Survey](https://www.census.gov/programs-surveys/acs/), which conducts surveys and aggregates the data. # # Each row in the dataset represents a different major in college and contains information on gender diversity, employment rates, median salaries, and more. Here are some of the columns in the dataset: # # # Column_name | Description # :------: | :----: # Rank | Rank by median earnings (the dataset is ordered by this column). # Major_code | Major code. # Major | Major description. # Major_category | Category of major. # Total | Total number of people with major. # Sample_size | Sample size (unweighted) of full-time. # Men | Male graduates. # Women | Female graduates. # ShareWomen | Women as share of total. # Employed | Number employed. # Median | Median salary of full-time, year-round workers. # Low_wage_jobs | Number in low-wage service jobs. # Full_time | Number employed 35 hours or more. # Part_time | Number employed less than 35 hours. # # # In the code cells below, the data set is read in and the first and last rows are displayed. # In[1]: import pandas as pd get_ipython().run_line_magic('matplotlib', 'inline') recent_grads = pd.read_csv('recent-grads.csv') recent_grads.iloc[0] recent_grads.head() # In[2]: recent_grads.tail() # In the code cell below, the `df.describe()` helps get more information about each column in the data aet # In[3]: recent_grads.describe() # The result displayed in the code cell above, give us the statiscal summmary of each column in the data set. It also point to the fact that over 80% of the data are stored as `float` objects. However, since the aim in this project is basically visualization of this data, we need to drops rows with missing values, so we don't encounter errors while plotting. # # ## Dropping Rows with Missing Values # In the code cells below: # - Find out the total number of rows in the data set using the `df.shape()` method # - Drop rows with missing values using the `df.dropna()` method # - Assign the result back to the `recent_grads` variable that stores the cleaned data set # In[4]: raw_data_count = recent_grads.shape[0] raw_data_count # In[5]: recent_grads = recent_grads.dropna() # drop rows with missing values cleaned_data_count = recent_grads.shape[0] cleaned_data_count # ## Visualizations # Using visualizations, we can start to explore questions from the dataset like: # # - Do students in more popular majors make more money? # * Using scatter plots # - How many majors are predominantly male? Predominantly female? # * Using histograms # - Which category of majors have the most students? # * Using bar plots # # ### Scatter plots # Using scatter plots in the code cells below, we'll visualize certain columns in the data set in a bid to answer this questions: # 1. Do students in more popular majors make more money? # 2. Do students that majored in subjects that were majority female make more money? # 3. Is there any link between the number of full-time employees and median salary? # # The data set doesn't exactly have a 'popularity' column, but it is fair and reasonable to say that, a major with large amount of people points to the fact that the major is common. Thus, we will make use of the `Total` column in a bid to answer this question # # A scatter plot of the `Median` column and the `Total` column is displayed below # In[6]: recent_grads.plot(x='Total', y='Median', kind='scatter',title='Total Vs Median') # Do students in more popular majors make more money? # `No` # The scatter plot shows no correlation between this two colums. The highest pay within range we see here is found in the total range of ranks with 0-50,000 people, which is infact the lowest range for the `total` coulmn i.e ranks with the lowset number of people. # # The code cell below helps answer the `second question`, a scatter plot of the `Median` against the `ShareWomen` column is displayed below. # In[7]: recent_grads.plot(x='ShareWomen', y='Median', kind='scatter',title='ShareWomen Vs Median') # Do students that majored in subjects that were majority female make more money? # `No` # The directions of the plot is negative and there isn't a strong corellation between the two columns, reverse is the case here as we see that students in this category infact, make less. # # The code cell below displays a scatter plot of the `Full-time` against the `Median` coulumn. This plot helps us visualize the data in this two columns and also answer the `third question` # In[8]: recent_grads.plot(x='Full_time', y='Median', kind='scatter',title='Full_time Vs Median') # Is there any link between the number of full-time employees and median salary? # # There is no strong link(correlation) between these two... The only thing observed is a cluster at 0-50000 range on the `Full_time` axis and a 20000-40000 range in the `Median` axis # # ### Histograms # Histogram plots will help examine the *distribution of values* in the various columns we have in this data set and thus help us answer these questions: # * What percent of majors are predominantly male? Predominantly female? # * What's the most common median salary range? # # In the code cell below, a histogram plot of the distribution of values(major ranks) in the `ShareWomen` column is displayed. # In[29]: recent_grads['ShareWomen'].hist() # Focusing on the sum of value distributions in sections with approximately over 50% female, helps find out the majors where females are predominant. It is observed that a larger percentage of the majors are predominantly female. # # Using the `value_counts` method, helps us get a precise detail of this values # In[10]: recent_grads['ShareWomen'].value_counts(bins = 20).sort_index() # Within the approx. 50% - 100% range of the `ShareWomen` column , there is an approximate of `57%` majors, which infers: # - `43%` of majors are predominatly `Male`; # - `57%` are predominatly `Felame` # # In a bid to answer the second question: # * What's the most common median salary range? # # The code cell below displays the distribution of values in the `Median` column # In[11]: recent_grads['Median'].hist(bins=20) # The plot above shows that the most common median salary is with the 30000USD - 35000USD range... However the `value_counts()` method gives us a more precise detail in the code cell below # In[12]: recent_grads['Median'].value_counts(bins = 20) # ### Scatter matrix plot # A scatter matrix plot combines both scatter plots and histograms into one grid of plots and allows us to explore potential relationships and distributions simultaneously. # In[13]: from pandas.plotting import scatter_matrix scatter_matrix(recent_grads[['Sample_size','Median']]) # In[14]: scatter_matrix(recent_grads[['Sample_size','Median','Unemployment_rate']]) # ## Bar plots # Using bar plots: # * compare the percentages of women `ShareWomen` from the first ten rows and last ten rows of the `recent_grads` dataframe. # * Use bar plots to compare the unemployment rate `Unemployment_rate` from the first ten rows and last ten rows of the `recent_grads` dataframe. # In[27]: recent_grads[:10].plot.bar(x='Major', y='ShareWomen', legend = False) # The plot above shows (in the topmost major ranks) that the `ASTRONOMY AND ASROPHYSICS` major has more percentage of women # In[30]: recent_grads.tail(10).plot.bar(x='Major', y='ShareWomen', legend = False) # The plot above shows (in the lower major ranks) that the `COMMUNICATION DIAORDER SCIENCE AND SEVICE` and `EARLY CHILDHOOD EDUCATION` major has more percentage of women # # Now, to the comparing the unemployment rate from the first ten rows and last ten rows of the recent_grads dataframe. # In[31]: recent_grads[:10].plot.bar(x='Major', y='Unemployment_rate', legend = False) # The plot above shows (in the topmost major ranks) that the `NUCLEAR ENGINEERING` major has more unemployement rate and factors as to `why?` are not disclosed in this analysis # In[33]: recent_grads.tail(10).plot.bar(x='Major', y='Unemployment_rate', legend = False) # The plot above shows (in the lower major ranks) that the `CLINICAL PHYCOLOGY` major has more unemployment rates and factors as to why are not disclosed in this analysis # ### Using a `GROUPED BAR PLOT` to compare the number of men with the number of women in each category of majors. # The code cell below displays the unique values in the `Major_category` column # In[19]: recent_grads['Major_category'].unique() # In[20]: recent_grads.groupby('Major_category')['Men', 'Women'].sum().plot.bar() # The plot displayed in the code cell above, infers: # - In the `Business`, `Engineering` and `Computers & Mathematics` major category, there are more men than women. # - Over `60%` of the major catergories have more women than men # ### Using `Box and Whisker Plots` to explore the `distributions of median salaries and unemployment rate.` # In[21]: recent_grads[['Median']].boxplot() # The box plot above, infers: # - All median salaray are aprroximately within the range of 20,000USD - 60,000USD # - Outliers within the 60,000USD - 80,000USD range and over 100,000USD # - The first quartile(25%) of median salaries are aprrox. between 21,000USD - 35,000USD range # In[22]: recent_grads[['Unemployment_rate']].boxplot() # The code cell above, infers: # - There is an approx. overall of 12.5% unemployment rate # - Outliers approx. within the 15% - 17.5% range # - first quartile(25%) of the unemployement rate is approx. within 0% - 5% range # ### Hexagonal Bin Plots # # In[23]: recent_grads.plot.hexbin(x='Men', y='Median', gridsize=30) # In[24]: recent_grads.plot.hexbin(x='Women', y='Median', gridsize=30) # The hexagonal plots above show us that women and men are similar in their median earning ranges, however women have two core points at 35,000USD and 40,000USD. # # Median earnings for men is around 30,000USD and 35,000USD most of the time. # # Conclusion # As an objective earlier stated in the introduction of this project, I have been able to `visualize` job outcomes of students who graduated from college between 2010 and 2012. # - Being a graduate in a popular major does not necessarily guarantee a high salary income # - As a Woman, the topmost ranked major to choose would be `ASTRONOMY AND ASROPHYSICS`, if considering a major with a larger population of women graduate # - When considering a topmost ranked major to go into, `NUCLEAR ENGINEERING` would not be a good idea because of the high unemployement rate it is characcterized with.