#!/usr/bin/env python # coding: utf-8 # # `Visualizing Earnings Based on College Majors` # --- # --- # # **In this project, we'll be working with `recent-grads.csv`, a dataset on the job outcomes of students who graduated from college between 2010 and 2012.** The original data on job outcomes was released by American Community Survey, which conducts surveys and aggregates the data. FiveThirtyEight cleaned the dataset and released it on their Github [repo](https://github.com/fivethirtyeight/data/tree/master/college-majors) # # Each row in the dataset represents a different major in college and contains information on gender diversity, # # Headers for `recent-grads.csv` are shown below: # # |**Header**|**Description**| # |-|-| # |Rank| Rank by median earnings **(the dataset is ordered by this column)**| # |Major_code| Major code, FO1DP in ACS PUMS| # |Major| Major description| # |Major_category|Category of major from Carnevale et al| # |Total| Total number of people with major| # |Sample_size| Sample size (unweighted) of full-time, year-round ONLY (used for earnings)| # |Men| Male graduates| # |Women| Female graduates| # |ShareWomen| Women as share of total| # |Employed| Number employed (ESR == 1 or 2)| # |Full_time| Employed 35 hours or more| # |Part_time| Employed less than 35 hours| # |Full_time_year_round| Employed at least 50 weeks (WKW == 1) and at least 35 hours (WKHP >= 35)| # |Unemployed| Number unemployed (ESR == 3)| # |Unemployment_rate| Unemployed / (Unemployed + Employed)| # |Median| Median earnings of full-time, year-round workers| # |P25th| 25th percentile of earnings| # |P75th| 75th percentile of earnings| # |College_jobs| Number with job requiring a college degree| # |Non_college_jobs| Number with job not requiring a college degree| # |Low_wage_jobs| Number in low-wage service jobs| # # # Using visualizations, we can start to explore interesting questions, such as: # # - Do students in more popular majors make more money?Using scatter plots # # - How many majors are predominantly male? Predominantly female? Using histograms # # - Which category of majors have the most students? Using bar plots # # - AND MANY MORE! # # Before we start creating data visualizations, let's import the libraries we need, explore the dataset, and remove rows containing null values. # # `Setting Up The Environment` # --- # **Importing libraries, controlling figure aesthetics & run the Jupyter magic:** `%matplotlib inline` # In[1]: #importing libraries import pandas as pd import matplotlib.pyplot as plt import numpy as np import seaborn as sns # In[2]: #controlling figure aesthetics: set font size, color palette, and style using seaborn sns.set(font_scale=1.3) sns.set_palette("PiYG") sns.set_style("white") # In[3]: #running some magic ;D get_ipython().run_line_magic('matplotlib', 'inline') #supernecessary --> so that Jupyter display all the plots inline. # # `Reading & Exploring` # --- # **Read the dataset into a DataFrame and start exploring the data.** # In[4]: recent_grads = pd.read_csv('recent-grads.csv') recent_grads.iloc[:1] #return the first row formatted as a table. # In[5]: recent_grads.info() # In[6]: recent_grads.head() # In[7]: recent_grads.tail() # Just a quick note here. As mentioned in the introduction, the dataset is ordered by Rank (rank by median earnings) #
So we can see that ... # - Majors in top 5 highest median earnings are in the engineering category # - Majors in the bottom 5 lowest median earnings are in the biology & life science, psychology & social work, and education categories # # Well... I'm not surprised... let's move on # In[8]: recent_grads.describe() # # `Cleaning: Dropping Rows With Missing Values` # --- # **We need to drop rows with missing values, because Matplotlib expects that columns of values we pass in have matching lengths and missing values will cause matplotlib to throw errors.** # In[9]: raw_data_count = len(recent_grads.index) raw_data_count # In[10]: #drop rows containing missing values recent_grads = recent_grads.dropna() # In[11]: #check the number of rows of the cleaned DataFrame cleaned_data_count = len(recent_grads.index) cleaned_data_count # In[12]: #comparing print('raw_data_count: ' + str(raw_data_count) + ' | ' + 'cleaned_data_count: '+ str(cleaned_data_count)) # **As we can see from the comparison result above, the initial DataFrame has 1 row with missing values, and we managed to drop it. #
Now we are ready to create some amazing plots! Let's get it!** # # `Scatterplot: Exploring The Relationship Between Some Columns` # --- # **Using `sns.scatterplot()`** # # **Let's generate scatter plots to explore the following relations:** # - `Sample_size and Median` # - `Sample_size and Unemployment_rate` # - `Full_time and Median` # - `ShareWomen and Unemployment_rate` # - `Men and Median` # - `Women and Median` # # **And then, let's answer the following questions:** # - `Do students in more popular majors make more money?` # - `Do students that majored in subjects that were majority female make more money?` # - `Is there any link between the number of full-time employees and median salary?` # In[13]: #Generate scatterplots in one go using a for loop x_val = ['Sample_size','Sample_size', 'Full_time', 'ShareWomen', 'Men', 'Women'] y_val = ['Median','Unemployment_rate', 'Median', 'Unemployment_rate', 'Median', 'Median'] fig = plt.figure(figsize=(25, 30)) for sp in range (len(x_val)): ax = fig.add_subplot(3,2,sp+1) ax = sns.scatterplot(data = recent_grads, x= x_val[sp], y= y_val[sp]) plt.title(y_val[sp] + ' vs ' + x_val[sp], weight='bold').set_fontsize('15') sns.despine(left=True, bottom=True) plt.show() # Based on the plot above, we can see that **all of the plots suggest that there is no significant relationship between the x & y variables:** # # - **On all the plots except for 'Unemployment_rate vs ShareWomen'**, we can see that there is a high variance near the 0 points, but as the y-value increases the variance diminishes. # # - **On 'Unemployment_rate vs ShareWomen'**, we can see that the dots are arbitrarily spread out. We can say the variances are pretty much equal at all ranges, which suggest that none of the variables have any effect on each other. # ## Remember that we have 3 questions? # # So, using the `Median vs Full_time` plot that we have above, we managed to answer one of the questions, # # ## Q1: "Is there any link between the number of full-time employees and median salary?", # **`Full_time: Number employed 35 hours or more # Median: Median salary of full-time, year-round workers.`** # #
**and the answer to the question is no. There is no link between the two** #
* *since we know that on all plots there is no significant relationship between the x & y variable (see the plot for detail)* # #
# # So... ONE down! TWO to go! **We stil have two unanswered questions, which are:** # - Do students in more popular majors make more money? # - Do students that majored in subjects that were majority female make more money? # # So let's make a few more plots to find the answers # ## Q2: "Do students in more popular majors make more money?" # **`Total :Total number of people with major. # Median :Median salary of full-time, year-round workers.`** # In[14]: fig, ax = plt.subplots(figsize =(10,5)) sns.scatterplot(data = recent_grads, x='Total', y='Median') sns.despine(left=True, bottom=True) plt.title('Total vs Median', weight='bold').set_fontsize('15') plt.show() # **Based on the plot, the answer is no. There is no correlation between Median and Total. #
We can see that there is a high variance near the 0 points, but as Total increases the variance diminishes.** # # **In other words, there is no correlation between the popularity of the major, and money** # ## Q3: "Do students that majored in subjects that were majority female make more money?" # **`ShareWomen: Women as share of total. # Median: Median salary of full-time, year-round workers.`** # In[15]: fig, ax = plt.subplots(figsize =(10,5)) sns.scatterplot(data = recent_grads, x='ShareWomen', y='Median') sns.despine(left=True, bottom=True) plt.title('Median vs. ShareWomen', weight='bold').set_fontsize('15') plt.show() # **Based on the plot, there is a weak negative correlation between Median and ShareWomen. # We can see that as Sharewomen increases, the Median decreases.** # # **In other words, the students that majored in subjects that were female majority tend to make less money.** # # **But really? Hmm we need more evidence. Let's analyze!** # # **We need numbers!** # ## Q3 analysis: Deep diving on Median and ShareWomen # In[16]: #Analyze ShareWomen and Median using groupby() median_sharewomen = recent_grads.groupby(["ShareWomen"])["Median"].mean().sort_values(ascending=False) print(median_sharewomen) # **Alright, so the result suggest that ShareWomen with low values tend to get higher median earnings, and vice versa. #
However, to be more confident in our analysis , let's group ShareWomen into 3 bins, and see what is the average Median of each group.** # # **Are we still getting a negative correlation? let's find out** # In[17]: #splitting ShareWomen into 3 groups sharewomen_grouped = recent_grads["ShareWomen"].value_counts(bins = 3).sort_index(ascending= False) sharewomen_grouped # In[18]: #using sns.barplot #we are creating a barplot with 3 bins bins = [-0.0019690000000000003, 0.323, 0.646, 0.969] #using the bins from the previous result median_sharewomen_grouped = recent_grads.groupby(pd.cut(recent_grads["ShareWomen"], bins))["Median"].mean().sort_values(ascending= False) print(median_sharewomen_grouped) fig, ax = plt.subplots(figsize =(10,5)) plt.xlabel('ShareWomen') plt.ylabel('Median') sns.barplot(x=sorted(median_sharewomen_grouped.index), y=median_sharewomen_grouped, data=recent_grads, ci=None) plt.title('Median vs ShareWomen (bins = 3)', weight='bold').set_fontsize('15') sns.despine(left=True, bottom=True) # **It's true! the students that majored in subjects that were female majority on average make less money.** # **So... just to recap:** # # - **"Do students in more popular majors make more money?"** #
Answer: No. There is correlation between the popularity of the major, and money # - **"Do students that majored in subjects that were majority female make more money?"** #
Answer: The opposite is true. The students that majored in subjects that were female majority on average make less money. # - **"Is there any link between the number of full-time employees and median salary?"** #
Answer: None of them have amy link with each other # # **We have managed to answer these questions by creating scatterplots for:** # - Sample_size and Median # - Sample_size and Unemployment_rate # - Full_time and Median # - ShareWomen and Unemployment_rate # - Men and Median # - Women and Median # - Total vs Median # - Median vs ShareWomen (we also did a deep dive analysis on this one, by creating a bar plot) # # **We're done with scatterplots, next we will create histograms!** # # `Histograms: Exploring The Distribution of Values In A Column` # --- # Using **`sns.histplot`** #
**Let's generate histograms to explore the distributions of the following columns:** #
* Note: We're going to use `Series.describe()` to understand the data distribution of each column # - `Sample_size: Sample size (unweighted) of full-time, year-round ONLY (used for earnings)` # - `Median: Median salary of full-time, year-round workers` # - `Employed: Number employed` # - `Full_time: Number employed 35 hours or more` # - `ShareWomen: Women as share of total` # - `Unemployment_rate: Percent of labor force that is jobless` # - `Men: Male graduates` # - `Women: Female graduates` # In[19]: #Generate histograms in one go using a for loop cols = ["Sample_size", "Median", "Employed", "Full_time", "ShareWomen", "Unemployment_rate", "Men", "Women"] bin_sizes = [14, 14, 14, 14, 14, 14, 30, 30] # Same length as cols #to get bins = 14, we use the square root rule for most of them #n = 172 --> sqroot(n) ~ 14 #exception for the “Men” and “Women”, we use bin = 30, #because when we use bin = 14 more than 80% of the data fall in the first bin for sp in range (len(cols)): fig = plt.figure(figsize=(10,5)) sns.histplot(data = recent_grads, x= cols[sp], bins = bin_sizes[sp]) sns.despine(left=True, bottom=True) plt.title(cols[sp], weight='bold').set_fontsize('16') plt.show() print('--------------------------------') print(recent_grads[cols[sp]].describe()) print('--------------------------------') # **Cool, using histograms we are able to comfortably visualize the data distribution. But what are the takeaways?** #
\**Note: Obviously, histogram and series.describe() will display us slightly different numbers. What we are trying to achieve here is to demonstrate how we can easily visualize data distribution with decent accuracy using histogram.* # # - Sample_size: # - Most values are within (0, 500) based on histogram # - 75% values are within \[2, 339\] based on series.describe( ) # - Ie. The values of Sample_size on each major appears to be really low. If it is as low as I think it is, we can't get a highly accurate insights from this data. # - **(We are going to deep dive on this column later in the section)** # - Median: # - Most values are within (30000, 40000) based on histogram # - 75% values are within \[22000, 45000\] based on series.describe( ) # - Ie. Students in most majors make around \\$30,000 - \\$40,000 annually. # - Employed: # - Most values are within (0, 30000) based on histogram # - 75% values are within \[0, 31701\] based on series.describe( ) # - Ie. Most majors have a really low number of employment. But is it really? # - **(We are going to deep dive on this column later in the section)** # - Full-time: # - Most values are within (0, 25000) based on histogram # - 75% values are within \[111, 25447\] based on series.describe( ) # - Ie. Most graduates don't have a full-time job # - Note that the histogram for full-time and employed looks similar, which is intuitively correct. # - ShareWomen: # - The values in this histogram is more evenly spread out compared to the others # - Most values are within (.6, .7) based on histogram # - Ie. Most majors have more females in proportion to male. I honestly thought it's the opposite... # Female represent 60% - 70% total number of people per major # - **(We are going to deep dive on this column later in the section)** # - Unemployment_rate: # - Most values are within (.05, .07) based on histogram # - Common unemployment rates are between 5.5% - 6.25% # - Men: # - Most male garduates are within (0, 12500) based on histogram # - 75% male graduates are within \[119, 14631\] based on series.describe( ) # - Women: # - Most female graduates are within (0, 25000) based on histogram # - 75% of female graduates are within \[0, 22553\] based on series.describe() # - **Hold on, there is a major with 0 female graduates?** ( based on series.describe( ) ) # - **(We are going to deep dive on this column later in the section)** # ## Analysis: Deep diving on `Sample_size`, `Employed`, `ShareWomen`, and `Women` # **A few quirky things from histograms and series.describe( ) caught my attention. Let's dive right into it.** # **`Sample_size: Sample size (unweighted) of full-time, year-round ONLY (used for earnings)`** # # As previously mentioned, the sample size seem to be really low. #
Let explore this column even more, and then find the percentage of sample size by calculating `Sample_size / Full_time_year_round`. # In[20]: #exploring rows that have the most common sample size values --> Sample_size: (0, 500) common_samplesize = recent_grads[recent_grads["Sample_size"].between(0, 500)] common_samplesize.sort_values(by='Sample_size', ascending=False) # Ok, there are a few things to note here: # - From the data dictionary, we know that `Sample_size` = Sample size (unweighted) of full-time, year-round ONLY (used for earnings). # - Ie. In `Sample_size` column we get the number of sampled data obtained from `Full_time_year_round`, which are used to determine earnings in `Median` column. # - **Or to put it simply, `Sample_size` column reflects the number of people from `Full_time_year_round` that reported their earnings.** # # # - The problem is, sample_size values are very low for the majority of the rows in this dataset. # - Based on the result of `recent_grads["Sample_size].describe()` (under the histogram in the above section), count = 172. # - Here, we can see that `common_samplesize` has 141 rows. This indicates that 141 / 172 = .82 = **82% of data has a sample size between (0, 500)** # # # - If we look at the last row in the sorted `common_samplesize` dataframe above (row #172), apparently **for "LIBRARY SCIENCE" major, out of 410 Full_time_year_round values, there are only 2 Sample_size values.** # - `Sample_size / Full_time_year_round` = 2 / 410 = .004 = **.4%** # # # - I don't have an extensive statistical bacakground, but I'm pretty sure a .4% sample size is too low to provide us any meaningful result #
*Based on this article [here](http://www.tools4dev.org/resources/how-to-choose-a-sample-size/#:~:text=A%20good%20maximum%20sample%20size,%2C%2010%25%20would%20be%2020%2C000.), a good sample size number is around 10% # *Alright alright alright* enough of this sample size rabbit hole! Let's just do this one thing: # **Calculate the sample size percentage of this dataset** # In[21]: #just some math sample_dataset = recent_grads["Sample_size"] / recent_grads["Full_time_year_round"] sample_dataset.describe() # So based on `sample_dataset.describe()`, we know the following: # - Average sample size used is 1.7% # - Minimum sample size used is .4% (we just calculated it earlier, it belongs to LIBRARY SCIENCE major) # - Maximum sample size used is 3.6%, meh -________-' # # Ok, not sure what's going on with this dataset. **The sample size is SUPER LOW**. This might have an effect on the other columns, or at least the median column, or... whatever, let's just roll with it. \**shoulder shrug*\* #
**NEXT!** # **`Employed: Number employed`** # # So earlier we figured that most majors have a really low number of employment, which is not surprising. #
However, what surprises me is that **there is a major that has 0 people employed**. I wonder which major(s). Let's see... # In[22]: #checking major(s) with 0 people employed recent_grads[recent_grads["Employed"] == 0] # ... and the major is **"MILITARY TECHNOLOGIES"**, so only 1 major? Hmmm # # BUT WAIT! Let's take another look at the `recent_grads[recent_grads["Employed"] == 0]` above. There are a few quirks: # - Employed = 0 # - Unemployed = 0? # - Median = 40000?? # - Non_college_jobs = 0??? # - College_jobs = 0???? # - Low_wage_jobs = 0????? # - *WHAAAAT??!!??!!* # # \**gasping for breath*\* #
I don't know what just happened, but suddenly I got dragged into another rabbit hole, and stumbled upon [this](https://github.com/fivethirtyeight/data/issues/250) #
**TL;DR There is something peculiar about this dataset.** # - **Total != Employed + Unemployed** # - **Total != College_jobs + Non_college_jobs, + Low_wage_jobs** # # So again, I am not sure what is going on here. But there is nothing we can do, so let's keep working with the dataset that we have. # # ...Moving on... # **`ShareWomen: Women as share of total`** # # We are going to answer one of the question that we have in the introduction, which is #
**"How many majors are predominantly male?"** # In[23]: #create a new column recent_grads["gender_majority"] = np.nan #add values to the new column recent_grads.loc[recent_grads["ShareWomen"] > .5, "gender_majority"] = "Female" recent_grads.loc[recent_grads["ShareWomen"] < .5, "gender_majority"] = "Male" #display recent_grads # In[24]: #using histplot #we are creating a histogram to answer the question, "How many majors are predominantly male?" fig = plt.subplots(figsize =(10,5)) #differentiate color b/w gender_majority using hue sns.histplot(data = recent_grads, x= "gender_majority", hue = "gender_majority") sns.despine(left=True, bottom=True) plt.title("gender_majority", weight='bold').set_fontsize('16') #display #count on legend plt.legend(recent_grads["gender_majority"].value_counts(),bbox_to_anchor=(1, 1), title= 'Count') plt.show() # **As we can see, 96 majors are predominantly Female, and 76 majors are predominantly Male. Cool...** # **`Women: Female graduates`** # # Based on series.describe( ) **there is a major with 0 female graduates, I wonder what that is...** # In[25]: recent_grads[recent_grads["Women"] == 0] # Ahh.. this one again, **"MILITARY TECHNOLOGIES"**... # # We just had this one earlier when deep diving on `Employed`. I don't know, this major has weird numbers. #
**We'll just leave it at that. Next please!** # # **And of course, before we move on, let's do a little recap:** # # **In the last 2 sections, we created the following:** # 1. **Scatter plots** to visualize potential relationships between the following columns: # - Sample_size and Median # - Sample_size and Unemployment_rate # - Full_time and Median # - ShareWomen and Unemployment_rate # - Men and Median # - Women and Median # - Total vs Median # - Median vs ShareWomen (we also did a deep dive analysis on this one, by creating a bar plot) # # Based on our scatter plots, we find that: # - **"Do students in more popular majors make more money?"** #
Answer: No. There is correlation between the popularity of the major, and money. # - **"Do students that majored in subjects that were majority female make more money?"** #
Answer: The opposite is true. The students that majored in subjects that were female majority on average make less money. #
(we deep dived on this by creating a bar plot with 3 bins) # - **"Is there any link between the number of full-time employees and median salary?"** #
Answer: None of them have amy link with each other # # # 2. **Histograms** to visualize the distributions of the following columns: #
Sample_size, Median, Employed, Full_time, ShareWomen, Unemployment_rate, Men, and Women # # We also did further analysis on Sample_size, Employed, ShareWomen, and Women # - We answered the question, **"How many majors are predominantly male?"** by creating a bar plot when deep diving on ShareWomen # # **Ready for more? ;D** # # `Scatter Matrix (Pair Plot) : Exploring Potential Relationships And Distributions Simultaneously` # --- #
**In this section, we will create a scatter matrix (pair plot), which is a plot that combines both scatter plots and histograms into one grid of plots and allows us to explore potential relationships and distributions simultaneously.** #
\* *Note: A scatter matrix plot consists of n by n plots on a grid, where n is the number of columns, the plots on the diagonal are histograms, and the non-diagonal plots are scatter plots.* # # **Using `sns.pairplot()`** #
**Let's generate scatter matrix plots (pair plots) to explore the relationship and distubution of the following columns:** # - `Sample_size and Median` # - `Sample_size, Median, and Unemployment_rate` # # Then, we will try to create a few more plots to explore the answer to the following questions: # - `Do students in more popular majors make more money?` # - `Do students that majored in subjects that were majority female make more money?` # - `Is there any link between the number of full-time employees and median salary?` # # *Note: We did have answered these questions using scatter plots in the earlier section, but let's try to explore these questions again while we familarize ourselves with scatter matrix # # **`Sample_size: Sample size (unweighted) of full-time # Median: Median salary of full-time, year-round workers # Unemployment_rate: Percent of labor force that is jobless`** # In[26]: pairs = [['Sample_size', 'Median'], ['Sample_size', 'Median', 'Unemployment_rate']] for pair in range(len(pairs)): pairplot = sns.pairplot(recent_grads[pairs[pair]]) pairplot.fig.set_size_inches(10,10) for ax in pairplot.axes.flat: #rotate x-axis labels ax.tick_params("x", labelrotation=45) # **As we can see, there is no correlation between sample size, median, and enemployment rate.** #
*these scatter matrices confirmed our findings in the previous sections. # **Now, let's use scatter matrix to validate our answers to the three questions that we have** # # To do that, we are going to create 3 scatter matrices (pair plots): # 1. Total and Median # 2. ShareWomen and Median # 3. Full_time and Median # **`Total :Total number of people with major. # Full_time: Number employed 35 hours or more # ShareWomen: Women as share of total. # Median :Median salary of full-time, year-round workers.`** # In[27]: pairs = [['Total', 'Median'], ['ShareWomen', 'Median'], ['Full_time', 'Median']] for pair in range(len(pairs)): pairplot = sns.pairplot(recent_grads[pairs[pair]]) pairplot.fig.set_size_inches(10,10) for ax in pairplot.axes.flat: #rotate x-axis labels ax.tick_params("x", labelrotation=45) # **Alright cool! Scatter matrix and scatter plot give us the same answer. This help us validate our answer in the previous sections.** # # **And uh, actually, we are done with scatter matrix. So Let's do another reflection...** # # **In the last 3 sections, we created the following:** # 1. **Scatter plots** to visualize potential relationships between the following columns: # - Sample_size and Median # - Sample_size and Unemployment_rate # - Full_time and Median # - ShareWomen and Unemployment_rate # - Men and Median # - Women and Median # - Total vs Median # - Median vs ShareWomen (we also did a deep dive analysis on this one, by creating a bar plot) # # Based on our scatter plots, we find that: # - **"Do students in more popular majors make more money?"** #
Answer: No. There is correlation between the popularity of the major, and money. # - **"Do students that majored in subjects that were majority female make more money?"** #
Answer: The opposite is true. The students that majored in subjects that were female majority on average make less money. #
(we deep dived on this by creating a bar plot with 3 bins) # - **"Is there any link between the number of full-time employees and median salary?"** #
Answer: None of them have amy link with each other # # # 2. **Histograms** to visualize the distributions of the following columns: #
Sample_size, Median, Employed, Full_time, ShareWomen, Unemployment_rate, Men, and Women # # We also did further analysis on Sample_size, Employed, ShareWomen, and Women # - We answered the question, **"How many majors are predominantly male?"** by creating a bar plot when deep diving on ShareWomen # # # 3. **Scatter matrix** plots to visually explore potential relationships and distributions of the following: # - Sample_size and Median # - Sample_size, Median, and Unemployment_rate # - Total and Median # - ShareWomen and Median # - Full_time and Median # # Using scatter matrix plots, we managed to validate our answers in the previous sections. # # # **Next!** # # `Bar Plots: Comparing Some Columns` # --- # **Using `sns.barplot`** #
**Let's generate barplots to do the following:** # - Compare the percentages of women (ShareWomen) from the first ten rows and last ten rows #
of the recent_grads dataframe, while having major on the x-axis. # - Compare the unemployment rate (Unemployment_rate) from the first ten rows and last ten rows #
of the recent_grads dataframe, while having major on the x-axis. # # **`ShareWomen vs Major`** # In[28]: fig, ax = plt.subplots(figsize =(10,5)) plt.xlabel('Major') plt.ylabel('ShareWomen') sns.barplot(x=recent_grads[:10]['Major'], y=recent_grads[:10]['ShareWomen'], ci=None) ax.set_xticklabels(recent_grads[:10]['Major'], rotation='vertical') sns.despine(left=True, bottom=True) plt.title("ShareWomen vs Major", weight='bold').set_fontsize('16') # In[29]: fig, ax = plt.subplots(figsize =(10,5)) plt.xlabel('Major') plt.ylabel('ShareWomen') sns.barplot(x=recent_grads[-10:]['Major'], y=recent_grads[-10:]['ShareWomen'], ci=None) ax.set_xticklabels(recent_grads[-10:]['Major'], rotation='vertical') sns.despine(left=True, bottom=True) plt.title("ShareWomen vs Major", weight='bold').set_fontsize('16') # **Remember that this dataset is ordered by median earnings?** # # - It's interesting to see that **only 1 out of 10 majors with the highest median earnings are predominantly female (ASTRONOMY AND ASTROPHYSICS).** # - On the other hand, the **10 majors with the lowest median earnings are all predominantly female.** Again, very interesting. # - This result also validates our findings in the earlier section, that there is a **negative correlation between ShareWomen and Median.** # **`Unemployment_rate vs Major`** # In[30]: fig, ax = plt.subplots(figsize =(10,5)) plt.xlabel('Major') plt.ylabel('Unemployment_rate') sns.barplot(x=recent_grads[:10]['Major'], y=recent_grads[:10]['Unemployment_rate'], ci=None) ax.set_xticklabels(recent_grads[:10]['Major'], rotation='vertical') sns.despine(left=True, bottom=True) plt.title("Unemployment_rate vs Major", weight='bold').set_fontsize('16') # In[31]: fig, ax = plt.subplots(figsize =(10,5)) plt.title("Unemployment_rate vs Major", weight='bold').set_fontsize('16') plt.xlabel('Major') plt.ylabel('Unemployment_rate') sns.barplot(x=recent_grads[-10:]['Major'], y=recent_grads[-10:]['Unemployment_rate'], ci=None) ax.set_xticklabels(recent_grads[-10:]['Major'], rotation='vertical') sns.despine(left=True, bottom=True) # Based on the bar plot, **Unemployment_rate seems to be normally distributed** across all majors. # **This is almost the end of the project, so let's do another reflection on what we have done.** # # **In the last 4 sections, we created the following:** # 1. **Scatter plots** to visualize potential relationships between the following columns: # - Sample_size and Median # - Sample_size and Unemployment_rate # - Full_time and Median # - ShareWomen and Unemployment_rate # - Men and Median # - Women and Median # - Total vs Median # - Median vs ShareWomen (we also did a deep dive analysis on this one, by creating a bar plot) # # Based on our scatter plots, we find that: # - **"Do students in more popular majors make more money?"** #
Answer: No. There is correlation between the popularity of the major, and money. # - **"Do students that majored in subjects that were majority female make more money?"** #
Answer: The opposite is true. The students that majored in subjects that were female majority on average make less money. #
(we deep dived on this by creating a bar plot with 3 bins) # - **"Is there any link between the number of full-time employees and median salary?"** #
Answer: None of them have amy link with each other # # # 2. **Histograms** to visualize the distributions of the following columns: #
Sample_size, Median, Employed, Full_time, ShareWomen, Unemployment_rate, Men, and Women # # We also did further analysis on Sample_size, Employed, ShareWomen, and Women # - We answered the question, **"How many majors are predominantly male?"** by creating a bar plot when deep diving on ShareWomen # # # 3. **Scatter matrix** plots to visually explore potential relationships and distributions of the following: # - Sample_size and Median # - Sample_size, Median, and Unemployment_rate # - Total and Median # - ShareWomen and Median # - Full_time and Median # # Using scatter matrix plots, we managed to validate our answers in the previous sections. # # # 4. **Bar Plots** to compare the percentages of women (ShareWomen), and Unemployment_rate of the first and last ten rows of the `recent_grads` dataframe, with Major on the x-axis # # As I said before, we are almost done! #
**On the next section, let's have some fun by doing further exploration and analysis!** # # `Oooh this is going to be fun ;D` # --- # **In this section, we are going to have fun by visually exploring and analyzing data using many different types of plots. So what are we waiting for?** # **`Use a grouped bar plot to compare the number of men with the number of women in each category of majors.`** # In[32]: #create a new column that shows the difference between male and female graduates in each major #we will use this new column to sort the data recent_grads["delta_graduates"] = recent_grads["Men"] - recent_grads["Women"] #print print(recent_grads.groupby(["Major_category"])["delta_graduates"].sum().sort_values()) # Note: positive values means there are more female, and negative values means there are more male # In[33]: recent_grads.groupby('Major_category').sum().sort_values(by=['delta_graduates'] ,ascending=False).plot.barh( y=['Men','Women'], figsize =(20,10)) sns.despine(left=True, bottom=True) plt.xlabel('Major_category') plt.ylabel('Total') plt.title("Distribution of Men and Women in each Major Categories", weight='bold').set_fontsize('16') # Top 3 majors that are predominantly **female: Education, Health, and Psychology & Social Work.** (in order) #
Top 3 majors that are predominantly **male: Engineering, Computers & Mathematics, and Business.** (in order) # **`Use a box plot to explore the distributions of median salaries and unemployment rate.`** # In[34]: sns.boxplot(data = recent_grads['Median']) # Alright, this verifies our findings in the histogram section, that **median is around \\$30,000 - \\$40,000** # In[35]: sns.boxplot(data = recent_grads['Unemployment_rate']) # Similarly, this verifies our findings in the histogram section, that **common unemployment rates are between 5.5% - 6.25%** # **`Use a hexagonal bin plot to visualize the columns that had dense scatter plots from earlier in the project:`** # - `Unemployment_rate vs ShareWomen` # - `Median vs ShareWomen` # In[36]: plt.hexbin(recent_grads['ShareWomen'], recent_grads['Unemployment_rate'], gridsize=(15,15), cmap ="RdPu") plt.show() # Well, another validation, **Unemployment_rate and ShareWomen has no correlation.** # In[37]: plt.hexbin(recent_grads['ShareWomen'], recent_grads['Median'], gridsize=(15,15), cmap ="RdPu") plt.show() # Yuuup, **Median and ShareWomen has a weak negative correlation.** Verified! # **`Use a barplot to display the following:`** # - Top 5 majors with the highest 75th percentile of earnings # - Top 5 majors with the most full time employees # In[38]: major_highest_P75th = recent_grads.groupby(["Major"])["P75th"].mean().sort_values(ascending=False).head() print(major_highest_P75th) fig, ax = plt.subplots(figsize =(15,6)) plt.title("Top 5 majors with the highest 75th percentile of earnings", weight='bold').set_fontsize('16') sns.barplot(x=major_highest_P75th.index, y=major_highest_P75th, data=recent_grads, ci=None).set_xticklabels( labels=major_highest_P75th.index,rotation='vertical') sns.despine(left=True, bottom=True) # Well, **for any of you who are still in college and want to get rich, this is the answer.** # Choose any of these majors: # - PETROLEUM ENGINEERING # - ASTRONOMY AND ASTROPHYSICS # - METALLURGICAL ENGINEERING # - NUCLEAR ENGINEERING # - PHARMACY PHARMACEUTICAL SCIENCES AND ADMINISTRATION # In[39]: major_most_fulltime = recent_grads.groupby(["Major"])["Full_time_year_round"].mean().sort_values( ascending=False).head() print(major_most_fulltime) fig, ax = plt.subplots(figsize =(15,6)) plt.title("Top 5 majors with the most full time employees", weight='bold').set_fontsize('16') sns.barplot(x=major_most_fulltime.index, y=major_most_fulltime, data=recent_grads, ci=None).set_xticklabels( labels=major_most_fulltime.index,rotation='vertical') sns.despine(left=True, bottom=True) # And **for any of you wants to get a full time job easily**, try get into any of these majors: # - BUSINESS MANAGEMENT AND ADMINISTRATION # - PSYCHOLOGY # - GENERAL BUSINESS # - MARKETING AND MARKETING RESEARCH # - ACCOUNTING # 1. **Scatter plots** to visualize potential relationships between the following columns: # - Sample_size and Median # - Sample_size and Unemployment_rate # - Full_time and Median # - ShareWomen and Unemployment_rate # - Men and Median # - Women and Median # - Total vs Median # - Median vs ShareWomen (we also did a deep dive analysis on this one, by creating a bar plot) # # Based on our scatter plots, we find that: # - **"Do students in more popular majors make more money?"** #
Answer: No. There is correlation between the popularity of the major, and money. # - **"Do students that majored in subjects that were majority female make more money?"** #
Answer: The opposite is true. The students that majored in subjects that were female majority on average make less money. #
(we deep dived on this by creating a bar plot with 3 bins) # - **"Is there any link between the number of full-time employees and median salary?"** #
Answer: None of them have amy link with each other # # # 2. **Histograms** to visualize the distributions of the following columns: #
Sample_size, Median, Employed, Full_time, ShareWomen, Unemployment_rate, Men, and Women # # We also did further analysis on Sample_size, Employed, ShareWomen, and Women # - We answered the question, **"How many majors are predominantly male?"** by creating a bar plot when deep diving on ShareWomen # # # 3. **Scatter matrix** plots to visually explore potential relationships and distributions of the following: # - Sample_size and Median # - Sample_size, Median, and Unemployment_rate # - Total and Median # - ShareWomen and Median # - Full_time and Median # # Using scatter matrix plots, we managed to validate our answers in the previous sections. # # # 4. **Bar Plots** to compare the percentages of women (ShareWomen), and Unemployment_rate of the first and last ten rows of the `recent_grads` dataframe, with Major on the x-axis # # # 5. We did some **fun stuff** here, here are what we did: # - Use a grouped bar plot to compare the number of men with the number of women in each category of majors. # - Use a box plot to explore the distributions of median salaries and unemployment rate. # - Use a hexagonal bin plot to visualize the columns that had dense scatter plots from earlier in the project: # - Unemployment_rate vs ShareWomen # - Median vs ShareWomen # - Use a barplot to display the following: # - Top 5 majors with the highest 75th percentile of earnings # - Top 5 majors with the most full time employees # # **And that's a wrap!** # # `Conclusion & Insights` # --- # --- # Well, you know what they say, "All good things must come to an end" # # This is the last section of the project. So before we actually end this, #
**Let's present our findings** # # **Q&A** # - **"Do students in more popular majors make more money?"** #
Answer: No. There is correlation between the popularity of the major, and money # - **"Do students that majored in subjects that were majority female make more money?"** #
Answer: The opposite is true. The students that majored in subjects that were female majority on average make less money. # - **"Is there any link between the number of full-time employees and median salary?"** #
Answer: None of them have amy link with each other # - **"How many majors are predominantly male?"** #
Answer: 76 majors are predominantly male, and 96 majors are predominantly Female # # **Major statistics** # - **Top 3 majors that are predominantly female:** Education, Health, and Psychology & Social Work. (in order) # - **Top 3 majors that are predominantly male:** Engineering, Computers & Mathematics, and Business. (in order) # - **Top 5 majors with the highest 75th percentile of earnings:** # - PETROLEUM ENGINEERING # - ASTRONOMY AND ASTROPHYSICS # - METALLURGICAL ENGINEERING # - NUCLEAR ENGINEERING # - PHARMACY PHARMACEUTICAL SCIENCES AND ADMINISTRATION # - **Top 5 majors with the most full time employees:** # - BUSINESS MANAGEMENT AND ADMINISTRATION # - PSYCHOLOGY # - GENERAL BUSINESS # - MARKETING AND MARKETING RESEARCH # - ACCOUNTING # # **Correlation** # - Sample_size and Median = NO # - Sample_size and Unemployment_rate = NO # - Full_time and Median = NO # - ShareWomen and Unemployment_rate = NO # - Men and Median = NO # - Women and Median = NO # - Total vs Median = NO # - **Median vs ShareWomen = Weak negative correlation** # # **Extra** # # - Students in most majors make around \\$30,000 - \\$40,000 annually. # - Female represent 60% - 70% total number of people per major # - Common unemployment rates are between 5.5% - 6.25% # - Only 1 out of 10 majors with the highest median earnings are predominantly female (ASTRONOMY AND ASTROPHYSICS). # - 10 majors with the lowest median earnings are all predominantly female