Visualizing Earnings Based on College Majors
¶In this project, we'll be working with recent-grads.csv
, a dataset on the job outcomes of students who graduated from college between 2010 and 2012. The original data on job outcomes was released by American Community Survey, which conducts surveys and aggregates the data. FiveThirtyEight cleaned the dataset and released it on their Github repo
Each row in the dataset represents a different major in college and contains information on gender diversity,
Headers for recent-grads.csv
are shown below:
Header | Description |
---|---|
Rank | Rank by median earnings (the dataset is ordered by this column) |
Major_code | Major code, FO1DP in ACS PUMS |
Major | Major description |
Major_category | Category of major from Carnevale et al |
Total | Total number of people with major |
Sample_size | Sample size (unweighted) of full-time, year-round ONLY (used for earnings) |
Men | Male graduates |
Women | Female graduates |
ShareWomen | Women as share of total |
Employed | Number employed (ESR == 1 or 2) |
Full_time | Employed 35 hours or more |
Part_time | Employed less than 35 hours |
Full_time_year_round | Employed at least 50 weeks (WKW == 1) and at least 35 hours (WKHP >= 35) |
Unemployed | Number unemployed (ESR == 3) |
Unemployment_rate | Unemployed / (Unemployed + Employed) |
Median | Median earnings of full-time, year-round workers |
P25th | 25th percentile of earnings |
P75th | 75th percentile of earnings |
College_jobs | Number with job requiring a college degree |
Non_college_jobs | Number with job not requiring a college degree |
Low_wage_jobs | Number in low-wage service jobs |
Using visualizations, we can start to explore interesting questions, such as:
Do students in more popular majors make more money?Using scatter plots
How many majors are predominantly male? Predominantly female? Using histograms
Which category of majors have the most students? Using bar plots
AND MANY MORE!
Before we start creating data visualizations, let's import the libraries we need, explore the dataset, and remove rows containing null values.
Setting Up The Environment
¶Importing libraries, controlling figure aesthetics & run the Jupyter magic: %matplotlib inline
#importing libraries
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
#controlling figure aesthetics: set font size, color palette, and style using seaborn
sns.set(font_scale=1.3)
sns.set_palette("husl")
sns.set_style("white")
#running some magic ;D
%matplotlib inline
#supernecessary --> so that Jupyter display all the plots inline.
recent_grads = pd.read_csv('recent-grads.csv')
recent_grads.iloc[:1] #return the first row formatted as a table.
Rank | Major_code | Major | Total | Men | Women | Major_category | ShareWomen | Sample_size | Employed | ... | Part_time | Full_time_year_round | Unemployed | Unemployment_rate | Median | P25th | P75th | College_jobs | Non_college_jobs | Low_wage_jobs | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 2419 | PETROLEUM ENGINEERING | 2339.0 | 2057.0 | 282.0 | Engineering | 0.120564 | 36 | 1976 | ... | 270 | 1207 | 37 | 0.018381 | 110000 | 95000 | 125000 | 1534 | 364 | 193 |
1 rows × 21 columns
recent_grads.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 173 entries, 0 to 172 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Rank 173 non-null int64 1 Major_code 173 non-null int64 2 Major 173 non-null object 3 Total 172 non-null float64 4 Men 172 non-null float64 5 Women 172 non-null float64 6 Major_category 173 non-null object 7 ShareWomen 172 non-null float64 8 Sample_size 173 non-null int64 9 Employed 173 non-null int64 10 Full_time 173 non-null int64 11 Part_time 173 non-null int64 12 Full_time_year_round 173 non-null int64 13 Unemployed 173 non-null int64 14 Unemployment_rate 173 non-null float64 15 Median 173 non-null int64 16 P25th 173 non-null int64 17 P75th 173 non-null int64 18 College_jobs 173 non-null int64 19 Non_college_jobs 173 non-null int64 20 Low_wage_jobs 173 non-null int64 dtypes: float64(5), int64(14), object(2) memory usage: 28.5+ KB
recent_grads.head() #familiarize ourself with how the data is structured.
Rank | Major_code | Major | Total | Men | Women | Major_category | ShareWomen | Sample_size | Employed | ... | Part_time | Full_time_year_round | Unemployed | Unemployment_rate | Median | P25th | P75th | College_jobs | Non_college_jobs | Low_wage_jobs | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 2419 | PETROLEUM ENGINEERING | 2339.0 | 2057.0 | 282.0 | Engineering | 0.120564 | 36 | 1976 | ... | 270 | 1207 | 37 | 0.018381 | 110000 | 95000 | 125000 | 1534 | 364 | 193 |
1 | 2 | 2416 | MINING AND MINERAL ENGINEERING | 756.0 | 679.0 | 77.0 | Engineering | 0.101852 | 7 | 640 | ... | 170 | 388 | 85 | 0.117241 | 75000 | 55000 | 90000 | 350 | 257 | 50 |
2 | 3 | 2415 | METALLURGICAL ENGINEERING | 856.0 | 725.0 | 131.0 | Engineering | 0.153037 | 3 | 648 | ... | 133 | 340 | 16 | 0.024096 | 73000 | 50000 | 105000 | 456 | 176 | 0 |
3 | 4 | 2417 | NAVAL ARCHITECTURE AND MARINE ENGINEERING | 1258.0 | 1123.0 | 135.0 | Engineering | 0.107313 | 16 | 758 | ... | 150 | 692 | 40 | 0.050125 | 70000 | 43000 | 80000 | 529 | 102 | 0 |
4 | 5 | 2405 | CHEMICAL ENGINEERING | 32260.0 | 21239.0 | 11021.0 | Engineering | 0.341631 | 289 | 25694 | ... | 5180 | 16697 | 1672 | 0.061098 | 65000 | 50000 | 75000 | 18314 | 4440 | 972 |
5 rows × 21 columns
recent_grads.tail() #familiarize ourself with how the data is structured.
Rank | Major_code | Major | Total | Men | Women | Major_category | ShareWomen | Sample_size | Employed | ... | Part_time | Full_time_year_round | Unemployed | Unemployment_rate | Median | P25th | P75th | College_jobs | Non_college_jobs | Low_wage_jobs | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
168 | 169 | 3609 | ZOOLOGY | 8409.0 | 3050.0 | 5359.0 | Biology & Life Science | 0.637293 | 47 | 6259 | ... | 2190 | 3602 | 304 | 0.046320 | 26000 | 20000 | 39000 | 2771 | 2947 | 743 |
169 | 170 | 5201 | EDUCATIONAL PSYCHOLOGY | 2854.0 | 522.0 | 2332.0 | Psychology & Social Work | 0.817099 | 7 | 2125 | ... | 572 | 1211 | 148 | 0.065112 | 25000 | 24000 | 34000 | 1488 | 615 | 82 |
170 | 171 | 5202 | CLINICAL PSYCHOLOGY | 2838.0 | 568.0 | 2270.0 | Psychology & Social Work | 0.799859 | 13 | 2101 | ... | 648 | 1293 | 368 | 0.149048 | 25000 | 25000 | 40000 | 986 | 870 | 622 |
171 | 172 | 5203 | COUNSELING PSYCHOLOGY | 4626.0 | 931.0 | 3695.0 | Psychology & Social Work | 0.798746 | 21 | 3777 | ... | 965 | 2738 | 214 | 0.053621 | 23400 | 19200 | 26000 | 2403 | 1245 | 308 |
172 | 173 | 3501 | LIBRARY SCIENCE | 1098.0 | 134.0 | 964.0 | Education | 0.877960 | 2 | 742 | ... | 237 | 410 | 87 | 0.104946 | 22000 | 20000 | 22000 | 288 | 338 | 192 |
5 rows × 21 columns
Just a quick note here. As mentioned in the introduction, the dataset is ordered by Rank (rank by median earnings)
So we can see that ...
Well... I'm not surprised... let's move on
recent_grads.describe() #summary statistics for all of the numeric columns
Rank | Major_code | Total | Men | Women | ShareWomen | Sample_size | Employed | Full_time | Part_time | Full_time_year_round | Unemployed | Unemployment_rate | Median | P25th | P75th | College_jobs | Non_college_jobs | Low_wage_jobs | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 173.000000 | 173.000000 | 172.000000 | 172.000000 | 172.000000 | 172.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 |
mean | 87.000000 | 3879.815029 | 39370.081395 | 16723.406977 | 22646.674419 | 0.522223 | 356.080925 | 31192.763006 | 26029.306358 | 8832.398844 | 19694.427746 | 2416.329480 | 0.068191 | 40151.445087 | 29501.445087 | 51494.219653 | 12322.635838 | 13284.497110 | 3859.017341 |
std | 50.084928 | 1687.753140 | 63483.491009 | 28122.433474 | 41057.330740 | 0.231205 | 618.361022 | 50675.002241 | 42869.655092 | 14648.179473 | 33160.941514 | 4112.803148 | 0.030331 | 11470.181802 | 9166.005235 | 14906.279740 | 21299.868863 | 23789.655363 | 6944.998579 |
min | 1.000000 | 1100.000000 | 124.000000 | 119.000000 | 0.000000 | 0.000000 | 2.000000 | 0.000000 | 111.000000 | 0.000000 | 111.000000 | 0.000000 | 0.000000 | 22000.000000 | 18500.000000 | 22000.000000 | 0.000000 | 0.000000 | 0.000000 |
25% | 44.000000 | 2403.000000 | 4549.750000 | 2177.500000 | 1778.250000 | 0.336026 | 39.000000 | 3608.000000 | 3154.000000 | 1030.000000 | 2453.000000 | 304.000000 | 0.050306 | 33000.000000 | 24000.000000 | 42000.000000 | 1675.000000 | 1591.000000 | 340.000000 |
50% | 87.000000 | 3608.000000 | 15104.000000 | 5434.000000 | 8386.500000 | 0.534024 | 130.000000 | 11797.000000 | 10048.000000 | 3299.000000 | 7413.000000 | 893.000000 | 0.067961 | 36000.000000 | 27000.000000 | 47000.000000 | 4390.000000 | 4595.000000 | 1231.000000 |
75% | 130.000000 | 5503.000000 | 38909.750000 | 14631.000000 | 22553.750000 | 0.703299 | 338.000000 | 31433.000000 | 25147.000000 | 9948.000000 | 16891.000000 | 2393.000000 | 0.087557 | 45000.000000 | 33000.000000 | 60000.000000 | 14444.000000 | 11783.000000 | 3466.000000 |
max | 173.000000 | 6403.000000 | 393735.000000 | 173809.000000 | 307087.000000 | 0.968954 | 4212.000000 | 307933.000000 | 251540.000000 | 115172.000000 | 199897.000000 | 28169.000000 | 0.177226 | 110000.000000 | 95000.000000 | 125000.000000 | 151643.000000 | 148395.000000 | 48207.000000 |
Cleaning: Dropping Rows With Missing Values
¶We need to drop rows with missing values, because Matplotlib expects that columns of values we pass in have matching lengths and missing values will cause matplotlib to throw errors.
raw_data_count = len(recent_grads.index)
raw_data_count
#the following code will return the same result
#a = recent_grads.shape
#a[0]
173
Note: We could have used .shape
property, but there is a performance difference.
len(DataFrame.index)
is faster
#drop rows containing missing values
recent_grads = recent_grads.dropna()
#check the number of rows of the cleaned DataFrame
cleaned_data_count = len(recent_grads.index)
cleaned_data_count
172
#comparing
print('raw_data_count: ' + str(raw_data_count) + ' | ' + 'cleaned_data_count: '+ str(cleaned_data_count))
raw_data_count: 173 | cleaned_data_count: 172
As we can see from the comparison result above, the initial DataFrame has 1 row with missing values, and we managed to drop it.
Now we are ready to create some amazing plots! Let's get it!
Scatterplot: Exploring The Relationship Between Some Columns
¶Using sns.scatterplot()
Let's generate scatter plots to explore the following relations:
Sample_size and Median
Sample_size and Unemployment_rate
Full_time and Median
ShareWomen and Unemployment_rate
Men and Median
Women and Median
And then, let's answer the following questions:
Do students in more popular majors make more money?
Do students that majored in subjects that were majority female make more money?
Is there any link between the number of full-time employees and median salary?
#Generate scatterplots in one go using a for loop
x_val = ['Sample_size','Sample_size', 'Full_time', 'ShareWomen', 'Men', 'Women']
y_val = ['Median','Unemployment_rate', 'Median', 'Unemployment_rate', 'Median', 'Median']
fig = plt.figure(figsize=(25, 30))
for sp in range (len(x_val)):
ax = fig.add_subplot(3,2,sp+1)
ax = sns.scatterplot(data = recent_grads, x= x_val[sp], y= y_val[sp], hue = y_val[sp])
plt.title(y_val[sp] + ' vs ' + x_val[sp], weight='bold').set_fontsize('15')
sns.despine(left=True, bottom=True)
plt.legend('')
plt.show()
Based on the plot above, we can see that all of the plots suggest that there is no significant relationship between the x & y variables:
On all the plots except for 'Unemployment_rate vs ShareWomen', we can see that there is a high variance near the 0 points, but as the y-value increases the variance diminishes.
On 'Unemployment_rate vs ShareWomen', we can see that the dots are arbitrarily spread out. We can say the variances are pretty much equal at all ranges, which suggest that none of the variables have any effect on each other.
So, using the Median vs Full_time
plot that we have above, we managed to answer one of the questions,
Full_time: Number employed 35 hours or more Median: Median salary of full-time, year-round workers.
and the answer to the question is no. There is no link between the two
* since we know that on all plots there is no significant relationship between the x & y variable (see the plot for detail)
So... ONE down! TWO to go! We stil have two unanswered questions, which are:
So let's make a few more plots to find the answers
Total :Total number of people with major. Median :Median salary of full-time, year-round workers.
fig, ax =plt.subplots(1,1, figsize =(10,5))
sns.scatterplot(data = recent_grads, x='Total', y='Median', hue ='Median')
sns.despine(left=True, bottom=True)
plt.title('Total vs Median', weight='bold').set_fontsize('15')
plt.legend(bbox_to_anchor=(1.55, 0.5), title= 'Median')
plt.show()
Based on the plot, the answer is no. There is no correlation between Median and Total.
We can see that there is a high variance near the 0 points, but as Total increases the variance diminishes.
In other words, there is no correlation between the popularity of the major, and money
ShareWomen: Women as share of total. Median: Median salary of full-time, year-round workers.
fig, ax =plt.subplots(1,1, figsize =(10,5))
sns.scatterplot(data = recent_grads, x='ShareWomen', y='Median', hue ='Median')
sns.despine(left=True, bottom=True)
plt.title('Median vs. ShareWomen', weight='bold').set_fontsize('15')
plt.legend(bbox_to_anchor=(1.55, 0.5), title= 'Median')
plt.show()
Based on the plot, there is a weak negative correlation between Median and ShareWomen. We can see that as Sharewomen increases, the Median decreases.
In other words, the students that majored in subjects that were female majority tend to make less money.
But really? Hmm we need more evidence. Let's analyze!
We need numbers!
#Analyze ShareWomen and Median using groupby()
median_sharewomen = recent_grads.groupby(["ShareWomen"])["Median"].mean().sort_values(ascending=False)
print(median_sharewomen)
ShareWomen 0.120564 110000 0.101852 75000 0.153037 73000 0.107313 70000 0.144967 65000 ... 0.637293 26000 0.799859 25000 0.817099 25000 0.798746 23400 0.877960 22000 Name: Median, Length: 172, dtype: int64
Alright, so the result suggest that ShareWomen with low values tend to get higher median earnings, and vice versa.
However, to be more confident in our analysis , let's group ShareWomen into 3 bins, and see what is the average Median of each group.
Are we still getting a negative correlation? let's find out
#splitting ShareWomen into 3 groups
sharewomen_grouped = recent_grads["ShareWomen"].value_counts(bins = 3).sort_index(ascending= False)
sharewomen_grouped
(0.646, 0.969] 60 (0.323, 0.646] 71 (-0.0019690000000000003, 0.323] 41 Name: ShareWomen, dtype: int64
#using sns.barplot
#we are creating a barplot with 3 bins
bins = [-0.0019690000000000003, 0.323, 0.646, 0.969] #using the bins from the previous result
median_sharewomen_grouped = recent_grads.groupby(pd.cut(recent_grads["ShareWomen"],
bins))["Median"].mean().sort_values(ascending= False)
print(median_sharewomen_grouped)
fig, ax =plt.subplots(1,1, figsize =(10,5))
plt.xlabel('ShareWomen')
plt.ylabel('Median')
sns.barplot(x=sorted(median_sharewomen_grouped.index), y=median_sharewomen_grouped, data=recent_grads, ci=None)
plt.title('Median vs ShareWomen (bins = 3)', weight='bold').set_fontsize('15')
sns.despine(left=True, bottom=True)
ShareWomen (-0.00197, 0.323] 50846.341463 (0.323, 0.646] 39695.774648 (0.646, 0.969] 33168.333333 Name: Median, dtype: float64
It's true! the students that majored in subjects that were female majority on average make less money.
So... just to recap:
Answer: No. There is correlation between the popularity of the major, and money
Answer: The opposite is true. The students that majored in subjects that were female majority on average make less money.
Answer: None of them have amy link with each other
We have managed to answer these questions by creating scatterplots for:
We're done with scatterplots, next we will create histograms!
Histograms: Exploring The Distribution of Values In A Column
¶Using sns.histplot
Let's generate histograms to explore the distributions of the following columns:
* Note: We're going to use Series.describe()
to understand the data distribution of each column
Sample_size: Sample size (unweighted) of full-time, year-round ONLY (used for earnings)
Median: Median salary of full-time, year-round workers
Employed: Number employed
Full_time: Number employed 35 hours or more
ShareWomen: Women as share of total
Unemployment_rate: Percent of labor force that is jobless
Men: Male graduates
Women: Female graduates
#Generate histograms in one go using a for loop
#we use 8 bins to simplify the visualization
cols = ["Sample_size", "Median", "Employed", "Full_time", "ShareWomen", "Unemployment_rate", "Men", "Women"]
for sp in range (len(cols)):
fig = plt.subplots(0,8, figsize =(10,5))
sns.histplot(data = recent_grads, x= cols[sp], bins = 8)
sns.despine(left=True, bottom=True)
plt.title(cols[sp], weight='bold').set_fontsize('16')
plt.show()
print('--------------------------------')
print(recent_grads[cols[sp]].describe())
print('--------------------------------')
-------------------------------- count 172.000000 mean 357.941860 std 619.680419 min 2.000000 25% 42.000000 50% 131.000000 75% 339.000000 max 4212.000000 Name: Sample_size, dtype: float64 --------------------------------
-------------------------------- count 172.000000 mean 40076.744186 std 11461.388773 min 22000.000000 25% 33000.000000 50% 36000.000000 75% 45000.000000 max 110000.000000 Name: Median, dtype: float64 --------------------------------
-------------------------------- count 172.00000 mean 31355.80814 std 50777.42865 min 0.00000 25% 3734.75000 50% 12031.50000 75% 31701.25000 max 307933.00000 Name: Employed, dtype: float64 --------------------------------
-------------------------------- count 172.000000 mean 26165.767442 std 42957.122320 min 111.000000 25% 3181.000000 50% 10073.500000 75% 25447.250000 max 251540.000000 Name: Full_time, dtype: float64 --------------------------------
-------------------------------- count 172.000000 mean 0.522223 std 0.231205 min 0.000000 25% 0.336026 50% 0.534024 75% 0.703299 max 0.968954 Name: ShareWomen, dtype: float64 --------------------------------
-------------------------------- count 172.000000 mean 0.068024 std 0.030340 min 0.000000 25% 0.050261 50% 0.067544 75% 0.087247 max 0.177226 Name: Unemployment_rate, dtype: float64 --------------------------------
-------------------------------- count 172.000000 mean 16723.406977 std 28122.433474 min 119.000000 25% 2177.500000 50% 5434.000000 75% 14631.000000 max 173809.000000 Name: Men, dtype: float64 --------------------------------
-------------------------------- count 172.000000 mean 22646.674419 std 41057.330740 min 0.000000 25% 1778.250000 50% 8386.500000 75% 22553.750000 max 307087.000000 Name: Women, dtype: float64 --------------------------------
Cool, using histograms we are able to comfortably visualize the data distribution. But what are the takeaways?
*Note: Obviously, histogram and series.describe() will display us slightly different numbers. What we are trying to achieve here is to demonstrate how we can easily visualize data distribution with decent accuracy using histogram.
Sample_size
, Employed
, ShareWomen
, and Women
¶A few quirky things from histograms and series.describe( ) caught my attention. Let's dive right into it.
Sample_size: Sample size (unweighted) of full-time, year-round ONLY (used for earnings)
As previously mentioned, the sample size seem to be really low.
Let explore this column even more, and then find the percentage of sample size by calculating Sample_size / Full_time_year_round
.
#exploring rows that have the most common sample size values --> Sample_size: (0, 500)
common_samplesize = recent_grads[recent_grads["Sample_size"].between(0, 500)]
common_samplesize.sort_values(by='Sample_size', ascending=False)
Rank | Major_code | Major | Total | Men | Women | Major_category | ShareWomen | Sample_size | Employed | ... | Part_time | Full_time_year_round | Unemployed | Unemployment_rate | Median | P25th | P75th | College_jobs | Non_college_jobs | Low_wage_jobs | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
97 | 98 | 5098 | MULTI-DISCIPLINARY OR GENERAL SCIENCE | 62052.0 | 27015.0 | 35037.0 | Physical Sciences | 0.564639 | 427 | 46138 | ... | 13133 | 28966 | 2727 | 0.055807 | 35000 | 24000 | 50000 | 17923 | 22039 | 5751 |
17 | 18 | 2400 | GENERAL ENGINEERING | 61152.0 | 45683.0 | 15469.0 | Engineering | 0.252960 | 425 | 44931 | ... | 7199 | 33540 | 2859 | 0.059824 | 56000 | 36000 | 69000 | 26898 | 11734 | 3192 |
42 | 43 | 2100 | COMPUTER AND INFORMATION SYSTEMS | 36698.0 | 27392.0 | 9306.0 | Computers & Mathematics | 0.253583 | 425 | 28459 | ... | 4332 | 21130 | 2934 | 0.093460 | 45000 | 30000 | 60000 | 13344 | 11783 | 1672 |
146 | 147 | 6002 | MUSIC | 60633.0 | 29909.0 | 30724.0 | Arts | 0.506721 | 419 | 47662 | ... | 24943 | 21425 | 3918 | 0.075960 | 31000 | 22300 | 42000 | 13752 | 28786 | 9286 |
10 | 11 | 2407 | COMPUTER ENGINEERING | 41542.0 | 33258.0 | 8284.0 | Engineering | 0.199413 | 399 | 32506 | ... | 5146 | 23621 | 2275 | 0.065409 | 60000 | 45000 | 75000 | 23694 | 5721 | 980 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
73 | 74 | 3801 | MILITARY TECHNOLOGIES | 124.0 | 124.0 | 0.0 | Industrial Arts & Consumer Services | 0.000000 | 4 | 0 | ... | 0 | 111 | 0 | 0.000000 | 40000 | 40000 | 40000 | 0 | 0 | 0 |
55 | 56 | 2303 | SCHOOL STUDENT COUNSELING | 818.0 | 119.0 | 699.0 | Education | 0.854523 | 4 | 730 | ... | 135 | 545 | 88 | 0.107579 | 41000 | 41000 | 43000 | 509 | 221 | 0 |
48 | 49 | 3607 | PHARMACOLOGY | 1762.0 | 515.0 | 1247.0 | Biology & Life Science | 0.707719 | 3 | 1144 | ... | 532 | 565 | 107 | 0.085532 | 45000 | 40000 | 45000 | 603 | 478 | 93 |
2 | 3 | 2415 | METALLURGICAL ENGINEERING | 856.0 | 725.0 | 131.0 | Engineering | 0.153037 | 3 | 648 | ... | 133 | 340 | 16 | 0.024096 | 73000 | 50000 | 105000 | 456 | 176 | 0 |
172 | 173 | 3501 | LIBRARY SCIENCE | 1098.0 | 134.0 | 964.0 | Education | 0.877960 | 2 | 742 | ... | 237 | 410 | 87 | 0.104946 | 22000 | 20000 | 22000 | 288 | 338 | 192 |
141 rows × 21 columns
Ok, there are a few things to note here:
From the data dictionary, we know that Sample_size
= Sample size (unweighted) of full-time, year-round ONLY (used for earnings).
Sample_size
column we get the number of sampled data obtained from Full_time_year_round
, which are used to determine earnings in Median
column.Sample_size
column reflects the number of people from Full_time_year_round
that reported their earnings.The problem is, sample_size values are very low for the majority of the rows in this dataset.
recent_grads["Sample_size].describe()
(under the histogram in the above section), count = 172.common_samplesize
has 141 rows. This indicates that 141 / 172 = .82 = 82% of data has a sample size between (0, 500)If we look at the last row in the sorted common_samplesize
dataframe above (row #172), apparently for "LIBRARY SCIENCE" major, out of 410 Full_time_year_round values, there are only 2 Sample_size values.
Sample_size / Full_time_year_round
= 2 / 410 = .004 = .4%I don't have an extensive statistical bacakground, but I'm pretty sure a .4% sample size is too low to provide us any meaningful result
*Based on this article here, a good sample size number is around 10%
Alright alright alright enough of this sample size rabbit hole! Let's just do this one thing: Calculate the sample size percentage of this dataset
#just some math
sample_dataset = recent_grads["Sample_size"] / recent_grads["Full_time_year_round"]
sample_dataset.describe()
count 172.000000 mean 0.017747 std 0.004970 min 0.004162 25% 0.015208 50% 0.017751 75% 0.020870 max 0.036036 dtype: float64
So based on sample_dataset.describe()
, we know the following:
Ok, not sure what's going on with this dataset. The sample size is SUPER LOW. This might have an effect on the other columns, or at least the median column, or... whatever, let's just roll with it. *shoulder shrug*
NEXT!
Employed: Number employed
So earlier we figured that most majors have a really low number of employment, which is not surprising.
However, what surprises me is that there is a major that has 0 people employed. I wonder which major(s). Let's see...
#checking major(s) with 0 people employed
recent_grads[recent_grads["Employed"] == 0]
Rank | Major_code | Major | Total | Men | Women | Major_category | ShareWomen | Sample_size | Employed | ... | Part_time | Full_time_year_round | Unemployed | Unemployment_rate | Median | P25th | P75th | College_jobs | Non_college_jobs | Low_wage_jobs | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
73 | 74 | 3801 | MILITARY TECHNOLOGIES | 124.0 | 124.0 | 0.0 | Industrial Arts & Consumer Services | 0.0 | 4 | 0 | ... | 0 | 111 | 0 | 0.0 | 40000 | 40000 | 40000 | 0 | 0 | 0 |
1 rows × 21 columns
... and the major is "MILITARY TECHNOLOGIES", so only 1 major? Hmmm
BUT WAIT! Let's take another look at the recent_grads[recent_grads["Employed"] == 0]
above. There are a few quirks:
*gasping for breath*
I don't know what just happened, but suddenly I got dragged into another rabbit hole, and stumbled upon this
TL;DR There is something peculiar about this dataset.
So again, I am not sure what is going on here. But there is nothing we can do, so let's keep working with the dataset that we have.
...Moving on...
ShareWomen: Women as share of total
We are going to answer one of the question that we have in the introduction, which is
"How many majors are predominantly male?"
#create a new column
recent_grads["gender_majority"] = np.nan
#add values to the new column
recent_grads.loc[recent_grads["ShareWomen"] > .5, "gender_majority"] = "Female"
recent_grads.loc[recent_grads["ShareWomen"] < .5, "gender_majority"] = "Male"
#display
recent_grads
Rank | Major_code | Major | Total | Men | Women | Major_category | ShareWomen | Sample_size | Employed | ... | Full_time_year_round | Unemployed | Unemployment_rate | Median | P25th | P75th | College_jobs | Non_college_jobs | Low_wage_jobs | gender_majority | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 2419 | PETROLEUM ENGINEERING | 2339.0 | 2057.0 | 282.0 | Engineering | 0.120564 | 36 | 1976 | ... | 1207 | 37 | 0.018381 | 110000 | 95000 | 125000 | 1534 | 364 | 193 | Male |
1 | 2 | 2416 | MINING AND MINERAL ENGINEERING | 756.0 | 679.0 | 77.0 | Engineering | 0.101852 | 7 | 640 | ... | 388 | 85 | 0.117241 | 75000 | 55000 | 90000 | 350 | 257 | 50 | Male |
2 | 3 | 2415 | METALLURGICAL ENGINEERING | 856.0 | 725.0 | 131.0 | Engineering | 0.153037 | 3 | 648 | ... | 340 | 16 | 0.024096 | 73000 | 50000 | 105000 | 456 | 176 | 0 | Male |
3 | 4 | 2417 | NAVAL ARCHITECTURE AND MARINE ENGINEERING | 1258.0 | 1123.0 | 135.0 | Engineering | 0.107313 | 16 | 758 | ... | 692 | 40 | 0.050125 | 70000 | 43000 | 80000 | 529 | 102 | 0 | Male |
4 | 5 | 2405 | CHEMICAL ENGINEERING | 32260.0 | 21239.0 | 11021.0 | Engineering | 0.341631 | 289 | 25694 | ... | 16697 | 1672 | 0.061098 | 65000 | 50000 | 75000 | 18314 | 4440 | 972 | Male |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
168 | 169 | 3609 | ZOOLOGY | 8409.0 | 3050.0 | 5359.0 | Biology & Life Science | 0.637293 | 47 | 6259 | ... | 3602 | 304 | 0.046320 | 26000 | 20000 | 39000 | 2771 | 2947 | 743 | Female |
169 | 170 | 5201 | EDUCATIONAL PSYCHOLOGY | 2854.0 | 522.0 | 2332.0 | Psychology & Social Work | 0.817099 | 7 | 2125 | ... | 1211 | 148 | 0.065112 | 25000 | 24000 | 34000 | 1488 | 615 | 82 | Female |
170 | 171 | 5202 | CLINICAL PSYCHOLOGY | 2838.0 | 568.0 | 2270.0 | Psychology & Social Work | 0.799859 | 13 | 2101 | ... | 1293 | 368 | 0.149048 | 25000 | 25000 | 40000 | 986 | 870 | 622 | Female |
171 | 172 | 5203 | COUNSELING PSYCHOLOGY | 4626.0 | 931.0 | 3695.0 | Psychology & Social Work | 0.798746 | 21 | 3777 | ... | 2738 | 214 | 0.053621 | 23400 | 19200 | 26000 | 2403 | 1245 | 308 | Female |
172 | 173 | 3501 | LIBRARY SCIENCE | 1098.0 | 134.0 | 964.0 | Education | 0.877960 | 2 | 742 | ... | 410 | 87 | 0.104946 | 22000 | 20000 | 22000 | 288 | 338 | 192 | Female |
172 rows × 22 columns
#using histplot
#we are creating a histogram to answer the question, "How many majors are predominantly male?"
fig = plt.subplots(0,8, figsize =(10,5))
#differentiate color b/w gender_majority using hue
sns.histplot(data = recent_grads, x= "gender_majority", hue = "gender_majority")
sns.despine(left=True, bottom=True)
plt.title("gender_majority", weight='bold').set_fontsize('16')
#display #count on legend
plt.legend(recent_grads["gender_majority"].value_counts(),bbox_to_anchor=(1, 1), title= 'Count')
plt.show()
As we can see, 96 majors are predominantly Female, and 76 majors are predominantly Male. Cool...
Women: Female graduates
Based on series.describe( ) there is a major with 0 female graduates, I wonder what that is...
recent_grads[recent_grads["Women"] == 0]
Rank | Major_code | Major | Total | Men | Women | Major_category | ShareWomen | Sample_size | Employed | ... | Full_time_year_round | Unemployed | Unemployment_rate | Median | P25th | P75th | College_jobs | Non_college_jobs | Low_wage_jobs | gender_majority | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
73 | 74 | 3801 | MILITARY TECHNOLOGIES | 124.0 | 124.0 | 0.0 | Industrial Arts & Consumer Services | 0.0 | 4 | 0 | ... | 111 | 0 | 0.0 | 40000 | 40000 | 40000 | 0 | 0 | 0 | Male |
1 rows × 22 columns
Ahh.. this one again, "MILITARY TECHNOLOGIES"...
We just had this one earlier when deep diving on Employed
. I don't know, this major has weird numbers.
We'll just leave it at that. Next please!
And of course, before we move on, let's do a little recap:
In the last 2 sections, we created the following:
Based on our scatter plots, we find that:
- "Do students in more popular majors make more money?"
Answer: No. There is correlation between the popularity of the major, and money.
- "Do students that majored in subjects that were majority female make more money?"
Answer: The opposite is true. The students that majored in subjects that were female majority on average make less money.
(we deep dived on this by creating a bar plot with 3 bins)
- "Is there any link between the number of full-time employees and median salary?"
Answer: None of them have amy link with each other
We also did further analysis on Sample_size, Employed, ShareWomen, and Women - We answered the question, "How many majors are predominantly male?" by creating a bar plot when deep diving on ShareWomen
Ready for more? ;D
Scatter Matrix (Pair Plot) : Exploring Potential Relationships And Distributions Simultaneously
¶
In this section, we will create a scatter matrix (pair plot), which is a plot that combines both scatter plots and histograms into one grid of plots and allows us to explore potential relationships and distributions simultaneously.
* Note: A scatter matrix plot consists of n by n plots on a grid, where n is the number of columns, the plots on the diagonal are histograms, and the non-diagonal plots are scatter plots.
Using sns.pairplot()
Let's generate scatter matrix plots (pair plots) to explore the relationship and distubution of the following columns:
Sample_size and Median
Sample_size, Median, and Unemployment_rate
Then, we will try to create a few more plots to explore the answer to the following questions:
Do students in more popular majors make more money?
Do students that majored in subjects that were majority female make more money?
Is there any link between the number of full-time employees and median salary?
*Note: We did have answered these questions using scatter plots in the earlier section, but let's try to explore these questions again while we familarize ourselves with scatter matrix
Sample_size: Sample size (unweighted) of full-time Median: Median salary of full-time, year-round workers Unemployment_rate: Percent of labor force that is jobless
pairs = [['Sample_size', 'Median'], ['Sample_size', 'Median', 'Unemployment_rate']]
for pair in range(len(pairs)):
pairplot = sns.pairplot(recent_grads[pairs[pair]])
pairplot.fig.set_size_inches(10,10)
for ax in pairplot.axes.flat: #rotate x-axis labels
ax.tick_params("x", labelrotation=45)
As we can see, there is no correlation between sample size, median, and enemployment rate.
*these scatter matrices confirmed our findings in the previous sections.
Now, let's use scatter matrix to validate our answers to the three questions that we have
To do that, we are going to create 3 scatter matrices (pair plots):
Total :Total number of people with major. Full_time: Number employed 35 hours or more ShareWomen: Women as share of total. Median :Median salary of full-time, year-round workers.
pairs = [['Total', 'Median'], ['ShareWomen', 'Median'], ['Full_time', 'Median']]
for pair in range(len(pairs)):
pairplot = sns.pairplot(recent_grads[pairs[pair]])
pairplot.fig.set_size_inches(10,10)
for ax in pairplot.axes.flat: #rotate x-axis labels
ax.tick_params("x", labelrotation=45)
Alright cool! Scatter matrix and scatter plot give us the same answer. This help us validate our answer in the previous sections.
And uh, actually, we are done with scatter matrix. So Let's do another reflection...
In the last 3 sections, we created the following:
Based on our scatter plots, we find that:
- "Do students in more popular majors make more money?"
Answer: No. There is correlation between the popularity of the major, and money.
- "Do students that majored in subjects that were majority female make more money?"
Answer: The opposite is true. The students that majored in subjects that were female majority on average make less money.
(we deep dived on this by creating a bar plot with 3 bins)
- "Is there any link between the number of full-time employees and median salary?"
Answer: None of them have amy link with each other
Histograms to visualize the distributions of the following columns:
Sample_size, Median, Employed, Full_time, ShareWomen, Unemployment_rate, Men, and Women
We also did further analysis on Sample_size, Employed, ShareWomen, and Women
Scatter matrix plots to visually explore potential relationships and distributions of the following:
Using scatter matrix plots, we managed to validate our answers in the previous sections.
Next!
Bar Plots: Comparing Some Columns
¶Using sns.barplot
Let's generate barplots to do the following:
of the recent_grads dataframe, while having major on the x-axis.
of the recent_grads dataframe, while having major on the x-axis.
ShareWomen vs Major
fig, ax =plt.subplots(1,1, figsize =(10,5))
plt.xlabel('Major')
plt.ylabel('ShareWomen')
sns.barplot(x=recent_grads[:10]['Major'], y=recent_grads[:10]['ShareWomen'], ci=None)
ax.set_xticklabels(recent_grads[:10]['Major'], rotation='vertical')
sns.despine(left=True, bottom=True)
plt.title("ShareWomen vs Major", weight='bold').set_fontsize('16')
fig, ax =plt.subplots(1,1, figsize =(10,5))
plt.xlabel('Major')
plt.ylabel('ShareWomen')
sns.barplot(x=recent_grads[-10:]['Major'], y=recent_grads[-10:]['ShareWomen'], ci=None)
ax.set_xticklabels(recent_grads[-10:]['Major'], rotation='vertical')
sns.despine(left=True, bottom=True)
plt.title("ShareWomen vs Major", weight='bold').set_fontsize('16')
Remember that this dataset is ordered by median earnings?
Unemployment_rate vs Major
fig, ax =plt.subplots(1,1, figsize =(10,5))
plt.xlabel('Major')
plt.ylabel('Unemployment_rate')
sns.barplot(x=recent_grads[:10]['Major'], y=recent_grads[:10]['Unemployment_rate'], ci=None)
ax.set_xticklabels(recent_grads[:10]['Major'], rotation='vertical')
sns.despine(left=True, bottom=True)
plt.title("Unemployment_rate vs Major", weight='bold').set_fontsize('16')
fig, ax =plt.subplots(1,1, figsize =(10,5))
plt.xlabel('Major')
plt.ylabel('Unemployment_rate')
sns.barplot(x=recent_grads[-10:]['Major'], y=recent_grads[-10:]['Unemployment_rate'], ci=None)
ax.set_xticklabels(recent_grads[-10:]['Major'], rotation='vertical')
sns.despine(left=True, bottom=True)
plt.title("Unemployment_rate vs Major", weight='bold').set_fontsize('16')
Based on the bar plot, Unemployment_rate seems to be normally distributed across all majors.
This is almost the end of the project, so let's do another reflection on what we have done.
In the last 4 sections, we created the following:
Based on our scatter plots, we find that:
- "Do students in more popular majors make more money?"
Answer: No. There is correlation between the popularity of the major, and money.
- "Do students that majored in subjects that were majority female make more money?"
Answer: The opposite is true. The students that majored in subjects that were female majority on average make less money.
(we deep dived on this by creating a bar plot with 3 bins)
- "Is there any link between the number of full-time employees and median salary?"
Answer: None of them have amy link with each other
Histograms to visualize the distributions of the following columns:
Sample_size, Median, Employed, Full_time, ShareWomen, Unemployment_rate, Men, and Women
We also did further analysis on Sample_size, Employed, ShareWomen, and Women
Scatter matrix plots to visually explore potential relationships and distributions of the following:
Using scatter matrix plots, we managed to validate our answers in the previous sections.
Bar Plots to compare the percentages of women (ShareWomen), and Unemployment_rate of the first and last ten rows of the recent_grads
dataframe, with Major on the x-axis
As I said before, we are almost done!
On the next section, let's have some fun by doing further exploration and analysis!
Oooh this is going to be fun ;D
¶In this section, we are going to have fun by visually exploring and analyzing data using many different types of plots. So what are we waiting for?
Use a grouped bar plot to compare the number of men with the number of women in each category of majors.
#create a new column that shows the difference between male and female graduates in each major
#we will use this new column to sort the data
recent_grads["delta_graduates"] = recent_grads["Men"] - recent_grads["Women"]
#print
print(recent_grads.groupby(["Major_category"])["delta_graduates"].sum().sort_values())
Major_category Education -352077.0 Health -312196.0 Psychology & Social Work -284777.0 Humanities & Liberal Arts -167776.0 Communications & Journalism -128759.0 Arts -88350.0 Biology & Life Science -84024.0 Industrial Arts & Consumer Services -22230.0 Social Science -16298.0 Interdisciplinary -6662.0 Law & Public Policy 3151.0 Agriculture & Natural Resources 5094.0 Physical Sciences 5301.0 Business 33328.0 Computers & Mathematics 118442.0 Engineering 279031.0 Name: delta_graduates, dtype: float64
Note: positive values means there are more female, and negative values means there are more male
recent_grads.groupby('Major_category').sum().sort_values(by=['delta_graduates'] ,ascending=False).plot.barh(
y=['Men','Women'], figsize =(20,10))
sns.despine(left=True, bottom=True)
plt.xlabel('Major_category')
plt.ylabel('Total')
#plt.legend("gender_majority")
plt.title("Distribution of Men and Women in each Major Categories", weight='bold').set_fontsize('16')
Top 3 majors that are predominantly female: Education, Health, and Psychology & Social Work. (in order)
Top 3 majors that are predominantly male: Engineering, Computers & Mathematics, and Business. (in order)
Use a box plot to explore the distributions of median salaries and unemployment rate.
sns.boxplot(data = recent_grads['Median'])
<matplotlib.axes._subplots.AxesSubplot at 0x7fe57a045110>
Alright, this verifies our findings in the histogram section, that median is around \$30,000 - \$40,000
sns.boxplot(data = recent_grads['Unemployment_rate'])
<matplotlib.axes._subplots.AxesSubplot at 0x7fe579c54b50>
Similarly, this verifies our findings in the histogram section, that common unemployment rates are between 5.5% - 6.25%
Use a hexagonal bin plot to visualize the columns that had dense scatter plots from earlier in the project:
Unemployment_rate vs ShareWomen
Median vs ShareWomen
plt.hexbin(recent_grads['ShareWomen'], recent_grads['Unemployment_rate'], gridsize=(15,15) )
plt.show()
Well, another validation, Unemployment_rate and ShareWomen has no correlation.
plt.hexbin(recent_grads['ShareWomen'], recent_grads['Median'], gridsize=(15,15) )
plt.show()
Yuuup, Median and ShareWomen has a weak negative correlation. Verified!
Use a barplot to display the following:
major_highest_P75th = recent_grads.groupby(["Major"])["P75th"].mean().sort_values(ascending=False).head()
print(major_highest_P75th)
fig, ax =plt.subplots(1,1,figsize=(15,6))
plt.title("Top 5 majors with the highest 75th percentile of earnings", weight='bold').set_fontsize('16')
sns.barplot(x=major_highest_P75th.index, y=major_highest_P75th, data=recent_grads, ci=None).set_xticklabels(
labels=major_highest_P75th.index,rotation='vertical')
sns.despine(left=True, bottom=True)
Major PETROLEUM ENGINEERING 125000 ASTRONOMY AND ASTROPHYSICS 109000 METALLURGICAL ENGINEERING 105000 NUCLEAR ENGINEERING 102000 PHARMACY PHARMACEUTICAL SCIENCES AND ADMINISTRATION 90000 Name: P75th, dtype: int64
Well, for any of you who are still in college and want to get rich, this is the answer. Choose any of these majors:
;D
major_most_fulltime = recent_grads.groupby(["Major"])["Full_time_year_round"].mean().sort_values(
ascending=False).head()
print(major_most_fulltime)
fig, ax =plt.subplots(1,1,figsize=(15,6))
plt.title("Top 5 majors with the most full time employees", weight='bold').set_fontsize('16')
sns.barplot(x=major_most_fulltime.index, y=major_most_fulltime, data=recent_grads, ci=None).set_xticklabels(
labels=major_most_fulltime.index,rotation='vertical')
sns.despine(left=True, bottom=True)
Major BUSINESS MANAGEMENT AND ADMINISTRATION 199897 PSYCHOLOGY 174438 GENERAL BUSINESS 138299 MARKETING AND MARKETING RESEARCH 127230 ACCOUNTING 123169 Name: Full_time_year_round, dtype: int64
And for any of you wants to get a full time job easily, try get into any of these majors:
;D
Based on our scatter plots, we find that:
- "Do students in more popular majors make more money?"
Answer: No. There is correlation between the popularity of the major, and money.
- "Do students that majored in subjects that were majority female make more money?"
Answer: The opposite is true. The students that majored in subjects that were female majority on average make less money.
(we deep dived on this by creating a bar plot with 3 bins)
- "Is there any link between the number of full-time employees and median salary?"
Answer: None of them have amy link with each other
Histograms to visualize the distributions of the following columns:
Sample_size, Median, Employed, Full_time, ShareWomen, Unemployment_rate, Men, and Women
We also did further analysis on Sample_size, Employed, ShareWomen, and Women
Scatter matrix plots to visually explore potential relationships and distributions of the following:
Using scatter matrix plots, we managed to validate our answers in the previous sections.
Bar Plots to compare the percentages of women (ShareWomen), and Unemployment_rate of the first and last ten rows of the recent_grads
dataframe, with Major on the x-axis
We did some fun stuff here, here are what we did:
And that's a wrap!
Conclusion & Insights
¶Well, you know what they say, "All good things must come to an end"
This is the last section of the project. So before we actually end this,
Let's present our findings
Q&A
Answer: No. There is correlation between the popularity of the major, and money
Answer: The opposite is true. The students that majored in subjects that were female majority on average make less money.
Answer: None of them have amy link with each other
Answer: 76 majors are predominantly male, and 96 majors are predominantly Female
Major statistics
Correlation
Extra