In this project, we will explore a dataset on the job outcomes of students in the USA who graduated from college between 2010 and 2012. The dataset was originally published by the American Community Survey, and subsequently cleaned by FiveThirtyEight who released it on their Github repo.
Each row in the dataset represents a different major in college and contains information on gender diversity, employment rates, median salaries, and more. Here are some of the columns in the dataset:
Column | Description |
---|---|
Rank | Rank by median earnings (the dataset is ordered by this column) |
Major_code | Major code |
Major | Major description |
Major_category | Category of major |
Total | Total number of people with major |
Sample_size | Sample size (unweighted) of full-time, year-round ONLY (used for earnings) |
Men | Male graduates |
Women | Female graduates |
ShareWomen | Women as share of total |
Employed | Number employed |
Median | Median salary of full-time, year-round workers |
Low_wage_jobs | Number in low-wage service jobs |
Full_time | Number employed 35 hours or more |
Part_time | Number employed less than 35 hours |
To explore this data, we will create a variety of data visualisations including scatter plots, histograms, and bar charts. We will generate these plots using the matplotlib library.
# import libraries
import math
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl
# run Jupyter magic so that plots are displayed inline
%matplotlib inline
# set styles
plt.rcParams['axes.titlesize'] = 'x-large'
plt.rcParams['axes.spines.left'] = False
plt.rcParams['axes.spines.right'] = False
plt.rcParams['axes.spines.top'] = False
plt.rcParams['axes.spines.bottom'] = False
# default_rc = dict(mpl.rcParams) # uncomment and run all to return styles to default
# function to remove ticks as this can't be set in rcParams
def remove_ticks():
# get all axes in current figure
axes = plt.gcf().get_axes()
# iterate over each axes
for ax in axes:
ax.tick_params(top='off', bottom='off', left='off', right='off')
# read dataset into dataframe and return first row
recent_grads = pd.read_csv("recent-grads.csv")
recent_grads.iloc[0]
Rank 1 Major_code 2419 Major PETROLEUM ENGINEERING Total 2339 Men 2057 Women 282 Major_category Engineering ShareWomen 0.120564 Sample_size 36 Employed 1976 Full_time 1849 Part_time 270 Full_time_year_round 1207 Unemployed 37 Unemployment_rate 0.0183805 Median 110000 P25th 95000 P75th 125000 College_jobs 1534 Non_college_jobs 364 Low_wage_jobs 193 Name: 0, dtype: object
# explore head
recent_grads.head()
Rank | Major_code | Major | Total | Men | Women | Major_category | ShareWomen | Sample_size | Employed | ... | Part_time | Full_time_year_round | Unemployed | Unemployment_rate | Median | P25th | P75th | College_jobs | Non_college_jobs | Low_wage_jobs | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 2419 | PETROLEUM ENGINEERING | 2339.0 | 2057.0 | 282.0 | Engineering | 0.120564 | 36 | 1976 | ... | 270 | 1207 | 37 | 0.018381 | 110000 | 95000 | 125000 | 1534 | 364 | 193 |
1 | 2 | 2416 | MINING AND MINERAL ENGINEERING | 756.0 | 679.0 | 77.0 | Engineering | 0.101852 | 7 | 640 | ... | 170 | 388 | 85 | 0.117241 | 75000 | 55000 | 90000 | 350 | 257 | 50 |
2 | 3 | 2415 | METALLURGICAL ENGINEERING | 856.0 | 725.0 | 131.0 | Engineering | 0.153037 | 3 | 648 | ... | 133 | 340 | 16 | 0.024096 | 73000 | 50000 | 105000 | 456 | 176 | 0 |
3 | 4 | 2417 | NAVAL ARCHITECTURE AND MARINE ENGINEERING | 1258.0 | 1123.0 | 135.0 | Engineering | 0.107313 | 16 | 758 | ... | 150 | 692 | 40 | 0.050125 | 70000 | 43000 | 80000 | 529 | 102 | 0 |
4 | 5 | 2405 | CHEMICAL ENGINEERING | 32260.0 | 21239.0 | 11021.0 | Engineering | 0.341631 | 289 | 25694 | ... | 5180 | 16697 | 1672 | 0.061098 | 65000 | 50000 | 75000 | 18314 | 4440 | 972 |
5 rows × 21 columns
# explore tail
recent_grads.tail()
Rank | Major_code | Major | Total | Men | Women | Major_category | ShareWomen | Sample_size | Employed | ... | Part_time | Full_time_year_round | Unemployed | Unemployment_rate | Median | P25th | P75th | College_jobs | Non_college_jobs | Low_wage_jobs | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
168 | 169 | 3609 | ZOOLOGY | 8409.0 | 3050.0 | 5359.0 | Biology & Life Science | 0.637293 | 47 | 6259 | ... | 2190 | 3602 | 304 | 0.046320 | 26000 | 20000 | 39000 | 2771 | 2947 | 743 |
169 | 170 | 5201 | EDUCATIONAL PSYCHOLOGY | 2854.0 | 522.0 | 2332.0 | Psychology & Social Work | 0.817099 | 7 | 2125 | ... | 572 | 1211 | 148 | 0.065112 | 25000 | 24000 | 34000 | 1488 | 615 | 82 |
170 | 171 | 5202 | CLINICAL PSYCHOLOGY | 2838.0 | 568.0 | 2270.0 | Psychology & Social Work | 0.799859 | 13 | 2101 | ... | 648 | 1293 | 368 | 0.149048 | 25000 | 25000 | 40000 | 986 | 870 | 622 |
171 | 172 | 5203 | COUNSELING PSYCHOLOGY | 4626.0 | 931.0 | 3695.0 | Psychology & Social Work | 0.798746 | 21 | 3777 | ... | 965 | 2738 | 214 | 0.053621 | 23400 | 19200 | 26000 | 2403 | 1245 | 308 |
172 | 173 | 3501 | LIBRARY SCIENCE | 1098.0 | 134.0 | 964.0 | Education | 0.877960 | 2 | 742 | ... | 237 | 410 | 87 | 0.104946 | 22000 | 20000 | 22000 | 288 | 338 | 192 |
5 rows × 21 columns
# generate summary statistics for all numeric columns
recent_grads.describe()
Rank | Major_code | Total | Men | Women | ShareWomen | Sample_size | Employed | Full_time | Part_time | Full_time_year_round | Unemployed | Unemployment_rate | Median | P25th | P75th | College_jobs | Non_college_jobs | Low_wage_jobs | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 173.000000 | 173.000000 | 172.000000 | 172.000000 | 172.000000 | 172.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 |
mean | 87.000000 | 3879.815029 | 39370.081395 | 16723.406977 | 22646.674419 | 0.522223 | 356.080925 | 31192.763006 | 26029.306358 | 8832.398844 | 19694.427746 | 2416.329480 | 0.068191 | 40151.445087 | 29501.445087 | 51494.219653 | 12322.635838 | 13284.497110 | 3859.017341 |
std | 50.084928 | 1687.753140 | 63483.491009 | 28122.433474 | 41057.330740 | 0.231205 | 618.361022 | 50675.002241 | 42869.655092 | 14648.179473 | 33160.941514 | 4112.803148 | 0.030331 | 11470.181802 | 9166.005235 | 14906.279740 | 21299.868863 | 23789.655363 | 6944.998579 |
min | 1.000000 | 1100.000000 | 124.000000 | 119.000000 | 0.000000 | 0.000000 | 2.000000 | 0.000000 | 111.000000 | 0.000000 | 111.000000 | 0.000000 | 0.000000 | 22000.000000 | 18500.000000 | 22000.000000 | 0.000000 | 0.000000 | 0.000000 |
25% | 44.000000 | 2403.000000 | 4549.750000 | 2177.500000 | 1778.250000 | 0.336026 | 39.000000 | 3608.000000 | 3154.000000 | 1030.000000 | 2453.000000 | 304.000000 | 0.050306 | 33000.000000 | 24000.000000 | 42000.000000 | 1675.000000 | 1591.000000 | 340.000000 |
50% | 87.000000 | 3608.000000 | 15104.000000 | 5434.000000 | 8386.500000 | 0.534024 | 130.000000 | 11797.000000 | 10048.000000 | 3299.000000 | 7413.000000 | 893.000000 | 0.067961 | 36000.000000 | 27000.000000 | 47000.000000 | 4390.000000 | 4595.000000 | 1231.000000 |
75% | 130.000000 | 5503.000000 | 38909.750000 | 14631.000000 | 22553.750000 | 0.703299 | 338.000000 | 31433.000000 | 25147.000000 | 9948.000000 | 16891.000000 | 2393.000000 | 0.087557 | 45000.000000 | 33000.000000 | 60000.000000 | 14444.000000 | 11783.000000 | 3466.000000 |
max | 173.000000 | 6403.000000 | 393735.000000 | 173809.000000 | 307087.000000 | 0.968954 | 4212.000000 | 307933.000000 | 251540.000000 | 115172.000000 | 199897.000000 | 28169.000000 | 0.177226 | 110000.000000 | 95000.000000 | 125000.000000 | 151643.000000 | 148395.000000 | 48207.000000 |
Initial thoughts on the dataset:
# get row count from raw data
raw_data_count = len(recent_grads.index)
print("Raw data row count:", raw_data_count)
# drop rows containing missing values - this is necessary as
# Matplotlib expects that any columns we pass are of the same length
recent_grads = recent_grads.dropna()
cleaned_data_count = len(recent_grads.index)
print("Clean data row count:", cleaned_data_count)
Raw data row count: 173 Clean data row count: 172
We can see that only one row contained missing values and has now been dropped from the dataset.
To generate these scatter plots, we will leverage the plotting functionality within pandas
using the .plot() method. Like pyplot, the plotting functionality in pandas is a wrapper for matplotlib. This means we can customise the plots when necessary by accessing the underlying Figure
, Axes
, and other matplotlib objects.
Let's start by exploring the following relationships:
Sample_size
vs. Median
Sample_size
vs. Unemployment_rate
ShareWomen
vs. Unemployment rate
Full_time
vs. Median
Men
vs. Median
Women
vs. Median
Here's a reminder of the dataframe column descriptions:
Column | Description |
---|---|
Sample_size | Sample size (unweighted) of full-time, year-round workers that provided salary data (used for Median) |
Median | Median salary of full-time, year-round workers |
Unemployment_rate | Unemployed / (Unemployed + Employed) |
Full_time | Number employed 35 hours or more |
ShareWomen | Women as share of total |
Men | Male graduates |
Women | Female graduates |
# create a figure and six axes arranged in a 3 row by 2 column layout
fig = plt.figure(figsize=(15,25))
# create list of columns to use for x and y values
col_x = ['Sample_size', 'Sample_size', 'ShareWomen', 'Full_time', 'Men', 'Women']
col_y = ['Median', 'Unemployment_rate', 'Unemployment_rate', 'Median', 'Median', 'Median']
# loop over col_x and col_y, generating a scatter plot for each pair
for r in range(0,6):
ax = fig.add_subplot(3,2,r+1)
# call matplot directly rather than using the pandas plot() method wrapper
ax.scatter(x=recent_grads[col_x[r]], y=recent_grads[col_y[r]])
# set title and labels
ax.set_title(str(r+1) + ". " + col_x[r] + " vs. " + col_y[r])
ax.set_xlabel(col_x[r])
ax.set_ylabel(col_y[r])
remove_ticks()
None of the scatter plots demonstrate a significant relationship between the variables plotted. However, with the exception of the Unemployment_rate
vs. ShareWomen
plot, all plots seem to exhibit a degree of heteroscedasticity with a right skew.
We can see from the top two plots that there is a high level of variance at lower values of Sample_size
, which then tapers off as Sample_size
increases.
Plot 3 reveals that there is no relationship whatsoever between the share of women in a major and unemployment rate.
In the plots numbered 4, 5, and 6, there is a high level of variance on median income when the number of graduates is low, which tapers off as the number of graduates increases. We can see from plots 5 and 6 that there is a slightly greater variance in income for male graudates than female graduates, with males tending to earn slightly more.
Let's try to answer the following questions about our data:
To do this, we will need to create a few more scatter plots.
Do students in more popular majors make more money?
Let's plot the total number of graduates in each major on the x-axis, and median annual salary on the y-axis.
# plot total graduates vs median income
recent_grads.plot(x='Total', y='Median', kind='scatter', title='Total vs. Median')
remove_ticks()
There is no correlation between the popularity of the major and median income earned by graduates; however we can see that the variance in median income decreases as popularity increases. We have a lot of overlapping datapoints in the bottom left of the chart, a hexagonal bin plot should give us a better idea of how this data is distributed.
# plot the same information as above but as a hexagonal bin
ax = recent_grads.plot.hexbin(
x='Total',
y='Median',
title='Hexbin showing Total graduates vs. Median income \n',
gridsize=15
)
remove_ticks()
The hexagonal bin plot reveals that for the majority of majors, most graduates tend to earn around 35,000 to 40,000 USD.
Do students that majored in subjects that were majority female make more money?
For this we will plot the share of women in a major on the x-axis, and median annual salary on the y-axis.
# plot share of women in a major vs median income
recent_grads.plot(x='ShareWomen', y='Median', kind='scatter', title='ShareWomen vs. Median')
remove_ticks()
The answer is no. There is a weak-negative correlation between the share of women in a major and median income, indicating that graduates who majored in subjects that are majority female were likely to earn less money in their employment. This reflects the inital thoughts we made on the dataset when viewing the top and bottom rows, where we observed that the top ranked majors were majority male, and the bottom ranked were majority female.
# plot number of graduates employed full time vs median income
recent_grads.plot(x='Full_time', y='Median', kind='scatter', title='Full_time vs. Median')
remove_ticks()
# plot hex bin chart to help identify distribution
recent_grads.plot.hexbin(
x='Full_time',
y='Median',
title='Number of graduates working full time vs. Median income \n',
gridsize=15
)
remove_ticks()
Is there any link between the number of full-time employees and median salary?
We plotted this scatter graph earlier but here it is again as a reminder. There is no relationship between the number of graduates who are in full-time employment and income earned. The hexagonal bin plot reveals a similar pattern to the previous one, most graduates working full-time tend to earn around 35,000 to 40,000 USD.
To create our histograms, we will use the pandas Series.hist()
method. While we could use Series.plot()
and set the kind
parameter to 'hist'
, there is no parameter that allows us to control the binning strategy. Fortunately, Series.hist()
contains parameters specific to customising histograms, including the number of bins.
## this time, we will attempt to generate all our plots in one go using a loop
# create list of columns of interest
cols = ["Sample_size",
"Median",
"Employed",
"Full_time",
"ShareWomen",
"Unemployment_rate",
"Men",
"Women"]
# create a list of bin values using a simple binning strategy of taking the
# square root of the total number of rows and rounding up to an interger
bins = math.ceil(math.sqrt(len(recent_grads.index)))
fig = plt.figure(figsize=(15,25))
# loop over list of columns columns and plot histograms
for r in range(0,8):
ax = fig.add_subplot(4,2,r+1)
ax = recent_grads[cols[r]].hist(bins = bins, xrot=45, grid = False)
ax.set_ylabel("Frequency")
ax.set_title(cols[r])
remove_ticks()
Sample_size
¶The distribution has a heavy right skew, with the vast majority of majors having a sample size of less than 250 graduates.
As a reminder, the Sample_size
column in our dataset refers to the number of graduates working full-time who reported their annual salary, which in turn is used to produce the Median
column (median annual salary) for each major. Although it would depend on the total number of graduates in each major, this histogram suggests that most majors have a weak sample size for calculating the median annual salary, which could influence our results.
Median
¶According to statistics website Statista, the average income of a college graduate in 2012 was 42,315 USD. The plot shows that the majority of our data falls in the range of 30,000 to 40,000 USD, which is roughly inline with the Statista figure.
Employed
¶The number of employed graduates for the majority of majors is less than 25,000. This information is not that useful on its own, but could be used alongisde the total number of graduates to find out the employment rate for each major.
Full_time
¶The histogram follows a very similar pattern to the Employed
histogram, suggesting that the majority of employed graduates are working full time, rather than part time. If we plotted these variables on a scatter plot, we could probably expect a strong positive correlation.
ShareWomen
¶We can make a rough assumption from this histogram that for the majority of majors in our dataset, the share of women graduates is greater than the share of men. We could analyse this data further to be certain of our assumption.
Unemployment_rate
¶The majority of majors have an unemployment rate of around 0.05% to 0.08%, which is roughly in line with the national average.
Men
and Women
¶Both these histograms are fairly similar and we can't derive anything particularly interesting from either plot.
Next, we will explore the data using scatter matrix plots so that we can explore potential relationships and distributions simulatenously.
# the scatter matrix function is part of the pandas.plotting module, which needs to be imported separately
from pandas.plotting import scatter_matrix
# create scatter matrix plot of Sample size vs Median income
scatter_matrix(recent_grads[['Sample_size', 'Median']], figsize=(10,10))
remove_ticks()
# create scatter matrix plot of Sample size, Median income, and Unemployment_rate
scatter_matrix(recent_grads[['Sample_size', 'Median', "Unemployment_rate"]], figsize=(10,10))
remove_ticks()
While these scatter matricies do not provide us with any information we didn't already know from previous plots, they have proven to be a quick means of exploring relationships between multiple variables.
Let's look at the share of women for each major ranked in the top 10 and compare this with the share of women for each major ranked in the bottom 10.
fig, axs = plt.subplots(nrows = 1, ncols = 2, figsize = (20, 6))
# slice the dataset to get the top and bottom 10 rows and
# enforce a range on the y-axis of 0 to 1 to make it easier to compare both plots
ax1 = recent_grads[:10].plot.bar(
x='Major',
y='ShareWomen',
ylim=(0,1),
ax = axs[0],
legend=False,
color="lightblue",
title="Share of women in top 10 majors"
)
ax2 = recent_grads[-10:].plot.bar(x='Major',
y='ShareWomen',
ylim=(0,1),
ax = axs[1],
legend=False,
color="lightblue",
title="Share of women in bottom 10 majors"
)
ax1.set_ylabel("ShareWomen")
ax2.set_ylabel("ShareWomen")
# demarcate threshold for greater share of women
ax1.axhline(0.5, color="red", linewidth=2)
ax2.axhline(0.5, color="red", linewidth=2)
remove_ticks()
These bar plots reinforce our inference that graduates of majors that are majority female are less likely to earn more money in their employment than vice versa. The red line on both plots is the point where the ratio of men to women is equal (0.5). Any bars below the red line are majority male, whereas any bars above the red line are majority female.
We can see from the plot on the left that only one major ranking in the top ten for median income had a greater share of women to men, astronomy and astrophysics
. Meanwhile, in the right plot, all bottom ranking majors for median income have a greater share of women to men.
We can also see that engineering disciplines have a strong tendency to be male dominated subjects, while life sciences and psychological sciences tend to be much more popular among women.
Out of interest, lets calculate the average median income for majority male majors and majority female majors.
# calculate the average income of graduates by whether the major was majority male or majority female
female_major_income = recent_grads.loc[recent_grads["ShareWomen"] > 0.5, "Median"].mean()
male_major_income = recent_grads.loc[recent_grads["ShareWomen"] < 0.5, "Median"].mean()
print("Average income (USD) across majors")
print("-------------------------------")
print("Majority female: {:.2f}".format(female_major_income))
print("Majority male: {:.2f}".format(male_major_income))
Average income (USD) across majors ------------------------------- Majority female: 34605.21 Majority male: 46988.16
Now let's look at the unemployment rate from the ten top and bottom ranking majors.
fig, axs = plt.subplots(nrows = 1, ncols = 2, figsize = (20, 6))
# using slices to get
ax1 = recent_grads[:10].plot.bar(
x='Major',
y='Unemployment_rate',
ax = axs[0],
legend=False,
color="lightblue",
title="Unemployment rate in top 10 majors"
)
ax2 = recent_grads[-10:].plot.bar(
x='Major',
y='Unemployment_rate',
ax = axs[1],
legend=False,
color="lightblue",
title="Unemployment rate in bottom 10 majors"
)
ax1.set_ylabel("Unemployment_rate")
ax2.set_ylabel("Unemployment_rate")
remove_ticks()
From the plots we can see that there is not a significant disparity in unemployment rate between the top and bottom ranking majors, indicating that the median income for graduates of any major is not influenced by the employment prospects of that major (note: this is also demonstrated in the scatter matrix above for the plot showing Unemployment_rate
and Median
). That said, it does appear that unemployment rate is, on average, higher among bottom ranking majors. We could investigate this further to be conclusive...
# Summarise unemployment rate for top 10 and bottom 10 majors
# use median rather than average to insulate from outlier (e.g. Nuclear engineering)
print("Median unemployment rate", "\n", "--------------------------")
print("Bottom 10 majors: {:.3g}".format(recent_grads.loc[-10:, "Unemployment_rate"].median()))
print("Top 10 majors: {:.3g}".format(recent_grads.loc[:10, "Unemployment_rate"].median()))
Median unemployment rate -------------------------- Bottom 10 majors: 0.0675 Top 10 majors: 0.0592
The median for top 10 majors is slightly lower, but not by any significant amount.
On another note, it is interesting that the major with the highest unemployment rate out of this selection is a top ranking major, 'Nuclear engineering'
. While the median income of nuclear engineering graduates in full time employment is relatively high, perhaps there may not be enough of a job market for nuclear engineering roles, though if this was the case it would certainly be contrary to the laws of supply vs. demand.
Let's use a grouped bar plot to compare the number of men with the number of women in each category of major.
# select the women, men, and major_category columns and store in a new dataframe
women_men_majors = recent_grads[["Women", "Men", "Major_category"]].copy()
# use groupby() operation to group data by the major category column and sum the men and women values
grouped_data = women_men_majors.groupby("Major_category").sum()
# generate grouped horizontal bar plot to compare number of men with the number of women in each category of majors
grouped_data.plot.barh(figsize=(10,15))
remove_ticks()
We can conclude a few things from the plot above:
business
making it the most popular category of major for both sexes; the ratio between men to women is fairly equal, with just slightly more men than womeneducation
health
humanities and liberal arts
psychology and social work
computers and mathematics
engineering
# visualise number of female graduates to male graduates
# sum both columns in dataframe and plot the resulting series as a bar chart
recent_grads[["Women", "Men"]].sum().plot.bar(title="Number of female and male graduates")
remove_ticks()
There are nearly 400,0000 female graduates, which is decisively more than the number of male graudates (approx 300,0000).
Let's take a closer look at the categories of major that are either significantly male or female dominated, and plot these on a bar chart to show median income by major category. Instead of just using the categories that we identified from the barchart above, we'll take a more calculated approach and select majors with graduates that are at least 2/3rds male or female.
# select columns of interest
select_cols = recent_grads[["Major_category", "Median", "ShareWomen"]]
# aggregate mean values of median income and share of women by major category
grouped = select_cols.groupby("Major_category").mean()
# use bool mask to select only majors that are significantly male or female dominated
gender_dominated = grouped[(grouped["ShareWomen"] >= 0.66) | (grouped["ShareWomen"] <= 0.33)].copy()
# create new column and assign a value to indicate which gender dominates the major category
gender_dominated["ShareGender"] = ['female' if x >= 0.66 else 'male' for x in gender_dominated['ShareWomen']]
# sort dataframe by 'Median' column and print to check
gender_dominated.sort_values(by="Median", inplace=True, ascending=False)
print(gender_dominated)
# plot data on bar chart with colour coding for gender
gender_dominated.plot.bar(
y="Median",
legend=False,
color=gender_dominated["ShareGender"].map({"male": 'g', "female": 'b'})
)
plt.ylabel("Median income (USD)")
plt.title("Median income of major categories with a significant gender skew \n")
# import patches module to create a custom legend
import matplotlib.patches as mpatches
green_patch = mpatches.Patch(color='green', label='Male dominated')
blue_patch = mpatches.Patch(color='blue', label='Female dominated')
plt.legend(handles=[green_patch, blue_patch])
remove_ticks()
Median ShareWomen ShareGender Major_category Engineering 57382.758621 0.238889 male Computers & Mathematics 42745.454545 0.311772 male Health 36825.000000 0.795152 female Interdisciplinary 35000.000000 0.770901 female Education 32350.000000 0.748507 female Psychology & Social Work 30100.000000 0.794397 female
From this plot we can see the male dominated major categories (green) earn on average more money than female dominated majors (blue), reinforcing our findings from earlier in this project.
Now let's make some box plots to explore the distributions of median salaries and unemployment rate.
from pandas.plotting import boxplot
# generate box plot of to show distribution of median income
recent_grads.boxplot(column="Median", figsize=(10,5))
plt.ylabel("Median income (USD)")
plt.title("Median income for all majors")
# hide unnecessary x-axis label
ax1 = plt.axes()
x_axis = ax1.axes.get_xaxis()
x_axis.set_visible(False)
remove_ticks()
(Outliers not considered)
# generate box plot of to show distribution of unemployment rate
recent_grads.boxplot(column="Unemployment_rate", figsize=(10,5))
plt.ylabel("Unemployment rate (%)")
plt.title("Unemployment rate for all majors")
# hide unnecessary x-axis label
ax1 = plt.axes()
x_axis = ax1.axes.get_xaxis()
x_axis.set_visible(False)
remove_ticks()
Nuclear Engineering
major we identified as having a unusually high unemployment rate earlier - let's explore this further after looking into what major has an unemployment rate of 0%.# lookup majors with an unemployment rate of 0
recent_grads.loc[recent_grads["Unemployment_rate"] == 0, ["Major", "Employed", "Unemployed"]]
Major | Employed | Unemployed | |
---|---|---|---|
52 | MATHEMATICS AND COMPUTER SCIENCE | 559 | 0 |
73 | MILITARY TECHNOLOGIES | 0 | 0 |
83 | BOTANY | 1010 | 0 |
112 | SOIL SCIENCE | 613 | 0 |
120 | EDUCATIONAL ADMINISTRATION AND SUPERVISION | 703 | 0 |
Turns out there's a few majors with no unemployed graduates, but one of the majors, Military Technologies
, has no employed people either! Perhaps only civilian jobs were recorded for this dataset. The other majors seem like safe bets for landing a job post-graduation.
# lookup majors with an unemployment rate of >= 12%
recent_grads.loc[recent_grads["Unemployment_rate"] >= 0.12, ["Major", "Employed", "Unemployed", "Unemployment_rate"]]
Major | Employed | Unemployed | Unemployment_rate | |
---|---|---|---|---|
5 | NUCLEAR ENGINEERING | 1857 | 400 | 0.177226 |
29 | PUBLIC POLICY | 4547 | 670 | 0.128426 |
84 | COMPUTER NETWORKING AND TELECOMMUNICATIONS | 6144 | 1100 | 0.151850 |
89 | PUBLIC ADMINISTRATION | 4158 | 789 | 0.159491 |
170 | CLINICAL PSYCHOLOGY | 2101 | 368 | 0.149048 |
Recall from the bar charts we created earlier on unemployment rate that we were surpirsed to see that Nuclear Engineering
, a top ten ranking major, had such a high unemployment rate, so high in fact that it is an outlier on our box plot. Other outlier majors with high unemployment rate include Public policy
, Computer Networking and Telecommunications
, Public administration
, and Clinical Psychology
. Perhaps these majors should be avoided unless the prospective student has a genuine passion for the field?
Let's summarise what we have learned from analysing this dataset. Near the start, we set out three questions to try and find answers to:
1. Do students in more popular majors make more money?
This was inconclusive, we did not find a statisically signiciant relationship between the total number of graduates of a major and the median income earned. However, we could identify a degree of heteroskedacity as the median income increased, variation decreased.
2. Do students that majored in subjects that were majority female make more money?
We identified a weak-negative correlation between the share of women in a major and the median income earned, indicating the opposite is true, graduates who majored in subjects that were majority female earned, on average, less money than graduates who majored in majority male subjects. We determined that men earned on average 46988.16 USD
while women earned 34605.21 USD
. Only one major ranked in the top 10 for median income was majority female, astronomy and astrophysics
, while every single major ranked in the bottom 10 was majority female.
3. Is there any link between the number of full-time employees and median salary?
There is no relationship between the number of graduates who are in full-time employment and income earned
Other notable insights:
There are more female graduates than male graduates (X% more)
Most graduates earn around 35,000 to 40,000 USD
8 out of 10 of the top ten majors were an engineering discipline
Only one top 10 major had a greater share of women to men: astronomy and astrophysics
Business was the most popular category of major for both male and female students.
The majority of graduates who are employed work full-time rather than part-time.
The unemployment rate for most majors was around 6-8%
Majors with no unemployment: Mathematics and Computer Science
, Botany
, Soil Science
, Educational Administration and Supervision
. Note that of these majors, the highest ranking major is rank 52 out of 171 by median income, so although all of these majors are great for getting a job, none are particularly well paid.
Majors with a high unemployment rate: Nuclear Engineering, Public Policy, Computer Networking and Telecommunications, Public Administration, and Clinical Psychology
.
Possible further investigations:
Sample size
At a glance, the Sample_size
for each major appears to be inadequate. Since this information is used to determine the median income for graduates of a major, and by extension, determine the majors rank relative to other majors, it is important that the sample size is large enough to accurately represent the population (i.e. number of graduates of the major).
To determine if the sample size is suitable, we will need to use a formula and apply this to every row in the dataset. If we find that the sample size is insufficient for a significant number of majors, this could invalidate any of our analyses that used Sample_size
, Median
, and Rank
.
Nuclear Engineering
It could be interesting to investigate why a top 10 ranking major, Nuclear Engineering
, has the worst unemployment rate of all majors.