Visualising earnings based on college majors¶

In this project, we will explore a dataset on the job outcomes of students in the USA who graduated from college between 2010 and 2012. The dataset was originally published by the American Community Survey, and subsequently cleaned by FiveThirtyEight who released it on their Github repo.

Each row in the dataset represents a different major in college and contains information on gender diversity, employment rates, median salaries, and more. Here are some of the columns in the dataset:

Column	Description
Rank	Rank by median earnings (the dataset is ordered by this column)
Major_code	Major code
Major	Major description
Major_category	Category of major
Total	Total number of people with major
Sample_size	Sample size (unweighted) of full-time, year-round ONLY (used for earnings)
Men	Male graduates
Women	Female graduates
ShareWomen	Women as share of total
Employed	Number employed
Median	Median salary of full-time, year-round workers
Low_wage_jobs	Number in low-wage service jobs
Full_time	Number employed 35 hours or more
Part_time	Number employed less than 35 hours

To explore this data, we will create a variety of data visualisations including scatter plots, histograms, and bar charts. We will generate these plots using the matplotlib library.

In [1]:

# import libraries
import math
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl
# run Jupyter magic so that plots are displayed inline
%matplotlib inline

# set styles
plt.rcParams['axes.titlesize'] = 'x-large'
plt.rcParams['axes.spines.left'] = False
plt.rcParams['axes.spines.right'] = False
plt.rcParams['axes.spines.top'] = False
plt.rcParams['axes.spines.bottom'] = False

# default_rc = dict(mpl.rcParams) # uncomment and run all to return styles to default

# function to remove ticks as this can't be set in rcParams
def remove_ticks():
    # get all axes in current figure
    axes = plt.gcf().get_axes()
    # iterate over each axes 
    for ax in axes:
        ax.tick_params(top='off', bottom='off', left='off', right='off')

Read dataset and begin exploring data¶

In [2]:

# read dataset into dataframe and return first row
recent_grads = pd.read_csv("recent-grads.csv")
recent_grads.iloc[0]

Out[2]:

Rank                                        1
Major_code                               2419
Major                   PETROLEUM ENGINEERING
Total                                    2339
Men                                      2057
Women                                     282
Major_category                    Engineering
ShareWomen                           0.120564
Sample_size                                36
Employed                                 1976
Full_time                                1849
Part_time                                 270
Full_time_year_round                     1207
Unemployed                                 37
Unemployment_rate                   0.0183805
Median                                 110000
P25th                                   95000
P75th                                  125000
College_jobs                             1534
Non_college_jobs                          364
Low_wage_jobs                             193
Name: 0, dtype: object

In [3]:

# explore head
recent_grads.head()

Out[3]:

	Rank	Major_code	Major	Total	Men	Women	Major_category	ShareWomen	Sample_size	Employed	...	Part_time	Full_time_year_round	Unemployed	Unemployment_rate	Median	P25th	P75th	College_jobs	Non_college_jobs	Low_wage_jobs
0	1	2419	PETROLEUM ENGINEERING	2339.0	2057.0	282.0	Engineering	0.120564	36	1976	...	270	1207	37	0.018381	110000	95000	125000	1534	364	193
1	2	2416	MINING AND MINERAL ENGINEERING	756.0	679.0	77.0	Engineering	0.101852	7	640	...	170	388	85	0.117241	75000	55000	90000	350	257	50
2	3	2415	METALLURGICAL ENGINEERING	856.0	725.0	131.0	Engineering	0.153037	3	648	...	133	340	16	0.024096	73000	50000	105000	456	176	0
3	4	2417	NAVAL ARCHITECTURE AND MARINE ENGINEERING	1258.0	1123.0	135.0	Engineering	0.107313	16	758	...	150	692	40	0.050125	70000	43000	80000	529	102	0
4	5	2405	CHEMICAL ENGINEERING	32260.0	21239.0	11021.0	Engineering	0.341631	289	25694	...	5180	16697	1672	0.061098	65000	50000	75000	18314	4440	972

5 rows × 21 columns

In [4]:

# explore tail
recent_grads.tail()

Out[4]:

	Rank	Major_code	Major	Total	Men	Women	Major_category	ShareWomen	Sample_size	Employed	...	Part_time	Full_time_year_round	Unemployed	Unemployment_rate	Median	P25th	P75th	College_jobs	Non_college_jobs	Low_wage_jobs
168	169	3609	ZOOLOGY	8409.0	3050.0	5359.0	Biology & Life Science	0.637293	47	6259	...	2190	3602	304	0.046320	26000	20000	39000	2771	2947	743
169	170	5201	EDUCATIONAL PSYCHOLOGY	2854.0	522.0	2332.0	Psychology & Social Work	0.817099	7	2125	...	572	1211	148	0.065112	25000	24000	34000	1488	615	82
170	171	5202	CLINICAL PSYCHOLOGY	2838.0	568.0	2270.0	Psychology & Social Work	0.799859	13	2101	...	648	1293	368	0.149048	25000	25000	40000	986	870	622
171	172	5203	COUNSELING PSYCHOLOGY	4626.0	931.0	3695.0	Psychology & Social Work	0.798746	21	3777	...	965	2738	214	0.053621	23400	19200	26000	2403	1245	308
172	173	3501	LIBRARY SCIENCE	1098.0	134.0	964.0	Education	0.877960	2	742	...	237	410	87	0.104946	22000	20000	22000	288	338	192

5 rows × 21 columns

In [5]:

# generate summary statistics for all numeric columns
recent_grads.describe()

Out[5]:

	Rank	Major_code	Total	Men	Women	ShareWomen	Sample_size	Employed	Full_time	Part_time	Full_time_year_round	Unemployed	Unemployment_rate	Median	P25th	P75th	College_jobs	Non_college_jobs	Low_wage_jobs
count	173.000000	173.000000	172.000000	172.000000	172.000000	172.000000	173.000000	173.000000	173.000000	173.000000	173.000000	173.000000	173.000000	173.000000	173.000000	173.000000	173.000000	173.000000	173.000000
mean	87.000000	3879.815029	39370.081395	16723.406977	22646.674419	0.522223	356.080925	31192.763006	26029.306358	8832.398844	19694.427746	2416.329480	0.068191	40151.445087	29501.445087	51494.219653	12322.635838	13284.497110	3859.017341
std	50.084928	1687.753140	63483.491009	28122.433474	41057.330740	0.231205	618.361022	50675.002241	42869.655092	14648.179473	33160.941514	4112.803148	0.030331	11470.181802	9166.005235	14906.279740	21299.868863	23789.655363	6944.998579
min	1.000000	1100.000000	124.000000	119.000000	0.000000	0.000000	2.000000	0.000000	111.000000	0.000000	111.000000	0.000000	0.000000	22000.000000	18500.000000	22000.000000	0.000000	0.000000	0.000000
25%	44.000000	2403.000000	4549.750000	2177.500000	1778.250000	0.336026	39.000000	3608.000000	3154.000000	1030.000000	2453.000000	304.000000	0.050306	33000.000000	24000.000000	42000.000000	1675.000000	1591.000000	340.000000
50%	87.000000	3608.000000	15104.000000	5434.000000	8386.500000	0.534024	130.000000	11797.000000	10048.000000	3299.000000	7413.000000	893.000000	0.067961	36000.000000	27000.000000	47000.000000	4390.000000	4595.000000	1231.000000
75%	130.000000	5503.000000	38909.750000	14631.000000	22553.750000	0.703299	338.000000	31433.000000	25147.000000	9948.000000	16891.000000	2393.000000	0.087557	45000.000000	33000.000000	60000.000000	14444.000000	11783.000000	3466.000000
max	173.000000	6403.000000	393735.000000	173809.000000	307087.000000	0.968954	4212.000000	307933.000000	251540.000000	115172.000000	199897.000000	28169.000000	0.177226	110000.000000	95000.000000	125000.000000	151643.000000	148395.000000	48207.000000

Initial thoughts on the dataset:

we can see that the top 5 majors (as ranked by median earnings) are in engineering disciplines
the share of women is much higher in the bottom ranked majors compared with the top ranked majors
the average unemployment rate is 6.8%, slightly below the unemployment rate in the USA at the time (9.6% [2010] to 7.7% [2012])
there appears to be one record with missing values (looking at the count row in the summarised statistics table above) - we will need to drop this row before we can peform our analysis

Drop missing values¶

In [6]:

# get row count from raw data
raw_data_count = len(recent_grads.index)
print("Raw data row count:", raw_data_count)

# drop rows containing missing values - this is necessary as
# Matplotlib expects that any columns we pass are of the same length
recent_grads = recent_grads.dropna()

cleaned_data_count = len(recent_grads.index)
print("Clean data row count:", cleaned_data_count)

Raw data row count: 173
Clean data row count: 172

We can see that only one row contained missing values and has now been dropped from the dataset.

Data visualisation¶

Exploring relationships with scatter plots¶

To generate these scatter plots, we will leverage the plotting functionality within pandas using the .plot() method. Like pyplot, the plotting functionality in pandas is a wrapper for matplotlib. This means we can customise the plots when necessary by accessing the underlying Figure, Axes, and other matplotlib objects.

Let's start by exploring the following relationships:

Sample_size vs. Median
Sample_size vs. Unemployment_rate
ShareWomen vs. Unemployment rate
Full_time vs. Median
Men vs. Median
Women vs. Median

Here's a reminder of the dataframe column descriptions:

Column	Description
Sample_size	Sample size (unweighted) of full-time, year-round workers that provided salary data (used for Median)
Median	Median salary of full-time, year-round workers
Unemployment_rate	Unemployed / (Unemployed + Employed)
Full_time	Number employed 35 hours or more
ShareWomen	Women as share of total
Men	Male graduates
Women	Female graduates

In [7]:

# create a figure and six axes arranged in a 3 row by 2 column layout
fig = plt.figure(figsize=(15,25))

# create list of columns to use for x and y values
col_x = ['Sample_size', 'Sample_size', 'ShareWomen', 'Full_time', 'Men', 'Women']
col_y = ['Median', 'Unemployment_rate', 'Unemployment_rate', 'Median', 'Median', 'Median']

# loop over col_x and col_y, generating a scatter plot for each pair
for r in range(0,6):
    ax = fig.add_subplot(3,2,r+1)
    # call matplot directly rather than using the pandas plot() method wrapper 
    ax.scatter(x=recent_grads[col_x[r]], y=recent_grads[col_y[r]])
    # set title and labels
    ax.set_title(str(r+1) + ". " + col_x[r] + " vs. " + col_y[r])
    ax.set_xlabel(col_x[r])
    ax.set_ylabel(col_y[r])

remove_ticks()

Scatter plot observations¶

None of the scatter plots demonstrate a significant relationship between the variables plotted. However, with the exception of the Unemployment_rate vs. ShareWomen plot, all plots seem to exhibit a degree of heteroscedasticity with a right skew.

We can see from the top two plots that there is a high level of variance at lower values of Sample_size, which then tapers off as Sample_size increases.

Plot 3 reveals that there is no relationship whatsoever between the share of women in a major and unemployment rate.

In the plots numbered 4, 5, and 6, there is a high level of variance on median income when the number of graduates is low, which tapers off as the number of graduates increases. We can see from plots 5 and 6 that there is a slightly greater variance in income for male graudates than female graduates, with males tending to earn slightly more.

Let's try to answer the following questions about our data:

do students in more popular majors make more money?
do students that majored in subjects that were majority female make more money?
is there any link between the number of full-time employees and median salary?

To do this, we will need to create a few more scatter plots.

Do students in more popular majors make more money?

Let's plot the total number of graduates in each major on the x-axis, and median annual salary on the y-axis.

In [8]:

# plot total graduates vs median income
recent_grads.plot(x='Total', y='Median', kind='scatter', title='Total vs. Median')
remove_ticks()

There is no correlation between the popularity of the major and median income earned by graduates; however we can see that the variance in median income decreases as popularity increases. We have a lot of overlapping datapoints in the bottom left of the chart, a hexagonal bin plot should give us a better idea of how this data is distributed.

In [9]:

# plot the same information as above but as a hexagonal bin
ax = recent_grads.plot.hexbin(
                         x='Total',
                         y='Median',
                         title='Hexbin showing Total graduates vs. Median income \n',
                         gridsize=15
                        )
remove_ticks()

The hexagonal bin plot reveals that for the majority of majors, most graduates tend to earn around 35,000 to 40,000 USD.

Do students that majored in subjects that were majority female make more money?

For this we will plot the share of women in a major on the x-axis, and median annual salary on the y-axis.

In [10]:

# plot share of women in a major vs median income
recent_grads.plot(x='ShareWomen', y='Median', kind='scatter', title='ShareWomen vs. Median')
remove_ticks()

The answer is no. There is a weak-negative correlation between the share of women in a major and median income, indicating that graduates who majored in subjects that are majority female were likely to earn less money in their employment. This reflects the inital thoughts we made on the dataset when viewing the top and bottom rows, where we observed that the top ranked majors were majority male, and the bottom ranked were majority female.

In [11]:

# plot number of graduates employed full time vs median income
recent_grads.plot(x='Full_time', y='Median', kind='scatter', title='Full_time vs. Median')
remove_ticks()
# plot hex bin chart to help identify distribution
recent_grads.plot.hexbin(
                         x='Full_time',
                         y='Median',
                         title='Number of graduates working full time vs. Median income \n',
                         gridsize=15
                        )
remove_ticks()

Is there any link between the number of full-time employees and median salary?

We plotted this scatter graph earlier but here it is again as a reminder. There is no relationship between the number of graduates who are in full-time employment and income earned. The hexagonal bin plot reveals a similar pattern to the previous one, most graduates working full-time tend to earn around 35,000 to 40,000 USD.

Explore distributions with histograms¶

To create our histograms, we will use the pandas Series.hist() method. While we could use Series.plot() and set the kind parameter to 'hist', there is no parameter that allows us to control the binning strategy. Fortunately, Series.hist() contains parameters specific to customising histograms, including the number of bins.

In [12]:

## this time, we will attempt to generate all our plots in one go using a loop

# create list of columns of interest
cols = ["Sample_size", 
        "Median", 
        "Employed", 
        "Full_time", 
        "ShareWomen", 
        "Unemployment_rate", 
        "Men", 
        "Women"]

# create a list of bin values using a simple binning strategy of taking the 
# square root of the total number of rows and rounding up to an interger
bins = math.ceil(math.sqrt(len(recent_grads.index)))

fig = plt.figure(figsize=(15,25))

# loop over list of columns columns and plot histograms
for r in range(0,8):
    ax = fig.add_subplot(4,2,r+1)
    ax = recent_grads[cols[r]].hist(bins = bins, xrot=45, grid = False)
    ax.set_ylabel("Frequency")
    ax.set_title(cols[r])

remove_ticks()

Histogram observations¶

`Sample_size`¶

The distribution has a heavy right skew, with the vast majority of majors having a sample size of less than 250 graduates.

As a reminder, the Sample_size column in our dataset refers to the number of graduates working full-time who reported their annual salary, which in turn is used to produce the Median column (median annual salary) for each major. Although it would depend on the total number of graduates in each major, this histogram suggests that most majors have a weak sample size for calculating the median annual salary, which could influence our results.

`Median`¶

According to statistics website Statista, the average income of a college graduate in 2012 was 42,315 USD. The plot shows that the majority of our data falls in the range of 30,000 to 40,000 USD, which is roughly inline with the Statista figure.

`Employed`¶

The number of employed graduates for the majority of majors is less than 25,000. This information is not that useful on its own, but could be used alongisde the total number of graduates to find out the employment rate for each major.

`Full_time`¶

The histogram follows a very similar pattern to the Employed histogram, suggesting that the majority of employed graduates are working full time, rather than part time. If we plotted these variables on a scatter plot, we could probably expect a strong positive correlation.

`ShareWomen`¶

We can make a rough assumption from this histogram that for the majority of majors in our dataset, the share of women graduates is greater than the share of men. We could analyse this data further to be certain of our assumption.

`Unemployment_rate`¶

The majority of majors have an unemployment rate of around 0.05% to 0.08%, which is roughly in line with the national average.

`Men` and `Women`¶

Both these histograms are fairly similar and we can't derive anything particularly interesting from either plot.

Explore relationships with scatter matrix plots¶

Next, we will explore the data using scatter matrix plots so that we can explore potential relationships and distributions simulatenously.

In [13]:

# the scatter matrix function is part of the pandas.plotting module, which needs to be imported separately
from pandas.plotting import scatter_matrix

In [14]:

# create scatter matrix plot of Sample size vs Median income
scatter_matrix(recent_grads[['Sample_size', 'Median']], figsize=(10,10))
remove_ticks()

In [15]:

# create scatter matrix plot of Sample size, Median income, and Unemployment_rate
scatter_matrix(recent_grads[['Sample_size', 'Median', "Unemployment_rate"]], figsize=(10,10))
remove_ticks()

Scatter matrix observations¶

While these scatter matricies do not provide us with any information we didn't already know from previous plots, they have proven to be a quick means of exploring relationships between multiple variables.

Explore relationships with bar plots¶

Let's look at the share of women for each major ranked in the top 10 and compare this with the share of women for each major ranked in the bottom 10.

In [16]:

fig, axs = plt.subplots(nrows = 1, ncols = 2, figsize = (20, 6))

# slice the dataset to get the top and bottom 10 rows and
# enforce a range on the y-axis of 0 to 1 to make it easier to compare both plots
ax1 = recent_grads[:10].plot.bar(
                                 x='Major', 
                                 y='ShareWomen', 
                                 ylim=(0,1), 
                                 ax = axs[0], 
                                 legend=False, 
                                 color="lightblue",
                                 title="Share of women in top 10 majors"
                                 )
ax2 = recent_grads[-10:].plot.bar(x='Major',
                                  y='ShareWomen',
                                  ylim=(0,1),
                                  ax = axs[1],
                                  legend=False,
                                  color="lightblue",
                                  title="Share of women in bottom 10 majors"
                                 )

ax1.set_ylabel("ShareWomen")
ax2.set_ylabel("ShareWomen")

# demarcate threshold for greater share of women
ax1.axhline(0.5, color="red", linewidth=2)
ax2.axhline(0.5, color="red", linewidth=2)

remove_ticks()

Observations¶

These bar plots reinforce our inference that graduates of majors that are majority female are less likely to earn more money in their employment than vice versa. The red line on both plots is the point where the ratio of men to women is equal (0.5). Any bars below the red line are majority male, whereas any bars above the red line are majority female.

We can see from the plot on the left that only one major ranking in the top ten for median income had a greater share of women to men, astronomy and astrophysics. Meanwhile, in the right plot, all bottom ranking majors for median income have a greater share of women to men.

We can also see that engineering disciplines have a strong tendency to be male dominated subjects, while life sciences and psychological sciences tend to be much more popular among women.

Out of interest, lets calculate the average median income for majority male majors and majority female majors.

In [17]:

# calculate the average income of graduates by whether the major was majority male or majority female
female_major_income = recent_grads.loc[recent_grads["ShareWomen"] > 0.5, "Median"].mean()
male_major_income = recent_grads.loc[recent_grads["ShareWomen"] < 0.5, "Median"].mean()
print("Average income (USD) across majors")
print("-------------------------------")
print("Majority female: {:.2f}".format(female_major_income))
print("Majority male: {:.2f}".format(male_major_income))

Average income (USD) across majors
-------------------------------
Majority female: 34605.21
Majority male: 46988.16

Now let's look at the unemployment rate from the ten top and bottom ranking majors.

In [18]:

fig, axs = plt.subplots(nrows = 1, ncols = 2, figsize = (20, 6))

# using slices to get 
ax1 = recent_grads[:10].plot.bar(
                                 x='Major',
                                 y='Unemployment_rate',
                                 ax = axs[0],
                                 legend=False, 
                                 color="lightblue",
                                 title="Unemployment rate in top 10 majors"
                                 )
ax2 = recent_grads[-10:].plot.bar(
                                  x='Major',
                                  y='Unemployment_rate',
                                  ax = axs[1],
                                  legend=False,
                                  color="lightblue",
                                  title="Unemployment rate in bottom 10 majors"
                                  )

ax1.set_ylabel("Unemployment_rate")
ax2.set_ylabel("Unemployment_rate")

remove_ticks()

Observations¶

From the plots we can see that there is not a significant disparity in unemployment rate between the top and bottom ranking majors, indicating that the median income for graduates of any major is not influenced by the employment prospects of that major (note: this is also demonstrated in the scatter matrix above for the plot showing Unemployment_rate and Median). That said, it does appear that unemployment rate is, on average, higher among bottom ranking majors. We could investigate this further to be conclusive...

In [19]:

# Summarise unemployment rate for top 10 and bottom 10 majors
# use median rather than average to insulate from outlier (e.g. Nuclear engineering)
print("Median unemployment rate", "\n", "--------------------------")
print("Bottom 10 majors: {:.3g}".format(recent_grads.loc[-10:, "Unemployment_rate"].median()))
print("Top 10 majors: {:.3g}".format(recent_grads.loc[:10, "Unemployment_rate"].median()))

Median unemployment rate 
 --------------------------
Bottom 10 majors: 0.0675
Top 10 majors: 0.0592

The median for top 10 majors is slightly lower, but not by any significant amount.

On another note, it is interesting that the major with the highest unemployment rate out of this selection is a top ranking major, 'Nuclear engineering'. While the median income of nuclear engineering graduates in full time employment is relatively high, perhaps there may not be enough of a job market for nuclear engineering roles, though if this was the case it would certainly be contrary to the laws of supply vs. demand.

Explore relationships with a grouped bar plot¶

Let's use a grouped bar plot to compare the number of men with the number of women in each category of major.

In [20]:

# select the women, men, and major_category columns and store in a new dataframe
women_men_majors = recent_grads[["Women", "Men", "Major_category"]].copy()
# use groupby() operation to group data by the major category column and sum the men and women values
grouped_data = women_men_majors.groupby("Major_category").sum()
# generate grouped horizontal bar plot to compare number of men with the number of women in each category of majors
grouped_data.plot.barh(figsize=(10,15))

remove_ticks()

Observations¶

We can conclude a few things from the plot above:

there seems to be more female graduates than male graduates (let's visualise this in another bar plot below)
the category of major with the most amount of men and women graduates is business making it the most popular category of major for both sexes; the ratio between men to women is fairly equal, with just slightly more men than women
the following categories of majors stand out as being particularly female dominated:
- education
- health
- humanities and liberal arts
- psychology and social work
categories of majors that appear to be particularly male dominated:
- computers and mathematics
- engineering

In [21]:

# visualise number of female graduates to male graduates

# sum both columns in dataframe and plot the resulting series as a bar chart
recent_grads[["Women", "Men"]].sum().plot.bar(title="Number of female and male graduates")

remove_ticks()

There are nearly 400,0000 female graduates, which is decisively more than the number of male graudates (approx 300,0000).

Investigating high ratios of men to women and vice versa¶

Let's take a closer look at the categories of major that are either significantly male or female dominated, and plot these on a bar chart to show median income by major category. Instead of just using the categories that we identified from the barchart above, we'll take a more calculated approach and select majors with graduates that are at least 2/3rds male or female.

In [22]:

# select columns of interest
select_cols = recent_grads[["Major_category", "Median", "ShareWomen"]]

# aggregate mean values of median income and share of women by major category
grouped = select_cols.groupby("Major_category").mean()

# use bool mask to select only majors that are significantly male or female dominated
gender_dominated = grouped[(grouped["ShareWomen"] >= 0.66) | (grouped["ShareWomen"] <= 0.33)].copy()

# create new column and assign a value to indicate which gender dominates the major category
gender_dominated["ShareGender"] = ['female' if x >= 0.66 else 'male' for x in gender_dominated['ShareWomen']]

# sort dataframe by 'Median' column and print to check
gender_dominated.sort_values(by="Median", inplace=True, ascending=False)
print(gender_dominated)

# plot data on bar chart with colour coding for gender
gender_dominated.plot.bar(
                          y="Median",
                          legend=False,
                          color=gender_dominated["ShareGender"].map({"male": 'g', "female": 'b'})
                         )
plt.ylabel("Median income (USD)")
plt.title("Median income of major categories with a significant gender skew \n")

# import patches module to create a custom legend
import matplotlib.patches as mpatches
green_patch = mpatches.Patch(color='green', label='Male dominated')
blue_patch = mpatches.Patch(color='blue', label='Female dominated')
plt.legend(handles=[green_patch, blue_patch])

remove_ticks()

                                Median  ShareWomen ShareGender
Major_category                                                
Engineering               57382.758621    0.238889        male
Computers & Mathematics   42745.454545    0.311772        male
Health                    36825.000000    0.795152      female
Interdisciplinary         35000.000000    0.770901      female
Education                 32350.000000    0.748507      female
Psychology & Social Work  30100.000000    0.794397      female

From this plot we can see the male dominated major categories (green) earn on average more money than female dominated majors (blue), reinforcing our findings from earlier in this project.

Explore distributions with box plots¶

Now let's make some box plots to explore the distributions of median salaries and unemployment rate.

In [23]:

from pandas.plotting import boxplot

# generate box plot of to show distribution of median income
recent_grads.boxplot(column="Median", figsize=(10,5))
plt.ylabel("Median income (USD)")
plt.title("Median income for all majors")

# hide unnecessary x-axis label
ax1 = plt.axes()
x_axis = ax1.axes.get_xaxis()
x_axis.set_visible(False)

remove_ticks()

Observations¶

(Outliers not considered)

the distribution is positively skewed, indicating that the median is less than the mean, i.e. more than half of graduates earn less than the average salary
the interquartile range is ~12,000 USD
the median anual income for graduates in our dataset is ~35,000 USD
75% of graduates earn more than ~34,000 USD
the maximum salary is ~62,000 USD
the minimum salary is ~22,000 USD

In [24]:

# generate box plot of to show distribution of unemployment rate
recent_grads.boxplot(column="Unemployment_rate", figsize=(10,5))
plt.ylabel("Unemployment rate (%)")
plt.title("Unemployment rate for all majors")
# hide unnecessary x-axis label
ax1 = plt.axes()
x_axis = ax1.axes.get_xaxis()
x_axis.set_visible(False)

remove_ticks()

Observations¶

the data is normally distributed, and the median unemployment rate is around 7%
the minimum value is on the baseline of the plot (0), thus there must be at least one major with every gradute employed, we'll take a closer a look at this next
the maximum unemployment rate is approximately 13%, excluding outliers, which is well above the national average. If we include the outliers, the highest value is ~18%, this must be the Nuclear Engineering major we identified as having a unusually high unemployment rate earlier - let's explore this further after looking into what major has an unemployment rate of 0%.

In [25]:

# lookup majors with an unemployment rate of 0
recent_grads.loc[recent_grads["Unemployment_rate"] == 0, ["Major", "Employed", "Unemployed"]]

Out[25]:

	Major	Employed
52	MATHEMATICS AND COMPUTER SCIENCE	559
73	MILITARY TECHNOLOGIES	0
83	BOTANY	1010
112	SOIL SCIENCE	613
120	EDUCATIONAL ADMINISTRATION AND SUPERVISION	703

Turns out there's a few majors with no unemployed graduates, but one of the majors, Military Technologies, has no employed people either! Perhaps only civilian jobs were recorded for this dataset. The other majors seem like safe bets for landing a job post-graduation.

In [26]:

# lookup majors with an unemployment rate of >= 12%
recent_grads.loc[recent_grads["Unemployment_rate"] >= 0.12, ["Major", "Employed", "Unemployed", "Unemployment_rate"]]

Out[26]:

	Major	Employed	Unemployed	Unemployment_rate
5	NUCLEAR ENGINEERING	1857	400	0.177226
29	PUBLIC POLICY	4547	670	0.128426
84	COMPUTER NETWORKING AND TELECOMMUNICATIONS	6144	1100	0.151850
89	PUBLIC ADMINISTRATION	4158	789	0.159491
170	CLINICAL PSYCHOLOGY	2101	368	0.149048

Recall from the bar charts we created earlier on unemployment rate that we were surpirsed to see that Nuclear Engineering, a top ten ranking major, had such a high unemployment rate, so high in fact that it is an outlier on our box plot. Other outlier majors with high unemployment rate include Public policy, Computer Networking and Telecommunications, Public administration, and Clinical Psychology. Perhaps these majors should be avoided unless the prospective student has a genuine passion for the field?

Conclusion¶

Let's summarise what we have learned from analysing this dataset. Near the start, we set out three questions to try and find answers to:

1. Do students in more popular majors make more money?

This was inconclusive, we did not find a statisically signiciant relationship between the total number of graduates of a major and the median income earned. However, we could identify a degree of heteroskedacity as the median income increased, variation decreased.

2. Do students that majored in subjects that were majority female make more money?

We identified a weak-negative correlation between the share of women in a major and the median income earned, indicating the opposite is true, graduates who majored in subjects that were majority female earned, on average, less money than graduates who majored in majority male subjects. We determined that men earned on average 46988.16 USD while women earned 34605.21 USD. Only one major ranked in the top 10 for median income was majority female, astronomy and astrophysics, while every single major ranked in the bottom 10 was majority female.

3. Is there any link between the number of full-time employees and median salary?

There is no relationship between the number of graduates who are in full-time employment and income earned

Other notable insights:

There are more female graduates than male graduates (X% more)
Most graduates earn around 35,000 to 40,000 USD
8 out of 10 of the top ten majors were an engineering discipline
Only one top 10 major had a greater share of women to men: astronomy and astrophysics
Business was the most popular category of major for both male and female students.
The majority of graduates who are employed work full-time rather than part-time.
The unemployment rate for most majors was around 6-8%
Majors with no unemployment: Mathematics and Computer Science, Botany, Soil Science, Educational Administration and Supervision. Note that of these majors, the highest ranking major is rank 52 out of 171 by median income, so although all of these majors are great for getting a job, none are particularly well paid.
Majors with a high unemployment rate: Nuclear Engineering, Public Policy, Computer Networking and Telecommunications, Public Administration, and Clinical Psychology.

Possible further investigations:

Sample size

At a glance, the Sample_size for each major appears to be inadequate. Since this information is used to determine the median income for graduates of a major, and by extension, determine the majors rank relative to other majors, it is important that the sample size is large enough to accurately represent the population (i.e. number of graduates of the major).

To determine if the sample size is suitable, we will need to use a formula and apply this to every row in the dataset. If we find that the sample size is insufficient for a significant number of majors, this could invalidate any of our analyses that used Sample_size, Median, and Rank.

Nuclear Engineering

It could be interesting to investigate why a top 10 ranking major, Nuclear Engineering, has the worst unemployment rate of all majors.

Visualising earnings based on college majors¶

Read dataset and begin exploring data¶

Drop missing values¶

Data visualisation¶

Exploring relationships with scatter plots¶

Scatter plot observations¶

Explore distributions with histograms¶

Histogram observations¶

Sample_size¶

Median¶

Employed¶

Full_time¶

ShareWomen¶

Unemployment_rate¶

Men and Women¶

Explore relationships with scatter matrix plots¶

Scatter matrix observations¶

Explore relationships with bar plots¶

Observations¶

Observations¶

Explore relationships with a grouped bar plot¶

Observations¶

Investigating high ratios of men to women and vice versa¶

Explore distributions with box plots¶

Observations¶

Observations¶

Conclusion¶

`Sample_size`¶

`Median`¶

`Employed`¶

`Full_time`¶

`ShareWomen`¶

`Unemployment_rate`¶

`Men` and `Women`¶