A look at recent graduates and their future earnings.
In this project I will use Pandas plotting functionality to create scatter plots, histograms and bar charts to identify any trends between the choice of university major, gender and their subsequent earnings.
I'll be working with a dataset on the job outcomes of students who graduated from college between 2010 and 2012. The original data on job outcomes was released by American Community Survey, which conducts surveys and aggregates the data. FiveThirtyEight cleaned the dataset and released it on their Github repo.
Let's start by importing the necessary modules and reading the dataset into a Pandas DataFrame.
#Import modules.
import pandas as pd
from matplotlib import pyplot as plt
%matplotlib inline
# Read data file into Pandas DataFrame
recent_grads = pd.read_csv('recent-grads.csv')
Now we can take a look at some of our data and see what columns we are working with. Let's start by showing the fist row of data. As the data is sorted by median earnings the first row is the major with the graduates with the highest earnings.
# Show the first row of data.
recent_grads.head(1)
Rank | Major_code | Major | Total | Men | Women | Major_category | ShareWomen | Sample_size | Employed | ... | Part_time | Full_time_year_round | Unemployed | Unemployment_rate | Median | P25th | P75th | College_jobs | Non_college_jobs | Low_wage_jobs | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 2419 | PETROLEUM ENGINEERING | 2339.0 | 2057.0 | 282.0 | Engineering | 0.120564 | 36 | 1976 | ... | 270 | 1207 | 37 | 0.018381 | 110000 | 95000 | 125000 | 1534 | 364 | 193 |
1 rows × 21 columns
This shows the columns on the left hand side and the values of the first row on the right hand side.
Some of the important columns are as follows:
As you can see from our first row, the major with the higest median earnings is Petroleum Engineering with a median salary of $110,000. Of the Petrolem Engineering graduates only 12% are women and less than 2% of graduates are unemployed. It's still a good time to be an oil man!
Now let's have a look at the top 5 rows and the bottom 5 rows. These have been sorted by the graduates median earnings. The top 5 rows therefore show the 5 majors with the highest median earnings and the bottom 5 rows show the 5 majors with the lowest median earnings.
# Select the top 5 rows from the dataset.
recent_grads.head(5)
Rank | Major_code | Major | Total | Men | Women | Major_category | ShareWomen | Sample_size | Employed | ... | Part_time | Full_time_year_round | Unemployed | Unemployment_rate | Median | P25th | P75th | College_jobs | Non_college_jobs | Low_wage_jobs | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 2419 | PETROLEUM ENGINEERING | 2339.0 | 2057.0 | 282.0 | Engineering | 0.120564 | 36 | 1976 | ... | 270 | 1207 | 37 | 0.018381 | 110000 | 95000 | 125000 | 1534 | 364 | 193 |
1 | 2 | 2416 | MINING AND MINERAL ENGINEERING | 756.0 | 679.0 | 77.0 | Engineering | 0.101852 | 7 | 640 | ... | 170 | 388 | 85 | 0.117241 | 75000 | 55000 | 90000 | 350 | 257 | 50 |
2 | 3 | 2415 | METALLURGICAL ENGINEERING | 856.0 | 725.0 | 131.0 | Engineering | 0.153037 | 3 | 648 | ... | 133 | 340 | 16 | 0.024096 | 73000 | 50000 | 105000 | 456 | 176 | 0 |
3 | 4 | 2417 | NAVAL ARCHITECTURE AND MARINE ENGINEERING | 1258.0 | 1123.0 | 135.0 | Engineering | 0.107313 | 16 | 758 | ... | 150 | 692 | 40 | 0.050125 | 70000 | 43000 | 80000 | 529 | 102 | 0 |
4 | 5 | 2405 | CHEMICAL ENGINEERING | 32260.0 | 21239.0 | 11021.0 | Engineering | 0.341631 | 289 | 25694 | ... | 5180 | 16697 | 1672 | 0.061098 | 65000 | 50000 | 75000 | 18314 | 4440 | 972 |
5 rows × 21 columns
From the above we can see that the 5 majors with the highest median earnings are all engineering related and they are all dominated by males. The only major in the top 5 with a share of women above 12% is the fifth ranked chemical engineering with 34% of graduates as women.
# Select the bottom 5 rows from the dataset.
recent_grads.tail(5)
Rank | Major_code | Major | Total | Men | Women | Major_category | ShareWomen | Sample_size | Employed | ... | Part_time | Full_time_year_round | Unemployed | Unemployment_rate | Median | P25th | P75th | College_jobs | Non_college_jobs | Low_wage_jobs | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
168 | 169 | 3609 | ZOOLOGY | 8409.0 | 3050.0 | 5359.0 | Biology & Life Science | 0.637293 | 47 | 6259 | ... | 2190 | 3602 | 304 | 0.046320 | 26000 | 20000 | 39000 | 2771 | 2947 | 743 |
169 | 170 | 5201 | EDUCATIONAL PSYCHOLOGY | 2854.0 | 522.0 | 2332.0 | Psychology & Social Work | 0.817099 | 7 | 2125 | ... | 572 | 1211 | 148 | 0.065112 | 25000 | 24000 | 34000 | 1488 | 615 | 82 |
170 | 171 | 5202 | CLINICAL PSYCHOLOGY | 2838.0 | 568.0 | 2270.0 | Psychology & Social Work | 0.799859 | 13 | 2101 | ... | 648 | 1293 | 368 | 0.149048 | 25000 | 25000 | 40000 | 986 | 870 | 622 |
171 | 172 | 5203 | COUNSELING PSYCHOLOGY | 4626.0 | 931.0 | 3695.0 | Psychology & Social Work | 0.798746 | 21 | 3777 | ... | 965 | 2738 | 214 | 0.053621 | 23400 | 19200 | 26000 | 2403 | 1245 | 308 |
172 | 173 | 3501 | LIBRARY SCIENCE | 1098.0 | 134.0 | 964.0 | Education | 0.877960 | 2 | 742 | ... | 237 | 410 | 87 | 0.104946 | 22000 | 20000 | 22000 | 288 | 338 | 192 |
5 rows × 21 columns
Meanwhile the five lowest paid majors according to median earnings contain 3 majors in the psychology and social work category and all 5 are dominated by women with only Zoology below 80% Female.
Let's finish our initial exploration by having a quick look at the statistics for our columns with numeric values.
# Generate a statistical summary for our numeric columns.
recent_grads.describe()
Rank | Major_code | Total | Men | Women | ShareWomen | Sample_size | Employed | Full_time | Part_time | Full_time_year_round | Unemployed | Unemployment_rate | Median | P25th | P75th | College_jobs | Non_college_jobs | Low_wage_jobs | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 173.000000 | 173.000000 | 172.000000 | 172.000000 | 172.000000 | 172.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 |
mean | 87.000000 | 3879.815029 | 39370.081395 | 16723.406977 | 22646.674419 | 0.522223 | 356.080925 | 31192.763006 | 26029.306358 | 8832.398844 | 19694.427746 | 2416.329480 | 0.068191 | 40151.445087 | 29501.445087 | 51494.219653 | 12322.635838 | 13284.497110 | 3859.017341 |
std | 50.084928 | 1687.753140 | 63483.491009 | 28122.433474 | 41057.330740 | 0.231205 | 618.361022 | 50675.002241 | 42869.655092 | 14648.179473 | 33160.941514 | 4112.803148 | 0.030331 | 11470.181802 | 9166.005235 | 14906.279740 | 21299.868863 | 23789.655363 | 6944.998579 |
min | 1.000000 | 1100.000000 | 124.000000 | 119.000000 | 0.000000 | 0.000000 | 2.000000 | 0.000000 | 111.000000 | 0.000000 | 111.000000 | 0.000000 | 0.000000 | 22000.000000 | 18500.000000 | 22000.000000 | 0.000000 | 0.000000 | 0.000000 |
25% | 44.000000 | 2403.000000 | 4549.750000 | 2177.500000 | 1778.250000 | 0.336026 | 39.000000 | 3608.000000 | 3154.000000 | 1030.000000 | 2453.000000 | 304.000000 | 0.050306 | 33000.000000 | 24000.000000 | 42000.000000 | 1675.000000 | 1591.000000 | 340.000000 |
50% | 87.000000 | 3608.000000 | 15104.000000 | 5434.000000 | 8386.500000 | 0.534024 | 130.000000 | 11797.000000 | 10048.000000 | 3299.000000 | 7413.000000 | 893.000000 | 0.067961 | 36000.000000 | 27000.000000 | 47000.000000 | 4390.000000 | 4595.000000 | 1231.000000 |
75% | 130.000000 | 5503.000000 | 38909.750000 | 14631.000000 | 22553.750000 | 0.703299 | 338.000000 | 31433.000000 | 25147.000000 | 9948.000000 | 16891.000000 | 2393.000000 | 0.087557 | 45000.000000 | 33000.000000 | 60000.000000 | 14444.000000 | 11783.000000 | 3466.000000 |
max | 173.000000 | 6403.000000 | 393735.000000 | 173809.000000 | 307087.000000 | 0.968954 | 4212.000000 | 307933.000000 | 251540.000000 | 115172.000000 | 199897.000000 | 28169.000000 | 0.177226 | 110000.000000 | 95000.000000 | 125000.000000 | 151643.000000 | 148395.000000 | 48207.000000 |
The first row shows the total number of values for each column and you will notice that some of the columns have a count of 173 whilst others have 172. This means that we have some missing values that now need to be removed before we can begin plotting some charts.
# Remove rows with missing values
recent_grads.dropna()
Rank | Major_code | Major | Total | Men | Women | Major_category | ShareWomen | Sample_size | Employed | ... | Part_time | Full_time_year_round | Unemployed | Unemployment_rate | Median | P25th | P75th | College_jobs | Non_college_jobs | Low_wage_jobs | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 2419 | PETROLEUM ENGINEERING | 2339.0 | 2057.0 | 282.0 | Engineering | 0.120564 | 36 | 1976 | ... | 270 | 1207 | 37 | 0.018381 | 110000 | 95000 | 125000 | 1534 | 364 | 193 |
1 | 2 | 2416 | MINING AND MINERAL ENGINEERING | 756.0 | 679.0 | 77.0 | Engineering | 0.101852 | 7 | 640 | ... | 170 | 388 | 85 | 0.117241 | 75000 | 55000 | 90000 | 350 | 257 | 50 |
2 | 3 | 2415 | METALLURGICAL ENGINEERING | 856.0 | 725.0 | 131.0 | Engineering | 0.153037 | 3 | 648 | ... | 133 | 340 | 16 | 0.024096 | 73000 | 50000 | 105000 | 456 | 176 | 0 |
3 | 4 | 2417 | NAVAL ARCHITECTURE AND MARINE ENGINEERING | 1258.0 | 1123.0 | 135.0 | Engineering | 0.107313 | 16 | 758 | ... | 150 | 692 | 40 | 0.050125 | 70000 | 43000 | 80000 | 529 | 102 | 0 |
4 | 5 | 2405 | CHEMICAL ENGINEERING | 32260.0 | 21239.0 | 11021.0 | Engineering | 0.341631 | 289 | 25694 | ... | 5180 | 16697 | 1672 | 0.061098 | 65000 | 50000 | 75000 | 18314 | 4440 | 972 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
168 | 169 | 3609 | ZOOLOGY | 8409.0 | 3050.0 | 5359.0 | Biology & Life Science | 0.637293 | 47 | 6259 | ... | 2190 | 3602 | 304 | 0.046320 | 26000 | 20000 | 39000 | 2771 | 2947 | 743 |
169 | 170 | 5201 | EDUCATIONAL PSYCHOLOGY | 2854.0 | 522.0 | 2332.0 | Psychology & Social Work | 0.817099 | 7 | 2125 | ... | 572 | 1211 | 148 | 0.065112 | 25000 | 24000 | 34000 | 1488 | 615 | 82 |
170 | 171 | 5202 | CLINICAL PSYCHOLOGY | 2838.0 | 568.0 | 2270.0 | Psychology & Social Work | 0.799859 | 13 | 2101 | ... | 648 | 1293 | 368 | 0.149048 | 25000 | 25000 | 40000 | 986 | 870 | 622 |
171 | 172 | 5203 | COUNSELING PSYCHOLOGY | 4626.0 | 931.0 | 3695.0 | Psychology & Social Work | 0.798746 | 21 | 3777 | ... | 965 | 2738 | 214 | 0.053621 | 23400 | 19200 | 26000 | 2403 | 1245 | 308 |
172 | 173 | 3501 | LIBRARY SCIENCE | 1098.0 | 134.0 | 964.0 | Education | 0.877960 | 2 | 742 | ... | 237 | 410 | 87 | 0.104946 | 22000 | 20000 | 22000 | 288 | 338 | 192 |
172 rows × 21 columns
Now that we have removed the row with missing data let's plot some charts
To begin I am going to make a series of scatter plots to answer the following questions:
The chart below looks at the first question. To decide the popular majors I have used the Total column which gives the number of graduates for each major and plotted the results on the x-axis. On the y-axis I have taken the results from the median column which gives the median annual income of each graduate. A positive correlation would indicate that the more popular majors do indeed make more money
# Plot first scatter graph
ax = recent_grads.plot(x='Total', y='Median', kind='scatter')
ax.set_title('Total number of students vs Median annual income')
ax.set_ylabel('Median annual income')
ax.set_xlabel('Total number of students for major')
Text(0.5, 0, 'Total number of students for major')
What does this chart tell us? That there is no clear correlation between the popularity of a major and future earnings. From this sample at least it would appear that students are not choosing their major based on potential future earnings. Perhaps money isn't the great motivator that capitalists would have us believe.
Let's move on to our second question: Do students that majored in subjects that were majority female make more money? For this I will plot the share of women on the x-axis and the median annual income on the y-axis
# Plot second scatter graph
ax = recent_grads.plot(x='ShareWomen', y='Median', kind='scatter')
ax.set_title('Share of women vs Median annual income')
ax.set_ylabel('Median annual income')
ax.set_xlabel('Share of Women')
Text(0.5, 0, 'Share of Women')
Do students that majored in subjects that were majority female make more money? No.
In fact the chart shows slight negative correlation which indicates that actually students who majored in subjects that were majority male were likely to earn more money. This matches what we saw at the start of the project where the top 5 highest earning majors were male dominated and the bottom five majors for median earnings were female dominated.
We now come to the final question: Is there any link between the number of full-time employees and median salary? For this question I will plot the number of full-time employees on the x-axis and the median annual salary on the y-axis.
#Plot third scatter graph
ax = recent_grads.plot(x='Full_time', y='Median', kind='scatter')
ax.set_title('Number of full time employees vs Median annual income')
ax.set_ylabel('Median annual income')
ax.set_xlabel('Full time employees')
#Plot Hexagonal bin chart
ax = recent_grads.plot(kind='hexbin', x='Total', y='Median', gridsize=10)
ax.set_title('Number of full time employees vs Median annual income')
Text(0.5, 1.0, 'Number of full time employees vs Median annual income')
The data is all focussed strongly in the bottom left hand corner around 35,000 for the median income and 10,000 students. I plotted an hexagonal bin underneath to try and make things a little bit clearer
No link here but perhaps we need to slightly change how we answer the question. Currently we just have the total number of full time employees. Therefore the more popular majors will have more full-time employees but they may also have more part-time emplyees than a less popular major.
Below I will add a new column to the recent_grads dataset. This column will be the percentage of graduates for each major who have a full time job. To find the share I have divided the number of full time employees for each major by the total number of graduates and added them to a new column 'Full_time_share'
# Create new column
recent_grads['Full_time_share'] = recent_grads['Full_time'] / recent_grads['Total']
print(recent_grads['Full_time_share'])
0 0.790509 1 0.735450 2 0.651869 3 0.849762 4 0.718227 ... 168 0.599715 169 0.647512 170 0.607470 171 0.681799 172 0.540073 Name: Full_time_share, Length: 173, dtype: float64
OK, now that I have the full time share column I can try to answer the third question Is there any link between the number of full-time employees and median salary? by plotting the full time share column against the median column.
# Plot scatter graph for thrid question using new column
ax = recent_grads.plot(x='Full_time_share', y='Median', kind='scatter')
ax.set_title('Full time share vs Median annual income')
ax.set_ylabel('Median annual income')
ax.set_xlabel('% with full time jobs')
Text(0.5, 0, '% with full time jobs')
We can see a weak positive correlation between the share of full time jobs and median annual income. This shows that there is a small link between the majors that have a higher percentage of full time jobs and the majors with a higher median annual income.
Now lets take a closer look at the individual columns by plotting some histograms. These charts will tell us how the values for each column are spread. Perhaps there is more we can learn. The first histogram is Sample size:
# Plot Sample Size Histogram
ax = recent_grads['Sample_size'].plot(kind='hist', bins=27, range=(0,2700))
ax.set_title('Sample Size')
ax.set_xlabel('Number Surveyed')
Text(0.5, 0, 'Number Surveyed')
Our first histogram shows us that the majority of majors have quite a small sample size. For over 70 of the majors in the dataset less than 100 people were surveyed. Are our results influenced by a small sample size?
For the second histogram let's take a look at the median earnings.
# Plot Median histogram
ax = recent_grads['Median'].plot(kind='hist', bins=15, range=(1,110000))
ax.set_title('Median annual income')
ax.set_xlabel('Annual Income')
Text(0.5, 0, 'Annual Income')
Majority of majors result in a median annual income between 30 and 50 thousand a year. Unsuprising as this is roughly in line with the country average of 44,000. As these are young recent grauates that does indicate that a college degree will get most people up to an avergare income quite quickly
Next chart: Employed
# Plot employed histogram
ax = recent_grads['Employed'].plot(kind='hist', bins=10, range=(1, 200000))
ax.set_title('Employed')
Text(0.5, 1.0, 'Employed')
The majority of majors have less than 25,000 employed graduates. However, without knowing the total number of graduates for each major this really doesnt tell us anything. A more interesting chart is the next one. Unemployment rate:
# Plot Unemployment rate Histogram
ax = recent_grads['Unemployment_rate'].plot(kind='hist', bins=10, range=(0,0.5))
The majority of majors have an unemploment rate between 5% and 10% which is where the national average usually hovers around as well.
Next up: Share of women
# Plot histogram for the share of women
ax = recent_grads['ShareWomen'].plot(kind='hist', bins=10, range=(0,1))
ax.set_title('Share of Women')
Text(0.5, 1.0, 'Share of Women')
The most common percentages are 60% and 70% indiciating that there is definitly majors which are more popular with Women than Men and vice versa.
Finally let's do the last 2 histograms together. Men and Women:
# Plot Histograms for Men and Women
fig = plt.figure(figsize=(20,10))
ax_1 = fig.add_subplot(2, 2, 1)
ax_2 = fig.add_subplot(2, 2, 2)
ax_1.hist(recent_grads['Men'], bins=20, range=(0, 100000))
ax_1.set_title('Men')
ax_2.hist(recent_grads['Women'], bins=20, range=(0, 100000))
ax_2.set_title('Women')
Text(0.5, 1.0, 'Women')
Not much difference between the 2 charts really. Most majors have less than 5,000 man/women.
Now let's combine the scatter graphs and histograms together and make some scatter matrix plots. I will start by looking at Share of Women and Median.
# Import scatter matrix module
from pandas.plotting import scatter_matrix
# Plot scatter matrix for ShareWomen and Median
scatter_1 = scatter_matrix(recent_grads[['ShareWomen', "Median"]], figsize=(10, 10))
This is information that we have seen before earlier in the project but it is a nice visualization. We can get a step further and add in unemployment rate as well.
# Plot scatter matrix with ShareWomen, Median and Unemployment rate
scatter_2 = scatter_matrix(recent_grads[['Median', 'ShareWomen', 'Unemployment_rate']], figsize=(10,10))
Again this is information that we have looked at earlier but we can see it mapped together using a scatter matrix.
Next up I am going to plot a couple of bar charts. On these bar charts I will be comparing the 10 majors with the highest median annual income against the 10 majors with the lowest median annual income. The first comparison will be with the share of women. Let's see how the share of women compares between high income majors and low income ones:
# Plot 2 bar charts for the Share of Women in high and low income majors
ax_1 = recent_grads[:10].plot.bar(x='Major', y='ShareWomen')
ax_2 = recent_grads[-10:].plot.bar(x='Major', y='ShareWomen')
There is a visual representation of what I said at the start of the project- that the top earning degrees are male dominated and the lowest earning degrees are female dominated.
Now let's take a look at Unemployment rate:
# Plot 2 bar charts for unemployment rates in high and low income majors
ax_1 = recent_grads[:10].plot.bar(x='Major', y='Unemployment_rate')
ax_1.set_title('High ranking majors')
ax_2 = recent_grads[-10:].plot.bar(x='Major', y='Unemployment_rate')
ax_1.set_title('Low ranking majors')
Text(0.5, 1.0, 'Low ranking majors')
Do the low ranking majors have a higher unemployment rate than the high ranking ones? Yes slightly. It is nothing major and the sample size is small but there is a difference.
Another way to plot our bar charts is to group them by the type of major. In the chart below I will plot the total number of men and women who took different categories of major. This will show us both the popular majors and also which ones are popular with different genders.
# Plot bar chart with popularity of majors
ax1 = recent_grads.groupby('Major_category').sum().plot.bar(y=['Men', 'Women'], title='Categories of Major')
print(ax1)
AxesSubplot(0.125,0.125;0.775x0.755)
This chart gives us a great look at both the popularity and the gender make up of different categories of majors. For example, business is by far the most popular major followed by the likes of education, engineering, humanities, psychology and social science.
Some categories like business and social science have a relativley even split between male and female whilst others have a large divide. Education, Psychology & Social work and health are female dominated whilst Engineering, computers and mathematics are male dominated.
Let's continue our look at the major categories by plotting unemployment rate:
# Plot bar chart with major categories and unemployment rate
ax1 = recent_grads.groupby('Major_category').mean().plot.bar(y=['Unemployment_rate'], title='Categories of Major')
AxesSubplot(0.125,0.125;0.775x0.755)
I guess the stereotype of unemployed art graduates isn't totally unwarranted. Social science, law & Public policy also have unemployment rates pushing 10%. On the other end of the spectrum education, industrial arts and physical sciences all have the lowest unemployment rates.
I will finish with a couple different charts. The first is a box plot showing median annual incomes and the second shows a box chart plotting the unemployment rate.
#Plot Median box chart
ax = recent_grads['Median'].plot.box()
ax.set_title('Median annual income')
Text(0.5, 1.0, 'Median annual income')
This chart shows the median to be around 40,000 a year which is around what we would expect after looking at the histogram earlier. The outlier is petroleum engineering which is either a great major to take or could be affected by the small sample size.
# Plot Unemployment rate box plot
ax = recent_grads['Unemployment_rate'].plot.box()
ax.set_title('Unemployment rate')
Text(0.5, 1.0, 'Unemployment rate')
Unemployment rate at a median of about 7% is quite close to the usual national average. A few outliers above 15% might be the majors that you avoid taking!
Final Thoughts
The majority of majors use a sample size of less than 100. Is this enough of a sample to draw any great conclusions from this data?
The popularity of majors is not affected by future annual incomes. Either students are unaware of the potential future earnings, they are unmotivated by future earnings or our sample size is just too small.
Majors which are male dominated tend to lead to a higher median annual income than female dominated majors. This is especially true of the top 10 highest earining majors which are heavily male dominated engineering courses.
Majors that have a higher percentage of full time jobs are also likely to have a higher median annual income.
The median annual income and unemployment rate of recent graduates are similar to national averages. You may conclude that a degree is not important but remember that these are likely to be young recent graduates. The fact that they are quickly earning salaries to match the national average is a good sign for them as annual income tends to increse as we age.
This data supports some traditional sterotypes about majors which are popular with males and females and also about which majors lead to higher future earnings. Men are more interested in courses such as engineering, computing and mathematics whilst more women majors such as education and health. Art majors also have a high unemployment rate.
Thanks for reading