Guided Project: Visualizing Earnings Based On College Majors

Aim

Using a dataset on the job outcomes of students in America who graduated from college between 2010 and 2012, we will explore questions such as:

  • Do students in more popular majors make more money?
  • How many majors are predominantly male? Predominantly female?
  • Which category of majors have the most students?
In [1]:
# Importing the necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
# Ensure plots are displayed inline
%matplotlib inline  
In [2]:
# Read in the dataset into a DataFrame
recent_grads = pd.read_csv('recent-grads.csv')
# Return first row as a table
recent_grads.iloc[0]
Out[2]:
Rank                                        1
Major_code                               2419
Major                   PETROLEUM ENGINEERING
Total                                    2339
Men                                      2057
Women                                     282
Major_category                    Engineering
ShareWomen                           0.120564
Sample_size                                36
Employed                                 1976
Full_time                                1849
Part_time                                 270
Full_time_year_round                     1207
Unemployed                                 37
Unemployment_rate                   0.0183805
Median                                 110000
P25th                                   95000
P75th                                  125000
College_jobs                             1534
Non_college_jobs                          364
Low_wage_jobs                             193
Name: 0, dtype: object
In [3]:
# Understand how the data is structured
recent_grads.head(5)
Out[3]:
Rank Major_code Major Total Men Women Major_category ShareWomen Sample_size Employed ... Part_time Full_time_year_round Unemployed Unemployment_rate Median P25th P75th College_jobs Non_college_jobs Low_wage_jobs
0 1 2419 PETROLEUM ENGINEERING 2339.0 2057.0 282.0 Engineering 0.120564 36 1976 ... 270 1207 37 0.018381 110000 95000 125000 1534 364 193
1 2 2416 MINING AND MINERAL ENGINEERING 756.0 679.0 77.0 Engineering 0.101852 7 640 ... 170 388 85 0.117241 75000 55000 90000 350 257 50
2 3 2415 METALLURGICAL ENGINEERING 856.0 725.0 131.0 Engineering 0.153037 3 648 ... 133 340 16 0.024096 73000 50000 105000 456 176 0
3 4 2417 NAVAL ARCHITECTURE AND MARINE ENGINEERING 1258.0 1123.0 135.0 Engineering 0.107313 16 758 ... 150 692 40 0.050125 70000 43000 80000 529 102 0
4 5 2405 CHEMICAL ENGINEERING 32260.0 21239.0 11021.0 Engineering 0.341631 289 25694 ... 5180 16697 1672 0.061098 65000 50000 75000 18314 4440 972

5 rows × 21 columns

Engineering majors have the highest median salaries, taking the top 5 spots.

In [4]:
recent_grads.tail()
Out[4]:
Rank Major_code Major Total Men Women Major_category ShareWomen Sample_size Employed ... Part_time Full_time_year_round Unemployed Unemployment_rate Median P25th P75th College_jobs Non_college_jobs Low_wage_jobs
168 169 3609 ZOOLOGY 8409.0 3050.0 5359.0 Biology & Life Science 0.637293 47 6259 ... 2190 3602 304 0.046320 26000 20000 39000 2771 2947 743
169 170 5201 EDUCATIONAL PSYCHOLOGY 2854.0 522.0 2332.0 Psychology & Social Work 0.817099 7 2125 ... 572 1211 148 0.065112 25000 24000 34000 1488 615 82
170 171 5202 CLINICAL PSYCHOLOGY 2838.0 568.0 2270.0 Psychology & Social Work 0.799859 13 2101 ... 648 1293 368 0.149048 25000 25000 40000 986 870 622
171 172 5203 COUNSELING PSYCHOLOGY 4626.0 931.0 3695.0 Psychology & Social Work 0.798746 21 3777 ... 965 2738 214 0.053621 23400 19200 26000 2403 1245 308
172 173 3501 LIBRARY SCIENCE 1098.0 134.0 964.0 Education 0.877960 2 742 ... 237 410 87 0.104946 22000 20000 22000 288 338 192

5 rows × 21 columns

The columns have been defined as follows:

Header Description
Rank Rank by median earnings
Major_code Major code, FO1DP in ACS PUMS
Major Major description
Major_category Category of major from Carnevale et al
Total Total number of people with major
Sample_size Sample size (unweighted) of full-time, year-round ONLY (used for earnings)
Men Male graduates
Women Female graduates
ShareWomen Women as share of total
Employed Number employed (ESR == 1 or 2)
Full_time Employed 35 hours or more
Part_time Employed less than 35 hours
Full_time_year_round Employed at least 50 weeks (WKW == 1) and at least 35 hours (WKHP >= 35)
Unemployed Number unemployed (ESR == 3)
Unemployment_rate Unemployed / (Unemployed + Employed)
Median Median earnings of full-time, year-round workers
P25th 25th percentile of earnings
P75th 75th percentile of earnings
College_jobs Number with job requiring a college degree
Non_college_jobs Number with job not requiring a college degree
Low_wage_jobs Number in low-wage service jobs
In [5]:
# Generating summary statistics for all numerical columns
recent_grads.describe()
Out[5]:
Rank Major_code Total Men Women ShareWomen Sample_size Employed Full_time Part_time Full_time_year_round Unemployed Unemployment_rate Median P25th P75th College_jobs Non_college_jobs Low_wage_jobs
count 173.000000 173.000000 172.000000 172.000000 172.000000 172.000000 173.000000 173.000000 173.000000 173.000000 173.000000 173.000000 173.000000 173.000000 173.000000 173.000000 173.000000 173.000000 173.000000
mean 87.000000 3879.815029 39370.081395 16723.406977 22646.674419 0.522223 356.080925 31192.763006 26029.306358 8832.398844 19694.427746 2416.329480 0.068191 40151.445087 29501.445087 51494.219653 12322.635838 13284.497110 3859.017341
std 50.084928 1687.753140 63483.491009 28122.433474 41057.330740 0.231205 618.361022 50675.002241 42869.655092 14648.179473 33160.941514 4112.803148 0.030331 11470.181802 9166.005235 14906.279740 21299.868863 23789.655363 6944.998579
min 1.000000 1100.000000 124.000000 119.000000 0.000000 0.000000 2.000000 0.000000 111.000000 0.000000 111.000000 0.000000 0.000000 22000.000000 18500.000000 22000.000000 0.000000 0.000000 0.000000
25% 44.000000 2403.000000 4549.750000 2177.500000 1778.250000 0.336026 39.000000 3608.000000 3154.000000 1030.000000 2453.000000 304.000000 0.050306 33000.000000 24000.000000 42000.000000 1675.000000 1591.000000 340.000000
50% 87.000000 3608.000000 15104.000000 5434.000000 8386.500000 0.534024 130.000000 11797.000000 10048.000000 3299.000000 7413.000000 893.000000 0.067961 36000.000000 27000.000000 47000.000000 4390.000000 4595.000000 1231.000000
75% 130.000000 5503.000000 38909.750000 14631.000000 22553.750000 0.703299 338.000000 31433.000000 25147.000000 9948.000000 16891.000000 2393.000000 0.087557 45000.000000 33000.000000 60000.000000 14444.000000 11783.000000 3466.000000
max 173.000000 6403.000000 393735.000000 173809.000000 307087.000000 0.968954 4212.000000 307933.000000 251540.000000 115172.000000 199897.000000 28169.000000 0.177226 110000.000000 95000.000000 125000.000000 151643.000000 148395.000000 48207.000000

The one issue with this data (from a plotting perspective) is the different lengths of the columns. In the columns 'Total', 'Men', and 'Women' there is 172 values, not 173 (as for the other columns).

These missing values will need to be removed before we can pass the data into matplotlib for analysis.

In [6]:
# Record how many rows are in the uncleaned dataframe
raw_data_count = recent_grads.shape[0]
print(raw_data_count)
173
In [7]:
# Drop rows from the dataframe with missing values
recent_grads = recent_grads.dropna(axis=0)
In [8]:
# See how many rows with missing values have been dropped
cleaned_data_count = recent_grads.shape[0]
print("The uncleaned data set had ", raw_data_count, " rows")
print("The cleaned data set has ", cleaned_data_count, " rows")
The uncleaned data set had  173  rows
The cleaned data set has  172  rows

So there was only one row with missing values which as now been dropped from the dataframe.

Now we can visualize the data to explore research questions.

Visualizing the data: Scatter plots

We will use scatter plots to answer the following questions:

  • Do students in more popular majors make more money?
  • Do students that majored in subjects that were majority female make more money?
  • Is there any link between the number of full-time employees and median salary?
In [9]:
recent_grads.columns
Out[9]:
Index(['Rank', 'Major_code', 'Major', 'Total', 'Men', 'Women',
       'Major_category', 'ShareWomen', 'Sample_size', 'Employed', 'Full_time',
       'Part_time', 'Full_time_year_round', 'Unemployed', 'Unemployment_rate',
       'Median', 'P25th', 'P75th', 'College_jobs', 'Non_college_jobs',
       'Low_wage_jobs'],
      dtype='object')
In [10]:
# Scatter plot: Sample size and median
recent_grads.plot(x = 'Sample_size', y = 'Median', kind = 'scatter', title = 'Median earnings vs. Sample Size', xlim=(0,4500), ylim=(0,120000))
Out[10]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f99e77f9d30>

Q. Do students in more popular majors make more money?

A. The scatter plot suggests that there is no noticeable relationship between the sample size and the median salary. However, there are two important qualifiers to this answer:

  1. This scatter plot uses earning information for an unweighted sample of people with the major. Therefore it may not be representative of the population of graudates with this major as a whole.
  2. The median sample size is 130 and the 75th percentile is 338, with the chart size distorted by a few outliers, which may visually compress any relationship. The chart can be zoomed in on to see if there is a relationship within the smaller range of majors with sample sizes equal to or less than the 75% percentile of 338.
In [11]:
# Scatter plot: Sample size (up to 75th percentile) and median
recent_grads.plot(x = 'Sample_size', y = 'Median', kind = 'scatter', title = 'Median earnings vs. Sample Size up to 75th percentile', xlim=(0,338), ylim=(0,120000))
Out[11]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f99e7745ef0>

There is no strong overall correlation between this narrowed down selection of majors and their median earnings.

However there is a wider range of median earnings in majors with a sample size under 50. With small samples the risk of an unrepresentative median salary is higher as outliers have a bigger effect.

Overall this additional scatter plot does not change the above answer to the question.

In [12]:
# Sample size and unemployment rate
ax = recent_grads.plot(x='Sample_size', y='Unemployment_rate', kind='scatter')
ax.set_title('Unemployment rate vs. Sample size')
ax.set_xlim(0,4500)
ax.set_ylim(0,0.2)
Out[12]:
(0, 0.2)

There is a lot of variation in unemployment rates among majors with small sample sizes. Yet again the small sample sizes may affect the representativeness of the relationship plotted here.

In addition, from reviewing some rows of the dataframe, there is a noticeable difference between the sample size and the number of graduates for whom there is data on whether they are employed/unemployed.

For example for Petroleum Engineering (rank 1) there is a sample size of 36 and a total of (1976+37) employed and unemployed.

This suggests that sample size is not readily comparable to other statistics collected, other than median wage.

In [13]:
# Full-time workers and median salary
ax = recent_grads.plot(x='Full_time', y='Median', kind='scatter')
ax.set_title('Median salary vs. Number employed full-time')
ax.set_xlim(0,255000)
ax.set_ylim(0,120000)
Out[13]:
(0, 120000)

Q. Is there any link between the number of full-time employees and median salary?

A. There is not a noticeable correlation between the number of graduates per major employed full-time and the median wage. If there was to be a relationship it would be positive i.e. more full-time employees leads to a higher median wage.

But as noted above the median wage figures are based off smaller unweighted samples that may not represent the wider population of graduates with each major.

To be sure, a more sample of the data can be plotted, setting the axes limits at the 75th percentile.

In [14]:
ax = recent_grads.plot(x='Full_time', y='Median', kind='scatter')
ax.set_title('Median salary vs. Number employed full-time (both at 75th percentile)')
ax.set_xlim(0,26000)
ax.set_ylim(0,45000)
Out[14]:
(0, 45000)

For this narrowed down sample there is no noticeable relationship.

In [15]:
# Share of women and the unemployment rate
ax = recent_grads.plot(x='ShareWomen', y='Unemployment_rate', kind='scatter')
ax.set_title('Unemployment rate vs Proportion of female graduates')
ax.set_xlim(0,1)
ax.set_ylim(0,0.2)
Out[15]:
(0, 0.2)

There doesn't seem to be a strong relationship between the proportion of female graduates in a course and the unemployment rate.

In [16]:
# Share of women and median salary
ax = recent_grads.plot(x='ShareWomen', y='Median', kind='scatter')
ax.set_title('Median salary vs. Proportion of female graduates')
ax.set_xlim(0,1)
ax.set_ylim(0,120000)
Out[16]:
(0, 120000)

Q. Do students that majored in subjects that were majority female make more money?

A. No. Here there is a noticeable relationship: the higher the proportion of female graduates for a major, the lower the median salary is.

The lower median salary is not due to more part-time work because it is defined as the median salary of full time year-round workers.

This means that the lower salary could be due to the lowwe pay for the types of major (and subsequent career paths) that have a higher proportions of female graduates and/or due to lower wages due to their gender or less career capital due to a higher propensity to take time away from work for family.

In [17]:
# Number of male graduates and median wage
ax = recent_grads.plot(x='Men', y='Median', kind='scatter')
ax.set_title('Median salary vs. Number of male graduates per major')
ax.set_xlim(0,175000)
ax.set_ylim(0,120000)
Out[17]:
(0, 120000)

There is no obvious relationship here.

In [18]:
# Number of female graduates and median wage
ax = recent_grads.plot(x='Women', y='Median', kind='scatter')
ax.set_title('Median salary vs. Number of female graduates per major')
ax.set_xlim(0,310000)
ax.set_ylim(0,120000)
Out[18]:
(0, 120000)

There is no obvious relationship here either.

Visualizing the data: Histograms

We will use histograms to answer the following questions:

  • What percent of majors are predominantly male? Predominantly female?
  • What's the most common median salary range?
In [19]:
# To allow bin size to be changed, use Series.hist() and not Series.plot(kind='hist')
# Sample_size histogram
recent_grads["Sample_size"].hist(bins=10)
Out[19]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f99e54332b0>

Most of the sample size values are below 500 so a more detailed view of the majority can be found by looking at those with a sample size below 500.

In [20]:
recent_grads["Sample_size"].hist(bins=50, range=(0,500))
Out[20]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f99e538ab00>

The majority of sample sizes were below 100. This raises concerns over how representative the salary data for each major is.

In [21]:
# Median salary histogram
recent_grads["Median"].hist(range=(20000,110000), bins=18)
Out[21]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f99e52d6da0>

Q. What's the most common median salary range?

A. The median salaries are mostly clustered around $30,000-40,000, with a relatively quick drop off in frequency for salary bands on either side.

In [22]:
# Employed histogram
recent_grads["Employed"].hist(range=(0,310000))
Out[22]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f99e5495da0>

I assume that the number employed per major is affected, in part, by the number of students that have taken the major and so its distribution is not very instructive by itself. To check out this assumption I can look at the relationship between Total and Employed.

In [23]:
# e.g. the Total number of people and number Employed for the largest majors
# Filter by majors with Total > 39,000 (75th percentile)
largest_majors = recent_grads.loc[recent_grads["Total"] > 39000, ["Major_code", "Major", "Total", "Employed"]]
largest_majors.sort_values(by='Total', ascending=False).head(10)
Out[23]:
Major_code Major Total Employed
145 5200 PSYCHOLOGY 393735.0 307933
76 6203 BUSINESS MANAGEMENT AND ADMINISTRATION 329927.0 276234
123 3600 BIOLOGY 280709.0 182295
57 6200 GENERAL BUSINESS 234590.0 190183
93 1901 COMMUNICATIONS 213996.0 179633
34 6107 NURSING 209394.0 180903
77 6206 MARKETING AND MARKETING RESEARCH 205211.0 178862
40 6201 ACCOUNTING 198633.0 165527
137 3301 ENGLISH LANGUAGE AND LITERATURE 194673.0 149180
78 5506 POLITICAL SCIENCE AND GOVERNMENT 182621.0 133454

So there is unsurprisingly a link between the Total number of people who have taken a major and the the number Employed.

A better way to illustrate this relationship would be with a scatter plot.

In [24]:
ax = recent_grads.plot(x='Total', y='Employed', kind='scatter')
ax.set_xlim(0, 400000)
ax.set_ylim(0,310000)
Out[24]:
(0, 310000)

So, as expected, the distribution and size of the number Employed per major, is closely related to the Total number of graduates per major. Of more use would be to look at the employment (or unemployment) rates per major, rather than the absolute numbers.

In [25]:
# Full-time histogram
recent_grads["Full_time"].hist()
Out[25]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f99e56db630>

This closely mirrors the distribution of the numbers Employed per major, which is expected.

In [26]:
# ShareWomen histogram
recent_grads["ShareWomen"].hist()
Out[26]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f99e551b278>

It appears that just over 50% of all majors, are majority female, with the highest frequency at 70-80% female.

In [27]:
# Seeing which courses have females at 80% or more
high_female_share = recent_grads[recent_grads["ShareWomen"] >= 0.8]
print(high_female_share.shape)
high_female_share.sort_values(by='ShareWomen', ascending=False).head(10)
(18, 21)
Out[27]:
Rank Major_code Major Total Men Women Major_category ShareWomen Sample_size Employed ... Part_time Full_time_year_round Unemployed Unemployment_rate Median P25th P75th College_jobs Non_college_jobs Low_wage_jobs
164 165 2307 EARLY CHILDHOOD EDUCATION 37589.0 1167.0 36422.0 Education 0.968954 342 32551 ... 7001 20748 1360 0.040105 28000 21000 35000 23515 7705 2868
163 164 6102 COMMUNICATION DISORDERS SCIENCES AND SERVICES 38279.0 1225.0 37054.0 Health 0.967998 95 29763 ... 13862 14460 1487 0.047584 28000 20000 40000 19957 9404 5125
51 52 6104 MEDICAL ASSISTING SERVICES 11123.0 803.0 10320.0 Health 0.927807 67 9168 ... 4107 4290 407 0.042507 42000 30000 65000 2091 6948 1270
138 139 2304 ELEMENTARY EDUCATION 170862.0 13029.0 157833.0 Education 0.923745 1629 149339 ... 37965 86540 7297 0.046586 32000 23400 38000 108085 36972 11502
150 151 2901 FAMILY AND CONSUMER SCIENCES 58001.0 5166.0 52835.0 Industrial Arts & Consumer Services 0.910933 518 46624 ... 15872 26906 3355 0.067128 30000 22900 40000 20985 20133 5248
100 101 2310 SPECIAL NEEDS EDUCATION 28739.0 2682.0 26057.0 Education 0.906677 246 24639 ... 5153 16642 1067 0.041508 35000 32000 42000 20185 3797 1179
156 157 5403 HUMAN SERVICES AND COMMUNITY ORGANIZATION 9374.0 885.0 8489.0 Psychology & Social Work 0.905590 89 8294 ... 2405 5061 326 0.037819 30000 24000 35000 2878 4595 724
151 152 5404 SOCIAL WORK 53552.0 5137.0 48415.0 Psychology & Social Work 0.904075 374 45038 ... 13481 27588 3329 0.068828 30000 25000 35000 27449 14416 4344
34 35 6107 NURSING 209394.0 21773.0 187621.0 Health 0.896019 2554 180903 ... 40818 122817 8497 0.044863 48000 39000 58000 151643 26146 6193
88 89 6199 MISCELLANEOUS HEALTH MEDICAL PROFESSIONS 13386.0 1589.0 11797.0 Health 0.881294 81 10076 ... 4145 5868 893 0.081411 36000 23000 42000 5652 3835 1422

10 rows × 21 columns

In [28]:
# Unemployment Rate histogram
recent_grads["Unemployment_rate"].hist()
Out[28]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f99e5675438>

The most frequency unemployment range is 6-7%, but there are handful of courses with unemployment rates greater than 14%.

In [29]:
high_unemp = recent_grads[recent_grads["Unemployment_rate"] >= 0.14]
print(high_unemp.shape)
high_unemp.sort_values(by='Unemployment_rate', ascending=False)
(4, 21)
Out[29]:
Rank Major_code Major Total Men Women Major_category ShareWomen Sample_size Employed ... Part_time Full_time_year_round Unemployed Unemployment_rate Median P25th P75th College_jobs Non_college_jobs Low_wage_jobs
5 6 2418 NUCLEAR ENGINEERING 2573.0 2200.0 373.0 Engineering 0.144967 17 1857 ... 264 1449 400 0.177226 65000 50000 102000 1142 657 244
89 90 5401 PUBLIC ADMINISTRATION 5629.0 2947.0 2682.0 Law & Public Policy 0.476461 46 4158 ... 847 2952 789 0.159491 36000 23000 60000 919 2313 496
84 85 2107 COMPUTER NETWORKING AND TELECOMMUNICATIONS 7613.0 5291.0 2322.0 Computers & Mathematics 0.305005 97 6144 ... 1447 4369 1100 0.151850 36400 27000 49000 2593 2941 352
170 171 5202 CLINICAL PSYCHOLOGY 2838.0 568.0 2270.0 Psychology & Social Work 0.799859 13 2101 ... 648 1293 368 0.149048 25000 25000 40000 986 870 622

4 rows × 21 columns

'Nuclear engineering' and 'Computer Networking and Telecommunications' are unexpected given they are in in demand fields (engineering, computers & mathematics).

In [30]:
# Men histogram i.e. number of male graduates
recent_grads["Men"].hist()
Out[30]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f99e518c390>
In [31]:
# Women histogram i.e. number of female graduates
recent_grads["Women"].hist()
Out[31]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f99e511d390>

Visualizing the data: Scatter Matrix Plot

A scatter matrix plot combines both scatter plots and histograms into one grid of plots and allows us to explore potential relationships and distributions simultaneously.

These can be generated by selecting the dataframe and columns of interest and passing this into: pandas.plotting.scatter_matrix

In [32]:
# import scatter_matrix function from panda.plotting
from pandas.plotting import scatter_matrix
In [33]:
# A 2 by 2 scatter matrix plot of Sample_size and Median salary
scatter_matrix(recent_grads[["Sample_size", "Median"]], figsize=(10,10))
Out[33]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7f99e51de470>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f99e4fe2128>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7f99e4faab00>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f99e4f67860>]],
      dtype=object)

Most sample sizes are less than 500 (top left histogram). The scatter plot of Sample_size vs. Median salary (bottom left) doesn't seem to provide much information other than there not being an obvious relationship between Sample_size and Median salary.

However the mirror scatter plot of Median salary vs. Sample_size (top right) shows that large Sample sizes are not associated with outlier Median salary values. Instead the majors with the highest Sample sizes have Median salaries that are in line with the most common salary ranges ($30,000-40,000).

In [34]:
# Scatter matrix plot of Sample_size, Median, and Unemployment_rate
scatter_matrix(recent_grads[["Sample_size", "Median", "Unemployment_rate"]], figsize=(20,20))
Out[34]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7f99e53c11d0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f99e4ea0438>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f99e4e69b70>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7f99e4e248d0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f99e4dedb00>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f99e4dadf98>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7f99e4cff6d8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f99e4d34dd8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f99e4c89160>]],
      dtype=object)

There's not any noticeably strong relationships between the variables here.

Scatter_matrix is a useful way to quickly explore relationships that I've considered above, for example Total students with a major and the number Employed.

In [35]:
# Total and Employed scatter matrix
scatter_matrix(recent_grads[["Total", "Employed"]], figsize=(10,10))
Out[35]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7f99e4b8cef0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f99e434ef98>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7f99e431f400>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f99e42d3da0>]],
      dtype=object)

Doing a scatter matrix plot earlier would have saved time and quickly revealed the strong relationship between the two variables.

Visualizing the data: Bar plots

Using either df.plot(kind='bar') or df.plot.bar(x=labels, y= data for bars)

In [36]:
# Looking at the share of women for the top 10 and bottom 10
# courses NB data is ranked by median salary

# Share of women in the top 10 courses
recent_grads[:10].plot.bar(x='Major', y='ShareWomen', title='Share of women in the 10 courses with the highest median salary')
Out[36]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f99e420cf98>
In [37]:
# Share of women in the bottom 10 courses
recent_grads[-10:].plot.bar(x='Major', y='ShareWomen', title='Share of women in the 10 courses with the lowest median salary')
Out[37]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f99e41d2be0>

The courses with the highest median salaries have a lower share of female graduates than those with the lowest median salaries, which are majority female (i.e. over 50% of the graduates are women).

Now to calculate how large the difference is:

In [38]:
#Calculating the average proportion of female graduates for the top and bottom 10 courses
top_10_female_share = recent_grads.loc[:9, "ShareWomen"].mean()
bottom_10_female_share = recent_grads[-10:]['ShareWomen'].mean()
In [39]:
top_10 = "The 10 highest paying courses have an average female share of {:.2f}".format(top_10_female_share)
bottom_10 = "The 10 lowest paying courses have an average female share of {:.2f}".format(bottom_10_female_share)

print(top_10)
print(bottom_10)
The 10 highest paying courses have an average female share of 0.23
The 10 lowest paying courses have an average female share of 0.79

So the difference in the average proportion of female graduates between the top and bottom 10 courses (in terms of median pay) is over 50%!

Next we will look at the differences in the unemployment rate between the top 10 and bottom 10 courses.

In [40]:
# Unemployment rate for the top 10 courses
ax1 = recent_grads[:10].plot.bar(x='Major', y='Unemployment_rate', title='Unemployment rate for the top 10 courses')
ax2 = recent_grads[-10:].plot.bar(x='Major', y='Unemployment_rate', title='Unemployment rate for the bottom 10 courses')

For this comparison it is less clear. The top 10 courses do tend to have a lower unemployment rate, apart from 2 exceptions: 'Nuclear Engineering' and 'Mining and Mineral Engineering'. Whilst for the bottom 10 courses 3-5 of the courses have higher unemployment rates.

This is can be analysed by looking at average unemployment rates.

In [41]:
mean_unemp_rate = recent_grads["Unemployment_rate"].mean()
#NB with .loc contrary to usual python slices, both the start and the stop are included
top_10_unemp = recent_grads.loc[:9 , "Unemployment_rate"].mean()
bottom_10_unemp = recent_grads[-10:]["Unemployment_rate"].mean()
In [42]:
mean_all = "The average unemployment rate across all majors is {:.2f}".format(mean_unemp_rate)
mean_top = "The average unemployment rate for the top 10 majors is {:.2f}".format(top_10_unemp)
mean_bottom = "The average unemployment rate for the bottom 10 majors is {:.2f}".format(bottom_10_unemp)

print(mean_all)
print(mean_top)
print(mean_bottom)
The average unemployment rate across all majors is 0.07
The average unemployment rate for the top 10 majors is 0.07
The average unemployment rate for the bottom 10 majors is 0.08

Whilst the average unemployment rates are similar for the top and bottom 10 courses, there appears to be more bottom 10 courses which are slightly above average, whilst for the top 10, 2 courses are far above the average, whilst the others are far below average.

To investigate this further:

In [43]:
top_10_outliers = recent_grads[:10].loc[recent_grads[:10]["Unemployment_rate"] > mean_unemp_rate]
top_10_outliers["Difference_from_mean"] = top_10_outliers["Unemployment_rate"] - mean_unemp_rate
In [44]:
bottom_10_outliers = recent_grads[-10:].loc[recent_grads[-10:]["Unemployment_rate"] > mean_unemp_rate]
bottom_10_outliers["Difference_from_mean"] = bottom_10_outliers["Unemployment_rate"] - mean_unemp_rate
In [45]:
""" Plot the majors from the top and bottom 10 with above
average unemployment rates and the size of the difference
in unmployment rate from the average"""
top_10_outliers.plot.bar(x='Major', y='Difference_from_mean', title='Majors in the top 10 with above average unemployment rates')
bottom_10_outliers.plot.bar(x='Major', y='Difference_from_mean', title='Majors in the bottom 10 with above average unemployment rates')
Out[45]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f99def3fb70>

So in the top 10, one course (Nuclear Engineering) is dragging up the average unemployment rate for the top 10 majors. In the bottom 10 Clinical Psychology has an above average unemployment rate, but the other 4 courses are also dragging up the unemployment rate.

Additional analysis

There are a lot of majors which makes it harder to see patterns across types of major e.g. arts, sciences.

So I will look at some analysis using the Category of the major.

Firstly I will build a new dataframe which is indexed by the categories of the majors.

In [46]:
# Create a list of all the categories of major
categories = recent_grads["Major_category"].unique()
In [47]:
# Now to aggregate across categories
# Firstly unemployment rates

# Create an empty dictionary
cat_unemp = {}

# Loop through major categories, calculate mean unemployment rate and add to dictionary
for c in categories:
    unemp_mean = recent_grads.loc[recent_grads["Major_category"] == c, "Unemployment_rate"].mean()
    cat_unemp[c] = unemp_mean
In [48]:
# Now the proportion (share) of women in each category

# Create an empty dictionary
cat_share_women = {}

#Loop through categories, calculate mean share of women, and add to dictionary
for c in categories:
    women_mean = recent_grads.loc[recent_grads["Major_category"] == c, "ShareWomen"].mean()
    cat_share_women[c] = women_mean
In [49]:
# Now the average median salary

# Create an empty dictionary
cat_salary = {}

for c in categories:
    salary_mean = recent_grads.loc[recent_grads["Major_category"] == c, "Median"].mean()
    cat_salary[c] = salary_mean
In [50]:
# Now the total number of men and women in each category

cat_women = {}

for c in categories:
    women_sum = recent_grads.loc[recent_grads["Major_category"] == c, "Women"].sum()
    cat_women[c] = women_sum
In [51]:
cat_men = {}

for c in categories:
    men_sum = recent_grads.loc[recent_grads["Major_category"] == c, "Men"].sum()
    cat_men[c] = men_sum

Now to convert both dictionaries into series objects, and then add both series objects to a dataframe (with named column headings).

In [52]:
unemp_series = pd.Series(cat_unemp)
share_women_series = pd.Series(cat_share_women)
salary_series = pd.Series(cat_salary)
women_series = pd.Series(cat_women)
men_series = pd.Series(cat_men)

type(salary_series)
Out[52]:
pandas.core.series.Series
In [53]:
#Now to turn all series into a dataframe
#NB. The dictionary keys became the index in the Series obj
#This index can be used for the dataframe

major_categories = pd.DataFrame(unemp_series, columns=['mean_unemployment_rate'])

major_categories
Out[53]:
mean_unemployment_rate
Agriculture & Natural Resources 0.051817
Arts 0.090173
Biology & Life Science 0.060918
Business 0.071064
Communications & Journalism 0.075538
Computers & Mathematics 0.084256
Education 0.051702
Engineering 0.063334
Health 0.065920
Humanities & Liberal Arts 0.081008
Industrial Arts & Consumer Services 0.048071
Interdisciplinary 0.070861
Law & Public Policy 0.090805
Physical Sciences 0.046511
Psychology & Social Work 0.072065
Social Science 0.095729

Now to add the other series into this new dataframe.

In [54]:
#Now add in mean mileage
#Don't use constructor! -only use that to create df obj

# Add in other series to df. Share same index.
major_categories["mean_share_women"] = share_women_series
major_categories["mean_salary"] = salary_series
major_categories["number_female_grads"] = women_series
major_categories["number_male_grads"] = men_series

major_categories
Out[54]:
mean_unemployment_rate mean_share_women mean_salary number_female_grads number_male_grads
Agriculture & Natural Resources 0.051817 0.405267 35111.111111 35263.0 40357.0
Arts 0.090173 0.603658 33062.500000 222740.0 134390.0
Biology & Life Science 0.060918 0.587193 36421.428571 268943.0 184919.0
Business 0.071064 0.483198 43538.461538 634524.0 667852.0
Communications & Journalism 0.075538 0.658384 34500.000000 260680.0 131921.0
Computers & Mathematics 0.084256 0.311772 42745.454545 90283.0 208725.0
Education 0.051702 0.748507 32350.000000 455603.0 103526.0
Engineering 0.063334 0.238889 57382.758621 129276.0 408307.0
Health 0.065920 0.795152 36825.000000 387713.0 75517.0
Humanities & Liberal Arts 0.081008 0.631790 31913.333333 440622.0 272846.0
Industrial Arts & Consumer Services 0.048071 0.349523 36342.857143 126011.0 103781.0
Interdisciplinary 0.070861 0.770901 35000.000000 9479.0 2817.0
Law & Public Policy 0.090805 0.483649 42200.000000 87978.0 91129.0
Physical Sciences 0.046511 0.508683 41890.000000 90089.0 95390.0
Psychology & Social Work 0.072065 0.794397 30100.000000 382892.0 98115.0
Social Science 0.095729 0.553962 37344.444444 273132.0 256834.0

Now it is time to plot the data by category to see what patterns emerge.

Firstly, using a grouped bar plot.

In [55]:
df_subset = major_categories[["number_male_grads", "number_female_grads"]]
df_subset.plot.bar(title = 'Total number of male and female graduates per category of major')
Out[55]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f99deea36a0>

As can be seen there are large differences in the number of male and female graduates in the following categories:

  • Education
  • Engineering
  • Health
  • Humanities & Liberal Arts
  • Psychology & Social Work

Now to look at how the mean salary differs across categories.

In [56]:
major_categories.plot.bar(y='mean_salary', title='Average salary per category of major')
Out[56]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f99ded9b780>

The mean salary is highest (by far) in Engineering, followed by Business.

Now to look at the relationship between mean salary and the average proportion of women in each category.

In [57]:
ax = major_categories.plot(x='mean_share_women', y='mean_salary', kind='scatter')
ax.set_title('Average share of female graduates per category vs. average salary')
Out[57]:
<matplotlib.text.Text at 0x7f99deca1eb8>

There's a slight drop in the average salary as the proportion of female graduates rises.

Finally, how does the average unemployment rate vary across the different categories?

In [58]:
major_categories.plot.bar(y='mean_unemployment_rate', legend = False, title='Average unemployment rate per category of major')
Out[58]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f99dec4ddd8>

Interestingly, there dones't appear to be a close link between the unemployment rates and average salary. For example, Education has a low unemployment rate and low salary. Low unemployment means there is a lower supply of surplus labour so usually it would lead to higher wages.

Using a scatter plot to explore this further:

In [59]:
ax = major_categories.plot(x='mean_unemployment_rate', y='mean_salary', kind='scatter')
ax.set_title("Average unemployment rate vs. average salary for each category of major")
Out[59]:
<matplotlib.text.Text at 0x7f99debb1320>

So there isn't an obvious relationship between average salary and