Guided Project: Visualizing Earnings Based On College Majors¶

Aim¶

Using a dataset on the job outcomes of students in America who graduated from college between 2010 and 2012, we will explore questions such as:

Do students in more popular majors make more money?
How many majors are predominantly male? Predominantly female?
Which category of majors have the most students?

In [1]:

# Importing the necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
# Ensure plots are displayed inline
%matplotlib inline  

In [2]:

# Read in the dataset into a DataFrame
recent_grads = pd.read_csv('recent-grads.csv')
# Return first row as a table
recent_grads.iloc[0]

Out[2]:

Rank                                        1
Major_code                               2419
Major                   PETROLEUM ENGINEERING
Total                                    2339
Men                                      2057
Women                                     282
Major_category                    Engineering
ShareWomen                           0.120564
Sample_size                                36
Employed                                 1976
Full_time                                1849
Part_time                                 270
Full_time_year_round                     1207
Unemployed                                 37
Unemployment_rate                   0.0183805
Median                                 110000
P25th                                   95000
P75th                                  125000
College_jobs                             1534
Non_college_jobs                          364
Low_wage_jobs                             193
Name: 0, dtype: object

In [3]:

# Understand how the data is structured
recent_grads.head(5)

Out[3]:

	Rank	Major_code	Major	Total	Men	Women	Major_category	ShareWomen	Sample_size	Employed	...	Part_time	Full_time_year_round	Unemployed	Unemployment_rate	Median	P25th	P75th	College_jobs	Non_college_jobs	Low_wage_jobs
0	1	2419	PETROLEUM ENGINEERING	2339.0	2057.0	282.0	Engineering	0.120564	36	1976	...	270	1207	37	0.018381	110000	95000	125000	1534	364	193
1	2	2416	MINING AND MINERAL ENGINEERING	756.0	679.0	77.0	Engineering	0.101852	7	640	...	170	388	85	0.117241	75000	55000	90000	350	257	50
2	3	2415	METALLURGICAL ENGINEERING	856.0	725.0	131.0	Engineering	0.153037	3	648	...	133	340	16	0.024096	73000	50000	105000	456	176	0
3	4	2417	NAVAL ARCHITECTURE AND MARINE ENGINEERING	1258.0	1123.0	135.0	Engineering	0.107313	16	758	...	150	692	40	0.050125	70000	43000	80000	529	102	0
4	5	2405	CHEMICAL ENGINEERING	32260.0	21239.0	11021.0	Engineering	0.341631	289	25694	...	5180	16697	1672	0.061098	65000	50000	75000	18314	4440	972

5 rows × 21 columns

Engineering majors have the highest median salaries, taking the top 5 spots.

In [4]:

recent_grads.tail()

Out[4]:

	Rank	Major_code	Major	Total	Men	Women	Major_category	ShareWomen	Sample_size	Employed	...	Part_time	Full_time_year_round	Unemployed	Unemployment_rate	Median	P25th	P75th	College_jobs	Non_college_jobs	Low_wage_jobs
168	169	3609	ZOOLOGY	8409.0	3050.0	5359.0	Biology & Life Science	0.637293	47	6259	...	2190	3602	304	0.046320	26000	20000	39000	2771	2947	743
169	170	5201	EDUCATIONAL PSYCHOLOGY	2854.0	522.0	2332.0	Psychology & Social Work	0.817099	7	2125	...	572	1211	148	0.065112	25000	24000	34000	1488	615	82
170	171	5202	CLINICAL PSYCHOLOGY	2838.0	568.0	2270.0	Psychology & Social Work	0.799859	13	2101	...	648	1293	368	0.149048	25000	25000	40000	986	870	622
171	172	5203	COUNSELING PSYCHOLOGY	4626.0	931.0	3695.0	Psychology & Social Work	0.798746	21	3777	...	965	2738	214	0.053621	23400	19200	26000	2403	1245	308
172	173	3501	LIBRARY SCIENCE	1098.0	134.0	964.0	Education	0.877960	2	742	...	237	410	87	0.104946	22000	20000	22000	288	338	192

5 rows × 21 columns

The columns have been defined as follows:¶

Header	Description
Rank	Rank by median earnings
Major_code	Major code, FO1DP in ACS PUMS
Major	Major description
Major_category	Category of major from Carnevale et al
Total	Total number of people with major
Sample_size	Sample size (unweighted) of full-time, year-round ONLY (used for earnings)
Men	Male graduates
Women	Female graduates
ShareWomen	Women as share of total
Employed	Number employed (ESR == 1 or 2)
Full_time	Employed 35 hours or more
Part_time	Employed less than 35 hours
Full_time_year_round	Employed at least 50 weeks (WKW == 1) and at least 35 hours (WKHP >= 35)
Unemployed	Number unemployed (ESR == 3)
Unemployment_rate	Unemployed / (Unemployed + Employed)
Median	Median earnings of full-time, year-round workers
P25th	25th percentile of earnings
P75th	75th percentile of earnings
College_jobs	Number with job requiring a college degree
Non_college_jobs	Number with job not requiring a college degree
Low_wage_jobs	Number in low-wage service jobs

In [5]:

# Generating summary statistics for all numerical columns
recent_grads.describe()

Out[5]:

	Rank	Major_code	Total	Men	Women	ShareWomen	Sample_size	Employed	Full_time	Part_time	Full_time_year_round	Unemployed	Unemployment_rate	Median	P25th	P75th	College_jobs	Non_college_jobs	Low_wage_jobs
count	173.000000	173.000000	172.000000	172.000000	172.000000	172.000000	173.000000	173.000000	173.000000	173.000000	173.000000	173.000000	173.000000	173.000000	173.000000	173.000000	173.000000	173.000000	173.000000
mean	87.000000	3879.815029	39370.081395	16723.406977	22646.674419	0.522223	356.080925	31192.763006	26029.306358	8832.398844	19694.427746	2416.329480	0.068191	40151.445087	29501.445087	51494.219653	12322.635838	13284.497110	3859.017341
std	50.084928	1687.753140	63483.491009	28122.433474	41057.330740	0.231205	618.361022	50675.002241	42869.655092	14648.179473	33160.941514	4112.803148	0.030331	11470.181802	9166.005235	14906.279740	21299.868863	23789.655363	6944.998579
min	1.000000	1100.000000	124.000000	119.000000	0.000000	0.000000	2.000000	0.000000	111.000000	0.000000	111.000000	0.000000	0.000000	22000.000000	18500.000000	22000.000000	0.000000	0.000000	0.000000
25%	44.000000	2403.000000	4549.750000	2177.500000	1778.250000	0.336026	39.000000	3608.000000	3154.000000	1030.000000	2453.000000	304.000000	0.050306	33000.000000	24000.000000	42000.000000	1675.000000	1591.000000	340.000000
50%	87.000000	3608.000000	15104.000000	5434.000000	8386.500000	0.534024	130.000000	11797.000000	10048.000000	3299.000000	7413.000000	893.000000	0.067961	36000.000000	27000.000000	47000.000000	4390.000000	4595.000000	1231.000000
75%	130.000000	5503.000000	38909.750000	14631.000000	22553.750000	0.703299	338.000000	31433.000000	25147.000000	9948.000000	16891.000000	2393.000000	0.087557	45000.000000	33000.000000	60000.000000	14444.000000	11783.000000	3466.000000
max	173.000000	6403.000000	393735.000000	173809.000000	307087.000000	0.968954	4212.000000	307933.000000	251540.000000	115172.000000	199897.000000	28169.000000	0.177226	110000.000000	95000.000000	125000.000000	151643.000000	148395.000000	48207.000000

The one issue with this data (from a plotting perspective) is the different lengths of the columns. In the columns 'Total', 'Men', and 'Women' there is 172 values, not 173 (as for the other columns).

These missing values will need to be removed before we can pass the data into matplotlib for analysis.

In [6]:

# Record how many rows are in the uncleaned dataframe
raw_data_count = recent_grads.shape[0]
print(raw_data_count)

In [7]:

# Drop rows from the dataframe with missing values
recent_grads = recent_grads.dropna(axis=0)

In [8]:

# See how many rows with missing values have been dropped
cleaned_data_count = recent_grads.shape[0]
print("The uncleaned data set had ", raw_data_count, " rows")
print("The cleaned data set has ", cleaned_data_count, " rows")

The uncleaned data set had  173  rows
The cleaned data set has  172  rows

So there was only one row with missing values which as now been dropped from the dataframe.

Now we can visualize the data to explore research questions.

Visualizing the data: Scatter plots¶

We will use scatter plots to answer the following questions:

Do students in more popular majors make more money?
Do students that majored in subjects that were majority female make more money?
Is there any link between the number of full-time employees and median salary?

In [9]:

recent_grads.columns

Out[9]:

Index(['Rank', 'Major_code', 'Major', 'Total', 'Men', 'Women',
       'Major_category', 'ShareWomen', 'Sample_size', 'Employed', 'Full_time',
       'Part_time', 'Full_time_year_round', 'Unemployed', 'Unemployment_rate',
       'Median', 'P25th', 'P75th', 'College_jobs', 'Non_college_jobs',
       'Low_wage_jobs'],
      dtype='object')

In [10]:

# Scatter plot: Sample size and median
recent_grads.plot(x = 'Sample_size', y = 'Median', kind = 'scatter', title = 'Median earnings vs. Sample Size', xlim=(0,4500), ylim=(0,120000))

Out[10]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f99e77f9d30>

Q. Do students in more popular majors make more money?

A. The scatter plot suggests that there is no noticeable relationship between the sample size and the median salary. However, there are two important qualifiers to this answer:

This scatter plot uses earning information for an unweighted sample of people with the major. Therefore it may not be representative of the population of graudates with this major as a whole.
The median sample size is 130 and the 75th percentile is 338, with the chart size distorted by a few outliers, which may visually compress any relationship. The chart can be zoomed in on to see if there is a relationship within the smaller range of majors with sample sizes equal to or less than the 75% percentile of 338.

In [11]:

# Scatter plot: Sample size (up to 75th percentile) and median
recent_grads.plot(x = 'Sample_size', y = 'Median', kind = 'scatter', title = 'Median earnings vs. Sample Size up to 75th percentile', xlim=(0,338), ylim=(0,120000))

Out[11]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f99e7745ef0>

There is no strong overall correlation between this narrowed down selection of majors and their median earnings.

However there is a wider range of median earnings in majors with a sample size under 50. With small samples the risk of an unrepresentative median salary is higher as outliers have a bigger effect.

Overall this additional scatter plot does not change the above answer to the question.

In [12]:

# Sample size and unemployment rate
ax = recent_grads.plot(x='Sample_size', y='Unemployment_rate', kind='scatter')
ax.set_title('Unemployment rate vs. Sample size')
ax.set_xlim(0,4500)
ax.set_ylim(0,0.2)

Out[12]:

(0, 0.2)

There is a lot of variation in unemployment rates among majors with small sample sizes. Yet again the small sample sizes may affect the representativeness of the relationship plotted here.

In addition, from reviewing some rows of the dataframe, there is a noticeable difference between the sample size and the number of graduates for whom there is data on whether they are employed/unemployed.

For example for Petroleum Engineering (rank 1) there is a sample size of 36 and a total of (1976+37) employed and unemployed.

This suggests that sample size is not readily comparable to other statistics collected, other than median wage.

In [13]:

# Full-time workers and median salary
ax = recent_grads.plot(x='Full_time', y='Median', kind='scatter')
ax.set_title('Median salary vs. Number employed full-time')
ax.set_xlim(0,255000)
ax.set_ylim(0,120000)

Out[13]:

(0, 120000)

Q. Is there any link between the number of full-time employees and median salary?

A. There is not a noticeable correlation between the number of graduates per major employed full-time and the median wage. If there was to be a relationship it would be positive i.e. more full-time employees leads to a higher median wage.

But as noted above the median wage figures are based off smaller unweighted samples that may not represent the wider population of graduates with each major.

To be sure, a more sample of the data can be plotted, setting the axes limits at the 75th percentile.

In [14]:

ax = recent_grads.plot(x='Full_time', y='Median', kind='scatter')
ax.set_title('Median salary vs. Number employed full-time (both at 75th percentile)')
ax.set_xlim(0,26000)
ax.set_ylim(0,45000)

Out[14]:

(0, 45000)

For this narrowed down sample there is no noticeable relationship.

In [15]:

# Share of women and the unemployment rate
ax = recent_grads.plot(x='ShareWomen', y='Unemployment_rate', kind='scatter')
ax.set_title('Unemployment rate vs Proportion of female graduates')
ax.set_xlim(0,1)
ax.set_ylim(0,0.2)

Out[15]:

(0, 0.2)

There doesn't seem to be a strong relationship between the proportion of female graduates in a course and the unemployment rate.

In [16]:

# Share of women and median salary
ax = recent_grads.plot(x='ShareWomen', y='Median', kind='scatter')
ax.set_title('Median salary vs. Proportion of female graduates')
ax.set_xlim(0,1)
ax.set_ylim(0,120000)

Out[16]:

(0, 120000)

Q. Do students that majored in subjects that were majority female make more money?

A. No. Here there is a noticeable relationship: the higher the proportion of female graduates for a major, the lower the median salary is.

The lower median salary is not due to more part-time work because it is defined as the median salary of full time year-round workers.

This means that the lower salary could be due to the lowwe pay for the types of major (and subsequent career paths) that have a higher proportions of female graduates and/or due to lower wages due to their gender or less career capital due to a higher propensity to take time away from work for family.

In [17]:

# Number of male graduates and median wage
ax = recent_grads.plot(x='Men', y='Median', kind='scatter')
ax.set_title('Median salary vs. Number of male graduates per major')
ax.set_xlim(0,175000)
ax.set_ylim(0,120000)

Out[17]:

(0, 120000)

There is no obvious relationship here.

In [18]:

# Number of female graduates and median wage
ax = recent_grads.plot(x='Women', y='Median', kind='scatter')
ax.set_title('Median salary vs. Number of female graduates per major')
ax.set_xlim(0,310000)
ax.set_ylim(0,120000)

Out[18]:

(0, 120000)

There is no obvious relationship here either.

Visualizing the data: Histograms¶

We will use histograms to answer the following questions:

What percent of majors are predominantly male? Predominantly female?
What's the most common median salary range?

In [19]:

# To allow bin size to be changed, use Series.hist() and not Series.plot(kind='hist')
# Sample_size histogram
recent_grads["Sample_size"].hist(bins=10)

Out[19]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f99e54332b0>

Most of the sample size values are below 500 so a more detailed view of the majority can be found by looking at those with a sample size below 500.

In [20]:

recent_grads["Sample_size"].hist(bins=50, range=(0,500))

Out[20]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f99e538ab00>

The majority of sample sizes were below 100. This raises concerns over how representative the salary data for each major is.

In [21]:

# Median salary histogram
recent_grads["Median"].hist(range=(20000,110000), bins=18)

Out[21]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f99e52d6da0>

Q. What's the most common median salary range?

A. The median salaries are mostly clustered around $30,000-40,000, with a relatively quick drop off in frequency for salary bands on either side.

In [22]:

# Employed histogram
recent_grads["Employed"].hist(range=(0,310000))

Out[22]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f99e5495da0>

I assume that the number employed per major is affected, in part, by the number of students that have taken the major and so its distribution is not very instructive by itself. To check out this assumption I can look at the relationship between Total and Employed.

In [23]:

# e.g. the Total number of people and number Employed for the largest majors
# Filter by majors with Total > 39,000 (75th percentile)
largest_majors = recent_grads.loc[recent_grads["Total"] > 39000, ["Major_code", "Major", "Total", "Employed"]]
largest_majors.sort_values(by='Total', ascending=False).head(10)

Out[23]:

	Major_code	Major	Total	Employed
145	5200	PSYCHOLOGY	393735.0	307933
76	6203	BUSINESS MANAGEMENT AND ADMINISTRATION	329927.0	276234
123	3600	BIOLOGY	280709.0	182295
57	6200	GENERAL BUSINESS	234590.0	190183
93	1901	COMMUNICATIONS	213996.0	179633
34	6107	NURSING	209394.0	180903
77	6206	MARKETING AND MARKETING RESEARCH	205211.0	178862
40	6201	ACCOUNTING	198633.0	165527
137	3301	ENGLISH LANGUAGE AND LITERATURE	194673.0	149180
78	5506	POLITICAL SCIENCE AND GOVERNMENT	182621.0	133454

So there is unsurprisingly a link between the Total number of people who have taken a major and the the number Employed.

A better way to illustrate this relationship would be with a scatter plot.

In [24]:

ax = recent_grads.plot(x='Total', y='Employed', kind='scatter')
ax.set_xlim(0, 400000)
ax.set_ylim(0,310000)

Out[24]:

(0, 310000)

So, as expected, the distribution and size of the number Employed per major, is closely related to the Total number of graduates per major. Of more use would be to look at the employment (or unemployment) rates per major, rather than the absolute numbers.

In [25]:

# Full-time histogram
recent_grads["Full_time"].hist()

Out[25]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f99e56db630>

This closely mirrors the distribution of the numbers Employed per major, which is expected.

In [26]:

# ShareWomen histogram
recent_grads["ShareWomen"].hist()

Out[26]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f99e551b278>

It appears that just over 50% of all majors, are majority female, with the highest frequency at 70-80% female.

In [27]:

# Seeing which courses have females at 80% or more
high_female_share = recent_grads[recent_grads["ShareWomen"] >= 0.8]
print(high_female_share.shape)
high_female_share.sort_values(by='ShareWomen', ascending=False).head(10)

(18, 21)

Out[27]:

	Rank	Major_code	Major	Total	Men	Women	Major_category	ShareWomen	Sample_size	Employed	...	Part_time	Full_time_year_round	Unemployed	Unemployment_rate	Median	P25th	P75th	College_jobs	Non_college_jobs	Low_wage_jobs
164	165	2307	EARLY CHILDHOOD EDUCATION	37589.0	1167.0	36422.0	Education	0.968954	342	32551	...	7001	20748	1360	0.040105	28000	21000	35000	23515	7705	2868
163	164	6102	COMMUNICATION DISORDERS SCIENCES AND SERVICES	38279.0	1225.0	37054.0	Health	0.967998	95	29763	...	13862	14460	1487	0.047584	28000	20000	40000	19957	9404	5125
51	52	6104	MEDICAL ASSISTING SERVICES	11123.0	803.0	10320.0	Health	0.927807	67	9168	...	4107	4290	407	0.042507	42000	30000	65000	2091	6948	1270
138	139	2304	ELEMENTARY EDUCATION	170862.0	13029.0	157833.0	Education	0.923745	1629	149339	...	37965	86540	7297	0.046586	32000	23400	38000	108085	36972	11502
150	151	2901	FAMILY AND CONSUMER SCIENCES	58001.0	5166.0	52835.0	Industrial Arts & Consumer Services	0.910933	518	46624	...	15872	26906	3355	0.067128	30000	22900	40000	20985	20133	5248
100	101	2310	SPECIAL NEEDS EDUCATION	28739.0	2682.0	26057.0	Education	0.906677	246	24639	...	5153	16642	1067	0.041508	35000	32000	42000	20185	3797	1179
156	157	5403	HUMAN SERVICES AND COMMUNITY ORGANIZATION	9374.0	885.0	8489.0	Psychology & Social Work	0.905590	89	8294	...	2405	5061	326	0.037819	30000	24000	35000	2878	4595	724
151	152	5404	SOCIAL WORK	53552.0	5137.0	48415.0	Psychology & Social Work	0.904075	374	45038	...	13481	27588	3329	0.068828	30000	25000	35000	27449	14416	4344
34	35	6107	NURSING	209394.0	21773.0	187621.0	Health	0.896019	2554	180903	...	40818	122817	8497	0.044863	48000	39000	58000	151643	26146	6193
88	89	6199	MISCELLANEOUS HEALTH MEDICAL PROFESSIONS	13386.0	1589.0	11797.0	Health	0.881294	81	10076	...	4145	5868	893	0.081411	36000	23000	42000	5652	3835	1422

10 rows × 21 columns

In [28]:

# Unemployment Rate histogram
recent_grads["Unemployment_rate"].hist()

Out[28]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f99e5675438>

The most frequency unemployment range is 6-7%, but there are handful of courses with unemployment rates greater than 14%.

In [29]:

high_unemp = recent_grads[recent_grads["Unemployment_rate"] >= 0.14]
print(high_unemp.shape)
high_unemp.sort_values(by='Unemployment_rate', ascending=False)

(4, 21)

Out[29]:

	Rank	Major_code	Major	Total	Men	Women	Major_category	ShareWomen	Sample_size	Employed	...	Part_time	Full_time_year_round	Unemployed	Unemployment_rate	Median	P25th	P75th	College_jobs	Non_college_jobs	Low_wage_jobs
5	6	2418	NUCLEAR ENGINEERING	2573.0	2200.0	373.0	Engineering	0.144967	17	1857	...	264	1449	400	0.177226	65000	50000	102000	1142	657	244
89	90	5401	PUBLIC ADMINISTRATION	5629.0	2947.0	2682.0	Law & Public Policy	0.476461	46	4158	...	847	2952	789	0.159491	36000	23000	60000	919	2313	496
84	85	2107	COMPUTER NETWORKING AND TELECOMMUNICATIONS	7613.0	5291.0	2322.0	Computers & Mathematics	0.305005	97	6144	...	1447	4369	1100	0.151850	36400	27000	49000	2593	2941	352
170	171	5202	CLINICAL PSYCHOLOGY	2838.0	568.0	2270.0	Psychology & Social Work	0.799859	13	2101	...	648	1293	368	0.149048	25000	25000	40000	986	870	622

4 rows × 21 columns

'Nuclear engineering' and 'Computer Networking and Telecommunications' are unexpected given they are in in demand fields (engineering, computers & mathematics).

In [30]:

# Men histogram i.e. number of male graduates
recent_grads["Men"].hist()

Out[30]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f99e518c390>

In [31]:

# Women histogram i.e. number of female graduates
recent_grads["Women"].hist()

Out[31]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f99e511d390>

Visualizing the data: Scatter Matrix Plot¶

A scatter matrix plot combines both scatter plots and histograms into one grid of plots and allows us to explore potential relationships and distributions simultaneously.

These can be generated by selecting the dataframe and columns of interest and passing this into: pandas.plotting.scatter_matrix

In [32]:

# import scatter_matrix function from panda.plotting
from pandas.plotting import scatter_matrix

In [33]:

# A 2 by 2 scatter matrix plot of Sample_size and Median salary
scatter_matrix(recent_grads[["Sample_size", "Median"]], figsize=(10,10))

Out[33]:

array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7f99e51de470>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f99e4fe2128>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7f99e4faab00>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f99e4f67860>]],
      dtype=object)

Most sample sizes are less than 500 (top left histogram). The scatter plot of Sample_size vs. Median salary (bottom left) doesn't seem to provide much information other than there not being an obvious relationship between Sample_size and Median salary.

However the mirror scatter plot of Median salary vs. Sample_size (top right) shows that large Sample sizes are not associated with outlier Median salary values. Instead the majors with the highest Sample sizes have Median salaries that are in line with the most common salary ranges ($30,000-40,000).

In [34]:

# Scatter matrix plot of Sample_size, Median, and Unemployment_rate
scatter_matrix(recent_grads[["Sample_size", "Median", "Unemployment_rate"]], figsize=(20,20))

Out[34]:

array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7f99e53c11d0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f99e4ea0438>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f99e4e69b70>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7f99e4e248d0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f99e4dedb00>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f99e4dadf98>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7f99e4cff6d8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f99e4d34dd8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f99e4c89160>]],
      dtype=object)

There's not any noticeably strong relationships between the variables here.

Scatter_matrix is a useful way to quickly explore relationships that I've considered above, for example Total students with a major and the number Employed.

In [35]:

# Total and Employed scatter matrix
scatter_matrix(recent_grads[["Total", "Employed"]], figsize=(10,10))

Out[35]:

array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7f99e4b8cef0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f99e434ef98>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7f99e431f400>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f99e42d3da0>]],
      dtype=object)

Doing a scatter matrix plot earlier would have saved time and quickly revealed the strong relationship between the two variables.

Visualizing the data: Bar plots¶

Using either df.plot(kind='bar') or df.plot.bar(x=labels, y= data for bars)

In [36]:

# Looking at the share of women for the top 10 and bottom 10
# courses NB data is ranked by median salary

# Share of women in the top 10 courses
recent_grads[:10].plot.bar(x='Major', y='ShareWomen', title='Share of women in the 10 courses with the highest median salary')

Out[36]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f99e420cf98>

In [37]:

# Share of women in the bottom 10 courses
recent_grads[-10:].plot.bar(x='Major', y='ShareWomen', title='Share of women in the 10 courses with the lowest median salary')

Out[37]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f99e41d2be0>

The courses with the highest median salaries have a lower share of female graduates than those with the lowest median salaries, which are majority female (i.e. over 50% of the graduates are women).

Now to calculate how large the difference is:

In [38]:

#Calculating the average proportion of female graduates for the top and bottom 10 courses
top_10_female_share = recent_grads.loc[:9, "ShareWomen"].mean()
bottom_10_female_share = recent_grads[-10:]['ShareWomen'].mean()

In [39]:

top_10 = "The 10 highest paying courses have an average female share of {:.2f}".format(top_10_female_share)
bottom_10 = "The 10 lowest paying courses have an average female share of {:.2f}".format(bottom_10_female_share)

print(top_10)
print(bottom_10)

The 10 highest paying courses have an average female share of 0.23
The 10 lowest paying courses have an average female share of 0.79

So the difference in the average proportion of female graduates between the top and bottom 10 courses (in terms of median pay) is over 50%!

Next we will look at the differences in the unemployment rate between the top 10 and bottom 10 courses.

In [40]:

# Unemployment rate for the top 10 courses
ax1 = recent_grads[:10].plot.bar(x='Major', y='Unemployment_rate', title='Unemployment rate for the top 10 courses')
ax2 = recent_grads[-10:].plot.bar(x='Major', y='Unemployment_rate', title='Unemployment rate for the bottom 10 courses')

For this comparison it is less clear. The top 10 courses do tend to have a lower unemployment rate, apart from 2 exceptions: 'Nuclear Engineering' and 'Mining and Mineral Engineering'. Whilst for the bottom 10 courses 3-5 of the courses have higher unemployment rates.

This is can be analysed by looking at average unemployment rates.

In [41]:

mean_unemp_rate = recent_grads["Unemployment_rate"].mean()
#NB with .loc contrary to usual python slices, both the start and the stop are included
top_10_unemp = recent_grads.loc[:9 , "Unemployment_rate"].mean()
bottom_10_unemp = recent_grads[-10:]["Unemployment_rate"].mean()

In [42]:

mean_all = "The average unemployment rate across all majors is {:.2f}".format(mean_unemp_rate)
mean_top = "The average unemployment rate for the top 10 majors is {:.2f}".format(top_10_unemp)
mean_bottom = "The average unemployment rate for the bottom 10 majors is {:.2f}".format(bottom_10_unemp)

print(mean_all)
print(mean_top)
print(mean_bottom)

The average unemployment rate across all majors is 0.07
The average unemployment rate for the top 10 majors is 0.07
The average unemployment rate for the bottom 10 majors is 0.08

Whilst the average unemployment rates are similar for the top and bottom 10 courses, there appears to be more bottom 10 courses which are slightly above average, whilst for the top 10, 2 courses are far above the average, whilst the others are far below average.

To investigate this further:

In [43]:

top_10_outliers = recent_grads[:10].loc[recent_grads[:10]["Unemployment_rate"] > mean_unemp_rate]
top_10_outliers["Difference_from_mean"] = top_10_outliers["Unemployment_rate"] - mean_unemp_rate

In [44]:

bottom_10_outliers = recent_grads[-10:].loc[recent_grads[-10:]["Unemployment_rate"] > mean_unemp_rate]
bottom_10_outliers["Difference_from_mean"] = bottom_10_outliers["Unemployment_rate"] - mean_unemp_rate

In [45]:

""" Plot the majors from the top and bottom 10 with above
average unemployment rates and the size of the difference
in unmployment rate from the average"""
top_10_outliers.plot.bar(x='Major', y='Difference_from_mean', title='Majors in the top 10 with above average unemployment rates')
bottom_10_outliers.plot.bar(x='Major', y='Difference_from_mean', title='Majors in the bottom 10 with above average unemployment rates')

Out[45]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f99def3fb70>

So in the top 10, one course (Nuclear Engineering) is dragging up the average unemployment rate for the top 10 majors. In the bottom 10 Clinical Psychology has an above average unemployment rate, but the other 4 courses are also dragging up the unemployment rate.

Additional analysis¶

There are a lot of majors which makes it harder to see patterns across types of major e.g. arts, sciences.

So I will look at some analysis using the Category of the major.

Firstly I will build a new dataframe which is indexed by the categories of the majors.

In [46]:

# Create a list of all the categories of major
categories = recent_grads["Major_category"].unique()

In [47]:

# Now to aggregate across categories
# Firstly unemployment rates

# Create an empty dictionary
cat_unemp = {}

# Loop through major categories, calculate mean unemployment rate and add to dictionary
for c in categories:
    unemp_mean = recent_grads.loc[recent_grads["Major_category"] == c, "Unemployment_rate"].mean()
    cat_unemp[c] = unemp_mean

In [48]:

# Now the proportion (share) of women in each category

# Create an empty dictionary
cat_share_women = {}

#Loop through categories, calculate mean share of women, and add to dictionary
for c in categories:
    women_mean = recent_grads.loc[recent_grads["Major_category"] == c, "ShareWomen"].mean()
    cat_share_women[c] = women_mean

In [49]:

# Now the average median salary

# Create an empty dictionary
cat_salary = {}

for c in categories:
    salary_mean = recent_grads.loc[recent_grads["Major_category"] == c, "Median"].mean()
    cat_salary[c] = salary_mean

In [50]:

# Now the total number of men and women in each category

cat_women = {}

for c in categories:
    women_sum = recent_grads.loc[recent_grads["Major_category"] == c, "Women"].sum()
    cat_women[c] = women_sum

In [51]:

cat_men = {}

for c in categories:
    men_sum = recent_grads.loc[recent_grads["Major_category"] == c, "Men"].sum()
    cat_men[c] = men_sum

Now to convert both dictionaries into series objects, and then add both series objects to a dataframe (with named column headings).

In [52]:

unemp_series = pd.Series(cat_unemp)
share_women_series = pd.Series(cat_share_women)
salary_series = pd.Series(cat_salary)
women_series = pd.Series(cat_women)
men_series = pd.Series(cat_men)

type(salary_series)

Out[52]:

pandas.core.series.Series

In [53]:

#Now to turn all series into a dataframe
#NB. The dictionary keys became the index in the Series obj
#This index can be used for the dataframe

major_categories = pd.DataFrame(unemp_series, columns=['mean_unemployment_rate'])

major_categories

Out[53]:

	mean_unemployment_rate
Agriculture & Natural Resources	0.051817
Arts	0.090173
Biology & Life Science	0.060918
Business	0.071064
Communications & Journalism	0.075538
Computers & Mathematics	0.084256
Education	0.051702
Engineering	0.063334
Health	0.065920
Humanities & Liberal Arts	0.081008
Industrial Arts & Consumer Services	0.048071
Interdisciplinary	0.070861
Law & Public Policy	0.090805
Physical Sciences	0.046511
Psychology & Social Work	0.072065
Social Science	0.095729

Now to add the other series into this new dataframe.

In [54]:

#Now add in mean mileage
#Don't use constructor! -only use that to create df obj

# Add in other series to df. Share same index.
major_categories["mean_share_women"] = share_women_series
major_categories["mean_salary"] = salary_series
major_categories["number_female_grads"] = women_series
major_categories["number_male_grads"] = men_series

major_categories

Out[54]:

	mean_unemployment_rate	mean_share_women	mean_salary	number_female_grads	number_male_grads
Agriculture & Natural Resources	0.051817	0.405267	35111.111111	35263.0	40357.0
Arts	0.090173	0.603658	33062.500000	222740.0	134390.0
Biology & Life Science	0.060918	0.587193	36421.428571	268943.0	184919.0
Business	0.071064	0.483198	43538.461538	634524.0	667852.0
Communications & Journalism	0.075538	0.658384	34500.000000	260680.0	131921.0
Computers & Mathematics	0.084256	0.311772	42745.454545	90283.0	208725.0
Education	0.051702	0.748507	32350.000000	455603.0	103526.0
Engineering	0.063334	0.238889	57382.758621	129276.0	408307.0
Health	0.065920	0.795152	36825.000000	387713.0	75517.0
Humanities & Liberal Arts	0.081008	0.631790	31913.333333	440622.0	272846.0
Industrial Arts & Consumer Services	0.048071	0.349523	36342.857143	126011.0	103781.0
Interdisciplinary	0.070861	0.770901	35000.000000	9479.0	2817.0
Law & Public Policy	0.090805	0.483649	42200.000000	87978.0	91129.0
Physical Sciences	0.046511	0.508683	41890.000000	90089.0	95390.0
Psychology & Social Work	0.072065	0.794397	30100.000000	382892.0	98115.0
Social Science	0.095729	0.553962	37344.444444	273132.0	256834.0

Now it is time to plot the data by category to see what patterns emerge.

Firstly, using a grouped bar plot.

In [55]:

df_subset = major_categories[["number_male_grads", "number_female_grads"]]
df_subset.plot.bar(title = 'Total number of male and female graduates per category of major')

Out[55]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f99deea36a0>

As can be seen there are large differences in the number of male and female graduates in the following categories:

Education
Engineering
Health
Humanities & Liberal Arts
Psychology & Social Work

Now to look at how the mean salary differs across categories.

In [56]:

major_categories.plot.bar(y='mean_salary', title='Average salary per category of major')

Out[56]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f99ded9b780>

The mean salary is highest (by far) in Engineering, followed by Business.

Now to look at the relationship between mean salary and the average proportion of women in each category.

In [57]:

ax = major_categories.plot(x='mean_share_women', y='mean_salary', kind='scatter')
ax.set_title('Average share of female graduates per category vs. average salary')

Out[57]:

<matplotlib.text.Text at 0x7f99deca1eb8>

There's a slight drop in the average salary as the proportion of female graduates rises.

Finally, how does the average unemployment rate vary across the different categories?

In [58]:

major_categories.plot.bar(y='mean_unemployment_rate', legend = False, title='Average unemployment rate per category of major')

Out[58]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f99dec4ddd8>

Interestingly, there dones't appear to be a close link between the unemployment rates and average salary. For example, Education has a low unemployment rate and low salary. Low unemployment means there is a lower supply of surplus labour so usually it would lead to higher wages.

Using a scatter plot to explore this further:

In [59]:

ax = major_categories.plot(x='mean_unemployment_rate', y='mean_salary', kind='scatter')
ax.set_title("Average unemployment rate vs. average salary for each category of major")

Out[59]:

<matplotlib.text.Text at 0x7f99debb1320>

So there isn't an obvious relationship between average salary and