Using a dataset on the job outcomes of students in America who graduated from college between 2010 and 2012, we will explore questions such as:
# Importing the necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
# Ensure plots are displayed inline
%matplotlib inline
# Read in the dataset into a DataFrame
recent_grads = pd.read_csv('recent-grads.csv')
# Return first row as a table
recent_grads.iloc[0]
Rank 1 Major_code 2419 Major PETROLEUM ENGINEERING Total 2339 Men 2057 Women 282 Major_category Engineering ShareWomen 0.120564 Sample_size 36 Employed 1976 Full_time 1849 Part_time 270 Full_time_year_round 1207 Unemployed 37 Unemployment_rate 0.0183805 Median 110000 P25th 95000 P75th 125000 College_jobs 1534 Non_college_jobs 364 Low_wage_jobs 193 Name: 0, dtype: object
# Understand how the data is structured
recent_grads.head(5)
Rank | Major_code | Major | Total | Men | Women | Major_category | ShareWomen | Sample_size | Employed | ... | Part_time | Full_time_year_round | Unemployed | Unemployment_rate | Median | P25th | P75th | College_jobs | Non_college_jobs | Low_wage_jobs | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 2419 | PETROLEUM ENGINEERING | 2339.0 | 2057.0 | 282.0 | Engineering | 0.120564 | 36 | 1976 | ... | 270 | 1207 | 37 | 0.018381 | 110000 | 95000 | 125000 | 1534 | 364 | 193 |
1 | 2 | 2416 | MINING AND MINERAL ENGINEERING | 756.0 | 679.0 | 77.0 | Engineering | 0.101852 | 7 | 640 | ... | 170 | 388 | 85 | 0.117241 | 75000 | 55000 | 90000 | 350 | 257 | 50 |
2 | 3 | 2415 | METALLURGICAL ENGINEERING | 856.0 | 725.0 | 131.0 | Engineering | 0.153037 | 3 | 648 | ... | 133 | 340 | 16 | 0.024096 | 73000 | 50000 | 105000 | 456 | 176 | 0 |
3 | 4 | 2417 | NAVAL ARCHITECTURE AND MARINE ENGINEERING | 1258.0 | 1123.0 | 135.0 | Engineering | 0.107313 | 16 | 758 | ... | 150 | 692 | 40 | 0.050125 | 70000 | 43000 | 80000 | 529 | 102 | 0 |
4 | 5 | 2405 | CHEMICAL ENGINEERING | 32260.0 | 21239.0 | 11021.0 | Engineering | 0.341631 | 289 | 25694 | ... | 5180 | 16697 | 1672 | 0.061098 | 65000 | 50000 | 75000 | 18314 | 4440 | 972 |
5 rows × 21 columns
Engineering majors have the highest median salaries, taking the top 5 spots.
recent_grads.tail()
Rank | Major_code | Major | Total | Men | Women | Major_category | ShareWomen | Sample_size | Employed | ... | Part_time | Full_time_year_round | Unemployed | Unemployment_rate | Median | P25th | P75th | College_jobs | Non_college_jobs | Low_wage_jobs | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
168 | 169 | 3609 | ZOOLOGY | 8409.0 | 3050.0 | 5359.0 | Biology & Life Science | 0.637293 | 47 | 6259 | ... | 2190 | 3602 | 304 | 0.046320 | 26000 | 20000 | 39000 | 2771 | 2947 | 743 |
169 | 170 | 5201 | EDUCATIONAL PSYCHOLOGY | 2854.0 | 522.0 | 2332.0 | Psychology & Social Work | 0.817099 | 7 | 2125 | ... | 572 | 1211 | 148 | 0.065112 | 25000 | 24000 | 34000 | 1488 | 615 | 82 |
170 | 171 | 5202 | CLINICAL PSYCHOLOGY | 2838.0 | 568.0 | 2270.0 | Psychology & Social Work | 0.799859 | 13 | 2101 | ... | 648 | 1293 | 368 | 0.149048 | 25000 | 25000 | 40000 | 986 | 870 | 622 |
171 | 172 | 5203 | COUNSELING PSYCHOLOGY | 4626.0 | 931.0 | 3695.0 | Psychology & Social Work | 0.798746 | 21 | 3777 | ... | 965 | 2738 | 214 | 0.053621 | 23400 | 19200 | 26000 | 2403 | 1245 | 308 |
172 | 173 | 3501 | LIBRARY SCIENCE | 1098.0 | 134.0 | 964.0 | Education | 0.877960 | 2 | 742 | ... | 237 | 410 | 87 | 0.104946 | 22000 | 20000 | 22000 | 288 | 338 | 192 |
5 rows × 21 columns
Header | Description |
---|---|
Rank | Rank by median earnings |
Major_code | Major code, FO1DP in ACS PUMS |
Major | Major description |
Major_category | Category of major from Carnevale et al |
Total | Total number of people with major |
Sample_size | Sample size (unweighted) of full-time, year-round ONLY (used for earnings) |
Men | Male graduates |
Women | Female graduates |
ShareWomen | Women as share of total |
Employed | Number employed (ESR == 1 or 2) |
Full_time | Employed 35 hours or more |
Part_time | Employed less than 35 hours |
Full_time_year_round | Employed at least 50 weeks (WKW == 1) and at least 35 hours (WKHP >= 35) |
Unemployed | Number unemployed (ESR == 3) |
Unemployment_rate | Unemployed / (Unemployed + Employed) |
Median | Median earnings of full-time, year-round workers |
P25th | 25th percentile of earnings |
P75th | 75th percentile of earnings |
College_jobs | Number with job requiring a college degree |
Non_college_jobs | Number with job not requiring a college degree |
Low_wage_jobs | Number in low-wage service jobs |
# Generating summary statistics for all numerical columns
recent_grads.describe()
Rank | Major_code | Total | Men | Women | ShareWomen | Sample_size | Employed | Full_time | Part_time | Full_time_year_round | Unemployed | Unemployment_rate | Median | P25th | P75th | College_jobs | Non_college_jobs | Low_wage_jobs | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 173.000000 | 173.000000 | 172.000000 | 172.000000 | 172.000000 | 172.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 |
mean | 87.000000 | 3879.815029 | 39370.081395 | 16723.406977 | 22646.674419 | 0.522223 | 356.080925 | 31192.763006 | 26029.306358 | 8832.398844 | 19694.427746 | 2416.329480 | 0.068191 | 40151.445087 | 29501.445087 | 51494.219653 | 12322.635838 | 13284.497110 | 3859.017341 |
std | 50.084928 | 1687.753140 | 63483.491009 | 28122.433474 | 41057.330740 | 0.231205 | 618.361022 | 50675.002241 | 42869.655092 | 14648.179473 | 33160.941514 | 4112.803148 | 0.030331 | 11470.181802 | 9166.005235 | 14906.279740 | 21299.868863 | 23789.655363 | 6944.998579 |
min | 1.000000 | 1100.000000 | 124.000000 | 119.000000 | 0.000000 | 0.000000 | 2.000000 | 0.000000 | 111.000000 | 0.000000 | 111.000000 | 0.000000 | 0.000000 | 22000.000000 | 18500.000000 | 22000.000000 | 0.000000 | 0.000000 | 0.000000 |
25% | 44.000000 | 2403.000000 | 4549.750000 | 2177.500000 | 1778.250000 | 0.336026 | 39.000000 | 3608.000000 | 3154.000000 | 1030.000000 | 2453.000000 | 304.000000 | 0.050306 | 33000.000000 | 24000.000000 | 42000.000000 | 1675.000000 | 1591.000000 | 340.000000 |
50% | 87.000000 | 3608.000000 | 15104.000000 | 5434.000000 | 8386.500000 | 0.534024 | 130.000000 | 11797.000000 | 10048.000000 | 3299.000000 | 7413.000000 | 893.000000 | 0.067961 | 36000.000000 | 27000.000000 | 47000.000000 | 4390.000000 | 4595.000000 | 1231.000000 |
75% | 130.000000 | 5503.000000 | 38909.750000 | 14631.000000 | 22553.750000 | 0.703299 | 338.000000 | 31433.000000 | 25147.000000 | 9948.000000 | 16891.000000 | 2393.000000 | 0.087557 | 45000.000000 | 33000.000000 | 60000.000000 | 14444.000000 | 11783.000000 | 3466.000000 |
max | 173.000000 | 6403.000000 | 393735.000000 | 173809.000000 | 307087.000000 | 0.968954 | 4212.000000 | 307933.000000 | 251540.000000 | 115172.000000 | 199897.000000 | 28169.000000 | 0.177226 | 110000.000000 | 95000.000000 | 125000.000000 | 151643.000000 | 148395.000000 | 48207.000000 |
The one issue with this data (from a plotting perspective) is the different lengths of the columns. In the columns 'Total', 'Men', and 'Women' there is 172 values, not 173 (as for the other columns).
These missing values will need to be removed before we can pass the data into matplotlib for analysis.
# Record how many rows are in the uncleaned dataframe
raw_data_count = recent_grads.shape[0]
print(raw_data_count)
173
# Drop rows from the dataframe with missing values
recent_grads = recent_grads.dropna(axis=0)
# See how many rows with missing values have been dropped
cleaned_data_count = recent_grads.shape[0]
print("The uncleaned data set had ", raw_data_count, " rows")
print("The cleaned data set has ", cleaned_data_count, " rows")
The uncleaned data set had 173 rows The cleaned data set has 172 rows
So there was only one row with missing values which as now been dropped from the dataframe.
Now we can visualize the data to explore research questions.
We will use scatter plots to answer the following questions:
recent_grads.columns
Index(['Rank', 'Major_code', 'Major', 'Total', 'Men', 'Women', 'Major_category', 'ShareWomen', 'Sample_size', 'Employed', 'Full_time', 'Part_time', 'Full_time_year_round', 'Unemployed', 'Unemployment_rate', 'Median', 'P25th', 'P75th', 'College_jobs', 'Non_college_jobs', 'Low_wage_jobs'], dtype='object')
# Scatter plot: Sample size and median
recent_grads.plot(x = 'Sample_size', y = 'Median', kind = 'scatter', title = 'Median earnings vs. Sample Size', xlim=(0,4500), ylim=(0,120000))
<matplotlib.axes._subplots.AxesSubplot at 0x7f99e77f9d30>
Q. Do students in more popular majors make more money?
A. The scatter plot suggests that there is no noticeable relationship between the sample size and the median salary. However, there are two important qualifiers to this answer:
# Scatter plot: Sample size (up to 75th percentile) and median
recent_grads.plot(x = 'Sample_size', y = 'Median', kind = 'scatter', title = 'Median earnings vs. Sample Size up to 75th percentile', xlim=(0,338), ylim=(0,120000))
<matplotlib.axes._subplots.AxesSubplot at 0x7f99e7745ef0>
There is no strong overall correlation between this narrowed down selection of majors and their median earnings.
However there is a wider range of median earnings in majors with a sample size under 50. With small samples the risk of an unrepresentative median salary is higher as outliers have a bigger effect.
Overall this additional scatter plot does not change the above answer to the question.
# Sample size and unemployment rate
ax = recent_grads.plot(x='Sample_size', y='Unemployment_rate', kind='scatter')
ax.set_title('Unemployment rate vs. Sample size')
ax.set_xlim(0,4500)
ax.set_ylim(0,0.2)
(0, 0.2)
There is a lot of variation in unemployment rates among majors with small sample sizes. Yet again the small sample sizes may affect the representativeness of the relationship plotted here.
In addition, from reviewing some rows of the dataframe, there is a noticeable difference between the sample size and the number of graduates for whom there is data on whether they are employed/unemployed.
For example for Petroleum Engineering (rank 1) there is a sample size of 36 and a total of (1976+37) employed and unemployed.
This suggests that sample size is not readily comparable to other statistics collected, other than median wage.
# Full-time workers and median salary
ax = recent_grads.plot(x='Full_time', y='Median', kind='scatter')
ax.set_title('Median salary vs. Number employed full-time')
ax.set_xlim(0,255000)
ax.set_ylim(0,120000)
(0, 120000)
Q. Is there any link between the number of full-time employees and median salary?
A. There is not a noticeable correlation between the number of graduates per major employed full-time and the median wage. If there was to be a relationship it would be positive i.e. more full-time employees leads to a higher median wage.
But as noted above the median wage figures are based off smaller unweighted samples that may not represent the wider population of graduates with each major.
To be sure, a more sample of the data can be plotted, setting the axes limits at the 75th percentile.
ax = recent_grads.plot(x='Full_time', y='Median', kind='scatter')
ax.set_title('Median salary vs. Number employed full-time (both at 75th percentile)')
ax.set_xlim(0,26000)
ax.set_ylim(0,45000)
(0, 45000)
For this narrowed down sample there is no noticeable relationship.
# Share of women and the unemployment rate
ax = recent_grads.plot(x='ShareWomen', y='Unemployment_rate', kind='scatter')
ax.set_title('Unemployment rate vs Proportion of female graduates')
ax.set_xlim(0,1)
ax.set_ylim(0,0.2)
(0, 0.2)
There doesn't seem to be a strong relationship between the proportion of female graduates in a course and the unemployment rate.
# Share of women and median salary
ax = recent_grads.plot(x='ShareWomen', y='Median', kind='scatter')
ax.set_title('Median salary vs. Proportion of female graduates')
ax.set_xlim(0,1)
ax.set_ylim(0,120000)
(0, 120000)
Q. Do students that majored in subjects that were majority female make more money?
A. No. Here there is a noticeable relationship: the higher the proportion of female graduates for a major, the lower the median salary is.
The lower median salary is not due to more part-time work because it is defined as the median salary of full time year-round workers.
This means that the lower salary could be due to the lowwe pay for the types of major (and subsequent career paths) that have a higher proportions of female graduates and/or due to lower wages due to their gender or less career capital due to a higher propensity to take time away from work for family.
# Number of male graduates and median wage
ax = recent_grads.plot(x='Men', y='Median', kind='scatter')
ax.set_title('Median salary vs. Number of male graduates per major')
ax.set_xlim(0,175000)
ax.set_ylim(0,120000)
(0, 120000)
There is no obvious relationship here.
# Number of female graduates and median wage
ax = recent_grads.plot(x='Women', y='Median', kind='scatter')
ax.set_title('Median salary vs. Number of female graduates per major')
ax.set_xlim(0,310000)
ax.set_ylim(0,120000)
(0, 120000)
There is no obvious relationship here either.
We will use histograms to answer the following questions:
# To allow bin size to be changed, use Series.hist() and not Series.plot(kind='hist')
# Sample_size histogram
recent_grads["Sample_size"].hist(bins=10)
<matplotlib.axes._subplots.AxesSubplot at 0x7f99e54332b0>
Most of the sample size values are below 500 so a more detailed view of the majority can be found by looking at those with a sample size below 500.
recent_grads["Sample_size"].hist(bins=50, range=(0,500))
<matplotlib.axes._subplots.AxesSubplot at 0x7f99e538ab00>
The majority of sample sizes were below 100. This raises concerns over how representative the salary data for each major is.
# Median salary histogram
recent_grads["Median"].hist(range=(20000,110000), bins=18)
<matplotlib.axes._subplots.AxesSubplot at 0x7f99e52d6da0>
Q. What's the most common median salary range?
A. The median salaries are mostly clustered around $30,000-40,000, with a relatively quick drop off in frequency for salary bands on either side.
# Employed histogram
recent_grads["Employed"].hist(range=(0,310000))
<matplotlib.axes._subplots.AxesSubplot at 0x7f99e5495da0>
I assume that the number employed per major is affected, in part, by the number of students that have taken the major and so its distribution is not very instructive by itself. To check out this assumption I can look at the relationship between Total and Employed.
# e.g. the Total number of people and number Employed for the largest majors
# Filter by majors with Total > 39,000 (75th percentile)
largest_majors = recent_grads.loc[recent_grads["Total"] > 39000, ["Major_code", "Major", "Total", "Employed"]]
largest_majors.sort_values(by='Total', ascending=False).head(10)
Major_code | Major | Total | Employed | |
---|---|---|---|---|
145 | 5200 | PSYCHOLOGY | 393735.0 | 307933 |
76 | 6203 | BUSINESS MANAGEMENT AND ADMINISTRATION | 329927.0 | 276234 |
123 | 3600 | BIOLOGY | 280709.0 | 182295 |
57 | 6200 | GENERAL BUSINESS | 234590.0 | 190183 |
93 | 1901 | COMMUNICATIONS | 213996.0 | 179633 |
34 | 6107 | NURSING | 209394.0 | 180903 |
77 | 6206 | MARKETING AND MARKETING RESEARCH | 205211.0 | 178862 |
40 | 6201 | ACCOUNTING | 198633.0 | 165527 |
137 | 3301 | ENGLISH LANGUAGE AND LITERATURE | 194673.0 | 149180 |
78 | 5506 | POLITICAL SCIENCE AND GOVERNMENT | 182621.0 | 133454 |
So there is unsurprisingly a link between the Total number of people who have taken a major and the the number Employed.
A better way to illustrate this relationship would be with a scatter plot.
ax = recent_grads.plot(x='Total', y='Employed', kind='scatter')
ax.set_xlim(0, 400000)
ax.set_ylim(0,310000)
(0, 310000)
So, as expected, the distribution and size of the number Employed per major, is closely related to the Total number of graduates per major. Of more use would be to look at the employment (or unemployment) rates per major, rather than the absolute numbers.
# Full-time histogram
recent_grads["Full_time"].hist()
<matplotlib.axes._subplots.AxesSubplot at 0x7f99e56db630>
This closely mirrors the distribution of the numbers Employed per major, which is expected.
# ShareWomen histogram
recent_grads["ShareWomen"].hist()
<matplotlib.axes._subplots.AxesSubplot at 0x7f99e551b278>
It appears that just over 50% of all majors, are majority female, with the highest frequency at 70-80% female.
# Seeing which courses have females at 80% or more
high_female_share = recent_grads[recent_grads["ShareWomen"] >= 0.8]
print(high_female_share.shape)
high_female_share.sort_values(by='ShareWomen', ascending=False).head(10)
(18, 21)
Rank | Major_code | Major | Total | Men | Women | Major_category | ShareWomen | Sample_size | Employed | ... | Part_time | Full_time_year_round | Unemployed | Unemployment_rate | Median | P25th | P75th | College_jobs | Non_college_jobs | Low_wage_jobs | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
164 | 165 | 2307 | EARLY CHILDHOOD EDUCATION | 37589.0 | 1167.0 | 36422.0 | Education | 0.968954 | 342 | 32551 | ... | 7001 | 20748 | 1360 | 0.040105 | 28000 | 21000 | 35000 | 23515 | 7705 | 2868 |
163 | 164 | 6102 | COMMUNICATION DISORDERS SCIENCES AND SERVICES | 38279.0 | 1225.0 | 37054.0 | Health | 0.967998 | 95 | 29763 | ... | 13862 | 14460 | 1487 | 0.047584 | 28000 | 20000 | 40000 | 19957 | 9404 | 5125 |
51 | 52 | 6104 | MEDICAL ASSISTING SERVICES | 11123.0 | 803.0 | 10320.0 | Health | 0.927807 | 67 | 9168 | ... | 4107 | 4290 | 407 | 0.042507 | 42000 | 30000 | 65000 | 2091 | 6948 | 1270 |
138 | 139 | 2304 | ELEMENTARY EDUCATION | 170862.0 | 13029.0 | 157833.0 | Education | 0.923745 | 1629 | 149339 | ... | 37965 | 86540 | 7297 | 0.046586 | 32000 | 23400 | 38000 | 108085 | 36972 | 11502 |
150 | 151 | 2901 | FAMILY AND CONSUMER SCIENCES | 58001.0 | 5166.0 | 52835.0 | Industrial Arts & Consumer Services | 0.910933 | 518 | 46624 | ... | 15872 | 26906 | 3355 | 0.067128 | 30000 | 22900 | 40000 | 20985 | 20133 | 5248 |
100 | 101 | 2310 | SPECIAL NEEDS EDUCATION | 28739.0 | 2682.0 | 26057.0 | Education | 0.906677 | 246 | 24639 | ... | 5153 | 16642 | 1067 | 0.041508 | 35000 | 32000 | 42000 | 20185 | 3797 | 1179 |
156 | 157 | 5403 | HUMAN SERVICES AND COMMUNITY ORGANIZATION | 9374.0 | 885.0 | 8489.0 | Psychology & Social Work | 0.905590 | 89 | 8294 | ... | 2405 | 5061 | 326 | 0.037819 | 30000 | 24000 | 35000 | 2878 | 4595 | 724 |
151 | 152 | 5404 | SOCIAL WORK | 53552.0 | 5137.0 | 48415.0 | Psychology & Social Work | 0.904075 | 374 | 45038 | ... | 13481 | 27588 | 3329 | 0.068828 | 30000 | 25000 | 35000 | 27449 | 14416 | 4344 |
34 | 35 | 6107 | NURSING | 209394.0 | 21773.0 | 187621.0 | Health | 0.896019 | 2554 | 180903 | ... | 40818 | 122817 | 8497 | 0.044863 | 48000 | 39000 | 58000 | 151643 | 26146 | 6193 |
88 | 89 | 6199 | MISCELLANEOUS HEALTH MEDICAL PROFESSIONS | 13386.0 | 1589.0 | 11797.0 | Health | 0.881294 | 81 | 10076 | ... | 4145 | 5868 | 893 | 0.081411 | 36000 | 23000 | 42000 | 5652 | 3835 | 1422 |
10 rows × 21 columns
# Unemployment Rate histogram
recent_grads["Unemployment_rate"].hist()
<matplotlib.axes._subplots.AxesSubplot at 0x7f99e5675438>
The most frequency unemployment range is 6-7%, but there are handful of courses with unemployment rates greater than 14%.
high_unemp = recent_grads[recent_grads["Unemployment_rate"] >= 0.14]
print(high_unemp.shape)
high_unemp.sort_values(by='Unemployment_rate', ascending=False)
(4, 21)
Rank | Major_code | Major | Total | Men | Women | Major_category | ShareWomen | Sample_size | Employed | ... | Part_time | Full_time_year_round | Unemployed | Unemployment_rate | Median | P25th | P75th | College_jobs | Non_college_jobs | Low_wage_jobs | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
5 | 6 | 2418 | NUCLEAR ENGINEERING | 2573.0 | 2200.0 | 373.0 | Engineering | 0.144967 | 17 | 1857 | ... | 264 | 1449 | 400 | 0.177226 | 65000 | 50000 | 102000 | 1142 | 657 | 244 |
89 | 90 | 5401 | PUBLIC ADMINISTRATION | 5629.0 | 2947.0 | 2682.0 | Law & Public Policy | 0.476461 | 46 | 4158 | ... | 847 | 2952 | 789 | 0.159491 | 36000 | 23000 | 60000 | 919 | 2313 | 496 |
84 | 85 | 2107 | COMPUTER NETWORKING AND TELECOMMUNICATIONS | 7613.0 | 5291.0 | 2322.0 | Computers & Mathematics | 0.305005 | 97 | 6144 | ... | 1447 | 4369 | 1100 | 0.151850 | 36400 | 27000 | 49000 | 2593 | 2941 | 352 |
170 | 171 | 5202 | CLINICAL PSYCHOLOGY | 2838.0 | 568.0 | 2270.0 | Psychology & Social Work | 0.799859 | 13 | 2101 | ... | 648 | 1293 | 368 | 0.149048 | 25000 | 25000 | 40000 | 986 | 870 | 622 |
4 rows × 21 columns
'Nuclear engineering' and 'Computer Networking and Telecommunications' are unexpected given they are in in demand fields (engineering, computers & mathematics).
# Men histogram i.e. number of male graduates
recent_grads["Men"].hist()
<matplotlib.axes._subplots.AxesSubplot at 0x7f99e518c390>
# Women histogram i.e. number of female graduates
recent_grads["Women"].hist()
<matplotlib.axes._subplots.AxesSubplot at 0x7f99e511d390>
A scatter matrix plot combines both scatter plots and histograms into one grid of plots and allows us to explore potential relationships and distributions simultaneously.
These can be generated by selecting the dataframe and columns of interest and passing this into: pandas.plotting.scatter_matrix
# import scatter_matrix function from panda.plotting
from pandas.plotting import scatter_matrix
# A 2 by 2 scatter matrix plot of Sample_size and Median salary
scatter_matrix(recent_grads[["Sample_size", "Median"]], figsize=(10,10))
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7f99e51de470>, <matplotlib.axes._subplots.AxesSubplot object at 0x7f99e4fe2128>], [<matplotlib.axes._subplots.AxesSubplot object at 0x7f99e4faab00>, <matplotlib.axes._subplots.AxesSubplot object at 0x7f99e4f67860>]], dtype=object)
Most sample sizes are less than 500 (top left histogram). The scatter plot of Sample_size vs. Median salary (bottom left) doesn't seem to provide much information other than there not being an obvious relationship between Sample_size and Median salary.
However the mirror scatter plot of Median salary vs. Sample_size (top right) shows that large Sample sizes are not associated with outlier Median salary values. Instead the majors with the highest Sample sizes have Median salaries that are in line with the most common salary ranges ($30,000-40,000).
# Scatter matrix plot of Sample_size, Median, and Unemployment_rate
scatter_matrix(recent_grads[["Sample_size", "Median", "Unemployment_rate"]], figsize=(20,20))
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7f99e53c11d0>, <matplotlib.axes._subplots.AxesSubplot object at 0x7f99e4ea0438>, <matplotlib.axes._subplots.AxesSubplot object at 0x7f99e4e69b70>], [<matplotlib.axes._subplots.AxesSubplot object at 0x7f99e4e248d0>, <matplotlib.axes._subplots.AxesSubplot object at 0x7f99e4dedb00>, <matplotlib.axes._subplots.AxesSubplot object at 0x7f99e4dadf98>], [<matplotlib.axes._subplots.AxesSubplot object at 0x7f99e4cff6d8>, <matplotlib.axes._subplots.AxesSubplot object at 0x7f99e4d34dd8>, <matplotlib.axes._subplots.AxesSubplot object at 0x7f99e4c89160>]], dtype=object)
There's not any noticeably strong relationships between the variables here.
Scatter_matrix is a useful way to quickly explore relationships that I've considered above, for example Total students with a major and the number Employed.
# Total and Employed scatter matrix
scatter_matrix(recent_grads[["Total", "Employed"]], figsize=(10,10))
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7f99e4b8cef0>, <matplotlib.axes._subplots.AxesSubplot object at 0x7f99e434ef98>], [<matplotlib.axes._subplots.AxesSubplot object at 0x7f99e431f400>, <matplotlib.axes._subplots.AxesSubplot object at 0x7f99e42d3da0>]], dtype=object)
Doing a scatter matrix plot earlier would have saved time and quickly revealed the strong relationship between the two variables.
Using either df.plot(kind='bar') or df.plot.bar(x=labels, y= data for bars)
# Looking at the share of women for the top 10 and bottom 10
# courses NB data is ranked by median salary
# Share of women in the top 10 courses
recent_grads[:10].plot.bar(x='Major', y='ShareWomen', title='Share of women in the 10 courses with the highest median salary')
<matplotlib.axes._subplots.AxesSubplot at 0x7f99e420cf98>
# Share of women in the bottom 10 courses
recent_grads[-10:].plot.bar(x='Major', y='ShareWomen', title='Share of women in the 10 courses with the lowest median salary')
<matplotlib.axes._subplots.AxesSubplot at 0x7f99e41d2be0>
The courses with the highest median salaries have a lower share of female graduates than those with the lowest median salaries, which are majority female (i.e. over 50% of the graduates are women).
Now to calculate how large the difference is:
#Calculating the average proportion of female graduates for the top and bottom 10 courses
top_10_female_share = recent_grads.loc[:9, "ShareWomen"].mean()
bottom_10_female_share = recent_grads[-10:]['ShareWomen'].mean()
top_10 = "The 10 highest paying courses have an average female share of {:.2f}".format(top_10_female_share)
bottom_10 = "The 10 lowest paying courses have an average female share of {:.2f}".format(bottom_10_female_share)
print(top_10)
print(bottom_10)
The 10 highest paying courses have an average female share of 0.23 The 10 lowest paying courses have an average female share of 0.79
So the difference in the average proportion of female graduates between the top and bottom 10 courses (in terms of median pay) is over 50%!
Next we will look at the differences in the unemployment rate between the top 10 and bottom 10 courses.
# Unemployment rate for the top 10 courses
ax1 = recent_grads[:10].plot.bar(x='Major', y='Unemployment_rate', title='Unemployment rate for the top 10 courses')
ax2 = recent_grads[-10:].plot.bar(x='Major', y='Unemployment_rate', title='Unemployment rate for the bottom 10 courses')
For this comparison it is less clear. The top 10 courses do tend to have a lower unemployment rate, apart from 2 exceptions: 'Nuclear Engineering' and 'Mining and Mineral Engineering'. Whilst for the bottom 10 courses 3-5 of the courses have higher unemployment rates.
This is can be analysed by looking at average unemployment rates.
mean_unemp_rate = recent_grads["Unemployment_rate"].mean()
#NB with .loc contrary to usual python slices, both the start and the stop are included
top_10_unemp = recent_grads.loc[:9 , "Unemployment_rate"].mean()
bottom_10_unemp = recent_grads[-10:]["Unemployment_rate"].mean()
mean_all = "The average unemployment rate across all majors is {:.2f}".format(mean_unemp_rate)
mean_top = "The average unemployment rate for the top 10 majors is {:.2f}".format(top_10_unemp)
mean_bottom = "The average unemployment rate for the bottom 10 majors is {:.2f}".format(bottom_10_unemp)
print(mean_all)
print(mean_top)
print(mean_bottom)
The average unemployment rate across all majors is 0.07 The average unemployment rate for the top 10 majors is 0.07 The average unemployment rate for the bottom 10 majors is 0.08
Whilst the average unemployment rates are similar for the top and bottom 10 courses, there appears to be more bottom 10 courses which are slightly above average, whilst for the top 10, 2 courses are far above the average, whilst the others are far below average.
To investigate this further:
top_10_outliers = recent_grads[:10].loc[recent_grads[:10]["Unemployment_rate"] > mean_unemp_rate]
top_10_outliers["Difference_from_mean"] = top_10_outliers["Unemployment_rate"] - mean_unemp_rate
bottom_10_outliers = recent_grads[-10:].loc[recent_grads[-10:]["Unemployment_rate"] > mean_unemp_rate]
bottom_10_outliers["Difference_from_mean"] = bottom_10_outliers["Unemployment_rate"] - mean_unemp_rate
""" Plot the majors from the top and bottom 10 with above
average unemployment rates and the size of the difference
in unmployment rate from the average"""
top_10_outliers.plot.bar(x='Major', y='Difference_from_mean', title='Majors in the top 10 with above average unemployment rates')
bottom_10_outliers.plot.bar(x='Major', y='Difference_from_mean', title='Majors in the bottom 10 with above average unemployment rates')
<matplotlib.axes._subplots.AxesSubplot at 0x7f99def3fb70>
So in the top 10, one course (Nuclear Engineering) is dragging up the average unemployment rate for the top 10 majors. In the bottom 10 Clinical Psychology has an above average unemployment rate, but the other 4 courses are also dragging up the unemployment rate.
There are a lot of majors which makes it harder to see patterns across types of major e.g. arts, sciences.
So I will look at some analysis using the Category of the major.
Firstly I will build a new dataframe which is indexed by the categories of the majors.
# Create a list of all the categories of major
categories = recent_grads["Major_category"].unique()
# Now to aggregate across categories
# Firstly unemployment rates
# Create an empty dictionary
cat_unemp = {}
# Loop through major categories, calculate mean unemployment rate and add to dictionary
for c in categories:
unemp_mean = recent_grads.loc[recent_grads["Major_category"] == c, "Unemployment_rate"].mean()
cat_unemp[c] = unemp_mean
# Now the proportion (share) of women in each category
# Create an empty dictionary
cat_share_women = {}
#Loop through categories, calculate mean share of women, and add to dictionary
for c in categories:
women_mean = recent_grads.loc[recent_grads["Major_category"] == c, "ShareWomen"].mean()
cat_share_women[c] = women_mean
# Now the average median salary
# Create an empty dictionary
cat_salary = {}
for c in categories:
salary_mean = recent_grads.loc[recent_grads["Major_category"] == c, "Median"].mean()
cat_salary[c] = salary_mean
# Now the total number of men and women in each category
cat_women = {}
for c in categories:
women_sum = recent_grads.loc[recent_grads["Major_category"] == c, "Women"].sum()
cat_women[c] = women_sum
cat_men = {}
for c in categories:
men_sum = recent_grads.loc[recent_grads["Major_category"] == c, "Men"].sum()
cat_men[c] = men_sum
unemp_series = pd.Series(cat_unemp)
share_women_series = pd.Series(cat_share_women)
salary_series = pd.Series(cat_salary)
women_series = pd.Series(cat_women)
men_series = pd.Series(cat_men)
type(salary_series)
pandas.core.series.Series
#Now to turn all series into a dataframe
#NB. The dictionary keys became the index in the Series obj
#This index can be used for the dataframe
major_categories = pd.DataFrame(unemp_series, columns=['mean_unemployment_rate'])
major_categories
mean_unemployment_rate | |
---|---|
Agriculture & Natural Resources | 0.051817 |
Arts | 0.090173 |
Biology & Life Science | 0.060918 |
Business | 0.071064 |
Communications & Journalism | 0.075538 |
Computers & Mathematics | 0.084256 |
Education | 0.051702 |
Engineering | 0.063334 |
Health | 0.065920 |
Humanities & Liberal Arts | 0.081008 |
Industrial Arts & Consumer Services | 0.048071 |
Interdisciplinary | 0.070861 |
Law & Public Policy | 0.090805 |
Physical Sciences | 0.046511 |
Psychology & Social Work | 0.072065 |
Social Science | 0.095729 |
Now to add the other series into this new dataframe.
#Now add in mean mileage
#Don't use constructor! -only use that to create df obj
# Add in other series to df. Share same index.
major_categories["mean_share_women"] = share_women_series
major_categories["mean_salary"] = salary_series
major_categories["number_female_grads"] = women_series
major_categories["number_male_grads"] = men_series
major_categories
mean_unemployment_rate | mean_share_women | mean_salary | number_female_grads | number_male_grads | |
---|---|---|---|---|---|
Agriculture & Natural Resources | 0.051817 | 0.405267 | 35111.111111 | 35263.0 | 40357.0 |
Arts | 0.090173 | 0.603658 | 33062.500000 | 222740.0 | 134390.0 |
Biology & Life Science | 0.060918 | 0.587193 | 36421.428571 | 268943.0 | 184919.0 |
Business | 0.071064 | 0.483198 | 43538.461538 | 634524.0 | 667852.0 |
Communications & Journalism | 0.075538 | 0.658384 | 34500.000000 | 260680.0 | 131921.0 |
Computers & Mathematics | 0.084256 | 0.311772 | 42745.454545 | 90283.0 | 208725.0 |
Education | 0.051702 | 0.748507 | 32350.000000 | 455603.0 | 103526.0 |
Engineering | 0.063334 | 0.238889 | 57382.758621 | 129276.0 | 408307.0 |
Health | 0.065920 | 0.795152 | 36825.000000 | 387713.0 | 75517.0 |
Humanities & Liberal Arts | 0.081008 | 0.631790 | 31913.333333 | 440622.0 | 272846.0 |
Industrial Arts & Consumer Services | 0.048071 | 0.349523 | 36342.857143 | 126011.0 | 103781.0 |
Interdisciplinary | 0.070861 | 0.770901 | 35000.000000 | 9479.0 | 2817.0 |
Law & Public Policy | 0.090805 | 0.483649 | 42200.000000 | 87978.0 | 91129.0 |
Physical Sciences | 0.046511 | 0.508683 | 41890.000000 | 90089.0 | 95390.0 |
Psychology & Social Work | 0.072065 | 0.794397 | 30100.000000 | 382892.0 | 98115.0 |
Social Science | 0.095729 | 0.553962 | 37344.444444 | 273132.0 | 256834.0 |
Now it is time to plot the data by category to see what patterns emerge.
Firstly, using a grouped bar plot.
df_subset = major_categories[["number_male_grads", "number_female_grads"]]
df_subset.plot.bar(title = 'Total number of male and female graduates per category of major')
<matplotlib.axes._subplots.AxesSubplot at 0x7f99deea36a0>
As can be seen there are large differences in the number of male and female graduates in the following categories:
Now to look at how the mean salary differs across categories.
major_categories.plot.bar(y='mean_salary', title='Average salary per category of major')
<matplotlib.axes._subplots.AxesSubplot at 0x7f99ded9b780>
The mean salary is highest (by far) in Engineering, followed by Business.
Now to look at the relationship between mean salary and the average proportion of women in each category.
ax = major_categories.plot(x='mean_share_women', y='mean_salary', kind='scatter')
ax.set_title('Average share of female graduates per category vs. average salary')
<matplotlib.text.Text at 0x7f99deca1eb8>
There's a slight drop in the average salary as the proportion of female graduates rises.
Finally, how does the average unemployment rate vary across the different categories?
major_categories.plot.bar(y='mean_unemployment_rate', legend = False, title='Average unemployment rate per category of major')
<matplotlib.axes._subplots.AxesSubplot at 0x7f99dec4ddd8>
Interestingly, there dones't appear to be a close link between the unemployment rates and average salary. For example, Education has a low unemployment rate and low salary. Low unemployment means there is a lower supply of surplus labour so usually it would lead to higher wages.
Using a scatter plot to explore this further:
ax = major_categories.plot(x='mean_unemployment_rate', y='mean_salary', kind='scatter')
ax.set_title("Average unemployment rate vs. average salary for each category of major")
<matplotlib.text.Text at 0x7f99debb1320>
So there isn't an obvious relationship between average salary and