** We'll be working with a dataset on the job outcomes of students who graduated from college between 2010 and 2012. The original data on job outcomes was released by American Community Survey, which conducts surveys and aggregates the data. FiveThirtyEight cleaned the dataset and released it on their Github repo. **
** Each row in the dataset represents a different major in college and contains information on gender diversity, employment rates, median salaries, and more. Here are some of the columns in the dataset: **
Rank
- Rank by median earnings (the dataset is ordered by this column).Major_code
- Major code.Major
- Major description.Major_category
- Category of major.Total
- Total number of people with major.Sample_size
- Sample size (unweighted) of full-time.Men
- Male graduates.Women
- Female graduates.ShareWomen
- Women as share of total.Employed
- Number employed.Median
- Median salary of full-time, year-round workers.Low_wage_jobs
- Number in low-wage service jobs.Full_time
- Number employed 35 hours or more.Part_time
- Number employed less than 35 hours.** Using visualizations, we can start to explore questions from the dataset like: **
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
recent_grads = pd.read_csv("recent-grads.csv")
recent_grads.iloc[0]
Rank 1 Major_code 2419 Major PETROLEUM ENGINEERING Total 2339 Men 2057 Women 282 Major_category Engineering ShareWomen 0.120564 Sample_size 36 Employed 1976 Full_time 1849 Part_time 270 Full_time_year_round 1207 Unemployed 37 Unemployment_rate 0.0183805 Median 110000 P25th 95000 P75th 125000 College_jobs 1534 Non_college_jobs 364 Low_wage_jobs 193 Name: 0, dtype: object
recent_grads.head()
Rank | Major_code | Major | Total | Men | Women | Major_category | ShareWomen | Sample_size | Employed | ... | Part_time | Full_time_year_round | Unemployed | Unemployment_rate | Median | P25th | P75th | College_jobs | Non_college_jobs | Low_wage_jobs | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 2419 | PETROLEUM ENGINEERING | 2339.0 | 2057.0 | 282.0 | Engineering | 0.120564 | 36 | 1976 | ... | 270 | 1207 | 37 | 0.018381 | 110000 | 95000 | 125000 | 1534 | 364 | 193 |
1 | 2 | 2416 | MINING AND MINERAL ENGINEERING | 756.0 | 679.0 | 77.0 | Engineering | 0.101852 | 7 | 640 | ... | 170 | 388 | 85 | 0.117241 | 75000 | 55000 | 90000 | 350 | 257 | 50 |
2 | 3 | 2415 | METALLURGICAL ENGINEERING | 856.0 | 725.0 | 131.0 | Engineering | 0.153037 | 3 | 648 | ... | 133 | 340 | 16 | 0.024096 | 73000 | 50000 | 105000 | 456 | 176 | 0 |
3 | 4 | 2417 | NAVAL ARCHITECTURE AND MARINE ENGINEERING | 1258.0 | 1123.0 | 135.0 | Engineering | 0.107313 | 16 | 758 | ... | 150 | 692 | 40 | 0.050125 | 70000 | 43000 | 80000 | 529 | 102 | 0 |
4 | 5 | 2405 | CHEMICAL ENGINEERING | 32260.0 | 21239.0 | 11021.0 | Engineering | 0.341631 | 289 | 25694 | ... | 5180 | 16697 | 1672 | 0.061098 | 65000 | 50000 | 75000 | 18314 | 4440 | 972 |
5 rows × 21 columns
recent_grads.tail()
Rank | Major_code | Major | Total | Men | Women | Major_category | ShareWomen | Sample_size | Employed | ... | Part_time | Full_time_year_round | Unemployed | Unemployment_rate | Median | P25th | P75th | College_jobs | Non_college_jobs | Low_wage_jobs | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
168 | 169 | 3609 | ZOOLOGY | 8409.0 | 3050.0 | 5359.0 | Biology & Life Science | 0.637293 | 47 | 6259 | ... | 2190 | 3602 | 304 | 0.046320 | 26000 | 20000 | 39000 | 2771 | 2947 | 743 |
169 | 170 | 5201 | EDUCATIONAL PSYCHOLOGY | 2854.0 | 522.0 | 2332.0 | Psychology & Social Work | 0.817099 | 7 | 2125 | ... | 572 | 1211 | 148 | 0.065112 | 25000 | 24000 | 34000 | 1488 | 615 | 82 |
170 | 171 | 5202 | CLINICAL PSYCHOLOGY | 2838.0 | 568.0 | 2270.0 | Psychology & Social Work | 0.799859 | 13 | 2101 | ... | 648 | 1293 | 368 | 0.149048 | 25000 | 25000 | 40000 | 986 | 870 | 622 |
171 | 172 | 5203 | COUNSELING PSYCHOLOGY | 4626.0 | 931.0 | 3695.0 | Psychology & Social Work | 0.798746 | 21 | 3777 | ... | 965 | 2738 | 214 | 0.053621 | 23400 | 19200 | 26000 | 2403 | 1245 | 308 |
172 | 173 | 3501 | LIBRARY SCIENCE | 1098.0 | 134.0 | 964.0 | Education | 0.877960 | 2 | 742 | ... | 237 | 410 | 87 | 0.104946 | 22000 | 20000 | 22000 | 288 | 338 | 192 |
5 rows × 21 columns
recent_grads.describe(include='all')
Rank | Major_code | Major | Total | Men | Women | Major_category | ShareWomen | Sample_size | Employed | ... | Part_time | Full_time_year_round | Unemployed | Unemployment_rate | Median | P25th | P75th | College_jobs | Non_college_jobs | Low_wage_jobs | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 173.000000 | 173.000000 | 173 | 172.000000 | 172.000000 | 172.000000 | 173 | 172.000000 | 173.000000 | 173.000000 | ... | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 |
unique | NaN | NaN | 173 | NaN | NaN | NaN | 16 | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
top | NaN | NaN | MARKETING AND MARKETING RESEARCH | NaN | NaN | NaN | Engineering | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
freq | NaN | NaN | 1 | NaN | NaN | NaN | 29 | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
mean | 87.000000 | 3879.815029 | NaN | 39370.081395 | 16723.406977 | 22646.674419 | NaN | 0.522223 | 356.080925 | 31192.763006 | ... | 8832.398844 | 19694.427746 | 2416.329480 | 0.068191 | 40151.445087 | 29501.445087 | 51494.219653 | 12322.635838 | 13284.497110 | 3859.017341 |
std | 50.084928 | 1687.753140 | NaN | 63483.491009 | 28122.433474 | 41057.330740 | NaN | 0.231205 | 618.361022 | 50675.002241 | ... | 14648.179473 | 33160.941514 | 4112.803148 | 0.030331 | 11470.181802 | 9166.005235 | 14906.279740 | 21299.868863 | 23789.655363 | 6944.998579 |
min | 1.000000 | 1100.000000 | NaN | 124.000000 | 119.000000 | 0.000000 | NaN | 0.000000 | 2.000000 | 0.000000 | ... | 0.000000 | 111.000000 | 0.000000 | 0.000000 | 22000.000000 | 18500.000000 | 22000.000000 | 0.000000 | 0.000000 | 0.000000 |
25% | 44.000000 | 2403.000000 | NaN | 4549.750000 | 2177.500000 | 1778.250000 | NaN | 0.336026 | 39.000000 | 3608.000000 | ... | 1030.000000 | 2453.000000 | 304.000000 | 0.050306 | 33000.000000 | 24000.000000 | 42000.000000 | 1675.000000 | 1591.000000 | 340.000000 |
50% | 87.000000 | 3608.000000 | NaN | 15104.000000 | 5434.000000 | 8386.500000 | NaN | 0.534024 | 130.000000 | 11797.000000 | ... | 3299.000000 | 7413.000000 | 893.000000 | 0.067961 | 36000.000000 | 27000.000000 | 47000.000000 | 4390.000000 | 4595.000000 | 1231.000000 |
75% | 130.000000 | 5503.000000 | NaN | 38909.750000 | 14631.000000 | 22553.750000 | NaN | 0.703299 | 338.000000 | 31433.000000 | ... | 9948.000000 | 16891.000000 | 2393.000000 | 0.087557 | 45000.000000 | 33000.000000 | 60000.000000 | 14444.000000 | 11783.000000 | 3466.000000 |
max | 173.000000 | 6403.000000 | NaN | 393735.000000 | 173809.000000 | 307087.000000 | NaN | 0.968954 | 4212.000000 | 307933.000000 | ... | 115172.000000 | 199897.000000 | 28169.000000 | 0.177226 | 110000.000000 | 95000.000000 | 125000.000000 | 151643.000000 | 148395.000000 | 48207.000000 |
11 rows × 21 columns
recent_grads.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 173 entries, 0 to 172 Data columns (total 21 columns): Rank 173 non-null int64 Major_code 173 non-null int64 Major 173 non-null object Total 172 non-null float64 Men 172 non-null float64 Women 172 non-null float64 Major_category 173 non-null object ShareWomen 172 non-null float64 Sample_size 173 non-null int64 Employed 173 non-null int64 Full_time 173 non-null int64 Part_time 173 non-null int64 Full_time_year_round 173 non-null int64 Unemployed 173 non-null int64 Unemployment_rate 173 non-null float64 Median 173 non-null int64 P25th 173 non-null int64 P75th 173 non-null int64 College_jobs 173 non-null int64 Non_college_jobs 173 non-null int64 Low_wage_jobs 173 non-null int64 dtypes: float64(5), int64(14), object(2) memory usage: 28.5+ KB
recent_grads = recent_grads.dropna(axis=0)
recent_grads.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 172 entries, 0 to 172 Data columns (total 21 columns): Rank 172 non-null int64 Major_code 172 non-null int64 Major 172 non-null object Total 172 non-null float64 Men 172 non-null float64 Women 172 non-null float64 Major_category 172 non-null object ShareWomen 172 non-null float64 Sample_size 172 non-null int64 Employed 172 non-null int64 Full_time 172 non-null int64 Part_time 172 non-null int64 Full_time_year_round 172 non-null int64 Unemployed 172 non-null int64 Unemployment_rate 172 non-null float64 Median 172 non-null int64 P25th 172 non-null int64 P75th 172 non-null int64 College_jobs 172 non-null int64 Non_college_jobs 172 non-null int64 Low_wage_jobs 172 non-null int64 dtypes: float64(5), int64(14), object(2) memory usage: 29.6+ KB
recent_grads.plot(x='Sample_size', y='Median', kind='scatter', title='Relationship Between Sample Size and The Median Earning')
<matplotlib.axes._subplots.AxesSubplot at 0x7f3a5f029fd0>
It looks like there is no correlation between the sample size and the median earnings.
recent_grads.plot(x='Sample_size', y='Unemployment_rate', kind='scatter', title='Relationship Between Sample Size and Unemployment Rate')
<matplotlib.axes._subplots.AxesSubplot at 0x7f3a5ef88828>
It looks like there is no correlation between the sample size and the unemployment rate.
recent_grads.plot(x='Full_time', y='Median', kind='scatter', title='Relationship between Full Time Employees and the Median Earnings')
<matplotlib.axes._subplots.AxesSubplot at 0x7f3a5ef4ec50>
It looks like there is no correlation between the median earnings and being a full time employee.
recent_grads.plot(x='ShareWomen', y='Unemployment_rate', kind='scatter', title='Relationship Between Women Share and Unemployment Rate')
<matplotlib.axes._subplots.AxesSubplot at 0x7f3a5ce67278>
It looks like there is no correlation between the women share and the unemployment rate.
recent_grads.plot(x='Men', y='Median', kind='scatter', title='Relationship Between Men and the Median Earnings')
<matplotlib.axes._subplots.AxesSubplot at 0x7f3a5ce4b438>
It looks like there is no correlation between the number of men with major and the median earnings.
recent_grads.plot(x='Women', y='Median', kind='scatter', title='Relationship Between Women and the Median Earnings')
<matplotlib.axes._subplots.AxesSubplot at 0x7f3a5cdb6320>
It looks like there is no correlation between the number of women and the median earnings.
recent_grads.plot(x='Total', y='Median', kind='scatter', title='Relationship Between Popular Majors in Total and The Median Earnings')
<matplotlib.axes._subplots.AxesSubplot at 0x7f3a5cd42fd0>
It looks like there is no correlation between the most popular majors and the median earnings.
recent_grads.plot(x='ShareWomen', y='Median', kind='scatter', title='The Relationship Between Women Share and Earnings')
<matplotlib.axes._subplots.AxesSubplot at 0x7f3a5cced550>
There is a negative correlation between the earnings and women share, as women share increases the median earning decreases, which indicates that women are hired more in low earning jobs or majority of women are with a major of low median earning.
recent_grads["Sample_size"].hist(bins=25)
<matplotlib.axes._subplots.AxesSubplot at 0x7f3a5cc990b8>
Most sample sizes are below 500.
recent_grads["Median"].hist(bins=25)
<matplotlib.axes._subplots.AxesSubplot at 0x7f3a5cbb7f60>
Most of the median earnings are in the range from 30k to 40k
recent_grads["Employed"].hist(bins=25)
<matplotlib.axes._subplots.AxesSubplot at 0x7f3a5caf2e80>
recent_grads["Full_time"].hist(bins=25)
<matplotlib.axes._subplots.AxesSubplot at 0x7f3a5ca3ebe0>
From the above two histograms, we can find that the number employed and full time jobs are aligning.
recent_grads["ShareWomen"].hist(bins=25)
<matplotlib.axes._subplots.AxesSubplot at 0x7f3a5c9c2b38>
Many majors have a women share ranging from 0.5 to 0.8.
recent_grads["Unemployment_rate"].hist(bins=25)
<matplotlib.axes._subplots.AxesSubplot at 0x7f3a5c8bfd30>
The most common unployment rate in 0.06 in more than 20 majors.
recent_grads["Men"].hist(bins=25)
<matplotlib.axes._subplots.AxesSubplot at 0x7f3a5c7fb588>
recent_grads["Women"].hist(bins=25, range=(0,180000))
<matplotlib.axes._subplots.AxesSubplot at 0x7f3a5c73c2e8>
Men numbers and Women Numbers are showing a similar pattern.
from pandas.plotting import scatter_matrix
scatter_matrix(recent_grads[["Sample_size", "Median"]], figsize=(10,10))
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7f3a5c6c1f98>, <matplotlib.axes._subplots.AxesSubplot object at 0x7f3a5c5f5978>], [<matplotlib.axes._subplots.AxesSubplot object at 0x7f3a5c5bfc50>, <matplotlib.axes._subplots.AxesSubplot object at 0x7f3a5c57b518>]], dtype=object)
No correlation between the sample size and the median even with using a scatter matrix plot.
scatter_matrix(recent_grads[["Sample_size","Median","Unemployment_rate"]], figsize=(10,10))
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7f3a5c51ff98>, <matplotlib.axes._subplots.AxesSubplot object at 0x7f3a5c49ba58>, <matplotlib.axes._subplots.AxesSubplot object at 0x7f3a5c3e9518>], [<matplotlib.axes._subplots.AxesSubplot object at 0x7f3a5c3a50f0>, <matplotlib.axes._subplots.AxesSubplot object at 0x7f3a5c370208>, <matplotlib.axes._subplots.AxesSubplot object at 0x7f3a5c32f7b8>], [<matplotlib.axes._subplots.AxesSubplot object at 0x7f3a5c2f6eb8>, <matplotlib.axes._subplots.AxesSubplot object at 0x7f3a5c2b45f8>, <matplotlib.axes._subplots.AxesSubplot object at 0x7f3a5c283940>]], dtype=object)
No correlation or similarities between the three variables of sample size, median earnings and unemployment rate.
#First 10 rows in ShareWomen.
recent_grads[:10].plot.bar(x='Major', y='ShareWomen')
<matplotlib.axes._subplots.AxesSubplot at 0x7f3a5c194278>
#Last 10 rows in ShareWomen.
recent_grads[-10:].plot.bar(x='Major', y='ShareWomen')
<matplotlib.axes._subplots.AxesSubplot at 0x7f3a5c0ea400>
In general, women share is higher in the last rows of majors with low rank and low median earnings
#First 10 rows of unemployment rate.
recent_grads[:10].plot.bar(x='Major', y='Unemployment_rate')
<matplotlib.axes._subplots.AxesSubplot at 0x7f3a5c38c470>
#Last 10 rows of unemployment rate.
recent_grads[-10:].plot.bar(x='Major', y='Unemployment_rate')
<matplotlib.axes._subplots.AxesSubplot at 0x7f3a5c0542b0>
The unemployment rate is slightly higher in the low ranked majors with low earnings than in majors with high rank and high earnings.As a side observation, the astronomy field has a remarkably low unemployment rate and has a higher women share than males.
#Aggregating the major categories, and calculating the total number of men and women per category
maj_cat = recent_grads['Major_category'].unique()
nw_catg = {}
nm_catg = {}
for c in maj_cat:
num_w = recent_grads.loc[recent_grads['Major_category']==c, 'Women'].sum().astype(int)
num_m = recent_grads.loc[recent_grads['Major_category']==c, 'Men'].sum().astype(int)
nw_catg[c] = num_w
nm_catg[c] = num_m
#Creating a dataframe with number of men and women per category.
nw_catg_s = pd.Series(nw_catg)
nm_catg_s = pd.Series(nm_catg)
wm_catg_df = pd.DataFrame(nw_catg_s, columns=['women_num'])
wm_catg_df['men_num'] = nm_catg_s
wm_catg_df
women_num | men_num | |
---|---|---|
Agriculture & Natural Resources | 35263 | 40357 |
Arts | 222740 | 134390 |
Biology & Life Science | 268943 | 184919 |
Business | 634524 | 667852 |
Communications & Journalism | 260680 | 131921 |
Computers & Mathematics | 90283 | 208725 |
Education | 455603 | 103526 |
Engineering | 129276 | 408307 |
Health | 387713 | 75517 |
Humanities & Liberal Arts | 440622 | 272846 |
Industrial Arts & Consumer Services | 126011 | 103781 |
Interdisciplinary | 9479 | 2817 |
Law & Public Policy | 87978 | 91129 |
Physical Sciences | 90089 | 95390 |
Psychology & Social Work | 382892 | 98115 |
Social Science | 273132 | 256834 |
wm_catg_df.plot.bar(rot=90, figsize=(10,10))
<matplotlib.axes._subplots.AxesSubplot at 0x7f3a5bf736d8>
From the above comparison we find women numbers are significantly high in fields like Arts, Biology/Life Science, Communications/Journalism, Education, Health and Psychology/Social Work. While men numbers are higher in Engineering/Computers/Mathematics
recent_grads["Median"].plot.box()
<matplotlib.axes._subplots.AxesSubplot at 0x7f3a5bec1e10>
The top 25% of salaries ranging from 45k to 60k, and the bottom 25% are ranging from 22k to 35k
recent_grads["Unemployment_rate"].plot.box()
<matplotlib.axes._subplots.AxesSubplot at 0x7f3a5bdefb70>
The bottom 25% of values are ranging from 0 to 0.06 and the top 25% are ranging from 0.09 to 0.13
Do students in more popular majors make more money?,It looks like there is no correlation between the most popular majors and the median earnings.
Do students that majored in subjects that were majority female make more money?, There is a negative correlation between the earnings and women share, as women share increases the median earning decreases.In general, women share is higher in the last rows of majors with low rank and low median earnings.
Most of the median earnings are in the range from 30k to 40k.
Many majors have a women share ranging from 0.5 to 0.8.
The most common unployment rate is 0.06 in more than 20 majors.
The unemployment rate is slightly higher in the low ranked majors with low earnings than in majors with high rank and high earnings.As a side observation, the astronomy field has a remarkably low unemployment rate and has a higher women share than males.
We found women numbers are significantly high in fields like Arts, Biology/Life Science, Communications/Journalism, Education, Health and Psychology/Social Work. While men numbers are higher in Engineering/Computers/Mathematics