We'll be working with a dataset on the job outcomes of students who graduated from college between 2010 and 2012. The original data on job outcomes was released by American Community Survey, which conducts surveys and aggregates the data. FiveThirtyEight cleaned the dataset and released it on their Github repo.
Each row in the dataset represents a different major in college and contains information on gender diversity, employment rates, median salaries, and more. Here are some of the columns in the dataset:
Rank
- Rank by median earnings (the dataset is ordered by this column).Major_code
- Major code.Major
- Major description.Major_category
- Category of major.Total
- Total number of people with major.Sample_size
- Sample size (unweighted) of full-time.Men
- Male graduates.Women
- Female graduates.ShareWomen
- Women as share of total.Employed
- Number employed.Median
- Median salary of full-time, year-round workers.Low_wage_jobs
- Number in low-wage service jobs.Full_time
- Number employed 35 hours or more.Part_time
- Number employed less than 35 hours.The aim of this project is to find answers on the folowing questions, using pyplot and matplotlib:
Do students in more popular majors make more money?
How many majors are predominantly male? Predominantly female?
Which category of majors have the most students?
import pandas as pd
import matplotlib as plt
%matplotlib inline
recent_grads = pd.read_csv('recent-grads.csv')
# returning first row:
recent_grads.iloc[0]
Rank 1 Major_code 2419 Major PETROLEUM ENGINEERING Total 2339 Men 2057 Women 282 Major_category Engineering ShareWomen 0.120564 Sample_size 36 Employed 1976 Full_time 1849 Part_time 270 Full_time_year_round 1207 Unemployed 37 Unemployment_rate 0.0183805 Median 110000 P25th 95000 P75th 125000 College_jobs 1534 Non_college_jobs 364 Low_wage_jobs 193 Name: 0, dtype: object
# reading first 5 rows:
recent_grads.head()
Rank | Major_code | Major | Total | Men | Women | Major_category | ShareWomen | Sample_size | Employed | ... | Part_time | Full_time_year_round | Unemployed | Unemployment_rate | Median | P25th | P75th | College_jobs | Non_college_jobs | Low_wage_jobs | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 2419 | PETROLEUM ENGINEERING | 2339.0 | 2057.0 | 282.0 | Engineering | 0.120564 | 36 | 1976 | ... | 270 | 1207 | 37 | 0.018381 | 110000 | 95000 | 125000 | 1534 | 364 | 193 |
1 | 2 | 2416 | MINING AND MINERAL ENGINEERING | 756.0 | 679.0 | 77.0 | Engineering | 0.101852 | 7 | 640 | ... | 170 | 388 | 85 | 0.117241 | 75000 | 55000 | 90000 | 350 | 257 | 50 |
2 | 3 | 2415 | METALLURGICAL ENGINEERING | 856.0 | 725.0 | 131.0 | Engineering | 0.153037 | 3 | 648 | ... | 133 | 340 | 16 | 0.024096 | 73000 | 50000 | 105000 | 456 | 176 | 0 |
3 | 4 | 2417 | NAVAL ARCHITECTURE AND MARINE ENGINEERING | 1258.0 | 1123.0 | 135.0 | Engineering | 0.107313 | 16 | 758 | ... | 150 | 692 | 40 | 0.050125 | 70000 | 43000 | 80000 | 529 | 102 | 0 |
4 | 5 | 2405 | CHEMICAL ENGINEERING | 32260.0 | 21239.0 | 11021.0 | Engineering | 0.341631 | 289 | 25694 | ... | 5180 | 16697 | 1672 | 0.061098 | 65000 | 50000 | 75000 | 18314 | 4440 | 972 |
5 rows × 21 columns
# reading last 5 rows:
recent_grads.tail()
Rank | Major_code | Major | Total | Men | Women | Major_category | ShareWomen | Sample_size | Employed | ... | Part_time | Full_time_year_round | Unemployed | Unemployment_rate | Median | P25th | P75th | College_jobs | Non_college_jobs | Low_wage_jobs | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
168 | 169 | 3609 | ZOOLOGY | 8409.0 | 3050.0 | 5359.0 | Biology & Life Science | 0.637293 | 47 | 6259 | ... | 2190 | 3602 | 304 | 0.046320 | 26000 | 20000 | 39000 | 2771 | 2947 | 743 |
169 | 170 | 5201 | EDUCATIONAL PSYCHOLOGY | 2854.0 | 522.0 | 2332.0 | Psychology & Social Work | 0.817099 | 7 | 2125 | ... | 572 | 1211 | 148 | 0.065112 | 25000 | 24000 | 34000 | 1488 | 615 | 82 |
170 | 171 | 5202 | CLINICAL PSYCHOLOGY | 2838.0 | 568.0 | 2270.0 | Psychology & Social Work | 0.799859 | 13 | 2101 | ... | 648 | 1293 | 368 | 0.149048 | 25000 | 25000 | 40000 | 986 | 870 | 622 |
171 | 172 | 5203 | COUNSELING PSYCHOLOGY | 4626.0 | 931.0 | 3695.0 | Psychology & Social Work | 0.798746 | 21 | 3777 | ... | 965 | 2738 | 214 | 0.053621 | 23400 | 19200 | 26000 | 2403 | 1245 | 308 |
172 | 173 | 3501 | LIBRARY SCIENCE | 1098.0 | 134.0 | 964.0 | Education | 0.877960 | 2 | 742 | ... | 237 | 410 | 87 | 0.104946 | 22000 | 20000 | 22000 | 288 | 338 | 192 |
5 rows × 21 columns
# description of numeric columns:
recent_grads.describe()
Rank | Major_code | Total | Men | Women | ShareWomen | Sample_size | Employed | Full_time | Part_time | Full_time_year_round | Unemployed | Unemployment_rate | Median | P25th | P75th | College_jobs | Non_college_jobs | Low_wage_jobs | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 173.000000 | 173.000000 | 172.000000 | 172.000000 | 172.000000 | 172.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 |
mean | 87.000000 | 3879.815029 | 39370.081395 | 16723.406977 | 22646.674419 | 0.522223 | 356.080925 | 31192.763006 | 26029.306358 | 8832.398844 | 19694.427746 | 2416.329480 | 0.068191 | 40151.445087 | 29501.445087 | 51494.219653 | 12322.635838 | 13284.497110 | 3859.017341 |
std | 50.084928 | 1687.753140 | 63483.491009 | 28122.433474 | 41057.330740 | 0.231205 | 618.361022 | 50675.002241 | 42869.655092 | 14648.179473 | 33160.941514 | 4112.803148 | 0.030331 | 11470.181802 | 9166.005235 | 14906.279740 | 21299.868863 | 23789.655363 | 6944.998579 |
min | 1.000000 | 1100.000000 | 124.000000 | 119.000000 | 0.000000 | 0.000000 | 2.000000 | 0.000000 | 111.000000 | 0.000000 | 111.000000 | 0.000000 | 0.000000 | 22000.000000 | 18500.000000 | 22000.000000 | 0.000000 | 0.000000 | 0.000000 |
25% | 44.000000 | 2403.000000 | 4549.750000 | 2177.500000 | 1778.250000 | 0.336026 | 39.000000 | 3608.000000 | 3154.000000 | 1030.000000 | 2453.000000 | 304.000000 | 0.050306 | 33000.000000 | 24000.000000 | 42000.000000 | 1675.000000 | 1591.000000 | 340.000000 |
50% | 87.000000 | 3608.000000 | 15104.000000 | 5434.000000 | 8386.500000 | 0.534024 | 130.000000 | 11797.000000 | 10048.000000 | 3299.000000 | 7413.000000 | 893.000000 | 0.067961 | 36000.000000 | 27000.000000 | 47000.000000 | 4390.000000 | 4595.000000 | 1231.000000 |
75% | 130.000000 | 5503.000000 | 38909.750000 | 14631.000000 | 22553.750000 | 0.703299 | 338.000000 | 31433.000000 | 25147.000000 | 9948.000000 | 16891.000000 | 2393.000000 | 0.087557 | 45000.000000 | 33000.000000 | 60000.000000 | 14444.000000 | 11783.000000 | 3466.000000 |
max | 173.000000 | 6403.000000 | 393735.000000 | 173809.000000 | 307087.000000 | 0.968954 | 4212.000000 | 307933.000000 | 251540.000000 | 115172.000000 | 199897.000000 | 28169.000000 | 0.177226 | 110000.000000 | 95000.000000 | 125000.000000 | 151643.000000 | 148395.000000 | 48207.000000 |
#count of rows and columns:
raw_data_count = recent_grads.shape
print(raw_data_count)
#checking missing values and updating dataset:
recent_grads = recent_grads.dropna()
cleaned_data_count = recent_grads.shape
print(cleaned_data_count)
(173, 21) (172, 21)
Short conclusion:
recent_grads.plot(x='Sample_size', y='Median', kind='scatter', title='Sample_size vs. Median')
<AxesSubplot:title={'center':'Sample_size vs. Median'}, xlabel='Sample_size', ylabel='Median'>
recent_grads.plot(x='Sample_size', y='Unemployment_rate', kind='scatter', title='Sample_size vs. Unemployment_rate')
<AxesSubplot:title={'center':'Sample_size vs. Unemployment_rate'}, xlabel='Sample_size', ylabel='Unemployment_rate'>
recent_grads.plot(x='Full_time', y='Median', kind='scatter', title='Full_time vs. Median')
<AxesSubplot:title={'center':'Full_time vs. Median'}, xlabel='Full_time', ylabel='Median'>
recent_grads.plot(x='ShareWomen', y='Unemployment_rate', kind='scatter', title='ShareWomen vs. Unemployment_rate')
<AxesSubplot:title={'center':'ShareWomen vs. Unemployment_rate'}, xlabel='ShareWomen', ylabel='Unemployment_rate'>
recent_grads.plot(x='Men', y='Median', kind='scatter', title='Men vs. Median')
<AxesSubplot:title={'center':'Men vs. Median'}, xlabel='Men', ylabel='Median'>
recent_grads.plot(x='Women', y='Median', kind='scatter', title='Women vs. Median')
<AxesSubplot:title={'center':'Women vs. Median'}, xlabel='Women', ylabel='Median'>
Short Conclusion:
In this block we will check if there is much difference between data in columns Total and Sample_size and define the most popular categories
# sorting values "Sample_size":
recent_grads['Sample_size'].sort_values(ascending= False).head()
76 4212 77 2684 145 2584 34 2554 93 2394 Name: Sample_size, dtype: int64
# loc sorted 5 rows:
recent_grads.iloc[[75,76,144,33,92]]
Rank | Major_code | Major | Total | Men | Women | Major_category | ShareWomen | Sample_size | Employed | ... | Part_time | Full_time_year_round | Unemployed | Unemployment_rate | Median | P25th | P75th | College_jobs | Non_college_jobs | Low_wage_jobs | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
76 | 77 | 6203 | BUSINESS MANAGEMENT AND ADMINISTRATION | 329927.0 | 173809.0 | 156118.0 | Business | 0.473190 | 4212 | 276234 | ... | 50357 | 199897 | 21502 | 0.072218 | 38000 | 29000 | 50000 | 36720 | 148395 | 32395 |
77 | 78 | 6206 | MARKETING AND MARKETING RESEARCH | 205211.0 | 78857.0 | 126354.0 | Business | 0.615727 | 2684 | 178862 | ... | 35829 | 127230 | 11663 | 0.061215 | 38000 | 30000 | 50000 | 25320 | 93889 | 27968 |
145 | 146 | 5200 | PSYCHOLOGY | 393735.0 | 86648.0 | 307087.0 | Psychology & Social Work | 0.779933 | 2584 | 307933 | ... | 115172 | 174438 | 28169 | 0.083811 | 31500 | 24000 | 41000 | 125148 | 141860 | 48207 |
34 | 35 | 6107 | NURSING | 209394.0 | 21773.0 | 187621.0 | Health | 0.896019 | 2554 | 180903 | ... | 40818 | 122817 | 8497 | 0.044863 | 48000 | 39000 | 58000 | 151643 | 26146 | 6193 |
93 | 94 | 1901 | COMMUNICATIONS | 213996.0 | 70619.0 | 143377.0 | Communications & Journalism | 0.669999 | 2394 | 179633 | ... | 49889 | 116251 | 14602 | 0.075177 | 35000 | 27000 | 45000 | 40763 | 97964 | 27440 |
5 rows × 21 columns
# sorting values "Total":
recent_grads['Total'].sort_values(ascending= False).head()
145 393735.0 76 329927.0 123 280709.0 57 234590.0 93 213996.0 Name: Total, dtype: float64
# loc sorted 5 rows:
recent_grads.iloc[[144,75,122,56,92]]
Rank | Major_code | Major | Total | Men | Women | Major_category | ShareWomen | Sample_size | Employed | ... | Part_time | Full_time_year_round | Unemployed | Unemployment_rate | Median | P25th | P75th | College_jobs | Non_college_jobs | Low_wage_jobs | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
145 | 146 | 5200 | PSYCHOLOGY | 393735.0 | 86648.0 | 307087.0 | Psychology & Social Work | 0.779933 | 2584 | 307933 | ... | 115172 | 174438 | 28169 | 0.083811 | 31500 | 24000 | 41000 | 125148 | 141860 | 48207 |
76 | 77 | 6203 | BUSINESS MANAGEMENT AND ADMINISTRATION | 329927.0 | 173809.0 | 156118.0 | Business | 0.473190 | 4212 | 276234 | ... | 50357 | 199897 | 21502 | 0.072218 | 38000 | 29000 | 50000 | 36720 | 148395 | 32395 |
123 | 124 | 3600 | BIOLOGY | 280709.0 | 111762.0 | 168947.0 | Biology & Life Science | 0.601858 | 1370 | 182295 | ... | 72371 | 100336 | 13874 | 0.070725 | 33400 | 24000 | 45000 | 88232 | 81109 | 28339 |
57 | 58 | 6200 | GENERAL BUSINESS | 234590.0 | 132238.0 | 102352.0 | Business | 0.436302 | 2380 | 190183 | ... | 36241 | 138299 | 14946 | 0.072861 | 40000 | 30000 | 55000 | 29334 | 100831 | 27320 |
93 | 94 | 1901 | COMMUNICATIONS | 213996.0 | 70619.0 | 143377.0 | Communications & Journalism | 0.669999 | 2394 | 179633 | ... | 49889 | 116251 | 14602 | 0.075177 | 35000 | 27000 | 45000 | 40763 | 97964 | 27440 |
5 rows × 21 columns
Short conclusion:
As we see, at least 3 rows from 5 are identical in both columns that allows us to think columns "Total" and "Sample_size" are very close by meanning. Besides, we can admit that the most popular categories are:
#series_hist() method for Sample_size column
recent_grads['Sample_size'].hist(bins=35, range=(0,5000))
<AxesSubplot:>
#series_hist() method for Median column
recent_grads['Median'].hist(bins= 35, range=(20000,200000))
<AxesSubplot:>
#series_hist() method for Employed column
recent_grads['Employed'].hist(bins= 35, range=(0,350000))
<AxesSubplot:>
#series_hist() method for Full_time column
recent_grads['Full_time'].hist(bins= 35, range=(0,300000))
<AxesSubplot:>
#series_hist() method for ShareWomen_time column
recent_grads['ShareWomen'].hist(bins= 35, range=(0,1))
<AxesSubplot:>
#series_hist() method for Unemployment_rate column
recent_grads['Unemployment_rate'].hist(bins= 35, range=(0,0.2))
<AxesSubplot:>
#series_hist() method for Men column
recent_grads['Men'].hist(bins= 35, range=(0,200000))
<AxesSubplot:>
#series_hist() method for Women column
recent_grads['Women'].hist(bins= 35, range=(0,200000))
<AxesSubplot:>
Short conclusion:
Quick exploring our histograms shows that more men in unpopular categories, unemployment rate in the middle between 0 - 0.12%, and range of salaries is the same with previously explored results.
# pandas plotting module with scatter_matrix function
# Sample_size and Median
pd.plotting.scatter_matrix(recent_grads[['Sample_size', 'Median']], figsize=(7,7))
array([[<AxesSubplot:xlabel='Sample_size', ylabel='Sample_size'>, <AxesSubplot:xlabel='Median', ylabel='Sample_size'>], [<AxesSubplot:xlabel='Sample_size', ylabel='Median'>, <AxesSubplot:xlabel='Median', ylabel='Median'>]], dtype=object)
# pandas plotting module with scatter_matrix function
# Sample_size,Median and Unemployment_rate
pd.plotting.scatter_matrix(recent_grads[['Sample_size', 'Median', 'Unemployment_rate']], figsize=(10,10))
array([[<AxesSubplot:xlabel='Sample_size', ylabel='Sample_size'>, <AxesSubplot:xlabel='Median', ylabel='Sample_size'>, <AxesSubplot:xlabel='Unemployment_rate', ylabel='Sample_size'>], [<AxesSubplot:xlabel='Sample_size', ylabel='Median'>, <AxesSubplot:xlabel='Median', ylabel='Median'>, <AxesSubplot:xlabel='Unemployment_rate', ylabel='Median'>], [<AxesSubplot:xlabel='Sample_size', ylabel='Unemployment_rate'>, <AxesSubplot:xlabel='Median', ylabel='Unemployment_rate'>, <AxesSubplot:xlabel='Unemployment_rate', ylabel='Unemployment_rate'>]], dtype=object)
# pandas plotting module with scatter_matrix function
# Median, Men and Women
pd.plotting.scatter_matrix(recent_grads[['Median', 'Men', 'Women']], figsize = (10,10))
array([[<AxesSubplot:xlabel='Median', ylabel='Median'>, <AxesSubplot:xlabel='Men', ylabel='Median'>, <AxesSubplot:xlabel='Women', ylabel='Median'>], [<AxesSubplot:xlabel='Median', ylabel='Men'>, <AxesSubplot:xlabel='Men', ylabel='Men'>, <AxesSubplot:xlabel='Women', ylabel='Men'>], [<AxesSubplot:xlabel='Median', ylabel='Women'>, <AxesSubplot:xlabel='Men', ylabel='Women'>, <AxesSubplot:xlabel='Women', ylabel='Women'>]], dtype=object)
recent_grads[:10].plot.barh(x='Major', y='ShareWomen')
<AxesSubplot:ylabel='Major'>
recent_grads[-10:].plot.barh(x='Major', y='ShareWomen')
<AxesSubplot:ylabel='Major'>
recent_grads[:10].plot.barh(x='Major', y='Unemployment_rate')
<AxesSubplot:ylabel='Major'>
recent_grads[-10:].plot.barh(x='Major', y='Unemployment_rate')
<AxesSubplot:ylabel='Major'>
Short conclusion:
The bars built above give us a picture that women a less represented in Engineering and more in Psychology, Languages, Arts and Education.
We will explore how male and female groups represented in top 10 and last 10 categories, and in top 10 median and low 10 median salaries.
# men and women in different categories:
recent_grads[:15][['Major', 'Men', 'Women']].plot.barh(x = 'Major')
recent_grads[-15:][['Major','Men', 'Women']].plot.barh(x = 'Major')
<AxesSubplot:ylabel='Major'>
# men and women salaries:
recent_grads[:10][['Median', 'Men', 'Women']].plot.bar(x = 'Median')
recent_grads[-10:][['Median','Men', 'Women']].plot.bar(x = 'Median')
<AxesSubplot:xlabel='Median'>
# dict to make box plot colored:
color = {"boxes": "DarkGreen",
"whiskers": "DarkOrange",
"medians": "DarkBlue",
"caps": "Gray" }
# series.plot.box method
recent_grads['Median'].plot.box(color = color, sym='r+')
<AxesSubplot:>
# series.plot.box method
recent_grads['Unemployment_rate'].plot.box(color = color, sym = 'r+')
<AxesSubplot:>
We are going to make the same plots in Hexagonal kind as we have already done scatter plots.
recent_grads.plot(x='Sample_size', y='Median', kind='hexbin', title='Sample_size vs. Median', gridsize = 30, bins = 10,
xscale = 'log',vmax = 6 )
<AxesSubplot:title={'center':'Sample_size vs. Median'}, xlabel='Sample_size', ylabel='Median'>
recent_grads.plot(x='Sample_size', y='Unemployment_rate', kind='hexbin', title='Sample_size vs. Unemployment_rate', gridsize = 30, bins = 10,
vmax = 6, xscale = 'log')
<AxesSubplot:title={'center':'Sample_size vs. Unemployment_rate'}, xlabel='Sample_size', ylabel='Unemployment_rate'>
recent_grads.plot(x='Men', y='Median', kind='hexbin', title='Men vs. Median', gridsize = 30, bins = 10,
vmax = 6, xscale = 'log')
<AxesSubplot:title={'center':'Men vs. Median'}, xlabel='Men', ylabel='Median'>
recent_grads.plot(x = 'Women', y= 'Median', kind='hexbin', title= 'Women vs. Median', gridsize = 30, bins = 10,
vmax = 6)
<AxesSubplot:title={'center':'Women vs. Median'}, xlabel='Women', ylabel='Median'>
Graduates from popular majors do not earn more money. Male predominantly in Engineering and Female in Psychology.