In this project, I will be working with a dataset on the job outcomes of students who graduated from college between 2010 and 2012. The original data on job outcomes was released by American Community Survey, which conducts surveys and aggregates the data. FiveThirtyEight cleaned the dataset and released it on their Github repo.
Each row in the dataset represents a different major in college and contains information on gender diversity, employment rates, median salaries, and more.
Following are the columns in the dataset:
Rank
- Rank by median earnings (the dataset is ordered by this column).Major_code
- Major code.Major
- Major description.Major_category
- Category of major.Total
- Total number of people with major.Sample_size
- Sample size (unweighted) of full-time.Men
- Male graduates.Women
- Female graduates.ShareWomen
- Women as share of total.Employed
- Number employed.Median
- Median salary of full-time, year-round workers.Low_wage_jobs
- Number in low-wage service jobs.Full_time
- Number employed 35 hours or more.Part_time
- Number employed less than 35 hours.Full_time_year_round
- Employed at least 50 weeks and at least 35 hoursUnemployed
- Number unemployedUnemployment_rate
- Unemployed / (Unemployed + Employed)P25th
- 25th percentile of earningsP75th
75th percentile of earningsCollege_jobs
- Number with job requiring a college degreeNon_college_jobs
- Number with job not requiring a college degreeUsing visualizations, we can start to explore questions from the dataset like:
Do students in more popular majors make more money?
How many majors are predominantly male? Predominantly female?
Which category of majors have the most students?
Before we start creating data visualizations, let's import the libraries we need and run the necessary Jupyter magic so that plots are displayed inline.
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
#Reading `recent-grads.csv` into pandas and assigning the resulting DataFrame to `recent_grads`
recent_grads = pd.read_csv('recent-grads.csv')
#Using `DataFrame.iloc[]` to return the first row formatted as a table
print(recent_grads.iloc[0])
Rank 1 Major_code 2419 Major PETROLEUM ENGINEERING Total 2339 Men 2057 Women 282 Major_category Engineering ShareWomen 0.120564 Sample_size 36 Employed 1976 Full_time 1849 Part_time 270 Full_time_year_round 1207 Unemployed 37 Unemployment_rate 0.0183805 Median 110000 P25th 95000 P75th 125000 College_jobs 1534 Non_college_jobs 364 Low_wage_jobs 193 Name: 0, dtype: object
#Using `DataFrame.head()` and `DataFrame.tail()` to become familiar with the structure of data
print(recent_grads.head())
print(recent_grads.tail())
Rank Major_code Major Total \ 0 1 2419 PETROLEUM ENGINEERING 2339.0 1 2 2416 MINING AND MINERAL ENGINEERING 756.0 2 3 2415 METALLURGICAL ENGINEERING 856.0 3 4 2417 NAVAL ARCHITECTURE AND MARINE ENGINEERING 1258.0 4 5 2405 CHEMICAL ENGINEERING 32260.0 Men Women Major_category ShareWomen Sample_size Employed ... \ 0 2057.0 282.0 Engineering 0.120564 36 1976 ... 1 679.0 77.0 Engineering 0.101852 7 640 ... 2 725.0 131.0 Engineering 0.153037 3 648 ... 3 1123.0 135.0 Engineering 0.107313 16 758 ... 4 21239.0 11021.0 Engineering 0.341631 289 25694 ... Part_time Full_time_year_round Unemployed Unemployment_rate Median \ 0 270 1207 37 0.018381 110000 1 170 388 85 0.117241 75000 2 133 340 16 0.024096 73000 3 150 692 40 0.050125 70000 4 5180 16697 1672 0.061098 65000 P25th P75th College_jobs Non_college_jobs Low_wage_jobs 0 95000 125000 1534 364 193 1 55000 90000 350 257 50 2 50000 105000 456 176 0 3 43000 80000 529 102 0 4 50000 75000 18314 4440 972 [5 rows x 21 columns] Rank Major_code Major Total Men Women \ 168 169 3609 ZOOLOGY 8409.0 3050.0 5359.0 169 170 5201 EDUCATIONAL PSYCHOLOGY 2854.0 522.0 2332.0 170 171 5202 CLINICAL PSYCHOLOGY 2838.0 568.0 2270.0 171 172 5203 COUNSELING PSYCHOLOGY 4626.0 931.0 3695.0 172 173 3501 LIBRARY SCIENCE 1098.0 134.0 964.0 Major_category ShareWomen Sample_size Employed ... \ 168 Biology & Life Science 0.637293 47 6259 ... 169 Psychology & Social Work 0.817099 7 2125 ... 170 Psychology & Social Work 0.799859 13 2101 ... 171 Psychology & Social Work 0.798746 21 3777 ... 172 Education 0.877960 2 742 ... Part_time Full_time_year_round Unemployed Unemployment_rate Median \ 168 2190 3602 304 0.046320 26000 169 572 1211 148 0.065112 25000 170 648 1293 368 0.149048 25000 171 965 2738 214 0.053621 23400 172 237 410 87 0.104946 22000 P25th P75th College_jobs Non_college_jobs Low_wage_jobs 168 20000 39000 2771 2947 743 169 24000 34000 1488 615 82 170 25000 40000 986 870 622 171 19200 26000 2403 1245 308 172 20000 22000 288 338 192 [5 rows x 21 columns]
#Using `DataFrame.describe()` to generate summary statistics for all of the numeric columns
recent_grads.describe(include='all')
Rank | Major_code | Major | Total | Men | Women | Major_category | ShareWomen | Sample_size | Employed | ... | Part_time | Full_time_year_round | Unemployed | Unemployment_rate | Median | P25th | P75th | College_jobs | Non_college_jobs | Low_wage_jobs | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 173.000000 | 173.000000 | 173 | 172.000000 | 172.000000 | 172.000000 | 173 | 172.000000 | 173.000000 | 173.000000 | ... | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 |
unique | NaN | NaN | 173 | NaN | NaN | NaN | 16 | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
top | NaN | NaN | MATERIALS ENGINEERING AND MATERIALS SCIENCE | NaN | NaN | NaN | Engineering | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
freq | NaN | NaN | 1 | NaN | NaN | NaN | 29 | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
mean | 87.000000 | 3879.815029 | NaN | 39370.081395 | 16723.406977 | 22646.674419 | NaN | 0.522223 | 356.080925 | 31192.763006 | ... | 8832.398844 | 19694.427746 | 2416.329480 | 0.068191 | 40151.445087 | 29501.445087 | 51494.219653 | 12322.635838 | 13284.497110 | 3859.017341 |
std | 50.084928 | 1687.753140 | NaN | 63483.491009 | 28122.433474 | 41057.330740 | NaN | 0.231205 | 618.361022 | 50675.002241 | ... | 14648.179473 | 33160.941514 | 4112.803148 | 0.030331 | 11470.181802 | 9166.005235 | 14906.279740 | 21299.868863 | 23789.655363 | 6944.998579 |
min | 1.000000 | 1100.000000 | NaN | 124.000000 | 119.000000 | 0.000000 | NaN | 0.000000 | 2.000000 | 0.000000 | ... | 0.000000 | 111.000000 | 0.000000 | 0.000000 | 22000.000000 | 18500.000000 | 22000.000000 | 0.000000 | 0.000000 | 0.000000 |
25% | 44.000000 | 2403.000000 | NaN | 4549.750000 | 2177.500000 | 1778.250000 | NaN | 0.336026 | 39.000000 | 3608.000000 | ... | 1030.000000 | 2453.000000 | 304.000000 | 0.050306 | 33000.000000 | 24000.000000 | 42000.000000 | 1675.000000 | 1591.000000 | 340.000000 |
50% | 87.000000 | 3608.000000 | NaN | 15104.000000 | 5434.000000 | 8386.500000 | NaN | 0.534024 | 130.000000 | 11797.000000 | ... | 3299.000000 | 7413.000000 | 893.000000 | 0.067961 | 36000.000000 | 27000.000000 | 47000.000000 | 4390.000000 | 4595.000000 | 1231.000000 |
75% | 130.000000 | 5503.000000 | NaN | 38909.750000 | 14631.000000 | 22553.750000 | NaN | 0.703299 | 338.000000 | 31433.000000 | ... | 9948.000000 | 16891.000000 | 2393.000000 | 0.087557 | 45000.000000 | 33000.000000 | 60000.000000 | 14444.000000 | 11783.000000 | 3466.000000 |
max | 173.000000 | 6403.000000 | NaN | 393735.000000 | 173809.000000 | 307087.000000 | NaN | 0.968954 | 4212.000000 | 307933.000000 | ... | 115172.000000 | 199897.000000 | 28169.000000 | 0.177226 | 110000.000000 | 95000.000000 | 125000.000000 | 151643.000000 | 148395.000000 | 48207.000000 |
11 rows × 21 columns
Matplotlib expects that columns of values we pass in have matching lengths and missing values will cause matplotlib to throw errors.
# Assigning the number of rows in `recent_grads` to `raw_data_count`
raw_data_count = recent_grads.shape[0]
print("Initial no. of rows = "+str(raw_data_count))
Initial no. of rows = 173
# Using `DataFrame.dropna()` to drop rows containing missing values and assigning the resulting DataFrame back to `recent_grads`
recent_grads = recent_grads.dropna()
# Look up the number of rows in `recent_grads` now and assigning the value to `cleaned_data_count`
cleaned_data_count = recent_grads.shape[0]
print("No. of rows after cleaning = "+str(cleaned_data_count))
No. of rows after cleaning = 172
On comparing cleaned_data_count
and raw_data_count
, we notice that only one row contained missing values and was dropped.
Exploring the relation: Sample_size
and Median
recent_grads.plot(x='Sample_size', y='Median', kind='scatter', title='Median vs. Sample_size')
<matplotlib.axes._subplots.AxesSubplot at 0x217549bf548>
Exploring the relation: Sample_size
and Unemployment_rate
recent_grads.plot(x='Sample_size', y='Unemployment_rate', kind='scatter', title='Unemployment_rate vs. Sample_size')
<matplotlib.axes._subplots.AxesSubplot at 0x21757a5dec8>
Exploring the relation: Full_time
and Median
recent_grads.plot(x='Full_time', y='Median', kind='scatter', title='Median vs. Full_time')
<matplotlib.axes._subplots.AxesSubplot at 0x21757adc2c8>
Exploring the relation: ShareWomen
and Unemployment_rate
recent_grads.plot(x='ShareWomen', y='Unemployment_rate', kind='scatter', title='Unemployment_rate vs. ShareWomen')
<matplotlib.axes._subplots.AxesSubplot at 0x21757b37588>
Exploring the relation: Men
and Median
recent_grads.plot(x='Men', y='Median', kind='scatter', title='Median vs. Men')
<matplotlib.axes._subplots.AxesSubplot at 0x21757bbbe08>
No direct corelation can be observed from the above scatterplots.
Exploring the question: Do students in more popular majors make more money?
# Finding out most popular major category
recent_grads['Major_category'].describe()
count 172 unique 16 top Engineering freq 29 Name: Major_category, dtype: object
Since Engineering is the most popular major category, we will now inspect if Engineering majors make more money.
# Inspecting median salaries of Engineering majors
rg_engg=recent_grads[recent_grads["Major_category"]=='Engineering']
rg_engg.plot(x='Median',y='Major_code',kind='scatter',title='Major_code vs. Median')
<matplotlib.axes._subplots.AxesSubplot at 0x21757c127c8>
# Finding out aggregate values of median salary
recent_grads['Median'].describe()
count 172.000000 mean 40076.744186 std 11461.388773 min 22000.000000 25% 33000.000000 50% 36000.000000 75% 45000.000000 max 110000.000000 Name: Median, dtype: float64
From the scatter plot it can be easily inferred that:
So, we can conclude that Students in more popular majors make more money.
Exploring the question: Do students that majored in subjects that were majority female make more money?
# Exploring share of women and median salary corresponding to that major
recent_grads.plot(x='ShareWomen', y='Median', kind='scatter', title='Median vs. ShareWomen')
<matplotlib.axes._subplots.AxesSubplot at 0x21757c89588>
There appears to be a negative correlation between Median
and ShareWomen
.
To confirm it, let's observe the graph more closely by seting axes limit.
ax1 = recent_grads.plot(x='ShareWomen', y='Median', kind='scatter', title='Median vs. ShareWomen',figsize=(4,4))
ax1.set_ylim(20000,80000)
ax1.set_xlim(0,1)
(0, 1)
It can be thus inferred that students who majored in subjects that were majority female make less money.
Exploring the question: Is there any link between the number of full-time employees and median salary?
# Calculating share of full-time employees
recent_grads["ShareFull_time"] = recent_grads["Full_time"]/recent_grads["Total"]
# Plotting scatter plot to look for link between `ShareFull_Time` and `Median`
recent_grads.plot(x='ShareFull_time', y='Median', kind='scatter', title='Median vs. ShareFull_Time')
<matplotlib.axes._subplots.AxesSubplot at 0x21758d6ee48>
There appears to be a weak positive correlation between Median
and ShareFull_Time
.
To confirm it, let's observe the graph more closely by seting axes limit.
ax1 = recent_grads.plot(x='ShareFull_time', y='Median', kind='scatter', title='Median vs. ShareFull_Time',figsize=(5,5))
ax1.set_xlim(0.4,1.0)
ax1.set_ylim(0,80000)
(0, 80000)
It seems that more the number of full-time employees among the recent graduates of a major, higher is the median salary of that major.
Exploring the distributions of
cols = ['Sample_size', 'Median', 'Employed', 'Full_time', 'ShareWomen', 'Unemployment_rate', 'Men', 'Women']
fig = plt.figure(figsize=(5,48))
for r in range(0,8):
ax=fig.add_subplot(8,1,r+1)
ax=recent_grads[cols[r]].plot(kind='hist', rot=30)
ax.set_title(cols[r])
# Finding the proportion of Women predominant majors by binning `ShareWomen`
ax1 = recent_grads['ShareWomen'].hist(bins=2,range=(0,1))
ax1.set_title('ShareWomen')
ax1.set_xlabel('ShareWomen')
ax1.set_ylabel('Number of Majors')
Text(0, 0.5, 'Number of Majors')
Thus, it's evident that more percent of majors are predominantly female.
#Calculating the number of female predominant majors
women = recent_grads[recent_grads['Women'] > (recent_grads['Total']/2)]
women.shape[0]/recent_grads.shape[0]
0.5581395348837209
Thus, about 56% majors are predominanty female.
# Calculating the share of male
recent_grads["ShareMen"] = 1 - recent_grads["ShareWomen"]
# Finding the proportion of Men predominant majors by binning `ShareMen`
ax1 = recent_grads['ShareMen'].hist(bins=2,range=(0,1))
ax1.set_title('ShareMen')
ax1.set_xlabel('ShareMen')
ax1.set_ylabel('Number of Majors')
Text(0, 0.5, 'Number of Majors')
# Calculating the number of male predominant majors
men = recent_grads[recent_grads['Men'] > (recent_grads['Total']/2)]
men.shape[0]/recent_grads.shape[0]
0.4418604651162791
Thus, about 44% majors are predominanty male.
ax1 = recent_grads['Median'].plot(kind='hist')
ax1.set_xlabel("Median")
ax1.set_ylabel("Number of majors")
ax1.set_title("Median")
Text(0.5, 1.0, 'Median')
As seen in the histogram above, the most common median salary range is 30000-40000.
Since scatter matrix plots are frequently used in exploratory data analysis, pandas contains a function named scatter_matrix() that generates the plots for us. This function is part of the pandas.plotting
module and needs to be imported separately.
# Importing `scatter_matrix` from the `pandas.plotting` module
from pandas.plotting import scatter_matrix
Exploring Sample_size
and Median
scatter_matrix(recent_grads[['Sample_size', 'Median']], figsize=(10,10))
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x00000217597685C8>, <matplotlib.axes._subplots.AxesSubplot object at 0x000002175924EF48>], [<matplotlib.axes._subplots.AxesSubplot object at 0x0000021759285488>, <matplotlib.axes._subplots.AxesSubplot object at 0x00000217592BBF88>]], dtype=object)
There doesn't seem to be any relationship between Median
and Sample_size
Exploring Sample_size
, Median
and Unemployment_rate
scatter_matrix(recent_grads[['Sample_size','Median','Unemployment_rate']], figsize=(10,10))
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x00000217595F4788>, <matplotlib.axes._subplots.AxesSubplot object at 0x000002175961A648>, <matplotlib.axes._subplots.AxesSubplot object at 0x00000217593FEC48>], [<matplotlib.axes._subplots.AxesSubplot object at 0x00000217594385C8>, <matplotlib.axes._subplots.AxesSubplot object at 0x000002175946CF88>, <matplotlib.axes._subplots.AxesSubplot object at 0x00000217594A5D48>], [<matplotlib.axes._subplots.AxesSubplot object at 0x00000217594DEE48>, <matplotlib.axes._subplots.AxesSubplot object at 0x0000021759518F48>, <matplotlib.axes._subplots.AxesSubplot object at 0x0000021759522B48>]], dtype=object)
There doesn't seem to be any relationship between Median
, Sample_size
and Unemployment_rate
too.
Using bar plot to compare the percentages of women (ShareWomen
) from the first ten rows and last ten rows of the recent_grads
dataframe.
recent_grads[:10].append(recent_grads[-10:])['ShareWomen'].plot.bar()
<matplotlib.axes._subplots.AxesSubplot at 0x217572b17c8>
The last 10 majors have a greater share of women than the first 10 majors.
recent_grads[:10].append(recent_grads[-10:])['Unemployment_rate'].plot.bar()
<matplotlib.axes._subplots.AxesSubplot at 0x21759dc4648>
Using bar plot to compare the unemployment rate (Unemployment_rate
) from the first ten rows and last ten rows of the recent_grads
dataframe.
There doesn't seem to be any relation between ShareWomen
and Unemployment_rate
.
Share = recent_grads[['ShareMen','ShareWomen','Major']]
Share.head(15).plot.barh(x='Major')
<matplotlib.axes._subplots.AxesSubplot at 0x21759df2108>
The share of men in Engineering majors is much higher than the share of women.
Box Plot of Median
recent_grads['Median'].plot(kind='box')
<matplotlib.axes._subplots.AxesSubplot at 0x21753a29188>
This confirms that the median salary is between 30000 and 40000.
Box Plot of Unemployment_rate
recent_grads['Unemployment_rate'].plot(kind='box')
<matplotlib.axes._subplots.AxesSubplot at 0x2175a153ac8>
The median unemployment rate is around 0.07.
Hexbin plots can be a useful alternative to scatter plots if our data are too dense to plot each point individually.
Hexbin plot for Unemployment_rate vs ShareWomen .
recent_grads.plot.hexbin(x='ShareWomen', y='Unemployment_rate')
<matplotlib.axes._subplots.AxesSubplot at 0x21759eff048>
This confirms that ShareWomen
and Unemployment_rate
are not related.
Visualizations can help us identify and uncover patterns more easily, especially when the dataset contains many values.