In this project, we'll explore how using the pandas plotting functionality along with the Jupyter notebook interface allows us to explore data quickly using visualizations.
We'll be working with a dataset on the job outcomes of students who graduated from college between 2010 and 2012. The original data on job outcomes was released by American Community Survey
, which conducts surveys and aggregates the data. FiveThirtyEight cleaned the dataset and released it on their Github repo.
Using visualizations, we can start to explore questions from the dataset like:
Using scatter plots
Using histograms
Using bar plots
Each row in the dataset represents a different major in college and contains information on gender diversity, employment rates, median salaries, and more.
Here are some of the columns in the dataset:
Rank
Rank by median earnings (the dataset is ordered by this column)
Major_code
Major code
Major
Major description
Major_category
Category of major
Total
Total number of people with major
Sample_size
Sample size (unweighted) of full-time
Men
Male graduates
Women
Female graduates
ShareWomen
Women as share of total
Employed
Number employed
Median
Median salary of full-time, year-round workers
Low_wage_jobs
Number in low-wage service jobs
Full_time
Number employed 35 hours or more
Part_time
Number employed less than 35 hours
We'll explore how to do these and more while primarily working in pandas. Before we start creating data visualizations, let's import the libraries we need and remove rows containing null values.
# import pandas and matplotlib
import pandas as pd
import matplotlib.pyplot as plt
# run jupyter magic so that plots are displayed inline
%matplotlib inline
# read the dataset into a dataframe
recent_grads = pd.read_csv('recent-grads.csv')
# exploring the data
recent_grads.iloc[0]
Rank 1 Major_code 2419 Major PETROLEUM ENGINEERING Total 2339 Men 2057 Women 282 Major_category Engineering ShareWomen 0.120564 Sample_size 36 Employed 1976 Full_time 1849 Part_time 270 Full_time_year_round 1207 Unemployed 37 Unemployment_rate 0.0183805 Median 110000 P25th 95000 P75th 125000 College_jobs 1534 Non_college_jobs 364 Low_wage_jobs 193 Name: 0, dtype: object
# first rows
recent_grads.head()
Rank | Major_code | Major | Total | Men | Women | Major_category | ShareWomen | Sample_size | Employed | ... | Part_time | Full_time_year_round | Unemployed | Unemployment_rate | Median | P25th | P75th | College_jobs | Non_college_jobs | Low_wage_jobs | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 2419 | PETROLEUM ENGINEERING | 2339.0 | 2057.0 | 282.0 | Engineering | 0.120564 | 36 | 1976 | ... | 270 | 1207 | 37 | 0.018381 | 110000 | 95000 | 125000 | 1534 | 364 | 193 |
1 | 2 | 2416 | MINING AND MINERAL ENGINEERING | 756.0 | 679.0 | 77.0 | Engineering | 0.101852 | 7 | 640 | ... | 170 | 388 | 85 | 0.117241 | 75000 | 55000 | 90000 | 350 | 257 | 50 |
2 | 3 | 2415 | METALLURGICAL ENGINEERING | 856.0 | 725.0 | 131.0 | Engineering | 0.153037 | 3 | 648 | ... | 133 | 340 | 16 | 0.024096 | 73000 | 50000 | 105000 | 456 | 176 | 0 |
3 | 4 | 2417 | NAVAL ARCHITECTURE AND MARINE ENGINEERING | 1258.0 | 1123.0 | 135.0 | Engineering | 0.107313 | 16 | 758 | ... | 150 | 692 | 40 | 0.050125 | 70000 | 43000 | 80000 | 529 | 102 | 0 |
4 | 5 | 2405 | CHEMICAL ENGINEERING | 32260.0 | 21239.0 | 11021.0 | Engineering | 0.341631 | 289 | 25694 | ... | 5180 | 16697 | 1672 | 0.061098 | 65000 | 50000 | 75000 | 18314 | 4440 | 972 |
5 rows × 21 columns
# last rows
recent_grads.tail()
Rank | Major_code | Major | Total | Men | Women | Major_category | ShareWomen | Sample_size | Employed | ... | Part_time | Full_time_year_round | Unemployed | Unemployment_rate | Median | P25th | P75th | College_jobs | Non_college_jobs | Low_wage_jobs | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
168 | 169 | 3609 | ZOOLOGY | 8409.0 | 3050.0 | 5359.0 | Biology & Life Science | 0.637293 | 47 | 6259 | ... | 2190 | 3602 | 304 | 0.046320 | 26000 | 20000 | 39000 | 2771 | 2947 | 743 |
169 | 170 | 5201 | EDUCATIONAL PSYCHOLOGY | 2854.0 | 522.0 | 2332.0 | Psychology & Social Work | 0.817099 | 7 | 2125 | ... | 572 | 1211 | 148 | 0.065112 | 25000 | 24000 | 34000 | 1488 | 615 | 82 |
170 | 171 | 5202 | CLINICAL PSYCHOLOGY | 2838.0 | 568.0 | 2270.0 | Psychology & Social Work | 0.799859 | 13 | 2101 | ... | 648 | 1293 | 368 | 0.149048 | 25000 | 25000 | 40000 | 986 | 870 | 622 |
171 | 172 | 5203 | COUNSELING PSYCHOLOGY | 4626.0 | 931.0 | 3695.0 | Psychology & Social Work | 0.798746 | 21 | 3777 | ... | 965 | 2738 | 214 | 0.053621 | 23400 | 19200 | 26000 | 2403 | 1245 | 308 |
172 | 173 | 3501 | LIBRARY SCIENCE | 1098.0 | 134.0 | 964.0 | Education | 0.877960 | 2 | 742 | ... | 237 | 410 | 87 | 0.104946 | 22000 | 20000 | 22000 | 288 | 338 | 192 |
5 rows × 21 columns
# summary statistics
recent_grads.describe()
Rank | Major_code | Total | Men | Women | ShareWomen | Sample_size | Employed | Full_time | Part_time | Full_time_year_round | Unemployed | Unemployment_rate | Median | P25th | P75th | College_jobs | Non_college_jobs | Low_wage_jobs | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 173.000000 | 173.000000 | 172.000000 | 172.000000 | 172.000000 | 172.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 |
mean | 87.000000 | 3879.815029 | 39370.081395 | 16723.406977 | 22646.674419 | 0.522223 | 356.080925 | 31192.763006 | 26029.306358 | 8832.398844 | 19694.427746 | 2416.329480 | 0.068191 | 40151.445087 | 29501.445087 | 51494.219653 | 12322.635838 | 13284.497110 | 3859.017341 |
std | 50.084928 | 1687.753140 | 63483.491009 | 28122.433474 | 41057.330740 | 0.231205 | 618.361022 | 50675.002241 | 42869.655092 | 14648.179473 | 33160.941514 | 4112.803148 | 0.030331 | 11470.181802 | 9166.005235 | 14906.279740 | 21299.868863 | 23789.655363 | 6944.998579 |
min | 1.000000 | 1100.000000 | 124.000000 | 119.000000 | 0.000000 | 0.000000 | 2.000000 | 0.000000 | 111.000000 | 0.000000 | 111.000000 | 0.000000 | 0.000000 | 22000.000000 | 18500.000000 | 22000.000000 | 0.000000 | 0.000000 | 0.000000 |
25% | 44.000000 | 2403.000000 | 4549.750000 | 2177.500000 | 1778.250000 | 0.336026 | 39.000000 | 3608.000000 | 3154.000000 | 1030.000000 | 2453.000000 | 304.000000 | 0.050306 | 33000.000000 | 24000.000000 | 42000.000000 | 1675.000000 | 1591.000000 | 340.000000 |
50% | 87.000000 | 3608.000000 | 15104.000000 | 5434.000000 | 8386.500000 | 0.534024 | 130.000000 | 11797.000000 | 10048.000000 | 3299.000000 | 7413.000000 | 893.000000 | 0.067961 | 36000.000000 | 27000.000000 | 47000.000000 | 4390.000000 | 4595.000000 | 1231.000000 |
75% | 130.000000 | 5503.000000 | 38909.750000 | 14631.000000 | 22553.750000 | 0.703299 | 338.000000 | 31433.000000 | 25147.000000 | 9948.000000 | 16891.000000 | 2393.000000 | 0.087557 | 45000.000000 | 33000.000000 | 60000.000000 | 14444.000000 | 11783.000000 | 3466.000000 |
max | 173.000000 | 6403.000000 | 393735.000000 | 173809.000000 | 307087.000000 | 0.968954 | 4212.000000 | 307933.000000 | 251540.000000 | 115172.000000 | 199897.000000 | 28169.000000 | 0.177226 | 110000.000000 | 95000.000000 | 125000.000000 | 151643.000000 | 148395.000000 | 48207.000000 |
# dropping rows with missing values
raw_data_count = recent_grads.shape
print('Raw data rows: ', raw_data_count)
recent_grads = recent_grads.dropna()
cleaned_data_count = recent_grads.shape
print('Cleaned data rows: ', cleaned_data_count)
Raw data rows: (173, 21) Cleaned data rows: (172, 21)
Matplotlib expects that columns of values we pass in have matching lengths and missing values will cause matplotlib to throw errors; for that reason, we dropped rows with missing values.
We'll generate scatter plots in separate jyputer notebook cells to explore the following relations:
Sample_size
and Median
Sample_size
and Unemployment_rate
Full_time
and Median
ShareWomen
and Unemployment_rate
Men
and Median
Women
and Median
# Sample_size and Median
recent_grads.plot(x='Sample_size', y='Median', kind='scatter', title='Employed vs. Median', figsize=(5,5))
<matplotlib.axes._subplots.AxesSubplot at 0x2045cf478c8>
# Sample_size and Unemployment_rate
ax2 = recent_grads.plot(x='Sample_size', y='Unemployment_rate', kind='scatter', figsize=(5,5))
ax2.set_title('Employed vs. Unemployment_rate')
Text(0.5, 1.0, 'Employed vs. Unemployment_rate')
# Full_time and Median
ax3 = recent_grads.plot(x='Full_time', y='Median', kind='scatter', figsize=(5,5))
ax3.set_title('Full_time vs. Median')
Text(0.5, 1.0, 'Full_time vs. Median')
# ShareWomen and Unemployment_rate
ax4 = recent_grads.plot(x='ShareWomen', y='Unemployment_rate', kind='scatter', figsize=(5,5))
ax4.set_title('ShareWomen vs. Unemployment_rate')
Text(0.5, 1.0, 'ShareWomen vs. Unemployment_rate')
# Men and Median
ax5 = recent_grads.plot(x='Men', y='Median', kind='scatter', figsize=(5,5))
ax5.set_title('Men vs. Median')
Text(0.5, 1.0, 'Men vs. Median')
# Women and Median
ax6 = recent_grads.plot(x='Women', y='Median', kind='scatter', figsize=(5,5))
ax6.set_title('Women vs. Median')
Text(0.5, 1.0, 'Women vs. Median')
The scatter plots show us that the columns are weakly correlated; for this reason, we cannot make any analysis.
Use the plots to explore the following questions:
# Total vs Median
ax7 = recent_grads.plot(x='Total', y='Median', kind='scatter', figsize=(10,5))
ax7.set_title('Total vs. Median')
Text(0.5, 1.0, 'Total vs. Median')
There is a decreasing trend in the median salary of full-time vs total number of people with major; particularly, from 0 to 50,000 of the people.
# ShareWomen and Median
ax8 = recent_grads.plot(x='ShareWomen', y='Median', kind='scatter', figsize=(5,5))
ax8.set_title('ShareWomen vs. Median')
Text(0.5, 1.0, 'ShareWomen vs. Median')
As can be seen in the plot, there is a negative correlation between median salary of full-time and woman as share of total.
# Full_time and Median
ax9 = recent_grads.plot(x='Full_time', y='Median', kind='scatter', figsize=(5,5))
ax9.set_title('Full_time vs. Median')
Text(0.5, 1.0, 'Full_time vs. Median')
As the number of full-time employees increases, the median income decreases. It is a negative correlation.
We'll generate histograms in separate jupyter notebook cells to explore the distributions of the following columns:
Sample_size
Median
Employed
Full_time
ShareWomen
Unemployment_rate
Men
Women
# Sample_size
recent_grads['Sample_size'].hist(bins=25, range=(0,5000))
<matplotlib.axes._subplots.AxesSubplot at 0x222572d0608>
# Median
recent_grads['Median'].plot(kind='hist', bins=10)
<matplotlib.axes._subplots.AxesSubplot at 0x20460dee508>
# Employed
recent_grads['Employed'].hist(bins=20, range=(0,100000))
<matplotlib.axes._subplots.AxesSubplot at 0x2046037f288>
# Full_time
recent_grads['Full_time'].hist(bins=20, range=(0,100000))
<matplotlib.axes._subplots.AxesSubplot at 0x20460423408>
# ShareWomen
recent_grads['ShareWomen'].hist(bins=10, range=(0,1))
<matplotlib.axes._subplots.AxesSubplot at 0x20460944488>
# Unemployment_rate
recent_grads['Unemployment_rate'].hist(bins=10, range=(0,0.2))
<matplotlib.axes._subplots.AxesSubplot at 0x20460be6e48>
# Men
recent_grads['Men'].hist(bins=25, range=(0,100000))
<matplotlib.axes._subplots.AxesSubplot at 0x2046212dac8>
# Women
recent_grads['Women'].hist(bins=25, range=(0,100000))
<matplotlib.axes._subplots.AxesSubplot at 0x20460ef8208>
Use the plots to explore the following questions:
70% and 60% respectively
The most common median salary range is from $30,000
to $40,000
We'll create scatter matrix plots using the following columns:
Sample_size and Median
Sample_size, Median and Unemployment_rate
# import scatter_matrix from the pandas.plotting module
from pandas.plotting import scatter_matrix
# Sample_size and Median
scatter_matrix(recent_grads[['Sample_size', 'Median']], figsize=(5,5))
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x000002045CE5B388>, <matplotlib.axes._subplots.AxesSubplot object at 0x000002045D5E5788>], [<matplotlib.axes._subplots.AxesSubplot object at 0x000002045D617A88>, <matplotlib.axes._subplots.AxesSubplot object at 0x000002045D6502C8>]], dtype=object)
# Sample_size, Median and Unemployment_rate
scatter_matrix(recent_grads[['Sample_size', 'Median', 'Unemployment_rate']], figsize=(10,10))
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x000002045D746708>, <matplotlib.axes._subplots.AxesSubplot object at 0x000002045D759F88>, <matplotlib.axes._subplots.AxesSubplot object at 0x000002045E75F608>], [<matplotlib.axes._subplots.AxesSubplot object at 0x000002045E79B048>, <matplotlib.axes._subplots.AxesSubplot object at 0x000002045E7CFA48>, <matplotlib.axes._subplots.AxesSubplot object at 0x000002045E80A948>], [<matplotlib.axes._subplots.AxesSubplot object at 0x000002045E8429C8>, <matplotlib.axes._subplots.AxesSubplot object at 0x000002045E87CB08>, <matplotlib.axes._subplots.AxesSubplot object at 0x000002045E886708>]], dtype=object)
We'll use bar plots to compare:
The percentages of women ShareWomen
from the first ten rows and last ten rows
The unemployment rate Unemployment_rate
from the first ten rows and last ten rows
# ShareWomen first ten rows
recent_grads[:10].plot.bar(x='Major', y='ShareWomen')
<matplotlib.axes._subplots.AxesSubplot at 0x2045eabca88>
# ShareWomen last ten rows
recent_grads.tail(10).plot.bar(x='Major', y='ShareWomen')
<matplotlib.axes._subplots.AxesSubplot at 0x2045eb588c8>
# Unemployment_rate first ten rows
recent_grads[:10].plot.bar(x='Major', y='Unemployment_rate')
<matplotlib.axes._subplots.AxesSubplot at 0x2045ebdf908>
# Unemployment_rate last ten rows
recent_grads.tail(10).plot.bar(x='Major', y='Unemployment_rate')
<matplotlib.axes._subplots.AxesSubplot at 0x2045ee43d08>
Use a grouped bar plot to compare the number of men with the number of women in each category of majors.
Use a box plot to explore the distributions of median salaries and unemployment rate.
Use a hexagonal bin plot to visualize the columns that had dense scatter plots from earlier in the project.
# grouped bar plot Major_category
cols = recent_grads[['Major_category', 'Men', 'Women']]
cols = cols.groupby('Major_category').sum()
cols.plot.bar()
<matplotlib.axes._subplots.AxesSubplot at 0x20462bee788>
# box plot median salaries and unemployment rate
fig = plt.figure(figsize=(10,5))
ax1 = fig.add_subplot(1,2,1)
ax2 = fig.add_subplot(1,2,2)
ax1.boxplot(recent_grads['Median'])
ax1.set_xticklabels(['Median salaries'])
ax2.boxplot(recent_grads['Unemployment_rate'])
ax2.set_xticklabels(['Unemployment rate'])
plt.show()
# hexagonal bin plot
fig = plt.figure(figsize=(5,10))
ax1 = fig.add_subplot(2,1,1)
ax2 = fig.add_subplot(2,1,2)
ax1.hexbin(x=recent_grads['Men'], y=recent_grads['Median'], gridsize=10)
ax1.set_title('Men vs Median')
ax2.hexbin(x=recent_grads['Women'], y=recent_grads['Median'], gridsize=10)
ax2.set_title('Women vs Median')
Text(0.5, 1.0, 'Women vs Median')