Introduction
In this project, we'll explore how using the pandas plotting functionality along with Jupyter notebook interface allows us to explore data quickly using visualizations.
The aim of this project is to work with a dataset on the job outcomes of students who graduated from college between 2010 and 2012 using visualizations to explore and undersatnd questions from the dataset like: Do students in more popular majors make more money? How many majors are predominantly male? Predominantly female? Which category of majors have the most students?
# Importing the libraries needed as well as reading the dataset
# into a dataframe.
import pandas as pd
import matplotlib.pyplot as plt
% matplotlib inline
recent_grads = pd.read_csv('recent-grads.csv')
print(recent_grads.iloc[0])
print(recent_grads.head())
print(recent_grads.tail())
recent_grads.describe()
# Dropping rows with missing values.
raw_data_count = recent_grads.shape[0]
recent_grads = recent_grads.dropna()
cleaned_data_count = recent_grads.shape[0]
print(raw_data_count)
print(cleaned_data_count)
Rank 1 Major_code 2419 Major PETROLEUM ENGINEERING Total 2339 Men 2057 Women 282 Major_category Engineering ShareWomen 0.120564 Sample_size 36 Employed 1976 Full_time 1849 Part_time 270 Full_time_year_round 1207 Unemployed 37 Unemployment_rate 0.0183805 Median 110000 P25th 95000 P75th 125000 College_jobs 1534 Non_college_jobs 364 Low_wage_jobs 193 Name: 0, dtype: object Rank Major_code Major Total \ 0 1 2419 PETROLEUM ENGINEERING 2339.0 1 2 2416 MINING AND MINERAL ENGINEERING 756.0 2 3 2415 METALLURGICAL ENGINEERING 856.0 3 4 2417 NAVAL ARCHITECTURE AND MARINE ENGINEERING 1258.0 4 5 2405 CHEMICAL ENGINEERING 32260.0 Men Women Major_category ShareWomen Sample_size Employed \ 0 2057.0 282.0 Engineering 0.120564 36 1976 1 679.0 77.0 Engineering 0.101852 7 640 2 725.0 131.0 Engineering 0.153037 3 648 3 1123.0 135.0 Engineering 0.107313 16 758 4 21239.0 11021.0 Engineering 0.341631 289 25694 ... Part_time Full_time_year_round Unemployed \ 0 ... 270 1207 37 1 ... 170 388 85 2 ... 133 340 16 3 ... 150 692 40 4 ... 5180 16697 1672 Unemployment_rate Median P25th P75th College_jobs Non_college_jobs \ 0 0.018381 110000 95000 125000 1534 364 1 0.117241 75000 55000 90000 350 257 2 0.024096 73000 50000 105000 456 176 3 0.050125 70000 43000 80000 529 102 4 0.061098 65000 50000 75000 18314 4440 Low_wage_jobs 0 193 1 50 2 0 3 0 4 972 [5 rows x 21 columns] Rank Major_code Major Total Men Women \ 168 169 3609 ZOOLOGY 8409.0 3050.0 5359.0 169 170 5201 EDUCATIONAL PSYCHOLOGY 2854.0 522.0 2332.0 170 171 5202 CLINICAL PSYCHOLOGY 2838.0 568.0 2270.0 171 172 5203 COUNSELING PSYCHOLOGY 4626.0 931.0 3695.0 172 173 3501 LIBRARY SCIENCE 1098.0 134.0 964.0 Major_category ShareWomen Sample_size Employed \ 168 Biology & Life Science 0.637293 47 6259 169 Psychology & Social Work 0.817099 7 2125 170 Psychology & Social Work 0.799859 13 2101 171 Psychology & Social Work 0.798746 21 3777 172 Education 0.877960 2 742 ... Part_time Full_time_year_round Unemployed \ 168 ... 2190 3602 304 169 ... 572 1211 148 170 ... 648 1293 368 171 ... 965 2738 214 172 ... 237 410 87 Unemployment_rate Median P25th P75th College_jobs Non_college_jobs \ 168 0.046320 26000 20000 39000 2771 2947 169 0.065112 25000 24000 34000 1488 615 170 0.149048 25000 25000 40000 986 870 171 0.053621 23400 19200 26000 2403 1245 172 0.104946 22000 20000 22000 288 338 Low_wage_jobs 168 743 169 82 170 622 171 308 172 192 [5 rows x 21 columns] 173 172
** Generating a scatter plot**
# Generating a scatter plots exploring the relations
# between Sample_size and Median.
recent_grads.plot(x='Sample_size', y='Median', kind='scatter')
<matplotlib.axes._subplots.AxesSubplot at 0x7f2da0af24a8>
# Generating a scatter plots exploring the relations
# between Sample_size and Unemployment_rate.
recent_grads.plot(x='Sample_size', y='Unemployment_rate', kind='scatter')
<matplotlib.axes._subplots.AxesSubplot at 0x7f2da0ad9080>
# Generating a scatter plots exploring the relations
# between Full_time and Median.
recent_grads.plot(x='Full_time', y='Median', kind='scatter')
<matplotlib.axes._subplots.AxesSubplot at 0x7f2da0ad95c0>
# Generating a scatter plots exploring the relations
# between ShareWomen and Unemployment_rate.
recent_grads.plot(x='ShareWomen', y='Unemployment_rate', kind='scatter')
<matplotlib.axes._subplots.AxesSubplot at 0x7f2da0a0b5f8>
# Generating a scatter plots exploring the relations
# between Men and Median.
recent_grads.plot(x='Men', y='Median', kind='scatter')
<matplotlib.axes._subplots.AxesSubplot at 0x7f2da097ac50>
# Generating a scatter plots exploring the relations
# between Women and Median.
recent_grads.plot(x='Women', y='Median', kind='scatter')
<matplotlib.axes._subplots.AxesSubplot at 0x7f2da08df198>
# Generating a scatter plots exploring the relations
# between total and median.
recent_grads.plot(x='Total', y='Median', kind='scatter')
<matplotlib.axes._subplots.AxesSubplot at 0x7f2da08c37f0>
There is no link between the number of full-time employees and median salary based on the result from the corresponding plot.Also students from more popular majors are not guarantee high paying jobs from the graph above.
Generating Histograms
# Generating histogram to explore the distribution of Sample_size
recent_grads['Sample_size'].hist(bins=25, range=(0,5000))
<matplotlib.axes._subplots.AxesSubplot at 0x7f2da08242e8>
# Generating histogram to explore the distribution of Median
recent_grads['Median'].hist(bins=25, range=(0,5000))
<matplotlib.axes._subplots.AxesSubplot at 0x7f2da07e77f0>
# Generating histogram to explore the distribution of Employed
recent_grads['Employed'].hist(bins=25, range=(0,5000))
<matplotlib.axes._subplots.AxesSubplot at 0x7f2da06b4710>
# Generating histogram to explore the distribution of Full_time
recent_grads['Full_time'].hist(bins=25, range=(0,5000))
<matplotlib.axes._subplots.AxesSubplot at 0x7f2da05eeeb8>
# Generating histogram to explore the distribution of ShareWomen
recent_grads['ShareWomen'].hist(bins=25, range=(0,5000))
<matplotlib.axes._subplots.AxesSubplot at 0x7f2da0ccb198>
# Generating histogram to explore the distribution of
# Unemployment_rate.
recent_grads['Unemployment_rate'].hist(bins=25, range=(0,5000))
<matplotlib.axes._subplots.AxesSubplot at 0x7f2da0c89f98>
# Generating histogram to explore the distribution of Men.
recent_grads['Men'].hist(bins=25, range=(0,5000))
<matplotlib.axes._subplots.AxesSubplot at 0x7f2da05d1e10>
# Generating histogram to explore the distribution of Women.
recent_grads['Women'].hist(bins=25, range=(0,5000))
<matplotlib.axes._subplots.AxesSubplot at 0x7f2da0512748>
** Generating Scatter Matrix Plots**
# Importing scatter_matrix
from pandas.plotting import scatter_matrix
# Create a 2 by 2 scatter matrix plot
# using the Sample_size and Median columns.
scatter_matrix(recent_grads[['Sample_size', 'Median']], figsize=(10,10))
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7f2da04bb320>, <matplotlib.axes._subplots.AxesSubplot object at 0x7f2da02e2240>], [<matplotlib.axes._subplots.AxesSubplot object at 0x7f2da02ac400>, <matplotlib.axes._subplots.AxesSubplot object at 0x7f2da0260b70>]], dtype=object)
# Create a 3 by 3 scatter matrix plot
# using the Sample_size, Median, and Unemployment_rate columns.
scatter_matrix(recent_grads[['Sample_size','Median','Unemployment_rate']], figsize=(10,10))
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7f2da02125c0>, <matplotlib.axes._subplots.AxesSubplot object at 0x7f2da017f6a0>, <matplotlib.axes._subplots.AxesSubplot object at 0x7f2da014e198>], [<matplotlib.axes._subplots.AxesSubplot object at 0x7f2da0102d30>, <matplotlib.axes._subplots.AxesSubplot object at 0x7f2da00cef28>, <matplotlib.axes._subplots.AxesSubplot object at 0x7f2da008e860>], [<matplotlib.axes._subplots.AxesSubplot object at 0x7f2da005d550>, <matplotlib.axes._subplots.AxesSubplot object at 0x7f2da0099390>, <matplotlib.axes._subplots.AxesSubplot object at 0x7f2d9ff619e8>]], dtype=object)
** Generating Bar Plots**
# Use bar plots to compare the percentages of women (ShareWomen)
# from the first ten rows and last ten rows of the
# recent_grads dataframe.
recent_grads.head(10).plot.bar(x='Major', y='ShareWomen', legend = False)
recent_grads.tail(10).plot.bar(x='Major', y='ShareWomen', legend = False)
<matplotlib.axes._subplots.AxesSubplot at 0x7f2d9fe17780>
# Use bar plots to compare the unemployment rate
# (Unemployment_rate) from the first ten rows and last ten rows of
# the recent_grads dataframe.
recent_grads[:10].plot.bar(x='Major', y='Unemployment_rate', legend=False)
recent_grads[163:].plot.bar(x='Major', y='Unemployment_rate', legend=False)
<matplotlib.axes._subplots.AxesSubplot at 0x7f2d9fca5da0>
** Conclusion**
In this project we utilize the use of the plotting tools built into pandas to explore data on job outcomes.