In this project, I'll explore using the pandas plotting functionality along with the Jupyter notebook interface to explore and visualiz data
I'll be working with a dataset on the job outcomes of students who graduated from college between 2010 and 2012.
Each row in the dataset represents a different major in college and contains information on gender diversity, employment rates, median salaries, and more. Here are some of the columns in the dataset:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
recent_grads = pd.read_csv("recent-grads.csv")
# first row formatted as a table.
print (recent_grads.iloc[0])
Rank 1 Major_code 2419 Major PETROLEUM ENGINEERING Total 2339 Men 2057 Women 282 Major_category Engineering ShareWomen 0.120564 Sample_size 36 Employed 1976 Full_time 1849 Part_time 270 Full_time_year_round 1207 Unemployed 37 Unemployment_rate 0.0183805 Median 110000 P25th 95000 P75th 125000 College_jobs 1534 Non_college_jobs 364 Low_wage_jobs 193 Name: 0, dtype: object
# to become familiar with how data is structured
print (recent_grads.head())
print (recent_grads.tail())
Rank Major_code Major Total \ 0 1 2419 PETROLEUM ENGINEERING 2339.0 1 2 2416 MINING AND MINERAL ENGINEERING 756.0 2 3 2415 METALLURGICAL ENGINEERING 856.0 3 4 2417 NAVAL ARCHITECTURE AND MARINE ENGINEERING 1258.0 4 5 2405 CHEMICAL ENGINEERING 32260.0 Men Women Major_category ShareWomen Sample_size Employed ... \ 0 2057.0 282.0 Engineering 0.120564 36 1976 ... 1 679.0 77.0 Engineering 0.101852 7 640 ... 2 725.0 131.0 Engineering 0.153037 3 648 ... 3 1123.0 135.0 Engineering 0.107313 16 758 ... 4 21239.0 11021.0 Engineering 0.341631 289 25694 ... Part_time Full_time_year_round Unemployed Unemployment_rate Median \ 0 270 1207 37 0.018381 110000 1 170 388 85 0.117241 75000 2 133 340 16 0.024096 73000 3 150 692 40 0.050125 70000 4 5180 16697 1672 0.061098 65000 P25th P75th College_jobs Non_college_jobs Low_wage_jobs 0 95000 125000 1534 364 193 1 55000 90000 350 257 50 2 50000 105000 456 176 0 3 43000 80000 529 102 0 4 50000 75000 18314 4440 972 [5 rows x 21 columns] Rank Major_code Major Total Men Women \ 168 169 3609 ZOOLOGY 8409.0 3050.0 5359.0 169 170 5201 EDUCATIONAL PSYCHOLOGY 2854.0 522.0 2332.0 170 171 5202 CLINICAL PSYCHOLOGY 2838.0 568.0 2270.0 171 172 5203 COUNSELING PSYCHOLOGY 4626.0 931.0 3695.0 172 173 3501 LIBRARY SCIENCE 1098.0 134.0 964.0 Major_category ShareWomen Sample_size Employed ... \ 168 Biology & Life Science 0.637293 47 6259 ... 169 Psychology & Social Work 0.817099 7 2125 ... 170 Psychology & Social Work 0.799859 13 2101 ... 171 Psychology & Social Work 0.798746 21 3777 ... 172 Education 0.877960 2 742 ... Part_time Full_time_year_round Unemployed Unemployment_rate Median \ 168 2190 3602 304 0.046320 26000 169 572 1211 148 0.065112 25000 170 648 1293 368 0.149048 25000 171 965 2738 214 0.053621 23400 172 237 410 87 0.104946 22000 P25th P75th College_jobs Non_college_jobs Low_wage_jobs 168 20000 39000 2771 2947 743 169 24000 34000 1488 615 82 170 25000 40000 986 870 622 171 19200 26000 2403 1245 308 172 20000 22000 288 338 192 [5 rows x 21 columns]
# to generate summary statistics for all of the numeric columns
recent_grads.describe()
Rank | Major_code | Total | Men | Women | ShareWomen | Sample_size | Employed | Full_time | Part_time | Full_time_year_round | Unemployed | Unemployment_rate | Median | P25th | P75th | College_jobs | Non_college_jobs | Low_wage_jobs | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 173.000000 | 173.000000 | 172.000000 | 172.000000 | 172.000000 | 172.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 |
mean | 87.000000 | 3879.815029 | 39370.081395 | 16723.406977 | 22646.674419 | 0.522223 | 356.080925 | 31192.763006 | 26029.306358 | 8832.398844 | 19694.427746 | 2416.329480 | 0.068191 | 40151.445087 | 29501.445087 | 51494.219653 | 12322.635838 | 13284.497110 | 3859.017341 |
std | 50.084928 | 1687.753140 | 63483.491009 | 28122.433474 | 41057.330740 | 0.231205 | 618.361022 | 50675.002241 | 42869.655092 | 14648.179473 | 33160.941514 | 4112.803148 | 0.030331 | 11470.181802 | 9166.005235 | 14906.279740 | 21299.868863 | 23789.655363 | 6944.998579 |
min | 1.000000 | 1100.000000 | 124.000000 | 119.000000 | 0.000000 | 0.000000 | 2.000000 | 0.000000 | 111.000000 | 0.000000 | 111.000000 | 0.000000 | 0.000000 | 22000.000000 | 18500.000000 | 22000.000000 | 0.000000 | 0.000000 | 0.000000 |
25% | 44.000000 | 2403.000000 | 4549.750000 | 2177.500000 | 1778.250000 | 0.336026 | 39.000000 | 3608.000000 | 3154.000000 | 1030.000000 | 2453.000000 | 304.000000 | 0.050306 | 33000.000000 | 24000.000000 | 42000.000000 | 1675.000000 | 1591.000000 | 340.000000 |
50% | 87.000000 | 3608.000000 | 15104.000000 | 5434.000000 | 8386.500000 | 0.534024 | 130.000000 | 11797.000000 | 10048.000000 | 3299.000000 | 7413.000000 | 893.000000 | 0.067961 | 36000.000000 | 27000.000000 | 47000.000000 | 4390.000000 | 4595.000000 | 1231.000000 |
75% | 130.000000 | 5503.000000 | 38909.750000 | 14631.000000 | 22553.750000 | 0.703299 | 338.000000 | 31433.000000 | 25147.000000 | 9948.000000 | 16891.000000 | 2393.000000 | 0.087557 | 45000.000000 | 33000.000000 | 60000.000000 | 14444.000000 | 11783.000000 | 3466.000000 |
max | 173.000000 | 6403.000000 | 393735.000000 | 173809.000000 | 307087.000000 | 0.968954 | 4212.000000 | 307933.000000 | 251540.000000 | 115172.000000 | 199897.000000 | 28169.000000 | 0.177226 | 110000.000000 | 95000.000000 | 125000.000000 | 151643.000000 | 148395.000000 | 48207.000000 |
#shape before dropping NAN vaues
print(recent_grads.shape)
(173, 21)
# Investigating Missing Values
print(recent_grads.isnull().sum())
Rank 0 Major_code 0 Major 0 Total 1 Men 1 Women 1 Major_category 0 ShareWomen 1 Sample_size 0 Employed 0 Full_time 0 Part_time 0 Full_time_year_round 0 Unemployed 0 Unemployment_rate 0 Median 0 P25th 0 P75th 0 College_jobs 0 Non_college_jobs 0 Low_wage_jobs 0 dtype: int64
#recent_grads = recent_grads.dropna(inplace=True)
print(recent_grads.dropna(inplace=True))
print(recent_grads.shape)
None (172, 21)
# #Investigating Missing Values after dropping NAN
print(recent_grads.isnull().sum())
Rank 0 Major_code 0 Major 0 Total 0 Men 0 Women 0 Major_category 0 ShareWomen 0 Sample_size 0 Employed 0 Full_time 0 Part_time 0 Full_time_year_round 0 Unemployed 0 Unemployment_rate 0 Median 0 P25th 0 P75th 0 College_jobs 0 Non_college_jobs 0 Low_wage_jobs 0 dtype: int64
# #shape before dropping NAN vaues
# print(recent_grads.shape)
# #Investigating Missing Values
# print(recent_grads.isnull().sum())
# #Drop missing values
# print(recent_grads.dropna(inplace =True))
# #shape after dropping NAN vaues
# print(recent_grads.shape)
# #Investigating Missing Values after dropping NAN
# print(recent_grads.isnull().sum())
#recent_grads = recent_grads.dropna(inplace=True)
#print(recent_grads.dropna(inplace=True))
#recent_grads = recent_grads.dropna(inplace=True)
#print(recent_grads.dropna(inplace=True))
#print(recent_grads.shape)
# # Look up the number of rows
# raw_data_count = recent_grads.count()
# print (raw_data_count)
# # Drop rows with missing values
# # recent_grads = recent_grads.dropna(axis=0, inplace=True)
# print(recent_grads.shape)
# # Look up the number of rows to ascertain if data has been droped
# cleaned_data_count = recent_grads.count()
# print (cleaned_data_count)
Comparing the raw data and cleaned data, it will be observed that number of rows droped to 172 in cleaned data. While raw data has 173 rows. This means a row has been removed for having missing value.
recent_grads.plot(x="Sample_size", y="Median", kind = "scatter", title = "Sample_size VS Median")
recent_grads.plot(x="Sample_size", y="Unemployment_rate", kind = "scatter", title = "Sample_size VS Uemployemny")
recent_grads.plot(x="Full_time", y="Median", kind = "scatter", title = "Full_time VS Median")
recent_grads.plot(x="ShareWomen", y="Unemployment_rate", kind = "scatter", title = "Sharewoman VS Unemployment_rate")
recent_grads.plot(x="Men",y="Median", kind = "scatter", title = "Men VS Median")
recent_grads.plot(x="Women",y="Median", kind = "scatter", title = "Sample_size VS Median")
<matplotlib.axes._subplots.AxesSubplot at 0x10a02c70>
There seems to be no significant relationships between the data points in these scatter plots. Theycan be explore further using histograms instead.
The y axis shows the frequency of the data and the x axis refers to the column name specified in code.
recent_grads["Sample_size"].hist()
<matplotlib.axes._subplots.AxesSubplot at 0x11a55370>
recent_grads["Median"].hist()
<matplotlib.axes._subplots.AxesSubplot at 0x11aff820>
The most common median salary range is $30,000-40,000.
recent_grads["Employed"].hist()
<matplotlib.axes._subplots.AxesSubplot at 0x11ac2af0>
recent_grads["Full_time"].hist()
<matplotlib.axes._subplots.AxesSubplot at 0x11bd1fa0>
recent_grads["ShareWomen"].hist()
<matplotlib.axes._subplots.AxesSubplot at 0x11c44400>
recent_grads["Unemployment_rate"].hist()
<matplotlib.axes._subplots.AxesSubplot at 0x11cb84c0>
The most common percentage of unemployment rate is between 5.5-7%.
recent_grads["Men"].hist()
<matplotlib.axes._subplots.AxesSubplot at 0x11d24fa0>
recent_grads["Women"].hist()
<matplotlib.axes._subplots.AxesSubplot at 0x11d8e310>
In other to explore the data further, both scatter plots and histograms are combined into one grid of plots so as to explore potential relationships and distributions simultaneously. This is achieved using scatter matrix plot.
from pandas.plotting import scatter_matrix
scatter_matrix(recent_grads[["Sample_size", "Median"]], figsize=(10,10))
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x0000000011B6CD30>, <matplotlib.axes._subplots.AxesSubplot object at 0x0000000011E164C0>], [<matplotlib.axes._subplots.AxesSubplot object at 0x0000000011E67730>, <matplotlib.axes._subplots.AxesSubplot object at 0x0000000011E91EB0>]], dtype=object)
scatter_matrix(recent_grads[["Sample_size", "Median","Unemployment_rate"]], figsize=(10,10))
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x0000000011ADAC10>, <matplotlib.axes._subplots.AxesSubplot object at 0x0000000011AEBFD0>, <matplotlib.axes._subplots.AxesSubplot object at 0x0000000011A73B50>], [<matplotlib.axes._subplots.AxesSubplot object at 0x0000000010961BE0>, <matplotlib.axes._subplots.AxesSubplot object at 0x0000000011A6E280>, <matplotlib.axes._subplots.AxesSubplot object at 0x0000000010A76A90>], [<matplotlib.axes._subplots.AxesSubplot object at 0x0000000010A761C0>, <matplotlib.axes._subplots.AxesSubplot object at 0x0000000010979700>, <matplotlib.axes._subplots.AxesSubplot object at 0x00000000101F8640>]], dtype=object)
Looking keenly at the scatter matrix plot
, it is difficult to ascertain any correlation between any pair of these columns. Looking at the histograms, the distribution of Sample_size and Median is skewed whereas the distribution of Unemployment_rate is more symetrically disbributed and more spread out.
Bar chart ploting only need the data that needs to be represented the bars and the labels for each bar.
recent_grads[:10].plot.bar(x="Major", y="ShareWomen", legend=False)
recent_grads[-10:].plot.bar(x="Major", y="ShareWomen", legend=False)
<matplotlib.axes._subplots.AxesSubplot at 0x12696ca0>
Women tend to shy away from technical courses such as Engineering. Engineering courses are the least subscribed for by women as shown in the above chart while Early childhood education is the most subcribed course by women.
recent_grads[:10].plot.bar(x="Major", y="Unemployment_rate", legend=False)
recent_grads[-10:].plot.bar(x="Major", y="Unemployment_rate", legend=False)
<matplotlib.axes._subplots.AxesSubplot at 0xf407040>
Women in Engineering Majors tend to enjoy low unemployment rate except for 'Nuclear Engineering' and 'Mining and Mineral Engineering'. This may be due to high technically and limited opening for for the two majors.
Grouped bar plot will be used to compare the number of men with the number of women in each category of majors.
# Men and Women are aggregated by Major_category to a single dictionary.
men_sum_dict = {}
women_sum_dict = {}
for c in recent_grads["Major_category"].unique():
men_cat = recent_grads.loc[recent_grads["Major_category"] == c, "Men"].sum()
men_sum_dict[c] = men_cat
women_cat = recent_grads.loc[recent_grads["Major_category"] == c, "Women"].sum()
women_sum_dict[c] = women_cat
# convertion of men_sum_dict and women_sum_dict to series
men_sum_series = pd.Series(men_sum_dict)
women_sum_series = pd.Series(women_sum_dict)
# convertion of men_sum_series and women_sum_series to DataFrame
men_women_df = pd.DataFrame(men_sum_series, columns=["Men Total"])
men_women_df["Women Total"] = women_sum_series
men_women_df
Men Total | Women Total | |
---|---|---|
Engineering | 408307.0 | 129276.0 |
Business | 667852.0 | 634524.0 |
Physical Sciences | 95390.0 | 90089.0 |
Law & Public Policy | 91129.0 | 87978.0 |
Computers & Mathematics | 208725.0 | 90283.0 |
Industrial Arts & Consumer Services | 103781.0 | 126011.0 |
Arts | 134390.0 | 222740.0 |
Health | 75517.0 | 387713.0 |
Social Science | 256834.0 | 273132.0 |
Biology & Life Science | 184919.0 | 268943.0 |
Education | 103526.0 | 455603.0 |
Agriculture & Natural Resources | 40357.0 | 35263.0 |
Humanities & Liberal Arts | 272846.0 | 440622.0 |
Psychology & Social Work | 98115.0 | 382892.0 |
Communications & Journalism | 131921.0 | 260680.0 |
Interdisciplinary | 2817.0 | 9479.0 |
men_women_df.plot.bar(figsize=(10,5))
<matplotlib.axes._subplots.AxesSubplot at 0x168d6280>
There is a significant difference in the number of Men and women in Engineering and Computers & Mathematics Majors. Men are seen have the highest enrolment.However, Arts, Health, Biology & Life Science, Education, Humanities & Liberal Arts, Psychology & Social Work and Communications & Journalism have significantly more women than men.
Boxplots can show us the range and positions of the quartiles for columns in the dataset. Box plot is used here to explore Median
and Unemployment_rate
columns a little more.
recent_grads.loc[:, ["Median", "Unemployment_rate"]].plot(kind='box', subplots=True, figsize=(10, 10))
Median AxesSubplot(0.125,0.125;0.352273x0.755) Unemployment_rate AxesSubplot(0.547727,0.125;0.352273x0.755) dtype: object
There are five outliers in the median salary salary column of the data with four being moderate outliers and one being an extreme outlier.
The Unemployment rate is more symmetrically distributed about the median of around 7%. There are four outliers with high unemployment rates of approximately 13.5-19%.
In other to explore the data a little more, hexagonal bin plot is used to establish relationship between pair of columns.
Here relationship between the following are visualised:
recent_grads.plot.hexbin("ShareWomen", "Unemployment_rate", figsize=(8, 8), gridsize=20, colorbar=False)
recent_grads.plot.hexbin("Women", "Median", figsize=(8, 8), gridsize=20, colorbar=False)
recent_grads.plot.hexbin("Total", "Median", figsize=(8, 8), gridsize=20, colorbar=False)
<matplotlib.axes._subplots.AxesSubplot at 0x1786df40>
Exploring data of graduated American College students was insightful and it gives eye catching details for quick understanding. For clearity, various forms of charts were used. Specifically, some Python concepts explored are pandas, matplotlib, histograms, bar charts, scatterplots, scatter matrices, box plot and hexagonal bin plot.