Matplotlib
and plotting functionalities in pandas
library are particularly useful to carry out Descriptive Analysis
to establish some basic understanding of our underlying data. In this project, I will attempt to apply this to explore the job-outcomes of students who graduated from college (2010-2012), the dataset for which was released by American Community Survey
. A cleaned and aggregated subset of the data can be found on - https://github.com/fivethirtyeight/data/tree/master/college-majors.
Throughout the descriptive analysis, I will pose some relevant questions, attempt to visualise the answer using relevant descriptive analytical tools, and then attempt to infer the answers from them.
Rank
- Rank by median earnings (the dataset is ordered by this column).Major_code
- Major code.Major
- Major description.Major_category
- Category of major.Total
- Total number of people with major.Sample_size
- Sample size (unweighted) of full-time.Men
- Male graduates.Women
- Female graduates.ShareWomen
- Women as share of total.Employed
- Number employed.Median
- Median salary of full-time, year-round workers.Low_wage_jobs
- Number in low-wage service jobs.Full_time
- Number employed 35 hours or more.Part_time
- Number employed less than 35 hours.As the first step, I import the required libraries and set up the necessary tools required for our work. Then I try to understand the structure of my dataset by printing a few rows of the dataset, and using the describe()
function on it.
# Importing relevant libraries
import matplotlib.pyplot as plt
import pandas as pd
# Running Jupyter magic to display plots inline
%matplotlib inline
# Reading the csv file into a pandas dataframe object class
recent_grads = pd.read_csv('recent-grads.csv')
# Displaying the column name and values first row of the dataframe
recent_grads.iloc[0]
Rank 1 Major_code 2419 Major PETROLEUM ENGINEERING Total 2339 Men 2057 Women 282 Major_category Engineering ShareWomen 0.120564 Sample_size 36 Employed 1976 Full_time 1849 Part_time 270 Full_time_year_round 1207 Unemployed 37 Unemployment_rate 0.0183805 Median 110000 P25th 95000 P75th 125000 College_jobs 1534 Non_college_jobs 364 Low_wage_jobs 193 Name: 0, dtype: object
print(recent_grads.describe())
recent_grads[:5]
rank major_code total men women \ count 172.000000 172.000000 172.000000 172.000000 172.000000 mean 87.377907 3895.953488 39370.081395 16723.406977 22646.674419 std 49.983181 1679.240095 63483.491009 28122.433474 41057.330740 min 1.000000 1100.000000 124.000000 119.000000 0.000000 25% 44.750000 2403.750000 4549.750000 2177.500000 1778.250000 50% 87.500000 3608.500000 15104.000000 5434.000000 8386.500000 75% 130.250000 5503.250000 38909.750000 14631.000000 22553.750000 max 173.000000 6403.000000 393735.000000 173809.000000 307087.000000 sharewomen sample_size employed full_time part_time \ count 172.000000 172.000000 172.00000 172.000000 172.000000 mean 0.522223 357.941860 31355.80814 26165.767442 8877.232558 std 0.231205 619.680419 50777.42865 42957.122320 14679.038729 min 0.000000 2.000000 0.00000 111.000000 0.000000 25% 0.336026 42.000000 3734.75000 3181.000000 1013.750000 50% 0.534024 131.000000 12031.50000 10073.500000 3332.500000 75% 0.703299 339.000000 31701.25000 25447.250000 9981.000000 max 0.968954 4212.000000 307933.00000 251540.000000 115172.000000 full_time_year_round unemployed unemployment_rate median \ count 172.000000 172.000000 172.000000 172.000000 mean 19798.843023 2428.412791 0.068024 40076.744186 std 33229.227514 4121.730452 0.030340 11461.388773 min 111.000000 0.000000 0.000000 22000.000000 25% 2474.750000 299.500000 0.050261 33000.000000 50% 7436.500000 905.000000 0.067544 36000.000000 75% 17674.750000 2397.000000 0.087247 45000.000000 max 199897.000000 28169.000000 0.177226 110000.000000 p25th p75th college_jobs non_college_jobs \ count 172.000000 172.000000 172.000000 172.000000 mean 29486.918605 51386.627907 12387.401163 13354.325581 std 9190.769927 14882.278650 21344.967522 23841.326605 min 18500.000000 22000.000000 0.000000 0.000000 25% 24000.000000 41750.000000 1744.750000 1594.000000 50% 27000.000000 47000.000000 4467.500000 4603.500000 75% 33250.000000 58500.000000 14595.750000 11791.750000 max 95000.000000 125000.000000 151643.000000 148395.000000 low_wage_jobs share_full_time count 172.000000 172.000000 mean 3878.633721 0.666427 std 6960.467621 0.102083 min 0.000000 0.372872 25% 336.750000 0.597190 50% 1238.500000 0.673859 75% 3496.000000 0.734996 max 48207.000000 0.958949
rank | major_code | major | total | men | women | major_category | sharewomen | sample_size | employed | ... | full_time_year_round | unemployed | unemployment_rate | median | p25th | p75th | college_jobs | non_college_jobs | low_wage_jobs | share_full_time | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 2419 | PETROLEUM ENGINEERING | 2339.0 | 2057.0 | 282.0 | Engineering | 0.120564 | 36 | 1976 | ... | 1207 | 37 | 0.018381 | 110000 | 95000 | 125000 | 1534 | 364 | 193 | 0.790509 |
1 | 2 | 2416 | MINING AND MINERAL ENGINEERING | 756.0 | 679.0 | 77.0 | Engineering | 0.101852 | 7 | 640 | ... | 388 | 85 | 0.117241 | 75000 | 55000 | 90000 | 350 | 257 | 50 | 0.735450 |
2 | 3 | 2415 | METALLURGICAL ENGINEERING | 856.0 | 725.0 | 131.0 | Engineering | 0.153037 | 3 | 648 | ... | 340 | 16 | 0.024096 | 73000 | 50000 | 105000 | 456 | 176 | 0 | 0.651869 |
3 | 4 | 2417 | NAVAL ARCHITECTURE AND MARINE ENGINEERING | 1258.0 | 1123.0 | 135.0 | Engineering | 0.107313 | 16 | 758 | ... | 692 | 40 | 0.050125 | 70000 | 43000 | 80000 | 529 | 102 | 0 | 0.849762 |
4 | 5 | 2405 | CHEMICAL ENGINEERING | 32260.0 | 21239.0 | 11021.0 | Engineering | 0.341631 | 289 | 25694 | ... | 16697 | 1672 | 0.061098 | 65000 | 50000 | 75000 | 18314 | 4440 | 972 | 0.718227 |
5 rows × 22 columns
# Dropping rows with null values from our data set
print(len(recent_grads))
recent_grads = recent_grads.dropna()
print(len(recent_grads))
173 172
# converting column names to lower case (because i dont like upper-case in my code)
recent_grads.columns = recent_grads.columns.str.lower()
recent_grads.columns
Index(['rank', 'major_code', 'major', 'total', 'men', 'women', 'major_category', 'sharewomen', 'sample_size', 'employed', 'full_time', 'part_time', 'full_time_year_round', 'unemployed', 'unemployment_rate', 'median', 'p25th', 'p75th', 'college_jobs', 'non_college_jobs', 'low_wage_jobs'], dtype='object')
Having no leads initially, I will draw random scatterplots between 2 variables that I believe should have a corelation between them. I will start with -
sample_size
and median
sample_size
and unemployment_rate
full_time
and median
sharewomen
and unemployment_rate
men
and median
women
and median
recent_grads.plot('sample_size','median', kind = 'scatter') #--> 1.1
<matplotlib.axes._subplots.AxesSubplot at 0xf1e7f0a1c8>
recent_grads.plot('sample_size','unemployment_rate', kind = 'scatter')# --> 1.2
<matplotlib.axes._subplots.AxesSubplot at 0xf1e86673c8>
recent_grads.plot('full_time','median', kind = 'scatter') # --> 1.3
<matplotlib.axes._subplots.AxesSubplot at 0xf1e86da048>
recent_grads.plot('sharewomen','unemployment_rate', kind = 'scatter') # --> 1.4
<matplotlib.axes._subplots.AxesSubplot at 0xf1e87443c8>
recent_grads.plot('men','median', kind = 'scatter') # --> 1.5
<matplotlib.axes._subplots.AxesSubplot at 0xf1e87b3c88>
recent_grads.plot('women','median', kind = 'scatter') # --> 1.6
<matplotlib.axes._subplots.AxesSubplot at 0xf1e87fba48>
Does a major
with more students corelate to a higher median
salary?
No. As per the scatterplot, there is a slight negative corelation between the number of students enrolled for a major and the median salary. Also, some of the highest median salaries belong to majors with a medium batch size.
recent_grads.plot('median','total',kind = 'scatter', figsize = (7,5))
<matplotlib.axes._subplots.AxesSubplot at 0xf1e8871d08>
Do majors
with more percentage of full_time
employed students have a greater median
salary?
Yes, majors
with a higher percentage of full_time
employed students seem to witness higher median
salaries overall.
recent_grads['share_full_time'] = recent_grads['full_time']/recent_grads['total']
recent_grads.plot('share_full_time','median',kind = 'scatter')
<matplotlib.axes._subplots.AxesSubplot at 0xf1e89375c8>
Do majors
with a higher share of women have more median
salary overall?
No, majors
with a higher share of women tend to have a lower median
salary.
recent_grads.plot('sharewomen','median',kind = 'scatter')
<matplotlib.axes._subplots.AxesSubplot at 0xf1e9156348>
Do majors
mostly consist of males or females?
Females, but by a small margin! The histogram below shows visibly higher frequencies of female-majority majors
in the 0.5 to 1.0 range of sharewomen
.
recent_grads['sharewomen'].hist(bins = 20)
<matplotlib.axes._subplots.AxesSubplot at 0xf1ea18dc08>
30,000 to 40,000 range is the most common median salary range among the majors
as per the histogram below.
recent_grads['median'].hist(bins = 10, range = (0,100000))
<matplotlib.axes._subplots.AxesSubplot at 0xf1ea1c4608>
recent_grads[:8]
major_category
has the most (& least) students men (& women) on average?¶Using the bar plot below, we can see that
Business
major category has the highest average number of Male students enrolled.Communication & Journalism
major category has the highest average number of Female students enrolled.from numpy import arange
categories = recent_grads['major_category'].unique()
avg_of_totals_men = []
avg_of_totals_women = []
for category in categories:
avg_of_totals_men.append(recent_grads.loc[recent_grads['major_category']==category, 'men'].mean())
avg_of_totals_women.append(recent_grads.loc[recent_grads['major_category']==category, 'women'].mean())
fig, ax = plt.subplots(figsize = (16,6))
ax.bar(arange(0,16)-0.2, avg_of_totals_men, 0.4,label = 'Men')
ax.bar(arange(0,16)+0.2, avg_of_totals_women, 0.4, label = 'Women')
ax.set_xticks(arange(0,16))
ax.set_xticklabels(categories)
plt.xticks(rotation = 90)
plt.legend()
<matplotlib.legend.Legend at 0xf1ea380948>
from pandas.plotting import scatter_matrix
scatter_matrix(recent_grads[['sample_size','median']], figsize = (10,8))
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x000000F1EA226D88>, <matplotlib.axes._subplots.AxesSubplot object at 0x000000F1EA5E7508>], [<matplotlib.axes._subplots.AxesSubplot object at 0x000000F1EA3DDAC8>, <matplotlib.axes._subplots.AxesSubplot object at 0x000000F1EA416BC8>]], dtype=object)
scatter_matrix(recent_grads[['sample_size','median', 'unemployment_rate']], figsize = (10,8))
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x000000F1EA464488>, <matplotlib.axes._subplots.AxesSubplot object at 0x000000F1EA55E708>, <matplotlib.axes._subplots.AxesSubplot object at 0x000000F1EA594888>], [<matplotlib.axes._subplots.AxesSubplot object at 0x000000F1EA80D948>, <matplotlib.axes._subplots.AxesSubplot object at 0x000000F1EA845A48>, <matplotlib.axes._subplots.AxesSubplot object at 0x000000F1EA87EB88>], [<matplotlib.axes._subplots.AxesSubplot object at 0x000000F1EA8B7C08>, <matplotlib.axes._subplots.AxesSubplot object at 0x000000F1EA8F0D08>, <matplotlib.axes._subplots.AxesSubplot object at 0x000000F1EA8FA908>]], dtype=object)
Hex bin plots can be particularly useful in place of some dense scatterplots. Here, I have taken the scatterplot previously drawin in 1.2 as a reference, which yields the same results.
recent_grads.plot.hexbin(x = 'share_full_time',y = 'median', gridsize = 15, cmap='inferno')
plt.xlim(0.4,0.95)
plt.ylim(20000,80000)
plt.xlabel('share_full_time')
Text(0.5, 0, 'share_full_time')
Communication & Journalism
major has the most female students, while Business
major has the most male students-Author : Raghav_A