Visualising Earnings based on College Majors¶

Matplotlib and plotting functionalities in pandas library are particularly useful to carry out Descriptive Analysis to establish some basic understanding of our underlying data. In this project, I will attempt to apply this to explore the job-outcomes of students who graduated from college (2010-2012), the dataset for which was released by American Community Survey. A cleaned and aggregated subset of the data can be found on - https://github.com/fivethirtyeight/data/tree/master/college-majors.

Throughout the descriptive analysis, I will pose some relevant questions, attempt to visualise the answer using relevant descriptive analytical tools, and then attempt to infer the answers from them.

Dataset Description¶

Rank - Rank by median earnings (the dataset is ordered by this column).
Major_code - Major code.
Major - Major description.
Major_category - Category of major.
Total - Total number of people with major.
Sample_size - Sample size (unweighted) of full-time.
Men - Male graduates.
Women - Female graduates.
ShareWomen - Women as share of total.
Employed - Number employed.
Median - Median salary of full-time, year-round workers.
Low_wage_jobs - Number in low-wage service jobs.
Full_time - Number employed 35 hours or more.
Part_time - Number employed less than 35 hours.

As the first step, I import the required libraries and set up the necessary tools required for our work. Then I try to understand the structure of my dataset by printing a few rows of the dataset, and using the describe() function on it.

In [1]:

# Importing relevant libraries
import matplotlib.pyplot as plt
import pandas as pd

In [2]:

# Running Jupyter magic to display plots inline
%matplotlib inline

In [3]:

# Reading the csv file into a pandas dataframe object class
recent_grads = pd.read_csv('recent-grads.csv')

In [4]:

# Displaying the column name and values first row of the dataframe 
recent_grads.iloc[0]

Out[4]:

Rank                                        1
Major_code                               2419
Major                   PETROLEUM ENGINEERING
Total                                    2339
Men                                      2057
Women                                     282
Major_category                    Engineering
ShareWomen                           0.120564
Sample_size                                36
Employed                                 1976
Full_time                                1849
Part_time                                 270
Full_time_year_round                     1207
Unemployed                                 37
Unemployment_rate                   0.0183805
Median                                 110000
P25th                                   95000
P75th                                  125000
College_jobs                             1534
Non_college_jobs                          364
Low_wage_jobs                             193
Name: 0, dtype: object

In [23]:

print(recent_grads.describe())
recent_grads[:5]

             rank   major_code          total            men          women  \
count  172.000000   172.000000     172.000000     172.000000     172.000000   
mean    87.377907  3895.953488   39370.081395   16723.406977   22646.674419   
std     49.983181  1679.240095   63483.491009   28122.433474   41057.330740   
min      1.000000  1100.000000     124.000000     119.000000       0.000000   
25%     44.750000  2403.750000    4549.750000    2177.500000    1778.250000   
50%     87.500000  3608.500000   15104.000000    5434.000000    8386.500000   
75%    130.250000  5503.250000   38909.750000   14631.000000   22553.750000   
max    173.000000  6403.000000  393735.000000  173809.000000  307087.000000   

       sharewomen  sample_size      employed      full_time      part_time  \
count  172.000000   172.000000     172.00000     172.000000     172.000000   
mean     0.522223   357.941860   31355.80814   26165.767442    8877.232558   
std      0.231205   619.680419   50777.42865   42957.122320   14679.038729   
min      0.000000     2.000000       0.00000     111.000000       0.000000   
25%      0.336026    42.000000    3734.75000    3181.000000    1013.750000   
50%      0.534024   131.000000   12031.50000   10073.500000    3332.500000   
75%      0.703299   339.000000   31701.25000   25447.250000    9981.000000   
max      0.968954  4212.000000  307933.00000  251540.000000  115172.000000   

       full_time_year_round    unemployed  unemployment_rate         median  \
count            172.000000    172.000000         172.000000     172.000000   
mean           19798.843023   2428.412791           0.068024   40076.744186   
std            33229.227514   4121.730452           0.030340   11461.388773   
min              111.000000      0.000000           0.000000   22000.000000   
25%             2474.750000    299.500000           0.050261   33000.000000   
50%             7436.500000    905.000000           0.067544   36000.000000   
75%            17674.750000   2397.000000           0.087247   45000.000000   
max           199897.000000  28169.000000           0.177226  110000.000000   

              p25th          p75th   college_jobs  non_college_jobs  \
count    172.000000     172.000000     172.000000        172.000000   
mean   29486.918605   51386.627907   12387.401163      13354.325581   
std     9190.769927   14882.278650   21344.967522      23841.326605   
min    18500.000000   22000.000000       0.000000          0.000000   
25%    24000.000000   41750.000000    1744.750000       1594.000000   
50%    27000.000000   47000.000000    4467.500000       4603.500000   
75%    33250.000000   58500.000000   14595.750000      11791.750000   
max    95000.000000  125000.000000  151643.000000     148395.000000   

       low_wage_jobs  share_full_time  
count     172.000000       172.000000  
mean     3878.633721         0.666427  
std      6960.467621         0.102083  
min         0.000000         0.372872  
25%       336.750000         0.597190  
50%      1238.500000         0.673859  
75%      3496.000000         0.734996  
max     48207.000000         0.958949

Out[23]:

	rank	major_code	major	total	men	women	major_category	sharewomen	sample_size	employed	...	full_time_year_round	unemployed	unemployment_rate	median	p25th	p75th	college_jobs	non_college_jobs	low_wage_jobs	share_full_time
0	1	2419	PETROLEUM ENGINEERING	2339.0	2057.0	282.0	Engineering	0.120564	36	1976	...	1207	37	0.018381	110000	95000	125000	1534	364	193	0.790509
1	2	2416	MINING AND MINERAL ENGINEERING	756.0	679.0	77.0	Engineering	0.101852	7	640	...	388	85	0.117241	75000	55000	90000	350	257	50	0.735450
2	3	2415	METALLURGICAL ENGINEERING	856.0	725.0	131.0	Engineering	0.153037	3	648	...	340	16	0.024096	73000	50000	105000	456	176	0	0.651869
3	4	2417	NAVAL ARCHITECTURE AND MARINE ENGINEERING	1258.0	1123.0	135.0	Engineering	0.107313	16	758	...	692	40	0.050125	70000	43000	80000	529	102	0	0.849762
4	5	2405	CHEMICAL ENGINEERING	32260.0	21239.0	11021.0	Engineering	0.341631	289	25694	...	16697	1672	0.061098	65000	50000	75000	18314	4440	972	0.718227

5 rows × 22 columns

In [6]:

# Dropping rows with null values from our data set
print(len(recent_grads))
recent_grads = recent_grads.dropna()
print(len(recent_grads))

173
172

In [7]:

# converting column names to lower case (because i dont like upper-case in my code)
recent_grads.columns = recent_grads.columns.str.lower()
recent_grads.columns

Out[7]:

Index(['rank', 'major_code', 'major', 'total', 'men', 'women',
       'major_category', 'sharewomen', 'sample_size', 'employed', 'full_time',
       'part_time', 'full_time_year_round', 'unemployed', 'unemployment_rate',
       'median', 'p25th', 'p75th', 'college_jobs', 'non_college_jobs',
       'low_wage_jobs'],
      dtype='object')

1. Searching for Co-relations between our Column-Variables¶

Having no leads initially, I will draw random scatterplots between 2 variables that I believe should have a corelation between them. I will start with -

sample_size and median
sample_size and unemployment_rate
full_time and median
sharewomen and unemployment_rate
men and median
women and median

In [8]:

recent_grads.plot('sample_size','median', kind = 'scatter') #--> 1.1

Out[8]:

<matplotlib.axes._subplots.AxesSubplot at 0xf1e7f0a1c8>

In [9]:

recent_grads.plot('sample_size','unemployment_rate', kind = 'scatter')# --> 1.2

Out[9]:

<matplotlib.axes._subplots.AxesSubplot at 0xf1e86673c8>

In [10]:

recent_grads.plot('full_time','median', kind = 'scatter') # --> 1.3

Out[10]:

<matplotlib.axes._subplots.AxesSubplot at 0xf1e86da048>

In [11]:

recent_grads.plot('sharewomen','unemployment_rate', kind = 'scatter') # --> 1.4

Out[11]:

<matplotlib.axes._subplots.AxesSubplot at 0xf1e87443c8>

In [12]:

recent_grads.plot('men','median', kind = 'scatter') # --> 1.5

Out[12]:

<matplotlib.axes._subplots.AxesSubplot at 0xf1e87b3c88>

In [13]:

recent_grads.plot('women','median', kind = 'scatter') # --> 1.6

Out[13]:

<matplotlib.axes._subplots.AxesSubplot at 0xf1e87fba48>

1.1 Popular Major = More \$$$?¶

Does a major with more students corelate to a higher median salary?

No. As per the scatterplot, there is a slight negative corelation between the number of students enrolled for a major and the median salary. Also, some of the highest median salaries belong to majors with a medium batch size.

In [14]:

recent_grads.plot('median','total',kind = 'scatter', figsize = (7,5))

Out[14]:

<matplotlib.axes._subplots.AxesSubplot at 0xf1e8871d08>

1.2 Do Full Time Employees make more \$$$?¶

Do majors with more percentage of full_time employed students have a greater median salary?

Yes, majors with a higher percentage of full_time employed students seem to witness higher median salaries overall.

In [15]:

recent_grads['share_full_time'] = recent_grads['full_time']/recent_grads['total']
recent_grads.plot('share_full_time','median',kind = 'scatter')

Out[15]:

<matplotlib.axes._subplots.AxesSubplot at 0xf1e89375c8>

1.3 Do women make more \$$$?¶

Do majors with a higher share of women have more median salary overall?

No, majors with a higher share of women tend to have a lower median salary.

In [16]:

recent_grads.plot('sharewomen','median',kind = 'scatter')

Out[16]:

<matplotlib.axes._subplots.AxesSubplot at 0xf1e9156348>

1.4 Are majors predominantly Male or Female?¶

Do majors mostly consist of males or females?

Females, but by a small margin! The histogram below shows visibly higher frequencies of female-majority majors in the 0.5 to 1.0 range of sharewomen.

In [17]:

recent_grads['sharewomen'].hist(bins = 20)

Out[17]:

<matplotlib.axes._subplots.AxesSubplot at 0xf1ea18dc08>

1.5 What median salary range is Most Common?¶

30,000 to 40,000 range is the most common median salary range among the majors as per the histogram below.

In [18]:

recent_grads['median'].hist(bins = 10, range = (0,100000))

Out[18]:

<matplotlib.axes._subplots.AxesSubplot at 0xf1ea1c4608>

recent_grads[:8]

1.6 Which `major_category` has the most (& least) students men (& women) on average?¶

Using the bar plot below, we can see that

Business major category has the highest average number of Male students enrolled.
Communication & Journalism major category has the highest average number of Female students enrolled.

In [19]:

from numpy import arange 

categories = recent_grads['major_category'].unique()
avg_of_totals_men = []
avg_of_totals_women = []

for category in categories:
    avg_of_totals_men.append(recent_grads.loc[recent_grads['major_category']==category, 'men'].mean())
    avg_of_totals_women.append(recent_grads.loc[recent_grads['major_category']==category, 'women'].mean())

fig, ax = plt.subplots(figsize = (16,6))

ax.bar(arange(0,16)-0.2, avg_of_totals_men, 0.4,label = 'Men')
ax.bar(arange(0,16)+0.2, avg_of_totals_women, 0.4, label = 'Women')
ax.set_xticks(arange(0,16))
ax.set_xticklabels(categories)
plt.xticks(rotation = 90)
plt.legend()

Out[19]:

<matplotlib.legend.Legend at 0xf1ea380948>

In [20]:

from pandas.plotting import scatter_matrix
scatter_matrix(recent_grads[['sample_size','median']], figsize = (10,8))

Out[20]:

array([[<matplotlib.axes._subplots.AxesSubplot object at 0x000000F1EA226D88>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000000F1EA5E7508>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x000000F1EA3DDAC8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000000F1EA416BC8>]],
      dtype=object)

In [21]:

scatter_matrix(recent_grads[['sample_size','median', 'unemployment_rate']], figsize = (10,8))

Out[21]:

array([[<matplotlib.axes._subplots.AxesSubplot object at 0x000000F1EA464488>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000000F1EA55E708>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000000F1EA594888>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x000000F1EA80D948>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000000F1EA845A48>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000000F1EA87EB88>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x000000F1EA8B7C08>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000000F1EA8F0D08>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000000F1EA8FA908>]],
      dtype=object)

Bonus: HexBin plot¶

Hex bin plots can be particularly useful in place of some dense scatterplots. Here, I have taken the scatterplot previously drawin in 1.2 as a reference, which yields the same results.

In [22]:

recent_grads.plot.hexbin(x = 'share_full_time',y = 'median', gridsize = 15, cmap='inferno')
plt.xlim(0.4,0.95)
plt.ylim(20000,80000)
plt.xlabel('share_full_time')

Out[22]:

Text(0.5, 0, 'share_full_time')

Conclusion¶

Less Popular Majors tend to have More median Salary
Full-Time employed students tend to have higher salaries
Women don't tend to make more \$$ than their male counterpart
Majors as a whole are predominantly female
30k to 40k is the most common salary-bracket for students when employed
Communication & Journalism major has the most female students, while Business major has the most male students

-Author : Raghav_A