A course's job outcomes can be one of the most crucial factors for both course providers and students. Students are usually looking for courses with more job demand and higher salaries. In comparison, institutions may focus on courses based on their market demand and provide future job opportunities as a motivation to attract students. Analysing the job outcomes of previously graduated students provides an overview of the market. I am working with a dataset (recent_grads.cesv
) initially released by American Comunity Survey and cleaned by FiveThirtyEight. The dataset contains the job outcomes of students who graduated from college between 2010 and 2012. I will visualise the distribution and the relation of different variables to answer some question from the dataset.
Data dictionary:
Variable | Description |
---|---|
Rank | Rank by median earnings (the dataset is ordered by this column) |
Major_code | Major code |
Major | Major description |
Major_category | Category of major |
Total | Total number of people with major |
Sample_size | Sample size (unweighted) of full-time |
Men | Male graduates |
Women | Female graduates |
ShareWomen | Women as share of total |
Employed | Number employed |
Median | Median salary of full-time, year-round workers |
Low_wage_jobs | Number in low-wage service jobs |
Full_time | Number employed 35 hours or more |
Part_time | Number employed less than 35 hours |
# import pandas and pyplot
import pandas as pd
import matplotlib.pyplot as plt
# run jupyter magic
%matplotlib inline
recent_grads = pd.read_csv('recent-grads.csv')
print(recent_grads.iloc[0])
print(recent_grads.head())
print(recent_grads.tail())
print(recent_grads.describe())
Rank 1 Major_code 2419 Major PETROLEUM ENGINEERING Total 2339 Men 2057 Women 282 Major_category Engineering ShareWomen 0.120564 Sample_size 36 Employed 1976 Full_time 1849 Part_time 270 Full_time_year_round 1207 Unemployed 37 Unemployment_rate 0.0183805 Median 110000 P25th 95000 P75th 125000 College_jobs 1534 Non_college_jobs 364 Low_wage_jobs 193 Name: 0, dtype: object Rank Major_code Major Total \ 0 1 2419 PETROLEUM ENGINEERING 2339.0 1 2 2416 MINING AND MINERAL ENGINEERING 756.0 2 3 2415 METALLURGICAL ENGINEERING 856.0 3 4 2417 NAVAL ARCHITECTURE AND MARINE ENGINEERING 1258.0 4 5 2405 CHEMICAL ENGINEERING 32260.0 Men Women Major_category ShareWomen Sample_size Employed \ 0 2057.0 282.0 Engineering 0.120564 36 1976 1 679.0 77.0 Engineering 0.101852 7 640 2 725.0 131.0 Engineering 0.153037 3 648 3 1123.0 135.0 Engineering 0.107313 16 758 4 21239.0 11021.0 Engineering 0.341631 289 25694 ... Part_time Full_time_year_round Unemployed \ 0 ... 270 1207 37 1 ... 170 388 85 2 ... 133 340 16 3 ... 150 692 40 4 ... 5180 16697 1672 Unemployment_rate Median P25th P75th College_jobs Non_college_jobs \ 0 0.018381 110000 95000 125000 1534 364 1 0.117241 75000 55000 90000 350 257 2 0.024096 73000 50000 105000 456 176 3 0.050125 70000 43000 80000 529 102 4 0.061098 65000 50000 75000 18314 4440 Low_wage_jobs 0 193 1 50 2 0 3 0 4 972 [5 rows x 21 columns] Rank Major_code Major Total Men Women \ 168 169 3609 ZOOLOGY 8409.0 3050.0 5359.0 169 170 5201 EDUCATIONAL PSYCHOLOGY 2854.0 522.0 2332.0 170 171 5202 CLINICAL PSYCHOLOGY 2838.0 568.0 2270.0 171 172 5203 COUNSELING PSYCHOLOGY 4626.0 931.0 3695.0 172 173 3501 LIBRARY SCIENCE 1098.0 134.0 964.0 Major_category ShareWomen Sample_size Employed \ 168 Biology & Life Science 0.637293 47 6259 169 Psychology & Social Work 0.817099 7 2125 170 Psychology & Social Work 0.799859 13 2101 171 Psychology & Social Work 0.798746 21 3777 172 Education 0.877960 2 742 ... Part_time Full_time_year_round Unemployed \ 168 ... 2190 3602 304 169 ... 572 1211 148 170 ... 648 1293 368 171 ... 965 2738 214 172 ... 237 410 87 Unemployment_rate Median P25th P75th College_jobs Non_college_jobs \ 168 0.046320 26000 20000 39000 2771 2947 169 0.065112 25000 24000 34000 1488 615 170 0.149048 25000 25000 40000 986 870 171 0.053621 23400 19200 26000 2403 1245 172 0.104946 22000 20000 22000 288 338 Low_wage_jobs 168 743 169 82 170 622 171 308 172 192 [5 rows x 21 columns] Rank Major_code Total Men Women \ count 173.000000 173.000000 172.000000 172.000000 172.000000 mean 87.000000 3879.815029 39370.081395 16723.406977 22646.674419 std 50.084928 1687.753140 63483.491009 28122.433474 41057.330740 min 1.000000 1100.000000 124.000000 119.000000 0.000000 25% 44.000000 2403.000000 4549.750000 2177.500000 1778.250000 50% 87.000000 3608.000000 15104.000000 5434.000000 8386.500000 75% 130.000000 5503.000000 38909.750000 14631.000000 22553.750000 max 173.000000 6403.000000 393735.000000 173809.000000 307087.000000 ShareWomen Sample_size Employed Full_time Part_time \ count 172.000000 173.000000 173.000000 173.000000 173.000000 mean 0.522223 356.080925 31192.763006 26029.306358 8832.398844 std 0.231205 618.361022 50675.002241 42869.655092 14648.179473 min 0.000000 2.000000 0.000000 111.000000 0.000000 25% 0.336026 39.000000 3608.000000 3154.000000 1030.000000 50% 0.534024 130.000000 11797.000000 10048.000000 3299.000000 75% 0.703299 338.000000 31433.000000 25147.000000 9948.000000 max 0.968954 4212.000000 307933.000000 251540.000000 115172.000000 Full_time_year_round Unemployed Unemployment_rate Median \ count 173.000000 173.000000 173.000000 173.000000 mean 19694.427746 2416.329480 0.068191 40151.445087 std 33160.941514 4112.803148 0.030331 11470.181802 min 111.000000 0.000000 0.000000 22000.000000 25% 2453.000000 304.000000 0.050306 33000.000000 50% 7413.000000 893.000000 0.067961 36000.000000 75% 16891.000000 2393.000000 0.087557 45000.000000 max 199897.000000 28169.000000 0.177226 110000.000000 P25th P75th College_jobs Non_college_jobs \ count 173.000000 173.000000 173.000000 173.000000 mean 29501.445087 51494.219653 12322.635838 13284.497110 std 9166.005235 14906.279740 21299.868863 23789.655363 min 18500.000000 22000.000000 0.000000 0.000000 25% 24000.000000 42000.000000 1675.000000 1591.000000 50% 27000.000000 47000.000000 4390.000000 4595.000000 75% 33000.000000 60000.000000 14444.000000 11783.000000 max 95000.000000 125000.000000 151643.000000 148395.000000 Low_wage_jobs count 173.000000 mean 3859.017341 std 6944.998579 min 0.000000 25% 340.000000 50% 1231.000000 75% 3466.000000 max 48207.000000
# drop rows with missing values
raw_data_count = recent_grads.shape[0]
recent_grads.dropna(inplace=True)
cleaned_data_count = recent_grads.shape[0]
print(raw_data_count, '-->', cleaned_data_count)
173 --> 172
Visualizing the relation between data with Scatter plot.
# Scatter plot of Sample_size vs Median
recent_grads.plot(x='Sample_size',
y='Median',
kind='scatter',
title='Sample-size vs Median'
)
<matplotlib.axes._subplots.AxesSubplot at 0x7eff4007ad30>
# Scatter plot of Sample_size and Unemployment_rate
recent_grads.plot(x='Sample_size',
y='Unemployment_rate',
kind='scatter',
title='Sample-size vs Unemployment-rate'
)
<matplotlib.axes._subplots.AxesSubplot at 0x7eff0ea2a828>
# Scatter plot of Full_time and Median
recent_grads.plot(x='Median',
y='Full_time',
kind='scatter',
title='Full_time vs Median')
<matplotlib.axes._subplots.AxesSubplot at 0x7eff0e9fa908>
# Scatter plot of ShareWomen and Unemployment_rate
recent_grads.plot(x='ShareWomen',
y='Unemployment_rate',
kind='scatter',
title='ShareWomen vs Unemployment_rate')
<matplotlib.axes._subplots.AxesSubplot at 0x7eff0c8b2a20>
# Scatter plot of Men and Median
recent_grads.plot(x='Men',
y='Median',
kind='scatter',
title='Men vs Median of income')
<matplotlib.axes._subplots.AxesSubplot at 0x7eff0c81c710>
# Scatter plot of Women and Median
recent_grads.plot(x='Women',
y='Median',
kind='scatter',
title='Women vs Median of income')
<matplotlib.axes._subplots.AxesSubplot at 0x7eff0c802ac8>
To answer this question, I beleive that the scatter plot of Total
and Median
can be more informative than Sample_size
, as the value of Sample_size
does not reflect the popularity of the Majors in all cases.
# explore the relation between Total and Median
recent_grads.plot(x='Total',
y='Median',
kind='scatter',
title='Median of income vs Total',
ylim = (15000, 120000),
xlim = (0, 400000))
<matplotlib.axes._subplots.AxesSubplot at 0x7eff0c8a55f8>
The highest level of wages decreases with more popular majors. The higher salary(more than 60K) are among, the less popular majors with less than 50000 graduated students in 2 years. However, the higher income range decreases gradually to less than 40k, while the major's popularity increases.
The majors with a low proportion of women have an income range from around 30k up to 80k. The income range gradually drops between less than 60k and 20k when women share increases to more than 60 per cent. The graph shows that graduated students in subjects with a higher number of women should expect lower salaries than man-dominant majors. This may indicate that women are underpaid.
The number of full-time employed for most major subjects are less than 50000 with a wide range of median earning from 20k and 80k. However, this range of median salary gets narrowed by the increases in the number of full-time employed. The median salary for majors with more than 100000 varies between 30k and 50k.
I will explore the distribution of values of the following column using histogram plots:
# histogram of Sample_size
ax = recent_grads['Sample_size'].plot(bins = 100,
kind='hist',
title = 'Sample_size histogram'
)
ax.set_xlabel('Sample_size')
ax.set_ylabel('frequency')
plt.show()
ax = recent_grads['Sample_size'].plot(bins = 100,
kind='hist',
title = 'Sample_size histogram, range[0,1800]',
xlim = (0, 1800)
)
ax.set_xlabel('Sample_size')
ax.set_ylabel('frequency')
plt.show()
# histogram of Median
ax = recent_grads['Median'].plot(bins = 30,
kind='hist',
title = 'Median histogram',
rot = 45
)
ax.set_xlabel('Median')
ax.set_ylabel('frequency')
plt.show()
# histogram of Median
ax = recent_grads['Employed'].plot(bins = 30,
kind='hist',
title = 'Employed histogram',
rot = 45
)
ax.set_xlabel('Employed')
ax.set_ylabel('frequency')
plt.show()
ax = recent_grads['Employed'].plot(bins = 60,
kind='hist',
title = 'Employed histogram, range[0,200000]',
xlim = (0, 200000)
)
ax.set_xlabel('Employed')
ax.set_ylabel('frequency')
plt.show()
# histogram of Full_time
ax = recent_grads['Full_time'].plot(bins = 30,
kind='hist',
title = 'Full_time histogram',
rot = 45
)
ax.set_xlabel('Full_time')
ax.set_ylabel('frequency')
plt.show()
ax = recent_grads['Full_time'].plot(bins = 60,
kind='hist',
title = 'Full_time histogram, range[0,180000]',
xlim = (0, 180000),
rot = 45,
color = 'red'
)
ax.set_xlabel('Employed')
ax.set_ylabel('frequency')
plt.show()
# histogram of ShareWomen
ax = recent_grads['ShareWomen'].plot(bins = 10,
kind='hist',
title = 'ShareWomen histogram',
rot = 45,
color = 'green'
)
ax.set_xlabel('ShareWomen')
ax.set_ylabel('frequency')
plt.show()
# histogram of Unemployment_rate
ax = recent_grads['Unemployment_rate'].plot(bins = 20,
kind='hist',
title = 'Unemployment_rate histogram',
rot = 45,
color = 'gray'
)
ax.set_xlabel('Unemployment_rate')
ax.set_ylabel('frequency')
ax.set_axis_bgcolor('lightblue')
plt.show()
# histogram of Men
ax = recent_grads['Men'].plot(bins = 40,
kind='hist',
title = 'Men histogram',
rot = 45,
color = 'gray',
alpha = 0.5
)
ax.set_xlabel('Men')
ax.set_ylabel('frequency')
ax.set_axis_bgcolor('lightblue')
plt.show()
ax2 = recent_grads['Men'].plot(bins = 50,
kind='hist',
title = 'Men histogram, range[0,100000]',
xlim = (0, 100000),
rot = 45,
color = 'yellow',
alpha = 0.5
)
ax2.set_xlabel('Men')
ax2.set_ylabel('frequency')
ax2.set_axis_bgcolor('lightslategray')
plt.show()
# histogram of Women
ax = recent_grads['Women'].plot(bins = 40,
kind='hist',
title = 'Women histogram',
rot = 45,
color = 'red',
alpha = 1
)
ax.set_xlabel('Women')
ax.set_ylabel('frequency')
ax.set_axis_bgcolor('yellow')
plt.show()
ax2 = recent_grads['Women'].plot(bins = 30,
kind='hist',
title = 'Women histogram, range[0,200000]',
xlim = (0, 200000),
rot = 45,
color = 'yellow',
alpha = 1
)
ax2.set_xlabel('Women')
ax2.set_ylabel('frequency')
ax2.set_axis_bgcolor('black')
plt.show()
observations on histograms:
Total
, Sample_size
, Employed
, Full_time
, Men
and Women
all are right-skewed data distribution which means that more data presence over the right of the pick of histogram graph. This distribution is expected based on the distribution of total graduated students among major courses over two years.SharedWomen
and unemployment_rate
show different distribution as their values are not the number of cases but the proportion.SharedWomen
does not show a specific pattern.Median
salary between 30k and 50k with the highest frequency at around 30 to 35k.# import necessary module - scatter_matrix
from pandas.plotting import scatter_matrix
# exploring relationship and distribution of Sample_size and Median
scatter_matrix(recent_grads[['Sample_size', 'Median']], figsize=(10,10))
# exploring relationship and distribution of
# Sample_size and Median and Unemployment_rate
scatter_matrix(recent_grads[['Sample_size', 'Median', 'Unemployment_rate']],
figsize=(12,12),
hist_kwds={'bins':25})
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7eff0c234160>, <matplotlib.axes._subplots.AxesSubplot object at 0x7eff0c286240>, <matplotlib.axes._subplots.AxesSubplot object at 0x7eff0be7e080>], [<matplotlib.axes._subplots.AxesSubplot object at 0x7eff0c4f3470>, <matplotlib.axes._subplots.AxesSubplot object at 0x7eff0c053a90>, <matplotlib.axes._subplots.AxesSubplot object at 0x7eff0c139780>], [<matplotlib.axes._subplots.AxesSubplot object at 0x7eff0be798d0>, <matplotlib.axes._subplots.AxesSubplot object at 0x7eff0bfcec18>, <matplotlib.axes._subplots.AxesSubplot object at 0x7eff0beb3a58>]], dtype=object)
# exploring relationship and distribution of Total and Median
scatter_matrix(recent_grads[['Total', 'Median']], figsize=(10,10))
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7eff0c298080>, <matplotlib.axes._subplots.AxesSubplot object at 0x7eff0c593080>], [<matplotlib.axes._subplots.AxesSubplot object at 0x7eff0c3de4a8>, <matplotlib.axes._subplots.AxesSubplot object at 0x7eff0c33af98>]], dtype=object)
Total
and Median
confirms the aggregation of data at less popular majors and the median income of 30k to 50 k. Overall, the popular Major courses expect to have lower maximum Median
income than less popular Major subjects.# exploring relationship and distribution of ShareWomen and Median
scatter_matrix(recent_grads[['ShareWomen', 'Median']],
figsize = (10,10),
hist_kwds = {'color':'red'},
marker = '+'
)
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7eff0c1c25f8>, <matplotlib.axes._subplots.AxesSubplot object at 0x7eff0c26d470>], [<matplotlib.axes._subplots.AxesSubplot object at 0x7eff0c194d30>, <matplotlib.axes._subplots.AxesSubplot object at 0x7eff0c3f85f8>]], dtype=object)
Median
income explains data aggregation between 30k and 50 k in scatter-plot. The overall Median
salary decreases for students that majored in subjects with a higher number of females.# exploring relationship and distribution of Full-time and Median
scatter_matrix(recent_grads[['Full_time', 'Median']],
figsize = (10,10),
hist_kwds = {'color':'gray'},
marker = 'o'
)
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7eff0c3444e0>, <matplotlib.axes._subplots.AxesSubplot object at 0x7eff0a54ecf8>], [<matplotlib.axes._subplots.AxesSubplot object at 0x7eff0a51c7b8>, <matplotlib.axes._subplots.AxesSubplot object at 0x7eff0a4d7588>]], dtype=object)
Median
income is lower when the number of Full-time
employees increase.# Bar plots for first and last 10 Majors
fig, [(ax1), (ax2), (ax3), (ax4)] = plt.subplots(nrows=4, ncols=1, figsize=(6,16))
plt.subplots_adjust(hspace=.3)
recent_grads.head(10).plot(ax=ax1,
x='Major',
y='ShareWomen',
kind='barh',
legend=False,
title='Percent of women for first 10 records'
)
recent_grads.tail(10).plot(ax=ax2,
x='Major',
y='ShareWomen',
kind='barh',
legend=False,
title='Percent of women for last 10 records'
)
recent_grads.head(10).plot(ax=ax3,
x='Major',
y='Unemployment_rate',
kind='barh',
legend=False,
title='Unemployment_rate for first 10 records',
color = 'red'
)
recent_grads.tail(10).plot(ax=ax4,
x='Major',
y='Unemployment_rate',
kind='barh',
legend=False,
title='Unemployment rate for last 10 records',
color='brown'
)
<matplotlib.axes._subplots.AxesSubplot at 0x7eff0a3867b8>
I collected the first and last ten records from the ranked data. As the data is ranked by median income, the first and last groups have the highest and lowest median salary respectively.
Percentage of women:
Unemployment rate:
Use a grouped bar plot to compare the number of men with the number of women in each category of majors.
ax = (recent_grads[['Major_category','Men','Women']].
groupby('Major_category').
sum().
plot.bar(title='Total graduated by gender in each Major category',
figsize = (8,8)
)
)
ax.set_ylabel('Number of Graduated')
<matplotlib.text.Text at 0x7eff0a2b24a8>
The number of graduated women is greater or equal to the number of men in three-quarter of categories.
The Education, Health, and Psychology & Social work categories are significantly female dominant.
Engineering is the most men dominant major category with significant differences between the number of the graduated genders.
Business is the most popular major category with a nearly equal number of men and women graduated within two years.
Interdisciplinary has the lowest number of graduated students for both genders with a higher number of females.
Exploring the distributions of median salaries and unemployment rate using box plot
# boxplot median salaries and unemployment rate
fig, [ax1,ax2] = plt.subplots(nrows = 1, ncols = 2, figsize = (10,10))
recent_grads['Median'].plot.box(ax=ax1)
ax1.set_title('Median Boxplot')
recent_grads['Unemployment_rate'].plot.box(notch=True, ax=ax2, flierprops=dict(marker='o'))
ax2.set_title('Unemployment rate Boxplot')
ax2.set_axis_bgcolor('gray')
Hexagonal bin plot
fig, axs = plt.subplots(nrows=3, ncols=2, figsize=(10,12))
plt.subplots_adjust(hspace=.3)
recent_grads.plot.hexbin(x='Sample_size', y='Median', gridsize=30, ax=axs[0,0], rot=45)
recent_grads.plot.hexbin(x='Sample_size', y='Unemployment_rate', gridsize=30, ax=axs[0,1], rot=45)
recent_grads.plot.hexbin(x='Full_time', y='Median', gridsize=30, ax=axs[1,0], rot=45)
recent_grads.plot.hexbin(x='ShareWomen', y='Unemployment_rate', gridsize=30, ax=axs[1,1])
recent_grads.plot.hexbin(x='Men', y='Median', gridsize=30, ax=axs[2,0], rot=45)
recent_grads.plot.hexbin(x='Women', y='Median', gridsize=30, ax=axs[2,1], rot=45)
<matplotlib.axes._subplots.AxesSubplot at 0x7eff08a77c18>
Other plots used for conclusion
recent_grads.plot.scatter(x='Total', y='Full_time')
recent_grads.plot.scatter(x='Total', y='Median')
recent_grads.plot.scatter(x='Full_time', y='Median')
recent_grads.plot.scatter(x='Unemployment_rate', y='Median')
After exploring this data, I have noticed some points that are not a fact but can be a clue for further causal or predictive analysing studies.
The number of graduated women is approximately equal to men with variation in different subjects.
Students who graduated from prevalent majors can expect average unemployment-rate and median salary with less variation for both in various majors.
The average salary is not related to the unemployment rate. Its range is mostly affected by factors like the subject's popularity and the number of full-time employed and genders proportion.