In this study, we will do an analysis of data related to college majors.
We'll be working with a dataset on the job outcomes of students who graduated from college between 2010 and 2012. The original data on job outcomes was released by American Community Survey, which conducts surveys and aggregates the data. FiveThirtyEight cleaned the dataset and released it on their Github repo.
There is a wealth of information in this dataset. We'll explore a couple of things via a variety of visualizations (scatter plots, histograms, scatter matrices, bar charts) aiming to answer questions such as:
Let's start with some preparations to enable data visualization in this notebook.
# Import libraries
import pandas as pd
import matplotlib.pyplot as plt
# Enable display of plots inline
%matplotlib inline
Let's read in the data and do some initial exploration.
# Read in the data
recent_grads = pd.read_csv('recent-grads.csv')
# Show count of rows and columns
print ('Row count, column count:', recent_grads.shape)
# Show the first row (in table format)
print ('\n')
print (recent_grads.iloc[0])
# Show the first three and the last three rows
print ('\n')
print (recent_grads.head(3))
print ('\n')
print (recent_grads.tail(3))
# Show key statistics of all (numeric) columns
print ('\n')
print (recent_grads.describe())
Row count, column count: (173, 21) Rank 1 Major_code 2419 Major PETROLEUM ENGINEERING Total 2339 Men 2057 Women 282 Major_category Engineering ShareWomen 0.120564 Sample_size 36 Employed 1976 Full_time 1849 Part_time 270 Full_time_year_round 1207 Unemployed 37 Unemployment_rate 0.0183805 Median 110000 P25th 95000 P75th 125000 College_jobs 1534 Non_college_jobs 364 Low_wage_jobs 193 Name: 0, dtype: object Rank Major_code Major Total Men Women \ 0 1 2419 PETROLEUM ENGINEERING 2339.0 2057.0 282.0 1 2 2416 MINING AND MINERAL ENGINEERING 756.0 679.0 77.0 2 3 2415 METALLURGICAL ENGINEERING 856.0 725.0 131.0 Major_category ShareWomen Sample_size Employed ... Part_time \ 0 Engineering 0.120564 36 1976 ... 270 1 Engineering 0.101852 7 640 ... 170 2 Engineering 0.153037 3 648 ... 133 Full_time_year_round Unemployed Unemployment_rate Median P25th P75th \ 0 1207 37 0.018381 110000 95000 125000 1 388 85 0.117241 75000 55000 90000 2 340 16 0.024096 73000 50000 105000 College_jobs Non_college_jobs Low_wage_jobs 0 1534 364 193 1 350 257 50 2 456 176 0 [3 rows x 21 columns] Rank Major_code Major Total Men Women \ 170 171 5202 CLINICAL PSYCHOLOGY 2838.0 568.0 2270.0 171 172 5203 COUNSELING PSYCHOLOGY 4626.0 931.0 3695.0 172 173 3501 LIBRARY SCIENCE 1098.0 134.0 964.0 Major_category ShareWomen Sample_size Employed ... \ 170 Psychology & Social Work 0.799859 13 2101 ... 171 Psychology & Social Work 0.798746 21 3777 ... 172 Education 0.877960 2 742 ... Part_time Full_time_year_round Unemployed Unemployment_rate Median \ 170 648 1293 368 0.149048 25000 171 965 2738 214 0.053621 23400 172 237 410 87 0.104946 22000 P25th P75th College_jobs Non_college_jobs Low_wage_jobs 170 25000 40000 986 870 622 171 19200 26000 2403 1245 308 172 20000 22000 288 338 192 [3 rows x 21 columns] Rank Major_code Total Men Women \ count 173.000000 173.000000 172.000000 172.000000 172.000000 mean 87.000000 3879.815029 39370.081395 16723.406977 22646.674419 std 50.084928 1687.753140 63483.491009 28122.433474 41057.330740 min 1.000000 1100.000000 124.000000 119.000000 0.000000 25% 44.000000 2403.000000 4549.750000 2177.500000 1778.250000 50% 87.000000 3608.000000 15104.000000 5434.000000 8386.500000 75% 130.000000 5503.000000 38909.750000 14631.000000 22553.750000 max 173.000000 6403.000000 393735.000000 173809.000000 307087.000000 ShareWomen Sample_size Employed Full_time Part_time \ count 172.000000 173.000000 173.000000 173.000000 173.000000 mean 0.522223 356.080925 31192.763006 26029.306358 8832.398844 std 0.231205 618.361022 50675.002241 42869.655092 14648.179473 min 0.000000 2.000000 0.000000 111.000000 0.000000 25% 0.336026 39.000000 3608.000000 3154.000000 1030.000000 50% 0.534024 130.000000 11797.000000 10048.000000 3299.000000 75% 0.703299 338.000000 31433.000000 25147.000000 9948.000000 max 0.968954 4212.000000 307933.000000 251540.000000 115172.000000 Full_time_year_round Unemployed Unemployment_rate Median \ count 173.000000 173.000000 173.000000 173.000000 mean 19694.427746 2416.329480 0.068191 40151.445087 std 33160.941514 4112.803148 0.030331 11470.181802 min 111.000000 0.000000 0.000000 22000.000000 25% 2453.000000 304.000000 0.050306 33000.000000 50% 7413.000000 893.000000 0.067961 36000.000000 75% 16891.000000 2393.000000 0.087557 45000.000000 max 199897.000000 28169.000000 0.177226 110000.000000 P25th P75th College_jobs Non_college_jobs \ count 173.000000 173.000000 173.000000 173.000000 mean 29501.445087 51494.219653 12322.635838 13284.497110 std 9166.005235 14906.279740 21299.868863 23789.655363 min 18500.000000 22000.000000 0.000000 0.000000 25% 24000.000000 42000.000000 1675.000000 1591.000000 50% 27000.000000 47000.000000 4390.000000 4595.000000 75% 33000.000000 60000.000000 14444.000000 11783.000000 max 95000.000000 125000.000000 151643.000000 148395.000000 Low_wage_jobs count 173.000000 mean 3859.017341 std 6944.998579 min 0.000000 25% 340.000000 50% 1231.000000 75% 3466.000000 max 48207.000000
To enable data visualization using matplotlib
, there should not be rows with missing values. Let's do the required cleaning (and check how much data was removed).
# Number of rows (before)
raw_data_count = recent_grads.shape[0]
print ('Number of rows in the raw data: ', raw_data_count)
# Drop rows with missing values
recent_grads.dropna(inplace = True)
# Number of rows (after)
cleaned_data_count = recent_grads.shape[0]
print ('Number of rows after removing rows with missing values: ', cleaned_data_count)
Number of rows in the raw data: 173 Number of rows after removing rows with missing values: 172
One row deleted, 172 rows remaining. We are now ready to explore this data using visualiztions.
To understand the various plots below, take note of what the different columns in the data represent:
Rank - Rank by median earnings (the dataset is ordered by this column).
Major_code - Major code.
Major - Major description.
Major_category - Category of major.
Total - Total number of people with major.
Sample_size - Sample size (unweighted) of full-time.
Men - Male graduates.
Women - Female graduates.
ShareWomen - Women as share of total.
Employed - Number employed.
Median - Median salary of full-time, year-round workers.
Low_wage_jobs - Number in low-wage service jobs.
Full_time - Number employed 35 hours or more.
Part_time - Number employed less than 35 hours.
We'll start with creating some scatter plots to see if we can find answers to these questions:
# Scatter plot showing median salary vs total number of people with the major
recent_grads.plot(x='Total', y='Median', kind='scatter')
<matplotlib.axes._subplots.AxesSubplot at 0x2aa4e3260b8>
# Scatter plot showing the sample size for median salary figures vs total number of people with the major
recent_grads.plot(x='Total', y='Sample_size', kind='scatter')
<matplotlib.axes._subplots.AxesSubplot at 0x2aa4e3b51d0>
# Scatter plot showing median salary vs the percentage of women with the major
recent_grads.plot (x='ShareWomen', y='Median', kind = 'Scatter')
<matplotlib.axes._subplots.AxesSubplot at 0x2aa4e4e57b8>
# Scatter plot showing the number of full-time employed vs the median salary
recent_grads.plot (y='Full_time', x='Median', kind = 'Scatter')
<matplotlib.axes._subplots.AxesSubplot at 0x2aa4e180240>
# Scatter plot showing the number of part-time employed vs the median salary
recent_grads.plot (y='Part_time', x='Median', kind = 'Scatter')
<matplotlib.axes._subplots.AxesSubplot at 0x2aa4ced2d68>
Conclusions:
We'll continue with creating some histograms to find answers to this:
Let's create a histogram to how the 'percentage-of-women' is distributed. Conclusions are written directly below each graph.
fig, ax = plt.subplots()
selected_column = 'ShareWomen'
data_to_show = recent_grads[selected_column]*100 #Multiply by 100 to show as a percentage
data_to_show.hist(bins=20)
ax.set_title(selected_column)
ax.set_xticks
ax.set_xlabel('Percentage of Women')
ax.set_ylabel('Frequency')
plt.show()
What we can see is that there are majors where the percentage of women is (close to) 0% and majors where the percentage of women is (almost) 100%. And everything is between. To answer our question, we can (be-it somewhat clunky) create the same histogram with just two categories.
fig, ax = plt.subplots()
selected_column = 'ShareWomen'
data_to_show = recent_grads[selected_column]*100 #Multiply by 100 to show as a percentage
data_to_show.hist(bins=2)
ax.set_title(selected_column)
ax.set_xticks
ax.set_xlabel('Percentage of Women')
ax.set_ylabel('Frequency')
plt.show()
We can see that the there are almost 80 majors where the percentage of women is below 50%, and almost 100 majors where the percentage of women is above 50%.
Let's now go search for common median saleries, by showing a histogram of the median salaries.
recent_grads['Median'].hist(bins=20)
<matplotlib.axes._subplots.AxesSubplot at 0x2aa4e71ea58>
It looks like the ranges between 25K and 50K are most common. Let's zoom in further on this part.
recent_grads['Median'].hist(bins=25, range=(25000,50000))
<matplotlib.axes._subplots.AxesSubplot at 0x2aa4e7ccc18>
It looks like very common (median) salaries are 35-36K and 40-41K.
Let's looking further in the relations between (1)total number of majors (2)the median salary (3) percentage of women by creating a scatter-matrix for these three.
from pandas.plotting import scatter_matrix
scatter_matrix(recent_grads[['Total','Median', 'ShareWomen']], figsize = (12,12))
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x000002AA4E86D710>, <matplotlib.axes._subplots.AxesSubplot object at 0x000002AA4E8A1B00>, <matplotlib.axes._subplots.AxesSubplot object at 0x000002AA4E8DF0F0>], [<matplotlib.axes._subplots.AxesSubplot object at 0x000002AA4E90E668>, <matplotlib.axes._subplots.AxesSubplot object at 0x000002AA4E93FC18>, <matplotlib.axes._subplots.AxesSubplot object at 0x000002AA4E97F208>], [<matplotlib.axes._subplots.AxesSubplot object at 0x000002AA4E9AC7B8>, <matplotlib.axes._subplots.AxesSubplot object at 0x000002AA4E9E0DA0>, <matplotlib.axes._subplots.AxesSubplot object at 0x000002AA4E9E0DD8>]], dtype=object)
What we can see from these plots is that:
In line with what we observed earlier
Now let's also create bar-plots for the first 10 and for the last 10 majors in the list to see what we can learn from that.
recent_grads[:10].plot.bar(x='Major', y='ShareWomen')
recent_grads[-10:].plot.bar(x='Major', y='ShareWomen')
<matplotlib.axes._subplots.AxesSubplot at 0x2aa4eb9d550>
recent_grads[:10].plot.bar(x='Major', y='Unemployment_rate')
recent_grads[-10:].plot.bar(x='Major', y='Unemployment_rate')
<matplotlib.axes._subplots.AxesSubplot at 0x2aa4e254710>
What we see:
Let's wrap-up with sharing some of the observations that we made above:
Clearly, we've only be scratching the surface. The dataset contains a wealth of interesting data to explore. Possibly to be continued at another occassion!