College Major Analysis¶

By David VanHeeswijk¶

In this project, we will use data visualization tools to explore many different questions surrounding college majors, starting income, and other demographical questions.

We will be using a data set that spans 2010-2012, which was released by American Community Survey. FiveThirtyEight then cleaned the data that we will use for this project.

Our first tasks will be to import our libraries and read in our data, found in the file 'recent-grads.csv', to the dataframe 'recent_grads'.

In [1]:

import pandas as pd
import matplotlib as plt

%matplotlib inline

In [2]:

recent_grads = pd.read_csv('recent-grads.csv')
recent_grads.head()

Out[2]:

	Rank	Major_code	Major	Total	Men	Women	Major_category	ShareWomen	Sample_size	Employed	...	Part_time	Full_time_year_round	Unemployed	Unemployment_rate	Median	P25th	P75th	College_jobs	Non_college_jobs	Low_wage_jobs
0	1	2419	PETROLEUM ENGINEERING	2339.0	2057.0	282.0	Engineering	0.120564	36	1976	...	270	1207	37	0.018381	110000	95000	125000	1534	364	193
1	2	2416	MINING AND MINERAL ENGINEERING	756.0	679.0	77.0	Engineering	0.101852	7	640	...	170	388	85	0.117241	75000	55000	90000	350	257	50
2	3	2415	METALLURGICAL ENGINEERING	856.0	725.0	131.0	Engineering	0.153037	3	648	...	133	340	16	0.024096	73000	50000	105000	456	176	0
3	4	2417	NAVAL ARCHITECTURE AND MARINE ENGINEERING	1258.0	1123.0	135.0	Engineering	0.107313	16	758	...	150	692	40	0.050125	70000	43000	80000	529	102	0
4	5	2405	CHEMICAL ENGINEERING	32260.0	21239.0	11021.0	Engineering	0.341631	289	25694	...	5180	16697	1672	0.061098	65000	50000	75000	18314	4440	972

5 rows × 21 columns

In [3]:

recent_grads.iloc[0]

Out[3]:

Rank                                        1
Major_code                               2419
Major                   PETROLEUM ENGINEERING
Total                                    2339
Men                                      2057
Women                                     282
Major_category                    Engineering
ShareWomen                           0.120564
Sample_size                                36
Employed                                 1976
Full_time                                1849
Part_time                                 270
Full_time_year_round                     1207
Unemployed                                 37
Unemployment_rate                   0.0183805
Median                                 110000
P25th                                   95000
P75th                                  125000
College_jobs                             1534
Non_college_jobs                          364
Low_wage_jobs                             193
Name: 0, dtype: object

In [4]:

recent_grads.tail()

Out[4]:

	Rank	Major_code	Major	Total	Men	Women	Major_category	ShareWomen	Sample_size	Employed	...	Part_time	Full_time_year_round	Unemployed	Unemployment_rate	Median	P25th	P75th	College_jobs	Non_college_jobs	Low_wage_jobs
168	169	3609	ZOOLOGY	8409.0	3050.0	5359.0	Biology & Life Science	0.637293	47	6259	...	2190	3602	304	0.046320	26000	20000	39000	2771	2947	743
169	170	5201	EDUCATIONAL PSYCHOLOGY	2854.0	522.0	2332.0	Psychology & Social Work	0.817099	7	2125	...	572	1211	148	0.065112	25000	24000	34000	1488	615	82
170	171	5202	CLINICAL PSYCHOLOGY	2838.0	568.0	2270.0	Psychology & Social Work	0.799859	13	2101	...	648	1293	368	0.149048	25000	25000	40000	986	870	622
171	172	5203	COUNSELING PSYCHOLOGY	4626.0	931.0	3695.0	Psychology & Social Work	0.798746	21	3777	...	965	2738	214	0.053621	23400	19200	26000	2403	1245	308
172	173	3501	LIBRARY SCIENCE	1098.0	134.0	964.0	Education	0.877960	2	742	...	237	410	87	0.104946	22000	20000	22000	288	338	192

5 rows × 21 columns

In [5]:

recent_grads.describe()

Out[5]:

	Rank	Major_code	Total	Men	Women	ShareWomen	Sample_size	Employed	Full_time	Part_time	Full_time_year_round	Unemployed	Unemployment_rate	Median	P25th	P75th	College_jobs	Non_college_jobs	Low_wage_jobs
count	173.000000	173.000000	172.000000	172.000000	172.000000	172.000000	173.000000	173.000000	173.000000	173.000000	173.000000	173.000000	173.000000	173.000000	173.000000	173.000000	173.000000	173.000000	173.000000
mean	87.000000	3879.815029	39370.081395	16723.406977	22646.674419	0.522223	356.080925	31192.763006	26029.306358	8832.398844	19694.427746	2416.329480	0.068191	40151.445087	29501.445087	51494.219653	12322.635838	13284.497110	3859.017341
std	50.084928	1687.753140	63483.491009	28122.433474	41057.330740	0.231205	618.361022	50675.002241	42869.655092	14648.179473	33160.941514	4112.803148	0.030331	11470.181802	9166.005235	14906.279740	21299.868863	23789.655363	6944.998579
min	1.000000	1100.000000	124.000000	119.000000	0.000000	0.000000	2.000000	0.000000	111.000000	0.000000	111.000000	0.000000	0.000000	22000.000000	18500.000000	22000.000000	0.000000	0.000000	0.000000
25%	44.000000	2403.000000	4549.750000	2177.500000	1778.250000	0.336026	39.000000	3608.000000	3154.000000	1030.000000	2453.000000	304.000000	0.050306	33000.000000	24000.000000	42000.000000	1675.000000	1591.000000	340.000000
50%	87.000000	3608.000000	15104.000000	5434.000000	8386.500000	0.534024	130.000000	11797.000000	10048.000000	3299.000000	7413.000000	893.000000	0.067961	36000.000000	27000.000000	47000.000000	4390.000000	4595.000000	1231.000000
75%	130.000000	5503.000000	38909.750000	14631.000000	22553.750000	0.703299	338.000000	31433.000000	25147.000000	9948.000000	16891.000000	2393.000000	0.087557	45000.000000	33000.000000	60000.000000	14444.000000	11783.000000	3466.000000
max	173.000000	6403.000000	393735.000000	173809.000000	307087.000000	0.968954	4212.000000	307933.000000	251540.000000	115172.000000	199897.000000	28169.000000	0.177226	110000.000000	95000.000000	125000.000000	151643.000000	148395.000000	48207.000000

Exploring the head and tail of the data has led us to see that there are 173 different majors listed. We also note there are 21 different columns of information stored per major.

In [6]:

raw_data_count = len(recent_grads)
raw_data_count

Out[6]:

We will now try to clean out any rows that have missing values, and compare this with the original data set. We see above that the number of rows in the set originally is 173.

In [7]:

# This removes the rows with missing values from our dataframe
recent_grads = recent_grads.dropna()
# This counts the number of rows after cleaning out missing values
cleaned_data_count = len(recent_grads)

cleaned_data_count

Out[7]:

Our data frame, after removing missing values, comes to be 1 less, meaning that there was only one row that had any missing values.

Now, we will focus on comparing different data via scatter plots. We will be comparing the following:

Sample Size vs Median
Sample Size vs Unemployment Rate
Full Time vs Median
Share of Women vs Unemployment Rate
Men vs Median
Women vs Median

Our first plot is comparing Sample Size and Median. What we find in the plot below is that majors with more students do not necessarily start with more median income. The majority of higher median incomes occur with majors containing fewer than 1000 students in the sample size. In fact, we see that a cluster of majors fall in the $35,000 range and below 1000 individuals.

In [8]:

recent_grads.plot(x = 'Sample_size', y = 'Median', kind = 'scatter')

Out[8]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f65505ff320>

The next plot shows whether there is a relationship between unemployment rate and popularity of each major. What we find is that there seems to be no real relationship between the two, as the unemployment rate is generally around 5-7% regardless of the major. We see this holds true, even though the majority of the points in our plot happen in sample sizes less than 1000 individuals.

In [9]:

recent_grads.plot(x = 'Sample_size', y = 'Unemployment_rate', kind = 'scatter')

Out[9]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f6550643400>

Our third plot compares full time employment to median income. Again, just as the last plot, there doesn't seem to be an obvious link between the number of full time students and the median income, which seems to balance out around $40,000.

In [10]:

recent_grads.plot(x = 'Full_time', y = 'Median', kind = 'scatter')

Out[10]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f654e543128>

When comparing the percentage of women in the program vs the unemployment rates, we see virtually no correlation with the scatter plot showing a very random structure.

In [11]:

recent_grads.plot(x = 'ShareWomen', y = 'Unemployment_rate', kind = 'scatter')

Out[11]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f654e4ae400>

Lastly, in the next two plots we look into the difference between men and women in regards to median income. For men, we

In [12]:

recent_grads.plot(x = 'Men', y = 'Median', kind = 'scatter')

Out[12]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f654e491400>

In [13]:

recent_grads.plot(x = 'Women', y = 'Median', kind = 'scatter')

Out[13]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f654e3eee48>

Now that we have looked at the relationship between a few of our categories of data, we move on to see how the data itself is distributed within each category. We will look into visualizations of the following column information:

Sample Size
Median Income
Employed
Full Time Employment
Percentage Share of Women
Unemployment Rate
Total number of Men
Total number of Women

Sample Size Histogram¶

In [29]:

recent_grads['Sample_size'].hist(bins = 20, range = (0,5000))

Out[29]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f654d83a470>

The above histogram shows clearly that the vast majority of majors fall below the 1000 people per sample size range, meaning that most majors have smaller numbers of students. In fact, only one major has beyond 4000 students in the sample size.

Median Income Histogram¶

In [33]:

recent_grads['Median'].hist(bins = 50, range = (0,120000))

Out[33]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f654d4d0d68>

Next, we see in the histogram referring to the median incomes of students is mainly distributed between $25,000 and $50,000 with one larger collection at $60,000. We conclude that the majority of students will likely see a median income of $35,000.

Employment Histogram¶

In [38]:

recent_grads['Employed'].hist(bins = 50, range = (0,90000))

Out[38]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f654d0a7d68>

Our next plot shows that the majority of majors see under 20,000 students get employed, with a very high number below 5,000 individuals. We do see a few outliers that show around 80,000 people employed, and may be worth investigating if we wanted to know which majors have the most people employed.

Full Time Employment Histogram¶

In [44]:

recent_grads['Full_time'].hist(bins = 50, range = (0,140000))

Out[44]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f654cc7feb8>

In the above plot, we see that a great deal of the majors employed fewer than 20,000 individuals full time, with a great deal under 10,000 individuals.

Percentage of Women Histogram¶

In [50]:

recent_grads['ShareWomen'].hist(bins = 10, range = (0,1))

Out[50]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f654c7f05f8>

When we consider the percentage of women in each major, we notice in the above plot that the largest portion of majors are between 50 and 80 percent woman. This implies that a larger portion of majors are mostly comprised of high percentages of women, with only a few majors predominately male or completely female.

Unemployment Rate Histogram¶

In [53]:

recent_grads['Unemployment_rate'].hist(bins = 20, range = (0,0.2))

Out[53]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f654c615320>

The plot above shows that unemployment for most majors falls somewhere between 5 and 10 percent, trending to be closer to 5 or 6 percent. We do see that there are a handful of majors that have over 12 percent unemployment, with some reaching around 17 percent unemployment.

Male Population Histogram¶

In [64]:

recent_grads['Men'].hist(bins = 30, range = (0,60000))

Out[64]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f654bd27b38>

We see that most majors have fewer than 6,000 men per major, with a few majors showing more than 20,000 men. It should be noted, after some trial and error, that there are a few majors that have upwards of 100,000 men as well, but those are sparse, so we focused our attention on the lower populations of men.

Female Population Histogram¶

In [69]:

recent_grads['Women'].hist(bins = 25, range = (0,100000))

Out[69]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f654b8a7080>

Finally, we see in the last plot shown above that the vast majority of majors have less than 20,000 women. We also see that over 60 majors have less than 4000 women.

In [79]:

import pandas.plotting

We would like to see the visualizations that we produced before together to make sense of the observations we have made so far. We will explore the following relationships using scatter matrices:

Sample Size vs Median Income
Sample Size vs Median Income vs Unemployment Rate

Sample Size vs Median Income¶

In [81]:

pd.scatter_matrix(recent_grads[['Sample_size','Median']],figsize = (10,10))

/dataquest/system/env/python3/lib/python3.4/site-packages/ipykernel/__main__.py:1: FutureWarning: pandas.scatter_matrix is deprecated. Use pandas.plotting.scatter_matrix instead
  if __name__ == '__main__':

Out[81]:

array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7f654ded1e80>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f654b6735f8>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7f654b6430f0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f654b5f8dd8>]],
      dtype=object)

In the above scatter matrix, we see that there is a concentration around $35,000 and Sample Size less than 1000 people. This corresponds exactly where the greatest average amount of majors' median income falls (~$35,000) and the largest average sample size (< 1,000 people per major).

Sample Size vs Median Income vs Unemployment Rate¶

In [82]:

pd.scatter_matrix(recent_grads[['Sample_size','Median','Unemployment_rate']],figsize = (10,10))

/dataquest/system/env/python3/lib/python3.4/site-packages/ipykernel/__main__.py:1: FutureWarning: pandas.scatter_matrix is deprecated. Use pandas.plotting.scatter_matrix instead
  if __name__ == '__main__':

Out[82]:

array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7f654b636ac8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f654b49c470>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f654b4666a0>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7f654b41be10>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f654b3e6cf8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f654b3ae3c8>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7f654b377ac8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f654b338208>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f654b302438>]],
      dtype=object)

As our first scatter matrix focused mainly on the relationship between sample size and median income, we will mostly focus on the relationship that each of these has with unemployment.

First, if we look into the relationship between sample size and unemployment, we see that the largest number of points in the scatter plot fell in the 5-7 percent unemployment range with most falling below 1000 individuals per sample size. This is confirmed with both histograms showing that these two values are the most common in each.

Second, if we explore the relationship between median income and unemployment, we see there is a weak interaction between 5-10 percent unemployment and around $30,000 median income. While this coincides with our histograms for each, it is less obvious when looking at the scatter plot than our other two comparisons.

Comparing Men and Women vs Median Income¶

In [86]:

pd.scatter_matrix(recent_grads[['Men','Women','Median']],figsize = (10,10))

/dataquest/system/env/python3/lib/python3.4/site-packages/ipykernel/__main__.py:1: FutureWarning: pandas.scatter_matrix is deprecated. Use pandas.plotting.scatter_matrix instead
  if __name__ == '__main__':

Out[86]:

array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7f654af61c88>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f654ae4fe48>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f654ad9e588>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7f654ad5b358>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f654ad234a8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f654ace0e48>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7f654acaeb38>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f654aceb978>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f654ac34eb8>]],
      dtype=object)

Our scatter matrix in regards to Populations of women and men vs median income shows that both populations have incomes that hover around $35,000 with most points being of population values less than 20,000 individuals. This doesn't tell us a great deal of difference between the two populations in regards to median income, but possibly could have more information if we zoomed in on the clusters to see if there is any meaningful difference between men and women in regards to median income.

Finally, we will now explore using bar plots to show information in regards to the following columns of information:

Percentage of women
Unemployment rate

For these columns, we will look at both the high end and low end of each to see if there is any significant information to be gained.

High Percentages of women¶

In [104]:

recent_grads['ShareWomen'].sort_values(ascending = False).head(10).sort_values(ascending = False).plot(kind = 'bar', x = 'ShareWomen')

Out[104]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f6549fd4780>

We see that, given our bar plot, that our top 10 majors with the highest percentage of women mostly had over 100 people being recent graduates, with 5 being at least 150. Only 3 of the majors with a very high percentage of women fell below 100 recent graduates.

Low Percentages of Women¶

In [106]:

recent_grads['ShareWomen'].sort_values(ascending = False).tail(10).sort_values(ascending = False).plot(kind = 'bar', x = 'ShareWomen')

Out[106]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f6549f55128>

Our observation of the 10 majors with the lowest percentage of women find that the majority have less than 100 recent graduates. We see that a 5 of these majors actually had 11 or less graduates. The highest number of graduates in the majors with the lowest percentage women was 111, far less than the majority of the top 10 percent in regards to high percentages of women.

High Unemployment Rate¶

In [107]:

recent_grads['Unemployment_rate'].sort_values(ascending = False).head(10).sort_values(ascending = False).plot(kind = 'bar', x = 'Unemployment_rate')

Out[107]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f6549ea1908>

Our observation in regards to the top 10 majors in regards to highest unemployment rate finds that the majority of these majors had between 50-105 recent graduates. Two of these majors had below 10 recent graduates and 1 major had 170 recent graduates, but 15% unemployment.

Low Unemployment Rate¶

In [109]:

recent_grads['Unemployment_rate'].sort_values(ascending = False).tail(10).sort_values(ascending = False).plot(kind = 'bar', x = 'Unemployment_rate')

Out[109]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f6549d80128>

When considering the 10 majors with the lowest unemployment rates, we find that half of these majors actually achieved 100% employment, each having somewhere between 50-120 recent graduates. One peculiar note is that one of these majors registered just under 2% unemployment rate, but had 0 recent graduates.

Final Takeaways¶

In this project, we explored the relationship between many different majors in regards to several factors. We looked into the relationship between median and sample size per major, unemployment rate and percentage of women per major, and the average median income for majors with regards to the population sizes of women and men within.

We conclude that any graduate (male or female) can expect to most likely get an average median income of $35,000. Additionally, most majors will have an unemployment rate of somewhere in the range of 5-10 percent for its recent graduates. Finally, there doesn't seem to be a relationship between popularity of a major and its median income.

In [ ]: