In this project, we will use data visualization tools to explore many different questions surrounding college majors, starting income, and other demographical questions.
We will be using a data set that spans 2010-2012, which was released by American Community Survey. FiveThirtyEight then cleaned the data that we will use for this project.
Our first tasks will be to import our libraries and read in our data, found in the file 'recent-grads.csv', to the dataframe 'recent_grads'.
import pandas as pd
import matplotlib as plt
%matplotlib inline
recent_grads = pd.read_csv('recent-grads.csv')
recent_grads.head()
Rank | Major_code | Major | Total | Men | Women | Major_category | ShareWomen | Sample_size | Employed | ... | Part_time | Full_time_year_round | Unemployed | Unemployment_rate | Median | P25th | P75th | College_jobs | Non_college_jobs | Low_wage_jobs | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 2419 | PETROLEUM ENGINEERING | 2339.0 | 2057.0 | 282.0 | Engineering | 0.120564 | 36 | 1976 | ... | 270 | 1207 | 37 | 0.018381 | 110000 | 95000 | 125000 | 1534 | 364 | 193 |
1 | 2 | 2416 | MINING AND MINERAL ENGINEERING | 756.0 | 679.0 | 77.0 | Engineering | 0.101852 | 7 | 640 | ... | 170 | 388 | 85 | 0.117241 | 75000 | 55000 | 90000 | 350 | 257 | 50 |
2 | 3 | 2415 | METALLURGICAL ENGINEERING | 856.0 | 725.0 | 131.0 | Engineering | 0.153037 | 3 | 648 | ... | 133 | 340 | 16 | 0.024096 | 73000 | 50000 | 105000 | 456 | 176 | 0 |
3 | 4 | 2417 | NAVAL ARCHITECTURE AND MARINE ENGINEERING | 1258.0 | 1123.0 | 135.0 | Engineering | 0.107313 | 16 | 758 | ... | 150 | 692 | 40 | 0.050125 | 70000 | 43000 | 80000 | 529 | 102 | 0 |
4 | 5 | 2405 | CHEMICAL ENGINEERING | 32260.0 | 21239.0 | 11021.0 | Engineering | 0.341631 | 289 | 25694 | ... | 5180 | 16697 | 1672 | 0.061098 | 65000 | 50000 | 75000 | 18314 | 4440 | 972 |
5 rows × 21 columns
recent_grads.iloc[0]
Rank 1 Major_code 2419 Major PETROLEUM ENGINEERING Total 2339 Men 2057 Women 282 Major_category Engineering ShareWomen 0.120564 Sample_size 36 Employed 1976 Full_time 1849 Part_time 270 Full_time_year_round 1207 Unemployed 37 Unemployment_rate 0.0183805 Median 110000 P25th 95000 P75th 125000 College_jobs 1534 Non_college_jobs 364 Low_wage_jobs 193 Name: 0, dtype: object
recent_grads.tail()
Rank | Major_code | Major | Total | Men | Women | Major_category | ShareWomen | Sample_size | Employed | ... | Part_time | Full_time_year_round | Unemployed | Unemployment_rate | Median | P25th | P75th | College_jobs | Non_college_jobs | Low_wage_jobs | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
168 | 169 | 3609 | ZOOLOGY | 8409.0 | 3050.0 | 5359.0 | Biology & Life Science | 0.637293 | 47 | 6259 | ... | 2190 | 3602 | 304 | 0.046320 | 26000 | 20000 | 39000 | 2771 | 2947 | 743 |
169 | 170 | 5201 | EDUCATIONAL PSYCHOLOGY | 2854.0 | 522.0 | 2332.0 | Psychology & Social Work | 0.817099 | 7 | 2125 | ... | 572 | 1211 | 148 | 0.065112 | 25000 | 24000 | 34000 | 1488 | 615 | 82 |
170 | 171 | 5202 | CLINICAL PSYCHOLOGY | 2838.0 | 568.0 | 2270.0 | Psychology & Social Work | 0.799859 | 13 | 2101 | ... | 648 | 1293 | 368 | 0.149048 | 25000 | 25000 | 40000 | 986 | 870 | 622 |
171 | 172 | 5203 | COUNSELING PSYCHOLOGY | 4626.0 | 931.0 | 3695.0 | Psychology & Social Work | 0.798746 | 21 | 3777 | ... | 965 | 2738 | 214 | 0.053621 | 23400 | 19200 | 26000 | 2403 | 1245 | 308 |
172 | 173 | 3501 | LIBRARY SCIENCE | 1098.0 | 134.0 | 964.0 | Education | 0.877960 | 2 | 742 | ... | 237 | 410 | 87 | 0.104946 | 22000 | 20000 | 22000 | 288 | 338 | 192 |
5 rows × 21 columns
recent_grads.describe()
Rank | Major_code | Total | Men | Women | ShareWomen | Sample_size | Employed | Full_time | Part_time | Full_time_year_round | Unemployed | Unemployment_rate | Median | P25th | P75th | College_jobs | Non_college_jobs | Low_wage_jobs | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 173.000000 | 173.000000 | 172.000000 | 172.000000 | 172.000000 | 172.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 |
mean | 87.000000 | 3879.815029 | 39370.081395 | 16723.406977 | 22646.674419 | 0.522223 | 356.080925 | 31192.763006 | 26029.306358 | 8832.398844 | 19694.427746 | 2416.329480 | 0.068191 | 40151.445087 | 29501.445087 | 51494.219653 | 12322.635838 | 13284.497110 | 3859.017341 |
std | 50.084928 | 1687.753140 | 63483.491009 | 28122.433474 | 41057.330740 | 0.231205 | 618.361022 | 50675.002241 | 42869.655092 | 14648.179473 | 33160.941514 | 4112.803148 | 0.030331 | 11470.181802 | 9166.005235 | 14906.279740 | 21299.868863 | 23789.655363 | 6944.998579 |
min | 1.000000 | 1100.000000 | 124.000000 | 119.000000 | 0.000000 | 0.000000 | 2.000000 | 0.000000 | 111.000000 | 0.000000 | 111.000000 | 0.000000 | 0.000000 | 22000.000000 | 18500.000000 | 22000.000000 | 0.000000 | 0.000000 | 0.000000 |
25% | 44.000000 | 2403.000000 | 4549.750000 | 2177.500000 | 1778.250000 | 0.336026 | 39.000000 | 3608.000000 | 3154.000000 | 1030.000000 | 2453.000000 | 304.000000 | 0.050306 | 33000.000000 | 24000.000000 | 42000.000000 | 1675.000000 | 1591.000000 | 340.000000 |
50% | 87.000000 | 3608.000000 | 15104.000000 | 5434.000000 | 8386.500000 | 0.534024 | 130.000000 | 11797.000000 | 10048.000000 | 3299.000000 | 7413.000000 | 893.000000 | 0.067961 | 36000.000000 | 27000.000000 | 47000.000000 | 4390.000000 | 4595.000000 | 1231.000000 |
75% | 130.000000 | 5503.000000 | 38909.750000 | 14631.000000 | 22553.750000 | 0.703299 | 338.000000 | 31433.000000 | 25147.000000 | 9948.000000 | 16891.000000 | 2393.000000 | 0.087557 | 45000.000000 | 33000.000000 | 60000.000000 | 14444.000000 | 11783.000000 | 3466.000000 |
max | 173.000000 | 6403.000000 | 393735.000000 | 173809.000000 | 307087.000000 | 0.968954 | 4212.000000 | 307933.000000 | 251540.000000 | 115172.000000 | 199897.000000 | 28169.000000 | 0.177226 | 110000.000000 | 95000.000000 | 125000.000000 | 151643.000000 | 148395.000000 | 48207.000000 |
Exploring the head and tail of the data has led us to see that there are 173 different majors listed. We also note there are 21 different columns of information stored per major.
raw_data_count = len(recent_grads)
raw_data_count
173
We will now try to clean out any rows that have missing values, and compare this with the original data set. We see above that the number of rows in the set originally is 173.
# This removes the rows with missing values from our dataframe
recent_grads = recent_grads.dropna()
# This counts the number of rows after cleaning out missing values
cleaned_data_count = len(recent_grads)
cleaned_data_count
172
Our data frame, after removing missing values, comes to be 1 less, meaning that there was only one row that had any missing values.
Now, we will focus on comparing different data via scatter plots. We will be comparing the following:
Our first plot is comparing Sample Size and Median. What we find in the plot below is that majors with more students do not necessarily start with more median income. The majority of higher median incomes occur with majors containing fewer than 1000 students in the sample size. In fact, we see that a cluster of majors fall in the $35,000 range and below 1000 individuals.
recent_grads.plot(x = 'Sample_size', y = 'Median', kind = 'scatter')
<matplotlib.axes._subplots.AxesSubplot at 0x7f65505ff320>
The next plot shows whether there is a relationship between unemployment rate and popularity of each major. What we find is that there seems to be no real relationship between the two, as the unemployment rate is generally around 5-7% regardless of the major. We see this holds true, even though the majority of the points in our plot happen in sample sizes less than 1000 individuals.
recent_grads.plot(x = 'Sample_size', y = 'Unemployment_rate', kind = 'scatter')
<matplotlib.axes._subplots.AxesSubplot at 0x7f6550643400>
Our third plot compares full time employment to median income. Again, just as the last plot, there doesn't seem to be an obvious link between the number of full time students and the median income, which seems to balance out around $40,000.
recent_grads.plot(x = 'Full_time', y = 'Median', kind = 'scatter')
<matplotlib.axes._subplots.AxesSubplot at 0x7f654e543128>
When comparing the percentage of women in the program vs the unemployment rates, we see virtually no correlation with the scatter plot showing a very random structure.
recent_grads.plot(x = 'ShareWomen', y = 'Unemployment_rate', kind = 'scatter')
<matplotlib.axes._subplots.AxesSubplot at 0x7f654e4ae400>
Lastly, in the next two plots we look into the difference between men and women in regards to median income. For men, we
recent_grads.plot(x = 'Men', y = 'Median', kind = 'scatter')
<matplotlib.axes._subplots.AxesSubplot at 0x7f654e491400>
recent_grads.plot(x = 'Women', y = 'Median', kind = 'scatter')
<matplotlib.axes._subplots.AxesSubplot at 0x7f654e3eee48>
Now that we have looked at the relationship between a few of our categories of data, we move on to see how the data itself is distributed within each category. We will look into visualizations of the following column information:
recent_grads['Sample_size'].hist(bins = 20, range = (0,5000))
<matplotlib.axes._subplots.AxesSubplot at 0x7f654d83a470>
The above histogram shows clearly that the vast majority of majors fall below the 1000 people per sample size range, meaning that most majors have smaller numbers of students. In fact, only one major has beyond 4000 students in the sample size.
recent_grads['Median'].hist(bins = 50, range = (0,120000))
<matplotlib.axes._subplots.AxesSubplot at 0x7f654d4d0d68>
Next, we see in the histogram referring to the median incomes of students is mainly distributed between $25,000 and $50,000 with one larger collection at $60,000. We conclude that the majority of students will likely see a median income of $35,000.
recent_grads['Employed'].hist(bins = 50, range = (0,90000))
<matplotlib.axes._subplots.AxesSubplot at 0x7f654d0a7d68>
Our next plot shows that the majority of majors see under 20,000 students get employed, with a very high number below 5,000 individuals. We do see a few outliers that show around 80,000 people employed, and may be worth investigating if we wanted to know which majors have the most people employed.
recent_grads['Full_time'].hist(bins = 50, range = (0,140000))
<matplotlib.axes._subplots.AxesSubplot at 0x7f654cc7feb8>
In the above plot, we see that a great deal of the majors employed fewer than 20,000 individuals full time, with a great deal under 10,000 individuals.
recent_grads['ShareWomen'].hist(bins = 10, range = (0,1))
<matplotlib.axes._subplots.AxesSubplot at 0x7f654c7f05f8>
When we consider the percentage of women in each major, we notice in the above plot that the largest portion of majors are between 50 and 80 percent woman. This implies that a larger portion of majors are mostly comprised of high percentages of women, with only a few majors predominately male or completely female.
recent_grads['Unemployment_rate'].hist(bins = 20, range = (0,0.2))
<matplotlib.axes._subplots.AxesSubplot at 0x7f654c615320>
The plot above shows that unemployment for most majors falls somewhere between 5 and 10 percent, trending to be closer to 5 or 6 percent. We do see that there are a handful of majors that have over 12 percent unemployment, with some reaching around 17 percent unemployment.
recent_grads['Men'].hist(bins = 30, range = (0,60000))
<matplotlib.axes._subplots.AxesSubplot at 0x7f654bd27b38>
We see that most majors have fewer than 6,000 men per major, with a few majors showing more than 20,000 men. It should be noted, after some trial and error, that there are a few majors that have upwards of 100,000 men as well, but those are sparse, so we focused our attention on the lower populations of men.
recent_grads['Women'].hist(bins = 25, range = (0,100000))
<matplotlib.axes._subplots.AxesSubplot at 0x7f654b8a7080>
Finally, we see in the last plot shown above that the vast majority of majors have less than 20,000 women. We also see that over 60 majors have less than 4000 women.
import pandas.plotting
We would like to see the visualizations that we produced before together to make sense of the observations we have made so far. We will explore the following relationships using scatter matrices:
pd.scatter_matrix(recent_grads[['Sample_size','Median']],figsize = (10,10))
/dataquest/system/env/python3/lib/python3.4/site-packages/ipykernel/__main__.py:1: FutureWarning: pandas.scatter_matrix is deprecated. Use pandas.plotting.scatter_matrix instead if __name__ == '__main__':
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7f654ded1e80>, <matplotlib.axes._subplots.AxesSubplot object at 0x7f654b6735f8>], [<matplotlib.axes._subplots.AxesSubplot object at 0x7f654b6430f0>, <matplotlib.axes._subplots.AxesSubplot object at 0x7f654b5f8dd8>]], dtype=object)
In the above scatter matrix, we see that there is a concentration around $35,000 and Sample Size less than 1000 people. This corresponds exactly where the greatest average amount of majors' median income falls (~$35,000) and the largest average sample size (< 1,000 people per major).
pd.scatter_matrix(recent_grads[['Sample_size','Median','Unemployment_rate']],figsize = (10,10))
/dataquest/system/env/python3/lib/python3.4/site-packages/ipykernel/__main__.py:1: FutureWarning: pandas.scatter_matrix is deprecated. Use pandas.plotting.scatter_matrix instead if __name__ == '__main__':
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7f654b636ac8>, <matplotlib.axes._subplots.AxesSubplot object at 0x7f654b49c470>, <matplotlib.axes._subplots.AxesSubplot object at 0x7f654b4666a0>], [<matplotlib.axes._subplots.AxesSubplot object at 0x7f654b41be10>, <matplotlib.axes._subplots.AxesSubplot object at 0x7f654b3e6cf8>, <matplotlib.axes._subplots.AxesSubplot object at 0x7f654b3ae3c8>], [<matplotlib.axes._subplots.AxesSubplot object at 0x7f654b377ac8>, <matplotlib.axes._subplots.AxesSubplot object at 0x7f654b338208>, <matplotlib.axes._subplots.AxesSubplot object at 0x7f654b302438>]], dtype=object)
As our first scatter matrix focused mainly on the relationship between sample size and median income, we will mostly focus on the relationship that each of these has with unemployment.
First, if we look into the relationship between sample size and unemployment, we see that the largest number of points in the scatter plot fell in the 5-7 percent unemployment range with most falling below 1000 individuals per sample size. This is confirmed with both histograms showing that these two values are the most common in each.
Second, if we explore the relationship between median income and unemployment, we see there is a weak interaction between 5-10 percent unemployment and around $30,000 median income. While this coincides with our histograms for each, it is less obvious when looking at the scatter plot than our other two comparisons.
pd.scatter_matrix(recent_grads[['Men','Women','Median']],figsize = (10,10))
/dataquest/system/env/python3/lib/python3.4/site-packages/ipykernel/__main__.py:1: FutureWarning: pandas.scatter_matrix is deprecated. Use pandas.plotting.scatter_matrix instead if __name__ == '__main__':
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7f654af61c88>, <matplotlib.axes._subplots.AxesSubplot object at 0x7f654ae4fe48>, <matplotlib.axes._subplots.AxesSubplot object at 0x7f654ad9e588>], [<matplotlib.axes._subplots.AxesSubplot object at 0x7f654ad5b358>, <matplotlib.axes._subplots.AxesSubplot object at 0x7f654ad234a8>, <matplotlib.axes._subplots.AxesSubplot object at 0x7f654ace0e48>], [<matplotlib.axes._subplots.AxesSubplot object at 0x7f654acaeb38>, <matplotlib.axes._subplots.AxesSubplot object at 0x7f654aceb978>, <matplotlib.axes._subplots.AxesSubplot object at 0x7f654ac34eb8>]], dtype=object)
Our scatter matrix in regards to Populations of women and men vs median income shows that both populations have incomes that hover around $35,000 with most points being of population values less than 20,000 individuals. This doesn't tell us a great deal of difference between the two populations in regards to median income, but possibly could have more information if we zoomed in on the clusters to see if there is any meaningful difference between men and women in regards to median income.
Finally, we will now explore using bar plots to show information in regards to the following columns of information:
For these columns, we will look at both the high end and low end of each to see if there is any significant information to be gained.
recent_grads['ShareWomen'].sort_values(ascending = False).head(10).sort_values(ascending = False).plot(kind = 'bar', x = 'ShareWomen')
<matplotlib.axes._subplots.AxesSubplot at 0x7f6549fd4780>
We see that, given our bar plot, that our top 10 majors with the highest percentage of women mostly had over 100 people being recent graduates, with 5 being at least 150. Only 3 of the majors with a very high percentage of women fell below 100 recent graduates.
recent_grads['ShareWomen'].sort_values(ascending = False).tail(10).sort_values(ascending = False).plot(kind = 'bar', x = 'ShareWomen')
<matplotlib.axes._subplots.AxesSubplot at 0x7f6549f55128>
Our observation of the 10 majors with the lowest percentage of women find that the majority have less than 100 recent graduates. We see that a 5 of these majors actually had 11 or less graduates. The highest number of graduates in the majors with the lowest percentage women was 111, far less than the majority of the top 10 percent in regards to high percentages of women.
recent_grads['Unemployment_rate'].sort_values(ascending = False).head(10).sort_values(ascending = False).plot(kind = 'bar', x = 'Unemployment_rate')
<matplotlib.axes._subplots.AxesSubplot at 0x7f6549ea1908>
Our observation in regards to the top 10 majors in regards to highest unemployment rate finds that the majority of these majors had between 50-105 recent graduates. Two of these majors had below 10 recent graduates and 1 major had 170 recent graduates, but 15% unemployment.
recent_grads['Unemployment_rate'].sort_values(ascending = False).tail(10).sort_values(ascending = False).plot(kind = 'bar', x = 'Unemployment_rate')
<matplotlib.axes._subplots.AxesSubplot at 0x7f6549d80128>
When considering the 10 majors with the lowest unemployment rates, we find that half of these majors actually achieved 100% employment, each having somewhere between 50-120 recent graduates. One peculiar note is that one of these majors registered just under 2% unemployment rate, but had 0 recent graduates.
In this project, we explored the relationship between many different majors in regards to several factors. We looked into the relationship between median and sample size per major, unemployment rate and percentage of women per major, and the average median income for majors with regards to the population sizes of women and men within.
We conclude that any graduate (male or female) can expect to most likely get an average median income of $35,000. Additionally, most majors will have an unemployment rate of somewhere in the range of 5-10 percent for its recent graduates. Finally, there doesn't seem to be a relationship between popularity of a major and its median income.