Sample project to visualize data from college majors earning. We'll be working with a dataset on the job outcomes of students who graduated from college between 2010 and 2012.
The data is from American Community Survey, you can download the clean dataset from this Github Repo.
Each row in the dataset represents a different major in college and contains information on gender diversity, employment rates, median salaries, and more. Here are some of the columns in the dataset:
Using visualizations, we can start to explore questions from the dataset like:
We'll explore how to do these and more while primarily working in pandas. Before we start creating data visualizations, let's import the libraries we need and remove rows containing null values.
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
recent_grads = pd.read_csv('recent-grads.csv')
recent_grads.iloc[0]
Rank 1 Major_code 2419 Major PETROLEUM ENGINEERING Total 2339 Men 2057 Women 282 Major_category Engineering ShareWomen 0.120564 Sample_size 36 Employed 1976 Full_time 1849 Part_time 270 Full_time_year_round 1207 Unemployed 37 Unemployment_rate 0.0183805 Median 110000 P25th 95000 P75th 125000 College_jobs 1534 Non_college_jobs 364 Low_wage_jobs 193 Name: 0, dtype: object
n the first row we can observe the data of Petroleum Engineering, it is a very well paid major, but have a small share of women.
Now we can see the first five and the last five majors in the list.
recent_grads.head()
Rank | Major_code | Major | Total | Men | Women | Major_category | ShareWomen | Sample_size | Employed | ... | Part_time | Full_time_year_round | Unemployed | Unemployment_rate | Median | P25th | P75th | College_jobs | Non_college_jobs | Low_wage_jobs | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 2419 | PETROLEUM ENGINEERING | 2339.0 | 2057.0 | 282.0 | Engineering | 0.120564 | 36 | 1976 | ... | 270 | 1207 | 37 | 0.018381 | 110000 | 95000 | 125000 | 1534 | 364 | 193 |
1 | 2 | 2416 | MINING AND MINERAL ENGINEERING | 756.0 | 679.0 | 77.0 | Engineering | 0.101852 | 7 | 640 | ... | 170 | 388 | 85 | 0.117241 | 75000 | 55000 | 90000 | 350 | 257 | 50 |
2 | 3 | 2415 | METALLURGICAL ENGINEERING | 856.0 | 725.0 | 131.0 | Engineering | 0.153037 | 3 | 648 | ... | 133 | 340 | 16 | 0.024096 | 73000 | 50000 | 105000 | 456 | 176 | 0 |
3 | 4 | 2417 | NAVAL ARCHITECTURE AND MARINE ENGINEERING | 1258.0 | 1123.0 | 135.0 | Engineering | 0.107313 | 16 | 758 | ... | 150 | 692 | 40 | 0.050125 | 70000 | 43000 | 80000 | 529 | 102 | 0 |
4 | 5 | 2405 | CHEMICAL ENGINEERING | 32260.0 | 21239.0 | 11021.0 | Engineering | 0.341631 | 289 | 25694 | ... | 5180 | 16697 | 1672 | 0.061098 | 65000 | 50000 | 75000 | 18314 | 4440 | 972 |
5 rows × 21 columns
recent_grads.tail()
Rank | Major_code | Major | Total | Men | Women | Major_category | ShareWomen | Sample_size | Employed | ... | Part_time | Full_time_year_round | Unemployed | Unemployment_rate | Median | P25th | P75th | College_jobs | Non_college_jobs | Low_wage_jobs | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
168 | 169 | 3609 | ZOOLOGY | 8409.0 | 3050.0 | 5359.0 | Biology & Life Science | 0.637293 | 47 | 6259 | ... | 2190 | 3602 | 304 | 0.046320 | 26000 | 20000 | 39000 | 2771 | 2947 | 743 |
169 | 170 | 5201 | EDUCATIONAL PSYCHOLOGY | 2854.0 | 522.0 | 2332.0 | Psychology & Social Work | 0.817099 | 7 | 2125 | ... | 572 | 1211 | 148 | 0.065112 | 25000 | 24000 | 34000 | 1488 | 615 | 82 |
170 | 171 | 5202 | CLINICAL PSYCHOLOGY | 2838.0 | 568.0 | 2270.0 | Psychology & Social Work | 0.799859 | 13 | 2101 | ... | 648 | 1293 | 368 | 0.149048 | 25000 | 25000 | 40000 | 986 | 870 | 622 |
171 | 172 | 5203 | COUNSELING PSYCHOLOGY | 4626.0 | 931.0 | 3695.0 | Psychology & Social Work | 0.798746 | 21 | 3777 | ... | 965 | 2738 | 214 | 0.053621 | 23400 | 19200 | 26000 | 2403 | 1245 | 308 |
172 | 173 | 3501 | LIBRARY SCIENCE | 1098.0 | 134.0 | 964.0 | Education | 0.877960 | 2 | 742 | ... | 237 | 410 | 87 | 0.104946 | 22000 | 20000 | 22000 | 288 | 338 | 192 |
5 rows × 21 columns
Now we generate summary statistics for all of the numeric columns.
recent_grads.info()
recent_grads.describe()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 173 entries, 0 to 172 Data columns (total 21 columns): Rank 173 non-null int64 Major_code 173 non-null int64 Major 173 non-null object Total 172 non-null float64 Men 172 non-null float64 Women 172 non-null float64 Major_category 173 non-null object ShareWomen 172 non-null float64 Sample_size 173 non-null int64 Employed 173 non-null int64 Full_time 173 non-null int64 Part_time 173 non-null int64 Full_time_year_round 173 non-null int64 Unemployed 173 non-null int64 Unemployment_rate 173 non-null float64 Median 173 non-null int64 P25th 173 non-null int64 P75th 173 non-null int64 College_jobs 173 non-null int64 Non_college_jobs 173 non-null int64 Low_wage_jobs 173 non-null int64 dtypes: float64(5), int64(14), object(2) memory usage: 28.5+ KB
Rank | Major_code | Total | Men | Women | ShareWomen | Sample_size | Employed | Full_time | Part_time | Full_time_year_round | Unemployed | Unemployment_rate | Median | P25th | P75th | College_jobs | Non_college_jobs | Low_wage_jobs | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 173.000000 | 173.000000 | 172.000000 | 172.000000 | 172.000000 | 172.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 |
mean | 87.000000 | 3879.815029 | 39370.081395 | 16723.406977 | 22646.674419 | 0.522223 | 356.080925 | 31192.763006 | 26029.306358 | 8832.398844 | 19694.427746 | 2416.329480 | 0.068191 | 40151.445087 | 29501.445087 | 51494.219653 | 12322.635838 | 13284.497110 | 3859.017341 |
std | 50.084928 | 1687.753140 | 63483.491009 | 28122.433474 | 41057.330740 | 0.231205 | 618.361022 | 50675.002241 | 42869.655092 | 14648.179473 | 33160.941514 | 4112.803148 | 0.030331 | 11470.181802 | 9166.005235 | 14906.279740 | 21299.868863 | 23789.655363 | 6944.998579 |
min | 1.000000 | 1100.000000 | 124.000000 | 119.000000 | 0.000000 | 0.000000 | 2.000000 | 0.000000 | 111.000000 | 0.000000 | 111.000000 | 0.000000 | 0.000000 | 22000.000000 | 18500.000000 | 22000.000000 | 0.000000 | 0.000000 | 0.000000 |
25% | 44.000000 | 2403.000000 | 4549.750000 | 2177.500000 | 1778.250000 | 0.336026 | 39.000000 | 3608.000000 | 3154.000000 | 1030.000000 | 2453.000000 | 304.000000 | 0.050306 | 33000.000000 | 24000.000000 | 42000.000000 | 1675.000000 | 1591.000000 | 340.000000 |
50% | 87.000000 | 3608.000000 | 15104.000000 | 5434.000000 | 8386.500000 | 0.534024 | 130.000000 | 11797.000000 | 10048.000000 | 3299.000000 | 7413.000000 | 893.000000 | 0.067961 | 36000.000000 | 27000.000000 | 47000.000000 | 4390.000000 | 4595.000000 | 1231.000000 |
75% | 130.000000 | 5503.000000 | 38909.750000 | 14631.000000 | 22553.750000 | 0.703299 | 338.000000 | 31433.000000 | 25147.000000 | 9948.000000 | 16891.000000 | 2393.000000 | 0.087557 | 45000.000000 | 33000.000000 | 60000.000000 | 14444.000000 | 11783.000000 | 3466.000000 |
max | 173.000000 | 6403.000000 | 393735.000000 | 173809.000000 | 307087.000000 | 0.968954 | 4212.000000 | 307933.000000 | 251540.000000 | 115172.000000 | 199897.000000 | 28169.000000 | 0.177226 | 110000.000000 | 95000.000000 | 125000.000000 | 151643.000000 | 148395.000000 | 48207.000000 |
As we can see from the results we have null values in the columns Men Women and Total, we are going to delete this row to clean the data set.
Next we'll look up the number of rows in recent_grads and assign the value to raw_data_count. Then drop rows containing missing values and assign the resulting DataFrame back to recent_grads.
raw_data_count = recent_grads.shape
raw_data_count
(173, 21)
recent_grads = recent_grads.dropna()
cleaned_data_count = recent_grads.shape
cleaned_data_count
(172, 21)
If you compare cleaned_data_count and raw_data_count, you'll notice that only one row contained missing values and was dropped.
In this part we are going to use scatter plots to explore our data, the goal is to see what data we have and what relations we have.
First, I'll generate scatter plots to explore the following relations:
recent_grads.plot(x='Sample_size', y='Median', kind='scatter')
<matplotlib.axes._subplots.AxesSubplot at 0x7fdd2d2f8208>
recent_grads.plot(x='Sample_size', y='Unemployment_rate', kind='scatter')
<matplotlib.axes._subplots.AxesSubplot at 0x7fdd2d254240>
recent_grads.plot(x='Full_time', y='Median', kind='scatter')
<matplotlib.axes._subplots.AxesSubplot at 0x7fdd2d202ac8>
recent_grads.plot(x='ShareWomen', y='Unemployment_rate', kind='scatter')
<matplotlib.axes._subplots.AxesSubplot at 0x7fdd2b193f60>
recent_grads.plot(x='Men', y='Median', kind='scatter')
<matplotlib.axes._subplots.AxesSubplot at 0x7fdd2b0bcda0>
recent_grads.plot(x='Women', y='Median', kind='scatter')
<matplotlib.axes._subplots.AxesSubplot at 0x7fdd2b0af8d0>
Then I'll use the df.plot method on pandas to see the data then use the scatter plots generated to explore the following questions:
# For the first question
recent_grads.plot(x='Total', y='Median', kind='scatter')
<matplotlib.axes._subplots.AxesSubplot at 0x7fdd2b05c860>
recent_grads.plot(x='ShareWomen', y='Median', kind='scatter')
<matplotlib.axes._subplots.AxesSubplot at 0x7fdd2afa61d0>
recent_grads.plot(x='Full_time', y='Median', kind='scatter')
<matplotlib.axes._subplots.AxesSubplot at 0x7fdd2af84198>
From the plots generated, the answers to the questions are:
(a) There’s no significant relation between the stidents in popular major and the median salary.
(b) There’s a weak(negative) correlation between the students that majored in subjects that are majority female and the median salary.
(c) There’s no significant relation between the number of full time employees and the median salary.
Now, we'll generate histograms in separate jupyter notebook cells to explore the distributions of the following columns:
recent_grads['Sample_size'].hist(bins=25, range=(0,5000))
<matplotlib.axes._subplots.AxesSubplot at 0x7fdd2affea20>
recent_grads['Median'].hist(bins=20, range=(0,120000))
<matplotlib.axes._subplots.AxesSubplot at 0x7fdd2ae47390>
recent_grads['Employed'].hist(bins=20, range=(0,350000))
<matplotlib.axes._subplots.AxesSubplot at 0x7fdd2ad77c18>
recent_grads['Full_time'].hist(bins=20, range=(0,300000))
<matplotlib.axes._subplots.AxesSubplot at 0x7fdd2ac9c908>
recent_grads['ShareWomen'].hist(bins=20, range=(0,1))
<matplotlib.axes._subplots.AxesSubplot at 0x7fdd2ad2ab38>
recent_grads['Unemployment_rate'].hist(bins=20, range=(0,0.2))
<matplotlib.axes._subplots.AxesSubplot at 0x7fdd2abfe710>
recent_grads['Men'].hist(bins=20, range=(0,200000))
<matplotlib.axes._subplots.AxesSubplot at 0x7fdd2ac10d68>
recent_grads['Women'].hist(bins=20, range=(0,200000))
<matplotlib.axes._subplots.AxesSubplot at 0x7fdd2aa3f0f0>
We use some of these plots to explore and answer the following questions:
-- What percent of majors are predominantly male? Predominantly female?
We focus in just the x-values above 0.6 (60%) in the histogram from ShareWomen column, we can see how in about 77 majors more than 60 percentage of the students are women, representing the 44.76% of the total majors and the ones who present more men represent the 55.23%. We have calculated these number, taking into account the following frecuencies for each bar higher than 0.6:
-- What's the most common median salary range?
These figures show how the most common median salary range is between 30000 and 45000.
This kind of plot is certainly useful to see large amount of data at once and at the end use the result to see more detailed results. We will use these plots to verify our claims from above.
from pandas.plotting import scatter_matrix
scatter_matrix(recent_grads[['Sample_size', 'Median']], figsize=(6,6))
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7fdd2a973940>, <matplotlib.axes._subplots.AxesSubplot object at 0x7fdd2a94bfd0>], [<matplotlib.axes._subplots.AxesSubplot object at 0x7fdd2a898c88>, <matplotlib.axes._subplots.AxesSubplot object at 0x7fdd2a8d4fd0>]], dtype=object)
scatter_matrix(recent_grads[['Sample_size', 'Median', 'Unemployment_rate']], figsize=(10,10))
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7fdd2a80ceb8>, <matplotlib.axes._subplots.AxesSubplot object at 0x7fdd2a776e10>, <matplotlib.axes._subplots.AxesSubplot object at 0x7fdd2a744828>], [<matplotlib.axes._subplots.AxesSubplot object at 0x7fdd2a700b38>, <matplotlib.axes._subplots.AxesSubplot object at 0x7fdd2a6d0550>, <matplotlib.axes._subplots.AxesSubplot object at 0x7fdd2a68a6a0>], [<matplotlib.axes._subplots.AxesSubplot object at 0x7fdd2a5da2b0>, <matplotlib.axes._subplots.AxesSubplot object at 0x7fdd2a5982e8>, <matplotlib.axes._subplots.AxesSubplot object at 0x7fdd2a560438>]], dtype=object)
The bar plots are very useful to visualize the data and see the nuances. I like the results with this kind of plot. We only need to change the kind in the .plot method to 'bar'.
Another alternative is to use plot.bar method.
We will use bar plots for the followimg:
Use bar plots to compare the percentages of women (ShareWomen) from the first ten rows and last ten rows of the recent_grads dataframe.
Use bar plots to compare the unemployment rate (Unemployment_rate) from the first ten rows and last ten rows of the recent_grads dataframe.
recent_grads[:10].plot.bar(x='Major', y='ShareWomen', legend=False)
recent_grads[163:].plot.bar(x='Major', y='ShareWomen', legend=False)
<matplotlib.axes._subplots.AxesSubplot at 0x7fdd2a3a8550>
recent_grads[:10].plot.bar(x='Major', y='Unemployment_rate', legend=False)
recent_grads[163:].plot.bar(x='Major', y='Unemployment_rate', legend=False)
<matplotlib.axes._subplots.AxesSubplot at 0x7fdd2a39e1d0>
In the following, we compare the number of men with the number of women in each category of majors.
major_categories = recent_grads['Major_category'].unique()
major_categ_total_men_women = []
for major_cat in major_categories:
sum_men = recent_grads.loc[recent_grads['Major_category'] == major_cat, 'Men'].sum()
sum_women = recent_grads.loc[recent_grads['Major_category'] == major_cat, 'Women'].sum()
major_total = (major_cat, sum_men, sum_women)
major_categ_total_men_women.append(major_total)
men_women_cat_major = pd.DataFrame(major_categ_total_men_women, columns = ['Major_category', 'Total_Men', 'Total_Women'])
men_women_cat_major
Major_category | Total_Men | Total_Women | |
---|---|---|---|
0 | Engineering | 408307.0 | 129276.0 |
1 | Business | 667852.0 | 634524.0 |
2 | Physical Sciences | 95390.0 | 90089.0 |
3 | Law & Public Policy | 91129.0 | 87978.0 |
4 | Computers & Mathematics | 208725.0 | 90283.0 |
5 | Industrial Arts & Consumer Services | 103781.0 | 126011.0 |
6 | Arts | 134390.0 | 222740.0 |
7 | Health | 75517.0 | 387713.0 |
8 | Social Science | 256834.0 | 273132.0 |
9 | Biology & Life Science | 184919.0 | 268943.0 |
10 | Education | 103526.0 | 455603.0 |
11 | Agriculture & Natural Resources | 40357.0 | 35263.0 |
12 | Humanities & Liberal Arts | 272846.0 | 440622.0 |
13 | Psychology & Social Work | 98115.0 | 382892.0 |
14 | Communications & Journalism | 131921.0 | 260680.0 |
15 | Interdisciplinary | 2817.0 | 9479.0 |
men_women_cat_major.plot(x="Major_category", kind="bar")
<matplotlib.axes._subplots.AxesSubplot at 0x7fdd2a314fd0>
We made some useful visualizations to extract info of our data an discover interesting info of our dataset:
The careers that pay the most are: are Petroleum engineering, mining engineering and metallurgical engineering. All three with a median of 70k USD
The most popular majors are Business Management and Psychology, with more than 300k of total professional.
The careers with more than 50k professionals and salaries with a median of 50k USD are Mechanical Engineering, General Engineering, Electrical Engineering and Computer Science.
The careers with more unemployment rate greater than 0.09 and with more than 100k professionals are Economics, Political Science, Commercial Art and History.
The career with more unemployment rates are Nuclear Engineering, Computer Networking and Public Administration.
The most popular full time jobs (greater than 100k professionals working full time) with a median salary above 40k USD are for professional with majors in Nursing, Finance and Accounting.
In each category of majors, there are more men in Engineering, Business, and Computer and mathematics; more women in Arts, Health, Industrial Arts and Consumer Services, Social Science, Biology and Life Science, Education, Humanity and lIberal Arts, Psychology and Social Work, Communications and Journalism and Interdisciplinary majors.
While there are equal or almost equal number of men and women in Law and Public Policy, Agriculture and Natural Resopurces and Physical Science,