In this guided project, we'll explore how using the pandas plotting functionality along with the Jupyter notebook interface allows us to explore data quickly using visualizations.
We'll be working with a dataset on the job outcomes of students who graduated from college between 2010 and 2012. The original data on job outcomes was released by American Community Survey, which conducts surveys and aggregates the data.
Each row in the dataset represents a different major in college and contains information on gender diversity, employment rates, median salaries, and more. Here are some of the columns in the dataset:
Column_name | Description |
---|---|
Rank | Rank by median earnings (the dataset is ordered by this column). |
Major_code | Major code. |
Major | Major description. |
Major_category | Category of major. |
Total | Total number of people with major. |
Sample_size | Sample size (unweighted) of full-time. |
Men | Male graduates. |
Women | Female graduates. |
ShareWomen | Women as share of total. |
Employed | Number employed. |
Median | Median salary of full-time, year-round workers. |
Low_wage_jobs | Number in low-wage service jobs. |
Full_time | Number employed 35 hours or more. |
Part_time | Number employed less than 35 hours. |
In the code cells below, the data set is read in and the first and last rows are displayed.
import pandas as pd
%matplotlib inline
recent_grads = pd.read_csv('recent-grads.csv')
recent_grads.iloc[0]
recent_grads.head()
Rank | Major_code | Major | Total | Men | Women | Major_category | ShareWomen | Sample_size | Employed | ... | Part_time | Full_time_year_round | Unemployed | Unemployment_rate | Median | P25th | P75th | College_jobs | Non_college_jobs | Low_wage_jobs | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 2419 | PETROLEUM ENGINEERING | 2339.0 | 2057.0 | 282.0 | Engineering | 0.120564 | 36 | 1976 | ... | 270 | 1207 | 37 | 0.018381 | 110000 | 95000 | 125000 | 1534 | 364 | 193 |
1 | 2 | 2416 | MINING AND MINERAL ENGINEERING | 756.0 | 679.0 | 77.0 | Engineering | 0.101852 | 7 | 640 | ... | 170 | 388 | 85 | 0.117241 | 75000 | 55000 | 90000 | 350 | 257 | 50 |
2 | 3 | 2415 | METALLURGICAL ENGINEERING | 856.0 | 725.0 | 131.0 | Engineering | 0.153037 | 3 | 648 | ... | 133 | 340 | 16 | 0.024096 | 73000 | 50000 | 105000 | 456 | 176 | 0 |
3 | 4 | 2417 | NAVAL ARCHITECTURE AND MARINE ENGINEERING | 1258.0 | 1123.0 | 135.0 | Engineering | 0.107313 | 16 | 758 | ... | 150 | 692 | 40 | 0.050125 | 70000 | 43000 | 80000 | 529 | 102 | 0 |
4 | 5 | 2405 | CHEMICAL ENGINEERING | 32260.0 | 21239.0 | 11021.0 | Engineering | 0.341631 | 289 | 25694 | ... | 5180 | 16697 | 1672 | 0.061098 | 65000 | 50000 | 75000 | 18314 | 4440 | 972 |
5 rows × 21 columns
recent_grads.tail()
Rank | Major_code | Major | Total | Men | Women | Major_category | ShareWomen | Sample_size | Employed | ... | Part_time | Full_time_year_round | Unemployed | Unemployment_rate | Median | P25th | P75th | College_jobs | Non_college_jobs | Low_wage_jobs | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
168 | 169 | 3609 | ZOOLOGY | 8409.0 | 3050.0 | 5359.0 | Biology & Life Science | 0.637293 | 47 | 6259 | ... | 2190 | 3602 | 304 | 0.046320 | 26000 | 20000 | 39000 | 2771 | 2947 | 743 |
169 | 170 | 5201 | EDUCATIONAL PSYCHOLOGY | 2854.0 | 522.0 | 2332.0 | Psychology & Social Work | 0.817099 | 7 | 2125 | ... | 572 | 1211 | 148 | 0.065112 | 25000 | 24000 | 34000 | 1488 | 615 | 82 |
170 | 171 | 5202 | CLINICAL PSYCHOLOGY | 2838.0 | 568.0 | 2270.0 | Psychology & Social Work | 0.799859 | 13 | 2101 | ... | 648 | 1293 | 368 | 0.149048 | 25000 | 25000 | 40000 | 986 | 870 | 622 |
171 | 172 | 5203 | COUNSELING PSYCHOLOGY | 4626.0 | 931.0 | 3695.0 | Psychology & Social Work | 0.798746 | 21 | 3777 | ... | 965 | 2738 | 214 | 0.053621 | 23400 | 19200 | 26000 | 2403 | 1245 | 308 |
172 | 173 | 3501 | LIBRARY SCIENCE | 1098.0 | 134.0 | 964.0 | Education | 0.877960 | 2 | 742 | ... | 237 | 410 | 87 | 0.104946 | 22000 | 20000 | 22000 | 288 | 338 | 192 |
5 rows × 21 columns
In the code cell below, the df.describe()
helps get more information about each column in the data aet
recent_grads.describe()
Rank | Major_code | Total | Men | Women | ShareWomen | Sample_size | Employed | Full_time | Part_time | Full_time_year_round | Unemployed | Unemployment_rate | Median | P25th | P75th | College_jobs | Non_college_jobs | Low_wage_jobs | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 173.000000 | 173.000000 | 172.000000 | 172.000000 | 172.000000 | 172.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 |
mean | 87.000000 | 3879.815029 | 39370.081395 | 16723.406977 | 22646.674419 | 0.522223 | 356.080925 | 31192.763006 | 26029.306358 | 8832.398844 | 19694.427746 | 2416.329480 | 0.068191 | 40151.445087 | 29501.445087 | 51494.219653 | 12322.635838 | 13284.497110 | 3859.017341 |
std | 50.084928 | 1687.753140 | 63483.491009 | 28122.433474 | 41057.330740 | 0.231205 | 618.361022 | 50675.002241 | 42869.655092 | 14648.179473 | 33160.941514 | 4112.803148 | 0.030331 | 11470.181802 | 9166.005235 | 14906.279740 | 21299.868863 | 23789.655363 | 6944.998579 |
min | 1.000000 | 1100.000000 | 124.000000 | 119.000000 | 0.000000 | 0.000000 | 2.000000 | 0.000000 | 111.000000 | 0.000000 | 111.000000 | 0.000000 | 0.000000 | 22000.000000 | 18500.000000 | 22000.000000 | 0.000000 | 0.000000 | 0.000000 |
25% | 44.000000 | 2403.000000 | 4549.750000 | 2177.500000 | 1778.250000 | 0.336026 | 39.000000 | 3608.000000 | 3154.000000 | 1030.000000 | 2453.000000 | 304.000000 | 0.050306 | 33000.000000 | 24000.000000 | 42000.000000 | 1675.000000 | 1591.000000 | 340.000000 |
50% | 87.000000 | 3608.000000 | 15104.000000 | 5434.000000 | 8386.500000 | 0.534024 | 130.000000 | 11797.000000 | 10048.000000 | 3299.000000 | 7413.000000 | 893.000000 | 0.067961 | 36000.000000 | 27000.000000 | 47000.000000 | 4390.000000 | 4595.000000 | 1231.000000 |
75% | 130.000000 | 5503.000000 | 38909.750000 | 14631.000000 | 22553.750000 | 0.703299 | 338.000000 | 31433.000000 | 25147.000000 | 9948.000000 | 16891.000000 | 2393.000000 | 0.087557 | 45000.000000 | 33000.000000 | 60000.000000 | 14444.000000 | 11783.000000 | 3466.000000 |
max | 173.000000 | 6403.000000 | 393735.000000 | 173809.000000 | 307087.000000 | 0.968954 | 4212.000000 | 307933.000000 | 251540.000000 | 115172.000000 | 199897.000000 | 28169.000000 | 0.177226 | 110000.000000 | 95000.000000 | 125000.000000 | 151643.000000 | 148395.000000 | 48207.000000 |
The result displayed in the code cell above, give us the statiscal summmary of each column in the data set. It also point to the fact that over 80% of the data are stored as float
objects. However, since the aim in this project is basically visualization of this data, we need to drops rows with missing values, so we don't encounter errors while plotting.
In the code cells below:
df.shape()
methoddf.dropna()
methodrecent_grads
variable that stores the cleaned data setraw_data_count = recent_grads.shape[0]
raw_data_count
173
recent_grads = recent_grads.dropna() # drop rows with missing values
cleaned_data_count = recent_grads.shape[0]
cleaned_data_count
172
Using visualizations, we can start to explore questions from the dataset like:
Using scatter plots in the code cells below, we'll visualize certain columns in the data set in a bid to answer this questions:
The data set doesn't exactly have a 'popularity' column, but it is fair and reasonable to say that, a major with large amount of people points to the fact that the major is common. Thus, we will make use of the Total
column in a bid to answer this question
A scatter plot of the Median
column and the Total
column is displayed below
recent_grads.plot(x='Total', y='Median', kind='scatter',title='Total Vs Median')
<matplotlib.axes._subplots.AxesSubplot at 0x204072a7908>
Do students in more popular majors make more money?
No
The scatter plot shows no correlation between this two colums. The highest pay within range we see here is found in the total range of ranks with 0-50,000 people, which is infact the lowest range for the total
coulmn i.e ranks with the lowset number of people.
The code cell below helps answer the second question
, a scatter plot of the Median
against the ShareWomen
column is displayed below.
recent_grads.plot(x='ShareWomen', y='Median', kind='scatter',title='ShareWomen Vs Median')
<matplotlib.axes._subplots.AxesSubplot at 0x20407ed7bc8>
Do students that majored in subjects that were majority female make more money?
No
The directions of the plot is negative and there isn't a strong corellation between the two columns, reverse is the case here as we see that students in this category infact, make less.
The code cell below displays a scatter plot of the Full-time
against the Median
coulumn. This plot helps us visualize the data in this two columns and also answer the third question
recent_grads.plot(x='Full_time', y='Median', kind='scatter',title='Full_time Vs Median')
<matplotlib.axes._subplots.AxesSubplot at 0x20407f94e08>
Is there any link between the number of full-time employees and median salary?
There is no strong link(correlation) between these two... The only thing observed is a cluster at 0-50000 range on the Full_time
axis and a 20000-40000 range in the Median
axis
Histogram plots will help examine the distribution of values in the various columns we have in this data set and thus help us answer these questions:
In the code cell below, a histogram plot of the distribution of values(major ranks) in the ShareWomen
column is displayed.
recent_grads['ShareWomen'].hist()
<matplotlib.axes._subplots.AxesSubplot at 0x20409cb5608>
Focusing on the sum of value distributions in sections with approximately over 50% female, helps find out the majors where females are predominant. It is observed that a larger percentage of the majors are predominantly female.
Using the value_counts
method, helps us get a precise detail of this values
recent_grads['ShareWomen'].value_counts(bins = 20).sort_index()
(-0.0019690000000000003, 0.0484] 1 (0.0484, 0.0969] 2 (0.0969, 0.145] 8 (0.145, 0.194] 6 (0.194, 0.242] 7 (0.242, 0.291] 9 (0.291, 0.339] 10 (0.339, 0.388] 12 (0.388, 0.436] 8 (0.436, 0.484] 11 (0.484, 0.533] 12 (0.533, 0.581] 9 (0.581, 0.63] 11 (0.63, 0.678] 14 (0.678, 0.727] 15 (0.727, 0.775] 14 (0.775, 0.824] 8 (0.824, 0.872] 3 (0.872, 0.921] 8 (0.921, 0.969] 4 Name: ShareWomen, dtype: int64
Within the approx. 50% - 100% range of the ShareWomen
column , there is an approximate of 57%
majors, which infers:
43%
of majors are predominatly Male
;57%
are predominatly Felame
In a bid to answer the second question:
The code cell below displays the distribution of values in the Median
column
recent_grads['Median'].hist(bins=20)
<matplotlib.axes._subplots.AxesSubplot at 0x20407204f08>
The plot above shows that the most common median salary is with the 30000USD - 35000USD range... However the value_counts()
method gives us a more precise detail in the code cell below
recent_grads['Median'].value_counts(bins = 20)
(30800.0, 35200.0] 56 (39600.0, 44000.0] 24 (26400.0, 30800.0] 19 (35200.0, 39600.0] 19 (44000.0, 48400.0] 16 (48400.0, 52800.0] 12 (57200.0, 61600.0] 7 (52800.0, 57200.0] 6 (21911.999, 26400.0] 5 (61600.0, 66000.0] 4 (66000.0, 70400.0] 1 (70400.0, 74800.0] 1 (74800.0, 79200.0] 1 (105600.0, 110000.0] 1 (101200.0, 105600.0] 0 (79200.0, 83600.0] 0 (83600.0, 88000.0] 0 (88000.0, 92400.0] 0 (92400.0, 96800.0] 0 (96800.0, 101200.0] 0 Name: Median, dtype: int64
A scatter matrix plot combines both scatter plots and histograms into one grid of plots and allows us to explore potential relationships and distributions simultaneously.
from pandas.plotting import scatter_matrix
scatter_matrix(recent_grads[['Sample_size','Median']])
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x0000020408119208>, <matplotlib.axes._subplots.AxesSubplot object at 0x0000020408158F88>], [<matplotlib.axes._subplots.AxesSubplot object at 0x000002040819AE88>, <matplotlib.axes._subplots.AxesSubplot object at 0x00000204081CF108>]], dtype=object)
scatter_matrix(recent_grads[['Sample_size','Median','Unemployment_rate']])
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x00000204082C2A08>, <matplotlib.axes._subplots.AxesSubplot object at 0x00000204082A6D88>, <matplotlib.axes._subplots.AxesSubplot object at 0x0000020408322448>], [<matplotlib.axes._subplots.AxesSubplot object at 0x0000020408357E48>, <matplotlib.axes._subplots.AxesSubplot object at 0x0000020408391C08>, <matplotlib.axes._subplots.AxesSubplot object at 0x00000204083C9D08>], [<matplotlib.axes._subplots.AxesSubplot object at 0x0000020408402E08>, <matplotlib.axes._subplots.AxesSubplot object at 0x000002040843BF08>, <matplotlib.axes._subplots.AxesSubplot object at 0x0000020408445AC8>]], dtype=object)
Using bar plots:
ShareWomen
from the first ten rows and last ten rows of the recent_grads
dataframe.Unemployment_rate
from the first ten rows and last ten rows of the recent_grads
dataframe.recent_grads[:10].plot.bar(x='Major', y='ShareWomen', legend = False)
<matplotlib.axes._subplots.AxesSubplot at 0x20409beb588>
The plot above shows (in the topmost major ranks) that the ASTRONOMY AND ASROPHYSICS
major has more percentage of women
recent_grads.tail(10).plot.bar(x='Major', y='ShareWomen', legend = False)
<matplotlib.axes._subplots.AxesSubplot at 0x20409d13088>
The plot above shows (in the lower major ranks) that the COMMUNICATION DIAORDER SCIENCE AND SEVICE
and EARLY CHILDHOOD EDUCATION
major has more percentage of women
Now, to the comparing the unemployment rate from the first ten rows and last ten rows of the recent_grads dataframe.
recent_grads[:10].plot.bar(x='Major', y='Unemployment_rate', legend = False)
<matplotlib.axes._subplots.AxesSubplot at 0x20409df7348>
The plot above shows (in the topmost major ranks) that the NUCLEAR ENGINEERING
major has more unemployement rate and factors as to why?
are not disclosed in this analysis
recent_grads.tail(10).plot.bar(x='Major', y='Unemployment_rate', legend = False)
<matplotlib.axes._subplots.AxesSubplot at 0x2040aec8948>
The plot above shows (in the lower major ranks) that the CLINICAL PHYCOLOGY
major has more unemployment rates and factors as to why are not disclosed in this analysis
GROUPED BAR PLOT
to compare the number of men with the number of women in each category of majors.¶The code cell below displays the unique values in the Major_category
column
recent_grads['Major_category'].unique()
array(['Engineering', 'Business', 'Physical Sciences', 'Law & Public Policy', 'Computers & Mathematics', 'Industrial Arts & Consumer Services', 'Arts', 'Health', 'Social Science', 'Biology & Life Science', 'Education', 'Agriculture & Natural Resources', 'Humanities & Liberal Arts', 'Psychology & Social Work', 'Communications & Journalism', 'Interdisciplinary'], dtype=object)
recent_grads.groupby('Major_category')['Men', 'Women'].sum().plot.bar()
C:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: FutureWarning: Indexing with multiple keys (implicitly converted to a tuple of keys) will be deprecated, use a list instead. """Entry point for launching an IPython kernel.
<matplotlib.axes._subplots.AxesSubplot at 0x204097b44c8>
The plot displayed in the code cell above, infers:
Business
, Engineering
and Computers & Mathematics
major category, there are more men than women.60%
of the major catergories have more women than menBox and Whisker Plots
to explore the distributions of median salaries and unemployment rate.
¶recent_grads[['Median']].boxplot()
<matplotlib.axes._subplots.AxesSubplot at 0x204098902c8>
The box plot above, infers:
recent_grads[['Unemployment_rate']].boxplot()
<matplotlib.axes._subplots.AxesSubplot at 0x204098fd388>
The code cell above, infers:
recent_grads.plot.hexbin(x='Men', y='Median', gridsize=30)
<matplotlib.axes._subplots.AxesSubplot at 0x2040996cb08>
recent_grads.plot.hexbin(x='Women', y='Median', gridsize=30)
<matplotlib.axes._subplots.AxesSubplot at 0x20409a27508>
The hexagonal plots above show us that women and men are similar in their median earning ranges, however women have two core points at 35,000USD and 40,000USD.
Median earnings for men is around 30,000USD and 35,000USD most of the time.
As an objective earlier stated in the introduction of this project, I have been able to visualize
job outcomes of students who graduated from college between 2010 and 2012.
ASTRONOMY AND ASROPHYSICS
, if considering a major with a larger population of women graduateNUCLEAR ENGINEERING
would not be a good idea because of the high unemployement rate it is characcterized with.