A lot of concerns mostly parents and also a number of students have risen over the years to determine what courses(majors) in college have a higher probability of assertaining their success. Which may be as a result of trying to gain "Return on Investment(ROA)" based on the resources and time which is put in while in college. However is not to say that some majors are irrelevant only that some are considered more valuable than the others in the society and world at large.
This project contains a dataset on the job outcomes of students who graduated from college between 2010 and 2012. The original data on job could be gotten from American community Survey, which conducts surveys and aggregates the data. However, we would use a cleaned version of the data released by FiveThirtyEight on their Github repo.
This project focuses on answering and exploring the following questions using several visualization techniques provided in the Matplot library as below:
Lastly, below are the columns in the dataset and their respective definitions:
Rank
- Rank by median earnings (the dataset is ordered by this column).Major_code
- Major code.Major
- Major description.Major_category
- Category of major.Total
- Total number of people with majors.Sample_size
- Sample size (unweighted) of full-time.Men
- Male graduates.Women
- Female graduates.ShareWomen
- Women as a share of total.Employed
- Number employed.Median
- Median salary of full-time, year-round workers.Low_wage_jobs
- Number in low-wage service jobs.Full_time
- Number employed 35 hours or more.Part_time
- Number employed less than 35 hours.Full_time_year_round
- Employed at least 50 weeks (WKW == 1) and at least 35 hours (WKHP >= 35).Unemployed
- Number unemployed (ESR == 3)Unemployment_rate
- Unemployed / (Unemployed + Employed).Median
- Median earnings of full-time, year-round workers.P25th
- 25th percentile of earnings.P50th
- 75th percentile of earnings.College_jobs
- Number with job requiring a college degree.Non_college_jobs
- Number with job not requiring a college degree.Low_wage_jobs
- Number in low-wage service jobs.The various libraries (pandas and matplotlib) are required to enable proper data cleaning steps, exploration, analysis and visualization.
import pandas as pd
import matplotlib.pyplot as plt
# jupyter magic function to display inline plots
%matplotlib inline
We need to read in the dataset to examine and explore the dataset, also to identity the contents contained in the dataset. e.g: patterns, outliers, values, changing column names (if need be) e.t.c
# reading the dataset
recent_grads = pd.read_csv('recent-grads.csv')
recent_grads.iloc[0] # returns the first row of the dataset
Rank 1 Major_code 2419 Major PETROLEUM ENGINEERING Total 2339 Men 2057 Women 282 Major_category Engineering ShareWomen 0.120564 Sample_size 36 Employed 1976 Full_time 1849 Part_time 270 Full_time_year_round 1207 Unemployed 37 Unemployment_rate 0.0183805 Median 110000 P25th 95000 P75th 125000 College_jobs 1534 Non_college_jobs 364 Low_wage_jobs 193 Name: 0, dtype: object
recent_grads.head() # returns the first 5 elements of the dataset
Rank | Major_code | Major | Total | Men | Women | Major_category | ShareWomen | Sample_size | Employed | ... | Part_time | Full_time_year_round | Unemployed | Unemployment_rate | Median | P25th | P75th | College_jobs | Non_college_jobs | Low_wage_jobs | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 2419 | PETROLEUM ENGINEERING | 2339.0 | 2057.0 | 282.0 | Engineering | 0.120564 | 36 | 1976 | ... | 270 | 1207 | 37 | 0.018381 | 110000 | 95000 | 125000 | 1534 | 364 | 193 |
1 | 2 | 2416 | MINING AND MINERAL ENGINEERING | 756.0 | 679.0 | 77.0 | Engineering | 0.101852 | 7 | 640 | ... | 170 | 388 | 85 | 0.117241 | 75000 | 55000 | 90000 | 350 | 257 | 50 |
2 | 3 | 2415 | METALLURGICAL ENGINEERING | 856.0 | 725.0 | 131.0 | Engineering | 0.153037 | 3 | 648 | ... | 133 | 340 | 16 | 0.024096 | 73000 | 50000 | 105000 | 456 | 176 | 0 |
3 | 4 | 2417 | NAVAL ARCHITECTURE AND MARINE ENGINEERING | 1258.0 | 1123.0 | 135.0 | Engineering | 0.107313 | 16 | 758 | ... | 150 | 692 | 40 | 0.050125 | 70000 | 43000 | 80000 | 529 | 102 | 0 |
4 | 5 | 2405 | CHEMICAL ENGINEERING | 32260.0 | 21239.0 | 11021.0 | Engineering | 0.341631 | 289 | 25694 | ... | 5180 | 16697 | 1672 | 0.061098 | 65000 | 50000 | 75000 | 18314 | 4440 | 972 |
5 rows × 21 columns
The dataframe displayed above consists of the first five elements of the dataset, which gives a better understanding of the dataset worked with.
recent_grads.tail() # returns the last 5 elements of the dataset
Rank | Major_code | Major | Total | Men | Women | Major_category | ShareWomen | Sample_size | Employed | ... | Part_time | Full_time_year_round | Unemployed | Unemployment_rate | Median | P25th | P75th | College_jobs | Non_college_jobs | Low_wage_jobs | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
168 | 169 | 3609 | ZOOLOGY | 8409.0 | 3050.0 | 5359.0 | Biology & Life Science | 0.637293 | 47 | 6259 | ... | 2190 | 3602 | 304 | 0.046320 | 26000 | 20000 | 39000 | 2771 | 2947 | 743 |
169 | 170 | 5201 | EDUCATIONAL PSYCHOLOGY | 2854.0 | 522.0 | 2332.0 | Psychology & Social Work | 0.817099 | 7 | 2125 | ... | 572 | 1211 | 148 | 0.065112 | 25000 | 24000 | 34000 | 1488 | 615 | 82 |
170 | 171 | 5202 | CLINICAL PSYCHOLOGY | 2838.0 | 568.0 | 2270.0 | Psychology & Social Work | 0.799859 | 13 | 2101 | ... | 648 | 1293 | 368 | 0.149048 | 25000 | 25000 | 40000 | 986 | 870 | 622 |
171 | 172 | 5203 | COUNSELING PSYCHOLOGY | 4626.0 | 931.0 | 3695.0 | Psychology & Social Work | 0.798746 | 21 | 3777 | ... | 965 | 2738 | 214 | 0.053621 | 23400 | 19200 | 26000 | 2403 | 1245 | 308 |
172 | 173 | 3501 | LIBRARY SCIENCE | 1098.0 | 134.0 | 964.0 | Education | 0.877960 | 2 | 742 | ... | 237 | 410 | 87 | 0.104946 | 22000 | 20000 | 22000 | 288 | 338 | 192 |
5 rows × 21 columns
The dataframe displayed above consists of the last five elements of the dataset, to also get a better intuition of the dataset worked with.
Notice that the column names begin with a capital letter, which is not much of a problem, but changing all column names to lower case ensures consistency which is a good thing and could help us carry out exploration and analysis even faster.
Therefore, the column names would be converted to lowercase.
# returns the column names in the dataset
recent_grads.columns
Index(['Rank', 'Major_code', 'Major', 'Total', 'Men', 'Women', 'Major_category', 'ShareWomen', 'Sample_size', 'Employed', 'Full_time', 'Part_time', 'Full_time_year_round', 'Unemployed', 'Unemployment_rate', 'Median', 'P25th', 'P75th', 'College_jobs', 'Non_college_jobs', 'Low_wage_jobs'], dtype='object')
# converting all column names to lowercase
lowercase_recent_grad = [] # stores the lowercase column names
for name in recent_grads.columns:
name = name.lower()
lowercase_recent_grad.append(name)
recent_grads.columns = lowercase_recent_grad # replaces the old columns with the new columns in the dataset
recent_grads.columns
Index(['rank', 'major_code', 'major', 'total', 'men', 'women', 'major_category', 'sharewomen', 'sample_size', 'employed', 'full_time', 'part_time', 'full_time_year_round', 'unemployed', 'unemployment_rate', 'median', 'p25th', 'p75th', 'college_jobs', 'non_college_jobs', 'low_wage_jobs'], dtype='object')
All the column names have now been converted to lowercase, which brings about a consistent name format for the columns.
recent_grads.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 173 entries, 0 to 172 Data columns (total 21 columns): rank 173 non-null int64 major_code 173 non-null int64 major 173 non-null object total 172 non-null float64 men 172 non-null float64 women 172 non-null float64 major_category 173 non-null object sharewomen 172 non-null float64 sample_size 173 non-null int64 employed 173 non-null int64 full_time 173 non-null int64 part_time 173 non-null int64 full_time_year_round 173 non-null int64 unemployed 173 non-null int64 unemployment_rate 173 non-null float64 median 173 non-null int64 p25th 173 non-null int64 p75th 173 non-null int64 college_jobs 173 non-null int64 non_college_jobs 173 non-null int64 low_wage_jobs 173 non-null int64 dtypes: float64(5), int64(14), object(2) memory usage: 28.5+ KB
The information above shows that most of the columns in the dataset contain numeric values of int64 and float64, only two columns namely Major_category and Major contain string values.
# returns statistical information of of all the values
# in the dataset
recent_grads.describe(include='all')
rank | major_code | major | total | men | women | major_category | sharewomen | sample_size | employed | ... | part_time | full_time_year_round | unemployed | unemployment_rate | median | p25th | p75th | college_jobs | non_college_jobs | low_wage_jobs | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 173.000000 | 173.000000 | 173 | 172.000000 | 172.000000 | 172.000000 | 173 | 172.000000 | 173.000000 | 173.000000 | ... | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 |
unique | NaN | NaN | 173 | NaN | NaN | NaN | 16 | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
top | NaN | NaN | ADVERTISING AND PUBLIC RELATIONS | NaN | NaN | NaN | Engineering | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
freq | NaN | NaN | 1 | NaN | NaN | NaN | 29 | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
mean | 87.000000 | 3879.815029 | NaN | 39370.081395 | 16723.406977 | 22646.674419 | NaN | 0.522223 | 356.080925 | 31192.763006 | ... | 8832.398844 | 19694.427746 | 2416.329480 | 0.068191 | 40151.445087 | 29501.445087 | 51494.219653 | 12322.635838 | 13284.497110 | 3859.017341 |
std | 50.084928 | 1687.753140 | NaN | 63483.491009 | 28122.433474 | 41057.330740 | NaN | 0.231205 | 618.361022 | 50675.002241 | ... | 14648.179473 | 33160.941514 | 4112.803148 | 0.030331 | 11470.181802 | 9166.005235 | 14906.279740 | 21299.868863 | 23789.655363 | 6944.998579 |
min | 1.000000 | 1100.000000 | NaN | 124.000000 | 119.000000 | 0.000000 | NaN | 0.000000 | 2.000000 | 0.000000 | ... | 0.000000 | 111.000000 | 0.000000 | 0.000000 | 22000.000000 | 18500.000000 | 22000.000000 | 0.000000 | 0.000000 | 0.000000 |
25% | 44.000000 | 2403.000000 | NaN | 4549.750000 | 2177.500000 | 1778.250000 | NaN | 0.336026 | 39.000000 | 3608.000000 | ... | 1030.000000 | 2453.000000 | 304.000000 | 0.050306 | 33000.000000 | 24000.000000 | 42000.000000 | 1675.000000 | 1591.000000 | 340.000000 |
50% | 87.000000 | 3608.000000 | NaN | 15104.000000 | 5434.000000 | 8386.500000 | NaN | 0.534024 | 130.000000 | 11797.000000 | ... | 3299.000000 | 7413.000000 | 893.000000 | 0.067961 | 36000.000000 | 27000.000000 | 47000.000000 | 4390.000000 | 4595.000000 | 1231.000000 |
75% | 130.000000 | 5503.000000 | NaN | 38909.750000 | 14631.000000 | 22553.750000 | NaN | 0.703299 | 338.000000 | 31433.000000 | ... | 9948.000000 | 16891.000000 | 2393.000000 | 0.087557 | 45000.000000 | 33000.000000 | 60000.000000 | 14444.000000 | 11783.000000 | 3466.000000 |
max | 173.000000 | 6403.000000 | NaN | 393735.000000 | 173809.000000 | 307087.000000 | NaN | 0.968954 | 4212.000000 | 307933.000000 | ... | 115172.000000 | 199897.000000 | 28169.000000 | 0.177226 | 110000.000000 | 95000.000000 | 125000.000000 | 151643.000000 | 148395.000000 | 48207.000000 |
11 rows × 21 columns
The information above shows that the dataset contains some columns with missing data specifically total, men, women and sharewomen.
Using Matplotlib, it is expected that our dataset contain matching rows of data else it throws an error. Since it has been identified that there are some rows with missing data as stated before they need to be removed.
# displays the column with missing values
print(recent_grads.isnull().sum())
rank 0 major_code 0 major 0 total 1 men 1 women 1 major_category 0 sharewomen 1 sample_size 0 employed 0 full_time 0 part_time 0 full_time_year_round 0 unemployed 0 unemployment_rate 0 median 0 p25th 0 p75th 0 college_jobs 0 non_college_jobs 0 low_wage_jobs 0 dtype: int64
Above, notice that there are only 4 columns with missing values and each of those columns contain only one row of missing data
# returns the total number of rows in the dataset, an
# alternative could be dataFrame.count()
raw_data_count = recent_grads.index
raw_data_count
RangeIndex(start=0, stop=173, step=1)
# dropping rows with missing values
recent_grads = recent_grads.dropna(axis='index')
cleaned_data_count = recent_grads.index
cleaned_data_count
Int64Index([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, ... 163, 164, 165, 166, 167, 168, 169, 170, 171, 172], dtype='int64', length=172)
Notice now that there's a difference between recent_data_count (length=173) and the new cleaned_data_count (length=172). This shows that only one row in the dataset contained missing values and was dropped.
In order to determing the category of students who earn higher amounts, a scatter plot would be created to compare and determine the disparity and correlation between some columns of our dataset. Such as: sample_size, median, unemployment_rate e.t.c
Since pandas also has some plotting functionalities DataFrame.plot() which enables us create different types of plots very quickly by passing in some arguments, that is what would be used in creating the plots.
E.g recent_grads.plot(x='Sample_size', y='Employed', kind='scatter', title='Employed vs. Sample_size', figsize=(5,10))
They plots created aids in answering the following important questions:
# creating a scatter plot sample_size vs. median
ax = recent_grads.plot(x='sample_size', y='median',
kind='scatter')
ax.set_title('Median Salary vs. Sample_size')
ax.set_xlim(0,4300)
ax.set_ylim(0,120000)
(0, 120000)
Q. Do students in more popular majors make more money?
A. NO! Beacause, based on the display shown above, there is little or no signifiant correlation between median earnings of graduates and what the majored in.
However there are two significant observations.
# scatter plot with a sample size at the 75th percentile
# and median values
ax = recent_grads.plot(x='sample_size', y='median',
kind='scatter')
ax.set_title('Median Salary vs. Sample_size')
ax.set_xlim(0,338) # 75th percentile value of sample_size
ax.set_ylim(0,120000)
(0, 120000)
The diagram above still shows no correlation, given a smaller range of the sample size.
However, in a given sample size of 50 there seem to be an increase in the median earnings of graduates.
# scatter plot of sample_size vs. unemployment_rate
ax = recent_grads.plot(x='sample_size', y='unemployment_rate',
kind='scatter')
ax.set_title('Unemployment_rate vs. Sample_size')
ax.set_xlim(0, 4300) # max value in sample_size
ax.set_ylim(-0.02, 0.2)
(-0.02, 0.2)
Q. Do students in more popular majors make more money?
A. The diagram above also shows little or no correlation between unemployment_rate against sample_size.
However, with a sample size in the range 500, it's observed that there's an increase in unemployment rate.
For better intuition it's better to make analyses within a smaller range of sample size. Say 1000 (this would seem appropriate).
# creates a scatter plot at the 75th percentile of sample_size)
ax = recent_grads.plot(x='sample_size', y='unemployment_rate',
kind='scatter')
ax.set_title('Unemployment_rate vs. Sample_size')
ax.set_xlim(0,1000) # 75th percentile value of sample_size
(0, 1000)
The diagram above still shows a lot of variations(no correlation) between sample_size and unemployment_rate.
Furthermore, exploring some of the rows in the dataset explains the reason for its variations.
# scatter plot: Full_time vs Median
ax = recent_grads.plot(x='full_time', y='median', kind='scatter')
ax.set_title('Median Salary vs. Full_time employed grads')
ax.set_xlim(0,)
(0, 300000.0)
Q. Is there any link between the number of full-time employees and median salary?
A. The diagram above also shows a lot of variations (no correlation) between full_time graduates and their expected earnings, most especially for organizations within 50,000 range of full time employees.
Asides the minute range of sample size, other variations could be as a result of the various companies/organizations. such as:
However, in my opinion its expected that the longer the hours put into work, the more their earnings.
# scatter plot: ShareWomen vs. Unemployment_rate
ax = recent_grads.plot(x='sharewomen', y='unemployment_rate',
kind='scatter')
ax.set_title('Unemployment_rate vs. Fraction of Graduate women')
ax.set_xlim(0,)
(0, 1.2000000000000002)
The diagram above shows no correlation between sharewomen and unemployment_rate.
# scatter plot: Sharewomen vs. median
ax = recent_grads.plot(x='sharewomen', y='median',
kind='scatter')
ax.set_title('Median Salary vs. Fraction of graduate women')
ax.set_xlim(0,1.0)
ax.set_ylim(10000,)
(10000, 120000.0)
Q. Do students that majored in subjects that were majority female make more money?
A. No! The scatter plot above shows a weak negative correlation.
This means females who concluded their college degrees in less female prospective majors earned more as shown in the diagram 0 - 0.2 (0 - 2%) of female had the highest earnings, while fe,ale concentrated majors (0.2) above had less earnings.
# scatter plot: Men vs. Median
ax = recent_grads.plot(x='men', y='median', kind='scatter')
ax.set_title('Median Salary vs. Male graduates')
ax.set_xlim(0,)
ax.set_ylim(20000,)
(20000, 120000.0)
There's no correlation between Male graduates and their average earnings.
# scatter plot: Women vs. Median
ax = recent_grads.plot(x='women', y='median', kind='scatter')
ax.set_title('Median Salary vs. Female graduates')
ax.set_xlim(0,)
ax.set_ylim(20000,)
(20000, 120000.0)
There's equally no correlation between Male graduates and their average earnings.
This sector focuses on answering two questions:
NB: dataFrame[col_name].plot(kind='hist')
was not used in generating histograms because it's difficult to control the binning strategy. Rather this would be more preferable dataFrame[col_name].hist(bins=<digit>, range=(<digits>)
.
For better understanding check out Series.hist()
# histogram exploring sample_size
ax = recent_grads['sample_size'].hist()
ax.set_title('Distribution of Sample size data')
ax.set_xlabel('Sample_size')
ax.set_ylabel('Frequency')
<matplotlib.text.Text at 0x7fae82d700f0>
The diagram above shows most of the sample_size data collected were below 500. A mored detailed view of the sample_size could be achieved by looking into the data in the range 500.
# Histogram of sample_size of range 500
ax = recent_grads['sample_size'].hist(bins=20, range=(0,500))
ax.set_title('Distribution of Sample size data')
ax.set_xlabel('Sample_size')
ax.set_ylabel('Frequecy')
<matplotlib.text.Text at 0x7fae82c5cc18>
Moving further, it's observed that most of Sample size fell within the range 100. With such little sample size the median earning of grads may not be so accurate.
# Histogram of Median earnings
ax = recent_grads['median'].hist(bins=50, range=(20000,80000))
ax.set_title('Median distribution')
ax.set_xlabel('Median')
ax.set_ylabel('Frequency')
<matplotlib.text.Text at 0x7fae82d1c908>
Q. What's the most common median salary range?
A. The diagram shows the most common median salary to be at $30,000 - $40,000. Next been the $40,000 - $50,000 or $50,000 - $60,000 which is quite hard to tell without further analysis or visualization.
# histogram for Employed grads
ax = recent_grads['employed'].hist(bins=25, range=(0,30000))
ax.set_title('Distribution of employed graduates')
ax.set_xlabel('Employed')
ax.set_ylabel('Frequecy')
<matplotlib.text.Text at 0x7fae82eb6940>
The diagram above shows that most organizations or companies have at least 5000 employed graduates working with them.
There's also a reasonable level of distribution at 5000 - 15,000 point which shows assertions that bigger companies could have up to 15,000 or more employed grads.
It would be fair to say that the number of students employed per major is affected by the the number of students that have taken the major.
I'd examine the columns to determine if any relationship exists.
# E.g. the total number of people vs the number employed
# for the largest majors
# Filter by majors with total > 39,000 (75th percentile)
largest_majors = recent_grads.loc[recent_grads['total'] > 39000, ['major_code', 'major', 'total', 'employed']]
largest_majors.sort_values(by='total', ascending=False).head(10)
major_code | major | total | employed | |
---|---|---|---|---|
145 | 5200 | PSYCHOLOGY | 393735.0 | 307933 |
76 | 6203 | BUSINESS MANAGEMENT AND ADMINISTRATION | 329927.0 | 276234 |
123 | 3600 | BIOLOGY | 280709.0 | 182295 |
57 | 6200 | GENERAL BUSINESS | 234590.0 | 190183 |
93 | 1901 | COMMUNICATIONS | 213996.0 | 179633 |
34 | 6107 | NURSING | 209394.0 | 180903 |
77 | 6206 | MARKETING AND MARKETING RESEARCH | 205211.0 | 178862 |
40 | 6201 | ACCOUNTING | 198633.0 | 165527 |
137 | 3301 | ENGLISH LANGUAGE AND LITERATURE | 194673.0 | 149180 |
78 | 5506 | POLITICAL SCIENCE AND GOVERNMENT | 182621.0 | 133454 |
Above we notice an obvious relationship between the total number of grads with majors and and the number employed.
To better illustrate this fact a scatter plot would be used to show the relationship.
# Scatter plot: Total vs. employed
ax = recent_grads.plot(x='total', y='employed', kind='scatter')
ax.set_title('Total grads with majors vs. Employed grads')
ax.set_xlim(0,)
ax.set_ylim(0,)
ax.set_xlabel('Total')
ax.set_ylabel('Employed')
<matplotlib.text.Text at 0x7fae850175c0>
# histogram of full time employed grads
ax = recent_grads['full_time'].hist(bins=25, range=(0,250000))
ax.set_title('Distribution of Full time employees')
ax.set_xlabel('Full Time Employees')
ax.set_ylabel('Frequecy')
<matplotlib.text.Text at 0x7fae82cb4d30>
This above diagram goes to say that in a company there are about 50,000 full-time grad workers actively with them, which is logical because companies also consits of interns, remote workers, e.t.c which lasts only for a given period of time.
# histogram for sharewomen
ax = recent_grads['sharewomen'].hist(bins=20)
ax.set_title('Distribution of a Fraction of Graduate women')
ax.set_xlabel('Sharewomen')
ax.set_ylabel('Frequency')
<matplotlib.text.Text at 0x7fae82ffd438>
It appears that just over 50% of all majors are mainly females, with the highest frequency at 70 - 80% female.
# Evaluating majors with the higest category
# of females (0.6 - 0.8)
largest_female_share = recent_grads.loc[(recent_grads['sharewomen'] > 0.6) &
(recent_grads['sharewomen'] <= 0.8)][['major_code', 'major', 'total',
'men', 'women', 'sharewomen',
'employed', 'unemployed']]
print(largest_female_share.shape)
largest_female_share.sort_values(by='sharewomen', ascending=False).head()
(54, 8)
major_code | major | total | men | women | sharewomen | employed | unemployed | |
---|---|---|---|---|---|---|---|---|
170 | 5202 | CLINICAL PSYCHOLOGY | 2838.0 | 568.0 | 2270.0 | 0.799859 | 2101 | 368 |
155 | 5299 | MISCELLANEOUS PSYCHOLOGY | 9628.0 | 1936.0 | 7692.0 | 0.798920 | 7653 | 419 |
171 | 5203 | COUNSELING PSYCHOLOGY | 4626.0 | 931.0 | 3695.0 | 0.798746 | 3777 | 214 |
118 | 6110 | COMMUNITY AND PUBLIC HEALTH | 19735.0 | 4103.0 | 15632.0 | 0.792095 | 14512 | 1833 |
145 | 5200 | PSYCHOLOGY | 393735.0 | 86648.0 | 307087.0 | 0.779933 | 307933 | 28169 |
Looking at the dataFrame comparing the columns men, women and total, the histogram confirms the fact that majority of the major consists of more women than men.
# histogram of unemployment_rate
ax = recent_grads['unemployment_rate'].hist()
ax.set_title('Distribution of unemployment_rate')
ax.set_xlabel('Unemployment Rate')
ax.set_ylabel('Frequency')
<matplotlib.text.Text at 0x7fae82bb5da0>
The majors with the highest unemployment rate is at 6-7%, while the majors with the least unemployment rate is at 14%.
I would examine the both cases below.
# Majors with higher unemployment rates
highest_majors_unemployed = recent_grads.loc[(recent_grads['unemployment_rate'] >= 0.06) &
(recent_grads['unemployment_rate'] <= 0.07)][['major_code', 'major', 'major_category', 'unemployment_rate']]
highest_majors_unemployed.sort_values(by='unemployment_rate', ascending=False).head()
major_code | major | major_category | unemployment_rate | |
---|---|---|---|---|
121 | 6106 | HEALTH AND MEDICAL PREPARATORY PROGRAMS | Health | 0.069780 |
40 | 6201 | ACCOUNTING | Business | 0.069749 |
96 | 1902 | JOURNALISM | Communications & Journalism | 0.069176 |
101 | 3608 | PHYSIOLOGY | Biology & Life Science | 0.069163 |
151 | 5404 | SOCIAL WORK | Psychology & Social Work | 0.068828 |
Displayed above shows the top 5 majors with the highest unemployment_rate with Health and Medical Preparatory programs at the top.
# Majors with the least unemployment rates
highest_majors_unemployed = recent_grads.loc[(recent_grads['unemployment_rate'] >= 0.12) &
(recent_grads['unemployment_rate'] <= 0.14)][['major_code', 'major', 'major_category', 'median', 'unemployment_rate']]
highest_majors_unemployed.sort_values(by='unemployment_rate', ascending=False).head()
major_code | major | major_category | median | unemployment_rate | |
---|---|---|---|---|---|
29 | 5402 | PUBLIC POLICY | Law & Public Policy | 50000 | 0.128426 |
PUBLIC POLICY shows the highest prospect of employment amongst all the other majors with also a very good average salary of $50,000
# histogram distribution of men
ax = recent_grads['men'].hist(bins=25, range=(0,200000))
ax.set_title('Distribution of Men')
ax.set_xlabel('Men')
ax.set_ylabel('Frequency')
<matplotlib.text.Text at 0x7fae82b20828>
Q. What percent of majors are predominantly male?
A. The diagram above shows most companies have a high percentage of Male grad workers. It could go as high as 80% (estimate) male workers in an organization.
However, it doesn't determine the majors significantly dominated by males.
# determining majors dominated by male grads
male_dominated_majors = recent_grads.loc[recent_grads['men'] >= 0, ['major_code', 'major', 'major_category', 'median', 'men', 'women']]
male_dominated_majors.sort_values(by='men', ascending=False).head(5)
major_code | major | major_category | median | men | women | |
---|---|---|---|---|---|---|
76 | 6203 | BUSINESS MANAGEMENT AND ADMINISTRATION | Business | 38000 | 173809.0 | 156118.0 |
57 | 6200 | GENERAL BUSINESS | Business | 40000 | 132238.0 | 102352.0 |
35 | 6207 | FINANCE | Business | 47000 | 115030.0 | 59476.0 |
123 | 3600 | BIOLOGY | Biology & Life Science | 33400 | 111762.0 | 168947.0 |
20 | 2102 | COMPUTER SCIENCE | Computers & Mathematics | 53000 | 99743.0 | 28576.0 |
The majors signifcantly dominated by males are BUSINESS MANAGEMENT AND ADMINISTRATION, GENERAL BUSINESS and FINANCE.
# histogram of unemployment_rate
ax = recent_grads['women'].hist(bins=25, range=(0,200000))
ax.set_title('Distribution of women')
ax.set_xlabel('Women')
ax.set_ylabel('Frequency')
ax.set_ylim(0,120)
(0, 120)
Q. What percent of majors are predominantly Female?
A. The diagram above shows most companies also have a high percentage of Female grad workers. It could also go as high as 75% (estimate) female workers in an organization.
# determining majors dominated by male grads
female_dominated_majors = recent_grads.loc[recent_grads['women'] >= 0, ['major_code', 'major', 'major_category', 'median', 'men', 'women']]
female_dominated_majors.sort_values(by='women', ascending=False).head(5)
major_code | major | major_category | median | men | women | |
---|---|---|---|---|---|---|
145 | 5200 | PSYCHOLOGY | Psychology & Social Work | 31500 | 86648.0 | 307087.0 |
34 | 6107 | NURSING | Health | 48000 | 21773.0 | 187621.0 |
123 | 3600 | BIOLOGY | Biology & Life Science | 33400 | 111762.0 | 168947.0 |
138 | 2304 | ELEMENTARY EDUCATION | Education | 32000 | 13029.0 | 157833.0 |
76 | 6203 | BUSINESS MANAGEMENT AND ADMINISTRATION | Business | 38000 | 173809.0 | 156118.0 |
The majors signifcantly dominated by females are PSYCHOLOGY, NURSING, BIOLOGY.
In other to evalutate the relationship between multiple columns more efficiently, a Scatter Matrix plot is the best viable solution.
A scatter matrix plot combines both scatter plots and histograms into one grid of plots and allows us explore potential relationships and distributions simultaneously.
# importing scatter_matrix
from pandas.plotting import scatter_matrix
# A 2 by 2 scatter matrix plot of Sample_size
# and Median Salary
scatter_matrix(recent_grads[['sample_size', 'median']],
figsize=(10,10))
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7fae82a05438>, <matplotlib.axes._subplots.AxesSubplot object at 0x7fae82968438>], [<matplotlib.axes._subplots.AxesSubplot object at 0x7fae82930780>, <matplotlib.axes._subplots.AxesSubplot object at 0x7fae828ed240>]], dtype=object)
The diagram above shows most sample sizes to be less than 1000 (top-left histogram). The scatter plot of Median vs. Sample_size (bottom-left) suggests that the median salary to be somewhere around $30,000 - $40,000.
However, the scatter plot of Sample_size vs. Median (top-right) suggests the increase in sample size doesn't necessarily affect the Median salary values.
# A 3 x 3 scatter matrix plot of sample_size, median and
# unemployment columns
scatter_matrix(recent_grads[['sample_size', 'median', 'unemployment_rate']],
figsize=(10,10))
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7fae8280ff98>, <matplotlib.axes._subplots.AxesSubplot object at 0x7fae8280ac50>, <matplotlib.axes._subplots.AxesSubplot object at 0x7fae82759668>], [<matplotlib.axes._subplots.AxesSubplot object at 0x7fae827160f0>, <matplotlib.axes._subplots.AxesSubplot object at 0x7fae826df128>, <matplotlib.axes._subplots.AxesSubplot object at 0x7fae8269e8d0>], [<matplotlib.axes._subplots.AxesSubplot object at 0x7fae82666fd0>, <matplotlib.axes._subplots.AxesSubplot object at 0x7fae82626710>, <matplotlib.axes._subplots.AxesSubplot object at 0x7fae825f3940>]], dtype=object)
There's no correlation in the scatter matrix plot above. It is a good way to show a faster relationship between columns which was shown in the cells above. E.g the total students with a major vs number employed or total students with major vs number unemployed.
# Total vs unemployed scatter matrix
scatter_matrix(recent_grads[['total', 'unemployed']], figsize=(10,10))
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7fae826cdef0>, <matplotlib.axes._subplots.AxesSubplot object at 0x7fae82454fd0>], [<matplotlib.axes._subplots.AxesSubplot object at 0x7fae824229b0>, <matplotlib.axes._subplots.AxesSubplot object at 0x7fae823de630>]], dtype=object)
This shows a weak postitive correlation between total students with majors, meaning only a small fraction of students with majors are unemployed.
# Total vs employed scatter matrix
scatter_matrix(recent_grads[['total', 'employed']], figsize=(10,10))
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7fae82383fd0>, <matplotlib.axes._subplots.AxesSubplot object at 0x7fae823010f0>], [<matplotlib.axes._subplots.AxesSubplot object at 0x7fae822c8a90>, <matplotlib.axes._subplots.AxesSubplot object at 0x7fae82285668>]], dtype=object)
Above exists a very strong positive correlation between the total number of students with majors vs. their rate of employment. This simply means majority of student with majors are employed.
Bar plots can be created using Series object: df[range][col].plot(kind='bar')
or DataFrame object: df[range].plot.bar(x=labels, y=data for bars)
# Bar plot of sharewomen from first ten rows vs sharewomen
# from last ten rows
ax1 = recent_grads[:10].plot.bar(x='major', y='sharewomen',
title='Fraction of female grads from the top 10 courses with the highest median salary')
ax2 = recent_grads[-10:].plot.bar(x='major', y='sharewomen',
title='Fraction of female grads from the bottom 10 courses with the least median salary')
Above we observe that courses with the highest median salaries have a lower share of female grads than those with the lowest median salaries, which in this case are majorly females (i.e. more than 50% of grads in the lowest median salaries are females).
We can calculate how large the difference is below:
# Calculating the average proportion of female grads for the
# top and bottom 10 courses
# NB: slices using .loc includes the index of both the start
# and stop index contrary to using normal python lists
top_10_female_share = recent_grads.loc[:9, 'sharewomen'].mean()
bottom_10_female_share = recent_grads[-10:]['sharewomen'].mean()
top_10 = ('The 10 highest paying courses have an average '
'amount of female share to be: {:.2f}'.format(
top_10_female_share))
bottom_10 = ('The 10 lowest paying courses have an average '
'amount of female share to be: {:.2f}'.format(
bottom_10_female_share))
print(top_10)
print(bottom_10)
The 10 highest paying courses have an average amount of female share to be: 0.23 The 10 lowest paying courses have an average amount of female share to be: 0.79
There's an obvious difference in the average proportion of top and bottom 10 courses for female grads (in terms of median pay), which is more than 50%.
Next, we check out the difference in the unemployment rate between the top and bottom 10 courses.
# Unemployment rate for top and bottom 10 courses
ax1 = recent_grads[:10].plot.bar(x='major', y='unemployment_rate',
title='Unemployment rate for top 10 courses.')
ax2 = recent_grads[-10:].plot.bar(x='major', y='unemployment_rate',
title='Unemployment rate for top 10 courses.')
For the top 10 courses in general, the unemployment rate is relatively low, however 2 courses NUCLEAR ENGINEERING and MINING AND MINERAL ENGINEERING seem to be outstandingly high.
While for the bottom 10 courses, the unemployment rate seem to be moderately high with 3-5 courses affriming it.
We can analyse this further by looking at the average unemployment rates.
# calculating the average unemployment rates for the
# top and bottom 10 courses
# NB: slices using .loc includes the index of both the start
# and stop index contrary to using normal python lists
mean_unemp_rate = recent_grads['unemployment_rate'].mean()
top_10_unemp_rate = recent_grads.loc[:9, 'unemployment_rate'].mean()
bottom_10_unemp_rate = recent_grads[-10:]['unemployment_rate'].mean()
print(('The average unemployment rate for all majors is {:.2f}'
.format(mean_unemp_rate)))
print(('The average unemployment rate for the top 10 '
'majors is {:.2f}'.format(top_10_unemp_rate)))
print(('The average unemployment rate for the bottom 10 '
'majors is {:.2f}'.format(bottom_10_unemp_rate)))
The average unemployment rate for all majors is 0.07 The average unemployment rate for the top 10 majors is 0.07 The average unemployment rate for the bottom 10 majors is 0.08
The average unemployment rate for all majors tend to be similar to that of the top and bottom 10 courses. However, in the top 10 you'd that there are only two course which are distinctively high, while in the bottom 10 there are about 3-5 courses.
We could examine this further below:
# creating a new column to calculate the difference btw
# unemployment rate in the top 10 courses
top_10_outliers = (recent_grads[:10]
.loc[recent_grads[:10]['unemployment_rate'] > mean_unemp_rate])
top_10_outliers['mean_difference'] = (
top_10_outliers['unemployment_rate'] - mean_unemp_rate)
# creating a new column to calculate the difference btw
# unemployment rate in the bottom 10 courses
bottom_10_outliers = (recent_grads[-10:]
.loc[recent_grads[-10:]['unemployment_rate'] > mean_unemp_rate])
bottom_10_outliers['mean_difference'] = (
bottom_10_outliers['unemployment_rate'] - mean_unemp_rate)
"""Plotting the courses from the top and bottom 10 with the
average unemployment rates and their mean_difference"""
top_10_outliers.plot.bar(x='major', y='mean_difference',
title='Majors in the top 10 above average unemployment rate.')
bottom_10_outliers.plot.bar(x='major', y='mean_difference',
title='Majors in the bottom 10 above average unemployment rate.')
<matplotlib.axes._subplots.AxesSubplot at 0x7fae820f1be0>
In the top 10 courses, the course NUCLEAR ENGINEERING is particularly responsible for shooting up the unemployment rate. While for the bottom 10, CLINICAL PSYCHOLOGY is particularly responsible for shooting up the unemployment rate.
Moving on, I'd generate visualiztions to carry out more indept analysis of the following questions:
NB: For more fantastic plots using pandas check out the documentation plotting in pandas.
# comparing the number of men and women in each category major
# NB: df.plot.bar(stacked=True) stacks the plot on top of each
# other
ax1 = (recent_grads.groupby('major_category')[['men','women']]
.sum().plot.bar(stacked=True))
ax1.set_title('Category Majors vs. Number of Men and women')
ax1.set_ylabel('Total')
<matplotlib.text.Text at 0x7fae8206fef0>
Q. Comparing the number of men and women in each category of majors using a grouped bar plot.
A. Above it is noticed that the Business category has the highest number of male and female graduates combined, which are also somewhat evenly distributed.
Generally, there is a higher percentage of female grads than male grads across the various major categories with some exceptions such as ENGINEERING and COMPUTER & MATHEMATICS which are dominated by male grads.
# box plot: exploring the distrbutions of median salaries
recent_grads['median'].plot.box(title='Distribution of Median salaries')
<matplotlib.axes._subplots.AxesSubplot at 0x7fae81846320>
The above figure shows that most common median salaries for majors are between $30,000 - $40,000. We also observe that there are outliers in the median salary for majors which range somewhere around $60,000 - $80,000.
# box plot: exploring the distrbutions of unemployment rate
ax = recent_grads['unemployment_rate'].plot.box(title='Distribution of unemployment rate')
# hexagonal bin plot: total vs employed
recent_grads.plot.hexbin(x='employed', y='total', gridsize=30)
<matplotlib.axes._subplots.AxesSubplot at 0x7fae81a4a908>
recent_grads.plot.hexbin(x='men', y='median', gridsize=30)
<matplotlib.axes._subplots.AxesSubplot at 0x7fae820af588>