The goal of the project is to make a detailed analysis of college graduates in different
majors and their earnings, and also the employment rates of graduates based on their majors.
To answer these questions I am going to implement the pandas, numpy and matplotlib modules
to create plots that will guide the answers.
# Importing modules to be used
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from numpy import arange
%matplotlib inline
# %matplotlib inline is so that plots can show in the cell
# Reading the recent-grads.csv into a dataframe
recent_grads = pd.read_csv('recent-grads.csv', encoding='latin-1')
# Observing the first row of the dataframe
recent_grads.iloc[0]
Rank 1 Major_code 2419 Major PETROLEUM ENGINEERING Total 2339 Men 2057 Women 282 Major_category Engineering ShareWomen 0.120564 Sample_size 36 Employed 1976 Full_time 1849 Part_time 270 Full_time_year_round 1207 Unemployed 37 Unemployment_rate 0.0183805 Median 110000 P25th 95000 P75th 125000 College_jobs 1534 Non_college_jobs 364 Low_wage_jobs 193 Name: 0, dtype: object
recent_grads.describe()
Rank | Major_code | Total | Men | Women | ShareWomen | Sample_size | Employed | Full_time | Part_time | Full_time_year_round | Unemployed | Unemployment_rate | Median | P25th | P75th | College_jobs | Non_college_jobs | Low_wage_jobs | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 173.000000 | 173.000000 | 172.000000 | 172.000000 | 172.000000 | 172.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 |
mean | 87.000000 | 3879.815029 | 39370.081395 | 16723.406977 | 22646.674419 | 0.522223 | 356.080925 | 31192.763006 | 26029.306358 | 8832.398844 | 19694.427746 | 2416.329480 | 0.068191 | 40151.445087 | 29501.445087 | 51494.219653 | 12322.635838 | 13284.497110 | 3859.017341 |
std | 50.084928 | 1687.753140 | 63483.491009 | 28122.433474 | 41057.330740 | 0.231205 | 618.361022 | 50675.002241 | 42869.655092 | 14648.179473 | 33160.941514 | 4112.803148 | 0.030331 | 11470.181802 | 9166.005235 | 14906.279740 | 21299.868863 | 23789.655363 | 6944.998579 |
min | 1.000000 | 1100.000000 | 124.000000 | 119.000000 | 0.000000 | 0.000000 | 2.000000 | 0.000000 | 111.000000 | 0.000000 | 111.000000 | 0.000000 | 0.000000 | 22000.000000 | 18500.000000 | 22000.000000 | 0.000000 | 0.000000 | 0.000000 |
25% | 44.000000 | 2403.000000 | 4549.750000 | 2177.500000 | 1778.250000 | 0.336026 | 39.000000 | 3608.000000 | 3154.000000 | 1030.000000 | 2453.000000 | 304.000000 | 0.050306 | 33000.000000 | 24000.000000 | 42000.000000 | 1675.000000 | 1591.000000 | 340.000000 |
50% | 87.000000 | 3608.000000 | 15104.000000 | 5434.000000 | 8386.500000 | 0.534024 | 130.000000 | 11797.000000 | 10048.000000 | 3299.000000 | 7413.000000 | 893.000000 | 0.067961 | 36000.000000 | 27000.000000 | 47000.000000 | 4390.000000 | 4595.000000 | 1231.000000 |
75% | 130.000000 | 5503.000000 | 38909.750000 | 14631.000000 | 22553.750000 | 0.703299 | 338.000000 | 31433.000000 | 25147.000000 | 9948.000000 | 16891.000000 | 2393.000000 | 0.087557 | 45000.000000 | 33000.000000 | 60000.000000 | 14444.000000 | 11783.000000 | 3466.000000 |
max | 173.000000 | 6403.000000 | 393735.000000 | 173809.000000 | 307087.000000 | 0.968954 | 4212.000000 | 307933.000000 | 251540.000000 | 115172.000000 | 199897.000000 | 28169.000000 | 0.177226 | 110000.000000 | 95000.000000 | 125000.000000 | 151643.000000 | 148395.000000 | 48207.000000 |
From the description of the dataframe there are 173 data entries and 172 data entries which suggests that there could be null values in one of the rows or columns.
raw_data_count = recent_grads.shape[0]
# The dataframe shape attribute returns a tuple as (row, column), so
# I assigned the first item which is the row number a variable
recent_grads = recent_grads.dropna()
# Here I dropped rows with null values for more accurate analysis
cleaned_data_count = recent_grads.shape[0]
# Getting the row number from the cleaned dataframe
print('Raw data count: ', raw_data_count,'----- Cleaned data count: ', cleaned_data_count)
Raw data count: 173 ----- Cleaned data count: 172
The Raw data count originally 173 rows but is now 172 rows after being cleaned which means that only one row contained null values
# Creating a scatter plot of independent variable Total and dependent variable Median
recent_grads.plot(x='Total', y='Median', kind='scatter', figsize=(5,10), title='Median vs. Total')
<matplotlib.axes._subplots.AxesSubplot at 0x7f7067f21080>
From the scatter plot above we can see that the graduates with the most popular major don't necessarily earn more money
# Creating a scatter plot of independent variable Sharewomen and dependent variable Median
recent_grads.plot(x='ShareWomen', y='Median', kind='scatter', title='Median vs. ShareWomen', figsize=(7,10))
<matplotlib.axes._subplots.AxesSubplot at 0x7f7067f39390>
This scatter above shows graduates with majors that have more females earn less than those with less females, although the relationship is quite weak (A weak negative corelation).
# Creating a scatter plot of independent variable Full_time and dependent variable Median
recent_grads.plot(x='Full_time', y='Median', kind='scatter', title='Full_time vs. Median', figsize=(7,10))
<matplotlib.axes._subplots.AxesSubplot at 0x7f706c023dd8>
There is no relationship between the full time employees and the salary because as the number of full time employee increases it does not really affect the salary.
# Histogram showing the distribution of salary
recent_grads['Median'].hist(bins=20, range=(22000, 115000))
<matplotlib.axes._subplots.AxesSubplot at 0x7f7067c2b6d8>
From the histogram above the salary range of 30,000 to 40,000 is more common across the majors
# Importing the scatter_matrix from pandas.plotting
from pandas.plotting import scatter_matrix
# Creating a scatter matrix of Total vs. Median
scatter_matrix(recent_grads[['Total', 'Median']], figsize=(10,10))
plt.suptitle('Total vs. Median')
<matplotlib.text.Text at 0x7f706736b4a8>
The scatter matrix of Total vs Median shows that there is no correlation between them and the most number of employees across majors is between 0 and 100,000 while the most common salary paid is between 30,000 to 40,000
# Creating a group bar plot of major categories and median salary
recent_grads[['Median','Major_category']].groupby('Major_category').sum().plot.bar()
<matplotlib.axes._subplots.AxesSubplot at 0x7f705ece01d0>
As we can see from the above bar plot the major_category that has the highest earners is the Engineering major category
# Creating a bar plot of the percentage of women in the first ten majors in the dataset
recent_grads[:10].plot.bar(x='Major', y='ShareWomen')
# Creating a bar plot of the percentage of women in the last ten majors in the dataset
recent_grads[-10:].plot.bar(x='Major', y='ShareWomen')
<matplotlib.axes._subplots.AxesSubplot at 0x7f7066f2bc50>
From the first ten major category Astronomy and Astrophysics have the highest percentage of female graduates and from the last ten major category Communication Disorders Sciences and Services and Early Childhood Education have the highest percentage of female graduates
recent_grads[:10].plot.bar(x='Major', y='Unemployment_rate')
recent_grads[-10:].plot.bar(x='Major', y='Unemployment_rate')
<matplotlib.axes._subplots.AxesSubplot at 0x7f706dd56a58>
From the above bar plots Nuclear Engineering from the first 10 majors have the highest unemployment rate and from the last ten clinical psycology have the highest unemployment rate
# Creating a group bar plot of Men and Women by Major_category
recent_grads.groupby('Major_category')[['Men', 'Women']].sum().plot.bar()
<matplotlib.axes._subplots.AxesSubplot at 0x7f7066ba6780>
The business major category has the highest number of male and female graduates more than any other major category
# Creating a boxplot of unemployment_rate
recent_grads.boxplot(column=['Unemployment_rate'], vert=False)
<matplotlib.axes._subplots.AxesSubplot at 0x7f7066d65d30>
The boxplot above shows that the unemployment rate has four outliers
The minimum unemployment rate is 0.0
The maximum non unusual unemployment rate is 0.13
The median of the unemployment rate is 0.07
# Creating a boxplot of Median
recent_grads.boxplot(column=['Median'], vert=False)
<matplotlib.axes._subplots.AxesSubplot at 0x7f7066ed8e10>
50% of the median salary is within the range of 32,000 and 40,000
There are four outliers in the median salary
fig, [ax1, ax2] = plt.subplots(1,2, figsize=(12,5))
recent_grads.plot.hexbin(x='Full_time', y='Median',gridsize=10, ax=ax1 )
ax1.set_title('Full_time vs. Median')
ax1.set_xlim(0)
ax1.set_ylim(0)
recent_grads.plot.scatter(x='Full_time', y='Median', ax=ax2)
ax2.set_xlim(0)
ax2.set_ylim(0)
ax2.set_title('Full_time vs. Median')
<matplotlib.text.Text at 0x7f706c9189e8>
The hexbin and scatter plots shows that there is no correlation between the full_time employees and the median salary
fig, [ax1, ax2] = plt.subplots(1,2, figsize=(16,5))
recent_grads.plot.hexbin(x='Men', y='Median',gridsize=10, ax=ax1 )
ax1.set_title('Median vs. Men')
ax1.set_xlim(0)
ax1.set_ylim(0)
recent_grads.plot.scatter(x='Men', y='Median', ax=ax2)
ax2.set_xlim(0)
ax2.set_ylim(0)
ax2.set_title('Median vs. Men')
<matplotlib.text.Text at 0x7f706c7bc3c8>
From the above hexbin and scatter plot the number of men does not affect the average salary
fig, [ax1, ax2] = plt.subplots(1,2, figsize=(12,5))
recent_grads.plot.hexbin(x='Women', y='Median',gridsize=10, ax=ax1 )
ax1.set_title('Median vs. Women')
ax1.set_xlim(0)
ax1.set_ylim(0)
recent_grads.plot.scatter(x='Women', y='Median', ax=ax2)
ax2.set_xlim(0)
ax2.set_ylim(0)
ax2.set_title('Median vs. Women')
<matplotlib.text.Text at 0x7f706c6bd390>
The number of women also does not affect the average salary
From the analysis of the data set I've come to the conclusion that:
I hope this is helpful to others and please feel free to provide your feedback of this project, cheers.