Notebook

An important decision in life, you'd better think it twice!¶

In this proyect we'll analyse if there are any relations between the majors choosen in college by students, with their future jobs. Expected future employment rates, type of jobs, and earnings based on this important decision will be explored.

We'll use basically two python libraries for this study: pandas for data analysis, and matplotlib for its visualization and better understanding.

We'll try to anser the following questions:

Which category of majors have the most students?
Do students in more popular majors make more money?
How many majors are predominantly male? Predominantly female?
Its this last aspect important for their future median salary?

Summary of conclusions:

Most popular majors as we could guess, are strogly related to low wage jobs, and higher unemployment. Although their median salary for full time jobs isn't lower than the rest.
Majors predominantely male have a median salary for full time jobs higher than average.

A. Data preliminary analysis¶

We'll be working with a dataset on the job outcomes of students who graduated from college between 2010 and 2012.

The original data on job outcomes was released by American Community Survey, which conducts surveys and aggregates the data. FiveThirtyEight cleaned the dataset and released it on their Github repo .

In this section we'll look at dataframe structure and meaning of each column.

In [1]:

# read in the dataset
import pandas as pd
recent_grads = pd.read_csv("recent-grads.csv")

# a look at the first row as a table
recent_grads.iloc[1]

Out[1]:

Rank                                                 2
Major_code                                        2416
Major                   MINING AND MINERAL ENGINEERING
Total                                              756
Men                                                679
Women                                               77
Major_category                             Engineering
ShareWomen                                    0.101852
Sample_size                                          7
Employed                                           640
Full_time                                          556
Part_time                                          170
Full_time_year_round                               388
Unemployed                                          85
Unemployment_rate                             0.117241
Median                                           75000
P25th                                            55000
P75th                                            90000
College_jobs                                       350
Non_college_jobs                                   257
Low_wage_jobs                                       50
Name: 1, dtype: object

Some columns titles description:

Rank - Rank by median earnings (the dataset is ordered by this column).
Major_code - Major code.
Major - Major description.
Major_category - Category of major.
Total - Total number of people with major.
Sample_size - Sample size (unweighted) of full-time.
Men - Male graduates.
Women - Female graduates.
ShareWomen - Women as share of total.
Employed - Number employed.
Median - Median salary of full-time, year-round workers.
Low_wage_jobs - Number in low-wage service jobs.
Full_time - Number employed 35 hours or more.
Part_time - Number employed less than 35 hours.

In [2]:

# a look at our dataframe structure
recent_grads.info()
print("o" + "-"*90 + "o") # dash separation

print(recent_grads.head(3))
print("o" + "-"*90 + "o")

raw_data_count = recent_grads.shape[0]
print("DataFrame 'recent_grads' , number of rows: {}\nDataFrame 'recent_grads' , number of columns: {}\nraw_data_count = {}".format(recent_grads.shape[0],recent_grads.shape[1],recent_grads.shape[0]))

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 173 entries, 0 to 172
Data columns (total 21 columns):
Rank                    173 non-null int64
Major_code              173 non-null int64
Major                   173 non-null object
Total                   172 non-null float64
Men                     172 non-null float64
Women                   172 non-null float64
Major_category          173 non-null object
ShareWomen              172 non-null float64
Sample_size             173 non-null int64
Employed                173 non-null int64
Full_time               173 non-null int64
Part_time               173 non-null int64
Full_time_year_round    173 non-null int64
Unemployed              173 non-null int64
Unemployment_rate       173 non-null float64
Median                  173 non-null int64
P25th                   173 non-null int64
P75th                   173 non-null int64
College_jobs            173 non-null int64
Non_college_jobs        173 non-null int64
Low_wage_jobs           173 non-null int64
dtypes: float64(5), int64(14), object(2)
memory usage: 28.5+ KB
o------------------------------------------------------------------------------------------o
   Rank  Major_code                           Major   Total     Men  Women  \
0     1        2419           PETROLEUM ENGINEERING  2339.0  2057.0  282.0   
1     2        2416  MINING AND MINERAL ENGINEERING   756.0   679.0   77.0   
2     3        2415       METALLURGICAL ENGINEERING   856.0   725.0  131.0   

  Major_category  ShareWomen  Sample_size  Employed      ...        Part_time  \
0    Engineering    0.120564           36      1976      ...              270   
1    Engineering    0.101852            7       640      ...              170   
2    Engineering    0.153037            3       648      ...              133   

   Full_time_year_round  Unemployed  Unemployment_rate  Median  P25th   P75th  \
0                  1207          37           0.018381  110000  95000  125000   
1                   388          85           0.117241   75000  55000   90000   
2                   340          16           0.024096   73000  50000  105000   

   College_jobs  Non_college_jobs  Low_wage_jobs  
0          1534               364            193  
1           350               257             50  
2           456               176              0  

[3 rows x 21 columns]
o------------------------------------------------------------------------------------------o
DataFrame 'recent_grads' , number of rows: 173
DataFrame 'recent_grads' , number of columns: 21
raw_data_count = 173

In [3]:

recent_grads.describe(include = "all").iloc[:5] # include also non numeric columns

Out[3]:

	Rank	Major_code	Major	Total	Men	Women	Major_category	ShareWomen	Sample_size	Employed	...	Part_time	Full_time_year_round	Unemployed	Unemployment_rate	Median	P25th	P75th	College_jobs	Non_college_jobs	Low_wage_jobs
count	173.0	173.000000	173	172.000000	172.000000	172.000000	173	172.000000	173.000000	173.000000	...	173.000000	173.000000	173.00000	173.000000	173.000000	173.000000	173.000000	173.000000	173.00000	173.000000
unique	NaN	NaN	173	NaN	NaN	NaN	16	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
top	NaN	NaN	ARCHITECTURAL ENGINEERING	NaN	NaN	NaN	Engineering	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
freq	NaN	NaN	1	NaN	NaN	NaN	29	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
mean	87.0	3879.815029	NaN	39370.081395	16723.406977	22646.674419	NaN	0.522223	356.080925	31192.763006	...	8832.398844	19694.427746	2416.32948	0.068191	40151.445087	29501.445087	51494.219653	12322.635838	13284.49711	3859.017341

5 rows × 21 columns

In [4]:

# localize one row with NaN values in some columns
recent_grads[recent_grads["Total"].isnull()]

Out[4]:

	Rank	Major_code	Major	Total	Men	Women	Major_category	ShareWomen	Sample_size	Employed	...	Part_time	Full_time_year_round	Unemployed	Unemployment_rate	Median	P25th	P75th	College_jobs	Non_college_jobs	Low_wage_jobs
21	22	1104	FOOD SCIENCE	NaN	NaN	NaN	Agriculture & Natural Resources	NaN	36	3149	...	1121	1735	338	0.096931	53000	32000	70000	1183	1274	485

1 rows × 21 columns

In [5]:

# drop this row from out dataset
recent_grads = recent_grads.dropna()

clean_data_count = recent_grads.shape[0]
print("DataFrame 'recent_grads' , number of rows: {}\nDataFrame 'recent_grads' , number of columns: {}\nclean_data_count = {}".format(recent_grads.shape[0],recent_grads.shape[1],recent_grads.shape[0]))

DataFrame 'recent_grads' , number of rows: 172
DataFrame 'recent_grads' , number of columns: 21
clean_data_count = 172

B. Visualizing relationships between variables¶

In this section we are going yo study the following relationships between the columns of our "recent_grads" DataFrame:

Sample_size and Median
Sample_size and Unemployment_rate
Full_time and Median
ShareWomen and Unemployment_rate
Men and Median
Women and Median
Total and Median--> Do students in more popular major make more money?
ShareWomen and Median--> Do students that majored in subjects that were majority female make more money?

For this task we'll use scatter plots.

In [6]:

import matplotlib.pyplot as plt
# this allow plots are displayed inline in Jupyter NB
%matplotlib inline 

/home/nbuser/anaconda3_420/lib/python3.5/site-packages/matplotlib/font_manager.py:281: UserWarning: Matplotlib is building the font cache using fc-list. This may take a moment.
  'Matplotlib is building the font cache using fc-list. '

In [7]:

# assign axes object to a variable to access later
ax = recent_grads.plot(x = "Sample_size", y = "Median", kind = "scatter", title = "1. Relationship: Sample_size and Median", figsize = (10, 5))

ax.set_xlim(0, 1500) # for better visualization. X values beyond this limits ar null or scarce
ax.set_ylim(20000, 80000) 
plt.show()

Comments: as expected there is no evidence of relationship between "Sample_size" (see column description at the beggining) and "Median".

In [8]:

# assign axes object to a variable to access later
ax = recent_grads.plot(x = "Sample_size", y = "Unemployment_rate", kind = "scatter", title = "2. Relationship: Sample_size and Unemployment_rate", figsize = (10, 5))

ax.set_xlim(0, 3000) # for better visualization. X values beyond this limits ar null or scarce
ax.set_ylim(0, 0.20) 
plt.show()