Using a dataset on the job outcomes of students in America who graduated from college between 2010 and 2012, we will explore questions such as:

- Do students in more popular majors make more money?
- How many majors are predominantly male? Predominantly female?
- Which category of majors have the most students?

In [1]:

```
# Importing the necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
# Ensure plots are displayed inline
%matplotlib inline
```

In [2]:

```
# Read in the dataset into a DataFrame
recent_grads = pd.read_csv('recent-grads.csv')
# Return first row as a table
recent_grads.iloc[0]
```

Out[2]:

In [3]:

```
# Understand how the data is structured
recent_grads.head(5)
```

Out[3]:

Engineering majors have the highest median salaries, taking the top 5 spots.

In [4]:

```
recent_grads.tail()
```

Out[4]:

Header | Description |
---|---|

Rank | Rank by median earnings |

Major_code | Major code, FO1DP in ACS PUMS |

Major | Major description |

Major_category | Category of major from Carnevale et al |

Total | Total number of people with major |

Sample_size | Sample size (unweighted) of full-time, year-round ONLY (used for earnings) |

Men | Male graduates |

Women | Female graduates |

ShareWomen | Women as share of total |

Employed | Number employed (ESR == 1 or 2) |

Full_time | Employed 35 hours or more |

Part_time | Employed less than 35 hours |

Full_time_year_round | Employed at least 50 weeks (WKW == 1) and at least 35 hours (WKHP >= 35) |

Unemployed | Number unemployed (ESR == 3) |

Unemployment_rate | Unemployed / (Unemployed + Employed) |

Median | Median earnings of full-time, year-round workers |

P25th | 25th percentile of earnings |

P75th | 75th percentile of earnings |

College_jobs | Number with job requiring a college degree |

Non_college_jobs | Number with job not requiring a college degree |

Low_wage_jobs | Number in low-wage service jobs |

In [5]:

```
# Generating summary statistics for all numerical columns
recent_grads.describe()
```

Out[5]:

The one issue with this data (from a plotting perspective) is the different lengths of the columns. In the columns 'Total', 'Men', and 'Women' there is 172 values, not 173 (as for the other columns).

These missing values will need to be removed before we can pass the data into matplotlib for analysis.

In [6]:

```
# Record how many rows are in the uncleaned dataframe
raw_data_count = recent_grads.shape[0]
print(raw_data_count)
```

In [7]:

```
# Drop rows from the dataframe with missing values
recent_grads = recent_grads.dropna(axis=0)
```

In [8]:

```
# See how many rows with missing values have been dropped
cleaned_data_count = recent_grads.shape[0]
print("The uncleaned data set had ", raw_data_count, " rows")
print("The cleaned data set has ", cleaned_data_count, " rows")
```

So there was only one row with missing values which as now been dropped from the dataframe.

Now we can visualize the data to explore research questions.

We will use scatter plots to answer the following questions:

- Do students in more popular majors make more money?
- Do students that majored in subjects that were majority female make more money?
- Is there any link between the number of full-time employees and median salary?

In [9]:

```
recent_grads.columns
```

Out[9]:

In [10]:

```
# Scatter plot: Sample size and median
recent_grads.plot(x = 'Sample_size', y = 'Median', kind = 'scatter', title = 'Median earnings vs. Sample Size', xlim=(0,4500), ylim=(0,120000))
```

Out[10]:

Q. Do students in more popular majors make more money?

A. The scatter plot suggests that there is no noticeable relationship between the sample size and the median salary. However, there are two important qualifiers to this answer:

- This scatter plot uses earning information for an unweighted sample of people with the major. Therefore it may not be representative of the population of graudates with this major as a whole.
- The median sample size is 130 and the 75th percentile is 338, with the chart size distorted by a few outliers, which may visually compress any relationship. The chart can be zoomed in on to see if there is a relationship within the smaller range of majors with sample sizes equal to or less than the 75% percentile of 338.

In [11]:

```
# Scatter plot: Sample size (up to 75th percentile) and median
recent_grads.plot(x = 'Sample_size', y = 'Median', kind = 'scatter', title = 'Median earnings vs. Sample Size up to 75th percentile', xlim=(0,338), ylim=(0,120000))
```

Out[11]:

There is no strong overall correlation between this narrowed down selection of majors and their median earnings.

However there is a wider range of median earnings in majors with a sample size under 50. With small samples the risk of an unrepresentative median salary is higher as outliers have a bigger effect.

Overall this additional scatter plot does not change the above answer to the question.

In [12]:

```
# Sample size and unemployment rate
ax = recent_grads.plot(x='Sample_size', y='Unemployment_rate', kind='scatter')
ax.set_title('Unemployment rate vs. Sample size')
ax.set_xlim(0,4500)
ax.set_ylim(0,0.2)
```

Out[12]:

There is a lot of variation in unemployment rates among majors with small sample sizes. Yet again the small sample sizes may affect the representativeness of the relationship plotted here.

In addition, from reviewing some rows of the dataframe, there is a noticeable difference between the sample size and the number of graduates for whom there is data on whether they are employed/unemployed.

For example for Petroleum Engineering (rank 1) there is a sample size of 36 and a total of (1976+37) employed and unemployed.

This suggests that sample size is not readily comparable to other statistics collected, other than median wage.

In [13]:

```
# Full-time workers and median salary
ax = recent_grads.plot(x='Full_time', y='Median', kind='scatter')
ax.set_title('Median salary vs. Number employed full-time')
ax.set_xlim(0,255000)
ax.set_ylim(0,120000)
```

Out[13]:

Q. Is there any link between the number of full-time employees and median salary?

A. There is not a noticeable correlation between the number of graduates per major employed full-time and the median wage. If there was to be a relationship it would be positive i.e. more full-time employees leads to a higher median wage.

But as noted above the median wage figures are based off smaller unweighted samples that may not represent the wider population of graduates with each major.

To be sure, a more sample of the data can be plotted, setting the axes limits at the 75th percentile.

In [14]:

```
ax = recent_grads.plot(x='Full_time', y='Median', kind='scatter')
ax.set_title('Median salary vs. Number employed full-time (both at 75th percentile)')
ax.set_xlim(0,26000)
ax.set_ylim(0,45000)
```

Out[14]:

For this narrowed down sample there is no noticeable relationship.

In [15]:

```
# Share of women and the unemployment rate
ax = recent_grads.plot(x='ShareWomen', y='Unemployment_rate', kind='scatter')
ax.set_title('Unemployment rate vs Proportion of female graduates')
ax.set_xlim(0,1)
ax.set_ylim(0,0.2)
```

Out[15]:

There doesn't seem to be a strong relationship between the proportion of female graduates in a course and the unemployment rate.

In [16]:

```
# Share of women and median salary
ax = recent_grads.plot(x='ShareWomen', y='Median', kind='scatter')
ax.set_title('Median salary vs. Proportion of female graduates')
ax.set_xlim(0,1)
ax.set_ylim(0,120000)
```

Out[16]:

Q. Do students that majored in subjects that were majority female make more money?

A. No. Here there is a noticeable relationship: the higher the proportion of female graduates for a major, the lower the median salary is.

The lower median salary is not due to more part-time work because it is defined as the median salary of full time year-round workers.

This means that the lower salary could be due to the lowwe pay for the types of major (and subsequent career paths) that have a higher proportions of female graduates and/or due to lower wages due to their gender or less career capital due to a higher propensity to take time away from work for family.

In [17]:

```
# Number of male graduates and median wage
ax = recent_grads.plot(x='Men', y='Median', kind='scatter')
ax.set_title('Median salary vs. Number of male graduates per major')
ax.set_xlim(0,175000)
ax.set_ylim(0,120000)
```

Out[17]:

There is no obvious relationship here.

In [18]:

```
# Number of female graduates and median wage
ax = recent_grads.plot(x='Women', y='Median', kind='scatter')
ax.set_title('Median salary vs. Number of female graduates per major')
ax.set_xlim(0,310000)
ax.set_ylim(0,120000)
```

Out[18]:

There is no obvious relationship here either.

We will use histograms to answer the following questions:

- What percent of majors are predominantly male? Predominantly female?
- What's the most common median salary range?

In [19]:

```
# To allow bin size to be changed, use Series.hist() and not Series.plot(kind='hist')
# Sample_size histogram
recent_grads["Sample_size"].hist(bins=10)
```

Out[19]:

Most of the sample size values are below 500 so a more detailed view of the majority can be found by looking at those with a sample size below 500.

In [20]:

```
recent_grads["Sample_size"].hist(bins=50, range=(0,500))
```

Out[20]:

The majority of sample sizes were below 100. This raises concerns over how representative the salary data for each major is.

In [21]:

```
# Median salary histogram
recent_grads["Median"].hist(range=(20000,110000), bins=18)
```

Out[21]:

Q. What's the most common median salary range?

A. The median salaries are mostly clustered around $30,000-40,000, with a relatively quick drop off in frequency for salary bands on either side.

In [22]:

```
# Employed histogram
recent_grads["Employed"].hist(range=(0,310000))
```

Out[22]:

I assume that the number employed per major is affected, in part, by the number of students that have taken the major and so its distribution is not very instructive by itself. To check out this assumption I can look at the relationship between Total and Employed.

In [23]:

```
# e.g. the Total number of people and number Employed for the largest majors
# Filter by majors with Total > 39,000 (75th percentile)
largest_majors = recent_grads.loc[recent_grads["Total"] > 39000, ["Major_code", "Major", "Total", "Employed"]]
largest_majors.sort_values(by='Total', ascending=False).head(10)
```

Out[23]:

So there is unsurprisingly a link between the Total number of people who have taken a major and the the number Employed.

A better way to illustrate this relationship would be with a scatter plot.

In [24]:

```
ax = recent_grads.plot(x='Total', y='Employed', kind='scatter')
ax.set_xlim(0, 400000)
ax.set_ylim(0,310000)
```

Out[24]:

So, as expected, the distribution and size of the number Employed per major, is closely related to the Total number of graduates per major. Of more use would be to look at the employment (or unemployment) rates per major, rather than the absolute numbers.

In [25]:

```
# Full-time histogram
recent_grads["Full_time"].hist()
```

Out[25]:

This closely mirrors the distribution of the numbers Employed per major, which is expected.

In [26]:

```
# ShareWomen histogram
recent_grads["ShareWomen"].hist()
```

Out[26]:

It appears that just over 50% of all majors, are majority female, with the highest frequency at 70-80% female.

In [27]:

```
# Seeing which courses have females at 80% or more
high_female_share = recent_grads[recent_grads["ShareWomen"] >= 0.8]
print(high_female_share.shape)
high_female_share.sort_values(by='ShareWomen', ascending=False).head(10)
```

Out[27]:

In [28]:

```
# Unemployment Rate histogram
recent_grads["Unemployment_rate"].hist()
```

Out[28]:

The most frequency unemployment range is 6-7%, but there are handful of courses with unemployment rates greater than 14%.

In [29]:

```
high_unemp = recent_grads[recent_grads["Unemployment_rate"] >= 0.14]
print(high_unemp.shape)
high_unemp.sort_values(by='Unemployment_rate', ascending=False)
```

Out[29]:

'Nuclear engineering' and 'Computer Networking and Telecommunications' are unexpected given they are in in demand fields (engineering, computers & mathematics).

In [30]:

```
# Men histogram i.e. number of male graduates
recent_grads["Men"].hist()
```

Out[30]:

In [31]:

```
# Women histogram i.e. number of female graduates
recent_grads["Women"].hist()
```

Out[31]:

A scatter matrix plot combines both scatter plots and histograms into one grid of plots and allows us to explore potential relationships and distributions simultaneously.

These can be generated by selecting the dataframe and columns of interest and passing this into: pandas.plotting.scatter_matrix

In [32]:

```
# import scatter_matrix function from panda.plotting
from pandas.plotting import scatter_matrix
```

In [33]:

```
# A 2 by 2 scatter matrix plot of Sample_size and Median salary
scatter_matrix(recent_grads[["Sample_size", "Median"]], figsize=(10,10))
```

Out[33]:

Most sample sizes are less than 500 (top left histogram). The scatter plot of Sample_size vs. Median salary (bottom left) doesn't seem to provide much information other than there not being an obvious relationship between Sample_size and Median salary.

However the mirror scatter plot of Median salary vs. Sample_size (top right) shows that large Sample sizes are not associated with outlier Median salary values. Instead the majors with the highest Sample sizes have Median salaries that are in line with the most common salary ranges ($30,000-40,000).

In [34]:

```
# Scatter matrix plot of Sample_size, Median, and Unemployment_rate
scatter_matrix(recent_grads[["Sample_size", "Median", "Unemployment_rate"]], figsize=(20,20))
```

Out[34]:

There's not any noticeably strong relationships between the variables here.

Scatter_matrix is a useful way to quickly explore relationships that I've considered above, for example Total students with a major and the number Employed.

In [35]:

```
# Total and Employed scatter matrix
scatter_matrix(recent_grads[["Total", "Employed"]], figsize=(10,10))
```

Out[35]:

Doing a scatter matrix plot earlier would have saved time and quickly revealed the strong relationship between the two variables.

Using either df.plot(kind='bar') or df.plot.bar(x=labels, y= data for bars)

In [36]:

```
# Looking at the share of women for the top 10 and bottom 10
# courses NB data is ranked by median salary
# Share of women in the top 10 courses
recent_grads[:10].plot.bar(x='Major', y='ShareWomen', title='Share of women in the 10 courses with the highest median salary')
```

Out[36]:

In [37]:

```
# Share of women in the bottom 10 courses
recent_grads[-10:].plot.bar(x='Major', y='ShareWomen', title='Share of women in the 10 courses with the lowest median salary')
```

Out[37]:

The courses with the highest median salaries have a lower share of female graduates than those with the lowest median salaries, which are majority female (i.e. over 50% of the graduates are women).

Now to calculate how large the difference is:

In [38]:

```
#Calculating the average proportion of female graduates for the top and bottom 10 courses
top_10_female_share = recent_grads.loc[:9, "ShareWomen"].mean()
bottom_10_female_share = recent_grads[-10:]['ShareWomen'].mean()
```

In [39]:

```
top_10 = "The 10 highest paying courses have an average female share of {:.2f}".format(top_10_female_share)
bottom_10 = "The 10 lowest paying courses have an average female share of {:.2f}".format(bottom_10_female_share)
print(top_10)
print(bottom_10)
```

So the difference in the average proportion of female graduates between the top and bottom 10 courses (in terms of median pay) is over 50%!

Next we will look at the differences in the unemployment rate between the top 10 and bottom 10 courses.

In [40]:

```
# Unemployment rate for the top 10 courses
ax1 = recent_grads[:10].plot.bar(x='Major', y='Unemployment_rate', title='Unemployment rate for the top 10 courses')
ax2 = recent_grads[-10:].plot.bar(x='Major', y='Unemployment_rate', title='Unemployment rate for the bottom 10 courses')
```

For this comparison it is less clear. The top 10 courses do tend to have a lower unemployment rate, apart from 2 exceptions: 'Nuclear Engineering' and 'Mining and Mineral Engineering'. Whilst for the bottom 10 courses 3-5 of the courses have higher unemployment rates.

This is can be analysed by looking at average unemployment rates.

In [41]:

```
mean_unemp_rate = recent_grads["Unemployment_rate"].mean()
#NB with .loc contrary to usual python slices, both the start and the stop are included
top_10_unemp = recent_grads.loc[:9 , "Unemployment_rate"].mean()
bottom_10_unemp = recent_grads[-10:]["Unemployment_rate"].mean()
```

In [42]:

```
mean_all = "The average unemployment rate across all majors is {:.2f}".format(mean_unemp_rate)
mean_top = "The average unemployment rate for the top 10 majors is {:.2f}".format(top_10_unemp)
mean_bottom = "The average unemployment rate for the bottom 10 majors is {:.2f}".format(bottom_10_unemp)
print(mean_all)
print(mean_top)
print(mean_bottom)
```

Whilst the average unemployment rates are similar for the top and bottom 10 courses, there appears to be more bottom 10 courses which are slightly above average, whilst for the top 10, 2 courses are far above the average, whilst the others are far below average.

To investigate this further:

In [43]:

```
top_10_outliers = recent_grads[:10].loc[recent_grads[:10]["Unemployment_rate"] > mean_unemp_rate]
top_10_outliers["Difference_from_mean"] = top_10_outliers["Unemployment_rate"] - mean_unemp_rate
```

In [44]:

```
bottom_10_outliers = recent_grads[-10:].loc[recent_grads[-10:]["Unemployment_rate"] > mean_unemp_rate]
bottom_10_outliers["Difference_from_mean"] = bottom_10_outliers["Unemployment_rate"] - mean_unemp_rate
```

In [45]:

```
""" Plot the majors from the top and bottom 10 with above
average unemployment rates and the size of the difference
in unmployment rate from the average"""
top_10_outliers.plot.bar(x='Major', y='Difference_from_mean', title='Majors in the top 10 with above average unemployment rates')
bottom_10_outliers.plot.bar(x='Major', y='Difference_from_mean', title='Majors in the bottom 10 with above average unemployment rates')
```

Out[45]:

So in the top 10, one course (Nuclear Engineering) is dragging up the average unemployment rate for the top 10 majors. In the bottom 10 Clinical Psychology has an above average unemployment rate, but the other 4 courses are also dragging up the unemployment rate.

There are a lot of majors which makes it harder to see patterns across types of major e.g. arts, sciences.

So I will look at some analysis using the Category of the major.

Firstly I will build a new dataframe which is indexed by the categories of the majors.

In [46]:

```
# Create a list of all the categories of major
categories = recent_grads["Major_category"].unique()
```

In [47]:

```
# Now to aggregate across categories
# Firstly unemployment rates
# Create an empty dictionary
cat_unemp = {}
# Loop through major categories, calculate mean unemployment rate and add to dictionary
for c in categories:
unemp_mean = recent_grads.loc[recent_grads["Major_category"] == c, "Unemployment_rate"].mean()
cat_unemp[c] = unemp_mean
```

In [48]:

```
# Now the proportion (share) of women in each category
# Create an empty dictionary
cat_share_women = {}
#Loop through categories, calculate mean share of women, and add to dictionary
for c in categories:
women_mean = recent_grads.loc[recent_grads["Major_category"] == c, "ShareWomen"].mean()
cat_share_women[c] = women_mean
```

In [49]:

```
# Now the average median salary
# Create an empty dictionary
cat_salary = {}
for c in categories:
salary_mean = recent_grads.loc[recent_grads["Major_category"] == c, "Median"].mean()
cat_salary[c] = salary_mean
```

In [50]:

```
# Now the total number of men and women in each category
cat_women = {}
for c in categories:
women_sum = recent_grads.loc[recent_grads["Major_category"] == c, "Women"].sum()
cat_women[c] = women_sum
```

In [51]:

```
cat_men = {}
for c in categories:
men_sum = recent_grads.loc[recent_grads["Major_category"] == c, "Men"].sum()
cat_men[c] = men_sum
```

In [52]:

```
unemp_series = pd.Series(cat_unemp)
share_women_series = pd.Series(cat_share_women)
salary_series = pd.Series(cat_salary)
women_series = pd.Series(cat_women)
men_series = pd.Series(cat_men)
type(salary_series)
```

Out[52]:

In [53]:

```
#Now to turn all series into a dataframe
#NB. The dictionary keys became the index in the Series obj
#This index can be used for the dataframe
major_categories = pd.DataFrame(unemp_series, columns=['mean_unemployment_rate'])
major_categories
```

Out[53]:

Now to add the other series into this new dataframe.

In [54]:

```
#Now add in mean mileage
#Don't use constructor! -only use that to create df obj
# Add in other series to df. Share same index.
major_categories["mean_share_women"] = share_women_series
major_categories["mean_salary"] = salary_series
major_categories["number_female_grads"] = women_series
major_categories["number_male_grads"] = men_series
major_categories
```

Out[54]:

Now it is time to plot the data by category to see what patterns emerge.

Firstly, using a grouped bar plot.

In [55]:

```
df_subset = major_categories[["number_male_grads", "number_female_grads"]]
df_subset.plot.bar(title = 'Total number of male and female graduates per category of major')
```

Out[55]:

As can be seen there are large differences in the number of male and female graduates in the following categories:

- Education
- Engineering
- Health
- Humanities & Liberal Arts
- Psychology & Social Work

Now to look at how the mean salary differs across categories.

In [56]:

```
major_categories.plot.bar(y='mean_salary', title='Average salary per category of major')
```

Out[56]:

The mean salary is highest (by far) in Engineering, followed by Business.

Now to look at the relationship between mean salary and the average proportion of women in each category.

In [57]:

```
ax = major_categories.plot(x='mean_share_women', y='mean_salary', kind='scatter')
ax.set_title('Average share of female graduates per category vs. average salary')
```

Out[57]:

There's a slight drop in the average salary as the proportion of female graduates rises.

Finally, how does the average unemployment rate vary across the different categories?

In [58]:

```
major_categories.plot.bar(y='mean_unemployment_rate', legend = False, title='Average unemployment rate per category of major')
```

Out[58]:

Interestingly, there dones't appear to be a close link between the unemployment rates and average salary. For example, Education has a low unemployment rate and low salary. Low unemployment means there is a lower supply of surplus labour so usually it would lead to higher wages.

Using a scatter plot to explore this further:

In [59]:

```
ax = major_categories.plot(x='mean_unemployment_rate', y='mean_salary', kind='scatter')
ax.set_title("Average unemployment rate vs. average salary for each category of major")
```

Out[59]:

So there isn't an obvious relationship between average salary and