#!/usr/bin/env python
# coding: utf-8

# ## Introduction
# 
# In this project, I'll explore using the pandas plotting functionality along with the Jupyter notebook interface to explore and visualiz data
# 
# I'll be working with a dataset on the job outcomes of students who graduated from college between 2010 and 2012.
# 
# Each row in the dataset represents a different major in college and contains information on gender diversity, employment rates, median salaries, and more. Here are some of the columns in the dataset:
# 
# * nk - Rank by median earnings (the dataset is ordered by this column).
# * Major_code - Major code.
# * Major - Major description.
# * Major_category - Category of major.
# * Total - Total number of people with major.
# * Sample_size - Sample size (unweighted) of full-time.
# * Men - Male graduates.
# * Women - Female graduates.
# * ShareWomen - Women as share of total.
# * Employed - Number employed.
# * Median - Median salary of full-time, year-round workers.
# * Low_wage_jobs - Number in low-wage service jobs.
# * Full_time - Number employed 35 hours or more.
# * Part_time - Number employed less than 35 hours.

# In[31]:


import pandas as pd
import matplotlib.pyplot as plt
get_ipython().run_line_magic('matplotlib', 'inline')

recent_grads = pd.read_csv("recent-grads.csv")

# first row formatted as a table.
print (recent_grads.iloc[0])


# In[32]:


# to become familiar with how data is structured
print (recent_grads.head())
print (recent_grads.tail())


# In[33]:


#  to generate summary statistics for all of the numeric columns
recent_grads.describe()


# In[34]:


#shape before dropping NAN vaues
print(recent_grads.shape)


# In[35]:


# Investigating Missing Values
print(recent_grads.isnull().sum())


# In[36]:


#recent_grads = recent_grads.dropna(inplace=True)
print(recent_grads.dropna(inplace=True))
print(recent_grads.shape)


# In[37]:


# #Investigating Missing Values after dropping NAN
print(recent_grads.isnull().sum())


# In[38]:


# #shape before dropping NAN vaues
# print(recent_grads.shape)

# #Investigating Missing Values
# print(recent_grads.isnull().sum())

# #Drop missing values
# print(recent_grads.dropna(inplace =True))

# #shape after dropping NAN vaues
# print(recent_grads.shape)

# #Investigating Missing Values after dropping NAN
# print(recent_grads.isnull().sum())


# In[39]:


#recent_grads = recent_grads.dropna(inplace=True)
#print(recent_grads.dropna(inplace=True))


# In[40]:


#recent_grads = recent_grads.dropna(inplace=True)
#print(recent_grads.dropna(inplace=True))


# In[41]:


#print(recent_grads.shape)


# ## Investigating Missing Values

# In[42]:


# # Look up the number of rows
# raw_data_count = recent_grads.count()
# print (raw_data_count)

# # Drop rows with missing values
# # recent_grads = recent_grads.dropna(axis=0, inplace=True)
# print(recent_grads.shape)


# In[43]:


# # Look up the number of rows to ascertain if data has been droped
# cleaned_data_count = recent_grads.count()
# print (cleaned_data_count)


# Comparing the raw data and cleaned data, it will be observed that number of rows droped to 172 in cleaned data. While raw data has 173 rows. This means a row has been removed for having missing value.

# ## Scatter Plot

# In[44]:


recent_grads.plot(x="Sample_size", y="Median", kind = "scatter", title = "Sample_size VS Median")
recent_grads.plot(x="Sample_size", y="Unemployment_rate", kind = "scatter", title = "Sample_size VS Uemployemny")
recent_grads.plot(x="Full_time", y="Median", kind = "scatter", title = "Full_time VS Median")
recent_grads.plot(x="ShareWomen", y="Unemployment_rate", kind = "scatter", title = "Sharewoman VS Unemployment_rate")
recent_grads.plot(x="Men",y="Median", kind = "scatter", title = "Men VS Median")
recent_grads.plot(x="Women",y="Median", kind = "scatter", title = "Sample_size VS Median")


# There seems to be no significant relationships between the data points in these scatter plots. Theycan be explore further using histograms instead.
# 
# The y axis shows the frequency of the data and the x axis refers to the column name specified in code.

# ## Histograms

# In[45]:


recent_grads["Sample_size"].hist()


# In[46]:


recent_grads["Median"].hist()


# The most common median salary range is $30,000-40,000. 
# 
# 

# In[47]:


recent_grads["Employed"].hist()


# In[48]:


recent_grads["Full_time"].hist()


# In[49]:


recent_grads["ShareWomen"].hist()


# In[50]:


recent_grads["Unemployment_rate"].hist()


# The most common percentage of unemployment rate is between 5.5-7%.
# 
# 

# In[51]:


recent_grads["Men"].hist()


# In[52]:


recent_grads["Women"].hist()


# ## Scatter matrix plot
# 
# In other to explore the data further, both scatter plots and histograms are combined into one grid of plots so as to explore potential relationships and distributions simultaneously. This is achieved using scatter matrix plot.

# In[53]:


from pandas.plotting import scatter_matrix
scatter_matrix(recent_grads[["Sample_size", "Median"]], figsize=(10,10))


# In[54]:


scatter_matrix(recent_grads[["Sample_size", "Median","Unemployment_rate"]], figsize=(10,10))


# Looking keenly at the `scatter matrix plot`,  it is difficult to ascertain any correlation between any pair of these columns. Looking at the histograms, the distribution of Sample_size and Median is skewed whereas the distribution of Unemployment_rate is more symetrically disbributed and more spread out.

# ## Exploring data with pandas Bar plots
# 
# Bar chart ploting  only need the data that needs to be represented the bars and the labels for each bar.
# 

# In[58]:


recent_grads[:10].plot.bar(x="Major", y="ShareWomen", legend=False)
recent_grads[-10:].plot.bar(x="Major", y="ShareWomen", legend=False)


# Women tend to shy away from technical courses such as Engineering. Engineering courses are the least subscribed for by women as shown in the above chart while Early childhood education is the most subcribed course by women.

# In[60]:


recent_grads[:10].plot.bar(x="Major", y="Unemployment_rate", legend=False)
recent_grads[-10:].plot.bar(x="Major", y="Unemployment_rate", legend=False)


# Women in Engineering Majors tend to enjoy low unemployment rate except for 'Nuclear Engineering' and 'Mining and Mineral Engineering'. This may be due to high technically and limited opening for for the two majors.

# ## Exploring with Grouped Bar Charts
# 
# Grouped bar plot will be used to compare the number of men with the number of women in each category of majors.

# In[86]:


# Men and Women are aggregated by Major_category to a single dictionary.

men_sum_dict = {}
women_sum_dict = {}
    
for c in recent_grads["Major_category"].unique():
    
    men_cat = recent_grads.loc[recent_grads["Major_category"] == c, "Men"].sum()
    men_sum_dict[c] = men_cat
    
    women_cat = recent_grads.loc[recent_grads["Major_category"] == c, "Women"].sum()
    women_sum_dict[c] = women_cat
    
    # convertion of men_sum_dict and women_sum_dict to series
men_sum_series = pd.Series(men_sum_dict)
women_sum_series = pd.Series(women_sum_dict)
    
    # convertion of men_sum_series and women_sum_series to DataFrame
men_women_df = pd.DataFrame(men_sum_series, columns=["Men Total"])
men_women_df["Women Total"] = women_sum_series
men_women_df
    

# In[90]:


men_women_df.plot.bar(figsize=(10,5))


# There is a significant difference in the number of Men and women in Engineering and Computers & Mathematics Majors. Men are seen have the highest enrolment.However, Arts, Health, Biology & Life Science, Education, Humanities & Liberal Arts, Psychology & Social Work and Communications & Journalism have significantly more women than men.

# ## Exploring with Boxplots 
# 
# Boxplots can show us the range and positions of the quartiles for columns in the dataset. Box plot is used here to explore `Median` and `Unemployment_rate` columns a little more.

# In[93]:


recent_grads.loc[:, ["Median", "Unemployment_rate"]].plot(kind='box', subplots=True, figsize=(10, 10))


# There are five outliers in the median salary salary column of the data with four being moderate outliers and one being an extreme outlier.
# 
# The Unemployment rate is more symmetrically distributed about the median of around 7%. There are four outliers with high unemployment rates of approximately 13.5-19%.

# ## Exploring with Hexagonal Bin Plot
# 
# In other to explore the data a little more, hexagonal bin plot is used to establish relationship between pair of columns. 
# 
# Here relationship between the following are visualised:
# * ShareWomen vs. Unemployment_rate 
# * Woman vs. Median
# * Total vs. Median.

# In[95]:


recent_grads.plot.hexbin("ShareWomen", "Unemployment_rate", figsize=(8, 8), gridsize=20, colorbar=False) 
recent_grads.plot.hexbin("Women", "Median", figsize=(8, 8), gridsize=20, colorbar=False) 
recent_grads.plot.hexbin("Total", "Median", figsize=(8, 8), gridsize=20, colorbar=False) 


# ## Conclusion
# 
# Exploring data of graduated American College students was insightful and it gives eye catching details for quick understanding. For clearity, various forms of charts were used. Specifically, some Python concepts explored are pandas, matplotlib, histograms, bar charts, scatterplots, scatter matrices, box plot and hexagonal bin plot. 
# 

# In[ ]: