Introduction

In this project, we'll explore how using the pandas plotting functionality along with Jupyter notebook interface allows us to explore data quickly using visualizations.

The aim of this project is to work with a dataset on the job outcomes of students who graduated from college between 2010 and 2012 using visualizations to explore and undersatnd questions from the dataset like: Do students in more popular majors make more money? How many majors are predominantly male? Predominantly female? Which category of majors have the most students?

In [38]:

# Importing the libraries needed as well as reading the dataset 
# into a dataframe.
import pandas as pd
import matplotlib.pyplot as plt
% matplotlib inline

recent_grads = pd.read_csv('recent-grads.csv')
print(recent_grads.iloc[0])
print(recent_grads.head())
print(recent_grads.tail())
recent_grads.describe()

# Dropping rows with missing values.
raw_data_count = recent_grads.shape[0]
recent_grads = recent_grads.dropna()
cleaned_data_count = recent_grads.shape[0]

print(raw_data_count)
print(cleaned_data_count)

Rank                                        1
Major_code                               2419
Major                   PETROLEUM ENGINEERING
Total                                    2339
Men                                      2057
Women                                     282
Major_category                    Engineering
ShareWomen                           0.120564
Sample_size                                36
Employed                                 1976
Full_time                                1849
Part_time                                 270
Full_time_year_round                     1207
Unemployed                                 37
Unemployment_rate                   0.0183805
Median                                 110000
P25th                                   95000
P75th                                  125000
College_jobs                             1534
Non_college_jobs                          364
Low_wage_jobs                             193
Name: 0, dtype: object
   Rank  Major_code                                      Major    Total  \
0     1        2419                      PETROLEUM ENGINEERING   2339.0   
1     2        2416             MINING AND MINERAL ENGINEERING    756.0   
2     3        2415                  METALLURGICAL ENGINEERING    856.0   
3     4        2417  NAVAL ARCHITECTURE AND MARINE ENGINEERING   1258.0   
4     5        2405                       CHEMICAL ENGINEERING  32260.0   

       Men    Women Major_category  ShareWomen  Sample_size  Employed  \
0   2057.0    282.0    Engineering    0.120564           36      1976   
1    679.0     77.0    Engineering    0.101852            7       640   
2    725.0    131.0    Engineering    0.153037            3       648   
3   1123.0    135.0    Engineering    0.107313           16       758   
4  21239.0  11021.0    Engineering    0.341631          289     25694   

       ...        Part_time  Full_time_year_round  Unemployed  \
0      ...              270                  1207          37   
1      ...              170                   388          85   
2      ...              133                   340          16   
3      ...              150                   692          40   
4      ...             5180                 16697        1672   

   Unemployment_rate  Median  P25th   P75th  College_jobs  Non_college_jobs  \
0           0.018381  110000  95000  125000          1534               364   
1           0.117241   75000  55000   90000           350               257   
2           0.024096   73000  50000  105000           456               176   
3           0.050125   70000  43000   80000           529               102   
4           0.061098   65000  50000   75000         18314              4440   

   Low_wage_jobs  
0            193  
1             50  
2              0  
3              0  
4            972  

[5 rows x 21 columns]
     Rank  Major_code                   Major   Total     Men   Women  \
168   169        3609                 ZOOLOGY  8409.0  3050.0  5359.0   
169   170        5201  EDUCATIONAL PSYCHOLOGY  2854.0   522.0  2332.0   
170   171        5202     CLINICAL PSYCHOLOGY  2838.0   568.0  2270.0   
171   172        5203   COUNSELING PSYCHOLOGY  4626.0   931.0  3695.0   
172   173        3501         LIBRARY SCIENCE  1098.0   134.0   964.0   

               Major_category  ShareWomen  Sample_size  Employed  \
168    Biology & Life Science    0.637293           47      6259   
169  Psychology & Social Work    0.817099            7      2125   
170  Psychology & Social Work    0.799859           13      2101   
171  Psychology & Social Work    0.798746           21      3777   
172                 Education    0.877960            2       742   

         ...        Part_time  Full_time_year_round  Unemployed  \
168      ...             2190                  3602         304   
169      ...              572                  1211         148   
170      ...              648                  1293         368   
171      ...              965                  2738         214   
172      ...              237                   410          87   

     Unemployment_rate  Median  P25th  P75th  College_jobs  Non_college_jobs  \
168           0.046320   26000  20000  39000          2771              2947   
169           0.065112   25000  24000  34000          1488               615   
170           0.149048   25000  25000  40000           986               870   
171           0.053621   23400  19200  26000          2403              1245   
172           0.104946   22000  20000  22000           288               338   

     Low_wage_jobs  
168            743  
169             82  
170            622  
171            308  
172            192  

[5 rows x 21 columns]
173
172

** Generating a scatter plot**

In [39]:

# Generating a scatter plots exploring the relations 
# between Sample_size and Median. 
recent_grads.plot(x='Sample_size', y='Median', kind='scatter')

Out[39]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f2da0af24a8>

In [40]:

# Generating a scatter plots exploring the relations 
# between Sample_size and Unemployment_rate.
recent_grads.plot(x='Sample_size', y='Unemployment_rate', kind='scatter')

Out[40]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f2da0ad9080>

In [41]:

# Generating a scatter plots exploring the relations 
# between Full_time and Median. 
recent_grads.plot(x='Full_time', y='Median', kind='scatter')

Out[41]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f2da0ad95c0>

In [42]:

# Generating a scatter plots exploring the relations 
# between ShareWomen and Unemployment_rate.
recent_grads.plot(x='ShareWomen', y='Unemployment_rate', kind='scatter')

Out[42]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f2da0a0b5f8>

In [43]:

# Generating a scatter plots exploring the relations 
# between Men and Median. 
recent_grads.plot(x='Men', y='Median', kind='scatter')

Out[43]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f2da097ac50>

In [44]:

# Generating a scatter plots exploring the relations 
# between Women and Median.
recent_grads.plot(x='Women', y='Median', kind='scatter')

Out[44]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f2da08df198>

In [45]:

# Generating a scatter plots exploring the relations 
# between total and median.
recent_grads.plot(x='Total', y='Median', kind='scatter')

Out[45]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f2da08c37f0>

There is no link between the number of full-time employees and median salary based on the result from the corresponding plot.Also students from more popular majors are not guarantee high paying jobs from the graph above.

Generating Histograms

In [46]:

# Generating histogram to explore the distribution of Sample_size
recent_grads['Sample_size'].hist(bins=25, range=(0,5000))

Out[46]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f2da08242e8>

In [47]:

# Generating histogram to explore the distribution of Median
recent_grads['Median'].hist(bins=25, range=(0,5000))

Out[47]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f2da07e77f0>

In [48]:

# Generating histogram to explore the distribution of Employed
recent_grads['Employed'].hist(bins=25, range=(0,5000))

Out[48]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f2da06b4710>

In [49]:

# Generating histogram to explore the distribution of Full_time
recent_grads['Full_time'].hist(bins=25, range=(0,5000))

Out[49]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f2da05eeeb8>

In [50]:

# Generating histogram to explore the distribution of ShareWomen
recent_grads['ShareWomen'].hist(bins=25, range=(0,5000))

Out[50]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f2da0ccb198>

In [51]:

# Generating histogram to explore the distribution of 
# Unemployment_rate.
recent_grads['Unemployment_rate'].hist(bins=25, range=(0,5000))

Out[51]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f2da0c89f98>

In [52]:

# Generating histogram to explore the distribution of Men.
recent_grads['Men'].hist(bins=25, range=(0,5000))

Out[52]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f2da05d1e10>

In [53]:

# Generating histogram to explore the distribution of Women.
recent_grads['Women'].hist(bins=25, range=(0,5000))

Out[53]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f2da0512748>

** Generating Scatter Matrix Plots**

In [54]:

# Importing scatter_matrix 
from pandas.plotting import scatter_matrix

# Create a 2 by 2 scatter matrix plot 
# using the Sample_size and Median columns. 
scatter_matrix(recent_grads[['Sample_size', 'Median']], figsize=(10,10))

Out[54]:

array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7f2da04bb320>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f2da02e2240>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7f2da02ac400>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f2da0260b70>]],
      dtype=object)

In [55]:

# Create a 3 by 3 scatter matrix plot 
# using the Sample_size, Median, and Unemployment_rate columns.
scatter_matrix(recent_grads[['Sample_size','Median','Unemployment_rate']], figsize=(10,10))

Out[55]:

array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7f2da02125c0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f2da017f6a0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f2da014e198>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7f2da0102d30>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f2da00cef28>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f2da008e860>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7f2da005d550>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f2da0099390>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f2d9ff619e8>]],
      dtype=object)

** Generating Bar Plots**

In [56]:

# Use bar plots to compare the percentages of women (ShareWomen) 
# from the first ten rows and last ten rows of the 
# recent_grads dataframe. 
recent_grads.head(10).plot.bar(x='Major', y='ShareWomen', legend = False)
recent_grads.tail(10).plot.bar(x='Major', y='ShareWomen', legend = False)

Out[56]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f2d9fe17780>

In [58]:

# Use bar plots to compare the unemployment rate 
# (Unemployment_rate) from the first ten rows and last ten rows of
# the recent_grads dataframe.
recent_grads[:10].plot.bar(x='Major', y='Unemployment_rate', legend=False)
recent_grads[163:].plot.bar(x='Major', y='Unemployment_rate', legend=False)

Out[58]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f2d9fca5da0>

** Conclusion**

In this project we utilize the use of the plotting tools built into pandas to explore data on job outcomes.