Guided Project: Visualizing Earnings Based On College Majors¶

In this guided project, we'll explore how using the pandas plotting functionality along with the Jupyter notebook interface allows us to explore data quickly using visualizations.

We'll be working with a dataset on the job outcomes of students who graduated from college between 2010 and 2012

Dataset columns:

Rank - Rank by median earnings (the dataset is ordered by this column).
Major_code - Major code.
Major - Major description.
Major_category - Category of major.
Total - Total number of people with major.
Sample_size - Sample size (unweighted) of full-time.
Men - Male graduates.
Women - Female graduates.
ShareWomen - Women as share of total.
Employed - Number employed.
Median - Median salary of full-time, year-round workers.
Low_wage_jobs - Number in low-wage service jobs.
Full_time - Number employed 35 hours or more.

Part_time - Number employed less than 35 hours.

In [35]:

#read file and show the structure
recent_grads = pd.read_csv('recent-grads.csv')
print(recent_grads.iloc[0])
print(recent_grads.head())
print(recent_grads.tail())
print(recent_grads.describe())

Rank                                        1
Major_code                               2419
Major                   PETROLEUM ENGINEERING
Total                                    2339
Men                                      2057
Women                                     282
Major_category                    Engineering
ShareWomen                           0.120564
Sample_size                                36
Employed                                 1976
Full_time                                1849
Part_time                                 270
Full_time_year_round                     1207
Unemployed                                 37
Unemployment_rate                   0.0183805
Median                                 110000
P25th                                   95000
P75th                                  125000
College_jobs                             1534
Non_college_jobs                          364
Low_wage_jobs                             193
Name: 0, dtype: object
   Rank  Major_code                                      Major    Total  \
0     1        2419                      PETROLEUM ENGINEERING   2339.0   
1     2        2416             MINING AND MINERAL ENGINEERING    756.0   
2     3        2415                  METALLURGICAL ENGINEERING    856.0   
3     4        2417  NAVAL ARCHITECTURE AND MARINE ENGINEERING   1258.0   
4     5        2405                       CHEMICAL ENGINEERING  32260.0   

       Men    Women Major_category  ShareWomen  Sample_size  Employed  \
0   2057.0    282.0    Engineering    0.120564           36      1976   
1    679.0     77.0    Engineering    0.101852            7       640   
2    725.0    131.0    Engineering    0.153037            3       648   
3   1123.0    135.0    Engineering    0.107313           16       758   
4  21239.0  11021.0    Engineering    0.341631          289     25694   

       ...        Part_time  Full_time_year_round  Unemployed  \
0      ...              270                  1207          37   
1      ...              170                   388          85   
2      ...              133                   340          16   
3      ...              150                   692          40   
4      ...             5180                 16697        1672   

   Unemployment_rate  Median  P25th   P75th  College_jobs  Non_college_jobs  \
0           0.018381  110000  95000  125000          1534               364   
1           0.117241   75000  55000   90000           350               257   
2           0.024096   73000  50000  105000           456               176   
3           0.050125   70000  43000   80000           529               102   
4           0.061098   65000  50000   75000         18314              4440   

   Low_wage_jobs  
0            193  
1             50  
2              0  
3              0  
4            972  

[5 rows x 21 columns]
     Rank  Major_code                   Major   Total     Men   Women  \
168   169        3609                 ZOOLOGY  8409.0  3050.0  5359.0   
169   170        5201  EDUCATIONAL PSYCHOLOGY  2854.0   522.0  2332.0   
170   171        5202     CLINICAL PSYCHOLOGY  2838.0   568.0  2270.0   
171   172        5203   COUNSELING PSYCHOLOGY  4626.0   931.0  3695.0   
172   173        3501         LIBRARY SCIENCE  1098.0   134.0   964.0   

               Major_category  ShareWomen  Sample_size  Employed  \
168    Biology & Life Science    0.637293           47      6259   
169  Psychology & Social Work    0.817099            7      2125   
170  Psychology & Social Work    0.799859           13      2101   
171  Psychology & Social Work    0.798746           21      3777   
172                 Education    0.877960            2       742   

         ...        Part_time  Full_time_year_round  Unemployed  \
168      ...             2190                  3602         304   
169      ...              572                  1211         148   
170      ...              648                  1293         368   
171      ...              965                  2738         214   
172      ...              237                   410          87   

     Unemployment_rate  Median  P25th  P75th  College_jobs  Non_college_jobs  \
168           0.046320   26000  20000  39000          2771              2947   
169           0.065112   25000  24000  34000          1488               615   
170           0.149048   25000  25000  40000           986               870   
171           0.053621   23400  19200  26000          2403              1245   
172           0.104946   22000  20000  22000           288               338   

     Low_wage_jobs  
168            743  
169             82  
170            622  
171            308  
172            192  

[5 rows x 21 columns]
             Rank   Major_code          Total            Men          Women  \
count  173.000000   173.000000     172.000000     172.000000     172.000000   
mean    87.000000  3879.815029   39370.081395   16723.406977   22646.674419   
std     50.084928  1687.753140   63483.491009   28122.433474   41057.330740   
min      1.000000  1100.000000     124.000000     119.000000       0.000000   
25%     44.000000  2403.000000    4549.750000    2177.500000    1778.250000   
50%     87.000000  3608.000000   15104.000000    5434.000000    8386.500000   
75%    130.000000  5503.000000   38909.750000   14631.000000   22553.750000   
max    173.000000  6403.000000  393735.000000  173809.000000  307087.000000   

       ShareWomen  Sample_size       Employed      Full_time      Part_time  \
count  172.000000   173.000000     173.000000     173.000000     173.000000   
mean     0.522223   356.080925   31192.763006   26029.306358    8832.398844   
std      0.231205   618.361022   50675.002241   42869.655092   14648.179473   
min      0.000000     2.000000       0.000000     111.000000       0.000000   
25%      0.336026    39.000000    3608.000000    3154.000000    1030.000000   
50%      0.534024   130.000000   11797.000000   10048.000000    3299.000000   
75%      0.703299   338.000000   31433.000000   25147.000000    9948.000000   
max      0.968954  4212.000000  307933.000000  251540.000000  115172.000000   

       Full_time_year_round    Unemployed  Unemployment_rate         Median  \
count            173.000000    173.000000         173.000000     173.000000   
mean           19694.427746   2416.329480           0.068191   40151.445087   
std            33160.941514   4112.803148           0.030331   11470.181802   
min              111.000000      0.000000           0.000000   22000.000000   
25%             2453.000000    304.000000           0.050306   33000.000000   
50%             7413.000000    893.000000           0.067961   36000.000000   
75%            16891.000000   2393.000000           0.087557   45000.000000   
max           199897.000000  28169.000000           0.177226  110000.000000   

              P25th          P75th   College_jobs  Non_college_jobs  \
count    173.000000     173.000000     173.000000        173.000000   
mean   29501.445087   51494.219653   12322.635838      13284.497110   
std     9166.005235   14906.279740   21299.868863      23789.655363   
min    18500.000000   22000.000000       0.000000          0.000000   
25%    24000.000000   42000.000000    1675.000000       1591.000000   
50%    27000.000000   47000.000000    4390.000000       4595.000000   
75%    33000.000000   60000.000000   14444.000000      11783.000000   
max    95000.000000  125000.000000  151643.000000     148395.000000   

       Low_wage_jobs  
count     173.000000  
mean     3859.017341  
std      6944.998579  
min         0.000000  
25%       340.000000  
50%      1231.000000  
75%      3466.000000  
max     48207.000000

In [34]:

#setting the environment importing

import pandas as pd
import matplotlib.pyplot as plt
% matplotlib inline
from pandas.plotting import scatter_matrix

In [36]:

#check missing values and drop if it exists. Check new shape of dataset
raw_data_count=recent_grads.shape[0]
recent_grads=recent_grads.dropna(axis=0)
cleaned_data_count=recent_grads.shape[0]
print(raw_data_count, cleaned_data_count)

173 172

Scatter Plots for investigation of relations in data¶

In [37]:

recent_grads.plot(x='Sample_size', y='Median', kind='scatter', title='Median vs. Sample_size')

Out[37]:

<matplotlib.axes._subplots.AxesSubplot at 0x7efbff101fd0>

In [38]:

recent_grads.plot(x='Sample_size', y='Unemployment_rate', kind='scatter', title='Unemployment_rate vs. Sample_size')

Out[38]:

<matplotlib.axes._subplots.AxesSubplot at 0x7efbfeaf89e8>

In [39]:

recent_grads.plot(x='Full_time', y='Median', kind='scatter', title='Median vs. Full_time')

Out[39]:

<matplotlib.axes._subplots.AxesSubplot at 0x7efbfead0208>

In [40]:

recent_grads.plot(x='ShareWomen', y='Unemployment_rate', kind='scatter', title='Unemployment_rate vs. ShareWomen')

Out[40]:

<matplotlib.axes._subplots.AxesSubplot at 0x7efbfea2f6d8>

In [41]:

recent_grads.plot(x='Men', y='Median', kind='scatter', title='Median vs. Men')

Out[41]:

<matplotlib.axes._subplots.AxesSubplot at 0x7efbfea14080>

In [42]:

recent_grads.plot(x='Women', y='Median', kind='scatter', title='Median vs. Woman')

Out[42]:

<matplotlib.axes._subplots.AxesSubplot at 0x7efbfe976e10>

** Questions: Do students in more popular majors make more money? ** ** Do students that majored in subjects that were majority female make more money? ** ** Is there any link between the number of full-time employees and median salary? **

Actually there are not strict correletions between the data so we can not say there is some relations between majors, sex of students, full-time job and money,

Histograms to explore the distributions of the columns¶

In [43]:

cols = ['Sample_size', 'Median', 'Employed', 'Full_time', 'ShareWomen', 'Unemployment_rate', 'Men', 'Women']

fig = plt.figure(figsize=(5,48))
for r in range(0,8):
    ax=fig.add_subplot(8,1,r+1)
    ax=recent_grads[cols[r]].plot(kind='hist', rot=30)
    ax.set_title(cols[r])

** Question: What's the most common median salary range? **

Most common median salary is in range 30000-40000

Question: What percent of majors are predominantly female/male?

In [44]:

#calculate female predominality
fem_predominality = recent_grads[recent_grads['ShareWomen'] > 0.5]
fem_predominality.shape[0]/recent_grads.shape[0]

Out[44]:

0.5581395348837209

In [45]:

#plot female predominality
ax1 = recent_grads['ShareWomen'].hist(bins=2,range=(0,1))
ax1.set_title('ShareWomen')
ax1.set_ylabel('Num of Majors')

Out[45]:

<matplotlib.text.Text at 0x7efbfcd56438>

Thus we can see 56% of majors female predominality and 100-56=44% male predominality. For this task I prefer calculation itseft rather then plots

Scatter plot matrix with Sample_size, Median and Unemployment_rate columns¶

In [46]:

scatter_matrix(recent_grads[['Sample_size', 'Median']], figsize=(20,20))

Out[46]:

array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7efbfef7ef60>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7efbfee64d30>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7efbfed3ba58>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7efbfee3dba8>]],
      dtype=object)

In [47]:

scatter_matrix(recent_grads[['Sample_size', 'Median', 'Unemployment_rate']], figsize=(20,20))

Out[47]:

array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7efbff000278>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7efbfedfce10>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7efbfcd3e7b8>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7efbfeef7ac8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7efbff0b74e0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7efbfccf3630>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7efbfccc0240>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7efbfcc7f048>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7efbfcbc7278>]],
      dtype=object)

There is no direct correlations between these columns

Bar plots for Share_Women and Unployment_rate columns¶

In [52]:

#plot first 10 rows in the dataset and ShareWomen percentage in this samples

recent_grads[:10]['ShareWomen'].plot(kind='bar')

Out[52]:

<matplotlib.axes._subplots.AxesSubplot at 0x7efbfc9e0b38>

In [53]:

#plot last 10 rows in the dataset and ShareWomen percentage in this samples
recent_grads[cleaned_data_count-10:]['ShareWomen'].plot(kind='bar')

Out[53]:

<matplotlib.axes._subplots.AxesSubplot at 0x7efbfc96d198>

In [54]:

#plot first 10 rows in the dataset and Unemployment_rate percentage in this samples

recent_grads[:10]['Unemployment_rate'].plot(kind='bar')

Out[54]:

<matplotlib.axes._subplots.AxesSubplot at 0x7efbfc91f8d0>

In [56]:

#plot last 10 rows in the dataset and Unemployment_rate percentage in this samples

recent_grads[cleaned_data_count-10:]['Unemployment_rate'].plot(kind='bar')

Out[56]:

<matplotlib.axes._subplots.AxesSubplot at 0x7efbfc839780>

Extra visualizations¶

** Task 1: Use a grouped bar plot to compare the number of men with the number of women in each category of majors**

In [66]:

#prepare data for plotting
recent_grads['ShareMen']=1-recent_grads['ShareWomen']
Share = recent_grads[['ShareMen','ShareWomen','Major']]

In [68]:

#plotting first 10 and last 10 majors:
Share.head(10).plot.barh(x='Major')

Out[68]:

<matplotlib.axes._subplots.AxesSubplot at 0x7efbf7c1c4a8>

In [63]:

Share.tail(10).plot.barh(x='Major')

Out[63]:

<matplotlib.axes._subplots.AxesSubplot at 0x7efbf7e27a90>

The percentage of Men in engineering majors is high, when there is a predominatory of female in non engineering majors. As the dataset is ranked by median earnings, we can see share of women in less median earnings majors is higher

** Task 2: Use a box plot to explore the distributions of median salaries and unemployment rate. **

In [65]:

recent_grads['Median'].plot(kind='box')

Out[65]:

<matplotlib.axes._subplots.AxesSubplot at 0x7efbf7d19550>

In [70]:

recent_grads['Unemployment_rate'].plot(kind='box')

Out[70]:

<matplotlib.axes._subplots.AxesSubplot at 0x7efbf7bd9080>

Common median salary is in range 35000-45000, unemployment rate is in range 0.04-0.09

Conclusion: different kinds of visualizations helps us make explonatory data analysis. It is a quite powerful tool in distingushing relations between data, their distributions and main features

In [ ]: