Sample project to visualize data from college majors earning.
The data is from American Community Survey, you can download the clean dataset from this Github Repo.
import pandas as pdt
import matplotlib.pyplot as plt
%matplotlib inline
recent_grads = pdt.read_csv('recent-grads.csv')
recent_grads.iloc[0]
Rank 1 Major_code 2419 Major PETROLEUM ENGINEERING Total 2339 Men 2057 Women 282 Major_category Engineering ShareWomen 0.120564 Sample_size 36 Employed 1976 Full_time 1849 Part_time 270 Full_time_year_round 1207 Unemployed 37 Unemployment_rate 0.0183805 Median 110000 P25th 95000 P75th 125000 College_jobs 1534 Non_college_jobs 364 Low_wage_jobs 193 Name: 0, dtype: object
In the first row we can observe the data of Petroleum Engineering, it is a very well paid major, but have a samll share of women, we need to explore more data to know more about these majors.
Now we can see the first five and the last five majors in the list.
recent_grads.head()
Rank | Major_code | Major | Total | Men | Women | Major_category | ShareWomen | Sample_size | Employed | ... | Part_time | Full_time_year_round | Unemployed | Unemployment_rate | Median | P25th | P75th | College_jobs | Non_college_jobs | Low_wage_jobs | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 2419 | PETROLEUM ENGINEERING | 2339.0 | 2057.0 | 282.0 | Engineering | 0.120564 | 36 | 1976 | ... | 270 | 1207 | 37 | 0.018381 | 110000 | 95000 | 125000 | 1534 | 364 | 193 |
1 | 2 | 2416 | MINING AND MINERAL ENGINEERING | 756.0 | 679.0 | 77.0 | Engineering | 0.101852 | 7 | 640 | ... | 170 | 388 | 85 | 0.117241 | 75000 | 55000 | 90000 | 350 | 257 | 50 |
2 | 3 | 2415 | METALLURGICAL ENGINEERING | 856.0 | 725.0 | 131.0 | Engineering | 0.153037 | 3 | 648 | ... | 133 | 340 | 16 | 0.024096 | 73000 | 50000 | 105000 | 456 | 176 | 0 |
3 | 4 | 2417 | NAVAL ARCHITECTURE AND MARINE ENGINEERING | 1258.0 | 1123.0 | 135.0 | Engineering | 0.107313 | 16 | 758 | ... | 150 | 692 | 40 | 0.050125 | 70000 | 43000 | 80000 | 529 | 102 | 0 |
4 | 5 | 2405 | CHEMICAL ENGINEERING | 32260.0 | 21239.0 | 11021.0 | Engineering | 0.341631 | 289 | 25694 | ... | 5180 | 16697 | 1672 | 0.061098 | 65000 | 50000 | 75000 | 18314 | 4440 | 972 |
5 rows × 21 columns
recent_grads.tail()
Rank | Major_code | Major | Total | Men | Women | Major_category | ShareWomen | Sample_size | Employed | ... | Part_time | Full_time_year_round | Unemployed | Unemployment_rate | Median | P25th | P75th | College_jobs | Non_college_jobs | Low_wage_jobs | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
168 | 169 | 3609 | ZOOLOGY | 8409.0 | 3050.0 | 5359.0 | Biology & Life Science | 0.637293 | 47 | 6259 | ... | 2190 | 3602 | 304 | 0.046320 | 26000 | 20000 | 39000 | 2771 | 2947 | 743 |
169 | 170 | 5201 | EDUCATIONAL PSYCHOLOGY | 2854.0 | 522.0 | 2332.0 | Psychology & Social Work | 0.817099 | 7 | 2125 | ... | 572 | 1211 | 148 | 0.065112 | 25000 | 24000 | 34000 | 1488 | 615 | 82 |
170 | 171 | 5202 | CLINICAL PSYCHOLOGY | 2838.0 | 568.0 | 2270.0 | Psychology & Social Work | 0.799859 | 13 | 2101 | ... | 648 | 1293 | 368 | 0.149048 | 25000 | 25000 | 40000 | 986 | 870 | 622 |
171 | 172 | 5203 | COUNSELING PSYCHOLOGY | 4626.0 | 931.0 | 3695.0 | Psychology & Social Work | 0.798746 | 21 | 3777 | ... | 965 | 2738 | 214 | 0.053621 | 23400 | 19200 | 26000 | 2403 | 1245 | 308 |
172 | 173 | 3501 | LIBRARY SCIENCE | 1098.0 | 134.0 | 964.0 | Education | 0.877960 | 2 | 742 | ... | 237 | 410 | 87 | 0.104946 | 22000 | 20000 | 22000 | 288 | 338 | 192 |
5 rows × 21 columns
Now we need to verify which type of data we have and in if we have null data to erase.
recent_grads.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 173 entries, 0 to 172 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Rank 173 non-null int64 1 Major_code 173 non-null int64 2 Major 173 non-null object 3 Total 172 non-null float64 4 Men 172 non-null float64 5 Women 172 non-null float64 6 Major_category 173 non-null object 7 ShareWomen 172 non-null float64 8 Sample_size 173 non-null int64 9 Employed 173 non-null int64 10 Full_time 173 non-null int64 11 Part_time 173 non-null int64 12 Full_time_year_round 173 non-null int64 13 Unemployed 173 non-null int64 14 Unemployment_rate 173 non-null float64 15 Median 173 non-null int64 16 P25th 173 non-null int64 17 P75th 173 non-null int64 18 College_jobs 173 non-null int64 19 Non_college_jobs 173 non-null int64 20 Low_wage_jobs 173 non-null int64 dtypes: float64(5), int64(14), object(2) memory usage: 28.5+ KB
recent_grads.describe()
Rank | Major_code | Total | Men | Women | ShareWomen | Sample_size | Employed | Full_time | Part_time | Full_time_year_round | Unemployed | Unemployment_rate | Median | P25th | P75th | College_jobs | Non_college_jobs | Low_wage_jobs | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 173.000000 | 173.000000 | 172.000000 | 172.000000 | 172.000000 | 172.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 |
mean | 87.000000 | 3879.815029 | 39370.081395 | 16723.406977 | 22646.674419 | 0.522223 | 356.080925 | 31192.763006 | 26029.306358 | 8832.398844 | 19694.427746 | 2416.329480 | 0.068191 | 40151.445087 | 29501.445087 | 51494.219653 | 12322.635838 | 13284.497110 | 3859.017341 |
std | 50.084928 | 1687.753140 | 63483.491009 | 28122.433474 | 41057.330740 | 0.231205 | 618.361022 | 50675.002241 | 42869.655092 | 14648.179473 | 33160.941514 | 4112.803148 | 0.030331 | 11470.181802 | 9166.005235 | 14906.279740 | 21299.868863 | 23789.655363 | 6944.998579 |
min | 1.000000 | 1100.000000 | 124.000000 | 119.000000 | 0.000000 | 0.000000 | 2.000000 | 0.000000 | 111.000000 | 0.000000 | 111.000000 | 0.000000 | 0.000000 | 22000.000000 | 18500.000000 | 22000.000000 | 0.000000 | 0.000000 | 0.000000 |
25% | 44.000000 | 2403.000000 | 4549.750000 | 2177.500000 | 1778.250000 | 0.336026 | 39.000000 | 3608.000000 | 3154.000000 | 1030.000000 | 2453.000000 | 304.000000 | 0.050306 | 33000.000000 | 24000.000000 | 42000.000000 | 1675.000000 | 1591.000000 | 340.000000 |
50% | 87.000000 | 3608.000000 | 15104.000000 | 5434.000000 | 8386.500000 | 0.534024 | 130.000000 | 11797.000000 | 10048.000000 | 3299.000000 | 7413.000000 | 893.000000 | 0.067961 | 36000.000000 | 27000.000000 | 47000.000000 | 4390.000000 | 4595.000000 | 1231.000000 |
75% | 130.000000 | 5503.000000 | 38909.750000 | 14631.000000 | 22553.750000 | 0.703299 | 338.000000 | 31433.000000 | 25147.000000 | 9948.000000 | 16891.000000 | 2393.000000 | 0.087557 | 45000.000000 | 33000.000000 | 60000.000000 | 14444.000000 | 11783.000000 | 3466.000000 |
max | 173.000000 | 6403.000000 | 393735.000000 | 173809.000000 | 307087.000000 | 0.968954 | 4212.000000 | 307933.000000 | 251540.000000 | 115172.000000 | 199897.000000 | 28169.000000 | 0.177226 | 110000.000000 | 95000.000000 | 125000.000000 | 151643.000000 | 148395.000000 | 48207.000000 |
As we can see from the results we have null values in the columns Men
Women
and Total
, we are going to delete this row to clean the data set.
raw_data_count = recent_grads.shape
raw_data_count
(173, 21)
#Dropna method is used to delet the rows with null vaues
recent_grads = recent_grads.dropna()
cleaned_data_count = recent_grads.shape
cleaned_data_count
(172, 21)
In this part we are going to use scatter plots to explore our data, the goal is (well, my personal goal, I ignore if what I'm doing is a canonical way to use the plots) to see what data we have and what relations we have.
First I'm going to use a the .plot
method on pandas to see the data.
I'm going to plot the Sample_size
and Employed
, this could help to see the correlation between the two columns, it is important to verify if the sample size matches the total of people in the majors.
#Using plot() method we need to indicate the kind of plot, also it facilitates to put the title.
recent_grads.plot(x='Sample_size', y='Employed', kind='scatter', title='Employed vs. Sample_size', figsize=(5,10))
<AxesSubplot:title={'center':'Employed vs. Sample_size'}, xlabel='Sample_size', ylabel='Employed'>
The first thing that we want to identify is the most popular majors with median of salary. One of the advantages to see the data in plots is to easily see what is the range of the data and retrieve the info that we need:
recent_grads.plot(x='Total', y='Median', kind='scatter', title='Median income vs. Sample_size', figsize=(5,10))
<AxesSubplot:title={'center':'Median income vs. Sample_size'}, xlabel='Total', ylabel='Median'>
most_profitable = recent_grads[(recent_grads['Median']>70000)]
most_profitable
Rank | Major_code | Major | Total | Men | Women | Major_category | ShareWomen | Sample_size | Employed | ... | Part_time | Full_time_year_round | Unemployed | Unemployment_rate | Median | P25th | P75th | College_jobs | Non_college_jobs | Low_wage_jobs | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 2419 | PETROLEUM ENGINEERING | 2339.0 | 2057.0 | 282.0 | Engineering | 0.120564 | 36 | 1976 | ... | 270 | 1207 | 37 | 0.018381 | 110000 | 95000 | 125000 | 1534 | 364 | 193 |
1 | 2 | 2416 | MINING AND MINERAL ENGINEERING | 756.0 | 679.0 | 77.0 | Engineering | 0.101852 | 7 | 640 | ... | 170 | 388 | 85 | 0.117241 | 75000 | 55000 | 90000 | 350 | 257 | 50 |
2 | 3 | 2415 | METALLURGICAL ENGINEERING | 856.0 | 725.0 | 131.0 | Engineering | 0.153037 | 3 | 648 | ... | 133 | 340 | 16 | 0.024096 | 73000 | 50000 | 105000 | 456 | 176 | 0 |
3 rows × 21 columns
most_popular = recent_grads[(recent_grads['Total']>300000)]
most_popular
Rank | Major_code | Major | Total | Men | Women | Major_category | ShareWomen | Sample_size | Employed | ... | Part_time | Full_time_year_round | Unemployed | Unemployment_rate | Median | P25th | P75th | College_jobs | Non_college_jobs | Low_wage_jobs | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
76 | 77 | 6203 | BUSINESS MANAGEMENT AND ADMINISTRATION | 329927.0 | 173809.0 | 156118.0 | Business | 0.473190 | 4212 | 276234 | ... | 50357 | 199897 | 21502 | 0.072218 | 38000 | 29000 | 50000 | 36720 | 148395 | 32395 |
145 | 146 | 5200 | PSYCHOLOGY | 393735.0 | 86648.0 | 307087.0 | Psychology & Social Work | 0.779933 | 2584 | 307933 | ... | 115172 | 174438 | 28169 | 0.083811 | 31500 | 24000 | 41000 | 125148 | 141860 | 48207 |
2 rows × 21 columns
most_profitable_popular = recent_grads[(recent_grads['Total']>50000)&(recent_grads['Median']>50000)]
most_profitable_popular
Rank | Major_code | Major | Total | Men | Women | Major_category | ShareWomen | Sample_size | Employed | ... | Part_time | Full_time_year_round | Unemployed | Unemployment_rate | Median | P25th | P75th | College_jobs | Non_college_jobs | Low_wage_jobs | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
8 | 9 | 2414 | MECHANICAL ENGINEERING | 91227.0 | 80320.0 | 10907.0 | Engineering | 0.119559 | 1029 | 76442 | ... | 13101 | 54639 | 4650 | 0.057342 | 60000 | 48000 | 70000 | 52844 | 16384 | 3253 |
9 | 10 | 2408 | ELECTRICAL ENGINEERING | 81527.0 | 65511.0 | 16016.0 | Engineering | 0.196450 | 631 | 61928 | ... | 12695 | 41413 | 3895 | 0.059174 | 60000 | 45000 | 72000 | 45829 | 10874 | 3170 |
17 | 18 | 2400 | GENERAL ENGINEERING | 61152.0 | 45683.0 | 15469.0 | Engineering | 0.252960 | 425 | 44931 | ... | 7199 | 33540 | 2859 | 0.059824 | 56000 | 36000 | 69000 | 26898 | 11734 | 3192 |
20 | 21 | 2102 | COMPUTER SCIENCE | 128319.0 | 99743.0 | 28576.0 | Computers & Mathematics | 0.222695 | 1196 | 102087 | ... | 18726 | 70932 | 6884 | 0.063173 | 53000 | 39000 | 70000 | 68622 | 25667 | 5144 |
4 rows × 21 columns
Next we will see the mayor with more unemployment, as we will see the unemployment is not correlated with the popularity of the major.
recent_grads.plot(x='Total', y='Unemployment_rate', kind='scatter', title='Unemployment_rate vs. Sample_size', figsize=(5,10))
<AxesSubplot:title={'center':'Unemployment_rate vs. Sample_size'}, xlabel='Total', ylabel='Unemployment_rate'>
most_unemployed_popular = recent_grads[(recent_grads['Unemployment_rate']>0.09)&(recent_grads['Total']>100000)]
most_unemployed_popular
Rank | Major_code | Major | Total | Men | Women | Major_category | ShareWomen | Sample_size | Employed | ... | Part_time | Full_time_year_round | Unemployed | Unemployment_rate | Median | P25th | P75th | College_jobs | Non_college_jobs | Low_wage_jobs | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
36 | 37 | 5501 | ECONOMICS | 139247.0 | 89749.0 | 49498.0 | Social Science | 0.355469 | 1322 | 104117 | ... | 25325 | 70740 | 11452 | 0.099092 | 47000 | 35000 | 65000 | 25582 | 37057 | 10653 |
78 | 79 | 5506 | POLITICAL SCIENCE AND GOVERNMENT | 182621.0 | 93880.0 | 88741.0 | Social Science | 0.485930 | 1387 | 133454 | ... | 43711 | 83236 | 15022 | 0.101175 | 38000 | 28000 | 50000 | 36854 | 66947 | 19803 |
95 | 96 | 6004 | COMMERCIAL ART AND GRAPHIC DESIGN | 103480.0 | 32041.0 | 71439.0 | Arts | 0.690365 | 1186 | 83483 | ... | 24387 | 52243 | 8947 | 0.096798 | 35000 | 25000 | 45000 | 37389 | 38119 | 14839 |
114 | 115 | 6402 | HISTORY | 141951.0 | 78253.0 | 63698.0 | Humanities & Liberal Arts | 0.448732 | 1058 | 105646 | ... | 40657 | 59218 | 11176 | 0.095667 | 34000 | 25000 | 47000 | 35336 | 54569 | 16839 |
4 rows × 21 columns
most_unemployed = recent_grads[(recent_grads['Unemployment_rate']>0.15)]
most_unemployed
Rank | Major_code | Major | Total | Men | Women | Major_category | ShareWomen | Sample_size | Employed | ... | Part_time | Full_time_year_round | Unemployed | Unemployment_rate | Median | P25th | P75th | College_jobs | Non_college_jobs | Low_wage_jobs | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
5 | 6 | 2418 | NUCLEAR ENGINEERING | 2573.0 | 2200.0 | 373.0 | Engineering | 0.144967 | 17 | 1857 | ... | 264 | 1449 | 400 | 0.177226 | 65000 | 50000 | 102000 | 1142 | 657 | 244 |
84 | 85 | 2107 | COMPUTER NETWORKING AND TELECOMMUNICATIONS | 7613.0 | 5291.0 | 2322.0 | Computers & Mathematics | 0.305005 | 97 | 6144 | ... | 1447 | 4369 | 1100 | 0.151850 | 36400 | 27000 | 49000 | 2593 | 2941 | 352 |
89 | 90 | 5401 | PUBLIC ADMINISTRATION | 5629.0 | 2947.0 | 2682.0 | Law & Public Policy | 0.476461 | 46 | 4158 | ... | 847 | 2952 | 789 | 0.159491 | 36000 | 23000 | 60000 | 919 | 2313 | 496 |
3 rows × 21 columns
Ploting the full time jobs with the median of salaries we find the something:
recent_grads.plot(x='Full_time', y='Median', kind='scatter', title='Median vs. Full_time', figsize=(5,10))
<AxesSubplot:title={'center':'Median vs. Full_time'}, xlabel='Full_time', ylabel='Median'>
most_fulltime_median = recent_grads[(recent_grads['Full_time']>100000)&(recent_grads['Median']>40000)]
most_fulltime_median
Rank | Major_code | Major | Total | Men | Women | Major_category | ShareWomen | Sample_size | Employed | ... | Part_time | Full_time_year_round | Unemployed | Unemployment_rate | Median | P25th | P75th | College_jobs | Non_college_jobs | Low_wage_jobs | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
34 | 35 | 6107 | NURSING | 209394.0 | 21773.0 | 187621.0 | Health | 0.896019 | 2554 | 180903 | ... | 40818 | 122817 | 8497 | 0.044863 | 48000 | 39000 | 58000 | 151643 | 26146 | 6193 |
35 | 36 | 6207 | FINANCE | 174506.0 | 115030.0 | 59476.0 | Business | 0.340825 | 2189 | 145696 | ... | 21463 | 108595 | 9413 | 0.060686 | 47000 | 35000 | 64000 | 24243 | 48447 | 9910 |
40 | 41 | 6201 | ACCOUNTING | 198633.0 | 94519.0 | 104114.0 | Business | 0.524153 | 2042 | 165527 | ... | 27693 | 123169 | 12411 | 0.069749 | 45000 | 34000 | 56000 | 11417 | 39323 | 10886 |
3 rows × 21 columns
worst_fulltime_median = recent_grads[(recent_grads['Full_time']>100000)&(recent_grads['Median']<35000)]
worst_fulltime_median
Rank | Major_code | Major | Total | Men | Women | Major_category | ShareWomen | Sample_size | Employed | ... | Part_time | Full_time_year_round | Unemployed | Unemployment_rate | Median | P25th | P75th | College_jobs | Non_college_jobs | Low_wage_jobs | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
123 | 124 | 3600 | BIOLOGY | 280709.0 | 111762.0 | 168947.0 | Biology & Life Science | 0.601858 | 1370 | 182295 | ... | 72371 | 100336 | 13874 | 0.070725 | 33400 | 24000 | 45000 | 88232 | 81109 | 28339 |
137 | 138 | 3301 | ENGLISH LANGUAGE AND LITERATURE | 194673.0 | 58227.0 | 136446.0 | Humanities & Liberal Arts | 0.700898 | 1436 | 149180 | ... | 57825 | 81180 | 14345 | 0.087724 | 32000 | 23000 | 41000 | 57690 | 71827 | 26503 |
138 | 139 | 2304 | ELEMENTARY EDUCATION | 170862.0 | 13029.0 | 157833.0 | Education | 0.923745 | 1629 | 149339 | ... | 37965 | 86540 | 7297 | 0.046586 | 32000 | 23400 | 38000 | 108085 | 36972 | 11502 |
145 | 146 | 5200 | PSYCHOLOGY | 393735.0 | 86648.0 | 307087.0 | Psychology & Social Work | 0.779933 | 2584 | 307933 | ... | 115172 | 174438 | 28169 | 0.083811 | 31500 | 24000 | 41000 | 125148 | 141860 | 48207 |
4 rows × 21 columns
The share of women is a burning issue, because the distribution of women in more profitable careers are not even, and is know that women are pay less than men in most of the jobs.
recent_grads.plot(x='ShareWomen', y='Unemployment_rate', kind='scatter', title='Unemployment_rate vs. ShareWomen', figsize=(5,10))
<AxesSubplot:title={'center':'Unemployment_rate vs. ShareWomen'}, xlabel='ShareWomen', ylabel='Unemployment_rate'>
most_ShareWomen_unemployment = recent_grads[(recent_grads['ShareWomen']>0.8)&(recent_grads['Unemployment_rate']<0.04)]
most_ShareWomen_unemployment
Rank | Major_code | Major | Total | Men | Women | Major_category | ShareWomen | Sample_size | Employed | ... | Part_time | Full_time_year_round | Unemployed | Unemployment_rate | Median | P25th | P75th | College_jobs | Non_college_jobs | Low_wage_jobs | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
154 | 155 | 2312 | TEACHER EDUCATION: MULTIPLE LEVELS | 14443.0 | 2734.0 | 11709.0 | Education | 0.810704 | 142 | 13076 | ... | 2214 | 8457 | 496 | 0.036546 | 30000 | 24000 | 37000 | 10766 | 1949 | 722 |
156 | 157 | 5403 | HUMAN SERVICES AND COMMUNITY ORGANIZATION | 9374.0 | 885.0 | 8489.0 | Psychology & Social Work | 0.905590 | 89 | 8294 | ... | 2405 | 5061 | 326 | 0.037819 | 30000 | 24000 | 35000 | 2878 | 4595 | 724 |
2 rows × 21 columns
worst_ShareWomen_unemployment = recent_grads[(recent_grads['ShareWomen']>0.8)&(recent_grads['Unemployment_rate']>0.10)]
worst_ShareWomen_unemployment
Rank | Major_code | Major | Total | Men | Women | Major_category | ShareWomen | Sample_size | Employed | ... | Part_time | Full_time_year_round | Unemployed | Unemployment_rate | Median | P25th | P75th | College_jobs | Non_college_jobs | Low_wage_jobs | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
55 | 56 | 2303 | SCHOOL STUDENT COUNSELING | 818.0 | 119.0 | 699.0 | Education | 0.854523 | 4 | 730 | ... | 135 | 545 | 88 | 0.107579 | 41000 | 41000 | 43000 | 509 | 221 | 0 |
172 | 173 | 3501 | LIBRARY SCIENCE | 1098.0 | 134.0 | 964.0 | Education | 0.877960 | 2 | 742 | ... | 237 | 410 | 87 | 0.104946 | 22000 | 20000 | 22000 | 288 | 338 | 192 |
2 rows × 21 columns
In the case of men in majors we find:
recent_grads.plot(x='Men', y='Median', kind='scatter', title='Median vs. Men', figsize=(5,10))
<AxesSubplot:title={'center':'Median vs. Men'}, xlabel='Men', ylabel='Median'>
most_men_median = recent_grads[(recent_grads['Men']>50000)&(recent_grads['Median']>50000)]
most_men_median
Rank | Major_code | Major | Total | Men | Women | Major_category | ShareWomen | Sample_size | Employed | ... | Part_time | Full_time_year_round | Unemployed | Unemployment_rate | Median | P25th | P75th | College_jobs | Non_college_jobs | Low_wage_jobs | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
8 | 9 | 2414 | MECHANICAL ENGINEERING | 91227.0 | 80320.0 | 10907.0 | Engineering | 0.119559 | 1029 | 76442 | ... | 13101 | 54639 | 4650 | 0.057342 | 60000 | 48000 | 70000 | 52844 | 16384 | 3253 |
9 | 10 | 2408 | ELECTRICAL ENGINEERING | 81527.0 | 65511.0 | 16016.0 | Engineering | 0.196450 | 631 | 61928 | ... | 12695 | 41413 | 3895 | 0.059174 | 60000 | 45000 | 72000 | 45829 | 10874 | 3170 |
20 | 21 | 2102 | COMPUTER SCIENCE | 128319.0 | 99743.0 | 28576.0 | Computers & Mathematics | 0.222695 | 1196 | 102087 | ... | 18726 | 70932 | 6884 | 0.063173 | 53000 | 39000 | 70000 | 68622 | 25667 | 5144 |
3 rows × 21 columns
worst_men_median = recent_grads[(recent_grads['Men']>50000)&(recent_grads['Median']<33000)]
worst_men_median
Rank | Major_code | Major | Total | Men | Women | Major_category | ShareWomen | Sample_size | Employed | ... | Part_time | Full_time_year_round | Unemployed | Unemployment_rate | Median | P25th | P75th | College_jobs | Non_college_jobs | Low_wage_jobs | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
137 | 138 | 3301 | ENGLISH LANGUAGE AND LITERATURE | 194673.0 | 58227.0 | 136446.0 | Humanities & Liberal Arts | 0.700898 | 1436 | 149180 | ... | 57825 | 81180 | 14345 | 0.087724 | 32000 | 23000 | 41000 | 57690 | 71827 | 26503 |
139 | 140 | 4101 | PHYSICAL FITNESS PARKS RECREATION AND LEISURE | 125074.0 | 62181.0 | 62893.0 | Industrial Arts & Consumer Services | 0.502846 | 1014 | 103078 | ... | 38515 | 57978 | 5593 | 0.051467 | 32000 | 24000 | 43000 | 27581 | 63946 | 16838 |
145 | 146 | 5200 | PSYCHOLOGY | 393735.0 | 86648.0 | 307087.0 | Psychology & Social Work | 0.779933 | 2584 | 307933 | ... | 115172 | 174438 | 28169 | 0.083811 | 31500 | 24000 | 41000 | 125148 | 141860 | 48207 |
3 rows × 21 columns
In the case of women in majors we find:
recent_grads.plot(x='Women', y='Median', kind='scatter', title='Median vs. Women', rot=30, figsize=(5,10))
<AxesSubplot:title={'center':'Median vs. Women'}, xlabel='Women', ylabel='Median'>
most_women_median = recent_grads[(recent_grads['Women']>50000)&(recent_grads['Median']>40000)]
most_women_median
Rank | Major_code | Major | Total | Men | Women | Major_category | ShareWomen | Sample_size | Employed | ... | Part_time | Full_time_year_round | Unemployed | Unemployment_rate | Median | P25th | P75th | College_jobs | Non_college_jobs | Low_wage_jobs | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
34 | 35 | 6107 | NURSING | 209394.0 | 21773.0 | 187621.0 | Health | 0.896019 | 2554 | 180903 | ... | 40818 | 122817 | 8497 | 0.044863 | 48000 | 39000 | 58000 | 151643 | 26146 | 6193 |
35 | 36 | 6207 | FINANCE | 174506.0 | 115030.0 | 59476.0 | Business | 0.340825 | 2189 | 145696 | ... | 21463 | 108595 | 9413 | 0.060686 | 47000 | 35000 | 64000 | 24243 | 48447 | 9910 |
40 | 41 | 6201 | ACCOUNTING | 198633.0 | 94519.0 | 104114.0 | Business | 0.524153 | 2042 | 165527 | ... | 27693 | 123169 | 12411 | 0.069749 | 45000 | 34000 | 56000 | 11417 | 39323 | 10886 |
3 rows × 21 columns
worst_women_median = recent_grads[(recent_grads['Women']>50000)&(recent_grads['Median']<32000)]
worst_women_median
Rank | Major_code | Major | Total | Men | Women | Major_category | ShareWomen | Sample_size | Employed | ... | Part_time | Full_time_year_round | Unemployed | Unemployment_rate | Median | P25th | P75th | College_jobs | Non_college_jobs | Low_wage_jobs | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
145 | 146 | 5200 | PSYCHOLOGY | 393735.0 | 86648.0 | 307087.0 | Psychology & Social Work | 0.779933 | 2584 | 307933 | ... | 115172 | 174438 | 28169 | 0.083811 | 31500 | 24000 | 41000 | 125148 | 141860 | 48207 |
150 | 151 | 2901 | FAMILY AND CONSUMER SCIENCES | 58001.0 | 5166.0 | 52835.0 | Industrial Arts & Consumer Services | 0.910933 | 518 | 46624 | ... | 15872 | 26906 | 3355 | 0.067128 | 30000 | 22900 | 40000 | 20985 | 20133 | 5248 |
2 rows × 21 columns
The series distribution can help us to see how the data is allocated and how we can understand the different situations:
recent_grads['Sample_size'].plot(kind='hist')
<AxesSubplot:ylabel='Frequency'>
# When we increase the bins we can see more detailed the frequency of the data.
recent_grads['Sample_size'].hist(bins=25, range=(0,5000))
<AxesSubplot:>
recent_grads['Median'].hist(bins=20, range=(0,120000))
<AxesSubplot:>
recent_grads['Employed'].hist(bins=20, range=(0,350000))
<AxesSubplot:>
recent_grads['Full_time'].hist(bins=20, range=(0,300000))
<AxesSubplot:>
recent_grads['ShareWomen'].hist(bins=20, range=(0,1))
<AxesSubplot:>
recent_grads['Unemployment_rate'].hist(bins=20, range=(0,0.2))
<AxesSubplot:>
recent_grads['Men'].hist(bins=20, range=(0,200000))
<AxesSubplot:>
recent_grads['Women'].hist(bins=20, range=(0,200000))
<AxesSubplot:>
This kind of plot is certainly useful to see large amount of data at once and at the end use the result to see more detailed results. I think we would start with this kind of plot first and letter use the others, anyway the resultas are the same of the previous exercises.
from pandas.plotting import scatter_matrix
scatter_matrix(recent_grads[['Women', 'Men']], figsize=(10,10))
array([[<AxesSubplot:xlabel='Women', ylabel='Women'>, <AxesSubplot:xlabel='Men', ylabel='Women'>], [<AxesSubplot:xlabel='Women', ylabel='Men'>, <AxesSubplot:xlabel='Men', ylabel='Men'>]], dtype=object)
scatter_matrix(recent_grads[['Total', 'Median']], figsize=(10,10))
array([[<AxesSubplot:xlabel='Total', ylabel='Total'>, <AxesSubplot:xlabel='Median', ylabel='Total'>], [<AxesSubplot:xlabel='Total', ylabel='Median'>, <AxesSubplot:xlabel='Median', ylabel='Median'>]], dtype=object)
scatter_matrix(recent_grads[['Total', 'Median','Unemployment_rate']], figsize=(10,10))
array([[<AxesSubplot:xlabel='Total', ylabel='Total'>, <AxesSubplot:xlabel='Median', ylabel='Total'>, <AxesSubplot:xlabel='Unemployment_rate', ylabel='Total'>], [<AxesSubplot:xlabel='Total', ylabel='Median'>, <AxesSubplot:xlabel='Median', ylabel='Median'>, <AxesSubplot:xlabel='Unemployment_rate', ylabel='Median'>], [<AxesSubplot:xlabel='Total', ylabel='Unemployment_rate'>, <AxesSubplot:xlabel='Median', ylabel='Unemployment_rate'>, <AxesSubplot:xlabel='Unemployment_rate', ylabel='Unemployment_rate'>]], dtype=object)
scatter_matrix(recent_grads[['Total', 'Median', 'ShareWomen']], figsize=(10,10))
array([[<AxesSubplot:xlabel='Total', ylabel='Total'>, <AxesSubplot:xlabel='Median', ylabel='Total'>, <AxesSubplot:xlabel='ShareWomen', ylabel='Total'>], [<AxesSubplot:xlabel='Total', ylabel='Median'>, <AxesSubplot:xlabel='Median', ylabel='Median'>, <AxesSubplot:xlabel='ShareWomen', ylabel='Median'>], [<AxesSubplot:xlabel='Total', ylabel='ShareWomen'>, <AxesSubplot:xlabel='Median', ylabel='ShareWomen'>, <AxesSubplot:xlabel='ShareWomen', ylabel='ShareWomen'>]], dtype=object)
scatter_matrix(recent_grads[['Full_time', 'Median', 'ShareWomen']], figsize=(10,10))
array([[<AxesSubplot:xlabel='Full_time', ylabel='Full_time'>, <AxesSubplot:xlabel='Median', ylabel='Full_time'>, <AxesSubplot:xlabel='ShareWomen', ylabel='Full_time'>], [<AxesSubplot:xlabel='Full_time', ylabel='Median'>, <AxesSubplot:xlabel='Median', ylabel='Median'>, <AxesSubplot:xlabel='ShareWomen', ylabel='Median'>], [<AxesSubplot:xlabel='Full_time', ylabel='ShareWomen'>, <AxesSubplot:xlabel='Median', ylabel='ShareWomen'>, <AxesSubplot:xlabel='ShareWomen', ylabel='ShareWomen'>]], dtype=object)
The bar plots are very useful to visualize the data and see the nuances. I like the results with this kind of plot. We only need to change the kind
in the .plot
method to 'bar'.
Another alternative is to use plot.bar
method.
recent_grads[:5]['Women'].plot(kind='bar')
<AxesSubplot:>
# With .plot.bar Method we can put to axes, so we can easily compare the data.
recent_grads[:5].plot.bar(x='Major', y='Women')
<AxesSubplot:xlabel='Major'>
recent_grads[:5]['ShareWomen'].plot(kind='bar')
<AxesSubplot:>
recent_grads[-5:]['ShareWomen'].plot(kind='bar')
<AxesSubplot:>
recent_grads[:5]['Unemployment_rate'].plot(kind='bar')
<AxesSubplot:>
recent_grads[-5:]['Unemployment_rate'].plot(kind='bar')
<AxesSubplot:>
With the bar plot we can put some conditions and see the results:
The results are interesting:
recent_grads[(recent_grads['ShareWomen']>0.4)&(recent_grads['Median']>48000)].plot.bar(x='Major', y='Median')
<AxesSubplot:xlabel='Major'>
In this case we are going to apply the same steps of the previous exercise, but in this case with unemployment rate. The Share of women is above 0.7, and unemployment rate below 0.4.
recent_grads[(recent_grads['ShareWomen']>0.7)&(recent_grads['Unemployment_rate']<0.04)].plot.bar(x='Major', y='Median')
<AxesSubplot:xlabel='Major'>
I don't understand very well this one, but looks awesome, so I'm going to put it anyway.
recent_grads.plot.hexbin(x='Total', y='Median',gridsize=10)
<AxesSubplot:xlabel='Total', ylabel='Median'>
Extremely useful to visualize better the median distribution, I would use it for following projects.
recent_grads.boxplot(column=['Median'])
<AxesSubplot:>
recent_grads.boxplot(column=['Men', 'Women'])
<AxesSubplot:>
We made some useful visualization to extract info of our data an discover interesting info of our dataset: