The goal of the project is to analyze job outcomes of students who graduated from college between 2010 and 2012. Using visualizations, we can start to explore questions from the dataset like:
Datafile name is : recent-grads.csv
Each row in the dataset represents a different major in college and contains information on gender diversity, employment rates, median salaries, and more. Here are some of the columns in the dataset:
Rank - Rank by median earnings (the dataset is ordered by this column).
Major_code - Major code.
Major - Major description.
Major_category - Category of major.
Total - Total number of people with major.
Sample_size - Sample size (unweighted) of full-time.
Men - Male graduates.
Women - Female graduates.
ShareWomen - Women as share of total.
Employed - Number employed.
Median - Median salary of full-time, year-round workers.
Low_wage_jobs - Number in low-wage service jobs.
Full_time - Number employed 35 hours or more.
Import needed libraries
import pandas as pd
import matplotlib.pyplot as plt
Read the dataset
recent_grads=pd.read_csv("recent-grads.csv")
#print first row of the data
print(recent_grads.iloc[0])
#first 5 rows of the data
print(recent_grads.head())
#last five rows of the data
print(recent_grads.tail())
Rank 1 Major_code 2419 Major PETROLEUM ENGINEERING Total 2339 Men 2057 Women 282 Major_category Engineering ShareWomen 0.120564 Sample_size 36 Employed 1976 Full_time 1849 Part_time 270 Full_time_year_round 1207 Unemployed 37 Unemployment_rate 0.0183805 Median 110000 P25th 95000 P75th 125000 College_jobs 1534 Non_college_jobs 364 Low_wage_jobs 193 Name: 0, dtype: object Rank Major_code Major Total \ 0 1 2419 PETROLEUM ENGINEERING 2339.0 1 2 2416 MINING AND MINERAL ENGINEERING 756.0 2 3 2415 METALLURGICAL ENGINEERING 856.0 3 4 2417 NAVAL ARCHITECTURE AND MARINE ENGINEERING 1258.0 4 5 2405 CHEMICAL ENGINEERING 32260.0 Men Women Major_category ShareWomen Sample_size Employed \ 0 2057.0 282.0 Engineering 0.120564 36 1976 1 679.0 77.0 Engineering 0.101852 7 640 2 725.0 131.0 Engineering 0.153037 3 648 3 1123.0 135.0 Engineering 0.107313 16 758 4 21239.0 11021.0 Engineering 0.341631 289 25694 ... Part_time Full_time_year_round Unemployed \ 0 ... 270 1207 37 1 ... 170 388 85 2 ... 133 340 16 3 ... 150 692 40 4 ... 5180 16697 1672 Unemployment_rate Median P25th P75th College_jobs Non_college_jobs \ 0 0.018381 110000 95000 125000 1534 364 1 0.117241 75000 55000 90000 350 257 2 0.024096 73000 50000 105000 456 176 3 0.050125 70000 43000 80000 529 102 4 0.061098 65000 50000 75000 18314 4440 Low_wage_jobs 0 193 1 50 2 0 3 0 4 972 [5 rows x 21 columns] Rank Major_code Major Total Men Women \ 168 169 3609 ZOOLOGY 8409.0 3050.0 5359.0 169 170 5201 EDUCATIONAL PSYCHOLOGY 2854.0 522.0 2332.0 170 171 5202 CLINICAL PSYCHOLOGY 2838.0 568.0 2270.0 171 172 5203 COUNSELING PSYCHOLOGY 4626.0 931.0 3695.0 172 173 3501 LIBRARY SCIENCE 1098.0 134.0 964.0 Major_category ShareWomen Sample_size Employed \ 168 Biology & Life Science 0.637293 47 6259 169 Psychology & Social Work 0.817099 7 2125 170 Psychology & Social Work 0.799859 13 2101 171 Psychology & Social Work 0.798746 21 3777 172 Education 0.877960 2 742 ... Part_time Full_time_year_round Unemployed \ 168 ... 2190 3602 304 169 ... 572 1211 148 170 ... 648 1293 368 171 ... 965 2738 214 172 ... 237 410 87 Unemployment_rate Median P25th P75th College_jobs Non_college_jobs \ 168 0.046320 26000 20000 39000 2771 2947 169 0.065112 25000 24000 34000 1488 615 170 0.149048 25000 25000 40000 986 870 171 0.053621 23400 19200 26000 2403 1245 172 0.104946 22000 20000 22000 288 338 Low_wage_jobs 168 743 169 82 170 622 171 308 172 192 [5 rows x 21 columns]
#Generate Statistcial Summary of the data
recent_grads.describe()
Rank | Major_code | Total | Men | Women | ShareWomen | Sample_size | Employed | Full_time | Part_time | Full_time_year_round | Unemployed | Unemployment_rate | Median | P25th | P75th | College_jobs | Non_college_jobs | Low_wage_jobs | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 173.000000 | 173.000000 | 172.000000 | 172.000000 | 172.000000 | 172.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 |
mean | 87.000000 | 3879.815029 | 39370.081395 | 16723.406977 | 22646.674419 | 0.522223 | 356.080925 | 31192.763006 | 26029.306358 | 8832.398844 | 19694.427746 | 2416.329480 | 0.068191 | 40151.445087 | 29501.445087 | 51494.219653 | 12322.635838 | 13284.497110 | 3859.017341 |
std | 50.084928 | 1687.753140 | 63483.491009 | 28122.433474 | 41057.330740 | 0.231205 | 618.361022 | 50675.002241 | 42869.655092 | 14648.179473 | 33160.941514 | 4112.803148 | 0.030331 | 11470.181802 | 9166.005235 | 14906.279740 | 21299.868863 | 23789.655363 | 6944.998579 |
min | 1.000000 | 1100.000000 | 124.000000 | 119.000000 | 0.000000 | 0.000000 | 2.000000 | 0.000000 | 111.000000 | 0.000000 | 111.000000 | 0.000000 | 0.000000 | 22000.000000 | 18500.000000 | 22000.000000 | 0.000000 | 0.000000 | 0.000000 |
25% | 44.000000 | 2403.000000 | 4549.750000 | 2177.500000 | 1778.250000 | 0.336026 | 39.000000 | 3608.000000 | 3154.000000 | 1030.000000 | 2453.000000 | 304.000000 | 0.050306 | 33000.000000 | 24000.000000 | 42000.000000 | 1675.000000 | 1591.000000 | 340.000000 |
50% | 87.000000 | 3608.000000 | 15104.000000 | 5434.000000 | 8386.500000 | 0.534024 | 130.000000 | 11797.000000 | 10048.000000 | 3299.000000 | 7413.000000 | 893.000000 | 0.067961 | 36000.000000 | 27000.000000 | 47000.000000 | 4390.000000 | 4595.000000 | 1231.000000 |
75% | 130.000000 | 5503.000000 | 38909.750000 | 14631.000000 | 22553.750000 | 0.703299 | 338.000000 | 31433.000000 | 25147.000000 | 9948.000000 | 16891.000000 | 2393.000000 | 0.087557 | 45000.000000 | 33000.000000 | 60000.000000 | 14444.000000 | 11783.000000 | 3466.000000 |
max | 173.000000 | 6403.000000 | 393735.000000 | 173809.000000 | 307087.000000 | 0.968954 | 4212.000000 | 307933.000000 | 251540.000000 | 115172.000000 | 199897.000000 | 28169.000000 | 0.177226 | 110000.000000 | 95000.000000 | 125000.000000 | 151643.000000 | 148395.000000 | 48207.000000 |
Columns across the table do not have equal number of counts. It means there are blank rows in the data. I will verify using info() method
original_data=recent_grads.info()
print(original_data)
<class 'pandas.core.frame.DataFrame'> RangeIndex: 173 entries, 0 to 172 Data columns (total 21 columns): Rank 173 non-null int64 Major_code 173 non-null int64 Major 173 non-null object Total 172 non-null float64 Men 172 non-null float64 Women 172 non-null float64 Major_category 173 non-null object ShareWomen 172 non-null float64 Sample_size 173 non-null int64 Employed 173 non-null int64 Full_time 173 non-null int64 Part_time 173 non-null int64 Full_time_year_round 173 non-null int64 Unemployed 173 non-null int64 Unemployment_rate 173 non-null float64 Median 173 non-null int64 P25th 173 non-null int64 P75th 173 non-null int64 College_jobs 173 non-null int64 Non_college_jobs 173 non-null int64 Low_wage_jobs 173 non-null int64 dtypes: float64(5), int64(14), object(2) memory usage: 28.5+ KB None
Check Count of values in dataset
raw_data_count=recent_grads.notnull().sum()
print(raw_data_count)
Rank 173 Major_code 173 Major 173 Total 172 Men 172 Women 172 Major_category 173 ShareWomen 172 Sample_size 173 Employed 173 Full_time 173 Part_time 173 Full_time_year_round 173 Unemployed 173 Unemployment_rate 173 Median 173 P25th 173 P75th 173 College_jobs 173 Non_college_jobs 173 Low_wage_jobs 173 dtype: int64
Based on above results, I can tell that 4 columns-Total,Men, Women,ShareWomen has null value. I will remove null rows from the data and update the data.
recent_grads=recent_grads.dropna()
cleaned_data_count=recent_grads.notnull().sum()
print(cleaned_data_count)
Rank 172 Major_code 172 Major 172 Total 172 Men 172 Women 172 Major_category 172 ShareWomen 172 Sample_size 172 Employed 172 Full_time 172 Part_time 172 Full_time_year_round 172 Unemployed 172 Unemployment_rate 172 Median 172 P25th 172 P75th 172 College_jobs 172 Non_college_jobs 172 Low_wage_jobs 172 dtype: int64
I will comapare cleaned_data_table and original_data_table to check any changes in data after deleting null rows
comp1=pd.Series(raw_data_count)
comp2=pd.Series(cleaned_data_count)
comp3=pd.DataFrame(comp1,columns=["Original_Data"])
comp4=pd.DataFrame(comp2,columns=["Cleaned_Data"])
comp4["Original_Data"]=comp3
comp4
Cleaned_Data | Original_Data | |
---|---|---|
Rank | 172 | 173 |
Major_code | 172 | 173 |
Major | 172 | 173 |
Total | 172 | 172 |
Men | 172 | 172 |
Women | 172 | 172 |
Major_category | 172 | 173 |
ShareWomen | 172 | 172 |
Sample_size | 172 | 173 |
Employed | 172 | 173 |
Full_time | 172 | 173 |
Part_time | 172 | 173 |
Full_time_year_round | 172 | 173 |
Unemployed | 172 | 173 |
Unemployment_rate | 172 | 173 |
Median | 172 | 173 |
P25th | 172 | 173 |
P75th | 172 | 173 |
College_jobs | 172 | 173 |
Non_college_jobs | 172 | 173 |
Low_wage_jobs | 172 | 173 |
The rows count have reduced from 173 to 172.
Just curious to print column name and data name + count to see frequency of each value
for m in recent_grads:
column_name=m
val=recent_grads[m].value_counts().sort_index(ascending=True)
print("Data is for: ", column_name)
print(val)
Data is for: Rank 1 1 2 1 3 1 4 1 5 1 6 1 7 1 8 1 9 1 10 1 11 1 12 1 13 1 14 1 15 1 16 1 17 1 18 1 19 1 20 1 21 1 23 1 24 1 25 1 26 1 27 1 28 1 29 1 30 1 31 1 .. 144 1 145 1 146 1 147 1 148 1 149 1 150 1 151 1 152 1 153 1 154 1 155 1 156 1 157 1 158 1 159 1 160 1 161 1 162 1 163 1 164 1 165 1 166 1 167 1 168 1 169 1 170 1 171 1 172 1 173 1 Name: Rank, Length: 172, dtype: int64 Data is for: Major_code 1100 1 1101 1 1102 1 1103 1 1105 1 1106 1 1199 1 1301 1 1302 1 1303 1 1401 1 1501 1 1901 1 1902 1 1903 1 1904 1 2001 1 2100 1 2101 1 2102 1 2105 1 2106 1 2107 1 2201 1 2300 1 2301 1 2303 1 2304 1 2305 1 2306 1 .. 6005 1 6006 1 6007 1 6099 1 6100 1 6102 1 6103 1 6104 1 6105 1 6106 1 6107 1 6108 1 6109 1 6110 1 6199 1 6200 1 6201 1 6202 1 6203 1 6204 1 6205 1 6206 1 6207 1 6209 1 6210 1 6211 1 6212 1 6299 1 6402 1 6403 1 Name: Major_code, Length: 172, dtype: int64 Data is for: Major ACCOUNTING 1 ACTUARIAL SCIENCE 1 ADVERTISING AND PUBLIC RELATIONS 1 AEROSPACE ENGINEERING 1 AGRICULTURAL ECONOMICS 1 AGRICULTURE PRODUCTION AND MANAGEMENT 1 ANIMAL SCIENCES 1 ANTHROPOLOGY AND ARCHEOLOGY 1 APPLIED MATHEMATICS 1 ARCHITECTURAL ENGINEERING 1 ARCHITECTURE 1 AREA ETHNIC AND CIVILIZATION STUDIES 1 ART AND MUSIC EDUCATION 1 ART HISTORY AND CRITICISM 1 ASTRONOMY AND ASTROPHYSICS 1 ATMOSPHERIC SCIENCES AND METEOROLOGY 1 BIOCHEMICAL SCIENCES 1 BIOLOGICAL ENGINEERING 1 BIOLOGY 1 BIOMEDICAL ENGINEERING 1 BOTANY 1 BUSINESS ECONOMICS 1 BUSINESS MANAGEMENT AND ADMINISTRATION 1 CHEMICAL ENGINEERING 1 CHEMISTRY 1 CIVIL ENGINEERING 1 CLINICAL PSYCHOLOGY 1 COGNITIVE SCIENCE AND BIOPSYCHOLOGY 1 COMMERCIAL ART AND GRAPHIC DESIGN 1 COMMUNICATION DISORDERS SCIENCES AND SERVICES 1 .. PHILOSOPHY AND RELIGIOUS STUDIES 1 PHYSICAL AND HEALTH EDUCATION TEACHING 1 PHYSICAL FITNESS PARKS RECREATION AND LEISURE 1 PHYSICAL SCIENCES 1 PHYSICS 1 PHYSIOLOGY 1 PLANT SCIENCE AND AGRONOMY 1 POLITICAL SCIENCE AND GOVERNMENT 1 PRE-LAW AND LEGAL STUDIES 1 PSYCHOLOGY 1 PUBLIC ADMINISTRATION 1 PUBLIC POLICY 1 SCHOOL STUDENT COUNSELING 1 SCIENCE AND COMPUTER TEACHER EDUCATION 1 SECONDARY TEACHER EDUCATION 1 SOCIAL PSYCHOLOGY 1 SOCIAL SCIENCE OR HISTORY TEACHER EDUCATION 1 SOCIAL WORK 1 SOCIOLOGY 1 SOIL SCIENCE 1 SPECIAL NEEDS EDUCATION 1 STATISTICS AND DECISION SCIENCE 1 STUDIO ARTS 1 TEACHER EDUCATION: MULTIPLE LEVELS 1 THEOLOGY AND RELIGIOUS VOCATIONS 1 TRANSPORTATION SCIENCES AND TECHNOLOGIES 1 TREATMENT THERAPY PROFESSIONS 1 UNITED STATES HISTORY 1 VISUAL AND PERFORMING ARTS 1 ZOOLOGY 1 Name: Major, Length: 172, dtype: int64 Data is for: Total 124.0 1 609.0 1 685.0 1 720.0 1 756.0 1 804.0 1 818.0 1 856.0 1 1098.0 1 1148.0 1 1258.0 1 1329.0 1 1386.0 1 1436.0 1 1488.0 1 1762.0 1 1792.0 1 1978.0 1 2116.0 1 2339.0 1 2418.0 1 2435.0 1 2439.0 1 2573.0 1 2825.0 1 2838.0 1 2854.0 1 2906.0 1 2993.0 1 3014.0 1 .. 60633.0 1 61152.0 1 62052.0 1 66530.0 1 71369.0 1 72397.0 1 72619.0 1 74440.0 1 81527.0 1 91227.0 1 103480.0 1 115433.0 1 125074.0 1 128319.0 1 139247.0 1 141951.0 1 143718.0 1 152824.0 1 170862.0 1 174506.0 1 182621.0 1 194673.0 1 198633.0 1 205211.0 1 209394.0 1 213996.0 1 234590.0 1 280709.0 1 329927.0 1 393735.0 1 Name: Total, Length: 172, dtype: int64 Data is for: Men 119.0 1 124.0 1 134.0 1 280.0 1 404.0 1 413.0 1 476.0 1 488.0 1 500.0 1 515.0 1 522.0 1 528.0 1 568.0 1 626.0 1 679.0 1 725.0 1 752.0 1 803.0 1 809.0 1 832.0 1 877.0 1 885.0 1 894.0 1 931.0 1 1075.0 1 1123.0 1 1167.0 1 1225.0 1 1499.0 1 1589.0 1 .. 25463.0 1 26893.0 1 27015.0 1 27392.0 1 29909.0 1 31967.0 1 32041.0 1 32510.0 1 32923.0 1 33258.0 1 39956.0 1 41081.0 1 45683.0 1 58227.0 1 62181.0 1 65511.0 1 70619.0 1 78253.0 1 78857.0 1 80231.0 1 80320.0 1 86648.0 1 89749.0 1 93880.0 1 94519.0 1 99743.0 1 111762.0 1 115030.0 1 132238.0 1 173809.0 1 Name: Men, Length: 172, dtype: int64 Data is for: Women 0.0 1 77.0 1 109.0 1 131.0 1 135.0 1 209.0 1 232.0 1 271.0 1 282.0 1 371.0 1 373.0 1 451.0 1 506.0 1 524.0 1 542.0 1 566.0 1 690.0 1 699.0 1 703.0 1 795.0 1 905.0 1 960.0 1 964.0 1 973.0 2 990.0 1 1084.0 1 1122.0 1 1154.0 1 1169.0 1 1247.0 1 .. 35004.0 1 35037.0 1 35411.0 1 36422.0 1 37054.0 1 40300.0 1 48415.0 1 48883.0 1 49030.0 1 49498.0 1 49654.0 1 52835.0 1 59476.0 1 62893.0 1 63698.0 1 71439.0 1 72593.0 1 82923.0 1 88741.0 1 102352.0 1 104114.0 1 116825.0 1 126354.0 1 136446.0 1 143377.0 1 156118.0 1 157833.0 1 168947.0 1 187621.0 1 307087.0 1 Name: Women, Length: 171, dtype: int64 Data is for: Major_category Agriculture & Natural Resources 9 Arts 8 Biology & Life Science 14 Business 13 Communications & Journalism 4 Computers & Mathematics 11 Education 16 Engineering 29 Health 12 Humanities & Liberal Arts 15 Industrial Arts & Consumer Services 7 Interdisciplinary 1 Law & Public Policy 5 Physical Sciences 10 Psychology & Social Work 9 Social Science 9 Name: Major_category, dtype: int64 Data is for: ShareWomen 0.000000 1 0.077453 1 0.090713 1 0.101852 1 0.107313 1 0.119559 1 0.120564 1 0.124950 1 0.125035 1 0.139793 1 0.144967 1 0.153037 1 0.174123 1 0.178982 1 0.180883 1 0.183985 1 0.189970 1 0.196450 1 0.199413 1 0.200023 1 0.222695 1 0.227118 1 0.232444 1 0.236063 1 0.244103 1 0.249190 1 0.251389 1 0.252960 1 0.253583 1 0.269194 1 .. 0.752144 1 0.753927 1 0.758060 1 0.764320 1 0.764427 1 0.770901 1 0.774577 1 0.779933 1 0.792095 1 0.798746 1 0.798920 1 0.799859 1 0.810704 1 0.812877 1 0.817099 1 0.845934 1 0.854523 1 0.864456 1 0.877228 1 0.877960 1 0.881294 1 0.896019 1 0.904075 1 0.905590 1 0.906677 1 0.910933 1 0.923745 1 0.927807 1 0.967998 1 0.968954 1 Name: ShareWomen, Length: 172, dtype: int64 Data is for: Sample_size 2 1 3 2 4 3 5 2 7 3 8 1 9 1 10 2 11 1 13 1 14 1 16 1 17 1 18 1 21 1 22 3 24 2 25 1 26 2 28 1 29 1 30 2 31 2 32 1 36 2 37 2 38 1 39 1 43 1 44 1 .. 541 1 546 1 565 1 569 1 590 1 623 1 631 1 681 1 843 1 919 1 1014 1 1024 1 1029 1 1058 1 1186 1 1196 1 1322 1 1370 1 1387 1 1436 1 1629 1 1728 1 2042 1 2189 1 2380 1 2394 1 2554 1 2584 1 2684 1 4212 1 Name: Sample_size, Length: 147, dtype: int64 Data is for: Employed 0 1 559 1 604 1 613 1 640 1 648 1 703 1 730 1 742 1 758 1 930 1 1010 1 1080 1 1144 1 1146 1 1290 1 1441 1 1526 1 1638 1 1778 1 1857 1 1976 1 2101 1 2107 1 2125 2 2174 1 2343 1 2449 1 2463 1 2575 1 .. 46138 1 46624 1 47662 1 48535 1 54844 1 58118 1 59679 1 61022 1 61928 1 76442 1 83483 1 92721 1 102087 1 103078 1 104117 1 105646 1 118241 1 125393 1 133454 1 145696 1 149180 1 149339 1 165527 1 178862 1 179633 1 180903 1 182295 1 190183 1 276234 1 307933 1 Name: Employed, Length: 170, dtype: int64 Data is for: Full_time 111 1 488 1 524 1 556 1 558 1 584 1 593 1 595 1 657 1 733 1 768 1 808 1 828 1 946 1 1069 1 1085 1 1098 1 1264 1 1392 1 1644 1 1658 1 1724 1 1787 1 1819 1 1848 1 1849 1 1931 1 1992 1 2038 1 2049 1 .. 38302 1 38815 1 39509 1 41235 1 42764 1 43401 1 46399 1 51411 1 55450 1 67448 1 71298 1 73475 1 77428 1 84681 1 91485 1 96567 1 98408 1 109970 1 114386 1 117709 1 123177 1 137921 1 144512 1 147335 1 151191 1 151967 1 156668 1 171385 1 233205 1 251540 1 Name: Full_time, Length: 172, dtype: int64 Data is for: Part_time 0 3 126 1 133 1 135 1 150 1 169 1 170 1 185 1 223 1 237 1 247 1 264 1 270 1 287 1 296 1 335 1 343 1 354 1 379 1 433 1 437 1 462 1 482 1 532 1 553 1 572 1 579 1 597 1 620 1 648 1 .. 14569 1 15066 1 15872 1 15902 1 15994 1 16659 1 18079 1 18726 1 19187 1 21463 1 23656 1 24387 1 24943 1 25325 1 27693 1 29558 1 29639 1 32242 1 35829 1 36241 1 37965 1 38515 1 40657 1 40818 1 43711 1 49889 1 50357 1 57825 1 72371 1 115172 1 Name: Part_time, Length: 169, dtype: int64 Data is for: Full_time_year_round 111 1 340 1 383 1 388 1 391 1 396 1 410 1 504 1 529 1 545 1 565 1 653 1 692 1 740 1 808 1 827 1 936 1 1011 1 1115 1 1151 1 1200 1 1207 1 1211 1 1274 1 1293 1 1358 1 1409 1 1449 1 1487 1 1528 1 .. 29196 1 29910 1 30932 1 31877 1 33438 1 33540 1 33738 1 39524 1 41413 1 52243 1 54639 1 56561 1 57978 1 59218 1 70740 1 70932 1 73531 1 81180 1 83236 1 86540 1 88548 1 100336 1 108595 1 116251 1 122817 1 123169 1 127230 1 138299 1 174438 1 199897 1 Name: Full_time_year_round, Length: 172, dtype: int64 Data is for: Unemployed 0 5 11 1 16 1 23 1 33 2 36 1 37 1 40 1 42 1 49 1 64 1 70 1 74 1 78 2 82 1 85 1 87 2 88 1 99 1 107 1 129 1 137 1 138 1 148 1 163 1 170 1 178 1 182 1 214 1 215 1 .. 3718 1 3895 1 3918 1 4267 1 4366 1 4410 1 4535 1 4650 1 4657 1 5486 1 5593 1 6884 1 7195 1 7297 1 8497 1 8608 1 8947 1 9413 1 11176 1 11268 1 11452 1 11663 1 12411 1 13874 1 14345 1 14602 1 14946 1 15022 1 21502 1 28169 1 Name: Unemployed, Length: 160, dtype: int64 Data is for: Unemployment_rate 0.000000 5 0.006334 1 0.011690 1 0.016203 1 0.018381 1 0.019642 1 0.021167 1 0.022229 1 0.023043 1 0.024096 1 0.024374 1 0.027789 1 0.028308 1 0.029480 1 0.029650 1 0.033652 1 0.034118 1 0.035354 1 0.036546 1 0.036983 1 0.037819 1 0.038638 1 0.040105 1 0.041508 1 0.042507 1 0.042876 1 0.044863 1 0.045455 1 0.046320 1 0.046586 1 .. 0.096052 1 0.096175 1 0.096448 1 0.096726 1 0.096798 1 0.096799 1 0.097244 1 0.099092 1 0.099723 1 0.101175 1 0.102197 1 0.102792 1 0.103455 1 0.104436 1 0.104946 1 0.105772 1 0.107116 1 0.107579 1 0.108787 1 0.112144 1 0.113332 1 0.113459 1 0.113983 1 0.117241 1 0.119511 1 0.128426 1 0.149048 1 0.151850 1 0.159491 1 0.177226 1 Name: Unemployment_rate, Length: 168, dtype: int64 Data is for: Median 22000 1 23400 1 25000 2 26000 1 27000 2 27500 1 28000 3 29000 4 30000 8 30500 1 31000 3 31500 1 32000 8 32100 1 32200 1 32400 1 32500 1 33000 9 33400 1 33500 2 34000 8 35000 20 35600 1 36000 6 36200 1 36400 1 37000 1 37400 1 37500 1 38000 5 38400 1 39000 1 40000 17 40100 1 41000 2 41300 1 42000 2 44000 1 44700 1 45000 9 46000 3 47000 2 48000 1 50000 9 51000 1 52000 2 53000 1 54000 2 56000 1 57000 1 57100 1 58000 1 60000 6 62000 2 65000 2 70000 1 73000 1 75000 1 110000 1 Name: Median, dtype: int64 Data is for: P25th 18500 1 19200 3 20000 11 20800 1 21000 4 22000 5 22300 1 22500 1 22900 3 23000 10 23050 1 23400 1 24000 12 24500 1 25000 22 26000 5 26700 1 27000 7 28000 5 28600 1 29000 4 29200 1 29600 1 30000 18 30400 1 31000 1 31200 1 31500 1 32000 2 32500 1 33000 2 34000 4 35000 6 36000 4 37900 1 38000 3 39000 4 40000 5 41000 1 42000 2 42800 1 43000 2 45000 2 48000 1 50000 4 53000 1 55000 1 95000 1 Name: P25th, dtype: int64 Data is for: P75th 22000 1 26000 1 34000 1 35000 6 36000 1 37000 1 38000 5 38300 1 39000 2 40000 17 40200 1 41000 6 42000 11 42100 1 42900 1 43000 2 44000 5 45000 18 46000 2 47000 4 47100 1 48000 2 49000 2 49900 1 50000 18 51000 1 52000 2 53000 3 54000 3 55000 2 56000 3 57000 1 58000 3 60000 11 62000 1 63000 1 64000 1 65000 6 66000 1 67000 1 68000 1 69000 1 70000 5 72000 2 74000 1 75000 2 76000 1 78000 1 80000 1 90000 2 102000 1 105000 1 109000 1 125000 1 Name: P75th, dtype: int64 Data is for: College_jobs 0 1 162 1 221 1 288 1 346 1 350 1 355 1 402 1 434 1 452 1 456 1 459 1 482 1 483 1 501 1 509 1 529 1 530 1 535 1 559 1 563 1 603 1 677 1 693 1 744 1 784 1 801 1 919 1 972 1 986 1 .. 22215 1 23279 1 23515 1 23694 1 24243 1 24348 1 25320 1 25582 1 26898 1 27449 1 27581 1 28526 1 29051 1 29334 1 30382 1 34800 1 35336 1 36720 1 36854 1 37389 1 40763 1 45829 1 52844 1 57690 1 68622 1 82007 1 88232 1 108085 1 125148 1 151643 1 Name: College_jobs, Length: 172, dtype: int64 Data is for: Non_college_jobs 0 1 50 1 67 1 102 1 144 1 176 1 184 1 206 1 221 1 257 1 305 1 314 1 338 1 364 1 391 1 465 1 478 1 500 1 528 1 591 1 593 1 615 1 626 1 649 1 657 1 678 1 803 1 830 1 844 1 870 1 .. 20721 1 22039 1 23059 1 23341 1 25297 1 25313 1 25667 1 26146 1 26672 1 28558 1 28786 1 31112 1 32725 1 36972 1 37057 1 38119 1 39323 1 48447 1 48899 1 54569 1 63946 1 66947 1 71827 1 81109 1 88858 1 93889 1 97964 1 100831 1 141860 1 148395 1 Name: Non_college_jobs, Length: 172, dtype: int64 Data is for: Low_wage_jobs 0 5 25 1 31 1 37 1 49 1 50 1 56 1 70 1 81 2 82 1 93 1 94 1 111 1 124 1 135 1 137 1 142 1 144 1 186 1 192 1 193 1 201 1 220 1 221 1 237 1 244 1 245 1 259 1 260 1 263 2 .. 5862 1 6193 1 6429 1 6866 1 7214 1 8051 1 8512 1 9030 1 9063 1 9286 1 9910 1 10653 1 10886 1 11068 1 11443 1 11502 1 11880 1 13748 1 14839 1 16838 1 16839 1 18404 1 19803 1 26503 1 27320 1 27440 1 27968 1 28339 1 32395 1 48207 1 Name: Low_wage_jobs, Length: 165, dtype: int64
Generate scatter plots io understand realtionship between Sample_size and Median
data is weekly correlated and has two outliers as shown in the plot
%matplotlib inline
plot1=recent_grads.plot(x="Sample_size",y="Median", kind="scatter",
title="Sample Size vs Median",
)
%matplotlib inline
plot2=recent_grads.plot(x="Sample_size",y="Unemployment_rate",
kind="scatter",
title="Sample Size vs Unemployment rate",
)
%matplotlib inline
plot3=recent_grads.plot(x="Full_time",y="Median", kind="scatter",
title="Full Time vs Median",
)
%matplotlib inline
plot4=recent_grads.plot(x="ShareWomen",y="Unemployment_rate",
kind="scatter",
title="ShareWomen vs Unemploymenr Rate",
)
%matplotlib inline
plot5=recent_grads.plot(x="Men",y="Median",
kind="scatter",
title="Men vs Median",
)
%matplotlib inline
Plot6=recent_grads.plot(x="Women",y="Median",
kind="scatter",
title="Women vs Median",
)
%matplotlib inline
Plot7=recent_grads.plot(x="Total",y="Median",
kind="scatter",
title="total vs Median", figsize=(15,5)
)
popular_majors=recent_grads["Major"].value_counts(normalize=True)
popular_majors.head(10)
FORESTRY 0.005814 NUCLEAR, INDUSTRIAL RADIOLOGY, AND BIOLOGICAL TECHNOLOGIES 0.005814 MISCELLANEOUS BUSINESS & MEDICAL ADMINISTRATION 0.005814 ENGINEERING TECHNOLOGIES 0.005814 SOIL SCIENCE 0.005814 FINE ARTS 0.005814 ARCHITECTURAL ENGINEERING 0.005814 MATERIALS ENGINEERING AND MATERIALS SCIENCE 0.005814 COMPUTER NETWORKING AND TELECOMMUNICATIONS 0.005814 ECONOMICS 0.005814 Name: Major, dtype: float64
With the increase in total (number of people Majors), salary does not incerase much. Data is clusdtered at total 0. The count of each major is same across the data. Therefore, Using Sample Size Vs Median Plot or Total vs Median Plot reveal same inference.
Based on the plot Women and Median and Men and Median, there seem no significant correlation depicting graduate female make more money than men
There is very weak correlation between Full Time and Median as data is scattered. There is no link.
cols = ["Sample_size", "Median", "Employed", "Full_time", "ShareWomen",
"Unemployment_rate", "Men", "Women"]
color=["red","green","blue","orange","navy",
"yellow","skyblue","violet","purple"]
fig=plt.figure(figsize=(5,35))
for i in range(1,8):
ax=fig.add_subplot(8,1,i+1) #iterate location index
ax=recent_grads[cols[i]].plot(kind="hist", rot=30, bins=10,
title=cols[i],
color=color[i], linewidth=1,
edgecolor=color[i+1])
#create a scatter matrix plot
from pandas.plotting import scatter_matrix
plot2X2=scatter_matrix(recent_grads[["Sample_size","Median"]],
figsize=(10,10))
plot3x3=scatter_matrix(recent_grads[["Sample_size","Median",
"Unemployment_rate"]],
figsize=(10,10))
from pandas.plotting import scatter_matrix
plot4x4=scatter_matrix(recent_grads[["Women","Men","Full_time","Median"]],
figsize=(10,10))
ax=recent_grads.head(10).plot.bar(x="Major", y="ShareWomen")
It is hard to read labels so in this situation, it is better to do horizontal bar plot. Let's see it.
recent_grads[0:10].plot.barh(x="Major",y="ShareWomen")
ax=recent_grads.tail(10).plot.barh(x="Major",y="ShareWomen",legend=False)
Now, I can read chart better.
Based on top 10 rows, Higher % of Women are graduating from Astronomy and Astrophysics Major.
Based on the last 10 rows, I see most common majors are
I get different results ipon looking first 10 rows and last 10 rows. I think to see which major is popular among females. These are the count of data/rows. The better option to make correct inference is that I should do average count of females per major and then make a plot.
unique_majors=recent_grads["Major"]
avg_sharewomen={}
from pandas import DataFrame
avg_sharewomen={}
for i in unique_majors:
sum_sharewomen=recent_grads.loc[(recent_grads["Major"]==i),
"ShareWomen"].sum()
len_sharewomen=len(recent_grads.loc[(recent_grads["Major"]==i),
"ShareWomen"])
avg_sharewomen[i]=sum_sharewomen/len_sharewomen
#convert to series
ShareWomen_series=pd.Series(avg_sharewomen)
#convert to dataframe
avg_sharewomen_df=pd.DataFrame(ShareWomen_series,
columns=["Avg_ShareWomen"])
print("Unsorted-Average list Major Vs ShareWomen",
avg_sharewomen_df.head(),"\n")
sorted_avg_sharewomen_df=avg_sharewomen_df.sort_values(by="Avg_ShareWomen", ascending=False)
print(sorted_avg_sharewomen_df.head())
Unsorted-Average list Major Vs ShareWomen Avg_ShareWomen ACCOUNTING 0.524153 ACTUARIAL SCIENCE 0.441356 ADVERTISING AND PUBLIC RELATIONS 0.758060 AEROSPACE ENGINEERING 0.139793 AGRICULTURAL ECONOMICS 0.282903 Avg_ShareWomen EARLY CHILDHOOD EDUCATION 0.968954 COMMUNICATION DISORDERS SCIENCES AND SERVICES 0.967998 MEDICAL ASSISTING SERVICES 0.927807 ELEMENTARY EDUCATION 0.923745 FAMILY AND CONSUMER SCIENCES 0.910933
The results have become interesting now!
avg_sharewomen_df[:10].plot.barh(y="Avg_ShareWomen",
title="First 10 Rows-Avg ShareWomen")
sorted_avg_sharewomen_df[:10].plot.barh(y="Avg_ShareWomen",legend=False,
title="Sorted_Avg ShareWomen")
<matplotlib.axes._subplots.AxesSubplot at 0x7f63279c4be0>
ax1=recent_grads[0:10].plot.barh(y="Unemployment_rate",x="Major",
legend=False)
ax2=recent_grads[-10:-1].plot.barh(y="Unemployment_rate",x="Major",
legend=False)
Based on above plots for first 10 rows, Unemployment is highest among 'Nuclear Engineering' major students.
For bottom 10 rows,it is among 'Clinical Psychology' major students
unique_major_categ=recent_grads["Major_category"].unique()
boys_major_categ={}
girls_major_categ={}
for i in unique_major_categ:
boys_val=recent_grads.loc[(recent_grads["Major_category"]==i),"Men"].sum()
girls_val=recent_grads.loc[(recent_grads["Major_category"]==i),"Women"].sum()
boys_major_categ[i]=boys_val
girls_major_categ[i]=girls_val
#convert to series
series_boys=pd.Series(boys_major_categ)
series_girls=pd.Series(girls_major_categ)
#convert to dataframe
df=pd.DataFrame(series_boys,columns=["Men"])
df["Women"]=series_girls
df.head()
recent_grads["Unemployment"]
KeyErrorTraceback (most recent call last) /dataquest/system/env/python3/lib/python3.4/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance) 2524 try: -> 2525 return self._engine.get_loc(key) 2526 except KeyError: pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc() pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc() pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item() pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item() KeyError: 'Unemployment' During handling of the above exception, another exception occurred: KeyErrorTraceback (most recent call last) <ipython-input-41-2b38c19aa8fb> in <module>() ----> 1 recent_grads["Unemployment"] /dataquest/system/env/python3/lib/python3.4/site-packages/pandas/core/frame.py in __getitem__(self, key) 2137 return self._getitem_multilevel(key) 2138 else: -> 2139 return self._getitem_column(key) 2140 2141 def _getitem_column(self, key): /dataquest/system/env/python3/lib/python3.4/site-packages/pandas/core/frame.py in _getitem_column(self, key) 2144 # get column 2145 if self.columns.is_unique: -> 2146 return self._get_item_cache(key) 2147 2148 # duplicate columns & possible reduce dimensionality /dataquest/system/env/python3/lib/python3.4/site-packages/pandas/core/generic.py in _get_item_cache(self, item) 1840 res = cache.get(item) 1841 if res is None: -> 1842 values = self._data.get(item) 1843 res = self._box_item_values(item, values) 1844 cache[item] = res /dataquest/system/env/python3/lib/python3.4/site-packages/pandas/core/internals.py in get(self, item, fastpath) 3841 3842 if not isna(item): -> 3843 loc = self.items.get_loc(item) 3844 else: 3845 indexer = np.arange(len(self.items))[isna(self.items)] /dataquest/system/env/python3/lib/python3.4/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance) 2525 return self._engine.get_loc(key) 2526 except KeyError: -> 2527 return self._engine.get_loc(self._maybe_cast_indexer(key)) 2528 2529 indexer = self.get_indexer([key], method=method, tolerance=tolerance) pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc() pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc() pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item() pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item() KeyError: 'Unemployment'
unique_major_categ=recent_grads["Major_category"].unique()
boys_major_categ={}
girls_major_categ={}
for i in unique_major_categ:
boys_val=recent_grads.loc[(recent_grads["Major_category"]==i),"Men"].sum()
girls_val=recent_grads.loc[(recent_grads["Major_category"]==i),"Women"].sum()
boys_major_categ[i]=boys_val
girls_major_categ[i]=girls_val
#convert to series
series_boys=pd.Series(boys_major_categ)
series_girls=pd.Series(girls_major_categ)
#convert to dataframe
df=pd.DataFrame(series_boys,columns=["Men"])
df["Women"]=series_girls
df.head()
Men | Women | |
---|---|---|
Agriculture & Natural Resources | 40357.0 | 35263.0 |
Arts | 134390.0 | 222740.0 |
Biology & Life Science | 184919.0 | 268943.0 |
Business | 667852.0 | 634524.0 |
Communications & Journalism | 131921.0 | 260680.0 |
df.plot.barh(title="Number of Men and Women in each Major Category")
<matplotlib.axes._subplots.AxesSubplot at 0x7f63272a76d8>
The highest gap between men and women is in 'Psychology & Social Work' major. However, the unemployment rate for this major is 47%.
It will be very handy to compare average number of men and women with respect to Major category to see which group favors which field.
from pandas import DataFrame
men_mean = {}
women_mean = {}
for i in recent_grads["Major_category"].unique():
sum_men = recent_grads.loc[(recent_grads["Major_category"] == i),
"Men"].sum()
len_men = len(recent_grads.loc[recent_grads["Major_category"] == i,
"Men"])
men_mean[i] = (sum_men/len_men)
sum_women = recent_grads.loc[(recent_grads["Major_category"] == i),
"Women"].sum()
len_women = len(recent_grads.loc[recent_grads["Major_category"] == i,
"Women"])
women_mean[i] = (sum_women/len_women)
men_mean_df = DataFrame(list(men_mean.items()),columns =
['Major_category','Men_Mean'])
women_mean_df = DataFrame(list(women_mean.items()),columns =
['Major_category','Women_Mean'])
%matplotlib inline
men_mean_df.plot.barh(x="Major_category",y="Men_Mean")
<matplotlib.axes._subplots.AxesSubplot at 0x7f6327811f60>
women_mean_df.plot.barh(x="Major_category",y="Women_Mean")
<matplotlib.axes._subplots.AxesSubplot at 0x7f6327b29908>
Higher avergae of women are in most of every major compared to men.
Major with highest women popualtion is Communication and Journalism
Major with highest Men popualtion is Business
Box plot :
Box plots are used here to explore the distributions of median salaries and unemployment rate
# import matplotlib.pyplot as plt
# %matplotlib inline
# cols=["Median"]
# fig,ax=plt.subplots()
# ax.boxplot(recent_grads[cols].values)
# ax.set_xticklabels(cols)
# plt.show()
#with Pandas
recent_grads.boxplot(column=["Median"])
<matplotlib.axes._subplots.AxesSubplot at 0x7f632c056a58>
There are lots of odd salary values. Having that in data, Most of the majors have salary more than median salary
recent_grads.boxplot(column=["Unemployment_rate"])
<matplotlib.axes._subplots.AxesSubplot at 0x7f6327a1b7f0>
75% of the unemployment rate is below 9%
Hexbin plot are denser scatter plots
recent_grads.plot.hexbin(x="ShareWomen", y="Median", gridsize=10
,sharex=False)
<matplotlib.axes._subplots.AxesSubplot at 0x7f6327721b38>
The highest frequency is 2% for 30,000 salary
recent_grads.plot.hexbin(x="Unemployment_rate", y="Median", gridsize=30
,sharex=False)
<matplotlib.axes._subplots.AxesSubplot at 0x7f6327630d68>
The highest frequency is 6% for Unemployment rate for 35000 salary
Visulaization definitely helps to understand data better