The goal of the project is to analyze job outcomes of students who graduated from college between 2010 and 2012. Using visualizations, we can start to explore questions from the dataset like:

Do students in more popular majors make more money?
- Using scatter plots
How many majors are predominantly male? Predominantly female?
- Using histograms
Which category of majors have the most students?
- Using bar plots

Datafile name is : recent-grads.csv

Each row in the dataset represents a different major in college and contains information on gender diversity, employment rates, median salaries, and more. Here are some of the columns in the dataset:

Rank - Rank by median earnings (the dataset is ordered by this column).
Major_code - Major code.
Major - Major description.
Major_category - Category of major.
Total - Total number of people with major.
Sample_size - Sample size (unweighted) of full-time.
Men - Male graduates.
Women - Female graduates.
ShareWomen - Women as share of total.
Employed - Number employed.
Median - Median salary of full-time, year-round workers.
Low_wage_jobs - Number in low-wage service jobs.
Full_time - Number employed 35 hours or more.

Import needed libraries

In [1]:

import pandas as pd
import matplotlib.pyplot as plt

Read the dataset

In [2]:

recent_grads=pd.read_csv("recent-grads.csv")
#print first row of the data
print(recent_grads.iloc[0]) 
#first 5 rows of the data
print(recent_grads.head())
#last five rows of the data
print(recent_grads.tail())

Rank                                        1
Major_code                               2419
Major                   PETROLEUM ENGINEERING
Total                                    2339
Men                                      2057
Women                                     282
Major_category                    Engineering
ShareWomen                           0.120564
Sample_size                                36
Employed                                 1976
Full_time                                1849
Part_time                                 270
Full_time_year_round                     1207
Unemployed                                 37
Unemployment_rate                   0.0183805
Median                                 110000
P25th                                   95000
P75th                                  125000
College_jobs                             1534
Non_college_jobs                          364
Low_wage_jobs                             193
Name: 0, dtype: object
   Rank  Major_code                                      Major    Total  \
0     1        2419                      PETROLEUM ENGINEERING   2339.0   
1     2        2416             MINING AND MINERAL ENGINEERING    756.0   
2     3        2415                  METALLURGICAL ENGINEERING    856.0   
3     4        2417  NAVAL ARCHITECTURE AND MARINE ENGINEERING   1258.0   
4     5        2405                       CHEMICAL ENGINEERING  32260.0   

       Men    Women Major_category  ShareWomen  Sample_size  Employed  \
0   2057.0    282.0    Engineering    0.120564           36      1976   
1    679.0     77.0    Engineering    0.101852            7       640   
2    725.0    131.0    Engineering    0.153037            3       648   
3   1123.0    135.0    Engineering    0.107313           16       758   
4  21239.0  11021.0    Engineering    0.341631          289     25694   

       ...        Part_time  Full_time_year_round  Unemployed  \
0      ...              270                  1207          37   
1      ...              170                   388          85   
2      ...              133                   340          16   
3      ...              150                   692          40   
4      ...             5180                 16697        1672   

   Unemployment_rate  Median  P25th   P75th  College_jobs  Non_college_jobs  \
0           0.018381  110000  95000  125000          1534               364   
1           0.117241   75000  55000   90000           350               257   
2           0.024096   73000  50000  105000           456               176   
3           0.050125   70000  43000   80000           529               102   
4           0.061098   65000  50000   75000         18314              4440   

   Low_wage_jobs  
0            193  
1             50  
2              0  
3              0  
4            972  

[5 rows x 21 columns]
     Rank  Major_code                   Major   Total     Men   Women  \
168   169        3609                 ZOOLOGY  8409.0  3050.0  5359.0   
169   170        5201  EDUCATIONAL PSYCHOLOGY  2854.0   522.0  2332.0   
170   171        5202     CLINICAL PSYCHOLOGY  2838.0   568.0  2270.0   
171   172        5203   COUNSELING PSYCHOLOGY  4626.0   931.0  3695.0   
172   173        3501         LIBRARY SCIENCE  1098.0   134.0   964.0   

               Major_category  ShareWomen  Sample_size  Employed  \
168    Biology & Life Science    0.637293           47      6259   
169  Psychology & Social Work    0.817099            7      2125   
170  Psychology & Social Work    0.799859           13      2101   
171  Psychology & Social Work    0.798746           21      3777   
172                 Education    0.877960            2       742   

         ...        Part_time  Full_time_year_round  Unemployed  \
168      ...             2190                  3602         304   
169      ...              572                  1211         148   
170      ...              648                  1293         368   
171      ...              965                  2738         214   
172      ...              237                   410          87   

     Unemployment_rate  Median  P25th  P75th  College_jobs  Non_college_jobs  \
168           0.046320   26000  20000  39000          2771              2947   
169           0.065112   25000  24000  34000          1488               615   
170           0.149048   25000  25000  40000           986               870   
171           0.053621   23400  19200  26000          2403              1245   
172           0.104946   22000  20000  22000           288               338   

     Low_wage_jobs  
168            743  
169             82  
170            622  
171            308  
172            192  

[5 rows x 21 columns]

In [3]:

#Generate Statistcial Summary of the data
recent_grads.describe()

Out[3]:

	Rank	Major_code	Total	Men	Women	ShareWomen	Sample_size	Employed	Full_time	Part_time	Full_time_year_round	Unemployed	Unemployment_rate	Median	P25th	P75th	College_jobs	Non_college_jobs	Low_wage_jobs
count	173.000000	173.000000	172.000000	172.000000	172.000000	172.000000	173.000000	173.000000	173.000000	173.000000	173.000000	173.000000	173.000000	173.000000	173.000000	173.000000	173.000000	173.000000	173.000000
mean	87.000000	3879.815029	39370.081395	16723.406977	22646.674419	0.522223	356.080925	31192.763006	26029.306358	8832.398844	19694.427746	2416.329480	0.068191	40151.445087	29501.445087	51494.219653	12322.635838	13284.497110	3859.017341
std	50.084928	1687.753140	63483.491009	28122.433474	41057.330740	0.231205	618.361022	50675.002241	42869.655092	14648.179473	33160.941514	4112.803148	0.030331	11470.181802	9166.005235	14906.279740	21299.868863	23789.655363	6944.998579
min	1.000000	1100.000000	124.000000	119.000000	0.000000	0.000000	2.000000	0.000000	111.000000	0.000000	111.000000	0.000000	0.000000	22000.000000	18500.000000	22000.000000	0.000000	0.000000	0.000000
25%	44.000000	2403.000000	4549.750000	2177.500000	1778.250000	0.336026	39.000000	3608.000000	3154.000000	1030.000000	2453.000000	304.000000	0.050306	33000.000000	24000.000000	42000.000000	1675.000000	1591.000000	340.000000
50%	87.000000	3608.000000	15104.000000	5434.000000	8386.500000	0.534024	130.000000	11797.000000	10048.000000	3299.000000	7413.000000	893.000000	0.067961	36000.000000	27000.000000	47000.000000	4390.000000	4595.000000	1231.000000
75%	130.000000	5503.000000	38909.750000	14631.000000	22553.750000	0.703299	338.000000	31433.000000	25147.000000	9948.000000	16891.000000	2393.000000	0.087557	45000.000000	33000.000000	60000.000000	14444.000000	11783.000000	3466.000000
max	173.000000	6403.000000	393735.000000	173809.000000	307087.000000	0.968954	4212.000000	307933.000000	251540.000000	115172.000000	199897.000000	28169.000000	0.177226	110000.000000	95000.000000	125000.000000	151643.000000	148395.000000	48207.000000

Columns across the table do not have equal number of counts. It means there are blank rows in the data. I will verify using info() method

In [4]:

original_data=recent_grads.info()
print(original_data)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 173 entries, 0 to 172
Data columns (total 21 columns):
Rank                    173 non-null int64
Major_code              173 non-null int64
Major                   173 non-null object
Total                   172 non-null float64
Men                     172 non-null float64
Women                   172 non-null float64
Major_category          173 non-null object
ShareWomen              172 non-null float64
Sample_size             173 non-null int64
Employed                173 non-null int64
Full_time               173 non-null int64
Part_time               173 non-null int64
Full_time_year_round    173 non-null int64
Unemployed              173 non-null int64
Unemployment_rate       173 non-null float64
Median                  173 non-null int64
P25th                   173 non-null int64
P75th                   173 non-null int64
College_jobs            173 non-null int64
Non_college_jobs        173 non-null int64
Low_wage_jobs           173 non-null int64
dtypes: float64(5), int64(14), object(2)
memory usage: 28.5+ KB
None

Check Count of values in dataset

In [5]:

raw_data_count=recent_grads.notnull().sum()
print(raw_data_count)

Rank                    173
Major_code              173
Major                   173
Total                   172
Men                     172
Women                   172
Major_category          173
ShareWomen              172
Sample_size             173
Employed                173
Full_time               173
Part_time               173
Full_time_year_round    173
Unemployed              173
Unemployment_rate       173
Median                  173
P25th                   173
P75th                   173
College_jobs            173
Non_college_jobs        173
Low_wage_jobs           173
dtype: int64

Based on above results, I can tell that 4 columns-Total,Men, Women,ShareWomen has null value. I will remove null rows from the data and update the data.

In [6]:

recent_grads=recent_grads.dropna()

In [7]:

cleaned_data_count=recent_grads.notnull().sum()
print(cleaned_data_count)

Rank                    172
Major_code              172
Major                   172
Total                   172
Men                     172
Women                   172
Major_category          172
ShareWomen              172
Sample_size             172
Employed                172
Full_time               172
Part_time               172
Full_time_year_round    172
Unemployed              172
Unemployment_rate       172
Median                  172
P25th                   172
P75th                   172
College_jobs            172
Non_college_jobs        172
Low_wage_jobs           172
dtype: int64

I will comapare cleaned_data_table and original_data_table to check any changes in data after deleting null rows

In [8]:

comp1=pd.Series(raw_data_count)
comp2=pd.Series(cleaned_data_count)
comp3=pd.DataFrame(comp1,columns=["Original_Data"])
comp4=pd.DataFrame(comp2,columns=["Cleaned_Data"])
comp4["Original_Data"]=comp3
comp4

Out[8]:

	Cleaned_Data	Original_Data
Rank	172	173
Major_code	172	173
Major	172	173
Total	172	172
Men	172	172
Women	172	172
Major_category	172	173
ShareWomen	172	172
Sample_size	172	173
Employed	172	173
Full_time	172	173
Part_time	172	173
Full_time_year_round	172	173
Unemployed	172	173
Unemployment_rate	172	173
Median	172	173
P25th	172	173
P75th	172	173
College_jobs	172	173
Non_college_jobs	172	173
Low_wage_jobs	172	173

The rows count have reduced from 173 to 172.

Just curious to print column name and data name + count to see frequency of each value

In [9]:

for m in recent_grads:
    column_name=m
    val=recent_grads[m].value_counts().sort_index(ascending=True)
    print("Data is for: ", column_name)
    print(val)
    

Data is for:  Rank
1      1
2      1
3      1
4      1
5      1
6      1
7      1
8      1
9      1
10     1
11     1
12     1
13     1
14     1
15     1
16     1
17     1
18     1
19     1
20     1
21     1
23     1
24     1
25     1
26     1
27     1
28     1
29     1
30     1
31     1
      ..
144    1
145    1
146    1
147    1
148    1
149    1
150    1
151    1
152    1
153    1
154    1
155    1
156    1
157    1
158    1
159    1
160    1
161    1
162    1
163    1
164    1
165    1
166    1
167    1
168    1
169    1
170    1
171    1
172    1
173    1
Name: Rank, Length: 172, dtype: int64
Data is for:  Major_code
1100    1
1101    1
1102    1
1103    1
1105    1
1106    1
1199    1
1301    1
1302    1
1303    1
1401    1
1501    1
1901    1
1902    1
1903    1
1904    1
2001    1
2100    1
2101    1
2102    1
2105    1
2106    1
2107    1
2201    1
2300    1
2301    1
2303    1
2304    1
2305    1
2306    1
       ..
6005    1
6006    1
6007    1
6099    1
6100    1
6102    1
6103    1
6104    1
6105    1
6106    1
6107    1
6108    1
6109    1
6110    1
6199    1
6200    1
6201    1
6202    1
6203    1
6204    1
6205    1
6206    1
6207    1
6209    1
6210    1
6211    1
6212    1
6299    1
6402    1
6403    1
Name: Major_code, Length: 172, dtype: int64
Data is for:  Major
ACCOUNTING                                       1
ACTUARIAL SCIENCE                                1
ADVERTISING AND PUBLIC RELATIONS                 1
AEROSPACE ENGINEERING                            1
AGRICULTURAL ECONOMICS                           1
AGRICULTURE PRODUCTION AND MANAGEMENT            1
ANIMAL SCIENCES                                  1
ANTHROPOLOGY AND ARCHEOLOGY                      1
APPLIED MATHEMATICS                              1
ARCHITECTURAL ENGINEERING                        1
ARCHITECTURE                                     1
AREA ETHNIC AND CIVILIZATION STUDIES             1
ART AND MUSIC EDUCATION                          1
ART HISTORY AND CRITICISM                        1
ASTRONOMY AND ASTROPHYSICS                       1
ATMOSPHERIC SCIENCES AND METEOROLOGY             1
BIOCHEMICAL SCIENCES                             1
BIOLOGICAL ENGINEERING                           1
BIOLOGY                                          1
BIOMEDICAL ENGINEERING                           1
BOTANY                                           1
BUSINESS ECONOMICS                               1
BUSINESS MANAGEMENT AND ADMINISTRATION           1
CHEMICAL ENGINEERING                             1
CHEMISTRY                                        1
CIVIL ENGINEERING                                1
CLINICAL PSYCHOLOGY                              1
COGNITIVE SCIENCE AND BIOPSYCHOLOGY              1
COMMERCIAL ART AND GRAPHIC DESIGN                1
COMMUNICATION DISORDERS SCIENCES AND SERVICES    1
                                                ..
PHILOSOPHY AND RELIGIOUS STUDIES                 1
PHYSICAL AND HEALTH EDUCATION TEACHING           1
PHYSICAL FITNESS PARKS RECREATION AND LEISURE    1
PHYSICAL SCIENCES                                1
PHYSICS                                          1
PHYSIOLOGY                                       1
PLANT SCIENCE AND AGRONOMY                       1
POLITICAL SCIENCE AND GOVERNMENT                 1
PRE-LAW AND LEGAL STUDIES                        1
PSYCHOLOGY                                       1
PUBLIC ADMINISTRATION                            1
PUBLIC POLICY                                    1
SCHOOL STUDENT COUNSELING                        1
SCIENCE AND COMPUTER TEACHER EDUCATION           1
SECONDARY TEACHER EDUCATION                      1
SOCIAL PSYCHOLOGY                                1
SOCIAL SCIENCE OR HISTORY TEACHER EDUCATION      1
SOCIAL WORK                                      1
SOCIOLOGY                                        1
SOIL SCIENCE                                     1
SPECIAL NEEDS EDUCATION                          1
STATISTICS AND DECISION SCIENCE                  1
STUDIO ARTS                                      1
TEACHER EDUCATION: MULTIPLE LEVELS               1
THEOLOGY AND RELIGIOUS VOCATIONS                 1
TRANSPORTATION SCIENCES AND TECHNOLOGIES         1
TREATMENT THERAPY PROFESSIONS                    1
UNITED STATES HISTORY                            1
VISUAL AND PERFORMING ARTS                       1
ZOOLOGY                                          1
Name: Major, Length: 172, dtype: int64
Data is for:  Total
124.0       1
609.0       1
685.0       1
720.0       1
756.0       1
804.0       1
818.0       1
856.0       1
1098.0      1
1148.0      1
1258.0      1
1329.0      1
1386.0      1
1436.0      1
1488.0      1
1762.0      1
1792.0      1
1978.0      1
2116.0      1
2339.0      1
2418.0      1
2435.0      1
2439.0      1
2573.0      1
2825.0      1
2838.0      1
2854.0      1
2906.0      1
2993.0      1
3014.0      1
           ..
60633.0     1
61152.0     1
62052.0     1
66530.0     1
71369.0     1
72397.0     1
72619.0     1
74440.0     1
81527.0     1
91227.0     1
103480.0    1
115433.0    1
125074.0    1
128319.0    1
139247.0    1
141951.0    1
143718.0    1
152824.0    1
170862.0    1
174506.0    1
182621.0    1
194673.0    1
198633.0    1
205211.0    1
209394.0    1
213996.0    1
234590.0    1
280709.0    1
329927.0    1
393735.0    1
Name: Total, Length: 172, dtype: int64
Data is for:  Men
119.0       1
124.0       1
134.0       1
280.0       1
404.0       1
413.0       1
476.0       1
488.0       1
500.0       1
515.0       1
522.0       1
528.0       1
568.0       1
626.0       1
679.0       1
725.0       1
752.0       1
803.0       1
809.0       1
832.0       1
877.0       1
885.0       1
894.0       1
931.0       1
1075.0      1
1123.0      1
1167.0      1
1225.0      1
1499.0      1
1589.0      1
           ..
25463.0     1
26893.0     1
27015.0     1
27392.0     1
29909.0     1
31967.0     1
32041.0     1
32510.0     1
32923.0     1
33258.0     1
39956.0     1
41081.0     1
45683.0     1
58227.0     1
62181.0     1
65511.0     1
70619.0     1
78253.0     1
78857.0     1
80231.0     1
80320.0     1
86648.0     1
89749.0     1
93880.0     1
94519.0     1
99743.0     1
111762.0    1
115030.0    1
132238.0    1
173809.0    1
Name: Men, Length: 172, dtype: int64
Data is for:  Women
0.0         1
77.0        1
109.0       1
131.0       1
135.0       1
209.0       1
232.0       1
271.0       1
282.0       1
371.0       1
373.0       1
451.0       1
506.0       1
524.0       1
542.0       1
566.0       1
690.0       1
699.0       1
703.0       1
795.0       1
905.0       1
960.0       1
964.0       1
973.0       2
990.0       1
1084.0      1
1122.0      1
1154.0      1
1169.0      1
1247.0      1
           ..
35004.0     1
35037.0     1
35411.0     1
36422.0     1
37054.0     1
40300.0     1
48415.0     1
48883.0     1
49030.0     1
49498.0     1
49654.0     1
52835.0     1
59476.0     1
62893.0     1
63698.0     1
71439.0     1
72593.0     1
82923.0     1
88741.0     1
102352.0    1
104114.0    1
116825.0    1
126354.0    1
136446.0    1
143377.0    1
156118.0    1
157833.0    1
168947.0    1
187621.0    1
307087.0    1
Name: Women, Length: 171, dtype: int64
Data is for:  Major_category
Agriculture & Natural Resources         9
Arts                                    8
Biology & Life Science                 14
Business                               13
Communications & Journalism             4
Computers & Mathematics                11
Education                              16
Engineering                            29
Health                                 12
Humanities & Liberal Arts              15
Industrial Arts & Consumer Services     7
Interdisciplinary                       1
Law & Public Policy                     5
Physical Sciences                      10
Psychology & Social Work                9
Social Science                          9
Name: Major_category, dtype: int64
Data is for:  ShareWomen
0.000000    1
0.077453    1
0.090713    1
0.101852    1
0.107313    1
0.119559    1
0.120564    1
0.124950    1
0.125035    1
0.139793    1
0.144967    1
0.153037    1
0.174123    1
0.178982    1
0.180883    1
0.183985    1
0.189970    1
0.196450    1
0.199413    1
0.200023    1
0.222695    1
0.227118    1
0.232444    1
0.236063    1
0.244103    1
0.249190    1
0.251389    1
0.252960    1
0.253583    1
0.269194    1
           ..
0.752144    1
0.753927    1
0.758060    1
0.764320    1
0.764427    1
0.770901    1
0.774577    1
0.779933    1
0.792095    1
0.798746    1
0.798920    1
0.799859    1
0.810704    1
0.812877    1
0.817099    1
0.845934    1
0.854523    1
0.864456    1
0.877228    1
0.877960    1
0.881294    1
0.896019    1
0.904075    1
0.905590    1
0.906677    1
0.910933    1
0.923745    1
0.927807    1
0.967998    1
0.968954    1
Name: ShareWomen, Length: 172, dtype: int64
Data is for:  Sample_size
2       1
3       2
4       3
5       2
7       3
8       1
9       1
10      2
11      1
13      1
14      1
16      1
17      1
18      1
21      1
22      3
24      2
25      1
26      2
28      1
29      1
30      2
31      2
32      1
36      2
37      2
38      1
39      1
43      1
44      1
       ..
541     1
546     1
565     1
569     1
590     1
623     1
631     1
681     1
843     1
919     1
1014    1
1024    1
1029    1
1058    1
1186    1
1196    1
1322    1
1370    1
1387    1
1436    1
1629    1
1728    1
2042    1
2189    1
2380    1
2394    1
2554    1
2584    1
2684    1
4212    1
Name: Sample_size, Length: 147, dtype: int64
Data is for:  Employed
0         1
559       1
604       1
613       1
640       1
648       1
703       1
730       1
742       1
758       1
930       1
1010      1
1080      1
1144      1
1146      1
1290      1
1441      1
1526      1
1638      1
1778      1
1857      1
1976      1
2101      1
2107      1
2125      2
2174      1
2343      1
2449      1
2463      1
2575      1
         ..
46138     1
46624     1
47662     1
48535     1
54844     1
58118     1
59679     1
61022     1
61928     1
76442     1
83483     1
92721     1
102087    1
103078    1
104117    1
105646    1
118241    1
125393    1
133454    1
145696    1
149180    1
149339    1
165527    1
178862    1
179633    1
180903    1
182295    1
190183    1
276234    1
307933    1
Name: Employed, Length: 170, dtype: int64
Data is for:  Full_time
111       1
488       1
524       1
556       1
558       1
584       1
593       1
595       1
657       1
733       1
768       1
808       1
828       1
946       1
1069      1
1085      1
1098      1
1264      1
1392      1
1644      1
1658      1
1724      1
1787      1
1819      1
1848      1
1849      1
1931      1
1992      1
2038      1
2049      1
         ..
38302     1
38815     1
39509     1
41235     1
42764     1
43401     1
46399     1
51411     1
55450     1
67448     1
71298     1
73475     1
77428     1
84681     1
91485     1
96567     1
98408     1
109970    1
114386    1
117709    1
123177    1
137921    1
144512    1
147335    1
151191    1
151967    1
156668    1
171385    1
233205    1
251540    1
Name: Full_time, Length: 172, dtype: int64
Data is for:  Part_time
0         3
126       1
133       1
135       1
150       1
169       1
170       1
185       1
223       1
237       1
247       1
264       1
270       1
287       1
296       1
335       1
343       1
354       1
379       1
433       1
437       1
462       1
482       1
532       1
553       1
572       1
579       1
597       1
620       1
648       1
         ..
14569     1
15066     1
15872     1
15902     1
15994     1
16659     1
18079     1
18726     1
19187     1
21463     1
23656     1
24387     1
24943     1
25325     1
27693     1
29558     1
29639     1
32242     1
35829     1
36241     1
37965     1
38515     1
40657     1
40818     1
43711     1
49889     1
50357     1
57825     1
72371     1
115172    1
Name: Part_time, Length: 169, dtype: int64
Data is for:  Full_time_year_round
111       1
340       1
383       1
388       1
391       1
396       1
410       1
504       1
529       1
545       1
565       1
653       1
692       1
740       1
808       1
827       1
936       1
1011      1
1115      1
1151      1
1200      1
1207      1
1211      1
1274      1
1293      1
1358      1
1409      1
1449      1
1487      1
1528      1
         ..
29196     1
29910     1
30932     1
31877     1
33438     1
33540     1
33738     1
39524     1
41413     1
52243     1
54639     1
56561     1
57978     1
59218     1
70740     1
70932     1
73531     1
81180     1
83236     1
86540     1
88548     1
100336    1
108595    1
116251    1
122817    1
123169    1
127230    1
138299    1
174438    1
199897    1
Name: Full_time_year_round, Length: 172, dtype: int64
Data is for:  Unemployed
0        5
11       1
16       1
23       1
33       2
36       1
37       1
40       1
42       1
49       1
64       1
70       1
74       1
78       2
82       1
85       1
87       2
88       1
99       1
107      1
129      1
137      1
138      1
148      1
163      1
170      1
178      1
182      1
214      1
215      1
        ..
3718     1
3895     1
3918     1
4267     1
4366     1
4410     1
4535     1
4650     1
4657     1
5486     1
5593     1
6884     1
7195     1
7297     1
8497     1
8608     1
8947     1
9413     1
11176    1
11268    1
11452    1
11663    1
12411    1
13874    1
14345    1
14602    1
14946    1
15022    1
21502    1
28169    1
Name: Unemployed, Length: 160, dtype: int64
Data is for:  Unemployment_rate
0.000000    5
0.006334    1
0.011690    1
0.016203    1
0.018381    1
0.019642    1
0.021167    1
0.022229    1
0.023043    1
0.024096    1
0.024374    1
0.027789    1
0.028308    1
0.029480    1
0.029650    1
0.033652    1
0.034118    1
0.035354    1
0.036546    1
0.036983    1
0.037819    1
0.038638    1
0.040105    1
0.041508    1
0.042507    1
0.042876    1
0.044863    1
0.045455    1
0.046320    1
0.046586    1
           ..
0.096052    1
0.096175    1
0.096448    1
0.096726    1
0.096798    1
0.096799    1
0.097244    1
0.099092    1
0.099723    1
0.101175    1
0.102197    1
0.102792    1
0.103455    1
0.104436    1
0.104946    1
0.105772    1
0.107116    1
0.107579    1
0.108787    1
0.112144    1
0.113332    1
0.113459    1
0.113983    1
0.117241    1
0.119511    1
0.128426    1
0.149048    1
0.151850    1
0.159491    1
0.177226    1
Name: Unemployment_rate, Length: 168, dtype: int64
Data is for:  Median
22000      1
23400      1
25000      2
26000      1
27000      2
27500      1
28000      3
29000      4
30000      8
30500      1
31000      3
31500      1
32000      8
32100      1
32200      1
32400      1
32500      1
33000      9
33400      1
33500      2
34000      8
35000     20
35600      1
36000      6
36200      1
36400      1
37000      1
37400      1
37500      1
38000      5
38400      1
39000      1
40000     17
40100      1
41000      2
41300      1
42000      2
44000      1
44700      1
45000      9
46000      3
47000      2
48000      1
50000      9
51000      1
52000      2
53000      1
54000      2
56000      1
57000      1
57100      1
58000      1
60000      6
62000      2
65000      2
70000      1
73000      1
75000      1
110000     1
Name: Median, dtype: int64
Data is for:  P25th
18500     1
19200     3
20000    11
20800     1
21000     4
22000     5
22300     1
22500     1
22900     3
23000    10
23050     1
23400     1
24000    12
24500     1
25000    22
26000     5
26700     1
27000     7
28000     5
28600     1
29000     4
29200     1
29600     1
30000    18
30400     1
31000     1
31200     1
31500     1
32000     2
32500     1
33000     2
34000     4
35000     6
36000     4
37900     1
38000     3
39000     4
40000     5
41000     1
42000     2
42800     1
43000     2
45000     2
48000     1
50000     4
53000     1
55000     1
95000     1
Name: P25th, dtype: int64
Data is for:  P75th
22000      1
26000      1
34000      1
35000      6
36000      1
37000      1
38000      5
38300      1
39000      2
40000     17
40200      1
41000      6
42000     11
42100      1
42900      1
43000      2
44000      5
45000     18
46000      2
47000      4
47100      1
48000      2
49000      2
49900      1
50000     18
51000      1
52000      2
53000      3
54000      3
55000      2
56000      3
57000      1
58000      3
60000     11
62000      1
63000      1
64000      1
65000      6
66000      1
67000      1
68000      1
69000      1
70000      5
72000      2
74000      1
75000      2
76000      1
78000      1
80000      1
90000      2
102000     1
105000     1
109000     1
125000     1
Name: P75th, dtype: int64
Data is for:  College_jobs
0         1
162       1
221       1
288       1
346       1
350       1
355       1
402       1
434       1
452       1
456       1
459       1
482       1
483       1
501       1
509       1
529       1
530       1
535       1
559       1
563       1
603       1
677       1
693       1
744       1
784       1
801       1
919       1
972       1
986       1
         ..
22215     1
23279     1
23515     1
23694     1
24243     1
24348     1
25320     1
25582     1
26898     1
27449     1
27581     1
28526     1
29051     1
29334     1
30382     1
34800     1
35336     1
36720     1
36854     1
37389     1
40763     1
45829     1
52844     1
57690     1
68622     1
82007     1
88232     1
108085    1
125148    1
151643    1
Name: College_jobs, Length: 172, dtype: int64
Data is for:  Non_college_jobs
0         1
50        1
67        1
102       1
144       1
176       1
184       1
206       1
221       1
257       1
305       1
314       1
338       1
364       1
391       1
465       1
478       1
500       1
528       1
591       1
593       1
615       1
626       1
649       1
657       1
678       1
803       1
830       1
844       1
870       1
         ..
20721     1
22039     1
23059     1
23341     1
25297     1
25313     1
25667     1
26146     1
26672     1
28558     1
28786     1
31112     1
32725     1
36972     1
37057     1
38119     1
39323     1
48447     1
48899     1
54569     1
63946     1
66947     1
71827     1
81109     1
88858     1
93889     1
97964     1
100831    1
141860    1
148395    1
Name: Non_college_jobs, Length: 172, dtype: int64
Data is for:  Low_wage_jobs
0        5
25       1
31       1
37       1
49       1
50       1
56       1
70       1
81       2
82       1
93       1
94       1
111      1
124      1
135      1
137      1
142      1
144      1
186      1
192      1
193      1
201      1
220      1
221      1
237      1
244      1
245      1
259      1
260      1
263      2
        ..
5862     1
6193     1
6429     1
6866     1
7214     1
8051     1
8512     1
9030     1
9063     1
9286     1
9910     1
10653    1
10886    1
11068    1
11443    1
11502    1
11880    1
13748    1
14839    1
16838    1
16839    1
18404    1
19803    1
26503    1
27320    1
27440    1
27968    1
28339    1
32395    1
48207    1
Name: Low_wage_jobs, Length: 165, dtype: int64

Generate scatter plots io understand realtionship between Sample_size and Median

data is weekly correlated and has two outliers as shown in the plot

In [10]:

%matplotlib inline
plot1=recent_grads.plot(x="Sample_size",y="Median", kind="scatter", 
                        title="Sample Size vs Median", 
                       )

In [ ]:

In [11]:

%matplotlib inline
plot2=recent_grads.plot(x="Sample_size",y="Unemployment_rate", 
                        kind="scatter", 
                        title="Sample Size vs Unemployment rate", 
                       )

In [12]:

%matplotlib inline
plot3=recent_grads.plot(x="Full_time",y="Median", kind="scatter", 
                        title="Full Time vs Median", 
                       )

In [13]:

%matplotlib inline
plot4=recent_grads.plot(x="ShareWomen",y="Unemployment_rate", 
                        kind="scatter", 
                        title="ShareWomen vs Unemploymenr Rate", 
                       )

In [14]:

%matplotlib inline
plot5=recent_grads.plot(x="Men",y="Median", 
                        kind="scatter", 
                        title="Men vs Median", 
                       )

In [15]:

%matplotlib inline
Plot6=recent_grads.plot(x="Women",y="Median", 
                        kind="scatter", 
                        title="Women vs Median", 
                       )

In [16]:

%matplotlib inline
Plot7=recent_grads.plot(x="Total",y="Median", 
                        kind="scatter", 
                        title="total vs Median", figsize=(15,5)
                       )

In [17]:

popular_majors=recent_grads["Major"].value_counts(normalize=True)
popular_majors.head(10)

Out[17]:

FORESTRY                                                      0.005814
NUCLEAR, INDUSTRIAL RADIOLOGY, AND BIOLOGICAL TECHNOLOGIES    0.005814
MISCELLANEOUS BUSINESS & MEDICAL ADMINISTRATION               0.005814
ENGINEERING TECHNOLOGIES                                      0.005814
SOIL SCIENCE                                                  0.005814
FINE ARTS                                                     0.005814
ARCHITECTURAL ENGINEERING                                     0.005814
MATERIALS ENGINEERING AND MATERIALS SCIENCE                   0.005814
COMPUTER NETWORKING AND TELECOMMUNICATIONS                    0.005814
ECONOMICS                                                     0.005814
Name: Major, dtype: float64

Do students in more popular majors make more money?¶

With the increase in total (number of people Majors), salary does not incerase much. Data is clusdtered at total 0. The count of each major is same across the data. Therefore, Using Sample Size Vs Median Plot or Total vs Median Plot reveal same inference.

Do students that majored in subjects that were majority female make more money?¶

Based on the plot Women and Median and Men and Median, there seem no significant correlation depicting graduate female make more money than men

Is there any link between the number of full-time employees and median salary?¶

There is very weak correlation between Full Time and Median as data is scattered. There is no link.

In [18]:

cols = ["Sample_size", "Median", "Employed", "Full_time", "ShareWomen", 
        "Unemployment_rate", "Men", "Women"]
color=["red","green","blue","orange","navy",
       "yellow","skyblue","violet","purple"]
fig=plt.figure(figsize=(5,35))
for i in range(1,8):
    ax=fig.add_subplot(8,1,i+1) #iterate location index   
    ax=recent_grads[cols[i]].plot(kind="hist", rot=30, bins=10,
                                         title=cols[i],
                                 color=color[i], linewidth=1,
                                  edgecolor=color[i+1]) 

What percent of majors are predominantly male? Predominantly female?¶

50% Male
30% Female

What's the most common median salary range?¶

The most common median salary range is 30,000 to 40,000

In [19]:

#create a scatter matrix plot
from pandas.plotting import scatter_matrix
plot2X2=scatter_matrix(recent_grads[["Sample_size","Median"]],
                       figsize=(10,10))
plot3x3=scatter_matrix(recent_grads[["Sample_size","Median",
                                    "Unemployment_rate"]],
                      figsize=(10,10))

In [20]:

from pandas.plotting import scatter_matrix
plot4x4=scatter_matrix(recent_grads[["Women","Men","Full_time","Median"]],
                       figsize=(10,10))

Based on these new scatter matrix plots, I am able to do better analysis.¶

Do students that majored in subjects that were majority female make more money?¶

No

Is there any link between the number of full-time employees and median salary?¶

No, there is no link.

Use bar plots to compare the percentages of women (ShareWomen) from the first ten rows and last ten rows of the recent_grads dataframe.¶

In [21]:

ax=recent_grads.head(10).plot.bar(x="Major", y="ShareWomen")

It is hard to read labels so in this situation, it is better to do horizontal bar plot. Let's see it.

In [22]:

recent_grads[0:10].plot.barh(x="Major",y="ShareWomen")
ax=recent_grads.tail(10).plot.barh(x="Major",y="ShareWomen",legend=False)

Now, I can read chart better.

Based on top 10 rows, Higher % of Women are graduating from Astronomy and Astrophysics Major.

Based on the last 10 rows, I see most common majors are

Communication Disorder Sciences and Services
Early Childhood Education

I get different results ipon looking first 10 rows and last 10 rows. I think to see which major is popular among females. These are the count of data/rows. The better option to make correct inference is that I should do average count of females per major and then make a plot.

In [23]:

unique_majors=recent_grads["Major"]
avg_sharewomen={}
from pandas import DataFrame
avg_sharewomen={}
for i in unique_majors:
    sum_sharewomen=recent_grads.loc[(recent_grads["Major"]==i),
                                    "ShareWomen"].sum()
    len_sharewomen=len(recent_grads.loc[(recent_grads["Major"]==i),
                                    "ShareWomen"])
    avg_sharewomen[i]=sum_sharewomen/len_sharewomen
    
#convert to series
ShareWomen_series=pd.Series(avg_sharewomen)
#convert to dataframe    
avg_sharewomen_df=pd.DataFrame(ShareWomen_series,
                               columns=["Avg_ShareWomen"])
print("Unsorted-Average list Major Vs ShareWomen",
      avg_sharewomen_df.head(),"\n")
sorted_avg_sharewomen_df=avg_sharewomen_df.sort_values(by="Avg_ShareWomen", ascending=False)
print(sorted_avg_sharewomen_df.head())

Unsorted-Average list Major Vs ShareWomen                                   Avg_ShareWomen
ACCOUNTING                              0.524153
ACTUARIAL SCIENCE                       0.441356
ADVERTISING AND PUBLIC RELATIONS        0.758060
AEROSPACE ENGINEERING                   0.139793
AGRICULTURAL ECONOMICS                  0.282903 

                                               Avg_ShareWomen
EARLY CHILDHOOD EDUCATION                            0.968954
COMMUNICATION DISORDERS SCIENCES AND SERVICES        0.967998
MEDICAL ASSISTING SERVICES                           0.927807
ELEMENTARY EDUCATION                                 0.923745
FAMILY AND CONSUMER SCIENCES                         0.910933

The results have become interesting now!
- The results have changed. Now, in top 10, 'Advertising & Public Relations' is higest chosen major among females and bottom 10 favors 'Special Needs Education'

The problem with above bar plots is that they are showing first 10 rows in the data. It does not show highest rows in terms of their values as they are not sorted. Therefore, I sorted the data and plotted top 10 rows as well.¶

! With the sorted data, Communication Disorder Sciences and Srvices is most favored major by female

In [24]:

avg_sharewomen_df[:10].plot.barh(y="Avg_ShareWomen",
                                 title="First 10 Rows-Avg ShareWomen")
sorted_avg_sharewomen_df[:10].plot.barh(y="Avg_ShareWomen",legend=False,
                                       title="Sorted_Avg ShareWomen")

Out[24]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f63279c4be0>

Use bar plots to compare the unemployment rate (Unemployment_rate) from the first ten rows and last ten rows of the recent_grads dataframe.¶

In [25]:

ax1=recent_grads[0:10].plot.barh(y="Unemployment_rate",x="Major",
                                legend=False)
ax2=recent_grads[-10:-1].plot.barh(y="Unemployment_rate",x="Major",
                                 legend=False)

Based on above plots for first 10 rows, Unemployment is highest among 'Nuclear Engineering' major students.
For bottom 10 rows,it is among 'Clinical Psychology' major students

In [ ]:

unique_major_categ=recent_grads["Major_category"].unique()
boys_major_categ={}
girls_major_categ={}
for i in unique_major_categ:
    boys_val=recent_grads.loc[(recent_grads["Major_category"]==i),"Men"].sum()
    girls_val=recent_grads.loc[(recent_grads["Major_category"]==i),"Women"].sum()
    boys_major_categ[i]=boys_val
    girls_major_categ[i]=girls_val
#convert to series
series_boys=pd.Series(boys_major_categ)
series_girls=pd.Series(girls_major_categ)

#convert to dataframe
df=pd.DataFrame(series_boys,columns=["Men"])
df["Women"]=series_girls

df.head()

In [41]:

recent_grads["Unemployment"]


KeyErrorTraceback (most recent call last)
/dataquest/system/env/python3/lib/python3.4/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2524             try:
-> 2525                 return self._engine.get_loc(key)
   2526             except KeyError:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'Unemployment'

During handling of the above exception, another exception occurred:

KeyErrorTraceback (most recent call last)
<ipython-input-41-2b38c19aa8fb> in <module>()
----> 1 recent_grads["Unemployment"]

/dataquest/system/env/python3/lib/python3.4/site-packages/pandas/core/frame.py in __getitem__(self, key)
   2137             return self._getitem_multilevel(key)
   2138         else:
-> 2139             return self._getitem_column(key)
   2140 
   2141     def _getitem_column(self, key):

/dataquest/system/env/python3/lib/python3.4/site-packages/pandas/core/frame.py in _getitem_column(self, key)
   2144         # get column
   2145         if self.columns.is_unique:
-> 2146             return self._get_item_cache(key)
   2147 
   2148         # duplicate columns & possible reduce dimensionality

/dataquest/system/env/python3/lib/python3.4/site-packages/pandas/core/generic.py in _get_item_cache(self, item)
   1840         res = cache.get(item)
   1841         if res is None:
-> 1842             values = self._data.get(item)
   1843             res = self._box_item_values(item, values)
   1844             cache[item] = res

/dataquest/system/env/python3/lib/python3.4/site-packages/pandas/core/internals.py in get(self, item, fastpath)
   3841 
   3842             if not isna(item):
-> 3843                 loc = self.items.get_loc(item)
   3844             else:
   3845                 indexer = np.arange(len(self.items))[isna(self.items)]

/dataquest/system/env/python3/lib/python3.4/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2525                 return self._engine.get_loc(key)
   2526             except KeyError:
-> 2527                 return self._engine.get_loc(self._maybe_cast_indexer(key))
   2528 
   2529         indexer = self.get_indexer([key], method=method, tolerance=tolerance)

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'Unemployment'

Let's comapre number of men and women for each major¶

In [26]:

unique_major_categ=recent_grads["Major_category"].unique()
boys_major_categ={}
girls_major_categ={}
for i in unique_major_categ:
    boys_val=recent_grads.loc[(recent_grads["Major_category"]==i),"Men"].sum()
    girls_val=recent_grads.loc[(recent_grads["Major_category"]==i),"Women"].sum()
    boys_major_categ[i]=boys_val
    girls_major_categ[i]=girls_val
#convert to series
series_boys=pd.Series(boys_major_categ)
series_girls=pd.Series(girls_major_categ)

#convert to dataframe
df=pd.DataFrame(series_boys,columns=["Men"])
df["Women"]=series_girls

df.head()

Out[26]:

	Men	Women
Agriculture & Natural Resources	40357.0	35263.0
Arts	134390.0	222740.0
Biology & Life Science	184919.0	268943.0
Business	667852.0	634524.0
Communications & Journalism	131921.0	260680.0

In [40]:

df.plot.barh(title="Number of Men and Women in each Major Category") 
            

Out[40]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f63272a76d8>

The highest gap between men and women is in 'Psychology & Social Work' major. However, the unemployment rate for this major is 47%.

Grouped bar plot¶

It will be very handy to compare average number of men and women with respect to Major category to see which group favors which field.

In [28]:

from pandas import DataFrame
men_mean = {}
women_mean = {}
for i in recent_grads["Major_category"].unique():
    sum_men = recent_grads.loc[(recent_grads["Major_category"] == i), 
                               "Men"].sum()
    len_men = len(recent_grads.loc[recent_grads["Major_category"] == i, 
                                   "Men"])
    men_mean[i] = (sum_men/len_men)
    sum_women = recent_grads.loc[(recent_grads["Major_category"] == i), 
                               "Women"].sum()
    len_women = len(recent_grads.loc[recent_grads["Major_category"] == i, 
                                   "Women"])
    women_mean[i] = (sum_women/len_women)
    
men_mean_df = DataFrame(list(men_mean.items()),columns = 
                          ['Major_category','Men_Mean'])
women_mean_df = DataFrame(list(women_mean.items()),columns = 
                          ['Major_category','Women_Mean'])

In [29]:

%matplotlib inline
men_mean_df.plot.barh(x="Major_category",y="Men_Mean")

Out[29]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f6327811f60>

In [30]:

women_mean_df.plot.barh(x="Major_category",y="Women_Mean")

Out[30]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f6327b29908>

Higher avergae of women are in most of every major compared to men.
Major with highest women popualtion is Communication and Journalism
Major with highest Men popualtion is Business

Box plot :

Box plots are used here to explore the distributions of median salaries and unemployment rate

In [31]:

# import matplotlib.pyplot as plt
# %matplotlib inline
# cols=["Median"]
# fig,ax=plt.subplots()
# ax.boxplot(recent_grads[cols].values)
# ax.set_xticklabels(cols)
# plt.show()

In [32]:

#with Pandas
recent_grads.boxplot(column=["Median"])

Out[32]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f632c056a58>

There are lots of odd salary values. Having that in data, Most of the majors have salary more than median salary

In [33]:

recent_grads.boxplot(column=["Unemployment_rate"])

Out[33]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f6327a1b7f0>

75% of the unemployment rate is below 9%

Hexbin plot:¶

Hexbin plot are denser scatter plots

In [34]:

recent_grads.plot.hexbin(x="ShareWomen", y="Median", gridsize=10
                         ,sharex=False)
                    

Out[34]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f6327721b38>

The highest frequency is 2% for 30,000 salary

In [35]:

recent_grads.plot.hexbin(x="Unemployment_rate", y="Median", gridsize=30
                         ,sharex=False)
                    

Out[35]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f6327630d68>

The highest frequency is 6% for Unemployment rate for 35000 salary

Visulaization definitely helps to understand data better