Visualizing Earnings Based On College Majors¶

Sample project to visualize data from college majors earning.

The data is from American Community Survey, you can download the clean dataset from this Github Repo.

In [1]:

import pandas as pdt
import matplotlib.pyplot as plt


%matplotlib inline

recent_grads = pdt.read_csv('recent-grads.csv')
recent_grads.iloc[0]

Out[1]:

Rank                                        1
Major_code                               2419
Major                   PETROLEUM ENGINEERING
Total                                    2339
Men                                      2057
Women                                     282
Major_category                    Engineering
ShareWomen                           0.120564
Sample_size                                36
Employed                                 1976
Full_time                                1849
Part_time                                 270
Full_time_year_round                     1207
Unemployed                                 37
Unemployment_rate                   0.0183805
Median                                 110000
P25th                                   95000
P75th                                  125000
College_jobs                             1534
Non_college_jobs                          364
Low_wage_jobs                             193
Name: 0, dtype: object

In the first row we can observe the data of Petroleum Engineering, it is a very well paid major, but have a samll share of women, we need to explore more data to know more about these majors.

Now we can see the first five and the last five majors in the list.

In [2]:

recent_grads.head()

Out[2]:

	Rank	Major_code	Major	Total	Men	Women	Major_category	ShareWomen	Sample_size	Employed	...	Part_time	Full_time_year_round	Unemployed	Unemployment_rate	Median	P25th	P75th	College_jobs	Non_college_jobs	Low_wage_jobs
0	1	2419	PETROLEUM ENGINEERING	2339.0	2057.0	282.0	Engineering	0.120564	36	1976	...	270	1207	37	0.018381	110000	95000	125000	1534	364	193
1	2	2416	MINING AND MINERAL ENGINEERING	756.0	679.0	77.0	Engineering	0.101852	7	640	...	170	388	85	0.117241	75000	55000	90000	350	257	50
2	3	2415	METALLURGICAL ENGINEERING	856.0	725.0	131.0	Engineering	0.153037	3	648	...	133	340	16	0.024096	73000	50000	105000	456	176	0
3	4	2417	NAVAL ARCHITECTURE AND MARINE ENGINEERING	1258.0	1123.0	135.0	Engineering	0.107313	16	758	...	150	692	40	0.050125	70000	43000	80000	529	102	0
4	5	2405	CHEMICAL ENGINEERING	32260.0	21239.0	11021.0	Engineering	0.341631	289	25694	...	5180	16697	1672	0.061098	65000	50000	75000	18314	4440	972

5 rows × 21 columns

In [3]:

recent_grads.tail()

Out[3]:

	Rank	Major_code	Major	Total	Men	Women	Major_category	ShareWomen	Sample_size	Employed	...	Part_time	Full_time_year_round	Unemployed	Unemployment_rate	Median	P25th	P75th	College_jobs	Non_college_jobs	Low_wage_jobs
168	169	3609	ZOOLOGY	8409.0	3050.0	5359.0	Biology & Life Science	0.637293	47	6259	...	2190	3602	304	0.046320	26000	20000	39000	2771	2947	743
169	170	5201	EDUCATIONAL PSYCHOLOGY	2854.0	522.0	2332.0	Psychology & Social Work	0.817099	7	2125	...	572	1211	148	0.065112	25000	24000	34000	1488	615	82
170	171	5202	CLINICAL PSYCHOLOGY	2838.0	568.0	2270.0	Psychology & Social Work	0.799859	13	2101	...	648	1293	368	0.149048	25000	25000	40000	986	870	622
171	172	5203	COUNSELING PSYCHOLOGY	4626.0	931.0	3695.0	Psychology & Social Work	0.798746	21	3777	...	965	2738	214	0.053621	23400	19200	26000	2403	1245	308
172	173	3501	LIBRARY SCIENCE	1098.0	134.0	964.0	Education	0.877960	2	742	...	237	410	87	0.104946	22000	20000	22000	288	338	192

5 rows × 21 columns

Description of the data¶

Now we need to verify which type of data we have and in if we have null data to erase.

In [4]:

recent_grads.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 173 entries, 0 to 172
Data columns (total 21 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Rank                  173 non-null    int64  
 1   Major_code            173 non-null    int64  
 2   Major                 173 non-null    object 
 3   Total                 172 non-null    float64
 4   Men                   172 non-null    float64
 5   Women                 172 non-null    float64
 6   Major_category        173 non-null    object 
 7   ShareWomen            172 non-null    float64
 8   Sample_size           173 non-null    int64  
 9   Employed              173 non-null    int64  
 10  Full_time             173 non-null    int64  
 11  Part_time             173 non-null    int64  
 12  Full_time_year_round  173 non-null    int64  
 13  Unemployed            173 non-null    int64  
 14  Unemployment_rate     173 non-null    float64
 15  Median                173 non-null    int64  
 16  P25th                 173 non-null    int64  
 17  P75th                 173 non-null    int64  
 18  College_jobs          173 non-null    int64  
 19  Non_college_jobs      173 non-null    int64  
 20  Low_wage_jobs         173 non-null    int64  
dtypes: float64(5), int64(14), object(2)
memory usage: 28.5+ KB

In [5]:

recent_grads.describe()

Out[5]:

	Rank	Major_code	Total	Men	Women	ShareWomen	Sample_size	Employed	Full_time	Part_time	Full_time_year_round	Unemployed	Unemployment_rate	Median	P25th	P75th	College_jobs	Non_college_jobs	Low_wage_jobs
count	173.000000	173.000000	172.000000	172.000000	172.000000	172.000000	173.000000	173.000000	173.000000	173.000000	173.000000	173.000000	173.000000	173.000000	173.000000	173.000000	173.000000	173.000000	173.000000
mean	87.000000	3879.815029	39370.081395	16723.406977	22646.674419	0.522223	356.080925	31192.763006	26029.306358	8832.398844	19694.427746	2416.329480	0.068191	40151.445087	29501.445087	51494.219653	12322.635838	13284.497110	3859.017341
std	50.084928	1687.753140	63483.491009	28122.433474	41057.330740	0.231205	618.361022	50675.002241	42869.655092	14648.179473	33160.941514	4112.803148	0.030331	11470.181802	9166.005235	14906.279740	21299.868863	23789.655363	6944.998579
min	1.000000	1100.000000	124.000000	119.000000	0.000000	0.000000	2.000000	0.000000	111.000000	0.000000	111.000000	0.000000	0.000000	22000.000000	18500.000000	22000.000000	0.000000	0.000000	0.000000
25%	44.000000	2403.000000	4549.750000	2177.500000	1778.250000	0.336026	39.000000	3608.000000	3154.000000	1030.000000	2453.000000	304.000000	0.050306	33000.000000	24000.000000	42000.000000	1675.000000	1591.000000	340.000000
50%	87.000000	3608.000000	15104.000000	5434.000000	8386.500000	0.534024	130.000000	11797.000000	10048.000000	3299.000000	7413.000000	893.000000	0.067961	36000.000000	27000.000000	47000.000000	4390.000000	4595.000000	1231.000000
75%	130.000000	5503.000000	38909.750000	14631.000000	22553.750000	0.703299	338.000000	31433.000000	25147.000000	9948.000000	16891.000000	2393.000000	0.087557	45000.000000	33000.000000	60000.000000	14444.000000	11783.000000	3466.000000
max	173.000000	6403.000000	393735.000000	173809.000000	307087.000000	0.968954	4212.000000	307933.000000	251540.000000	115172.000000	199897.000000	28169.000000	0.177226	110000.000000	95000.000000	125000.000000	151643.000000	148395.000000	48207.000000

As we can see from the results we have null values in the columns Men Women and Total, we are going to delete this row to clean the data set.

In [6]:

raw_data_count = recent_grads.shape
raw_data_count

Out[6]:

(173, 21)

In [7]:

#Dropna method is used to delet the rows with null vaues
recent_grads = recent_grads.dropna()
cleaned_data_count = recent_grads.shape
cleaned_data_count

Out[7]:

(172, 21)

Comparing the data¶

In this part we are going to use scatter plots to explore our data, the goal is (well, my personal goal, I ignore if what I'm doing is a canonical way to use the plots) to see what data we have and what relations we have.

First I'm going to use a the .plot method on pandas to see the data.

I'm going to plot the Sample_size and Employed, this could help to see the correlation between the two columns, it is important to verify if the sample size matches the total of people in the majors.

In [8]:

#Using plot() method we need to indicate the kind of plot, also it facilitates to put the title.
recent_grads.plot(x='Sample_size', y='Employed', kind='scatter', title='Employed vs. Sample_size', figsize=(5,10))

Out[8]:

<AxesSubplot:title={'center':'Employed vs. Sample_size'}, xlabel='Sample_size', ylabel='Employed'>

Most popular majors with best median of salary¶

The first thing that we want to identify is the most popular majors with median of salary. One of the advantages to see the data in plots is to easily see what is the range of the data and retrieve the info that we need:

The most well payed careers are Petroleum engineering, mining engineering and metallurgical engineering. All three with a median of 70k USD
The most popular majors are Business Management and Psychology, with more than 300k of total professional.
The careers with more than 50k professionals and salaries with a median of 50k USD are Mechanical Engineering, General Engineering, Electrical Engineering and Computer Science.

In [9]:

recent_grads.plot(x='Total', y='Median', kind='scatter', title='Median income vs. Sample_size', figsize=(5,10))

Out[9]:

<AxesSubplot:title={'center':'Median income vs. Sample_size'}, xlabel='Total', ylabel='Median'>

In [10]:

most_profitable = recent_grads[(recent_grads['Median']>70000)]
most_profitable

Out[10]:

	Rank	Major_code	Major	Total	Men	Women	Major_category	ShareWomen	Sample_size	Employed	...	Part_time	Full_time_year_round	Unemployed	Unemployment_rate	Median	P25th	P75th	College_jobs	Non_college_jobs	Low_wage_jobs
0	1	2419	PETROLEUM ENGINEERING	2339.0	2057.0	282.0	Engineering	0.120564	36	1976	...	270	1207	37	0.018381	110000	95000	125000	1534	364	193
1	2	2416	MINING AND MINERAL ENGINEERING	756.0	679.0	77.0	Engineering	0.101852	7	640	...	170	388	85	0.117241	75000	55000	90000	350	257	50
2	3	2415	METALLURGICAL ENGINEERING	856.0	725.0	131.0	Engineering	0.153037	3	648	...	133	340	16	0.024096	73000	50000	105000	456	176	0

3 rows × 21 columns

In [11]:

most_popular = recent_grads[(recent_grads['Total']>300000)]
most_popular

Out[11]:

	Rank	Major_code	Major	Total	Men	Women	Major_category	ShareWomen	Sample_size	Employed	...	Part_time	Full_time_year_round	Unemployed	Unemployment_rate	Median	P25th	P75th	College_jobs	Non_college_jobs	Low_wage_jobs
76	77	6203	BUSINESS MANAGEMENT AND ADMINISTRATION	329927.0	173809.0	156118.0	Business	0.473190	4212	276234	...	50357	199897	21502	0.072218	38000	29000	50000	36720	148395	32395
145	146	5200	PSYCHOLOGY	393735.0	86648.0	307087.0	Psychology & Social Work	0.779933	2584	307933	...	115172	174438	28169	0.083811	31500	24000	41000	125148	141860	48207

2 rows × 21 columns

In [12]:

most_profitable_popular = recent_grads[(recent_grads['Total']>50000)&(recent_grads['Median']>50000)]
most_profitable_popular

Out[12]:

	Rank	Major_code	Major	Total	Men	Women	Major_category	ShareWomen	Sample_size	Employed	...	Part_time	Full_time_year_round	Unemployed	Unemployment_rate	Median	P25th	P75th	College_jobs	Non_college_jobs	Low_wage_jobs
8	9	2414	MECHANICAL ENGINEERING	91227.0	80320.0	10907.0	Engineering	0.119559	1029	76442	...	13101	54639	4650	0.057342	60000	48000	70000	52844	16384	3253
9	10	2408	ELECTRICAL ENGINEERING	81527.0	65511.0	16016.0	Engineering	0.196450	631	61928	...	12695	41413	3895	0.059174	60000	45000	72000	45829	10874	3170
17	18	2400	GENERAL ENGINEERING	61152.0	45683.0	15469.0	Engineering	0.252960	425	44931	...	7199	33540	2859	0.059824	56000	36000	69000	26898	11734	3192
20	21	2102	COMPUTER SCIENCE	128319.0	99743.0	28576.0	Computers & Mathematics	0.222695	1196	102087	...	18726	70932	6884	0.063173	53000	39000	70000	68622	25667	5144

4 rows × 21 columns

Majors with most unemployed people¶

Next we will see the mayor with more unemployment, as we will see the unemployment is not correlated with the popularity of the major.

The careers with more unemployment rate greater than 0.09 and with more than 100k professionals are Economics, Political Science, Commercial Art and History.
The career with more unemployment rates are Nuclear Engineering, Computer Networking and Public Administration.

In [13]:

recent_grads.plot(x='Total', y='Unemployment_rate', kind='scatter', title='Unemployment_rate vs. Sample_size', figsize=(5,10))

Out[13]:

<AxesSubplot:title={'center':'Unemployment_rate vs. Sample_size'}, xlabel='Total', ylabel='Unemployment_rate'>

In [14]:

most_unemployed_popular = recent_grads[(recent_grads['Unemployment_rate']>0.09)&(recent_grads['Total']>100000)]
most_unemployed_popular

Out[14]:

	Rank	Major_code	Major	Total	Men	Women	Major_category	ShareWomen	Sample_size	Employed	...	Part_time	Full_time_year_round	Unemployed	Unemployment_rate	Median	P25th	P75th	College_jobs	Non_college_jobs	Low_wage_jobs
36	37	5501	ECONOMICS	139247.0	89749.0	49498.0	Social Science	0.355469	1322	104117	...	25325	70740	11452	0.099092	47000	35000	65000	25582	37057	10653
78	79	5506	POLITICAL SCIENCE AND GOVERNMENT	182621.0	93880.0	88741.0	Social Science	0.485930	1387	133454	...	43711	83236	15022	0.101175	38000	28000	50000	36854	66947	19803
95	96	6004	COMMERCIAL ART AND GRAPHIC DESIGN	103480.0	32041.0	71439.0	Arts	0.690365	1186	83483	...	24387	52243	8947	0.096798	35000	25000	45000	37389	38119	14839
114	115	6402	HISTORY	141951.0	78253.0	63698.0	Humanities & Liberal Arts	0.448732	1058	105646	...	40657	59218	11176	0.095667	34000	25000	47000	35336	54569	16839

4 rows × 21 columns

In [15]:

most_unemployed = recent_grads[(recent_grads['Unemployment_rate']>0.15)]
most_unemployed

Out[15]:

	Rank	Major_code	Major	Total	Men	Women	Major_category	ShareWomen	Sample_size	Employed	...	Part_time	Full_time_year_round	Unemployed	Unemployment_rate	Median	P25th	P75th	College_jobs	Non_college_jobs	Low_wage_jobs
5	6	2418	NUCLEAR ENGINEERING	2573.0	2200.0	373.0	Engineering	0.144967	17	1857	...	264	1449	400	0.177226	65000	50000	102000	1142	657	244
84	85	2107	COMPUTER NETWORKING AND TELECOMMUNICATIONS	7613.0	5291.0	2322.0	Computers & Mathematics	0.305005	97	6144	...	1447	4369	1100	0.151850	36400	27000	49000	2593	2941	352
89	90	5401	PUBLIC ADMINISTRATION	5629.0	2947.0	2682.0	Law & Public Policy	0.476461	46	4158	...	847	2952	789	0.159491	36000	23000	60000	919	2313	496

3 rows × 21 columns

Full time jobs with the best salaries.¶

Ploting the full time jobs with the median of salaries we find the something:

The most popular full time jobs (greater than 100k professionals working full time) with a median salary above 40k USD are for professional with majors in Nursing, Finance and Accounting.
The most popular full time jobs (greater than 100k professionals working full time) with a median salary below 35k USD are for professional with majors in Biology, Literature, Elementary Education and Psychology.

In [16]:

recent_grads.plot(x='Full_time', y='Median', kind='scatter', title='Median vs. Full_time', figsize=(5,10))

Out[16]:

<AxesSubplot:title={'center':'Median vs. Full_time'}, xlabel='Full_time', ylabel='Median'>

In [17]:

most_fulltime_median = recent_grads[(recent_grads['Full_time']>100000)&(recent_grads['Median']>40000)]
most_fulltime_median

Out[17]:

	Rank	Major_code	Major	Total	Men	Women	Major_category	ShareWomen	Sample_size	Employed	...	Part_time	Full_time_year_round	Unemployed	Unemployment_rate	Median	P25th	P75th	College_jobs	Non_college_jobs	Low_wage_jobs
34	35	6107	NURSING	209394.0	21773.0	187621.0	Health	0.896019	2554	180903	...	40818	122817	8497	0.044863	48000	39000	58000	151643	26146	6193
35	36	6207	FINANCE	174506.0	115030.0	59476.0	Business	0.340825	2189	145696	...	21463	108595	9413	0.060686	47000	35000	64000	24243	48447	9910
40	41	6201	ACCOUNTING	198633.0	94519.0	104114.0	Business	0.524153	2042	165527	...	27693	123169	12411	0.069749	45000	34000	56000	11417	39323	10886

3 rows × 21 columns

In [18]:

worst_fulltime_median = recent_grads[(recent_grads['Full_time']>100000)&(recent_grads['Median']<35000)]
worst_fulltime_median

Out[18]:

	Rank	Major_code	Major	Total	Men	Women	Major_category	ShareWomen	Sample_size	Employed	...	Part_time	Full_time_year_round	Unemployed	Unemployment_rate	Median	P25th	P75th	College_jobs	Non_college_jobs	Low_wage_jobs
123	124	3600	BIOLOGY	280709.0	111762.0	168947.0	Biology & Life Science	0.601858	1370	182295	...	72371	100336	13874	0.070725	33400	24000	45000	88232	81109	28339
137	138	3301	ENGLISH LANGUAGE AND LITERATURE	194673.0	58227.0	136446.0	Humanities & Liberal Arts	0.700898	1436	149180	...	57825	81180	14345	0.087724	32000	23000	41000	57690	71827	26503
138	139	2304	ELEMENTARY EDUCATION	170862.0	13029.0	157833.0	Education	0.923745	1629	149339	...	37965	86540	7297	0.046586	32000	23400	38000	108085	36972	11502
145	146	5200	PSYCHOLOGY	393735.0	86648.0	307087.0	Psychology & Social Work	0.779933	2584	307933	...	115172	174438	28169	0.083811	31500	24000	41000	125148	141860	48207

4 rows × 21 columns

The share of women is a burning issue, because the distribution of women in more profitable careers are not even, and is know that women are pay less than men in most of the jobs.

The careers with more than 80% of the professionals women and with low unemployment are Teacher Education and Human Services.
The careers with more than 80% of the professionals women and with high unemployment are School Student Counseling and Library Science.

In [19]:

recent_grads.plot(x='ShareWomen', y='Unemployment_rate', kind='scatter', title='Unemployment_rate vs. ShareWomen', figsize=(5,10))

Out[19]:

<AxesSubplot:title={'center':'Unemployment_rate vs. ShareWomen'}, xlabel='ShareWomen', ylabel='Unemployment_rate'>

In [20]:

most_ShareWomen_unemployment = recent_grads[(recent_grads['ShareWomen']>0.8)&(recent_grads['Unemployment_rate']<0.04)]
most_ShareWomen_unemployment

Out[20]:

	Rank	Major_code	Major	Total	Men	Women	Major_category	ShareWomen	Sample_size	Employed	...	Part_time	Full_time_year_round	Unemployed	Unemployment_rate	Median	P25th	P75th	College_jobs	Non_college_jobs	Low_wage_jobs
154	155	2312	TEACHER EDUCATION: MULTIPLE LEVELS	14443.0	2734.0	11709.0	Education	0.810704	142	13076	...	2214	8457	496	0.036546	30000	24000	37000	10766	1949	722
156	157	5403	HUMAN SERVICES AND COMMUNITY ORGANIZATION	9374.0	885.0	8489.0	Psychology & Social Work	0.905590	89	8294	...	2405	5061	326	0.037819	30000	24000	35000	2878	4595	724

2 rows × 21 columns

In [21]:

worst_ShareWomen_unemployment = recent_grads[(recent_grads['ShareWomen']>0.8)&(recent_grads['Unemployment_rate']>0.10)]
worst_ShareWomen_unemployment

Out[21]:

	Rank	Major_code	Major	Total	Men	Women	Major_category	ShareWomen	Sample_size	Employed	...	Part_time	Full_time_year_round	Unemployed	Unemployment_rate	Median	P25th	P75th	College_jobs	Non_college_jobs	Low_wage_jobs
55	56	2303	SCHOOL STUDENT COUNSELING	818.0	119.0	699.0	Education	0.854523	4	730	...	135	545	88	0.107579	41000	41000	43000	509	221	0
172	173	3501	LIBRARY SCIENCE	1098.0	134.0	964.0	Education	0.877960	2	742	...	237	410	87	0.104946	22000	20000	22000	288	338	192

2 rows × 21 columns

Men majors and median salaries¶

In the case of men in majors we find:

The majors with more than 50k men and median salary above 50k USD are Mechanical Engineering, Electrical Engineering and Computer Science.
The majors with more than 50k men and median salary below 33k USD are Literature, Physical Fitness and Psychology.

In [22]:

recent_grads.plot(x='Men', y='Median', kind='scatter', title='Median vs. Men', figsize=(5,10))

Out[22]:

<AxesSubplot:title={'center':'Median vs. Men'}, xlabel='Men', ylabel='Median'>

In [23]:

most_men_median = recent_grads[(recent_grads['Men']>50000)&(recent_grads['Median']>50000)]
most_men_median

Out[23]:

	Rank	Major_code	Major	Total	Men	Women	Major_category	ShareWomen	Sample_size	Employed	...	Part_time	Full_time_year_round	Unemployed	Unemployment_rate	Median	P25th	P75th	College_jobs	Non_college_jobs	Low_wage_jobs
8	9	2414	MECHANICAL ENGINEERING	91227.0	80320.0	10907.0	Engineering	0.119559	1029	76442	...	13101	54639	4650	0.057342	60000	48000	70000	52844	16384	3253
9	10	2408	ELECTRICAL ENGINEERING	81527.0	65511.0	16016.0	Engineering	0.196450	631	61928	...	12695	41413	3895	0.059174	60000	45000	72000	45829	10874	3170
20	21	2102	COMPUTER SCIENCE	128319.0	99743.0	28576.0	Computers & Mathematics	0.222695	1196	102087	...	18726	70932	6884	0.063173	53000	39000	70000	68622	25667	5144

3 rows × 21 columns

In [24]:

worst_men_median = recent_grads[(recent_grads['Men']>50000)&(recent_grads['Median']<33000)]
worst_men_median

Out[24]:

	Rank	Major_code	Major	Total	Men	Women	Major_category	ShareWomen	Sample_size	Employed	...	Part_time	Full_time_year_round	Unemployed	Unemployment_rate	Median	P25th	P75th	College_jobs	Non_college_jobs	Low_wage_jobs
137	138	3301	ENGLISH LANGUAGE AND LITERATURE	194673.0	58227.0	136446.0	Humanities & Liberal Arts	0.700898	1436	149180	...	57825	81180	14345	0.087724	32000	23000	41000	57690	71827	26503
139	140	4101	PHYSICAL FITNESS PARKS RECREATION AND LEISURE	125074.0	62181.0	62893.0	Industrial Arts & Consumer Services	0.502846	1014	103078	...	38515	57978	5593	0.051467	32000	24000	43000	27581	63946	16838
145	146	5200	PSYCHOLOGY	393735.0	86648.0	307087.0	Psychology & Social Work	0.779933	2584	307933	...	115172	174438	28169	0.083811	31500	24000	41000	125148	141860	48207

3 rows × 21 columns

Women majors and median salaries¶

In the case of women in majors we find:

The majors with more than 50k men and median salary above 40k USD are Nursing, Finance and Accounting.
The most popular majors in women are not well paid compared with the case of men.
The majors with more than 50k men and median salary below 32k USD are Psychology and "Family and Consumer Sciences".

In [25]:

recent_grads.plot(x='Women', y='Median', kind='scatter', title='Median vs. Women', rot=30, figsize=(5,10))

Out[25]:

<AxesSubplot:title={'center':'Median vs. Women'}, xlabel='Women', ylabel='Median'>

In [26]:

most_women_median = recent_grads[(recent_grads['Women']>50000)&(recent_grads['Median']>40000)]
most_women_median

Out[26]:

	Rank	Major_code	Major	Total	Men	Women	Major_category	ShareWomen	Sample_size	Employed	...	Part_time	Full_time_year_round	Unemployed	Unemployment_rate	Median	P25th	P75th	College_jobs	Non_college_jobs	Low_wage_jobs
34	35	6107	NURSING	209394.0	21773.0	187621.0	Health	0.896019	2554	180903	...	40818	122817	8497	0.044863	48000	39000	58000	151643	26146	6193
35	36	6207	FINANCE	174506.0	115030.0	59476.0	Business	0.340825	2189	145696	...	21463	108595	9413	0.060686	47000	35000	64000	24243	48447	9910
40	41	6201	ACCOUNTING	198633.0	94519.0	104114.0	Business	0.524153	2042	165527	...	27693	123169	12411	0.069749	45000	34000	56000	11417	39323	10886

3 rows × 21 columns

In [27]:

worst_women_median = recent_grads[(recent_grads['Women']>50000)&(recent_grads['Median']<32000)]
worst_women_median

Out[27]:

	Rank	Major_code	Major	Total	Men	Women	Major_category	ShareWomen	Sample_size	Employed	...	Part_time	Full_time_year_round	Unemployed	Unemployment_rate	Median	P25th	P75th	College_jobs	Non_college_jobs	Low_wage_jobs
145	146	5200	PSYCHOLOGY	393735.0	86648.0	307087.0	Psychology & Social Work	0.779933	2584	307933	...	115172	174438	28169	0.083811	31500	24000	41000	125148	141860	48207
150	151	2901	FAMILY AND CONSUMER SCIENCES	58001.0	5166.0	52835.0	Industrial Arts & Consumer Services	0.910933	518	46624	...	15872	26906	3355	0.067128	30000	22900	40000	20985	20133	5248

2 rows × 21 columns

Series Distribution¶

The series distribution can help us to see how the data is allocated and how we can understand the different situations:

Sample size, median salary, employment, full time jobs, men and women distributions are exponential.
Share of women and Unemployment rates are more "normal".

In [28]:

recent_grads['Sample_size'].plot(kind='hist')

Out[28]:

<AxesSubplot:ylabel='Frequency'>

In [29]:

# When we increase the bins we can see more detailed the frequency of the data.
recent_grads['Sample_size'].hist(bins=25, range=(0,5000))

Out[29]:

<AxesSubplot:>

In [30]:

recent_grads['Median'].hist(bins=20, range=(0,120000))

Out[30]:

<AxesSubplot:>

In [31]:

recent_grads['Employed'].hist(bins=20, range=(0,350000))

Out[31]:

<AxesSubplot:>

In [32]:

recent_grads['Full_time'].hist(bins=20, range=(0,300000))

Out[32]:

<AxesSubplot:>

In [33]:

recent_grads['ShareWomen'].hist(bins=20, range=(0,1))

Out[33]:

<AxesSubplot:>

In [34]:

recent_grads['Unemployment_rate'].hist(bins=20, range=(0,0.2))

Out[34]:

<AxesSubplot:>

In [35]:

recent_grads['Men'].hist(bins=20, range=(0,200000))

Out[35]:

<AxesSubplot:>

In [36]:

recent_grads['Women'].hist(bins=20, range=(0,200000))

Out[36]:

<AxesSubplot:>

Scatter plots to compare the data¶

This kind of plot is certainly useful to see large amount of data at once and at the end use the result to see more detailed results. I think we would start with this kind of plot first and letter use the others, anyway the resultas are the same of the previous exercises.

In [37]:

from pandas.plotting import scatter_matrix
scatter_matrix(recent_grads[['Women', 'Men']], figsize=(10,10))

Out[37]:

array([[<AxesSubplot:xlabel='Women', ylabel='Women'>,
        <AxesSubplot:xlabel='Men', ylabel='Women'>],
       [<AxesSubplot:xlabel='Women', ylabel='Men'>,
        <AxesSubplot:xlabel='Men', ylabel='Men'>]], dtype=object)

In [38]:

scatter_matrix(recent_grads[['Total', 'Median']], figsize=(10,10))

Out[38]:

array([[<AxesSubplot:xlabel='Total', ylabel='Total'>,
        <AxesSubplot:xlabel='Median', ylabel='Total'>],
       [<AxesSubplot:xlabel='Total', ylabel='Median'>,
        <AxesSubplot:xlabel='Median', ylabel='Median'>]], dtype=object)

In [39]:

scatter_matrix(recent_grads[['Total', 'Median','Unemployment_rate']], figsize=(10,10))

Out[39]:

array([[<AxesSubplot:xlabel='Total', ylabel='Total'>,
        <AxesSubplot:xlabel='Median', ylabel='Total'>,
        <AxesSubplot:xlabel='Unemployment_rate', ylabel='Total'>],
       [<AxesSubplot:xlabel='Total', ylabel='Median'>,
        <AxesSubplot:xlabel='Median', ylabel='Median'>,
        <AxesSubplot:xlabel='Unemployment_rate', ylabel='Median'>],
       [<AxesSubplot:xlabel='Total', ylabel='Unemployment_rate'>,
        <AxesSubplot:xlabel='Median', ylabel='Unemployment_rate'>,
        <AxesSubplot:xlabel='Unemployment_rate', ylabel='Unemployment_rate'>]],
      dtype=object)

In [40]:

scatter_matrix(recent_grads[['Total', 'Median', 'ShareWomen']], figsize=(10,10))

Out[40]:

array([[<AxesSubplot:xlabel='Total', ylabel='Total'>,
        <AxesSubplot:xlabel='Median', ylabel='Total'>,
        <AxesSubplot:xlabel='ShareWomen', ylabel='Total'>],
       [<AxesSubplot:xlabel='Total', ylabel='Median'>,
        <AxesSubplot:xlabel='Median', ylabel='Median'>,
        <AxesSubplot:xlabel='ShareWomen', ylabel='Median'>],
       [<AxesSubplot:xlabel='Total', ylabel='ShareWomen'>,
        <AxesSubplot:xlabel='Median', ylabel='ShareWomen'>,
        <AxesSubplot:xlabel='ShareWomen', ylabel='ShareWomen'>]],
      dtype=object)

In [41]:

scatter_matrix(recent_grads[['Full_time', 'Median', 'ShareWomen']], figsize=(10,10))

Out[41]:

array([[<AxesSubplot:xlabel='Full_time', ylabel='Full_time'>,
        <AxesSubplot:xlabel='Median', ylabel='Full_time'>,
        <AxesSubplot:xlabel='ShareWomen', ylabel='Full_time'>],
       [<AxesSubplot:xlabel='Full_time', ylabel='Median'>,
        <AxesSubplot:xlabel='Median', ylabel='Median'>,
        <AxesSubplot:xlabel='ShareWomen', ylabel='Median'>],
       [<AxesSubplot:xlabel='Full_time', ylabel='ShareWomen'>,
        <AxesSubplot:xlabel='Median', ylabel='ShareWomen'>,
        <AxesSubplot:xlabel='ShareWomen', ylabel='ShareWomen'>]],
      dtype=object)

Bar Plots¶

The bar plots are very useful to visualize the data and see the nuances. I like the results with this kind of plot. We only need to change the kind in the .plot method to 'bar'.

Another alternative is to use plot.bar method.

In [42]:

recent_grads[:5]['Women'].plot(kind='bar')

Out[42]:

<AxesSubplot:>

In [43]:

# With .plot.bar Method we can put to axes, so we can easily compare the data.
recent_grads[:5].plot.bar(x='Major', y='Women')

Out[43]:

<AxesSubplot:xlabel='Major'>

In [44]:

recent_grads[:5]['ShareWomen'].plot(kind='bar')

Out[44]:

<AxesSubplot:>

In [45]:

recent_grads[-5:]['ShareWomen'].plot(kind='bar')

Out[45]:

<AxesSubplot:>

In [46]:

recent_grads[:5]['Unemployment_rate'].plot(kind='bar')

Out[46]:

<AxesSubplot:>

In [47]:

recent_grads[-5:]['Unemployment_rate'].plot(kind='bar')

Out[47]:

<AxesSubplot:>

Best paid majors for women¶

With the bar plot we can put some conditions and see the results:

I need to know the majors with a ShareWomen above 0.4.
And also that have a Median salary above 48k USD.
In plot Bar I select the Major and the Median.

The results are interesting:

Actuarial Science, Astronomy y Biomedical Engineering are majors with a good Share of Women and Median salary above 55k USD.

In [48]:

recent_grads[(recent_grads['ShareWomen']>0.4)&(recent_grads['Median']>48000)].plot.bar(x='Major', y='Median')

Out[48]:

<AxesSubplot:xlabel='Major'>

Majors with less unemployment rates for Women.¶

In this case we are going to apply the same steps of the previous exercise, but in this case with unemployment rate. The Share of women is above 0.7, and unemployment rate below 0.4.

Medical technology technicians, social psychology and mathematics teacher are jobs with a low rate of unemployment for women.
This major have a low salary, maybe because mostly are women the professionals.

In [49]:

recent_grads[(recent_grads['ShareWomen']>0.7)&(recent_grads['Unemployment_rate']<0.04)].plot.bar(x='Major', y='Median')

Out[49]:

<AxesSubplot:xlabel='Major'>

Hexagonal plot¶

I don't understand very well this one, but looks awesome, so I'm going to put it anyway.

In [50]:

recent_grads.plot.hexbin(x='Total', y='Median',gridsize=10)

Out[50]:

<AxesSubplot:xlabel='Total', ylabel='Median'>

Boxplot¶

Extremely useful to visualize better the median distribution, I would use it for following projects.

In [51]:

recent_grads.boxplot(column=['Median'])

Out[51]:

<AxesSubplot:>

In [52]:

recent_grads.boxplot(column=['Men', 'Women'])

Out[52]:

<AxesSubplot:>

Conclusion¶

We made some useful visualization to extract info of our data an discover interesting info of our dataset:

The most well payed careers are Petroleum engineering, mining engineering and metallurgical engineering. All three with a median of 70k USD
The most popular majors are Business Management and Psychology, with more than 300k of total professional.
The careers with more than 50k professionals and salaries with a median of 50k USD are Mechanical Engineering, General Engineering, Electrical Engineering and Computer Science.
The careers with more unemployment rate greater than 0.09 and with more than 100k professionals are Economics, Political Science, Commercial Art and History.
The career with more unemployment rates are Nuclear Engineering, Computer Networking and Public Administration.
The most popular full time jobs (greater than 100k professionals working full time) with a median salary above 40k USD are for professional with majors in Nursing, Finance and Accounting.
The most popular full time jobs (greater than 100k professionals working full time) with a median salary below 35k USD are for professional with majors in Biology, Literature, Elementary Education and Psychology.
The careers with more than 80% of the professionals women and with low unemployment are Teacher Education and Human Services.
The careers with more than 80% of the professionals women and with high unemployment are School Student Counseling and Library Science.
The majors with more than 50k men and median salary above 50k USD are Mechanical Engineering, Electrical Engineering and Computer Science.
The majors with more than 50k men and median salary below 33k USD are Literature, Physical Fitness and Psychology.
The majors with more than 50k men and median salary above 40k USD are Nursing, Finance and Accounting.
The most popular majors in women are not well paid compared with the case of men.
The majors with more than 50k men and median salary below 32k USD are Psychology and "Family and Consumer Sciences".
Medical technology technicians, social psychology and mathematics teacher are jobs with a low rate of unemployment for women.
This major have a low salary, maybe because mostly are women the professionals.
Actuarial Science, Astronomy y Biomedical Engineering are majors with a good Share of Women and Median salary above 55k USD.

Visualizing Earnings Based On College Majors¶

Description of the data¶

Comparing the data¶

Most popular majors with best median of salary¶

Majors with most unemployed people¶

Full time jobs with the best salaries.¶

Share of women in the majors and unemployment.¶

Men majors and median salaries¶

Women majors and median salaries¶

Series Distribution¶

Scatter plots to compare the data¶

Bar Plots¶

Best paid majors for women¶

Majors with less unemployment rates for Women.¶

Hexagonal plot¶

Boxplot¶

Conclusion¶