The main goal of this project is to find whether there is a gender gap in number of college degrees obtained and display result by graphical representation. The project covers 17 majors only in United States. The comparison of volume of degrees obtained by women and men will not be executed based on absolute values. Instead I will compare share of degrees obtained by women and man in all 17 majors.
All of comparisons will be displayed in separate line chart and all charts will be placed to one representation (image). In other words, at the end we will get one image with 17 line charts displaying comparison mentioned above. That image will have 3 columns of line charts. Each column will represent a category to which majors will be classified later in this project.
The data from where we can obtain share of women and men in number of degrees obtained is published annually by The Department of Education Statistics of United States. The report containing data between years 1970 and 2011 was cleaned by Randal Olson, a data scientist at University of Pennsylvania. Randal Olson published cleaned data here.
Below cell starts with importing necessary libraries for analysis and visualization. After that the data was read and top 10 rows were displayed for understanding the data we will deal with.
from pandas import DataFrame, read_csv
import matplotlib.pyplot as plt
from numpy import arange
fem_deg = read_csv("percent-bachelors-degrees-women-usa.csv")
fem_deg.head(10)
Year | Agriculture | Architecture | Art and Performance | Biology | Business | Communications and Journalism | Computer Science | Education | Engineering | English | Foreign Languages | Health Professions | Math and Statistics | Physical Sciences | Psychology | Public Administration | Social Sciences and History | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1970 | 4.229798 | 11.921005 | 59.7 | 29.088363 | 9.064439 | 35.3 | 13.6 | 74.535328 | 0.8 | 65.570923 | 73.8 | 77.1 | 38.0 | 13.8 | 44.4 | 68.4 | 36.8 |
1 | 1971 | 5.452797 | 12.003106 | 59.9 | 29.394403 | 9.503187 | 35.5 | 13.6 | 74.149204 | 1.0 | 64.556485 | 73.9 | 75.5 | 39.0 | 14.9 | 46.2 | 65.5 | 36.2 |
2 | 1972 | 7.420710 | 13.214594 | 60.4 | 29.810221 | 10.558962 | 36.6 | 14.9 | 73.554520 | 1.2 | 63.664263 | 74.6 | 76.9 | 40.2 | 14.8 | 47.6 | 62.6 | 36.1 |
3 | 1973 | 9.653602 | 14.791613 | 60.2 | 31.147915 | 12.804602 | 38.4 | 16.4 | 73.501814 | 1.6 | 62.941502 | 74.9 | 77.4 | 40.9 | 16.5 | 50.4 | 64.3 | 36.4 |
4 | 1974 | 14.074623 | 17.444688 | 61.9 | 32.996183 | 16.204850 | 40.5 | 18.9 | 73.336811 | 2.2 | 62.413412 | 75.3 | 77.9 | 41.8 | 18.2 | 52.6 | 66.1 | 37.3 |
5 | 1975 | 18.333162 | 19.134048 | 60.9 | 34.449902 | 19.686249 | 41.5 | 19.8 | 72.801854 | 3.2 | 61.647206 | 75.0 | 78.9 | 40.7 | 19.1 | 54.5 | 63.0 | 37.7 |
6 | 1976 | 22.252760 | 21.394491 | 61.3 | 36.072871 | 23.430038 | 44.3 | 23.9 | 72.166525 | 4.5 | 62.148194 | 74.4 | 79.2 | 41.5 | 20.0 | 56.9 | 65.6 | 39.2 |
7 | 1977 | 24.640177 | 23.740541 | 62.0 | 38.331386 | 27.163427 | 46.9 | 25.7 | 72.456395 | 6.8 | 62.723067 | 74.3 | 80.5 | 41.1 | 21.3 | 59.0 | 69.3 | 40.5 |
8 | 1978 | 27.146192 | 25.849240 | 62.5 | 40.112496 | 30.527519 | 49.9 | 28.1 | 73.192821 | 8.4 | 63.619122 | 74.3 | 81.9 | 41.6 | 22.5 | 61.3 | 71.5 | 41.8 |
9 | 1979 | 29.633365 | 27.770477 | 63.2 | 42.065551 | 33.621634 | 52.3 | 30.2 | 73.821142 | 9.4 | 65.088390 | 74.2 | 82.3 | 42.3 | 23.7 | 63.3 | 73.3 | 43.6 |
As mentioned above I am going to categorize majors in 3 types. In below cell you can see dictionary in which keys are categories and values are lists of majors classified to this category. You can see 17 majors in above cell's output as column labels. I will classify those majors in 'STEM', 'Liberal Arts' and 'Other' categories. Then I will convert that dictionary to dataframe and display it.
Note: As I tried to make my code for visualization as dynamic as I could number of rows, columns or values can be changed in cell below without any need to adjust nested loops for visualization in following cells
# Categorizing column labels of above read csv file (which represent majors) in 3
majors_dict = {'STEM': ['Psychology', 'Biology', 'Math and Statistics',
'Physical Sciences', 'Computer Science', 'Engineering'],
'Liberal Arts': ['Foreign Languages', 'English', 'Communications and Journalism',
'Art and Performance', 'Social Sciences and History', ''],
'Others': ['Health Professions', 'Public Administration', 'Education',
'Agriculture', 'Business', 'Architecture']}
majors_df = DataFrame(majors_dict)
majors_df
STEM | Liberal Arts | Others | |
---|---|---|---|
0 | Psychology | Foreign Languages | Health Professions |
1 | Biology | English | Public Administration |
2 | Math and Statistics | Communications and Journalism | Education |
3 | Physical Sciences | Art and Performance | Agriculture |
4 | Computer Science | Social Sciences and History | Business |
5 | Engineering | Architecture |
As we can see from output of above cell we have dataframe where:
As we have only 17 majors divided in 3 types we get one column (Liberal Arts) only with 5 majors. Empty string was entered as value to 6th row of mentioned column where we do not have major. It will help us later to identify condition when we need to pass and not plot line chart. In other words when code visualizing charts below will come to this point it will skip it. Therefore it will leave empty space on image representing all line charts.
In short below cell crates figure object (container) and fills it with 17 line charts. Then it displays the filled container and saves it as png file to local directory. In following cells I will describe important features of displayed image. And then, based on same displayed charts discuss main observations.
# Creating figure object to fill it wit ax objects in below nested loops & titling it
fig2 = plt.figure(figsize=(22, 20))
fig2.suptitle('Share of Men & Women in below majors in % (y axis) by Years(x axis) '\
f'in {majors_df.shape[1]} category(ies):', size='xx-large')
# Below variables are used in nested loops for defining some features of each ax object:
dt = 10 # distance between annotation text and plot line (for y coordination position)
yr = 2007 # x coordinate position for annotation
db = (0, 107/255, 164/255) # dark blue color for women line
o = (1, 128/255, 14/255) # orange color for men line
gr = (171/250, 171/250, 171/250) # grey color for axhline
lw = 3 # linewidth of plotted lines
plot_pos=0 # position of ax object inside figure object
# Nested for loops for creating ax object for each major in majors_df dataframe
for row in range(majors_df.shape[0]):
for col in range(majors_df.shape[1]):
plot_pos += 1
# to skip iteration if cell is empty in majors_df
if len(majors_df.iat[row, col]) == 0: continue
# Adding ax object, plotting women and men line
ax = fig2.add_subplot(majors_df.shape[0], majors_df.shape[1], plot_pos)
ax.plot(fem_deg['Year'], fem_deg[majors_df.iat[row, col]], c=db, linewidth=lw)
ax.plot(fem_deg['Year'], 100-fem_deg[majors_df.iat[row, col]], c=o, linewidth=lw)
#Using ax.set_title in first row of charts to name catgeories for each column of charts
if (plot_pos-1) in range(majors_df.shape[1]):
ax.set_title(majors_df.columns[plot_pos-1], pad=30, weight='bold')
# Setting title for ax with ax.text, ticks and horizontal line in middle across the chart
ax.text(1990, 98, majors_df.iat[row, col], c='r',ha='center', va="top", size='x-large')
ax.set_yticks([0, 50, 100]); ax.set_xticks(arange(1970, 2011, 10))
ax.axhline(y=50, c=gr, alpha=0.3)
# Removing ticks (leaving ticklabels) and spines
ax.tick_params(left=False, bottom=False)
for loc, spine in ax.spines.items(): spine.set_visible(False)
# Finding last year observation for men and women in percentage by each major.
# placing annotation above or below line depending on whether obdervation is below or
# aboove 50 percent
lt_y = float(fem_deg[majors_df.iat[row, col]].tail(1))
if lt_y>=50:
ax.text(yr, lt_y+dt, 'Women', c=db, size='large')
ax.text(yr, (100-lt_y)-dt, 'Men', c=o, size='large')
else:
ax.text(yr, lt_y-dt, 'Women', c=db, size='large')
ax.text(yr, (100-lt_y)+dt, 'Men', c=o, size='large')
# adjusting position of all axes on figure object and saving it to png file
plt.subplots_adjust(wspace=0.088, hspace=0.3, top=0.93)
plt.savefig('degree_gap_gender.png')
Pay attention to position of 2 annotations relatively to each other, higher annotation for any gender means majority for it in last year. For example in case of 'Agriculture' women got majority in last year.
While women holds majority in 'Psychology' & 'Biology' man has higher share of obtained degrees in rest of 4 majors. From those 4 fields in 'Math and Statistics' representation of women is highest in all years. Also there is positive trend for increasing representation of women in 'Physical Sciences'. 'Computer Sciences' & 'Engineering' majors in their turn need more work to be done in attracting women to the field and obtaining respective degrees in American universities.
Quite opposite situation can be observed in Liberal Arts relatively to STEM majors. Here women has majority in 4 majors out of 5. Only in 'Social Sciences and History' men got majority in last years of observed period (although it is difficult to understand it from line charts pay attention to the fact that annotation in orange color for men is higher than the blue one for women).
While we can observe prevailing majority of women in 'Public Administration' & 'Education' the situation in other 3 fields are more or less equal. Moreover there are trends towards equality of representation in those 3 fields.