The Gender Gap In College Degrees¶

Introduction¶

The main goal of this project is to find whether there is a gender gap in number of college degrees obtained and display result by graphical representation. The project covers 17 majors only in United States. The comparison of volume of degrees obtained by women and men will not be executed based on absolute values. Instead I will compare share of degrees obtained by women and man in all 17 majors.

All of comparisons will be displayed in separate line chart and all charts will be placed to one representation (image). In other words, at the end we will get one image with 17 line charts displaying comparison mentioned above. That image will have 3 columns of line charts. Each column will represent a category to which majors will be classified later in this project.

Data¶

The data from where we can obtain share of women and men in number of degrees obtained is published annually by The Department of Education Statistics of United States. The report containing data between years 1970 and 2011 was cleaned by Randal Olson, a data scientist at University of Pennsylvania. Randal Olson published cleaned data here.

Reading and displaying dataframe¶

Below cell starts with importing necessary libraries for analysis and visualization. After that the data was read and top 10 rows were displayed for understanding the data we will deal with.

In [29]:

from pandas import DataFrame, read_csv
import matplotlib.pyplot as plt
from numpy import arange

fem_deg = read_csv("percent-bachelors-degrees-women-usa.csv")
fem_deg.head(10)

Out[29]:

	Year	Agriculture	Architecture	Art and Performance	Biology	Business	Communications and Journalism	Computer Science	Education	Engineering	English	Foreign Languages	Health Professions	Math and Statistics	Physical Sciences	Psychology	Public Administration	Social Sciences and History
0	1970	4.229798	11.921005	59.7	29.088363	9.064439	35.3	13.6	74.535328	0.8	65.570923	73.8	77.1	38.0	13.8	44.4	68.4	36.8
1	1971	5.452797	12.003106	59.9	29.394403	9.503187	35.5	13.6	74.149204	1.0	64.556485	73.9	75.5	39.0	14.9	46.2	65.5	36.2
2	1972	7.420710	13.214594	60.4	29.810221	10.558962	36.6	14.9	73.554520	1.2	63.664263	74.6	76.9	40.2	14.8	47.6	62.6	36.1
3	1973	9.653602	14.791613	60.2	31.147915	12.804602	38.4	16.4	73.501814	1.6	62.941502	74.9	77.4	40.9	16.5	50.4	64.3	36.4
4	1974	14.074623	17.444688	61.9	32.996183	16.204850	40.5	18.9	73.336811	2.2	62.413412	75.3	77.9	41.8	18.2	52.6	66.1	37.3
5	1975	18.333162	19.134048	60.9	34.449902	19.686249	41.5	19.8	72.801854	3.2	61.647206	75.0	78.9	40.7	19.1	54.5	63.0	37.7
6	1976	22.252760	21.394491	61.3	36.072871	23.430038	44.3	23.9	72.166525	4.5	62.148194	74.4	79.2	41.5	20.0	56.9	65.6	39.2
7	1977	24.640177	23.740541	62.0	38.331386	27.163427	46.9	25.7	72.456395	6.8	62.723067	74.3	80.5	41.1	21.3	59.0	69.3	40.5
8	1978	27.146192	25.849240	62.5	40.112496	30.527519	49.9	28.1	73.192821	8.4	63.619122	74.3	81.9	41.6	22.5	61.3	71.5	41.8
9	1979	29.633365	27.770477	63.2	42.065551	33.621634	52.3	30.2	73.821142	9.4	65.088390	74.2	82.3	42.3	23.7	63.3	73.3	43.6

Categorization¶

As mentioned above I am going to categorize majors in 3 types. In below cell you can see dictionary in which keys are categories and values are lists of majors classified to this category. You can see 17 majors in above cell's output as column labels. I will classify those majors in 'STEM', 'Liberal Arts' and 'Other' categories. Then I will convert that dictionary to dataframe and display it.

Note: As I tried to make my code for visualization as dynamic as I could number of rows, columns or values can be changed in cell below without any need to adjust nested loops for visualization in following cells

In [30]:

# Categorizing column labels of above read csv file (which represent majors) in 3
majors_dict = {'STEM': ['Psychology', 'Biology', 'Math and Statistics',
                           'Physical Sciences', 'Computer Science', 'Engineering'],
               'Liberal Arts': ['Foreign Languages', 'English', 'Communications and Journalism',
                               'Art and Performance', 'Social Sciences and History', ''],
               'Others': ['Health Professions', 'Public Administration', 'Education',
                            'Agriculture', 'Business', 'Architecture']}
majors_df = DataFrame(majors_dict)
majors_df

Out[30]:

	STEM	Liberal Arts	Others
0	Psychology	Foreign Languages	Health Professions
1	Biology	English	Public Administration
2	Math and Statistics	Communications and Journalism	Education
3	Physical Sciences	Art and Performance	Agriculture
4	Computer Science	Social Sciences and History	Business
5	Engineering		Architecture

As we can see from output of above cell we have dataframe where:

column labels are categories
values in each column are majors
we have 3 columns and 6 rows

As we have only 17 majors divided in 3 types we get one column (Liberal Arts) only with 5 majors. Empty string was entered as value to 6th row of mentioned column where we do not have major. It will help us later to identify condition when we need to pass and not plot line chart. In other words when code visualizing charts below will come to this point it will skip it. Therefore it will leave empty space on image representing all line charts.

Visualization¶

In short below cell crates figure object (container) and fills it with 17 line charts. Then it displays the filled container and saves it as png file to local directory. In following cells I will describe important features of displayed image. And then, based on same displayed charts discuss main observations.

In [31]:

# Creating figure object to fill it wit ax objects in below nested loops & titling it
fig2 = plt.figure(figsize=(22, 20)) 
fig2.suptitle('Share of Men & Women in below majors in % (y axis) by Years(x axis) '\
              f'in {majors_df.shape[1]} category(ies):', size='xx-large')

# Below variables are used in nested loops for defining some features of each ax object:
dt = 10 # distance between annotation text and plot line (for y coordination position)
yr = 2007 # x coordinate position for annotation
db = (0, 107/255, 164/255) # dark blue color for women line
o = (1, 128/255, 14/255) # orange color for men line
gr = (171/250, 171/250, 171/250) # grey color for axhline
lw = 3 # linewidth of plotted lines
plot_pos=0 # position of ax object inside figure object

# Nested for loops for creating ax object for each major in majors_df dataframe
for row in range(majors_df.shape[0]):
    for col in range(majors_df.shape[1]):
        plot_pos += 1
        
        # to skip iteration if cell is empty in majors_df
        if len(majors_df.iat[row, col]) == 0: continue
        # Adding ax object, plotting women and men line
        ax = fig2.add_subplot(majors_df.shape[0], majors_df.shape[1], plot_pos)
        ax.plot(fem_deg['Year'], fem_deg[majors_df.iat[row, col]], c=db, linewidth=lw)
        ax.plot(fem_deg['Year'], 100-fem_deg[majors_df.iat[row, col]], c=o, linewidth=lw)
        
        #Using ax.set_title in first row of charts to name catgeories for each column of charts
        if (plot_pos-1) in range(majors_df.shape[1]):
            ax.set_title(majors_df.columns[plot_pos-1], pad=30, weight='bold')
        
        # Setting title for ax with ax.text, ticks and horizontal line in middle across the chart  
        ax.text(1990, 98, majors_df.iat[row, col], c='r',ha='center', va="top", size='x-large')
        ax.set_yticks([0, 50, 100]); ax.set_xticks(arange(1970, 2011, 10))
        ax.axhline(y=50, c=gr, alpha=0.3)
        # Removing ticks (leaving ticklabels) and spines
        ax.tick_params(left=False, bottom=False)
        for loc, spine in ax.spines.items(): spine.set_visible(False)
        
        # Finding last year observation for men and women in percentage by each major.
        # placing annotation above or below line depending on whether obdervation is below or
        # aboove 50 percent
        lt_y = float(fem_deg[majors_df.iat[row, col]].tail(1))        
        if lt_y>=50:
            ax.text(yr, lt_y+dt, 'Women', c=db, size='large')
            ax.text(yr, (100-lt_y)-dt, 'Men', c=o, size='large')
        else:
            ax.text(yr, lt_y-dt, 'Women', c=db, size='large')
            ax.text(yr, (100-lt_y)+dt, 'Men', c=o, size='large')
            
# adjusting position of all axes on figure object and saving it to png file        
plt.subplots_adjust(wspace=0.088, hspace=0.3, top=0.93)
plt.savefig('degree_gap_gender.png')

Notes on above visualization¶

Right below the title of image we can see 3 category labels in bold representing below column of charts on image

On top of each of 17 line charts we can see chart title in red color
Annotations ('Women' & 'Men' ) and line charts are linked with color
In case it is not clear from line charts which gender had majority in last year (e.g. 'Agriculture') :

Pay attention to position of 2 annotations relatively to each other, higher annotation for any gender means majority for it in last year. For example in case of 'Agriculture' women got majority in last year.

General Observations¶

STEM:

While women holds majority in 'Psychology' & 'Biology' man has higher share of obtained degrees in rest of 4 majors. From those 4 fields in 'Math and Statistics' representation of women is highest in all years. Also there is positive trend for increasing representation of women in 'Physical Sciences'. 'Computer Sciences' & 'Engineering' majors in their turn need more work to be done in attracting women to the field and obtaining respective degrees in American universities.

Liberal Arts:

Quite opposite situation can be observed in Liberal Arts relatively to STEM majors. Here women has majority in 4 majors out of 5. Only in 'Social Sciences and History' men got majority in last years of observed period (although it is difficult to understand it from line charts pay attention to the fact that annotation in orange color for men is higher than the blue one for women).

Others:

While we can observe prevailing majority of women in 'Public Administration' & 'Education' the situation in other 3 fields are more or less equal. Moreover there are trends towards equality of representation in those 3 fields.