The Department of Education Statistics releases a data set annually containing the percentage of bachelor's degrees granted to women from 1970 to 2012. The data set is broken up into 17 categories of degrees, with each column as a separate category.
Randal Olson, a data scientist at University of Pennsylvania, has cleaned the data set and made it available on his personal website. You can download the dataset Randal compiled here.
Randal compiled this data set to explore the gender gap in STEM fields, which stands for science, technology, engineering, and mathematics. This gap is reported on often in the news and not everyone agrees that there is a gap.
The purpose of this project is to gain experience with enhancing graph visualization by reducing "visual busyness" and thus make it easier for observers to see what the graphs are "saying". There are no specific questions posed in this guided project.
# import the appropriate data file and supporting graph enabler; matplotlib
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
women_degrees = pd.read_csv('percent-bachelors-degrees-women-usa.csv')
# identify each of the column headings for setting up the line graph layout with 6 rows and 3 columns.
stem_cats = ['Psychology', 'Biology', 'Math and Statistics', 'Physical Sciences', 'Computer Science', 'Engineering']
lib_arts_cats = ['Foreign Languages', 'English', 'Communications and Journalism', 'Art and Performance', 'Social Sciences and History']
other_cats = ['Health Professions', 'Public Administration', 'Education', 'Agriculture','Business', 'Architecture']
# establish the specific line colors that everyone csn distinguish, including "color blind" individuals.
cb_dark_blue = (0/255,107/255,164/255)
cb_orange = (255/255, 128/255, 14/255)
fig = plt.figure(figsize=(16, 20))
# use "for loop" to generate first column of line charts. STEM degrees.
for sp in range(0,18,3):
cat_index = int(sp/3)
ax = fig.add_subplot(6,3,sp+1)
ax.plot(women_degrees['Year'], women_degrees[stem_cats[cat_index]], c=cb_dark_blue, label='Women', linewidth=3)
ax.plot(women_degrees['Year'], 100-women_degrees[stem_cats[cat_index]], c=cb_orange, label='Men', linewidth=3)
for key,spine in ax.spines.items():
spine.set_visible(False)
ax.set_xlim(1968, 2011)
ax.set_ylim(0,100)
ax.set_title(stem_cats[cat_index])
ax.tick_params(bottom="off", top="off", left="off", right="off")
if cat_index == 0:
ax.text(2003, 83, 'Women')
ax.text(2006, 12, 'Men')
elif cat_index == 5:
ax.text(2006, 87, 'Men')
ax.text(2003, 8, 'Women')
# use "for loop" to generate second column of line charts. Liberal arts degrees.
for sp in range(1,16,3):
cat_index = int((sp-1)/3)
ax = fig.add_subplot(6,3,sp+1)
ax.plot(women_degrees['Year'], women_degrees[lib_arts_cats[cat_index]], c=cb_dark_blue, label='Women', linewidth=3)
ax.plot(women_degrees['Year'], 100-women_degrees[lib_arts_cats[cat_index]], c=cb_orange, label='Men', linewidth=3)
for key,spine in ax.spines.items():
spine.set_visible(False)
ax.set_xlim(1968, 2011)
ax.set_ylim(0,100)
ax.set_title(lib_arts_cats[cat_index])
ax.tick_params(bottom="off", top="off", left="off", right="off")
if cat_index == 0:
ax.text(2003, 75, 'Women')
ax.text(2006, 20, 'Men')
# use "for loop" to generate third column of line charts. Other degrees.
for sp in range(2,20,3):
cat_index = int((sp-2)/3)
ax = fig.add_subplot(6,3,sp+1)
ax.plot(women_degrees['Year'], women_degrees[other_cats[cat_index]], c=cb_dark_blue, label='Women', linewidth=3)
ax.plot(women_degrees['Year'], 100-women_degrees[other_cats[cat_index]], c=cb_orange, label='Men', linewidth=3)
for key,spine in ax.spines.items():
spine.set_visible(False)
ax.set_xlim(1968, 2011)
ax.set_ylim(0,100)
ax.set_title(other_cats[cat_index])
ax.tick_params(bottom="off", top="off", left="off", right="off")
if cat_index == 0:
ax.text(2003, 91, 'Women')
ax.text(2006, 4, 'Men')
elif cat_index == 5:
ax.text(2006, 63, 'Men')
ax.text(2003, 32, 'Women')
plt.show()
The graphs above still look a little "busy" with the years showing on x-axis at the bottom of all graphs. Here, we will remove these x-axis labels from all graphs except the bottom row in each of the columns.
fig = plt.figure(figsize=(16, 20))
for sp in range(0,18,3):
cat_index = int(sp/3)
ax = fig.add_subplot(6,3,sp+1)
ax.plot(women_degrees['Year'], women_degrees[stem_cats[cat_index]], c=cb_dark_blue, label='Women', linewidth=3)
ax.plot(women_degrees['Year'], 100-women_degrees[stem_cats[cat_index]], c=cb_orange, label='Men', linewidth=3)
for key,spine in ax.spines.items():
spine.set_visible(False)
ax.set_xlim(1968, 2011)
ax.set_ylim(0,100)
ax.set_title(stem_cats[cat_index])
# add labelbottom = "off" to end of command to remove x-axis labels
ax.tick_params(bottom="off", top="off", left="off", right="off", labelbottom="off")
if cat_index == 0:
ax.text(2003, 83, 'Women')
ax.text(2006, 12, 'Men')
elif cat_index == 5:
ax.text(2006, 87, 'Men')
ax.text(2003, 8, 'Women')
# add labelbottom = "on" to end of command to add x-axis labels only to bottom rows of graph in column 1.
if cat_index ==5:
ax.tick_params(labelbottom='on')
for sp in range(1,16,3):
cat_index = int((sp-1)/3)
ax = fig.add_subplot(6,3,sp+1)
ax.plot(women_degrees['Year'], women_degrees[lib_arts_cats[cat_index]], c=cb_dark_blue, label='Women', linewidth=3)
ax.plot(women_degrees['Year'], 100-women_degrees[lib_arts_cats[cat_index]], c=cb_orange, label='Men', linewidth=3)
for key,spine in ax.spines.items():
spine.set_visible(False)
ax.set_xlim(1968, 2011)
ax.set_ylim(0,100)
ax.set_title(lib_arts_cats[cat_index])
# add labelbottom = "off" to end of command to remove x-axis labels
ax.tick_params(bottom="off", top="off", left="off", right="off", labelbottom="off")
if cat_index == 0:
ax.text(2003, 75, 'Women')
ax.text(2006, 20, 'Men')
# add labelbottom = "on" to end of command to add x-axis labels only to bottom rows of graph in column 2.
if cat_index ==4:
ax.tick_params(labelbottom='on')
for sp in range(2,20,3):
cat_index = int((sp-2)/3)
ax = fig.add_subplot(6,3,sp+1)
ax.plot(women_degrees['Year'], women_degrees[other_cats[cat_index]], c=cb_dark_blue, label='Women', linewidth=3)
ax.plot(women_degrees['Year'], 100-women_degrees[other_cats[cat_index]], c=cb_orange, label='Men', linewidth=3)
for key,spine in ax.spines.items():
spine.set_visible(False)
ax.set_xlim(1968, 2011)
ax.set_ylim(0,100)
ax.set_title(other_cats[cat_index])
# add labelbottom = "off" to end of command to remove x-axis labels
ax.tick_params(bottom="off", top="off", left="off", right="off", labelbottom="off")
if cat_index == 0:
ax.text(2003, 91, 'Women')
ax.text(2006, 4, 'Men')
elif cat_index == 5:
ax.text(2006, 63, 'Men')
ax.text(2003, 32, 'Women')
# add labelbottom = "on" to end of command to add x-axis labels only to bottom rows of graph in column 3.
if cat_index ==5:
ax.tick_params(labelbottom='on')
plt.show()
There is still opportunity to remove some additional graph information for visual simplification. At this stage, we will change the y-axis labels from having output showing in increments of 20 (0, 20, 40, 60, 80, & 100) and limit the labels to just 0 and 100.
fig = plt.figure(figsize=(16, 20))
for sp in range(0,18,3):
cat_index = int(sp/3)
ax = fig.add_subplot(6,3,sp+1)
ax.plot(women_degrees['Year'], women_degrees[stem_cats[cat_index]], c=cb_dark_blue, label='Women', linewidth=3)
ax.plot(women_degrees['Year'], 100-women_degrees[stem_cats[cat_index]], c=cb_orange, label='Men', linewidth=3)
for key,spine in ax.spines.items():
spine.set_visible(False)
ax.set_xlim(1968, 2011)
# change the "ylim" command to "yticks" as shown below to limit the y-axis output to showing only 0 and 100.
ax.set_yticks([0,100])
ax.set_title(stem_cats[cat_index])
ax.tick_params(bottom="off", top="off", left="off", right="off", labelbottom="off")
if cat_index == 0:
ax.text(2003, 83, 'Women')
ax.text(2006, 12, 'Men')
elif cat_index == 5:
ax.text(2006, 87, 'Men')
ax.text(2003, 8, 'Women')
if cat_index ==5:
ax.tick_params(labelbottom='on')
for sp in range(1,16,3):
cat_index = int((sp-1)/3)
ax = fig.add_subplot(6,3,sp+1)
ax.plot(women_degrees['Year'], women_degrees[lib_arts_cats[cat_index]], c=cb_dark_blue, label='Women', linewidth=3)
ax.plot(women_degrees['Year'], 100-women_degrees[lib_arts_cats[cat_index]], c=cb_orange, label='Men', linewidth=3)
for key,spine in ax.spines.items():
spine.set_visible(False)
ax.set_xlim(1968, 2011)
# change the "ylim" command to "yticks" as shown below to limit the y-axis output to showing only 0 and 100.
ax.set_yticks([0,100])
ax.set_title(lib_arts_cats[cat_index])
ax.tick_params(bottom="off", top="off", left="off", right="off", labelbottom="off")
if cat_index == 0:
ax.text(2003, 75, 'Women')
ax.text(2006, 20, 'Men')
if cat_index ==4:
ax.tick_params(labelbottom='on')
for sp in range(2,20,3):
cat_index = int((sp-2)/3)
ax = fig.add_subplot(6,3,sp+1)
ax.plot(women_degrees['Year'], women_degrees[other_cats[cat_index]], c=cb_dark_blue, label='Women', linewidth=3)
ax.plot(women_degrees['Year'], 100-women_degrees[other_cats[cat_index]], c=cb_orange, label='Men', linewidth=3)
for key,spine in ax.spines.items():
spine.set_visible(False)
ax.set_xlim(1968, 2011)
# change the "ylim" command to "yticks" as shown below to limit the y-axis output to showing only 0 and 100.
ax.set_yticks([0,100])
ax.set_title(other_cats[cat_index])
ax.tick_params(bottom="off", top="off", left="off", right="off", labelbottom="off")
if cat_index == 0:
ax.text(2003, 91, 'Women')
ax.text(2006, 4, 'Men')
elif cat_index == 5:
ax.text(2006, 63, 'Men')
ax.text(2003, 32, 'Women')
if cat_index ==5:
ax.tick_params(labelbottom='on')
plt.show()
fig = plt.figure(figsize=(16, 20))
for sp in range(0,18,3):
cat_index = int(sp/3)
ax = fig.add_subplot(6,3,sp+1)
ax.plot(women_degrees['Year'], women_degrees[stem_cats[cat_index]], c=cb_dark_blue, label='Women', linewidth=3)
ax.plot(women_degrees['Year'], 100-women_degrees[stem_cats[cat_index]], c=cb_orange, label='Men', linewidth=3)
for key,spine in ax.spines.items():
spine.set_visible(False)
ax.set_xlim(1968, 2011)
ax.set_yticks([0,100])
# with the following command, add a centerline at location 50 to improve the ability of identifying middle of graph.
ax.axhline(50, c=(171/255, 171/255, 171/255), alpha=0.3)
ax.set_title(stem_cats[cat_index])
ax.tick_params(bottom="off", top="off", left="off", right="off", labelbottom="off")
if cat_index == 0:
ax.text(2003, 83, 'Women')
ax.text(2006, 12, 'Men')
elif cat_index == 5:
ax.text(2006, 87, 'Men')
ax.text(2003, 8, 'Women')
if cat_index ==5:
ax.tick_params(labelbottom='on')
for sp in range(1,16,3):
cat_index = int((sp-1)/3)
ax = fig.add_subplot(6,3,sp+1)
ax.plot(women_degrees['Year'], women_degrees[lib_arts_cats[cat_index]], c=cb_dark_blue, label='Women', linewidth=3)
ax.plot(women_degrees['Year'], 100-women_degrees[lib_arts_cats[cat_index]], c=cb_orange, label='Men', linewidth=3)
for key,spine in ax.spines.items():
spine.set_visible(False)
ax.set_xlim(1968, 2011)
ax.set_yticks([0,100])
# with the following command, add a centerline at location 50 to improve the ability of identifying middle of graph.
ax.axhline(50, c=(171/255, 171/255, 171/255), alpha=0.3)
ax.set_title(lib_arts_cats[cat_index])
ax.tick_params(bottom="off", top="off", left="off", right="off", labelbottom="off")
if cat_index == 0:
ax.text(2003, 75, 'Women')
ax.text(2006, 20, 'Men')
if cat_index ==4:
ax.tick_params(labelbottom='on')
for sp in range(2,20,3):
cat_index = int((sp-2)/3)
ax = fig.add_subplot(6,3,sp+1)
ax.plot(women_degrees['Year'], women_degrees[other_cats[cat_index]], c=cb_dark_blue, label='Women', linewidth=3)
ax.plot(women_degrees['Year'], 100-women_degrees[other_cats[cat_index]], c=cb_orange, label='Men', linewidth=3)
for key,spine in ax.spines.items():
spine.set_visible(False)
ax.set_xlim(1968, 2011)
ax.set_yticks([0,100])
# with the following command, add a centerline at location 50 to improve the ability of identifying middle of graph.
ax.axhline(50, c=(171/255, 171/255, 171/255), alpha=0.3)
ax.set_title(other_cats[cat_index])
ax.tick_params(bottom="off", top="off", left="off", right="off", labelbottom="off")
if cat_index == 0:
ax.text(2003, 91, 'Women')
ax.text(2006, 4, 'Men')
elif cat_index == 5:
ax.text(2006, 63, 'Men')
ax.text(2003, 32, 'Women')
if cat_index ==5:
ax.tick_params(labelbottom='on')
plt.show()
Looking at the horizontal lines in the above graphs, I believe they are insufficiently visible. My opinion is that if we are going to add something to further enhance visual ability to discern what the graphs are "saying", then make it visible.
My suggestions are to either:
to improve visibility such that it doesn't become a distraction.
I will experiment with each option here.
fig = plt.figure(figsize=(16, 20))
for sp in range(0,18,3):
cat_index = int(sp/3)
ax = fig.add_subplot(6,3,sp+1)
ax.plot(women_degrees['Year'], women_degrees[stem_cats[cat_index]], c=cb_dark_blue, label='Women', linewidth=3)
ax.plot(women_degrees['Year'], 100-women_degrees[stem_cats[cat_index]], c=cb_orange, label='Men', linewidth=3)
for key,spine in ax.spines.items():
spine.set_visible(False)
ax.set_xlim(1968, 2011)
ax.set_yticks([0,100])
# with the following command, add a centerline at location 50 to improve the ability of identifying middle of graph.
ax.axhline(50, c=(171/255, 171/255, 171/255), alpha=0.6)
ax.set_title(stem_cats[cat_index])
ax.tick_params(bottom="off", top="off", left="off", right="off", labelbottom="off")
if cat_index == 0:
ax.text(2003, 83, 'Women')
ax.text(2006, 12, 'Men')
elif cat_index == 5:
ax.text(2006, 87, 'Men')
ax.text(2003, 8, 'Women')
if cat_index ==5:
ax.tick_params(labelbottom='on')
for sp in range(1,16,3):
cat_index = int((sp-1)/3)
ax = fig.add_subplot(6,3,sp+1)
ax.plot(women_degrees['Year'], women_degrees[lib_arts_cats[cat_index]], c=cb_dark_blue, label='Women', linewidth=3)
ax.plot(women_degrees['Year'], 100-women_degrees[lib_arts_cats[cat_index]], c=cb_orange, label='Men', linewidth=3)
for key,spine in ax.spines.items():
spine.set_visible(False)
ax.set_xlim(1968, 2011)
ax.set_yticks([0,100])
# with the following command, add a centerline at location 50 to improve the ability of identifying middle of graph.
ax.axhline(50, c=(200/255, 82/255, 0/255), alpha=0.3)
ax.set_title(lib_arts_cats[cat_index])
ax.tick_params(bottom="off", top="off", left="off", right="off", labelbottom="off")
if cat_index == 0:
ax.text(2003, 75, 'Women')
ax.text(2006, 20, 'Men')
if cat_index ==4:
ax.tick_params(labelbottom='on')
for sp in range(2,20,3):
cat_index = int((sp-2)/3)
ax = fig.add_subplot(6,3,sp+1)
ax.plot(women_degrees['Year'], women_degrees[other_cats[cat_index]], c=cb_dark_blue, label='Women', linewidth=3)
ax.plot(women_degrees['Year'], 100-women_degrees[other_cats[cat_index]], c=cb_orange, label='Men', linewidth=3)
for key,spine in ax.spines.items():
spine.set_visible(False)
ax.set_xlim(1968, 2011)
ax.set_yticks([0,100])
# with the following command, add a centerline at location 50 to improve the ability of identifying middle of graph.
ax.axhline(50, c=(171/255, 171/255, 171/255), alpha=0.3, linewidth=3)
ax.set_title(other_cats[cat_index])
ax.tick_params(bottom="off", top="off", left="off", right="off", labelbottom="off")
if cat_index == 0:
ax.text(2003, 91, 'Women')
ax.text(2006, 4, 'Men')
elif cat_index == 5:
ax.text(2006, 63, 'Men')
ax.text(2003, 32, 'Women')
if cat_index ==5:
ax.tick_params(labelbottom='on')
plt.show()
I made three modifications to center line addition code as follows:
There is improved visibility for all three types of line spec changes made for the graphs as shown above.
In essence, there are numerous ways to achieve the desired end result whatever that may be.
plt.savefig('biology_degrees.png')
<matplotlib.figure.Figure at 0x7fe7f9f13e48>
from IPython.display import Image
from IPython.core.display import HTML
Image(url= "https://app.dataquest.io/jupyter/tree/notebook/biology_degrees.jpg")
NOTE: I tried to show proof above that the "jpg" file was saved by creating a picture and saving it from the Jupyter notebook web page. It didn't work. Not sure what to do to correct this.
For this project, there were no specific questions posed to be answered, therefore no conclusions or inferences to make.
We can make observations like:
College degrees which showed no obvious up or down trends or crossover of gender gaps are:
"Foreign Languages", "English", and "Arts and Performance".
College degrees which began with high gender gap and reduced substantially are:
"Physical Sciences", "Agriculture", "Business" and "Architecture".
College degrees which showed significant reversal of gender gaps are:
"Psychology", "Biology", and "Communications and Journalism".
This project provided great opportunity to learn how to perform graph visual enhancement.