Open In Colab

Gendered Language on University Subreddits - ECON 323 Final Project

Author

In this notebook, we seek to understand university culture through comments made on the social media platform Reddit by focusing our analysis on university subreddits which are frequently used by prospective students and current students at both the undergraduate as well as the graduate level. Alice H. Wu (2018) seeks to understand culture in academia through the Economics Job Market Rumors forum. Both Reddit as well as Economics Job Market Rumors offer users a degree of anonymity. Note that therefore, both platforms may be used by people outside the academic communities as well. While Wu (2018) suggests that Economics Job Market Rumors was initially created for discussions surrounding job market placements, interviews, and outcomes, Wu (2018) seeks to understand if there is a distinct stereotypical culture. The author finds that the words "hotter", "pregnant", and "plow" are the most predictive of a post being classified as female while "homo", "testosterone", and "chapters" are predictive of a post being classified as male. Beyond the heteronormativity and misogyny that emerge from the word list, Wu (2018) finds professional words such as "supervisor" and "adviser" are predictive of a post discussing a male. We apply the same lasso-logistic model and use the similar methods as Wu (2018) to understand if that culture permeates social media forums used by members of the academic community. We are aware that as Wu (2018) discusses, both Reddit as well as Economics Job Market Rumors are moderated communities. Posts that do not meet the respective set of community guidelines may be subject to action by moderators including deletion of the post. Therefore our analysis is limited in that the comments in the dataset that we extract are not necessarily an unbiased reflection of university culture. Also our sample is certainly not random. Not every student and faculty at universities use their respective university subreddit, and subreddits are rarely if ever the primary means of university-level communication.

In [ ]:
#Install on colab or as necessary
! pip install qeds fiona geopandas xgboost gensim folium pyLDAvis descartes psaw pyarrow
Requirement already satisfied: qeds in /usr/local/lib/python3.6/dist-packages (0.6.2)
Requirement already satisfied: fiona in /usr/local/lib/python3.6/dist-packages (1.8.13.post1)
Requirement already satisfied: geopandas in /usr/local/lib/python3.6/dist-packages (0.7.0)
Requirement already satisfied: xgboost in /usr/local/lib/python3.6/dist-packages (0.90)
Requirement already satisfied: gensim in /usr/local/lib/python3.6/dist-packages (3.6.0)
Requirement already satisfied: folium in /usr/local/lib/python3.6/dist-packages (0.8.3)
Requirement already satisfied: pyLDAvis in /usr/local/lib/python3.6/dist-packages (2.1.2)
Requirement already satisfied: descartes in /usr/local/lib/python3.6/dist-packages (1.1.0)
Requirement already satisfied: psaw in /usr/local/lib/python3.6/dist-packages (0.0.12)
Requirement already satisfied: pyarrow in /usr/local/lib/python3.6/dist-packages (0.14.1)
Requirement already satisfied: seaborn in /usr/local/lib/python3.6/dist-packages (from qeds) (0.10.0)
Requirement already satisfied: quandl in /usr/local/lib/python3.6/dist-packages (from qeds) (3.5.0)
Requirement already satisfied: quantecon in /usr/local/lib/python3.6/dist-packages (from qeds) (0.4.7)
Requirement already satisfied: pandas in /usr/local/lib/python3.6/dist-packages (from qeds) (1.0.3)
Requirement already satisfied: openpyxl in /usr/local/lib/python3.6/dist-packages (from qeds) (2.5.9)
Requirement already satisfied: requests in /usr/local/lib/python3.6/dist-packages (from qeds) (2.21.0)
Requirement already satisfied: numpy in /usr/local/lib/python3.6/dist-packages (from qeds) (1.18.3)
Requirement already satisfied: scipy in /usr/local/lib/python3.6/dist-packages (from qeds) (1.4.1)
Requirement already satisfied: plotly in /usr/local/lib/python3.6/dist-packages (from qeds) (4.4.1)
Requirement already satisfied: statsmodels in /usr/local/lib/python3.6/dist-packages (from qeds) (0.10.2)
Requirement already satisfied: pandas-datareader in /usr/local/lib/python3.6/dist-packages (from qeds) (0.8.1)
Requirement already satisfied: matplotlib in /usr/local/lib/python3.6/dist-packages (from qeds) (3.2.1)
Requirement already satisfied: scikit-learn in /usr/local/lib/python3.6/dist-packages (from qeds) (0.22.2.post1)
Requirement already satisfied: click<8,>=4.0 in /usr/local/lib/python3.6/dist-packages (from fiona) (7.1.1)
Requirement already satisfied: munch in /usr/local/lib/python3.6/dist-packages (from fiona) (2.5.0)
Requirement already satisfied: click-plugins>=1.0 in /usr/local/lib/python3.6/dist-packages (from fiona) (1.1.1)
Requirement already satisfied: cligj>=0.5 in /usr/local/lib/python3.6/dist-packages (from fiona) (0.5.0)
Requirement already satisfied: attrs>=17 in /usr/local/lib/python3.6/dist-packages (from fiona) (19.3.0)
Requirement already satisfied: six>=1.7 in /usr/local/lib/python3.6/dist-packages (from fiona) (1.12.0)
Requirement already satisfied: shapely in /usr/local/lib/python3.6/dist-packages (from geopandas) (1.7.0)
Requirement already satisfied: pyproj>=2.2.0 in /usr/local/lib/python3.6/dist-packages (from geopandas) (2.6.0)
Requirement already satisfied: smart-open>=1.2.1 in /usr/local/lib/python3.6/dist-packages (from gensim) (1.11.1)
Requirement already satisfied: branca>=0.3.0 in /usr/local/lib/python3.6/dist-packages (from folium) (0.4.0)
Requirement already satisfied: jinja2 in /usr/local/lib/python3.6/dist-packages (from folium) (2.11.2)
Requirement already satisfied: wheel>=0.23.0 in /usr/local/lib/python3.6/dist-packages (from pyLDAvis) (0.34.2)
Requirement already satisfied: future in /usr/local/lib/python3.6/dist-packages (from pyLDAvis) (0.16.0)
Requirement already satisfied: numexpr in /usr/local/lib/python3.6/dist-packages (from pyLDAvis) (2.7.1)
Requirement already satisfied: funcy in /usr/local/lib/python3.6/dist-packages (from pyLDAvis) (1.14)
Requirement already satisfied: joblib>=0.8.4 in /usr/local/lib/python3.6/dist-packages (from pyLDAvis) (0.14.1)
Requirement already satisfied: pytest in /usr/local/lib/python3.6/dist-packages (from pyLDAvis) (3.6.4)
Requirement already satisfied: more-itertools in /usr/local/lib/python3.6/dist-packages (from quandl->qeds) (8.2.0)
Requirement already satisfied: inflection>=0.3.1 in /usr/local/lib/python3.6/dist-packages (from quandl->qeds) (0.4.0)
Requirement already satisfied: python-dateutil in /usr/local/lib/python3.6/dist-packages (from quandl->qeds) (2.8.1)
Requirement already satisfied: sympy in /usr/local/lib/python3.6/dist-packages (from quantecon->qeds) (1.1.1)
Requirement already satisfied: numba>=0.38 in /usr/local/lib/python3.6/dist-packages (from quantecon->qeds) (0.48.0)
Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.6/dist-packages (from pandas->qeds) (2018.9)
Requirement already satisfied: et-xmlfile in /usr/local/lib/python3.6/dist-packages (from openpyxl->qeds) (1.0.1)
Requirement already satisfied: jdcal in /usr/local/lib/python3.6/dist-packages (from openpyxl->qeds) (1.4.1)
Requirement already satisfied: urllib3<1.25,>=1.21.1 in /usr/local/lib/python3.6/dist-packages (from requests->qeds) (1.24.3)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.6/dist-packages (from requests->qeds) (2020.4.5.1)
Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /usr/local/lib/python3.6/dist-packages (from requests->qeds) (3.0.4)
Requirement already satisfied: idna<2.9,>=2.5 in /usr/local/lib/python3.6/dist-packages (from requests->qeds) (2.8)
Requirement already satisfied: retrying>=1.3.3 in /usr/local/lib/python3.6/dist-packages (from plotly->qeds) (1.3.3)
Requirement already satisfied: patsy>=0.4.0 in /usr/local/lib/python3.6/dist-packages (from statsmodels->qeds) (0.5.1)
Requirement already satisfied: lxml in /usr/local/lib/python3.6/dist-packages (from pandas-datareader->qeds) (4.2.6)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.6/dist-packages (from matplotlib->qeds) (1.2.0)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /usr/local/lib/python3.6/dist-packages (from matplotlib->qeds) (2.4.7)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.6/dist-packages (from matplotlib->qeds) (0.10.0)
Requirement already satisfied: boto3 in /usr/local/lib/python3.6/dist-packages (from smart-open>=1.2.1->gensim) (1.12.43)
Requirement already satisfied: boto in /usr/local/lib/python3.6/dist-packages (from smart-open>=1.2.1->gensim) (2.49.0)
Requirement already satisfied: MarkupSafe>=0.23 in /usr/local/lib/python3.6/dist-packages (from jinja2->folium) (1.1.1)
Requirement already satisfied: setuptools in /usr/local/lib/python3.6/dist-packages (from pytest->pyLDAvis) (46.1.3)
Requirement already satisfied: pluggy<0.8,>=0.5 in /usr/local/lib/python3.6/dist-packages (from pytest->pyLDAvis) (0.7.1)
Requirement already satisfied: atomicwrites>=1.0 in /usr/local/lib/python3.6/dist-packages (from pytest->pyLDAvis) (1.3.0)
Requirement already satisfied: py>=1.5.0 in /usr/local/lib/python3.6/dist-packages (from pytest->pyLDAvis) (1.8.1)
Requirement already satisfied: mpmath>=0.19 in /usr/local/lib/python3.6/dist-packages (from sympy->quantecon->qeds) (1.1.0)
Requirement already satisfied: llvmlite<0.32.0,>=0.31.0dev0 in /usr/local/lib/python3.6/dist-packages (from numba>=0.38->quantecon->qeds) (0.31.0)
Requirement already satisfied: s3transfer<0.4.0,>=0.3.0 in /usr/local/lib/python3.6/dist-packages (from boto3->smart-open>=1.2.1->gensim) (0.3.3)
Requirement already satisfied: botocore<1.16.0,>=1.15.43 in /usr/local/lib/python3.6/dist-packages (from boto3->smart-open>=1.2.1->gensim) (1.15.43)
Requirement already satisfied: jmespath<1.0.0,>=0.7.1 in /usr/local/lib/python3.6/dist-packages (from boto3->smart-open>=1.2.1->gensim) (0.9.5)
Requirement already satisfied: docutils<0.16,>=0.10 in /usr/local/lib/python3.6/dist-packages (from botocore<1.16.0,>=1.15.43->boto3->smart-open>=1.2.1->gensim) (0.15.2)
In [ ]:
#Not all packages imported were necessarily used in the analysis
import pandas as pd
import numpy as np
import patsy
import json
import os
from bs4 import BeautifulSoup
import time
import psaw
from google.colab import files
import nltk
import string
import matplotlib.pyplot as plt
%matplotlib inline
# activate plot theme
import qeds
qeds.themes.mpl_style();
import pyarrow.feather
import scipy
import pickle
from sklearn.feature_extraction.text import CountVectorizer
from sklearn import model_selection
from sklearn import linear_model
from sklearn.inspection import plot_partial_dependence
from wordcloud import WordCloud

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
Out[ ]:
True

Using PSAW to get data

In this section, we leverage PSAW which is a wrapper for Pushshift API to build a dataset of less than 3 million Reddit comments made on the subreddits of universities with economics departments in top 35 institutions as of March 2020 according to rankings published by RePEc. You can find the ranking list here. We note first that the choice of ranking list to use for the analysis was indeed arbitrary. For this analysis we focused on the top 35 institutions. We could not include Paris School of Economics, Toulouse School of Economics, Barcelona Graduate School of Economics, Tilburg University, Graduate School of Business - Columbia University (Columbia University has been entered once). We asked for up to 500,000 comments per subreddit. We wish to emphasize that we do not have 500,000 comments for each subreddit and that some subreddits would be under-represented in the analysis unless weights are attached.

We strongly recommend running this section in a separate session in order to minimize burdening servers and to efficiently use RAM. We ran this section on Tuesday, April 14 2020. The analysis in this notebook shall be based on data collected on that day.

In [ ]:
api = psaw.PushshiftAPI()

harvardgen = api.search_comments(subreddit = "harvard", limit = 500000)
harvarddata = pd.DataFrame([i.d_ for i in harvardgen])

mitgen = api.search_comments(subreddit = "mit", limit = 500000)
mitdata = pd.DataFrame([i.d_ for i in mitgen])

berkeleygen = api.search_comments(subreddit = "berkeley", limit = 500000)
berkeleydata = pd.DataFrame([i.d_ for i in berkeleygen])

uchicagogen = api.search_comments(subreddit = "uchicago", limit = 500000)
uchicagodata = pd.DataFrame([i.d_ for i in uchicagogen])

princetongen = api.search_comments(subreddit = "princeton", limit = 500000)
princetondata = pd.DataFrame([i.d_ for i in princetongen])

stanfordgen = api.search_comments(subreddit = "stanford", limit = 500000)
stanforddata = pd.DataFrame([i.d_ for i in stanfordgen])

oxfordunigen = api.search_comments(subreddit = "oxforduni", limit = 500000)
oxfordunidata = pd.DataFrame([i.d_ for i in oxfordunigen])

columbiagen = api.search_comments(subreddit = "columbia", limit = 500000)
columbiadata = pd.DataFrame([i.d_ for i in columbiagen])

brownugen = api.search_comments(subreddit = "brownu", limit = 500000)
brownudata = pd.DataFrame([i.d_ for i in brownugen])

nyugen = api.search_comments(subreddit = "nyu", limit = 500000)
nyudata = pd.DataFrame([i.d_ for i in nyugen])

yalegen = api.search_comments(subreddit = "yale", limit = 500000)
yaledata = pd.DataFrame([i.d_ for i in yalegen])

bostonugen = api.search_comments(subreddit = "bostonu", limit = 500000)
bostonudata = pd.DataFrame([i.d_ for i in bostonugen])

dartmouthgen = api.search_comments(subreddit = "dartmouth", limit = 500000)
dartmouthdata = pd.DataFrame([i.d_ for i in dartmouthgen])

upenngen = api.search_comments(subreddit = "upenn", limit = 500000)
upenndata = pd.DataFrame([i.d_ for i in upenngen])

ucsdgen = api.search_comments(subreddit = "ucsd", limit = 500000)
ucsddata = pd.DataFrame([i.d_ for i in ucsdgen])

uclgen = api.search_comments(subreddit = "ucl", limit = 500000)
ucldata = pd.DataFrame([i.d_ for i in uclgen])

uclagen = api.search_comments(subreddit = "ucla", limit = 500000)
ucladata = pd.DataFrame([i.d_ for i in uclagen])

northwesterngen = api.search_comments(subreddit = "northwestern", limit = 500000)
northwesterndata = pd.DataFrame([i.d_ for i in northwesterngen])

uwmadisongen = api.search_comments(subreddit = "uwmadison", limit = 500000)
uwmadisondata = pd.DataFrame([i.d_ for i in uwmadisongen])

thelsegen = api.search_comments(subreddit = "thelse", limit = 500000)
thelsedata = pd.DataFrame([i.d_ for i in thelsegen])

msugen = api.search_comments(subreddit = "msu", limit = 500000)
msudata = pd.DataFrame([i.d_ for i in msugen])

uofmgen = api.search_comments(subreddit = "uofm", limit = 500000)
uofmdata = pd.DataFrame([i.d_ for i in uofmgen])

ucdavisgen = api.search_comments(subreddit = "ucdavis", limit = 500000)
ucdavisdata = pd.DataFrame([i.d_ for i in ucdavisgen])

bostoncollegegen = api.search_comments(subreddit = "bostoncollege", limit = 500000)
bostoncollegedata = pd.DataFrame([i.d_ for i in bostoncollegegen])

ubcgen = api.search_comments(subreddit = "ubc", limit = 500000)
ubcdata = pd.DataFrame([i.d_ for i in ubcgen])

georgetowngen = api.search_comments(subreddit = "georgetown", limit = 500000)
georgetowndata = pd.DataFrame([i.d_ for i in georgetowngen])

uscgen = api.search_comments(subreddit = "usc", limit = 500000)
uscdata = pd.DataFrame([i.d_ for i in uscgen])

universityofwarwickgen = api.search_comments(subreddit = "universityofwarwick", limit = 500000)
universityofwarwickdata = pd.DataFrame([i.d_ for i in universityofwarwickgen])

uoftgen = api.search_comments(subreddit = "uoft", limit = 500000)
uoftdata = pd.DataFrame([i.d_ for i in uoftgen])

uongen = api.search_comments(subreddit = "uon", limit = 500000)
uondata = pd.DataFrame([i.d_ for i in uongen])


unidata = pd.concat([harvarddata, mitdata, berkeleydata, uchicagodata, princetondata, stanforddata, oxfordunidata, columbiadata, brownudata, nyudata, yaledata, bostonudata, dartmouthdata, upenndata, ucsddata, ucldata, ucladata, northwesterndata, uwmadisondata, thelsedata, msudata, uofmdata, ucdavisdata, bostoncollegedata, ubcdata, georgetowndata, uscdata, universityofwarwickdata, uoftdata, uondata])
unibodydata = unidata.loc[:,["body"]]
In [ ]:
unidata = unidata.reset_index()
unidata.to_csv("unidata.csv")

Understanding the data

After we download the data we collected in the previous section we import it for analysis.

In [ ]:
#Set directory as necessary
from google.colab import drive
drive.mount("/content/drive")

unidata = pd.read_csv("/content/drive/My Drive/unidata.csv")
unidata.head()
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
/usr/local/lib/python3.6/dist-packages/IPython/core/interactiveshell.py:2718: DtypeWarning: Columns (0,2,3,5,6,7,8,10,11,12,13,14,15,19,21,23,24,26,27,29,30,34,37,38,39,43,44,45,49,50,51,52,53,54,55,56) have mixed types.Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)
Out[ ]:
Unnamed: 0 index all_awardings associated_award author author_flair_background_color author_flair_css_class author_flair_richtext author_flair_template_id author_flair_text author_flair_text_color author_flair_type author_fullname author_patreon_flair author_premium awarders body collapsed_because_crowd_control created_utc gildings id is_submitter link_id locked no_follow parent_id permalink retrieved_on score send_replies stickied subreddit subreddit_id total_awards_received treatment_tags created edited distinguished author_cakeday steward_reports updated_utc gilded author_created_utc can_gild collapsed collapsed_reason controversiality nest_level reply_delay subreddit_name_prefixed subreddit_type user_removed mod_removed removal_reason score_hidden rte_mode permalink_url
0 0 0.0 [] NaN everythingharam NaN NaN [] NaN NaN NaN text t2_26tskkl False False [] This is an interesting comment. Care to elabor... NaN 1.586872e+09 {} fndkuxz True t3_g1196c False True t1_fndkrbe /r/Harvard/comments/g1196c/college_is_making_m... 1.5869e+09 4.0 True False Harvard t5_2qkkm 0.0 [] 1.586872e+09 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 1 1.0 [] NaN DakwonBrown NaN NaN [] NaN NaN NaN text t2_657iry16 False False [] Are you a director in your life or just an actor? NaN 1.586872e+09 {} fndkrbe False t3_g1196c False True t3_g1196c /r/Harvard/comments/g1196c/college_is_making_m... 1.5869e+09 6.0 True False Harvard t5_2qkkm 0.0 [] 1.586872e+09 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 2 2.0 [] NaN coolcatsarecold NaN NaN [] NaN NaN NaN text t2_7rk21 False False [] You over explaining my guy NaN 1.586872e+09 {} fndkoth False t3_g08gzx False True t1_fndizbt /r/Harvard/comments/g08gzx/the_quad_only_has_g... 1.5869e+09 0.0 True False Harvard t5_2qkkm 0.0 [] 1.586872e+09 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 3 3.0 [] NaN Thoreau80 NaN NaN [] NaN NaN NaN text t2_tekuu False False [] So do something else. Do college when you act... NaN 1.586872e+09 {} fndjxhl False t3_g1196c False True t3_g1196c /r/Harvard/comments/g1196c/college_is_making_m... 1.5869e+09 -1.0 True False Harvard t5_2qkkm 0.0 [] 1.586872e+09 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 4 4.0 [] NaN starfleet_rambo NaN NaN [] NaN NaN NaN text t2_2pzsvi08 False False [] I could say MD but either way the quad is clos... NaN 1.586871e+09 {} fndizbt False t3_g08gzx False True t1_fnd5k6d /r/Harvard/comments/g08gzx/the_quad_only_has_g... 1.5869e+09 1.0 True False Harvard t5_2qkkm 0.0 [] 1.586871e+09 1.586876e+09 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

We select columns that we believe will be useful for the analysis: the name of the author that made the comment, the body of the comment, the score of the comment and the subreddit the comment was made on.

In [ ]:
uniuseful = unidata[["author", "body", "score", "subreddit"]]
uniuseful.head()
Out[ ]:
author body score subreddit
0 everythingharam This is an interesting comment. Care to elabor... 4.0 Harvard
1 DakwonBrown Are you a director in your life or just an actor? 6.0 Harvard
2 coolcatsarecold You over explaining my guy 0.0 Harvard
3 Thoreau80 So do something else. Do college when you act... -1.0 Harvard
4 starfleet_rambo I could say MD but either way the quad is clos... 1.0 Harvard
In [ ]:
uniuseful = uniuseful.rename(columns={"body": "comment_body"}) #one of the tokens is "body"

We list the unique subreddits in the dataset and create a visualization that shows the number of comments in the dataset by subreddit.

In [ ]:
uniuseful.subreddit.unique()
Out[ ]:
array(['Harvard', 'mit', 'berkeley', nan, '3.0', '25238.0', '2.0', '1.0',
       '4.0', 'uchicago', 'princeton', 'stanford', 'oxforduni',
       'columbia', 'BrownU', 'nyu', 'yale', 'BostonU', 'dartmouth',
       'UPenn', 'UCSD', 'UCL', 'ucla', 'Northwestern', 'UWMadison',
       'TheLse', 'msu', '902.0', 'uofm', 'UCDavis', 'bostoncollege',
       'UBC', 'georgetown', 'USC', 'UniversityOfWarwick', 'UofT', 'UoN'],
      dtype=object)
In [ ]:
fig, ax = plt.subplots(figsize = (20, 20))
groupuni = uniuseful.groupby("subreddit").count()["comment_body"]
groupuni.plot.bar(ax = ax)

ax.set_ylabel("Number of Comments")
ax.set_xlabel("Subreddits")
ax.set_title("Number of Comments in Dataset by Subreddit")
ax.set_facecolor("white")
y_ticks = np.arange(0, 550000, 50000)
ax.set_yticks(y_ticks)
Out[ ]:
[<matplotlib.axis.YTick at 0x7f0405cbd278>,
 <matplotlib.axis.YTick at 0x7f0405cadc18>,
 <matplotlib.axis.YTick at 0x7f0405c46080>,
 <matplotlib.axis.YTick at 0x7f0405ba67f0>,
 <matplotlib.axis.YTick at 0x7f0405ba1ef0>,
 <matplotlib.axis.YTick at 0x7f0405b8ba58>,
 <matplotlib.axis.YTick at 0x7f0405ba6390>,
 <matplotlib.axis.YTick at 0x7f0405baf160>,
 <matplotlib.axis.YTick at 0x7f0405baf6a0>,
 <matplotlib.axis.YTick at 0x7f0405bafbe0>,
 <matplotlib.axis.YTick at 0x7f0405bb4160>]

We ensure that all the comments are strings. We note that as of now urls are included in the comments. We intend on using "regular expression" to get rid of the urls. We convert all the strings to lowercase.

In [ ]:
uniuseful["comment_body"] = uniuseful["comment_body"].apply(str) #convert all text in body to string

#note urls included. FIXME use "regular expression" to get rid of urls.

uniuseful = uniuseful.applymap(lambda i:i.lower() if type(i) == str else i) #lowercase strings

Applying model developed in Wu (2018)

In this section, the crux of our analysis, we apply the model developed by Alice H. Wu (2018) to our dataset of Reddit comments. Wu (2018) studied posts on the Economics Job Market Rumors forum. Among other things, Wu (2018) runs a lasso logistic regression model in order to find the words that are predictive of a post being classified as a female or male post. The author classifies posts as female or male based on classifiers that they develop from a list of the ten thousand most frequent words in the posts. Our analysis here is based on Wu (2018) where we develop a similar (if not the same) set up and model on our dataset of Reddit comments.

As does Wu (2018) we find the most frequent ten thousand words in the comments. We choose not to remove stopwords as preserving pronouns among other stopwords is important for classification.

In [ ]:
vectorizer = CountVectorizer(max_features = 10000) 
X = vectorizer.fit_transform(uniuseful.comment_body)

Here $X$ is a sparse matrix of word counts such that the rows of $X$ are the comments and the columns of $X$ are the frequent ten thousand words. Therefore each item in cell $(x, y)$ is the count of the word associated with column $y$ in comment associated with row $x$.

In [ ]:
X.shape
Out[ ]:
(2953040, 10000)
In [ ]:
vocab = dict(vectorizer.vocabulary_) #dictionary of vocabulary
vocab = pd.Series(vocab)
vocab = vocab.to_frame()
In [ ]:
vocab.reset_index(level=0, inplace=True)
In [ ]:
vocab = vocab.rename(columns = {"index": "word", 0: "position_in_sparse_matrix"})

We create a dataframe of the identified frequent words as measured by word count. This dataframe "vocab" includes the word and the position of the word in the sparse matrix $X$. We note that the positions of the words in the sparse matrix $X$ are arranged alpha-numerically (alphabetically with numbers before words).

In [ ]:
vocab.head()
Out[ ]:
word position_in_sparse_matrix
0 this 9000
1 is 5004
2 an 808
3 interesting 4916
4 comment 2129

We manually go through the vocab dataframe and as does Wu (2018) we identify candidate words that we use as classifiers. These include pronouns, common identifiers, and names commonly attributed to the genders. There is significant overlap in the words Wu (2018) uses as classifiers and us but differences still emerge due to differences between the top ten thousand words in the Economics Job Market Rumors forum dataset and the top ten thousand words in the Reddit comment dataset.

In [ ]:
#manually go through vocab and find words that can be used as classifiers
female_classifier_list = [#Pronouns
                          "her",
                          "herself",
                          "she",
                          "shes",
                          #Identifiers
                          "daughter",
                          "female",
                          "females",
                          "gf",
                          "girl",
                          "girlfriend",
                          "girls",
                          "ladies",
                          "lady",
                          "mom",
                          "mommy",
                          "mother",
                          "sister",
                          "sisters",
                          "wife",
                          "woman",
                          "women",
                          #Names
                          "anne",
                          "barbara",
                          "hillary",
                          "katehi",
                          "karen",
                          "liz",
                          "marina",
                          "mary",
                          "monica"
                          ]

male_classifier_list = [#Pronouns
                        "he",
                        "hes",
                        "him",
                        "himself",
                        "his",
                        #Identifiers
                        "boy",
                        "boyfriend",
                        "boys",
                        "bro",
                        "bros",
                        "brother",
                        "brothers",
                        "bruh",
                        "dad",
                        "daddy",
                        "dude",
                        "dudes",
                        "father",
                        "guys",
                        "husband",
                        "male",
                        "males",
                        "man",
                        "men",
                        "mr",
                        "son",
                        "uncle",
                        #Names
                        "adam",
                        "adams",
                        "albert", 
                        "alex", 
                        "alfonso", 
                        "andrew", 
                        "anthony",
                        "ben",
                        "bernie", 
                        "bob",
                        "bradley",
                        "brian", 
                        "brody",
                        "chris",
                        "charles",
                        "dan", 
                        "daniel", 
                        "dave", 
                        "david", 
                        "donald", 
                        "doug", 
                        "drake", 
                        "durant", 
                        "dwight",
                        "eric", 
                        "evans",
                        "gary",
                        "gateman",
                        "george", 
                        "gordon",
                        "gregor",
                        "harry", 
                        "jack", 
                        "james", 
                        "jay", 
                        "jeff",
                        "jerry",
                        "jim", 
                        "jimmy",
                        "john", 
                        "johnny",
                        "josh", 
                        "justin",
                        "kevin",
                        "larry", 
                        "lawrence",
                        "lorenzo", 
                        "louis",
                        "martin",  
                        "matt",  
                        "michael", 
                        "mike",
                        "milo",
                        "nick",
                        "ono",
                        "owen", 
                        "paul",
                        "pete",
                        "peter", 
                        "ralph", 
                        "ralphs", 
                        "richard", 
                        "robert",
                        "ron",
                        "ross",
                        "ryan",
                        "santa",
                        "sean",
                        "simon", 
                        "spencer",
                        "stephen", 
                        "steve", 
                        "stewart", 
                        "tom",
                        "thomas", 
                        "william", 
                        "wilson"
                        ]
all_classifier_list = female_classifier_list + male_classifier_list

As does Wu (2018), we create a column Female which takes the value of 1 if it includes a strictly postiive number of female classifiers, 0 if it includes a strictly positive number of male classifiers, else is null.

In [ ]:
vocab["female"] = np.nan #add new column female = 1 if female classifier, female = 0 if male classifier, female = np.nan otherwise

for i in vocab.index:
  if vocab.word[i] in female_classifier_list:
    vocab.female[i] = 1
  elif vocab.word[i] in male_classifier_list:
    vocab.female[i] = 0
  else: vocab.female[i] = np.nan 
/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py:8: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py:7: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  import sys
/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py:5: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """
In [ ]:
female_words = vocab.loc[vocab["female"] == 1] #find position in sparse matrix of female classifiers
In [ ]:
male_words = vocab.loc[vocab["female"] == 0] #find position in sparse matrix of male classifiers
In [ ]:
female_pos_list = list(female_words.position_in_sparse_matrix) #create list of positions of female classifiers
In [ ]:
male_pos_list = list(male_words.position_in_sparse_matrix) #create list of positions of male classifiers
In [ ]:
female_post_word_count = X[:, female_pos_list].sum(axis = 1) #matrix of how many female classifiers are in each post
In [ ]:
male_post_word_count = X[:, male_pos_list].sum(axis = 1) #matrix of how many male classifiers are in each post
In [ ]:
female_post_word_count_data = pd.DataFrame(female_post_word_count) 
In [ ]:
female_post_word_count_data = female_post_word_count_data.rename(columns = {0: "female_post_word_count_col"})
In [ ]:
male_post_word_count_data = pd.DataFrame(male_post_word_count)
In [ ]:
male_post_word_count_data = male_post_word_count_data.rename(columns = {0: "male_post_word_count_col"})
In [ ]:
uniuseful_w_fem_mal_counts = pd.concat([uniuseful, female_post_word_count_data, male_post_word_count_data], axis = 1)
In [ ]:
uniuseful_w_fem_mal_counts["post_incl_fem_class"] = uniuseful_w_fem_mal_counts["female_post_word_count_col"] > 0 #create column of booleans such that True if there is a strictly positive number of female classifiers in the post else False
In [ ]:
uniuseful_w_fem_mal_counts["post_incl_mal_class"] = uniuseful_w_fem_mal_counts["male_post_word_count_col"] > 0 #create column of booleans such that True if there is a strictly positive number of male classifiers in the post else False
In [ ]:
uniuseful_w_fem_mal_counts["post_gendered_lang"] = (uniuseful_w_fem_mal_counts["post_incl_fem_class"] == True) | (uniuseful_w_fem_mal_counts["post_incl_mal_class"] == True)  #create column of booleans such that True if includes a gender classifier else False

Therefore, based on the set up of Wu (2018), we have a dataset of Reddit comments that includes the author of the comments ("author"), the body of the comments ("comment_body"), the score of the comments ("score"), the word count of words in the list of female classifiers in each comment("female_post_word_count_col"), the word count of words in the list of male classifiers in each comment("male_post_word_count_col"), a column of booleans indicating if the comment includes a strictly positive word count of female classifiers ("post_incl_fem_class"), a column of booleans indicating if the comment includes a strictly positive word count of male classifiers ("post_incl_mal_class"), a column of booleans indicating if the comment includes "gendered language" i.e. a strictly positive word count of female classifiers OR a strictly positive word count of male classifiers ("post_gendered_lang").

In [ ]:
uniuseful_w_fem_mal_counts.head()
Out[ ]:
author comment_body score subreddit female_post_word_count_col male_post_word_count_col post_incl_fem_class post_incl_mal_class post_gendered_lang
0 everythingharam this is an interesting comment. care to elabor... 4.0 harvard 0 0 False False False
1 dakwonbrown are you a director in your life or just an actor? 6.0 harvard 0 0 False False False
2 coolcatsarecold you over explaining my guy 0.0 harvard 0 0 False False False
3 thoreau80 so do something else. do college when you act... -1.0 harvard 0 0 False False False
4 starfleet_rambo i could say md but either way the quad is clos... 1.0 harvard 0 0 False False False
In [ ]:
uniuseful_gendered = uniuseful_w_fem_mal_counts.loc[uniuseful_w_fem_mal_counts["post_gendered_lang"] == True] #dataframe of posts that include strictly positive gender classifier
In [ ]:
uniuseful_gendered["post_incl_only_fem_class"] = (uniuseful_gendered["post_incl_fem_class"] == True) & (uniuseful_gendered["post_incl_mal_class"] == False)
/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
In [ ]:
uniuseful_gendered["post_incl_only_mal_class"] = (uniuseful_gendered["post_incl_fem_class"] == False) & (uniuseful_gendered["post_incl_mal_class"] == True)
/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
In [ ]:
uniuseful_gendered["post_incl_both_class"] = (uniuseful_gendered["post_incl_fem_class"] == True) & (uniuseful_gendered["post_incl_mal_class"] == True)
/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
In [ ]:
uniuseful_gendered["post_incl_only_one_class"] = (uniuseful_gendered["post_incl_only_fem_class"] == True) | (uniuseful_gendered["post_incl_only_mal_class"] == True)
/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
In [ ]:
uniuseful_gendered["post_incl_fem_class_int"] = uniuseful_gendered.post_incl_fem_class.astype(int)
/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.

As does Wu (2018) we create a dataframe of the subset of Reddit comments that include gendered language that is a strictly positive word count of gender classifiers where in addition to the aforementioned columns, we also include a column of booleans indicating if the comments included only a strictly positive word count of female classifiers ("post_incl_only_fem_class"), a column of booleans indicating if the comments included only a strictly positive word count of male classifiers ("post_incl_only_mal_class"), a column of booleans indicating if the comments included a strictly positive word count of both female and male classifiers ("post_incl_both_class"),a column of booleans indicating fi the comments included a strictly positive word count of either female or male classifiers but not both ("post_incl_only_one_class"), a column of "post_incl_only_fem_class" as an integer i.e. True = 1, False = 0 ("post_incl_only_fem_class_int").

In [ ]:
uniuseful_gendered.head()
Out[ ]:
author comment_body score subreddit female_post_word_count_col male_post_word_count_col post_incl_fem_class post_incl_mal_class post_gendered_lang post_incl_only_fem_class post_incl_only_mal_class post_incl_both_class post_incl_only_one_class post_incl_fem_class_int
7 throwaway_he1p dude, your post history is so weird. stop adve... 1.0 harvard 0 1 False True True False True False True 0
10 coolcatsarecold man, if you wanted to give an example of why t... 0.0 harvard 0 1 False True True False True False True 0
17 midwest88 your story is either inaccurate or you're tell... 1.0 harvard 0 2 False True True False True False True 0
19 bigred1636 this is an ad hominem attack on people who don... 1.0 harvard 4 1 True True True False False True False 1
21 zan_jr i think she means that by not endorsing biden,... 5.0 harvard 1 1 True True True False False True False 1

As does Wu (2018), we identify the rows of the sparse matrix $X$ that correspond to comments that include a strictly positive word count of gender classifiers.

In [ ]:
X_gendered = X[list(uniuseful_gendered.index), :]

Next as does Wu (2018) we split the comments that include either a female or male classifier but not both into a training and testing set down a 75-25 split i.e. 75% of comments that include either a female or male classifier but not both are in the training set and 25% of comments that include either a female or male classifier but not both are in the testing set.

In [ ]:
train_one_class, test_one_class = model_selection.train_test_split(uniuseful_gendered.loc[uniuseful_gendered["post_incl_only_one_class"] == True]) #split dataset with either male or female classifiers but not both into training and testing

As does Wu (2018) we create another testing set of comments that include both female and male classifiers. The purpose of this is to reclassify these posts as female or male posts. Wu (2018) uses a threshold based on the optimal p-value in order to do that classification.

In [ ]:
test_both_class = uniuseful_gendered.loc[uniuseful_gendered["post_incl_both_class"] == True]

We find the rows in the sparse matrix $X$ that correspond to the rows subsetted to be in the training set.

In [ ]:
X_train_one_class = X[list(train_one_class.index), :]

We find the rows in the sparse matrix $X$ that correspond to the rows subsetted to be in the first of the two testing sets we mentioned above.

In [ ]:
X_test_one_class = X[list(test_one_class.index), :]

We find the rows in the sparse matrix $X$ that correspond to the rows subsetted to be in the testing set that need to be reclassified as female/male posts.

In [ ]:
X_test_both_class = X[list(test_both_class.index), :]

We find the words in the dataframe of the ten thousand highest word counts that are not used as classifiers.

In [ ]:
non_classifier_words = vocab.loc[vocab["female"].isnull()]

We find their positions in the sparse matrix $X$.

In [ ]:
non_classifier_pos_list = list(non_classifier_words.position_in_sparse_matrix)

We create a sparse matrix $X\_train\_one\_class\_non\_classifier$ such that the rows are comments that include a strictly positive word count of gender classifiers and are in the training set and the columns are words that are not used as classifiers; likewise for the two testing subsets.

In [ ]:
X_train_one_class_non_classifier = X_train_one_class[:, non_classifier_pos_list]
In [ ]:
X_test_one_class_non_classifier = X_test_one_class[:, non_classifier_pos_list]
In [ ]:
X_test_both_class_non_classifier = X_test_both_class[:, non_classifier_pos_list]

We create the $y$ variable for the training set from the column "post_incl_fem_class_int" which takes the value 1 if the comment includes a strictly positive word count of female classifiers and 0 if the comment does not include a strictly positive word count of female classifiers.

In [ ]:
y_train_one_class = train_one_class.loc[:, "post_incl_fem_class_int"].to_numpy()

As does Wu (2018) we fit a lasso logistic regression model with a five fold cross validation and a $\mathcal{l}1$ penalty. In future drafts we intend on using grid search to explore the optimal level of the inverse regularization parameter.

In [ ]:
logistic_cv_model = linear_model.LogisticRegressionCV(cv = 5, penalty = "l1", solver = "liblinear", refit = True, n_jobs = -1, verbose = 10) 
In [ ]:
logistic_cv_model.fit(X_train_one_class_non_classifier, y_train_one_class) 
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done   2 out of   5 | elapsed:  1.1min remaining:  1.7min
[Parallel(n_jobs=-1)]: Done   3 out of   5 | elapsed:  1.2min remaining:   46.4s
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:  1.7min remaining:    0.0s
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:  1.7min finished
[LibLinear]
Out[ ]:
LogisticRegressionCV(Cs=10, class_weight=None, cv=5, dual=False,
                     fit_intercept=True, intercept_scaling=1.0, l1_ratios=None,
                     max_iter=100, multi_class='auto', n_jobs=-1, penalty='l1',
                     random_state=None, refit=True, scoring=None,
                     solver='liblinear', tol=0.0001, verbose=10)

As does Wu (2018) we find the coefficients that emerge from the model and create a dataframe that matches the coefficient with the word in question.

In [ ]:
logistic_cv_model.coef_
Out[ ]:
array([[-0.01114809,  0.00274474,  0.00513668, ...,  0.        ,
         0.        ,  0.        ]])
In [ ]:
coefficients = pd.DataFrame(logistic_cv_model.coef_.T)
In [ ]:
foo = non_classifier_words.reset_index()
In [ ]:
foo_w_coefficients = pd.concat([foo, coefficients], axis = 1)
In [ ]:
foo_w_coefficients = foo_w_coefficients.rename(columns = {0: "coefficients"})
In [ ]:
foo_w_coefficients_sorted_fem_sort = foo_w_coefficients.sort_values("coefficients", ascending = False)
In [ ]:
foo_w_coefficients_sorted_mal_sort = foo_w_coefficients.sort_values("coefficients", ascending = True)

We list the 25 words with the highest coefficients.

In [ ]:
foo_w_coefficients_sorted_fem_sort.head(25)
Out[ ]:
index word position_in_sparse_matrix female coefficients
5826 5889 sorority 8344 NaN 1.904436
9458 9582 waters 9710 NaN 1.468953
8359 8462 raped 7252 NaN 1.111328
2727 2753 cute 2618 NaN 1.082182
9537 9663 napolitano 6011 NaN 1.031107
1786 1806 attractive 1083 NaN 0.952228
6803 6883 abortion 485 NaN 0.707105
2842 2871 barnard 1202 NaN 0.689501
9453 9577 coulter 2440 NaN 0.687675
6111 6182 clinton 2029 NaN 0.669793
3857 3890 dating 2670 NaN 0.663514
6757 6837 ratio 7270 NaN 0.641145
3558 3589 gender 4070 NaN 0.628565
3488 3519 sororities 8343 NaN 0.626047
2012 2032 counselor 2445 NaN 0.562392
3163 3193 sex 8034 NaN 0.562340
2626 2652 hot 4562 NaN 0.536652
7033 7114 chick 1916 NaN 0.506128
6815 6895 sexist 8036 NaN 0.492410
2767 2796 relationship 7442 NaN 0.450580
8463 8567 yoga 9974 NaN 0.445040
2142 2165 bitch 1379 NaN 0.425406
3259 3289 assaulted 1010 NaN 0.421444
9141 9258 pepper 6579 NaN 0.412672
2238 2261 date 2667 NaN 0.411576
In [ ]:
foo_w_coefficients_sorted_mal_sort["negative coefficients"] = foo_w_coefficients_sorted_mal_sort["coefficients"] * -1

We list the 25 words with the lowest coefficients. We have created a column of negative coefficients to help us generate the word cloud below.

In [ ]:
foo_w_coefficients_sorted_mal_sort.head(25)
Out[ ]:
index word position_in_sparse_matrix female coefficients negative coefficients
7504 7593 cruz 2557 NaN -1.434057 1.434057
7508 7597 shapiro 8058 NaN -1.075589 1.075589
3739 3772 st 8459 NaN -0.966781 0.966781
19 19 guy 4283 NaN -0.885609 0.885609
7736 7826 potter 6836 NaN -0.833271 0.833271
8613 8721 peterson 6621 NaN -0.644096 0.644096
6103 6173 smash 8259 NaN -0.635671 0.635671
1302 1318 hey 4445 NaN -0.617124 0.617124
5311 5365 coach 2058 NaN -0.613904 0.613904
7644 7733 fraternity 3947 NaN -0.586988 0.586988
2969 2999 joe 5070 NaN -0.538663 0.538663
608 616 troll 9227 NaN -0.517101 0.517101
9375 9496 hilfinger 4462 NaN -0.502146 0.502146
4561 4606 sanders 7810 NaN -0.498680 0.498680
3700 3733 nah 6000 NaN -0.492143 0.492143
9388 9509 sahai 7790 NaN -0.470642 0.470642
5610 5666 legend 5306 NaN -0.444860 0.444860
291 297 trump 9238 NaN -0.436579 0.436579
4825 4874 scott 7900 NaN -0.409858 0.409858
3040 3070 aw 1125 NaN -0.408276 0.408276
7265 7350 straw 8584 NaN -0.383624 0.383624
1891 1911 kid 5143 NaN -0.379644 0.379644
7670 7759 player 6728 NaN -0.372034 0.372034
746 756 thanks 8953 NaN -0.370819 0.370819
6015 6083 players 6729 NaN -0.357782 0.357782

As does Wu (2018) we find the predicted probability of a comment being classified as female. Wu (2018) uses these probabilities to reclassify posts that include both female and male classifiers as either female or male.

In [ ]:
train_predicted_prob = logistic_cv_model.predict_proba(X_train_one_class_non_classifier)
y_predicted_prob_post_fem_train_one_class_non_classifier = train_predicted_prob[:, 1]
In [ ]:
test_one_predicted_prob = logistic_cv_model.predict_proba(X_test_one_class_non_classifier)
y_predicted_prob_post_fem_test_one_class_non_classifier = test_one_predicted_prob[:, 1]
In [ ]:
test_both_predicted_prob = logistic_cv_model.predict_proba(X_test_both_class_non_classifier)
y_predicted_prob_post_fem_test_both_class_non_classifier = test_both_predicted_prob [:, 1]

We create a word cloud of words that have the highest coefficient values. These are words that have the highest predictive power of a comment being classified as female. In future drafts as does Wu (2018) we intend on finding the marginal effect associated with each word.

In [ ]:
word_data_fem = dict(zip(foo_w_coefficients_sorted_fem_sort["word"].tolist(), foo_w_coefficients_sorted_fem_sort["coefficients"].tolist()))
cloud_fem = WordCloud(max_words = 250, background_color = "white", width = 1600, height = 1600, max_font_size = 150, min_font_size = 30, colormap = "plasma").generate_from_frequencies(word_data_fem)
In [ ]:
plt.figure(figsize = (20, 20))
plt.imshow(cloud_fem, interpolation="bilinear")
plt.axis("off")
Out[ ]:
(-0.5, 1599.5, 1599.5, -0.5)