Author
In this notebook, we seek to understand university culture through comments made on the social media platform Reddit by focusing our analysis on university subreddits which are frequently used by prospective students and current students at both the undergraduate as well as the graduate level. Alice H. Wu (2018) seeks to understand culture in academia through the Economics Job Market Rumors forum. Both Reddit as well as Economics Job Market Rumors offer users a degree of anonymity. Note that therefore, both platforms may be used by people outside the academic communities as well. While Wu (2018) suggests that Economics Job Market Rumors was initially created for discussions surrounding job market placements, interviews, and outcomes, Wu (2018) seeks to understand if there is a distinct stereotypical culture. The author finds that the words "hotter", "pregnant", and "plow" are the most predictive of a post being classified as female while "homo", "testosterone", and "chapters" are predictive of a post being classified as male. Beyond the heteronormativity and misogyny that emerge from the word list, Wu (2018) finds professional words such as "supervisor" and "adviser" are predictive of a post discussing a male. We apply the same lasso-logistic model and use the similar methods as Wu (2018) to understand if that culture permeates social media forums used by members of the academic community. We are aware that as Wu (2018) discusses, both Reddit as well as Economics Job Market Rumors are moderated communities. Posts that do not meet the respective set of community guidelines may be subject to action by moderators including deletion of the post. Therefore our analysis is limited in that the comments in the dataset that we extract are not necessarily an unbiased reflection of university culture. Also our sample is certainly not random. Not every student and faculty at universities use their respective university subreddit, and subreddits are rarely if ever the primary means of university-level communication.
#Install on colab or as necessary
! pip install qeds fiona geopandas xgboost gensim folium pyLDAvis descartes psaw pyarrow
#Not all packages imported were necessarily used in the analysis
import pandas as pd
import numpy as np
import patsy
import json
import os
from bs4 import BeautifulSoup
import time
import psaw
from google.colab import files
import nltk
import string
import matplotlib.pyplot as plt
%matplotlib inline
# activate plot theme
import qeds
qeds.themes.mpl_style();
import pyarrow.feather
import scipy
import pickle
from sklearn.feature_extraction.text import CountVectorizer
from sklearn import model_selection
from sklearn import linear_model
from sklearn.inspection import plot_partial_dependence
from wordcloud import WordCloud
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
In this section, we leverage PSAW which is a wrapper for Pushshift API to build a dataset of less than 3 million Reddit comments made on the subreddits of universities with economics departments in top 35 institutions as of March 2020 according to rankings published by RePEc. You can find the ranking list here. We note first that the choice of ranking list to use for the analysis was indeed arbitrary. For this analysis we focused on the top 35 institutions. We could not include Paris School of Economics, Toulouse School of Economics, Barcelona Graduate School of Economics, Tilburg University, Graduate School of Business - Columbia University (Columbia University has been entered once). We asked for up to 500,000 comments per subreddit. We wish to emphasize that we do not have 500,000 comments for each subreddit and that some subreddits would be under-represented in the analysis unless weights are attached.
We strongly recommend running this section in a separate session in order to minimize burdening servers and to efficiently use RAM. We ran this section on Tuesday, April 14 2020. The analysis in this notebook shall be based on data collected on that day.
api = psaw.PushshiftAPI()
harvardgen = api.search_comments(subreddit = "harvard", limit = 500000)
harvarddata = pd.DataFrame([i.d_ for i in harvardgen])
mitgen = api.search_comments(subreddit = "mit", limit = 500000)
mitdata = pd.DataFrame([i.d_ for i in mitgen])
berkeleygen = api.search_comments(subreddit = "berkeley", limit = 500000)
berkeleydata = pd.DataFrame([i.d_ for i in berkeleygen])
uchicagogen = api.search_comments(subreddit = "uchicago", limit = 500000)
uchicagodata = pd.DataFrame([i.d_ for i in uchicagogen])
princetongen = api.search_comments(subreddit = "princeton", limit = 500000)
princetondata = pd.DataFrame([i.d_ for i in princetongen])
stanfordgen = api.search_comments(subreddit = "stanford", limit = 500000)
stanforddata = pd.DataFrame([i.d_ for i in stanfordgen])
oxfordunigen = api.search_comments(subreddit = "oxforduni", limit = 500000)
oxfordunidata = pd.DataFrame([i.d_ for i in oxfordunigen])
columbiagen = api.search_comments(subreddit = "columbia", limit = 500000)
columbiadata = pd.DataFrame([i.d_ for i in columbiagen])
brownugen = api.search_comments(subreddit = "brownu", limit = 500000)
brownudata = pd.DataFrame([i.d_ for i in brownugen])
nyugen = api.search_comments(subreddit = "nyu", limit = 500000)
nyudata = pd.DataFrame([i.d_ for i in nyugen])
yalegen = api.search_comments(subreddit = "yale", limit = 500000)
yaledata = pd.DataFrame([i.d_ for i in yalegen])
bostonugen = api.search_comments(subreddit = "bostonu", limit = 500000)
bostonudata = pd.DataFrame([i.d_ for i in bostonugen])
dartmouthgen = api.search_comments(subreddit = "dartmouth", limit = 500000)
dartmouthdata = pd.DataFrame([i.d_ for i in dartmouthgen])
upenngen = api.search_comments(subreddit = "upenn", limit = 500000)
upenndata = pd.DataFrame([i.d_ for i in upenngen])
ucsdgen = api.search_comments(subreddit = "ucsd", limit = 500000)
ucsddata = pd.DataFrame([i.d_ for i in ucsdgen])
uclgen = api.search_comments(subreddit = "ucl", limit = 500000)
ucldata = pd.DataFrame([i.d_ for i in uclgen])
uclagen = api.search_comments(subreddit = "ucla", limit = 500000)
ucladata = pd.DataFrame([i.d_ for i in uclagen])
northwesterngen = api.search_comments(subreddit = "northwestern", limit = 500000)
northwesterndata = pd.DataFrame([i.d_ for i in northwesterngen])
uwmadisongen = api.search_comments(subreddit = "uwmadison", limit = 500000)
uwmadisondata = pd.DataFrame([i.d_ for i in uwmadisongen])
thelsegen = api.search_comments(subreddit = "thelse", limit = 500000)
thelsedata = pd.DataFrame([i.d_ for i in thelsegen])
msugen = api.search_comments(subreddit = "msu", limit = 500000)
msudata = pd.DataFrame([i.d_ for i in msugen])
uofmgen = api.search_comments(subreddit = "uofm", limit = 500000)
uofmdata = pd.DataFrame([i.d_ for i in uofmgen])
ucdavisgen = api.search_comments(subreddit = "ucdavis", limit = 500000)
ucdavisdata = pd.DataFrame([i.d_ for i in ucdavisgen])
bostoncollegegen = api.search_comments(subreddit = "bostoncollege", limit = 500000)
bostoncollegedata = pd.DataFrame([i.d_ for i in bostoncollegegen])
ubcgen = api.search_comments(subreddit = "ubc", limit = 500000)
ubcdata = pd.DataFrame([i.d_ for i in ubcgen])
georgetowngen = api.search_comments(subreddit = "georgetown", limit = 500000)
georgetowndata = pd.DataFrame([i.d_ for i in georgetowngen])
uscgen = api.search_comments(subreddit = "usc", limit = 500000)
uscdata = pd.DataFrame([i.d_ for i in uscgen])
universityofwarwickgen = api.search_comments(subreddit = "universityofwarwick", limit = 500000)
universityofwarwickdata = pd.DataFrame([i.d_ for i in universityofwarwickgen])
uoftgen = api.search_comments(subreddit = "uoft", limit = 500000)
uoftdata = pd.DataFrame([i.d_ for i in uoftgen])
uongen = api.search_comments(subreddit = "uon", limit = 500000)
uondata = pd.DataFrame([i.d_ for i in uongen])
unidata = pd.concat([harvarddata, mitdata, berkeleydata, uchicagodata, princetondata, stanforddata, oxfordunidata, columbiadata, brownudata, nyudata, yaledata, bostonudata, dartmouthdata, upenndata, ucsddata, ucldata, ucladata, northwesterndata, uwmadisondata, thelsedata, msudata, uofmdata, ucdavisdata, bostoncollegedata, ubcdata, georgetowndata, uscdata, universityofwarwickdata, uoftdata, uondata])
unibodydata = unidata.loc[:,["body"]]
unidata = unidata.reset_index()
unidata.to_csv("unidata.csv")
After we download the data we collected in the previous section we import it for analysis.
#Set directory as necessary
from google.colab import drive
drive.mount("/content/drive")
unidata = pd.read_csv("/content/drive/My Drive/unidata.csv")
unidata.head()
We select columns that we believe will be useful for the analysis: the name of the author that made the comment, the body of the comment, the score of the comment and the subreddit the comment was made on.
uniuseful = unidata[["author", "body", "score", "subreddit"]]
uniuseful.head()
uniuseful = uniuseful.rename(columns={"body": "comment_body"}) #one of the tokens is "body"
We list the unique subreddits in the dataset and create a visualization that shows the number of comments in the dataset by subreddit.
uniuseful.subreddit.unique()
fig, ax = plt.subplots(figsize = (20, 20))
groupuni = uniuseful.groupby("subreddit").count()["comment_body"]
groupuni.plot.bar(ax = ax)
ax.set_ylabel("Number of Comments")
ax.set_xlabel("Subreddits")
ax.set_title("Number of Comments in Dataset by Subreddit")
ax.set_facecolor("white")
y_ticks = np.arange(0, 550000, 50000)
ax.set_yticks(y_ticks)
We ensure that all the comments are strings. We note that as of now urls are included in the comments. We intend on using "regular expression" to get rid of the urls. We convert all the strings to lowercase.
uniuseful["comment_body"] = uniuseful["comment_body"].apply(str) #convert all text in body to string
#note urls included. FIXME use "regular expression" to get rid of urls.
uniuseful = uniuseful.applymap(lambda i:i.lower() if type(i) == str else i) #lowercase strings
In this section, the crux of our analysis, we apply the model developed by Alice H. Wu (2018) to our dataset of Reddit comments. Wu (2018) studied posts on the Economics Job Market Rumors forum. Among other things, Wu (2018) runs a lasso logistic regression model in order to find the words that are predictive of a post being classified as a female or male post. The author classifies posts as female or male based on classifiers that they develop from a list of the ten thousand most frequent words in the posts. Our analysis here is based on Wu (2018) where we develop a similar (if not the same) set up and model on our dataset of Reddit comments.
As does Wu (2018) we find the most frequent ten thousand words in the comments. We choose not to remove stopwords as preserving pronouns among other stopwords is important for classification.
vectorizer = CountVectorizer(max_features = 10000)
X = vectorizer.fit_transform(uniuseful.comment_body)
Here $X$ is a sparse matrix of word counts such that the rows of $X$ are the comments and the columns of $X$ are the frequent ten thousand words. Therefore each item in cell $(x, y)$ is the count of the word associated with column $y$ in comment associated with row $x$.
X.shape
vocab = dict(vectorizer.vocabulary_) #dictionary of vocabulary
vocab = pd.Series(vocab)
vocab = vocab.to_frame()
vocab.reset_index(level=0, inplace=True)
vocab = vocab.rename(columns = {"index": "word", 0: "position_in_sparse_matrix"})
We create a dataframe of the identified frequent words as measured by word count. This dataframe "vocab" includes the word and the position of the word in the sparse matrix $X$. We note that the positions of the words in the sparse matrix $X$ are arranged alpha-numerically (alphabetically with numbers before words).
vocab.head()
We manually go through the vocab dataframe and as does Wu (2018) we identify candidate words that we use as classifiers. These include pronouns, common identifiers, and names commonly attributed to the genders. There is significant overlap in the words Wu (2018) uses as classifiers and us but differences still emerge due to differences between the top ten thousand words in the Economics Job Market Rumors forum dataset and the top ten thousand words in the Reddit comment dataset.
#manually go through vocab and find words that can be used as classifiers
female_classifier_list = [#Pronouns
"her",
"herself",
"she",
"shes",
#Identifiers
"daughter",
"female",
"females",
"gf",
"girl",
"girlfriend",
"girls",
"ladies",
"lady",
"mom",
"mommy",
"mother",
"sister",
"sisters",
"wife",
"woman",
"women",
#Names
"anne",
"barbara",
"hillary",
"katehi",
"karen",
"liz",
"marina",
"mary",
"monica"
]
male_classifier_list = [#Pronouns
"he",
"hes",
"him",
"himself",
"his",
#Identifiers
"boy",
"boyfriend",
"boys",
"bro",
"bros",
"brother",
"brothers",
"bruh",
"dad",
"daddy",
"dude",
"dudes",
"father",
"guys",
"husband",
"male",
"males",
"man",
"men",
"mr",
"son",
"uncle",
#Names
"adam",
"adams",
"albert",
"alex",
"alfonso",
"andrew",
"anthony",
"ben",
"bernie",
"bob",
"bradley",
"brian",
"brody",
"chris",
"charles",
"dan",
"daniel",
"dave",
"david",
"donald",
"doug",
"drake",
"durant",
"dwight",
"eric",
"evans",
"gary",
"gateman",
"george",
"gordon",
"gregor",
"harry",
"jack",
"james",
"jay",
"jeff",
"jerry",
"jim",
"jimmy",
"john",
"johnny",
"josh",
"justin",
"kevin",
"larry",
"lawrence",
"lorenzo",
"louis",
"martin",
"matt",
"michael",
"mike",
"milo",
"nick",
"ono",
"owen",
"paul",
"pete",
"peter",
"ralph",
"ralphs",
"richard",
"robert",
"ron",
"ross",
"ryan",
"santa",
"sean",
"simon",
"spencer",
"stephen",
"steve",
"stewart",
"tom",
"thomas",
"william",
"wilson"
]
all_classifier_list = female_classifier_list + male_classifier_list
As does Wu (2018), we create a column Female which takes the value of 1 if it includes a strictly postiive number of female classifiers, 0 if it includes a strictly positive number of male classifiers, else is null.
vocab["female"] = np.nan #add new column female = 1 if female classifier, female = 0 if male classifier, female = np.nan otherwise
for i in vocab.index:
if vocab.word[i] in female_classifier_list:
vocab.female[i] = 1
elif vocab.word[i] in male_classifier_list:
vocab.female[i] = 0
else: vocab.female[i] = np.nan
female_words = vocab.loc[vocab["female"] == 1] #find position in sparse matrix of female classifiers
male_words = vocab.loc[vocab["female"] == 0] #find position in sparse matrix of male classifiers
female_pos_list = list(female_words.position_in_sparse_matrix) #create list of positions of female classifiers
male_pos_list = list(male_words.position_in_sparse_matrix) #create list of positions of male classifiers
female_post_word_count = X[:, female_pos_list].sum(axis = 1) #matrix of how many female classifiers are in each post
male_post_word_count = X[:, male_pos_list].sum(axis = 1) #matrix of how many male classifiers are in each post
female_post_word_count_data = pd.DataFrame(female_post_word_count)
female_post_word_count_data = female_post_word_count_data.rename(columns = {0: "female_post_word_count_col"})
male_post_word_count_data = pd.DataFrame(male_post_word_count)
male_post_word_count_data = male_post_word_count_data.rename(columns = {0: "male_post_word_count_col"})
uniuseful_w_fem_mal_counts = pd.concat([uniuseful, female_post_word_count_data, male_post_word_count_data], axis = 1)
uniuseful_w_fem_mal_counts["post_incl_fem_class"] = uniuseful_w_fem_mal_counts["female_post_word_count_col"] > 0 #create column of booleans such that True if there is a strictly positive number of female classifiers in the post else False
uniuseful_w_fem_mal_counts["post_incl_mal_class"] = uniuseful_w_fem_mal_counts["male_post_word_count_col"] > 0 #create column of booleans such that True if there is a strictly positive number of male classifiers in the post else False
uniuseful_w_fem_mal_counts["post_gendered_lang"] = (uniuseful_w_fem_mal_counts["post_incl_fem_class"] == True) | (uniuseful_w_fem_mal_counts["post_incl_mal_class"] == True) #create column of booleans such that True if includes a gender classifier else False
Therefore, based on the set up of Wu (2018), we have a dataset of Reddit comments that includes the author of the comments ("author"), the body of the comments ("comment_body"), the score of the comments ("score"), the word count of words in the list of female classifiers in each comment("female_post_word_count_col"), the word count of words in the list of male classifiers in each comment("male_post_word_count_col"), a column of booleans indicating if the comment includes a strictly positive word count of female classifiers ("post_incl_fem_class"), a column of booleans indicating if the comment includes a strictly positive word count of male classifiers ("post_incl_mal_class"), a column of booleans indicating if the comment includes "gendered language" i.e. a strictly positive word count of female classifiers OR a strictly positive word count of male classifiers ("post_gendered_lang").
uniuseful_w_fem_mal_counts.head()
uniuseful_gendered = uniuseful_w_fem_mal_counts.loc[uniuseful_w_fem_mal_counts["post_gendered_lang"] == True] #dataframe of posts that include strictly positive gender classifier
uniuseful_gendered["post_incl_only_fem_class"] = (uniuseful_gendered["post_incl_fem_class"] == True) & (uniuseful_gendered["post_incl_mal_class"] == False)
uniuseful_gendered["post_incl_only_mal_class"] = (uniuseful_gendered["post_incl_fem_class"] == False) & (uniuseful_gendered["post_incl_mal_class"] == True)
uniuseful_gendered["post_incl_both_class"] = (uniuseful_gendered["post_incl_fem_class"] == True) & (uniuseful_gendered["post_incl_mal_class"] == True)
uniuseful_gendered["post_incl_only_one_class"] = (uniuseful_gendered["post_incl_only_fem_class"] == True) | (uniuseful_gendered["post_incl_only_mal_class"] == True)
uniuseful_gendered["post_incl_fem_class_int"] = uniuseful_gendered.post_incl_fem_class.astype(int)
As does Wu (2018) we create a dataframe of the subset of Reddit comments that include gendered language that is a strictly positive word count of gender classifiers where in addition to the aforementioned columns, we also include a column of booleans indicating if the comments included only a strictly positive word count of female classifiers ("post_incl_only_fem_class"), a column of booleans indicating if the comments included only a strictly positive word count of male classifiers ("post_incl_only_mal_class"), a column of booleans indicating if the comments included a strictly positive word count of both female and male classifiers ("post_incl_both_class"),a column of booleans indicating fi the comments included a strictly positive word count of either female or male classifiers but not both ("post_incl_only_one_class"), a column of "post_incl_only_fem_class" as an integer i.e. True = 1, False = 0 ("post_incl_only_fem_class_int").
uniuseful_gendered.head()
As does Wu (2018), we identify the rows of the sparse matrix $X$ that correspond to comments that include a strictly positive word count of gender classifiers.
X_gendered = X[list(uniuseful_gendered.index), :]
Next as does Wu (2018) we split the comments that include either a female or male classifier but not both into a training and testing set down a 75-25 split i.e. 75% of comments that include either a female or male classifier but not both are in the training set and 25% of comments that include either a female or male classifier but not both are in the testing set.
train_one_class, test_one_class = model_selection.train_test_split(uniuseful_gendered.loc[uniuseful_gendered["post_incl_only_one_class"] == True]) #split dataset with either male or female classifiers but not both into training and testing
As does Wu (2018) we create another testing set of comments that include both female and male classifiers. The purpose of this is to reclassify these posts as female or male posts. Wu (2018) uses a threshold based on the optimal p-value in order to do that classification.
test_both_class = uniuseful_gendered.loc[uniuseful_gendered["post_incl_both_class"] == True]
We find the rows in the sparse matrix $X$ that correspond to the rows subsetted to be in the training set.
X_train_one_class = X[list(train_one_class.index), :]
We find the rows in the sparse matrix $X$ that correspond to the rows subsetted to be in the first of the two testing sets we mentioned above.
X_test_one_class = X[list(test_one_class.index), :]
We find the rows in the sparse matrix $X$ that correspond to the rows subsetted to be in the testing set that need to be reclassified as female/male posts.
X_test_both_class = X[list(test_both_class.index), :]
We find the words in the dataframe of the ten thousand highest word counts that are not used as classifiers.
non_classifier_words = vocab.loc[vocab["female"].isnull()]
We find their positions in the sparse matrix $X$.
non_classifier_pos_list = list(non_classifier_words.position_in_sparse_matrix)
We create a sparse matrix $X\_train\_one\_class\_non\_classifier$ such that the rows are comments that include a strictly positive word count of gender classifiers and are in the training set and the columns are words that are not used as classifiers; likewise for the two testing subsets.
X_train_one_class_non_classifier = X_train_one_class[:, non_classifier_pos_list]
X_test_one_class_non_classifier = X_test_one_class[:, non_classifier_pos_list]
X_test_both_class_non_classifier = X_test_both_class[:, non_classifier_pos_list]
We create the $y$ variable for the training set from the column "post_incl_fem_class_int" which takes the value 1 if the comment includes a strictly positive word count of female classifiers and 0 if the comment does not include a strictly positive word count of female classifiers.
y_train_one_class = train_one_class.loc[:, "post_incl_fem_class_int"].to_numpy()
As does Wu (2018) we fit a lasso logistic regression model with a five fold cross validation and a $\mathcal{l}1$ penalty. In future drafts we intend on using grid search to explore the optimal level of the inverse regularization parameter.
logistic_cv_model = linear_model.LogisticRegressionCV(cv = 5, penalty = "l1", solver = "liblinear", refit = True, n_jobs = -1, verbose = 10)
logistic_cv_model.fit(X_train_one_class_non_classifier, y_train_one_class)
As does Wu (2018) we find the coefficients that emerge from the model and create a dataframe that matches the coefficient with the word in question.
logistic_cv_model.coef_
coefficients = pd.DataFrame(logistic_cv_model.coef_.T)
foo = non_classifier_words.reset_index()
foo_w_coefficients = pd.concat([foo, coefficients], axis = 1)
foo_w_coefficients = foo_w_coefficients.rename(columns = {0: "coefficients"})
foo_w_coefficients_sorted_fem_sort = foo_w_coefficients.sort_values("coefficients", ascending = False)
foo_w_coefficients_sorted_mal_sort = foo_w_coefficients.sort_values("coefficients", ascending = True)
We list the 25 words with the highest coefficients.
foo_w_coefficients_sorted_fem_sort.head(25)
foo_w_coefficients_sorted_mal_sort["negative coefficients"] = foo_w_coefficients_sorted_mal_sort["coefficients"] * -1
We list the 25 words with the lowest coefficients. We have created a column of negative coefficients to help us generate the word cloud below.
foo_w_coefficients_sorted_mal_sort.head(25)
As does Wu (2018) we find the predicted probability of a comment being classified as female. Wu (2018) uses these probabilities to reclassify posts that include both female and male classifiers as either female or male.
train_predicted_prob = logistic_cv_model.predict_proba(X_train_one_class_non_classifier)
y_predicted_prob_post_fem_train_one_class_non_classifier = train_predicted_prob[:, 1]
test_one_predicted_prob = logistic_cv_model.predict_proba(X_test_one_class_non_classifier)
y_predicted_prob_post_fem_test_one_class_non_classifier = test_one_predicted_prob[:, 1]
test_both_predicted_prob = logistic_cv_model.predict_proba(X_test_both_class_non_classifier)
y_predicted_prob_post_fem_test_both_class_non_classifier = test_both_predicted_prob [:, 1]
We create a word cloud of words that have the highest coefficient values. These are words that have the highest predictive power of a comment being classified as female. In future drafts as does Wu (2018) we intend on finding the marginal effect associated with each word.
word_data_fem = dict(zip(foo_w_coefficients_sorted_fem_sort["word"].tolist(), foo_w_coefficients_sorted_fem_sort["coefficients"].tolist()))
cloud_fem = WordCloud(max_words = 250, background_color = "white", width = 1600, height = 1600, max_font_size = 150, min_font_size = 30, colormap = "plasma").generate_from_frequencies(word_data_fem)
plt.figure(figsize = (20, 20))
plt.imshow(cloud_fem, interpolation="bilinear")
plt.axis("off")