Author: Leon Yin and Surya Mattu
Links: GitHub | nbviewer | Story
This notebook shows how we aggregate the data we preprocessed in the last notebook (see in nbviewer). The tables in this notebook are featured in our experiment and story.
%matplotlib inline
import os
import numpy as np
import pandas as pd
# inputs
fn_input = '../data/intermediates/email_metadata_merged_with_newsletters.csv.gz'
fn_newsletter_categories = '../data/input/newsletter_categories.csv'
# outputs
appedix_dir = '../data/output/appendix'
table_dir = '../data/output/tables'
for d in [appedix_dir, table_dir]:
os.makedirs(d, exist_ok=True)
# read the dataset we preprocessed in the last notebook.
df = pd.read_csv(fn_input)
len(df)
5134
Recall our dataset looks like this:
df.iloc[-1].to_dict()
{'X-GM-THRID': 1655113375086340206, 'X-Gmail-Labels': 'Spam,Category Updates,Unread', 'Date': '2020-01-07 23:18:33+00:00', 'Subject': 're: your 2020 membership', 'text': 'Become a card-carrying member of the Democratic Party by renewing your membership today.\nHi --\nWe’ve been talking about the importance of preparing for 2020 for months now, and it’s finally here. We’re in the thick of it. The Iowa caucuses are less than a month away, and before we know it we’ll have our Democratic nominee.\nIt’s more clear now than ever: This is not the year to sit on the sidelines. As a member of our party, we know you are committed to defeating Trump and his allies -- and now is the time to prove it.\nBecome a card-carrying member of the Democratic Party by making a $7 donation to renew your membership today.\nRENEW MY MEMBERSHIP\nIf we’re going to win up and down the ballot this year, we need you in this fight. I hope you’ll carry your 2020 DNC membership card with pride -- it’s a reflection of all you’ve done for our party as we prepare for the fight ahead.\nRenew your membership today to receive your Official 2020 DNC Membership Card.\nThanks as always,\nLauren\nLauren Williams\nDirect Marketing Director\nDemocratic National Committee\nIf you no longer wish to receive emails from the DNC, submit this form to unsubscribe. If you’d only like to receive our most important messages, sign up to receive less email.\nIf you’re ready to elect Democrats in all 50 states, make a contribution today.\nContributions or gifts to the Democratic National Committee are not tax deductible. Paid for by the Democratic National Committee, www.Democrats.org, and not authorized by any candidate or candidate’s committee.', 'markup_id': '5a0fe9753c37a010ae045117cc77e261', 'Category Personal': 0, 'Category Promotions': 0, 'Category Updates': 1, 'Inbox': 0, 'Spam': 1, 'Primary': 0, 'From_Email': 'feedback@act.democrats.org', 'From_Domain': 'democrats.org', 'From_Name': 'DNC Headquarters', 'To_Email': 'ajcorrigan88+dnc@gmail.com', 'Name': 'DNC', 'Website': 'https://democrats.org/', 'Category': 'Advocacy organization or think tank', 'Entity_ID': 'dbc25ff0242fc30a08228f14e658ad48'}
categories = df.Category.unique()
categories
array(['Presidential candidate', 'House battleground campaign', 'Advocacy organization or think tank', 'House battleground official'], dtype=object)
How many emails did we get from each email sender? Which newsletters didn't send us anything?
data = []
for eID, _df in df.groupby('Entity_ID'):
first_email = _df.Date.min().split()[0]
last_email = _df.Date.max().split()[0]
n_email = _df.markup_id.nunique()
cat = _df.Category.iloc[0]
name = _df.Name.iloc[0]
row = {
'Name' : name,
'Entity_ID' : eID,
'Emails received' : n_email,
'Date of first email' : first_email,
'Date of last email' : last_email,
'Newsletter Category' : cat
}
data.append(row)
emails_received = pd.DataFrame(data)
emails_received.head(2)
Name | Entity_ID | Emails received | Date of first email | Date of last email | Newsletter Category | |
---|---|---|---|---|---|---|
0 | Sharice Davids | 0440493f2ad4348bc652f41cb1ff5d49 | 15 | 2019-10-22 | 2020-02-06 | House battleground official |
1 | Harley Rouda | 04dfd84c37ccdf9421ba355e25529adc | 58 | 2019-10-21 | 2020-02-10 | House battleground campaign |
# newsletters we signed up for and which alias we used.
newsletter_categories = pd.read_csv(fn_newsletter_categories)
newsletter_categories.head(2)
Name | Website | Category | Entity_ID | ||
---|---|---|---|---|---|
0 | AARP | https://www.aarp.org/ | ajcorrigan88+aarp@gmail.com | Advocacy organization or think tank | 555b07f95e3dc347516af4e713ed190b |
1 | Center for Responsive Politics | http://www.opensecrets.org/ | ajcorrigan88+cfrp@gmail.com | Advocacy organization or think tank | 77751658e640d8c56dbda5d7501dfd5d |
A1_output_cols = [
'Name', 'Website',
'Emails received',
'Date of first email',
'Date of last email',
]
# this prevents duplicate rows
cols_to_use = emails_received.columns.difference(
newsletter_categories.columns
)
no_shows = []
for cat in categories:
fn_out = os.path.join(
appedix_dir, f"A1-{cat.replace(' ', '-')}.csv"
)
newsletter_categories_cat = newsletter_categories[
newsletter_categories.Category == cat].set_index('Name')
emails_received_cat = emails_received[
emails_received['Newsletter Category'] == cat].set_index('Name')
# here we combine all emails we opt-d in for, with all recieved
email_tracker = newsletter_categories_cat.merge(emails_received_cat[cols_to_use],
on= ['Name'],
how='left') \
.sort_values(by= 'Name') \
.reset_index()
# For email senders with no match, set filler values.
email_tracker['Emails received'].fillna(0, inplace=True)
for col in ['Date of first email', 'Date of last email']:
email_tracker[col].fillna('-', inplace=True)
# let's record who we didn't get!
no_shows.extend(
email_tracker[email_tracker['Emails received'] == 0][[
'Name', 'Category', 'Entity_ID'
]].to_dict(orient='records')
)
# save this for the appendix
email_tracker[A1_output_cols].to_csv(fn_out, index=False)
del fn_out
# print some summary stats
n_signed_up = len(newsletter_categories_cat)
n_received = len(emails_received_cat)
print(f"{cat}:\nwe signed up for {n_signed_up}\n"
f"got emails from {n_received}\n")
Presidential candidate: we signed up for 17 got emails from 16 House battleground campaign: we signed up for 71 got emails from 44 Advocacy organization or think tank: we signed up for 73 got emails from 50 House battleground official: we signed up for 70 got emails from 61
Let's get the stats for all senders:
n_no_shows = len(no_shows)
n_received = df.Entity_ID.nunique()
n_subscribed = newsletter_categories.Entity_ID.nunique()
print(f"Out of the {n_subscribed} email senders we opted into "
f"we received {n_received} of them and didn't recieve {n_no_shows}.")
Out of the 231 email senders we opted into we received 171 of them and didn't recieve 60.
# Who didn't send us emails?
fn_out = os.path.join(appedix_dir, 'Amisc_email_senders_who_didnt_send.csv')
pd.DataFrame(no_shows).sort_values('Name').to_csv(fn_out, index=False)
fn_out = os.path.join(table_dir, 'T1-overview.csv')
data = []
for cat, _df in df.groupby('Category'):
spam = []
promos = []
primary = []
for user, __df in _df.groupby('To_Email'):
n_emails = __df.markup_id.nunique()
n_spam = __df[__df['Spam'] == 1].markup_id.nunique()/ n_emails
n_promo = __df[(__df['Category Promotions'] == 1) &
(__df['Spam'] != 1)].markup_id.nunique()/ n_emails
n_primary = __df[__df['Primary'] == 1].markup_id.nunique() / n_emails
spam.append(n_spam)
promos.append(n_promo)
primary.append(n_primary)
# this is to get the variance
spam = np.std(spam)
promos = np.std(promos)
primary = np.std(primary)
n_emails = _df.markup_id.nunique()
n_spam = _df[_df['Spam'] == 1].markup_id.nunique() / n_emails
n_promo = _df[(_df['Category Promotions'] == 1) &
(_df['Spam'] != 1)].markup_id.nunique()/ n_emails
n_primary = _df[_df['Primary'] == 1].markup_id.nunique()/ n_emails
row = {
'Category' : cat,
'Primary' : n_primary,
'Promotions' : n_promo,
'Spam' : n_spam,
'Total Emails' : n_emails,
'STD Spam' : spam,
'STD Promotions' :promos,
'STD Primary' : primary
}
data.append(row)
# Average for all
spam = []
promos = []
primary = []
for user, __df in df.groupby('To_Email'):
n_emails = __df.markup_id.nunique()
n_spam = __df[__df['Spam'] == 1].markup_id.nunique()/ n_emails
n_promo = __df[(__df['Category Promotions'] == 1) &
(__df['Spam'] != 1)].markup_id.nunique()/ n_emails
n_primary = __df[__df['Primary'] == 1].markup_id.nunique() / n_emails
spam.append(n_spam)
promos.append(n_promo)
primary.append(n_primary)
spam = np.std(spam)
promos = np.std(promos)
primary = np.std(primary)
n_emails = df.markup_id.nunique()
n_spam = df[df['Spam'] == 1].markup_id.nunique() / n_emails
n_promo = df[(df['Category Promotions'] == 1) &
(df['Spam'] != 1)].markup_id.nunique()/ n_emails
n_primary = df[df['Primary'] == 1].markup_id.nunique()/ n_emails
row = {
'Category' : 'All emails',
'Primary' : n_primary,
'Promotions' : n_promo,
'Spam' : n_spam,
'Total Emails' : n_emails,
'STD Spam' : spam,
'STD Promotions' :promos,
'STD Primary' : primary
}
data.append(row)
overview = pd.DataFrame(data)
overview.to_csv(fn_out, index=False)
del fn_out
overview
Category | Primary | Promotions | Spam | Total Emails | STD Spam | STD Promotions | STD Primary | |
---|---|---|---|---|---|---|---|---|
0 | Advocacy organization or think tank | 0.088472 | 0.463807 | 0.447721 | 1865 | 0.340419 | 0.350158 | 0.280486 |
1 | House battleground campaign | 0.007117 | 0.494662 | 0.498221 | 1124 | 0.380390 | 0.376841 | 0.178170 |
2 | House battleground official | 0.447883 | 0.249186 | 0.302932 | 614 | 0.310248 | 0.275013 | 0.302288 |
3 | Presidential candidate | 0.064010 | 0.638145 | 0.297845 | 1531 | 0.335097 | 0.366065 | 0.179952 |
4 | All emails | 0.106350 | 0.496884 | 0.396767 | 5134 | 0.336899 | 0.327792 | 0.280078 |
cat2stats = {}
data_all = []
for cat in categories:
stats = []
for eID, messages in df[df.Category == cat].groupby('Entity_ID'):
name = messages.iloc[0].Name
n_cat = messages.markup_id.nunique()
n_pro = messages[(messages['Category Promotions'] == 1) &
(messages['Spam'] != 1) ].markup_id.nunique()
n_u = messages[messages['Category Updates'] == 1].markup_id.nunique()
n_s = messages[messages['Spam'] == 1].markup_id.nunique()
n_pri = messages[messages['Primary'] == 1].markup_id.nunique()
start_date = messages.Date.min()
end_date = messages.Date.max()
website = messages.Website.iloc[0]
row = {
'Entity_ID' : eID,
'Name' : name,
'Website' : website,
'n_promos' : n_pro,
'n_updates' : n_u,
'n_spam' : n_s,
'n_primary' : n_pri,
'start_date' : start_date,
'end_date' : end_date,
'n_messages' : len(messages)
}
stats.append(row)
# assign all sender stats to the category
cat2stats[cat] = stats
# add all sender stats to the list of everything
data_all.extend(stats)
# names of columns we're using
metrics = [ 'n_primary', 'n_promos', 'n_spam']
ratios = [col.replace('n_', 'perc_') for col in metrics]
# columns we want for our output table
t2_cols = ['Name'] + ratios + ['n_messages']
# rename these columns for the appendix
col2col = {
'Name' : 'Name',
'perc_primary' : 'Primary',
'perc_promos' : 'Promotions',
'perc_spam' : 'Spam',
'n_messages' : 'Total Emails'
}
def get_inbox_ratios(df):
"""Gets some summary stats per sender and sorts by primary inbox"""
df[ratios] = df[metrics].div(df['n_messages'],
axis=0).round(4)
output = df[t2_cols]
output.columns = [col2col.get(c, c) for c in output.columns]
output = output.set_index('Name').sort_values(
by=['Primary', 'Total Emails'], ascending=False
).reset_index()
return output
How is Gmail classifying emails from each of these presidential candidates?
fn_out = os.path.join(table_dir, 'T2a-presidental-candidates.csv')
cat = 'Presidential candidate'
print(cat)
df_cat = pd.DataFrame(cat2stats[cat])
output = get_inbox_ratios(df_cat)
# remove Trump rallies and add Trump with zero emails.
output = output[output.Name != 'Trump rallies'].append(
pd.DataFrame([{
'Name' : 'Donald J. Trump',
'Primary' : 0.0,
'Promotions' : 0.0,
'Spam' : 0.0,
'Total Emails' : 0
}])).reset_index(drop=True)
# output = publ?ication_ready(output)
output.to_csv(fn_out, index=False)
del fn_out
output
Presidential candidate
Name | Primary | Promotions | Spam | Total Emails | |
---|---|---|---|---|---|
0 | Pete Buttigieg | 0.6279 | 0.0930 | 0.2791 | 43 |
1 | Andrew Yang | 0.4638 | 0.0435 | 0.4928 | 69 |
2 | Michael Bloomberg | 0.1667 | 0.8333 | 0.0000 | 6 |
3 | Julián Castro | 0.1204 | 0.7407 | 0.1389 | 216 |
4 | Bill Weld | 0.0291 | 0.2913 | 0.6796 | 103 |
5 | Tulsi Gabbard | 0.0222 | 0.7778 | 0.2000 | 45 |
6 | Bernie Sanders | 0.0180 | 0.9640 | 0.0180 | 111 |
7 | Cory Booker | 0.0145 | 0.7633 | 0.2222 | 207 |
8 | Amy Klobuchar | 0.0096 | 0.9071 | 0.0833 | 312 |
9 | Kamala Harris | 0.0000 | 0.3496 | 0.6504 | 123 |
10 | Joe Walsh | 0.0000 | 0.0549 | 0.9451 | 91 |
11 | Joe Biden | 0.0000 | 0.8571 | 0.1429 | 91 |
12 | Elizabeth Warren | 0.0000 | 0.8485 | 0.1515 | 66 |
13 | Beto O’Rourke | 0.0000 | 0.1957 | 0.8043 | 46 |
14 | Tom Steyer | 0.0000 | 1.0000 | 0.0000 | 1 |
15 | Donald J. Trump | 0.0000 | 0.0000 | 0.0000 | 0 |
We can also get this for the coalition of advocacy groups plus Change.org.
coalition_plus_change = [
'SumOfUs',
'Democracy for America',
'CREDO Action',
'Change.org'
]
fn_out = os.path.join(table_dir, 'T2b-coalition-groups.csv')
cat = 'Advocacy organization or think tank'
df_cat = pd.DataFrame(cat2stats[cat])
df_cat = df_cat[df_cat.Name.isin(coalition_plus_change)]
output = get_inbox_ratios(df_cat)
output.to_csv(fn_out, index=False)
del fn_out
output
Name | Primary | Promotions | Spam | Total Emails | |
---|---|---|---|---|---|
0 | SumOfUs | 0.0976 | 0.9024 | 0.0000 | 41 |
1 | Change.org | 0.0385 | 0.8462 | 0.1154 | 26 |
2 | Democracy for America | 0.0000 | 0.6023 | 0.3977 | 88 |
3 | CREDO Action | 0.0000 | 0.7273 | 0.2727 | 44 |
The same stats as Table 2a and 2b, but for all the categories.
for cat in categories:
fn_out = os.path.join(appedix_dir, f"A2-{cat.replace(' ', '-')}.csv")
df_cat = pd.DataFrame(cat2stats[cat])
output = get_inbox_ratios(df_cat)
output.to_csv(fn_out, index=False)
del fn_out
output.head(2)
Name | Primary | Promotions | Spam | Total Emails | |
---|---|---|---|---|---|
0 | Antonio Delgado | 1.0 | 0.0 | 0.0 | 11 |
1 | Troy Balderson | 1.0 | 0.0 | 0.0 | 4 |
When we created the cat2stats
dictionary, we also added each row to data_all
.
email_sender_stats = pd.DataFrame(data_all)
len(email_sender_stats)
171
Let's only look at senders who sent us at least two emails.
senders = email_sender_stats[email_sender_stats.n_messages >= 2]
len(senders)
149
What percentage these senders never got to the primary inbox?
len(senders[senders.n_primary == 0]) / len(senders)
0.4161073825503356
len(df[(df['Category Promotions'] == 1) & (df['Spam'] == 1)]) / len(df[df['Spam'] == 1])
0.6921944035346097
len(df[(df['Category Updates'] == 1) & (df['Spam'] != 1)])
540