Gmail Experiment: Data Analysis¶

Author: Leon Yin and Surya Mattu
Links: GitHub | nbviewer | Story

This notebook shows how we aggregate the data we preprocessed in the last notebook (see in nbviewer). The tables in this notebook are featured in our experiment and story.

In [1]:

%matplotlib inline
import os

import numpy as np
import pandas as pd

In [2]:

# inputs
fn_input = '../data/intermediates/email_metadata_merged_with_newsletters.csv.gz'
fn_newsletter_categories = '../data/input/newsletter_categories.csv'

# outputs
appedix_dir = '../data/output/appendix'
table_dir = '../data/output/tables'

for d in [appedix_dir, table_dir]:
    os.makedirs(d, exist_ok=True)

In [3]:

# read the dataset we preprocessed in the last notebook.
df = pd.read_csv(fn_input)
len(df)

Out[3]:

Recall our dataset looks like this:

In [4]:

df.iloc[-1].to_dict()

Out[4]:

{'X-GM-THRID': 1655113375086340206,
 'X-Gmail-Labels': 'Spam,Category Updates,Unread',
 'Date': '2020-01-07 23:18:33+00:00',
 'Subject': 're: your 2020 membership',
 'text': 'Become a card-carrying member of the Democratic Party by renewing your membership today.\nHi --\nWe’ve been talking about the importance of preparing for 2020 for months now, and it’s finally here. We’re in the thick of it. The Iowa caucuses are less than a month away, and before we know it we’ll have our Democratic nominee.\nIt’s more clear now than ever: This is not the year to sit on the sidelines. As a member of our party, we know you are committed to defeating Trump and his allies -- and now is the time to prove it.\nBecome a card-carrying member of the Democratic Party by making a $7 donation to renew your membership today.\nRENEW MY MEMBERSHIP\nIf we’re going to win up and down the ballot this year, we need you in this fight. I hope you’ll carry your 2020 DNC membership card with pride -- it’s a reflection of all you’ve done for our party as we prepare for the fight ahead.\nRenew your membership today to receive your Official 2020 DNC Membership Card.\nThanks as always,\nLauren\nLauren Williams\nDirect Marketing Director\nDemocratic National Committee\nIf you no longer wish to receive emails from the DNC, submit this form to unsubscribe. If you’d only like to receive our most important messages, sign up to receive less email.\nIf you’re ready to elect Democrats in all 50 states, make a contribution today.\nContributions or gifts to the Democratic National Committee are not tax deductible. Paid for by the Democratic National Committee, www.Democrats.org, and not authorized by any candidate or candidate’s committee.',
 'markup_id': '5a0fe9753c37a010ae045117cc77e261',
 'Category Personal': 0,
 'Category Promotions': 0,
 'Category Updates': 1,
 'Inbox': 0,
 'Spam': 1,
 'Primary': 0,
 'From_Email': 'feedback@act.democrats.org',
 'From_Domain': 'democrats.org',
 'From_Name': 'DNC Headquarters',
 'To_Email': 'ajcorrigan88+dnc@gmail.com',
 'Name': 'DNC',
 'Website': 'https://democrats.org/',
 'Category': 'Advocacy organization or think tank',
 'Entity_ID': 'dbc25ff0242fc30a08228f14e658ad48'}

In [5]:

categories = df.Category.unique()
categories

Out[5]:

array(['Presidential candidate', 'House battleground campaign',
       'Advocacy organization or think tank',
       'House battleground official'], dtype=object)

Appendix 1: Taking inventory¶

How many emails did we get from each email sender? Which newsletters didn't send us anything?

In [6]:

data = []
for eID, _df in df.groupby('Entity_ID'):
    first_email = _df.Date.min().split()[0]
    last_email = _df.Date.max().split()[0]
    n_email = _df.markup_id.nunique()
    cat = _df.Category.iloc[0]
    name = _df.Name.iloc[0]
    
    row = {
        'Name' : name,
        'Entity_ID' : eID,
        'Emails received' : n_email,
        'Date of first email' : first_email,
        'Date of last email' : last_email,
        'Newsletter Category' : cat
    }
    
    data.append(row)
    
emails_received = pd.DataFrame(data)
emails_received.head(2)

Out[6]:

	Name	Entity_ID	Emails received	Date of first email	Date of last email	Newsletter Category
0	Sharice Davids	0440493f2ad4348bc652f41cb1ff5d49	15	2019-10-22	2020-02-06	House battleground official
1	Harley Rouda	04dfd84c37ccdf9421ba355e25529adc	58	2019-10-21	2020-02-10	House battleground campaign

In [7]:

# newsletters we signed up for and which alias we used.
newsletter_categories = pd.read_csv(fn_newsletter_categories)
newsletter_categories.head(2)

Out[7]:

	Name	Website	Email	Category	Entity_ID
0	AARP	https://www.aarp.org/	ajcorrigan88+aarp@gmail.com	Advocacy organization or think tank	555b07f95e3dc347516af4e713ed190b
1	Center for Responsive Politics	http://www.opensecrets.org/	ajcorrigan88+cfrp@gmail.com	Advocacy organization or think tank	77751658e640d8c56dbda5d7501dfd5d

In [8]:

A1_output_cols = [
    'Name', 'Website', 
    'Emails received',
    'Date of first email', 
    'Date of last email',
]
# this prevents duplicate rows
cols_to_use = emails_received.columns.difference(
    newsletter_categories.columns
)

no_shows = []
for cat in categories:
    fn_out = os.path.join(
        appedix_dir, f"A1-{cat.replace(' ', '-')}.csv"
    )
    
    newsletter_categories_cat = newsletter_categories[
        newsletter_categories.Category == cat].set_index('Name')
    
    emails_received_cat = emails_received[
        emails_received['Newsletter Category'] == cat].set_index('Name')

    # here we combine all emails we opt-d in for, with all recieved
    email_tracker = newsletter_categories_cat.merge(emails_received_cat[cols_to_use],
                                                    on= ['Name'], 
                                                    how='left') \
                                             .sort_values(by= 'Name') \
                                             .reset_index()
    
    # For email senders with no match, set filler values.
    email_tracker['Emails received'].fillna(0, inplace=True)
    for col in ['Date of first email', 'Date of last email']:
        email_tracker[col].fillna('-', inplace=True)
    
    # let's record who we didn't get!
    no_shows.extend(
        email_tracker[email_tracker['Emails received'] == 0][[
            'Name', 'Category', 'Entity_ID'
        ]].to_dict(orient='records')
    )
    
    # save this for the appendix
    email_tracker[A1_output_cols].to_csv(fn_out, index=False)
    del fn_out
    
    # print some summary stats
    n_signed_up = len(newsletter_categories_cat)
    n_received = len(emails_received_cat)
    print(f"{cat}:\nwe signed up for {n_signed_up}\n"
          f"got emails from {n_received}\n")

Presidential candidate:
we signed up for 17
got emails from 16

House battleground campaign:
we signed up for 71
got emails from 44

Advocacy organization or think tank:
we signed up for 73
got emails from 50

House battleground official:
we signed up for 70
got emails from 61

Let's get the stats for all senders:

In [9]:

n_no_shows = len(no_shows)
n_received = df.Entity_ID.nunique()
n_subscribed = newsletter_categories.Entity_ID.nunique()

print(f"Out of the {n_subscribed} email senders we opted into "
      f"we received {n_received} of them and didn't recieve {n_no_shows}.")

Out of the 231 email senders we opted into we received 171 of them and didn't recieve 60.

In [10]:

# Who didn't send us emails?
fn_out = os.path.join(appedix_dir, 'Amisc_email_senders_who_didnt_send.csv')
pd.DataFrame(no_shows).sort_values('Name').to_csv(fn_out, index=False)

Table 1: How Gmail classified our emails¶

In [11]:

fn_out = os.path.join(table_dir, 'T1-overview.csv')

data = []
for cat, _df in df.groupby('Category'):
    spam = []
    promos = []
    primary = []
    for user, __df in _df.groupby('To_Email'):
        n_emails = __df.markup_id.nunique()
        n_spam = __df[__df['Spam'] == 1].markup_id.nunique()/ n_emails
        n_promo = __df[(__df['Category Promotions'] == 1) &
                       (__df['Spam'] != 1)].markup_id.nunique()/ n_emails
        n_primary = __df[__df['Primary'] == 1].markup_id.nunique() / n_emails
        
        spam.append(n_spam)
        promos.append(n_promo)
        primary.append(n_primary)
    
    # this is to get the variance
    spam = np.std(spam)
    promos = np.std(promos)
    primary = np.std(primary)
    
    n_emails = _df.markup_id.nunique()
    n_spam = _df[_df['Spam'] == 1].markup_id.nunique() / n_emails
    n_promo = _df[(_df['Category Promotions'] == 1) &
                  (_df['Spam'] != 1)].markup_id.nunique()/ n_emails
    n_primary = _df[_df['Primary'] == 1].markup_id.nunique()/ n_emails
    
    row = {
        'Category' : cat,
        'Primary' : n_primary,
        'Promotions' : n_promo,
        'Spam' : n_spam,
        'Total Emails' : n_emails,
        'STD Spam' : spam,
        'STD Promotions' :promos,
        'STD Primary' : primary
    }
    
    data.append(row)
    
# Average for all
spam = []
promos = []
primary = []

for user, __df in df.groupby('To_Email'):
    n_emails = __df.markup_id.nunique()
    n_spam = __df[__df['Spam'] == 1].markup_id.nunique()/ n_emails
    n_promo = __df[(__df['Category Promotions'] == 1) &
                   (__df['Spam'] != 1)].markup_id.nunique()/ n_emails
    n_primary = __df[__df['Primary'] == 1].markup_id.nunique() / n_emails

    spam.append(n_spam)
    promos.append(n_promo)
    primary.append(n_primary)

spam = np.std(spam)
promos = np.std(promos)
primary = np.std(primary)

n_emails = df.markup_id.nunique()
n_spam = df[df['Spam'] == 1].markup_id.nunique() / n_emails
n_promo = df[(df['Category Promotions'] == 1) &
             (df['Spam'] != 1)].markup_id.nunique()/ n_emails
n_primary = df[df['Primary'] == 1].markup_id.nunique()/ n_emails

row = {
    'Category' : 'All emails',
    'Primary' : n_primary,
    'Promotions' : n_promo,
    'Spam' : n_spam,
    'Total Emails' : n_emails,
    'STD Spam' : spam,
    'STD Promotions' :promos,
    'STD Primary' : primary
}

data.append(row)

overview = pd.DataFrame(data)
overview.to_csv(fn_out, index=False)
del fn_out
overview

Out[11]:

	Category	Primary	Promotions	Spam	Total Emails	STD Spam	STD Promotions	STD Primary
0	Advocacy organization or think tank	0.088472	0.463807	0.447721	1865	0.340419	0.350158	0.280486
1	House battleground campaign	0.007117	0.494662	0.498221	1124	0.380390	0.376841	0.178170
2	House battleground official	0.447883	0.249186	0.302932	614	0.310248	0.275013	0.302288
3	Presidential candidate	0.064010	0.638145	0.297845	1531	0.335097	0.366065	0.179952
4	All emails	0.106350	0.496884	0.396767	5134	0.336899	0.327792	0.280078

Table 2: Gmail classification statistics for each email sender¶

In [12]:

cat2stats = {}
data_all = []
for cat in categories:
    stats = []
    for eID, messages in df[df.Category == cat].groupby('Entity_ID'):
        name = messages.iloc[0].Name
        n_cat = messages.markup_id.nunique()
        n_pro = messages[(messages['Category Promotions'] == 1) &
                         (messages['Spam'] != 1) ].markup_id.nunique()
        n_u = messages[messages['Category Updates'] == 1].markup_id.nunique()
        n_s = messages[messages['Spam'] == 1].markup_id.nunique()
        n_pri = messages[messages['Primary'] == 1].markup_id.nunique()
        
        start_date = messages.Date.min()
        end_date = messages.Date.max()
        website = messages.Website.iloc[0]
        
        row = {
            'Entity_ID' : eID,
            'Name' : name,
            'Website' : website,
            'n_promos' : n_pro,
            'n_updates' : n_u,
            'n_spam' : n_s,
            'n_primary' : n_pri,
            'start_date' : start_date,
            'end_date' : end_date,
            'n_messages' : len(messages)
        }
        stats.append(row)
        
    # assign all sender stats to the category
    cat2stats[cat] = stats
    # add all sender stats to the list of everything
    data_all.extend(stats)

In [13]:

# names of columns we're using
metrics = [ 'n_primary', 'n_promos', 'n_spam']
ratios = [col.replace('n_', 'perc_') for col in metrics]

# columns we want for our output table
t2_cols = ['Name'] + ratios +  ['n_messages']

# rename these columns for the appendix
col2col = {
    'Name' : 'Name',
    'perc_primary' : 'Primary',
    'perc_promos' : 'Promotions',
    'perc_spam' : 'Spam',
    'n_messages' : 'Total Emails'
}

In [14]:

def get_inbox_ratios(df):
    """Gets some summary stats per sender and sorts by primary inbox"""
    df[ratios] = df[metrics].div(df['n_messages'], 
                                 axis=0).round(4)
    output = df[t2_cols]
    output.columns = [col2col.get(c, c) for c in output.columns]
    output = output.set_index('Name').sort_values(
        by=['Primary', 'Total Emails'], ascending=False
    ).reset_index()
    
    return output

How is Gmail classifying emails from each of these presidential candidates?

In [15]:

fn_out = os.path.join(table_dir, 'T2a-presidental-candidates.csv')
cat = 'Presidential candidate'
print(cat)

df_cat = pd.DataFrame(cat2stats[cat])
output = get_inbox_ratios(df_cat)

# remove Trump rallies and add Trump with zero emails.
output = output[output.Name != 'Trump rallies'].append(
    pd.DataFrame([{
        'Name' : 'Donald J. Trump',
        'Primary' : 0.0,
        'Promotions' : 0.0,
        'Spam' : 0.0,
        'Total Emails' : 0
    }])).reset_index(drop=True)
# output = publ?ication_ready(output)
output.to_csv(fn_out, index=False)
del fn_out
output

Presidential candidate

Out[15]:

	Name	Primary	Promotions	Spam	Total Emails
0	Pete Buttigieg	0.6279	0.0930	0.2791	43
1	Andrew Yang	0.4638	0.0435	0.4928	69
2	Michael Bloomberg	0.1667	0.8333	0.0000	6
3	Julián Castro	0.1204	0.7407	0.1389	216
4	Bill Weld	0.0291	0.2913	0.6796	103
5	Tulsi Gabbard	0.0222	0.7778	0.2000	45
6	Bernie Sanders	0.0180	0.9640	0.0180	111
7	Cory Booker	0.0145	0.7633	0.2222	207
8	Amy Klobuchar	0.0096	0.9071	0.0833	312
9	Kamala Harris	0.0000	0.3496	0.6504	123
10	Joe Walsh	0.0000	0.0549	0.9451	91
11	Joe Biden	0.0000	0.8571	0.1429	91
12	Elizabeth Warren	0.0000	0.8485	0.1515	66
13	Beto O’Rourke	0.0000	0.1957	0.8043	46
14	Tom Steyer	0.0000	1.0000	0.0000	1
15	Donald J. Trump	0.0000	0.0000	0.0000	0

We can also get this for the coalition of advocacy groups plus Change.org.

In [16]:

coalition_plus_change = [
    'SumOfUs',
    'Democracy for America',
    'CREDO Action',
    'Change.org'
]

In [17]:

fn_out = os.path.join(table_dir, 'T2b-coalition-groups.csv')
cat = 'Advocacy organization or think tank'

df_cat = pd.DataFrame(cat2stats[cat])
df_cat = df_cat[df_cat.Name.isin(coalition_plus_change)]
output = get_inbox_ratios(df_cat)
output.to_csv(fn_out, index=False)
del fn_out
output

Out[17]:

	Name	Primary	Promotions	Spam	Total Emails
0	SumOfUs	0.0976	0.9024	0.0000	41
1	Change.org	0.0385	0.8462	0.1154	26
2	Democracy for America	0.0000	0.6023	0.3977	88
3	CREDO Action	0.0000	0.7273	0.2727	44

Appendix 2¶

The same stats as Table 2a and 2b, but for all the categories.

In [18]:

for cat in categories:
    fn_out = os.path.join(appedix_dir, f"A2-{cat.replace(' ', '-')}.csv")
    df_cat = pd.DataFrame(cat2stats[cat])
    output = get_inbox_ratios(df_cat)
    output.to_csv(fn_out, index=False)
    del fn_out
    
output.head(2)

Out[18]:

	Name	Primary	Promotions	Spam	Total Emails
0	Antonio Delgado	1.0	0.0	0.0	11
1	Troy Balderson	1.0	0.0	0.0	4

Misc Stats¶

What is the percentage of email senders that never got an email in primary?¶

When we created the cat2stats dictionary, we also added each row to data_all.

In [19]:

email_sender_stats = pd.DataFrame(data_all)
len(email_sender_stats)

Out[19]:

Let's only look at senders who sent us at least two emails.

In [20]:

senders = email_sender_stats[email_sender_stats.n_messages >= 2]
len(senders)

Out[20]:

What percentage these senders never got to the primary inbox?

In [21]:

len(senders[senders.n_primary == 0]) / len(senders)

Out[21]:

0.4161073825503356

What percentage of Spam emails are also labelled as promotions?¶

In [22]:

len(df[(df['Category Promotions'] == 1) & (df['Spam'] == 1)]) / len(df[df['Spam'] == 1])

Out[22]:

0.6921944035346097

How many emails would have been in Updates if Updates was active?¶

In [23]:

len(df[(df['Category Updates'] == 1) & (df['Spam'] != 1)])

Out[23]: