This program identifies 'priority' e-mails in a dataset of e-mails. E-mail $i$ is a priority e-mail if Rank($i$) $\ge$ threshold, where Rank($i$) is the product of 5 weighting factors. So, if:
$\prod_{k=1}^{5}\omega_k(i) \ge \textrm{threshold}$
then e-mail $i$ is a priority message. The weighting functions $\omega_k(i)$ and the
threshold are calculated from a training dataset (train_df
).
The five weights functions are:
The program is broken into sections of functionality:
%pylab inline
Welcome to pylab, a matplotlib-based Python environment [backend: module://IPython.zmq.pylab.backend_inline]. For more information, type 'help(pylab)'.
import os
import re
import math
import random
import numpy as np
import datetime as dt
import dateutil.parser as dtp
import matplotlib.pyplot as plt
import textmining as txtm
import string
from pandas import *
import sys
from statsmodels.nonparametric import KDE
from tdm_df import tdm_df # Term-document matrix builder from CH3
# Define paths to the data. We'll be prioritizing e-mails in the "Easy Ham" folder
data_path = os.path.abspath(os.path.join('..','ch3', 'data'))
easyham_path = os.path.abspath(os.path.join(data_path, 'easy_ham'))
A set of functions that, combined, generates the next e-mail from a list of files, and returns a tuple containing its sender, subject, message, and the name of the file.
def parse_email_list(file_list):
'''
Create a generator that iterates through the e-mails in
a folder, and parses them as the generator is called.
'''
for f in file_list:
yield parse_email(os.path.join(easyham_path, f))
def parse_email(path):
'''
Get important elements of an e-mail file (date, sender's
e-mail address, subject, and message contents.
'''
filename = os.path.split(path)[-1]
header, message = get_header_and_message(path)
date = get_date(header)
sender = get_sender(header)
subj = get_subject(header)
return date, sender, subj, message, filename
def get_header_and_message(path):
'''
Split e-mail's header and its message into two
separate strings.
'''
with open(path, 'rU') as con:
email = con.readlines()
first_blank_index = email.index('\n')
header = email[:first_blank_index]
message = ''.join(email[(first_blank_index + 1): ])
return header, message
# Regex pattern that finds the date line
# (it's a line that starts with 'Date')
date_pattern = re.compile('^Date')
def get_date(header):
'''
Find the date line of the e-mail's header, and parse
the date into a datetime object.
'''
# Get the first line that matches the date-line
# regex pattern
dateline = [l for l in header if
re.search(date_pattern, l) != None][0]
# Grab the text after 'Date: ' (6 chars)
dateline = dateline[6:].strip()
# Return the date as parsed by dateutil.parser.parse()
return dtp.parse(dateline)
# Characters that may separate words in the From line
splitfrom_pattern = re.compile('[\"\:<> ]')
def get_sender(header):
'''
Find the 'From:' line in the e-mail's header data, and
extract the sender's e-mail address from it.
'''
# Find line in header that contains 'From: '
sender = filter(lambda s: s.find('From: ') != -1, header)[0]
# Get 'words' in From line
sender = re.split(splitfrom_pattern, sender)
sender = filter(lambda s: s != ' ' and s != '', sender)
# Find the first word that is an e-mail address (contains @)
sender = filter(lambda s: '@' in s, sender)[0]
sender = re.sub('\\[a-z]', '', sender)
return sender.lower().rstrip()
def get_subject(header):
'''
Find the subject line of the e-mail's header, and extract
the subject text.
'''
subject = filter(lambda s: s.find('Subject: ') != -1, header)
if len(subject) > 0:
subject_start = subject[0].find('Subject: ') + 9
subject = subject[0][subject_start:]
return subject.lower()
else:
return ''
This function will using the parse_email_list function to iterate through the e-mail generator and collect the e-mails' date, sender, subject, message (and the name of the file containing the e-mail) into a dictionary of lists. The dictionary is then converted to a DataFrame.
def make_email_df(email_dir):
'''
Parse each e-mail in a directory and return a dataframe
of each e-mail's date, sender, subject, and message
'''
email_dict = {'date' : [],
'sender' : [],
'subject' : [],
'message' : [],
'filename' : []}
# Get a list of e-mails to parse
file_list = os.listdir(email_dir)
file_list = [f for f in file_list if f != 'cmds']
# A generator that that returns the parsed components
# of each subsequent e-mail as called
parsed_emails = parse_email_list(file_list)
# Fill up the dictionary with the generator
for pe in parsed_emails:
date, sender, subject, message, filename = pe
email_dict['date'].append(date)
email_dict['sender'].append(sender)
email_dict['subject'].append(subject)
email_dict['message'].append(message)
email_dict['filename'].append(filename)
# Make the dictionary a data frame.
email_df = DataFrame(email_dict,
columns = ['date', 'sender', 'subject',
'message', 'filename'])
return email_df
# Make a dataframe of information in the
# 'Easy Ham' emails
email_df = make_email_df(easyham_path)
The first half of the e-mails will be used as the 'training' set. From this set we'll extract features that we'll use to classify the second half of e-mails. Typically, we might shuffle the data before splitting the set, so that we get a representative view of features. But, since we're classifying e-mails, it's more appropriate to take the earlier e-mails as the training set, and see how well they identify 'priority' e-mails amongst subsequently-recieved e-mails.
def train_test_split(df, train_fraction = .5, shuffle = True,
preserve_index = True, seed = None):
'''
Split a dataframe into training and test sets by randomly assigning
its observations to one set or the other.
Parameters
----------
df: a DataFrame
train_fraction: the fraction of `df` to be assigned to the
training set (1-train_fraction will go to the test set).
preserve_index: If True, the split dataframes keep their index
values from `df`. If False, each set gets a new integer index
of arange(# rows).
seed: the random number generator seed for row shuffling.
'''
if seed:
random.seed(seed)
nrows = df.shape[0]
split_point = int(train_fraction * nrows)
rows = range(nrows)
if shuffle:
random.shuffle(rows)
train_rows = rows[:split_point]
test_rows = rows[split_point:]
train_index = df.index[train_rows]
test_index = df.index[test_rows]
train_df = df.ix[train_index, :]
test_df = df.ix[test_index, :]
if not preserve_index:
train_df.index = arange(train_df.shape[0])
test_df.index = arange(test_df.shape[0])
return train_df, test_df
Some dates lack time-zone info, making them impossible to sort We'll sort ignoring timezone, knowing that it's buggy, but for comparison with the book. The R code in ML4H ignores timezones, and this leads to some errors. See note at the end.
email_df['sort_date'] = email_df['date'].map(lambda d:
dt.datetime(d.year, d.month, d.day, d.hour, d.minute, d.second))
email_df = email_df.sort('sort_date')
train_df, test_df = train_test_split(email_df, shuffle = False,
preserve_index = False,
seed = 224)
# Each sender address get's a weight equal to the log of
# 1 + the number of e-mails from that address.
def get_sender_weights(email_df):
freq = email_df['sender'].value_counts()
freq.sort()
sender_weights = DataFrame({'freq' : freq,
'weight' : np.log(1.0 + freq)})
sender_weights = sender_weights.sort('weight', ascending = False)
return sender_weights
Plot sender frequencies of the top 30 senders in the training set. Get the frequencies of these same addresses from the test set, and plot for comparisons. This will give us some idea of senders we'll end up flagging in the test set.
sender_weights = get_sender_weights(train_df)
sender_weights_test = get_sender_weights(test_df)
nshow = 30
top_emails = sender_weights[:nshow].index
plt.figure(figsize = (6, 14))
plt.barh(np.arange(nshow), sender_weights['freq'][top_emails],
align = 'center',
label = 'Training')
plt.barh(np.arange(nshow), sender_weights_test['freq'][top_emails],
align = 'center',
left = sender_weights['freq'][top_emails].fillna(0),
fc = 'orange',
alpha = .8,
label = 'Test')
plt.ylim((0 - .5, nshow - .5))
plt.title('Frequency of top %i sender addresses' % nshow)
plt.yticks(np.arange(nshow), top_emails)
plt.legend(frameon = False)
<matplotlib.legend.Legend at 0x1070c7190>
We want to identify threads amongst the e-mails and exctract information from them separately. We'll find threads that have been replied-to or forwarded, and collect all e-mails with the same subject (after stripping reply and forward tags out) into the same thread.
For example:
will all be in the same thread. This differs slightly from how the book identifies thread messages. They only flag messages with 'Re:' prefixes, so they would miss messages [1] and [4]. (TODO: Check how they would flag [5])
After flagging them, we'll collect info about these threads# into a separate DataFrame.
Regex patterns indicating a thread. Looking for reply and forward markers. Note that some mail services use 'Re:Re:Re:'-type prexifes to mark multiple replies (or forwards), while other use 'Re[3]:'.
reply_pattern = '(re:|re\[\d\]:)'
fwd_pattern = '(fw:|fw[\d]:)'
def thread_flag(s):
'''
Returns True if string s matches the thread patterns.
If s is a pandas Series, returns a Series of booleans.
'''
if isinstance(s, basestring):
return re.search(reply_pattern, s) is not None
else:
return s.str.contains(reply_pattern, re.I)
def clean_subject(s):
'''
Removes all the reply and forward labeling from a
string (an e-mail subject) s.
If s is a pandas Series, returns a Series of cleaned
strings.
This will help find the initial message in the thread
(which won't have any of the reply/forward labeling.
'''
if isinstance(s, basestring):
s_clean = re.sub(reply_pattern, '', s, re.I)
s_clean = re.sub(fwd_pattern, '', s_clean, re.I)
s_clean = s_clean.strip()
else:
s_clean = s.str.replace(reply_pattern, '', re.I)
s_clean = s_clean.str.replace(fwd_pattern, '', re.I)
s_clean = s_clean.str.strip()
return s_clean
def get_thread_df(email_df):
'''
Identify threads in an e-mail DataFrame, and extract
them into a new DataFrame.
'''
# Find threads by e-mails with reply patterns in their subjects.
# Then get a set of thread subjects.
is_thread = thread_flag(email_df['subject'])
thread_subj = email_df['subject'][is_thread]
# Clean up the subjects by removing reply and forward labels
thread_subj = clean_subject(thread_subj)
thread_subj = thread_subj.unique()
# Search for these thread subjects in the original
# e-mail DataFrame (so we pick up the original e-mail in the
# thread (which won't have a reply pattern in its subject)
# Prepare the DataFrame for searching by cleaning up the
# subjects.
search_df = email_df[['date', 'sender', 'subject']]
search_df['subject'] = clean_subject(search_df['subject'])
# Find subject matches
thread_matches = [subj in thread_subj for subj in search_df['subject']]
match_df = search_df.ix[thread_matches, :]
return match_df
# Get a DataFrame of threads in the training e-mails;
# compute sender weights within this subset of threads.
# To do the latter, we can use the same function we
# used on the whole DataFrame above.
thread_df = get_thread_df(train_df)
thread_sender_weights = get_sender_weights(thread_df)
def get_thread_activity(thread_df):
'''
Compute 'activity' statistics on threads in a DataFrame:
frequency: Number of e-mails observed in the thread
span: Time before the first and last e-mail observed (seconds)
weight: Number e-mails-per-second in the thread (log-scale)
'''
clean_times = thread_df['date'].map(lambda t: t.tzinfo is not None)
thread_df_clean = thread_df.ix[clean_times, :]
freq_by_thread = thread_df['subject'].value_counts()
# NB: I'm not sure if total_seconds() is daylight-savings aware
# so this may be mildly buggy.
seconds_span = lambda x: (
(np.max(x) - np.min(x)).total_seconds())
span_by_thread = thread_df_clean.groupby('subject')
span_by_thread = span_by_thread['date'].aggregate(seconds_span)
activity = DataFrame({'freq' : freq_by_thread,
'span' : span_by_thread,
'weight' : 10 + np.log10(freq_by_thread/span_by_thread)})
# Restricting to threads with more than 2 e-mails may be unecessary, but we
# could get 1-e-mail threads for a couple of reasons: misclassification (it's
# not really a thread), and left-censored threads -- where we found the last
# reply in the thread, but the earlier messages were recieved before the data
# was collected.
activity = activity[activity['freq'] >= 2]
activity = activity[notnull(activity['weight'])]
return activity
thread_activity_df = get_thread_activity(thread_df)
Check against table on p. 116
Since book ignores time-zone info., spans in book are often wrong (see, e.g., 'please help newbie compile mplayer :-)')
threads_to_check = ['please help a newbie compile mplayer :-)',
'prob. w/ install/uninstall',
'http://apt.nixia.no/']
print thread_activity_df.ix[threads_to_check, :]
freq span weight please help a newbie compile mplayer :-) 4 13509 6.471437 prob. w/ install/uninstall 4 12114 6.518772 http://apt.nixia.no/ 10 265303 5.576258
We create term-document-matrices for the e-mails subjects and messages. (E.g., each subject is a document, or each message is a document). Then we make two different sets of term weighting functions.
subjects it occurs in. Note that only terms that are in the subjects of the threads in the training data get weight.
# Stopwords from R. NLTK has a set of stopwords we could use as
# well, but this minimizes discrepancies with the book.
rsw = read_csv('../ch3/r_stopwords.csv')['x'].values.tolist()
def get_thread_subject_term_weights(thread_activity_df):
'''
Creates a term->weight map based on a DataFrame containing
thread subjects and their activity weights
'''
thread_subjects = thread_activity_df.index
thread_tdm = tdm_df(thread_subjects, remove_punctuation = False,
remove_digits = False,
stopwords = rsw)
def calc_term_weight(term):
threads_with_term = np.where(thread_tdm[term] > 0.0)
weight_vec = thread_activity_df['weight'].ix[threads_with_term]
return weight_vec.mean()
term_weights = Series({t: calc_term_weight(t) for t in thread_tdm})
return term_weights
thread_subject_terms_weights = \
get_thread_subject_term_weights(thread_activity_df)
# Example term weights
print thread_subject_terms_weights.head(10)
aa 5.759571 adam 5.975084 adhesion 4.814012 adsl 6.397131 advice 4.915926 alsa 5.817899 angry 6.028029 announces 6.093981 anolther 5.317371 app 5.699035
def get_message_term_weights(email_df):
'''
Creates a term->weight map for terms in the messages of the training
e-mail DataFrame
'''
messages = email_df['message']
term_freq = tdm_df(messages, stopwords = rsw).sum()
term_weight_df = DataFrame({'freq' : term_freq,
'weight' : np.log10(term_freq)})
return term_weight_df
message_terms_weights = get_message_term_weights(train_df)
The following function comprises the two term-weighting functions. It computes either a message's subject-term-weight or message-term-weight, depending on what arguments we pass to it.
def get_weight_from_terms(term_list, weight_df, subject = False):
'''
Given a term-list from an e-mail's message, and a term->weights
map contained in a DataFrame, calculate the e-mail's message or
subject term-weight. (default is message)
'''
if isinstance(term_list, basestring):
term_list = [term_list]
if subject:
weights = weight_df
else:
weights = weight_df['weight']
term_list_weight = weights[term_list].mean()
if np.isnan(term_list_weight):
term_list_weight = 1.0
return term_list_weight
The following functions iterate through a DataFrame of e-mails and compute the weights of its e-mails by combining all the functions we've defined above.
def rank_email(email_df, row):
'''
Ranks an e-mail (as contained in the row of a DataFrame) by
computing and combining its five weights.
'''
email = email_df.ix[row, :]
date = email['date']
sender = email['sender']
subject = email['subject']
message = email['message']
# 1. Get sender weights (all messages)
sender_weight = (sender_weights['weight']
.get(sender) or 1.0)
# 2. Get sender weights (threads)
thread_sender_weight = (thread_sender_weights['weight']
.get(sender) or 1.0)
# 3. Get thread activity weight
is_thread = thread_flag(subject)
subject = clean_subject(subject)
if is_thread:
activity_weight = (thread_activity_df['weight']
.get(subject) or 1.0)
else:
activity_weight = 1.0
# 4. Get subject line weight via thread-subject term weights
subj_terms = tdm_df(subject, remove_punctuation = False,
remove_digits = False,
stopwords = rsw).columns
subj_term_weight = get_weight_from_terms(subj_terms,
thread_subject_terms_weights,
subject = True)
# 5. Message term weight
message_terms = tdm_df(message, stopwords = rsw).columns
message_term_weight = get_weight_from_terms(message_terms,
message_terms_weights)
weights = [sender_weight,
thread_sender_weight,
activity_weight,
subj_term_weight,
message_term_weight]
# The e-mail's final rank is just the product of the weights.
rank = np.array(weights).prod()
return rank
def make_rank_df(email_df):
'''
Rank each e-mail in a DataFrame.
'''
n_emails = email_df.shape[0]
sender_weight_results = []
thread_sender_weight_results = []
activity_weight_results = []
subj_term_weight_results = []
message_term_weight_results = []
rank_results = []
rank_df = email_df.copy()
for e in xrange(n_emails):
weights_rank = rank_email(email_df, e)
rank_results.append(weights_rank)
rank_df['rank'] = rank_results
return rank_df
Rank the e-mails in the training set.
train_ranks = make_rank_df(train_df)
Calculate the threshold for flagging a 'priority' message. New messages that have a rank greater than the threshold get flagged as priority.
threshold = train_ranks['rank'].median()
Rank the e-mails in the test set.
test_ranks = make_rank_df(test_df)
Plot the density of e-mail ranks in the test and training set. Note the mass of e-mails in the test set are ranked lower than priority, and only a tail of e-mails are flagged.
train_kde = KDE(train_ranks['rank'])
train_kde.fit()
test_kde = KDE(test_ranks['rank'])
test_kde.fit()
plt.figure(figsize(8, 6))
plt.fill(train_kde.support, train_kde.density, color = 'steelblue', alpha = .7,
label = 'Train')
plt.fill(test_kde.support, test_kde.density, color = 'red', alpha = .7,
label = 'Test')
plt.xlim(0, 400)
plt.ylim(0, np.max(test_kde.density))
plt.axvline(threshold, linestyle = '--', label = 'Priority threshold')
plt.xlabel('Rank')
plt.ylabel('Density')
plt.title('Distribution of ranks for training and test e-mails')
plt.legend(frameon = False)
<matplotlib.legend.Legend at 0x10f450f90>
How many e-mails do we flag as priority in the test set?
(test_ranks['rank'] > threshold).sum()
65
What sender addresses get flagged in the test set? We should see some familiar addresses from the graph above.
test_ranks[test_ranks['rank'] > threshold]['sender'].unique()
array([fork_list@hotmail.com, cdale@techmonkeys.net, owen@permafrost.net, geege@barrera.org, felicity@kluge.net, johnhall@evergo.net, welch@panasas.com, rah@shipwright.com, cwg-exmh@deepeddy.com, elias@cse.ucsc.edu, dl@silcom.com, yyyy@spamassassin.taint.org, ricardo@mdbcomunicacao.com.br, kre@munnari.oz.au, matthias@egwn.net, tomwhore@slack.net, quinlan@pathname.com, aeriksson@fastmail.fm, eugen@leitl.org, ejw@cse.ucsc.edu, garym@canada.com, angles@aminvestments.com, mark@talios.com, deafbox@hotmail.com, timc@2ubh.com], dtype=object)
Which training set e-mail is ranked highest?
maxrank_email = train_ranks.ix[train_ranks['rank'].idxmax(), :]
print 'Maximum rank in training set: ', train_ranks['rank'].max()
Maximum rank in training set: 1020.94633354
This turns out to be a lot higher than the largest found by the authors' R program. Let's check it out.
It turns out this e-mail's thread activity weight is substantially higher than found in the R code. As shown next, this is the result of a time-zone bug in the R code.
thread_activity_df.ix[clean_subject(maxrank_email['subject']), :]
freq 2.00000 span 4.00000 weight 9.69897 Name: [sadev] [bug 840] spam_level_char option change/removal
There are two messages in this thread (in the training set).
thread_df[thread_df['subject'].str.contains('spam_level_char')]['date']
734 2002-09-06 10:56:23-07:00 763 2002-09-06 13:56:19-04:00 Name: date
E-mail 734 come 4 seconds after 763. But if we mistakenly ignore the time-zone (like in the R code) 734 comes 2:59:56 before 763.
All else equal, this will reduce the rank of this e-mail by 35%.
correct_weight = 10 + np.log10(2 / 4.)
incorrect_weight = 10 + np.log10(2 / (2* 3600 + 59 * 60 + 56.))
print 'Activity weight correct time-zone: ', correct_weight
print 'Activity weight incorrect time-zone: ', incorrect_weight
print 'Rank difference: ' , '{:2.1%}'.format(1 - incorrect_weight / correct_weight)
Activity weight correct time-zone: 9.69897000434 Activity weight incorrect time-zone: 6.26776711978 Rank difference: 35.4%