String operations and regular expressions

Today we talk about strings. When we have a string, we might want to ask whether it has particular characteristics---does it start with a particular character? Does it contain within it another string?---or try to extract smaller parts of the string, like the first fifteen characters, or say, the part of the string inside parentheses. Or we may want to transform the string into another string altogether, by (for example) converting its characters to upper case, or replacing substrings within it with other substrings. Today we discuss how to do these things in Python.

Simple string checks

There are a number of functions, methods and operators that can tell us whether or not a Python string matches certain characteristics. Let's talk about the in operator first:

In [1]:
"foo" in "buffoon"
Out[1]:
True
In [2]:
"foo" in "reginald"
Out[2]:
False

The in operator takes one expression evaluating to a string on the left and another on the right, and returns True if the string on the left occurs somewhere inside of the string on the right.

We can check to see if a string begins with or ends with another string using that string's .startswith() and .endswith() methods, respectively:

In [3]:
"foodie".startswith("foo")
Out[3]:
True
In [4]:
"foodie".endswith("foo")
Out[4]:
False

The .isdigit() method returns True if Python thinks the string could represent an integer, and False otherwise:

In [5]:
print "foodie".isdigit()
print "4567".isdigit()
False
True

And the .islower() and .isupper() methods return True if the string is in all lower case or all upper case, respectively (and False otherwise).

In [12]:
print "foodie".islower()
print "foodie".isupper()
True
False
In [8]:
print "YELLING ON THE INTERNET".islower()
print "YELLING ON THE INTERNET".isupper()
False
True

Finding substrings

The in operator discussed above will tell us if a substring occurs in some other string. If we want to know where that substring occurs, we can use the .find() method. The .find() method takes a single parameter between its parentheses: an expression evaluating to a string, which will be searched for within the string whose .find() method was called. If the substring is found, the entire expression will evaluate to the index at which the substring is found. If the substring is not found, the expression evaluates to -1. To demonstrate:

In [10]:
print "Now is the winter of our discontent".find("win")
print "Now is the winter of our discontent".find("lose")
11
-1

The .count() method will return the number of times a particular substring is found within the larger string:

In [13]:
print "I got rhythm, I got music, I got my man, who could ask for anything more".count("I got")
3

String slices

As has been alluded to previously, string slices work exactly like list slices---except you're getting characters from the string, instead of elements from a list. Observe:

In [15]:
message = "bungalow"
message[3]
Out[15]:
'g'
In [16]:
message[1:6]
Out[16]:
'ungal'
In [17]:
message[:3]
Out[17]:
'bun'
In [18]:
message[2:]
Out[18]:
'ngalow'
In [21]:
message[-2]
Out[21]:
'o'

Combine this with the find() method and you can do things like write expressions that evaluate to everything from where a substring matches, up to the end of the string:

In [23]:
shakespeare = "Now is the winter of our discontent"
substr_index = shakespeare.find("win")
print shakespeare[substr_index:]
winter of our discontent

Simple string transformations

Python strings have a number of different methods which, when called on a string, return a copy of that string with a simple transformation applied to it. These are helpful for normalizing and cleaning up data, or preparing it to be displayed.

Let's start with .lower(), which evaluates to a copy of the string in all lower case:

In [28]:
"ARGUMENTATION! DISAGREEMENT! STRIFE!".lower()
Out[28]:
'argumentation! disagreement! strife!'

The converse of .lower() is .upper():

In [32]:
"e.e. cummings is. not. happy about this.".upper()
Out[32]:
'E.E. CUMMINGS IS. NOT. HAPPY ABOUT THIS.'

The method .title() evaluates to a copy of the string it's called on, replacing every letter at the beginning of a word in the string with a capital letter:

In [33]:
"dr. strangelove, or, how I learned to love the bomb".title()
Out[33]:
'Dr. Strangelove, Or, How I Learned To Love The Bomb'

The .strip() method removes any whitespace from the beginning or end of the string (but not between characters later in the string):

In [30]:
" got some random whitespace in some places here     ".strip()
Out[30]:
'got some random whitespace in some places here'

Finally, the .replace() method takes two parameters: a string to find, and a string to replace that string with whenever it's found. You can use this to make sad stories.

In [44]:
"I got rhythm, I got music, I got my man, who could ask for anything more".replace("I got", "I used to have")
Out[44]:
'I used to have rhythm, I used to have music, I used to have my man, who could ask for anything more'

"Escape" sequences in strings

Inside of strings that you type into your Python code, there are certain sequences of characters that have a special meaning. These sequences start with a backslash character (\) and allow you to insert into your string characters that would otherwise be difficult to type, or that would go against Python syntax. Here's some code illustrating a few common sequences:

In [113]:
print "include \"double quotes\" (inside of a double-quoted string)"
print 'include \'single quotes\' (inside of a single-quoted string)'
print "one\ttab, two\ttabs"
print "new\nline"
print "include an actual backslash \\ (two backslashes in the string)"
include "double quotes" (inside of a double-quoted string)
include 'single quotes' (inside of a single-quoted string)
one	tab, two	tabs
new
line
include an actual backslash \ (two backslashes in the string)

Regular expressions

So far, we've discussed how to write programs and expressions that are able to check whether strings meet very simple criteria, such as “does this string begin with a particular character” or “does this string contain another string”? But imagine writing a program that performs the following task: find and print all ZIP codes in a string (i.e., a five-character sequence of digits). Give up? Here’s my attempt, using only the tools we’ve discussed so far:

In [38]:
input_str = "here's a zip code: 12345. 567 isn't a zip code, but 45678 is. 23456? yet another zip code."
current = ""
zips = []
for ch in input_str:
    if ch in '0123456789':
        current += ch
    else:
        current = ""
    if len(current) == 5:
        zips.append(current)
        current = ""
print zips
['12345', '45678', '23456']

Basically, we have to iterate over each character in the string, check to see if that character is a digit, append to a string variable if so, continue reading characters until we reach a non-digit character, check to see if we found exactly five digit characters, and add it to a list if so. At the end, we print out the list that has all of our results. Problems with this code: it’s messy; it doesn’t overtly communicate what it’s doing; it’s not easily generalized to other, similar tasks (e.g., if we wanted to write a program that printed out phone numbers from a string, the code would likely look completely different).

Our ancient UNIX pioneers had this problem, and in pursuit of a solution, thought to themselves, "Let’s make a tiny language that allows us to write specifications for textual patterns, and match those patterns against strings. No one will ever have to write fiddly code that checks strings character-by-character ever again." And thus regular expressions were born.

Here's the code for accomplishing the same task with regular expressions, by the way:

In [40]:
import re
zips = re.findall(r"\d{5}", input_str)
print zips
['12345', '45678', '23456']

I’ll allow that the r"\d{5}" in there is mighty cryptic (though hopefully it won’t be when you’re done reading this page and/or participating in the associated lecture). But the overall structure of the program is much simpler.

Fetching our corpus

For this section of class, we'll be using the subject lines of all e-mails in the EnronSent corpus, kindly put into the public domain by the United States Federal Energy Regulatory Commission. Download a copy into your notebook directory like so:

In [185]:
import urllib
urllib.urlretrieve("https://raw.githubusercontent.com/ledeprogram/courses/master/databases/data/enronsubjects.txt", "enronsubjects.txt")
Out[185]:
('enronsubjects.txt', <httplib.HTTPMessage instance at 0x1053d8ab8>)

Matching strings with regular expressions

The most basic operation that regular expressions perform is matching strings: you’re asking the computer whether a particular string matches some description. We're going to be using regular expressions to print only those lines from our enronsubjects.txt corpus that match particular sequences. Let's load our corpus into a list of lines first:

In [186]:
subjects = [x.strip() for x in open("enronsubjects.txt").readlines()]

We can check whether or not a pattern matches a given string in Python with the re.search() function. The first parameter to search is the regular expression you're trying to match; the second parameter is the string you're matching against.

Here's an example, using a very simple regular expression. The following code prints out only those lines in our Enron corpus that match the (very simple) regular expression shipping:

In [187]:
import re
[line for line in subjects if re.search("shipping", line)]
Out[187]:
['FW: How to use UPS for shipping on the internet',
 'FW: How to use UPS for shipping on the internet',
 'How to use UPS for shipping on the internet',
 'FW: How to use UPS for shipping on the internet',
 'FW: How to use UPS for shipping on the internet',
 'How to use UPS for shipping on the internet',
 'lng shipping/mosk meeting in tokyo 2nd of feb',
 'lng shipping/mosk meeting in tokyo 2nd of feb',
 'Re: lng shipping',
 'Re: lng shipping',
 'Re: lng shipping',
 'Re: lng shipping',
 'Re: lng shipping',
 'lng shipping',
 'Re: lng shipping',
 'Re: lng shipping',
 'Re: lng shipping',
 'lng shipping',
 'lng shipping',
 'lng shipping',
 'Re: lng shipping',
 'lng shipping']

At its simplest, a regular expression matches a string if that string contains exactly the characters you've specified in the regular expression. So the expression shipping matches strings that contain exactly the sequences of s, h, i, p, p, i, n, and g in a row. If the regular expression matches, re.search() evaluates to True and the matching line is included in the evaluation of the list comprehension.

BONUS TECH TIP: re.search() doesn't actually evaluate to True or False---it evaluates to either a Match object if a match is found, or None if no match was found. Those two count as True and False for the purposes of an if statement, though.

Metacharacters: character classes

The "shipping" example is pretty boring. (There was hardly any fan fiction in there at all.) Let's go a bit deeper into detail with what you can do with regular expressions. There are certain characters or strings of characters that we can insert into a regular expressions that have special meaning. For example:

In [101]:
[line for line in subjects if re.search("sh.pping", line)]
Out[101]:
['FW: How to use UPS for shipping on the internet',
 'FW: How to use UPS for shipping on the internet',
 'How to use UPS for shipping on the internet',
 'FW: How to use UPS for shipping on the internet',
 'FW: How to use UPS for shipping on the internet',
 'How to use UPS for shipping on the internet',
 "FW: We've been shopping!",
 'Re: Start shopping...',
 'Start shopping...',
 'lng shipping/mosk meeting in tokyo 2nd of feb',
 'lng shipping/mosk meeting in tokyo 2nd of feb',
 'Re: lng shipping',
 'Re: lng shipping',
 'Re: lng shipping',
 'Re: lng shipping',
 'Re: lng shipping',
 'lng shipping',
 'Re: lng shipping',
 'Re: lng shipping',
 'Re: lng shipping',
 'lng shipping',
 'lng shipping',
 'lng shipping',
 'Re: lng shipping',
 'lng shipping',
 'FW: Online shopping',
 'Online shopping']

In a regular expression, the character . means "match any character here." So, using the regular expression sh.pping, we get lines that match shipping but also shopping. The . is an example of a regular expression metacharacter---a character (or string of characters) that has a special meaning.

Here are a few more metacharacters. These metacharacters allow you to say that a character belonging to a particular class of characters should be matched in a particular position:

metacharacter meaning
. match any character
\w match any alphanumeric ("word") character (lowercase and capital letters, 0 through 9, underscore)
\s match any whitespace character (i.e., space and tab)
\S match any non-whitespace character (the inverse of \s)
\d match any digit (0 through 9)
\. match a literal .

Here, for example, is a (clearly imperfect) regular expression to search for all subject lines containing a time of day:

In [111]:
[line for line in subjects if re.search(r"\d:\d\d\wm", line)]
Out[111]:
['RE: 3:17pm',
 '3:17pm',
 "RE: It's On!!! - 2:00pm Today",
 "FW: It's On!!! - 2:00pm Today",
 "It's On!!! - 2:00pm Today",
 'Re: Registration Confirmation: Larry Summers on 12/6 at 1:45pm (was',
 'Re: Conference Call today 2/9/01 at 11:15am PST',
 'Conference Call today 2/9/01 at 11:15am PST',
 '5/24 1:00pm conference call.',
 '5/24 1:00pm conference call.',
 'FW: 07:33am EDT 15-Aug-01 Prudential Securities (C',
 'FW: 07:33am EDT 15-Aug-01 Prudential Securities (C',
 '07:33am EDT 15-Aug-01 Prudential Securities (C',
 "Re: Updated Mar'00 Requirements Received at 11:25am from CES",
 "Re: Updated Mar'00 Requirements Received at 11:25am from CES",
 "Re: Updated Mar'00 Requirements Received at 11:25am from CES",
 "Updated Mar'00 Requirements Received at 11:25am from CES",
 'Reminder: Legal Team Meeting -- Friday, 9:00am Houston time',
 'Thursday, March 7th 1:30-3:00pm: REORIENTATION',
 'Meeting at 2:00pm Friday',
 'Meeting at 2:00pm Friday',
 'Fw: 12:30pm Deadline for changes to letters or contracts today',
 '12:30pm Deadline for changes to letters or contracts today',
 'Johnathan actually resigned at 9:00am this morning',
 'FW: Enron Conference Call Today, 11:00am CST',
 'Enron Conference Call Today, 11:00am CST',
 'Meeting, Wednesday, January 23 at 10:00am at the Houstonian',
 'RE: TVA Meeting, Wednesday June13, 1:15pm, EB3125b',
 'TVA Meeting, Wednesday June13, 1:15pm, EB3125b',
 'Re: Dabhol Update: Conference Call Thursday, Dec. 28, 8:00am',
 'Dabhol Update: Conference Call Thursday, Dec. 28, 8:00am Houston time',
 'FW: Victoria Ashley Jones Born 5/25/01 7:31am.',
 'Fw: Victoria Ashley Jones Born 5/25/01 7:31am.',
 'Victoria Ashley Jones Born 5/25/01 7:31am.',
 'RE: Victoria Ashley Jones Born 5/25/01 7:31am.',
 'Fw: Victoria Ashley Jones Born 5/25/01 7:31am.',
 'Victoria Ashley Jones Born 5/25/01 7:31am.',
 'RE: UCSF Cogen Calculation Conf Call, 10/12/01 at 8:00am PST',
 'UCSF Cogen Calculation Conf Call, 10/12/01 at 8:00am PST',
 'FW: Confirmation:  UCSF Cogen Conf Call. 10/22/02 at 8:00am',
 '=09RE: Confirmation:  UCSF Cogen Conf Call. 10/22/02 at 8:00am PST/=',
 '=09Confirmation:  UCSF Cogen Conf Call. 10/22/02 at 8:00am PST/10:0=',
 'RE: Confirmation:  UCSF Cogen Conf Call. 10/22/02 at 8:00am',
 '=09Confirmation:  UCSF Cogen Conf Call. 10/22/02 at 8:00am PST/10:0=',
 'Re: March expenses - deadline 04-04-01 2:00pm',
 'Cirque - Jan 24 5:00pm show']

Here's that regular expression again: r"\d:\d\d\wm". I'm going to show you how to read this, one unit at a time.

"Hey, regular expression engine. Tell me if you can find this pattern in the current string. First of all, look for any number (\d). If you find that, look for a colon right after it (:). If you find that, look for another number right after it (\d). If you find that, look for any alphanumeric character---you know, a letter, a number, an underscore. If you find that, then look for a m. Good? If you found all of those things in a row, then the pattern matched."

But what about that weirdo r""?

Python provides another way to include string literals in your program, in addition to the single- and double-quoted strings we've already discussed. The r"" string literal, or "raw" string, includes all characters inside the quotes literally, without interpolating special escape characters. Here's an example:

In [114]:
print "this is\na test"
print r"this is\na test"
print "I love \\ backslashes!"
print r"I love \ backslashes!"
this is
a test
this is\na test
I love \ backslashes!
I love \ backslashes!

As you can see, whereas a double- or single-quoted string literal interprets \n as a new line character, the raw quoted string includes those characters as they were literally written. More importantly, for our purposes at least, is the fact that, in the raw quoted string, we only need to write one backslash in order to get a literal backslash in our string.

Why is this important? Because regular expressions use backslashes all the time, and we don't want Python to try to interpret those backslashes as special characters. (Inside a regular string, we'd have to write a simple regular expression like \b\w+\b as \\b\\w+\\b---yecch.)

So the basic rule of thumb is this: use r"" to quote any regular expressions in your program. All of the examples you'll see below will use this convention.

Character classes in-depth

You can define your own character classes by enclosing a list of characters, or range of characters, inside square brackets:

regex explanation
[aeiou] matches any vowel
[02468] matches any even digit
[a-z] matches any lower-case letter
[A-Z] matches any upper-case character
[^0-9] matches any non-digit (the ^ inverts the class, matches anything not in the list)
[Ee] matches either E or e

Let's find every subject line where we have four or more vowels in a row:

In [121]:
[line for line in subjects if re.search(r"[aeiou][aeiou][aeiou][aeiou]", line)]
Out[121]:
['Re: Natural gas quote for Louiisiana-Pacific (L-P)',
 'WooooooHoooooo more Vacation',
 'Re: Clickpaper Counterparties waiting to clear the work queue',
 'Gooooooooooood Bye!',
 'Gooooooooooood Bye!',
 'RE: Hello Sweeeeetie',
 'Hello Sweeeeetie',
 'FW: Waaasssaaaaabi !',
 'FW: Waaasssaaaaabi !',
 'FW: Waaasssaaaaabi !',
 'FW: Waaasssaaaaabi !',
 'Re: FW: Wasss Uuuuuup STG?',
 'RE: Rrrrrrrooooolllllllllllll TIDE!!!!!!!!',
 'Rrrrrrrooooolllllllllllll TIDE!!!!!!!!',
 'FW: The Osama Bin Laden Song ( Soooo Funny !! )',
 'Fw: The Osama Bin Laden Song ( Soooo Funny !! )',
 'The Osama Bin Laden Song ( Soooo Funny !! )',
 'RE: duuuuhhhhh',
 'RE: duuuuhhhhh',
 'RE: duuuuhhhhh',
 'duuuuhhhhh',
 'RE: duuuuhhhhh',
 'duuuuhhhhh',
 'RE: FPL Queue positions 1-15',
 'Re: FPL Queue positions 1-15',
 'Re: Helloooooo!!!',
 'Re: Helloooooo!!!',
 'Fw: FW: OOOooooops',
 'FW: FW: OOOooooops',
 'Re: yeeeeha',
 'yeeeeha',
 'yahoooooooooooooooooooo',
 'RE: yahoooooooooooooooooooo',
 'RE: yahoooooooooooooooooooo',
 'yahoooooooooooooooooooo',
 'RE: I hate yahooooooooooooooo',
 'I hate yahooooooooooooooo',
 'RE: I hate yahooooooooooooooo',
 'I hate yahooooooooooooooo',
 'RE: I hate yahooooooooooooooo',
 'I hate yahooooooooooooooo',
 'RE: I hate yahooooooooooooooo',
 'I hate yahooooooooooooooo',
 "FW: duuuuuuuuuuuuuuuuude...........what's up?",
 "RE: duuuuuuuuuuuuuuuuude...........what's up?",
 "RE: duuuuuuuuuuuuuuuuude...........what's up?",
 'Re: skiiiiiiiiing',
 'skiiiiiiiiing',
 'scuba dooooooooooooo',
 'RE: scuba dooooooooooooo',
 'RE: scuba dooooooooooooo',
 'scuba dooooooooooooo',
 'Re: skiiiiiiiing',
 'skiiiiiiiing',
 'Re: skiiiiiiiing',
 'Re: skiiiiiiiiing',
 "RE: Clickpaper CP's awaiting migration in work queue's 06/27/01",
 "FW: Clickpaper CP's awaiting migration in work queue's 06/27/01",
 "Clickpaper CP's awaiting migration in work queue's 06/27/01",
 'RE:  Sequoia Adv. Pro.: Draft Stipulation and Order',
 'FW: Sequoia Adv. Pro.: Draft Stipulation and Order',
 'Sequoia Adv. Pro.: Draft Stipulation and Order',
 'Re: FW: Sequoia Adv. Pro.: Draft Stipulation and Order',
 'FW: Sequoia Adv. Pro.: Draft Stipulation and Order',
 'FW: Sequoia Adv. Pro.: Draft Stipulation and Order',
 'Fw: Sequoia Adv. Pro.: Draft Stipulation and Order',
 'Sequoia Adv. Pro.: Draft Stipulation and Order',
 'Sequoia Adv. Pro.: Draft Stipulation and Order',
 'i would have done this but i was toooo busy.....']

Metacharacters: anchors

The next important kind of metacharacter is the anchor. An anchor doesn't match a character, but matches a particular place in a string.

anchor meaning
^ match at beginning of string
$ match at end of string
\b match at word boundary

Note: ^ in a character class has a different meaning from ^ outside a character class!

Note #2: If you want to search for a literal dollar sign ($), you need to put a backslash in front of it, like so: \$

Now we have enough regular expression knowledge to do some fairly sophisticated matching. As an example, all the subject lines that begin with the string New York, regardless of whether or not the initial letters were capitalized:

In [127]:
[line for line in subjects if re.search(r"^[Nn]ew [Yy]ork", line)]
Out[127]:
['New York Details',
 'New York Power Authority',
 'New York Power Authority',
 'New York Power Authority',
 'New York Power Authority',
 'New York',
 'New York',
 'New York',
 'New York, etc.',
 'New York, etc.',
 'New York sites',
 'New York Hotel',
 'New York Hotel',
 'New York Hotel',
 'New York Hotel',
 'New York',
 'New York',
 'New York City Marathon Guaranteed Entry',
 'new york rest reviews',
 'New York State Electric & Gas Corporation ("NYSEG")',
 'New York State Electric & Gas Corporation ("NYSEG")',
 'New York State Electric & Gas Corporation ("NYSEG")',
 'New York State Electric & Gas ("NYSEG")',
 'New York regulatory restriccions',
 'New York regulatory restriccions',
 'New York Bar Numbers']

Every subject line that ends with an ellipsis:

In [130]:
[line for line in subjects if re.search(r"\.\.\.$", line)]
Out[130]:
['Re: Inquiry....',
 'Re: Inquiry....',
 'RE: the candidate we spoke about this morning...',
 'the candidate we spoke about this morning...',
 'RE: the candidate we spoke about this morning...',
 'RE: the candidate we spoke about this morning...',
 'RE: the candidate we spoke about this morning...',
 'the candidate we spoke about this morning...',
 'RE: the candidate we spoke about this morning...',
 'RE: the candidate we spoke about this morning...',
 'RE: the candidate we spoke about this morning...',
 'the candidate we spoke about this morning...',
 'Re: Hmmmmm........',
 'Hmmmmm........',
 'FW: Bumping into the husband....',
 'FW: Bumping into the husband....',
 'RE: try this one...',
 'RE: try this one...',
 'Re: try this one...',
 'try this one...',
 'RE: try this one...',
 'RE: try this one...',
 'Re: try this one...',
 'try this one...',
 'RE: try this one...',
 'RE: try this one...',
 'Re: try this one...',
 'try this one...',
 'RE: try this one...',
 'RE: try this one...',
 'Re: try this one...',
 'try this one...',
 'Re: try this one...',
 'try this one...',
 'Henry Hub instead of NYMEX...',
 'Henry Hub instead of NYMEX...',
 'Re: Henry Hub instead of NYMEX...',
 'Henry Hub instead of NYMEX...',
 'Transcanada Trade...',
 'Transcanada Trade...',
 'Here is the Article---no picture though...',
 'Here is the Article---no picture though...',
 'Re: ooops....',
 'Re: ooops....',
 'Re: ooops....',
 'ooops....',
 'FW: A crossroads we have all been at ...',
 'FW: A crossroads we have all been at ...',
 'RE: try this one...',
 'RE: try this one...',
 'Re: try this one...',
 'try this one...',
 'RE: try this one...',
 'RE: try this one...',
 'Re: try this one...',
 'try this one...',
 'FW: follow up > FW: Caltech-developed arbitrage trading technolog\ty being assessed by Reliant Energy right now...',
 'follow up > FW: Caltech-developed arbitrage  trading technology being assessed by Reliant Energy right  now...',
 'Caltech-developed arbitrage trading technology  being assessed by Reliant Energy right now...',
 'RE: try this one...',
 'RE: try this one...',
 'Re: try this one...',
 'try this one...',
 "RE: okay here's what i got on the euro...",
 "okay here's what i got on the euro...",
 "RE: okay here's what i got on the euro...",
 'RE: first of all...',
 'RE: first of all...',
 'RE: try this one...',
 'RE: try this one...',
 'Re: try this one...',
 'try this one...',
 'Re: try this one...',
 'try this one...',
 'RE: Follow up for Hardware Request....',
 'Follow up for Hardware Request....',
 'RE: Yahoo - GE Lighting Launches National Energy Program ...',
 'RE: cheer up...',
 'cheer up...',
 'RE: Leaving Enron.....',
 'Leaving Enron.....',
 'Fwd: Revenge is a sweet thing...',
 'Fwd: Revenge is a sweet thing...',
 'Fwd: Revenge is a sweet thing...',
 'Re: Got bored and...',
 'Got bored and...',
 'Re: Fw: [txhmed] interesting ...',
 'Re: all Hector wants for christmas...',
 'all Hector wants for christmas...',
 "Re: Check out Leni's website...",
 "Check out Leni's website...",
 'my ....',
 'my ....',
 'Re: Just a little something to make you smile.......',
 'Just a little something to make you smile.......',
 'Just a little something to make you smile.......',
 'Steamboat Vacation information...',
 'Steamboat Vacation information...',
 'FW: FW: (fwd) FW:  Warning from HFD...',
 'Fw: (fwd) FW: Warning from HFD...',
 'FW: (fwd) FW: Warning from HFD...',
 'RE: Yeah Orange....',
 'Yeah Orange....',
 'RE: Yeah Orange....',
 'RE: Yeah Orange....',
 'Re: one last thing...',
 'one last thing...',
 'If you are stuck...',
 "Re: FW: You've Been in Corporate America Too Long When...",
 "Re: It's true what they say...",
 "It's true what they say...",
 'Re: Todd & Things....',
 'Todd & Things....',
 'Re: testing....',
 "Re: Don't send a dad...",
 "Don't send a dad...",
 'James is coming...',
 'Re: Congratulations, etc...................',
 'Congratulations, etc...................',
 'RE: Fancy meeting you....',
 'Fancy meeting you....',
 'FW: In the spirit of cooperation...',
 'In the spirit of cooperation...',
 'In the spirit of cooperation...',
 'RE: Back on the Block....',
 'Back on the Block....',
 'RE: A little humor for the new year....',
 'A little humor for the new year....',
 'RE: Infrastructure Prevents...',
 'Infrastructure Prevents...',
 'Re: FW: Could you please....',
 'RE: FW: Could you please....',
 'RE: FW: Could you please....',
 'Re: FW: Could you please....',
 'Re: Just Checking...',
 'Just Checking...',
 'RE: Vacation...',
 'RE: Vacation...',
 'Vacation...',
 'FW: Vacation....',
 'RE: Vacation....',
 'Vacation....',
 'FW: AA has left the building...',
 'AA has left the building...',
 'Re: It has been a while...',
 "Re: Fw: it ain't easy.....",
 'FW: Message from Boeing.......',
 'FW: Message from Boeing.......',
 'FW: Message from Boeing.......',
 'Message from Boeing.......',
 'RE: I AM THANKFUL FOR ......',
 'Re: I AM THANKFUL FOR ......',
 'RE: I AM THANKFUL FOR ......',
 'Re: I AM THANKFUL FOR ......',
 'RE: I AM THANKFUL FOR ......',
 'Re: I AM THANKFUL FOR ......',
 'FW: I AM THANKFUL FOR ......',
 'FW: I AM THANKFUL FOR ......',
 'Re: Back in the saddle again...',
 'Re: Back in the saddle again...',
 'Re: Back in the saddle again...',
 'RE: I know this sounds crazy but...',
 'I know this sounds crazy but...',
 'RE: By the way...',
 'By the way...',
 "Re: Fwd: Why you don't drink till you pass out.....",
 "Fwd: Why you don't drink till you pass out.....",
 "Fwd: Why you don't drink till you pass out.....",
 "Fwd: Why you don't drink till you pass out.....",
 "Why you don't drink till you pass out.....",
 "Fw: Ads you won't see...",
 "Fw: Ads you won't see...",
 "RE: FW: Nostradamus' prediction on WW3..................",
 "Re: FW: Nostradamus' prediction on WW3..................",
 'Lets get this ball rolling....',
 'RE: Lets get the ball rolling......',
 'RE: Lets get the ball rolling......',
 'Lets get the ball rolling......',
 'Tell me that....................',
 'Re: Tell me that....................',
 'FW: YOU WANT TO KNOW ABOUT THIS....',
 'FW: YOU WANT TO KNOW ABOUT THIS....',
 'Re: FW: Cash Balance Plan...',
 "Don't forget...",
 'RE: Seeking info...',
 'Seeking info...',
 'FW: I thought you might be interested...',
 'I thought you might be interested...',
 'RE: Party Date...',
 'Party Date...',
 'Re: FW: RE: Coming Home Soon...',
 'RE: FW: RE: Coming Home Soon...',
 'FW: Weekend Events.........',
 'FW: Weekend Events.........',
 'Weekend Events.........',
 'Re: your famous...',
 'Re: your famous...',
 'Even the best laid plans...',
 'Re: Even the best laid plans...',
 'Re: Parting is such sweet sorrow...',
 'Parting is such sweet sorrow...',
 'Re: Even the best laid plans...',
 'Re: Even the best laid plans...',
 'Re: Even the best laid plans...',
 'Re: Even the best laid plans...',
 'Re: Even the best laid plans...',
 'Re: Even the best laid plans...',
 'Re: FW: Never let a guy take a message.....',
 'Re: And the winners are...',
 'RE: And the winners are...',
 'Re: And the winners are...',
 "Re: WSJ: PG&E's Huge losses...",
 'RE: Here it is...',
 'RE: Here it is...',
 'RE: Here it is...',
 'Here it is...',
 'Re:RE: Here it is...',
 'Re:RE: Here it is...',
 'Re:Here it is...',
 'Eeegads...',
 "Re: FW: ok, it's a little excessive, but...",
 "RE: FW: ok, it's a little excessive, but...",
 "Re: FW: ok, it's a little excessive, but...",
 "RE: FW: ok, it's a little excessive, but...",
 "RE: FW: ok, it's a little excessive, but...",
 "Re: FW: ok, it's a little excessive, but...",
 "RE: FW: ok, it's a little excessive, but...",
 "RE: FW: ok, it's a little excessive, but...",
 "RE: FW: ok, it's a little excessive, but...",
 "Re: FW: ok, it's a little excessive, but...",
 'RE: Eeegads...',
 'Eeegads...',
 'help...',
 'Re: Well...',
 'RE: Well...',
 'Re: Well...',
 'Re: You forgot your wine....',
 "RE: tell me it isn't true...",
 "tell me it isn't true...",
 'How You Should Act...........',
 'RE: If you go a run this afternoon....',
 'If you go a run this afternoon....',
 'FW: You still suck at baseball....',
 'You still suck at baseball....',
 'FW: Primary Authority Plus...',
 'Primary Authority Plus...',
 'Fw: Primary Authority Plus...',
 'Primary Authority Plus...',
 'RE: Would like to help...',
 'Would like to help...',
 'FW: Would like to help...',
 'Would like to help...',
 'from my red neck uncle...',
 'Your Chapters.ca Coupons ...',
 'Your Chapters.ca Coupons ...',
 'funny stuff about your mother...',
 'RE: Reply to this....',
 'RE: Reply to this....',
 'RE: Reply to this....',
 'RE: Reply to this....',
 'RE: Reply to this....',
 'RE: Reply to this....',
 'Re: Reply to this....',
 'Re: Reply to this....',
 'Re: Plans...',
 'Re: Plans...',
 'Re: Plans...',
 'Re: Howdy Stranger...',
 'Help! Canadians need weather...',
 'FW: Hilarious....',
 'FW: Hilarious....',
 'FW: Hilarious....',
 'Fw: Hilarious....',
 'FW: Hilarious....',
 "RE: Dan's coming to town...",
 "Re: Dan's coming to town...",
 "Dan's coming to town...",
 'FW: Interesting...',
 'FW: Interesting...',
 "RE: Haven't heard from you yet......",
 "=09Haven't heard from you yet......",
 'RE: A pleasant thought for long term investors...',
 'RE: A pleasant thought for long term investors...',
 'A pleasant thought for long term investors...',
 'A pleasant thought for long term investors...',
 'RE: New Digits....',
 'New Digits....',
 'RE: Things we wish we could say at work...',
 'FW: Things we wish we could say at work...',
 'RE: PBM merger...',
 'FW: PBM merger...',
 'PBM merger...',
 'funny........',
 'Questions.........',
 'RE: You have 44 hours remaining...',
 'You have 44 hours remaining...',
 'FW: guidelines....',
 'guidelines....',
 'Re: Daily California Update.....',
 'Daily California Update.....',
 'Daily California Update.....',
 'ETS on the Move...',
 'ETS on the Move...',
 "Re: HELLO I'M HERE AGAIN...",
 'Fw: Cast your vote..........',
 'Fw: Cast your vote..........',
 'Fw: Cast your vote..........',
 'Fw: Cast your vote..........',
 'Re: Next time you see me....',
 'RE: I need your help...',
 'I need your help...',
 'RE: 2 nd version of Plan...',
 '2 nd version of Plan...',
 'FW: 2 nd version of Plan...',
 'RE: 2 nd version of Plan...',
 '2 nd version of Plan...',
 'Last night...',
 'Last night...',
 'Re: Last night...',
 'Last night...',
 'Hey Chris..........',
 'Hey Chris..........',
 'FW: Thanks!...',
 'Re: Drinks...',
 'Drinks...',
 'RE: Advantages of being a man...',
 'RE: Advantages of being a man...',
 'Re: Advantages of being a man...',
 'Re: Advantages of being a man...',
 'FW: Now we know....',
 'FW: Fun for when your bored....',
 'FW: Fun for when your bored....',
 'Fun for when your bored....',
 "FW: Robin & Peter Vint's going away party -  Friday March 15th  -\t boo hoo.....",
 'RE: This weekend...',
 'This weekend...',
 'RE: Moving on...',
 'Moving on...',
 'FW: A Very Cold Winter...',
 'Fwd: Just ask a child...',
 'Fwd: Just ask a child...',
 'Just ask a child...',
 'Fwd: FW: Fwd[3]:FW: For the Sportsman in all of us...',
 'Fwd: FW: Fwd[3]:FW: For the Sportsman in all of us...',
 'Fwd: FW: Fwd[3]:FW: For the Sportsman in all of us...',
 'FW: Fwd[3]:FW: For the Sportsman in all of us...',
 'About the release tomorrow...',
 'About the release tomorrow...',
 'Re: Two Flatscreens to be moved...',
 'Re: Two Flatscreens to be moved...',
 'Re: Two Flatscreens to be moved...',
 'Two Flatscreens to be moved...',
 'Two Flatscreens to be moved...',
 "Fw: La medaille d'or...",
 "Fw: La medaille d'or...",
 "FW: La medaille d'or...",
 'Fw: FW: Paul Harvey Story ...Probably Should Circulate This One...',
 'Fw: FW: Paul Harvey Story ...Probably Should Circulate This One...',
 'Fwd: FW: Paul Harvey Story ...Probably Should Circulate This One...',
 'FW: Paul Harvey Story ...Probably Should Circulate This One...',
 'Paul Harvey Story ...Probably Should Circulate This  One...',
 'FW: Do you remember.........',
 'FW: Do you remember.........',
 'FW: Do you remember.........',
 'Fw: Priceless Series .........',
 'Fw: Priceless Series .........',
 'Fwd: Priceless Series .........',
 'Re: Surround Sound...',
 'Surround Sound...',
 'FW: wow...',
 'Fwd: wow...',
 'wow...',
 'FW: Patience...',
 'FW: Patience...',
 'FW: This is hillarious...',
 'This is hillarious...',
 'FW: Women...',
 'Fw: Women...',
 'FW: Women...',
 'FW: I have moved, but my Phone has not .....',
 'I have moved, but my Phone has not .....',
 'FW: two sides to the story....',
 'FW: two sides to the story....',
 'RE: two sides to the story....',
 'RE: two sides to the story....',
 'FW: two sides to the story....',
 'RE: I have moved, but my Phone has not .....',
 'I have moved, but my Phone has not .....',
 'FW: Voices from the past...',
 'Fw: Voices from the past...',
 'Voices from the past...',
 'FW: The cost of kids...',
 'Fw: The cost of kids...',
 'FW: Condom Sense....',
 'Fw: Condom Sense....',
 'Re: Trying to reach you...',
 'RE: Shut in comments and EOG....',
 'FW: Shut in comments and EOG....',
 'FW: Shut in comments and EOG....',
 'FW: Shut in comments and EOG....',
 'FW: new "rules"...',
 'new "rules"...',
 'RE: O:/ECT_Trading...',
 'O:/ECT_Trading...',
 'Re: Question...',
 'NYMEX email address...',
 'Re: NYMEX email address...',
 'FW: FYI...',
 'FYI...',
 'RE: FYI...',
 'FYI...',
 'RE: FYI...',
 'RE: FYI...',
 'RE: FYI...',
 'FYI...',
 'RE: FYI...',
 'RE: FYI...',
 'FW: FYI...',
 'FYI...',
 'RE: FYI...',
 'RE: FYI...',
 'RE: FYI...',
 'RE: FYI...',
 'FW: FYI...',
 'FYI...',
 'FW: For hours of endless revenge..........',
 'FW: For hours of endless revenge..........',
 'FW: For hours of endless revenge..........',
 'For hours of endless revenge..........',
 'RE: Three new additions to the world.........',
 'Three new additions to the world.........',
 'FW: Complete Madness ...',
 'FW: Complete Madness ...',
 'FW: Complete Madness ...',
 'FW: Three new additions to the world.........',
 'RE: Three new additions to the world.........',
 'Three new additions to the world.........',
 'Re: AW: AW: I am sooo sorry...',
 'Re: AW: I am sooo sorry...',
 'Re: I am sooo sorry...',
 'Delta Airlines...',
 'Delta Airlines...',
 'RE: How to get emails....',
 'How to get emails....',
 'FW: It could be worse....',
 'FW: It could be worse....',
 'Fwd: Life...',
 'Fwd: Life...',
 'Fwd: Life...',
 'Life...',
 'Fwd: Life...',
 'Fwd: Life...',
 'Fwd: Life...',
 'Life...',
 'FW: For our Children...',
 'FW: For our Children...',
 'FW: For our Children...',
 'FW: For our Children...',
 'FW: For our Children...',
 'FW: For our Children...',
 'FW: For our Children...',
 'FW: FW: I said a prayer for you just now.......',
 'Fwd: So Very True...',
 'Fwd: So Very True...',
 'Fwd: So Very True...',
 'Fwd: So Very True...',
 'Fwd: So Very True...',
 'Fwd: So Very True...',
 'Fwd: So Very True...',
 'Fwd: So Very True...',
 'Fwd: So Very True...',
 'Fwd: So Very True...',
 'Fwd: So Very True...',
 'Fwd: So Very True...',
 'Fwd: So Very True...',
 'Fwd: So Very True...',
 'Fwd: So Very True...',
 'FW: Hope you like this poem...',
 'FW: Hope you like this poem...',
 'FW: Hope you like this poem...',
 'FW: Hope you like this poem...',
 'Fwd: So Very True...',
 'Fwd: So Very True...',
 'Fwd: So Very True...',
 'Fwd: So Very True...',
 'Fwd: So Very True...',
 'Fwd: So Very True...',
 'Fwd: So Very True...',
 'FW: Hope you like this poem...',
 'FW: Hope you like this poem...',
 'FW: Hope you like this poem...',
 'Fwd: So Very True...',
 'Fwd: So Very True...',
 'Fwd: So Very True...',
 'Fwd: So Very True...',
 'Fwd: So Very True...',
 "Fw: YOU KNOW YOU'RE A LATINO IF...",
 "Fw: YOU KNOW YOU'RE A LATINO IF...",
 "RE: YOU KNOW YOU'RE A LATINO IF...",
 'For you ....',
 'For you ....',
 'For you ....',
 'FW: Dying...',
 'FW: Dying...',
 'FW: Dying...',
 'FW: Dying...',
 'FW: Dying...',
 'RE: FW: You know your at a LATINO birthday party....',
 'Fw: FW: You know your at a LATINO birthday party....',
 'RE: Follow up for Hardware Request...',
 'FW: Follow up for Hardware Request...',
 'Follow up for Hardware Request...',
 'RE: tonight...',
 'tonight...',
 'Re: some questions...',
 'some questions...',
 'Re: some questions...',
 'Re: Updating Regulatory Affairs Database.....',
 'Updating Regulatory Affairs Database.....',
 'RE: Stranger things have happened...',
 'Stranger things have happened...',
 "RE: Michelle's poem....",
 "Michelle's poem....",
 'RE: FW: Enron Employees Leaving Houston...',
 'Fwd: FW: Enron Employees Leaving Houston...',
 'FW: Interesting facts about this election...',
 'Re: crestar / gulf aos contract....',
 'Re: crestar / gulf aos contract....',
 'Re: crestar / gulf aos contract....',
 'Re: crestar / gulf aos contract....',
 'Re: crestar / gulf aos contract....',
 'Re: crestar / gulf aos contract....',
 'FW: Life......',
 'FW: Life......',
 'Re: And the Prize Goes To...',
 'Re: TIME heals all .....',
 "I'm Back...",
 'Re: as you requested...',
 'as you requested...',
 'Re: as you requested...',
 'Re: as you requested...',
 'as you requested...',
 'Re: as you requested...',
 'Re: as you requested...',
 'Re: as you requested...',
 'as you requested...',
 'And the beat goes on...',
 'Re: And the beat goes on...',
 'Re: Follow up....',
 'Greetings...',
 'Can you handle...',
 'When you get back...',
 'Oh knower of all things...',
 'Re: Yes, I need your help again ...',
 'Another Picture....',
 'Another Picture....',
 'Per your request...',
 'It should be more...',
 'Good luck America....',
 'Good luck America....',
 'Good luck America....',
 'Re: Yes but...........',
 'Yes but...........',
 'Good luck America....',
 'Positively the last word .....',
 'Positively the last word .....',
 'Positively the last word .....',
 'Re: Positively the last word .....',
 'Positively the last word .....',
 'Just Like Chapman...',
 'Re: Just Like Chapman...',
 'When you come....',
 'My Lucky Day...',
 'Re: Sorry, one more thing ...',
 "Another Gov't Agency Name...",
 'just thinking...',
 'Re: just thinking...',
 'just thinking...',
 'More Koch Masters...',
 'No Word...',
 '400 and counting...',
 'Declined: Lunch ...',
 'RE: Lunch ...',
 'RE: Lunch ...',
 'Declined: Lunch ...',
 'Re: access to O;...',
 'Re: access to O;...',
 'access to O;...',
 'Our Apologies ...',
 'Our Apologies ...',
 'Need direction please...',
 'Need direction please...',
 'Need direction please...',
 'Some municipal bonds for you to look at.....',
 'Some municipal bonds for you to look at.....',
 'Re: Publishable Research......',
 'Publishable Research......',
 'Publishable Research......',
 'Thank-you...',
 'Thank-you...',
 'Clintons leaving the Whitehouse...',
 'Clintons leaving the Whitehouse...',
 'Clintons leaving the Whitehouse...',
 'I just like hearing it.....',
 'I just like hearing it.....',
 "I've done it....",
 "I've done it....",
 'FW: Real Options Research...',
 'Real Options Research...',
 'RE: In light of the events this week....',
 'In light of the events this week....',
 'RE: In light of the events this week....',
 'RE: In light of the events this week....',
 'RE: In light of the events this week....',
 'In light of the events this week....',
 'RE: In light of the events this week....',
 'FW: An interesting story Abt. Stanford University ...',
 'FW: An interesting story Abt. Stanford University ...',
 'RE: Would anyone be interested?....',
 '=09Would anyone be interested?....',
 'FW: Would anyone be interested?....',
 '=09Would anyone be interested?....',
 'RE: Would anyone be interested?....',
 'RE: Would anyone be interested?....',
 'FW: Would anyone be interested?....',
 '=09RE: Would anyone be interested?....',
 '=09FW: Would anyone be interested?....',
 '=09Would anyone be interested?....',
 'RE: Hello...',
 'Hello...',
 'RE: Czy planujesz pojawic sie w Londynie  ...',
 'Czy planujesz pojawic sie w Londynie  ...',
 'Re: Conf Call...',
 'Conf Call...',
 'Midwest ISO information...',
 'Midwest ISO information...',
 'Need a laugh? Here it is...',
 'Time Magazine - Enron Plays the Pipes....',
 'Time Magazine - Enron Plays the Pipes....',
 "WSJ: PG&E's Huge losses...",
 "FW: Hadn't heard about this Enron Mention...",
 "Hadn't heard about this Enron Mention...",
 'Finally the truth comes out...',
 'Finally the truth comes out...',
 'RE: however...',
 'however...',
 'Re: FW: Mike Curry has signed and returned docs.....',
 "Rahil Jafry: Carly Fiorina Tops FORTUNE's List of 50 Most Powerful Women in Business   for ...",
 "RE: I'm still here ....",
 "I'm still here ....",
 'RE: Marcello has a favour to ask....',
 'Marcello has a favour to ask....',
 'RE: Hi...',
 'RE: Hi...',
 'RE: Hi...',
 'Hi...',
 'RE: Hi...',
 'Hi...',
 'RE: Hi...',
 'Hi...',
 'RE: Shankman...',
 'Shankman...',
 "Re: I'm Leaving...",
 'RE: floor space...',
 'floor space...',
 'Re: Sad news...',
 'Fw: Cast your vote..........',
 'Fw: Cast your vote..........',
 'Fw: Cast your vote..........',
 'Fwd: FW: Careful what you write...',
 'Fwd: FW: Careful what you write...',
 'FW: Careful what you write...',
 'FW: Careful what you write...',
 'RE: FW: Careful what you write...',
 'RE: FW: Careful what you write...',
 'Fwd: FW: Careful what you write...',
 'Fwd: FW: Careful what you write...',
 'FW: Careful what you write...',
 'FW: Careful what you write...',
 'Fwd: FW: Careful what you write...',
 'Fwd: FW: Careful what you write...',
 'FW: Careful what you write...',
 'FW: Careful what you write...',
 'Fwd: something groovy to do...',
 'RE: Our tree trimming storey.....',
 'RE: Our tree trimming storey.....',
 'RE: Our tree trimming storey.....',
 'Our tree trimming storey.....',
 'RE: Our tree trimming storey.....',
 'Our tree trimming storey.....',
 'FW: Our tree trimming storey.....',
 'Our tree trimming storey.....',
 'RE: Just a thought ...',
 'FW: Just a thought ...',
 'FW: Just a thought ...',
 'FW: This is a classic...',
 'FW: This is a classic...',
 'FW: This is a classic...',
 'RE: Advice / Information....',
 'Advice / Information....',
 'FW: Advice / Information....',
 'RE: Advice / Information....',
 'RE: Advice /  Information....',
 'Advice /  Information....',
 'RE: SAC visit...',
 'SAC visit...',
 'RE: If we were going to pay....',
 'If we were going to pay....',
 'Re: Dates for the Faculty-Alumni Awards at MU...',
 'Dates for the Faculty-Alumni Awards at MU...',
 'Re: Scott McNealy wants to hear from you...',
 'Scott McNealy wants to hear from you...',
 'OK, Jeff, you requested that we be candid about Enron...',
 'OK, Jeff, you requested that we be candid about Enron...',
 'Kenneth, here are four Christmas articles for you ...',
 'Re: oath to you...',
 'Re: Re[2]: oath to you...',
 'Re: Re[4]: oath to you...',
 'Fw: Bad Girl Barbies....',
 'Fw: Bad Girl Barbies....',
 'Fw: Bad Girl Barbies....',
 'Re: Hands on...',
 'Re: Hands on...',
 'Fw: Something to think about...',
 'FW: Something to think about...',
 'Re: Fw: Something to think about...',
 'Fw: Someone has way too much time on their hands......',
 'Fw: Someone has way too much time on their hands......',
 'FW: Someone has way too much time on their hands......',
 "Fw: What We've Learned From Watching Porn......",
 "Fw: What We've Learned From Watching Porn......",
 "FW: What We've Learned From Watching Porn......",
 "FW: What We've Learned From Watching Porn......",
 "Fw: What We've Learned From Watching Porn......",
 "Fw: What We've Learned From Watching Porn......",
 "FW: What We've Learned From Watching Porn......",
 "FW: What We've Learned From Watching Porn......",
 "Fw: What We've Learned From Watching Porn......",
 "Fw: What We've Learned From Watching Porn......",
 "FW: What We've Learned From Watching Porn......",
 "FW: What We've Learned From Watching Porn......",
 "Re: Fw: What We've Learned From Watching Porn......",
 "RE: What We've Learned From Watching Porn......",
 'Re: hey...',
 'Re: Better get a good backup....',
 'Fwd: something groovy to do...',
 'Re: Tomorrow...',
 'RE: Tomorrow...',
 'Re: Hey...',
 'Re: Are you still up for...',
 'Re: Tomorrow...',
 'RE: Tomorrow...',
 'Re: Hey...',
 'Re: Are you still up for...',
 'RE: this one too....',
 'this one too....',
 'FW: New Darwin Award winners are in...',
 'FW: New Darwin Award winners are in...',
 'New Darwin Award winners are in...',
 'RE: Hi Sweetie...',
 'Hi Sweetie...',
 'RE: Hi Sweetie...',
 'RE: Hi Sweetie...',
 'RE: Hi Sweetie...',
 'Hi Sweetie...',
 'RE: Dear Abby.......',
 'FW: Dear Abby.......',
 'FW: Dear Abby.......',
 'FW: Dear Abby.......',
 'Dear Abby.......',
 'Dear Abby.......',
 'RE: Strong Words...',
 'Strong Words...',
 'Fwd: something groovy to do...',
 'Re: Happy Hanukah and Merry Christmas...',
 'FW: As A Promising Energy Professional...',
 'As A Promising Energy Professional...',
 "RE: Why markers don't make good Christmas gifts...",
 "FW: Why markers don't make good Christmas gifts...",
 "RE: Why markers don't make good Christmas gifts...",
 "FW: Why markers don't make good Christmas gifts...",
 "RE: Why markers don't make good Christmas gifts...",
 "FW: Why markers don't make good Christmas gifts...",
 'FW: Angels are ....',
 'FW: Angels are ....',
 'RE: Angels are ....',
 'RE: Angels are ....',
 'FW: Angels are ....',
 'FW: Angels are ....',
 'Re: Fw: your voice..........',
 'Fw: FW: Paul Harvey Story ...Probably Should Circulate This One...',
 'Fw: FW: Paul Harvey Story ...Probably Should Circulate This One...',
 'Fw: FW: Paul Harvey Story ...Probably Should Circulate This One...',
 'Fwd: FW: Paul Harvey Story ...Probably Should Circulate This One...',
 'FW: Paul Harvey Story ...Probably Should Circulate This One...',
 'Paul Harvey Story ...Probably Should Circulate This  One...',
 'Re: WAR DAMN EAGLE....',
 'RE: User list to access different post ids...',
 'Re: User list to access different post ids...',
 "FW: Too bad stupidity isn't painful...",
 "FW: Too bad stupidity isn't painful...",
 "Re: FW: It couldn't hurt...",
 'RE: This is hillarious...',
 'RE: This is hillarious...',
 'FW: This is hillarious...',
 'This is hillarious...',
 'FW: This is hillarious...',
 'This is hillarious...',
 'RE: Mahmassani VaR........',
 'RE: Mahmassani VaR........',
 'RE: Mahmassani VaR........',
 'RE: Mahmassani VaR........',
 'FW: Mahmassani VaR........',
 'RE: Mahmassani VaR........',
 'RE: Mahmassani VaR........',
 'FW: Mahmassani VaR........',
 'FW: Mahmassani VaR........',
 'RE: Mahmassani VaR........',
 'RE: Mahmassani VaR........',
 'RE: Mahmassani VaR........',
 'RE: Mahmassani VaR........',
 'RE: Mahmassani VaR........',
 'Mahmassani VaR........',
 'FW: Mahmassani VaR........',
 'RE: Mahmassani VaR........',
 'RE: Mahmassani VaR........',
 'RE: Mahmassani VaR........',
 'RE: Mahmassani VaR........',
 'RE: Mahmassani VaR........',
 'Mahmassani VaR........',
 'RE: Mahmassani VaR........',
 'RE: Mahmassani VaR........',
 'RE: Mahmassani VaR........',
 'RE: Mahmassani VaR........',
 'RE: Mahmassani VaR........',
 'Mahmassani VaR........',
 'RE: Mahmassani VaR........',
 'RE: Mahmassani VaR........',
 'RE: Mahmassani VaR........',
 'Mahmassani VaR........',
 'RE: Mahmassani VaR........',
 'Mahmassani VaR........',
 'FW: Weekend Events.........',
 'FW: Weekend Events.........',
 'Weekend Events.........',
 'RE: after the storm...',
 'Fw: after the storm...',
 'Fwd: after the storm...',
 'RE: Memories.......',
 'Memories.......',
 'FW: Memories.......',
 'RE: Memories.......',
 'Memories.......',
 'RE: Well...',
 'Well...',
 'RE: Hey...',
 'Hey...',
 'RE: Paulie-wog...',
 'Paulie-wog...',
 'FW: They build great outhouses in AR...',
 'They build great outhouses in AR...',
 'They build great outhouses in AR...',
 'RE: Oh happy day...',
 'Oh happy day...',
 'RE: Oh happy day...',
 'RE: Oh happy day...',
 "RE: I know you're busy...",
 "I know you're busy...",
 'RE: Good Morning...',
 'Good Morning...',
 'RE: Arrrrgh....',
 'Arrrrgh....',
 'I need to call Mark about the docs that Heather sent...',
 'a little high...',
 'RE: a little high...',
 'Some thoughts on anniversary stuff...',
 'You probably saw this all ready...',
 "HELP I'm drowning....",
 'FW: Drinking quotes...',
 'FW: Drinking quotes...',
 'FW: Drinking quotes...',
 'Ok, it is a slow news day...',
 'A blurb from an internal Enron communiciation...',
 'FW: To those of you getting married...',
 'To those of you getting married...',
 'Oh, just a change or two...',
 "One email didn't go through...",
 "RE: One email didn't go through...",
 "RE: One email didn't go through...",
 "One email didn't go through...",
 'Dick Westfahl Retirement - Bambi would be proud...',
 'Re: Dick Westfahl Retirement - Bambi would be proud...',
 'Re: Dick Westfahl Retirement - Bambi would be proud...',
 'Just to confuse you...',
 'And on this legal front...',
 "Re: FW: We'll Miss You Steffy.......",
 "FW: We'll Miss You Steffy.......",
 "FW: We'll Miss You Steffy.......",
 "FW: We'll Miss You Steffy.......",
 "We'll Miss You Steffy.......",
 "RE: FW: We'll Miss You Steffy.......",
 "RE: FW: We'll Miss You Steffy.......",
 "Re: FW: We'll Miss You Steffy.......",
 "FW: We'll Miss You Steffy.......",
 "FW: We'll Miss You Steffy.......",
 "FW: We'll Miss You Steffy.......",
 "We'll Miss You Steffy.......",
 'Another try...',
 'Re: Fw: things to ponder.....',
 "FW: so, you've been laid off....",
 "Fwd: so, you've been laid off....",
 'CAF - Tier 1 -  (Req# 600 - Jode Corp) requires your signature...',
 'Re: CAF - Tier 1 - OVERDUE - (Req# 599) is overdue...',
 'Re: Years ago...',
 'Re: Congratulations...',
 'Congratulations...',
 'Re: quick confirmation....',
 'AM I FREE.....',
 "FW: If you're bored here's a mensa test...",
 "FW:If you're bored here's a mensa test...",
 'Re: Long Time...',
 'Long Time...',
 'RE: Long Time...',
 'RE: New exchange broker...',
 'RE: New exchange broker...',
 'Re: New exchange broker...',
 'New exchange broker...',
 'FW:  Move Related Reminders...',
 '=09FW:  Move Related Reminders...',
 '=09 Move Related Reminders...',
 'RE: I NEED VERIFICATION...',
 'RE: I NEED VERIFICATION...',
 'RE: I NEED VERIFICATION...',
 'I NEED VERIFICATION...',
 'RE: I NEED VERIFICATION...',
 'I NEED VERIFICATION...',
 'RE: Heartland Steel revised....',
 'Heartland Steel revised....',
 'FW: Info you requested...',
 'Info you requested...',
 'Re: Long distance call........',
 'Re: FW: Please call this number...',
 'Re: Just a short note.........',
 'FW: Read Storyline first.........',
 '? Fwd: Read Storyline first.........',
 'Re: Hey, Dad....',
 'Re: T-minus 3 days...',
 'T-minus 3 days...',
 "Re: Don't forget to vote...",
 "Don't forget to vote...",
 'Re: Suds...',
 'Suds...',
 'Re: TIP OF THE DAY...',
 'TIP OF THE DAY...',
 'Re: Kristi called me last night....',
 'details...',
 'details...',
 're: ... no subject ...',
 'Re: Thinking of you...........',
 "And That's It...",
 "And That's It...",
 "And That's It...",
 'FW: Training courses available...',
 "Re: Haven't heard from you in a while........",
 'RE: Dudes...',
 'Re: Dudes...',
 'Dudes...',
 'FW: Dudes...',
 'RE: Dudes...',
 'Re: Dudes...',
 'Dudes...',
 'FW: Dudes...',
 'RE: Dudes...',
 'Re: Dudes...',
 'Dudes...',
 'FW: Dudes...',
 'RE: Dudes...',
 'Re: Dudes...',
 'Dudes...',
 'FW: Dudes...',
 'RE: Dudes...',
 'Re: Dudes...',
 'Dudes...',
 'RE: Some news.....',
 'Some news.....',
 "FW: We're Still Here...",
 "We're Still Here...",
 'RE: Still here and still ready...',
 'Still here and still ready...',
 'Still here and still ready...',
 'FW: Dudes...',
 'Re: Dudes...',
 'Dudes...',
 "RE: I'm here...",
 "I'm here...",
 "RE: I'm here...",
 "RE: I'm here...",
 "RE: I'm here...",
 "I'm here...",
 'RE: Troutmaster say...',
 ...]

Every subject line that has the word 'oil' in it

In [132]:
[line for line in subjects if re.search(r"\b[Oo]il\b", line)]
Out[132]:
['Re: PIRA Global Oil and Natural Outlooks- Save these dates.',
 'PIRA Global Oil and Natural Outlooks- Save these dates.',
 'Re: PIRA Global Oil and Natural Outlooks- Save these dates.',
 '=09PIRA Global Oil and Natural Outlooks- Save these dates.',
 'Re: Cabot Oil & Gas Marketing Corp. - 9/99 production - price',
 'Re: Cabot Oil & Gas Marketing Corp. - 9/99 production - price',
 'Re: Cabot Oil & Gas Marketing Corp. - 9/99 production - price',
 'Re: Cabot Oil & Gas Marketing Corp. - 9/99 production - price',
 'Re: Cabot Oil & Gas Marketing Corp. - 9/99 production - price',
 'Re: Cabot Oil & Gas Marketing Corp. - 9/99 production - price',
 'Re: Cabot Oil & Gas Marketing Corp. - 9/99 production - price',
 'Cabot Oil & Gas Marketing Corp. - Amendment and Confirmations to',
 'Cabot Oil & Gas Marketing Corp. - Amendment and Confirmations to',
 'Re: Cabot Oil & Gas Marketing Corp. - 9/99 production - price',
 'Re: Cabot Oil & Gas Marketing Corp. - 9/99 production - price',
 'Re: Cabot Oil & Gas Marketing Corp. - 9/99 production - price',
 'Cabot Oil & Gas Marketing Corp. - Amendment and Confirmations to',
 'Cabot Oil & Gas Marketing Corp. - Amendment and Confirmations to',
 'EOTT Crude Oil Tanks',
 'Re: Oil Skim + "Bugs"',
 'Oil Skim + "Bugs"',
 'Oil Release Incident',
 'Oil Release Incident',
 'Oil Release Incident',
 'RE: Location of the 2002 Institute on Oil & Gas Law & Taxation --',
 'Location of the 2002 Institute on Oil & Gas Law & Taxation -- February, 2002',
 'RE: Location of the 2002 Institute on Oil & Gas Law & Taxation --',
 'RE: Location of the 2002 Institute on Oil & Gas Law & Taxation -- February, 2002',
 'RE: Location of the 2002 Institute on Oil & Gas Law & Taxation',
 'B & J Gas and Oil',
 'Re: B & J Gas and Oil',
 'National Oil & Gas Coop',
 'Re: National Oil & Gas Coop',
 'Re: B & J Gas & Oil',
 'Re: B & J Gas & Oil',
 'B & J Gas & Oil',
 'Re: B & J Gas & Oil',
 'Re: B & J Gas & Oil',
 'B & J Gas & Oil',
 'Re: Maynard Oil - Revised Nom',
 'Maynard Oil - Revised Nom',
 'Re: Period after commissioning on oil - PPA availability penalties',
 'Re: Period after commissioning on oil - PPA availability penalties',
 'Re: Period after commissioning on oil - PPA availability penalties',
 'Eastern States Oil & Gas',
 'Eastern States Oil & Gas',
 'Re: Petro-Canada Oil & Gas',
 'Petro-Canada Oil & Gas',
 'Re: Enron Liquid Fuels, Inc. v. Gulf Oil Limited Partnership',
 'Re: Enron Liquid Fuels, Inc. v. Gulf Oil Limited Partnership',
 'RE: Andersen Oil & Gas Symposium',
 'RE: Andersen Oil & Gas Symposium',
 'Andersen Oil & Gas Symposium',
 'RE: Andersen Oil & Gas Symposium',
 'Andersen Oil & Gas Symposium',
 'Re: FW: Andersen Oil & Gas Symposium',
 'RE: Colonial Oil Industries, Inc.',
 'Colonial Oil Industries, Inc.',
 'Colonial Oil Indusries, Inc.',
 'RE: Colonial Oil Indusries, Inc.',
 'RE: Colonial Oil Indusries, Inc.',
 'Colonial Oil Indusries, Inc.',
 'Re: Vineyard Oil and Gas deal 662482',
 'Vineyard Oil and Gas deal 662482',
 'Vineyard Oil and Gas deal 662482',
 'Re: Oil & Gas Confirms - Andex',
 'Re: National Oil & Gas Coop',
 'Mobil Oil Corporation Master Enfolio Agreement',
 'Re: Vineyard Oil & Gas: Q56432',
 'Re: Vineyard Oil & Gas: Q56432',
 'Re: Vineyard Oil & Gas: Q56432',
 'Imperial Oil Resources',
 'Belco Oil & Gas',
 'Belco Oil & Gas',
 'United Oil & Minerals, Inc.',
 'Belco Oil & Gas Corp.',
 'Re: United Oil & Minerals',
 'Husky Oil',
 'ETA Amendment - Imperial Oil Resources',
 '(00-323) Margin Change for Crude Oil, Unleaded Gasoline, and',
 'Re: (00-323) Margin Change for Crude Oil, Unleaded Gasoline, and',
 '(00-323) Margin Change for Crude Oil, Unleaded Gasoline, and Heating',
 'Re: United Oil & Minerals -Amended Credit W/S',
 'Re: US Heating Oil and Unleaded Gas Fin Spreads - Approval',
 'US Heating Oil and Unleaded Gas Fin Spreads - Approval',
 '(00-359) Margin Rate Change for Crude Oil, Unleaded Gasoline, and',
 '(00-362) Revised Crude Oil Options Expiration Date',
 'Re: (00-362) Revised Crude Oil Options Expiration Date',
 '(00-367) Revised - Crude Oil Futures Expiration Date',
 '(00-377) Margin Rate Change for Crude Oil, Unleaded Gasoline, and',
 'Re: New EOL Product - Crude Oil Financial',
 'New EOL Product - Crude Oil Financial',
 'NDA-PVM Oil  Associates Limited',
 'Re: NDA-PVM Oil Associates Limited',
 'BP Oil International Limited',
 'NDA - PVM Oil Associates Limited',
 'BP Exploration & Oil Inc.',
 'BP Oil Supply Company',
 'RE: BP Oil Supply Company',
 'BP Oil Supply Company',
 'ETA Amendment - BP Oil Supply Company',
 'Re: bp Oil Supply registration status',
 'bp Oil Supply registration status',
 'Re: FW: Product Type approval for 3 product types (Heating Oil Fin',
 'FW: Product Type approval for 3 product types (Heating Oil Fin=20',
 'FW: Product Type approval for 3 product types (Heating Oil Fin=20',
 'Product Type approval for 3 product types (Heating Oil Fin Options=',
 'Product Type approval for 2 product types (US Residual Fuel Oil 1%=',
 'Husky Oil Limited/ECT Canada',
 'RE: Belco Oil & Gas/Westport Resources merger',
 'RE: Belco Oil & Gas/Westport Resources merger',
 'Belco Oil & Gas/Westport Resources merger',
 'RE: Hunt Oil Company of Canada',
 'Hunt Oil Company of Canada',
 'FW: Belco Oil & Gas Corp Merger w/ Westport Resources Corp.',
 'FW: Belco Oil & Gas Corp Merger w/ Westport Resources Corp.',
 'Belco Oil & Gas Corp Merger w/ Westport Resources Corp.',
 'BP Exploration & Oil Inc. Merger Documentation',
 'FW: BP Exploration & Oil Inc. Merger Documentation',
 'BP Exploration & Oil Inc. Merger Documentation',
 'Re: Forward oil prices',
 'Re: Forward oil prices',
 'Forward oil prices',
 'Re: Forward oil prices',
 'Re: Forward oil prices',
 'Re: Forward oil prices',
 'Re: Forward oil prices',
 'Re: Forward oil prices',
 'Forward oil prices',
 'Forward oil prices',
 'Re: Forward oil prices',
 'Re: Forward oil prices',
 'Forward oil prices',
 'Forward oil prices',
 'Forward oil prices',
 'Forward oil prices',
 'PIRA Global Oil and Natural Outlooks- Save these dates.',
 'PIRA Global Oil and Natural Outlooks- Save these dates.',
 'Re: Seismic Data on Oil & Gas field Development via Satellite',
 'Re: Seismic Data on Oil & Gas field Development via Satellite',
 'Re: exploration data as the root of the energy (oil) supply chain',
 'exploration data as the root of the energy (oil) supply chain and',
 'Seismic Data on Oil & Gas field Development via Satellite',
 'Seismic Data on Oil & Gas field Development via Satellite',
 'PIRA Global Oil and Natural Outlooks- Save these dates.',
 'PIRA Global Oil and Natural Outlooks- Save these dates.',
 'RE: Crude Oil for Oz',
 'FW: Crude Oil for Oz',
 'FW: Crude Oil for Oz',
 'Crude Oil for Oz',
 'FW: GPCM News: 8/20/01:  RBAC Finalizing Schedule in Houston: Oil',
 '=09GPCM News: 8/20/01:  RBAC Finalizing Schedule in Houston: Oil Co=',
 'PIRA Global Oil and Natural Outlooks- Save these dates.',
 '=09PIRA Global Oil and Natural Outlooks- Save these dates.',
 'PIRA Global Oil and Natural Outlooks- Save these dates.',
 '=09PIRA Global Oil and Natural Outlooks- Save these dates.',
 'MOU With India Oil Corp.',
 'MOU With India Oil Corp.',
 'Iraqui - Oil for food',
 'Iraqui - Oil for food',
 'Re: Iraqui - Oil for food',
 'Iraqui - Oil for food',
 'Re: PIRA Oil Briefing',
 'PIRA Oil Briefing',
 'Telephone call to Steve Hellman, Oil Space',
 'Re: PIRA Global Oil and Natural Outlooks- Save these dates.',
 'PIRA Global Oil and Natural Outlooks- Save these dates.',
 'John Arnold Crude Oil Deals',
 'Re: Cabot Oil & Gas Marketing Corporation',
 'Re: Cabot Oil & Gas Marketing Corporation',
 'Re: Cabot Oil & Gas Marketing Corporation',
 'Re: Cabot Oil & Gas Marketing Corporation',
 'Re: Cabot Oil & Gas Marketing Corporation',
 'Re: Cabot Oil & Gas Marketing Corporation',
 'Re: Cabot Oil & Gas Marketing Corporation',
 'Re: Cabot Oil & Gas Marketing Corporation',
 'Re: Cabot Oil & Gas Marketing Corporation',
 'Re: Cabot Oil & Gas Marketing Corporation',
 'Re: Cabot Oil & Gas Marketing Corporation',
 'Re: Cabot Oil & Gas Marketing Corporation',
 'Re: Cabot Oil & Gas Marketing Corporation',
 'Re: Cabot Oil & Gas Marketing Corporation',
 'Re: Cabot Oil & Gas Marketing Corporation',
 'Re: Cabot Oil & Gas Marketing Corporation',
 'Re: Cabot Oil & Gas Marketing Corporation',
 'Re: Cabot Oil & Gas Marketing Corporation',
 'Re: Cabot Oil & Gas Marketing Corporation',
 'Re: Cabot Oil & Gas Marketing Corporation',
 'Re: Cabot Oil & Gas Marketing Corporation',
 'Re: Cabot Oil & Gas Marketing Corporation',
 'Re: Cabot Oil & Gas Marketing Corporation',
 'Re: Cabot Oil & Gas Marketing Corporation',
 'Re: Cabot Oil & Gas Marketing Corporation',
 'Re: Cabot Oil & Gas Marketing Corporation',
 'Re: Cabot Oil & Gas Marketing Corporation',
 'Re: Cabot Oil & Gas Marketing Corporation',
 'Re: Cabot Oil & Gas Marketing Corporation',
 'Re: Cabot Oil & Gas Marketing Corporation',
 'Re: Cabot Oil & Gas Marketing Corporation',
 'Re: Cabot Oil & Gas Marketing Corporation',
 'Cabot Oil & Gas Marketing Corporation',
 'Re: Cabot Oil & Gas Marketing Corporation',
 'Cabot Oil & Gas Marketing Corporation',
 'Coleman Oil & Gas Term Sheet and LOU',
 'Coleman Oil & Gas Term Sheet and LOU',
 'Re: Oil & Gas Lease',
 'FW: Fw: Oil changes: Men vs. Women',
 'Fwd: Fw: Oil changes: Men vs. Women',
 'Citation Oil & Gas Spot Enfolio',
 'FW: Patina Oil & Gas Docs.',
 'Patina Oil & Gas Docs.',
 'FW: Pioneer Oil',
 'Pioneer Oil',
 'FW: Patina Oil and Gas Purchase',
 'Patina Oil and Gas Purchase',
 'FW: Kennedy Oil and Enron North America Corp.',
 'FW: Kennedy Oil and Enron North America Corp.',
 'RE: Cutter Oil',
 'Cutter Oil',
 'RE: Cutter Oil',
 'RE: Cutter Oil',
 'RE: Cutter Oil',
 'Cutter Oil',
 'Belco Oil & Gas/Westport Resources merger',
 'Merit Gas & Oil, Inc.',
 'Re: Cutter Oil Contract',
 'Re: Cross Timbers Oil Company - Contact Information',
 'Re: Revised Cutter Oil contract',
 'Re: Revised Cutter Oil contract',
 'Cross Timbers Oil',
 'Cutter Oil Company',
 'Cutter Oil Company',
 'Cutter Oil',
 'Murphy Oil',
 'Murphy Oil',
 'Forest Oil / Energen Resources',
 'Forest Oil Corporation',
 'Matrix Oil & Gas',
 'Irving Oil CA',
 'Irving Oil CA',
 'Re: Irving Oil CA',
 'Re: Cutter Oil Company',
 'Re: Cutter Oil Company',
 'Re: Cutter Oil Company',
 'Jay Bee Oil & Gas',
 'Re: Jay Bee Oil & Gas',
 'FW: IGI Resources, Inc. and BTA Oil Producers',
 'RE: IGI Resources, Inc. and BTA Oil Producers',
 'IGI Resources, Inc. and BTA Oil Producers',
 'RE: Patina Oil and Gas Purchase',
 'FW: Patina Oil and Gas Purchase',
 'Patina Oil and Gas Purchase',
 'RE: DeBrular/Stevens Oil',
 'RE: DeBrular/Stevens Oil',
 'DeBrular/Stevens Oil',
 'RE: Stevens Oil',
 'RE: Stevens Oil',
 'Stevens Oil',
 'RE: DeBrular/Stevens Oil',
 'DeBrular/Stevens Oil',
 'FW: Proposed Contract - BTA Oil Producers',
 'Proposed Contract - BTA Oil Producers',
 'RE: IGI Resources, Inc. and BTA Oil Producers',
 'IGI Resources, Inc. and BTA Oil Producers',
 'BTA Oil Producers / IGI Resources',
 'Re: Cabot Oil & Gas Marketing Corporation',
 'Re: Killam Oil Dehydration in the Juanita Lobo Field, Webb County,',
 'Re: Colonial Oil',
 'Colonial Oil',
 'Wall Street Journal Article - Regarding SPR Oil',
 'Reminder: Annual Oil Spill Crisis Management Training',
 'Re: Gulf Oil Co.',
 'BP Oil International Ltd. ("BP")',
 'Forest Oil Corporation',
 'Wiser Oil Confidentiality Agreement',
 'Draft term sheet for oil-power spread option pruchase from FPL',
 'Re: Draft term sheet for oil-power spread option pruchase from FPL',
 'Draft term sheet for oil-power spread option pruchase from FPL',
 'Prepaid Oil Swap - Transaction Diagram',
 'Prepaid Oil Swap - Transaction Diagram',
 'BP Oil Internation Ltd',
 'Cross Oil and Refining & Marketing',
 'RE: Kern Oil & Refining Company',
 'Kern Oil & Refining Company',
 'RE: Oil Prepay Rollover',
 'Oil Prepay Rollover',
 'FW: US Filter Comments on Omnibus and Annex A of a Heating Oil',
 'US Filter Comments on Omnibus and Annex A of a Heating Oil Deferred Premium Call',
 'FW: ENA Oil Prepay with CSFB/ Morgan Stanley',
 'FW: ENA Oil Prepay with CSFB/ Morgan Stanley',
 'FW: ENA Oil Prepay with CSFB/ Morgan Stanley',
 'FW: ENA Oil Prepay with CSFB/ Morgan Stanley',
 'ENA Oil Prepay with CSFB/ Morgan Stanley',
 'Re: CPE Credit-Oil Spill Training',
 'CPE Credit-Oil Spill Training',
 'NE Heating Oil Reserve',
 'NE Heating Oil Reserve',
 'CERA Monthly Oil Briefing - CERA Alert - December 20, 2000',
 'CERA Monthly Oil Briefing - CERA Alert - December 20, 2000',
 'CERA Monthly Oil Briefing - CERA Alert - December 20, 2000',
 'Re: PIRA World Oil Outlook Presentation',
 'PIRA World Oil Outlook Presentation',
 'Oil Week Ahead - Dec 4-00',
 'Oil Week Ahead - Dec 4-00',
 'The Oil Daily Wednesday, Nov. 29, 2000 (pdf)',
 'The Oil Daily Wednesday, Nov. 29, 2000 (pdf)',
 'The Oil Daily Wednesday, Nov. 29, 2000 (pdf)',
 'The Oil Daily, Tuesday, Nov. 28, 2000 (pdf)',
 'The Oil Daily, Tuesday, Nov. 28, 2000 (pdf)',
 'The Oil Daily, Tuesday, Nov. 28, 2000 (pdf)',
 'Reminder: Annual Oil Spill Crisis Management Training',
 'Reminder: Annual Oil Spill Crisis Management Training',
 'Re: Annual Oil Spill Crisis Management Training',
 'Re: Annual Oil Spill Crisis Management Training',
 'Fuel Oil',
 'Fuel Oil',
 'how to go forward in the oil markets',
 'how to go forward in the oil markets',
 'Re: Oil Prices and Investment',
 'Oil Prices and Investment',
 'Crude Oil Purchase Agreement',
 'Re: US Heating Oil and Unleaded Gas Fin Spreads - Approval',
 '(00-362) Revised Crude Oil Options Expiration Date',
 'Re: (00-362) Revised Crude Oil Options Expiration Date',
 'Re: PLEASE RESPOND: US Residual Fuel Oil 1% Fin Spd / US Residual',
 'Product Type Approval for 2 product types!! (US Residual Fuel Oil 1%',
 'Re: ISDA for Irving Oil',
 'Union Oil and ETA for Enron Online',
 'Union Oil and ETA for Enron Online',
 'RE: Access to the Daily Oil Bulletin online',
 'Access to the Daily Oil Bulletin online',
 'Fuel Oil Sample',
 'Fuel Oil Vanadium Spec',
 'RE: Fuel Oil Ash Spec',
 'Fuel Oil Ash Spec',
 'Fuel Oil!!!',
 'Ft. Pierce #2 Fuel Oil',
 'Fuel Oil Reimbursement to FPUA',
 'Re: Fuel Oil Sample',
 'Fuel Oil Sample',
 'Re: Fuel Oil Sample',
 'Re: Fuel Oil Sample',
 'Re: Fuel Oil Sample',
 'Fuel Oil Sample',
 'Re: Fuel Oil Sample',
 'Re: Fuel Oil Sample',
 'Re: Fuel Oil Sample',
 'Re: Fuel Oil Sample',
 'Re: Fuel Oil Sample',
 'Re: Fuel Oil Sample',
 'Re: Fuel Oil Sample',
 'Re: Fuel Oil Sample',
 'Re: Fuel Oil Sample',
 'Fuel Oil Sample',
 'Re: Fuel Oil Sample',
 'Fuel Oil Sample',
 'FYI - Update of Fuel Oil Analysis',
 'Update of Fuel Oil Analysis',
 'Update of Fuel Oil Analysis',
 'Fuel Oil Questions',
 'RE: FW: Patina Oil & Gas Docs.',
 'RE: FW: Patina Oil & Gas Docs.',
 'RE: FW: Patina Oil & Gas Docs.',
 'RE: FW: Patina Oil & Gas Docs.',
 'Re: FW: Patina Oil & Gas Docs.',
 'RE: FW: Patina Oil & Gas Docs.',
 'RE: FW: Patina Oil & Gas Docs.',
 'Re: FW: Patina Oil & Gas Docs.',
 'FW: Patina Oil & Gas Docs.',
 'Patina Oil & Gas Docs.',
 'Patina Oil and Gas Purchase',
 'FW: Kennedy Oil',
 'Kennedy Oil',
 'Kennedy Oil',
 'FW: Kennedy Oil',
 'Kennedy Oil',
 'Kennedy Oil',
 'FW: Frontier Oil Corporation Credit Line',
 'Frontier Oil Corporation Credit Line',
 'Re: No.6 Oil - Dom. Rep.']

Metacharacters: quantifiers

Above we had a regular expression that looked like this:

[aeiou][aeiou][aeiou][aeiou]

Typing out all of those things is kind of a pain. Fortunately, there’s a way to specify how many times to match a particular character, using quantifiers. These affect the character that immediately precede them:

quantifier meaning
{n} match exactly n times
{n,m} match at least n times, but no more than m times
{n,} match at least n times
+ match at least once (same as {1,})
* match zero or more times
? match one time or zero times

For example, here's an example of a regular expression that finds subjects that contain at least fifteen capital letters in a row:

In [136]:
[line for line in subjects if re.search(r"[A-Z]{15,}", line)]
Out[136]:
['CONGRATULATIONS!',
 'CONGRATULATIONS!',
 'Re: FW: Fw: Fw: Fw: Fw: Fw: Fw: PLEEEEEEEEEEEEEEEASE READ!',
 'ACCOMPLISHMENTS',
 'ACCOMPLISHMENTS',
 'Re: FW: FORM: BILATERAL CONFIDENTIALITY AGREEMENT',
 'FORM: BILATERAL CONFIDENTIALITY AGREEMENT',
 'Re: CONGRATULATIONS!',
 'CONGRATULATIONS!',
 'Re: ORDER ACKNOWLEDGEMENT',
 'ORDER ACKNOWLEDGEMENT',
 'RE: CONGRATULATIONS',
 'RE: CONGRATULATIONS',
 'Re: CONGRATULATIONS',
 'CONGRATULATIONS',
 'RE: CONGRATULATIONS',
 'RE: CONGRATULATIONS',
 'RE: CONGRATULATIONS',
 'RE: CONGRATULATIONS',
 'Re: CONGRATULATIONS',
 'CONGRATULATIONS',
 'Re: VEPCO INTERCONNECTION AGREEMENT',
 'VEPCO INTERCONNECTION AGREEMENT',
 'Re: VEPCO INTERCONNECTION AGREEMENT',
 'Re: VEPCO INTERCONNECTION AGREEMENT',
 'VEPCO INTERCONNECTION AGREEMENT',
 'Re: CONGRATULATIONS !',
 'FW: WASSSAAAAAAAAAAAAAABI!',
 'FW: WASSSAAAAAAAAAAAAAABI!',
 'FW: WASSSAAAAAAAAAAAAAABI!',
 'FW: WASSSAAAAAAAAAAAAAABI!',
 'Re: FW: WASSSAAAAAAAAAAAAAABI!',
 'FW: WASSSAAAAAAAAAAAAAABI!',
 'FW: WASSSAAAAAAAAAAAAAABI!',
 'RE: NOOOOOOOOOOOOOOOO',
 'NOOOOOOOOOOOOOOOO',
 'RE: NOOOOOOOOOOOOOOOO',
 'CONGRATULATIONS!!!!!!!!!!!!!',
 'RE: CONGRATULATIONS!!!!!!!!!!!!!',
 'Re: CONGRATULATIONS!!!!!!!!!!!!!',
 'CONGRATULATIONS',
 'Re: CONFIDENTIALITY/CONFLICTS ISSUES MEETING',
 'CONFIDENTIALITY/CONFLICTS ISSUES MEETING',
 'GOALS AND ACCOMPLISHMENTS',
 'ACCOMPLISHMENTS',
 'Re: CONGRATULATIONS!',
 'RE: STANDARDIZATION OF TANKER FREIGHT WORDING',
 'RE: STANDARDIZATION OF TANKER FREIGHT WORDING',
 'Re: STANDARDIZATION OF TANKER FREIGHT WORDING',
 'STANDARDIZATION OF TANKER FREIGHT WORDING',
 'BRRRRRRRRRRRRRRRRRRRRR',
 'Re: CONGRATULATIONS !!!',
 'CONGRATULATIONS !!!',
 'RE: Mtg. to discuss assignment of customers. Transmission list:  P/LEGAL/PROJECTNETCO/NETCOTRANSMISSION.XLS',
 'RE: Mtg. to discuss assignment of customers. Transmission list:  P/LEGAL/PROJECTNETCO/NETCOTRANSMISSION.XLS',
 'Mtg. to discuss assignment of customers. Transmission list:  P/LEGAL/PROJECTNETCO/NETCOTRANSMISSION.XLS',
 'FW: NEW WEATHER SWAPS ON THE INTERCONTINENTAL EXCHANGE',
 'NEW WEATHER SWAPS ON THE INTERCONTINENTAL EXCHANGE']

Lines that contain five consecutive vowels:

In [137]:
[line for line in subjects if re.search(r"[aeiou]{5}", line)]
Out[137]:
['WooooooHoooooo more Vacation',
 'Gooooooooooood Bye!',
 'Gooooooooooood Bye!',
 'RE: Hello Sweeeeetie',
 'Hello Sweeeeetie',
 'FW: Waaasssaaaaabi !',
 'FW: Waaasssaaaaabi !',
 'FW: Waaasssaaaaabi !',
 'FW: Waaasssaaaaabi !',
 'Re: FW: Wasss Uuuuuup STG?',
 'RE: Rrrrrrrooooolllllllllllll TIDE!!!!!!!!',
 'Rrrrrrrooooolllllllllllll TIDE!!!!!!!!',
 'Re: Helloooooo!!!',
 'Re: Helloooooo!!!',
 'Fw: FW: OOOooooops',
 'FW: FW: OOOooooops',
 'yahoooooooooooooooooooo',
 'RE: yahoooooooooooooooooooo',
 'RE: yahoooooooooooooooooooo',
 'yahoooooooooooooooooooo',
 'RE: I hate yahooooooooooooooo',
 'I hate yahooooooooooooooo',
 'RE: I hate yahooooooooooooooo',
 'I hate yahooooooooooooooo',
 'RE: I hate yahooooooooooooooo',
 'I hate yahooooooooooooooo',
 'RE: I hate yahooooooooooooooo',
 'I hate yahooooooooooooooo',
 "FW: duuuuuuuuuuuuuuuuude...........what's up?",
 "RE: duuuuuuuuuuuuuuuuude...........what's up?",
 "RE: duuuuuuuuuuuuuuuuude...........what's up?",
 'Re: skiiiiiiiiing',
 'skiiiiiiiiing',
 'scuba dooooooooooooo',
 'RE: scuba dooooooooooooo',
 'RE: scuba dooooooooooooo',
 'scuba dooooooooooooo',
 'Re: skiiiiiiiing',
 'skiiiiiiiing',
 'Re: skiiiiiiiing',
 'Re: skiiiiiiiiing']

Count the number of lines that are e-mail forwards, regardless of whether the subject line begins with Fw:, FW:, Fwd: or FWD:

In [140]:
len([line for line in subjects if re.search(r"^F[Ww]d?:", line)])
Out[140]:
20159

Lines that have the word news in them and end in an exclamation point:

In [144]:
[line for line in subjects if re.search(r"\b[Nn]ews\b.*!$", line)]
Out[144]:
['RE: Christmas Party News!',
 'FW: Christmas Party News!',
 'Christmas Party News!',
 'Good News!',
 'Good News--Twice!',
 'Re: VERY Interesting News!',
 'Great News!',
 'Re: Great News!',
 'News Flash!',
 'RE: News Flash!',
 'RE: News Flash!',
 'News Flash!',
 'RE: Good News!',
 'RE: Good News!',
 'RE: Good News!',
 'RE: Good News!',
 'Good News!',
 'RE: Good News!!!',
 'Good News!!!',
 'RE: Big News!',
 'Big News!',
 'Individual.com - News From a Friend!',
 'Individual.com - News From a Friend!',
 'Re: Individual.com - News From a Friend!',
 'RE: We need news!',
 '=09We need news!',
 'RE: Big News!',
 'FW: Big News!',
 'RE: Big News!',
 'FW: Big News!',
 'Big News!',
 'FW: NW Wine News- Eroica, Sineann, Bergstrom, Hamacher, And more!',
 '=09NW Wine News- Eroica, Sineann, Bergstrom, Hamacher, And more!',
 'RE: Good News!!!',
 'Good News!!!',
 'Re: Big News!',
 'Big News!',
 'RE: Good  News!',
 'Good  News!']

Metacharacters: alternation

One final bit of regular expression syntax: alternation.

  • (?:x|y): match either x or y
  • (?:x|y|z): match x, y or z
  • etc.

So for example, if you wanted to count every subject line that begins with either Re: or Fwd::

In [174]:
len([line for line in subjects if re.search(r"^(?:Re|Fwd):", line)])
Out[174]:
39901

Every subject line that mentions a primary color:

In [175]:
[line for line in subjects if re.search(r"\b(?:[Rr]ed|[Yy]ellow|[Bb]lue)\b", line)]
Out[175]:
['Re: Blue Dolphin Pipe',
 'Blue Dolphin Pipe',
 'FW: Red Rock expansion',
 'FW: Red Rock expansion',
 'Red Rock expansion',
 'Re: Red Rock GE/NP Emissions',
 'Re: Red Rock GE/NP Emissions',
 'Re: Red Rock GE/NP Emissions',
 'Red Rock GE/NP Emissions',
 'Air Permit Delay, Red Rock Expansion',
 'RE: Red Rock Air Permits Heads up!',
 'RE: Red Rock Air Permits Heads up!',
 'Red Rock Air Permits Heads up!',
 'RE: FW: Red Rock Expansion Station 4',
 'RE: FW: Red Rock Expansion Station 4',
 'RE: FW: Red Rock Expansion Station 4',
 'Re: FW: Red Rock Expansion Station 4',
 'FW: Red Rock Expansion Station 4',
 'Red Rock Expansion Station 4',
 'summary of red rock contracts',
 'Re: Yellow Book',
 'Re: Now we have red ones',
 'Re: Red Herring Res',
 'Blue Jean Shirts',
 'Blue Jean Shirts',
 'Blue Girl - $1.2MM option expires today - need to know whether to',
 'Blue Girl - $1.2MM option expires today - need to know whether to',
 'Blue Girl - $1.2MM option expires today - need to know whether to',
 'Blue Girl - $1.2MM option expires today - need to know whether to',
 'Blue Girl - $1.2MM option expires today - need to know whether to',
 'Blue Girl - $1.2MM option expires today - need to know whether to',
 'Blue Girl - $1.2MM option expires today - need to know whether to',
 'Red Rock Delay $$ Impact',
 'from my red neck uncle...',
 'Blue Cross ?',
 'Re: REPLY TO: Power Rate Offer: Weserner Exposition, Red Deer',
 'Red Rock Expansion Filing',
 'Red Rock Expansion Filing',
 'Red Rock open season',
 'Red Rock open season',
 'Re: Final Red Rock',
 'Final Red Rock',
 'Re: Red Rock shipper letter',
 'Red Rock shipper letter',
 'Re: Red Rock posting',
 'Red Rock posting',
 'Re: Red Rock posting',
 'Re: Red Rock posting',
 'Re: Red Rock posting',
 'Red Rock posting',
 'Red Cedar',
 'Red Cedar update',
 'Red Cedar update',
 'Red Cedar',
 'Red Cedar',
 'Re: Red Cedar - Contract Approval Request',
 'Red Cedar - Contract Approval Request',
 'Re: Red Cedar',
 'Re: Red Cedar',
 'Red Cedar',
 'Red Cedar',
 'FW: Red Rock (Transwestern) Cashflow',
 'RE: Red Rock (Transwestern) Cashflow',
 'RE: Red Rock (Transwestern) Cashflow',
 'Red Rock (Transwestern) Cashflow',
 'FW: Red Rock additions',
 'FW: Red Rock additions',
 'FW: Red Rock additions',
 'RE: Red Rock additions',
 'Red Rock additions',
 'FW: Honest Answers from the Guy in the Red..',
 'Fw: Honest Answers from the Guy in the Red..',
 'Honest Answers from the Guy in the Red..',
 'Re: Blue Range',
 'Blue Range',
 'Re: Blue Range',
 'Re: Blue Range Resource Corporation',
 'Re: Blue Range Resource Corporation',
 'FW: Red Rock additions',
 'FW: Red Rock additions',
 'RE: Red Rock additions',
 'Red Rock additions',
 'Red Rock expansion',
 'SD 750 case w/ Red Hawk lateral',
 'Aquila Red Lake storage project',
 'Red Lake Storage project',
 'FW: New Uniforms for the other Big Red',
 'New Uniforms for the other Big Red',
 'FW: Red Lake Storage w/Kevin Hyatt',
 'Red Lake Storage w/Kevin Hyatt',
 'Follow up from Red Lake meeting',
 'Firm Rec/Del questions from potential Aquila Red Lake shippers',
 'FW: Red Lake Info',
 'FW: Red Lake Info',
 'RE: Red Lake Info',
 'Red Lake Info',
 'Blue Files',
 'Blue Flame Propane Inc.',
 'Re: Blue Flame Propane Inc.',
 'Blue Flame Propane Inc.',
 'Re: Ideas on Blue/Green Files',
 'FW: Red Hat no longer sings "When I\'m 64"',
 'Red Hat no longer sings "When I\'m 64"',
 'SECURITY AND BUG NEWS ALERT: Users offer tips on foiling Code Red',
 'Red Herring Article',
 'Red Herring Article',
 'RE: Red Herring',
 'Red Herring',
 'FW: Red Herring e-mail',
 'Red Herring e-mail',
 'FW: Transwestern Red Rock Permitting Status',
 'Transwestern Red Rock Permitting Status',
 'RE: Red Banner Screen Shots',
 'FW: Red Banner Screen Shots',
 'Red Banner Screen Shots',
 'Re: Blue Range Resource Corporation - Eligible Financial Contracts',
 'Blue Range Recovery',
 'Blue Range Recovery',
 'Thank you from Red Herring',
 'RE: Red Lake Storage project',
 'Red Lake Storage project',
 'Aquila Red Lake/TW strategy meeting',
 'FW: Red Rock Receipts',
 'RE: Red Rock Receipts',
 'Red Rock Receipts',
 'FW: Project Status Review Meeting for Red Rock and GTB',
 'Updated: Project Status Review Meeting for Red Rock and GTB',
 'FW: Red Rock Briefing',
 'Red Rock Briefing',
 'RE: Amending Red Rock Contracts',
 'RE: Amending Red Rock Contracts',
 'RE: Amending Red Rock Contracts',
 'RE: Amending Red Rock Contracts',
 'FW: Amending Red Rock Contracts',
 'FW: Amending Red Rock Contracts',
 'Amending Red Rock Contracts',
 'RE: Amending Red Rock Contracts',
 'RE: Amending Red Rock Contracts',
 'FW: Amending Red Rock Contracts',
 'FW: Amending Red Rock Contracts',
 'Amending Red Rock Contracts',
 'FW: Amending Red Rock Contracts',
 'FW: Amending Red Rock Contracts',
 'Amending Red Rock Contracts',
 'RE: Commissioner.COM Trade Offered by Blue Balls',
 'RE: Commissioner.COM Trade Offered by Blue Balls',
 'RE: Commissioner.COM Trade Offered by Blue Balls',
 'Commissioner.COM Trade Offered by Blue Balls',
 'RE: Commissioner.COM Trade Offered by Blue Balls',
 'Commissioner.COM Trade Offered by Blue Balls',
 'Blue dog override letter',
 'Just kidding, this is the real Blue dog override letter',
 'Just kidding, this is the real Blue dog override letter',
 'Blue Dog',
 'Blue Dog',
 'Blue Dog',
 'Override letter related to the Blue Dog turbines',
 'Re: Blue Dog Change Order',
 'Blue Dog Change Order',
 'Blue Dog Change Order',
 'LJM/Blue Girl turbines',
 'Blue Dog Change Order',
 'Blue Dog Change Order',
 'Blue Dog Change Order',
 'Re: Blue Dog Change Order',
 'Re: Blue Dog Change Order',
 'Re: Blue Dog Change Order',
 'Re: Blue Dog - Comments to Change Order #1',
 'Blue Dog - Comments to Change Order #1',
 'Letter agreement re: Blue Dog',
 'Form for Blue Dog',
 'Letter agreement re: Blue Dog',
 'Letter agreement re: Blue Dog',
 'Re: FW: Form for Blue Dog',
 'FW: Form for Blue Dog',
 'Form for Blue Dog',
 'Re: Blue Dog Amended LLC',
 'Blue Dog Amended LLC',
 'Re: Dual Fuel Configuration - Blue Dog #2',
 'Re: Blue Dog Amended LLC',
 'Re: Blue Dog Amended LLC',
 'Blue Dog Amended LLC',
 'Re: FW: Blue Dog #2',
 'FW: Blue Dog #2',
 'FW: Blue Dog #2',
 'Re: FW: Blue Dog #2',
 'Re: FW: Blue Dog #2',
 'Re: FW: Blue Dog #2',
 'FW: Blue Dog #2',
 'FW: Blue Dog #2',
 'Blue Dog',
 'Blue Dog',
 'Blue Dog',
 'RE: Blue Dog #2',
 'RE: Blue Dog #2',
 'Re: FW: Blue Dog #2',
 'Re: FW: Blue Dog #2   << OLE Object: StdOleLink >>',
 'FW: Blue Dog #2',
 'FW: Blue Dog #2',
 'ENA/Blue Dog: First Amended and Restated LLC Agreement',
 'ENA/Blue Dog: First Amended and Restated LLC Agreement',
 'Blue Dog LLC agreement',
 'Blue Dog LLC agreement',
 'Blue Dog LLC agreement',
 'Re: ENA/Blue Dog:',
 'ENA/Blue Dog:',
 'Re: ENA/Project Blue Dog: Preliminary Documents',
 'ENA/Project Blue Dog: Preliminary Documents',
 'ENA/Project Blue Dog & Salmon: Preliminary Documents',
 'ENA/Project Blue Dog & Salmon: Preliminary Documents',
 'Blue Dog',
 'Blue Dog change orders',
 'Re: Blue Dog change orders',
 'Re: Blue Dog change orders',
 'Blue Dog assignment language',
 'Blue Dog',
 'Blue Dog',
 'Blue Dog',
 'Blue Dog',
 'Blue Dog',
 'Blue Dog',
 'Blue Dog',
 'Re: Blue Dog',
 'Blue Dog',
 'Re: ENRON Blue Dog Max 2 X 7 EA GTG',
 'Re: ENRON Blue Dog Max 2 X 7 EA GTG',
 'ENRON Blue Dog Max 2 X 7 EA GTG',
 'NW my comments in blue',
 'NW my comments in blue',
 'NW my comments in blue',
 'RE: ENA/Blue Dog: Revised Letter Agreement- my comments',
 'RE: ENA/Blue Dog: Revised Letter Agreement',
 'Re: Blue Dog change orders',
 'Re: Blue Dog change orders',
 'Re: Blue Dog change orders',
 'Re: Blue Dog change orders',
 'Re: Blue dog',
 'Blue dog',
 'Standard acknowledgement from GE - applicable to Blue Dog?',
 'Blue Dog change order',
 'ENA/Blue Dog: Revised Letter Agreement',
 'ENA/Blue Dog: Revised Letter Agreement',
 'ENA/Blue Dog: Revised Letter Agreement',
 'ENA/Blue Dog: Revised Letter Agreement',
 'ENA/Blue Dog: Revised Letter Agreement',
 'ENA/Blue Dog: Revised Letter Agreement',
 'ENA/Blue Dog: Revised Letter Agreement',
 'ENA/Blue Dog: LLC Agreement and Letter Agreement',
 'ENA/Blue Dog: LLC Agreement and Letter Agreement',
 'Re: FW: ENA/Blue Dog: Marked Documents',
 'RE: Blue Dog Turbines',
 'RE: Blue Dog Turbines',
 'RE: Blue Dog Turbines',
 'Blue Dog Turbines',
 'ENA/Blue Dog: Closing Checklist and ENA Incumbency Certificate',
 'ENA/Blue Dog: Closing Checklist and ENA Incumbency Certificate',
 'Re: ENA/Blue Dog: Closing Checklist and ENA Incumbency Certificate',
 'ENA/Blue Dog: Closing Checklist and ENA Incumbency Certificate',
 'Re: ENA/Blue Dog: Closing Checklist and ENA Incumbency Certificate',
 'Re: ENA/Blue Dog: Closing Checklist and ENA Incumbency Certificate',
 'ENA/Blue Dog: Closing Checklist and ENA Incumbency Certificate',
 'ENA/Blue Dog: Closing Checklist and ENA Incumbency Certificate',
 'ENA/Blue Dog: Marked',
 'ENA/Blue Dog: Marked',
 'ENA/Blue Dog: conference call with Paul Hastings',
 'ENA/Blue Dog: conference call with Paul Hastings',
 'Blue Dog (Northwestern) assignment',
 'ENA/Blue Dog: Conference call details',
 'ENA/Blue Dog: Conference call details',
 'Change Order 2 Blue Dog',
 'RE: Change Order 2 Blue Dog',
 'RE: Change Order 2 Blue Dog',
 'Change Order 2 Blue Dog',
 'RE: Change Order 2 Blue Dog',
 'RE: Change Order 2 Blue Dog',
 'Change Order 2 Blue Dog',
 'Blue Dog',
 'RE: FW: Blue Dog Max, comments on draft CO No. 2, rev of 4/23/01',
 'Re: ENA/Blue Dog: Escrow',
 'ENA/Blue Dog: Escrow',
 'Re: payment of invoices, Blue Dog Max',
 'Re: payment of invoices, Blue Dog Max',
 'payment of invoices, Blue Dog Max',
 'payment of invoices, Blue Dog Max',
 'Re: Blue Dog Meeting Today',
 'Blue Dog Meeting Today',
 'Re: payment of invoices, Blue Dog Max',
 'Re: payment of invoices, Blue Dog Max',
 'Re: payment of invoices, Blue Dog Max',
 'payment of invoices, Blue Dog Max',
 'payment of invoices, Blue Dog Max',
 'Re: Blue Dog - Letter to GE',
 'Blue Dog - Letter to GE',
 'Friday Meeting re: Blue Dog',
 'Friday Meeting re: Blue Dog',
 'Re: Friday Meeting re: Blue Dog',
 'Re: Friday Meeting re: Blue Dog',
 'Friday Meeting re: Blue Dog',
 'Re: Blue Dog Max monthly report',
 'Re: Blue Dog Max monthly report',
 'Blue Dog Max monthly report',
 'Blue Dog Max monthly report',
 'Blue Dog',
 'RE: Blue Dog',
 'RE: Blue Dog',
 'Blue Dog',
 'RE: Blue Dog',
 'RE: Blue Dog',
 'RE: Blue Dog',
 'Re: FW: Blue Dog',
 'FW: Blue Dog',
 'RE: Blue Dog',
 'RE: Blue Dog',
 'Re: Enron Blue Dog Assignment',
 'Enron Blue Dog Assignment',
 'Re: Blue Dog',
 'Blue Dog',
 'Re: ENA/Blue Dog: Incumbency Certificate',
 'ENA/Blue Dog: Incumbency Certificate',
 'RE: ENA/Blue Dog: Incumbency Certificate',
 'RE: ENA/Blue Dog: Incumbency Certificate',
 'Re: ENA/Blue Dog: Incumbency Certificate',
 'ENA/Blue Dog: Incumbency Certificate',
 'Re: Enron Blue Dog Assignment',
 'Re: Enron Blue Dog Assignment',
 'Enron Blue Dog Assignment',
 'RE: CA for Blue Ridge',
 'RE: CA for Blue Ridge',
 'RE: CA for Blue Ridge',
 'FW: CA for Blue Ridge',
 'CA for Blue Ridge',
 'FW: CA for Blue Ridge',
 'RE: CA for Blue Ridge',
 'RE: CA for Blue Ridge',
 'FW: CA for Blue Ridge',
 'CA for Blue Ridge',
 'RE: CA for Blue Ridge',
 'RE: CA for Blue Ridge',
 'RE: CA for Blue Ridge',
 'RE: CA for Blue Ridge',
 'RE: CA for Blue Ridge',
 'FW: CA for Blue Ridge',
 'CA for Blue Ridge',
 'FW: ENA/Blue Dog: Execution Sets',
 'ENA/Blue Dog: Execution Sets',
 'Blue Dog',
 'Blue Dog closing list',
 'Re: Golf at Doral Blue Course - January 3rd',
 'Golf at Doral Blue Course - January 3rd',
 'Re: Golf at Doral Blue Course - January 3rd',
 'Re: Your Blue Note',
 'Re: Your Blue Note',
 "Re: Lichtenstein's Blue Note",
 "Re: Lichtenstein's Blue Note",
 "Re: Lichtenstein's Blue Note",
 "Re: Lichtenstein's Blue Note",
 "Re: Lichtenstein's Blue Note",
 "Re: Lichtenstein's Blue Note",
 'Re: Agreement - Updated in Bold (Red).',
 'Re: Agreement - Updated in Bold (Red).',
 'FW: Red Rock filing',
 'Red Rock filing',
 'Red Rock Meeting',
 'Transwestern Red Rock Expansion',
 'FW: Transwestern Red Rock Expansion',
 'Transwestern Red Rock Expansion',
 'RE: Transwestern Red Rock Expansion',
 '=09FW: Transwestern Red Rock Expansion',
 '=09Transwestern Red Rock Expansion',
 'Transwestern Red Rock Expansion- FERC Extension Letter',
 'RE: Transwestern Red Rock Expansion- FERC Extension Letter',
 'RE: Transwestern Red Rock Expansion- FERC Extension Letter',
 'Transwestern Red Rock Expansion- FERC Extension Letter',
 'Transwestern Red Rock Expansion; Extension Request',
 'Re: Blue Range Resource Corporation',
 'blue book value',
 'Black Marlin / Blue Dolphin',
 'Black Marlin / Blue Dolphin',
 'Re: Blue Grass Synfuel LLC, et al, v. ENA',
 'RE: Blue Ribbon Panel',
 'Blue Ribbon Panel',
 'FW: One Fish, Two Fish, Yellow Fish, Enron',
 'RE: One Fish, Two Fish, Yellow Fish, Enron',
 'FW: One Fish, Two Fish, Yellow Fish, Enron',
 'FW: One Fish, Two Fish, Yellow Fish, Enron',
 'One Fish, Two Fish, Yellow Fish, Enron',
 'FW: One Fish, Two Fish, Yellow Fish, Enron',
 'FW: One Fish, Two Fish, Yellow Fish, Enron',
 'One Fish, Two Fish, Yellow Fish, Enron',
 'FW: TW Capacity Affected for Red Rock Tie-ins (CORRECTION)',
 'FW: TW Capacity Affected for Red Rock Tie-ins (CORRECTION)',
 'RE: TW Capacity Affected for Red Rock Tie-ins',
 'TW Capacity Affected for Red Rock Tie-ins=20',
 'Re: Red Cedar Receipts',
 'Red Cedar Receipts',
 'FW: TW Capacity Affected for Red Rock Tie-ins',
 '=09RE: TW Capacity Affected for Red Rock Tie-ins',
 '=09RE: TW Capacity Affected for Red Rock Tie-ins',
 'TW Capacity Affected for Red Rock Tie-ins=20',
 'Re: Red Cedar Contract',
 'Red Cedar Contract',
 'TW/Red Cedar deal',
 'more Red Cedar',
 'Red Cedar',
 'Red Cedar update',
 'New red cedar contract',
 'Red Cedar',
 'Red Cedar deal',
 'Red Cedar letter',
 'Red Cedar',
 'Red Cedar Letter',
 'Red Cedar Letter',
 'Red Cedar',
 'Red Cedar agency agreement',
 'Re: Red Meat',
 'Re: Red Meat',
 'Re: Red Meat',
 'Re: Red Meat',
 'Re: Red Meat',
 'Re: Red Meat',
 'Re: Red Meat',
 'Re: Red Meat',
 'Re: Red Meat',
 'Re: Red Meat',
 'Re: Red Meat',
 'Re: Red Meat',
 'Re: Red Meat',
 'Re: Red Meat',
 'Re: Red Meat',
 'Red Meat',
 'Southern Ute Indian Tribe d/b/a Red Willow Production Company',
 'Bank of America/ERMS executed master blue file',
 'Bank of America/ERMS executed master blue file',
 'SmartPortfolio.Com Update: Markets Soar on Blue Chip and Tech Rally',
 'SmartPortfolio.Com Update: Markets Soar on Blue Chip and Tech Rall=',
 'need a red file',
 'Re: Jan/Red Jan nat gas spreads',
 'Re: Jan/Red Jan nat gas spreads',
 'Blue Sox',
 'RE: Blue Sox',
 'Blue Sox',
 'Here are my changes to GTC bullet letter revision.doc (in blue I',
 'Gas at Blue Diamond',
 'FW: The Red, White and Blue',
 'FW: The Red, White and Blue',
 'The Red, White and  Blue',
 'FW: Red Rock filing',
 'FW: Red Rock filing',
 'Red Rock filing',
 'FW: TW Capacity Affected for Red Rock Tie-ins (CORRECTION)',
 'RE: TW Capacity Affected for Red Rock Tie-ins (CORRECTION)',
 'FW: TW Capacity Affected for Red Rock Tie-ins (CORRECTION)',
 'RE: TW Capacity Affected for Red Rock Tie-ins',
 'TW Capacity Affected for Red Rock Tie-ins=20',
 'FW: Air Permit Delay, Red Rock Expansion',
 'FW: Air Permit Delay, Red Rock Expansion',
 'Air Permit Delay, Red Rock Expansion',
 'Letter of Credit $ 5,500,000 in support of Transwestern Pipeline Red Rock Expansion',
 'Letter of Credit $ 5,500,000 in support of Transwestern Pipeline Red Rock Expansion',
 'RE: EDG and Red Cedar',
 'FW: EDG and Red Cedar',
 'RE: EDG and Red Cedar',
 'FW: EDG and Red Cedar',
 'RE: EDG and Red Cedar',
 'FW: EDG and Red Cedar',
 'EDG and Red Cedar',
 'Red Rock impact if delayed',
 'FW: Red Rock Weekly Reports',
 'Red Rock Weekly Reports',
 'Declined: Red Rock PSR',
 'Red Lake Storage',
 'RE: ReL Red Pepper Soup Recipe',
 'ReL Red Pepper Soup Recipe',
 'RE: Red Soup Recipe',
 'Red Soup Recipe',
 'FW: TW Capacity Affected for Red Rock Tie-ins',
 '=09TW Capacity Affected for Red Rock Tie-ins=20',
 'FW: Red-Neck Horseshoes',
 'FW: Red-Neck Horseshoes']

Capturing what matches

The re.search() function allows us to check to see whether or not a string matches a regular expression. Sometimes we want to find out not just if the string matches, but also to what, exactly, in the string matched. In other words, we want to capture whatever it was that matched.

The easiest way to do this is with the re.findall() function, which takes a regular expression and a string to match it against, and returns a list of all parts of the string that the regular expression matched. Here's an example:

In [154]:
import re
print re.findall(r"\b\w{5}\b", "alpha beta gamma delta epsilon zeta eta theta")
Out[154]:
['alpha', 'gamma', 'delta', 'theta']

The regular expression above, \b\w{5}\b, is a regular expression that means "find me strings of five non-white space characters between word boundaries"---in other words, find me five-letter words. The re.findall() method returns a list of strings---not just telling us whether or not the string matched, but which parts of the string matched.

For the following re.findall() examples, we'll be operating on the entire file of subject lines as a single string, instead of using a list comprehension for individual subject lines. Here's how to read in the entire file as one string, instead of as a list of strings:

In [159]:
all_subjects = open("enronsubjects.txt").read()

Having done that, let's write a regular expression that finds all domain names in the subject lines:

In [188]:
re.findall(r"\b\w+\.(?:com|net|org)", all_subjects)
Out[188]:
['enron.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'Forbes.com',
 'Cortlandtwines.com',
 'Cortlandtwines.com',
 'Match.com',
 'Amazon.com',
 'Amazon.com',
 'Ticketmaster.com',
 'Ticketmaster.com',
 'Concierge.com',
 'Concierge.com',
 'har.com',
 'har.com',
 'HoustonChronicle.com',
 'HoustonChronicle.com',
 'har.com',
 'har.com',
 'har.com',
 'har.com',
 'har.com',
 'har.com',
 'Concierge.com',
 'Concierge.com',
 'washingtonpost.com',
 'washingtonpost.com',
 'washingtonpost.com',
 'washingtonpost.com',
 'ESPN.com',
 'ESPN.com',
 'ESPN.com',
 'enron.com',
 'EnronCredit.com',
 'EnronCredit.com',
 'EnronCredit.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'CommodityLogic.com',
 'CommodityLogic.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'MarketWatch.com',
 'MarketWatch.com',
 'MarketWatch.com',
 'INSIDER.com',
 'INSIDER.com',
 'ArdorNY.com',
 'ArdorNY.com',
 'ArdorNY.com',
 'ArdorNY.com',
 'ArdorNY.com',
 'yahoo.com',
 'governmentguide.com',
 'SmartPrice.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'Headhunter.net',
 'Headhunter.net',
 'Headhunter.net',
 'Headhunter.net',
 'Headhunter.net',
 'Headhunter.net',
 'enron.com',
 'merckmedco.com',
 'merckmedco.com',
 'turnonthetruth.com',
 'DefensiveDriver.com',
 'DefensiveDriver.com',
 'PrimeShot.com',
 'PrimeShot.com',
 'PrimeShot.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'Clickpaper.com',
 'Clickpaper.com',
 'Clickpaper.com',
 'Clickpaper.com',
 'enron.com',
 'MichaelMcDermott.com',
 'MichaelMcDermott.com',
 'MichaelMcDermott.com',
 'FUZZY.com',
 'lptrixie.com',
 'lptrixie.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'CapacityCenter.com',
 'CapacityCenter.com',
 'dow.com',
 'dow.com',
 'Center.com',
 'Center.com',
 'INO.com',
 'INO.com',
 'INO.com',
 'shockwave.com',
 'shockwave.com',
 'enron.com',
 'StudentMagazine.com',
 'myuhc.com',
 'myuhc.com',
 'southwest.com',
 'southwest.com',
 'southwest.com',
 'Alamo.com',
 'Alamo.com',
 'Alamo.com',
 'Alamo.com',
 'taxclaity.com',
 'taxclaity.com',
 'taxclaity.com',
 'taxclaity.com',
 'taxclaity.com',
 'taxclaity.com',
 'clanpages.com',
 'clanpages.com',
 'clanpages.com',
 'Amazon.com',
 'Amazon.com',
 'enron.com',
 'enron.com',
 'Paper.com',
 'enron.com',
 'fitnessheaven.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'MarketWatch.com',
 'MarketWatch.com',
 'MarketWatch.com',
 'MarketWatch.com',
 'EnronX.org',
 'EnronX.org',
 'NYTimes.com',
 'NYTimes.com',
 'enron.com',
 'enron.com',
 'NYTimes.com',
 'NYTimes.com',
 'Colonize.com',
 'CareerPath.com',
 'HoustonStreet.com',
 'Braodcast.com',
 'RedMeteor.com',
 'Broker.com',
 'Broker.com',
 'Broker.com',
 'EnronCredit.com',
 'EnronCredit.com',
 'EnronCredit.com',
 'EnronCredit.com',
 'EnronCredit.com',
 'ClickPaper.com',
 'EnronCredit.com',
 'EnronCredit.com',
 'EnronCredit.com',
 'EnronCredit.com',
 'EnergyGateway.com',
 'Credit2B.com',
 'EnronCredit.com',
 'Credit2B.com',
 'Credit2B.com',
 'Anywhere.com',
 'Credit2B.com',
 'EnronCredit.com',
 'PaperExchange.com',
 'PaperExchange.com',
 'enron.com',
 'marcus.net',
 'enron.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'eCredit.com',
 'eCredit.com',
 'eCredit.com',
 'Chematch.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'boxmind.com',
 'CERA.com',
 'CERA.com',
 'CERA.com',
 'sharperimage.com',
 'enron.com',
 'enron.com',
 'educationplanet.com',
 'educationplanet.com',
 'mathmistakes.com',
 'enron.com',
 'enron.com',
 'ft.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'PowerMarketers.com',
 'PowerMarketers.com',
 'EnronCredit.com',
 'EnronCredit.com',
 'EnronCredit.com',
 'EnronCredit.com',
 'FT.com',
 'FT.com',
 'enron.com',
 'enron.com',
 'homestead.com',
 'FT.com',
 'FT.com',
 'FT.com',
 'FT.com',
 'insiderSCORES.com',
 'insiderSCORES.com',
 'insiderSCORES.com',
 'insiderSCORES.com',
 'Insiderscores.com',
 'Insiderscores.com',
 'marshweb.com',
 'marshweb.com',
 'Credit.com',
 'Credit.com',
 'Credit.com',
 'Credit.com',
 'Credit.com',
 'FT.com',
 'FT.com',
 'Powermarketers.com',
 'Powermarketers.com',
 'Enroncredit.com',
 'Enroncredit.com',
 'Enroncredit.com',
 'Enroncredit.com',
 'Enroncredit.com',
 'enerfax.com',
 'libertyforelian.org',
 'enerfax.com',
 'enerfax.com',
 'powermarketers.com',
 'Ynot.com',
 'Ynot.com',
 'bluemountain.com',
 'Amazon.com',
 'FinMath.com',
 'Amazon.com',
 'FinMath.com',
 'NYTimes.com',
 'NYTimes.com',
 'NYTimes.com',
 'NYTimes.com',
 'NYTimes.com',
 'NYTimes.com',
 'thewoodlands.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'Street.com',
 'Street.com',
 'nexant.com',
 'enron.com',
 'enron.com',
 'econlib.org',
 'latimes.com',
 'aimnet.com',
 'blades.com',
 'siliconvalley.com',
 'reactionsnet.com',
 'reactionsnet.com',
 'go.com',
 'reactionsnet.com',
 'reactionsnet.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'reactionsnet.com',
 'reactionsnet.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'Expedia.com',
 'Expedia.com',
 'Travelocity.com',
 'Travelocity.com',
 'NYTimes.com',
 'NYTimes.com',
 'enron.com',
 'enron.com',
 'ubsenergy.com',
 'ubsenergy.com',
 'netcoonline.com',
 'netcoonline.com',
 'ubsenergy.com',
 'ubswenergy.com',
 'ubsenergy.com',
 'ubswenergy.com',
 'ubsenergy.com',
 'ubswenergy.com',
 'ubsenergy.com',
 'ubswenergy.com',
 'ubsenergy.com',
 'ubswenergy.com',
 'ubsenergy.com',
 'ubswenergy.com',
 'ubsenergy.com',
 'ubswenergy.com',
 'ubsenergy.com',
 'ubswenergy.com',
 'ubsenergy.com',
 'ubswenergy.com',
 'Fool.com',
 'yahoo.com',
 'citienergy.com',
 'CVS.com',
 'CVS.com',
 'EnronCredit.com',
 'EnronCredit.com',
 'Compaq.com',
 'Compaq.com',
 'Compaq.com',
 'enron.com',
 'ESPN.com',
 'sekurity.com',
 'sekurity.com',
 'enron.com',
 'taxclaity.com',
 'taxclaity.com',
 'taxclaity.com',
 'taxclaity.com',
 'enron.com',
 'BIGWORDS.com',
 'al.com',
 'al.com',
 'clanpages.com',
 'clanpages.com',
 'clanpages.com',
 'Match.com',
 'Match.com',
 'Match.com',
 'Match.com',
 'Match.com',
 'Match.com',
 'Match.com',
 'Match.com',
 'Match.com',
 'Match.com',
 'Match.com',
 'Match.com',
 'iWon.com',
 'iWon.com',
 'Individual.com',
 'Individual.com',
 'Individual.com',
 'Edmunds.com',
 'Quicken.com',
 'enron.com',
 'enron.com',
 'SurveySavvy.com',
 'Clickpaper.com',
 'EnronCredit.com',
 'EnronCredit.com',
 'Nodocero.com',
 'Nodocero.com',
 'Nodocero.com',
 'Nodocero.com',
 'Nodocero.com',
 'Nodocero.com',
 'Nodocero.com',
 'Nodocero.com',
 'Nodocero.com',
 'Nodocero.com',
 'Nodocero.com',
 'Nodocero.com',
 'Nodocero.com',
 'Nodocero.com',
 'Nodocero.com',
 'Nodocero.com',
 'Nodocero.com',
 'ClickPaper.com',
 'ClickPaper.com',
 'ClickPaper.com',
 'ClickPaper.com',
 'ClickPaper.com',
 'ClickPaper.com',
 'ClickPaper.com',
 'ClickPaper.com',
 'ClickPaper.com',
 'EnergyPrism.com',
 'EnergyPrism.com',
 'EnergyPrism.com',
 'EnergyPrism.com',
 'InfrastructureWorld.com',
 'InfrastructureWorld.com',
 'InfrastructureWorld.com',
 'edftrading.com',
 'edftrading.com',
 'edftrading.com',
 'edftrading.com',
 'enron.com',
 'WeatherMarkets.com',
 'WeatherMarkets.com',
 'Amazon.com',
 'StoneAge.com',
 'StoneAge.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'hubserve.com',
 'hubserve.com',
 'AGA.org',
 'AGA.org',
 'u2.com',
 'U2.com',
 'U2.com',
 'ABCNEWS.com',
 'ABCNEWS.com',
 'ABCNEWS.com',
 'ABCNEWS.com',
 'ABCNEWS.com',
 'ABCNEWS.com',
 'ABCNEWS.com',
 'ABCNEWS.com',
 'ABCNEWS.com',
 'ABCNEWS.com',
 'ABCNEWS.com',
 'ABCNEWS.com',
 'ABCNEWS.com',
 'ABCNEWS.com',
 'ABCNEWS.com',
 'ABCNEWS.com',
 'FitRx.com',
 'FitRx.com',
 'FitRx.com',
 'FitRx.com',
 'Dictionary.com',
 'Dictionary.com',
 '1400smith.com',
 '1400smith.com',
 'UBSWenergy.com',
 'UBSWenergy.com',
 'UBSWenergy.com',
 'UBSWenergy.com',
 'UBSWenergy.com',
 'Omaha.com',
 'Omaha.com',
 'Omaha.com',
 'Omaha.com',
 'Omaha.com',
 'Omaha.com',
 'MyFamily.com',
 'MyFamily.com',
 'enron.com',
 'enron.com',
 'Quote.com',
 'Quote.com',
 'enron.com',
 'enron.com',
 'Grassy.com',
 'Grassy.com',
 'enron.com',
 'enron.com',
 'SmartMoney.com',
 'SmartMoney.com',
 'NYTimes.com',
 'NYTimes.com',
 'NYTimes.com',
 'NYTimes.com',
 'TheStreet.com',
 'TheStreet.com',
 'hannaandersson.com',
 'hannaandersson.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'akingump.com',
 'akingump.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'brcepat.com',
 'brcepat.com',
 'Agency.com',
 'Agency.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'southwest.com',
 'southwest.com',
 'southwest.com',
 'CareerPath.com',
 'CareerPath.com',
 'CareerPath.com',
 'CareerPath.com',
 'CareerPath.com',
 'Rediff.com',
 'Rediff.com',
 'Rediff.com',
 'EnronCredit.com',
 'EnronCredit.com',
 'EnronCredit.com',
 'EnronCredit.com',
 'EnronCredit.com',
 'EnronCredit.com',
 'ScottPaul.com',
 'ScottPaul.com',
 'EnronCredit.com',
 'ScottPaul.com',
 'ScottPaul.com',
 'SpeakOut.com',
 'ScottPaul.com',
 'EnronCredit.com',
 'Nice.com',
 'Nice.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'EnronCredit.com',
 'EnronCredit.com',
 'enron.com',
 'enron.com',
 'WSJ.com',
 'RealMoney.com',
 'RealMoney.com',
 'Abestkitchen.com',
 'Abestkitchen.com',
 'ZanyBrainy.com',
 'ZanyBrainy.com',
 'CareerPath.com',
 'CareerPath.com',
 'CareerPath.com',
 'CareerPath.com',
 'CareerPath.com',
 'CareerPath.com',
 'CareerPath.com',
 'CareerPath.com',
 'alan.com',
 'enron.com',
 'alan.com',
 'enron.com',
 'Markets.com',
 'Markets.com',
 'BadMojo09092hotmail.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'Dictionary.com',
 'enron.com',
 'CareerPath.com',
 'CareerPath.com',
 'CareerPath.com',
 'Paper.com',
 'merckmedco.com',
 'EnronCredit.com',
 'EnergyGateway.com',
 'Amazon.com',
 'Amazon.com',
 'Amazon.com',
 'HoustonChronicle.com',
 'HoustonChronicle.com',
 'HoustonChronicle.com',
 'HoustonStreet.com',
 'HoustonStreet.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'Agency.com',
 'Agency.com',
 'Agency.com',
 'Agency.com',
 'Agency.com',
 'Agency.com',
 'Agency.com',
 'EnronCredit.com',
 'EnronCredit.com',
 'EnronCredit.com',
 'EnronCredit.com',
 'EnronCredit.com',
 'risk.com',
 'risk.com',
 'risk.com',
 'risk.com',
 'risk.com',
 'risk.com',
 'risk.com',
 'EnronCredit.com',
 'EnronCredit.com',
 'EnronCredit.com',
 'EnronOnline.com',
 'EnronOnline.com',
 'EnronOnline.com',
 'EnronOnline.com',
 'EnronOnline.com',
 'enron.com',
 'NYTimes.com',
 'NYTimes.com',
 'NYTimes.com',
 'NYTimes.com',
 'NYTimes.com',
 'NYTimes.com',
 'Travelocity.com',
 'Travelocity.com',
 'Industrialinfo.com',
 'Amazon.com',
 'Amazon.com',
 'quote.com',
 'quote.com',
 'Expedia.com',
 'Expedia.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'ubid.com',
 'Fingerhut.com',
 'Fingerhut.com',
 'Blair.com',
 'Blair.com',
 'continental.com',
 'continental.com',
 'GOPUSA.com',
 'GOPUSA.com',
 'GOPUSA.com',
 'autobytel.com',
 'autobytel.com',
 'RedMeteor.com',
 'RedMeteor.com',
 'RedMeteor.com',
 'EnronCredit.com',
 'Bid4me.com',
 'Bid4me.com',
 'enron.com',
 'enron.com',
 'hotmail.com',
 'hotmail.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'UBSWenergy.com',
 'UBSWenergy.com',
 'UBSWenergy.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'flashyourrack.com',
 'Quicken.com',
 'Quicken.com',
 'Quicken.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'WeatherMarkets.com',
 'WeatherMarkets.com',
 'freeyellow.com',
 'freeyellow.com']

Every time the string New York is found, along with the word that comes directly afterward:

In [161]:
re.findall(r"New York \b\w+\b", all_subjects)
Out[161]:
['New York Details',
 'New York Details',
 'New York on',
 'New York on',
 'New York on',
 'New York on',
 'New York on',
 'New York on',
 'New York Times',
 'New York on',
 'New York on',
 'New York on',
 'New York on',
 'New York on',
 'New York on',
 'New York on',
 'New York on',
 'New York Times',
 'New York Times',
 'New York Times',
 'New York Times',
 'New York Times',
 'New York Times',
 'New York Times',
 'New York City',
 'New York City',
 'New York City',
 'New York Power',
 'New York Power',
 'New York Power',
 'New York Power',
 'New York Power',
 'New York Power',
 'New York Power',
 'New York Power',
 'New York Mercantile',
 'New York Mercantile',
 'New York Branch',
 'New York City',
 'New York Energy',
 'New York Energy',
 'New York Energy',
 'New York Energy',
 'New York Energy',
 'New York sites',
 'New York sites',
 'New York Hotel',
 'New York Hotel',
 'New York Hotel',
 'New York Hotel',
 'New York Hotel',
 'New York Hotel',
 'New York Hotel',
 'New York Hotel',
 'New York Hotel',
 'New York Hotel',
 'New York Hotel',
 'New York Hotel',
 'New York Hotel',
 'New York Hotel',
 'New York City',
 'New York City',
 'New York City',
 'New York City',
 'New York voice',
 'New York State',
 'New York State',
 'New York State',
 'New York State',
 'New York State',
 'New York State',
 'New York Inc',
 'New York Office',
 'New York Office',
 'New York regulatory',
 'New York regulatory',
 'New York regulatory',
 'New York regulatory',
 'New York Bar',
 'New York Bar']

And just to bring things full-circle, everything that looks like a zip code, sorted:

In [163]:
sorted(re.findall(r"\b\d{5}\b", all_subjects))
Out[163]:
['00003',
 '00003',
 '00003',
 '00003',
 '00003',
 '00003',
 '00003',
 '00003',
 '00003',
 '00010',
 '00010',
 '00458',
 '01003',
 '02177',
 '06716',
 '06736',
 '06736',
 '06752',
 '06752',
 '06752',
 '06752',
 '06752',
 '06980',
 '06980',
 '10000',
 '10000',
 '11111',
 '11111',
 '11111',
 '11111',
 '11111',
 '11385',
 '11385',
 '11385',
 '11385',
 '11385',
 '11781',
 '11781',
 '12357',
 '12357',
 '12578',
 '12590',
 '12619',
 '14790',
 '14790',
 '14790',
 '14790',
 '14790',
 '14790',
 '14790',
 '14790',
 '15444',
 '15444',
 '15444',
 '15444',
 '15444',
 '15444',
 '15692',
 '15692',
 '15692',
 '15692',
 '15692',
 '15692',
 '15956',
 '16995',
 '19785',
 '19818',
 '20001',
 '20001',
 '20267',
 '20267',
 '20267',
 '20267',
 '20721',
 '20721',
 '20721',
 '20721',
 '20721',
 '20721',
 '20721',
 '20740',
 '20740',
 '20740',
 '20740',
 '20740',
 '20740',
 '20740',
 '20747',
 '20747',
 '20747',
 '20748',
 '20748',
 '20748',
 '21349',
 '21349',
 '21865',
 '22027',
 '22027',
 '22069',
 '22069',
 '22585',
 '22585',
 '22585',
 '22585',
 '22585',
 '22585',
 '22585',
 '22585',
 '22585',
 '23231',
 '24194',
 '24194',
 '24468',
 '24468',
 '24468',
 '24468',
 '24468',
 '24468',
 '24468',
 '24690',
 '24690',
 '24690',
 '24690',
 '24924',
 '24924',
 '24924',
 '24924',
 '25374',
 '25374',
 '25374',
 '25374',
 '25407',
 '25672',
 '25672',
 '25672',
 '25672',
 '25672',
 '25672',
 '25841',
 '25841',
 '25841',
 '25841',
 '26486',
 '26490',
 '26511',
 '26511',
 '26511',
 '26511',
 '26511',
 '26511',
 '26511',
 '26511',
 '26532',
 '26606',
 '26635',
 '26819',
 '26819',
 '26819',
 '26819',
 '26862',
 '26862',
 '27190',
 '27190',
 '27190',
 '27190',
 '27190',
 '27190',
 '27239',
 '27239',
 '27239',
 '27239',
 '27239',
 '27239',
 '27239',
 '27239',
 '27239',
 '27239',
 '27239',
 '27239',
 '27239',
 '27239',
 '27252',
 '27252',
 '27252',
 '27253',
 '27253',
 '27253',
 '27253',
 '27253',
 '27253',
 '27253',
 '27291',
 '27291',
 '27291',
 '27291',
 '27291',
 '27291',
 '27291',
 '27291',
 '27291',
 '27291',
 '27291',
 '27291',
 '27291',
 '27291',
 '27291',
 '27291',
 '27291',
 '27291',
 '27496',
 '27496',
 '27579',
 '27579',
 '27579',
 '27579',
 '27600',
 '27600',
 '27606',
 '27606',
 '27606',
 '27606',
 '27606',
 '27641',
 '29667',
 '29667',
 '29667',
 '29667',
 '29667',
 '29667',
 '29667',
 '29667',
 '29667',
 '30643',
 '30643',
 '30643',
 '34342',
 '34342',
 '34342',
 '34342',
 '34342',
 '34342',
 '34342',
 '34342',
 '34342',
 '34342',
 '34342',
 '34342',
 '34342',
 '34342',
 '34342',
 '34342',
 '34342',
 '34342',
 '34342',
 '34342',
 '34342',
 '34342',
 '34342',
 '34342',
 '34342',
 '34342',
 '34342',
 '34342',
 '34342',
 '34342',
 '34342',
 '34342',
 '34342',
 '34342',
 '35830',
 '35830',
 '35830',
 '35830',
 '35830',
 '35830',
 '35830',
 '35830',
 '35830',
 '35830',
 '35842',
 '35842',
 '35842',
 '35842',
 '35874',
 '35874',
 '35874',
 '37877',
 '38616',
 '38616',
 '38927',
 '38927',
 '38927',
 '38927',
 '39474',
 '39764',
 '39764',
 '39764',
 '39764',
 '39764',
 '39764',
 '39764',
 '39764',
 '39833',
 '39833',
 '39833',
 '39833',
 '39833',
 '39833',
 '39833',
 '40387',
 '40387',
 '41038',
 '41598',
 '42029',
 '42066',
 '42150',
 '42306',
 '42342',
 '42343',
 '42354',
 '42361',
 '42363',
 '42371',
 '42375',
 '42661',
 '42750',
 '42764',
 '42789',
 '42789',
 '42896',
 '42913',
 '42933',
 '42934',
 '42975',
 '42976',
 '47477',
 '47477',
 '50250',
 '50569',
 '51407',
 '51407',
 '51407',
 '51407',
 '51407',
 '52330',
 '52330',
 '53091',
 '55358',
 '55358',
 '55358',
 '55358',
 '55358',
 '55358',
 '55358',
 '55358',
 '55358',
 '55358',
 '55358',
 '55358',
 '55358',
 '56565',
 '58900',
 '58900',
 '58911',
 '62039',
 '62039',
 '62164',
 '62164',
 '64231',
 '64231',
 '64231',
 '64231',
 '64231',
 '64231',
 '64231',
 '64231',
 '64231',
 '64231',
 '64231',
 '64231',
 '64937',
 '64937',
 '64937',
 '64937',
 '65066',
 '65066',
 '65066',
 '65066',
 '65066',
 '65066',
 '65066',
 '65187',
 '65403',
 '66394',
 '66547',
 '67133',
 '67207',
 '70197',
 '70197',
 '70996',
 '70996',
 '77257',
 '77349',
 '77349',
 '77349',
 '77349',
 '77349',
 '77349',
 '77349',
 '77349',
 '78032',
 '78032',
 '78033',
 '78033',
 '78158',
 '78158',
 '78728',
 '78728',
 '78728',
 '78728',
 '78728',
 '78728',
 '80110',
 '83017',
 '83017',
 '83017',
 '83829',
 '87541',
 '90593',
 '90593',
 '90593',
 '92886',
 '92886',
 '93394',
 '93481',
 '93481',
 '93836',
 '93836',
 '93836',
 '93836',
 '93836',
 '93836',
 '93836',
 '93836',
 '93836',
 '93836',
 '93836',
 '93836',
 '93836',
 '93836',
 '93836',
 '93836',
 '93836',
 '93836',
 '93836',
 '93836',
 '93836',
 '93836',
 '93836',
 '93836',
 '93836',
 '93836',
 '93871',
 '94074',
 '96724',
 '96724',
 '96731',
 '96731']

Full example: finding the dollar value of the Enron e-mail subject corpus

Here's an example that combines our regular expression prowess with our ability to do smaller manipulations on strings. We want to find all dollar amounts in the subject lines, and then figure out what their sum is.

To understand what we're working with, let's start by writing a list comprehension that finds strings that just have the dollar sign ($) in them:

In [164]:
[line for line in subjects if re.search(r"\$", line)]
Out[164]:
['Re: APEA - $228,204 hit',
 'Re: APEA - $228,204 hit',
 'DJ Cal-ISO Pays $10M To Avoid Rolling Blackouts Wed -Sources, DJ',
 'DJ Cal-ISO Pays $10M To Avoid Rolling Blackouts Wed -Sources, DJ',
 'DJ Cal-ISO Pays $10M To Avoid Rolling Blackouts Wed -Sources, DJ',
 'DJ Cal-ISO Pays $10M To Avoid Rolling Blackouts Wed -Sources, DJ',
 'Goldman Comment re: Enron issued this morning - Revised Price Target of $68/share',
 'RE: Goldman Sachs $2.19 Natural GAs',
 'Goldman Sachs $2.19 Natural GAs',
 'RE: $25 million',
 '$25 million',
 'RE: $25 million loan from EDf',
 '$25 million loan from EDf',
 'RE: $25 million loan from EDf',
 'RE: $25 million loan from EDf',
 'RE: $25 million loan from EDf',
 '$25 million loan from EDf',
 'RE: $25 million loan from EDf',
 'RE: $25 million loan from EDf',
 'RE: $25 million loan from EDf',
 'RE: $25 million loan from EDf',
 'RE: $25 million loan from EDf',
 '$25 million loan from EDf',
 'A$M and its "second tier" status',
 'A$M and its "second tier" status',
 'A$M and its "second tier" status',
 'UT/a$m business school and engineering school comparisons',
 'Re: $',
 '$',
 'Re: $',
 '$',
 '$$$$',
 'FFL $$',
 'RE: shipper imbal $$ collected',
 'shipper imbal $$ collected',
 "Oneok's Strangers Gas Payment $820,000",
 "Oneok's Strangers Gas Payment $820,000",
 'Another $40 Million?',
 'FW: Entergy and FPL Group Agree to a $27 Billion Merger Of Equals',
 'FW: Entergy and FPL Group Agree to a $27 Billion Merger Of Equals',
 'Over $50 -- You made it happen!',
 'Over $50 -- You made it happen!',
 'FW: Co 0530 CINY 40781075  $5,356.46  FX Funding',
 'Co 0530 CINY 40781075  $5,356.46  FX Funding',
 'FW: Outstanding Young Alumni Travel Value to Amsterdam from $895',
 'Outstanding Young Alumni Travel Value to Amsterdam from $895',
 'RE: Modesto 7 MW COB deal @$19.3.',
 'RE: Modesto 7 MW COB deal @$19.3.',
 'Modesto 7 MW COB deal @$19.3.',
 'Modesto 7 MW COB deal @$19.3.',
 'RE: -$870K prior month adjustments',
 '-$870K prior month adjustments',
 'RE: -$141,000 P&L hit on 8/13/01',
 '-$141,000 P&L hit on 8/13/01',
 '$$$',
 'Re: DWR Stranded costs: $21 billion',
 'CAISO cuts refund estimate to $6.1B from $8.9B',
 "State's Power Purchases Costlier Than Projected Tab is $6 million a",
 'Fwd: Edison gets more time; Calif. may sell $14 bln bonds',
 'Edison gets more time; Calif. may sell $14 bln bonds',
 'Re: IDEA RE ISSUE OF UTILS IN CALIF WANTING $100 PRICE CAP',
 'Back to $250 Cap in California',
 'Energy Secretary Announces $350MM to Upgrade Path 15',
 'RE: $.01 surcharge as "tax"',
 'FW: $.01 surcharge as "tax"',
 'FW: $.01 surcharge as "tax"',
 '$.01 surcharge as "tax"',
 "California's $12.5 Bln Bond Sale May Be Salvaged, Official Says;",
 "RE: California's $12.5 Bln Bond Sale May Be Salvaged, Official",
 "RE: California's $12.5 Bln Bond Sale May Be Salvaged, Official Says; DWR Contract Renegotiation Is Key",
 "California's $12.5 Bln Bond Sale May Be Salvaged, Official Says; DWR Contract Renegotiation Is Key",
 'Re: Royal Bank of Canada - Wire ($2,529,352.58)',
 'Free $10 Three Team Parlay',
 'Blue Girl - $1.2MM option expires today - need to know whether to',
 'Blue Girl - $1.2MM option expires today - need to know whether to',
 'Blue Girl - $1.2MM option expires today - need to know whether to',
 'Blue Girl - $1.2MM option expires today - need to know whether to',
 'Blue Girl - $1.2MM option expires today - need to know whether to',
 'Blue Girl - $1.2MM option expires today - need to know whether to',
 'Blue Girl - $1.2MM option expires today - need to know whether to',
 'FW: Economic Times article: FIs may take over Enron for $700-800m',
 'FW: Economic Times article: FIs may take over Enron for $700-800m',
 'FW: Economic Times article: FIs may take over Enron for $700-800m',
 'Red Rock Delay $$ Impact',
 'HandsFree Kits - $2',
 'HandsFree Kits - $2',
 'Re: The $10 you owe me',
 'The $10 you owe me',
 'RE: Enron files for Chapter 11 owing US$13B',
 'Enron files for Chapter 11 owing US$13B',
 'RE: $ allocation',
 '$ allocation',
 'Re: Last chance: Save $100 on a future airline ticket',
 'Re: ECS and the $500k reduction',
 'Re: ECS and the $500k reduction',
 'Re: ECS and the $500k reduction',
 'Re: ECS and the $500k reduction',
 'ECS and the $500k reduction',
 'ECS and the $500k reduction',
 'ECS and the $500k reduction',
 'ECS and the $500k reduction',
 'ECS and the $500k reduction',
 'FW: Free Shipping & $1,300 in Savings',
 'Free Shipping & $1,300 in Savings',
 'RE: Free Shipping & $1,300 in Savings',
 'RE: Free Shipping & $1,300 in Savings',
 'FW: Free Shipping & $1,300 in Savings',
 'Free Shipping & $1,300 in Savings',
 'RE: Dynegy Is Mulling $2 Billion Investment In Enron in Possible',
 'FW: Dynegy Is Mulling $2 Billion Investment In Enron in Possible \tStep Toward Merger',
 'FW: Dynegy Is Mulling $2 Billion Investment In Enron in Possible Step Toward Merger',
 'Dynegy Is Mulling $2 Billion Investment In Enron in Possible Step Toward Merger',
 'Peoples Gas --> $5,000 Invoice for Summer-Winter Exchange 6-1-00 to',
 'Peoples Gas --> $5,000 Invoice for Summer-Winter Exchange 6-1-00 to',
 'Peoples Gas --> $5,000 Invoice for Summer-Winter Exchange 6-1-00 to',
 'Peoples Gas --> $5,000 Invoice for Summer-Winter Exchange 6-1-00 to',
 'Re: short fall $971,443.11 for Wis Elect Power',
 'Re: short fall $971,443.11 for Wis Elect Power',
 'Re: short fall $971,443.11 for Wis Elect Power',
 'Re: short fall $971,443.11 for Wis Elect Power',
 'Re: short fall $971,443.11 for Wis Elect Power',
 'short fall $971,443.11 for Wis Elect Power',
 'RE: Q&A for NNG/TW Supported $1Billion Line of Credit',
 'Q&A for NNG/TW Supported $1Billion Line of Credit',
 'FW: Deals from $39 in our Las Vegas store!',
 '=09Deals from $39 in our Las Vegas store!',
 'A trip worth $10,000 could be yours',
 'A trip worth $10,000 could be yours',
 '142,000,000 Email Addresses for ONLY $149!!!!',
 "Lou's $50,000",
 "Lou's $50,000",
 "Lou's $50,000",
 'Summary of $ at Risk for Customs',
 'Summary of $ at Risk for Customs',
 'Summary of $ at Risk for Customs',
 "Calling All Investors: The New Power Company's IPO Priced at $21",
 "Calling All Investors: The New Power Company's IPO Priced at $21 P=",
 'Fenosa and Enron to Invest $550 Million in Dominican Republic',
 "Enron Brazil To Invest $455 Million In Gas Distribution '01-'04",
 'RE: $5 million for 90 days?- how quaint!',
 'FW: $5 million for 90 days?- how quaint!',
 '$5 million for 90 days?- how quaint!',
 'RE: Wind $7MM',
 'RE: Wind $7MM',
 'RE: Wind $7MM',
 'Wind $7MM',
 'RE: Wind $7MM',
 'Wind $7MM',
 'Re: Counting the Cal ISO Votes for a $100 Price Cap',
 'RE: C$ swap between EIM/ENA',
 'C$ swap between EIM/ENA',
 "Re: Where's My $20",
 "Re: Where's My $20",
 "Re: Where's My $20",
 "Re: Where's My $20",
 'Re: $100',
 'Re: $100',
 'Re: $100',
 "Re: Where's My $20",
 "Re: Where's My $20",
 'RE: Eric Schroeder has just sent you $29.75 with PayPal',
 'Fw: Eric Schroeder has just sent you $29.75 with PayPal',
 'Eric Schroeder has just sent you $29.75 with PayPal',
 'RE: Eric Schroeder has just sent you $29.75 with PayPal',
 'Fw: Eric Schroeder has just sent you $29.75 with PayPal',
 'Eric Schroeder has just sent you $29.75 with PayPal',
 'RE: What are you talking about $1600?',
 'Re: What are you talking about $1600?',
 'RE: What are you talking about $1600?',
 'RE: What are you talking about $1600?',
 '=09Re: What are you talking about $1600?',
 'What are you talking about $1600?',
 'What are you talking about $1600?',
 'FW: Enron Seeks $2 Billion Cash Infusion As It Faces an Escalating',
 'FW: Enron Seeks $2 Billion Cash Infusion As It Faces an Escalating Fiscal Crisis',
 'Enron Seeks $2 Billion Cash Infusion As It Faces an Escalating Fiscal Crisis',
 'The new, correct price is $67,776,700',
 'Re: Demar request for $2.7 mm to pay out the Skandinavian now',
 'Re: Demar request for $2.7 mm to pay out the Skandinavian now',
 'RE: Transactions exceeding $100mil',
 'Our benefits are about $50 per month higher with UBS',
 'RE: $9.6MM EOL Gas Daily Issue',
 '$9.6MM EOL Gas Daily Issue',
 'FW: NEAL - ITIN ONLY/$212.50',
 'FW: NEAL - ITIN ONLY/$212.50',
 'NEAL - ITIN ONLY/$212.50',
 'FW: NEAL - ITIN ONLY/$212.50',
 'FW: NEAL - ITIN ONLY/$212.50',
 'NEAL - ITIN ONLY/$212.50',
 'FW: Duke $',
 'Duke $',
 'RE: Duke $',
 'FW: Duke $',
 'FW: Duke $',
 'Duke $',
 '$$$$',
 '$$$$',
 'RE: Wire Detail for 10/25/01 wire for  $195,209.95',
 'FW: Wire Detail for 10/25/01 wire for  $195,209.95',
 'RE: Wire Detail for 10/25/01 wire for  $195,209.95',
 'RE: Wire Detail for 10/25/01 wire for  $195,209.95',
 'FW: Wire Detail for 10/25/01 wire for  $195,209.95',
 'RE: Wire Detail for 10/25/01 wire for  $195,209.95',
 'RE: Wire Detail for 10/25/01 wire for  $195,209.95',
 'FW: Wire Detail for 10/25/01 wire for  $195,209.95',
 'RE: Wire Detail for 10/25/01 wire for  $195,209.95',
 'RE: Wire Detail for 10/25/01 wire for  $195,209.95',
 'FW: Wire Detail for 10/25/01 wire for  $195,209.95',
 'RE: Wire Detail for 10/25/01 wire for  $195,209.95',
 'FW: DYN($42/sh)/ENE($7/sh) Merger At Risk. - Simmons and Company',
 'FW: DYN($42/sh)/ENE($7/sh) Merger At Risk. - Simmons and Company latest thoughts',
 'FW: DYN($42/sh)/ENE($7/sh) Merger At Risk. - Simmons and Company latest thoughts',
 "FW: Re-Allocaton of $'s",
 "RE: Re-Allocaton of $'s",
 "Re-Allocaton of $'s",
 "Re-Allocaton of $'s",
 'RE: Wind $7MM',
 'FW: Wind $7MM',
 'Wind $7MM',
 'RE: $9.92????????????',
 '$9.92????????????',
 'RE: Below $10',
 'Below $10',
 'FW: Comments on the Status of ENE ($16/sh).',
 'FW: Comments on the Status of ENE ($16/sh).',
 'FW: Comments on the Status of ENE ($16/sh).',
 'Breaking News : Williams Ordered to Pay $8 Million Refund to',
 'Breaking News : Williams Ordered to Pay $8 Million Refund to Cal-ISO',
 'Coho $500mm lawsuit against Hicks Muse',
 'Coho $500mm lawsuit against Hicks Muse',
 'Coho $500mm lawsuit against Hicks Muse',
 'Re: $$$$',
 '$$$$',
 'Perd $',
 'Re: $80 million',
 'Re: $80 million',
 '$80 million',
 '$80 million',
 'Re: $80 million',
 '$80 million',
 '$80 million',
 'Re: Calif Atty Gen Offers $50M Reward In Pwr Supplier',
 'Financial Disclosure of $1.2 Billion Equity Adjustment',
 'ENE: Despite Bounce It Appears Cheap; Yet $102 Target Likely a Late',
 'ENE: Despite Bounce It Appears Cheap; Yet $102 Target Likely a Late 2002 Event:',
 'Is it worth $200?',
 'RE: #@$ !!!!!!!!',
 '$#%:#@$ !!!!!!!!',
 'RE: @%[email protected]!!!',
 '[email protected]%[email protected]!!!',
 'Special Offer: Switch to ShareBuilder and Get $50!',
 'Amendment to Enron Corp. $25 Million guaranty of Enron Credit Inc.',
 'RE: Amendment to Enron Corp. $25 Million guaranty of Enron Credit',
 'Goldman Sach $ repo docs',
 'Re: Goldman Sach $ repo docs',
 'RE: Amendment to Enron Corp. $25 Million guaranty of Enron Credit',
 'FW: Goldmans $1.5m',
 'Goldmans $1.5m',
 'FW: $1.5 Check',
 '$1.5 Check',
 'RE: Goldman Sachs $',
 'Goldman Sachs $',
 'RE: TODAY ONLY - SAVE UP TO $120 EXTRA ON AIRLINE TICKETS!',
 'RE: TODAY ONLY - SAVE UP TO $120 EXTRA ON AIRLINE TICKETS!',
 'RE: $.01 surcharge as "tax"',
 'RE: $.01 surcharge as "tax"',
 'RE: $.01 surcharge as "tax"',
 'FW: $.01 surcharge as "tax"',
 'FW: $.01 surcharge as "tax"',
 '$.01 surcharge as "tax"',
 "FW: PennFuture's E-Cubed - The $45 Million Rip Off",
 "=09PennFuture's E-Cubed - The $45 Million Rip Off",
 'RE: PaPUC assessment of $147,000 to Enron',
 'Re: PaPUC assessment of $147,000 to Enron',
 'PaPUC assessment of $147,000 to Enron',
 "RE: ASAP!! EES' objections to PaPUC assessment of $147,000",
 "ASAP!! EES' objections to PaPUC assessment of $147,000",
 'RE: Pennsylvania $147,000 EES Assessment',
 '=09Pennsylvania $147,000 EES Assessment',
 'FW: CAEM Study: Gas Dereg Has Saved Consumers $600B',
 'CAEM Study: Gas Dereg Has Saved Consumers $600B',
 'PaPUC assessment of $147,000 to Enron',
 "RE: ASAP!! EES' objections to PaPUC assessment of $147,000",
 "ASAP!! EES' objections to PaPUC assessment of $147,000",
 'FW: Energy Novice to Be Paid $240,000',
 'Energy Novice to Be Paid $240,000',
 'RE:  $22.8 schedule C for BPA deal',
 '$22.8 schedule C for BPA deal',
 '$22.8 schedule C for BPA deal',
 'origination $100k to Laird Dyer',
 'Cd$ CME letter',
 'Cd$ CME letter',
 '$',
 'RE: $',
 'RE: $',
 'Re: $',
 'RE: $',
 'GET RICH ON $6.00 !!!',
 'RE: Thoughts on the world of energy (OSX $77, XNG $183, XOI 496)',
 'FW: Letter of Credit $ 5,500,000 in support of Transwestern',
 'Letter of Credit $ 5,500,000 in support of Transwestern Pipeline Red Rock Expansion',
 'Letter of Credit $ 5,500,000 in support of Transwestern Pipeline Red Rock Expansion',
 'FW: shipper imbal $$ collected',
 'shipper imbal $$ collected',
 'FW: shipper imbal $$ collected',
 'RE: shipper imbal $$ collected',
 'shipper imbal $$ collected',
 'FW: shipper imbal $$ collected',
 'RE: shipper imbal $$ collected',
 'shipper imbal $$ collected',
 "FW: $$'s allocated to TW",
 "$$'s allocated to TW",
 'RE: email to USG confirming our decision not to require more LOC $',
 'email to USG confirming our decision not to require more LOC $',
 '$',
 'Re: Calpine Confirms $4.6B, 10-Yr Calif. Power Sales',
 'RE: $2.15 bn Enron Metals Inventory Financings Closed',
 'RE: $2.15 bn Enron Metals Inventory Financings Closed',
 'FW: Thayer Aerospace Awarded $130 Million Vought Aircraft Contract',
 'FW: Thayer Aerospace Awarded $130 Million Vought Aircraft Contract',
 'Thayer Aerospace Awarded $130 Million Vought Aircraft Contract to',
 're: mid-columbia $1 mm Schedule E difference',
 '$0.25 scheduling fee.',
 'MPC $',
 'RE: $10',
 'RE: $10',
 'RE: $10',
 '$10',
 'RE: $10',
 '$10',
 'FW: *** Gold/TSE GL/$US/CPI/TSE MM/CRB Bloomberg charts ***',
 'FW: *** Gold/TSE GL/$US/CPI/TSE MM/CRB Bloomberg charts ***',
 'FW: *** Gold/TSE GL/$US/CPI/TSE MM/CRB Bloomberg charts ***',
 'FW: Summer Fare Sale From $128 Return!',
 'Summer Fare Sale From $128 Return!']

Based on this data, we can guess at the steps we'd need to do in order to figure out these values. We're going to ignore anything that doesn't have "k", "million" or "billion" after it as chump change. So what we need to find is: a dollar sign, followed by any series of numbers (or a period), followed potentially by a space (but sometimes not), followed by a "k", "m" or "b" (which will sometimes start the word "million" or "billion" but sometimes not... so we won't bother looking).

Here's how I would translate that into a regular expression:

\$[0-9.]+ ?(?:[Kk]|[Mm]|[Bb])

We can use re.findall() to capture all instances where we found this regular expression in the text. Here's what that would look like:

In [182]:
re.findall(r"\$[0-9.]+ ?(?:[Kk]|[Mm]|[Bb])", all_subjects)
Out[182]:
['$10M',
 '$10M',
 '$10M',
 '$10M',
 '$25 m',
 '$25 m',
 '$25 m',
 '$25 m',
 '$25 m',
 '$25 m',
 '$25 m',
 '$25 m',
 '$25 m',
 '$25 m',
 '$25 m',
 '$25 m',
 '$25 m',
 '$25 m',
 '$40 M',
 '$27 B',
 '$27 B',
 '$870K',
 '$870K',
 '$21 b',
 '$6.1B',
 '$8.9B',
 '$6 m',
 '$14 b',
 '$14 b',
 '$350M',
 '$12.5 B',
 '$12.5 B',
 '$12.5 B',
 '$12.5 B',
 '$1.2M',
 '$1.2M',
 '$1.2M',
 '$1.2M',
 '$1.2M',
 '$1.2M',
 '$1.2M',
 '$13B',
 '$13B',
 '$500k',
 '$500k',
 '$500k',
 '$500k',
 '$500k',
 '$500k',
 '$500k',
 '$500k',
 '$500k',
 '$2 B',
 '$2 B',
 '$2 B',
 '$2 B',
 '$1B',
 '$1B',
 '$550 M',
 '$455 M',
 '$5 m',
 '$5 m',
 '$5 m',
 '$7M',
 '$7M',
 '$7M',
 '$7M',
 '$7M',
 '$7M',
 '$2 B',
 '$2 B',
 '$2 B',
 '$2.7 m',
 '$2.7 m',
 '$100m',
 '$9.6M',
 '$9.6M',
 '$7M',
 '$7M',
 '$7M',
 '$8 M',
 '$8 M',
 '$500m',
 '$500m',
 '$500m',
 '$80 m',
 '$80 m',
 '$80 m',
 '$80 m',
 '$80 m',
 '$80 m',
 '$80 m',
 '$50M',
 '$1.2 B',
 '$25 M',
 '$25 M',
 '$25 M',
 '$1.5m',
 '$1.5m',
 '$45 M',
 '$45 M',
 '$600B',
 '$600B',
 '$100k',
 '$4.6B',
 '$2.15 b',
 '$2.15 b',
 '$130 M',
 '$130 M',
 '$130 M',
 '$1 m']

If we want to actually make a sum, though, we're going to need to do a little massaging.

In [183]:
total_value = 0
dollar_amounts = re.findall(r"\$\d+ ?(?:[Kk]|[Mm]|[Bb])", all_subjects)
for amount in dollar_amounts:
    # the last character will be 'k', 'm', or 'b'; "normalize" by making lowercase.
    multiplier = amount[-1].lower()
    # trim off the beginning $ and ending multiplier value
    amount = amount[1:-1]
    # remove any remaining whitespace
    amount = amount.strip()
    # convert to a floating-point number
    float_amount = float(amount)
    # multiply by an amount, based on what the last character was
    if multiplier == 'k':
        float_amount = float_amount * 1000
    elif multiplier == 'm':
        float_amount = float_amount * 1000000
    elif multiplier == 'b':
        float_amount = float_amount * 1000000000
    # add to total value
    total_value = total_value + float_amount

print total_value
1.34965734e+12

The number is so big that Python decided to use scientific notation! If we convert to an integer, we get around that problem:

In [184]:
print int(total_value)
1349657340000

That's over one trillion dollars! Nice work, guys.

Conclusion

Regular expressions are a great way to take some raw text and find the parts that are of interest to you. Python's string methods and string slicing syntax are a great way to massage and clean up data. You know them both now, which makes you powerful. But as powerful as you are, you have only scratched the surface of your potential! We only scratched the surface of what's possible with regular expressions. Here's some further reading: