Regular expressions: A Gentle Introduction

By Allison Parrish

A regular expression is more than just a phrase that sounds like a euphemism for what happens when your diet includes enough fiber. It's a way of writing what amount to small programs for matching patterns in text that would otherwise be difficult to match with the regular toolbox of string filtering and searching tools. This tutorial will take you through the basics of using regular expressions in Python. But many (if not most) other programming languages also support regular expressions in some form or other (like JavaScript), so the skills you'll learn here will apply to other languages as well.

"Escape" sequences in strings

Before we go into too much detail about regular expressions, I want to review with you how escape sequences work in Python strings.

Inside of strings that you type into your Python code, there are certain sequences of characters that have a special meaning. These sequences start with a backslash character (\) and allow you to insert into your string characters that would otherwise be difficult to type, or that would go against Python syntax. Here's some code illustrating a few common sequences:

In [1]:
print("1. include \"double quotes\" (inside of a double-quoted string)")
print('2. include \'single quotes\' (inside of a single-quoted string)')
print("3. one\ttab, two\ttabs")
print("4. new\nline")
print("5. include an actual backslash \\ (two backslashes in the string)")
1. include "double quotes" (inside of a double-quoted string)
2. include 'single quotes' (inside of a single-quoted string)
3. one	tab, two	tabs
4. new
line
5. include an actual backslash \ (two backslashes in the string)

Regular expressions

So far, we've discussed how to write Python expressions that are able to check whether strings meet very simple criteria, such as “does this string begin with a particular character” or “does this string contain another string”? But imagine writing a program that performs the following task: find and print all ZIP codes in a string (i.e., a five-character sequence of digits). Give up? Here’s my attempt, using only the tools we’ve discussed so far:

In [2]:
input_str = "here's a zip code: 12345. 567 isn't a zip code, but 45678 is. 23456? yet another zip code."
current = ""
zips = []
for ch in input_str:
    if ch in '0123456789':
        current += ch
    else:
        current = ""
    if len(current) == 5:
        zips.append(current)
        current = ""
zips
Out[2]:
['12345', '45678', '23456']

Basically, we have to iterate over each character in the string, check to see if that character is a digit, append to a string variable if so, continue reading characters until we reach a non-digit character, check to see if we found exactly five digit characters, and add it to a list if so. At the end, we print out the list that has all of our results. Problems with this code: it’s messy; it doesn’t overtly communicate what it’s doing; it’s not easily generalized to other, similar tasks (e.g., if we wanted to write a program that printed out phone numbers from a string, the code would likely look completely different).

Our ancient UNIX pioneers had this problem, and in pursuit of a solution, thought to themselves, "Let’s make a tiny language that allows us to write specifications for textual patterns, and match those patterns against strings. No one will ever have to write fiddly code that checks strings character-by-character ever again." And thus regular expressions were born.

Here's the code for accomplishing the same task with regular expressions, by the way:

In [3]:
import re
zips = re.findall(r"\d{5}", input_str)
zips
Out[3]:
['12345', '45678', '23456']

I’ll allow that the r"\d{5}" in there is mighty cryptic (though hopefully it won’t be when you’re done reading this page and/or participating in the associated lecture). But the overall structure of the program is much simpler.

Fetching our corpus

For this section of class, we'll be using the subject lines of all e-mails in the EnronSent corpus, kindly put into the public domain by the United States Federal Energy Regulatory Commission. Download a copy of this file and place it in the same directory as this notebook.

Matching strings with regular expressions

The most basic operation that regular expressions perform is matching strings: you’re asking the computer whether a particular string matches some description. We're going to be using regular expressions to print only those lines from our enronsubjects.txt corpus that match particular sequences. Let's load our corpus into a list of lines first:

In [6]:
subjects = [x.strip() for x in open("enronsubjects.txt").readlines()]

We can check whether or not a pattern matches a given string in Python with the re.search() function. The first parameter to search is the regular expression you're trying to match; the second parameter is the string you're matching against.

Here's an example, using a very simple regular expression. The following code prints out only those lines in our Enron corpus that match the (very simple) regular expression shipping:

In [7]:
import re
[line for line in subjects if re.search("shipping", line)]
Out[7]:
['FW: How to use UPS for shipping on the internet',
 'FW: How to use UPS for shipping on the internet',
 'How to use UPS for shipping on the internet',
 'FW: How to use UPS for shipping on the internet',
 'FW: How to use UPS for shipping on the internet',
 'How to use UPS for shipping on the internet',
 'lng shipping/mosk meeting in tokyo 2nd of feb',
 'lng shipping/mosk meeting in tokyo 2nd of feb',
 'Re: lng shipping',
 'Re: lng shipping',
 'Re: lng shipping',
 'Re: lng shipping',
 'Re: lng shipping',
 'lng shipping',
 'Re: lng shipping',
 'Re: lng shipping',
 'Re: lng shipping',
 'lng shipping',
 'lng shipping',
 'lng shipping',
 'Re: lng shipping',
 'lng shipping']

At its simplest, a regular expression matches a string if that string contains exactly the characters you've specified in the regular expression. So the expression shipping matches strings that contain exactly the sequences of s, h, i, p, p, i, n, and g in a row. If the regular expression matches, re.search() evaluates to True and the matching line is included in the evaluation of the list comprehension.

BONUS TECH TIP: re.search() doesn't actually evaluate to True or False---it evaluates to either a Match object if a match is found, or None if no match was found. Those two count as True and False for the purposes of an if statement, though.

Metacharacters: character classes

The "shipping" example is pretty boring. (There was hardly any fan fiction in there at all.) Let's go a bit deeper into detail with what you can do with regular expressions. There are certain characters or strings of characters that we can insert into a regular expressions that have special meaning. For example:

In [8]:
[line for line in subjects if re.search("sh.pping", line)]
Out[8]:
['FW: How to use UPS for shipping on the internet',
 'FW: How to use UPS for shipping on the internet',
 'How to use UPS for shipping on the internet',
 'FW: How to use UPS for shipping on the internet',
 'FW: How to use UPS for shipping on the internet',
 'How to use UPS for shipping on the internet',
 "FW: We've been shopping!",
 'Re: Start shopping...',
 'Start shopping...',
 'lng shipping/mosk meeting in tokyo 2nd of feb',
 'lng shipping/mosk meeting in tokyo 2nd of feb',
 'Re: lng shipping',
 'Re: lng shipping',
 'Re: lng shipping',
 'Re: lng shipping',
 'Re: lng shipping',
 'lng shipping',
 'Re: lng shipping',
 'Re: lng shipping',
 'Re: lng shipping',
 'lng shipping',
 'lng shipping',
 'lng shipping',
 'Re: lng shipping',
 'lng shipping',
 'FW: Online shopping',
 'Online shopping']

In a regular expression, the character . means "match any character here." So, using the regular expression sh.pping, we get lines that match shipping but also shopping. The . is an example of a regular expression metacharacter---a character (or string of characters) that has a special meaning.

Here are a few more metacharacters. These metacharacters allow you to say that a character belonging to a particular class of characters should be matched in a particular position:

metacharacter meaning
. match any character
\w match any alphanumeric ("word") character (lowercase and capital letters, 0 through 9, underscore)
\s match any whitespace character (i.e., space and tab)
\S match any non-whitespace character (the inverse of \s)
\d match any digit (0 through 9)
\. match a literal .

Here, for example, is a (clearly imperfect) regular expression to search for all subject lines containing a time of day:

In [9]:
[line for line in subjects if re.search(r"\d:\d\d\wm", line)]
Out[9]:
['RE: 3:17pm',
 '3:17pm',
 "RE: It's On!!! - 2:00pm Today",
 "FW: It's On!!! - 2:00pm Today",
 "It's On!!! - 2:00pm Today",
 'Re: Registration Confirmation: Larry Summers on 12/6 at 1:45pm (was',
 'Re: Conference Call today 2/9/01 at 11:15am PST',
 'Conference Call today 2/9/01 at 11:15am PST',
 '5/24 1:00pm conference call.',
 '5/24 1:00pm conference call.',
 'FW: 07:33am EDT 15-Aug-01 Prudential Securities (C',
 'FW: 07:33am EDT 15-Aug-01 Prudential Securities (C',
 '07:33am EDT 15-Aug-01 Prudential Securities (C',
 "Re: Updated Mar'00 Requirements Received at 11:25am from CES",
 "Re: Updated Mar'00 Requirements Received at 11:25am from CES",
 "Re: Updated Mar'00 Requirements Received at 11:25am from CES",
 "Updated Mar'00 Requirements Received at 11:25am from CES",
 'Reminder: Legal Team Meeting -- Friday, 9:00am Houston time',
 'Thursday, March 7th 1:30-3:00pm: REORIENTATION',
 'Meeting at 2:00pm Friday',
 'Meeting at 2:00pm Friday',
 'Fw: 12:30pm Deadline for changes to letters or contracts today',
 '12:30pm Deadline for changes to letters or contracts today',
 'Johnathan actually resigned at 9:00am this morning',
 'FW: Enron Conference Call Today, 11:00am CST',
 'Enron Conference Call Today, 11:00am CST',
 'Meeting, Wednesday, January 23 at 10:00am at the Houstonian',
 'RE: TVA Meeting, Wednesday June13, 1:15pm, EB3125b',
 'TVA Meeting, Wednesday June13, 1:15pm, EB3125b',
 'Re: Dabhol Update: Conference Call Thursday, Dec. 28, 8:00am',
 'Dabhol Update: Conference Call Thursday, Dec. 28, 8:00am Houston time',
 'FW: Victoria Ashley Jones Born 5/25/01 7:31am.',
 'Fw: Victoria Ashley Jones Born 5/25/01 7:31am.',
 'Victoria Ashley Jones Born 5/25/01 7:31am.',
 'RE: Victoria Ashley Jones Born 5/25/01 7:31am.',
 'Fw: Victoria Ashley Jones Born 5/25/01 7:31am.',
 'Victoria Ashley Jones Born 5/25/01 7:31am.',
 'RE: UCSF Cogen Calculation Conf Call, 10/12/01 at 8:00am PST',
 'UCSF Cogen Calculation Conf Call, 10/12/01 at 8:00am PST',
 'FW: Confirmation:  UCSF Cogen Conf Call. 10/22/02 at 8:00am',
 '=09RE: Confirmation:  UCSF Cogen Conf Call. 10/22/02 at 8:00am PST/=',
 '=09Confirmation:  UCSF Cogen Conf Call. 10/22/02 at 8:00am PST/10:0=',
 'RE: Confirmation:  UCSF Cogen Conf Call. 10/22/02 at 8:00am',
 '=09Confirmation:  UCSF Cogen Conf Call. 10/22/02 at 8:00am PST/10:0=',
 'Re: March expenses - deadline 04-04-01 2:00pm',
 'Cirque - Jan 24 5:00pm show']

Here's that regular expression again: r"\d:\d\d\wm". I'm going to show you how to read this, one unit at a time.

"Hey, regular expression engine. Tell me if you can find this pattern in the current string. First of all, look for any number (\d). If you find that, look for a colon right after it (:). If you find that, look for another number right after it (\d). If you find that, look for any alphanumeric character---you know, a letter, a number, an underscore. If you find that, then look for a m. Good? If you found all of those things in a row, then the pattern matched."

But what about that weirdo r""?

Python provides another way to include string literals in your program, in addition to the single- and double-quoted strings we've already discussed. The r"" string literal, or "raw" string, includes all characters inside the quotes literally, without interpolating special escape characters. Here's an example:

In [10]:
print("1. this is\na test")
print(r"2. this is\na test")
print("3. I love \\ backslashes!")
print(r"4. I love \ backslashes!")
1. this is
a test
2. this is\na test
3. I love \ backslashes!
4. I love \ backslashes!

As you can see, whereas a double- or single-quoted string literal interprets \n as a new line character, the raw quoted string includes those characters as they were literally written. More importantly, for our purposes at least, is the fact that, in the raw quoted string, we only need to write one backslash in order to get a literal backslash in our string.

Why is this important? Because regular expressions use backslashes all the time, and we don't want Python to try to interpret those backslashes as special characters. (Inside a regular string, we'd have to write a simple regular expression like \b\w+\b as \\b\\w+\\b---yecch.)

So the basic rule of thumb is this: use r"" to quote any regular expressions in your program. All of the examples you'll see below will use this convention.

Character classes in-depth

You can define your own character classes by enclosing a list of characters, or range of characters, inside square brackets:

regex explanation
[aeiou] matches any vowel
[02468] matches any even digit
[a-z] matches any lower-case letter
[A-Z] matches any upper-case character
[^0-9] matches any non-digit (the ^ inverts the class, matches anything not in the list)
[Ee] matches either E or e

Let's find every subject line where we have four or more vowels in a row:

In [121]:
[line for line in subjects if re.search(r"[aeiou][aeiou][aeiou][aeiou]", line)]
Out[121]:
['Re: Natural gas quote for Louiisiana-Pacific (L-P)',
 'WooooooHoooooo more Vacation',
 'Re: Clickpaper Counterparties waiting to clear the work queue',
 'Gooooooooooood Bye!',
 'Gooooooooooood Bye!',
 'RE: Hello Sweeeeetie',
 'Hello Sweeeeetie',
 'FW: Waaasssaaaaabi !',
 'FW: Waaasssaaaaabi !',
 'FW: Waaasssaaaaabi !',
 'FW: Waaasssaaaaabi !',
 'Re: FW: Wasss Uuuuuup STG?',
 'RE: Rrrrrrrooooolllllllllllll TIDE!!!!!!!!',
 'Rrrrrrrooooolllllllllllll TIDE!!!!!!!!',
 'FW: The Osama Bin Laden Song ( Soooo Funny !! )',
 'Fw: The Osama Bin Laden Song ( Soooo Funny !! )',
 'The Osama Bin Laden Song ( Soooo Funny !! )',
 'RE: duuuuhhhhh',
 'RE: duuuuhhhhh',
 'RE: duuuuhhhhh',
 'duuuuhhhhh',
 'RE: duuuuhhhhh',
 'duuuuhhhhh',
 'RE: FPL Queue positions 1-15',
 'Re: FPL Queue positions 1-15',
 'Re: Helloooooo!!!',
 'Re: Helloooooo!!!',
 'Fw: FW: OOOooooops',
 'FW: FW: OOOooooops',
 'Re: yeeeeha',
 'yeeeeha',
 'yahoooooooooooooooooooo',
 'RE: yahoooooooooooooooooooo',
 'RE: yahoooooooooooooooooooo',
 'yahoooooooooooooooooooo',
 'RE: I hate yahooooooooooooooo',
 'I hate yahooooooooooooooo',
 'RE: I hate yahooooooooooooooo',
 'I hate yahooooooooooooooo',
 'RE: I hate yahooooooooooooooo',
 'I hate yahooooooooooooooo',
 'RE: I hate yahooooooooooooooo',
 'I hate yahooooooooooooooo',
 "FW: duuuuuuuuuuuuuuuuude...........what's up?",
 "RE: duuuuuuuuuuuuuuuuude...........what's up?",
 "RE: duuuuuuuuuuuuuuuuude...........what's up?",
 'Re: skiiiiiiiiing',
 'skiiiiiiiiing',
 'scuba dooooooooooooo',
 'RE: scuba dooooooooooooo',
 'RE: scuba dooooooooooooo',
 'scuba dooooooooooooo',
 'Re: skiiiiiiiing',
 'skiiiiiiiing',
 'Re: skiiiiiiiing',
 'Re: skiiiiiiiiing',
 "RE: Clickpaper CP's awaiting migration in work queue's 06/27/01",
 "FW: Clickpaper CP's awaiting migration in work queue's 06/27/01",
 "Clickpaper CP's awaiting migration in work queue's 06/27/01",
 'RE:  Sequoia Adv. Pro.: Draft Stipulation and Order',
 'FW: Sequoia Adv. Pro.: Draft Stipulation and Order',
 'Sequoia Adv. Pro.: Draft Stipulation and Order',
 'Re: FW: Sequoia Adv. Pro.: Draft Stipulation and Order',
 'FW: Sequoia Adv. Pro.: Draft Stipulation and Order',
 'FW: Sequoia Adv. Pro.: Draft Stipulation and Order',
 'Fw: Sequoia Adv. Pro.: Draft Stipulation and Order',
 'Sequoia Adv. Pro.: Draft Stipulation and Order',
 'Sequoia Adv. Pro.: Draft Stipulation and Order',
 'i would have done this but i was toooo busy.....']

Metacharacters: anchors

The next important kind of metacharacter is the anchor. An anchor doesn't match a character, but matches a particular place in a string.

anchor meaning
^ match at beginning of string
$ match at end of string
\b match at word boundary

Note: ^ in a character class has a different meaning from ^ outside a character class!

Note #2: If you want to search for a literal dollar sign ($), you need to put a backslash in front of it, like so: \$

Now we have enough regular expression knowledge to do some fairly sophisticated matching. As an example, all the subject lines that begin with the string New York, regardless of whether or not the initial letters were capitalized:

In [12]:
[line for line in subjects if re.search(r"^[Nn]ew [Yy]ork", line)]
Out[12]:
['New York Details',
 'New York Power Authority',
 'New York Power Authority',
 'New York Power Authority',
 'New York Power Authority',
 'New York',
 'New York',
 'New York',
 'New York, etc.',
 'New York, etc.',
 'New York sites',
 'New York Hotel',
 'New York Hotel',
 'New York Hotel',
 'New York Hotel',
 'New York',
 'New York',
 'New York City Marathon Guaranteed Entry',
 'new york rest reviews',
 'New York State Electric & Gas Corporation ("NYSEG")',
 'New York State Electric & Gas Corporation ("NYSEG")',
 'New York State Electric & Gas Corporation ("NYSEG")',
 'New York State Electric & Gas ("NYSEG")',
 'New York regulatory restriccions',
 'New York regulatory restriccions',
 'New York Bar Numbers']

Every subject line that ends with an ellipsis (there are a lot of these, so I'm only displaying the first 30):

In [13]:
[line for line in subjects if re.search(r"\.\.\.$", line)][:30]
Out[13]:
['Re: Inquiry....',
 'Re: Inquiry....',
 'RE: the candidate we spoke about this morning...',
 'the candidate we spoke about this morning...',
 'RE: the candidate we spoke about this morning...',
 'RE: the candidate we spoke about this morning...',
 'RE: the candidate we spoke about this morning...',
 'the candidate we spoke about this morning...',
 'RE: the candidate we spoke about this morning...',
 'RE: the candidate we spoke about this morning...',
 'RE: the candidate we spoke about this morning...',
 'the candidate we spoke about this morning...',
 'Re: Hmmmmm........',
 'Hmmmmm........',
 'FW: Bumping into the husband....',
 'FW: Bumping into the husband....',
 'RE: try this one...',
 'RE: try this one...',
 'Re: try this one...',
 'try this one...',
 'RE: try this one...',
 'RE: try this one...',
 'Re: try this one...',
 'try this one...',
 'RE: try this one...',
 'RE: try this one...',
 'Re: try this one...',
 'try this one...',
 'RE: try this one...',
 'RE: try this one...']

The first thirty subject lines containing the word "oil":

In [14]:
[line for line in subjects if re.search(r"\b[Oo]il\b", line)][:30]
Out[14]:
['Re: PIRA Global Oil and Natural Outlooks- Save these dates.',
 'PIRA Global Oil and Natural Outlooks- Save these dates.',
 'Re: PIRA Global Oil and Natural Outlooks- Save these dates.',
 '=09PIRA Global Oil and Natural Outlooks- Save these dates.',
 'Re: Cabot Oil & Gas Marketing Corp. - 9/99 production - price',
 'Re: Cabot Oil & Gas Marketing Corp. - 9/99 production - price',
 'Re: Cabot Oil & Gas Marketing Corp. - 9/99 production - price',
 'Re: Cabot Oil & Gas Marketing Corp. - 9/99 production - price',
 'Re: Cabot Oil & Gas Marketing Corp. - 9/99 production - price',
 'Re: Cabot Oil & Gas Marketing Corp. - 9/99 production - price',
 'Re: Cabot Oil & Gas Marketing Corp. - 9/99 production - price',
 'Cabot Oil & Gas Marketing Corp. - Amendment and Confirmations to',
 'Cabot Oil & Gas Marketing Corp. - Amendment and Confirmations to',
 'Re: Cabot Oil & Gas Marketing Corp. - 9/99 production - price',
 'Re: Cabot Oil & Gas Marketing Corp. - 9/99 production - price',
 'Re: Cabot Oil & Gas Marketing Corp. - 9/99 production - price',
 'Cabot Oil & Gas Marketing Corp. - Amendment and Confirmations to',
 'Cabot Oil & Gas Marketing Corp. - Amendment and Confirmations to',
 'EOTT Crude Oil Tanks',
 'Re: Oil Skim + "Bugs"',
 'Oil Skim + "Bugs"',
 'Oil Release Incident',
 'Oil Release Incident',
 'Oil Release Incident',
 'RE: Location of the 2002 Institute on Oil & Gas Law & Taxation --',
 'Location of the 2002 Institute on Oil & Gas Law & Taxation -- February, 2002',
 'RE: Location of the 2002 Institute on Oil & Gas Law & Taxation --',
 'RE: Location of the 2002 Institute on Oil & Gas Law & Taxation -- February, 2002',
 'RE: Location of the 2002 Institute on Oil & Gas Law & Taxation',
 'B & J Gas and Oil']

Metacharacters: quantifiers

Above we had a regular expression that looked like this:

[aeiou][aeiou][aeiou][aeiou]

Typing out all of those things is kind of a pain. Fortunately, there’s a way to specify how many times to match a particular character, using quantifiers. These affect the character that immediately precede them:

quantifier meaning
{n} match exactly n times
{n,m} match at least n times, but no more than m times
{n,} match at least n times
+ match at least once (same as {1,})
* match zero or more times
? match one time or zero times

For example, here's an example of a regular expression that finds subjects that contain at least fifteen capital letters in a row:

In [15]:
[line for line in subjects if re.search(r"[A-Z]{15,}", line)]
Out[15]:
['CONGRATULATIONS!',
 'CONGRATULATIONS!',
 'Re: FW: Fw: Fw: Fw: Fw: Fw: Fw: PLEEEEEEEEEEEEEEEASE READ!',
 'ACCOMPLISHMENTS',
 'ACCOMPLISHMENTS',
 'Re: FW: FORM: BILATERAL CONFIDENTIALITY AGREEMENT',
 'FORM: BILATERAL CONFIDENTIALITY AGREEMENT',
 'Re: CONGRATULATIONS!',
 'CONGRATULATIONS!',
 'Re: ORDER ACKNOWLEDGEMENT',
 'ORDER ACKNOWLEDGEMENT',
 'RE: CONGRATULATIONS',
 'RE: CONGRATULATIONS',
 'Re: CONGRATULATIONS',
 'CONGRATULATIONS',
 'RE: CONGRATULATIONS',
 'RE: CONGRATULATIONS',
 'RE: CONGRATULATIONS',
 'RE: CONGRATULATIONS',
 'Re: CONGRATULATIONS',
 'CONGRATULATIONS',
 'Re: VEPCO INTERCONNECTION AGREEMENT',
 'VEPCO INTERCONNECTION AGREEMENT',
 'Re: VEPCO INTERCONNECTION AGREEMENT',
 'Re: VEPCO INTERCONNECTION AGREEMENT',
 'VEPCO INTERCONNECTION AGREEMENT',
 'Re: CONGRATULATIONS !',
 'FW: WASSSAAAAAAAAAAAAAABI!',
 'FW: WASSSAAAAAAAAAAAAAABI!',
 'FW: WASSSAAAAAAAAAAAAAABI!',
 'FW: WASSSAAAAAAAAAAAAAABI!',
 'Re: FW: WASSSAAAAAAAAAAAAAABI!',
 'FW: WASSSAAAAAAAAAAAAAABI!',
 'FW: WASSSAAAAAAAAAAAAAABI!',
 'RE: NOOOOOOOOOOOOOOOO',
 'NOOOOOOOOOOOOOOOO',
 'RE: NOOOOOOOOOOOOOOOO',
 'CONGRATULATIONS!!!!!!!!!!!!!',
 'RE: CONGRATULATIONS!!!!!!!!!!!!!',
 'Re: CONGRATULATIONS!!!!!!!!!!!!!',
 'CONGRATULATIONS',
 'Re: CONFIDENTIALITY/CONFLICTS ISSUES MEETING',
 'CONFIDENTIALITY/CONFLICTS ISSUES MEETING',
 'GOALS AND ACCOMPLISHMENTS',
 'ACCOMPLISHMENTS',
 'Re: CONGRATULATIONS!',
 'RE: STANDARDIZATION OF TANKER FREIGHT WORDING',
 'RE: STANDARDIZATION OF TANKER FREIGHT WORDING',
 'Re: STANDARDIZATION OF TANKER FREIGHT WORDING',
 'STANDARDIZATION OF TANKER FREIGHT WORDING',
 'BRRRRRRRRRRRRRRRRRRRRR',
 'Re: CONGRATULATIONS !!!',
 'CONGRATULATIONS !!!',
 'RE: Mtg. to discuss assignment of customers. Transmission list:  P/LEGAL/PROJECTNETCO/NETCOTRANSMISSION.XLS',
 'RE: Mtg. to discuss assignment of customers. Transmission list:  P/LEGAL/PROJECTNETCO/NETCOTRANSMISSION.XLS',
 'Mtg. to discuss assignment of customers. Transmission list:  P/LEGAL/PROJECTNETCO/NETCOTRANSMISSION.XLS',
 'FW: NEW WEATHER SWAPS ON THE INTERCONTINENTAL EXCHANGE',
 'NEW WEATHER SWAPS ON THE INTERCONTINENTAL EXCHANGE']

Lines that contain five consecutive vowels:

In [16]:
[line for line in subjects if re.search(r"[aeiou]{5}", line)]
Out[16]:
['WooooooHoooooo more Vacation',
 'Gooooooooooood Bye!',
 'Gooooooooooood Bye!',
 'RE: Hello Sweeeeetie',
 'Hello Sweeeeetie',
 'FW: Waaasssaaaaabi !',
 'FW: Waaasssaaaaabi !',
 'FW: Waaasssaaaaabi !',
 'FW: Waaasssaaaaabi !',
 'Re: FW: Wasss Uuuuuup STG?',
 'RE: Rrrrrrrooooolllllllllllll TIDE!!!!!!!!',
 'Rrrrrrrooooolllllllllllll TIDE!!!!!!!!',
 'Re: Helloooooo!!!',
 'Re: Helloooooo!!!',
 'Fw: FW: OOOooooops',
 'FW: FW: OOOooooops',
 'yahoooooooooooooooooooo',
 'RE: yahoooooooooooooooooooo',
 'RE: yahoooooooooooooooooooo',
 'yahoooooooooooooooooooo',
 'RE: I hate yahooooooooooooooo',
 'I hate yahooooooooooooooo',
 'RE: I hate yahooooooooooooooo',
 'I hate yahooooooooooooooo',
 'RE: I hate yahooooooooooooooo',
 'I hate yahooooooooooooooo',
 'RE: I hate yahooooooooooooooo',
 'I hate yahooooooooooooooo',
 "FW: duuuuuuuuuuuuuuuuude...........what's up?",
 "RE: duuuuuuuuuuuuuuuuude...........what's up?",
 "RE: duuuuuuuuuuuuuuuuude...........what's up?",
 'Re: skiiiiiiiiing',
 'skiiiiiiiiing',
 'scuba dooooooooooooo',
 'RE: scuba dooooooooooooo',
 'RE: scuba dooooooooooooo',
 'scuba dooooooooooooo',
 'Re: skiiiiiiiing',
 'skiiiiiiiing',
 'Re: skiiiiiiiing',
 'Re: skiiiiiiiiing']

Count the number of lines that are e-mail forwards, regardless of whether the subject line begins with Fw:, FW:, Fwd: or FWD:

In [17]:
len([line for line in subjects if re.search(r"^F[Ww]d?:", line)])
Out[17]:
20159

Lines that have the word news in them and end in an exclamation point:

In [18]:
[line for line in subjects if re.search(r"\b[Nn]ews\b.*!$", line)]
Out[18]:
['RE: Christmas Party News!',
 'FW: Christmas Party News!',
 'Christmas Party News!',
 'Good News!',
 'Good News--Twice!',
 'Re: VERY Interesting News!',
 'Great News!',
 'Re: Great News!',
 'News Flash!',
 'RE: News Flash!',
 'RE: News Flash!',
 'News Flash!',
 'RE: Good News!',
 'RE: Good News!',
 'RE: Good News!',
 'RE: Good News!',
 'Good News!',
 'RE: Good News!!!',
 'Good News!!!',
 'RE: Big News!',
 'Big News!',
 'Individual.com - News From a Friend!',
 'Individual.com - News From a Friend!',
 'Re: Individual.com - News From a Friend!',
 'RE: We need news!',
 '=09We need news!',
 'RE: Big News!',
 'FW: Big News!',
 'RE: Big News!',
 'FW: Big News!',
 'Big News!',
 'FW: NW Wine News- Eroica, Sineann, Bergstrom, Hamacher, And more!',
 '=09NW Wine News- Eroica, Sineann, Bergstrom, Hamacher, And more!',
 'RE: Good News!!!',
 'Good News!!!',
 'Re: Big News!',
 'Big News!',
 'RE: Good  News!',
 'Good  News!']

Metacharacters: alternation

One final bit of regular expression syntax: alternation.

  • (?:x|y): match either x or y
  • (?:x|y|z): match x, y or z
  • etc.

So for example, if you wanted to count every subject line that begins with either Re: or Fwd::

In [19]:
len([line for line in subjects if re.search(r"^(?:Re|Fwd):", line)])
Out[19]:
39901

Every subject line that mentions kinds of cats:

In [20]:
[line for line in subjects if re.search(r"\b(?:[Cc]at|[Kk]itten|[Kk]itty)\b", line)]
Out[20]:
['Re: FW: cat attack',
 'Re: FW: cat attack',
 'Re: FW: cat attack',
 'Re: FW: cat attack',
 'Fw: Cat clip',
 'Fw: Cat clip',
 'FW: Cat clip',
 'Re: Amazing Kitten',
 'RE: How To Tell Which Cat Ate Your Drugs',
 'FW: How To Tell Which Cat Ate Your Drugs',
 'FW: How To Tell Which Cat Ate Your Drugs',
 "FW: Fw: A cat's tale",
 "Fwd: Fw: A cat's tale",
 'Kim lost her cat this morning',
 'Fw: cat clip............',
 'Fw: cat clip............',
 'Fw: cat clip............',
 'cat clip............',
 'Fw: cat clip............',
 'Fw: cat clip............',
 'Fw: cat clip............',
 'cat clip............',
 'Fw: cat clip............',
 'Fw: cat clip............',
 'Fw: cat clip............',
 'cat clip............',
 'Fw: cat clip............',
 'Fw: cat clip............',
 'Fw: cat clip............',
 'cat clip............',
 'Fw: cat clip............',
 'Fw: cat clip............',
 'Fw: cat clip............',
 'cat clip............',
 'kitty',
 'Diary of a Cat',
 'Diary of a Cat',
 'Diary of a Cat',
 'Diary of a Cat',
 'Diary of a Cat',
 'RE: Cat show?',
 'Cat show?',
 'RE: Cat show?',
 'RE: Cat show?',
 'RE: Cat show?',
 'Cat show?']

Capturing what matches

The re.search() function allows us to check to see whether or not a string matches a regular expression. Sometimes we want to find out not just if the string matches, but also to what, exactly, in the string matched. In other words, we want to capture whatever it was that matched.

The easiest way to do this is with the re.findall() function, which takes a regular expression and a string to match it against, and returns a list of all parts of the string that the regular expression matched. Here's an example:

In [22]:
import re
re.findall(r"\b\w{5}\b", "alpha beta gamma delta epsilon zeta eta theta")
Out[22]:
['alpha', 'gamma', 'delta', 'theta']

The regular expression above, \b\w{5}\b, is a regular expression that means "find me strings of five non-white space characters between word boundaries"---in other words, find me five-letter words. The re.findall() method returns a list of strings---not just telling us whether or not the string matched, but which parts of the string matched.

For the following re.findall() examples, we'll be operating on the entire file of subject lines as a single string, instead of using a list comprehension for individual subject lines. Here's how to read in the entire file as one string, instead of as a list of strings:

In [23]:
all_subjects = open("enronsubjects.txt").read()

Having done that, let's write a regular expression that finds all domain names in the subject lines (displaying just the first thirty because the list is long):

In [24]:
re.findall(r"\b\w+\.(?:com|net|org)", all_subjects)[:30]
Out[24]:
['enron.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'Forbes.com',
 'Cortlandtwines.com',
 'Cortlandtwines.com',
 'Match.com',
 'Amazon.com',
 'Amazon.com',
 'Ticketmaster.com',
 'Ticketmaster.com',
 'Concierge.com',
 'Concierge.com',
 'har.com',
 'har.com',
 'HoustonChronicle.com',
 'HoustonChronicle.com',
 'har.com',
 'har.com',
 'har.com',
 'har.com',
 'har.com',
 'har.com',
 'Concierge.com',
 'Concierge.com']

Every time the string New York is found, along with the word that comes directly afterward:

In [25]:
re.findall(r"New York \b\w+\b", all_subjects)
Out[25]:
['New York Details',
 'New York Details',
 'New York on',
 'New York on',
 'New York on',
 'New York on',
 'New York on',
 'New York on',
 'New York Times',
 'New York on',
 'New York on',
 'New York on',
 'New York on',
 'New York on',
 'New York on',
 'New York on',
 'New York on',
 'New York Times',
 'New York Times',
 'New York Times',
 'New York Times',
 'New York Times',
 'New York Times',
 'New York Times',
 'New York City',
 'New York City',
 'New York City',
 'New York Power',
 'New York Power',
 'New York Power',
 'New York Power',
 'New York Power',
 'New York Power',
 'New York Power',
 'New York Power',
 'New York Mercantile',
 'New York Mercantile',
 'New York Branch',
 'New York City',
 'New York Energy',
 'New York Energy',
 'New York Energy',
 'New York Energy',
 'New York Energy',
 'New York sites',
 'New York sites',
 'New York Hotel',
 'New York Hotel',
 'New York Hotel',
 'New York Hotel',
 'New York Hotel',
 'New York Hotel',
 'New York Hotel',
 'New York Hotel',
 'New York Hotel',
 'New York Hotel',
 'New York Hotel',
 'New York Hotel',
 'New York Hotel',
 'New York Hotel',
 'New York City',
 'New York City',
 'New York City',
 'New York City',
 'New York voice',
 'New York State',
 'New York State',
 'New York State',
 'New York State',
 'New York State',
 'New York State',
 'New York Inc',
 'New York Office',
 'New York Office',
 'New York regulatory',
 'New York regulatory',
 'New York regulatory',
 'New York regulatory',
 'New York Bar',
 'New York Bar']

And just to bring things full-circle, everything that looks like a zip code, sorted:

In [26]:
sorted(re.findall(r"\b\d{5}\b", all_subjects))[:30]
Out[26]:
['00003',
 '00003',
 '00003',
 '00003',
 '00003',
 '00003',
 '00003',
 '00003',
 '00003',
 '00010',
 '00010',
 '00458',
 '01003',
 '02177',
 '06716',
 '06736',
 '06736',
 '06752',
 '06752',
 '06752',
 '06752',
 '06752',
 '06980',
 '06980',
 '10000',
 '10000',
 '11111',
 '11111',
 '11111',
 '11111']

Full example: finding the dollar value of the Enron e-mail subject corpus

Here's an example that combines our regular expression prowess with our ability to do smaller manipulations on strings. We want to find all dollar amounts in the subject lines, and then figure out what their sum is.

To understand what we're working with, let's start by writing a list comprehension that finds strings that just have the dollar sign ($) in them:

In [27]:
[line for line in subjects if re.search(r"\$", line)]
Out[27]:
['Re: APEA - $228,204 hit',
 'Re: APEA - $228,204 hit',
 'DJ Cal-ISO Pays $10M To Avoid Rolling Blackouts Wed -Sources, DJ',
 'DJ Cal-ISO Pays $10M To Avoid Rolling Blackouts Wed -Sources, DJ',
 'DJ Cal-ISO Pays $10M To Avoid Rolling Blackouts Wed -Sources, DJ',
 'DJ Cal-ISO Pays $10M To Avoid Rolling Blackouts Wed -Sources, DJ',
 'Goldman Comment re: Enron issued this morning - Revised Price Target of $68/share',
 'RE: Goldman Sachs $2.19 Natural GAs',
 'Goldman Sachs $2.19 Natural GAs',
 'RE: $25 million',
 '$25 million',
 'RE: $25 million loan from EDf',
 '$25 million loan from EDf',
 'RE: $25 million loan from EDf',
 'RE: $25 million loan from EDf',
 'RE: $25 million loan from EDf',
 '$25 million loan from EDf',
 'RE: $25 million loan from EDf',
 'RE: $25 million loan from EDf',
 'RE: $25 million loan from EDf',
 'RE: $25 million loan from EDf',
 'RE: $25 million loan from EDf',
 '$25 million loan from EDf',
 'A$M and its "second tier" status',
 'A$M and its "second tier" status',
 'A$M and its "second tier" status',
 'UT/a$m business school and engineering school comparisons',
 'Re: $',
 '$',
 'Re: $',
 '$',
 '$$$$',
 'FFL $$',
 'RE: shipper imbal $$ collected',
 'shipper imbal $$ collected',
 "Oneok's Strangers Gas Payment $820,000",
 "Oneok's Strangers Gas Payment $820,000",
 'Another $40 Million?',
 'FW: Entergy and FPL Group Agree to a $27 Billion Merger Of Equals',
 'FW: Entergy and FPL Group Agree to a $27 Billion Merger Of Equals',
 'Over $50 -- You made it happen!',
 'Over $50 -- You made it happen!',
 'FW: Co 0530 CINY 40781075  $5,356.46  FX Funding',
 'Co 0530 CINY 40781075  $5,356.46  FX Funding',
 'FW: Outstanding Young Alumni Travel Value to Amsterdam from $895',
 'Outstanding Young Alumni Travel Value to Amsterdam from $895',
 'RE: Modesto 7 MW COB deal @$19.3.',
 'RE: Modesto 7 MW COB deal @$19.3.',
 'Modesto 7 MW COB deal @$19.3.',
 'Modesto 7 MW COB deal @$19.3.',
 'RE: -$870K prior month adjustments',
 '-$870K prior month adjustments',
 'RE: -$141,000 P&L hit on 8/13/01',
 '-$141,000 P&L hit on 8/13/01',
 '$$$',
 'Re: DWR Stranded costs: $21 billion',
 'CAISO cuts refund estimate to $6.1B from $8.9B',
 "State's Power Purchases Costlier Than Projected Tab is $6 million a",
 'Fwd: Edison gets more time; Calif. may sell $14 bln bonds',
 'Edison gets more time; Calif. may sell $14 bln bonds',
 'Re: IDEA RE ISSUE OF UTILS IN CALIF WANTING $100 PRICE CAP',
 'Back to $250 Cap in California',
 'Energy Secretary Announces $350MM to Upgrade Path 15',
 'RE: $.01 surcharge as "tax"',
 'FW: $.01 surcharge as "tax"',
 'FW: $.01 surcharge as "tax"',
 '$.01 surcharge as "tax"',
 "California's $12.5 Bln Bond Sale May Be Salvaged, Official Says;",
 "RE: California's $12.5 Bln Bond Sale May Be Salvaged, Official",
 "RE: California's $12.5 Bln Bond Sale May Be Salvaged, Official Says; DWR Contract Renegotiation Is Key",
 "California's $12.5 Bln Bond Sale May Be Salvaged, Official Says; DWR Contract Renegotiation Is Key",
 'Re: Royal Bank of Canada - Wire ($2,529,352.58)',
 'Free $10 Three Team Parlay',
 'Blue Girl - $1.2MM option expires today - need to know whether to',
 'Blue Girl - $1.2MM option expires today - need to know whether to',
 'Blue Girl - $1.2MM option expires today - need to know whether to',
 'Blue Girl - $1.2MM option expires today - need to know whether to',
 'Blue Girl - $1.2MM option expires today - need to know whether to',
 'Blue Girl - $1.2MM option expires today - need to know whether to',
 'Blue Girl - $1.2MM option expires today - need to know whether to',
 'FW: Economic Times article: FIs may take over Enron for $700-800m',
 'FW: Economic Times article: FIs may take over Enron for $700-800m',
 'FW: Economic Times article: FIs may take over Enron for $700-800m',
 'Red Rock Delay $$ Impact',
 'HandsFree Kits - $2',
 'HandsFree Kits - $2',
 'Re: The $10 you owe me',
 'The $10 you owe me',
 'RE: Enron files for Chapter 11 owing US$13B',
 'Enron files for Chapter 11 owing US$13B',
 'RE: $ allocation',
 '$ allocation',
 'Re: Last chance: Save $100 on a future airline ticket',
 'Re: ECS and the $500k reduction',
 'Re: ECS and the $500k reduction',
 'Re: ECS and the $500k reduction',
 'Re: ECS and the $500k reduction',
 'ECS and the $500k reduction',
 'ECS and the $500k reduction',
 'ECS and the $500k reduction',
 'ECS and the $500k reduction',
 'ECS and the $500k reduction',
 'FW: Free Shipping & $1,300 in Savings',
 'Free Shipping & $1,300 in Savings',
 'RE: Free Shipping & $1,300 in Savings',
 'RE: Free Shipping & $1,300 in Savings',
 'FW: Free Shipping & $1,300 in Savings',
 'Free Shipping & $1,300 in Savings',
 'RE: Dynegy Is Mulling $2 Billion Investment In Enron in Possible',
 'FW: Dynegy Is Mulling $2 Billion Investment In Enron in Possible \tStep Toward Merger',
 'FW: Dynegy Is Mulling $2 Billion Investment In Enron in Possible Step Toward Merger',
 'Dynegy Is Mulling $2 Billion Investment In Enron in Possible Step Toward Merger',
 'Peoples Gas --> $5,000 Invoice for Summer-Winter Exchange 6-1-00 to',
 'Peoples Gas --> $5,000 Invoice for Summer-Winter Exchange 6-1-00 to',
 'Peoples Gas --> $5,000 Invoice for Summer-Winter Exchange 6-1-00 to',
 'Peoples Gas --> $5,000 Invoice for Summer-Winter Exchange 6-1-00 to',
 'Re: short fall $971,443.11 for Wis Elect Power',
 'Re: short fall $971,443.11 for Wis Elect Power',
 'Re: short fall $971,443.11 for Wis Elect Power',
 'Re: short fall $971,443.11 for Wis Elect Power',
 'Re: short fall $971,443.11 for Wis Elect Power',
 'short fall $971,443.11 for Wis Elect Power',
 'RE: Q&A for NNG/TW Supported $1Billion Line of Credit',
 'Q&A for NNG/TW Supported $1Billion Line of Credit',
 'FW: Deals from $39 in our Las Vegas store!',
 '=09Deals from $39 in our Las Vegas store!',
 'A trip worth $10,000 could be yours',
 'A trip worth $10,000 could be yours',
 '142,000,000 Email Addresses for ONLY $149!!!!',
 "Lou's $50,000",
 "Lou's $50,000",
 "Lou's $50,000",
 'Summary of $ at Risk for Customs',
 'Summary of $ at Risk for Customs',
 'Summary of $ at Risk for Customs',
 "Calling All Investors: The New Power Company's IPO Priced at $21",
 "Calling All Investors: The New Power Company's IPO Priced at $21 P=",
 'Fenosa and Enron to Invest $550 Million in Dominican Republic',
 "Enron Brazil To Invest $455 Million In Gas Distribution '01-'04",
 'RE: $5 million for 90 days?- how quaint!',
 'FW: $5 million for 90 days?- how quaint!',
 '$5 million for 90 days?- how quaint!',
 'RE: Wind $7MM',
 'RE: Wind $7MM',
 'RE: Wind $7MM',
 'Wind $7MM',
 'RE: Wind $7MM',
 'Wind $7MM',
 'Re: Counting the Cal ISO Votes for a $100 Price Cap',
 'RE: C$ swap between EIM/ENA',
 'C$ swap between EIM/ENA',
 "Re: Where's My $20",
 "Re: Where's My $20",
 "Re: Where's My $20",
 "Re: Where's My $20",
 'Re: $100',
 'Re: $100',
 'Re: $100',
 "Re: Where's My $20",
 "Re: Where's My $20",
 'RE: Eric Schroeder has just sent you $29.75 with PayPal',
 'Fw: Eric Schroeder has just sent you $29.75 with PayPal',
 'Eric Schroeder has just sent you $29.75 with PayPal',
 'RE: Eric Schroeder has just sent you $29.75 with PayPal',
 'Fw: Eric Schroeder has just sent you $29.75 with PayPal',
 'Eric Schroeder has just sent you $29.75 with PayPal',
 'RE: What are you talking about $1600?',
 'Re: What are you talking about $1600?',
 'RE: What are you talking about $1600?',
 'RE: What are you talking about $1600?',
 '=09Re: What are you talking about $1600?',
 'What are you talking about $1600?',
 'What are you talking about $1600?',
 'FW: Enron Seeks $2 Billion Cash Infusion As It Faces an Escalating',
 'FW: Enron Seeks $2 Billion Cash Infusion As It Faces an Escalating Fiscal Crisis',
 'Enron Seeks $2 Billion Cash Infusion As It Faces an Escalating Fiscal Crisis',
 'The new, correct price is $67,776,700',
 'Re: Demar request for $2.7 mm to pay out the Skandinavian now',
 'Re: Demar request for $2.7 mm to pay out the Skandinavian now',
 'RE: Transactions exceeding $100mil',
 'Our benefits are about $50 per month higher with UBS',
 'RE: $9.6MM EOL Gas Daily Issue',
 '$9.6MM EOL Gas Daily Issue',
 'FW: NEAL - ITIN ONLY/$212.50',
 'FW: NEAL - ITIN ONLY/$212.50',
 'NEAL - ITIN ONLY/$212.50',
 'FW: NEAL - ITIN ONLY/$212.50',
 'FW: NEAL - ITIN ONLY/$212.50',
 'NEAL - ITIN ONLY/$212.50',
 'FW: Duke $',
 'Duke $',
 'RE: Duke $',
 'FW: Duke $',
 'FW: Duke $',
 'Duke $',
 '$$$$',
 '$$$$',
 'RE: Wire Detail for 10/25/01 wire for  $195,209.95',
 'FW: Wire Detail for 10/25/01 wire for  $195,209.95',
 'RE: Wire Detail for 10/25/01 wire for  $195,209.95',
 'RE: Wire Detail for 10/25/01 wire for  $195,209.95',
 'FW: Wire Detail for 10/25/01 wire for  $195,209.95',
 'RE: Wire Detail for 10/25/01 wire for  $195,209.95',
 'RE: Wire Detail for 10/25/01 wire for  $195,209.95',
 'FW: Wire Detail for 10/25/01 wire for  $195,209.95',
 'RE: Wire Detail for 10/25/01 wire for  $195,209.95',
 'RE: Wire Detail for 10/25/01 wire for  $195,209.95',
 'FW: Wire Detail for 10/25/01 wire for  $195,209.95',
 'RE: Wire Detail for 10/25/01 wire for  $195,209.95',
 'FW: DYN($42/sh)/ENE($7/sh) Merger At Risk. - Simmons and Company',
 'FW: DYN($42/sh)/ENE($7/sh) Merger At Risk. - Simmons and Company latest thoughts',
 'FW: DYN($42/sh)/ENE($7/sh) Merger At Risk. - Simmons and Company latest thoughts',
 "FW: Re-Allocaton of $'s",
 "RE: Re-Allocaton of $'s",
 "Re-Allocaton of $'s",
 "Re-Allocaton of $'s",
 'RE: Wind $7MM',
 'FW: Wind $7MM',
 'Wind $7MM',
 'RE: $9.92????????????',
 '$9.92????????????',
 'RE: Below $10',
 'Below $10',
 'FW: Comments on the Status of ENE ($16/sh).',
 'FW: Comments on the Status of ENE ($16/sh).',
 'FW: Comments on the Status of ENE ($16/sh).',
 'Breaking News : Williams Ordered to Pay $8 Million Refund to',
 'Breaking News : Williams Ordered to Pay $8 Million Refund to Cal-ISO',
 'Coho $500mm lawsuit against Hicks Muse',
 'Coho $500mm lawsuit against Hicks Muse',
 'Coho $500mm lawsuit against Hicks Muse',
 'Re: $$$$',
 '$$$$',
 'Perd $',
 'Re: $80 million',
 'Re: $80 million',
 '$80 million',
 '$80 million',
 'Re: $80 million',
 '$80 million',
 '$80 million',
 'Re: Calif Atty Gen Offers $50M Reward In Pwr Supplier',
 'Financial Disclosure of $1.2 Billion Equity Adjustment',
 'ENE: Despite Bounce It Appears Cheap; Yet $102 Target Likely a Late',
 'ENE: Despite Bounce It Appears Cheap; Yet $102 Target Likely a Late 2002 Event:',
 'Is it worth $200?',
 'RE: #@$ !!!!!!!!',
 '$#%:#@$ !!!!!!!!',
 'RE: @%[email protected]!!!',
 '[email protected]%[email protected]!!!',
 'Special Offer: Switch to ShareBuilder and Get $50!',
 'Amendment to Enron Corp. $25 Million guaranty of Enron Credit Inc.',
 'RE: Amendment to Enron Corp. $25 Million guaranty of Enron Credit',
 'Goldman Sach $ repo docs',
 'Re: Goldman Sach $ repo docs',
 'RE: Amendment to Enron Corp. $25 Million guaranty of Enron Credit',
 'FW: Goldmans $1.5m',
 'Goldmans $1.5m',
 'FW: $1.5 Check',
 '$1.5 Check',
 'RE: Goldman Sachs $',
 'Goldman Sachs $',
 'RE: TODAY ONLY - SAVE UP TO $120 EXTRA ON AIRLINE TICKETS!',
 'RE: TODAY ONLY - SAVE UP TO $120 EXTRA ON AIRLINE TICKETS!',
 'RE: $.01 surcharge as "tax"',
 'RE: $.01 surcharge as "tax"',
 'RE: $.01 surcharge as "tax"',
 'FW: $.01 surcharge as "tax"',
 'FW: $.01 surcharge as "tax"',
 '$.01 surcharge as "tax"',
 "FW: PennFuture's E-Cubed - The $45 Million Rip Off",
 "=09PennFuture's E-Cubed - The $45 Million Rip Off",
 'RE: PaPUC assessment of $147,000 to Enron',
 'Re: PaPUC assessment of $147,000 to Enron',
 'PaPUC assessment of $147,000 to Enron',
 "RE: ASAP!! EES' objections to PaPUC assessment of $147,000",
 "ASAP!! EES' objections to PaPUC assessment of $147,000",
 'RE: Pennsylvania $147,000 EES Assessment',
 '=09Pennsylvania $147,000 EES Assessment',
 'FW: CAEM Study: Gas Dereg Has Saved Consumers $600B',
 'CAEM Study: Gas Dereg Has Saved Consumers $600B',
 'PaPUC assessment of $147,000 to Enron',
 "RE: ASAP!! EES' objections to PaPUC assessment of $147,000",
 "ASAP!! EES' objections to PaPUC assessment of $147,000",
 'FW: Energy Novice to Be Paid $240,000',
 'Energy Novice to Be Paid $240,000',
 'RE:  $22.8 schedule C for BPA deal',
 '$22.8 schedule C for BPA deal',
 '$22.8 schedule C for BPA deal',
 'origination $100k to Laird Dyer',
 'Cd$ CME letter',
 'Cd$ CME letter',
 '$',
 'RE: $',
 'RE: $',
 'Re: $',
 'RE: $',
 'GET RICH ON $6.00 !!!',
 'RE: Thoughts on the world of energy (OSX $77, XNG $183, XOI 496)',
 'FW: Letter of Credit $ 5,500,000 in support of Transwestern',
 'Letter of Credit $ 5,500,000 in support of Transwestern Pipeline Red Rock Expansion',
 'Letter of Credit $ 5,500,000 in support of Transwestern Pipeline Red Rock Expansion',
 'FW: shipper imbal $$ collected',
 'shipper imbal $$ collected',
 'FW: shipper imbal $$ collected',
 'RE: shipper imbal $$ collected',
 'shipper imbal $$ collected',
 'FW: shipper imbal $$ collected',
 'RE: shipper imbal $$ collected',
 'shipper imbal $$ collected',
 "FW: $$'s allocated to TW",
 "$$'s allocated to TW",
 'RE: email to USG confirming our decision not to require more LOC $',
 'email to USG confirming our decision not to require more LOC $',
 '$',
 'Re: Calpine Confirms $4.6B, 10-Yr Calif. Power Sales',
 'RE: $2.15 bn Enron Metals Inventory Financings Closed',
 'RE: $2.15 bn Enron Metals Inventory Financings Closed',
 'FW: Thayer Aerospace Awarded $130 Million Vought Aircraft Contract',
 'FW: Thayer Aerospace Awarded $130 Million Vought Aircraft Contract',
 'Thayer Aerospace Awarded $130 Million Vought Aircraft Contract to',
 're: mid-columbia $1 mm Schedule E difference',
 '$0.25 scheduling fee.',
 'MPC $',
 'RE: $10',
 'RE: $10',
 'RE: $10',
 '$10',
 'RE: $10',
 '$10',
 'FW: *** Gold/TSE GL/$US/CPI/TSE MM/CRB Bloomberg charts ***',
 'FW: *** Gold/TSE GL/$US/CPI/TSE MM/CRB Bloomberg charts ***',
 'FW: *** Gold/TSE GL/$US/CPI/TSE MM/CRB Bloomberg charts ***',
 'FW: Summer Fare Sale From $128 Return!',
 'Summer Fare Sale From $128 Return!']

Based on this data, we can guess at the steps we'd need to do in order to figure out these values. We're going to ignore anything that doesn't have "k", "million" or "billion" after it as chump change. So what we need to find is: a dollar sign, followed by any series of numbers (or a period), followed potentially by a space (but sometimes not), followed by a "k", "m" or "b" (which will sometimes start the word "million" or "billion" but sometimes not... so we won't bother looking).

Here's how I would translate that into a regular expression:

\$[0-9.]+ ?(?:[Kk]|[Mm]|[Bb])

We can use re.findall() to capture all instances where we found this regular expression in the text. Here's what that would look like:

In [28]:
re.findall(r"\$[0-9.]+ ?(?:[Kk]|[Mm]|[Bb])", all_subjects)
Out[28]:
['$10M',
 '$10M',
 '$10M',
 '$10M',
 '$25 m',
 '$25 m',
 '$25 m',
 '$25 m',
 '$25 m',
 '$25 m',
 '$25 m',
 '$25 m',
 '$25 m',
 '$25 m',
 '$25 m',
 '$25 m',
 '$25 m',
 '$25 m',
 '$40 M',
 '$27 B',
 '$27 B',
 '$870K',
 '$870K',
 '$21 b',
 '$6.1B',
 '$8.9B',
 '$6 m',
 '$14 b',
 '$14 b',
 '$350M',
 '$12.5 B',
 '$12.5 B',
 '$12.5 B',
 '$12.5 B',
 '$1.2M',
 '$1.2M',
 '$1.2M',
 '$1.2M',
 '$1.2M',
 '$1.2M',
 '$1.2M',
 '$13B',
 '$13B',
 '$500k',
 '$500k',
 '$500k',
 '$500k',
 '$500k',
 '$500k',
 '$500k',
 '$500k',
 '$500k',
 '$2 B',
 '$2 B',
 '$2 B',
 '$2 B',
 '$1B',
 '$1B',
 '$550 M',
 '$455 M',
 '$5 m',
 '$5 m',
 '$5 m',
 '$7M',
 '$7M',
 '$7M',
 '$7M',
 '$7M',
 '$7M',
 '$2 B',
 '$2 B',
 '$2 B',
 '$2.7 m',
 '$2.7 m',
 '$100m',
 '$9.6M',
 '$9.6M',
 '$7M',
 '$7M',
 '$7M',
 '$8 M',
 '$8 M',
 '$500m',
 '$500m',
 '$500m',
 '$80 m',
 '$80 m',
 '$80 m',
 '$80 m',
 '$80 m',
 '$80 m',
 '$80 m',
 '$50M',
 '$1.2 B',
 '$25 M',
 '$25 M',
 '$25 M',
 '$1.5m',
 '$1.5m',
 '$45 M',
 '$45 M',
 '$600B',
 '$600B',
 '$100k',
 '$4.6B',
 '$2.15 b',
 '$2.15 b',
 '$130 M',
 '$130 M',
 '$130 M',
 '$1 m']

If we want to actually make a sum, though, we're going to need to do a little massaging.

In [29]:
total_value = 0
dollar_amounts = re.findall(r"\$\d+ ?(?:[Kk]|[Mm]|[Bb])", all_subjects)
for amount in dollar_amounts:
    # the last character will be 'k', 'm', or 'b'; "normalize" by making lowercase.
    multiplier = amount[-1].lower()
    # trim off the beginning $ and ending multiplier value
    amount = amount[1:-1]
    # remove any remaining whitespace
    amount = amount.strip()
    # convert to a floating-point number
    float_amount = float(amount)
    # multiply by an amount, based on what the last character was
    if multiplier == 'k':
        float_amount = float_amount * 1000
    elif multiplier == 'm':
        float_amount = float_amount * 1000000
    elif multiplier == 'b':
        float_amount = float_amount * 1000000000
    # add to total value
    total_value = total_value + float_amount
total_value
Out[29]:
1349657340000.0

That's over one trillion dollars! Nice work, guys.

Finer-grained matches with grouping

We used re.search() above to check whether or not a string matches a particular regular expression, in a context like this:

In [30]:
import re
dickens = [
    "it was the best of times",
    "it was the worst of times"]
[line for line in dickens if re.search(r"best", line)]
Out[30]:
['it was the best of times']

But the match object doesn't actually return True or False. If the search succeeds, the function returns something called a "match object." Let's assign the result of re.search() to a variable and see what we can do with it.

In [31]:
source_string = "this example has been used 423 times"
match = re.search(r"\d\d\d", source_string)
type(match)
Out[31]:
_sre.SRE_Match

It's a value of type _sre.SRE_Match. This value has several methods that we can use to access helpful and interesting information about the way the regular expression matched the string. Read more about the methods of the match object here.

For example, we can see both where the match started in the string and where it ended, using the .start() and .end() methods. These methods return the indexes in the string where the regular expression matched.

In [32]:
match.start()
Out[32]:
27
In [33]:
match.end()
Out[33]:
30

Together, we can use these methods to grab exactly the part of the string that matched the regular expression, by using the start/end values to get a slice:

In [34]:
source_string[match.start():match.end()]
Out[34]:
'423'

Because it's so common, there's a shortcut for this operation, which is the match object's .group() method:

In [35]:
match.group()
Out[35]:
'423'

The .group() method of a match object, in other words, returns exactly the part of the string that matched the regular expression.

As an example of how to use the match object and its .group() method in context, let's revisit the example from above which found every subject line in the Enron corpus that had fifteen or more consecutive capital letters. In that example, we could only display the entire subject line. If we wanted to show just the part of the string that matched (i.e., the sequence of fifteen or more capital letters), we could use .group():

In [37]:
for line in subjects:
    match = re.search(r"[A-Z]{15,}", line)
    if match:
        print(match.group())
CONGRATULATIONS
CONGRATULATIONS
PLEEEEEEEEEEEEEEEASE
ACCOMPLISHMENTS
ACCOMPLISHMENTS
CONFIDENTIALITY
CONFIDENTIALITY
CONGRATULATIONS
CONGRATULATIONS
ACKNOWLEDGEMENT
ACKNOWLEDGEMENT
CONGRATULATIONS
CONGRATULATIONS
CONGRATULATIONS
CONGRATULATIONS
CONGRATULATIONS
CONGRATULATIONS
CONGRATULATIONS
CONGRATULATIONS
CONGRATULATIONS
CONGRATULATIONS
INTERCONNECTION
INTERCONNECTION
INTERCONNECTION
INTERCONNECTION
INTERCONNECTION
CONGRATULATIONS
WASSSAAAAAAAAAAAAAABI
WASSSAAAAAAAAAAAAAABI
WASSSAAAAAAAAAAAAAABI
WASSSAAAAAAAAAAAAAABI
WASSSAAAAAAAAAAAAAABI
WASSSAAAAAAAAAAAAAABI
WASSSAAAAAAAAAAAAAABI
NOOOOOOOOOOOOOOOO
NOOOOOOOOOOOOOOOO
NOOOOOOOOOOOOOOOO
CONGRATULATIONS
CONGRATULATIONS
CONGRATULATIONS
CONGRATULATIONS
CONFIDENTIALITY
CONFIDENTIALITY
ACCOMPLISHMENTS
ACCOMPLISHMENTS
CONGRATULATIONS
STANDARDIZATION
STANDARDIZATION
STANDARDIZATION
STANDARDIZATION
BRRRRRRRRRRRRRRRRRRRRR
CONGRATULATIONS
CONGRATULATIONS
NETCOTRANSMISSION
NETCOTRANSMISSION
NETCOTRANSMISSION
INTERCONTINENTAL
INTERCONTINENTAL

An important thing to remember about re.search() is that it returns None if there is no match. For this reason, you always need to check to make sure the object is not None before you attempt to call the value's .group() method. This is the reason that it's difficult to write the above example as a list comprehension---you need to check the result of re.search() before you can use it. An attempt to do something like this, for example, will fail:

In [38]:
[re.search(r"[A-Z]{15,}", line).group() for line in subjects]
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-38-666f9b5fe0ae> in <module>()
----> 1 [re.search(r"[A-Z]{15,}", line).group() for line in subjects]

<ipython-input-38-666f9b5fe0ae> in <listcomp>(.0)
----> 1 [re.search(r"[A-Z]{15,}", line).group() for line in subjects]

AttributeError: 'NoneType' object has no attribute 'group'

Python complains that NoneType has no group() method. This happens because sometimes the result of re.search() is none.

We could, of course, write a little function to get around this limitation:

In [39]:
# make a function
def filter_and_group(source, regex):
    return [re.search(regex, item).group() for item in source if re.search(regex, item)]

# now call it
filter_and_group(subjects, r"[A-Z]{15,}")
Out[39]:
['CONGRATULATIONS',
 'CONGRATULATIONS',
 'PLEEEEEEEEEEEEEEEASE',
 'ACCOMPLISHMENTS',
 'ACCOMPLISHMENTS',
 'CONFIDENTIALITY',
 'CONFIDENTIALITY',
 'CONGRATULATIONS',
 'CONGRATULATIONS',
 'ACKNOWLEDGEMENT',
 'ACKNOWLEDGEMENT',
 'CONGRATULATIONS',
 'CONGRATULATIONS',
 'CONGRATULATIONS',
 'CONGRATULATIONS',
 'CONGRATULATIONS',
 'CONGRATULATIONS',
 'CONGRATULATIONS',
 'CONGRATULATIONS',
 'CONGRATULATIONS',
 'CONGRATULATIONS',
 'INTERCONNECTION',
 'INTERCONNECTION',
 'INTERCONNECTION',
 'INTERCONNECTION',
 'INTERCONNECTION',
 'CONGRATULATIONS',
 'WASSSAAAAAAAAAAAAAABI',
 'WASSSAAAAAAAAAAAAAABI',
 'WASSSAAAAAAAAAAAAAABI',
 'WASSSAAAAAAAAAAAAAABI',
 'WASSSAAAAAAAAAAAAAABI',
 'WASSSAAAAAAAAAAAAAABI',
 'WASSSAAAAAAAAAAAAAABI',
 'NOOOOOOOOOOOOOOOO',
 'NOOOOOOOOOOOOOOOO',
 'NOOOOOOOOOOOOOOOO',
 'CONGRATULATIONS',
 'CONGRATULATIONS',
 'CONGRATULATIONS',
 'CONGRATULATIONS',
 'CONFIDENTIALITY',
 'CONFIDENTIALITY',
 'ACCOMPLISHMENTS',
 'ACCOMPLISHMENTS',
 'CONGRATULATIONS',
 'STANDARDIZATION',
 'STANDARDIZATION',
 'STANDARDIZATION',
 'STANDARDIZATION',
 'BRRRRRRRRRRRRRRRRRRRRR',
 'CONGRATULATIONS',
 'CONGRATULATIONS',
 'NETCOTRANSMISSION',
 'NETCOTRANSMISSION',
 'NETCOTRANSMISSION',
 'INTERCONTINENTAL',
 'INTERCONTINENTAL']

Multiple groups in one regular expression

So re.search() lets us get the parts of a string that match a regular expression, using the .group() method of the match object it returns. You can get even finer-grained matches using a feature of regular expressions called grouping.

Let's start with a toy example. Say you have a list of University courses in the following format:

In [40]:
courses = [
    "CSCI 105: Introductory Programming for Cat-Lovers",
    "LING 214: Pronouncing Things Backwards",
    "ANTHRO 342: Theory and Practice of Cheesemongery (Graduate Seminar)",
    "CSCI 205: Advanced Programming for Cat-Lovers",
    "ENGL 112: Speculative Travel Writing"
]

Let's say you want to extract the following items from this data:

  • A unique list of all departments (e.g., CSCI, LING, ANTHRO, etc.)
  • A list of all course names
  • A dictionary with all of the 100-level classes, 200-level classes, and 300-level classes

Somehow we need to get three items from each line of data: the department, the number, and the course name. You can do this easily with regular expressions using grouping. To use grouping, put parentheses (()) around the portions of the regular expression that are of interest to you. You can then use the .groups() (note the s!) function to get the portion of the string that matched the portion of the regular expression inside the parentheses individually. Here's what it looks like, just operating on the first item of the list:

In [41]:
first_course = courses[0]
match = re.search(r"(\w+) (\d+): (.+)$", first_course)
match.groups()
Out[41]:
('CSCI', '105', 'Introductory Programming for Cat-Lovers')

The regular expression in re.search() above roughly translates as the following:

  • Find me a sequence of one or more alphanumeric characters. Save this sequence as the first group.
  • Find a space.
  • Find me a sequence of one or more digits. Save this as the second group.
  • Find a colon followed by a space.
  • Find me one or more characters---I don't care which characters---and save the sequence as the third group.
  • Match the end of the line.

Calling the .groups() method returns a tuple containing each of the saved items from the grouping. You can use it like so:

In [43]:
groups = match.groups()
print("Department:", groups[0]) # department
print("Course number:", groups[1]) # course number
print("Course name:", groups[2]) # course name
Department: CSCI
Course number: 105
Course name: Introductory Programming for Cat-Lovers

Now let's iterate over the entire list of courses and put them in the data structure as appropriate:

In [44]:
departments = set()
course_names = []
course_levels = {}
for item in courses:
    # search and create match object
    match = re.search(r"(\w+) (\d+): (.+)$", item)
    if match: # if there's a match...
        groups = match.groups() # get the groups: 0 is department, 1 is course number, 2 is name
        departments.add(groups[0]) # add to department set (we wanted a list of *unique* departments)
        course_names.append(groups[2]) # add to list of courses
        level = int(groups[1]) / 100 # get the course "level" by dividing by 100
        # add the level/course key-value pair to course_levels
        if level not in course_levels:
            course_levels[level*100] = []
        course_levels[level*100].append(groups[2])

After you run this cell, you can check out the unique list of departments:

In [45]:
departments
Out[45]:
{'ANTHRO', 'CSCI', 'ENGL', 'LING'}

... the list of course names:

In [46]:
course_names
Out[46]:
['Introductory Programming for Cat-Lovers',
 'Pronouncing Things Backwards',
 'Theory and Practice of Cheesemongery (Graduate Seminar)',
 'Advanced Programming for Cat-Lovers',
 'Speculative Travel Writing']

... and the dictionary that maps course "levels" to a list of courses at that level:

In [47]:
course_levels
Out[47]:
{105.0: ['Introductory Programming for Cat-Lovers'],
 112.00000000000001: ['Speculative Travel Writing'],
 204.99999999999997: ['Advanced Programming for Cat-Lovers'],
 214.0: ['Pronouncing Things Backwards'],
 342.0: ['Theory and Practice of Cheesemongery (Graduate Seminar)']}

Grouping with multiple matches in the same string

A problem with re.search() is that it only returns the first match in a string. What if we want to find all of the matches? It turns out that re.findall() also supports the regular expression grouping syntax. If the regular expression you pass to re.findall() includes any grouping parentheses, then the function returns not a list of strings, but a list of tuples, where each tuple has elements corresponding in order to the groups in the regular expression.

As a quick example, here's a test string with number names and digits, and a regular expression to extract all instances of a series of alphanumeric characters, followed by a space, followed by a single digit:

In [48]:
test = "one 1 two 2 three 3 four 4 five 5"
re.findall(r"(\w+) (\d)", test)
Out[48]:
[('one', '1'), ('two', '2'), ('three', '3'), ('four', '4'), ('five', '5')]

We can use this to extract every phone number from the Enron subjects corpus, separating out the components of the numbers by group:

In [49]:
re.findall(r"(\d\d\d)-(\d\d\d)-(\d\d\d\d)", all_subjects)
Out[49]:
[('713', '853', '4743'),
 ('713', '222', '7667'),
 ('713', '222', '7667'),
 ('713', '222', '7667'),
 ('713', '222', '7667'),
 ('713', '222', '7667'),
 ('713', '222', '7667'),
 ('713', '222', '7667'),
 ('713', '222', '7667'),
 ('713', '222', '7667'),
 ('713', '222', '7667'),
 ('281', '296', '0573'),
 ('713', '851', '2499'),
 ('713', '345', '7896'),
 ('713', '345', '7896'),
 ('713', '345', '7896'),
 ('713', '345', '7896'),
 ('713', '345', '7896'),
 ('281', '367', '8953'),
 ('713', '528', '0759'),
 ('713', '850', '9002'),
 ('713', '703', '8294'),
 ('614', '888', '9588'),
 ('713', '767', '8686'),
 ('303', '571', '6135'),
 ('281', '537', '9334'),
 ('800', '937', '6563'),
 ('800', '937', '6563'),
 ('888', '296', '1938')]

And then we can do a quick little data analysis on the frequency of area codes in these numbers, using the Counter object from the collections module:

In [50]:
from collections import Counter
area_codes = [item[0] for item in re.findall(r"(\d\d\d)-(\d\d\d)-(\d\d\d\d)", all_subjects)]
count = Counter(area_codes)
count.most_common(1)
Out[50]:
[('713', 21)]

Multiple match objects with re.finditer()

The re library also has a re.finditer() function, which returns not a list of matching strings in tuples (like re.findall()), but an iterator of match objects. This is useful if you need to know not just which text matched, but where in the text the match occurs. So, for example, to find the positions in the all_subjects corpus where the word "Oregon" occurs, regardless of capitalization:

In [51]:
[(match.start(), match.end(), match.group()) for match in re.finditer(r"[Oo]regon", all_subjects)]
Out[51]:
[(410338, 410344, 'Oregon'),
 (410353, 410359, 'Oregon'),
 (608654, 608660, 'Oregon'),
 (831605, 831611, 'Oregon'),
 (3059955, 3059961, 'Oregon'),
 (3640267, 3640273, 'Oregon'),
 (3640292, 3640298, 'Oregon'),
 (3640317, 3640323, 'Oregon'),
 (3640610, 3640616, 'Oregon'),
 (3640635, 3640641, 'Oregon'),
 (3640660, 3640666, 'Oregon'),
 (4385798, 4385804, 'oregon')]

Conclusion

Regular expressions are a great way to take some raw text and find the parts that are of interest to you. Python's string methods and string slicing syntax are a great way to massage and clean up data. You know them both now, which makes you powerful. But as powerful as you are, you have only scratched the surface of your potential! We only scratched the surface of what's possible with regular expressions. Here's some further reading: