Intermediate Regular Expressions (Regex)

Agenda

  1. Greedy or lazy quantifiers
  2. Alternatives
  3. Substitution
  4. Anchors
  5. Option flags
  6. Lookarounds
  7. Assorted functionality
In [1]:
import re

Part 1 : Greedy or lazy quantifiers

Quantifiers modify the required quantity of a character or a pattern.

Quantifier What it matches
a+ 1 or more occurrences of 'a' (the pattern directly to its left)
a* 0 or more occurrences of 'a'
a? 0 or 1 occurrence of 'a'
In [2]:
s = 'sid is missing class'
In [3]:
re.search(r'miss\w+', s).group()
Out[3]:
'missing'
In [4]:
re.search(r'is\w*', s).group()
Out[4]:
'is'
In [5]:
re.search(r'is\w+', s).group()
Out[5]:
'issing'

+ and * are "greedy", meaning that they try to use up as much of the string as possible:

In [6]:
s = 'Some text <h1>my heading</h1> More text'
In [7]:
re.search(r'<.+>', s).group()
Out[7]:
'<h1>my heading</h1>'

Add a ? after + or * to make them "lazy", meaning that they try to use up as little of the string as possible:

In [8]:
re.search(r'<.+?>', s).group()
Out[8]:
'<h1>'

Lazy quantifiers are sometimes also called "ungreedy" or "reluctant". You can do that by putting a question mark after the plus in the regex. You can do the same the curly braces and the question mark itself. So our example becomes <.+?>. Let's have another look inside the regex engine.

Again, < matches the first < in the string. The next token is the dot, this time repeated by a lazy plus. This tells the regex engine to repeat the dot as few times as possible. The minimum is one. So the engine matches the dot with h. The requirement has been met, and the engine continues with > and 1. This fails. The engine will backtrack. The backtracking will force the lazy plus to expand rather than reduce its reach. So the match of .+ is expanded to h1, and the engine tries again to continue with >. Now, > is matched successfully. The last token in the regex has been matched. The engine reports that <h1> has been successfully matched. That's more like it.

Part 2 : Alternatives

Alternatives define multiple possible patterns that can be used to produce a match. They are separated by a pipe and put in parentheses :

In [9]:
s = 'I live at 100 First St, which is around the corner.'
In [10]:
re.search(r'\d+ .+ (Ave|St|Rd)', s).group()
Out[10]:
'100 First St'

Part 3 : Substitution

re.sub() finds all matches in a given string and replaces them with a specified string :

In [11]:
s = 'my twitter is @jimmy, my emails are [email protected] and [email protected]'
In [12]:
re.sub(r'jim', r'JIM', s)
Out[12]:
'my twitter is @JIMmy, my emails are [email protected] and [email protected]'
In [13]:
re.sub(r' @\w+', r' @johnny', s)
Out[13]:
'my twitter is @johnny, my emails are [email protected]otmail.com and [email protected]'

The replacement string can refer to text from match groups :

  • \1 refers to group(1)
  • \2 refers to group(2)
  • etc.
In [14]:
re.sub(r'(\w+)@[\w.]+', r'\[email protected]', s)
Out[14]:
'my twitter is @jimmy, my emails are [email protected] and [email protected]'

Part 4 : Anchors

Anchors define where in a string the regular expression pattern must occur.

Anchor What it requires
^abc this pattern must appear at the start of a string
abc$ this pattern must appear at the end of a string
In [15]:
s = 'sid is missing class'
In [16]:
re.search(r'\w+', s).group()
Out[16]:
'sid'
In [17]:
re.search(r'\w+$', s).group()
Out[17]:
'class'

The following cell will cause an error because the is cannot be matched right after the start of the string, matched by ^.

In [18]:
# this will cause an error  
# re.search(r'^is', s).group()

Part 5 : Option flags

Options flags change the default behavior of the pattern matching.

Default behavior Option flag Behavior when using flag
matching is case-sensitive re.IGNORECASE matching is case-insensitive
. matches any character except a newline re.DOTALL . matches any character including a newline
within a multi-line string, ^ and $
match start and end of entire string
re.MULTILINE ^ and $ matches start and end of each line
spaces and # are treated as literal characters re.VERBOSE spaces and # are ignored (except in a character class or
when preceded by \), and characters after # are ignored
In [19]:
s = 'LINE one\nLINE two'
In [20]:
print(s)
LINE one
LINE two

re.IGNORECASE example

In [21]:
# case-sensitive
re.search(r'..n.', s).group()
Out[21]:
' one'
In [22]:
# case-insensitive
re.search(r'..n.', s, flags = re.IGNORECASE).group()
Out[22]:
'LINE'

re.DOTALL example

In [23]:
# . does not match a newline
re.search(r'n.+', s).group()
Out[23]:
'ne'
In [24]:
# . matches a newline
re.search(r'n.+', s, flags = re.DOTALL).group()
Out[24]:
'ne\nLINE two'

Combine option flags example

In [25]:
# combine option flags
re.search(r'n.+', s, flags = re.IGNORECASE | re.DOTALL).group()
Out[25]:
'NE one\nLINE two'

re.MULTILINE example

In [26]:
# $ matches end of entire string
re.search(r'..o\w*$', s).group()
Out[26]:
'two'
In [27]:
# $ matches end of each line
re.search(r'..o\w*$', s, flags = re.MULTILINE).group()
Out[27]:
'E one'

re.VERBOSE examples

Example 1 :

In [28]:
# spaces are literal characters
re.search(r' \w+', s).group()
Out[28]:
' one'
In [29]:
# spaces are ignored
re.search(r' \w+', s, flags=re.VERBOSE).group()
Out[29]:
'LINE'
In [30]:
# use multi-line patterns and add comments in verbose mode
re.search(r'''
\     # single space
\w+   # one or more word characters
''', s, flags=re.VERBOSE).group()
Out[30]:
' one'

Example 2 : Mathematics Stack Exchange reputation (revisited)

In [31]:
# read the file into a single string
with open('../data/reputation.txt') as f:
    data = f.read()
In [32]:
print(data[0:300])
total votes: 723
 2   1423294 (5)
 3   1423294 (-2)
-- 2015-09-05 rep +3    = 4
 2   1423843 (5)
 2   1423843 (5)
 1   1423857 (2)
-- 2015-09-06 rep +12   = 16
 1   1480479 (2)
-- 2015-10-14 rep +2    = 18
-- 2015-10-15 rep 0     = 18
-- 2015-10-18 rep 0     = 18
 2   1488132 (5)
 1   1488167 (2)
--
In [33]:
len(data)
Out[33]:
12796

For the purpose of this demonstration/explanation, we don't need to work with the entire string. We'll make it smaller by a factor of roughly $10$ and overwrite the initial string :

In [34]:
# make data smaller
data = data[0:1000]
In [35]:
print(re.findall(r'-- (\d{4}-\d{2}-\d{2}) rep ([+-]?\d+) += (\d+)', data))
[('2015-09-05', '+3', '4'), ('2015-09-06', '+12', '16'), ('2015-10-14', '+2', '18'), ('2015-10-15', '0', '18'), ('2015-10-18', '0', '18'), ('2015-10-19', '+7', '25'), ('2015-10-21', '0', '25'), ('2015-10-24', '+12', '37'), ('2015-10-25', '+13', '50'), ('2015-10-26', '+7', '57'), ('2015-10-27', '0', '57'), ('2015-10-28', '+7', '64'), ('2015-10-29', '+5', '69'), ('2015-10-31', '+5', '74'), ('2015-11-04', '0', '74'), ('2015-11-08', '0', '74'), ('2015-11-09', '0', '74'), ('2015-11-27', '+2', '76')]

Here is our motivation for what we are about to do. When we look at the previous regular expression, we might think to ourself : "Gosh, that is hard to read !". And maybe we can read it now, but three days from now, if we look at it again, we are going to forget why this works and what it means. We are going to solve that problem step by step :

In [36]:
print(re.findall(r'-- (\d{4}-\d{2}-\d{2}) rep ([+-]?\d+) += (\d+)', data, flags = re.VERBOSE))
[]

We only added the re.VERBOSE flag to the original regular expression and the print function returned an empty list as the output. Why is that ? Wha does re.VERBOSE do ? Well, for one thing, re.VERBOSE ignores spaces. Now, we are probably thinking : "Well, that's silly ! Why would I do that ? I need those spaces.". We'll come back to what's going on shortly.

Let's fix this. To fix this, we need to escape our spaces. There are 5 spaces in our pattern and we need to be carefull when we do this because it's easy to mess this up. We put a back slash at the beginning of each space.

In [37]:
print(re.findall(r'--\ (\d{4}-\d{2}-\d{2})\ rep\ ([+-]?\d+)\ +=\ (\d+)', data, flags = re.VERBOSE))
[('2015-09-05', '+3', '4'), ('2015-09-06', '+12', '16'), ('2015-10-14', '+2', '18'), ('2015-10-15', '0', '18'), ('2015-10-18', '0', '18'), ('2015-10-19', '+7', '25'), ('2015-10-21', '0', '25'), ('2015-10-24', '+12', '37'), ('2015-10-25', '+13', '50'), ('2015-10-26', '+7', '57'), ('2015-10-27', '0', '57'), ('2015-10-28', '+7', '64'), ('2015-10-29', '+5', '69'), ('2015-10-31', '+5', '74'), ('2015-11-04', '0', '74'), ('2015-11-08', '0', '74'), ('2015-11-09', '0', '74'), ('2015-11-27', '+2', '76')]

We got back to our initial output. Great... At this point, we are probably thinking : "Well that was a silly exercise... We've just made it harder to read !". For the moment, we might be right, but there's some cool properties when using re.VERBOSE.

For one, since spaces are ignored, we could add somme spaces around our pattern and it won't affect the output :

In [38]:
print(re.findall(r'--\     (\d{4}-\d{2}-\d{2})\     rep\ ([+-]?\d+)\ +=\ (\d+)', data, flags = re.VERBOSE))
[('2015-09-05', '+3', '4'), ('2015-09-06', '+12', '16'), ('2015-10-14', '+2', '18'), ('2015-10-15', '0', '18'), ('2015-10-18', '0', '18'), ('2015-10-19', '+7', '25'), ('2015-10-21', '0', '25'), ('2015-10-24', '+12', '37'), ('2015-10-25', '+13', '50'), ('2015-10-26', '+7', '57'), ('2015-10-27', '0', '57'), ('2015-10-28', '+7', '64'), ('2015-10-29', '+5', '69'), ('2015-10-31', '+5', '74'), ('2015-11-04', '0', '74'), ('2015-11-08', '0', '74'), ('2015-11-09', '0', '74'), ('2015-11-27', '+2', '76')]

Now, on it's own, that's is not particularly useful. Here's where we get to the good stuff ! We can also use multi-line strings. Mutliline strings start with three quotation marks.

In [39]:
print(re.findall(r'''
--\ (\d{4}-\d{2}-\d{2})\ rep\ ([+-]?\d+)\ +=\ (\d+)
''', data, flags = re.VERBOSE))
[('2015-09-05', '+3', '4'), ('2015-09-06', '+12', '16'), ('2015-10-14', '+2', '18'), ('2015-10-15', '0', '18'), ('2015-10-18', '0', '18'), ('2015-10-19', '+7', '25'), ('2015-10-21', '0', '25'), ('2015-10-24', '+12', '37'), ('2015-10-25', '+13', '50'), ('2015-10-26', '+7', '57'), ('2015-10-27', '0', '57'), ('2015-10-28', '+7', '64'), ('2015-10-29', '+5', '69'), ('2015-10-31', '+5', '74'), ('2015-11-04', '0', '74'), ('2015-11-08', '0', '74'), ('2015-11-09', '0', '74'), ('2015-11-27', '+2', '76')]

Here's the next cool part ! We can now add line breaks within our pattern. We are going to add a line break at the end of every escaped space.

In [40]:
print(re.findall(r'''
--\ 
(\d{4}-\d{2}-\d{2})\ 
rep\ 
([+-]?\d+)\ +
=\ 
(\d+)
''', data, flags = re.VERBOSE))
[('2015-09-05', '+3', '4'), ('2015-09-06', '+12', '16'), ('2015-10-14', '+2', '18'), ('2015-10-15', '0', '18'), ('2015-10-18', '0', '18'), ('2015-10-19', '+7', '25'), ('2015-10-21', '0', '25'), ('2015-10-24', '+12', '37'), ('2015-10-25', '+13', '50'), ('2015-10-26', '+7', '57'), ('2015-10-27', '0', '57'), ('2015-10-28', '+7', '64'), ('2015-10-29', '+5', '69'), ('2015-10-31', '+5', '74'), ('2015-11-04', '0', '74'), ('2015-11-08', '0', '74'), ('2015-11-09', '0', '74'), ('2015-11-27', '+2', '76')]

Our regular expression is a bit more readable now. Withoud having to mentally parse out what are the components of our regular expression, we can now see them. There's $6$ components to the regular expression. It is still a little bit confusing in that there's the escape space but other than that we would argue it's more readable.

But wait ! There is one more (mind blowing) thing. re.VERBOSE allows us to add comments :

In [41]:
print(re.findall(r'''
--\                     # two dashes, then a space
(\d{4}-\d{2}-\d{2})\    # match group 1 is date, then a space
rep\                    # rep, then a space
([+-]?\d+)\ +           # match group 2 is rep change (with optional sign), then multiple spaces 
=\                      # equal sign, then a space
(\d+)                   # match group 3 is running total
''', data, flags = re.VERBOSE)) 
[('2015-09-05', '+3', '4'), ('2015-09-06', '+12', '16'), ('2015-10-14', '+2', '18'), ('2015-10-15', '0', '18'), ('2015-10-18', '0', '18'), ('2015-10-19', '+7', '25'), ('2015-10-21', '0', '25'), ('2015-10-24', '+12', '37'), ('2015-10-25', '+13', '50'), ('2015-10-26', '+7', '57'), ('2015-10-27', '0', '57'), ('2015-10-28', '+7', '64'), ('2015-10-29', '+5', '69'), ('2015-10-31', '+5', '74'), ('2015-11-04', '0', '74'), ('2015-11-08', '0', '74'), ('2015-11-09', '0', '74'), ('2015-11-27', '+2', '76')]

Summary : re.VERBOSE ignores spaces and it ignores all texts after the pound (#) sign. As such, we can then create multi-line strings and we can add comments such that complicated regular expressions become easier to read.

Part 6: Lookarounds

A lookahead matches a pattern only if it is followed by another pattern. For example:

  • 100(?= dollars) matches '100' only if it is followed by ' dollars'

A lookbehind matches a pattern only if it is preceded by another pattern. For example:

  • (?<=\$)100 matches '100' only if it is preceded by '$'
In [42]:
s = 'Name: Cindy, 66 inches tall, 30 years old'
In [43]:
# find the age, without a lookahead
re.search(r'(\d+) years? old', s).group(1)
Out[43]:
'30'
In [44]:
# find the age, with a lookahead
re.search(r'\d+(?= years? old)', s).group()
Out[44]:
'30'
In [45]:
# find the name, without a lookbehind
re.search(r'Name: (\w+)', s).group(1)
Out[45]:
'Cindy'
In [46]:
# find the name, with a lookbehind
re.search(r'(?<=Name: )\w+', s).group()
Out[46]:
'Cindy'

Part 7 : Assorted functionality

re.compile() compiles a regular expression pattern (i.e. it stores a pattern as a variable) for improved readability and performance (if the pattern is used frequently). This way, one can later treat the pattern as a variable.

re.compile examples

Example 1 :

In [47]:
s = '-- 2015-09-05 rep +3    = 4'
In [48]:
re.search(r'\d{4}-\d{2}-\d{2}', s).group()
Out[48]:
'2015-09-05'
In [49]:
date = re.compile(r'\d{4}-\d{2}-\d{2}')
In [50]:
# method 1 (rarely used)
re.search(date, s).group()

# method 2 (most used)
date.search(s).group()
Out[50]:
'2015-09-05'

Example 2 :

In [51]:
In [52]:
email = re.compile(r'[\w.-][email protected][\w.-]+')
In [53]:
# these are all equivalent
re.search(r'[\w.-][email protected][\w.-]+', s).group()
re.search(email, s).group()
email.search(s).group()
Out[53]:
In [54]:
# these are all equivalent
re.findall(r'[\w.-][email protected][\w.-]+', s)
re.findall(email, s)
email.findall(s)

span() method

Use the span() method of a match object, rather than the group() method, to determine the location/position of a match :

In [55]:
re.search(email, s).span()
Out[55]:
(8, 26)
In [56]:
s[8:26]
Out[56]:

split method

re.split() splits a string by the occurrences of a regular expression pattern :

In [57]:
# quick aside : split on space character
print('Hello there !'.split(' '))
['Hello', 'there', '!']

Recall that when we split on something, that something gets drop from the output :

In [58]:
# split on the character e
print('Hello there !'.split('e'))
['H', 'llo th', 'r', ' !']

It turns out we can do the same with regular expressions :

In [59]:
re.split(r' ', 'Hello there !')
Out[59]:
['Hello', 'there', '!']

So why would we want to do this using regular expressions ? The answer should be obvious to us at this point. We can now split strings based upon occurences of a rugular expression pattern.

In [60]:
In [61]:
re.split(r'john|jane', s)
Out[61]:
['emails: ', '[email protected] and ', '[email protected]']

Intermediate Regex Exercises

Exercise 1 : IMDb top 100 movies

Data about the 100 highest rated movies has been been scraped from the IMDb website and stored in the file imdb_100.csv (in the data directory).

In [62]:
# read the file into a DataFrame
import pandas as pd
path = '../data/imdb_100.csv'
imdb = pd.read_csv(path)
In [63]:
imdb.columns
Out[63]:
Index(['star_rating', 'title', 'content_rating', 'genre', 'duration',
       'actors_list'],
      dtype='object')
In [64]:
# save the 'title' Series as a Python list
titles = imdb.title.tolist()
In [65]:
print(titles)
['The Shawshank Redemption', 'The Godfather', 'The Godfather: Part II', 'The Dark Knight', 'Pulp Fiction', '12 Angry Men', 'The Good, the Bad and the Ugly', 'The Lord of the Rings: The Return of the King', "Schindler's List", 'Fight Club', 'The Lord of the Rings: The Fellowship of the Ring', 'Inception', 'Star Wars: Episode V - The Empire Strikes Back', 'Forrest Gump', 'The Lord of the Rings: The Two Towers', 'Interstellar', "One Flew Over the Cuckoo's Nest", 'Seven Samurai', 'Goodfellas', 'Star Wars', 'The Matrix', 'City of God', "It's a Wonderful Life", 'The Usual Suspects', 'Se7en', 'Life Is Beautiful', 'Once Upon a Time in the West', 'The Silence of the Lambs', 'Leon: The Professional', 'City Lights', 'Spirited Away', 'The Intouchables', 'Casablanca', 'Whiplash', 'American History X', 'Modern Times', 'Saving Private Ryan', 'Raiders of the Lost Ark', 'Rear Window', 'Psycho', 'The Green Mile', 'Sunset Blvd.', 'The Pianist', 'The Dark Knight Rises', 'Gladiator', 'Terminator 2: Judgment Day', 'Memento', 'Taare Zameen Par', 'Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb', 'The Departed', 'Cinema Paradiso', 'Apocalypse Now', 'The Great Dictator', 'The Prestige', 'Back to the Future', 'The Lion King', 'The Lives of Others', 'Alien', 'Paths of Glory', 'Django Unchained', '3 Idiots', 'Grave of the Fireflies', 'The Shining', 'M', 'WALL-E', 'Witness for the Prosecution', 'Munna Bhai M.B.B.S.', 'American Beauty', 'Das Boot', 'Princess Mononoke', 'Amelie', 'North by Northwest', 'Rang De Basanti', 'Jodaeiye Nader az Simin', 'Citizen Kane', 'Aliens', 'Vertigo', 'Oldeuboi', 'Once Upon a Time in America', 'Double Indemnity', 'Star Wars: Episode VI - Return of the Jedi', 'Toy Story 3', 'Braveheart', 'To Kill a Mockingbird', 'Requiem for a Dream', 'Lawrence of Arabia', 'A Clockwork Orange', 'Bicycle Thieves', 'The Kid', 'Swades', 'Reservoir Dogs', 'Eternal Sunshine of the Spotless Mind', 'Taxi Driver', 'Dilwale Dulhania Le Jayenge', "Singin' in the Rain", 'All About Eve', 'Yojimbo', 'The Sting', 'Rashomon', 'Amadeus']

Here are a few of the titles from this list :

titles = [..., "It's a Wonderful Life", 'The Usual Suspects', 'Se7en', ...]

We want a revised list with the initial article (A/An/The) removed, without affecting the rest of the title. Here is the expected output :

clean_titles = [..., "It's a Wonderful Life", 'Usual Suspects', 'Se7en', ...]

In [66]:
import re
In [67]:
# remove the initial article
clean_titles = [re.sub(r'^(A|An|The) ', r'', title) for title in titles]
print(clean_titles)
['Shawshank Redemption', 'Godfather', 'Godfather: Part II', 'Dark Knight', 'Pulp Fiction', '12 Angry Men', 'Good, the Bad and the Ugly', 'Lord of the Rings: The Return of the King', "Schindler's List", 'Fight Club', 'Lord of the Rings: The Fellowship of the Ring', 'Inception', 'Star Wars: Episode V - The Empire Strikes Back', 'Forrest Gump', 'Lord of the Rings: The Two Towers', 'Interstellar', "One Flew Over the Cuckoo's Nest", 'Seven Samurai', 'Goodfellas', 'Star Wars', 'Matrix', 'City of God', "It's a Wonderful Life", 'Usual Suspects', 'Se7en', 'Life Is Beautiful', 'Once Upon a Time in the West', 'Silence of the Lambs', 'Leon: The Professional', 'City Lights', 'Spirited Away', 'Intouchables', 'Casablanca', 'Whiplash', 'American History X', 'Modern Times', 'Saving Private Ryan', 'Raiders of the Lost Ark', 'Rear Window', 'Psycho', 'Green Mile', 'Sunset Blvd.', 'Pianist', 'Dark Knight Rises', 'Gladiator', 'Terminator 2: Judgment Day', 'Memento', 'Taare Zameen Par', 'Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb', 'Departed', 'Cinema Paradiso', 'Apocalypse Now', 'Great Dictator', 'Prestige', 'Back to the Future', 'Lion King', 'Lives of Others', 'Alien', 'Paths of Glory', 'Django Unchained', '3 Idiots', 'Grave of the Fireflies', 'Shining', 'M', 'WALL-E', 'Witness for the Prosecution', 'Munna Bhai M.B.B.S.', 'American Beauty', 'Das Boot', 'Princess Mononoke', 'Amelie', 'North by Northwest', 'Rang De Basanti', 'Jodaeiye Nader az Simin', 'Citizen Kane', 'Aliens', 'Vertigo', 'Oldeuboi', 'Once Upon a Time in America', 'Double Indemnity', 'Star Wars: Episode VI - Return of the Jedi', 'Toy Story 3', 'Braveheart', 'To Kill a Mockingbird', 'Requiem for a Dream', 'Lawrence of Arabia', 'Clockwork Orange', 'Bicycle Thieves', 'Kid', 'Swades', 'Reservoir Dogs', 'Eternal Sunshine of the Spotless Mind', 'Taxi Driver', 'Dilwale Dulhania Le Jayenge', "Singin' in the Rain", 'All About Eve', 'Yojimbo', 'Sting', 'Rashomon', 'Amadeus']

As a bonus task, add the removed article to the end of the title. Here is the expected output :

better_titles = [..., "It's a Wonderful Life", 'Usual Suspects, The', 'Se7en', ...]

In [68]:
# move the initial article to the end
better_titles = [re.sub(r'^(A|An|The) (.+)', r'\2, \1', title) for title in titles]
print(better_titles)
['Shawshank Redemption, The', 'Godfather, The', 'Godfather: Part II, The', 'Dark Knight, The', 'Pulp Fiction', '12 Angry Men', 'Good, the Bad and the Ugly, The', 'Lord of the Rings: The Return of the King, The', "Schindler's List", 'Fight Club', 'Lord of the Rings: The Fellowship of the Ring, The', 'Inception', 'Star Wars: Episode V - The Empire Strikes Back', 'Forrest Gump', 'Lord of the Rings: The Two Towers, The', 'Interstellar', "One Flew Over the Cuckoo's Nest", 'Seven Samurai', 'Goodfellas', 'Star Wars', 'Matrix, The', 'City of God', "It's a Wonderful Life", 'Usual Suspects, The', 'Se7en', 'Life Is Beautiful', 'Once Upon a Time in the West', 'Silence of the Lambs, The', 'Leon: The Professional', 'City Lights', 'Spirited Away', 'Intouchables, The', 'Casablanca', 'Whiplash', 'American History X', 'Modern Times', 'Saving Private Ryan', 'Raiders of the Lost Ark', 'Rear Window', 'Psycho', 'Green Mile, The', 'Sunset Blvd.', 'Pianist, The', 'Dark Knight Rises, The', 'Gladiator', 'Terminator 2: Judgment Day', 'Memento', 'Taare Zameen Par', 'Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb', 'Departed, The', 'Cinema Paradiso', 'Apocalypse Now', 'Great Dictator, The', 'Prestige, The', 'Back to the Future', 'Lion King, The', 'Lives of Others, The', 'Alien', 'Paths of Glory', 'Django Unchained', '3 Idiots', 'Grave of the Fireflies', 'Shining, The', 'M', 'WALL-E', 'Witness for the Prosecution', 'Munna Bhai M.B.B.S.', 'American Beauty', 'Das Boot', 'Princess Mononoke', 'Amelie', 'North by Northwest', 'Rang De Basanti', 'Jodaeiye Nader az Simin', 'Citizen Kane', 'Aliens', 'Vertigo', 'Oldeuboi', 'Once Upon a Time in America', 'Double Indemnity', 'Star Wars: Episode VI - Return of the Jedi', 'Toy Story 3', 'Braveheart', 'To Kill a Mockingbird', 'Requiem for a Dream', 'Lawrence of Arabia', 'Clockwork Orange, A', 'Bicycle Thieves', 'Kid, The', 'Swades', 'Reservoir Dogs', 'Eternal Sunshine of the Spotless Mind', 'Taxi Driver', 'Dilwale Dulhania Le Jayenge', "Singin' in the Rain", 'All About Eve', 'Yojimbo', 'Sting, The', 'Rashomon', 'Amadeus']

Exercise 2 : FAA tower closures (revisited)

A list of FAA (Federal Aviation Administration) tower closures has been copied from a PDF into the file faa.txt, which is stored in the data directory of the course repository.

In [69]:
# read the file into a single string
with open('../data/faa.txt') as f:
    data = f.read()
In [70]:
# examine the first 300 characters
print(data[0:300])
FAA Contract Tower Closure List
(149 FCTs)
3-22-2013
LOC
ID Facility Name City State
DHN DOTHAN RGNL DOTHAN AL
TCL TUSCALOOSA RGNL TUSCALOOSA AL
FYV DRAKE FIELD FAYETTEVILLE AR
TXK TEXARKANA RGNL-WEBB FIELD TEXARKANA AR
GEU GLENDALE MUNI GLENDALE AZ
GYR PHOENIX GOODYEAR GOODYEAR AZ
IFP LAUGHLIN/BULL
In [71]:
# create a list of tuples containing the tower IDs and their states
print(re.findall(r'([A-Z]{3}) .+ ([A-Z]{2})', data))
[('DHN', 'AL'), ('TCL', 'AL'), ('FYV', 'AR'), ('TXK', 'AR'), ('GEU', 'AZ'), ('GYR', 'AZ'), ('IFP', 'AZ'), ('RYN', 'AZ'), ('FUL', 'CA'), ('MER', 'CA'), ('OXR', 'CA'), ('RAL', 'CA'), ('RNM', 'CA'), ('SAC', 'CA'), ('SDM', 'CA'), ('SNS', 'CA'), ('VCV', 'CA'), ('WHP', 'CA'), ('WJF', 'CA'), ('BDR', 'CT'), ('DXR', 'CT'), ('GON', 'CT'), ('HFD', 'CT'), ('HVN', 'CT'), ('OXC', 'CT'), ('APF', 'FL'), ('BCT', 'FL'), ('EVB', 'FL'), ('FMY', 'FL'), ('HWO', 'FL'), ('LAL', 'FL'), ('LEE', 'FL'), ('OCF', 'FL'), ('OMN', 'FL'), ('PGD', 'FL'), ('SGJ', 'FL'), ('SPG', 'FL'), ('SUA', 'FL'), ('TIX', 'FL'), ('ABY', 'GA'), ('AHN', 'GA'), ('LZU', 'GA'), ('MCN', 'GA'), ('RYY', 'GA'), ('DBQ', 'IA'), ('IDA', 'ID'), ('LWS', 'ID'), ('PIH', 'ID'), ('SUN', 'ID'), ('ALN', 'IL'), ('BMI', 'IL'), ('DEC', 'IL'), ('MDH', 'IL'), ('UGN', 'IL'), ('BAK', 'IN'), ('GYY', 'IN'), ('HUT', 'KS'), ('IXD', 'KS'), ('MHK', 'KS'), ('OJC', 'KS'), ('TOP', 'KS'), ('OWB', 'KY'), ('PAH', 'KY'), ('DTN', 'LA'), ('BVY', 'MA'), ('EWB', 'MA'), ('LWM', 'MA'), ('ORH', 'MA'), ('OWD', 'MA'), ('ESN', 'MD'), ('FDK', 'MD'), ('HGR', 'MD'), ('MTN', 'MD'), ('SBY', 'MD'), ('BTL', 'MI'), ('DET', 'MI'), ('SAW', 'MI'), ('ANE', 'MN'), ('STC', 'MN'), ('BBG', 'MO'), ('COU', 'MO'), ('GLH', 'MS'), ('HKS', 'MS'), ('HSA', 'MS'), ('OLV', 'MS'), ('TUP', 'MS'), ('GPI', 'MT'), ('EWN', 'NC'), ('HKY', 'NC'), ('INT', 'NC'), ('ISO', 'NC'), ('JQF', 'NC'), ('ASH', 'NH'), ('TTN', 'NJ'), ('AEG', 'NM'), ('SAF', 'NM'), ('ITH', 'NY'), ('RME', 'NY'), ('CGF', 'OH'), ('OSU', 'OH'), ('TZR', 'OH'), ('LAW', 'OK'), ('OUN', 'OK'), ('PWA', 'OK'), ('SWO', 'OK'), ('OTH', 'OR'), ('PDT', 'OR'), ('SLE', 'OR'), ('TTD', 'OR'), ('CXY', 'PA'), ('LBE', 'PA'), ('LNS', 'PA'), ('CRE', 'SC'), ('GYH', 'SC'), ('HXD', 'SC'), ('MKL', 'TN'), ('NQA', 'TN'), ('BAZ', 'TX'), ('BRO', 'TX'), ('CLL', 'TX'), ('CNW', 'TX'), ('CXO', 'TX'), ('GTU', 'TX'), ('HYI', 'TX'), ('RBD', 'TX'), ('SGR', 'TX'), ('SSF', 'TX'), ('TKI', 'TX'), ('TYR', 'TX'), ('VCT', 'TX'), ('OGD', 'UT'), ('PVU', 'UT'), ('LYH', 'VA'), ('OLM', 'WA'), ('RNT', 'WA'), ('SFF', 'WA'), ('TIW', 'WA'), ('YKM', 'WA'), ('CWA', 'WI'), ('EAU', 'WI'), ('ENW', 'WI'), ('JVL', 'WI'), ('LSE', 'WI'), ('MWC', 'WI'), ('OSH', 'WI'), ('UES', 'WI'), ('HLG', 'WV'), ('LWB', 'WV'), ('PKB', 'WV')]

Without changing the output, make this regular expression pattern more readable by using the re.VERBOSE option flag and adding comments.

In [72]:
print(re.findall(r'''
([A-Z]{3})\    # match group 1 is ID, then space
.+\            # any characters, then space
([A-Z]{2})     # match group 2 is state
''', data, flags = re.VERBOSE))
[('DHN', 'AL'), ('TCL', 'AL'), ('FYV', 'AR'), ('TXK', 'AR'), ('GEU', 'AZ'), ('GYR', 'AZ'), ('IFP', 'AZ'), ('RYN', 'AZ'), ('FUL', 'CA'), ('MER', 'CA'), ('OXR', 'CA'), ('RAL', 'CA'), ('RNM', 'CA'), ('SAC', 'CA'), ('SDM', 'CA'), ('SNS', 'CA'), ('VCV', 'CA'), ('WHP', 'CA'), ('WJF', 'CA'), ('BDR', 'CT'), ('DXR', 'CT'), ('GON', 'CT'), ('HFD', 'CT'), ('HVN', 'CT'), ('OXC', 'CT'), ('APF', 'FL'), ('BCT', 'FL'), ('EVB', 'FL'), ('FMY', 'FL'), ('HWO', 'FL'), ('LAL', 'FL'), ('LEE', 'FL'), ('OCF', 'FL'), ('OMN', 'FL'), ('PGD', 'FL'), ('SGJ', 'FL'), ('SPG', 'FL'), ('SUA', 'FL'), ('TIX', 'FL'), ('ABY', 'GA'), ('AHN', 'GA'), ('LZU', 'GA'), ('MCN', 'GA'), ('RYY', 'GA'), ('DBQ', 'IA'), ('IDA', 'ID'), ('LWS', 'ID'), ('PIH', 'ID'), ('SUN', 'ID'), ('ALN', 'IL'), ('BMI', 'IL'), ('DEC', 'IL'), ('MDH', 'IL'), ('UGN', 'IL'), ('BAK', 'IN'), ('GYY', 'IN'), ('HUT', 'KS'), ('IXD', 'KS'), ('MHK', 'KS'), ('OJC', 'KS'), ('TOP', 'KS'), ('OWB', 'KY'), ('PAH', 'KY'), ('DTN', 'LA'), ('BVY', 'MA'), ('EWB', 'MA'), ('LWM', 'MA'), ('ORH', 'MA'), ('OWD', 'MA'), ('ESN', 'MD'), ('FDK', 'MD'), ('HGR', 'MD'), ('MTN', 'MD'), ('SBY', 'MD'), ('BTL', 'MI'), ('DET', 'MI'), ('SAW', 'MI'), ('ANE', 'MN'), ('STC', 'MN'), ('BBG', 'MO'), ('COU', 'MO'), ('GLH', 'MS'), ('HKS', 'MS'), ('HSA', 'MS'), ('OLV', 'MS'), ('TUP', 'MS'), ('GPI', 'MT'), ('EWN', 'NC'), ('HKY', 'NC'), ('INT', 'NC'), ('ISO', 'NC'), ('JQF', 'NC'), ('ASH', 'NH'), ('TTN', 'NJ'), ('AEG', 'NM'), ('SAF', 'NM'), ('ITH', 'NY'), ('RME', 'NY'), ('CGF', 'OH'), ('OSU', 'OH'), ('TZR', 'OH'), ('LAW', 'OK'), ('OUN', 'OK'), ('PWA', 'OK'), ('SWO', 'OK'), ('OTH', 'OR'), ('PDT', 'OR'), ('SLE', 'OR'), ('TTD', 'OR'), ('CXY', 'PA'), ('LBE', 'PA'), ('LNS', 'PA'), ('CRE', 'SC'), ('GYH', 'SC'), ('HXD', 'SC'), ('MKL', 'TN'), ('NQA', 'TN'), ('BAZ', 'TX'), ('BRO', 'TX'), ('CLL', 'TX'), ('CNW', 'TX'), ('CXO', 'TX'), ('GTU', 'TX'), ('HYI', 'TX'), ('RBD', 'TX'), ('SGR', 'TX'), ('SSF', 'TX'), ('TKI', 'TX'), ('TYR', 'TX'), ('VCT', 'TX'), ('OGD', 'UT'), ('PVU', 'UT'), ('LYH', 'VA'), ('OLM', 'WA'), ('RNT', 'WA'), ('SFF', 'WA'), ('TIW', 'WA'), ('YKM', 'WA'), ('CWA', 'WI'), ('EAU', 'WI'), ('ENW', 'WI'), ('JVL', 'WI'), ('LSE', 'WI'), ('MWC', 'WI'), ('OSH', 'WI'), ('UES', 'WI'), ('HLG', 'WV'), ('LWB', 'WV'), ('PKB', 'WV')]