import codecs
import unicodedata
with codecs.open("faust.txt","r","utf-8") as stream: text = stream.read()
# !sudo locale-gen de_DE.UTF-8
import locale
locale.setlocale(locale.LC_ALL,'de_DE.utf8')
# C, en_US.utf8, ...
'de_DE.utf8'
The re
(regular expression) module contains all the functions we are talking about here.
Regular expressions are powerful tools for searching for strings and patterns.
They are the basis of the command line fgrep
, grep
, and egrep
tools (the re
in those names stands for "regular expression").
Internally, the query is converted into a finite state automaton, and that automaton is then matched.
import re
There are two basic operations, search
and match
.
The first searches for a regular expression anywhere,
the second requires the match to start at the beginning.
A match is indicated by returning a regular expression object
(this behaves like a boolean True
), and a failed match
is indicated by returning None.
re.search('cheese','the cheese and the bread')
<_sre.SRE_Match at 0x40ab988>
re.search('butter','the cheese and the bread')
re.match('cheese','the cheese and the bread')
re.match('the','the cheese and the bread')
<_sre.SRE_Match at 0x40aba58>
Matches are case-sensitive by default.
re.search('THE','the cheese and the bread')
But we can make matches case insensitive with the re.I
flag.
re.search('THE','the cheese and the bread',re.I)
<_sre.SRE_Match at 0x40abd30>
We can also incorporate this flag directly into the query.
re.search('THE(?i)','the cheese and the bread')
<_sre.SRE_Match at 0x40abd98>
A third important operation is sub
and its variant subn
.
re.sub('cheese','butter','bread and cheese')
'bread and butter'
re.subn('cheese','butter','bread and cheese')
('bread and butter', 1)
Also, we can find multiple matches with findall
.
re.findall('spam','spam, spam, ham, and spam')
['spam', 'spam', 'spam']
Finally, we can also split.
re.split(' ','the quick brown fox')
['the', 'quick', 'brown', 'fox']
Regular expression operations also take a number of flags that affect the operation:
re.I
- ignore casere.L
- locale-dependent matchesre.M
- multiline (changes meaning of $
and ^
)re.S
- dot matches all characters (usually doesn't match \n
)re.X
- verbose regular expressions (whitespace is ignored and allows comments)re.U
- unicode-dependent matches (changes interpretation of digits etc)You can also specify these with syntax like (?iu)
inside the expression.
re.findall(r'THE','the cat in the hat',re.I)
['the', 'the']
re.findall(r'THE(?i)','the cat in the hat')
['the', 'the']
The match object gives additional information about the match. It contains "groups"; group 0 refers to the entire match (we'll see how to define other groups later).
g = re.search('cheese','the cheese and the bread')
g
<_sre.SRE_Match at 0x3b8e098>
g.group(0)
'cheese'
g.start(0),g.end(0)
(4, 10)
Regular expression matching is a two step process:
Compilation can be costly, so you can separate it from matching and substitution.
obj = re.compile('cheese')
obj
re.compile(r'cheese')
obj.search('bread and cheese')
<_sre.SRE_Match at 0x3b8e100>
obj.match('bread and cheese')
obj.sub('butter','bread and cheese')
'bread and butter'
Regular expressions frequently involve backslash characters (\
),
and sometimes also single or double quotes.
For this, there are several convenient quoting conventions:
r"abc"
- raw string"""a"bc"""
- triple quotedr"""a"bc"""
- triple quoted rawur"""a"bc"""
- triple quoted raw unicode stringprint 'a\bc'
print r'a\bc'
print "a\"b\"c"
print r"""a\"b\"c"""
print ur"""a\"b\"c"""
ac a\bc a"b"c a\"b\"c a\"b\"c
re.search(r'\w+','the bread and the cheese').group(0)
'the'
re.search(ur'\w+',u'Brot und Käse').group(0)
u'Brot'
Be careful when matching Unicode in Python 2.x, since you can write
either or both the regular expression and the target as str
or unicode
.
If you aren't consistent, the matches will just fail.
Furthermore, matching UTF-8 encodings stored in str
won't work right.
re.search(ur'Käse',u'Der Käse und das Brot.')
<_sre.SRE_Match at 0x3b8e2a0>
re.search('Käse',u'Der Käse und das Brot.')
re.search(ur'Käse','Der Käse und das Brot.')
re.search('Käse','Der Käse und das Brot.')
<_sre.SRE_Match at 0x3b8e308>
Even if both strings are Unicode, you still have to worry about normalization.
s = unicodedata.normalize('NFD',u'Käse')
print "(%s)"%s
re.search(s,'Der Käse und das Brot')
(Käse)
def normalizing_search(regex,s):
regex = unicodedata.normalize('NFC',regex)
s = unicodedata.normalize('NFC',s)
return re.search(regex,s)
normalizing_search(s,u'Der Käse und das Brot')
<_sre.SRE_Match at 0x3b8e370>
There are a number of standard syntactic elements:
.
matches a single character (any character)x*
matches 0 or more x
x+
matches 1 or more x
x?
matches 0 or 1 x
^
and $
match at the beginning and end of a line, respectively\x
suppresses the special meaning of character x
(xyz)
matches xyz
and treats it as a unit for the purpose of operators (it also defines a group)x|y
matches x
or y
[abcA-Z]
matches any one character in the set a
, b
, c
, or in the range A
through Z
[^abc]
matches any character other than a
, b
, or c
re.findall('c.t','the cat on the cot')
['cat', 'cot']
re.findall('we*t','wet cowtippers tweet frequently')
['wet', 'wt', 'weet']
re.findall('we+t','wet cowtippers tweet frequently')
['wet', 'weet']
re.findall('we?t','wet cowtippers tweet frequently')
['wet', 'wt']
There is actually a generalization of the *
-like operators, where you can
specify the exact number of repetitions with syntax like {3,7}
.
re.findall('[ew]t','wet cowtippers tweet frequently')
['et', 'wt', 'et']
print re.findall(r'\^\.\^','this ^.^ is a Japanese smiley, ^_^')
print re.findall(r'\^.\^','this ^.^ is a Japanese smiley, ^_^')
['^.^'] ['^.^', '^_^']
print re.findall(r'w','wet cowtippers tweet frequently')
print re.findall(r'^w','wet cowtippers tweet frequently')
['w', 'w', 'w'] ['w']
print re.findall(r'(tweet|twit)','wet cowtippers tweet frequently, but are twits')
['tweet', 'twit']
By default, regular expression libraries return the longest match.
print re.findall(r'ab+','xyz abbbbbbc def')
['abbbbbb']
Sometimes, you want the shortest possible match.
You get that by putting a ?
after a repeat operator like *
, +
, or ?
.
print re.findall(r'ab+?','xyz abbbbbbc def')
['ab']
Note that this does not "search for" the shortest match, it is just that when it matches, it picks up the shortest string.
print re.search(r'ab+?','xyz abbbbbbc abc def').start(0)
4
print re.findall(r'the ([^ ]*)','the cat in the hat')
['cat', 'hat']
print re.findall(r'(a|the) ([^ ]*)','a cat in the hat')
[('a', 'cat'), ('the', 'hat')]
g = re.search(r'(a|the) ([^ ]*)','a cat in the hat')
g.group(0)
'a cat'
g.group(1)
'a'
g.group(2)
'cat'
print g.start(2),g.end(2),g.span(2)
2 5 (2, 5)
print re.findall(r'(?:a|the) ([^ ]*)','a cat in the hat')
['cat', 'hat']
print re.search(r'(the|a) [^ ]+ near \1 [^ ]+','the cat near the cat')
print re.search(r'(the|a) [^ ]+ near \1 [^ ]+','a cat near a cat')
print re.search(r'(the|a) [^ ]+ near \1 [^ ]+','the cat near a cat')
<_sre.SRE_Match object at 0x3b6ff30> <_sre.SRE_Match object at 0x3b6ff30> None
Grouping also takes on special meaning with split
, alternating between separators and words.
print re.split(r'([,;]?\s+|\W+$)','The quick, brown fox jumps; over lazy dogs!')
['The', ' ', 'quick', ', ', 'brown', ' ', 'fox', ' ', 'jumps', '; ', 'over', ' ', 'lazy', ' ', 'dogs', '!', '']
Grouping can get more complex with naming and conditionals.
print re.findall(r'(.)\1','aa bc dd ef')
print re.findall(r'(?P<id>.)(?P=id)','aa bc dd ef')
['a', 'd'] ['a', 'd'] bc
Named groups can also be used to refer to parts of patterns.
print re.search(r'(?P<id>b.)','aa bc dd ef').group("id")
bc
There are even conditionals based on named groups.
q = r'^(<)?[^<>]+(?(1)>|)$'
print re.search(q,'abc')
print re.search(q,'<abc>')
print re.search(q,'<abc')
<_sre.SRE_Match object at 0x42404e0> <_sre.SRE_Match object at 0x42404e0> None
Regular expressions can become hard to read very easily.
q = r'^(<)?[^<>]+(?(1)>|)$'
With the re.X
flag (or (?x)
), you can insert whitespace and comments.
qx = r"""(?x)
^(<)? # match optional beginning "<"
[^<>]* # match any non-bracket character
(?(1)>|)$ # match a ">" at the end if we did so at the beginning
"""
print re.search(q,'<abc>')
print re.search(qx,'<abc>')
<_sre.SRE_Match object at 0x42405d0> <_sre.SRE_Match object at 0x42405d0>
There are a number of common special character classes:
\A
- empty string at the beginning of the string\Z
- empty string at end of string\b
- empty string at the beginning of the word\B
- empty string not at the beginning of the word (upper case is often inverse of lower case)\d
- digit (usually [0-9]
, or digit class in Unicode)\D
- not a digit\s
- white space\S
- not white space\w
- word character\W
- not a word characterre.findall(r'\w+',"The quick brown fox... jumped over the la$y dogz.")
['The', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'la', 'y', 'dogz']
numbers = re.compile(r'((?:\d+\.\d*|\d*\.\d+)(?:e[+-]\d+)?)',re.I)
numbers.findall("The fine structure constant is 7.2973525698e-3, and pi is about 3.14159.")
['7.2973525698e-3', '3.14159']
Sometimes you want to match something "in context" without actually considering the context part of the match. For this, you can use lookahead and lookbehind assertions.
re.findall(r"[abc](?=z)","ax by cz")
['c']
re.findall(r"[abc](?!z)","ax by cz")
['a', 'b']
re.findall(r"(?<=a)[xyz]","ax by cz")
['x']
re.findall(r"(?<!a)[xyz]","ax by cz")
['y', 'z']
Above, we have seen the standard Python regular expression features. Regular expressions differ somewhat between different tools.
Most importantly, quoting differs: special characters like (
, )
, and |
are sometimes special by default, and sometimes need a backslash like \|
in order to take on their special meaning.
POSIX tools support special POSIX character classes, like [:upper:]
, [:digit:]
etc.
Perl supports recursive regular expressions; these aren't really "regular expressions" at all anymore, they are more like general purpose parsing. (In Python, there are several parsing modules you can use instead.)
There is a more powerful regular expression module in Python,
called regex
.
It handles Unicode better and supports some interesting additional features.
import regex
r = regex.compile(r"^(\w+|\((?1)[+*/-](?1)\))$")
r.match("x")
<_regex.Match at 0x3ce6d98>
r.match("(x+y)")
<_regex.Match at 0x3ce6e00>
r.match("(x*(y+z))")
<_regex.Match at 0x3ce6e68>
r.match("(x*y+z))")
Fuzzy matching allows edit distance information to be taken into account during matching. That is, a group does not need to match precisely.
regex.findall(r"(?=\w)(quick){e<=1}","the quick brown fox quacks loudly")
['quick', 'quack']
You can specify the number of insertions, deletions, substitutions, and errors.
Often, it is useful to compile large lists of words into a regular expression (c.f. fgrep
).
with open("basic-english.txt") as stream: words = stream.read().split()
len(words)
851
allwords = regex.compile(r"\b(\L<words>)(?:s|es|ed|ing)?\b(?i)",words=words)
allwords.findall("The quick brown fox jumps over the lazy dogs.")
['The', 'quick', 'brown', 'fox', 'jump', 'over', 'the', 'lazy', 'dog']
fuzzywords = regex.compile(r"\b(\L<words>){e<=2}(?:s|es|ed|ing)?\b(?i)",words=words)
print fuzzywords.findall("The quock briwn fox jxmps over the lazy dogs.")
['The ', 'quock', ' ', 'briwn', ' fox', ' ', 'jxmp', ' over', ' the ', 'lazy', ' dog', '']
fuzzywords = regex.compile(r"\b(?=\w)(\L<words>){e<=2}(?:s|es|ed|ing)?\b(?i)",words=words)
print fuzzywords.findall("The quock briwn fox jxmps over the lazy dogs.")
['The ', 'quock', 'briwn', 'fox ', 'jxmp', 'over', 'the ', 'lazy', 'dogs']
There is generally better Unicode support in regex
:
\w
etc.) refer to Unicode by default\m
and \M
match at the beginning/end of a word respectively\p
and \P
\X
regex.findall(ur'\S+',u'the quick рыжая лиса')
[u'the', u'quick', u'\u0440\u044b\u0436\u0430\u044f', u'\u043b\u0438\u0441\u0430']
regex.findall(ur'\w+',u'the quick рыжая лиса')
[u'the', u'quick', u'\u0440\u044b\u0436\u0430\u044f', u'\u043b\u0438\u0441\u0430']
regex.findall(ur'\p{Script=Latin}+',u'the quick рыжая лиса')
[u'the', u'quick']
regex.findall(ur'\p{Script=Cyrillic}+',u'the quick рыжая лиса')
[u'\u0440\u044b\u0436\u0430\u044f', u'\u043b\u0438\u0441\u0430']
s = u"Käse"
t = unicodedata.normalize('NFD',s)
print repr(s)
print repr(t)
u'K\xe4se' u'Ka\u0308se'
By default, re
doesn't consider non-ASCII characters word characters at all.
re.findall(ur"\w",s),re.findall(ur"\w",t)
([u'K', u's', u'e'], [u'K', u'a', u's', u'e'])
With Unicode support, it does, but it doesn't handle decomposed characters.
re.findall(ur"\w(?u)",s),re.findall(ur"\w(?u)",t)
([u'K', u'\xe4', u's', u'e'], [u'K', u'a', u's', u'e'])
The regex
package deals correctly with word characters by default,
but still doesn't handle deocmposed characters with either \w
or .
.
regex.findall(ur"\w",s),regex.findall(ur"\w",t)
([u'K', u'\xe4', u's', u'e'], [u'K', u'a', u'\u0308', u's', u'e'])
regex.findall(ur".",s),regex.findall(ur".",t)
([u'K', u'\xe4', u's', u'e'], [u'K', u'a', u'\u0308', u's', u'e'])
However, the grapheme matcher \X
recognizes that the decomposed
umlaut is, in fact, a single grapheme, even though it consists
of several codepoints.
regex.findall(ur"\X",s),regex.findall(ur"\X",t)
([u'K', u'\xe4', u's', u'e'], [u'K', u'a\u0308', u's', u'e'])
Regular expressions are best for fairly simple tasks. For more complex parsing tasks, you may want to use an actual parsing tool, like pyparsing.
import pyparsing
pyparsing.nestedExpr().parseString("(a (b c) d)").asList()
[['a', ['b', 'c'], 'd']]
import string
from pyparsing import oneOf,Literal,Word,Optional,StringEnd
greeting = oneOf("Hi Yo") + Optional(Literal(",")) + Word(string.uppercase,string.lowercase) + Optional(oneOf(". !")) + StringEnd()
greeting.parseString("Hi, Peter!")
(['Hi', ',', 'Peter', '!'], {})
greeting.parseString("Yo, DogZ.")
--------------------------------------------------------------------------- ParseException Traceback (most recent call last) <ipython-input-140-1186c0727049> in <module>() ----> 1 greeting.parseString("Yo, DogZ.") /usr/lib/python2.7/dist-packages/pyparsing.pyc in parseString(self, instring, parseAll) 1030 # catch and re-raise exception from here, clears out pyparsing internal stack trace 1031 exc = sys.exc_info()[1] -> 1032 raise exc 1033 else: 1034 return tokens ParseException: Expected end of text (at char 7), (line:1, col:8)