Regular Expressions

In [1]:
import addutils.toc ; addutils.toc.js(ipy_notebook=True)
In [2]:
from addutils import css_notebook

1 Finding wanted words and pieces of information (in a text complexity)

This notebook is about the task of searching and managing substrings (matches) of a string. This is useful to extract piece of information from a text, for example when parsing dates, urls, e-mails, data lists, configuration files or programing scripts. Python offers some string methods for managing the simplest requirements, but the most powerful solution is offered by a language-free pattern matching standard: regular expressions.
Regular expressions are a sort of very specialized programming language made of special text strings (meta-characters) designed for describing a search pattern. Python has also some packages suitable for managing regular expression, such as python re, the regular expression module contained in the python standard distribution, or pyregex, a new external package under development (not treated in this notebook).

2 Python easy solutions for simple problems

2.1 Some of the built-in python string functions may solve some of the easiest tasks:

2.1.1 find

One of the most common requirements is to find a given word, or set of characters/numbers from a text. The find functions returns the positional index of the first character we were looking for, if a match is found; it returns -1 if not found.

In [3]:
string = "this is string!!!"

part = "wow!!!"
part2 = "strong"

2.1.2 strip, lstrip, rstrip

Other functions help to clean and extract only useful information

In [4]:
string = "0000000this is string!!!0000000"

this is string!!!
this is string!!!0000000
0000000this is string!!!

2.1.3 replace

In [5]:
string = "this is string!!!"

spl = string.replace('string', 'good')
'this is good!!!'

2.1.4 functions for identifying the type of character

a series of methods, and even simple idiomatic expressions using basic operators, returning True or False, such as isalnum (checking for alphanumeric), isalpha (only alphabetic), isdigit (numbers), isspace (whitespace), islower (lowercase), isupper (uppercase), istitle (titlecase, if all words in a string starts with uppercase), startswith, endswith.

In [6]:
"a" in 'xyxxyabcxyzzy'
In [7]:
string = "this is string!!!"

print(string.startswith('string', 8))   # start index at the matching boundary
In [8]:
string = 'this'

In [9]:
string = 'this '  # whitespace is not alphabetic!


Try by yourself  the other methods:

string = 'this'
print string.isupper()
print string.islower()
print string.istitle()
print string.isalnum()
print string.isspace()
print string.isdigit()

Try also by modifying the string:

mod = string.upper()
mod = string.title()
print mod.isupper()

2.1.5 a slightly more complex example

Let see how we could clean a string with some unuseful elements, using python built-in methods:

In [10]:
string = "this 44444is a99999 dirty 678435 string xxxxxxexample....wow000000!!!"

spl = string.split()
['this', '44444is', 'a99999', 'dirty', '678435', 'string', 'xxxxxxexample....wow000000!!!']
In [11]:
ls = []
for i, item in enumerate(spl):
    if item.find('xxx') != -1:
        item = item.lstrip('x')
    result = ''.join([e for e in item if not e.isdigit()])
    if result:                                             # needed to exclude empty strings
print('The temporary cleaned list looks like this: ', ls)
The temporary cleaned list looks like this:  ['this', 'is', 'a', 'dirty', 'string', '!!!']

Get back to string again, after complete cleaning and slight modifying:

In [12]:
string = ' '.join(ls)
final = string.replace('....', ', ')
final = string.replace('dirty', 'clean')
this is a clean string!!!

For simple string management python built-in methods are enough, but when we are dealing to more complex tasks, regular expressions are the best solution for dealing with pattern matching.

2.2 The Power of Regular Expressions

A regular expression (regex or regexp for short) is a special text string for describing a search pattern. Regular expressions may be used for retrieving parts of longer strings matching some desired criteria. Dealing with regular expressions may seem complex at the beginning, since they are made of both regular and special characters concatenated in a sequence, hard to be understood at a first sight. But once they are fully assimilated, they become a powerful helper while parsing any kind of text. The most basic regular expressions are single literal characters, for example "a" will look for all "a" occurrence in a text. But there are some special characters, also called meta-characters, which combined with regular characters and concatenated build the regular expression search patterns. The meta-characters used by regular expressions are:

. ^ $ * + ? { [ ] \ | ( )

The following link refers to a list of regular expressions, and the description of their use:

In [13]:
from IPython.display import HTML
HTML('<iframe src= width=700 height=250>')