Regular Expressions


Excerpt from "Coding for Scientists"
(C) Fabrizio Smeraldi 2014
http://www.eecs.qmul.ac.uk/~fabri
Queen Mary, University of London


Regular expressions (or REGEX) are compact ways of summarising a text pattern. Simple instances of such expressions are very common indeed: for instance typing

ls *.py

in a Linux shell will list all files that end in .py. The character * is known as a wildchar.

Regular expressions help with extracting information from text files (eg BLAST output or FASTA files) by locating particular patters. For instance, in a FASTA file, the accession number always comes between a ">" and a "|": such a pattern can be easily described by a regular expression.

>P04637|P53_HUMAN Cellular tumor antigen p53 - Homo sapiens (Human).

Also, databases such as PROSITE list regular expressions that can be applied directly to protein sequences to identify particular families of proteins or domains. http://prosite.expasy.org/

Regular expression syntax in Python is very similar to PERL syntax, so migrating between the two languages should not be difficult.

The re Module

In Python, REGEXP support is provided in the re module. Simple usage is indeed straightforward:

In [5]:
import re

# mo is a "match object"
mo=re.search("hello", "Hello world, hello Python!")
print mo.group()
print mo.span()
hello
(13, 18)

This is not too different from the .index() method of a string:

In [2]:
print "Hello world, hello Python!".index("hello")
13

But it is a lot more flexible:

In [3]:
re.findall("[Hh][ea]llo", "Hallo world, hello Python!")
Out[3]:
['Hallo', 'hello']

here the square brackets express an alternative within a set of characters.

If a match is not found, the search returns None:

In [4]:
mo=re.search("hello", "Hi world!")
print mo
None

Performing matches

We have already seen .search(), that finds the first match only, and .findall(). The re module offers four matching operators:

Method/Attribute Purpose
match() Determine if the RE matches at the beginning of the string.
search() Scan through a string, looking for any location where this RE matches.
findall() Find all substrings where the RE matches, and returns them as a list.
finditer() Find all substrings where the RE matches, and returns them as an iterator(*).

(*) an iterator works very much like a list in that for instance you can loop over it, but items are computed on the fly as they are needed, so it is more memory-efficient.

Compiling a pattern

For reasons of efficiency, if a pattern is going to be used repeatedly, it is best to compile it. This is done as follows:

In [5]:
rgx=re.compile("[Hh][ea]llo")
rgx.findall("Hallo world, hello Python!")
Out[5]:
['Hallo', 'hello']

the same search functions listed above are available as methods of the compiled expression object.

Beware of the backslash

Regular expressions are a powerful tool, though a bit tedious to learn. Besides matching very complex patterns indeed, other operations that are possible are splitting a string where a pattern matches and substitution. I invite you to have a look at the official tutorial to get a feeling for what can be done: https://docs.python.org/2/howto/regex.html#regex-howto

As you will see, REGEXP syntax makes heavy use of backslashes. This is a problem in Python, because a backslash is interpreted as an escape character:

In [6]:
print "escape\nsequence"
escape
sequence

The solution is to use the Python "raw string" syntax by prepending an "r" to the string in question:

In [7]:
print r"escape\nsequence"
escape\nsequence

to be on the safe side, you may want to put an "r" before all of the regular expressions you write. Examle:

In [8]:
solomon="""
    Solomon Grundy,
    Born on a Monday,
    Christened on Tuesday,
    Married on Wednesday,
    Took ill on Thursday,
    Grew worse on Friday,
    Died on Saturday,
    Buried on Sunday.
    That was the end of,
    Solomon Grundy."""

# \w+ matches one or more alphanumeric characters
rgx=re.compile(r"\w+day")
rgx.findall(solomon)
Out[8]:
['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']

Matching PROSITE patterns

The Tihioredoxin pattern listed on PROSITE with accession number PS00194 (http://prosite.expasy.org/PS00194) is the following:

[LIVMF]-[LIVMSTA]-x-[LIVMFYC]-[FYWSTHE]-x(2)-[FYWGTN]-C-[GATPLVE]-
[PHYWSTA]-C-{I}-x-{A}-x(3)-[LIVMFYWT].

We can easily translate this to a Python REGEXP:

r'[LIVMF][LIVMSTA]\w[LIVMFYC][FYWSTHE]\w\w[FYWGTN]C[GATPLVE][PHYWSTA]C[^I]\w[^A]\w\w\w[LIVMFYWT]'

where "\w" matches any character and for example [^I] will match anything except an I. The following code scans the chicken proteome for matches and prints out the accession number of the proteins that match.

(the chicken proteome can be retrieved from ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/proteomes/)

In [1]:
""" Browse chicken genome and find all proteins that match
PROSITE patters PS00194 (THIOREDOXIN_1) """

import re

# Compile the regexp
PS00194=(r'[LIVMF][LIVMSTA]\w[LIVMFYC][FYWSTHE]\w\w[FYWGTN]'+
    r'C[GATPLVE][PHYWSTA]C[^I]\w[^A]\w\w\w[LIVMFYWT]')
rgx=re.compile(PS00194)

INFILE=open("CHICK.fasta", "r")

seq="" # build sequence here
header="" # name of protein

for line in INFILE:
    if line[0]==">": # this line is a header
        # search protein we just read and print header 
        # if pattern is found
        if (rgx.search(seq)!=None):
            print header                    
        # update header and reset sequence
        header=line.rstrip()
        seq=""
    else:  # this line contains part of the sequence
        seq+=line.rstrip() # remove trailing newline

# process the last protein
if (rgx.search(seq)!=None):
    print header                    

INFILE.close()
>tr|E1BRA6|E1BRA6_CHICK Uncharacterized protein OS=Gallus gallus GN=DNAJC10 PE=4 SV=2
>tr|E1BUP6|E1BUP6_CHICK Uncharacterized protein (Fragment) OS=Gallus gallus GN=PDIA5 PE=4 SV=2
>tr|E1BXX8|E1BXX8_CHICK Uncharacterized protein OS=Gallus gallus GN=TXNDC3 PE=4 SV=1
>tr|E1BZS8|E1BZS8_CHICK Uncharacterized protein OS=Gallus gallus GN=TXNL1 PE=4 SV=1
>tr|E1C549|E1C549_CHICK Uncharacterized protein OS=Gallus gallus GN=VPS13A PE=4 SV=2
>tr|E1C928|E1C928_CHICK Uncharacterized protein OS=Gallus gallus GN=TXNRD3 PE=3 SV=1
>tr|F1N9H3|F1N9H3_CHICK Protein disulfide-isomerase OS=Gallus gallus GN=P4HB PE=3 SV=2
>tr|F1NCD5|F1NCD5_CHICK Uncharacterized protein OS=Gallus gallus GN=TXN2 PE=4 SV=2
>tr|F1NDY9|F1NDY9_CHICK Protein disulfide-isomerase A4 OS=Gallus gallus GN=PDIA4 PE=3 SV=1
>tr|F1NK96|F1NK96_CHICK Uncharacterized protein OS=Gallus gallus GN=PDIA6 PE=3 SV=1
>tr|F1NLC7|F1NLC7_CHICK Uncharacterized protein OS=Gallus gallus GN=TXNDC12 PE=4 SV=2
>tr|F1P212|F1P212_CHICK Uncharacterized protein (Fragment) OS=Gallus gallus GN=TMX3 PE=4 SV=2
>tr|F1P4H4|F1P4H4_CHICK Uncharacterized protein OS=Gallus gallus GN=TXNDC5 PE=3 SV=1
>sp|P08629|THIO_CHICK Thioredoxin OS=Gallus gallus GN=TXN PE=3 SV=2
>sp|P09102|PDIA1_CHICK Protein disulfide-isomerase OS=Gallus gallus GN=P4HB PE=1 SV=3
>sp|P12244|GSBP_CHICK Dolichyl-diphosphooligosaccharide--protein glycotransferase OS=Gallus gallus PE=2 SV=2
>sp|Q8JG64|PDIA3_CHICK Protein disulfide-isomerase A3 OS=Gallus gallus GN=PDIA3 PE=2 SV=1
>tr|R4GFY2|R4GFY2_CHICK Uncharacterized protein OS=Gallus gallus GN=TMX4 PE=4 SV=1
>tr|R4GGT2|R4GGT2_CHICK Uncharacterized protein (Fragment) OS=Gallus gallus GN=LOC100857897 PE=4 SV=1
In [ ]: