A common workflow with regular expressions is that you write a pattern for the thing you are looking for...
In short - this pattern describes an email address; With the above regex pattern, we can search through a text file to find email addresses, or verify if a given string looks like an email address..
The most basic regex pattern in a token like just an $<b>$ i.e a single literal character. In the string " Zebra is an animal.", this will match the very first $b$ in the Ze$b$ Note that it doesn't matter whether it's present in the middle of the word as of now..
Now let me introduce few very basics things used in $<regex>$ to define itself (remeber the e-mail address pattern above, now we will break into piece by piece..)
In the regex discussed in this tutorial, there are 11 characters with special meanings: the opening square bracket $<[>$, the backslash, the caret <^>, the dollar sign <$>, the period or dot <.>, the vertical bar or pipe symbol <|>, the question mark <?>, the asterisk or star <>, the plus sign <+>, the opening round bracket <(> and the closing round bracket <)>. These special characters are often called “metacharacters”*.
Meta character | Description |
---|---|
. | Period matches any single character except a line break. |
[ ] | Character class. Matches any character contained between the square brackets. |
[^ ] | Negated character class. Matches any character that is not contained between the square brackets . |
***** | Matches 0 or more repetitions of the preceding symbol. |
+ | Matches 1 or more repetitions of the preceding symbol. |
? | Makes the preceding symbol optional. |
{n,m} | Braces. Matches at least "n" but not more than "m" repetitions of the preceding symbol. |
(xyz) | Character group. Matches the characters xyz in that exact order. |
| | Alternation. Matches either the characters before or the characters after the symbol. |
\ | Escapes the next character. This allows you to match reserved characters [ ] ( ) { } . * + ? ^ $ \ . |
^ | Matches the beginning of the input. |
$ | Matches the end of the input. |
Read More here and here. Both are Very Very Good...
you want to match <1+1=2>, the correct regex is $1\+1=2$. Otherwise, the plus sign will have a special meaning. Note that <1+1=2>, with the backslash omitted, is a valid regex. So you will not get an error message. But it will not match <1+1=2>.
This is a very important point to understand: a regex-directed engine will always return the leftmost match, even if a better match could be found later. When applying a regex to a string, the engine will start at the first character of the string. It will try all possible permutations of the regular expression at the first character. Only if all possibilities have been tried and found to fail, will the engine continue with the second character in the text. Again, it will try all possible permutations of the regex, in exactly the same order. The result is that the regex-directed engine will return the leftmost match.
token in the regex <c> to the first character in the match H. This fails. There are no other possible permutations of this regex, because it merely consists of a sequence of literal characters. So the regex engine tries to match the <c> with the e. This fails too, as does matching the c with the space. Arriving at the 4th character in the match, <c> matches c. The engine will then try to match the second token <a> to the 5th character, a. This succeeds too. But then, <t> fails to match p. At that point, the engine knows the regex cannot be matched starting at the 4th character in the match. So it will continue with the 5th: a. Again, <c> fails to match here and the engine carries on. At the 15th character in the match, <c> again matches c. The engine then proceeds to attempt to match the remainder of the regex at character 15 and finds that <a> matches a and <t> matches t.
match. It will therefore report the first three letters of catfish as a valid match. The engine never proceeds beyond this point to see if there are any better matches. The first match is considered good enough.
Character sets are also called character class. Square brackets are used to specify character sets. Use a hyphen inside a character set to specify the characters' range. The order of the character range inside square brackets doesn't matter. For example, the regular expression [Tt]he
means: an uppercase T or lowercase t, followed by the letter h, followed by the letter e.
A period inside a character set, however, means a literal period. The regular expression <ar[.]> means: a lowercase character a, followed by letter r, followed by a period . character.
<ar[.]> => A garage is a good place to park a car.
<[0-9]> => Matches a single digit between 0 and 9. You can use more than one range.
<[0-9a-fA-F]> => Matches a single hexadecimal digit, case insensitively.
You can combine ranges and single characters. <[0-9a-fxA-FX]> matches a hexadecimal digit or the letter X.* Again, the order of the characters and the ranges does not matter.*
Find a word, even if it is misspelled, such as <sep[ae]r[ae]te> or <li[cs]en[cs]e>.
Typing a caret(^) after the opening square bracket will negate the character class. The result is that the character class will match any character that is not in the character class.
mean: a q not followed by a u . It means: <font color= red a q followed by a character that is not a u . It will not match the $q$ in the string $Iraq$. It will match the $q$ and $the space$ after the $q$ in Iraq is a country.
Regular expression provides shorthands for the commonly used character sets, which offer convenient shorthands for commonly used regular expressions. The shorthand character sets are as follows:
Shorthand | Description |
---|---|
. | Any character except new line. It's the most commonly misused metacharacter. |
\w | Matches alphanumeric characters: [a-zA-Z0-9_] |
\W | Matches non-alphanumeric characters: [^\w] |
\d | Matches digit: [0-9] |
\D | Matches non-digit: [^\d] |
\s | Matches whitespace character: [\t\n\f\r\p{Z}] |
\S | Matches non-whitespace character: [^\s] |
Following meta characters +
, *
or ?
are used to specify how many times a
subpattern can occur. These meta characters act differently in different
situations.
The symbol *
matches zero or more repetitions of the preceding matcher. The
regular expression a*
means: zero or more repetitions of preceding lowercase
character a
. But if it appears after a character set or class then it finds
the repetitions of the whole character set.
For example, the regular expression
[a-z]*
means: any number of lowercase letters in a row.The *
symbol can be used with the meta character .
to match any string of
characters .*
. The *
symbol can be used with the whitespace character \s
to match a string of whitespace characters. For example, the expression
\s*cat\s*
means: zero or more spaces, followed by lowercase character c
,
followed by lowercase character a
, followed by lowercase character t
,
followed by zero or more spaces.
The symbol +
matches one or more repetitions of the preceding character. For
example, the regular expression c.+t
means: lowercase letter c
, followed by
at least one character, followed by the lowercase character t
. It needs to be
clarified that t
is the last t
in the sentence.
In regular expression the meta character ?
makes the preceding character optional. This symbol matches zero or one instance of the preceding character. For example, the regular expression [T]?he
means: Optional the uppercase letter T, followed by the lowercase character h, followed by the lowercase character e.
Repeats the previous item zero or more times. Lazy, so the engine first attempts to skip the previous item, before trying permutations with ever increasing matches of the preceding item.
Regex | Means |
---|---|
abc+ | matches a string that has ab followed by one or more c |
abc? | matches a string that has ab followed by zero or one c |
abc{2} | matches a string that has ab followed by 2 c |
abc{2,} | matches a string that has ab followed by 2 or more c |
abc{2,5} | matches a string that has ab followed by 2 up to 5 c |
a(bc)* | matches a string that has a followed by zero or more copies of the sequence bc |
a(bc){2,5} | matches a string that has a followed by 2 up to 5 copies of the sequence bc |
<.+> | matches <div>simple div</div> |
In regular expressions, the dot or period is one of the most commonly used metacharacters. Unfortunately, it is also the most commonly misused metacharacter. The dot is short for the negated character class <[^\n]> (UNIX regex flavors) or <[^\r\n]> (Windows regex flavors).
Use The Dot Sparingly
match just fine when you test the regex on valid data. The problem is that the regex will also match in cases where it should not match..`
Example - Let’s say we want to match a date in mm/dd/yy
format, but we
want to leave the user the choice of date separators. The quick solution is <\d\d.\d\d.\d\d>. Seems fine at
first sight.. It will match a date like 02/12/03
just what we intended, So fine...
Anchors are a different breed. They do not match any character at all. Instead, they match a position before,
after or between characters. They can be used to anchor
the regex match at a certain position.
matches the position before the first character in the string. Applying <^a> to abc
matches a
. <^b> will
not match abc
at all, because the <b> cannot be matched right after the start of the string, matched by <^>.
c
in abc
, while <a$> does not
match abc
at all....The word boundary \b matches positions where one side is a word character (usually a letter, digit or underscore—but see below for variations across engines) and the other side is not a word character (for instance, it may be the beginning of the string or a space character).
The \bcat\b
would therefore match cat
in a black cat
, but it wouldn't match it in catatonic
, tomcat
or certificate
. Removing one of the boundaries, \bcat
would match cat
in catfish
, and cat\b
would match cat
in tomcat
, but not vice-versa. Both, of course, would match cat
on its own.
Word boundaries are useful when you want to match a sequence of letters (or digits) on their own, or to ensure that they occur at the beginning or the end of a sequence of characters.
Be aware, though, that \bcat\b
will not match cat
in _cat
or in cat25
because there is no boundary between an underscore and a letter, nor between a letter and a digit: these all belong to what regex defines as word characters.
By placing part of a regular expression inside round brackets or parentheses, you can group that part of the regular expression together. This allows you to apply a regex operator, e.g. a repetition operator, to the entire group. Only round brackets can be used for grouping. Square brackets define a character class, and curly braces are used by a special repetition operator.
Set(Value)?
matches "Set or SetValue".empty, because it did not match anything.
- In the second case, the first backreference
will contain Value.
Backreferences allow you to re-use part of the regex match. You can reuse it inside the regular expression before or afterwards depending on the Regex Flavour you are using...
Some regex flavours use \
, some flavours use $
, etc..
In Perl, you can use the magic variables $1, $2, etc. to access the part of the string matched by the backreference
Regex : (\w+)\1
on String seek
will match ee
PS I am myself studying this section properly, hence couldn't add more details :))
Suppose you want to use a regex to match a list of function names in a programming language: "Get, GetValue, Set or SetValue."
Get|GetValue|Set|SetValue
Now take a look closer carefully at the regex and the string, both. Here are some other ways to do the same task
Get(Value)?|Set(Value)?
\b(Get|GetValue|Set|SetValue)\b
\b(Get(Value)?|Set(Value)?)\b
\b(Get|Set)(Value)?\b
Regex: <[^>]+>
<\a>, <\b>, <\img />, <\br />, etc
. You can use this to find segments that have HTML tags you need to deal with, or to remove all HTML tags from a text.Regex: https?:\/\/[\w\.\/\-?=&%,]+
Regex: '\w+?'
Regex: ([-A-Za-z0-9_]*?([-A-Za-z_][0-9]|[0-9][-A-Za-z_])[-A-Za-z0-9_]*)
This can be very useful if you are translating documents that have a lot of alphanumeric codes or references in them, and you need to be able to find them easily.
Regex: \b(the|The)\b.*?\b(?=\W?\b(is|are|was|can|shall| must|that|which|about|by|at|if|when|should|among|above|under|$)\b)
This is particularly useful when you need to extract terminology. Suppose you have segments like these:
The Web based look up is our new feature. A project manager should not proofread... Our Product Name is...
- The Regex shown above would find anything between The and is, or should. With most texts, there is a good chance that anything this Regex finds is a good term that you can add to your Termbase.
Regex: \b(a|an|A|An)\b.*?\b(?=\W?\b(is|are|was|can|shall|must |that|which|about|by|at|if|when|among|above|under|$)\b)
Regex: \b(this|these|This|These)\b.*?\b(?=\W?\b(is|are|was|can|shall|must|that|which|about|by|at|if|when|among|above|under|$)\b)
- What it does: This works much like the Regex shown above, except that it finds text that begins with this or these. This can also be very helpful when you need to extract terminology from a project.
Regex :(.*?)
re.sub(regex, replacement, subject)
performs a search-and-replace across subject, replacing allmatches of regex in subject with replacement. The result is returned by the sub() function. The subject string you pass is not modified. The re.sub() function applies the same backslash logic to the replacement text as is applied to the regular expression. Therefore, you should use raw strings for the replacement text...
%load_ext autoreload
%autoreload
import re, time
s = 'How do you do this'
print('After applying re.sub -- ', re.sub(r"How do you", "How do I", s), '\nOriginal Text is still -- ', s)
After applying re.sub -- How do I do this Original Text is still -- How do you do this
So does that mens that we have to type one regex expression everytime, run and check it and then the substituion willl happen? i.e Can't we stack re.sub(), re.sub(), re.sub()....
Surely not, Remeber re.sub()
is returning a string after making the chnages that matched the pattern you asked more..
s_old = 'How do you do this'
print('After applying re.sub -- ',end='')
s_new = re.sub(r"How do you", "How do I", s_old)
print(f'\nOriginal Text isn\'t still **{s_old}** but it\'s now **{s_new}**')
#Obviously s_old and s_new are different, I am just trying to show that we can stack the operations....
After applying re.sub -- Original Text isn't still **How do you do this** but it's now **How do I do this**
tweet = '#fingerprint #Pregnancy Test https://goo.gl/h1MfQV #android +#apps +#beautiful \
#cute #health #igers #iphoneonly #iphonesia #iphone \
<3 ;D :( :-('
#Let's take care of emojis and the #(hash-tags)...
print(f'Original Tweet ---- \n {tweet}')
## Replacing #hashtag with only hashtag
tweet = re.sub(r'#(\S+)', r' \1 ', tweet)
#this gets a bit technical as here we are using Backreferencing and Character Sets Shorthands and replacing the captured Group.
#\S = [^\s] Matches any charachter that isn't white space
print(f'\n Tweet after replacing hashtags ----\n {tweet}')
## Love -- <3, :*
tweet = re.sub(r'(<3|:\*)', ' EMO_POS ', tweet)
print(f'\n Tweet after replacing Emojis for Love with EMP_POS ----\n {tweet}')
#The parentheses are for Grouping, so we search (remeber the raw string (`r`))
#either for <3 or(|) :\* (as * is a meta character, so preceeded by the backslash)
## Wink -- ;-), ;), ;-D, ;D, (;, (-;
tweet = re.sub(r'(;-?\)|;-?D|\(-?;)', ' EMO_POS ', tweet)
print(f'\n Tweet after replacing Emojis for Wink with EMP_POS ----\n {tweet}')
#The parentheses are for Grouping as usual, then we first focus on `;-), ;),`, so we can see that 1st we need to have a ;
#and then we can either have a `-` or nothing, so we can do this via using our `?` clubbed with `;` and hence we have the very
#starting with `(;-?\)` and simarly for others...
## Sad -- :-(, : (, :(, ):, )-:
tweet = re.sub(r'(:\s?\(|:-\(|\)\s?:|\)-:)', ' EMO_NEG ', tweet)
print(f'\n Tweet after replacing Emojis for Sad with EMP_NEG ----\n {tweet}')
Original Tweet ---- #fingerprint #Pregnancy Test https://goo.gl/h1MfQV #android +#apps +#beautiful #cute #health #igers #iphoneonly #iphonesia #iphone <3 ;D :( :-( Tweet after replacing hashtags ---- fingerprint Pregnancy Test https://goo.gl/h1MfQV android + apps + beautiful cute health igers iphoneonly iphonesia iphone <3 ;D :( :-( Tweet after replacing Emojis for Love with EMP_POS ---- fingerprint Pregnancy Test https://goo.gl/h1MfQV android + apps + beautiful cute health igers iphoneonly iphonesia iphone EMO_POS ;D :( :-( Tweet after replacing Emojis for Wink with EMP_POS ---- fingerprint Pregnancy Test https://goo.gl/h1MfQV android + apps + beautiful cute health igers iphoneonly iphonesia iphone EMO_POS EMO_POS :( :-( Tweet after replacing Emojis for Sad with EMP_NEG ---- fingerprint Pregnancy Test https://goo.gl/h1MfQV android + apps + beautiful cute health igers iphoneonly iphonesia iphone EMO_POS EMO_POS EMO_NEG EMO_NEG
##See the Output Carefully, there are Spaces inbetween un-necessary...
## Replace multiple spaces with a single space
tweet = re.sub(r'\s+', ' ', tweet)
print(f'\n Tweet after replacing xtra spaces ----\n {tweet}')
##Replace the Puctuations (+,;)
tweet = re.sub(r'[^\w\s]','',tweet)
print(f'\n Tweet after replacing Punctuation + with PUNC ----\n {tweet}')
Tweet after replacing xtra spaces ---- fingerprint Pregnancy Test https://goo.gl/h1MfQV android + apps + beautiful cute health igers iphoneonly iphonesia iphone EMO_POS EMO_POS EMO_NEG EMO_NEG Tweet after replacing Punctuation + with PUNC ---- fingerprint Pregnancy Test httpsgooglh1MfQV android apps beautiful cute health igers iphoneonly iphonesia iphone EMO_POS EMO_POS EMO_NEG EMO_NEG
# bags of positive/negative smiles (You can extend the above example to take care of these few too...))) A good Excercise...
positive_emojis = set([
":‑)",":)",":-]",":]",":-3",":3",":->",":>","8-)","8)",":-}",":}",":o)",":c)",":^)","=]","=)",":‑D",":D","8‑D","8D",
"x‑D","xD","X‑D","XD","=D","=3","B^D",":-))",";‑)",";)","*-)","*)",";‑]",";]",";^)",":‑,",";D",":‑P",":P","X‑P","XP",
"x‑p","xp",":‑p",":p",":‑Þ",":Þ",":‑þ",":þ",":‑b",":b","d:","=p",">:P", ":'‑)", ":')", ":-*", ":*", ":×"
])
negative_emojis = set([
":‑(",":(",":‑c",":c",":‑<",":<",":‑[",":[",":-||",">:[",":{",":@",">:(","D‑':","D:<","D:","D8","D;","D=","DX",":‑/",
":/",":‑.",'>:\\', ">:/", ":\\", "=/" ,"=\\", ":L", "=L",":S",":‑|",":|","|‑O","<:‑|"
])
## Pattern to match any IP Addresses
pattern = r'\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b'
the above pattern will also match 999.999.999.999
but that isn't a valid IP at all
Now this depends on the data at hand as to how far you want the regex to be accurate...
To restrict all 4
numbers in the IP address to 0..255
, you can use this
complex beast:
9]|[01]?[0-9][0-9]?).(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?).(25[0-5]|2[0- 4][0-9]|[01]?[0-9][0-9]?)\b`
updated_pattern = r'\b(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\b'
updated_pattern
'\\b(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\\b'
if re.search(pattern, '999.999.999.999'): print('Matched')
if re.search(updated_pattern, '256.999.999.999'):
print('Matched')
else:
print('Not Matched')
Matched Not Matched
#Valid Dates..
pattern = r'(19|20)\d\d[- /.](0[1-9]|1[012])[- /.](0[1-9]|[12][0-9]|3[01])'
matches a date in yyyy-mm-dd format from between 1900-01-01 and 2099-12-31, with a choice of four separators(space included :))
The year is matched by (19|20)\d\d
The month is matched by (0[1-9]|1[012])
(rounding brackets are necessary so that to include both the options)
01 and 09
, and10, 11 or 12
The last part of the regex consists of three options. The first matches the numbers `01
through 09, the second
10 through 29, and the third matches
30 or 31`...