from tock import *
Regular expressions were invented by Stephen Kleene (pronounced clay-knee) back in the 1950s as a characterization of the languages recognized by the earliest neural networks. But they became widely used through various Unix tools, like
grep, which is where you have likely encountered them.
There are some differences between Unix regular expressions and true regular expressions. True regular expressions have three operations: concatenation, union ($\cup$ in the book,
| in Unix), and Kleene star (
*). The order of operations is star, then union, then concatenation. Use parentheses to change the order of operations, just as in arithmetic expressions.
Unix regular expressions use a dot (
.) to match any symbol; the book uses $\Sigma$ for this purpose.
Another difference is that Unix regular expressions usually match anywhere in a string, whereas true regular expressions usually must match the entire string. When using
grep, use the
-Ex flags to approximate true regular expressions (
-E to get the union operator,
-x to match the whole line).
%%file movies.txt the phantom menace attack of the clones revenge of the sith a new hope the empire strikes back return of the jedi the force awakens rogue one the last jedi solo the rise of skywalker the motion picture the wrath of khan the search for spock the voyage home the final frontier the undiscovered country generations first contact insurrection nemesis into darkness beyond
!grep -Ex '.*(m | (t|n)|b).*' movies.txt
Question. Can you fix the regular expression so that it also accepts
the force awakens,
the last jedi,
the rise of skywalker, but not
Regular expressions are the third "model of computation" of this course, and they, too, are equivalent to both DFAs and NFAs. The proof involves algorithms to convert between regular expressions and NFAs. Today we're converting regular expressions to NFAs, which is not as difficult as the subset construction from last time.
The algorithm in the book is a slight variation on the algorithm from a paper by Ken Thompson, one of the co-developers of Unix, for the QED editor, from which are descended
vi. (But many modern regular expression engines actually do not use this algorithm; they use one that is asymptotically much slower!)
Here's a step by step visualization of the example from the book. You can try different regular expressions to see how they get converted. (Note that in Tock, unlike in Unix tools, symbols in regular expressions are separated by spaces.)
m = from_regexp('(a b|a)*', display_steps=True)
m = from_regexp('(a|b)* a b a', display_steps=True)
Question. If a regular expression has $n$ symbols, how big will the resulting NFA be?
For a very cool "real-time" visualization of (a slightly different version of) this construction, see Debuggex.
Read or reread Lemma 1.60 and its proof, which (in my opinion) is the most complicated one so far.
Today we're completing the proof from last time by converting NFAs to regular expressions. The algorithm for this is known as the state elimination algorithm, because it eliminates the states of the NFA one by one.
To make this construction simpler, the new concept of a generalized NFA (GNFA) is introduced. In this course, GNFAs are a "throwaway" formalism; they exist just to make this proof possible. But elsewhere they might be useful in their own right. A GNFA is a NFA whose transitions can be labeled with regular expressions instead of symbols. A GNFA can follow a transition if the next $k \geq 0$ symbols match the regular expression; when it follows the transition, it consumes those $k$ symbols.
The algorithm first changes the NFA into a GNFA that has a single start state with no incoming edges and a single accept state with no outgoing edges. This makes the final step of the algorithm simpler. The book also creates a lot of transitions labeled $\emptyset$; these are not shown in the figures below, and they're not shown in the book, either. In fact, as far as I'm concerned, you can pretend they don't exist.
The algorithm then eliminates the (non-start, non-accept) states one by one. When there is just one transition left, its label is the final answer. (If you're not using $\emptyset$ transitions, then there is a special case: if there are no transitions left, then the answer is $\emptyset$.) The order of elimination is arbitrary, although some orders may lead to more compact regular expressions than others. Tock eliminates states in reverse alphabetical order, for no particularly good reason.
Let's start with a simpler example than the one in the book, a NFA whose state diagram is acyclic.
m = read_csv('acyclic.csv') m
Eliminating 4 is not very interesting; let's skip it for now. Eliminating 3 concatenates the
c transitions into a single
bc transition. Then, eliminating 2 concatenates the
bc transitions into a single
abc transition. But now there are two parallel transitions, from state 1 to the accept state. So these are unioned into a single
a|abc transition. Finally, eliminating 1 is not very interesting.
The general procedure for eliminating a state is shown in Figure 1.63. Now here's an example with a cycle in it; these give rise to Kleene stars in the resulting expression.
m66 = read_csv('sipser-1-66.csv') m66
Eliminating state 2 looks similar to before, but the self-loops turn into subexpressions inside Kleene stars.
One more example:
m68 = read_csv('sipser-1-68.csv') m68
(The answer is different from the book's because the elimination order was different.) The thing to note here is that if there are multiple incoming transitions and/or multiple outgoing transitions, you must consider all possible combinations. For example, when eliminating state 3, there are two outgoing transitions (to states 1 and 2), so a self-loop from 1 to 1 is created, and the transition from 1 to 2 is modified.