Week 6: CFGs and PDAs are equivalent

In [1]:
from tock import *
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as patches, matplotlib.lines as lines

Monday reading

Read Section 2.2, Lemma 2.21.

Tuesday class

This week we'll show that the two models we learned last week, context-free grammars and pushdown automata, are equivalent. Today we will show how to convert a context-free grammar to a pushdown automaton, which is important because it is the basis for a lot of parsing algorithms (algorithms that take a string as input, decide whether it belongs to the language, and if so, generates a tree as output).

The top-down construction

The construction used in the proof of Lemma 2.21 is known as top-down or sometimes "nondeterministic LL" parsing.

The basic idea is pretty simple, and probably easier to describe first without getting into the details of the PDA. The stack is initialized to $S\mathtt{$}$ (remember that the top of the stack is on the left).

Whenever the top stack symbol is a terminal symbol and it matches the next input symbol, we pop it and read in the input symbol. If it doesn't match, then this path of the derivation rejects.

Whenever the top stack symbol is a nonterminal symbol, we pop it and nondeterministically push all possible replacements for the nonterminal. Each replacement is pushed in reverse order, so that the leftmost symbol is on the top.

If we reach the end of the input string and the stack just has $\mathtt{$}$, then we accept.

Here's an example grammar:

\begin{align*} S &\rightarrow \mathtt{a} T \mathtt{b} \\ S &\rightarrow \mathtt{b} \\ T &\rightarrow T \mathtt{a} \\ T &\rightarrow \varepsilon \end{align*}

Here's what the successful parse looks like for string aaab:

Input Stack
aaab S$
aaab aTb$
aab Tb$
aab Tab$
aab Taab$
aab aab$
ab ab$
b b$
$\varepsilon$ $

There are also many unsuccessful parses, but as long as one of them succeeds, we accept the string.

The conversion from CFG to PDA basically implements the above algorithm in the PDA. This construction is implemented in Tock:

In [2]:
g = Grammar.from_lines(["S -> a T b",
                        "S -> b",
                        "T -> T a",
                        "T -> &"])
p1 = from_grammar(g)
to_graph(p1)
%3 _START 0 start _START->0 1 0.1 0->1 ε,ε → $ 2 loop 1->2 ε,ε → S 2->2 ε,S → b ε,T → ε a,a → ε b,b → ε 3 1.2 2->3 ε,S → b 5 3.1 2->5 ε,T → a 6 accept 2->6 ε,$ → ε 4 1.1 3->4 ε,ε → T 4->2 ε,ε → a 5->2 ε,ε → T
In [3]:
run(p1, "a a a b")
%3 _START 5 start,ε _START->5 0 1 0->1 a 2 1->2 a 3 2->3 a 4 3->4 b 6 0.1,$ 5->6 7 loop,[S] $ 6->7 8 1.2,[b] $ 7->8 9 loop,[b] $ 7->9 10 1.1,[T] b $ 8->10 11 loop,[a] T b … 10->11 12 loop,[T] b $ 11->12 13 3.1,[a] b $ 12->13 14 loop,[b] $ 12->14 15 loop,[T] a b … 13->15 16 3.1,[a] a b … 15->16 17 loop,[a] b $ 15->17 18 loop,[T] a a … 16->18 19 loop,[b] $ 17->19 20 3.1,[a] a a … 18->20 21 loop,[a] a b … 18->21 23 loop,[a] a a … 18->23 20->18 20->23 22 loop,[a] b $ 21->22 24 loop,[b] $ 22->24 26 loop,[a] a b … 23->26 28 loop,[a] a a … 23->28 25 loop,$ 24->25 27 accept,ε 25->27 29 loop,[a] b $ 26->29 30 loop,[a] a a … 28->30 31 loop,[a] a b … 28->31

Question. Convert the following CFG to a PDA:

\begin{align*} S &\rightarrow \mathtt{0} S \mathtt{0} \\ S &\rightarrow \mathtt{1} S \mathtt{1} \\ S &\rightarrow \varepsilon \end{align*}

Question. If you actually had to implement a parser this way, how would you do it? What would its time complexity be?

The bottom-up construction (optional)

There's another parsing strategy that the book doesn't mention at this point. It's called bottom-up, shift-reduce, or maybe sometimes "nondeterministic LR" parsing. It's the basis for most parsing algorithms that are used in compilers.

The idea is again pretty simple -- it's like top-down parsing in reverse. The stack is initialized to $\mathtt{\$}$. At any point in time, we can do two operations.

In a shift, we read in one input symbol and push it onto the stack.

In a reduce, we check to see if the prefix (top symbols) of the stack match a right-hand-side of a rule (in reverse order), and if so, we can pop those symbols and replace them with the left-hand-side of the rule.

This algorithm is again nondeterministic: it's always possible to do a shift unless we're at the end of the string, and it may be possible to do several different reduces.

If we reach the end of the input and the stack has just $S\mathtt{\$}$, then we accept.

Here's what the successful parse looks like for string aaab:

Input Stack
aaab $
aab a$
aab Ta$
ab aTa$
ab Ta$
b aTa$
b Ta$
$\varepsilon$ bTa$
$\varepsilon$ S$
In [4]:
p2 = from_grammar(g, mode="bottomup")
to_graph(p2)
%3 _START 3 start _START->3 0 accept 1 0.1 1->0 ε,$ → ε 2 3.1 4 loop 2->4 ε,T → T 3->4 ε,ε → $ 4->1 ε,S → ε 4->2 ε,a → ε 4->4 ε,b → S ε,ε → T a,ε → a b,ε → b 5 1.2 4->5 ε,b → ε 6 1.1 5->6 ε,T → ε 6->4 ε,a → S

Wednesday reading

Read Section 2.2, Lemma 2.27.

Thursday class

The conversion from a PDA to a CFG is probably the trickiest to understand of all the constructions we do in this class. Fortunately, though, it's not very difficult to perform.

Example

Let's start with an example PDA, one that recognizes the language of balanced parentheses but using a and b for left and right parentheses:

In [5]:
mc = read_csv("pda-parens.csv")
to_graph(mc)
%3 _START 2 q1 _START->2 0 q2 0->0 a,ε → # b,# → ε 1 q3 0->1 ε,$ → ε 2->0 ε,ε → $
In [6]:
w = list("aaaabbbaabbb")
r = run(mc, w, show_stack=100)
r
%3 _START 27 q1,ε _START->27 0 q2,[#] $ 13 q2,[#] # $ 0->13 1 q2,[#] # $ 1->0 2 q2,[#] # # $ 3 q2,[#] # $ 2->3 4 q2,[#] $ 3->4 5 q2,$ 4->5 6 q3,ε 5->6 7 q2,[#] $ 9 q2,[#] # $ 7->9 8 q3,ε 10 q2,[#] # # $ 9->10 11 q2,[#] # # # $ 10->11 12 q2,[#] # # $ 11->12 12->1 13->2 14 16 14->16 a 15 18 15->18 a 17 16->17 a 17->15 a 19 18->19 b 20 19->20 b 22 20->22 b 21 23 21->23 a 22->21 a 24 23->24 b 25 24->25 b 26 25->26 b 28 q2,$ 27->28 28->7 28->8

The book uses plots of stack height over time (Figure 2.28 and 2.29) to help illustrate the construction. Here's a function that produces similar plots (you don't need to understand this):

In [7]:
def plot_height(ax, path):
    n = len(path)
    heights = [len(c[2]) for c in path]
    bars = ax.bar(np.arange(n), heights)
    ax.set_xticks(np.arange(n-1)+0.5)
    labels = []
    for i in range(n-1):
        if len(path[i][1]) > len(path[i+1][1]):
            labels.append(path[i][1][0])
        else:
            labels.append("")
    ax.set_xticklabels(labels)
    ax.set_xlabel("input string")
    ax.set_yticks(np.arange(max(heights)+1))
    ax.set_ylim(ymax=max(heights)+0.5)
    ax.set_ylabel("stack height")
    
    for c, b in zip(path, bars):
        h = b.get_height()
        ax.text(b.get_x() + b.get_width()/2., h, c[0], ha="center", va="bottom")
In [8]:
fig, ax = plt.subplots()
plot_height(ax, r.shortest_path())
plt.show()

Each bar (including bars with zero height) represents a configuration of the machine as it reads the string. The machine's state is written above the bar. The horizontal axis shows the input symbols that are read in (if any) and the vertical axis is the height of the stack. In this case, notice how the stack grows whenever an a is read and shrinks whenever a b is read.

You can try changing the input string w or even the PDA to see how the above graph changes. (However, the figures below will get messed up.)

A sequence of configurations like this is called a run. Let's define a sub-run to be a contiguous subsequence of a run that begins and ends with the same stack $t$ and does not pop any symbols in $t$ (not even if it pushes the same symbols back).

For example, the blue rectangle below highlights a sub-run:

In [9]:
fig, ax = plt.subplots()
plot_height(ax, r.shortest_path())
ax.add_patch(patches.Rectangle((2,2),10,3.5,alpha=0.3))
plt.show()

But this isn't a sub-run, because it doesn't start and end with the same stack:

In [10]:
fig, ax = plt.subplots()
plot_height(ax, r.shortest_path())
ax.add_patch(patches.Rectangle((2,2),11,3.5,alpha=0.3))
ax.add_line(lines.Line2D([12.5,13.5],[1,2], color="red"))
ax.add_line(lines.Line2D([12.5,13.5],[2,1], color="red"))
plt.show()

And this isn't a sub-run, because it pops a symbol from the starting/ending stack:

In [11]:
fig, ax = plt.subplots()
plot_height(ax, r.shortest_path())
ax.add_patch(patches.Rectangle((3,3),8,2.5,alpha=0.5))
ax.add_line(lines.Line2D([7.5,8.5],[2,3], color="red"))
ax.add_line(lines.Line2D([7.5,8.5],[3,2], color="red"))
plt.show()

The idea behind the conversion to CFG is that sub-runs become sub-trees. A sub-run that starts in state $q$ and ends in state $r$ is going to become a sub-tree that is rooted in the nonterminal $A_{qr}$.

The whole run is a sub-run. It is generated by the nonterminal symbol $A_{q_1q_3}$, which we make the start symbol. We can picture it as below:

In [12]:
fig, ax = plt.subplots()
plot_height(ax, r.shortest_path())
ax.add_patch(patches.Rectangle((0,0),14,5.5,alpha=0.3))
plt.show()

Because the run starts and ends with epsilon transitions, there's also a sub-run that covers the whole string but begins in state $q_2$ and ends in state $q_2$:

In [13]:
fig, ax = plt.subplots()
plot_height(ax, r.shortest_path())
ax.add_patch(patches.Rectangle((0,0),14,5.5,alpha=0.3))
ax.add_patch(patches.Rectangle((1,1),12,4.5,alpha=0.3))
plt.show()

So we should add a rule $A_{q_1q_3} \rightarrow A_{q_2q_2}$. The next smallest sub-run is the one that covers the whole string except the first and last symbols. It, too, begins in state $q_2$ and ends in state $q_2$:

In [14]:
fig, ax = plt.subplots()
plot_height(ax, r.shortest_path())
ax.add_patch(patches.Rectangle((1,1),12,4.5,alpha=0.3))
ax.add_patch(patches.Rectangle((2,2),10,3.5,alpha=0.3))
plt.show()

So we should add a rule $A_{q_2q_2} \rightarrow \texttt{a} A_{q_2q_2} \texttt{b}$. Now what is the next smallest sub-run? Notice that if we lop off the first and last symbols again, the result wouldn't be a sub-run:

In [15]:
fig, ax = plt.subplots()
plot_height(ax, r.shortest_path())
ax.add_patch(patches.Rectangle((2,2),10,3.5,alpha=0.3))
ax.add_patch(patches.Rectangle((3,3),8,2.5,alpha=0.5))
ax.add_line(lines.Line2D([7.5,8.5],[2,3], color="red"))
ax.add_line(lines.Line2D([7.5,8.5],[3,2], color="red"))
plt.show()

Instead, there are two next-smallest sub-runs:

In [16]:
fig, ax = plt.subplots()
plot_height(ax, r.shortest_path())
ax.add_patch(patches.Rectangle((2,2),10,3.5,alpha=0.3))
ax.add_patch(patches.Rectangle((2.2,2),5.7,3.3,alpha=0.5))
ax.add_patch(patches.Rectangle((8.1,2),3.7,2.5,alpha=0.5))
plt.show()

So we should add a rule $A_{q_2q_2} \rightarrow A_{q_2q_2} A_{q_2q_2}$.

Finally, there are also sub-runs of length 0, so we should add a rule $A_{q_2q_2} \rightarrow \varepsilon$.

The general construction

The PDA to CFG construction does this, but not for a particular string; it does this for all possible strings. It starts by putting the PDA into a well-behaved form:

  1. If there is more than one accept state, create a new accept state $f'$, make all the old accept states into non-accept states, and add $\varepsilon,\varepsilon\rightarrow\varepsilon$ transitions from every old accept state to $f'$ like this:

Making the PDA have a single accept state

  1. If the PDA potentially accepts with a non-empty stack, add states and transitions like this:

Making the PDA empty the stack before accepting

where state $e$ has many self-loops, one for each stack symbol $x \in \Gamma$.

  1. If the PDA has any transitions that both push and pop, break them into two transitions like this:

Removing a pop-push transition

Similarly, if the PDA has any transitions that neither push nor pop, break them into two transitions like this:

Removing a no-op transition

Then, the construction creates CFG rules of three types.

  1. For each pair of transitions of the form

A pair of push and pop transitions

(where $a$ could be $\epsilon$), create the rule

$$ A_{pq} \rightarrow a A_{rs} b $$
  1. For all states $p, q, r$, create the rule
$$ A_{pq} \rightarrow A_{pr} A_{rq} $$
  1. For each state $p$, create the rule
$$ A_{pp} \rightarrow \varepsilon $$

Here's what the construction looks like on our example PDA:

In [17]:
to_grammar(mc)
Out[17]:
start: (start,accept)
(q1,q3) → (q2,q2)
(q1,empty) → (q2,q3)
(q1,empty) → (q2,empty)
(start,accept) → (q1,q3)
(start,accept) → (q1,empty)
(q2,q2) → a (q2,q2) b
(q2,empty) → a (q2,q3)
(q2,empty) → a (q2,empty)
(q1,q1) → (q1,q1) (q1,q1)
(q1,q1) → (q1,q2) (q2,q1)
(q1,q1) → (q1,q3) (q3,q1)
(q1,q2) → (q1,q1) (q1,q2)
(q1,q2) → (q1,q2) (q2,q2)
(q1,q2) → (q1,q3) (q3,q2)
(q1,q3) → (q1,q1) (q1,q3)
(q1,q3) → (q1,q2) (q2,q3)
(q1,q3) → (q1,q3) (q3,q3)
(q2,q1) → (q2,q1) (q1,q1)
(q2,q1) → (q2,q2) (q2,q1)
(q2,q1) → (q2,q3) (q3,q1)
(q2,q2) → (q2,q1) (q1,q2)
(q2,q2) → (q2,q2) (q2,q2)
(q2,q2) → (q2,q3) (q3,q2)
(q2,q3) → (q2,q1) (q1,q3)
(q2,q3) → (q2,q2) (q2,q3)
(q2,q3) → (q2,q3) (q3,q3)
(q3,q1) → (q3,q1) (q1,q1)
(q3,q1) → (q3,q2) (q2,q1)
(q3,q1) → (q3,q3) (q3,q1)
(q3,q2) → (q3,q1) (q1,q2)
(q3,q2) → (q3,q2) (q2,q2)
(q3,q2) → (q3,q3) (q3,q2)
(q3,q3) → (q3,q1) (q1,q3)
(q3,q3) → (q3,q2) (q2,q3)
(q3,q3) → (q3,q3) (q3,q3)
(q1,q1) → ε
(q2,q2) → ε
(q3,q3) → ε

(Tock writes the state $A_pq$ as $(p, q)$.) The grammar is really big! The first reason is that Tock added states to make the PDA empty the stack before accepting, even though it already does so. The second reason is that there are many rules of the second type. To avoid having to write so many, we can skip any rules involving nonterminals that we know can never be used. Here's one way to do that. First, create rules of the first and third type:

\begin{align*} A_{13} &\rightarrow A_{22} \\ A_{22} &\rightarrow \texttt{a} A_{22} \texttt{b} \\ A_{11} &\rightarrow \epsilon \\ A_{22} &\rightarrow \epsilon \\ A_{33} &\rightarrow \epsilon \end{align*}

Then, look at the left-hand sides of the rules. If there is a left-hand side $A_{pr}$ and a left-hand side $A_{rs}$, create the rule $A_{pq} \rightarrow A_{pr} A_{rs}$.

\begin{align*} A_{13} & \rightarrow A_{11} A_{13} \\ A_{13} & \rightarrow A_{13} A_{33} \\ A_{11} & \rightarrow A_{11} A_{11} \\ A_{22} & \rightarrow A_{22} A_{22} \\ A_{33} & \rightarrow A_{33} A_{33} \end{align*}

Repeat until no more rules can be created -- in this case, no more rules can be created after the first pass.