Week 6: 2018/02/26-03/02

In [1]:
from tock import *
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as patches, matplotlib.lines as lines

Monday reading

Read Section 2.2, Lemma 2.21.

Tuesday class

This week we'll show that the two models we learned last week, context-free grammars and pushdown automata, are equivalent. Today we will show how to convert a context-free grammar to a pushdown automaton, which is important because it is the basis for a lot of parsing algorithms (algorithms that take a string as input, decide whether it belongs to the language, and if so, generates a tree as output).

The top-down construction

The construction used in the proof of Lemma 2.21 is known as top-down or sometimes "nondeterministic LL" parsing.

The basic idea is pretty simple, and probably easier to describe first without getting into the details of the PDA. The stack is initialized to $S\mathtt{$}$ (remember that the top of the stack is on the left).

Whenever the top stack symbol is a terminal symbol and it matches the next input symbol, we pop it and read in the input symbol. If it doesn't match, then this path of the derivation rejects.

Whenever the top stack symbol is a nonterminal symbol, we pop it and nondeterministically push all possible replacements for the nonterminal. Each replacement is pushed in reverse order, so that the leftmost symbol is on the top.

If we reach the end of the input string and the stack just has $\mathtt{$}$, then we accept.

Here's an example grammar:

\begin{align*} S &\rightarrow \mathtt{a} T \mathtt{b} \\ S &\rightarrow \mathtt{b} \\ T &\rightarrow T \mathtt{a} \\ T &\rightarrow \varepsilon \end{align*}

Here's what the successful parse looks like for string aaab:

Input Stack
aaab S$
aaab aTb$
aab Tb$
aab Tab$
aab Taab$
aab aab$
ab ab$
b b$
$\varepsilon$ $

There are also many unsuccessful parses, but as long as one of them succeeds, we accept the string.

The conversion from CFG to PDA basically implements the above algorithm in the PDA. This construction is implemented in Tock:

In [2]:
g = Grammar.from_lines(["S -> a T b",
                        "S -> b",
                        "T -> T a",
                        "T -> &"])
p1 = from_grammar(g)
to_graph(p1)
%3 _START 0 start _START->0 2 0.1 0->2 ε,ε → $ 1 accept 3 loop 2->3 ε,ε → S 3->1 ε,$ → ε 3->3 ε,S → b ε,T → ε a,a → ε b,b → ε 4 1.2 3->4 ε,S → b 6 3.1 3->6 ε,T → a 5 1.1 4->5 ε,ε → T 5->3 ε,ε → a 6->3 ε,ε → T
In [3]:
run(p1, "a a a b")
%3 _START 5 start,ε _START->5 0 2 0->2 b 1 1->0 a 3 3->1 a 4 4->3 a 8 0.1,$ 5->8 6 1.2,[b] $ 11 1.1,[T] b $ 6->11 7 loop,[T] b $ 13 3.1,[a] b $ 7->13 14 loop,[b] $ 7->14 9 loop,[S] $ 8->9 9->6 10 loop,[b] $ 9->10 12 loop,[a] T b … 11->12 12->7 15 loop,[T] a b … 13->15 16 3.1,[a] a b … 15->16 17 loop,[a] b $ 15->17 18 loop,[T] a a … 16->18 19 loop,[b] $ 17->19 20 3.1,[a] a a … 18->20 21 loop,[a] a b … 18->21 23 loop,[a] a a … 18->23 20->18 20->23 22 loop,[a] b $ 21->22 24 loop,[b] $ 22->24 26 loop,[a] a b … 23->26 28 loop,[a] a a … 23->28 25 loop,$ 24->25 27 accept,ε 25->27 29 loop,[a] b $ 26->29 30 loop,[a] a b … 28->30 31 loop,[a] a a … 28->31

Question. Convert the following CFG to a PDA:

\begin{align*} S &\rightarrow \mathtt{0} S \mathtt{0} \\ S &\rightarrow \mathtt{1} S \mathtt{1} \\ S &\rightarrow \varepsilon \end{align*}

Question. If you actually had to implement a parser this way, how would you do it? What would its time complexity be?

The bottom-up construction

There's another parsing strategy that the book doesn't mention at this point. It's called bottom-up, shift-reduce, or maybe sometimes "nondeterministic LR" parsing. It's the basis for most parsing algorithms that are used in compilers.

The idea is again pretty simple -- it's like top-down parsing in reverse. The stack is initialized to $\mathtt{\$}$. At any point in time, we can do two operations.

In a shift, we read in one input symbol and push it onto the stack.

In a reduce, we check to see if the prefix (top symbols) of the stack match a right-hand-side of a rule (in reverse order), and if so, we can pop those symbols and replace them with the left-hand-side of the rule.

This algorithm is again nondeterministic: it's always possible to do a shift unless we're at the end of the string, and it may be possible to do several different reduces.

If we reach the end of the input and the stack has just $S\mathtt{\$}$, then we accept.

Here's what the successful parse looks like for string aaab:

Input Stack
aaab $
aab a$
aab Ta$
ab aTa$
ab Ta$
b aTa$
b Ta$
$\varepsilon$ bTa$
$\varepsilon$ S$
In [15]:
p2 = from_grammar(g, mode="bottomup")
to_graph(p2)
%3 _START 4 start _START->4 0 loop 0->0 ε,b → S ε,ε → T a,ε → a b,ε → b 1 3.1 0->1 ε,a → ε 2 1.2 0->2 ε,b → ε 6 0.1 0->6 ε,S → ε 1->0 ε,T → T 3 1.1 2->3 ε,T → ε 3->0 ε,a → S 4->0 ε,ε → $ 5 accept 6->5 ε,$ → ε

Wednesday reading

Read Section 2.2, Lemma 2.27.

Thursday class

The conversion from a PDA to a CFG is probably the trickiest to understand of all the constructions we do in this class. Fortunately, though, it's not very difficult to perform.

Let's start with an example PDA, one that recognizes the language of balanced parentheses but using a and b for left and right parentheses:

In [5]:
mc = read_csv("pda-parens.csv")
to_graph(mc)
%3 _START 0 q1 _START->0 2 q2 0->2 ε,ε → $ 1 q3 2->1 ε,$ → ε 2->2 a,ε → # b,# → ε
In [6]:
w = list("aaaabbbaabbb")
r = run(mc, w, show_stack=100)
r
%3 _START 13 q1,ε _START->13 0 1 0->1 a 2 1->2 a 3 2->3 a 7 3->7 a 4 5 4->5 b 6 5->6 b 8 6->8 a 7->4 b 9 8->9 a 10 9->10 b 12 10->12 b 11 12->11 b 14 q2,$ 13->14 15 q2,[#] $ 14->15 16 q3,ε 14->16 17 q2,[#] # $ 15->17 18 q2,[#] # # $ 17->18 19 q2,[#] # # # $ 18->19 20 q2,[#] # # $ 19->20 21 q2,[#] # $ 20->21 22 q2,[#] $ 21->22 23 q2,[#] # $ 22->23 24 q2,[#] # # $ 23->24 25 q2,[#] # $ 24->25 26 q2,[#] $ 25->26 27 q2,$ 26->27 28 q3,ε 27->28

The book uses plots of stack height over time (Figure 2.28 and 2.29) to help illustrate the construction. Here's a function that produces similar plots (you don't need to understand this):

In [7]:
def plot_height(ax, path):
    n = len(path)
    heights = [len(c[2]) for c in path]
    bars = ax.bar(np.arange(n), heights)
    ax.set_xticks(np.arange(n-1)+0.5)
    labels = []
    for i in range(n-1):
        if len(path[i][1]) > len(path[i+1][1]):
            labels.append(path[i][1][0])
        else:
            labels.append("")
    ax.set_xticklabels(labels)
    ax.set_xlabel("input string")
    ax.set_yticks(np.arange(max(heights)+1))
    ax.set_ylim(ymax=max(heights)+0.5)
    ax.set_ylabel("stack height")
    
    for c, b in zip(path, bars):
        h = b.get_height()
        ax.text(b.get_x() + b.get_width()/2., h, c[0], ha="center", va="bottom")
In [8]:
fig, ax = plt.subplots()
plot_height(ax, r.shortest_path())
plt.show()

Each bar (including bars with zero height) represents a configuration of the machine as it reads the string. The machine's state is written above the bar. The horizontal axis shows the input symbols that are read in (if any) and the vertical axis is the height of the stack. In this case, notice how the stack grows whenever an a is read and shrinks whenever a b is read.

You can try changing the input string w or even the PDA to see how the above graph changes. (However, the figures below will get messed up.)

Let's think about how a CFG would generate this string, working from the top down. The key idea is that every nonterminal symbol covers a substring that is read by a sub-run of the PDA that begins and ends with the same stack. The start symbol covers the whole string and corresponds to the whole run. Let's call the start symbol $A_{q_1q_3}$ because the PDA begins in state $q_1$ and ends in state $q_3$. We can picture it as below:

In [9]:
fig, ax = plt.subplots()
plot_height(ax, r.shortest_path())
ax.add_patch(patches.Rectangle((0,0),14,5.5,alpha=0.3))
plt.show()

Because the run starts and ends with epsilon transitions, there's also a sub-run that covers the whole string but begins in state $q_2$ and ends in state $q_2$:

In [10]:
fig, ax = plt.subplots()
plot_height(ax, r.shortest_path())
ax.add_patch(patches.Rectangle((0,0),14,5.5,alpha=0.3))
ax.add_patch(patches.Rectangle((1,1),12,4.5,alpha=0.3))
plt.show()

So we should add a rule $A_{q_1q_3} \rightarrow A_{q_2q_2}$. The next smallest sub-run that begins and ends with the same stack is the one that covers the whole string except the first and last symbols. It, too, begins in state $q_2$ and ends in state $q_2$:

In [11]:
fig, ax = plt.subplots()
plot_height(ax, r.shortest_path())
ax.add_patch(patches.Rectangle((1,1),12,4.5,alpha=0.3))
ax.add_patch(patches.Rectangle((2,2),10,3.5,alpha=0.3))
plt.show()

So we should add a rule $A_{q_2q_2} \rightarrow \texttt{a} A_{q_2q_2} \texttt{b}$. Now what is the next smallest sub-run? Notice that if we lop off the first and last symbols again, the resulting sub-run wouldn't begin and end with the same stack:

In [12]:
fig, ax = plt.subplots()
plot_height(ax, r.shortest_path())
ax.add_patch(patches.Rectangle((2,2),10,3.5,alpha=0.3))
ax.add_patch(patches.Rectangle((3,3),8,2.5,alpha=0.5))
ax.add_line(lines.Line2D([7.5,8.5],[2,3], color="red"))
ax.add_line(lines.Line2D([7.5,8.5],[3,2], color="red"))
plt.show()

Instead, there are two next-smallest sub-runs:

In [13]:
fig, ax = plt.subplots()
plot_height(ax, r.shortest_path())
ax.add_patch(patches.Rectangle((2,2),10,3.5,alpha=0.3))
ax.add_patch(patches.Rectangle((2.2,2),5.7,3.3,alpha=0.5))
ax.add_patch(patches.Rectangle((8.1,2),3.7,2.5,alpha=0.5))
plt.show()

So we should add a rule $A_{q_2q_2} \rightarrow A_{q_2q_2} A_{q_2q_2}$.

The construction, more concisely

The PDA to CFG construction does this, but not for a particular string; it does this for all possible strings. As mentioned above, although the construction is challenging to understand, it's quite easy to actually carry out. The only problem is that the book's version generates a lot more rules than are usually necessary.

Before beginning, preprocess the PDA so that it has the properties listed on page 121. If we ever ask you to run this construction, we'll give you a PDA so that it already has these properties.

  1. For each pair of push/pop transitions $p \xrightarrow{a, \varepsilon \rightarrow u} r$ and $s \xrightarrow{b, u \rightarrow \varepsilon} q$, create the rule $A_{pq} \rightarrow a A_{rs} b$.
  2. For each triple of states $p, q, r$, create the rule $A_{pq} \rightarrow A_{pr} A_{rq}$.
  3. For each state $p$, create the rule $A_{pp} \rightarrow \varepsilon$.

The start symbol is $A_{q_0,q_{\text{accept}}}$.

If your PDA has $n$ states, then step 2 above generates $n^3$ rules, and step 1 generates $n^4$ rules in the (rarely attained) worst case. To avoid generating so many rules, remember that a nonterminal $A_{pq}$ should only be generated if it's possible for the PDA to get from state $p$ to state $q$ starting and ending with the same stack. Any rules involving impossible nonterminals should not be created.

For example, with the above PDA, nonterminals $A_{q_2,q_1}$, $A_{q_3,q_2}$, and $A_{q_3,q_1}$ are impossible because there aren't any paths from $q_2$ to $q_1$, etc. And $A_{q_1,q_2}$ is impossible because the only path that gets from $q_1$ to $q_2$ changes the stack, and similarly for $A_{q_2,q_3}$.

Example construction

Let's run the construction but remove all rules that can't be reached from the start symbol or that can't reach a string:

In [14]:
to_grammar(mc).remove_useless()
Out[14]:
start: (start,accept)
(start,accept) → (q1,q3)
(q2,q2) → a (q2,q2) b
(q1,q3) → (q2,q2)
(q1,q1) → (q1,q1) (q1,q1)
(q1,q3) → (q1,q1) (q1,q3)
(q1,q3) → (q1,q3) (q3,q3)
(q3,q3) → (q3,q3) (q3,q3)
(q2,q2) → (q2,q2) (q2,q2)
(q1,q1) → ε
(q3,q3) → ε
(q2,q2) → ε

Even after removing useless rules, there are still some that don't really need to be there; but hopefully it's clear that the grammar generates the right language. "start" and "accept" are the names of states that were added to the PDA during preprocessing.