Read Section 2.3.

Today we're going to talk about how to show that languages are not context-free. There are two canonical examples of non-context-free languages:

\begin{align*} B &= \{ \mathtt{a}^n \mathtt{b}^n \mathtt{c}^n \mid n \geq 0 \} \\ D &= \{ ww \mid w \in \{ \mathtt{0}, \mathtt{1} \}^\ast \}. \end{align*}The pumping lemma for context-free languages is a little more complicated than the pumping lemma for regular languages (even though it actually predates it).

From an "outside" point of view, the basic difference is that the pumping causes *two* pieces of the string to be repeated instead of just one. Also, whereas in the regular pumping lemma, the repeated piece had to be near the beginning of the string, here, the two pieces have to be near each other but can occur anywhere in the string. Other than that, writing proofs using the context-free pumping lemma is the same as using the regular pumping lemma.

From an "inside" point of view, the argument is similar to the regular pumping lemma, but goes along the *longest path of the tree* instead of the string. You pick a string $s$ that is long enough to guarantee that the parse tree for $s$ has a repeated nonterminal along its longest path. Choose the lowest such loop. Then, you go around that loop multiple times to generate new strings. These strings have *two* pieces of $s$ repeated. Or, you go around the loop zero times, yielding a string with two pieces of $s$ removed. You have to show that the new string does not belong to the language. Therefore the grammar can't generate the language.

However, the details of the proof are a lot more complicated than the details of the regular pumping lemma. So we'll start by presenting the argument as a dialogue between Alice and Bill, but some of Alice's reasoning will be unexplained at first.

Alice. The language $B=\{\mathtt{a}^n\mathtt{b}^n\mathtt{c}^n\}$ is not context-free.

Bill. Yes it is.

Alice. Do you have a context-free grammar that generates it?

Bill. Yes, it's:

\begin{align*} A &\rightarrow \mathtt{a} A \mathrel\vert \mathtt{a} B \\ B &\rightarrow \mathtt{b} C \\ C &\rightarrow B \mathtt{c} \mathrel\vert \mathtt{c} \end{align*}Alice. How many nonterminal symbols does your grammar have?

Bill. Three.

Alice. What's the maximum length of any right-hand side?

Bill. Two.

Alice. Does your grammar generate the string $s = \mathtt{a}^{6}\mathtt{b}^{6}\mathtt{c}^{6}$?

Bill. Of course.

Alice. How long is the longest path in the parse tree? At least four nonterminals and a terminal, right?

Bill. A lot more than that, yes.

Alice. In this path, you use the same nonterminal twice, right?

Bill. Yes.

Alice. Starting from the bottom of the tree, let $n_1$ be the first occurrence of the first repeated nonterminal. And let $n_2$ be the second occurrence of that nonterminal.

Bill. Okay:

Alice. Let $u$, $v$, $x$, $y$, and $z$ be such that $s = uvxyz$ and $vxy$ is the string under $n_2$ and $x$ is the string under $n_1$.

Bill. Okay:

\begin{align*} u &= \mathtt{a}^{6} \mathtt{b}^{5} \\ v &= \mathtt{b} \\ x &= \mathtt{c} \\ y &= \mathtt{c} \\ z &= \mathtt{c}^{4} \end{align*}Alice. Does your grammar generate $uvvxyyz = \mathtt{a}^{6} \mathtt{b}^{7} \mathtt{c}^{7}$?

Bill. Doh!

Recall that in the regular pumping lemma, the pumping length $p$ was just the number of states in the DFA. Thus $p$ symbols was enough to ensure that the DFA went through a loop, and the loop had to occur within the first $p$ symbols. Now, we want to know how many terminals is long enough to ensure that the CFG goes through a loop.

Let $b$ be the maximum number of symbols in a right hand side of the grammar, which is also the maximum number of children a node can have. Define the *height* of a tree to be the number of *edges* in its longest root-to-leaf path. For example, this tree has height 2:

Then it's easy to see that the largest possible tree with branching factor $b$ and height $h$ has $b^h$ leaves. So if we set $p = b^{|V|+1}$, then even the biggest tree of height $|V|$ couldn't have that many leaves, so the tree must have height at least $|V|+1$. (As the book points out, $p=b^{|V|}+1$ would have been enough, but we overshoot for a reason that will hopefully be clear in a moment.) That means that the longest path must have at least $|V|+1$ nonterminals and a terminal. So there must be a repeated nonterminal (a "loop").

We can say more. If we choose the lowest such loop, we know that it must occur within a subtree of height at most $|V|+1$, which means this subtree must have at most $b^{|V|+1}$ leaves. Conveniently, that's what we set $p$ to -- so we can make the tidy statement that if the tree has at least $p$ leaves, there must be a loop, and the loop must occur within a subtree with at most $p$ leaves.

Here's the formal statement of the lemma. It has the same structure as the regular pumping lemma, but with more variables.

- For any context-free language $A$,
- there exists a $p$ (the pumping length), such that
- for any string $s$ such that $|s| \geq p$,
- there exist $u,v,x,y,z$ such that $s=uvxyz$, $|vy|>0$, $|vxy|\leq p$, and
- for all $i$, $uv^ixy^iz \in A$.

Now let's use the pumping lemma to properly prove the non-context-freeness of language $B$.

Suppose that $B$ is context-free. Given $p$ (the pumping length), let $s = \mathtt{a}^n \mathtt{b}^n \mathtt{c}^n$, where $n = \lceil p/3 \rceil$. If $s = uvxyz$ such that either $v$ or $y$ is nonempty, the pumping lemma for context-free languages says that $s' = uvvxyyz \in B$. But this is impossible, for:

If both $v$ and $y$ contain only one type of symbol, e.g., $v$ contains all $\mathtt{a}$'s and $y$ contains all $\mathtt{b}$'s, then $s'$ cannot have an equal number of all three types of symbol.

On the other hand, if either $v$ or $y$ contains two different types of symbol, e.g., contains at least one $\mathtt{a}$ and one $\mathtt{b}$, then $s'$ must have, somewhere, $\mathtt{a}$ after $\mathtt{b}$ or $\mathtt{b}$ after $\mathtt{c}$.

In either case, $s'$ cannot belong to $A$, which is a contradiction.

**Question.** Show that $\{ \mathtt{a}^m \mathtt{b}^n \mathtt{c}^m \mathtt{d}^n \mid m \geq 0, n \geq 0 \}$ is not context-free.

**Question.** Show that $D = \{ ww \mid w \in \{ \mathtt{0}, \mathtt{1} \}^\ast \}$ is not context-free.

**Question.** Show that $C = \{\mathtt{a}^i \mathtt{b}^j \mathtt{c}^k \mid 0 \leq i \leq j \leq k\}$ is not context-free.

Review Section 2.3.

Today we'll continue with leftover examples from last time, and if time permits, move on to the next topic, Turing machines.

Have a good break!