Week 7: Non-context-free languages

Monday reading

Read Section 2.3.

Tuesday class

Today we're going to talk about how to show that languages are not context-free. There are two canonical examples of non-context-free languages:

\begin{align*} B &= \{ \mathtt{a}^n \mathtt{b}^n \mathtt{c}^n \mid n \geq 0 \} \\ D &= \{ ww \mid w \in \{ \mathtt{0}, \mathtt{1} \}^\ast \}. \end{align*}

Another pumping lemma

The pumping lemma for context-free languages is a little more complicated than the pumping lemma for regular languages (even though it actually predates it).

The basic difference is that the pumping now causes two pieces of the string to be repeated instead of just one. Also, whereas in the regular pumping lemma, the repeated piece had to be near the beginning of the string, in the context-free pumping lemma, the two pieces have to be near each other, but can occur anywhere in the string. Other than that, writing proofs using the context-free pumping lemma is the same as using the regular pumping lemma.

As with the regular pumping lemma, we'll start by presenting the argument as a dialogue between Alice and Bill. But some of Alice's reasoning will be unexplained at first.

As a dialogue

Alice. The language $B=\{\mathtt{a}^n\mathtt{b}^n\mathtt{c}^n\}$ is not context-free.

Bill. Yes it is.

Alice. Do you have a context-free grammar that generates it?

Bill. Yes, it's:

\begin{align*} A &\rightarrow \mathtt{a} A \mathrel\vert \mathtt{a} B \\ B &\rightarrow \mathtt{b} C \\ C &\rightarrow B \mathtt{c} \mathrel\vert \mathtt{c} \end{align*}

Alice. How many nonterminal symbols does your grammar have?

Bill. Three.

Alice. What's the maximum length of any right-hand side?

Bill. Two.

Alice. Does your grammar generate the string $s = \mathtt{a}^{6}\mathtt{b}^{6}\mathtt{c}^{6}$?

Bill. Of course.

Alice. How long is the longest path in the parse tree? At least four nonterminals and a terminal, right?

Bill. A lot more than that, yes.

Parse tree

Alice. In this path, you use the same nonterminal twice, right?

Bill. Yes.

Alice. Starting from the bottom of the tree, let $n_1$ be the first occurrence of the first repeated nonterminal. And let $n_2$ be the second occurrence of that nonterminal.

Bill. Okay:

Parse tree

Alice. Let $u$, $v$, $x$, $y$, and $z$ be such that $s = uvxyz$ and $vxy$ is the string under $n_2$ and $x$ is the string under $n_1$.

Bill. Okay:

\begin{align*} u &= \mathtt{a}^{6} \mathtt{b}^{5} \\ v &= \mathtt{b} \\ x &= \mathtt{c} \\ y &= \mathtt{c} \\ z &= \mathtt{c}^{4} \end{align*}

Alice. Does your grammar generate $uvvxyyz = \mathtt{a}^{6} \mathtt{b}^{7} \mathtt{c}^{7}$?

Bill. Doh!

The pumping length

At this point you should read page 126-127 if you haven't already. But I'd like to explain the choice of the pumping length $p$ in a bit more detail.

Recall that in the regular pumping lemma, the pumping length $p$ was just the number of states in the DFA. Thus $p$ symbols was enough to ensure that the DFA went through a loop, and the loop had to occur within the first $p$ symbols. Now, we want to know how many terminals is long enough to ensure that the CFG goes through a loop.

Let $b$ be the maximum number of symbols in a right hand side of the grammar, which is also the maximum number of children a node can have. Define the height of a tree to be the number of edges in its longest root-to-leaf path. For example, this tree has height 2:

Tree of height 2

A tree with height 0 has just 1 leaf; a tree with height 1 has at most $b$ leaves; a tree with height 2 has at most $b^2$ leaves, and in general, the largest possible tree with height $h$ has $b^h$ leaves.

We want to set $p$ to guarantee that the tree has a path with a repeated nonterminal, so let's reason backwards to find $p$. There will be a repeated nonterminal if a path has $|V|+1$ nonterminals, that is, the tree has height at least $|V|+1$. The largest tree with height $|V|$ has at most $b^{|V|}$ leaves, so we just need to set $p$ to something bigger than $p=b^{|V|}$. Then (reasoning forwards again) the tree must have height at least height $|V|+1$, that is, it has at least $|V|+1$ nonterminals and a terminal, and by the pigeonhole principle, there must be a repeated nonterminal on this path.

We can say more. If we choose the lowest pair of repeated nonterminals, we know that it must occur within a subtree of height at most $|V|+1$, which means this subtree must have at most $b^{|V|+1}$ leaves.

To simplify the statement of the lemma, let's set $p$ to the maximum of $b^{|V|}+1$ and $b^{|V|+1}$. (It will always be the latter, except for the edge case $b=1$.) Then we can state that if the tree has at least $p$ leaves, there must be a path with a repeated nonterminal, and the repeated nonterminal must occur within a subtree with at most $p$ leaves.

Statement of the lemma

Here's the formal statement of the lemma. It has the same structure as the regular pumping lemma, but with more variables.

  • For any context-free language $A$,
  • there exists a $p$ (the pumping length), such that
  • for any string $s$ such that $|s| \geq p$,
  • there exist $u,v,x,y,z$ such that $s=uvxyz$, $|vy|>0$, $|vxy|\leq p$, and
  • for all $i$, $uv^ixy^iz \in A$.

An example proof

Now let's use the pumping lemma to properly prove the non-context-freeness of language $B$.

Suppose that $B$ is context-free. Given $p$ (the pumping length), let $s = \mathtt{a}^n \mathtt{b}^n \mathtt{c}^n$, where $n = \lceil p/3 \rceil$. If $s = uvxyz$ such that either $v$ or $y$ is nonempty, the pumping lemma for context-free languages says that $s' = uvvxyyz \in B$. But this is impossible, for:

  • If both $v$ and $y$ contain only one type of symbol, e.g., $v$ contains all $\mathtt{a}$'s and $y$ contains all $\mathtt{b}$'s, then $s'$ cannot have an equal number of all three types of symbol.

  • On the other hand, if either $v$ or $y$ contains two different types of symbol, e.g., contains at least one $\mathtt{a}$ and one $\mathtt{b}$, then $s'$ must have, somewhere, $\mathtt{a}$ after $\mathtt{b}$ or $\mathtt{b}$ after $\mathtt{c}$.

In either case, $s'$ cannot belong to $A$, which is a contradiction.

Question. Show that $\{ \mathtt{a}^m \mathtt{b}^n \mathtt{c}^m \mathtt{d}^n \mid m \geq 0, n \geq 0 \}$ is not context-free.

Question. Show that $D = \{ ww \mid w \in \{ \mathtt{0}, \mathtt{1} \}^\ast \}$ is not context-free.

Question. Show that $C = \{\mathtt{a}^i \mathtt{b}^j \mathtt{c}^k \mid 0 \leq i \leq j \leq k\}$ is not context-free.

Wednesday reading

Review Section 2.3.

Thursday class

Today we'll continue with leftover examples from last time, and if time permits, talk about a bonus topic: whether human languages are context-free.

Have a good break!