Unit 1: Natural language arithmetic

Bill MacCartney

Spring 2015

This is Unit 1 of the SippyCup codelab.

Our first case study is directly inspired by Liang & Potts 2015. We consider the problem of interpreting expressions of natural language arithmetic, such as:

```
"one plus one"
"minus three minus two"
"three plus three minus two"
"two times two plus three"
```

While these phrases are simple enough, they do exhibit some interesting ambiguity. For example, "two times two plus three" exhibits syntactic ambiguity: it could refer to either seven or ten, depending on the order of operations. And "minus three minus two" exhibits lexical ambiguity: the first "minus" refers to the (unary) negation operator, while the second "minus" refers to the (binary) subtraction operator.

Still, as semantic parsing problems go, this one is not very challenging. It has a small, closed vocabulary, and a limited variety of syntactic structures. These qualities make the arithmetic domain an ideal vehicle for developing the essential elements of a semantic parsing system.

Whenever we begin work on a new domain (that is, application or use case) for semantic parsing, the first order of business is to collect a broad sample of the inputs we want to be able to handle, and to study its properties. The characteristics of this sample will drive the choice of semantic representation, the style and structure of the grammar, and the selection of features for the scoring model. It will also serve as the basis of the dataset used for evaluation and for training. Therefore, it's important that the sample be as *large* and as *realistic* as possible. It should be representative of the inputs the semantic parser will actually encounter in its intended application.

The central challenge of semantic parsing is linguistic diversity. There are just a zillion different ways to say the same thing — far, far more than a single person will ever come up with. Therefore, when constructing a set of example inputs, we should strongly prefer naturally-occurring (or "found") data reflecting the linguistic output of many different people. Only in this way can we be confident that our sample reflects the diversity of real-world usage.

*However*, for this case study only, we're going to ignore that fine advice. The arithmetic domain is simple and straightforward, and for now our goals are strictly pedagogical. So a small, artificial sample of inputs will suffice. Here's a sample of 17 inputs borrowed from the companion code to Liang & Potts 2015.

In [1]:

```
inputs = [
"one plus one",
"one plus two",
"one plus three",
"two plus two",
"two plus three",
"three plus one",
"three plus minus two",
"two plus two",
"three minus two",
"minus three minus two",
"two times two",
"two times three",
"three plus three minus two",
"minus three",
"three plus two",
"two times two plus three",
"minus four",
]
```

Note that, even for the arithmetic domain, this sample is quite limited in scope. It uses only the integers one through four, and it uses only a small number of arithmetic operations. The exercises at the end of this unit will challenge you to extend the scope of the problem in various ways.

In our second case study, we'll look at strategies for extracting a larger and more realistic sample of inputs from search query logs.

Having collected a sample of inputs, the next order of business is to choose a good semantic representation. After all, the semantic representation is the desired *output* of the semantic parsing system, so the choice of representation will drive many other decisions.

As discussed earlier, our semantic representation should be machine-readable, unambiguous, and easily executable. For the domain of natural language arithmetic, a natural choice is to use binary expression trees, represented in Python by nested tuples. That is, every semantic representation will be either a number, or a tuple consisting of an operator and one or more arguments, which are themselves semantic representations. To begin with, we'll define just four operators:
`+`

(addition),
`-`

(subtraction),
`*`

(multiplication), and
`~`

(negation).
Thus, all of the following are valid semantic representations:

In [2]:

```
sems = [
('+', 1, 1), # one plus one
('-', ('~', 3), 2), # minus three minus two
('-', ('+', 3, 3), 2), # three plus three minus two
('+', ('*', 2, 2), 3), # two times two plus three
]
```

It's easy to implement an executor which actually performs the arithmetic calculations described by these semantic representations to return a denotation.

In [3]:

```
ops = {
'~': lambda x: -x,
'+': lambda x, y: x + y,
'-': lambda x, y: x - y,
'*': lambda x, y: x * y,
}
def execute(sem):
if isinstance(sem, tuple):
op = ops[sem[0]]
args = [execute(arg) for arg in sem[1:]]
return op(*args)
else:
return sem
```

Note that the executor is simple, straightforward, deterministic, and doesn't rely on any linguistic knowledge. This is a sign that we've chosen a good semantic representation!

Let's see the executor in action:

In [4]:

```
for sem in sems:
print('%s = %d' % (sem, execute(sem)))
```

It works!

Now that we've collected a sample of inputs and defined a semantic representation, we can construct a set of *examples* which pair inputs with their target outputs. Examples have two purposes:

- They can serve as
*evaluation data*to assess the quality of our semantic parser. - They can serve as
*training data*to improve the system.

However, writing down target semantics for a large sample of inputs can be laborious, time-consuming, and error-prone. Depending on the complexity of the semantic representation, it may also require expert knowledge. In many domains, it is easier and faster to write down target denotations than to write down target semantics, and the task often can be crowdsourced. As we'll see later, in many cases target denotations alone can suffice for both evaluation and training.

For the arithmetic domain, we'll generate examples which include both semantics and denotations.
The SippyCup codebase defines (in `example.py`

) a simple container class called `Example`

. By importing this class, we can define some examples to guide our development.

In [5]:

```
from example import Example
arithmetic_examples = [
Example(input="one plus one", semantics=('+', 1, 1), denotation=2),
Example(input="one plus two", semantics=('+', 1, 2), denotation=3),
Example(input="one plus three", semantics=('+', 1, 3), denotation=4),
Example(input="two plus two", semantics=('+', 2, 2), denotation=4),
Example(input="two plus three", semantics=('+', 2, 3), denotation=5),
Example(input="three plus one", semantics=('+', 3, 1), denotation=4),
Example(input="three plus minus two", semantics=('+', 3, ('~', 2)), denotation=1),
Example(input="two plus two", semantics=('+', 2, 2), denotation=4),
Example(input="three minus two", semantics=('-', 3, 2), denotation=1),
Example(input="minus three minus two", semantics=('-', ('~', 3), 2), denotation=-5),
Example(input="two times two", semantics=('*', 2, 2), denotation=4),
Example(input="two times three", semantics=('*', 2, 3), denotation=6),
Example(input="three plus three minus two", semantics=('-', ('+', 3, 3), 2), denotation=4),
Example(input="minus three", semantics=('~', 3), denotation=-3),
Example(input="three plus two", semantics=('+', 3, 2), denotation=5),
Example(input="two times two plus three", semantics=('+', ('*', 2, 2), 3), denotation=7),
Example(input="minus four", semantics=('~', 4), denotation=-4),
]
```

Note that for examples with syntactically ambiguous inputs, our target semantics and denotations reflects a specific choice about how to resolve the ambiguity which accords with the standard order of operations.

OK, we've defined the problem, and we have a collection of examples which pair inputs with intended outputs. It's time to get down to business. How do we actually build a semantic parsing system that can map the example inputs to the target outputs?

In order to do *semantic parsing*, we're going to start by doing *syntactic parsing*. Syntactic parsing is a big topic, and we'll give only a cursory treatment here. If you haven't seen syntactic parsing before, you might benefit from the fuller exposition in Part III of Jurafsky & Martin, Speech and Language Processing.

In syntactic parsing, the goal is to build a tree structure (a *parse*) over the input which describes what linguists call its constituency structure, which basically means how we group the words into larger and larger phrases. For example, there are two different ways to parse "minus three minus two", depending on the order in which we group the words into phrases. If we start by grouping "minus three" into a phrase, we arrive at an answer of -5; if we start by grouping "three minus two", we arrive at -1. We can represent these two parses by adding parentheses to the input, like this:

```
parse 1 ((minus three) minus two) yields -5
parse 2 (minus (three minus two)) yields -1
```

Alternatively, we could represent these two parses graphically like this:

In syntactic parsing, we not only group words into phrases, but also assign labels known as *categories* to each word and phrase. In SippyCup, we adopt the convention that category names always begin with `$`

, so that they are easily distinguished from ordinary words. For the arithmetic domain, we require only three categories:

`$E`

, the category of*expressions*, including both numerals and longer phrases which denote a number.`$BinOp`

, the category of*binary operators*, such as`"plus"`

,`"minus"`

(meaning subtraction), and`"times"`

.`$UnOp`

, the category of*unary operators*, such as`"minus"`

(meaning negation).

If we add category labels to our parses, they look like this:

```
parse 1 ($E ($E ($UnOp minus) ($E three)) ($BinOp minus) ($E two))
parse 2 ($E ($UnOp minus) ($E ($E three) ($BinOp minus) ($E two)))
```

Or, graphically:

In order to build a valid parse tree over an input, we need to know the space of possibilities. This is the role of the *grammar*, which in SippyCup is a context-free grammar (CFG). The grammar contains a collection of *rules*, each of which specifies a valid *local subtree*, consisting of a parent and its immediate children.

In these figures, two local subtrees are highlighted in red:

It's kind of like building with Legos. There are many types of pieces. Each grammar rule specifies the shape of one type of piece, and how it connects to other pieces. Starting from the words of the input, you connect pieces into larger and larger structures, always according to the rules. The set of pieces available defines the space of structures you can build.

Here are all the pieces we need to build either of the parse trees above:

A grammar rule has a *left-hand side* (LHS) which is a single category, and a *right-hand side* (RHS) which is a sequence of one or more categories or words. (Words are also known as *terminals*, and categories as *non-terminals*.) The LHS specifies the parent of a valid local subtree; the RHS, the children.

Here's a simple Python class for representing grammar rules. You can ignore `sem`

for now — we're only concerned with syntax at this point.

In [6]:

```
class Rule:
def __init__(self, lhs, rhs, sem=None):
self.lhs = lhs
self.rhs = tuple(rhs.split()) if isinstance(rhs, str) else rhs
self.sem = sem
def __str__(self):
return 'Rule' + str((self.lhs, ' '.join(self.rhs), self.sem))
```

Note that while `rhs`

is stored as a tuple of strings, the constructor will accept either a tuple of strings or a single space-separated string. This is purely for convenience; it means that instead of `Rule('$E', ('$E', '$BinOp', $E'))`

we can write `Rule('$E', '$E $BinOp $E')`

.

We can now write down a few grammar rules for the arithmetic domain.

In [7]:

```
numeral_rules = [
Rule('$E', 'one'),
Rule('$E', 'two'),
Rule('$E', 'three'),
Rule('$E', 'four'),
]
operator_rules = [
Rule('$UnOp', 'minus'),
Rule('$BinOp', 'plus'),
Rule('$BinOp', 'minus'),
Rule('$BinOp', 'times'),
]
compositional_rules = [
Rule('$E', '$UnOp $E'),
Rule('$E', '$E $BinOp $E'),
]
def arithmetic_rules():
return numeral_rules + operator_rules + compositional_rules
```

In fact, these rules are all we need to be able to parse the 17 examples above.

There's just one little snag: the parsing algorithm we'll present in the next section requires that the grammar be in Chomsky normal form, or CNF. (We'll look at ways to relax this restriction in the next unit.) Roughly, CNF requires that every grammar rule have one of two forms:

- In a
*unary lexical rule*, the RHS must consist of exactly one word (terminal). - In a
*binary compositional rule*, the RHS must consist of exactly two categories (non-terminals).

Actually, we're already pretty close to satisfying these criteria. All the rules in `numeral_rules`

and `operator_rules`

are unary lexical rules. And the first rule in `compositional_rules`

is a binary compositional rule. But the second rule is a problem: it has three categories on the RHS, so it's a *trinary* compositional rule.

Fear not! We can convert our grammar to CNF by *binarizing* the offending rule. It's easiest to think in terms of local subtrees here. To binarize a local subtree having more than two children, we just introduce a new node underneath the parent which becomes the new parent of all the children *except* the rightmost. (If the new node still has more than two children, we just repeat the process.)

We also have to come up with a category for the new node. Since it spans an `$E`

and a `$BinOp`

, let's call it `$EBO`

.

In terms of grammar rules, we're replacing the (bad) trinary rule with two (good) binary rules, as follows:

In [8]:

```
compositional_rules = [
Rule('$E', '$UnOp $E'),
Rule('$EBO', '$E $BinOp'), # binarized rule
Rule('$E', '$EBO $E'), # binarized rule
]
```

Let's define some helper functions which will help us ensure that our grammars are in CNF.

In [9]:

```
def is_cat(label):
return label.startswith('$')
def is_lexical(rule):
return all([not is_cat(rhsi) for rhsi in rule.rhs])
def is_binary(rule):
return len(rule.rhs) == 2 and is_cat(rule.rhs[0]) and is_cat(rule.rhs[1])
```

(In a more object-oriented design, these functions might be defined as methods of `Rule`

. However, defining them as static functions is more convenient for the incremental presentation of this codelab.)

Now we'll create a `Grammar`

class to hold a collection of rules. We'll store the rules in maps indexed by their right-hand sides, which will facilitate lookup during parsing. Ignore the `parse_input()`

method for now — we'll define it shortly.

In [10]:

```
from collections import defaultdict
class Grammar:
def __init__(self, rules=[]):
self.lexical_rules = defaultdict(list)
self.binary_rules = defaultdict(list)
for rule in rules:
add_rule(self, rule)
print('Created grammar with %d rules.' % len(rules))
def parse_input(self, input):
"""Returns a list of parses for the given input."""
return parse_input(self, input) # defined later
def add_rule(grammar, rule):
if is_lexical(rule):
grammar.lexical_rules[rule.rhs].append(rule)
elif is_binary(rule):
grammar.binary_rules[rule.rhs].append(rule)
else:
raise Exception('Cannot accept rule: %s' % rule)
```

(Again, we've defined `add_rule()`

as a static function, rather than a member of the `Grammar`

class, only because it facilitates the incremental presentation of this codelab.)

Let's create a grammar using the rules we defined earlier.

In [11]:

```
arithmetic_grammar = Grammar(arithmetic_rules())
```

Great, we have a grammar. Now we need to implement a parsing algorithm.

The next question is, given a grammar and a specific input, how can we find the set of parses for the input which are allowed by the grammar? To solve this problem, we're going to use a variant of the CYK algorithm, which is an example of a chart parsing algorithm, and more broadly, of dynamic programming.

Chart parsing relies on a data structure known as the *chart*, which has one entry (known as a *cell*) for every possible span in the input. Spans are identified by a pair of token indices (*i*, *j*), where *i* is the (0-based) index of the leftmost token of the span, and *j* is one greater than the index of the rightmost token of the span. It follows that *j* – *i* is equal to the length of the span. For example, if the input is "one plus two", then span (0, 1) is "one", span (1, 3) is "plus two", and span (0, 3) is "one plus two". The chart cell for each span holds a list of possible parses for that span.

(TODO: Show diagram of chart, with iteration order marked by arrows.)

The chart parsing algorithm works like this:

- Split the input into a sequence of tokens.
- Construct a chart, which maps from each span of the input to a list of possible parses for that span.
- Iterate over all possible spans, working bottom-up, from smaller spans to larger spans.
- For each span, and for each grammar rule, if the rule lets you build a parse for the span, add it to the chart.

Here's how to express the algorithm in Python. It's surprisingly simple!

In [12]:

```
def parse_input(grammar, input):
"""Returns a list of all parses for input using grammar."""
tokens = input.split()
chart = defaultdict(list) # map from span (i, j) to list of parses
for j in range(1, len(tokens) + 1):
for i in range(j - 1, -1, -1):
apply_lexical_rules(grammar, chart, tokens, i, j)
apply_binary_rules(grammar, chart, i, j)
return chart[(0, len(tokens))] # return all parses for full span
```

The definition of `apply_lexical_rules()`

is very simple. We simply retrieve from the grammar all the lexical rules having the given token span as RHS. For each such rule, we construct a new instance of `Parse`

(a class we'll define in a moment) and add it to the chart.

In [13]:

```
def apply_lexical_rules(grammar, chart, tokens, i, j):
"""Add parses to span (i, j) in chart by applying lexical rules from grammar to tokens."""
for rule in grammar.lexical_rules[tuple(tokens[i:j])]:
chart[(i, j)].append(Parse(rule, tokens[i:j]))
```

(By the way, you might wonder why the invocation of `apply_lexical_rules()`

appears inside the loop over `i`

(and why its arguments include `i`

). Isn't it enough to call it once for each value of `j`

? Yes, for now, it is. But in the next unit, we'll extend the grammar to support lexical rules with multiple words in the RHS, and this code design will make that easier.)

The definition of `apply_binary_rules()`

is slightly more complicated. First, we need to iterate over all indices *k* at which the span (*i*, *j*) could be split into two subspans. For each split point, we consult the chart to see what parses we've already constructed for the two subspans. Then, for each pair of subspan parses, we retrieve from the grammar all binary rules that could used to combine them. For each such rule, we construct a new instance of `Parse`

and add it to the chart.

In [14]:

```
from itertools import product
def apply_binary_rules(grammar, chart, i, j):
"""Add parses to span (i, j) in chart by applying binary rules from grammar."""
for k in range(i + 1, j): # all ways of splitting the span into two subspans
for parse_1, parse_2 in product(chart[(i, k)], chart[(k, j)]):
for rule in grammar.binary_rules[(parse_1.rule.lhs, parse_2.rule.lhs)]:
chart[(i, j)].append(Parse(rule, [parse_1, parse_2]))
```

By the way, it should now be clear why we require the grammar to be in Chomsky normal form. If we allow rules with three (or more) items on the RHS, then we have to consider all ways of splitting a span into three (or more) subspans. This quickly becomes both inefficient and unwieldy. Better to binarize all the rules first.

Now what about the `Parse`

class? It's a simple container class which stores the `Rule`

used to build the parse and the children to which the rule was applied. If the rule was a lexical rule, the children are just tokens; if it was a compositional rule, the children are other `Parse`

s.

In [15]:

```
class Parse:
def __init__(self, rule, children):
self.rule = rule
self.children = tuple(children[:])
self.semantics = compute_semantics(self) # Ignore this for now -- we'll use it later.
self.score = float('NaN') # Ditto.
self.denotation = None # Ditto.
def __str__(self):
return '(%s %s)' % (self.rule.lhs, ' '.join([str(c) for c in self.children]))
def compute_semantics(parse):
return None # We'll redefine this later.
```

We're finally ready to do some parsing! Let's see our grammar in action, by applying it to the set of 17 examples we defined above.

In [16]:

```
arithmetic_grammar = Grammar(arithmetic_rules())
for example in arithmetic_examples:
parses = parse_input(arithmetic_grammar, example.input)
print()
print('%-16s %s' % ('input', example.input))
for idx, parse in enumerate(parses):
print('%-16s %s' % ('parse %d' % idx, parse))
```

You should observe that all 17 examples were successfully parsed. Moreover, the examples which exhibit ambiguity get multiple parses, exactly as expected.

Syntactic parsing provides the spine around which we'll build our semantic parsing system. But now we need to add semantics.

Now we need to bring semantics into the picture. Given a parse tree, we'd like to attach a semantic representation to every node in the tree, like this:

How are the semantic representations computed? Like the parse tree itself, they are computed *bottom-up*. We begin at the leaf nodes, where the semantics are determined directly by the words: the semantics for "one" is `1`

, the semantics for "plus" is `+`

, and so on. This is province of lexical semantics.

Then, as we work our way up the tree, the semantic representation for each internal node is computed from the semantics of its children, in a manner that depends on the rule that was used to combine them. This is the province of *compositional semantics*, and it hinges on the principle of compositionality (often attributed to Gottlob Frege).

The meaning of a compound expression is a function of the meanings of its parts and the manner of their combination.

(The principle of compositionality is central to Montague grammar, which provides the theoretical foundation for much academic work in semantic parsing. However, most formulations of Montague grammar assume the use of the typed lambda calculus for semantic representations. SippyCup is compatible with this choice, but does not assume it.)

Now recall that the rules of our grammar specify valid local subtrees from which we can construct parse trees. Just as we've added semantic attachments to our parse trees, we'll add semantic attachments to the rules of our grammar. The semantic attachment to a rule specifies how to construct the semantics for the parent (LHS) category. For a lexical rule, the semantic attachment is simply a semantic representation (or a fragment thereof). For a compositional rule, the semantic attachment is a function which takes the semantics of the children as input and returns the semantics for the parent.

Let's see how this looks in Python. We need to redefine our rules to include semantic attachments. For *lexical* rules, the semantic attachments directly specify semantic representations (or fragments thereof):

In [17]:

```
numeral_rules = [
Rule('$E', 'one', 1),
Rule('$E', 'two', 2),
Rule('$E', 'three', 3),
Rule('$E', 'four', 4),
]
operator_rules = [
Rule('$UnOp', 'minus', '~'),
Rule('$BinOp', 'plus', '+'),
Rule('$BinOp', 'minus', '-'),
Rule('$BinOp', 'times', '*'),
]
```

For *compositional* rules, the semantic attachments are functions which specify how to construct the semantics of the parent from the semantics of the children. We'll define these functions using Python's `lambda`

syntax, and we'll establish the convention that these lambda functions always have a single parameter called `sems`

, which will contain a list of the semantic representations of the children on the RHS of the rule.

For example, consider the rule which specifies how to combine a unary operator with its argument. We can define its semantic attachment like this:

```
Rule('$E', '$UnOp $E', lambda sems: (sems[0], sems[1]))
```

Now, `sems[0]`

refers to the semantics of the child `$UnOp`

,
and `sems[1]`

refers to the semantics of the child `$E`

.
So this semantic attachment says: take the semantics of the child `$UnOp`

and the semantics of the child `$E`

, form a pair from them, and return that pair as the semantics for the parent `$E`

.

For the other two compositional rules — the ones involving binary operators — things are complicated by binarization. Recall that *before* binarization, we had a single rule, for which we could easily specify a semantic attachment:

```
Rule('$E', '$E $BinOp $E', lambda sems: (sems[1], sems[0], sems[2]))
```

Note the order of the tuple elements here: first we have `sems[1]`

, which refers to the semantics of the child `$BinOp`

, and then we have `sems[0]`

and `sems[2]`

, which refer to the semantics of the child `$E`

s.

However, *after* binarization, we need to break our semantic construction into stages. First we combine the semantics of the left child `$E`

with the semantics of `$BinOp`

, forming a semantic representation for `$EBO`

which is an "incomplete" tuple:

```
Rule('$EBO', '$E $BinOp', lambda sems: (sems[1], sems[0]))
```

This semantic representation is incomplete in the sense that it combines a binary operator with a single argument. (This is a form of currying.)

Then we follow through by combining the "incomplete" semantics for `$EBO`

with the semantics for the right child `$E`

to yield the semantic representation for the parent `$E`

:

```
Rule('$E', '$EBO $E', lambda sems: (sems[0][0], sems[0][1], sems[1]))
```

Putting everything together, we get:

In [18]:

```
compositional_rules = [
Rule('$E', '$UnOp $E', lambda sems: (sems[0], sems[1])),
Rule('$EBO', '$E $BinOp', lambda sems: (sems[1], sems[0])),
Rule('$E', '$EBO $E', lambda sems: (sems[0][0], sems[0][1], sems[1])),
]
```

OK, these lambda functions tell us *how* to construct the semantics for the parent, given the semantics for the children. But *when* and *where* do these functions actually get invoked? There needs to be a function that constructs the semantics for a parse. Here it is:

In [19]:

```
from types import FunctionType
def compute_semantics(parse):
if is_lexical(parse.rule) or not isinstance(parse.rule.sem, FunctionType):
return parse.rule.sem
else:
return parse.rule.sem([child.semantics for child in parse.children])
```

You may recall that `compute_semantics()`

was invoked in the constructor for `Parse`

shown above. Thus, SippyCup performs semantic construction "eagerly", that is, during parsing. It's worth noting that this is a design choice — it would also be possible to make semantic construction a separate phase which occurs only after parsing is complete.

At this point, we should have all the pieces we need to do semantic parsing. Let's test that by parsing "two times two plus three":

In [20]:

```
arithmetic_grammar = Grammar(arithmetic_rules())
parses = parse_input(arithmetic_grammar, "two times two plus three")
for parse in parses:
print()
print(parse)
print(parse.semantics)
```

OK, it looks like it's working!

Next we'd like to evaluate the performance of our semantic grammar on our whole dataset. The SippyCup codebase includes a number of helper functions (in `experiment.py`

) to facilitate empirical evaluations. But since the code isn't very interesting, we won't reproduce it here — we'll just import it.

In [21]:

```
from experiment import evaluate_grammar
arithmetic_grammar = Grammar(arithmetic_rules())
evaluate_grammar(grammar=arithmetic_grammar, executor=execute, examples=arithmetic_examples)
```

If you look at the bottom of the output, you see that we report the mean values of several evaluation metrics over the 17 examples in the dataset. (Evaluation metrics are defined in metrics.py.) The first two are particularly significant at this point:

- The "semantics accuracy" metric shows how often the semantics of the parse
*at position 0*matched the target semantics in the example. The result is 0.824, or 14/17. In other words, there were three examples where the parse at position 0 was*not*correct. - The "semantics oracle accuracy" metric shows how often the semantics of
*any*parse matched the target semantics in the example. The result is 1.000, or 17/17. In other words, in every example we produced some correct parse.

The gap between accuracy and oracle accuracy represents an opportunity. If we had some way of ranking candidate parses so that correct parses were likely to appear higher in the list, then we could bring accuracy closer to oracle accuracy. (The "oracle" in question is one that magically knows the optimal ranking of candidate parses.)

So far, we've seen only a couple of cases where the parser found more than one parse for a given input. But in richer domains, with more complex grammars, it's not unusual to find tens, hundreds, or even thousands of parses for some inputs. However, the list of candidate parses returned by `parse_input()`

appears in arbitrary order. The first parse is not necessarily the best, and the best parse is not necessarily the first. Therefore, we'd like to have some way of ranking the candidates, so that more plausible interpretations appear earlier in the list.

Consider the example "`two times two plus three`

". Because of syntactic ambiguity, there are two possible interpretations, according to whether we perform multiplication first (in accordance with the standard order of operations) or addition first. Our parser duly produces both parses, but it happens to produce the wrong parse (the one with denotation 10) first, and the right parse (with denotation 7) second. So this example counts as a loss for the accuracy metric, which only considers the first parse.

An easy way to rank candidate parses is with a linear scoring function. (This approach will be very familiar if you've had any exposure to machine learning.) The idea is that we define a vector of feature functions, each of which takes a parse as input and returns as output a real number which encodes some salient characteristic of the parse. We then define a corresponding vector of real-valued weights (one for each feature), and we compute the score for a parse as the inner product of the weight vector and the feature vector. If $p$ is a parse, $w$ is the weight vector, and $\phi$ is the vector of feature functions, we can write this as:

$ score(p) = \sum_i w_i \cdot \phi_i(p) $

Finally, we sort the candidate parses by score, so that the highest-scoring parses appear first.

We now have two problems:

**Feature engineering**: defining features which help to discriminate good parses from bad.**Weight learning**: deciding what weight to assign to each feature.

The rest of this section will focus on feature engineering; we'll turn to weight learning in the next section.

In the academic literature on semantic parsing, the default starting point for feature engineering is usually with *rule features*. There is one such feature for each rule in the grammar, and its value for a given parse simply indicates how many times the rule was used in the parse. We can implement rule features in Python like this:

In [22]:

```
def rule_features(parse):
"""
Returns a map from (string representations of) rules to how often they were
used in the given parse.
"""
def collect_rule_features(parse, features):
feature = str(parse.rule)
features[feature] += 1.0
for child in parse.children:
if isinstance(child, Parse):
collect_rule_features(child, features)
features = defaultdict(float)
collect_rule_features(parse, features)
return features
```

For example, here are the rule features for `"two times two"`

:

In [23]:

```
parses = parse_input(arithmetic_grammar, "two times two")
for feature, value in rule_features(parses[0]).items():
print('%8g %s' % (value, feature))
```

Note that the rule feature for `Rule('$E', 'two', 2)`

has value 2, because that rule was used twice in the parse.

Rule features are often quite effective in discriminating good parses from bad. In cases of lexical ambiguity, rule features can help to identify the preferred interpretation, by assigning a higher weight to one lexical rule with the given RHS than to another. Likewise, in cases of syntactic ambiguity, rule features can help to identify the more plausible construction.

However, rule features won't help us distinguish between the two candidate parses of "two times two plus three", because the two parses use exactly the same rules, only in a different order. Here, let's prove it:

In [24]:

```
parses = parse_input(arithmetic_grammar, "two times two plus three")
print('Number of parses:', len(parses))
print('Identical rule features?', rule_features(parses[0]) == rule_features(parses[1]))
```

In order to discriminate between the two candidate parses of "two times two plus three", we need features that explicitly represent operator precedence. The following feature function will do the job.

In [25]:

```
def operator_precedence_features(parse):
"""
Traverses the arithmetic expression tree which forms the semantics of
the given parse and adds a feature (op1, op2) whenever op1 appears
lower in the tree than (i.e. with higher precedence than) than op2.
"""
def collect_features(semantics, features):
if isinstance(semantics, tuple):
for child in semantics[1:]:
collect_features(child, features)
if isinstance(child, tuple) and child[0] != semantics[0]:
features[(child[0], semantics[0])] += 1.0
features = defaultdict(float)
collect_features(parse.semantics, features)
return features
```

Let's make sure it works properly.

In [26]:

```
for parse in parses:
print('Semantics %s yields features %s' % (parse.semantics, dict(operator_precedence_features(parse))))
```

Looks good. Now we want to make sure that a parse with feature `('*', '+')`

will get a higher score (all else being equal) than a parse with feature `('+', '*')`

. That's easy to arrange: we can just assign a weight of 1 to the former feature and a weight of -1 to the latter. While we're at it, let's assign weights to a few other precedence features in accordance with the standard order of operations.

In [27]:

```
weights = defaultdict(float)
weights[('*', '+')] = 1.0
weights[('*', '-')] = 1.0
weights[('~', '+')] = 1.0
weights[('~', '-')] = 1.0
weights[('+', '*')] = -1.0
weights[('-', '*')] = -1.0
weights[('+', '~')] = -1.0
weights[('-', '~')] = -1.0
```

Now we need a function that actually computes the score for a parse, given a feature function and a weight vector.

In [28]:

```
def score(parse=None, feature_fn=None, weights=None):
"""Returns the inner product of feature_fn(parse) and weights."""
return sum(weights[feature] * value for feature, value in feature_fn(parse).items())
```

Let's verify that our preferred interpretation for "two times two plus three" gets a higher score.

In [29]:

```
for parse in parses:
print('Semantics %s gets score %4.1f' % (parse.semantics, score(parse, operator_precedence_features, weights)))
```

Great, it works.

Now that we have a grammar, scoring, and an executor, it will be convenient to tie all these pieces together in a class called `Model`

, which can take an input and generate a ranked list of parses with scores, semantics, and denotations:

In [30]:

```
class Model:
def __init__(self,
grammar=None,
feature_fn=lambda parse: defaultdict(float),
weights=defaultdict(float),
executor=None):
self.grammar = grammar
self.feature_fn = feature_fn
self.weights = weights
self.executor = executor
def parse_input(self, input):
parses = self.grammar.parse_input(input)
for parse in parses:
if self.executor:
parse.denotation = self.executor(parse.semantics)
parse.score = score(parse, self.feature_fn, self.weights)
parses = sorted(parses, key=lambda parse: parse.score, reverse=True)
return parses
```

Let's redo the evaluation we ran at the end of the last section, to demonstrate that we've achieved a gain from introducing scoring.

In [31]:

```
from experiment import evaluate_model
arithmetic_model = Model(grammar=arithmetic_grammar,
feature_fn=operator_precedence_features,
weights=weights,
executor=execute)
evaluate_model(model=arithmetic_model, examples=arithmetic_examples)
```

If you look at the bottom of the output, you'll see that our semantics accuracy has increased from 0.824 (or 14 of 17) to 0.941 (or 16 of 17). Hooray, scoring works! In particular, our operator precedence features now allow us to correctly rank the alternate parses for "minus three minus two" and "two times two minus three".

The remaining error is on "three plus three minus two". Note that in the standard order of operations, operators with same precedence level (such as addition and subtraction) are evaluated left-to-right. Accordingly, the target semantics performs the addition first. However, the parse at position 0 performs the subtraction first. In this case, both parses yield the same denotation, so you might think that this mistake doesn't much matter. But that's not true in general: if the example were instead "three minus three plus two", the denotation would depend on the order of operations. In the exercises at the end of this unit, we'll ask you to introduce new features to address this problem.

In the previous section, we set the weights of our scoring model "by hand". And it was easy to do so, because the model uses just a handful of features. Moreover, there was no question about what weights to choose: the sign of each weight was determined by the standard order of operations, and the magnitudes were all the same.

In more realistic applications, however, things are often more complicated. Scoring models can contain hundreds or thousands of features, too many to set by hand. And, perhaps surprisingly, it's not always obvious whether the weight of a particular feature should be positive or negative, let alone what its magnitude should be.

Rather than setting weights by hand, we'd like learn the weights automatically from training data.

(TODO ... flesh out this section. Until this section is fleshed out, please refer to Liang & Potts 2015, which describes the application of SGD to semantic parsing with great clarity. You may also find it useful to examine the demonstration code published as a companion to that paper. The code here is superficially different but fundamentally the same.)

In [32]:

```
def latent_sgd(model=None, examples=[], training_metric=None, T=10, eta=0.1, seed=None):
print('Running SGD learning on %d examples with training metric: %s' % (
len(examples), training_metric.name()))
if seed:
print('random.seed(%d)' % seed)
random.seed(seed)
model = clone_model(model)
for t in range(T):
random.shuffle(examples)
num_correct = 0
for example in examples:
# Parse input with current weights.
parses = model.parse_input(example.input)
# Get the highest-scoring "good" parse among the candidate parses.
good_parses = [p for p in parses if training_metric.evaluate(example, [p])]
if good_parses:
target_parse = good_parses[0]
# Get all (score, parse) pairs.
scores = [(p.score + cost(target_parse, p), p) for p in parses]
# Get the maximal score.
max_score = sorted(scores)[-1][0]
# Get all the candidates with the max score and choose one randomly.
predicted_parse = random.choice([p for s, p in scores if s == max_score])
if training_metric.evaluate(example, [predicted_parse]):
num_correct += 1
update_weights(model, target_parse, predicted_parse, eta)
print('SGD iteration %d: train accuracy: %.3f' % (t, 1.0 * num_correct / len(examples)))
print_weights(model.weights)
return model
def cost(parse_1, parse_2):
return 0.0 if parse_1 == parse_2 else 1.0
def clone_model(model):
return Model(grammar=model.grammar,
feature_fn=model.feature_fn,
weights=defaultdict(float), # Zero the weights.
executor=model.executor)
def update_weights(model, target_parse, predicted_parse, eta):
target_features = model.feature_fn(target_parse)
predicted_features = model.feature_fn(predicted_parse)
for f in set(target_features.keys() + predicted_features.keys()):
update = target_features[f] - predicted_features[f]
if update != 0.0:
# print 'update %g + %g * %g = %g\t%s' % (
# model.weights[f], eta, update, model.weights[f] + eta * update, f)
model.weights[f] += eta * update
```

Let's put `latent_sgd()`

to the test, by using it to learn weights for our arithmetic model.
We'll divide our 17 arithmetic examples into 13 training examples and 4 test examples.
Then, we'll use the utility function `train_test()`

, defined in `experiment.py`

, which:

- evaluates the
*current*model on both the training examples and the test examples - trains the model on the training examples, using
`latent_sgd()`

- reports the learned weights
- evaluates the
*trained*model on both the training examples and the test examples

Note that `train_test()`

has a parameter called `training_metric`

, which is passed through to the `latent_sgd()`

parameter of the same name. To begin with, we'll use the `SemanticsAccuracyMetric`

as the training metric: thus, a parse will count as correct iff its semantic yield matches the target semantics in the example. Here we go:

In [33]:

```
from experiment import train_test
from metrics import SemanticsAccuracyMetric
train_test(model=arithmetic_model,
train_examples=arithmetic_examples[:13],
test_examples=arithmetic_examples[13:],
training_metric=SemanticsAccuracyMetric(),
seed=1)
```

Note that, while training accuracy increased from 0.846 (11 of 13 correct) to 1.000 (all 13 correct), test accuracy remained stuck at 0.750 (3 of 4 correct). We seem to have learned *something* from the training data, but whatever we learned did not generalize to the test data. In fact, if we look at the learned features, we can see exactly what we learned from the training data: that '~' should take precedence over '-' (which is correct, in the sense that it accords with the standard order of operations), and that '+' should take precedence over '-' (which is not correct, in general). The model learned these specific preferences because the two errors it was making on the training data involved these two precedence pairs.

However, the one error we're making on the test data involves a different precedence pair. If you add the argument `print_examples=True`

to `train_test()`

, you can see which example we're getting wrong: it's `"two times two plus three"`

. In order to get this example right, we need to know that `*`

should take precedence over `+`

. But none of the examples in the training data involved this precedence pair, so we haven't learned that.

Of course, this failure to improve test accuracy doesn't reflect any problem with `latent_sgd()`

.
It's merely a consequence of the extremely small size of our training dataset. With so few training examples, it's not surprising that a test example hinges on a feature not seen in training. Going forward, we'll strive for larger datasets which are less subject to this kind of sampling noise.

It's clear that we need larger datasets, with more training examples. But as we noted above, annotating examples with target semantics can be slow, expensive, and error-prone, and may require expert knowledge. However, annotating examples with target *denotations* can often be faster, cheaper, and more reliable.

The arithmetic domain illustrates this nicely. Writing down the target semantics for `"minus three minus two"`

(namely, `('-', ('~', 3), 2)`

) is a tedious chore that most people probably could not perform reliably. You need to understand Lisp-y prefix notation. You need to remember to use the funny `'~'`

, instead of the more natural `'-'`

, for unary `"minus"`

. You need to remember to quote the operators. And you need to get the order of operations right.

By contrast, writing down the target denotation (namely, `-5`

) is easy as pie. The only thing you really need to think about is the order of operations, which most people are capable of mastering. So if you're asking only for denotations, rather than semantics, you can get more annotations, faster, cheaper, and more reliably, from ordinary people.

One of the principal contributions of Liang et al. 2011 was to show that it is possible to learn scoring models for semantic parsing using only target denotations, rather than target semantics, as the source of supervision. The central idea is presented with admirable clarity in Liang & Potts 2015.

In SippyCup, we can change the source of supervision from semantics to denotations simply by changing the training metric from `SemanticsAccuracyMetric`

to `DenotationAccuracyMetric`

. With this change, a parse will count as correct iff the denotation of its semantic yield matches the target denotation in the example.

Let's begin by repeating the experiment we did a moment ago, but switching to denotations as the source of supervision:

In [34]:

```
from metrics import DenotationAccuracyMetric
train_test(model=arithmetic_model,
train_examples=arithmetic_examples[:13],
test_examples=arithmetic_examples[13:],
training_metric=DenotationAccuracyMetric(),
seed=1)
```

Now that we're learning from denotations, the semantics accuracy on the training data did not reach 1.000, as it did before. One of the exercises will ask you to investigate why. However, the denotation accuracy did reach 1.000.

To no one's astonishment, performance on the test data did not budge. As we saw earlier, the sole error on a test example depends on a feature which is not observed in the training data. Learning from denotations does not solve that problem directly. However, it opens the door to obtaining larger datasets for training, which may help.

The file `arithmetic.py`

contains a set of 100 "development" examples for the arithmetic domain. These examples are annotated only with denotations, not with semantics. 100 examples is still not huge, but the arithmetic domain is simple enough that it should suffice.

In [35]:

```
from arithmetic import arithmetic_dev_examples
from metrics import denotation_match_metrics
train_test(model=arithmetic_model,
train_examples=arithmetic_dev_examples,
test_examples=arithmetic_examples[13:],
metrics=denotation_match_metrics(),
training_metric=DenotationAccuracyMetric(),
seed=1)
```

Note that denotation accuracy on the test examples now reaches 1.000 after training. The sole error is corrected, because the larger training dataset affords ample opportunity to learn that `*`

should take precedence over `+`

.

We'd like to automate the process of building a semantic parsing model as much as possible. Ideally, we'd start from training examples annotated with denotations, run one command, and get back a fully-baked semantic parsing model. This would make it vastly easier to build models for new domains or in new languages.

The weight learning introduced in the previous section brings us one step closer to that goal. It makes it possible to leverage machine learning, rather than human expertise, in constructing a scoring model for ranking candidate parses.

In truth, however, we're still quite far from fully automated model construction. As things stand, there's one big piece that remains completely manual: grammar engineering. In the arithmetic domain, we had to write rules that specify that the token `"one"`

corresponds to the semantic representation `1`

, that the token `"times"`

corresponds to the semantic representation `*`

, and so on. Because the arithmetic domain is simple and the grammar is small, this wasn't especially cumbersome. But in more realistic domains, grammars can become very large, with thousands of rules. The burden of manual grammar engineering then becomes a major impediment to rapid progress.

Consequently, a major theme in academic research on semantic parsing (and, indeed, syntactic parsing as well) has been grammar induction, which means automatically inducing the rules of the grammar from data. This is a big topic, one that we will touch on only lightly in SippyCup.

One approach to inducing the rules of the grammar, explored in Zettlemoyer & Collins 2005, reduces the problem to conventional weight learning:

- generate all rules possible under certain constraints,
- build a linear scoring model with rule features,
- use SGD training, as in the previous section, to learn weights for each rule, and then
- (optional) prune all but the highest-weight rules

This strategy has usually been applied only to *lexical* rules, with compositional rules still defined manually. What does it mean to generate all lexical rules possible? The basic idea is to build rules from pieces that we find in the training examples. To generate a lexical rule, we:

- choose the LHS from the set of categories which appear in the compositional rules,
- choose the RHS from the set of tokens which appear in the example inputs, and
- choose the semantics from the set of fragments which appear in the example semantics.

This generative process is usually constrained in some way, and much depends on the specific constraints. If we generate too few rules, we are unlikely to find the ones we need for parsing, and we'll see oracle accuracy stuck near zero. But if we generate too many, our grammar will be very large, parsing will be exceedingly slow, and training will grind to a halt.

One way of constraining the generative process, suggested in Liang & Potts 2015, is to assume that we know the proper syntactic category for each token and each semantic fragment, but not the proper mapping between tokens and semantic fragments within the same syntactic category. In the arithmetic domain, this would mean knowing that the tokens `'plus'`

, `'minus'`

, and `'times`

, and the semantic fragments `'+'`

, `'-'`

, and `'*'`

, all belong to the syntactic category `$BinOp`

, but without knowing that `'plus'`

corresponds to `'+'`

.

Let's test this out! The following function expands a set of rules by generating lexical rules from the Cartesian product of existing lexical tokens and semantic fragments for each category, thereby effective "forgetting" the proper mapping between them.

In [36]:

```
def cartesian_product_of_lexical_rules(rules):
"""
Expands the given collection of rules by iterating through all possible
pairs of existing lexical rules and adding a new rule which combines the RHS
of the first rule with the semantics of the second.
"""
lexical_rules = [rule for rule in rules if is_lexical(rule)]
expanded_rules = [rule for rule in rules if not is_lexical(rule)]
# Partition rules by lhs.
lexical_rules_by_lhs = defaultdict(list)
for rule in lexical_rules:
lexical_rules_by_lhs[rule.lhs].append(rule)
# In each partition, iterate through cross-product of lexical rules.
for lhs, rules in lexical_rules_by_lhs.items():
sems = set([rule.sem for rule in rules])
for rule, sem in product(rules, sems):
expanded_rules.append(Rule(rule.lhs, rule.rhs, sem))
return expanded_rules
```

Let's see what it looks like when we apply this to our current arithmetic rules.

In [37]:

```
original_arithmetic_rules = arithmetic_rules()
expanded_arithmetic_rules = cartesian_product_of_lexical_rules(original_arithmetic_rules)
print("Expanded %d arithmetic rules into %d rules\n" % (
len(original_arithmetic_rules), len(expanded_arithmetic_rules)))
for rule in expanded_arithmetic_rules:
print(rule)
```

As expected, the expanded rule set pairs every token with every semantic fragment, restricted by syntactic category.

In our earlier learning experiments, we used the `operator_precedence_features()`

. But this approach to lexicon induction depends on using rule features, so let's create a feature function which generates both kinds of features:

In [38]:

```
def arithmetic_features(parse):
features = rule_features(parse)
features.update(operator_precedence_features(parse))
return features
```

OK, we're finally ready to run a `train_test()`

experiment. Let's create a model using the expanded rule set and both kinds of features. Training is going to be a lot slower now, so let's use our original training dataset of 13 examples, and our usual test dataset of 4 examples.

In [39]:

```
expanded_arithmetic_grammar = Grammar(expanded_arithmetic_rules)
expanded_arithmetic_model = Model(grammar=expanded_arithmetic_grammar,
feature_fn=arithmetic_features,
weights=weights,
executor=execute)
train_test(model=expanded_arithmetic_model,
train_examples=arithmetic_examples[:13],
test_examples=arithmetic_examples[13:],
metrics=denotation_match_metrics(),
training_metric=DenotationAccuracyMetric(),
seed=1)
```

You will observe not only that denotation accuracy improves dramatically on both training and test datasets, but also that "correct" lexical rules (such as the one that pairs `'plus'`

with `'+'`

) get positive weight, while "incorrect" rules (like the one that pairs `'plus'`

with `'-'`

) get negative weight. You will also observe that the number of parses is far higher than it was before! That's because the expanded grammar is much looser and more productive than it was. But at this point we can begin to imagine schemes for pruning the grammar.

There are, of course, a wide variety of approaches to inducing the rules of a semantic grammar from training data, but this simple example should give you some hope that it's possible.

Note that some of these exercises ask you to extend the grammar in ways that will be a bit awkward, given the requirement that rules be in Chomsky normal form (CNF). In the next unit, we'll look at ways to relax this restriction by enriching the `Grammar`

class. Don't jump the gun! Your solutions to these exercises should adhere to the CNF restriction. It's going to be a bit of a pain in the ass, and that's part of the point.

How many parses do you get for "one plus one plus one plus one plus one"? Why?

Extend the grammar to support postfix unary operators, as in "two squared" or "two cubed".

Extend the grammar to support divison, as in "four divided by three" or "minus four over two".

Extend the grammar to support multi-word operators, as in "the square root of one" or "the average of one and two". (You may need to extend the semantic representation and the executor as well.)

Extend the grammar to parse expressions for very large numbers, such as "one million forty eight thousand five hundred seventy six" or "a billion and one".

Extend the grammar to support decimal numbers, such as "three point one four one six" or "one point zero zero zero one".

Extend the grammar to support fractions, including improper fractions and mixed numbers, such as "four thirds", "twenty two over seven", or "two and a quarter".

When we learned from semantics, semantics accuracy on the training examples reached 1.000 after training. However, when we switched to learning from denotations, it did not. Explain why. Be precise and specific.

When we trained on the 100 examples in

`arithmetic_dev_examples`

, denotation accuracy on the training examples did not reach 1.000 after training. Diagnose the errors and describe your findings. What could help us to eliminate those errors? (You don't need to implement the fix. Just give a precise diagnosis.)Add features to ensure that subtraction and division are properly left-associative, so that "four minus three plus two" has denotation 3, "four minus three minus two" has denotation -1, and "four over three times two" has denotation 8/3.

Implement an "eager" version of the arithmetic grammar, in which the semantics of "plus" is not simply a symbol (

`'+'`

), but a lambda function (`lambda x, y: x + y`

), and similarly for the other operators. What else has to change to make this work? What impact does this have on accuracy?When inducing the lexicon, what happens if you drop the restriction that the token and the semantics of a lexical rule must belong to the same syntactic category? If you encounter obstacles, can they be overcome?

Copyright (C) 2015 Bill MacCartney