# SippyCup Unit 2: Travel queries

Bill MacCartney
Spring 2015

This is Unit 2 of the SippyCup codelab.

In our next case study, we're going to look at travel queries, in which a user seeks information about getting from here to there. Examples include:

"birmingham al distance from indianapolish in"
"discount travel flights to austin texas"



It's easy to imagine applications for semantic parsing in the travel domain. Providing answers to queries like these is a core part of the value proposition of intelligent assistants such as Apple's Siri, Google Now, Microsoft's Cortana, and Amazon Echo.

The travel domain has some important differences to the domain of natural language arithmetic we looked at in our first case study.

• Above all, it's far more diverse. The lexicon of travel queries can include thousands of words, not merely dozens — indeed, its potential size is unbounded. And the words and phrases can be assembled in a dizzying variety of syntactic structures.
• It's much more ambiguous. Consider "directions from washington to canada". Does that refer to the State of Washington, or to the District of Columbia? And where exactly in Canada are we headed?
• It can also be much messier. Because it's a real-world application, and because users are idiots make mistakes, travel queries can include lots of misspellings (such as "indianapolish"), mangled syntax, and other linguistic infelicities. It's not all neat and tidy like the arithmetic case study.
• However, the semantics are flatter. The semantic representation for a travel query may need to represent the destination, the origin, the transportation mode, and perhaps a few other bits of information. But, unlike the expression trees we looked at in the arithmetic case study, it does not need to contain arbitrarily deeply-nested recursive structure.

## Example inputs¶

Whenever we start work in a new domain, we must begin with data collection. To drive an effective development process, we need a large, realistic, representative sample of the inputs our system is likely to encounter in its intended application. In the arithmetic case study, we copped out on the data collection, in the name of pedagogical convenience. This time, we'll try harder.

For this purpose, we'll leverage the AOL search query dataset (Pass et al. 2006), a resource which has proven immensely useful over time, despite its controversial origin. This dataset contains approximately 10M unique search queries issued by about 650K AOL users in 2006. Because the dataset was quickly retracted by AOL, it can be difficult to find online, but it is currently available for download here.

Of course, the majority of those 10M queries are not related to travel. (The five most frequent queries are "google", "ebay", "yahoo", "yahoo.com", and "mapquest". Remember, it was 2006. The query "facebook" doesn't show up until position 119!) To construct a query sample suitable for our purposes, we'll need to do some filtering.

Since the queries we're interested in always refer to a location, we'll start by filtering for location names. Recognizing location names is itself a hard problem, so for current purposes we'll adopt a good-enough solution: we'll grep the queries for the names of about 600 U.S. locations (cities, states, mountains, lakes, and so on) listed in Geobase. (We'll be using Geobase again in the next unit.) This winnows the 10M unique queries to about 1M. It's a good start, but it's still not enough. (The top five queries are now "white pages", "florida lottery", "wells fargo", "whitney houston", and "las vegas".) We need to find queries which not only contain a location, but also a clear indication of travel intent.

Since travel is about going from one location to another, let's grep for queries which contain both "from" and "to". This cuts the number of queries to about 3,100, and we're now seeing the kind of queries we're looking for:

"airline flights from raleigh durham to nashville tennessee"
"distance from seattle to vancouver"
"cheap flights to italy from new york"
"international flights from usa to morocco"
"milage from portlan oregon to canada"
"distance from alabama to arkansas in hours"
"driving directions to sault ste marie canada from vermont"
"great white sharks life cycle from brith to death"
"directions from newark airport to bayonne bridge"
"poem it takes courage to love from tyler perrys movie"



A couple of observations here. First, not all of the queries are actually travel queries. That's OK — we actually want to include some queries which do not belong to the target domain. Such "negative" queries are important for evaluating our system's ability to distinguish between queries it should be parsing and queries it shouldn't. Second, there are a lot of mispelings. Again, that's OK. We want a realistic sample which reflects the inputs our system will actually encounter in its intended application. Messy inputs are a fact of life, and they're one of the things that distinguishes this domain from the highly artificial example of natural language arithmetic we looked at in the last unit.

There's still a problem, though. Not all travel queries contain both "from" and "to". Many valid travel queries contain only a destination, but no origin — and some contain an origin, but no destination. If we relax the conjunction to a disjunction (that is, "from" or "to", we get about 23,000 queries, most of which are still not what we're looking for (for example, "colorado rent to own", "moving to florida"). However, the list of examples above (and its continuation, not shown) suggest many other terms we can usefully search for: "airline", "flights", "distance", "mileage", "driving", "directions", and so on. We constructed a list of about 60 such terms (recorded in a comment in travel_examples.py). We then selected queries which contain a location, either "from" or "to", and either one of the ~60 travel terms or both "from" and "to". This left 6,588 queries, and the results look pretty good:

"flight one way las vegas nv to rapid city sd"
"amtrak schedule in california from burbanlk airport"
"airfare las vegas to my"
"cycling route san francisco to sacramento"
"where to find cheapest gasoline in new castle delaware"
"discount tickets to six flags and discounted hotels in new jersey"
"public transportation washington dc to new york city"
"transportation from newark airport to manhattan"



Three of those are not really travel queries, but that's perfectly fine.

Let's recap. To construct our query sample, we:

• Began from the set of unique queries in the AOL dataset (10M queries).
• Selected queries containing one of the 600 locations named in Geobase (1M queries).
• Selected queries containing "from" or "to" (23K queries).
• Selected queries containing one of about 60 travel terms, or containing both "from" and "to" (6,588 queries).

We've got our inputs; now we need to think about the outputs.

## Semantic representation ¶

For the travel domain, we're going to use a style of semantic representation based on nested maps of key-value pairs (represented by Python dictionaries). The keys belong to a small, fixed set of strings. The values can be numbers, strings, or nested maps, depending upon the key.

Here are some examples:

{'domain': 'travel', 'type': 'directions', 'mode': 'car',
'destination': {'id': 4793846, 'name': 'Williamsburg, VA, US'}}
{'domain': 'travel', 'type': 'duration', 'mode': 'bus',
'origin': {'id': 4500546, 'name': 'Atlantic City, NJ, US'},
'destination': {'id': 5128581, 'name': 'New York City, NY, US'}}
{'domain': 'travel', 'type': 'cost', 'mode': 'air',
'origin': {'id': 5101798, 'name': 'Newark, NJ, US'},
'destination': {'id': 4574324, 'name': 'Charleston, SC, US'}}



You can probably discern what each example means without any difficulty. The first is asking for driving directions to Williamsburg. The second is asking for the duration of a bus trip from Atlantic City to New York City. And the third is asking for the cost of an airfare from Newark to Charleston.

To be a bit more formal, the following keys are allowed in the top-level map:

• domain has one of two values: travel for travel queries, or other for non-travel queries.
• type specifies the type of information requested, one of: directions, distance, duration, cost, or schedule.
• mode specifies the mode of travel, one of: air, bike, boat, bus, car, taxi, train, or transit.
• destination specifies the destination of travel using a nested map representing a location.
• origin specifies the origin of travel using a nested map representing a location.

In nested maps representating locations, just two keys are allowed:

• id is an integer which is the unique identifier of the location in the GeoNames geographical database.
• names is a human-readable name for the location whose format is defined by code in geonames.py.

Note that our semantic representation both resolves ambiguity and canonicalizes. One the one hand, the phrase "new york" has a different semantic representation if it refers to the city than if it refers to the state; on the other hand, "new york city" and "nyc" have the same semantic representation. These qualities will make it easy for a downstream executor to turn a semantic representation directly into action.

In a real-world application, the executor would likely be a module that translates our semantic representations into requests to backend APIs to compute optimal routes, retrieve bus schedules or airfares, or whatever. However, because such functionality is not easily captured in a few lines of Python code, SippyCup doesn't include an executor for the travel domain.

Since we don't have an executor, we won't be dealing with denotations in the travel domain. (Abstractly, the denotation of a directions request is some kind of complex computational object representing a route.) Instead, training and evaluation in this domain will consider only whether we're generating the correct semantics. We'll return to considering denotations in our third case study.

## Example data¶

We've described a process for extracting 6,588 (likely) travel queries from the AOL search query dataset, and we've defined a simple semantic representation for such queries. The next step should be to manually annotate each query with its semantics, in order generate example data for training and evaluation.

However, there's a problem. Manual annotation is tedious and time-consuming, and therefore expensive. For datasets of significant size, it is usually not practical for one person to do all the work. This suggests distributing the work over a cluster of graduate students — which is in fact the method that was used to generate the Geo880 dataset that we'll use in our third case study. Accordingly, one of the exercises at the end of this section will ask you to annotate a score of travel queries.

Alternatively, if the annotation task does not require specialized knowledge, the work can often be distributed over workers on a crowdsourcing platform such as Amazon Mechanical Turk. Our semantic representation for the travel domain is simple enough that the crowdsourcing approach might be feasible. Of course, the workers wouldn't write down the semantics directly; rather, they'd fill out a carefully-designed form containing questions about the query, and a script would generate semantic representations based on their responses. We leave working out the details of the crowdsourcing approach as an exercise for the motivated reader. :-)

The difficulty of producing annotated data, and the consequent paucity of such data, has been the eternal bane of research in semantic parsing. Indeed, overcoming this obstacle was the chief motivation for the shift made by Liang et al. 2011 to learn not from semantics but from denotations, which are easier and cheaper to generate via crowdsourcing. Despite this shift, annotated data remains scarce. To this day, the largest standard dataset for semantic parsing (the WebQuestions dataset) contains only 5,810 examples.

For the purposes of this codelab, we have manually annotated 100 examples, given in travel_examples.py. The examples were sampled uniformly at random from the set of 6,588 AOL queries described earlier, and are divided into a training set of 75 examples and a test set of 25 examples. You might find yourself quibbling with some of the annotation choices. In some cases, our semantic representation is too impoverished to capture the user's intent completely. For example, "scenic drive to duluth mn", "honeymoon trip to hawaii", and "jet blue flights to new york" contain modifiers which cannot be expressed in our semantic representation. In other cases, the user's intent may be uncertain or obscure: what exactly does "airbus from boston to europe" mean?

## Enriching the Grammar class¶

Now that we have some example data for the travel domain, we'd like to start building a grammar. However, the grammar formalism we introduced in the arithmetic case study was quite restrictive. Every rule had to be in Chomsky normal form: either a unary lexical rule, or a binary compositional rule. If you did the exercises at the end of the arithmetic case study, you will have chafed against those restrictions. So, to make grammar engineering far more convenient, we're going to loosen things up. We'll modify the Grammar class by adding:

• Annotators, which are external modules for assigning categories and semantics to phrases of specific types.
• N-ary lexical rules, such as Rule('$City', 'new york city'). • Unary compositional rules, such as Rule('$Location', '$City'). • N-ary compositional rules, such as Rule('$RouteQuery', '$FromLocation$ToLocation $TravelMode'). • Rules with optional elements on the RHS, such as Rule('$BikeMode', '?by bike'), which can match either "bike" or "by bike".
• A designated start symbol, typically $ROOT, to allow us to limit which categories can yield complete parses. In order to make all this stuff work, we'll need to redefine our Grammar class and some of its affiliated functions. First, we'll redefine the Grammar constructor to add member variables for storing unary rules, annotators, and the start symbol. In [1]: class Grammar: def __init__(self, rules=[], annotators=[], start_symbol='$ROOT'):
self.categories = set()
self.lexical_rules = defaultdict(list)
self.unary_rules = defaultdict(list)
self.binary_rules = defaultdict(list)
self.annotators = annotators
self.start_symbol = start_symbol
for rule in rules:
print('Created grammar with %d rules.' % len(rules))

def parse_input(self, input):
"""Returns a list of parses for the given input."""
return parse_input(self, input)


Next, we'll redefine add_rule(), so that it knows what to do when it encounters unary or n-ary compositional rules, or rule containing optionals:

In [2]:
def add_rule(grammar, rule):
if contains_optionals(rule):
elif is_lexical(rule):
grammar.lexical_rules[rule.rhs].append(rule)
elif is_unary(rule):
grammar.unary_rules[rule.rhs].append(rule)
elif is_binary(rule):
grammar.binary_rules[rule.rhs].append(rule)
elif all([is_cat(rhsi) for rhsi in rule.rhs]):
else:
# One of the exercises will ask you to handle this case.
raise Exception('RHS mixes terminals and non-terminals: %s' % rule)

def is_unary(rule):
"""
Returns true iff the given Rule is a unary compositional rule, i.e.,
contains only a single category (non-terminal) on the RHS.
"""
return len(rule.rhs) == 1 and is_cat(rule.rhs[0])


Finally, we'll redefine parse_input(), so that it properly invokes functions for applying annotators and unary rules (in addition to existing invocations of functions for applying lexical rules and binary rules).

In [3]:
def parse_input(grammar, input):
"""Returns a list of all parses for input using grammar."""
tokens = input.split()
chart = defaultdict(list)
for j in range(1, len(tokens) + 1):
for i in range(j - 1, -1, -1):
apply_annotators(grammar, chart, tokens, i, j)               # new!
apply_lexical_rules(grammar, chart, tokens, i, j)
apply_binary_rules(grammar, chart, i, j)
apply_unary_rules(grammar, chart, i, j)                      # new!
parses = chart[(0, len(tokens))]
if hasattr(grammar, 'start_symbol') and grammar.start_symbol:        # new!
parses = [parse for parse in parses if parse.rule.lhs == grammar.start_symbol]
return parses


Note that if grammar.start_symbol is defined (the usual case), then we will return only parses having that category at the root.

### Annotators¶

Our travel grammar will need to recognize names of cities (and other kinds of locations). We could achieve this by writing lots of lexical rules:

In [4]:
from parsing import Rule

rules_cities = [
Rule('$City', 'austin', {'id': 4671654, 'name': 'Austin, TX, US'}), Rule('$City', 'boston', {'id': 4930956, 'name': 'Boston, MA, US'}),
Rule('$City', 'chicago', {'id': 4887398, 'name': 'Chicago, IL, US'}), Rule('$City', 'dallas', {'id': 4684888, 'name': 'Dallas, TX, US'}),
# ...
]


However, there are thousands of cities in the world. Adding thousands of rules to our grammar would be cumbersome and difficult to maintain. Besides, there is probably already some library or service which knows how to recognize names of cities and map them into unique identifiers. If we could wrap this external module in the right kind of interface, we could make it directly available to our grammar.

We call such a module an annotator. What kind of an interface should it have? Well, it should take a token span (that is, a list of tokens) and return a list of zero or more interpretations, each a pair consisting of a category and a semantic representation.

In [5]:
class Annotator:
"""A base class for annotators."""
def annotate(self, tokens):
"""Returns a list of pairs, each a category and a semantic representation."""
return []


Let's take a moment to implement a couple of simple annotators which will prove useful later on. The first is a NumberAnnotator, which will annotate any string representing a number with category $Number and semantics equal to its numeric value. In [6]: class NumberAnnotator(Annotator): def annotate(self, tokens): if len(tokens) == 1: try: value = float(tokens[0]) if int(value) == value: value = int(value) return [('$Number', value)]
except ValueError:
pass
return []


Here's the NumberAnnotator in action:

In [7]:
NumberAnnotator().annotate(['16'])

Out[7]:
[('$Number', 16)] Another annotator that will prove very useful is the TokenAnnotator, which will annotate any single token with category $Token and semantics equal to the token itself.

In [8]:
class TokenAnnotator(Annotator):
def annotate(self, tokens):
if len(tokens) == 1:
return [('$Token', tokens[0])] else: return []  Here's the TokenAnnotator in action: In [9]: TokenAnnotator().annotate(['foo'])  Out[9]: [('$Token', 'foo')]

Later in this unit, we'll introduce the GeoNamesAnnotator, a more complex annotator designed to annotate names of locations.

We saw above that the redesigned Grammar class holds a list of annotators, and that the redesigned parse_input() function invokes a function called apply_annotators() on every token span. Its operation is straightforward: for every annotator in the grammar, and for every interpretation of the given span by the annotator, it creates a new parse and adds it to the corresponding chart cell.

In [10]:
def apply_annotators(grammar, chart, tokens, i, j):
"""Add parses to chart cell (i, j) by applying annotators."""
if hasattr(grammar, 'annotators'):
for annotator in grammar.annotators:
for category, semantics in annotator.annotate(tokens[i:j]):
if not check_capacity(chart, i, j):
return
rule = Rule(category, tuple(tokens[i:j]), semantics)
chart[(i, j)].append(Parse(rule, tokens[i:j]))


(The check_capacity() function will be defined in the next section.)

### Unary rules¶

When designing a grammar, it's often convenient to write unary rules, which have a single category on the RHS. However, unary rules are not allowed in Chomsky normal form. One reason why unary rules are problematic is that badly-designed grammars can contain unary cycles, which can cause parsing to get stuck in an infinite regress.

In [11]:
rules_with_unary_cycle = [
Rule('$ROOT', '$Ping', lambda sems: sems[0]),
Rule('$Ping', '$Pong', lambda sems: sems[0] + '1'),
Rule('$Pong', '$Ping', lambda sems: sems[0]),
Rule('$Ping', 'ping', '1'), ]  If a grammar with these rules is used to parse the input "ping", it will generate an endless stream of candidate parses, with semantics '1', '11', '111', '1111', and so on. There are various strategies for preventing unary cycles from causing a runaway parsing process. We'll adopt one of the simplest: rather than trying to identify unary cycles in advance, we'll just impose an upper bound on the capacity of any chart cell, and if the limit is reached, we'll terminate the parsing process. With this in mind, the definition of apply_unary_rules() is straightforward. In [12]: def apply_unary_rules(grammar, chart, i, j): """Add parses to chart cell (i, j) by applying unary rules.""" # Note that the last line of this method can add new parses to chart[(i, # j)], the list over which we are iterating. Because of this, we # essentially get unary closure "for free". (However, if the grammar # contains unary cycles, we'll get stuck in a loop, which is one reason for # check_capacity().) if hasattr(grammar, 'unary_rules'): for parse in chart[(i, j)]: for rule in grammar.unary_rules[(parse.rule.lhs,)]: if not check_capacity(chart, i, j): return chart[(i, j)].append(Parse(rule, [parse])) MAX_CELL_CAPACITY = 10000 # Important for catching e.g. unary cycles. def check_capacity(chart, i, j): if len(chart[(i, j)]) >= MAX_CELL_CAPACITY: print('Cell (%d, %d) has reached capacity %d' % ( i, j, MAX_CELL_CAPACITY)) return False return True  Note that check_capacity() is not just for taming unary cycles — in fact, it's a general-purpose mechanism for bounding the amount of computation invested in a single parsing task. Accordingly, we've included calls to it, not only in apply_unary_rules(), but also in its three sibling functions. As you build richer, more complex grammars, you may find that some inputs have hundreds or even thousands of possible parses, and you may begin to run up against the chart cell capacity limit. Feel free to adjust it as needed. ### N-ary rules¶ Like unary rules, n-ary compositional rules are make grammar engineering easier, by allowing grammar rules to express constraints in natural, intuitive ways. In the previous unit, we saw how to eliminate the n-ary rule Rule('$E', '$E$BinOp $E') from a grammar by binarizing it — that is, by replacing it with two binary rules having equivalent effect. Obviously, the binarizing operation can be automated. The redesigned add_rule() calls a function named add_n_ary_rule() which performs this binarization. In [13]: def add_n_ary_rule(grammar, rule): """ Handles adding a rule with three or more non-terminals on the RHS. We introduce a new category which covers all elements on the RHS except the first, and then generate two variants of the rule: one which consumes those elements to produce the new category, and another which combines the new category which the first element to produce the original LHS category. We add these variants in place of the original rule. (If the new rules still contain more than two elements on the RHS, we'll wind up recursing.) For example, if the original rule is: Rule('$Z', '$A$B $C$D')

then we create a new category '$Z_$A' (roughly, "$Z missing$A to the left"),

Rule('$Z_$A', '$B$C $D') Rule('$Z', '$A$Z_$A') """ def add_category(base_name): assert is_cat(base_name) name = base_name while name in grammar.categories: name = name + '_' grammar.categories.add(name) return name category = add_category('%s_%s' % (rule.lhs, rule.rhs[0])) add_rule(grammar, Rule(category, rule.rhs[1:], lambda sems: sems)) add_rule(grammar, Rule(rule.lhs, (rule.rhs[0], category), lambda sems: apply_semantics(rule, [sems[0]] + sems[1])))  ### Optional elements¶ Another convenience for manual grammar engineering is the ability to mark an element in the RHS as optional — in SippyCup, by putting '?' in front of it. With such a mechanism, we can replace these two rules: Rule('$BikeMode, 'bike')
Rule('$BikeMode', 'by bike')  with a single rule containing an optional element: Rule('$BikeMode', '?by bike')



The optional mechanism also makes it easy to define one category as a sequence of another category:

Rule('$Thing', 'thing') Rule('$Things', '$Thing ?$Things')


The following functions enable the optional mechanism.

In [14]:
from types import FunctionType

def is_optional(label):
"""
Returns true iff the given RHS item is optional, i.e., is marked with an
initial '?'.
"""
return label.startswith('?') and len(label) > 1

def contains_optionals(rule):
"""Returns true iff the given Rule contains any optional items on the RHS."""
return any([is_optional(rhsi) for rhsi in rule.rhs])

"""
Handles adding a rule which contains an optional element on the RHS.
We find the leftmost optional element on the RHS, and then generate
two variants of the rule: one in which that element is required, and
one in which it is removed.  We add these variants in place of the
original rule.  (If there are more optional elements further to the
right, we'll wind up recursing.)

For example, if the original rule is:

Rule('$Z', '$A ?$B ?$C $D') then we add these rules instead: Rule('$Z', '$A$B ?$C$D')
Rule('$Z', '$A ?$C$D')
"""
# Find index of the first optional element on the RHS.
first = next((idx for idx, elt in enumerate(rule.rhs) if is_optional(elt)), -1)
assert first >= 0
assert len(rule.rhs) > 1, 'Entire RHS is optional: %s' % rule
prefix = rule.rhs[:first]
suffix = rule.rhs[(first + 1):]
# First variant: the first optional element gets deoptionalized.
deoptionalized = (rule.rhs[first][1:],)
add_rule(grammar, Rule(rule.lhs, prefix + deoptionalized + suffix, rule.sem))
# Second variant: the first optional element gets removed.
# If the semantics is a value, just keep it as is.
sem = rule.sem
# But if it's a function, we need to supply a dummy argument for the removed element.
if isinstance(rule.sem, FunctionType):
sem = lambda sems: rule.sem(sems[:first] + [None] + sems[first:])
add_rule(grammar, Rule(rule.lhs, prefix + suffix, sem))


## Grammar engineering¶

It's time to start writing some grammar rules for the travel domain. We're going to adopt a data-driven approach. Using the 75 training examples as a development set, we will iteratively:

• look at examples which are not yet parsed correctly,
• identify common obstacles or sources of error,
• introduce new rules to address those problems, and then
• re-evaluate on the 75 training examples.

During grammar engineering, the performance metric we'll focus on is oracle accuracy (the proportion of examples for which any parse is correct), not accuracy (the proportion of examples for which the first parse is correct). Remember that oracle accuracy is an upper bound on accuracy. Oracle accuracy is a measure of the expressive power of the grammar: does it have the rules it needs to generate the correct parse? The gap between oracle accuracy and accuracy, on the other hand, reflects the ability of the scoring model to bring the correct parse to the top of the candidate list.

### Phrase-bag grammars¶

For the travel domain, we're going to develop a style of grammar known as a phrase-bag grammar. To get a sense of how it will work, let's look at ten example inputs from the training data.

travel boston to fr. myers fla
how do i get from tulsa oklahoma to atlantic city. new jersey by air
airbus from boston to europe
cheap tickets to south carolina
birmingham al distance from indianapolish in
one day cruise from fort lauderdale florida
flights from portland or to seattle wa
honeymoon trip to hawaii


Here we've highlighted phrases in different colors, according to the roles they play in building the meaning of the input.

• Green phrases indicate the destination of travel.
• Blue phrases indicate the origin of travel.
• Yellow phrases indicate a mode of travel: air, boat, etc.
• Orange phrases indicate travel of some kind, but do not specify a travel mode.
• Red phrases indicate a specific type of information sought: distance, directions, etc.
• Gray phrases indicate "optional" words which contribute little to the semantics. (The modifiers "one day" and "honeymoon" may be quite meaningful to the user, but our semantic representation is too impoverished to capture them, so they are not relevant for our grammar.)

Note that the ordering of phrases in a query isn't particularly important. Whether the user says directions from washington to canada, or from washington to canada directions, or even to canada directions from washington, the intent is clearly the same. Some of these formulations might be more natural than others — and more common in the query logs — but all of them should be interpretable by our grammar.

That's the motivation for the phrase-bag style of grammar. In a simple phrase-bag grammar, a valid query is made up of one or more query elements, which can appear in any order, and optionals, which can be scattered freely amongst the query elements. (More complex phrase-bag grammars can impose further constraints.) For the travel domain, we can identify two types of query elements: travel locations and travel arguments. A travel location is either a to location (destination) or a from location (origin). A travel argument is either a travel mode, a travel trigger, or a request type.

We're ready to write our first grammar rules for the travel domain.

In [15]:
def sems_0(sems):
return sems[0]

def sems_1(sems):
return sems[1]

def merge_dicts(d1, d2):
if not d2:
return d1
result = d1.copy()
result.update(d2)
return result

rules_travel = [
Rule('$ROOT', '$TravelQuery', sems_0),
Rule('$TravelQuery', '$TravelQueryElements',
lambda sems: merge_dicts({'domain': 'travel'}, sems[0])),
Rule('$TravelQueryElements', '$TravelQueryElement ?$TravelQueryElements', lambda sems: merge_dicts(sems[0], sems[1])), Rule('$TravelQueryElement', '$TravelLocation', sems_0), Rule('$TravelQueryElement', '$TravelArgument', sems_0), ]  These rules are incomplete: $TravelLocation and $TravelArgument are not yet defined, and there is no mention of optionals. But these rules define the high-level structure of our phrase-bag grammar. Pay attention to the semantic functions attached to each rule. Remember that our semantic representations are maps of key-value pairs. In the rules which define $TravelQueryElement, the semantic function propagates the semantics of the child unchanged. In the rule which defines $TravelQueryElements, the semantic function merges the semantics of the children. And in the rule which defines $TravelQuery, the semantic function adds a key-value pair to the semantics of the child.

### Travel locations¶

The rules introduced above left $TravelLocation undefined. Let's add some rules to define it. In [16]: rules_travel_locations = [ Rule('$TravelLocation', '$ToLocation', sems_0), Rule('$TravelLocation', '$FromLocation', sems_0), Rule('$ToLocation', '$To$Location', lambda sems: {'destination': sems[1]}),
Rule('$FromLocation', '$From $Location', lambda sems: {'origin': sems[1]}), Rule('$To', 'to'),
Rule('$From', 'from'), ]  This looks good, but we still need something that defines $Location, which will match a phrase like "boston" and assign to it a semantic representation like {id: 4930956, name: 'Boston, MA, US'}. If only we had such a thing.

### The GeoNamesAnnotator¶

Geocoding is the process of mapping a piece of text describing a location (such as a place name or an address) into a canonical, machine-readable representation (such as geographic coordinates or a unique identifier in a geographic database). Because geocoding is a common need across many applications, many geocoding services are available.

One such service is GeoNames. GeoNames defines a large geographic database in which each location is identified by a unique integer. For example, Boston is identified by 4930956. GeoNames also provides a free, RESTful API for geocoding requests. For example, a request to geocode "Palo Alto" looks like this:

In SippyCup, the GeoNamesAnnotator (defined in geonames.py) is implemented as a wrapper around the GeoNames API. The job of the GeoNamesAnnotator is to recognize $Locations and generate semantics for them. Take a few minutes to skim that code now. Note that the GeoNamesAnnotator uses a persistent cache which is pre-populated for phrases in the 100 annotated travel examples, avoiding the need for live calls to the GeoNames API. However, if you run on any other examples, you will be making live calls. By default, your requests will specify your username as wcmac. That's fine, but each (free) GeoNames account is limited to 2000 calls per hour. If too many people are making calls as wcmac, that quota could be exhausted quickly. If that happens, you'll want to create your own account on GeoNames. You will soon observe that the annotations are far from perfect. For example, "florida" is annotated as {'id': 3442584, 'name': 'Florida, UY'}, which denotes a city in Uruguay. The current implementation of GeoNamesAnnotator could be improved in many ways. For example, while the GeoNames API can return multiple results for ambiguous queries such as "florida", the GeoNamesAnnotator considers only the first result — and there is no guarantee that the first result is the best. It may also be possible to improve the quality of the results by playing with some of the API request parameters documented here. Some of the exercises at the end of this unit ask you to explore these possibilities. We're now ready to see our grammar parse an input: In [17]: from collections import defaultdict from geonames import GeoNamesAnnotator from parsing import * geonames_annotator = GeoNamesAnnotator() rules = rules_travel + rules_travel_locations travel_annotators = [geonames_annotator] grammar = Grammar(rules=rules, annotators=travel_annotators) parses = grammar.parse_input('from boston to austin') print(parses[0].semantics)  Loaded GeoNamesAnnotator cache with 1462 items Created grammar with 11 rules {'domain': 'travel', 'origin': {'name': 'Boston, MA, US', 'id': 4930956}, 'destination': {'name': 'Austin, TX, US', 'id': 4671654}}  ### Travel modes¶ Let's turn our attention from travel locations to travel arguments. One kind of travel argument is a travel mode. Our semantic representation defines eight travel modes (air, bike, boat, bus, car, taxi, train, and transit), and our training examples illustrate some common ways of expressing each travel mode. In fact, individual examples will suggest specific lexical rules. Consider this example: Example(input='flights from portland or to seattle wa', semantics={'domain': 'travel', 'mode': 'air', 'origin': {'id': 5746545, 'name': 'Portland, OR, US'}, 'destination': {'id': 5809844, 'name': 'Seattle, WA, US'}}),  The pairing of the token "flights" with the semantic fragment {'mode': 'air'} suggests the rule: Rule('$AirMode', 'flights', {'mode': 'air'})



Actually, for simplicity, we're going to add the 'air' semantics one level "higher" in the grammar, like this:

Rule('$TravelMode', '$AirMode', {'mode': 'air'})
Rule('$AirMode', 'flights')  The rules below illustrate this approach. These rules will allow our grammar to handle the most common ways of referring to each travel mode. All of the lexical rules below use phrases which are either highly obvious (such as 'taxi' for $TaxiMode) or else are motivated by specific examples in our training dataset (not the test set!).

In [18]:
rules_travel_modes = [
Rule('$TravelArgument', '$TravelMode', sems_0),

Rule('$TravelMode', '$AirMode', {'mode': 'air'}),
Rule('$TravelMode', '$BikeMode', {'mode': 'bike'}),
Rule('$TravelMode', '$BoatMode', {'mode': 'boat'}),
Rule('$TravelMode', '$BusMode', {'mode': 'bus'}),
Rule('$TravelMode', '$CarMode', {'mode': 'car'}),
Rule('$TravelMode', '$TaxiMode', {'mode': 'taxi'}),
Rule('$TravelMode', '$TrainMode', {'mode': 'train'}),
Rule('$TravelMode', '$TransitMode', {'mode': 'transit'}),

Rule('$AirMode', 'air fare'), Rule('$AirMode', 'air fares'),
Rule('$AirMode', 'airbus'), Rule('$AirMode', 'airfare'),
Rule('$AirMode', 'airfares'), Rule('$AirMode', 'airline'),
Rule('$AirMode', 'airlines'), Rule('$AirMode', '?by air'),
Rule('$AirMode', 'flight'), Rule('$AirMode', 'flights'),
Rule('$AirMode', 'fly'), Rule('$BikeMode', '?by bike'),
Rule('$BikeMode', 'bike riding'), Rule('$BoatMode', '?by boat'),
Rule('$BoatMode', 'cruise'), Rule('$BoatMode', 'cruises'),
Rule('$BoatMode', 'norwegian cruise lines'), Rule('$BusMode', '?by bus'),
Rule('$BusMode', 'bus tours'), Rule('$BusMode', 'buses'),
Rule('$BusMode', 'shutle'), Rule('$BusMode', 'shuttle'),

Rule('$CarMode', '?by car'), Rule('$CarMode', 'drive'),
Rule('$CarMode', 'driving'), Rule('$CarMode', 'gas'),

Rule('$TaxiMode', 'cab'), Rule('$TaxiMode', 'car service'),
Rule('$TaxiMode', 'taxi'), Rule('$TrainMode', '?by train'),
Rule('$TrainMode', 'trains'), Rule('$TrainMode', 'amtrak'),

Rule('$TransitMode', '?by public transportation'), Rule('$TransitMode', '?by ?public transit'),
]


Let's see the rules in action on a toy example.

In [19]:
rules = rules_travel + rules_travel_locations + rules_travel_modes
grammar = Grammar(rules=rules, annotators=travel_annotators)
parses = grammar.parse_input('from boston to austin by train')
print(parses[0].semantics)

Created grammar with 54 rules
{'mode': 'train', 'domain': 'travel', 'origin': {'name': 'Boston, MA, US', 'id': 4930956}, 'destination': {'name': 'Austin, TX, US', 'id': 4671654}}


Great, it works.

We're far from done with the travel grammar, but we now have enough in place that we should be able to parse several of the examples in our training data. This means that we can start hill-climbing on oracle accuracy!

To drive our grammar engineering process, we're going to use a SippyCup utility function called sample_wins_and_losses(), which will report our oracle accuracy on the training data, and show some examples we're parsing correctly and some examples we're not.

(Note that sample_wins_and_losses() requires a Domain, a Model, and a Metric. A description of these classes is tangential to our presentation.  If you're interested, read some SippyCup code! It's not very complicated.)

In [20]:
from experiment import sample_wins_and_losses
from metrics import SemanticsOracleAccuracyMetric
from scoring import Model
from travel import TravelDomain

domain = TravelDomain()
model = Model(grammar=grammar)
metric = SemanticsOracleAccuracyMetric()

sample_wins_and_losses(domain=domain, model=model, metric=metric, seed=31)

Loaded GeoNamesAnnotator cache with 1462 items
================================================================================
Evaluating on 75 examples

--------------------------------------------------------------------------------
Over 75 examples:

semantics accuracy                 0.133
semantics oracle accuracy          0.133
number of parses                   0.173
spurious ambiguity                 0.000
has travel parse                   0.173

5 of 10 wins on semantics oracle accuracy:

airfares to honolulu hawaii
cruises from new york
flight from atlanta to vermont
flights from portland or to seattle wa
train from moscow to st. petersburg

10 of 65 losses on semantics oracle accuracy:

cheap tickets to south carolina
cheapest place to live in florida
direct flights from california to loreto mexico
discount travel flights to austin texas
flights newark to raleigh nc
one day cruise from fort lauderdale florida
public transportation to new jersey
ride this train to roseburg oregon now ther's a town for ya
university of washington transportation to seatac



As you can see, sample_wins_and_losses() doesn't print many details — just a few evaluation metrics, and then examples of wins and losses with the current grammar. (A "win" is an example on which our primary metric has a positive value.) Our primary metric, semantics oracle accuracy, stands at 0.133 — not great, but greater than zero, so it's a start. The wins make sense: we can parse queries which consist solely of travel locations and travel modes, with no extraneous elements. The losses are more interesting, because they will provide the motivation for the next phase of work.

### Travel triggers¶

Many of the examples we're currently failing to parse contain phrases (such as "tickets" or "transportation") which indicate a travel intent, but do not specify a travel mode. These phrases constitute the second type of travel argument, namely travel triggers. Inspection of the examples in our training dataset suggest a small number of lexical rules, shown here.

In [21]:
rules_travel_triggers = [
Rule('$TravelArgument', '$TravelTrigger', {}),

Rule('$TravelTrigger', 'tickets'), Rule('$TravelTrigger', 'transportation'),
Rule('$TravelTrigger', 'travel'), Rule('$TravelTrigger', 'travel packages'),
Rule('$TravelTrigger', 'trip'), ]  Let's run sample_wins_and_losses() again, to see how much we've gained, and what to work on next. In [22]: rules = rules_travel + rules_travel_locations + rules_travel_modes + rules_travel_triggers grammar = Grammar(rules=rules, annotators=travel_annotators) model = Model(grammar=grammar) sample_wins_and_losses(domain=domain, model=model, metric=metric, seed=1)  Created grammar with 60 rules ================================================================================ Evaluating on 75 examples -------------------------------------------------------------------------------- Over 75 examples: semantics accuracy 0.173 semantics oracle accuracy 0.173 number of parses 0.213 spurious ambiguity 0.000 has travel parse 0.213 5 of 13 wins on semantics oracle accuracy: cruises from new york flight from atlanta to vermont flight from ft.lauderdale florida to lexington kentucky train from moscow to st. petersburg travel packages from hartford ct. to berlin germany 10 of 62 losses on semantics oracle accuracy: birmingham al distance from indianapolish in cost of gas from cedar rapids ia to las vegas direct flights from california to loreto mexico distance of oxnard ca. to rohnert park ca. distance usa to peru jet blue flights to new york one way air fare to chicago rochester ny bus tours to yankees stadium scenic drive to duluth mn what is 1500 miles away from texas city  So we've gained about four points (that is, 0.04) in oracle accuracy on the training dataset. We're making progress! Again, the losses suggest where to go next. ### Request types¶ The third and last kind of travel argument is a request type, which indicates a specific type of information sought, such as directions or distance. We'll adopt the same methodology, adding lexical rules motivated by specific training examples, tied together by higher-level rules which add semantics. In [23]: rules_request_types = [ Rule('$TravelArgument', '$RequestType', sems_0), Rule('$RequestType', '$DirectionsRequest', {'type': 'directions'}), Rule('$RequestType', '$DistanceRequest', {'type': 'distance'}), Rule('$RequestType', '$ScheduleRequest', {'type': 'schedule'}), Rule('$RequestType', '$CostRequest', {'type': 'cost'}), Rule('$DirectionsRequest', 'directions'),
Rule('$DirectionsRequest', 'how do i get'), Rule('$DistanceRequest', 'distance'),
Rule('$ScheduleRequest', 'schedule'), Rule('$CostRequest', 'cost'),
]


Again, we'll check our progress using sample_wins_and_losses().

In [24]:
rules = rules_travel + rules_travel_locations + rules_travel_modes + rules_travel_triggers + rules_request_types
grammar = Grammar(rules=rules, annotators=travel_annotators)
model = Model(grammar=grammar)
sample_wins_and_losses(domain=domain, model=model, metric=metric, seed=1)

Created grammar with 70 rules
================================================================================
Evaluating on 75 examples

--------------------------------------------------------------------------------
Over 75 examples:

semantics accuracy                 0.200
semantics oracle accuracy          0.200
number of parses                   0.240
spurious ambiguity                 0.000
has travel parse                   0.240

5 of 15 wins on semantics oracle accuracy:

cruise from new york to canada
cruises from new york
flight from ft.lauderdale florida to lexington kentucky
flights from portland or to seattle wa

10 of 60 losses on semantics oracle accuracy:

cheapest place to live in florida
cost of gas from cedar rapids ia to las vegas
distance cunmming georgia to chattanooga
distance usa to peru
driving distance washington dc to niagara falls
ithaca to scranton trains
one day cruise from fort lauderdale florida
public transportation to new jersey
santiago chile to atlanta ga. airline schedule 3 26 06



Great, oracle accuracy is up to 0.20. But there's still one big piece we're missing.

### Optionals¶

A key ingredient of the phrase-bag approach to grammar building is the ability to accept optional elements interspersed freely among the query elements. Optionals are phrases which can be either present or absent; typically, they contribute nothing to the semantics.

The following rules illustrate one approach to allowing optionals. The first two rules allow any $TravelQueryElement to combine with an $Optionals either to the right or to the left, while ignoring its semantics. The third rule defines $Optionals as a sequence of one or more $Optional elements, while the following rules define several specific categories of optionals. As usual, most of the lexical rules are motivated by specific examples from the training dataset, with a few extras included just because they are super obvious.

This is not necessarily the best design! One of the exercises will challenge you to do better.

In [25]:
rules_optionals = [
Rule('$TravelQueryElement', '$TravelQueryElement $Optionals', sems_0), Rule('$TravelQueryElement', '$Optionals$TravelQueryElement', sems_1),

Rule('$Optionals', '$Optional ?$Optionals'), Rule('$Optional', '$Show'), Rule('$Optional', '$Modifier'), Rule('$Optional', '$Carrier'), Rule('$Optional', '$Stopword'), Rule('$Optional', '$Determiner'), Rule('$Show', 'book'),
Rule('$Show', 'give ?me'), Rule('$Show', 'show ?me'),

Rule('$Modifier', 'cheap'), Rule('$Modifier', 'cheapest'),
Rule('$Modifier', 'discount'), Rule('$Modifier', 'honeymoon'),
Rule('$Modifier', 'one way'), Rule('$Modifier', 'direct'),
Rule('$Modifier', 'scenic'), Rule('$Modifier', 'transatlantic'),
Rule('$Modifier', 'one day'), Rule('$Modifier', 'last minute'),

Rule('$Carrier', 'delta'), Rule('$Carrier', 'jet blue'),
Rule('$Carrier', 'spirit airlines'), Rule('$Carrier', 'amtrak'),

Rule('$Stopword', 'all'), Rule('$Stopword', 'of'),
Rule('$Stopword', 'what'), Rule('$Stopword', 'will'),
Rule('$Stopword', 'it'), Rule('$Stopword', 'to'),

Rule('$Determiner', 'a'), Rule('$Determiner', 'an'),
]


Note that the last rule depends on the $Token category, which can be applied to any token by the TokenAnnotator. So let's add the TokenAnnotator to our list of annotators. In [28]: travel_annotators = [geonames_annotator, TokenAnnotator()]  As usual, we'll check our progress using sample_wins_and_losses(). In [29]: rules = rules_travel + rules_travel_locations + rules_travel_modes + rules_travel_triggers + rules_request_types + rules_optionals + rules_not_travel grammar = Grammar(rules=rules, annotators=travel_annotators) model = Model(grammar=grammar) sample_wins_and_losses(domain=domain, model=model, metric=metric, seed=1)  Created grammar with 107 rules ================================================================================ Evaluating on 75 examples -------------------------------------------------------------------------------- Over 75 examples: semantics accuracy 0.173 semantics oracle accuracy 0.573 number of parses 1.773 spurious ambiguity 0.025 has travel parse 0.493 5 of 43 wins on semantics oracle accuracy: cheap flights to hawaii directions from washington to canada jet blue flights to new york travel packages from hartford ct. to berlin germany what will it cost to drive from kemerovo to st. petersburg russia 10 of 32 losses on semantics oracle accuracy: bike riding-seattle to portland birmingham al distance from indianapolish in directions to rupparena in lexington distance of oxnard ca. to rohnert park ca. flights newark to raleigh nc fly boston to myrtle beach spirit airlines ithaca to scranton trains norwegian cruises lines to alaska combined with land tours to denali rochester ny bus tours to yankees stadium transatlantic cruise southampton to tampa  We've achieved another big gain in oracle accuracy, from 0.400 to 0.573, just by ensuring that we offer a "negative" prediction for every input. (Note that the mean number of parses has increased by exactly 1, from 0.773 to 1.773.) However, for the first time, a big gap has opened between accuracy, at 0.173, and oracle accuracy, at 0.573. The problem is that we don't yet have a scoring model for the travel domain, so the ranking of parses is arbitrary. In order to close the gap, we need to create a scoring model. One of the exercises will ask you to pursue this. ## Exercises ¶ Several of these exercises ask you to measure the impact of your change on key evaluation metrics. Part of your job is to decide which evaluation metrics are most relevant for the change you're making. It's probably best to evaluate only on training data, in order to keep the test data unseen during development. (But the test data is hardly a state secret, so whatever.) ### Straightforward¶ 1. Select 20 of the 6,588 queries in aol-travel-queries.txt and manually annotate them with target semantics. Select your queries using uniform random sampling — this will minimize overlap between different people completing this exercise, and therefore maximize the value of the aggregate labor. (In Python, you can sample from a list using random.sample().) You will likely find some cases where it's not clear what the right semantics are. Do your best. The point of the exercise is to develop an awareness of the challenges of annotation, and to recognize that there's no such thing as perfectly annotated data. 2. Many of the remaining errors on training examples occur because the origin isn't marked by "from". Examples include "transatlantic cruise southampton to tampa", "fly boston to myrtle beach spirit airlines", and "distance usa to peru". Extend the grammar to handle examples like these. Measure the impact on key evaluation metrics. 3. Does your solution to the previous exercise handle examples where some other query element intervenes between origin and destination? Examples include "university of washington transportation to seatac", "birmingham al distance from indianapolish in", and "nyc flights to buffalo ny". If not, extend the grammar to handle examples like these. Measure the impact on key evaluation metrics. 4. The current structure of the grammar permits parses containing any number of $FromLocations and $ToLocations, including zero. Find a way to require that (a) there is at least one $FromLocation or $ToLocation, (b) there are not multiple $FromLocations or $ToLocations. 5. The travel grammar is lacking a scoring model, and it shows a big gap between accuracy and oracle accuracy. Examine and diagnose some examples where accuracy is 0 even though oracle accuracy is 1. Propose and implement a scoring model, and measure its efficacy in closing that gap. ### Challenging¶ 1. Extend Grammar to allow rules which mix terminals and non-terminals on the RHS, such as Rule('$RouteQuery', '$TravelMode from$Location to \$Location').

2. Try to improve the precision of the GeoNamesAnnotator by fiddling with the GeoNames API request parameters documented here. For example, the featureClass, countryBias, or orderby parameters seem like promising targets.

3. Try to improve the coverage of the GeoNamesAnnotator by enabling it to return multiple annotations for ambiguous location names. Investigate the impact of varying the maximum number of annotations on various performance metrics, including accuracy, oracle accuracy, and number of parses. How would you characterize the tradeoff you're making?

4. Building on the previous exercise, implement a feature which captures information about the result rank of annotations generated by the GeoNamesAnnotator, and see if you can use this feature to narrow the gap between accuracy and oracle accuracy.

5. You have probably noticed that one of our standard evaluation metrics is something called "spurious ambiguity". Dig into the SippyCup codebase to figure out what spurious ambiguity is. Here's a hint: it's something bad, so we want to push that metric toward zero. Find a training example where it's not zero, and figure out why the example exhibits spurious ambiguity. Are there changes we can make to the grammar to reduce spurious ambiguity? Also, why is spurious ambiguity undesirable?

6. In its current form, the travel grammar parses lots of queries it shouldn't. (By "parses", we mean "generates a positive parse".) This problem is known as overtriggering. The overtriggering problem is hard to observe on our tiny dataset of 100 examples, where most of the examples are positive examples. Investigate overtriggering by downloading the AOL query dataset, identifying the 1,000 most frequent queries, and running them through the grammar. How many cases of overtriggering do you find? Can you suggest some simple changes to minimize overtriggering? (Warning: the AOL dataset contains lots of queries which may be offensive. Skip this exercise if you're not down with that.)

7. Consider the queries "flights los angeles hawaii" vs. "flights los angeles california". Despite the superficial resemblance, the second query seems to mean flights to Los Angeles, not flights from Los Angeles to California. But the grammar can successfully interpret both queries only if it permits both interpretations for each query. This creates a ranking problem: which interpretation should be scored higher? Can you add to the scoring model a feature which solves the problem?

8. Develop and execute a strategy to annotate all 6,588 queries in aol-travel-queries.txt with target semantics using crowdsourcing. Warning: this is an ambitious exercise.