Notebook

SippyCup
Unit 2: Travel queries

Bill MacCartney
Spring 2015

This is Unit 2 of the SippyCup codelab.

In our next case study, we're going to look at travel queries, in which a user seeks information about getting from here to there. Examples include:

"birmingham al distance from indianapolish in"
"directions from washington to canada"
"discount travel flights to austin texas"

It's easy to imagine applications for semantic parsing in the travel domain. Providing answers to queries like these is a core part of the value proposition of intelligent assistants such as Apple's Siri, Google Now, Microsoft's Cortana, and Amazon Echo.

The travel domain has some important differences to the domain of natural language arithmetic we looked at in our first case study.

Above all, it's far more diverse. The lexicon of travel queries can include thousands of words, not merely dozens — indeed, its potential size is unbounded. And the words and phrases can be assembled in a dizzying variety of syntactic structures.
It's much more ambiguous. Consider "directions from washington to canada". Does that refer to the State of Washington, or to the District of Columbia? And where exactly in Canada are we headed?
It can also be much messier. Because it's a real-world application, and because users ~~are idiots~~ make mistakes, travel queries can include lots of misspellings (such as "indianapolish"), mangled syntax, and other linguistic infelicities. It's not all neat and tidy like the arithmetic case study.
However, the semantics are flatter. The semantic representation for a travel query may need to represent the destination, the origin, the transportation mode, and perhaps a few other bits of information. But, unlike the expression trees we looked at in the arithmetic case study, it does not need to contain arbitrarily deeply-nested recursive structure.

Example inputs¶

Whenever we start work in a new domain, we must begin with data collection. To drive an effective development process, we need a large, realistic, representative sample of the inputs our system is likely to encounter in its intended application. In the arithmetic case study, we copped out on the data collection, in the name of pedagogical convenience. This time, we'll try harder.

For this purpose, we'll leverage the AOL search query dataset (Pass et al. 2006), a resource which has proven immensely useful over time, despite its controversial origin. This dataset contains approximately 10M unique search queries issued by about 650K AOL users in 2006. Because the dataset was quickly retracted by AOL, it can be difficult to find online, but it is currently available for download here.

Of course, the majority of those 10M queries are not related to travel. (The five most frequent queries are "google", "ebay", "yahoo", "yahoo.com", and "mapquest". Remember, it was 2006. The query "facebook" doesn't show up until position 119!) To construct a query sample suitable for our purposes, we'll need to do some filtering.

Since the queries we're interested in always refer to a location, we'll start by filtering for location names. Recognizing location names is itself a hard problem, so for current purposes we'll adopt a good-enough solution: we'll grep the queries for the names of about 600 U.S. locations (cities, states, mountains, lakes, and so on) listed in Geobase. (We'll be using Geobase again in the next unit.) This winnows the 10M unique queries to about 1M. It's a good start, but it's still not enough. (The top five queries are now "white pages", "florida lottery", "wells fargo", "whitney houston", and "las vegas".) We need to find queries which not only contain a location, but also a clear indication of travel intent.

Since travel is about going from one location to another, let's grep for queries which contain both "from" and "to". This cuts the number of queries to about 3,100, and we're now seeing the kind of queries we're looking for:

"airline flights from raleigh durham to nashville tennessee"
"distance from seattle to vancouver"
"cheap flights to italy from new york"
"international flights from usa to morocco"
"milage from portlan oregon to canada"
"distance from alabama to arkansas in hours"
"driving directions to sault ste marie canada from vermont"
"great white sharks life cycle from brith to death"
"directions from newark airport to bayonne bridge"
"poem it takes courage to love from tyler perrys movie"

A couple of observations here. First, not all of the queries are actually travel queries. That's OK — we actually want to include some queries which do not belong to the target domain. Such "negative" queries are important for evaluating our system's ability to distinguish between queries it should be parsing and queries it shouldn't. Second, there are a lot of mispelings. Again, that's OK. We want a realistic sample which reflects the inputs our system will actually encounter in its intended application. Messy inputs are a fact of life, and they're one of the things that distinguishes this domain from the highly artificial example of natural language arithmetic we looked at in the last unit.

There's still a problem, though. Not all travel queries contain both "from" and "to". Many valid travel queries contain only a destination, but no origin — and some contain an origin, but no destination. If we relax the conjunction to a disjunction (that is, "from" or "to", we get about 23,000 queries, most of which are still not what we're looking for (for example, "colorado rent to own", "moving to florida"). However, the list of examples above (and its continuation, not shown) suggest many other terms we can usefully search for: "airline", "flights", "distance", "mileage", "driving", "directions", and so on. We constructed a list of about 60 such terms (recorded in a comment in travel_examples.py). We then selected queries which contain a location, either "from" or "to", and either one of the ~60 travel terms or both "from" and "to". This left 6,588 queries, and the results look pretty good:

"flight one way las vegas nv to rapid city sd"
"number to alaska airlines"
"travel time to denver colorado"
"amtrak schedule in california from burbanlk airport"
"airfare las vegas to my"
"cycling route san francisco to sacramento"
"where to find cheapest gasoline in new castle delaware"
"discount tickets to six flags and discounted hotels in new jersey"
"public transportation washington dc to new york city"
"transportation from newark airport to manhattan"

Three of those are not really travel queries, but that's perfectly fine.

Let's recap. To construct our query sample, we:

Began from the set of unique queries in the AOL dataset (10M queries).
Selected queries containing one of the 600 locations named in Geobase (1M queries).
Selected queries containing "from" or "to" (23K queries).
Selected queries containing one of about 60 travel terms, or containing both "from" and "to" (6,588 queries).

We've got our inputs; now we need to think about the outputs.

Semantic representation ¶

For the travel domain, we're going to use a style of semantic representation based on nested maps of key-value pairs (represented by Python dictionaries). The keys belong to a small, fixed set of strings. The values can be numbers, strings, or nested maps, depending upon the key.

Here are some examples:

{'domain': 'travel', 'type': 'directions', 'mode': 'car',
   'destination': {'id': 4793846, 'name': 'Williamsburg, VA, US'}}
{'domain': 'travel', 'type': 'duration', 'mode': 'bus',
   'origin': {'id': 4500546, 'name': 'Atlantic City, NJ, US'},
   'destination': {'id': 5128581, 'name': 'New York City, NY, US'}}
{'domain': 'travel', 'type': 'cost', 'mode': 'air',
   'origin': {'id': 5101798, 'name': 'Newark, NJ, US'},
   'destination': {'id': 4574324, 'name': 'Charleston, SC, US'}}

You can probably discern what each example means without any difficulty. The first is asking for driving directions to Williamsburg. The second is asking for the duration of a bus trip from Atlantic City to New York City. And the third is asking for the cost of an airfare from Newark to Charleston.

To be a bit more formal, the following keys are allowed in the top-level map:

domain has one of two values: travel for travel queries, or other for non-travel queries.
type specifies the type of information requested, one of: directions, distance, duration, cost, or schedule.
mode specifies the mode of travel, one of: air, bike, boat, bus, car, taxi, train, or transit.
destination specifies the destination of travel using a nested map representing a location.
origin specifies the origin of travel using a nested map representing a location.

In nested maps representating locations, just two keys are allowed:

id is an integer which is the unique identifier of the location in the GeoNames geographical database.
names is a human-readable name for the location whose format is defined by code in geonames.py.

Note that our semantic representation both resolves ambiguity and canonicalizes. One the one hand, the phrase "new york" has a different semantic representation if it refers to the city than if it refers to the state; on the other hand, "new york city" and "nyc" have the same semantic representation. These qualities will make it easy for a downstream executor to turn a semantic representation directly into action.

In a real-world application, the executor would likely be a module that translates our semantic representations into requests to backend APIs to compute optimal routes, retrieve bus schedules or airfares, or whatever. However, because such functionality is not easily captured in a few lines of Python code, SippyCup doesn't include an executor for the travel domain.

Since we don't have an executor, we won't be dealing with denotations in the travel domain. (Abstractly, the denotation of a directions request is some kind of complex computational object representing a route.) Instead, training and evaluation in this domain will consider only whether we're generating the correct semantics. We'll return to considering denotations in our third case study.

Example data¶

We've described a process for extracting 6,588 (likely) travel queries from the AOL search query dataset, and we've defined a simple semantic representation for such queries. The next step should be to manually annotate each query with its semantics, in order generate example data for training and evaluation.

However, there's a problem. Manual annotation is tedious and time-consuming, and therefore expensive. For datasets of significant size, it is usually not practical for one person to do all the work. This suggests distributing the work over a cluster of graduate students — which is in fact the method that was used to generate the Geo880 dataset that we'll use in our third case study. Accordingly, one of the exercises at the end of this section will ask you to annotate a score of travel queries.

Alternatively, if the annotation task does not require specialized knowledge, the work can often be distributed over workers on a crowdsourcing platform such as Amazon Mechanical Turk. Our semantic representation for the travel domain is simple enough that the crowdsourcing approach might be feasible. Of course, the workers wouldn't write down the semantics directly; rather, they'd fill out a carefully-designed form containing questions about the query, and a script would generate semantic representations based on their responses. We leave working out the details of the crowdsourcing approach as an exercise for the motivated reader. :-)

The difficulty of producing annotated data, and the consequent paucity of such data, has been the eternal bane of research in semantic parsing. Indeed, overcoming this obstacle was the chief motivation for the shift made by Liang et al. 2011 to learn not from semantics but from denotations, which are easier and cheaper to generate via crowdsourcing. Despite this shift, annotated data remains scarce. To this day, the largest standard dataset for semantic parsing (the WebQuestions dataset) contains only 5,810 examples.

For the purposes of this codelab, we have manually annotated 100 examples, given in travel_examples.py. The examples were sampled uniformly at random from the set of 6,588 AOL queries described earlier, and are divided into a training set of 75 examples and a test set of 25 examples. You might find yourself quibbling with some of the annotation choices. In some cases, our semantic representation is too impoverished to capture the user's intent completely. For example, "scenic drive to duluth mn", "honeymoon trip to hawaii", and "jet blue flights to new york" contain modifiers which cannot be expressed in our semantic representation. In other cases, the user's intent may be uncertain or obscure: what exactly does "airbus from boston to europe" mean?

Enriching the Grammar class¶

Now that we have some example data for the travel domain, we'd like to start building a grammar. However, the grammar formalism we introduced in the arithmetic case study was quite restrictive. Every rule had to be in Chomsky normal form: either a unary lexical rule, or a binary compositional rule. If you did the exercises at the end of the arithmetic case study, you will have chafed against those restrictions. So, to make grammar engineering far more convenient, we're going to loosen things up. We'll modify the Grammar class by adding:

Annotators, which are external modules for assigning categories and semantics to phrases of specific types.
N-ary lexical rules, such as Rule('$City', 'new york city').
Unary compositional rules, such as Rule('$Location', '$City').
N-ary compositional rules, such as Rule('$RouteQuery', '$FromLocation $ToLocation $TravelMode').
Rules with optional elements on the RHS, such as Rule('$BikeMode', '?by bike'), which can match either "bike" or "by bike".
A designated start symbol, typically $ROOT, to allow us to limit which categories can yield complete parses.

In order to make all this stuff work, we'll need to redefine our Grammar class and some of its affiliated functions. First, we'll redefine the Grammar constructor to add member variables for storing unary rules, annotators, and the start symbol.

In [1]:

class Grammar:
    def __init__(self, rules=[], annotators=[], start_symbol='$ROOT'):
        self.categories = set()
        self.lexical_rules = defaultdict(list)
        self.unary_rules = defaultdict(list)
        self.binary_rules = defaultdict(list)
        self.annotators = annotators
        self.start_symbol = start_symbol
        for rule in rules:
            add_rule(self, rule)
        print('Created grammar with %d rules.' % len(rules))

    def parse_input(self, input):
        """Returns a list of parses for the given input."""
        return parse_input(self, input)

Next, we'll redefine add_rule(), so that it knows what to do when it encounters unary or n-ary compositional rules, or rule containing optionals:

In [2]:

def add_rule(grammar, rule):
    if contains_optionals(rule):
        add_rule_containing_optional(grammar, rule)
    elif is_lexical(rule):
        grammar.lexical_rules[rule.rhs].append(rule)
    elif is_unary(rule):
        grammar.unary_rules[rule.rhs].append(rule)
    elif is_binary(rule):
        grammar.binary_rules[rule.rhs].append(rule)
    elif all([is_cat(rhsi) for rhsi in rule.rhs]):
        add_n_ary_rule(grammar, rule)
    else:
        # One of the exercises will ask you to handle this case.
        raise Exception('RHS mixes terminals and non-terminals: %s' % rule)
        
def is_unary(rule):
    """
    Returns true iff the given Rule is a unary compositional rule, i.e.,
    contains only a single category (non-terminal) on the RHS.
    """
    return len(rule.rhs) == 1 and is_cat(rule.rhs[0])

Finally, we'll redefine parse_input(), so that it properly invokes functions for applying annotators and unary rules (in addition to existing invocations of functions for applying lexical rules and binary rules).

In [3]:

def parse_input(grammar, input):
    """Returns a list of all parses for input using grammar."""
    tokens = input.split()
    chart = defaultdict(list)
    for j in range(1, len(tokens) + 1):
        for i in range(j - 1, -1, -1):
            apply_annotators(grammar, chart, tokens, i, j)               # new!
            apply_lexical_rules(grammar, chart, tokens, i, j)
            apply_binary_rules(grammar, chart, i, j)
            apply_unary_rules(grammar, chart, i, j)                      # new!
    parses = chart[(0, len(tokens))]
    if hasattr(grammar, 'start_symbol') and grammar.start_symbol:        # new!
        parses = [parse for parse in parses if parse.rule.lhs == grammar.start_symbol]
    return parses

Note that if grammar.start_symbol is defined (the usual case), then we will return only parses having that category at the root.

Annotators¶

Our travel grammar will need to recognize names of cities (and other kinds of locations). We could achieve this by writing lots of lexical rules:

In [4]:

from parsing import Rule

rules_cities = [
    Rule('$City', 'austin', {'id': 4671654, 'name': 'Austin, TX, US'}),
    Rule('$City', 'boston', {'id': 4930956, 'name': 'Boston, MA, US'}),
    Rule('$City', 'chicago', {'id': 4887398, 'name': 'Chicago, IL, US'}),
    Rule('$City', 'dallas', {'id': 4684888, 'name': 'Dallas, TX, US'}),
    # ...
]

However, there are thousands of cities in the world. Adding thousands of rules to our grammar would be cumbersome and difficult to maintain. Besides, there is probably already some library or service which knows how to recognize names of cities and map them into unique identifiers. If we could wrap this external module in the right kind of interface, we could make it directly available to our grammar.

We call such a module an annotator. What kind of an interface should it have? Well, it should take a token span (that is, a list of tokens) and return a list of zero or more interpretations, each a pair consisting of a category and a semantic representation.

In [5]:

class Annotator:
    """A base class for annotators."""
    def annotate(self, tokens):
        """Returns a list of pairs, each a category and a semantic representation."""
        return []

Let's take a moment to implement a couple of simple annotators which will prove useful later on. The first is a NumberAnnotator, which will annotate any string representing a number with category $Number and semantics equal to its numeric value.

In [6]:

class NumberAnnotator(Annotator):
    def annotate(self, tokens):
        if len(tokens) == 1:
            try:
                value = float(tokens[0])
                if int(value) == value:
                    value = int(value)
                return [('$Number', value)]
            except ValueError:
                pass
        return []

Here's the NumberAnnotator in action:

In [7]:

NumberAnnotator().annotate(['16'])

Out[7]:

[('$Number', 16)]

Another annotator that will prove very useful is the TokenAnnotator, which will annotate any single token with category $Token and semantics equal to the token itself.

In [8]:

class TokenAnnotator(Annotator):
    def annotate(self, tokens):
        if len(tokens) == 1:
            return [('$Token', tokens[0])]
        else:
            return []

Here's the TokenAnnotator in action:

In [9]:

TokenAnnotator().annotate(['foo'])

Out[9]:

[('$Token', 'foo')]

Later in this unit, we'll introduce the GeoNamesAnnotator, a more complex annotator designed to annotate names of locations.

We saw above that the redesigned Grammar class holds a list of annotators, and that the redesigned parse_input() function invokes a function called apply_annotators() on every token span. Its operation is straightforward: for every annotator in the grammar, and for every interpretation of the given span by the annotator, it creates a new parse and adds it to the corresponding chart cell.

In [10]:

def apply_annotators(grammar, chart, tokens, i, j):
    """Add parses to chart cell (i, j) by applying annotators."""
    if hasattr(grammar, 'annotators'):
        for annotator in grammar.annotators:
            for category, semantics in annotator.annotate(tokens[i:j]):
                if not check_capacity(chart, i, j):
                    return
                rule = Rule(category, tuple(tokens[i:j]), semantics)
                chart[(i, j)].append(Parse(rule, tokens[i:j]))

(The check_capacity() function will be defined in the next section.)

Unary rules¶

When designing a grammar, it's often convenient to write unary rules, which have a single category on the RHS. However, unary rules are not allowed in Chomsky normal form. One reason why unary rules are problematic is that badly-designed grammars can contain unary cycles, which can cause parsing to get stuck in an infinite regress.

In [11]:

rules_with_unary_cycle = [
    Rule('$ROOT', '$Ping', lambda sems: sems[0]),
    Rule('$Ping', '$Pong', lambda sems: sems[0] + '1'),
    Rule('$Pong', '$Ping', lambda sems: sems[0]),
    Rule('$Ping', 'ping', '1'),
]

If a grammar with these rules is used to parse the input "ping", it will generate an endless stream of candidate parses, with semantics '1', '11', '111', '1111', and so on.

There are various strategies for preventing unary cycles from causing a runaway parsing process. We'll adopt one of the simplest: rather than trying to identify unary cycles in advance, we'll just impose an upper bound on the capacity of any chart cell, and if the limit is reached, we'll terminate the parsing process.

With this in mind, the definition of apply_unary_rules() is straightforward.

In [12]:

def apply_unary_rules(grammar, chart, i, j):
    """Add parses to chart cell (i, j) by applying unary rules."""
    # Note that the last line of this method can add new parses to chart[(i,
    # j)], the list over which we are iterating.  Because of this, we
    # essentially get unary closure "for free".  (However, if the grammar
    # contains unary cycles, we'll get stuck in a loop, which is one reason for
    # check_capacity().)
    if hasattr(grammar, 'unary_rules'):
        for parse in chart[(i, j)]:
            for rule in grammar.unary_rules[(parse.rule.lhs,)]:
                if not check_capacity(chart, i, j):
                    return
                chart[(i, j)].append(Parse(rule, [parse]))

MAX_CELL_CAPACITY = 10000

# Important for catching e.g. unary cycles.
def check_capacity(chart, i, j):
    if len(chart[(i, j)]) >= MAX_CELL_CAPACITY:
        print('Cell (%d, %d) has reached capacity %d' % (
            i, j, MAX_CELL_CAPACITY))
        return False
    return True

Note that check_capacity() is not just for taming unary cycles — in fact, it's a general-purpose mechanism for bounding the amount of computation invested in a single parsing task. Accordingly, we've included calls to it, not only in apply_unary_rules(), but also in its three sibling functions. As you build richer, more complex grammars, you may find that some inputs have hundreds or even thousands of possible parses, and you may begin to run up against the chart cell capacity limit. Feel free to adjust it as needed.

N-ary rules¶

Like unary rules, n-ary compositional rules are make grammar engineering easier, by allowing grammar rules to express constraints in natural, intuitive ways.

In the previous unit, we saw how to eliminate the n-ary rule Rule('$E', '$E $BinOp $E') from a grammar by binarizing it — that is, by replacing it with two binary rules having equivalent effect. Obviously, the binarizing operation can be automated. The redesigned add_rule() calls a function named add_n_ary_rule() which performs this binarization.

In [13]:

def add_n_ary_rule(grammar, rule):
    """
    Handles adding a rule with three or more non-terminals on the RHS.
    We introduce a new category which covers all elements on the RHS except
    the first, and then generate two variants of the rule: one which
    consumes those elements to produce the new category, and another which
    combines the new category which the first element to produce the
    original LHS category.  We add these variants in place of the
    original rule.  (If the new rules still contain more than two elements
    on the RHS, we'll wind up recursing.)

    For example, if the original rule is:

        Rule('$Z', '$A $B $C $D')

    then we create a new category '$Z_$A' (roughly, "$Z missing $A to the left"),
    and add these rules instead:

        Rule('$Z_$A', '$B $C $D')
        Rule('$Z', '$A $Z_$A')
    """
    def add_category(base_name):
        assert is_cat(base_name)
        name = base_name
        while name in grammar.categories:
            name = name + '_'
        grammar.categories.add(name)
        return name
    category = add_category('%s_%s' % (rule.lhs, rule.rhs[0]))
    add_rule(grammar, Rule(category, rule.rhs[1:], lambda sems: sems))
    add_rule(grammar, Rule(rule.lhs, (rule.rhs[0], category),
                           lambda sems: apply_semantics(rule, [sems[0]] + sems[1])))

Optional elements¶

Another convenience for manual grammar engineering is the ability to mark an element in the RHS as optional — in SippyCup, by putting '?' in front of it. With such a mechanism, we can replace these two rules:

Rule('$BikeMode, 'bike')
Rule('$BikeMode', 'by bike')

with a single rule containing an optional element:

Rule('$BikeMode', '?by bike')

The optional mechanism also makes it easy to define one category as a sequence of another category:

Rule('$Thing', 'thing')
Rule('$Things', '$Thing ?$Things')

The following functions enable the optional mechanism.

In [14]:

from types import FunctionType

def is_optional(label):
    """
    Returns true iff the given RHS item is optional, i.e., is marked with an
    initial '?'.
    """
    return label.startswith('?') and len(label) > 1

def contains_optionals(rule):
    """Returns true iff the given Rule contains any optional items on the RHS."""
    return any([is_optional(rhsi) for rhsi in rule.rhs])

def add_rule_containing_optional(grammar, rule):
    """
    Handles adding a rule which contains an optional element on the RHS.
    We find the leftmost optional element on the RHS, and then generate
    two variants of the rule: one in which that element is required, and
    one in which it is removed.  We add these variants in place of the
    original rule.  (If there are more optional elements further to the
    right, we'll wind up recursing.)

    For example, if the original rule is:

        Rule('$Z', '$A ?$B ?$C $D')

    then we add these rules instead:

        Rule('$Z', '$A $B ?$C $D')
        Rule('$Z', '$A ?$C $D')
    """
    # Find index of the first optional element on the RHS.
    first = next((idx for idx, elt in enumerate(rule.rhs) if is_optional(elt)), -1)
    assert first >= 0
    assert len(rule.rhs) > 1, 'Entire RHS is optional: %s' % rule
    prefix = rule.rhs[:first]
    suffix = rule.rhs[(first + 1):]
    # First variant: the first optional element gets deoptionalized.
    deoptionalized = (rule.rhs[first][1:],)
    add_rule(grammar, Rule(rule.lhs, prefix + deoptionalized + suffix, rule.sem))
    # Second variant: the first optional element gets removed.
    # If the semantics is a value, just keep it as is.
    sem = rule.sem
    # But if it's a function, we need to supply a dummy argument for the removed element.
    if isinstance(rule.sem, FunctionType):
        sem = lambda sems: rule.sem(sems[:first] + [None] + sems[first:])
    add_rule(grammar, Rule(rule.lhs, prefix + suffix, sem))

Grammar engineering¶

It's time to start writing some grammar rules for the travel domain. We're going to adopt a data-driven approach. Using the 75 training examples as a development set, we will iteratively:

look at examples which are not yet parsed correctly,
identify common obstacles or sources of error,
introduce new rules to address those problems, and then
re-evaluate on the 75 training examples.

During grammar engineering, the performance metric we'll focus on is oracle accuracy (the proportion of examples for which any parse is correct), not accuracy (the proportion of examples for which the first parse is correct). Remember that oracle accuracy is an upper bound on accuracy. Oracle accuracy is a measure of the expressive power of the grammar: does it have the rules it needs to generate the correct parse? The gap between oracle accuracy and accuracy, on the other hand, reflects the ability of the scoring model to bring the correct parse to the top of the candidate list.

Phrase-bag grammars¶

For the travel domain, we're going to develop a style of grammar known as a phrase-bag grammar. To get a sense of how it will work, let's look at ten example inputs from the training data.

travel boston to fr. myers fla
how do i get from tulsa oklahoma to atlantic city. new jersey by air
airbus from boston to europe
cheap tickets to south carolina
birmingham al distance from indianapolish in
transportation to the philadelphia airport
one day cruise from fort lauderdale florida
directions from washington to canada
flights from portland or to seattle wa
honeymoon trip to hawaii

Here we've highlighted phrases in different colors, according to the roles they play in building the meaning of the input.

Green phrases indicate the destination of travel.
Blue phrases indicate the origin of travel.
Yellow phrases indicate a mode of travel: air, boat, etc.
Orange phrases indicate travel of some kind, but do not specify a travel mode.
Red phrases indicate a specific type of information sought: distance, directions, etc.
Gray phrases indicate "optional" words which contribute little to the semantics. (The modifiers "one day" and "honeymoon" may be quite meaningful to the user, but our semantic representation is too impoverished to capture them, so they are not relevant for our grammar.)

Note that the ordering of phrases in a query isn't particularly important. Whether the user says directions from washington to canada, or from washington to canada directions, or even to canada directions from washington, the intent is clearly the same. Some of these formulations might be more natural than others — and more common in the query logs — but all of them should be interpretable by our grammar.

That's the motivation for the phrase-bag style of grammar. In a simple phrase-bag grammar, a valid query is made up of one or more query elements, which can appear in any order, and optionals, which can be scattered freely amongst the query elements. (More complex phrase-bag grammars can impose further constraints.) For the travel domain, we can identify two types of query elements: travel locations and travel arguments. A travel location is either a to location (destination) or a from location (origin). A travel argument is either a travel mode, a travel trigger, or a request type.

We're ready to write our first grammar rules for the travel domain.

In [15]:

def sems_0(sems):
    return sems[0]

def sems_1(sems):
    return sems[1]

def merge_dicts(d1, d2):
    if not d2:
        return d1
    result = d1.copy()
    result.update(d2)
    return result

rules_travel = [
    Rule('$ROOT', '$TravelQuery', sems_0),
    Rule('$TravelQuery', '$TravelQueryElements',
         lambda sems: merge_dicts({'domain': 'travel'}, sems[0])),
    Rule('$TravelQueryElements', '$TravelQueryElement ?$TravelQueryElements',
         lambda sems: merge_dicts(sems[0], sems[1])),
    Rule('$TravelQueryElement', '$TravelLocation', sems_0),
    Rule('$TravelQueryElement', '$TravelArgument', sems_0),
]

These rules are incomplete: $TravelLocation and $TravelArgument are not yet defined, and there is no mention of optionals. But these rules define the high-level structure of our phrase-bag grammar.

Pay attention to the semantic functions attached to each rule. Remember that our semantic representations are maps of key-value pairs. In the rules which define $TravelQueryElement, the semantic function propagates the semantics of the child unchanged. In the rule which defines $TravelQueryElements, the semantic function merges the semantics of the children. And in the rule which defines $TravelQuery, the semantic function adds a key-value pair to the semantics of the child.

Travel locations¶

The rules introduced above left $TravelLocation undefined. Let's add some rules to define it.

In [16]:

rules_travel_locations = [
    Rule('$TravelLocation', '$ToLocation', sems_0),
    Rule('$TravelLocation', '$FromLocation', sems_0),
    Rule('$ToLocation', '$To $Location', lambda sems: {'destination': sems[1]}),
    Rule('$FromLocation', '$From $Location', lambda sems: {'origin': sems[1]}),
    Rule('$To', 'to'),
    Rule('$From', 'from'),
]

This looks good, but we still need something that defines $Location, which will match a phrase like "boston" and assign to it a semantic representation like {id: 4930956, name: 'Boston, MA, US'}. If only we had such a thing.

The `GeoNamesAnnotator`¶

Geocoding is the process of mapping a piece of text describing a location (such as a place name or an address) into a canonical, machine-readable representation (such as geographic coordinates or a unique identifier in a geographic database). Because geocoding is a common need across many applications, many geocoding services are available.

One such service is GeoNames. GeoNames defines a large geographic database in which each location is identified by a unique integer. For example, Boston is identified by 4930956. GeoNames also provides a free, RESTful API for geocoding requests. For example, a request to geocode "Palo Alto" looks like this:

http://api.geonames.org/searchJSON?q=Palo+Alto&username=demo

In SippyCup, the GeoNamesAnnotator (defined in geonames.py) is implemented as a wrapper around the GeoNames API. The job of the GeoNamesAnnotator is to recognize $Locations and generate semantics for them. Take a few minutes to skim that code now.

Note that the GeoNamesAnnotator uses a persistent cache which is pre-populated for phrases in the 100 annotated travel examples, avoiding the need for live calls to the GeoNames API. However, if you run on any other examples, you will be making live calls. By default, your requests will specify your username as wcmac. That's fine, but each (free) GeoNames account is limited to 2000 calls per hour. If too many people are making calls as wcmac, that quota could be exhausted quickly. If that happens, you'll want to create your own account on GeoNames.

You will soon observe that the annotations are far from perfect. For example, "florida" is annotated as {'id': 3442584, 'name': 'Florida, UY'}, which denotes a city in Uruguay. The current implementation of GeoNamesAnnotator could be improved in many ways. For example, while the GeoNames API can return multiple results for ambiguous queries such as "florida", the GeoNamesAnnotator considers only the first result — and there is no guarantee that the first result is the best. It may also be possible to improve the quality of the results by playing with some of the API request parameters documented here. Some of the exercises at the end of this unit ask you to explore these possibilities.

We're now ready to see our grammar parse an input:

In [17]:

from collections import defaultdict

from geonames import GeoNamesAnnotator
from parsing import *

geonames_annotator = GeoNamesAnnotator()
rules = rules_travel + rules_travel_locations
travel_annotators = [geonames_annotator]
grammar = Grammar(rules=rules, annotators=travel_annotators)
parses = grammar.parse_input('from boston to austin')
print(parses[0].semantics)

Loaded GeoNamesAnnotator cache with 1462 items
Created grammar with 11 rules
{'domain': 'travel', 'origin': {'name': 'Boston, MA, US', 'id': 4930956}, 'destination': {'name': 'Austin, TX, US', 'id': 4671654}}

Travel modes¶

Let's turn our attention from travel locations to travel arguments. One kind of travel argument is a travel mode. Our semantic representation defines eight travel modes (air, bike, boat, bus, car, taxi, train, and transit), and our training examples illustrate some common ways of expressing each travel mode. In fact, individual examples will suggest specific lexical rules. Consider this example:

Example(input='flights from portland or to seattle wa',
        semantics={'domain': 'travel', 'mode': 'air',
                   'origin': {'id': 5746545, 'name': 'Portland, OR, US'},
                   'destination': {'id': 5809844, 'name': 'Seattle, WA, US'}}),

The pairing of the token "flights" with the semantic fragment {'mode': 'air'} suggests the rule:

Rule('$AirMode', 'flights', {'mode': 'air'})

Actually, for simplicity, we're going to add the 'air' semantics one level "higher" in the grammar, like this:

Rule('$TravelMode', '$AirMode', {'mode': 'air'})
Rule('$AirMode', 'flights')

The rules below illustrate this approach. These rules will allow our grammar to handle the most common ways of referring to each travel mode. All of the lexical rules below use phrases which are either highly obvious (such as 'taxi' for $TaxiMode) or else are motivated by specific examples in our training dataset (not the test set!).

In [18]:

rules_travel_modes = [
    Rule('$TravelArgument', '$TravelMode', sems_0),

    Rule('$TravelMode', '$AirMode', {'mode': 'air'}),
    Rule('$TravelMode', '$BikeMode', {'mode': 'bike'}),
    Rule('$TravelMode', '$BoatMode', {'mode': 'boat'}),
    Rule('$TravelMode', '$BusMode', {'mode': 'bus'}),
    Rule('$TravelMode', '$CarMode', {'mode': 'car'}),
    Rule('$TravelMode', '$TaxiMode', {'mode': 'taxi'}),
    Rule('$TravelMode', '$TrainMode', {'mode': 'train'}),
    Rule('$TravelMode', '$TransitMode', {'mode': 'transit'}),

    Rule('$AirMode', 'air fare'),
    Rule('$AirMode', 'air fares'),
    Rule('$AirMode', 'airbus'),
    Rule('$AirMode', 'airfare'),
    Rule('$AirMode', 'airfares'),
    Rule('$AirMode', 'airline'),
    Rule('$AirMode', 'airlines'),
    Rule('$AirMode', '?by air'),
    Rule('$AirMode', 'flight'),
    Rule('$AirMode', 'flights'),
    Rule('$AirMode', 'fly'),

    Rule('$BikeMode', '?by bike'),
    Rule('$BikeMode', 'bike riding'),

    Rule('$BoatMode', '?by boat'),
    Rule('$BoatMode', 'cruise'),
    Rule('$BoatMode', 'cruises'),
    Rule('$BoatMode', 'norwegian cruise lines'),

    Rule('$BusMode', '?by bus'),
    Rule('$BusMode', 'bus tours'),
    Rule('$BusMode', 'buses'),
    Rule('$BusMode', 'shutle'),
    Rule('$BusMode', 'shuttle'),

    Rule('$CarMode', '?by car'),
    Rule('$CarMode', 'drive'),
    Rule('$CarMode', 'driving'),
    Rule('$CarMode', 'gas'),

    Rule('$TaxiMode', 'cab'),
    Rule('$TaxiMode', 'car service'),
    Rule('$TaxiMode', 'taxi'),

    Rule('$TrainMode', '?by train'),
    Rule('$TrainMode', 'trains'),
    Rule('$TrainMode', 'amtrak'),

    Rule('$TransitMode', '?by public transportation'),
    Rule('$TransitMode', '?by ?public transit'),
]

Let's see the rules in action on a toy example.

In [19]:

rules = rules_travel + rules_travel_locations + rules_travel_modes
grammar = Grammar(rules=rules, annotators=travel_annotators)
parses = grammar.parse_input('from boston to austin by train')
print(parses[0].semantics)

Created grammar with 54 rules
{'mode': 'train', 'domain': 'travel', 'origin': {'name': 'Boston, MA, US', 'id': 4930956}, 'destination': {'name': 'Austin, TX, US', 'id': 4671654}}

Great, it works.

We're far from done with the travel grammar, but we now have enough in place that we should be able to parse several of the examples in our training data. This means that we can start hill-climbing on oracle accuracy!

To drive our grammar engineering process, we're going to use a SippyCup utility function called sample_wins_and_losses(), which will report our oracle accuracy on the training data, and show some examples we're parsing correctly and some examples we're not.

(Note that sample_wins_and_losses() requires a Domain, a Model, and a Metric. A description of these classes is tangential to our presentation. If you're interested, read some SippyCup code! It's not very complicated.)

In [20]:

from experiment import sample_wins_and_losses
from metrics import SemanticsOracleAccuracyMetric
from scoring import Model
from travel import TravelDomain

domain = TravelDomain()
model = Model(grammar=grammar)
metric = SemanticsOracleAccuracyMetric()

sample_wins_and_losses(domain=domain, model=model, metric=metric, seed=31)

Loaded GeoNamesAnnotator cache with 1462 items
================================================================================
Evaluating on 75 examples

--------------------------------------------------------------------------------
Over 75 examples:

semantics accuracy                 0.133
semantics oracle accuracy          0.133
number of parses                   0.173
spurious ambiguity                 0.000
has travel parse                   0.173

5 of 10 wins on semantics oracle accuracy:

  airfares to honolulu hawaii
  cruises from new york
  flight from atlanta to vermont
  flights from portland or to seattle wa
  train from moscow to st. petersburg

10 of 65 losses on semantics oracle accuracy:

  cheap tickets to south carolina
  cheapest place to live in florida
  direct flights from california to loreto mexico
  discount travel flights to austin texas
  flights newark to raleigh nc
  one day cruise from fort lauderdale florida
  public transportation to new jersey
  ride this train to roseburg oregon now ther's a town for ya
  transportation to the philadelphia airport
  university of washington transportation to seatac

As you can see, sample_wins_and_losses() doesn't print many details — just a few evaluation metrics, and then examples of wins and losses with the current grammar. (A "win" is an example on which our primary metric has a positive value.) Our primary metric, semantics oracle accuracy, stands at 0.133 — not great, but greater than zero, so it's a start. The wins make sense: we can parse queries which consist solely of travel locations and travel modes, with no extraneous elements. The losses are more interesting, because they will provide the motivation for the next phase of work.

Travel triggers¶

Many of the examples we're currently failing to parse contain phrases (such as "tickets" or "transportation") which indicate a travel intent, but do not specify a travel mode. These phrases constitute the second type of travel argument, namely travel triggers. Inspection of the examples in our training dataset suggest a small number of lexical rules, shown here.

In [21]:

rules_travel_triggers = [
    Rule('$TravelArgument', '$TravelTrigger', {}),

    Rule('$TravelTrigger', 'tickets'),
    Rule('$TravelTrigger', 'transportation'),
    Rule('$TravelTrigger', 'travel'),
    Rule('$TravelTrigger', 'travel packages'),
    Rule('$TravelTrigger', 'trip'),
]

Let's run sample_wins_and_losses() again, to see how much we've gained, and what to work on next.

In [22]:

rules = rules_travel + rules_travel_locations + rules_travel_modes + rules_travel_triggers
grammar = Grammar(rules=rules, annotators=travel_annotators)
model = Model(grammar=grammar)
sample_wins_and_losses(domain=domain, model=model, metric=metric, seed=1)

Created grammar with 60 rules
================================================================================
Evaluating on 75 examples

--------------------------------------------------------------------------------
Over 75 examples:

semantics accuracy                 0.173
semantics oracle accuracy          0.173
number of parses                   0.213
spurious ambiguity                 0.000
has travel parse                   0.213

5 of 13 wins on semantics oracle accuracy:

  cruises from new york
  flight from atlanta to vermont
  flight from ft.lauderdale florida to lexington kentucky
  train from moscow to st. petersburg
  travel packages from hartford ct. to berlin germany

10 of 62 losses on semantics oracle accuracy:

  birmingham al distance from indianapolish in
  cost of gas from cedar rapids ia to las vegas
  direct flights from california to loreto mexico
  distance of oxnard ca. to rohnert park ca.
  distance usa to peru
  jet blue flights to new york
  one way air fare to chicago
  rochester ny bus tours to yankees stadium
  scenic drive to duluth mn
  what is 1500 miles away from texas city

So we've gained about four points (that is, 0.04) in oracle accuracy on the training dataset. We're making progress! Again, the losses suggest where to go next.

Request types¶

The third and last kind of travel argument is a request type, which indicates a specific type of information sought, such as directions or distance. We'll adopt the same methodology, adding lexical rules motivated by specific training examples, tied together by higher-level rules which add semantics.

In [23]:

rules_request_types = [
    Rule('$TravelArgument', '$RequestType', sems_0),

    Rule('$RequestType', '$DirectionsRequest', {'type': 'directions'}),
    Rule('$RequestType', '$DistanceRequest', {'type': 'distance'}),
    Rule('$RequestType', '$ScheduleRequest', {'type': 'schedule'}),
    Rule('$RequestType', '$CostRequest', {'type': 'cost'}),

    Rule('$DirectionsRequest', 'directions'),
    Rule('$DirectionsRequest', 'how do i get'),
    Rule('$DistanceRequest', 'distance'),
    Rule('$ScheduleRequest', 'schedule'),
    Rule('$CostRequest', 'cost'),
]

Again, we'll check our progress using sample_wins_and_losses().

In [24]:

rules = rules_travel + rules_travel_locations + rules_travel_modes + rules_travel_triggers + rules_request_types
grammar = Grammar(rules=rules, annotators=travel_annotators)
model = Model(grammar=grammar)
sample_wins_and_losses(domain=domain, model=model, metric=metric, seed=1)

Created grammar with 70 rules
================================================================================
Evaluating on 75 examples

--------------------------------------------------------------------------------
Over 75 examples:

semantics accuracy                 0.200
semantics oracle accuracy          0.200
number of parses                   0.240
spurious ambiguity                 0.000
has travel parse                   0.240

5 of 15 wins on semantics oracle accuracy:

  cruise from new york to canada
  cruises from new york
  directions from washington to canada
  flight from ft.lauderdale florida to lexington kentucky
  flights from portland or to seattle wa

10 of 60 losses on semantics oracle accuracy:

  cheapest place to live in florida
  cost of gas from cedar rapids ia to las vegas
  distance cunmming georgia to chattanooga
  distance usa to peru
  driving distance washington dc to niagara falls
  ithaca to scranton trains
  one day cruise from fort lauderdale florida
  public transportation to new jersey
  santiago chile to atlanta ga. airline schedule 3 26 06
  transportation to the philadelphia airport

Great, oracle accuracy is up to 0.20. But there's still one big piece we're missing.

Optionals¶

A key ingredient of the phrase-bag approach to grammar building is the ability to accept optional elements interspersed freely among the query elements. Optionals are phrases which can be either present or absent; typically, they contribute nothing to the semantics.

The following rules illustrate one approach to allowing optionals. The first two rules allow any $TravelQueryElement to combine with an $Optionals either to the right or to the left, while ignoring its semantics. The third rule defines $Optionals as a sequence of one or more $Optional elements, while the following rules define several specific categories of optionals. As usual, most of the lexical rules are motivated by specific examples from the training dataset, with a few extras included just because they are super obvious.

This is not necessarily the best design! One of the exercises will challenge you to do better.

In [25]:

rules_optionals = [
    Rule('$TravelQueryElement', '$TravelQueryElement $Optionals', sems_0),
    Rule('$TravelQueryElement', '$Optionals $TravelQueryElement', sems_1),

    Rule('$Optionals', '$Optional ?$Optionals'),

    Rule('$Optional', '$Show'),
    Rule('$Optional', '$Modifier'),
    Rule('$Optional', '$Carrier'),
    Rule('$Optional', '$Stopword'),
    Rule('$Optional', '$Determiner'),

    Rule('$Show', 'book'),
    Rule('$Show', 'give ?me'),
    Rule('$Show', 'show ?me'),

    Rule('$Modifier', 'cheap'),
    Rule('$Modifier', 'cheapest'),
    Rule('$Modifier', 'discount'),
    Rule('$Modifier', 'honeymoon'),
    Rule('$Modifier', 'one way'),
    Rule('$Modifier', 'direct'),
    Rule('$Modifier', 'scenic'),
    Rule('$Modifier', 'transatlantic'),
    Rule('$Modifier', 'one day'),
    Rule('$Modifier', 'last minute'),

    Rule('$Carrier', 'delta'),
    Rule('$Carrier', 'jet blue'),
    Rule('$Carrier', 'spirit airlines'),
    Rule('$Carrier', 'amtrak'),

    Rule('$Stopword', 'all'),
    Rule('$Stopword', 'of'),
    Rule('$Stopword', 'what'),
    Rule('$Stopword', 'will'),
    Rule('$Stopword', 'it'),
    Rule('$Stopword', 'to'),

    Rule('$Determiner', 'a'),
    Rule('$Determiner', 'an'),
    Rule('$Determiner', 'the'),
]

Again, we'll check our progress using sample_wins_and_losses().

In [26]:

rules = rules_travel + rules_travel_locations + rules_travel_modes + rules_travel_triggers + rules_request_types + rules_optionals
grammar = Grammar(rules=rules, annotators=travel_annotators)
model = Model(grammar=grammar)
sample_wins_and_losses(domain=domain, model=model, metric=metric, seed=1)

Created grammar with 104 rules
================================================================================
Evaluating on 75 examples

--------------------------------------------------------------------------------
Over 75 examples:

semantics accuracy                 0.400
semantics oracle accuracy          0.400
number of parses                   0.773
spurious ambiguity                 0.039
has travel parse                   0.493

5 of 30 wins on semantics oracle accuracy:

  delta flights to oakland
  driving from los angeles to seattle
  flight from ft.lauderdale florida to lexington kentucky
  honeymoon trip to hawaii
  what will it cost to drive from kemerovo to st. petersburg russia

10 of 45 losses on semantics oracle accuracy:

  buses to sacramento rivercats games
  cheap tickets to south carolina
  cheapest place to live in florida
  lawton va to orlando fl amtrak
  nyc flights to buffalo ny
  photos at las vegas nevada from april 1 to april 5 2006
  public transportation to new jersey
  rand mcnally direction to horse shoe casino hammond in
  transportation to the philadelphia airport
  what is 1500 miles away from texas city

Adding support for optionals has doubled oracle accuracy on the training dataset, from 0.200 to 0.400. This is a big gain! However, there are still many losses, and many of them share a property: they are negative examples.

Negative examples¶

A semantic parsing model for a given domain should be able to predict that a given input does not belong to the domain. We call such inputs negative examples. For the travel domain, negative examples include:

discount tickets to new york city ballet
george washington borrows 500 000 from pennsylvania farmer to finance war
ride this train to roseburg oregon now ther's a town for ya

Much of the academic literature on semantic parsing describes systems which are required always to produce an in-domain semantic representation, no matter what the input. If the input is not in-domain, the result is usually garbage. In real-world applications, it's much better to have models which can learn when to produce no positive output.

The easiest way to achieve this is to introduce some rules which allow any input to be parsed with "negative" semantics, and then learn weights for those rule features in the scoring model. In the travel domain, the "negative" semantic representation is the special value {'domain': 'other'}.

In [27]:

rules_not_travel = [
    Rule('$ROOT', '$NotTravelQuery', sems_0),
    Rule('$NotTravelQuery', '$Text', {'domain': 'other'}),
    Rule('$Text', '$Token ?$Text'),
]

Note that the last rule depends on the $Token category, which can be applied to any token by the TokenAnnotator. So let's add the TokenAnnotator to our list of annotators.

In [28]:

travel_annotators = [geonames_annotator, TokenAnnotator()]

As usual, we'll check our progress using sample_wins_and_losses().

In [29]:

rules = rules_travel + rules_travel_locations + rules_travel_modes + rules_travel_triggers + rules_request_types + rules_optionals + rules_not_travel
grammar = Grammar(rules=rules, annotators=travel_annotators)
model = Model(grammar=grammar)
sample_wins_and_losses(domain=domain, model=model, metric=metric, seed=1)

Created grammar with 107 rules
================================================================================
Evaluating on 75 examples

--------------------------------------------------------------------------------
Over 75 examples:

semantics accuracy                 0.173
semantics oracle accuracy          0.573
number of parses                   1.773
spurious ambiguity                 0.025
has travel parse                   0.493

5 of 43 wins on semantics oracle accuracy:

  cheap flights to hawaii
  directions from washington to canada
  jet blue flights to new york
  travel packages from hartford ct. to berlin germany
  what will it cost to drive from kemerovo to st. petersburg russia

10 of 32 losses on semantics oracle accuracy:

  bike riding-seattle to portland
  birmingham al distance from indianapolish in
  directions to rupparena in lexington
  distance of oxnard ca. to rohnert park ca.
  flights newark to raleigh nc
  fly boston to myrtle beach spirit airlines
  ithaca to scranton trains
  norwegian cruises lines to alaska combined with land tours to denali
  rochester ny bus tours to yankees stadium
  transatlantic cruise southampton to tampa

We've achieved another big gain in oracle accuracy, from 0.400 to 0.573, just by ensuring that we offer a "negative" prediction for every input. (Note that the mean number of parses has increased by exactly 1, from 0.773 to 1.773.) However, for the first time, a big gap has opened between accuracy, at 0.173, and oracle accuracy, at 0.573. The problem is that we don't yet have a scoring model for the travel domain, so the ranking of parses is arbitrary. In order to close the gap, we need to create a scoring model. One of the exercises will ask you to pursue this.

Exercises ¶

Several of these exercises ask you to measure the impact of your change on key evaluation metrics. Part of your job is to decide which evaluation metrics are most relevant for the change you're making. It's probably best to evaluate only on training data, in order to keep the test data unseen during development. (But the test data is hardly a state secret, so whatever.)

Straightforward¶

Select 20 of the 6,588 queries in aol-travel-queries.txt and manually annotate them with target semantics. Select your queries using uniform random sampling — this will minimize overlap between different people completing this exercise, and therefore maximize the value of the aggregate labor. (In Python, you can sample from a list using random.sample().) You will likely find some cases where it's not clear what the right semantics are. Do your best. The point of the exercise is to develop an awareness of the challenges of annotation, and to recognize that there's no such thing as perfectly annotated data.
Many of the remaining errors on training examples occur because the origin isn't marked by "from". Examples include "transatlantic cruise southampton to tampa", "fly boston to myrtle beach spirit airlines", and "distance usa to peru". Extend the grammar to handle examples like these. Measure the impact on key evaluation metrics.
Does your solution to the previous exercise handle examples where some other query element intervenes between origin and destination? Examples include "university of washington transportation to seatac", "birmingham al distance from indianapolish in", and "nyc flights to buffalo ny". If not, extend the grammar to handle examples like these. Measure the impact on key evaluation metrics.
The current structure of the grammar permits parses containing any number of $FromLocations and $ToLocations, including zero. Find a way to require that (a) there is at least one $FromLocation or $ToLocation, (b) there are not multiple $FromLocations or $ToLocations.
The travel grammar is lacking a scoring model, and it shows a big gap between accuracy and oracle accuracy. Examine and diagnose some examples where accuracy is 0 even though oracle accuracy is 1. Propose and implement a scoring model, and measure its efficacy in closing that gap.

Challenging¶

Extend Grammar to allow rules which mix terminals and non-terminals on the RHS, such as Rule('$RouteQuery', '$TravelMode from $Location to $Location').
Try to improve the precision of the GeoNamesAnnotator by fiddling with the GeoNames API request parameters documented here. For example, the featureClass, countryBias, or orderby parameters seem like promising targets.
Try to improve the coverage of the GeoNamesAnnotator by enabling it to return multiple annotations for ambiguous location names. Investigate the impact of varying the maximum number of annotations on various performance metrics, including accuracy, oracle accuracy, and number of parses. How would you characterize the tradeoff you're making?
Building on the previous exercise, implement a feature which captures information about the result rank of annotations generated by the GeoNamesAnnotator, and see if you can use this feature to narrow the gap between accuracy and oracle accuracy.
You have probably noticed that one of our standard evaluation metrics is something called "spurious ambiguity". Dig into the SippyCup codebase to figure out what spurious ambiguity is. Here's a hint: it's something bad, so we want to push that metric toward zero. Find a training example where it's not zero, and figure out why the example exhibits spurious ambiguity. Are there changes we can make to the grammar to reduce spurious ambiguity? Also, why is spurious ambiguity undesirable?
In its current form, the travel grammar parses lots of queries it shouldn't. (By "parses", we mean "generates a positive parse".) This problem is known as overtriggering. The overtriggering problem is hard to observe on our tiny dataset of 100 examples, where most of the examples are positive examples. Investigate overtriggering by downloading the AOL query dataset, identifying the 1,000 most frequent queries, and running them through the grammar. How many cases of overtriggering do you find? Can you suggest some simple changes to minimize overtriggering? (Warning: the AOL dataset contains lots of queries which may be offensive. Skip this exercise if you're not down with that.)
Consider the queries "flights los angeles hawaii" vs. "flights los angeles california". Despite the superficial resemblance, the second query seems to mean flights to Los Angeles, not flights from Los Angeles to California. But the grammar can successfully interpret both queries only if it permits both interpretations for each query. This creates a ranking problem: which interpretation should be scored higher? Can you add to the scoring model a feature which solves the problem?
Develop and execute a strategy to annotate all 6,588 queries in aol-travel-queries.txt with target semantics using crowdsourcing. Warning: this is an ambitious exercise.

SippyCup Unit 2: Travel queries