#!/usr/bin/env python # coding: utf-8 # # #

# SippyCup
# Unit 2: Travel queries #

# #

# Bill MacCartney
# Spring 2015 # #

#
# This is Unit 2 of the SippyCup codelab. #
# In our next case study, we're going to look at travel queries, in which a user seeks information about getting from here to there. Examples include: # # "birmingham al distance from indianapolish in" # "directions from washington to canada" # "discount travel flights to austin texas" # # It's easy to imagine applications for semantic parsing in the travel domain. Providing answers to queries like these is a core part of the value proposition of intelligent assistants such as [Apple's Siri][], [Google Now][], [Microsoft's Cortana][], and [Amazon Echo][]. # # [Apple's Siri]: https://www.apple.com/ios/siri/ # [Google Now]: https://www.google.com/landing/now/ # [Microsoft's Cortana]: http://www.microsoft.com/en-us/mobile/campaign-cortana/ # [Amazon Echo]: http://www.amazon.com/oc/echo/ # # The travel domain has some important differences to the domain of natural language arithmetic we looked at in our [first case study](./sippycup-unit-1.ipynb). # # - **Above all, it's far more diverse.** The lexicon of travel queries can include thousands of words, not merely dozens — indeed, its potential size is unbounded. And the words and phrases can be assembled in a dizzying variety of syntactic structures. # - **It's much more ambiguous.** Consider `"directions from washington to canada"`. Does that refer to the State of Washington, or to the District of Columbia? And where exactly in Canada are we headed? # - **It can also be much messier.** Because it's a real-world application, and because users ~~are idiots~~ make mistakes, travel queries can include lots of misspellings (such as `"indianapolish"`), mangled syntax, and other linguistic infelicities. It's not all neat and tidy like the arithmetic case study. # - **However, the semantics are flatter.** The semantic representation for a travel query may need to represent the destination, the origin, the transportation mode, and perhaps a few other bits of information. But, unlike the expression trees we looked at in the arithmetic case study, it does not need to contain arbitrarily deeply-nested recursive structure. # ## Example inputs # Whenever we start work in a new domain, we must begin with data collection. To drive an effective development process, we need a large, realistic, representative sample of the inputs our system is likely to encounter in its intended application. In the arithmetic case study, we copped out on the data collection, in the name of pedagogical convenience. This time, we'll try harder. # # For this purpose, we'll leverage the AOL search query dataset ([Pass et al. 2006][]), a resource which has proven immensely useful over time, despite its [controversial origin][]. This dataset contains approximately 10M unique search queries issued by about 650K AOL users in 2006. Because the dataset was quickly retracted by AOL, it can be difficult to find online, but it is currently available for download [here][aol-dataset]. # # [Pass et al. 2006]: http://people.cs.georgetown.edu/~abdur/publications/pos-infoscale.pdf # [controversial origin]: http://en.wikipedia.org/wiki/AOL_search_data_leak # [aol-dataset]: http://www.cim.mcgill.ca/~dudek/206/Logs/AOL-user-ct-collection/ # # Of course, the majority of those 10M queries are not related to travel. (The five most frequent queries are "google", "ebay", "yahoo", "yahoo.com", and "mapquest". Remember, it was 2006. The query "facebook" doesn't show up until position 119!) To construct a query sample suitable for our purposes, we'll need to do some filtering. # # Since the queries we're interested in always refer to a location, we'll start by filtering for location names. Recognizing location names is itself a hard problem, so for current purposes we'll adopt a good-enough solution: we'll grep the queries for the names of about 600 U.S. locations (cities, states, mountains, lakes, and so on) listed in [Geobase][]. (We'll be using Geobase again in the next unit.) This winnows the 10M unique queries to about 1M. It's a good start, but it's still not enough. (The top five queries are now "white pages", "florida lottery", "wells fargo", "whitney houston", and "las vegas".) We need to find queries which not only contain a location, but also a clear indication of travel intent. # # [Geobase]: ftp://ftp.cs.utexas.edu/pub/mooney/nl-ilp-data/geosystem/geobase # # Since travel is about going *from* one location *to* another, let's grep for queries which contain both "from" and "to". This cuts the number of queries to about 3,100, and we're now seeing the kind of queries we're looking for: # # "airline flights from raleigh durham to nashville tennessee" # "distance from seattle to vancouver" # "cheap flights to italy from new york" # "international flights from usa to morocco" # "milage from portlan oregon to canada" # "distance from alabama to arkansas in hours" # "driving directions to sault ste marie canada from vermont" # "great white sharks life cycle from brith to death" # "directions from newark airport to bayonne bridge" # "poem it takes courage to love from tyler perrys movie" # # A couple of observations here. First, not all of the queries are actually travel queries. That's OK — we actually *want* to include some queries which do not belong to the target domain. Such "negative" queries are important for evaluating our system's ability to distinguish between queries it should be parsing and queries it shouldn't. Second, there are a lot of mispelings. Again, that's OK. We want a *realistic* sample which reflects the inputs our system will actually encounter in its intended application. Messy inputs are a fact of life, and they're one of the things that distinguishes this domain from the highly artificial example of natural language arithmetic we looked at in the last unit. # # There's still a problem, though. Not all travel queries contain both "from" and "to". Many valid travel queries contain only a destination, but no origin — and some contain an origin, but no destination. If we relax the conjunction to a disjunction (that is, "from" *or* "to", we get about 23,000 queries, most of which are still not what we're looking for (for example, "colorado rent to own", "moving to florida"). However, the list of examples above (and its continuation, not shown) suggest many other terms we can usefully search for: "airline", "flights", "distance", "mileage", "driving", "directions", and so on. We constructed a list of about 60 such terms (recorded in a comment in [`travel_examples.py`](travel_examples.py)). We then selected queries which contain a location, either "from" or "to", and either one of the ~60 travel terms or both "from" and "to". This left 6,588 queries, and the results look pretty good: # # "flight one way las vegas nv to rapid city sd" # "number to alaska airlines" # "travel time to denver colorado" # "amtrak schedule in california from burbanlk airport" # "airfare las vegas to my" # "cycling route san francisco to sacramento" # "where to find cheapest gasoline in new castle delaware" # "discount tickets to six flags and discounted hotels in new jersey" # "public transportation washington dc to new york city" # "transportation from newark airport to manhattan" # # Three of those are not really travel queries, but that's perfectly fine. # # Let's recap. To construct our query sample, we: # # - Began from the set of unique queries in the AOL dataset (10M queries). # - Selected queries containing one of the 600 locations named in Geobase (1M queries). # - Selected queries containing "from" or "to" (23K queries). # - Selected queries containing one of about 60 travel terms, or containing both "from" and "to" (6,588 queries). # # We've got our inputs; now we need to think about the outputs. # ## Semantic representation # # # For the travel domain, we're going to use a style of semantic representation based on nested maps of key-value pairs (represented by Python dictionaries). The keys belong to a small, fixed set of strings. The values can be numbers, strings, or nested maps, depending upon the key. # # Here are some examples: # # {'domain': 'travel', 'type': 'directions', 'mode': 'car', # 'destination': {'id': 4793846, 'name': 'Williamsburg, VA, US'}} # {'domain': 'travel', 'type': 'duration', 'mode': 'bus', # 'origin': {'id': 4500546, 'name': 'Atlantic City, NJ, US'}, # 'destination': {'id': 5128581, 'name': 'New York City, NY, US'}} # {'domain': 'travel', 'type': 'cost', 'mode': 'air', # 'origin': {'id': 5101798, 'name': 'Newark, NJ, US'}, # 'destination': {'id': 4574324, 'name': 'Charleston, SC, US'}} # # You can probably discern what each example means without any difficulty. The first is asking for driving directions to Williamsburg. The second is asking for the duration of a bus trip from Atlantic City to New York City. And the third is asking for the cost of an airfare from Newark to Charleston. # # To be a bit more formal, the following keys are allowed in the top-level map: # # - `domain` has one of two values: `travel` for travel queries, or `other` for non-travel queries. # - `type` specifies the type of information requested, one of: `directions`, `distance`, `duration`, `cost`, or `schedule`. # - `mode` specifies the mode of travel, one of: `air`, `bike`, `boat`, `bus`, `car`, `taxi`, `train`, or `transit`. # - `destination` specifies the destination of travel using a nested map representing a location. # - `origin` specifies the origin of travel using a nested map representing a location. # # In nested maps representating locations, just two keys are allowed: # # - `id` is an integer which is the unique identifier of the location in the [GeoNames][] geographical database. # - `names` is a human-readable name for the location whose format is defined by code in [`geonames.py`](./geonames.py). # # [GeoNames]: http://www.geonames.org/ # # Note that our semantic representation both resolves ambiguity and canonicalizes. One the one hand, the phrase "`new york`" has a different semantic representation if it refers to the city than if it refers to the state; on the other hand, "`new york city`" and "`nyc`" have the same semantic representation. These qualities will make it easy for a downstream executor to turn a semantic representation directly into action. # # In a real-world application, the executor would likely be a module that translates our semantic representations into requests to backend APIs to compute optimal routes, retrieve bus schedules or airfares, or whatever. However, because such functionality is not easily captured in a few lines of Python code, SippyCup doesn't include an executor for the travel domain. # # Since we don't have an executor, we won't be dealing with denotations in the travel domain. (Abstractly, the denotation of a directions request is some kind of complex computational object representing a route.) Instead, training and evaluation in this domain will consider only whether we're generating the correct semantics. We'll return to considering denotations in our [third case study][]. # # [third case study]: ./sippycup-unit-3.ipynb # ## Example data # We've described a process for extracting 6,588 (likely) travel queries from the AOL search query dataset, and we've defined a simple semantic representation for such queries. The next step should be to manually annotate each query with its semantics, in order generate example data for training and evaluation. # # However, there's a problem. Manual annotation is tedious and time-consuming, and therefore expensive. For datasets of significant size, it is usually not practical for one person to do all the work. This suggests distributing the work over a cluster of graduate students — which is in fact the method that was used to generate the [Geo880][] dataset that we'll use in our [third case study][]. Accordingly, one of the [exercises](#travel-exercises) at the end of this section will ask you to annotate a score of travel queries. # # Alternatively, if the annotation task does not require specialized knowledge, the work can often be distributed over workers on a crowdsourcing platform such as [Amazon Mechanical Turk][]. Our semantic representation for the travel domain is simple enough that the crowdsourcing approach might be feasible. Of course, the workers wouldn't write down the semantics directly; rather, they'd fill out a carefully-designed form containing questions about the query, and a script would generate semantic representations based on their responses. We leave working out the details of the crowdsourcing approach as an exercise for the motivated reader. :-) # # [Geo880]: http://www.cs.utexas.edu/users/ml/geo.html # [third case study]: ./sippycup-unit-1.ipynb # [Amazon Mechanical Turk]: https://www.mturk.com/mturk/welcome # # The difficulty of producing annotated data, and the consequent paucity of such data, has been the eternal bane of research in semantic parsing. Indeed, overcoming this obstacle was the chief motivation for the shift made by [Liang et al. 2011][] to learn not from semantics but from denotations, which are easier and cheaper to generate via crowdsourcing. Despite this shift, annotated data remains scarce. To this day, the largest standard dataset for semantic parsing (the [WebQuestions][] dataset) contains only 5,810 examples. # # [WebQuestions]: http://www-nlp.stanford.edu/software/sempre/ # [Liang et al. 2011]: http://www.cs.berkeley.edu/~jordan/papers/liang-jordan-klein-acl2011.pdf # # For the purposes of this codelab, we have manually annotated 100 examples, given in [`travel_examples.py`](./travel_examples.py). The examples were sampled uniformly at random from the set of 6,588 AOL queries described earlier, and are divided into a training set of 75 examples and a test set of 25 examples. You might find yourself quibbling with some of the annotation choices. In some cases, our semantic representation is too impoverished to capture the user's intent completely. For example, "`scenic drive to duluth mn`", "`honeymoon trip to hawaii`", and "`jet blue flights to new york`" contain modifiers which cannot be expressed in our semantic representation. In other cases, the user's intent may be uncertain or obscure: what exactly does "`airbus from boston to europe`" mean? # ## Enriching the Grammar class # Now that we have some example data for the travel domain, we'd like to start building a grammar. However, the grammar formalism we introduced in the arithmetic case study was quite restrictive. Every rule had to be in Chomsky normal form: either a unary lexical rule, or a binary compositional rule. If you did the [exercises](#arithmetic-exercises) at the end of the arithmetic case study, you will have chafed against those restrictions. So, to make grammar engineering far more convenient, we're going to loosen things up. We'll modify the `Grammar` class by adding: # # - **Annotators**, which are external modules for assigning categories and semantics to phrases of specific types. # - **N-ary lexical rules**, such as `Rule('$City', 'new york city')`. # - **Unary compositional rules**, such as `Rule('$Location', '$City')`. # - **N-ary compositional rules**, such as `Rule('$RouteQuery', '$FromLocation $ToLocation $TravelMode')`. # - Rules with **optional elements** on the RHS, such as `Rule('$BikeMode', '?by bike')`, which can match either `"bike"` or `"by bike"`. # - A **designated start symbol**, typically `$ROOT`, to allow us to limit which categories can yield complete parses. # # In order to make all this stuff work, we'll need to redefine our `Grammar` class and some of its affiliated functions. First, we'll redefine the `Grammar` constructor to add member variables for storing unary rules, annotators, and the start symbol. # In[1]: class Grammar: def __init__(self, rules=[], annotators=[], start_symbol='$ROOT'): self.categories = set() self.lexical_rules = defaultdict(list) self.unary_rules = defaultdict(list) self.binary_rules = defaultdict(list) self.annotators = annotators self.start_symbol = start_symbol for rule in rules: add_rule(self, rule) print('Created grammar with %d rules.' % len(rules)) def parse_input(self, input): """Returns a list of parses for the given input.""" return parse_input(self, input) # Next, we'll redefine `add_rule()`, so that it knows what to do when it encounters unary or n-ary compositional rules, or rule containing optionals: # In[2]: def add_rule(grammar, rule): if contains_optionals(rule): add_rule_containing_optional(grammar, rule) elif is_lexical(rule): grammar.lexical_rules[rule.rhs].append(rule) elif is_unary(rule): grammar.unary_rules[rule.rhs].append(rule) elif is_binary(rule): grammar.binary_rules[rule.rhs].append(rule) elif all([is_cat(rhsi) for rhsi in rule.rhs]): add_n_ary_rule(grammar, rule) else: # One of the exercises will ask you to handle this case. raise Exception('RHS mixes terminals and non-terminals: %s' % rule) def is_unary(rule): """ Returns true iff the given Rule is a unary compositional rule, i.e., contains only a single category (non-terminal) on the RHS. """ return len(rule.rhs) == 1 and is_cat(rule.rhs[0]) # Finally, we'll redefine `parse_input()`, so that it properly invokes functions for applying annotators and unary rules (in addition to existing invocations of functions for applying lexical rules and binary rules). # In[3]: def parse_input(grammar, input): """Returns a list of all parses for input using grammar.""" tokens = input.split() chart = defaultdict(list) for j in range(1, len(tokens) + 1): for i in range(j - 1, -1, -1): apply_annotators(grammar, chart, tokens, i, j) # new! apply_lexical_rules(grammar, chart, tokens, i, j) apply_binary_rules(grammar, chart, i, j) apply_unary_rules(grammar, chart, i, j) # new! parses = chart[(0, len(tokens))] if hasattr(grammar, 'start_symbol') and grammar.start_symbol: # new! parses = [parse for parse in parses if parse.rule.lhs == grammar.start_symbol] return parses # Note that if `grammar.start_symbol` is defined (the usual case), then we will return only parses having that category at the root. # ### Annotators # Our travel grammar will need to recognize names of cities (and other kinds of locations). We could achieve this by writing lots of lexical rules: # In[4]: from parsing import Rule rules_cities = [ Rule('$City', 'austin', {'id': 4671654, 'name': 'Austin, TX, US'}), Rule('$City', 'boston', {'id': 4930956, 'name': 'Boston, MA, US'}), Rule('$City', 'chicago', {'id': 4887398, 'name': 'Chicago, IL, US'}), Rule('$City', 'dallas', {'id': 4684888, 'name': 'Dallas, TX, US'}), # ... ] # However, there are thousands of cities in the world. Adding thousands of rules to our grammar would be cumbersome and difficult to maintain. Besides, there is probably already some library or service which knows how to recognize names of cities and map them into unique identifiers. If we could wrap this external module in the right kind of interface, we could make it directly available to our grammar. # # We call such a module an *annotator*. What kind of an interface should it have? Well, it should take a token span (that is, a list of tokens) and return a list of zero or more interpretations, each a pair consisting of a category and a semantic representation. # In[5]: class Annotator: """A base class for annotators.""" def annotate(self, tokens): """Returns a list of pairs, each a category and a semantic representation.""" return [] # Let's take a moment to implement a couple of simple annotators which will prove useful later on. The first is a `NumberAnnotator`, which will annotate any string representing a number with category `$Number` and semantics equal to its numeric value. # In[6]: class NumberAnnotator(Annotator): def annotate(self, tokens): if len(tokens) == 1: try: value = float(tokens[0]) if int(value) == value: value = int(value) return [('$Number', value)] except ValueError: pass return [] # Here's the `NumberAnnotator` in action: # In[7]: NumberAnnotator().annotate(['16']) # Another annotator that will prove very useful is the `TokenAnnotator`, which will annotate any single token with category `$Token` and semantics equal to the token itself. # In[8]: class TokenAnnotator(Annotator): def annotate(self, tokens): if len(tokens) == 1: return [('$Token', tokens[0])] else: return [] # Here's the `TokenAnnotator` in action: # In[9]: TokenAnnotator().annotate(['foo']) # Later in this unit, we'll introduce the `GeoNamesAnnotator`, a more complex annotator designed to annotate names of locations. # # We saw above that the redesigned `Grammar` class holds a list of annotators, and that the redesigned `parse_input()` function invokes a function called `apply_annotators()` on every token span. Its operation is straightforward: for every annotator in the grammar, and for every interpretation of the given span by the annotator, it creates a new parse and adds it to the corresponding chart cell. # In[10]: def apply_annotators(grammar, chart, tokens, i, j): """Add parses to chart cell (i, j) by applying annotators.""" if hasattr(grammar, 'annotators'): for annotator in grammar.annotators: for category, semantics in annotator.annotate(tokens[i:j]): if not check_capacity(chart, i, j): return rule = Rule(category, tuple(tokens[i:j]), semantics) chart[(i, j)].append(Parse(rule, tokens[i:j])) # (The `check_capacity()` function will be defined in the next section.) # ### Unary rules # When designing a grammar, it's often convenient to write *unary* rules, which have a single category on the RHS. However, unary rules are not allowed in Chomsky normal form. One reason why unary rules are problematic is that badly-designed grammars can contain *unary cycles*, which can cause parsing to get stuck in an infinite regress. # In[11]: rules_with_unary_cycle = [ Rule('$ROOT', '$Ping', lambda sems: sems[0]), Rule('$Ping', '$Pong', lambda sems: sems[0] + '1'), Rule('$Pong', '$Ping', lambda sems: sems[0]), Rule('$Ping', 'ping', '1'), ] # If a grammar with these rules is used to parse the input `"ping"`, it will generate an endless stream of candidate parses, with semantics '1', '11', '111', '1111', and so on. # # There are various strategies for preventing unary cycles from causing a runaway parsing process. We'll adopt one of the simplest: rather than trying to identify unary cycles in advance, we'll just impose an upper bound on the capacity of any chart cell, and if the limit is reached, we'll terminate the parsing process. # # With this in mind, the definition of `apply_unary_rules()` is straightforward. # In[12]: def apply_unary_rules(grammar, chart, i, j): """Add parses to chart cell (i, j) by applying unary rules.""" # Note that the last line of this method can add new parses to chart[(i, # j)], the list over which we are iterating. Because of this, we # essentially get unary closure "for free". (However, if the grammar # contains unary cycles, we'll get stuck in a loop, which is one reason for # check_capacity().) if hasattr(grammar, 'unary_rules'): for parse in chart[(i, j)]: for rule in grammar.unary_rules[(parse.rule.lhs,)]: if not check_capacity(chart, i, j): return chart[(i, j)].append(Parse(rule, [parse])) MAX_CELL_CAPACITY = 10000 # Important for catching e.g. unary cycles. def check_capacity(chart, i, j): if len(chart[(i, j)]) >= MAX_CELL_CAPACITY: print('Cell (%d, %d) has reached capacity %d' % ( i, j, MAX_CELL_CAPACITY)) return False return True # Note that `check_capacity()` is not just for taming unary cycles — in fact, it's a general-purpose mechanism for bounding the amount of computation invested in a single parsing task. Accordingly, we've included calls to it, not only in `apply_unary_rules()`, but also in its three sibling functions. As you build richer, more complex grammars, you may find that some inputs have hundreds or even thousands of possible parses, and you may begin to run up against the chart cell capacity limit. Feel free to adjust it as needed. # ### N-ary rules # Like unary rules, n-ary compositional rules are make grammar engineering easier, by allowing grammar rules to express constraints in natural, intuitive ways. # # In the previous unit, we saw how to eliminate the n-ary rule `Rule('$E', '$E $BinOp $E')` from a grammar by *binarizing* it — that is, by replacing it with two binary rules having equivalent effect. Obviously, the binarizing operation can be automated. The redesigned `add_rule()` calls a function named `add_n_ary_rule()` which performs this binarization. # In[13]: def add_n_ary_rule(grammar, rule): """ Handles adding a rule with three or more non-terminals on the RHS. We introduce a new category which covers all elements on the RHS except the first, and then generate two variants of the rule: one which consumes those elements to produce the new category, and another which combines the new category which the first element to produce the original LHS category. We add these variants in place of the original rule. (If the new rules still contain more than two elements on the RHS, we'll wind up recursing.) For example, if the original rule is: Rule('$Z', '$A $B $C $D') then we create a new category '$Z_$A' (roughly, "$Z missing $A to the left"), and add these rules instead: Rule('$Z_$A', '$B $C $D') Rule('$Z', '$A $Z_$A') """ def add_category(base_name): assert is_cat(base_name) name = base_name while name in grammar.categories: name = name + '_' grammar.categories.add(name) return name category = add_category('%s_%s' % (rule.lhs, rule.rhs[0])) add_rule(grammar, Rule(category, rule.rhs[1:], lambda sems: sems)) add_rule(grammar, Rule(rule.lhs, (rule.rhs[0], category), lambda sems: apply_semantics(rule, [sems[0]] + sems[1]))) # ### Optional elements # Another convenience for manual grammar engineering is the ability to mark an element in the RHS as optional — in SippyCup, by putting '?' in front of it. With such a mechanism, we can replace these two rules: # # # #
# Rule('$BikeMode, 'bike')
# Rule('$BikeMode', 'by bike')
# 
# # with a single rule containing an optional element: # # Rule('$BikeMode', '?by bike') # # The optional mechanism also makes it easy to define one category as a sequence of another category: # #
# Rule('$Thing', 'thing')
# Rule('$Things', '$Thing ?$Things')
# 
# # The following functions enable the optional mechanism. # In[14]: from types import FunctionType def is_optional(label): """ Returns true iff the given RHS item is optional, i.e., is marked with an initial '?'. """ return label.startswith('?') and len(label) > 1 def contains_optionals(rule): """Returns true iff the given Rule contains any optional items on the RHS.""" return any([is_optional(rhsi) for rhsi in rule.rhs]) def add_rule_containing_optional(grammar, rule): """ Handles adding a rule which contains an optional element on the RHS. We find the leftmost optional element on the RHS, and then generate two variants of the rule: one in which that element is required, and one in which it is removed. We add these variants in place of the original rule. (If there are more optional elements further to the right, we'll wind up recursing.) For example, if the original rule is: Rule('$Z', '$A ?$B ?$C $D') then we add these rules instead: Rule('$Z', '$A $B ?$C $D') Rule('$Z', '$A ?$C $D') """ # Find index of the first optional element on the RHS. first = next((idx for idx, elt in enumerate(rule.rhs) if is_optional(elt)), -1) assert first >= 0 assert len(rule.rhs) > 1, 'Entire RHS is optional: %s' % rule prefix = rule.rhs[:first] suffix = rule.rhs[(first + 1):] # First variant: the first optional element gets deoptionalized. deoptionalized = (rule.rhs[first][1:],) add_rule(grammar, Rule(rule.lhs, prefix + deoptionalized + suffix, rule.sem)) # Second variant: the first optional element gets removed. # If the semantics is a value, just keep it as is. sem = rule.sem # But if it's a function, we need to supply a dummy argument for the removed element. if isinstance(rule.sem, FunctionType): sem = lambda sems: rule.sem(sems[:first] + [None] + sems[first:]) add_rule(grammar, Rule(rule.lhs, prefix + suffix, sem)) # # ## Grammar engineering # It's time to start writing some grammar rules for the travel domain. We're going to adopt a data-driven approach. Using the 75 training examples as a development set, we will iteratively: # # - look at examples which are not yet parsed correctly, # - identify common obstacles or sources of error, # - introduce new rules to address those problems, and then # - re-evaluate on the 75 training examples. # # During grammar engineering, the performance metric we'll focus on is *oracle* accuracy (the proportion of examples for which *any* parse is correct), not accuracy (the proportion of examples for which the *first* parse is correct). Remember that oracle accuracy is an upper bound on accuracy. Oracle accuracy is a measure of the expressive power of the grammar: does it have the rules it needs to generate the correct parse? The gap between oracle accuracy and accuracy, on the other hand, reflects the ability of the scoring model to bring the correct parse to the top of the candidate list. # ### Phrase-bag grammars # For the travel domain, we're going to develop a style of grammar known as a *phrase-bag grammar*. To get a sense of how it will work, let's look at ten example inputs from the training data. # # # # # # # # # #
# travel boston to fr. myers fla
# how do i get from tulsa oklahoma to atlantic city. new jersey by air
# airbus from boston to europe
# cheap tickets to south carolina
# birmingham al distance from indianapolish in
# transportation to the philadelphia airport
# one day cruise from fort lauderdale florida
# directions from washington to canada
# flights from portland or to seattle wa
# honeymoon trip to hawaii
# 
# # Here we've highlighted phrases in different colors, according to the roles they play in building the meaning of the input. # # - Green phrases indicate the destination of travel. # - Blue phrases indicate the origin of travel. # - Yellow phrases indicate a mode of travel: air, boat, etc. # - Orange phrases indicate travel of some kind, but do not specify a travel mode. # - Red phrases indicate a specific type of information sought: distance, directions, etc. # - Gray phrases indicate "optional" words which contribute little to the semantics. (The modifiers "one day" and "honeymoon" may be quite meaningful to the user, but our semantic representation is too impoverished to capture them, so they are not relevant for our grammar.) # # Note that the *ordering* of phrases in a query isn't particularly important. Whether the user says directions from washington to canada, or from washington to canada directions, or even to canada directions from washington, the intent is clearly the same. Some of these formulations might be more natural than others — and more common in the query logs — but all of them should be interpretable by our grammar. # # That's the motivation for the phrase-bag style of grammar. In a simple phrase-bag grammar, a valid query is made up of one or more *query elements*, which can appear in any order, and *optionals*, which can be scattered freely amongst the query elements. (More complex phrase-bag grammars can impose further constraints.) For the travel domain, we can identify two types of query elements: *travel locations* and *travel arguments*. # A travel location is either a *to location* (destination) or a *from location* (origin). # A travel argument is either a *travel mode*, a *travel trigger*, or a *request type*. # # We're ready to write our first grammar rules for the travel domain. # In[15]: def sems_0(sems): return sems[0] def sems_1(sems): return sems[1] def merge_dicts(d1, d2): if not d2: return d1 result = d1.copy() result.update(d2) return result rules_travel = [ Rule('$ROOT', '$TravelQuery', sems_0), Rule('$TravelQuery', '$TravelQueryElements', lambda sems: merge_dicts({'domain': 'travel'}, sems[0])), Rule('$TravelQueryElements', '$TravelQueryElement ?$TravelQueryElements', lambda sems: merge_dicts(sems[0], sems[1])), Rule('$TravelQueryElement', '$TravelLocation', sems_0), Rule('$TravelQueryElement', '$TravelArgument', sems_0), ] # These rules are incomplete: `$TravelLocation` and `$TravelArgument` are not yet defined, and there is no mention of optionals. But these rules define the high-level structure of our phrase-bag grammar. # # Pay attention to the semantic functions attached to each rule. # Remember that our semantic representations are maps of key-value pairs. # In the rules which define `$TravelQueryElement`, the semantic function propagates the semantics of the child unchanged. # In the rule which defines `$TravelQueryElements`, the semantic function merges the semantics of the children. # And in the rule which defines `$TravelQuery`, the semantic function adds a key-value pair to the semantics of the child. # ### Travel locations # The rules introduced above left `$TravelLocation` undefined. Let's add some rules to define it. # In[16]: rules_travel_locations = [ Rule('$TravelLocation', '$ToLocation', sems_0), Rule('$TravelLocation', '$FromLocation', sems_0), Rule('$ToLocation', '$To $Location', lambda sems: {'destination': sems[1]}), Rule('$FromLocation', '$From $Location', lambda sems: {'origin': sems[1]}), Rule('$To', 'to'), Rule('$From', 'from'), ] # This looks good, but we still need something that defines `$Location`, which will match a phrase like "`boston`" and assign to it a semantic representation like `{id: 4930956, name: 'Boston, MA, US'}`. If only we had such a thing. # ### The `GeoNamesAnnotator` # [Geocoding][] is the process of mapping a piece of text describing a location (such as a place name or an address) into a canonical, machine-readable representation (such as [geographic coordinates][] or a unique identifier in a geographic database). Because geocoding is a common need across many applications, many geocoding services are available. # # One such service is [GeoNames][]. GeoNames defines a large geographic database in which each location is identified by a unique integer. For example, Boston is identified by [4930956][]. GeoNames also provides a free, [RESTful][] API for geocoding requests. For example, a request to geocode "Palo Alto" looks like this: # # [http://api.geonames.org/searchJSON?q=Palo+Alto&username=demo](http://api.geonames.org/searchJSON?q=Palo+Alto&username=demo) # # [geocoding]: http://en.wikipedia.org/wiki/Geocoding # [geographic coordinates]: http://en.wikipedia.org/wiki/Geographic_coordinate_system # [GeoNames]: http://www.geonames.org/ # [RESTful]: http://en.wikipedia.org/wiki/Representational_state_transfer # [4930956]: http://www.geonames.org/4930956/ # # In SippyCup, the `GeoNamesAnnotator` (defined in [`geonames.py`](./geonames.py)) is implemented as a wrapper around the GeoNames API. The job of the `GeoNamesAnnotator` is to recognize `$Location`s and generate semantics for them. Take a few minutes to skim that code now. # # Note that the `GeoNamesAnnotator` uses a persistent cache which is pre-populated for phrases in the 100 annotated travel examples, avoiding the need for live calls to the GeoNames API. However, if you run on any other examples, you will be making live calls. By default, your requests will specify your username as `wcmac`. That's fine, but each (free) GeoNames account is limited to 2000 calls per hour. If too many people are making calls as `wcmac`, that quota could be exhausted quickly. If that happens, you'll want to [create your own account](http://www.geonames.org/login) on GeoNames. # # You will soon observe that the annotations are far from perfect. For example, "`florida`" is annotated as `{'id': 3442584, 'name': 'Florida, UY'}`, which denotes a city in Uruguay. The current implementation of `GeoNamesAnnotator` could be improved in many ways. For example, while the GeoNames API can return multiple results for ambiguous queries such as "`florida`", the `GeoNamesAnnotator` considers only the first result — and there is no guarantee that the first result is the best. It may also be possible to improve the quality of the results by playing with some of the API request parameters documented [here](http://www.geonames.org/export/geonames-search.html). Some of the [exercises](#travel-exercises) at the end of this unit ask you to explore these possibilities. # # We're now ready to see our grammar parse an input: # In[17]: from collections import defaultdict from geonames import GeoNamesAnnotator from parsing import * geonames_annotator = GeoNamesAnnotator() rules = rules_travel + rules_travel_locations travel_annotators = [geonames_annotator] grammar = Grammar(rules=rules, annotators=travel_annotators) parses = grammar.parse_input('from boston to austin') print(parses[0].semantics) # ### Travel modes # Let's turn our attention from travel locations to travel arguments. One kind of travel argument is a travel mode. Our semantic representation defines eight travel modes (air, bike, boat, bus, car, taxi, train, and transit), and our training examples illustrate some common ways of expressing each travel mode. In fact, individual examples will suggest specific lexical rules. Consider this example: # # Example(input='flights from portland or to seattle wa', # semantics={'domain': 'travel', 'mode': 'air', # 'origin': {'id': 5746545, 'name': 'Portland, OR, US'}, # 'destination': {'id': 5809844, 'name': 'Seattle, WA, US'}}), # # The pairing of the token "`flights`" with the semantic fragment `{'mode': 'air'}` suggests the rule: # # Rule('$AirMode', 'flights', {'mode': 'air'}) # # Actually, for simplicity, we're going to add the '`air`' semantics one level "higher" in the grammar, like this: # # Rule('$TravelMode', '$AirMode', {'mode': 'air'}) # Rule('$AirMode', 'flights') # # The rules below illustrate this approach. These rules will allow our grammar to handle the most common ways of referring to each travel mode. All of the lexical rules below use phrases which are either highly obvious (such as '`taxi`' for `$TaxiMode`) or else are motivated by specific examples in our training dataset (not the test set!). # In[18]: rules_travel_modes = [ Rule('$TravelArgument', '$TravelMode', sems_0), Rule('$TravelMode', '$AirMode', {'mode': 'air'}), Rule('$TravelMode', '$BikeMode', {'mode': 'bike'}), Rule('$TravelMode', '$BoatMode', {'mode': 'boat'}), Rule('$TravelMode', '$BusMode', {'mode': 'bus'}), Rule('$TravelMode', '$CarMode', {'mode': 'car'}), Rule('$TravelMode', '$TaxiMode', {'mode': 'taxi'}), Rule('$TravelMode', '$TrainMode', {'mode': 'train'}), Rule('$TravelMode', '$TransitMode', {'mode': 'transit'}), Rule('$AirMode', 'air fare'), Rule('$AirMode', 'air fares'), Rule('$AirMode', 'airbus'), Rule('$AirMode', 'airfare'), Rule('$AirMode', 'airfares'), Rule('$AirMode', 'airline'), Rule('$AirMode', 'airlines'), Rule('$AirMode', '?by air'), Rule('$AirMode', 'flight'), Rule('$AirMode', 'flights'), Rule('$AirMode', 'fly'), Rule('$BikeMode', '?by bike'), Rule('$BikeMode', 'bike riding'), Rule('$BoatMode', '?by boat'), Rule('$BoatMode', 'cruise'), Rule('$BoatMode', 'cruises'), Rule('$BoatMode', 'norwegian cruise lines'), Rule('$BusMode', '?by bus'), Rule('$BusMode', 'bus tours'), Rule('$BusMode', 'buses'), Rule('$BusMode', 'shutle'), Rule('$BusMode', 'shuttle'), Rule('$CarMode', '?by car'), Rule('$CarMode', 'drive'), Rule('$CarMode', 'driving'), Rule('$CarMode', 'gas'), Rule('$TaxiMode', 'cab'), Rule('$TaxiMode', 'car service'), Rule('$TaxiMode', 'taxi'), Rule('$TrainMode', '?by train'), Rule('$TrainMode', 'trains'), Rule('$TrainMode', 'amtrak'), Rule('$TransitMode', '?by public transportation'), Rule('$TransitMode', '?by ?public transit'), ] # Let's see the rules in action on a toy example. # In[19]: rules = rules_travel + rules_travel_locations + rules_travel_modes grammar = Grammar(rules=rules, annotators=travel_annotators) parses = grammar.parse_input('from boston to austin by train') print(parses[0].semantics) # Great, it works. # # We're far from done with the travel grammar, but we now have enough in place that we should be able to parse several of the examples in our training data. This means that we can start hill-climbing on oracle accuracy! # # To drive our grammar engineering process, we're going to use a SippyCup utility function called `sample_wins_and_losses()`, which will report our oracle accuracy on the training data, and show some examples we're parsing correctly and some examples we're not. # # (Note that `sample_wins_and_losses()` requires a `Domain`, a `Model`, and a `Metric`. A description of these classes is tangential to our presentation.  If you're interested, read some SippyCup code! It's not very complicated.) # # # In[20]: from experiment import sample_wins_and_losses from metrics import SemanticsOracleAccuracyMetric from scoring import Model from travel import TravelDomain domain = TravelDomain() model = Model(grammar=grammar) metric = SemanticsOracleAccuracyMetric() sample_wins_and_losses(domain=domain, model=model, metric=metric, seed=31) # As you can see, `sample_wins_and_losses()` doesn't print many details — just a few evaluation metrics, and then examples of wins and losses with the current grammar. (A "win" is an example on which our primary metric has a positive value.) Our primary metric, semantics oracle accuracy, stands at 0.133 — not great, but greater than zero, so it's a start. The wins make sense: we can parse queries which consist solely of travel locations and travel modes, with no extraneous elements. The losses are more interesting, because they will provide the motivation for the next phase of work. # ### Travel triggers # Many of the examples we're currently failing to parse contain phrases (such as "tickets" or "transportation") which indicate a travel intent, but do not specify a travel mode. These phrases constitute the second type of travel argument, namely *travel triggers*. Inspection of the examples in our training dataset suggest a small number of lexical rules, shown here. # In[21]: rules_travel_triggers = [ Rule('$TravelArgument', '$TravelTrigger', {}), Rule('$TravelTrigger', 'tickets'), Rule('$TravelTrigger', 'transportation'), Rule('$TravelTrigger', 'travel'), Rule('$TravelTrigger', 'travel packages'), Rule('$TravelTrigger', 'trip'), ] # Let's run `sample_wins_and_losses()` again, to see how much we've gained, and what to work on next. # In[22]: rules = rules_travel + rules_travel_locations + rules_travel_modes + rules_travel_triggers grammar = Grammar(rules=rules, annotators=travel_annotators) model = Model(grammar=grammar) sample_wins_and_losses(domain=domain, model=model, metric=metric, seed=1) # So we've gained about four points (that is, 0.04) in oracle accuracy on the training dataset. We're making progress! Again, the losses suggest where to go next. # ### Request types # The third and last kind of travel argument is a request type, which indicates a specific type of information sought, such as directions or distance. We'll adopt the same methodology, adding lexical rules motivated by specific training examples, tied together by higher-level rules which add semantics. # In[23]: rules_request_types = [ Rule('$TravelArgument', '$RequestType', sems_0), Rule('$RequestType', '$DirectionsRequest', {'type': 'directions'}), Rule('$RequestType', '$DistanceRequest', {'type': 'distance'}), Rule('$RequestType', '$ScheduleRequest', {'type': 'schedule'}), Rule('$RequestType', '$CostRequest', {'type': 'cost'}), Rule('$DirectionsRequest', 'directions'), Rule('$DirectionsRequest', 'how do i get'), Rule('$DistanceRequest', 'distance'), Rule('$ScheduleRequest', 'schedule'), Rule('$CostRequest', 'cost'), ] # Again, we'll check our progress using `sample_wins_and_losses()`. # In[24]: rules = rules_travel + rules_travel_locations + rules_travel_modes + rules_travel_triggers + rules_request_types grammar = Grammar(rules=rules, annotators=travel_annotators) model = Model(grammar=grammar) sample_wins_and_losses(domain=domain, model=model, metric=metric, seed=1) # Great, oracle accuracy is up to 0.20. But there's still one big piece we're missing. # ### Optionals # A key ingredient of the phrase-bag approach to grammar building is the ability to accept *optional* elements interspersed freely among the query elements. Optionals are phrases which can be either present or absent; typically, they contribute nothing to the semantics. # # The following rules illustrate one approach to allowing optionals. The first two rules allow any `$TravelQueryElement` to combine with an `$Optionals` either to the right or to the left, while ignoring its semantics. The third rule defines `$Optionals` as a sequence of one or more `$Optional` elements, while the following rules define several specific categories of optionals. As usual, most of the lexical rules are motivated by specific examples from the training dataset, with a few extras included just because they are super obvious. # # This is not necessarily the best design! One of the [exercises](#travel-exercises) will challenge you to do better. # In[25]: rules_optionals = [ Rule('$TravelQueryElement', '$TravelQueryElement $Optionals', sems_0), Rule('$TravelQueryElement', '$Optionals $TravelQueryElement', sems_1), Rule('$Optionals', '$Optional ?$Optionals'), Rule('$Optional', '$Show'), Rule('$Optional', '$Modifier'), Rule('$Optional', '$Carrier'), Rule('$Optional', '$Stopword'), Rule('$Optional', '$Determiner'), Rule('$Show', 'book'), Rule('$Show', 'give ?me'), Rule('$Show', 'show ?me'), Rule('$Modifier', 'cheap'), Rule('$Modifier', 'cheapest'), Rule('$Modifier', 'discount'), Rule('$Modifier', 'honeymoon'), Rule('$Modifier', 'one way'), Rule('$Modifier', 'direct'), Rule('$Modifier', 'scenic'), Rule('$Modifier', 'transatlantic'), Rule('$Modifier', 'one day'), Rule('$Modifier', 'last minute'), Rule('$Carrier', 'delta'), Rule('$Carrier', 'jet blue'), Rule('$Carrier', 'spirit airlines'), Rule('$Carrier', 'amtrak'), Rule('$Stopword', 'all'), Rule('$Stopword', 'of'), Rule('$Stopword', 'what'), Rule('$Stopword', 'will'), Rule('$Stopword', 'it'), Rule('$Stopword', 'to'), Rule('$Determiner', 'a'), Rule('$Determiner', 'an'), Rule('$Determiner', 'the'), ] # Again, we'll check our progress using `sample_wins_and_losses()`. # In[26]: rules = rules_travel + rules_travel_locations + rules_travel_modes + rules_travel_triggers + rules_request_types + rules_optionals grammar = Grammar(rules=rules, annotators=travel_annotators) model = Model(grammar=grammar) sample_wins_and_losses(domain=domain, model=model, metric=metric, seed=1) # Adding support for optionals has doubled oracle accuracy on the training dataset, from 0.200 to 0.400. This is a big gain! However, there are still many losses, and many of them share a property: they are *negative examples*. # ### Negative examples # A semantic parsing model for a given domain should be able to predict that a given input does *not* belong to the domain. We call such inputs *negative examples*. For the travel domain, negative examples include: # # discount tickets to new york city ballet # george washington borrows 500 000 from pennsylvania farmer to finance war # ride this train to roseburg oregon now ther's a town for ya # # Much of the academic literature on semantic parsing describes systems which are required always to produce an in-domain semantic representation, no matter what the input. If the input is *not* in-domain, the result is usually garbage. In real-world applications, it's much better to have models which can learn when to produce no positive output. # # The easiest way to achieve this is to introduce some rules which allow any input to be parsed with "negative" semantics, and then learn weights for those rule features in the scoring model. In the travel domain, the "negative" semantic representation is the special value `{'domain': 'other'}`. # In[27]: rules_not_travel = [ Rule('$ROOT', '$NotTravelQuery', sems_0), Rule('$NotTravelQuery', '$Text', {'domain': 'other'}), Rule('$Text', '$Token ?$Text'), ] # Note that the last rule depends on the `$Token` category, which can be applied to any token by the `TokenAnnotator`. So let's add the `TokenAnnotator` to our list of annotators. # In[28]: travel_annotators = [geonames_annotator, TokenAnnotator()] # As usual, we'll check our progress using `sample_wins_and_losses()`. # In[29]: rules = rules_travel + rules_travel_locations + rules_travel_modes + rules_travel_triggers + rules_request_types + rules_optionals + rules_not_travel grammar = Grammar(rules=rules, annotators=travel_annotators) model = Model(grammar=grammar) sample_wins_and_losses(domain=domain, model=model, metric=metric, seed=1) # We've achieved another big gain in oracle accuracy, from 0.400 to 0.573, just by ensuring that we offer a "negative" prediction for every input. (Note that the mean number of parses has increased by exactly 1, from 0.773 to 1.773.) However, for the first time, a big gap has opened between accuracy, at 0.173, and oracle accuracy, at 0.573. The problem is that we don't yet have a scoring model for the travel domain, so the ranking of parses is arbitrary. In order to close the gap, we need to create a scoring model. One of the [exercises](#travel-exercises) will ask you to pursue this. # ## Exercises # Several of these exercises ask you to measure the impact of your change on key evaluation metrics. Part of your job is to decide which evaluation metrics are most relevant for the change you're making. It's probably best to evaluate only on training data, in order to keep the test data unseen during development. (But the test data is hardly a state secret, so whatever.) # # ### Straightforward # # 1. Select 20 of the 6,588 queries in [aol-travel-queries.txt](./aol-travel-queries.txt) and manually annotate them with target semantics. Select your queries using uniform random sampling — this will minimize overlap between different people completing this exercise, and therefore maximize the value of the aggregate labor. (In Python, you can sample from a list using `random.sample()`.) You will likely find some cases where it's not clear what the right semantics are. Do your best. The point of the exercise is to develop an awareness of the challenges of annotation, and to recognize that there's no such thing as perfectly annotated data. # # 1. Many of the remaining errors on training examples occur because the origin isn't marked by "from". Examples include "`transatlantic cruise southampton to tampa`", "`fly boston to myrtle beach spirit airlines`", and "`distance usa to peru`". Extend the grammar to handle examples like these. Measure the impact on key evaluation metrics. # # 1. Does your solution to the previous exercise handle examples where some other query element intervenes between origin and destination? Examples include "`university of washington transportation to seatac`", "`birmingham al distance from indianapolish in`", and "`nyc flights to buffalo ny`". If not, extend the grammar to handle examples like these. Measure the impact on key evaluation metrics. # # 1. The current structure of the grammar permits parses containing any number of `$FromLocation`s and `$ToLocation`s, including zero. Find a way to require that (a) there is at least one `$FromLocation` or `$ToLocation`, (b) there are not multiple `$FromLocation`s or `$ToLocation`s. # # 1. The travel grammar is lacking a scoring model, and it shows a big gap between accuracy and oracle accuracy. Examine and diagnose some examples where accuracy is 0 even though oracle accuracy is 1. Propose and implement a scoring model, and measure its efficacy in closing that gap. # # # ### Challenging # # 1. Extend `Grammar` to allow rules which mix terminals and non-terminals on the RHS, such as `Rule('$RouteQuery', '$TravelMode from $Location to $Location')`. # # 1. Try to improve the precision of the `GeoNamesAnnotator` by fiddling with the GeoNames API request parameters documented [here](http://www.geonames.org/export/geonames-search.html). For example, the `featureClass`, `countryBias`, or `orderby` parameters seem like promising targets. # # 1. Try to improve the coverage of the `GeoNamesAnnotator` by enabling it to return multiple annotations for ambiguous location names. Investigate the impact of varying the maximum number of annotations on various performance metrics, including accuracy, oracle accuracy, and number of parses. How would you characterize the tradeoff you're making? # # 1. Building on the previous exercise, implement a feature which captures information about the result rank of annotations generated by the `GeoNamesAnnotator`, and see if you can use this feature to narrow the gap between accuracy and oracle accuracy. # # 1. You have probably noticed that one of our standard evaluation metrics is something called "spurious ambiguity". Dig into the SippyCup codebase to figure out what spurious ambiguity is. Here's a hint: it's something bad, so we want to push that metric toward zero. Find a training example where it's not zero, and figure out why the example exhibits spurious ambiguity. Are there changes we can make to the grammar to reduce spurious ambiguity? Also, why is spurious ambiguity undesirable? # # 1. In its current form, the travel grammar parses lots of queries it shouldn't. (By "parses", we mean "generates a *positive* parse".) This problem is known as *overtriggering*. The overtriggering problem is hard to observe on our tiny dataset of 100 examples, where most of the examples are positive examples. Investigate overtriggering by downloading the AOL query dataset, identifying the 1,000 most frequent queries, and running them through the grammar. How many cases of overtriggering do you find? Can you suggest some simple changes to minimize overtriggering? (Warning: the AOL dataset contains lots of queries which may be offensive. Skip this exercise if you're not down with that.) # # 1. Consider the queries "`flights los angeles hawaii`" vs. "`flights los angeles california`". Despite the superficial resemblance, the second query seems to mean flights *to* Los Angeles, not flights *from* Los Angeles *to* California. But the grammar can successfully interpret both queries only if it permits both interpretations for each query. This creates a ranking problem: which interpretation should be scored higher? Can you add to the scoring model a feature which solves the problem? # # 1. Develop and execute a strategy to annotate all 6,588 queries in [aol-travel-queries.txt](./aol-travel-queries.txt) with target semantics using crowdsourcing. Warning: this is an ambitious exercise. # # Copyright (C) 2015 Bill MacCartney