#!/usr/bin/env python # coding: utf-8 # # #

# SippyCup
# Unit 3: Geography queries #

# #

# Bill MacCartney
# Spring 2015 # #

#
# This is Unit 3 of the SippyCup codelab. #
# Our third case study will examine the domain of geography queries. In particular, we'll focus on the [Geo880][] corpus, which contains 880 queries about U.S. geography. Examples include: # # [Geo880]: http://www.cs.utexas.edu/users/ml/geo.html # # "which states border texas?" # "how many states border the largest state?" # "what is the size of the capital of texas?" # # The Geo880 queries have a quite different character from the arithmetic queries and travel queries we have examined previously. They differ from the arithmetic queries in using a large vocabulary, and in exhibiting greater degrees of both lexical and syntactic ambiguity. They differ from the travel queries in adhering to conventional rules for spelling and syntax, and in having semantics with arbitrarily complex compositional structure. For example: # # "what rivers flow through states that border the state with the largest population?" # "what is the population of the capital of the largest state through which the mississippi runs?" # "what is the longest river that passes the states that border the state that borders the most states?" # # Geo880 was developed in Ray Mooney's group at UT Austin. It is of particular interest because it has for many years served as a standard evaluation for semantic parsing systems. (See, for example, [Zelle & Mooney 1996][], [Tang & Mooney 2001][], [Zettlemoyer & Collins 2005][], and [Liang et al. 2011][].) It has thereby become, for many, a paradigmatic application of semantic parsing. It has also served as a bridge between an older current of research on natural language interfaces to databases (NLIDBs) (see [Androutsopoulos et al. 1995][]) and the modern era of semantic parsing. # # [Zelle & Mooney 1996]: http://www.aaai.org/Papers/AAAI/1996/AAAI96-156.pdf # [Tang & Mooney 2001]: http://www.cs.utexas.edu/~ai-lab/pubs/cocktail-ecml-01.pdf # [Zettlemoyer & Collins 2005]: http://people.csail.mit.edu/lsz/papers/zc-uai05.pdf # [Liang et al. 2011]: http://www.cs.berkeley.edu/~jordan/papers/liang-jordan-klein-acl2011.pdf # [Androutsopoulos et al. 1995]: http://arxiv.org/pdf/cmp-lg/9503016.pdf # # The domain of geography queries is also of interest because there are many plausible real-world applications for semantic parsing which similarly involve complex compositional queries against a richly structured knowledge base. For example, some people are passionate about baseball statistics, and might want to ask queries like: # # "pitchers who have struck out four batters in one inning" # "players who have stolen at least 100 bases in a season" # "complete games with fewer than 90 pitches" # "most home runs hit in one game" # # Environmental advocates and policymakers might have queries like: # # "which country has the highest co2 emissions" # "what five countries have the highest per capita co2 emissions" # "what country's co2 emissions increased the most over the last five years" # "what fraction of co2 emissions was from european countries in 2010" # # Techniques that work in the geography domain are likely to work in these other domains too. # ## The Geo880 dataset # I've been told that the Geo880 queries were collected from students in classes taught by Ray Mooney at UT Austin. I'm not sure whether I've got the story right. But this account is consistent with one of the notable limitations of the dataset: it is not a natural distribution, and not a realistic representation of the geography queries that people actually ask on, say, Google. Nobody ever asks, "what is the longest river that passes the states that border the state that borders the most states?" Nobody. Ever. # # The dataset was published online by Rohit Jaivant Kate in a [Prolog file][] containing semantic representations in Prolog style. It was later republished by Yuk Wah Wong as an [XML file][] containing additional metadata for each example, including translations into Spanish, Japanese, and Turkish; syntactic parse trees; and semantics in two different representations: Prolog and [FunQL][]. # # [Prolog file]: ftp://ftp.cs.utexas.edu/pub/mooney/nl-ilp-data/geosystem/geoqueries880 # [XML file]: http://www.cs.utexas.edu/~ml/wasp/geo-funql/corpus.xml # [FunQL]: http://www.cs.utexas.edu/~ml/wasp/geo-funql.html # # In SippyCup, we're not going to use either Prolog or FunQL semantics. Instead, we'll use examples which have been annotated only with denotations (which were provided by Percy Liang — thanks!). Of course, our grammar will require a semantic representation, even if our examples are not annotated with semantics. We will introduce one [below](#geoquery-semantic-representation). # # The Geo880 dataset is conventionally divided into 600 training examples and 280 test examples. In SippyCup, the dataset can in found in [`geo880.py`](./geo880.py). Let's take a peek. # In[1]: from geo880 import geo880_train_examples, geo880_test_examples print('train examples:', len(geo880_train_examples)) print('test examples: ', len(geo880_test_examples)) print(geo880_train_examples[0]) print(geo880_test_examples[0]) # ## The Geobase knowledge base # Geobase is a small knowledge base about the geography of the United States. It contains (almost) all the information needed to answer queries in the Geo880 dataset, including facts about: # # - states: capital, area, population, major cities, neighboring states, highest and lowest points and elevations # - cities: containing state and population # - rivers: length and states traversed # - mountains: containing state and height # - roads: states traversed # - lakes: area, states traversed # # SippyCup contains a class called `GeobaseReader` (in [`geobase.py`](./geobase.py)) which facilitates working with Geobase in Python. It reads and parses the Geobase Prolog file, and creates a set of tuples representing its content. Let's take a look. # # [Prolog file]: ftp://ftp.cs.utexas.edu/pub/mooney/nl-ilp-data/geosystem/geobase # In[2]: from geobase import GeobaseReader reader = GeobaseReader() unaries = [str(t) for t in reader.tuples if len(t) == 2] print('\nSome unaries:\n ' + '\n '.join(unaries[:10])) binaries = [str(t) for t in reader.tuples if len(t) == 3] print('\nSome binaries:\n ' + '\n '.join(binaries[:10])) # Some observations here: # # - _Unaries_ are pairs consisting of a unary predicate (a type) and an entity. # - _Binaries_ are triples consisting of binary predicate (a relation) and two entities (or an entity and a numeric or string value). # - Entities are named by unique identifiers of the form `/type/name`. This is a `GeobaseReader` convention; these identifiers are not used in the original Prolog file. # - Some entities have the generic type `place` because they occur in the Prolog file only as the highest or lowest point in a state, and it's hard to reliably assign such points to one of the more specific types. # - The original Prolog file is inconsistent about units. For example, the area of states is expressed in square miles, but the area of lakes is expressed in square kilometers. `GeobaseReader` converts everything to [SI units][]: meters and square meters. # # [SI units]: http://en.wikipedia.org/wiki/International_System_of_Units # ## Semantic representation # `GeobaseReader` merely reads the data in Geobase into a set of tuples. It doesn't provide any facility for querying that data. That's where `GraphKB` and `GraphKBExecutor` come in. `GraphKB` is a graph-structured knowledge base, with indexing for fast lookups. `GraphKBExecutor` defines a representation for formal queries against that knowledge base, and supports query execution. The formal query language defined by `GraphKBExecutor` will serve as our semantic representation for the geography domain. # ### The `GraphKB` class # A `GraphKB` is a generic graph-structured knowledge base, or equivalently, a set of relational pairs and triples, with indexing for fast lookups. It represents a knowledge base as set of tuples, each either: # # - a pair, consisting of a unary relation and an element which belongs to it, # or # - a triple consisting of a binary relation and a pair of elements which # belong to it. # # For example, we can construct a `GraphKB` representing facts about [The Simpsons][]: # # [The Simpsons]: http://en.wikipedia.org/wiki/The_Simpsons # In[3]: from graph_kb import GraphKB simpsons_tuples = [ # unaries ('male', 'homer'), ('female', 'marge'), ('male', 'bart'), ('female', 'lisa'), ('female', 'maggie'), ('adult', 'homer'), ('adult', 'marge'), ('child', 'bart'), ('child', 'lisa'), ('child', 'maggie'), # binaries ('has_age', 'homer', 36), ('has_age', 'marge', 34), ('has_age', 'bart', 10), ('has_age', 'lisa', 8), ('has_age', 'maggie', 1), ('has_brother', 'lisa', 'bart'), ('has_brother', 'maggie', 'bart'), ('has_sister', 'bart', 'maggie'), ('has_sister', 'bart', 'lisa'), ('has_sister', 'lisa', 'maggie'), ('has_sister', 'maggie', 'lisa'), ('has_father', 'bart', 'homer'), ('has_father', 'lisa', 'homer'), ('has_father', 'maggie', 'homer'), ('has_mother', 'bart', 'marge'), ('has_mother', 'lisa', 'marge'), ('has_mother', 'maggie', 'marge'), ] simpsons_kb = GraphKB(simpsons_tuples) # The `GraphKB` object now contains three indexes: # # - `unaries[U]`: all entities belonging to unary relation `U` # - `binaries_fwd[B][E]`: all entities `X` such that `(E, X)` belongs to binary relation `B` # - `binaries_rev[B][E]`: all entities `X` such that `(X, E)` belongs to binary relation `B` # # For example: # In[4]: simpsons_kb.unaries['child'] # In[5]: simpsons_kb.binaries_fwd['has_sister']['lisa'] # In[6]: simpsons_kb.binaries_rev['has_sister']['lisa'] # ### The `GraphKBExecutor` class # A `GraphKBExecutor` executes formal queries against a `GraphKB` and returns their denotations. # Queries are represented by Python tuples, and can be nested. # Denotations are also represented by Python tuples, but are conceptually sets (possibly empty). The elements of these tuples are always sorted in canonical order, so that they can be reliably compared for set equality. # # The query language defined by `GraphKBExecutor` is perhaps most easily explained by example: # In[7]: queries = [ 'bart', 'male', ('has_sister', 'lisa'), # who has sister lisa? ('lisa', 'has_sister'), # lisa has sister who, i.e., who is a sister of lisa? ('lisa', 'has_brother'), # lisa has brother who, i.e., who is a brother of lisa? ('.and', 'male', 'child'), ('.or', 'male', 'adult'), ('.not', 'child'), ('.any',), # anything ('.any', 'has_sister'), # anything has sister who, i.e., who is a sister of anything? ('.and', 'child', ('.not', ('.any', 'has_sister'))), ('.count', ('bart', 'has_sister')), ('has_age', ('.gt', 21)), ('has_age', ('.lt', 2)), ('has_age', ('.eq', 10)), ('.max', 'has_age', 'female'), ('.min', 'has_age', ('bart', 'has_sister')), ('.max', 'has_age', '.any'), ('.argmax', 'has_age', 'female'), ('.argmin', 'has_age', ('bart', 'has_sister')), ('.argmax', 'has_age', '.any'), ] executor = simpsons_kb.executor() for query in queries: print() print('Q ', query) print('D ', executor.execute(query)) # Note that the query `(R E)` denotes entities having relation `R` to entity `E`, # whereas the query `(E R)` denotes entities to which entity `E` has relation `R`. # # For a more detailed understanding of the style of semantic representation defined by `GraphKBExecutor`, take a look at the [source code](graph_kb.py). # ### Using `GraphKBExecutor` with Geobase # In[8]: geobase = GraphKB(reader.tuples) executor = geobase.executor() queries = [ ('/state/texas', 'capital'), # capital of texas ('.and', 'river', ('traverses', '/state/utah')), # rivers that traverse utah ('.argmax', 'height', 'mountain'), # tallest mountain ] for query in queries: print() print(query) print(executor.execute(query)) # ## Grammar engineering # It's time to start developing a grammar for the geography domain. As in Unit 2, # the performance metric we'll focus on during grammar engineering is *oracle* accuracy (the proportion of examples for which *any* parse is correct), not accuracy (the proportion of examples for which the *first* parse is correct). Remember that oracle accuracy is an upper bound on accuracy, and is a measure of the expressive power of the grammar: does it have the rules it needs to generate the correct parse? The gap between oracle accuracy and accuracy, on the other hand, reflects the ability of the scoring model to bring the correct parse to the top of the candidate list. # # As always, we're going to take a data-driven approach to grammar engineering. We want to introduce rules which will enable us to handle the lexical items and syntactic structures that we actually observe in the Geo880 training data. To that end, let's count the words that appear among the 600 training examples. (We do _not_ examine the test data!) # In[9]: from collections import defaultdict from operator import itemgetter from geo880 import geo880_train_examples words = [word for example in geo880_train_examples for word in example.input.split()] counts = defaultdict(int) for word in words: counts[word] += 1 counts = sorted([(count, word) for word, count in counts.items()], reverse=True) print('There were %d tokens of %d types:\n' % (len(words), len(counts))) print(', '.join(['%s (%d)' % (word, count) for count, word in counts[:50]] + ['...'])) # There are at least four major categories of words here: # - Words that refer to *entities*, such as "texas", "mississippi", "usa", and "austin". # - Words that refer to *types*, such as "state", "river", and "cities". # - Words that refer to *relations*, such as "in", "borders", "capital", and "long". # - Other function words, such as "the", "what", "how", and "are". # # One might make finer distinctions, but this seems like a reasonable starting point. Note that these categories do not always correspond to traditional syntactic categories. While the entities are typically proper nouns, and the types are typically common nouns, the relations include prepositions, verbs, nouns, and adjectives. # # The design of our grammar will roughly follow this schema. The major categories will include `$Entity`, `$Type`, `$Collection`, `$Relation`, and `$Optional`. # ### Optionals # In [Unit 2][], our grammar engineering process didn't really start cooking until we introduced optionals. This time around, let's *begin* with the optionals. We'll define as `$Optional` every word in the Geo880 training data which does not plainly refer to an entity, type, or relation. And we'll let any query be preceded or followed by a sequence of one or more `$Optionals`. # # [Unit 2]: ./sippycup-unit-2.ipynb # In[10]: from parsing import Grammar, Rule optional_words = [ 'the', '?', 'what', 'is', 'in', 'of', 'how', 'many', 'are', 'which', 'that', 'with', 'has', 'major', 'does', 'have', 'where', 'me', 'there', 'give', 'name', 'all', 'a', 'by', 'you', 'to', 'tell', 'other', 'it', 'do', 'whose', 'show', 'one', 'on', 'for', 'can', 'whats', 'urban', 'them', 'list', 'exist', 'each', 'could', 'about' ] rules_optionals = [ Rule('$ROOT', '?$Optionals $Query ?$Optionals', lambda sems: sems[1]), Rule('$Optionals', '$Optional ?$Optionals'), ] + [Rule('$Optional', word) for word in optional_words] # Because `$Query` has not yet been defined, we won't be able to parse anything yet. # ### Entities and collections # Our grammar will need to be able to recognize names of entities, such as `"utah"`. There are hundreds of entities in Geobase, and we don't want to have to introduce a grammar rule for each entity. Instead, we'll define a new annotator, `GeobaseAnnotator`, which simply annotates phrases which exactly match names in Geobase. # In[11]: from annotator import Annotator, NumberAnnotator class GeobaseAnnotator(Annotator): def __init__(self, geobase): self.geobase = geobase def annotate(self, tokens): phrase = ' '.join(tokens) places = self.geobase.binaries_rev['name'][phrase] return [('$Entity', place) for place in places] # Now a couple of rules that will enable us to parse inputs that simply name locations, such as `"utah"`. # # (TODO: explain rationale for `$Collection` and `$Query`.) # In[12]: rules_collection_entity = [ Rule('$Query', '$Collection', lambda sems: sems[0]), Rule('$Collection', '$Entity', lambda sems: sems[0]), ] rules = rules_optionals + rules_collection_entity # Now let's make a grammar. # In[13]: annotators = [NumberAnnotator(), GeobaseAnnotator(geobase)] grammar = Grammar(rules=rules, annotators=annotators) # Let's try to parse some inputs which just name locations. # In[14]: parses = grammar.parse_input('what is utah') for parse in parses[:1]: print('\n'.join([str(parse.semantics), str(executor.execute(parse.semantics))])) # Great, it worked. Now let's run an evaluation on the Geo880 training examples. # In[15]: from experiment import sample_wins_and_losses from geoquery import GeoQueryDomain from metrics import DenotationOracleAccuracyMetric from scoring import Model domain = GeoQueryDomain() model = Model(grammar=grammar, executor=executor.execute) metric = DenotationOracleAccuracyMetric() sample_wins_and_losses(domain=domain, model=model, metric=metric, seed=1) # We don't yet have a single win: denotation oracle accuracy remains stuck at zero. However, the average number of parses is slightly greater than zero, meaning that there are a few examples which our grammar can parse (though not correctly). It would be interesting to know which examples. There's a utility function in [`experiment.py`](./experiment.py) which will give you the visibility you need. See if you can figure out what to do. # # # ### Types # (TODO: the words in the training data include lots of words for types. Let's write down some lexical rules defining the category `$Type`, guided as usual by the words we actually see in the training data. We'll also make `$Type` a kind of `$Collection`.) # In[16]: rules_types = [ Rule('$Collection', '$Type', lambda sems: sems[0]), Rule('$Type', 'state', 'state'), Rule('$Type', 'states', 'state'), Rule('$Type', 'city', 'city'), Rule('$Type', 'cities', 'city'), Rule('$Type', 'big cities', 'city'), Rule('$Type', 'towns', 'city'), Rule('$Type', 'river', 'river'), Rule('$Type', 'rivers', 'river'), Rule('$Type', 'mountain', 'mountain'), Rule('$Type', 'mountains', 'mountain'), Rule('$Type', 'mount', 'mountain'), Rule('$Type', 'peak', 'mountain'), Rule('$Type', 'road', 'road'), Rule('$Type', 'roads', 'road'), Rule('$Type', 'lake', 'lake'), Rule('$Type', 'lakes', 'lake'), Rule('$Type', 'country', 'country'), Rule('$Type', 'countries', 'country'), ] # We should now be able to parse inputs denoting types, such as `"name the lakes"`: # In[17]: rules = rules_optionals + rules_collection_entity + rules_types grammar = Grammar(rules=rules, annotators=annotators) parses = grammar.parse_input('name the lakes') for parse in parses[:1]: print('\n'.join([str(parse.semantics), str(executor.execute(parse.semantics))])) # It worked. Let's evaluate on the Geo880 training data again. # In[18]: model = Model(grammar=grammar, executor=executor.execute) sample_wins_and_losses(domain=domain, model=model, metric=metric, seed=1) # Liftoff! We have two wins, and denotation oracle accuracy is greater than zero! Just barely. # ### Relations and joins # In order to really make this bird fly, we're going to have to handle relations. In particular, we'd like to be able to parse queries which combine a relation with an entity or collection, such as `"what is the capital of vermont"`. # # As usual, we'll adopt a data-driven approach. The training examples include lots of words and phrases which refer to relations, both "forward" relations (like `"traverses"`) and "reverse" relations (like `"traversed by"`). Guided by the training data, we'll write lexical rules which define the categories `$FwdRelation` and `$RevRelation`. Then we'll add rules that allow either a `$FwdRelation` or a `$RevRelation` to be promoted to a generic `$Relation`, with semantic functions which ensure that the semantics are constructed with the proper orientation. Finally, we'll define a rule for _joining_ a `$Relation` (such as `"capital of"`) with a `$Collection` (such as `"vermont"`) to yield another `$Collection` (such as `"capital of vermont"`). # # # In[19]: rules_relations = [ Rule('$Collection', '$Relation ?$Optionals $Collection', lambda sems: sems[0](sems[2])), Rule('$Relation', '$FwdRelation', lambda sems: (lambda arg: (sems[0], arg))), Rule('$Relation', '$RevRelation', lambda sems: (lambda arg: (arg, sems[0]))), Rule('$FwdRelation', '$FwdBordersRelation', 'borders'), Rule('$FwdBordersRelation', 'border'), Rule('$FwdBordersRelation', 'bordering'), Rule('$FwdBordersRelation', 'borders'), Rule('$FwdBordersRelation', 'neighbor'), Rule('$FwdBordersRelation', 'neighboring'), Rule('$FwdBordersRelation', 'surrounding'), Rule('$FwdBordersRelation', 'next to'), Rule('$FwdRelation', '$FwdTraversesRelation', 'traverses'), Rule('$FwdTraversesRelation', 'cross ?over'), Rule('$FwdTraversesRelation', 'flow through'), Rule('$FwdTraversesRelation', 'flowing through'), Rule('$FwdTraversesRelation', 'flows through'), Rule('$FwdTraversesRelation', 'go through'), Rule('$FwdTraversesRelation', 'goes through'), Rule('$FwdTraversesRelation', 'in'), Rule('$FwdTraversesRelation', 'pass through'), Rule('$FwdTraversesRelation', 'passes through'), Rule('$FwdTraversesRelation', 'run through'), Rule('$FwdTraversesRelation', 'running through'), Rule('$FwdTraversesRelation', 'runs through'), Rule('$FwdTraversesRelation', 'traverse'), Rule('$FwdTraversesRelation', 'traverses'), Rule('$RevRelation', '$RevTraversesRelation', 'traverses'), Rule('$RevTraversesRelation', 'has'), Rule('$RevTraversesRelation', 'have'), # 'how many states have major rivers' Rule('$RevTraversesRelation', 'lie on'), Rule('$RevTraversesRelation', 'next to'), Rule('$RevTraversesRelation', 'traversed by'), Rule('$RevTraversesRelation', 'washed by'), Rule('$FwdRelation', '$FwdContainsRelation', 'contains'), # 'how many states have a city named springfield' Rule('$FwdContainsRelation', 'has'), Rule('$FwdContainsRelation', 'have'), Rule('$RevRelation', '$RevContainsRelation', 'contains'), Rule('$RevContainsRelation', 'contained by'), Rule('$RevContainsRelation', 'in'), Rule('$RevContainsRelation', 'found in'), Rule('$RevContainsRelation', 'located in'), Rule('$RevContainsRelation', 'of'), Rule('$RevRelation', '$RevCapitalRelation', 'capital'), Rule('$RevCapitalRelation', 'capital'), Rule('$RevCapitalRelation', 'capitals'), Rule('$RevRelation', '$RevHighestPointRelation', 'highest_point'), Rule('$RevHighestPointRelation', 'high point'), Rule('$RevHighestPointRelation', 'high points'), Rule('$RevHighestPointRelation', 'highest point'), Rule('$RevHighestPointRelation', 'highest points'), Rule('$RevRelation', '$RevLowestPointRelation', 'lowest_point'), Rule('$RevLowestPointRelation', 'low point'), Rule('$RevLowestPointRelation', 'low points'), Rule('$RevLowestPointRelation', 'lowest point'), Rule('$RevLowestPointRelation', 'lowest points'), Rule('$RevLowestPointRelation', 'lowest spot'), Rule('$RevRelation', '$RevHighestElevationRelation', 'highest_elevation'), Rule('$RevHighestElevationRelation', '?highest elevation'), Rule('$RevRelation', '$RevHeightRelation', 'height'), Rule('$RevHeightRelation', 'elevation'), Rule('$RevHeightRelation', 'height'), Rule('$RevHeightRelation', 'high'), Rule('$RevHeightRelation', 'tall'), Rule('$RevRelation', '$RevAreaRelation', 'area'), Rule('$RevAreaRelation', 'area'), Rule('$RevAreaRelation', 'big'), Rule('$RevAreaRelation', 'large'), Rule('$RevAreaRelation', 'size'), Rule('$RevRelation', '$RevPopulationRelation', 'population'), Rule('$RevPopulationRelation', 'big'), Rule('$RevPopulationRelation', 'large'), Rule('$RevPopulationRelation', 'populated'), Rule('$RevPopulationRelation', 'population'), Rule('$RevPopulationRelation', 'populations'), Rule('$RevPopulationRelation', 'populous'), Rule('$RevPopulationRelation', 'size'), Rule('$RevRelation', '$RevLengthRelation', 'length'), Rule('$RevLengthRelation', 'length'), Rule('$RevLengthRelation', 'long'), ] # We should now be able to parse `"what is the capital of vermont"`. Let's see: # In[20]: rules = rules_optionals + rules_collection_entity + rules_types + rules_relations grammar = Grammar(rules=rules, annotators=annotators) parses = grammar.parse_input('what is the capital of vermont ?') for parse in parses[:1]: print('\n'.join([str(parse.semantics), str(executor.execute(parse.semantics))])) # Montpelier! I always forget that one. # # OK, let's evaluate our progress on the Geo880 training data. # In[21]: model = Model(grammar=grammar, executor=executor.execute) sample_wins_and_losses(domain=domain, model=model, metric=metric, seed=1) # Hot diggity, it's working. Denotation oracle accuracy is over 12%, double digits. We have 75 wins, and they're what we expect: queries that simply combine a relation and an entity (or collection). # ### Intersections # In[22]: rules_intersection = [ Rule('$Collection', '$Collection $Collection', lambda sems: ('.and', sems[0], sems[1])), Rule('$Collection', '$Collection $Optional $Collection', lambda sems: ('.and', sems[0], sems[2])), Rule('$Collection', '$Collection $Optional $Optional $Collection', lambda sems: ('.and', sems[0], sems[3])), ] # In[23]: rules = rules_optionals + rules_collection_entity + rules_types + rules_relations + rules_intersection grammar = Grammar(rules=rules, annotators=annotators) parses = grammar.parse_input('states bordering california') for parse in parses[:1]: print('\n'.join([str(parse.semantics), str(executor.execute(parse.semantics))])) # Let's evaluate the impact on the Geo880 training examples. # In[24]: model = Model(grammar=grammar, executor=executor.execute) sample_wins_and_losses(domain=domain, model=model, metric=metric, seed=1) # Great, denotation oracle accuracy has more than doubled, from 12% to 28%. And the wins now include intersections like `"which states border new york"`. The losses, however, are clearly dominated by one category of error. # ### Superlatives # Many of the losses involve superlatives, such as `"biggest"` or `"shortest"`. Let's remedy that. As usual, we let the training examples guide us in adding lexical rules. # In[25]: rules_superlatives = [ Rule('$Collection', '$Superlative ?$Optionals $Collection', lambda sems: sems[0] + (sems[2],)), Rule('$Collection', '$Collection ?$Optionals $Superlative', lambda sems: sems[2] + (sems[0],)), Rule('$Superlative', 'largest', ('.argmax', 'area')), Rule('$Superlative', 'largest', ('.argmax', 'population')), Rule('$Superlative', 'biggest', ('.argmax', 'area')), Rule('$Superlative', 'biggest', ('.argmax', 'population')), Rule('$Superlative', 'smallest', ('.argmin', 'area')), Rule('$Superlative', 'smallest', ('.argmin', 'population')), Rule('$Superlative', 'longest', ('.argmax', 'length')), Rule('$Superlative', 'shortest', ('.argmin', 'length')), Rule('$Superlative', 'tallest', ('.argmax', 'height')), Rule('$Superlative', 'highest', ('.argmax', 'height')), Rule('$Superlative', '$MostLeast $RevRelation', lambda sems: (sems[0], sems[1])), Rule('$MostLeast', 'most', '.argmax'), Rule('$MostLeast', 'least', '.argmin'), Rule('$MostLeast', 'lowest', '.argmin'), Rule('$MostLeast', 'greatest', '.argmax'), Rule('$MostLeast', 'highest', '.argmax'), ] # Now we should be able to parse `"tallest mountain"`: # In[26]: rules = rules_optionals + rules_collection_entity + rules_types + rules_relations + rules_intersection + rules_superlatives grammar = Grammar(rules=rules, annotators=annotators) parses = grammar.parse_input('tallest mountain') for parse in parses[:1]: print('\n'.join([str(parse.semantics), str(executor.execute(parse.semantics))])) # Let's evaluate the impact on the Geo880 training examples. # In[27]: model = Model(grammar=grammar, executor=executor.execute) sample_wins_and_losses(domain=domain, model=model, metric=metric, seed=1) # Wow, superlatives make a big difference. Denotation oracle accuracy has surged from 28% to 42%. # ### Reverse joins # In[28]: def reverse(relation_sem): """TODO""" # relation_sem is a lambda function which takes an arg and forms a pair, # either (rel, arg) or (arg, rel). We want to swap the order of the pair. def apply_and_swap(arg): pair = relation_sem(arg) return (pair[1], pair[0]) return apply_and_swap rules_reverse_joins = [ Rule('$Collection', '$Collection ?$Optionals $Relation', lambda sems: reverse(sems[2])(sems[0])), ] # In[29]: rules = rules_optionals + rules_collection_entity + rules_types + rules_relations + rules_intersection + rules_superlatives + rules_reverse_joins grammar = Grammar(rules=rules, annotators=annotators) parses = grammar.parse_input('which states does the rio grande cross') for parse in parses[:1]: print('\n'.join([str(parse.semantics), str(executor.execute(parse.semantics))])) # Let's evaluate the impact on the Geo880 training examples. # In[30]: model = Model(grammar=grammar, executor=executor.execute) sample_wins_and_losses(domain=domain, model=model, metric=metric, seed=1) # This time the gain in denotation oracle accuracy was more modest, from 42% to 47%. Still, we are making good progress. However, note that a substantial gap has opened between accuracy and oracle accuracy. This indicates that we could benefit from adding a scoring model. # ## Feature engineering # Through an iterative process of grammar engineering, we've managed to increase denotation oracle accuracy of 47%. But we've been ignoring denotation accuracy, which now lags far behind, at 25%. This represents an opportunity. # # In order to figure out how best to fix the problem, we need to do some error analysis. Let's look for some specific examples where denotation accuracy is 0, even though denotation oracle accuracy is 1. In other words, let's look for some examples where we have a correct parse, but it's not ranked at the top. We should be able to find some cases like that among the first ten examples of the Geo880 training data. # In[31]: from experiment import evaluate_model from metrics import denotation_match_metrics evaluate_model(model=model, examples=geo880_train_examples[:10], metrics=denotation_match_metrics(), print_examples=True) # Take a look through that output. Over the ten examples, we achieved denotation oracle accuracy of 60%, but denotation accuracy of just 40%. In other words, there were two examples where we generated a correct parse, but failed to rank it at the top. Take a closer look at those two cases. # # The first case is `"what state has the shortest river ?"`. The top parse has semantics `('.and', 'state', ('.argmin', 'length', 'river'))`, which means something like "states that are the shortest river". That's not right. In fact, there's no such thing: the denotation is empty. # # The second case is `"what is the highest mountain in alaska ?"`. The top parse has semantics `('.argmax', 'height', ('.and', 'mountain', '/state/alaska'))`, which means "the highest mountain which is alaska". Again, there's no such thing: the denotation is empty. # # So in both of the cases where we put the wrong parse at the top, the top parse had nonsensical semantics with an empty denotation. In fact, if you scroll through the output above, you will see that there are a _lot_ of candidate parses with empty denotations. Seems like we could make a big improvement just by downweighting parses with empty denotations. This is easy to do. # In[32]: def empty_denotation_feature(parse): features = defaultdict(float) if parse.denotation == (): features['empty_denotation'] += 1.0 return features weights = {'empty_denotation': -1.0} model = Model(grammar=grammar, feature_fn=empty_denotation_feature, weights=weights, executor=executor.execute) # Let's evaluate the impact of using our new `empty_denotation` feature on the Geo880 training examples. # In[33]: from experiment import evaluate_model from metrics import denotation_match_metrics evaluate_model(model=model, examples=geo880_train_examples, metrics=denotation_match_metrics(), print_examples=False) # Great! Using the `empty_denotation` feature has enabled us to increase denotation accuracy from 25% to 39%. That's a big gain! In fact, we've closed most of the gap between accuracy and oracle accuracy. As a result, the headroom for further gains from feature engineering is limited — but it's not zero. The [exercises](#geoquery-exercises) will ask you to push further. # ## Exercises # Several of these exercises ask you to measure the impact of your change on key evaluation metrics. Part of your job is to decide which evaluation metrics are most relevant for the change you're making. It's probably best to evaluate only on training data, in order to keep the test data unseen during development. (But the test data is hardly a state secret, so whatever.) # # ### Straightforward # # 1. Extend the grammar to handle queries which use the phrase "how many people" to ask about population. There are many examples in the Geo880 training data. Measure the impact of this change on key evaluation metrics. # # 1. Extend the grammar to handle counting questions, indicated by the phrases "how many" or "number of", as in "how many states does missouri border". Measure the impact of this change on key evaluation metrics. # # 1. Several examples in the Geo880 training dataset fail to parse because they refer to locations using names that, while valid, are not recognized by the `GeobaseAnnotator`. For example, some queries use "america" to refer to `/country/usa`, but `GeobaseAnnotator` recognizes only "usa". Find unannotated location references in the Geo880 training dataset, and extend the grammar to handle them. Measure the impact of this change on key evaluation metrics. # # 1. Extend the grammar to handle examples of the form "where is X", such as "where is mount whitney" or "where is san diego". Measure the impact of this change on key evaluation metrics. # # 1. Quite a few examples in the Geo880 training dataset specify a set of entities by name, as in "how many states have a city named springfield" or "how many rivers are called colorado". Extend the grammar to handle these, leveraging the `TokenAnnotator` introduced in [Unit 2][]. Measure the impact of this change on key evaluation metrics. # # 1. Extend the grammar to handle phrases like "austin texas" or "atlanta ga", where two entity names appear in sequence. Make sure that "atlanta ga" has semantics and a denotation, whereas "atlanta tx" has semantics but no denotation. Measure the impact of this change on key evaluation metrics. # # 1. In [Unit 2][], while examining the travel domain, we saw big gains from including rule features in our feature representation. Experiment with adding rule features to the GeoQuery model. Are the learned weights intuitive? Do the rule features help? If so, identify a few specific examples which are fixed by the inclusion of rule features. Measure the impact on key evaluation metrics. # # 1. Find an example where using the `empty_denotation` feature causes a loss (a bad prediction). # # 1. The `empty_denotation` feature doesn't help on count questions (e.g., "how many states ..."), where all parses, good or bad, typically yield a denotation which is a single number. Add a new feature which helps in such cases. Identify a few specific examples which are fixed by the new feature. Measure the impact on key evaluation metrics. [This exercise assumes you have already done the exercise on handling count questions.] # # ### Challenging # # 1. Extend the grammar to handle comparisons, such as "mountains with height greater than 5000" or "mountains with height greater than the height of bona". # # 1. Building on the previous exercise, extend the grammar to handle even those comparisons which involve ellipsis, such as "mountains higher than \[the height of\] mt. katahdin" or "rivers longer than \[the length of\] the colorado river" (where the bracketed phrase does not appear in the surface form!). # # 1. Extend the grammar to handle queries involving units, such as "what is the area of maryland in square kilometers" or "how long is the mississippi river in miles". # # 1. The Geo880 training dataset contains many examples involving population density. Extend the grammar to handle these examples. Measure the impact on key evaluation metrics. # # 1. Several examples in the Geo880 training dataset involve some form of negation, expressed by the words "not", "no", or "excluding". Extend the grammar to handle these examples. Measure the impact on key evaluation metrics. # # 1. The current grammar handles "capital of texas", but not "texas capital". It handles "has austin capital", but not "has capital austin". In general, it defines every phrase which expresses a relation as either a `$FwdRelation` or a `$RevRelation`, and constrains word order accordingly. Extend the grammar to allow any phrase which is ordinarily a `$FwdRelation` to function as a `$RevRelation`, and vice versa. Can you now handle "texas capital" and "has capital austin"? Measure the impact on key evaluation metrics. # # 1. What if we permit any word to be optionalized by adding the rule `Rule('$Optional', '$Token')` to our grammar? (Recall that `$Token` is produced by the `TokenAnnotator`.) Measure the impact of this change on key evaluation metrics. You will likely find that the change has some negative effects and some positive effects. Is there a way to mitigate the negative effects, while preserving the positive effects? # # 1. The success of the `empty_denotation` feature demonstrates the potential of denotation features. Can we go further? Experiment with features that characterize the *size* of the denotation (that is, the number of answers). Are two answers better than one? Are ten answers better than two? If you find some features that seem to work, identify a few specific examples which are fixed by your new features. Measure the impact on key evaluation metrics. # # [Unit 2]: ./sippycup-unit-2.ipynb # [Levenshtein distance]: http://en.wikipedia.org/wiki/Levenshtein_distance # [minhashing]: http://en.wikipedia.org/wiki/MinHash # # # # # # Copyright (C) 2015 Bill MacCartney