# SippyCup Unit 3: Geography queries

Bill MacCartney
Spring 2015

This is Unit 3 of the SippyCup codelab.

Our third case study will examine the domain of geography queries. In particular, we'll focus on the Geo880 corpus, which contains 880 queries about U.S. geography. Examples include:

"which states border texas?"
"how many states border the largest state?"
"what is the size of the capital of texas?"



The Geo880 queries have a quite different character from the arithmetic queries and travel queries we have examined previously. They differ from the arithmetic queries in using a large vocabulary, and in exhibiting greater degrees of both lexical and syntactic ambiguity. They differ from the travel queries in adhering to conventional rules for spelling and syntax, and in having semantics with arbitrarily complex compositional structure. For example:

"what rivers flow through states that border the state with the largest population?"
"what is the population of the capital of the largest state through which the mississippi runs?"
"what is the longest river that passes the states that border the state that borders the most states?"



Geo880 was developed in Ray Mooney's group at UT Austin. It is of particular interest because it has for many years served as a standard evaluation for semantic parsing systems. (See, for example, Zelle & Mooney 1996, Tang & Mooney 2001, Zettlemoyer & Collins 2005, and Liang et al. 2011.) It has thereby become, for many, a paradigmatic application of semantic parsing. It has also served as a bridge between an older current of research on natural language interfaces to databases (NLIDBs) (see Androutsopoulos et al. 1995) and the modern era of semantic parsing.

The domain of geography queries is also of interest because there are many plausible real-world applications for semantic parsing which similarly involve complex compositional queries against a richly structured knowledge base. For example, some people are passionate about baseball statistics, and might want to ask queries like:

"pitchers who have struck out four batters in one inning"
"players who have stolen at least 100 bases in a season"
"complete games with fewer than 90 pitches"
"most home runs hit in one game"



Environmental advocates and policymakers might have queries like:

"which country has the highest co2 emissions"
"what five countries have the highest per capita co2 emissions"
"what country's co2 emissions increased the most over the last five years"
"what fraction of co2 emissions was from european countries in 2010"



Techniques that work in the geography domain are likely to work in these other domains too.

## The Geo880 dataset¶

I've been told that the Geo880 queries were collected from students in classes taught by Ray Mooney at UT Austin. I'm not sure whether I've got the story right. But this account is consistent with one of the notable limitations of the dataset: it is not a natural distribution, and not a realistic representation of the geography queries that people actually ask on, say, Google. Nobody ever asks, "what is the longest river that passes the states that border the state that borders the most states?" Nobody. Ever.

The dataset was published online by Rohit Jaivant Kate in a Prolog file containing semantic representations in Prolog style. It was later republished by Yuk Wah Wong as an XML file containing additional metadata for each example, including translations into Spanish, Japanese, and Turkish; syntactic parse trees; and semantics in two different representations: Prolog and FunQL.

In SippyCup, we're not going to use either Prolog or FunQL semantics. Instead, we'll use examples which have been annotated only with denotations (which were provided by Percy Liang — thanks!). Of course, our grammar will require a semantic representation, even if our examples are not annotated with semantics. We will introduce one below.

The Geo880 dataset is conventionally divided into 600 training examples and 280 test examples. In SippyCup, the dataset can in found in geo880.py. Let's take a peek.

In [1]:
from geo880 import geo880_train_examples, geo880_test_examples

print('train examples:', len(geo880_train_examples))
print('test examples: ', len(geo880_test_examples))
print(geo880_train_examples[0])
print(geo880_test_examples[0])

train examples: 600
test examples:  280
Example(input='what is the highest point in florida ?', denotation=('/place/walton_county',))
Example(input='which state is the smallest ?', denotation=('/state/district_of_columbia',))


## The Geobase knowledge base¶

Geobase is a small knowledge base about the geography of the United States. It contains (almost) all the information needed to answer queries in the Geo880 dataset, including facts about:

• states: capital, area, population, major cities, neighboring states, highest and lowest points and elevations
• cities: containing state and population
• rivers: length and states traversed
• mountains: containing state and height
• lakes: area, states traversed

SippyCup contains a class called GeobaseReader (in geobase.py) which facilitates working with Geobase in Python. It reads and parses the Geobase Prolog file, and creates a set of tuples representing its content. Let's take a look.

In [2]:
from geobase import GeobaseReader

unaries = [str(t) for t in reader.tuples if len(t) == 2]
print('\nSome unaries:\n  ' + '\n  '.join(unaries[:10]))
binaries = [str(t) for t in reader.tuples if len(t) == 3]
print('\nSome binaries:\n  ' + '\n  '.join(binaries[:10]))

GeobaseReader read 51 state rows.

Some unaries:
('city', '/city/providence_ri')
('city', '/city/ewa_hi')
('state', '/state/district_of_columbia')
('city', '/city/fort_wayne_in')
('city', '/city/concord_ca')
('place', '/place/little_river')
('city', '/city/high_point_nc')
('city', '/city/el_paso_tx')
('city', '/city/santa_monica_ca')

Some binaries:
('contains', '/state/arizona', '/city/tempe_az')
('contains', '/state/texas', '/city/midland_tx')
('contains', '/country/usa', '/state/indiana')
('contains', '/country/usa', '/city/stockton_ca')
('contains', '/country/usa', '/city/utica_ny')
('contains', '/country/usa', '/city/augusta_me')
('contains', '/country/usa', '/state/connecticut')
('abbreviation', '/state/alabama', 'al')


Some observations here:

• Unaries are pairs consisting of a unary predicate (a type) and an entity.
• Binaries are triples consisting of binary predicate (a relation) and two entities (or an entity and a numeric or string value).
• Entities are named by unique identifiers of the form /type/name. This is a GeobaseReader convention; these identifiers are not used in the original Prolog file.
• Some entities have the generic type place because they occur in the Prolog file only as the highest or lowest point in a state, and it's hard to reliably assign such points to one of the more specific types.
• The original Prolog file is inconsistent about units. For example, the area of states is expressed in square miles, but the area of lakes is expressed in square kilometers. GeobaseReader converts everything to SI units: meters and square meters.

## Semantic representation ¶

GeobaseReader merely reads the data in Geobase into a set of tuples. It doesn't provide any facility for querying that data. That's where GraphKB and GraphKBExecutor come in. GraphKB is a graph-structured knowledge base, with indexing for fast lookups. GraphKBExecutor defines a representation for formal queries against that knowledge base, and supports query execution. The formal query language defined by GraphKBExecutor will serve as our semantic representation for the geography domain.

### The GraphKB class¶

A GraphKB is a generic graph-structured knowledge base, or equivalently, a set of relational pairs and triples, with indexing for fast lookups. It represents a knowledge base as set of tuples, each either:

• a pair, consisting of a unary relation and an element which belongs to it, or
• a triple consisting of a binary relation and a pair of elements which belong to it.

For example, we can construct a GraphKB representing facts about The Simpsons:

In [3]:
from graph_kb import GraphKB

simpsons_tuples = [

# unaries
('male', 'homer'),
('female', 'marge'),
('male', 'bart'),
('female', 'lisa'),
('female', 'maggie'),
('child', 'bart'),
('child', 'lisa'),
('child', 'maggie'),

# binaries
('has_age', 'homer', 36),
('has_age', 'marge', 34),
('has_age', 'bart', 10),
('has_age', 'lisa', 8),
('has_age', 'maggie', 1),
('has_brother', 'lisa', 'bart'),
('has_brother', 'maggie', 'bart'),
('has_sister', 'bart', 'maggie'),
('has_sister', 'bart', 'lisa'),
('has_sister', 'lisa', 'maggie'),
('has_sister', 'maggie', 'lisa'),
('has_father', 'bart', 'homer'),
('has_father', 'lisa', 'homer'),
('has_father', 'maggie', 'homer'),
('has_mother', 'bart', 'marge'),
('has_mother', 'lisa', 'marge'),
('has_mother', 'maggie', 'marge'),
]

simpsons_kb = GraphKB(simpsons_tuples)


The GraphKB object now contains three indexes:

• unaries[U]: all entities belonging to unary relation U
• binaries_fwd[B][E]: all entities X such that (E, X) belongs to binary relation B
• binaries_rev[B][E]: all entities X such that (X, E) belongs to binary relation B

For example:

In [4]:
simpsons_kb.unaries['child']

Out[4]:
{'bart', 'lisa', 'maggie'}
In [5]:
simpsons_kb.binaries_fwd['has_sister']['lisa']

Out[5]:
{'maggie'}
In [6]:
simpsons_kb.binaries_rev['has_sister']['lisa']

Out[6]:
{'bart', 'maggie'}

### The GraphKBExecutor class¶

A GraphKBExecutor executes formal queries against a GraphKB and returns their denotations. Queries are represented by Python tuples, and can be nested. Denotations are also represented by Python tuples, but are conceptually sets (possibly empty). The elements of these tuples are always sorted in canonical order, so that they can be reliably compared for set equality.

The query language defined by GraphKBExecutor is perhaps most easily explained by example:

In [7]:
queries = [
'bart',
'male',
('has_sister', 'lisa'),  # who has sister lisa?
('lisa', 'has_sister'),  # lisa has sister who, i.e., who is a sister of lisa?
('lisa', 'has_brother'), # lisa has brother who, i.e., who is a brother of lisa?
('.and', 'male', 'child'),
('.not', 'child'),
('.any',),               # anything
('.any', 'has_sister'),  # anything has sister who, i.e., who is a sister of anything?
('.and', 'child', ('.not', ('.any', 'has_sister'))),
('.count', ('bart', 'has_sister')),
('has_age', ('.gt', 21)),
('has_age', ('.lt', 2)),
('has_age', ('.eq', 10)),
('.max', 'has_age', 'female'),
('.min', 'has_age', ('bart', 'has_sister')),
('.max', 'has_age', '.any'),
('.argmax', 'has_age', 'female'),
('.argmin', 'has_age', ('bart', 'has_sister')),
('.argmax', 'has_age', '.any'),
]

executor = simpsons_kb.executor()
for query in queries:
print()
print('Q  ', query)
print('D  ', executor.execute(query))

Q   bart
D   ('bart',)

Q   male
D   ('bart', 'homer')

Q   ('has_sister', 'lisa')
D   ('bart', 'maggie')

Q   ('lisa', 'has_sister')
D   ('maggie',)

Q   ('lisa', 'has_brother')
D   ('bart',)

Q   ('.and', 'male', 'child')
D   ('bart',)

D   ('bart', 'homer', 'marge')

Q   ('.not', 'child')
D   (1, 10, 34, 36, 8, 'homer', 'marge')

Q   ('.any',)
D   (1, 10, 34, 36, 8, 'bart', 'homer', 'lisa', 'maggie', 'marge')

Q   ('.any', 'has_sister')
D   ('lisa', 'maggie')

Q   ('.and', 'child', ('.not', ('.any', 'has_sister')))
D   ('bart',)

Q   ('.count', ('bart', 'has_sister'))
D   (2,)

Q   ('has_age', ('.gt', 21))
D   ('homer', 'marge')

Q   ('has_age', ('.lt', 2))
D   ('maggie',)

Q   ('has_age', ('.eq', 10))
D   ('bart',)

Q   ('.max', 'has_age', 'female')
D   (34,)

Q   ('.min', 'has_age', ('bart', 'has_sister'))
D   (1,)

Q   ('.max', 'has_age', '.any')
D   (36,)

Q   ('.argmax', 'has_age', 'female')
D   ('marge',)

Q   ('.argmin', 'has_age', ('bart', 'has_sister'))
D   ('maggie',)

Q   ('.argmax', 'has_age', '.any')
D   ('homer',)


Note that the query (R E) denotes entities having relation R to entity E, whereas the query (E R) denotes entities to which entity E has relation R.

For a more detailed understanding of the style of semantic representation defined by GraphKBExecutor, take a look at the source code.

### Using GraphKBExecutor with Geobase¶

In [8]:
geobase = GraphKB(reader.tuples)
executor = geobase.executor()
queries = [
('/state/texas', 'capital'), # capital of texas
('.and', 'river', ('traverses', '/state/utah')), # rivers that traverse utah
('.argmax', 'height', 'mountain'), # tallest mountain
]
for query in queries:
print()
print(query)
print(executor.execute(query))

('/state/texas', 'capital')
('/city/austin_tx',)

('.and', 'river', ('traverses', '/state/utah'))

('.argmax', 'height', 'mountain')
('/mountain/mckinley',)


## Grammar engineering¶

It's time to start developing a grammar for the geography domain. As in Unit 2, the performance metric we'll focus on during grammar engineering is oracle accuracy (the proportion of examples for which any parse is correct), not accuracy (the proportion of examples for which the first parse is correct). Remember that oracle accuracy is an upper bound on accuracy, and is a measure of the expressive power of the grammar: does it have the rules it needs to generate the correct parse? The gap between oracle accuracy and accuracy, on the other hand, reflects the ability of the scoring model to bring the correct parse to the top of the candidate list.

As always, we're going to take a data-driven approach to grammar engineering. We want to introduce rules which will enable us to handle the lexical items and syntactic structures that we actually observe in the Geo880 training data. To that end, let's count the words that appear among the 600 training examples. (We do not examine the test data!)

In [9]:
from collections import defaultdict
from operator import itemgetter
from geo880 import geo880_train_examples

words = [word for example in geo880_train_examples for word in example.input.split()]
counts = defaultdict(int)
for word in words:
counts[word] += 1
counts = sorted([(count, word) for word, count in counts.items()], reverse=True)
print('There were %d tokens of %d types:\n' % (len(words), len(counts)))
print(', '.join(['%s (%d)' % (word, count) for count, word in counts[:50]] + ['...']))

There were 5109 tokens of 248 types:

the (618), ? (600), what (386), is (283), in (247), state (170), states (169), of (150), how (111), many (90), are (90), population (86), river (76), which (75), through (73), largest (67), border (65), texas (63), rivers (62), cities (56), point (55), highest (54), city (49), that (48), capital (48), with (44), has (43), major (42), run (34), smallest (31), people (29), mississippi (29), us (28), does (27), usa (26), most (26), longest (25), area (24), lowest (23), have (23), density (23), colorado (23), borders (23), new (22), live (22), where (17), runs (17), biggest (16), california (14), alaska (14), ...


There are at least four major categories of words here:

• Words that refer to entities, such as "texas", "mississippi", "usa", and "austin".
• Words that refer to types, such as "state", "river", and "cities".
• Words that refer to relations, such as "in", "borders", "capital", and "long".
• Other function words, such as "the", "what", "how", and "are".

One might make finer distinctions, but this seems like a reasonable starting point. Note that these categories do not always correspond to traditional syntactic categories. While the entities are typically proper nouns, and the types are typically common nouns, the relations include prepositions, verbs, nouns, and adjectives.

The design of our grammar will roughly follow this schema. The major categories will include $Entity, $Type, $Collection, $Relation, and $Optional. ### Optionals¶ In Unit 2, our grammar engineering process didn't really start cooking until we introduced optionals. This time around, let's begin with the optionals. We'll define as $Optional every word in the Geo880 training data which does not plainly refer to an entity, type, or relation. And we'll let any query be preceded or followed by a sequence of one or more $Optionals. In [10]: from parsing import Grammar, Rule optional_words = [ 'the', '?', 'what', 'is', 'in', 'of', 'how', 'many', 'are', 'which', 'that', 'with', 'has', 'major', 'does', 'have', 'where', 'me', 'there', 'give', 'name', 'all', 'a', 'by', 'you', 'to', 'tell', 'other', 'it', 'do', 'whose', 'show', 'one', 'on', 'for', 'can', 'whats', 'urban', 'them', 'list', 'exist', 'each', 'could', 'about' ] rules_optionals = [ Rule('$ROOT', '?$Optionals$Query ?$Optionals', lambda sems: sems[1]), Rule('$Optionals', '$Optional ?$Optionals'),
] + [Rule('$Optional', word) for word in optional_words]  Because $Query has not yet been defined, we won't be able to parse anything yet.

### Entities and collections¶

Our grammar will need to be able to recognize names of entities, such as "utah". There are hundreds of entities in Geobase, and we don't want to have to introduce a grammar rule for each entity. Instead, we'll define a new annotator, GeobaseAnnotator, which simply annotates phrases which exactly match names in Geobase.

In [11]:
from annotator import Annotator, NumberAnnotator

class GeobaseAnnotator(Annotator):
def __init__(self, geobase):
self.geobase = geobase

def annotate(self, tokens):
phrase = ' '.join(tokens)
places = self.geobase.binaries_rev['name'][phrase]
return [('$Entity', place) for place in places]  Now a couple of rules that will enable us to parse inputs that simply name locations, such as "utah". (TODO: explain rationale for $Collection and $Query.) In [12]: rules_collection_entity = [ Rule('$Query', '$Collection', lambda sems: sems[0]), Rule('$Collection', '$Entity', lambda sems: sems[0]), ] rules = rules_optionals + rules_collection_entity  Now let's make a grammar. In [13]: annotators = [NumberAnnotator(), GeobaseAnnotator(geobase)] grammar = Grammar(rules=rules, annotators=annotators)  Created grammar with 48 rules  Let's try to parse some inputs which just name locations. In [14]: parses = grammar.parse_input('what is utah') for parse in parses[:1]: print('\n'.join([str(parse.semantics), str(executor.execute(parse.semantics))]))  /state/utah ('/state/utah',)  Great, it worked. Now let's run an evaluation on the Geo880 training examples. In [15]: from experiment import sample_wins_and_losses from geoquery import GeoQueryDomain from metrics import DenotationOracleAccuracyMetric from scoring import Model domain = GeoQueryDomain() model = Model(grammar=grammar, executor=executor.execute) metric = DenotationOracleAccuracyMetric() sample_wins_and_losses(domain=domain, model=model, metric=metric, seed=1)  GeobaseReader read 51 state rows. GeobaseReader read 386 city rows. GeobaseReader read 46 river rows. GeobaseReader read 51 border rows. GeobaseReader read 51 highlow rows. GeobaseReader read 50 mountain rows. GeobaseReader read 40 road rows. GeobaseReader read 22 lake rows. GeobaseReader read 1 country row. GeobaseReader computed transitive closure of 'contains', adding 495 edges ================================================================================ Evaluating on 600 examples -------------------------------------------------------------------------------- Over 600 examples: denotation accuracy 0.000 denotation oracle accuracy 0.000 number of parses 0.028 spurious ambiguity 0.000 0 of 0 wins on denotation oracle accuracy: 10 of 600 losses on denotation oracle accuracy: how long is the shortest river in the usa ? name the states which have no surrounding states ? what is the capital of the state with the highest point ? what is the capital of utah ? what is the highest elevation in new mexico ? what is the highest point in the smallest state ? what is the population density of wyoming ? what is the total population of the states that border texas ? what state has the largest population ? what states does the mississippi run through ?  We don't yet have a single win: denotation oracle accuracy remains stuck at zero. However, the average number of parses is slightly greater than zero, meaning that there are a few examples which our grammar can parse (though not correctly). It would be interesting to know which examples. There's a utility function in experiment.py which will give you the visibility you need. See if you can figure out what to do. ### Types¶ (TODO: the words in the training data include lots of words for types. Let's write down some lexical rules defining the category $Type, guided as usual by the words we actually see in the training data. We'll also make $Type a kind of $Collection.)

In [16]:
rules_types = [
Rule('$Collection', '$Type', lambda sems: sems[0]),

Rule('$Type', 'state', 'state'), Rule('$Type', 'states', 'state'),
Rule('$Type', 'city', 'city'), Rule('$Type', 'cities', 'city'),
Rule('$Type', 'big cities', 'city'), Rule('$Type', 'towns', 'city'),
Rule('$Type', 'river', 'river'), Rule('$Type', 'rivers', 'river'),
Rule('$Type', 'mountain', 'mountain'), Rule('$Type', 'mountains', 'mountain'),
Rule('$Type', 'mount', 'mountain'), Rule('$Type', 'peak', 'mountain'),
Rule('$Type', 'road', 'road'), Rule('$Type', 'roads', 'road'),
Rule('$Type', 'lake', 'lake'), Rule('$Type', 'lakes', 'lake'),
Rule('$Type', 'country', 'country'), Rule('$Type', 'countries', 'country'),
]


We should now be able to parse inputs denoting types, such as "name the lakes":

In [17]:
rules = rules_optionals + rules_collection_entity + rules_types
grammar = Grammar(rules=rules, annotators=annotators)
parses = grammar.parse_input('name the lakes')
for parse in parses[:1]:
print('\n'.join([str(parse.semantics), str(executor.execute(parse.semantics))]))

Created grammar with 67 rules
lake
('/lake/becharof', '/lake/champlain', '/lake/erie', '/lake/flathead', '/lake/great_salt_lake', '/lake/huron', '/lake/iliamna', '/lake/lake_of_the_woods', '/lake/michigan', '/lake/mille_lacs', '/lake/naknek', '/lake/okeechobee', '/lake/ontario', '/lake/pontchartrain', '/lake/rainy', '/lake/red', '/lake/salton_sea', '/lake/st._clair', '/lake/superior', '/lake/tahoe', '/lake/teshekpuk', '/lake/winnebago')


It worked. Let's evaluate on the Geo880 training data again.

In [18]:
model = Model(grammar=grammar, executor=executor.execute)
sample_wins_and_losses(domain=domain, model=model, metric=metric, seed=1)

================================================================================
Evaluating on 600 examples

--------------------------------------------------------------------------------
Over 600 examples:

denotation accuracy                0.003
denotation oracle accuracy         0.003
number of parses                   0.033
spurious ambiguity                 0.000

2 of 2 wins on denotation oracle accuracy:

list the states ?
what are the states ?

10 of 598 losses on denotation oracle accuracy:

give me the cities which are in texas ?
how many people are there in iowa ?
how many states have major rivers ?
name the rivers in arkansas ?
what river runs through virginia ?
what rivers are in utah ?
what state has the highest elevation ?
what states border rhode island ?
where is mount whitney located ?
which rivers run through the state with the largest city in the us ?



Liftoff! We have two wins, and denotation oracle accuracy is greater than zero! Just barely.

### Relations and joins¶

In order to really make this bird fly, we're going to have to handle relations. In particular, we'd like to be able to parse queries which combine a relation with an entity or collection, such as "what is the capital of vermont".

As usual, we'll adopt a data-driven approach. The training examples include lots of words and phrases which refer to relations, both "forward" relations (like "traverses") and "reverse" relations (like "traversed by"). Guided by the training data, we'll write lexical rules which define the categories $FwdRelation and $RevRelation. Then we'll add rules that allow either a $FwdRelation or a $RevRelation to be promoted to a generic $Relation, with semantic functions which ensure that the semantics are constructed with the proper orientation. Finally, we'll define a rule for joining a $Relation (such as "capital of") with a $Collection (such as "vermont") to yield another $Collection (such as "capital of vermont").

In [19]:
rules_relations = [
Rule('$Collection', '$Relation ?$Optionals$Collection', lambda sems: sems[0](sems[2])),

Rule('$Relation', '$FwdRelation', lambda sems: (lambda arg: (sems[0], arg))),
Rule('$Relation', '$RevRelation', lambda sems: (lambda arg: (arg, sems[0]))),

Rule('$FwdRelation', '$FwdBordersRelation', 'borders'),
Rule('$FwdBordersRelation', 'border'), Rule('$FwdBordersRelation', 'bordering'),
Rule('$FwdBordersRelation', 'borders'), Rule('$FwdBordersRelation', 'neighbor'),
Rule('$FwdBordersRelation', 'neighboring'), Rule('$FwdBordersRelation', 'surrounding'),
Rule('$FwdBordersRelation', 'next to'), Rule('$FwdRelation', '$FwdTraversesRelation', 'traverses'), Rule('$FwdTraversesRelation', 'cross ?over'),
Rule('$FwdTraversesRelation', 'flow through'), Rule('$FwdTraversesRelation', 'flowing through'),
Rule('$FwdTraversesRelation', 'flows through'), Rule('$FwdTraversesRelation', 'go through'),
Rule('$FwdTraversesRelation', 'goes through'), Rule('$FwdTraversesRelation', 'in'),
Rule('$FwdTraversesRelation', 'pass through'), Rule('$FwdTraversesRelation', 'passes through'),
Rule('$FwdTraversesRelation', 'run through'), Rule('$FwdTraversesRelation', 'running through'),
Rule('$FwdTraversesRelation', 'runs through'), Rule('$FwdTraversesRelation', 'traverse'),
Rule('$FwdTraversesRelation', 'traverses'), Rule('$RevRelation', '$RevTraversesRelation', 'traverses'), Rule('$RevTraversesRelation', 'has'),
Rule('$RevTraversesRelation', 'have'), # 'how many states have major rivers' Rule('$RevTraversesRelation', 'lie on'),
Rule('$RevTraversesRelation', 'next to'), Rule('$RevTraversesRelation', 'traversed by'),
Rule('$RevTraversesRelation', 'washed by'), Rule('$FwdRelation', '$FwdContainsRelation', 'contains'), # 'how many states have a city named springfield' Rule('$FwdContainsRelation', 'has'),
Rule('$FwdContainsRelation', 'have'), Rule('$RevRelation', '$RevContainsRelation', 'contains'), Rule('$RevContainsRelation', 'contained by'),
Rule('$RevContainsRelation', 'in'), Rule('$RevContainsRelation', 'found in'),
Rule('$RevContainsRelation', 'located in'), Rule('$RevContainsRelation', 'of'),

Rule('$RevRelation', '$RevCapitalRelation', 'capital'),
Rule('$RevCapitalRelation', 'capital'), Rule('$RevCapitalRelation', 'capitals'),

Rule('$RevRelation', '$RevHighestPointRelation', 'highest_point'),
Rule('$RevHighestPointRelation', 'high point'), Rule('$RevHighestPointRelation', 'high points'),
Rule('$RevHighestPointRelation', 'highest point'), Rule('$RevHighestPointRelation', 'highest points'),

Rule('$RevRelation', '$RevLowestPointRelation', 'lowest_point'),
Rule('$RevLowestPointRelation', 'low point'), Rule('$RevLowestPointRelation', 'low points'),
Rule('$RevLowestPointRelation', 'lowest point'), Rule('$RevLowestPointRelation', 'lowest points'),
Rule('$RevLowestPointRelation', 'lowest spot'), Rule('$RevRelation', '$RevHighestElevationRelation', 'highest_elevation'), Rule('$RevHighestElevationRelation', '?highest elevation'),

Rule('$RevRelation', '$RevHeightRelation', 'height'),
Rule('$RevHeightRelation', 'elevation'), Rule('$RevHeightRelation', 'height'),
Rule('$RevHeightRelation', 'high'), Rule('$RevHeightRelation', 'tall'),

Rule('$RevRelation', '$RevAreaRelation', 'area'),
Rule('$RevAreaRelation', 'area'), Rule('$RevAreaRelation', 'big'),
Rule('$RevAreaRelation', 'large'), Rule('$RevAreaRelation', 'size'),

Rule('$RevRelation', '$RevPopulationRelation', 'population'),
Rule('$RevPopulationRelation', 'big'), Rule('$RevPopulationRelation', 'large'),
Rule('$RevPopulationRelation', 'populated'), Rule('$RevPopulationRelation', 'population'),
Rule('$RevPopulationRelation', 'populations'), Rule('$RevPopulationRelation', 'populous'),
Rule('$RevPopulationRelation', 'size'), Rule('$RevRelation', '$RevLengthRelation', 'length'), Rule('$RevLengthRelation', 'length'),
Rule('$RevLengthRelation', 'long'), ]  We should now be able to parse "what is the capital of vermont". Let's see: In [20]: rules = rules_optionals + rules_collection_entity + rules_types + rules_relations grammar = Grammar(rules=rules, annotators=annotators) parses = grammar.parse_input('what is the capital of vermont ?') for parse in parses[:1]: print('\n'.join([str(parse.semantics), str(executor.execute(parse.semantics))]))  Created grammar with 146 rules ('/state/vermont', 'capital') ('/city/montpelier_vt',)  Montpelier! I always forget that one. OK, let's evaluate our progress on the Geo880 training data. In [21]: model = Model(grammar=grammar, executor=executor.execute) sample_wins_and_losses(domain=domain, model=model, metric=metric, seed=1)  ================================================================================ Evaluating on 600 examples -------------------------------------------------------------------------------- Over 600 examples: denotation accuracy 0.113 denotation oracle accuracy 0.125 number of parses 0.427 spurious ambiguity 0.000 5 of 75 wins on denotation oracle accuracy: what is the capital of utah ? what is the highest point in texas ? what is the population of idaho ? what is the population of rhode island ? what is the population of sacramento ? 10 of 525 losses on denotation oracle accuracy: give me the longest river that passes through the us ? how many cities are there in the us ? what are the major cities in new mexico ? what are the states through which the longest river runs ? what is the area of the state with the smallest population density ? what is the biggest city in usa ? what is the highest elevation in new mexico ? what is the largest state capital in population ? what river traverses the most states ? what states border hawaii ?  Hot diggity, it's working. Denotation oracle accuracy is over 12%, double digits. We have 75 wins, and they're what we expect: queries that simply combine a relation and an entity (or collection). ### Intersections¶ In [22]: rules_intersection = [ Rule('$Collection', '$Collection$Collection',
lambda sems: ('.and', sems[0], sems[1])),
Rule('$Collection', '$Collection $Optional$Collection',
lambda sems: ('.and', sems[0], sems[2])),
Rule('$Collection', '$Collection $Optional$Optional $Collection', lambda sems: ('.and', sems[0], sems[3])), ]  In [23]: rules = rules_optionals + rules_collection_entity + rules_types + rules_relations + rules_intersection grammar = Grammar(rules=rules, annotators=annotators) parses = grammar.parse_input('states bordering california') for parse in parses[:1]: print('\n'.join([str(parse.semantics), str(executor.execute(parse.semantics))]))  Created grammar with 149 rules ('.and', 'state', ('borders', '/state/california')) ('/state/arizona', '/state/nevada', '/state/oregon')  Let's evaluate the impact on the Geo880 training examples. In [24]: model = Model(grammar=grammar, executor=executor.execute) sample_wins_and_losses(domain=domain, model=model, metric=metric, seed=1)  ================================================================================ Evaluating on 600 examples -------------------------------------------------------------------------------- Over 600 examples: denotation accuracy 0.198 denotation oracle accuracy 0.277 number of parses 1.962 spurious ambiguity 0.000 5 of 166 wins on denotation oracle accuracy: give me the cities in virginia ? how high is guadalupe peak ? what are the neighboring states for michigan ? what is the lowest point in louisiana ? what rivers run through west virginia ? 10 of 434 losses on denotation oracle accuracy: what is the largest state traversed by the mississippi river ? what is the longest river ? what is the longest river in pennsylvania ? what is the most populous state in the us ? what is the name of the state with the lowest point ? what is the population density of the state with the smallest area ? what is the shortest river in nebraska ? what rivers flow through states that alabama borders ? where is san jose ? which state is the largest city in montana in ?  Great, denotation oracle accuracy has more than doubled, from 12% to 28%. And the wins now include intersections like "which states border new york". The losses, however, are clearly dominated by one category of error. ### Superlatives¶ Many of the losses involve superlatives, such as "biggest" or "shortest". Let's remedy that. As usual, we let the training examples guide us in adding lexical rules. In [25]: rules_superlatives = [ Rule('$Collection', '$Superlative ?$Optionals $Collection', lambda sems: sems[0] + (sems[2],)), Rule('$Collection', '$Collection ?$Optionals $Superlative', lambda sems: sems[2] + (sems[0],)), Rule('$Superlative', 'largest', ('.argmax', 'area')),
Rule('$Superlative', 'largest', ('.argmax', 'population')), Rule('$Superlative', 'biggest', ('.argmax', 'area')),
Rule('$Superlative', 'biggest', ('.argmax', 'population')), Rule('$Superlative', 'smallest', ('.argmin', 'area')),
Rule('$Superlative', 'smallest', ('.argmin', 'population')), Rule('$Superlative', 'longest', ('.argmax', 'length')),
Rule('$Superlative', 'shortest', ('.argmin', 'length')), Rule('$Superlative', 'tallest', ('.argmax', 'height')),
Rule('$Superlative', 'highest', ('.argmax', 'height')), Rule('$Superlative', '$MostLeast$RevRelation', lambda sems: (sems[0], sems[1])),
Rule('$MostLeast', 'most', '.argmax'), Rule('$MostLeast', 'least', '.argmin'),
Rule('$MostLeast', 'lowest', '.argmin'), Rule('$MostLeast', 'greatest', '.argmax'),
Rule('$MostLeast', 'highest', '.argmax'), ]  Now we should be able to parse "tallest mountain": In [26]: rules = rules_optionals + rules_collection_entity + rules_types + rules_relations + rules_intersection + rules_superlatives grammar = Grammar(rules=rules, annotators=annotators) parses = grammar.parse_input('tallest mountain') for parse in parses[:1]: print('\n'.join([str(parse.semantics), str(executor.execute(parse.semantics))]))  Created grammar with 167 rules ('.argmax', 'height', 'mountain') ('/mountain/mckinley',)  Let's evaluate the impact on the Geo880 training examples. In [27]: model = Model(grammar=grammar, executor=executor.execute) sample_wins_and_losses(domain=domain, model=model, metric=metric, seed=1)  ================================================================================ Evaluating on 600 examples -------------------------------------------------------------------------------- Over 600 examples: denotation accuracy 0.245 denotation oracle accuracy 0.422 number of parses 4.430 spurious ambiguity 0.000 5 of 253 wins on denotation oracle accuracy: what can you tell me about the population of missouri ? what is the population of texas ? what state borders michigan ? what state has the city with the most population ? where is the lowest spot in iowa ? 10 of 347 losses on denotation oracle accuracy: how many states are next to major rivers ? how many states have cities named austin ? how many states have cities or towns named springfield ? number of citizens in boulder ? of the states washed by the mississippi river which has the lowest point ? what is the highest point in the united states ? what is the longest river in the state with the highest point ? what is the population of springfield missouri ? what state has the largest urban population ? what state has the smallest population density ?  Wow, superlatives make a big difference. Denotation oracle accuracy has surged from 28% to 42%. ### Reverse joins¶ In [28]: def reverse(relation_sem): """TODO""" # relation_sem is a lambda function which takes an arg and forms a pair, # either (rel, arg) or (arg, rel). We want to swap the order of the pair. def apply_and_swap(arg): pair = relation_sem(arg) return (pair[1], pair[0]) return apply_and_swap rules_reverse_joins = [ Rule('$Collection', '$Collection ?$Optionals $Relation', lambda sems: reverse(sems[2])(sems[0])), ]  In [29]: rules = rules_optionals + rules_collection_entity + rules_types + rules_relations + rules_intersection + rules_superlatives + rules_reverse_joins grammar = Grammar(rules=rules, annotators=annotators) parses = grammar.parse_input('which states does the rio grande cross') for parse in parses[:1]: print('\n'.join([str(parse.semantics), str(executor.execute(parse.semantics))]))  Created grammar with 168 rules ('.and', 'state', ('/river/rio_grande', 'traverses')) ('/state/colorado', '/state/new_mexico', '/state/texas')  Let's evaluate the impact on the Geo880 training examples. In [30]: model = Model(grammar=grammar, executor=executor.execute) sample_wins_and_losses(domain=domain, model=model, metric=metric, seed=1)  ================================================================================ Evaluating on 600 examples -------------------------------------------------------------------------------- Max cell capacity 1000 has been hit 1 times Max cell capacity 1000 has been hit 2 times Max cell capacity 1000 has been hit 4 times Max cell capacity 1000 has been hit 8 times Max cell capacity 1000 has been hit 16 times Max cell capacity 1000 has been hit 32 times Over 600 examples: denotation accuracy 0.257 denotation oracle accuracy 0.468 number of parses 13.210 spurious ambiguity 0.001 Max cell capacity 1000 has been hit 64 times 5 of 281 wins on denotation oracle accuracy: how long is the mississippi ? what is the biggest city in oregon ? what is the biggest city in wyoming ? what is the capital of washington ? what is the smallest city in the largest state ? 10 of 319 losses on denotation oracle accuracy: how many people live in the state with the largest population density ? how many rivers are found in colorado ? how many rivers are there in idaho ? name the major lakes in michigan ? what are the major cities in the smallest state in the us ? what is the tallest mountain in america ? what river is the longest one in the united states ? what rivers run through the states that border the state with the capital atlanta ? where is indianapolis ? which states border the longest river in the usa ?  This time the gain in denotation oracle accuracy was more modest, from 42% to 47%. Still, we are making good progress. However, note that a substantial gap has opened between accuracy and oracle accuracy. This indicates that we could benefit from adding a scoring model. ## Feature engineering¶ Through an iterative process of grammar engineering, we've managed to increase denotation oracle accuracy of 47%. But we've been ignoring denotation accuracy, which now lags far behind, at 25%. This represents an opportunity. In order to figure out how best to fix the problem, we need to do some error analysis. Let's look for some specific examples where denotation accuracy is 0, even though denotation oracle accuracy is 1. In other words, let's look for some examples where we have a correct parse, but it's not ranked at the top. We should be able to find some cases like that among the first ten examples of the Geo880 training data. In [31]: from experiment import evaluate_model from metrics import denotation_match_metrics evaluate_model(model=model, examples=geo880_train_examples[:10], metrics=denotation_match_metrics(), print_examples=True)  ================================================================================ Evaluating on 10 examples -------------------------------------------------------------------------------- input what is the highest point in florida ? target denotation ('/place/walton_county',) denotation accuracy 1 denotation oracle accuracy 1 number of parses 3 spurious ambiguity 0 0 0.000 semantics ('/state/florida', 'highest_point') denotation + ('/place/walton_county',) 1 0.000 semantics (('traverses', '/state/florida'), 'highest_point') denotation - () 2 0.000 semantics (('/state/florida', 'contains'), 'highest_point') denotation - () -------------------------------------------------------------------------------- input what are the high points of states surrounding mississippi ? target denotation ('/place/cheaha_mountain', '/place/clingmans_dome', '/place/driskill_mountain', '/place/magazine_mountain') denotation accuracy 1 denotation oracle accuracy 1 number of parses 28 spurious ambiguity 0 0 0.000 semantics (('.and', 'state', ('borders', '/state/mississippi')), 'highest_point') denotation + ('/place/cheaha_mountain', '/place/clingmans_dome', '/place/driskill_mountain', '/place/magazine_mountain') 1 0.000 semantics (('.and', 'state', ('borders', '/river/mississippi')), 'highest_point') denotation - () 2 0.000 semantics (('.and', ('state', 'borders'), '/state/mississippi'), 'highest_point') denotation - ('/place/woodall_mountain',) 3 0.000 semantics (('.and', ('state', 'borders'), '/river/mississippi'), 'highest_point') denotation - () 4 0.000 semantics ((('.and', 'state', ('borders', '/state/mississippi')), 'contains'), 'highest_point') denotation - () 5 0.000 semantics ((('.and', 'state', ('borders', '/river/mississippi')), 'contains'), 'highest_point') denotation - () 6 0.000 semantics ((('.and', ('state', 'borders'), '/state/mississippi'), 'contains'), 'highest_point') denotation - () 7 0.000 semantics ((('.and', ('state', 'borders'), '/river/mississippi'), 'contains'), 'highest_point') denotation - () 8 0.000 semantics (('.and', ('state', 'contains'), ('borders', '/state/mississippi')), 'highest_point') denotation - () 9 0.000 semantics (('.and', ('state', 'contains'), ('borders', '/river/mississippi')), 'highest_point') denotation - () 10 0.000 semantics (('.and', (('state', 'borders'), 'contains'), '/state/mississippi'), 'highest_point') denotation - () 11 0.000 semantics (('.and', (('state', 'borders'), 'contains'), '/river/mississippi'), 'highest_point') denotation - () 12 0.000 semantics (('.and', (('state', 'contains'), 'borders'), '/state/mississippi'), 'highest_point') denotation - () 13 0.000 semantics (('.and', (('state', 'contains'), 'borders'), '/river/mississippi'), 'highest_point') denotation - () 14 0.000 semantics ('.and', ('state', 'highest_point'), ('borders', '/state/mississippi')) denotation - () 15 0.000 semantics ('.and', ('state', 'highest_point'), ('borders', '/river/mississippi')) denotation - () 16 0.000 semantics ('.and', (('state', 'contains'), 'highest_point'), ('borders', '/state/mississippi')) denotation - () 17 0.000 semantics ('.and', (('state', 'contains'), 'highest_point'), ('borders', '/river/mississippi')) denotation - () 18 0.000 semantics ('.and', (('state', 'borders'), 'highest_point'), '/state/mississippi') denotation - () 19 0.000 semantics ('.and', (('state', 'borders'), 'highest_point'), '/river/mississippi') denotation - () 20 0.000 semantics ('.and', ((('state', 'borders'), 'contains'), 'highest_point'), '/state/mississippi') denotation - () 21 0.000 semantics ('.and', ((('state', 'borders'), 'contains'), 'highest_point'), '/river/mississippi') denotation - () 22 0.000 semantics ('.and', ((('state', 'contains'), 'borders'), 'highest_point'), '/state/mississippi') denotation - () 23 0.000 semantics ('.and', ((('state', 'contains'), 'borders'), 'highest_point'), '/river/mississippi') denotation - () 24 0.000 semantics ('.and', (('state', 'highest_point'), 'borders'), '/state/mississippi') denotation - () 25 0.000 semantics ('.and', (('state', 'highest_point'), 'borders'), '/river/mississippi') denotation - () 26 0.000 semantics ('.and', ((('state', 'contains'), 'highest_point'), 'borders'), '/state/mississippi') denotation - () 27 0.000 semantics ('.and', ((('state', 'contains'), 'highest_point'), 'borders'), '/river/mississippi') denotation - () -------------------------------------------------------------------------------- input what state has the shortest river ? target denotation ('/state/delaware', '/state/new_jersey', '/state/new_york', '/state/pennsylvania') denotation accuracy 0 denotation oracle accuracy 1 number of parses 8 spurious ambiguity 0 0 0.000 semantics ('.and', 'state', ('.argmin', 'length', 'river')) denotation - () 1 0.000 semantics ('.and', 'state', (('.argmin', 'length', 'river'), 'traverses')) denotation + ('/state/delaware', '/state/new_jersey', '/state/new_york', '/state/pennsylvania') 2 0.000 semantics ('.and', 'state', ('contains', ('.argmin', 'length', 'river'))) denotation - () 3 0.000 semantics ('.and', ('traverses', 'state'), ('.argmin', 'length', 'river')) denotation - ('/river/delaware',) 4 0.000 semantics ('.and', ('state', 'contains'), ('.argmin', 'length', 'river')) denotation - () 5 0.000 semantics ('.and', ('.argmin', 'length', 'state'), 'river') denotation - () 6 0.000 semantics ('.and', ('.argmin', 'length', ('traverses', 'state')), 'river') denotation - ('/river/delaware',) 7 0.000 semantics ('.and', ('.argmin', 'length', ('state', 'contains')), 'river') denotation - () -------------------------------------------------------------------------------- input what is the tallest mountain in the united states ? target denotation ('/mountain/mckinley',) denotation accuracy 0 denotation oracle accuracy 0 number of parses 0 spurious ambiguity 0 -------------------------------------------------------------------------------- input what is the capital of maine ? target denotation ('/city/augusta_me',) denotation accuracy 1 denotation oracle accuracy 1 number of parses 2 spurious ambiguity 0 0 0.000 semantics ('/state/maine', 'capital') denotation + ('/city/augusta_me',) 1 0.000 semantics (('/state/maine', 'contains'), 'capital') denotation - () -------------------------------------------------------------------------------- input what are the populations of states through which the mississippi river run ? target denotation (11400000, 2286000, 2364000, 2520000, 2913000, 4076000, 4206000, 4591000, 4700000, 4916000) denotation accuracy 0 denotation oracle accuracy 0 number of parses 0 spurious ambiguity 0 -------------------------------------------------------------------------------- input name all the lakes of us ? target denotation ('/lake/becharof', '/lake/champlain', '/lake/erie', '/lake/flathead', '/lake/great_salt_lake', '/lake/huron', '/lake/iliamna', '/lake/lake_of_the_woods', '/lake/michigan', '/lake/mille_lacs', '/lake/naknek', '/lake/okeechobee', '/lake/ontario', '/lake/pontchartrain', '/lake/rainy', '/lake/red', '/lake/salton_sea', '/lake/st._clair', '/lake/superior', '/lake/tahoe', '/lake/teshekpuk', '/lake/winnebago') denotation accuracy 0 denotation oracle accuracy 0 number of parses 0 spurious ambiguity 0 -------------------------------------------------------------------------------- input which states border states through which the mississippi traverses ? target denotation ('/state/alabama', '/state/arkansas', '/state/georgia', '/state/illinois', '/state/indiana', '/state/iowa', '/state/kansas', '/state/kentucky', '/state/louisiana', '/state/michigan', '/state/minnesota', '/state/mississippi', '/state/missouri', '/state/nebraska', '/state/north_carolina', '/state/north_dakota', '/state/ohio', '/state/oklahoma', '/state/south_dakota', '/state/tennessee', '/state/texas', '/state/virginia', '/state/west_virginia', '/state/wisconsin') denotation accuracy 0 denotation oracle accuracy 0 number of parses 0 spurious ambiguity 0 -------------------------------------------------------------------------------- input what is the highest mountain in alaska ? target denotation ('/mountain/mckinley',) denotation accuracy 0 denotation oracle accuracy 1 number of parses 12 spurious ambiguity 0 0 0.000 semantics ('.argmax', 'height', ('.and', 'mountain', '/state/alaska')) denotation - () 1 0.000 semantics ('.argmax', 'height', ('.and', 'mountain', ('traverses', '/state/alaska'))) denotation - () 2 0.000 semantics ('.argmax', 'height', ('.and', 'mountain', ('/state/alaska', 'contains'))) denotation + ('/mountain/mckinley',) 3 0.000 semantics ('.argmax', 'height', ('.and', ('mountain', 'traverses'), '/state/alaska')) denotation - () 4 0.000 semantics ('.argmax', 'height', ('.and', ('contains', 'mountain'), '/state/alaska')) denotation - () 5 0.000 semantics ('.and', ('.argmax', 'height', 'mountain'), '/state/alaska') denotation - () 6 0.000 semantics ('.and', ('.argmax', 'height', 'mountain'), ('traverses', '/state/alaska')) denotation - () 7 0.000 semantics ('.and', ('.argmax', 'height', 'mountain'), ('/state/alaska', 'contains')) denotation + ('/mountain/mckinley',) 8 0.000 semantics ('.and', ('.argmax', 'height', ('mountain', 'traverses')), '/state/alaska') denotation - () 9 0.000 semantics ('.and', ('.argmax', 'height', ('contains', 'mountain')), '/state/alaska') denotation - () 10 0.000 semantics ('.and', (('.argmax', 'height', 'mountain'), 'traverses'), '/state/alaska') denotation - () 11 0.000 semantics ('.and', ('contains', ('.argmax', 'height', 'mountain')), '/state/alaska') denotation - ('/state/alaska',) -------------------------------------------------------------------------------- input what is the population of illinois ? target denotation (11400000,) denotation accuracy 1 denotation oracle accuracy 1 number of parses 2 spurious ambiguity 0 0 0.000 semantics ('/state/illinois', 'population') denotation + (11400000,) 1 0.000 semantics (('/state/illinois', 'contains'), 'population') denotation - (100054, 124160, 139712, 3005172, 58267, 60278, 60590, 61232, 63668, 66116, 67653, 73706, 77956, 81293, 93939) -------------------------------------------------------------------------------- Over 10 examples: denotation accuracy 0.400 denotation oracle accuracy 0.600 number of parses 5.500 spurious ambiguity 0.000  Take a look through that output. Over the ten examples, we achieved denotation oracle accuracy of 60%, but denotation accuracy of just 40%. In other words, there were two examples where we generated a correct parse, but failed to rank it at the top. Take a closer look at those two cases. The first case is "what state has the shortest river ?". The top parse has semantics ('.and', 'state', ('.argmin', 'length', 'river')), which means something like "states that are the shortest river". That's not right. In fact, there's no such thing: the denotation is empty. The second case is "what is the highest mountain in alaska ?". The top parse has semantics ('.argmax', 'height', ('.and', 'mountain', '/state/alaska')), which means "the highest mountain which is alaska". Again, there's no such thing: the denotation is empty. So in both of the cases where we put the wrong parse at the top, the top parse had nonsensical semantics with an empty denotation. In fact, if you scroll through the output above, you will see that there are a lot of candidate parses with empty denotations. Seems like we could make a big improvement just by downweighting parses with empty denotations. This is easy to do. In [32]: def empty_denotation_feature(parse): features = defaultdict(float) if parse.denotation == (): features['empty_denotation'] += 1.0 return features weights = {'empty_denotation': -1.0} model = Model(grammar=grammar, feature_fn=empty_denotation_feature, weights=weights, executor=executor.execute)  Let's evaluate the impact of using our new empty_denotation feature on the Geo880 training examples. In [33]: from experiment import evaluate_model from metrics import denotation_match_metrics evaluate_model(model=model, examples=geo880_train_examples, metrics=denotation_match_metrics(), print_examples=False)  ================================================================================ Evaluating on 600 examples -------------------------------------------------------------------------------- Max cell capacity 1000 has been hit 128 times Over 600 examples: denotation accuracy 0.387 denotation oracle accuracy 0.468 number of parses 13.210 spurious ambiguity 0.001  Great! Using the empty_denotation feature has enabled us to increase denotation accuracy from 25% to 39%. That's a big gain! In fact, we've closed most of the gap between accuracy and oracle accuracy. As a result, the headroom for further gains from feature engineering is limited — but it's not zero. The exercises will ask you to push further. ## Exercises ¶ Several of these exercises ask you to measure the impact of your change on key evaluation metrics. Part of your job is to decide which evaluation metrics are most relevant for the change you're making. It's probably best to evaluate only on training data, in order to keep the test data unseen during development. (But the test data is hardly a state secret, so whatever.) ### Straightforward¶ 1. Extend the grammar to handle queries which use the phrase "how many people" to ask about population. There are many examples in the Geo880 training data. Measure the impact of this change on key evaluation metrics. 2. Extend the grammar to handle counting questions, indicated by the phrases "how many" or "number of", as in "how many states does missouri border". Measure the impact of this change on key evaluation metrics. 3. Several examples in the Geo880 training dataset fail to parse because they refer to locations using names that, while valid, are not recognized by the GeobaseAnnotator. For example, some queries use "america" to refer to /country/usa, but GeobaseAnnotator recognizes only "usa". Find unannotated location references in the Geo880 training dataset, and extend the grammar to handle them. Measure the impact of this change on key evaluation metrics. 4. Extend the grammar to handle examples of the form "where is X", such as "where is mount whitney" or "where is san diego". Measure the impact of this change on key evaluation metrics. 5. Quite a few examples in the Geo880 training dataset specify a set of entities by name, as in "how many states have a city named springfield" or "how many rivers are called colorado". Extend the grammar to handle these, leveraging the TokenAnnotator introduced in Unit 2. Measure the impact of this change on key evaluation metrics. 6. Extend the grammar to handle phrases like "austin texas" or "atlanta ga", where two entity names appear in sequence. Make sure that "atlanta ga" has semantics and a denotation, whereas "atlanta tx" has semantics but no denotation. Measure the impact of this change on key evaluation metrics. 7. In Unit 2, while examining the travel domain, we saw big gains from including rule features in our feature representation. Experiment with adding rule features to the GeoQuery model. Are the learned weights intuitive? Do the rule features help? If so, identify a few specific examples which are fixed by the inclusion of rule features. Measure the impact on key evaluation metrics. 8. Find an example where using the empty_denotation feature causes a loss (a bad prediction). 9. The empty_denotation feature doesn't help on count questions (e.g., "how many states ..."), where all parses, good or bad, typically yield a denotation which is a single number. Add a new feature which helps in such cases. Identify a few specific examples which are fixed by the new feature. Measure the impact on key evaluation metrics. [This exercise assumes you have already done the exercise on handling count questions.] ### Challenging¶ 1. Extend the grammar to handle comparisons, such as "mountains with height greater than 5000" or "mountains with height greater than the height of bona". 2. Building on the previous exercise, extend the grammar to handle even those comparisons which involve ellipsis, such as "mountains higher than [the height of] mt. katahdin" or "rivers longer than [the length of] the colorado river" (where the bracketed phrase does not appear in the surface form!). 3. Extend the grammar to handle queries involving units, such as "what is the area of maryland in square kilometers" or "how long is the mississippi river in miles". 4. The Geo880 training dataset contains many examples involving population density. Extend the grammar to handle these examples. Measure the impact on key evaluation metrics. 5. Several examples in the Geo880 training dataset involve some form of negation, expressed by the words "not", "no", or "excluding". Extend the grammar to handle these examples. Measure the impact on key evaluation metrics. 6. The current grammar handles "capital of texas", but not "texas capital". It handles "has austin capital", but not "has capital austin". In general, it defines every phrase which expresses a relation as either a $FwdRelation or a $RevRelation, and constrains word order accordingly. Extend the grammar to allow any phrase which is ordinarily a $FwdRelation to function as a $RevRelation, and vice versa. Can you now handle "texas capital" and "has capital austin"? Measure the impact on key evaluation metrics. 7. What if we permit any word to be optionalized by adding the rule Rule('$Optional', '$Token') to our grammar? (Recall that $Token is produced by the TokenAnnotator.) Measure the impact of this change on key evaluation metrics. You will likely find that the change has some negative effects and some positive effects. Is there a way to mitigate the negative effects, while preserving the positive effects?

8. The success of the empty_denotation feature demonstrates the potential of denotation features. Can we go further? Experiment with features that characterize the size of the denotation (that is, the number of answers). Are two answers better than one? Are ten answers better than two? If you find some features that seem to work, identify a few specific examples which are fixed by your new features. Measure the impact on key evaluation metrics.