Homework 5: Semantic date parsing

In [ ]:
__author__ = "Christopher Potts"
__version__ = "CS224u, Stanford, Spring 2018 term"

Submission

For this homework, the submission procedure is a bit different than before:

  1. Run the complete notebook, or at least up to the section 'A rule-based baseline'. (It's probably easier to just run the whole notebook.) Leave all the output embedded.
  2. Save the notebook and upload the notebook file to Canvas: https://canvas.stanford.edu/courses/83399/assignments/135084

It's fine to add cells for your work, but don't delete any, so that we're sure we have the output we need for evaluation.

As usual, evaluating your work is not really our central goal. In fact, we've included various tests and checks to help ensure that you do everything correctly. Our goal is to give you the experience of designing and optimizing a semantic parser that could have real-world applicability.

Set-up

  1. Make sure the sys.path.append value is the path to your local SippyCup repository. (Alternatively, you can add SippyCup to your Python path; see one of the teaching team if you'd like to do that but aren't sure how.)

  2. Make sure that semparse_dateparse_data.py is in the current directory (or available via your Python path).

In [ ]:
import sys
sys.path.append('../sippycup')
In [ ]:
import copy
import re
import calendar
import itertools
import random
from collections import defaultdict
from datetime import date
from dateutil import parser

Goals

The goal of this homework is to train a semantic parser to interpret date strings. Although this requires only simple syntactic structures, it turns out to be a deep understanding challenge involving contextual inference. Date strings are often ambiguous, and there are conventions for how to resolve the ambiguities. For example, 1/2/11 is ambiguous between Jan 2 and Feb 1, and the 11 could pick out 1211, 1911, 2011, and so forth.

The homework is structured basically as our bake-off was: the first goal is to create a grammar that gets perfect oracle accuracy, and the second goal is to design features that resolve ambiguities accurately and so predict the correct denotations.

In this case, it's possible to write a high-quality grammar right from the start. We know (approximately) what most of the words mean, so we really just need to handle the variation in the forms people use. The bigger challenge comes in defining features that capture the subtler interpretive conventions.

In the final part, we bring in a baseline: the rule-based dateutil.parser.parse. It's good, achieving around 70% on the train and dev sets. However, there are some strings that it just can't parse, and, since it isn't learning-based, it can't learn the conventions latent in the data. You'll be able to do much better!

Dataset

The dataset has train and dev portions, with 1000 and 500 examples, respectively.

In [ ]:
from semparse_dateparse_data import dateparse_train, dateparse_dev

Here's a look at the data via random samples. Running this cell a few times will give you a feel for the data.

  • All the punctuation has been removed from the strings for convenience.

  • All of the examples end with one of two informal timezone flags: US if the string comes from text produced in the U.S., and non-US otherwise. This is an important piece of information that can be used in learning, since (I assume) only Americans use the jumbled month/day/year order.

In [ ]:
# Display a random sample of training examples:
for ex in random.sample(dateparse_train, k=5):    
    print("=" * 70)
    print("Input: {}".format(ex.input))
    print("Denotation: {}".format(repr(ex.denotation)))

Core interpretive functionality

Here's the semantics you'll want to use. You won't need to change it.

Important: date_semantics expects its arguments in international order: year, month, day.

In [ ]:
def date_semantics(year, month, day):
    """Interpret the arguments as a date objects if possible. Returns 
    None of the datetime module won't treat the arguments as a date 
    (say, because it would be a Feb 31)."""
    try:
        return date(year=year, month=month, day=day)
    except ValueError:
        return None

# The key dt is arbitrary, but the grammar rules need to hook into it.
ops = {'dt': date_semantics}

def execute(semantics):
    """The usual SippyCup interpretation function"""
    if isinstance(semantics, tuple):
        op = ops[semantics[0]]
        args = [execute(arg) for arg in semantics[1:]]
        return op(*args)
    else:
        return semantics

Questions 1–6: Build the crude grammar [6 points]

The following is a high-level summary of the grammar you're trying to build. Rules in green have been completed for you. Rules in orange still need to be written.

dateparse-grammar-summary.png

It's easy to accidentally add repeated rules and problematic rules to a grammar, forget that one has done that, and then plunge into a pit of grammatical despair as one deals with the bugs this introduces. To help prevent this, you'll write the entire grammar in a single cell of this notebook, to ensure that reloading the cell gives you a fresh start. So questions 1–6 are in the next cell, along with the rules we wrote for you (and a lot of hints).

In [ ]:
from parsing import Grammar, Rule, add_rule


# Use $DATE for the root node of its full date expressions:
gram = Grammar(start_symbol="$DATE")

# Fill this with Rule objects:
rules = []


######################################################################
################# TIMEZONE RULES (ALREADY COMPLETED) #################

# Make sure the grammar can handle final timezone information:
timezone_rules = [
    Rule('$DATE', '$DATE $TZ', (lambda sems : sems[0])),
    Rule('$TZ', 'US'),
    Rule('$TZ', 'non-US') ]

rules += timezone_rules


######################################################################
################ RULES FOR MONTHS (ALREADY COMPLETED) ################
#
# As a reminder about the notation, concepts, etc., here are lexical 
# rules for the months. They all determine trees
# 
#    $M : i
#      |
#  monthname
#
# where monthname is the ith month of the year.

# Full month names with "" in the 0th position so that 'January' has 
# index 1, 'February' index 2, ...
months = [str(s) for s in calendar.month_name]
# Add the rules:
rules += [Rule('$M', m, i) for i, m in enumerate(months) if m]

# 3-letter month names like "Jan", "Feb", with "" in 0th position:
mos = [str(s) for s in calendar.month_abbr]
# Add the rules:
rules += [Rule('$M', m, i) for i, m in enumerate(mos) if m]

# Numerical months:
num_months = [str(i) for i in range(1, 13)]
# Add the rules:
rules += [Rule('$M', m, int(m)) for m in num_months]

# 1-digit months with an initial zero:
num_months_padded = [str(i).zfill(2) for i in range(1,10)]
# Add the rules:
rules += [Rule('$M', m, int(m)) for m in num_months_padded]
        
    
######################################################################
# Question 1: Rules for integer days
#
# Add lexical rules for the days. Again, these will be structures
#
#   $D : i
#     |
#    day
# 
# where day is a 1- or 2-digit string and i is its corresponding int.

# Here are days of the month without and with two-digit padding. Each 
# day string m can be interpreted semantically as int(m).

days = [str(d) for d in range(1, 32)]

days_padded = [str(i).zfill(2) for i in range(1, 11)]

# Add to the rules list here:



######################################################################
# Question 2: Rules for ordinal days  
#
# The data contain day names like "2nd". Use int2ordinal to create 
# rules for these too:
#
#   $D : int(i)
#        |
#  int2ordinal(i)
#
# The meaning of these expressions is the corresponding int -- e.g.,
# '2nd' means 2 and '21st' means 21.

def int2ordinal(s):
    """Forms numerals like "1st" from int strs like 1"""
    suffixes = {"1": "st", "2": "nd", "3": "rd"}    
    if len(s) == 2 and s[0] == '1': # teens
        suff = "th"
    else: # all others use suffixes or default to "th"
        suff = suffixes.get(s[-1], "th")
    return s + suff

# Add to the rules list here; you can use `days` above for the 
# core meanings.



######################################################################
# Question 3: Rules for 'the'
#
# The data contain expressions like "the 3rd". Add a lexical rule to 
# include 'the', and a binary combination rule so that we have 
# structures like
#
#      $D
#     /  \ 
#   $Det  $D
#    |    |
#   the  3rd
#
# We emphasize that this requires *two* new rules: one to add 'the' 
# as a $Det, and another to allow for the parent and its children in 
# the above. The second should just pass up unmodified the meaning of 
# its child node, so that the entire tree above is interpreted as 
# semantically equivalent to what 3rd means.

# Add to the rules list here:



######################################################################
# Question 4: Rules for 4-digit years
#
# Long (4-digit) years are easy: they can be interpeted as themselves. 
# So these are rules of the form
#
#   $Y : int(year)
#       |
#      year
#
# where year is a 4-digit string.

# This range will be fine for our data:
years_long = [str(y) for y in range(1900, 2100)]

# Add to the rules list here:



######################################################################
# Question 5: Rules for 2-digit years
#
# Two-digit years are ambiguous about their century. Intuitively, for 
# each 2-digit year, we need a rule for interpreting it in all 
# potential centuries:

# All 2-digit years:
years_short = [str(x).zfill(2) for x in range(0, 100)]
# A suitable range of century multipliers given our data:
centuries = range(1700, 2500, 100)       

# Add to the rules list here:



######################################################################
########### INITIAL COMPOSITION RULES (ALREADY COMPLETED) ############
#
# Here are some initial composition rules. Together, these rules 
# create structures like this:
#
#           $DATE
#           /    \
#   $MonthDay    $Y
#    /     \
#   $M     $D
#
# and the 'dt' key connects with the ops dictionary we defined above.
# The semantics is creating a tuple (dt year month day) for the
# sake of date_semantics. You'll clearly need rules for at least
# 
# * $MonthDay in the reverse order
# * $Y before $MonthDay

composition_rules = [
    #                                                $M        $D
    Rule('$MonthDay', '$M $D', lambda sems : ('dt', sems[0], sems[1])),
    #                                           
    Rule('$DATE', '$MonthDay $Y', 
    #                   dt            $Y       $M           $D
         lambda sems: (sems[0][0], sems[1], sems[0][1], sems[0][2]))
]


######################################################################
# Question 6: The remaining composition rules
#
# We need rules for
#
#    $MonthDay
#     /     \
#    $D     $M
#
# and 
#   
#      $DATE
#      /    \
#     $Y  $MonthDay

composition_rules += [
    
]

rules += composition_rules

######################################################################
# Add all the rules to the grammar:

for rule in rules:
    add_rule(gram, rule)

print('Grammar now has {} lexical rules, {} unary rules, '
      'and {} binary rules'.format(
          len(gram.lexical_rules), 
          len(gram.unary_rules), 
          len(gram.binary_rules)))

Your grammar should have

  • 366 lexical rules
  • 0 unary rules
  • 6 binary rules

It can be informative to enter your own date strings and see what happens to them:

In [ ]:
from parsing import parse_to_pretty_string

def parse_and_interpret(s, grammar=gram):
    for i, parse in enumerate(gram.parse_input(s)):
        print("=" * 70)
        print("Parse {}: {}".format(i+1, parse_to_pretty_string(parse)))
        print("Denotation: {}".format(execute(parse.semantics)))
In [ ]:
parse_and_interpret("5 4 2015 US")

Check oracle accuracy

You should be able to get 100% oracle accuracy with your grammar:

In [ ]:
def check_oracle_accuracy(grammar=None, examples=dateparse_train, verbose=True):
    """Use verbose=True to print out cases where your grammar can't
    find a correct denotation among all of its parses."""
    oracle = 0
    for ex in examples:
        # All the denotations for all the parses:
        dens = [execute(parse.semantics) for parse in gram.parse_input(ex.input)]
        if ex.denotation in dens:
            oracle += 1
        elif verbose:
            print("=" * 70)
            print(ex.input)
            print(set(dens))
            print(repr(ex.denotation))
    percent_correct = oracle / float(len(examples))
    print("Oracle accuracy: {0:} / {1:} ({2:0.02%})".format(
        oracle, len(examples), percent_correct))
In [ ]:
check_oracle_accuracy(
    grammar=gram, examples=dateparse_train, verbose=False)

Question 7: Timezone sensitivity [2 points]

As noted above, only Americans use the month/day/year order. This suggests that we can disambiguate otherwise ambiguous dates by writing a feature function that is sensitive to this pre-terminal ordering and the associated informal timezone string.

Your task: Complete timezone_sensitivity_features so that, when given an input parse, it returns a dict of length 1 that maps the concatenation of the the timezone string and the pre-terminal nodes to 1. You'll likely want to make use of get_lemmas and get_timezone to do this. (See test_timezone_sensitivity_features for a test, in case this description is ambiguous.)

In [ ]:
def get_timezone(parse):
    """Returns the timezone string in `parse`."""
    return parse.children[1].rule.rhs[0]


def get_lemmas(parse):
    """Returns the list of (pre-terminal, word) pairs in `parse`."""
    labs = []
    for t in parse.children:
        if len(t.rule.rhs) == 1:
            labs.append((t.rule.lhs, t.rule.rhs[0]))
        else:
            labs += get_lemmas(t)
    return labs
In [ ]:
def timezone_sensitivity_features(parse):
    feats = defaultdict(float)
    
    
    return feats

It's always a good measure to write a simple unit test:

In [ ]:
def test_timezone_sensitivity_features():
    from parsing import Parse, Rule
    
    p = Parse(
        Rule('$DATE', '$DATE $TZ'), 
        [Parse(
            Rule('$DATE', '$MonthDay $Y'), 
            [Parse(Rule('$MonthDay', '$D $M'), 
                   [Parse(Rule('$D', '29'), ['29']), 
                    Parse(Rule('$M', 'Jun'), ['Jun'])]),
             Parse(Rule('$Y', '1969'), ['1969'])]),
         Parse(Rule('$TZ', 'non-US'), ['non-US'])])    
    
    expected = {'non-US $D $M $Y $TZ': 1}
    result = timezone_sensitivity_features(p)    
    if result == expected:
        print("timezone_sensitivity_features test passed!")
    else:
        raise AssertionError("timezone_sensitivity_features has a bug!")
In [ ]:
test_timezone_sensitivity_features()

Question 8: Year-string length [2 points]

There is another interesting bias in the data: throughout the latter half of the 20th century, people felt that they could give the year with just two digits, secure in the knowledge that this would somehow never be ambiguous (a rather depressing assumption). Let's capture this bias in a feature function.

Your task: Complete year_length_features so that it returns a dict of length 1 in which the key has the form $Ylen={LENGTH} year={YEAR} where LENGTH is the length of the year string in the input and YEAR is the year field in the denotation. Return the empty dict if the denotation has no year. Tips: get_lemmas will make it easy to access the year string in the input, and parse.denotation is a datetime.time object from which it is easy to get the year.

In [ ]:
def year_length_features(parse):
    feats = defaultdict(float)
    
    
    return feats

Here's a test for year_length_features:

In [ ]:
def test_year_length_features():
    from parsing import Parse, Rule
    
    p = Parse(
        Rule('$DATE', '$DATE $TZ'), 
        [Parse(
            Rule('$DATE', '$MonthDay $Y'), 
            [Parse(Rule('$MonthDay', '$D $M'), 
                   [Parse(Rule('$D', '29'), ['29']), 
                    Parse(Rule('$M', 'Jun'), ['Jun'])]),
             Parse(Rule('$Y', '69'), ['69'])]),
         Parse(Rule('$TZ', 'non-US'), ['non-US'])])
    
    p.denotation = date(1969, 6, 29)
            
    expected = {'$Ylen=2 year=1969': 1}
    result = year_length_features(p)
    if result == expected:
        print("year_length_features test passed!")
    else:
        raise AssertionError("year_length_features has a bug!")
In [ ]:
test_year_length_features()

Putting the feature functions together

Here we just knit together timezone_sensitivity_features and year_length_features to train a model on both:

In [ ]:
def date_rule_features(parse): 
    features = defaultdict(float)
    features.update(timezone_sensitivity_features(parse))
    features.update(year_length_features(parse))
    return features

Train a model

The training step. Feel free to fiddle – longer optimization, lower learning rate – but this is optional. It's fine to run this as-is and move on.

In [ ]:
from learning import latent_sgd
from metrics import DenotationAccuracyMetric
from scoring import Model

model = Model(grammar=gram, feature_fn=date_rule_features, executor=execute)

trained_model = latent_sgd(
    model, 
    dateparse_train,
    loss='hinge',
    l2_penalty=0,
    training_metric=DenotationAccuracyMetric(), 
    T=10, 
    eta=0.1)

Evaluate the trained model

Now evaluate your model on the held-out data. Nothing to do here beyond running the cell so that we can see the output printed in the notebook.

In [ ]:
from experiment import evaluate_model
from metrics import denotation_match_metrics

evaluate_model(
    model=trained_model, 
    examples=dateparse_dev,
    examples_label="Dev",
    metrics=denotation_match_metrics(),
    print_examples=False)

A rule-based baseline

Was it worth the trouble? I think it was, in that you probably soundly beat the high-performance dateutil parser. Running the following is optional – purely in case you are curious about what the baseline achieves.

In [ ]:
def evaluate_dateutil(dataset, verbose=False):
    """dataset should be one data['train'] or data['dev']. Use
    verbose=True to see information about the errors."""
    results = defaultdict(int)
    for ex in dataset:
        input = re.sub(r"(non-)?US$", "", ex.input)
        prediction = None 
        try:
            prediction = parser.parse(input).date()
        except ValueError:
            if verbose:
                print("dateutil can't parse '%s'" % input)
        results[prediction == ex.denotation] += 1
        if prediction and prediction != ex.denotation and verbose:
            print("dateutil predicted %s to mean %s" % (input, repr(prediction)))
    acc = results[True] / float(len(dataset))
    return "accuracy: {0:} / {1:} ({2:0.2%})".format(
        results[True], len(dataset), acc)

# Use verbose=True to see where and how dateutil stumbles:
for key, data in (('train', dateparse_train), ('dev', dateparse_dev)):
    print("{}: {}".format(key, evaluate_dateutil(data, verbose=False)))

Inspect the trained model

Which examples are you getting wrong? (Running this step is optional; it's recommended if you wish to improve the model by adding more features.)

In [ ]:
def sample_errors(model=None, dataset=None, k=5):
    data = copy.copy(dataset)
    random.shuffle(data)
    errors = 0
    for i in range(len(data)):
        ex = data[i]
        parses = model.parse_input(ex.input)
        if not parses or (parses[0].denotation != ex.denotation):
            best_parse = parses[0] if parses else None
            prediction = parses[0].denotation if parses else None
            print("=" * 70)
            print('Input:', ex.input)
            print('Best parse:', best_parse)
            print('Prediction:', prediction)
            print('Actual:', ex.denotation)
            errors += 1
            if errors >= k:
                return
In [ ]:
sample_errors(model=trained_model, dataset=dateparse_train)