In [1]:

import nltk
import tagutils; reload(tagutils)
from tagutils import *
from IPython.core.display import HTML
from nltk.corpus import brown
import random as pyrand
from tagutils import *

Evaluation Framework¶

In [5]:

sents = list(brown.tagged_sents())
n = len(sents)
test = sorted(list(set(range(0,n,10))))
training = sorted(list(set(range(n))-set(test)))
training_set = [sents[i] for i in training]
test_set = [sents[i] for i in test]

In [7]:

print len(training_set)
print len(test_set)

51606
5734

In [6]:

t0 = nltk.DefaultTagger('NN')
t1 = nltk.UnigramTagger(training_set, backoff=t0)
t2 = nltk.BigramTagger(training_set, backoff=t1)
t2.evaluate(test_set)

Out[6]:

0.9236947791164659

Wordnet-Based Improvements¶

In [14]:

import nltk.tag.api
# help(nltk.tag.api.TaggerI)

Your homework consists of implementing new taggers based on Wordnet. With regular taggers, we have a problem of sparsity; that is, we don't know what tag to assign to a word if we have never seen it in a context.

However, for many words, Wordnet may give us useful information to help with tagging. You need to work out some ideas, implement them, and test them.

There are different implementation strategies, but a simple one might be:

write classes that map token sequences to other token sequences using WordNet; for example, you might map an input sentence to some collection of hyponyms
then, apply the regular NLTK n-gram taggers to the modified output sequences
use backoff (as above) when the WordNet mapping fails for some reason (you can't find the word, or maybe the mapping would be ambiguous and you don't know how to handle it)

This may not be the best strategy, but it's a good way of getting started.

Another strategy is to use use WordNet to generate a cloud of related words around a given word, and then see whether you can find bigrams in an existing model for any of the related words.

Implement your model(s) so that they conform to the NLTK tagging APIs, perform evaluations on the training and test sets defined above, and be ready to present your results (idea, evaluation, results) in the exercises.

In [ ]: