import nltk
import tagutils; reload(tagutils)
from tagutils import *
from IPython.core.display import HTML
from nltk.corpus import brown
import random as pyrand
from tagutils import *
sents = list(brown.tagged_sents())
n = len(sents)
test = sorted(list(set(range(0,n,10))))
training = sorted(list(set(range(n))-set(test)))
training_set = [sents[i] for i in training]
test_set = [sents[i] for i in test]
print len(training_set)
print len(test_set)
51606 5734
t0 = nltk.DefaultTagger('NN')
t1 = nltk.UnigramTagger(training_set, backoff=t0)
t2 = nltk.BigramTagger(training_set, backoff=t1)
t2.evaluate(test_set)
0.9236947791164659
import nltk.tag.api
# help(nltk.tag.api.TaggerI)
Your homework consists of implementing new taggers based on Wordnet. With regular taggers, we have a problem of sparsity; that is, we don't know what tag to assign to a word if we have never seen it in a context.
However, for many words, Wordnet may give us useful information to help with tagging. You need to work out some ideas, implement them, and test them.
There are different implementation strategies, but a simple one might be:
This may not be the best strategy, but it's a good way of getting started.
Another strategy is to use use WordNet to generate a cloud of related words around a given word, and then see whether you can find bigrams in an existing model for any of the related words.
Implement your model(s) so that they conform to the NLTK tagging APIs, perform evaluations on the training and test sets defined above, and be ready to present your results (idea, evaluation, results) in the exercises.