Spanish POS-tagger with nltk

NLTK comes with an english POS-tagger trained by default. For others languanges you will need to train it. Depending on the language, this is not a problem because, as you may know, NLTK has a lot of corpus that you can download and use.

In this example I'm going to use a spanish corpus called cess_esp. As I want to test if the tagger performs good or not I need to evaluate it so I'll need some test data. For this reason I'm going to divide the corpus in two sets: 90% for training and 10% for testing.

In [1]:
from nltk.corpus import cess_esp

sents = cess_esp.tagged_sents()
training = []
test = []
for i in range(len(sents)):
    if i % 10:
        training.append(sents[i])
    else:
        test.append(sents[i])

NLTK provides different types of taggers so, I'm going to train a few taggers and after that I will choose the most accurate one.

In [2]:
from nltk import UnigramTagger, BigramTagger, TrigramTagger
from nltk.tag.hmm import HiddenMarkovModelTagger

unigram_tagger = UnigramTagger(training)
bigram_tagger = BigramTagger(training, backoff=unigram_tagger) # uses unigram tagger in case it can't tag a word
trigram_tagger = TrigramTagger(training, backoff=unigram_tagger)
hmm_tagger = HiddenMarkovModelTagger.train(training)

Now, let's evaluate each tagger with our test data.

In [3]:
print 'UnigramTagger: %.1f %%' % (unigram_tagger.evaluate(test) * 100)
print 'BigramTagger: %.1f %%' % (bigram_tagger.evaluate(test) * 100)
print 'TrigramTagger: %.1f %%' % (trigram_tagger.evaluate(test) * 100)
print 'HMM: %.1f %%' % (hmm_tagger.evaluate(test) * 100)
UnigramTagger: 87.6 %
BigramTagger: 89.4 %
TrigramTagger: 89.0 %
HMM: 89.9 %

In this case HiddenMarkovModelTagger is the best. Besides, this tagger can be tuned a bit more, so maybe you find useful this link.

Depending on the size of the corpus, training a tagger can take long (especially the hmm tagger) and maybe you don't want a decrease of performance in your application in production. You can use the same technique that nltk uses with its default POS-tagger, you can dump a trained tagger with pickle and load it whenever you want.

In [4]:
import pickle

# Dump trained tagger
with open('unigram_spanish.pickle', 'w') as fd:
    pickle.dump(unigram_tagger, fd)

# Load tagger
with open('unigram_spanish.pickle', 'r') as fd:
    tagger = pickle.load(fd)

If you save unigram_spanish.pickle in one of the directories of the nltk.data.path you will be able to load your tagger with nltk.data.load function which is more clean because you won’t have to use absolute paths or non intuitive relative paths.