# Spanish POS-tagger with nltk¶

NLTK comes with an english POS-tagger trained by default. For others languanges you will need to train it. Depending on the language, this is not a problem because, as you may know, NLTK has a lot of corpus that you can download and use.

In this example I'm going to use a spanish corpus called cess_esp. As I want to test if the tagger performs good or not I need to evaluate it so I'll need some test data. For this reason I'm going to divide the corpus in two sets: 90% for training and 10% for testing.

In [1]:
from nltk.corpus import cess_esp

sents = cess_esp.tagged_sents()
training = []
test = []
for i in range(len(sents)):
if i % 10:
training.append(sents[i])
else:
test.append(sents[i])


NLTK provides different types of taggers so, I'm going to train a few taggers and after that I will choose the most accurate one.

In [2]:
from nltk import UnigramTagger, BigramTagger, TrigramTagger
from nltk.tag.hmm import HiddenMarkovModelTagger

unigram_tagger = UnigramTagger(training)
bigram_tagger = BigramTagger(training, backoff=unigram_tagger) # uses unigram tagger in case it can't tag a word
trigram_tagger = TrigramTagger(training, backoff=unigram_tagger)
hmm_tagger = HiddenMarkovModelTagger.train(training)


Now, let's evaluate each tagger with our test data.

In [3]:
print 'UnigramTagger: %.1f %%' % (unigram_tagger.evaluate(test) * 100)
print 'BigramTagger: %.1f %%' % (bigram_tagger.evaluate(test) * 100)
print 'TrigramTagger: %.1f %%' % (trigram_tagger.evaluate(test) * 100)
print 'HMM: %.1f %%' % (hmm_tagger.evaluate(test) * 100)

UnigramTagger: 87.6 %
BigramTagger: 89.4 %
TrigramTagger: 89.0 %
HMM: 89.9 %


In this case HiddenMarkovModelTagger is the best. Besides, this tagger can be tuned a bit more, so maybe you find useful this link.

Depending on the size of the corpus, training a tagger can take long (especially the hmm tagger) and maybe you don't want a decrease of performance in your application in production. You can use the same technique that nltk uses with its default POS-tagger, you can dump a trained tagger with pickle and load it whenever you want.

In [4]:
import pickle

# Dump trained tagger
with open('unigram_spanish.pickle', 'w') as fd:
pickle.dump(unigram_tagger, fd)