TreeTagger-based Backoff Lemmatizer

Here is what a TreeTagger-based sublemmatizer for the CLTK Latin Backoff Lemmatizer would look like. It's something like a wrapper wrapper—that is, it creates a subclass of the sequential backoff tagger that gets lemma information from the treetagger-python wrapper. The lemmatize method takes a list of tokens (like all of the CLTK backoff lemmatizers), joins them, and runs this string through TreeTagger. The Backoff Lemmatizer works by trying to return a lemma for a given token with one lemmatizer and, if no match is found (i.e. None is returned), it backs off to a second lemmatizer (and a third, etc.). Accordingly, with the TreeTaggerLemmatizer, '<unknown'> is replaced with None so that the backoff tagger can work properly.

When used on its own, the results are what you would expect from TreeTagger. This is shown in the first example below using the first paragraph of Sallust's Bellum Catilinae.

The next two examples—using the nonsense Latin text of a Jabberwocky translation—shows the flexibility of the Backoff Lemmatizer and how the TreeTaggerLemmatizer results can improved upon.

In the first example, we create a chain of TreeTaggerLemmatizer and RegexpLemmatizer and see that a nonsense word like vabo is reasonably(?) lemmatized to vo, because the Latin regex substitution patterns include -abo > -o. Vabo is the hapaxest of hapax legomena—it has never appeared in another Latin text and so will not be found in any lemma dictionary. (It is also admittedly wrong—vabo here is clearly the ablative of a noun vabus or vabum in the phrase in vabo. We will fix this in a future post by combining lemmatization with POS-tagging.)

The next example, expands the backoff chain to include additional regex patterns and a custom lemma dictionary (using the UnigramLemmatizer) to lemmatize fully the first ten tokens in the Jabberwocky.

The TreeTaggerLemmatizer works well and I plan to introduce it to the CLTK with the next update to the lemmatize class. It, of course, requires TreeTagger to be installed as well as the treetagger-python package.[PJB 5.6.18]

In [1]:
# Imports

from nltk.tag.sequential import SequentialBackoffTagger

from cltk.tokenize.word import WordTokenizer

from treetagger import TreeTagger

from pprint import pprint
In [2]:
# Create instances of CLTK tools

tokenizer = WordTokenizer('latin')
In [3]:
# Create TreeTaggerLemmatizer as subclass of NLTK's Sequential Backoff Tagger

class TreeTaggerLemmatizer(SequentialBackoffTagger):
    """"""
    def __init__(self, backoff=None):
        """Setup for TreeTaggerLemmatizer()."""
        SequentialBackoffTagger.__init__(self, backoff)
        self.tagger = TreeTagger(language='latin') # Error trap to see if module is installed!
        self._lemmas = []
        
        
    def choose_tag(self, tokens, index, history):
        """Returns the lemma at the index in the _lemmas list created
        by TreeTagger in lemmatize.
        :param tokens: List of tokens to be lemmatized
        :param index: Int with current token
        :param history: List with tokens that have already been lemmatized
        :return: String, spec. the lemma found at the current index.
        """
        return self._lemmas[index]    
    
    def lemmatize(self, tokens):
        lemmas = []
        text = " ".join([token.lower() for token in tokens])
        lemmas = []
        for _, _, lemma in self.tagger.tag(text):
            if lemma == '<unknown>':
                lemmas.append(None)
            else:
                lemmas.append(lemma.split('|')[0])
        self._lemmas = lemmas
        return self.tag(tokens)
            
In [4]:
# Create instance of lemmatizer

lemmatizer = TreeTaggerLemmatizer()
In [5]:
# Sample text

text = """Omnis homines qui sese student praestare ceteris animalibus, summa ope niti decet, ne vitam silentio transeant veluti pecora, quae natura prona atque ventri oboedientia finxit. Sed nostra omnis vis in animo et corpore sita est: animi imperio, corporis servitio magis utimur; alterum nobis cum dis, alterum cum beluis commune est. Quo mihi rectius videtur ingeni quam virium opibus gloriam quaerere et, quoniam vita ipsa, qua fruimur, brevis est, memoriam nostri quam maxume longam efficere. Nam divitiarum et formae gloria fluxa atque fragilis est, virtus clara aeternaque habetur. Sed diu magnum inter mortalis certamen fuit, vine corporis an virtute animi res militaris magis procederet. Nam et, prius quam incipias, consulto et, ubi consulueris, mature facto opus est. Ita utrumque per se indigens alterum alterius auxilio eget.
"""
In [6]:
%%time

# Get lemmas

lemma_pairs = lemmatizer.lemmatize(tokenizer.tokenize(text))
CPU times: user 34.6 ms, sys: 16.1 ms, total: 50.7 ms
Wall time: 6.37 s
In [7]:
# Print sample

pprint(lemma_pairs[:10])
[('Omnis', 'omnis'),
 ('homines', 'homo'),
 ('qui', 'qui'),
 ('sese', 'sui'),
 ('student', None),
 ('praestare', 'praesto'),
 ('ceteris', 'ceterus'),
 ('animalibus', 'animal'),
 (',', ','),
 ('summa', 'summus')]
In [8]:
# Another sample text

text = """ Est brilgum: tovi slimici
In vabo tererotitant
Brogovi sunt macresculi
Momi rasti strugitant.

"Fuge Gabrobocchia, fili mi,
Qui fero lacerat morsu:
Diffide Iubiubae avi
Es procul ab Unguimanu."""
In [9]:
%%time

# Get lemmas

lemma_pairs = lemmatizer.lemmatize(tokenizer.tokenize(text))
CPU times: user 9.78 ms, sys: 13.1 ms, total: 22.9 ms
Wall time: 7.26 s
In [10]:
# Print sample

pprint(lemma_pairs[:10])
[('Est', 'sum'),
 ('brilgum', None),
 (':', ':'),
 ('tovi', None),
 ('slimici', None),
 ('In', 'in'),
 ('vabo', None),
 ('tererotitant', None),
 ('Brogovi', None),
 ('sunt', 'sum')]

Note here that only words like 'est' and 'in' (and punctuation) are lemmatized on a first pass. These are of course the functional vocabulary necessary to keep the nonsense Latin recognizable as Latin at all.

In [11]:
# Import additional lemmatizers/resources

from cltk.lemmatize.backoff import UnigramLemmatizer, RegexpLemmatizer
from cltk.lemmatize.latin.latin import latin_sub_patterns
In [12]:
# Set up lemmatizer with backoff chain

backoff = RegexpLemmatizer(latin_sub_patterns, backoff=None)
lemmatizer = TreeTaggerLemmatizer(backoff=backoff)
In [13]:
%%time

# Get lemmas

lemma_pairs = lemmatizer.lemmatize(tokenizer.tokenize(text))
CPU times: user 11.7 ms, sys: 11.3 ms, total: 22.9 ms
Wall time: 7.3 s
In [14]:
# Print sample

pprint(lemma_pairs[:10])
[('Est', 'sum'),
 ('brilgum', None),
 (':', ':'),
 ('tovi', None),
 ('slimici', None),
 ('In', 'in'),
 ('vabo', 'vo'),
 ('tererotitant', None),
 ('Brogovi', None),
 ('sunt', 'sum')]

Note how the introduction of the RegexpLemmatizer to the backoff chain has now returned the result vo for vabo. (As I mention in the introduction, this is an incorrect result, but a result nonetheless).

In [15]:
# Set up a more expansive backoff chain

r = RegexpLemmatizer([
    ('(.)(ant)$', '\\1o'), 
    ('(.)(um)$', '\\1us'),
    ('(.)(i)$', '\\1us')
])
u = UnigramLemmatizer(model={'Momi': 'momus'}, backoff=r)
backoff = RegexpLemmatizer(latin_sub_patterns, backoff=u)
lemmatizer = TreeTaggerLemmatizer(backoff=backoff)
In [16]:
%%time

# Get lemmas

lemma_pairs = lemmatizer.lemmatize(tokenizer.tokenize(text))
CPU times: user 11.8 ms, sys: 12.6 ms, total: 24.4 ms
Wall time: 6.51 s
In [17]:
# Print sample

pprint(lemma_pairs[:10])
[('Est', 'sum'),
 ('brilgum', 'brilgus'),
 (':', ':'),
 ('tovi', 'tovus'),
 ('slimici', 'slimicus'),
 ('In', 'in'),
 ('vabo', 'vo'),
 ('tererotitant', 'tererotito'),
 ('Brogovi', 'Brogovus'),
 ('sunt', 'sum')]

Note that all ten of the first ten words of the Jabberwocky translation have been lemmatized, plausibly, if not correctly (whatever correctly means in lemmatizing nonsense poetry.)