Latin Lemmatization with Collatinus

Collatinus is a lemmatizer and a morphological analyser for Latin texts developed by Yves Ouvrard and Philippe Verkerk. The lemmatizer derives its results from 1. a lexicon of 11,000 Latin lemmas and 2. an inflection engine that generates possible forms. It is available as a standalone app for OSX and also has a web version. Collatinus is available under the GNU GPL v3 license.

A Python version of CollatinusPyCollatinus—is now available, ported from the original C++ by Thibault Clérice. It generates a lot of useful information. It is also fast.

In this post, I demonstrate its basic use and suggest some provisional strategies for correctly matching a token with its lemma. The key here in correctly matching tokens and lemmas is that PyCollatinus returns as many results as possible for a given form, so we need to have some way of choosing one among several option. Here is a sample result for homines:

[{'form': 'homines',
  'lemma': 'homo',
  'morph': 'nominatif pluriel',
  'radical': 'homin',
  'desinence': 'es'},
 {'form': 'homines',
  'lemma': 'homo',
  'morph': 'vocatif pluriel',
  'radical': 'homin',
  'desinence': 'es'},
 {'form': 'homines',
  'lemma': 'homo',
  'morph': 'accusatif pluriel',
  'radical': 'homin',
  'desinence': 'es'}]

There is no difficulty here as all three possible forms resolve to the lemma homo. Let's look at another, prona:

[{'form': 'prona',
  'lemma': 'pronus',
  'morph': 'nominatif féminin singulier',
  'radical': 'pron',
  'desinence': 'a'},
 {'form': 'prona',
  'lemma': 'pronus',
  'morph': 'vocatif féminin singulier',
  'radical': 'pron',
  'desinence': 'a'},
 {'form': 'prona',
  'lemma': 'pronus',
  'morph': 'ablatif féminin singulier',
  'radical': 'pron',
  'desinence': 'a'},
 {'form': 'prona',
  'lemma': 'pronus',
  'morph': 'nominatif neutre pluriel',
  'radical': 'pron',
  'desinence': 'a'},
 {'form': 'prona',
  'lemma': 'pronus',
  'morph': 'vocatif neutre pluriel',
  'radical': 'pron',
  'desinence': 'a'},
 {'form': 'prona',
  'lemma': 'pronus',
  'morph': 'accusatif neutre pluriel',
  'radical': 'pron',
  'desinence': 'a'},
 {'form': 'prona',
  'lemma': 'prono',
  'morph': '2ème singulier impératif présent actif',
  'radical': 'pron',
  'desinence': 'a'}]

Here we need to disambiguate between the very common pronus and the very uncommon prono. But without introducing corpus information, that is, by looking at only this result, we do not have any way of knowing what is common and uncommon. Accordingly, for this experiment, I have decided to pick the lemma that appears most often. This works (in an overdetermined way) for homines. It works for a word like student which only resolves to one lemma, studeo. It happens to work for pronus. But if we inspect the results, there are plenty of suspect cases as well: magis from magus, natura from nascor, and the infamous est from edo. In future posts, I will go through sounder solutions to disambiguation (weighting results will be involved, though not quite like this).

Lastly, a note about punctuation. Unlike TreeTagger, Collatinus does not include punctuation in the tokens to be lemmatized. Since one of the goals of these posts is to compare lemmatizers, we need to address this difference. Below I show one strategy for reintroducing the dropped punctuation so that we have comparable lemmatizer results between platforms. [PJB 5.7.18]

In [1]:
# Imports

from pycollatinus import Lemmatiseur

from cltk.tokenize.word import WordTokenizer

from pprint import pprint
In [2]:
%%capture --no-display 
# ^^^ Ignore cell-specific warnings ^^^

# Set up lemmatizer

analyzer = Lemmatiseur()
In [3]:
# Set up test text

# Sall. Bell. Cat. 1
text = """Omnis homines, qui sese student praestare ceteris animalibus, summa ope niti decet, ne vitam silentio transeant veluti pecora, quae natura prona atque ventri oboedientia finxit. Sed nostra omnis vis in animo et corpore sita est: animi imperio, corporis servitio magis utimur; alterum nobis cum dis, alterum cum beluis commune est. Quo mihi rectius videtur ingeni quam virium opibus gloriam quaerere et, quoniam vita ipsa, qua fruimur, brevis est, memoriam nostri quam maxume longam efficere. Nam divitiarum et formae gloria fluxa atque fragilis est, virtus clara aeternaque habetur. Sed diu magnum inter mortalis certamen fuit, vine corporis an virtute animi res militaris magis procederet. Nam et, prius quam incipias, consulto et, ubi consulueris, mature facto opus est. Ita utrumque per se indigens alterum alterius auxilio eget.
"""
In [4]:
# Create instances of CLTK tools

tokenizer = WordTokenizer('latin')
tokens = tokenizer.tokenize(text)
text_string = " ".join(tokens)
In [5]:
# Get length of token list

print(f'There are {len(tokens)} in the sample text.')
There are 151 in the sample text.
In [6]:
%%time

# Get Collatinus results

results = analyzer.lemmatise_multiple(text_string)
CPU times: user 83.5 ms, sys: 3.97 ms, total: 87.5 ms
Wall time: 88.5 ms
In [7]:
# Print sample of result

results[20]
Out[7]:
[{'form': 'prona',
  'lemma': 'pronus',
  'morph': 'nominatif féminin singulier',
  'radical': 'pron',
  'desinence': 'a'},
 {'form': 'prona',
  'lemma': 'pronus',
  'morph': 'vocatif féminin singulier',
  'radical': 'pron',
  'desinence': 'a'},
 {'form': 'prona',
  'lemma': 'pronus',
  'morph': 'ablatif féminin singulier',
  'radical': 'pron',
  'desinence': 'a'},
 {'form': 'prona',
  'lemma': 'pronus',
  'morph': 'nominatif neutre pluriel',
  'radical': 'pron',
  'desinence': 'a'},
 {'form': 'prona',
  'lemma': 'pronus',
  'morph': 'vocatif neutre pluriel',
  'radical': 'pron',
  'desinence': 'a'},
 {'form': 'prona',
  'lemma': 'pronus',
  'morph': 'accusatif neutre pluriel',
  'radical': 'pron',
  'desinence': 'a'},
 {'form': 'prona',
  'lemma': 'prono',
  'morph': '2ème singulier impératif présent actif',
  'radical': 'pron',
  'desinence': 'a'}]
In [8]:
# Limit results to only lemma info

lemmas = []

for result in results:
    _lemmas = []
    for _result in result:
        _lemmas.append(_result['lemma'])
    lemmas.append(_lemmas)
In [9]:
# Print lemma info

pprint(lemmas[:3])
[['omne',
  'omnis',
  'omnis',
  'omnis',
  'omnis',
  'omnis',
  'omnis',
  'omnis',
  'omnis',
  'omnis'],
 ['homo', 'homo', 'homo'],
 ['qui', 'qui', 'quis', 'queo', 'queo', 'qui']]
In [10]:
# A way to weight the lemma results

from collections import Counter

c = Counter(lemmas[0])
weights = [(i, c[i] / len(lemmas[0]) * 100.0) for i in c]

print(f'There are {len(weights)} different lemmas in the results for \'{tokens[0]}\'.')
for weight in weights:
    print(f'- {weight[1]}% {weight[0]}')
There are 2 different lemmas in the results for 'Omnis'.
- 10.0% omne
- 90.0% omnis
In [11]:
# Get weighted lemmas

weighted_lemmas = []

for lemma in lemmas:
    c = Counter(lemma)
    weights = [(i, c[i] / len(lemma) * 100.0) for i in c]
    weighted_lemmas.append(weights)
In [12]:
# Print weight lemmas

pprint(weighted_lemmas[:10])
[[('omne', 10.0), ('omnis', 90.0)],
 [('homo', 100.0)],
 [('qui', 50.0), ('quis', 16.666666666666664), ('queo', 33.33333333333333)],
 [('se', 100.0)],
 [('studeo', 100.0)],
 [('praesto', 100.0)],
 [('ceteri', 42.857142857142854),
  ('ceterum', 14.285714285714285),
  ('ceterus', 42.857142857142854)],
 [('animal', 25.0), ('animalis', 75.0)],
 [('summa', 23.076923076923077),
  ('summum', 23.076923076923077),
  ('summus', 46.15384615384615),
  ('summo', 7.6923076923076925)],
 [('Opis', 25.0), ('Ops', 25.0), ('ops', 25.0), ('opos', 25.0)]]
In [13]:
# Get max weight for each lemma

lemma_max = []

for weighted_lemma in weighted_lemmas:
    weight_max = max(weighted_lemma,key=lambda item:item[1])[0]
    lemma_max.append(weight_max)
In [14]:
# Print max weight

pprint(lemma_max[:10])
['omnis',
 'homo',
 'qui',
 'se',
 'studeo',
 'praesto',
 'ceteri',
 'animalis',
 'summus',
 'Opis']
In [15]:
# Compare lenghts of original tokens and resulting lemmas

print(f'There are {len(tokens)} tokens in the sample text, but only {len(lemma_max)} lemmas!')
There are 151 tokens in the sample text, but only 126 lemmas!

Unlike TreeTagger, Collatinus does not include punctuation in the tokens to be lemmatized, which explains the difference in our input and output lists. From what I have been able to determine, these are the only tokens that are ignored. Accordingly, we can easily restore punctuation by comparing the two lists.

In [16]:
# Align tokens & lemmas due to missing punctuation

from string import punctuation

lemma_pairs = []

pos = 0
for token in tokens:
    if token in punctuation:
        lemma_pairs.append((token, token))
    else:
        lemma_pairs.append((token, lemma_max[pos]))
        pos += 1
In [17]:
pprint(lemma_pairs[:50])
[('Omnis', 'omnis'),
 ('homines', 'homo'),
 (',', ','),
 ('qui', 'qui'),
 ('sese', 'se'),
 ('student', 'studeo'),
 ('praestare', 'praesto'),
 ('ceteris', 'ceteri'),
 ('animalibus', 'animalis'),
 (',', ','),
 ('summa', 'summus'),
 ('ope', 'Opis'),
 ('niti', 'nitor'),
 ('decet', 'decet'),
 (',', ','),
 ('ne', 'ne'),
 ('vitam', 'uita'),
 ('silentio', 'silentium'),
 ('transeant', 'transeo'),
 ('veluti', 'ueluti'),
 ('pecora', 'pecus'),
 (',', ','),
 ('quae', 'qui'),
 ('natura', 'nascor'),
 ('prona', 'pronus'),
 ('atque', 'atque'),
 ('ventri', 'venter'),
 ('oboedientia', 'oboedio'),
 ('finxit', 'fingo'),
 ('.', '.'),
 ('Sed', 'sed'),
 ('nostra', 'noster'),
 ('omnis', 'omnis'),
 ('vis', 'uia'),
 ('in', 'in'),
 ('animo', 'animus'),
 ('et', 'et'),
 ('corpore', 'corpus'),
 ('sita', 'sino'),
 ('est', 'edo'),
 (':', ':'),
 ('animi', 'animus'),
 ('imperio', 'imperium'),
 (',', ','),
 ('corporis', 'corpus'),
 ('servitio', 'servitium'),
 ('magis', 'magus'),
 ('utimur', 'utor'),
 (';', ';'),
 ('alterum', 'alter')]