Latin Lemmatization with LemLat

LemLat is yet another lemmatizer and morphological tagger for Latin. It is lexicon-driven, built on a substantial base of the Oxford Latin Dictionary, Georges's Ausführliches Lateinisch-Deutsches Handwörterbuch, Gradenwitz's Laterculi Vocum Latinarum, and, added recently, the onomasticon from Forcellini's Lexicon Totius Latinitatis. As far as lexical coverage, it surpasses (from what I can tell) similar tools. It was developed in 1990 by Andrea Bozzi, Giuseppe Cappelli, and Nino Marinone, with substantial additions in 2002 by Bozzi, Marco Passarotti, and Paolo Ruffolo. Work continues through the CIRCSE Research Centre and the code for version 3.0 (which this Notebook uses) is hosted on GitHub.

There is no Python wrapper for LemLat. As with LatMor, this notebook moves in that direction, using subprocess to generate command line results and then parsing them. In a future post, I will build a wrapper for LemLat that can be used with the CLTK Backoff Lemmatizer. (NB: This runs pretty slowly. I have some ideas for speeding it up, but those will have to wait for now.)

Unlike other recent posts, I have abandoned (for the moment) the strategy of choosing a single lemma for each token. As will become apparent in the next phase of this series of lemmatizer review posts, the CLTK Backoff Lemmatizer will soon begin to incorporate weighted scores for lemmas based on all available information. Accordingly, there is no benefit to discarding possible (if incorrect) lemmas at this stage in the tagging process.

The last section of this post offers assistance with installation and configuration of LemLat for OSX. [PJB 5.11.18]

In [1]:
# # Install LemLat

# # See last cell for more information.
In [2]:
# Imports

import os
import re
import subprocess
import shlex

from collections import Counter

from cltk.tokenize.word import WordTokenizer

from pprint import pprint
In [3]:
# Constants

path = '/usr/local/bin/lemlat'
In [4]:
# Set up tools

In [5]:
# Create instances of CLTK tools

tokenizer = WordTokenizer('latin')

Working with LatMor and subprocess

In [6]:
# Set up subprocess commands

tokens = 'carpe diem , quam minimum credula postero'.split()
text = '\n'.join([token.lower() for token in tokens])
cmd1 = ['echo', text]
cmd2 = './lemlat'
In [7]:

# Set up subprocess and pipe

p1 = subprocess.Popen(cmd1, stdout=subprocess.PIPE)
p2 = subprocess.Popen(shlex.split(cmd2), stdin=p1.stdout, stdout=subprocess.PIPE)
output = p2.communicate()[0].decode()
CPU times: user 1.84 ms, sys: 6.15 ms, total: 7.99 ms
Wall time: 4.71 s
In [8]:
# Parse results

# Split into entries by token
results = re.split(r'A>', output.strip())[1:-1]

# Split into lemma lists
lemmas = [re.findall(r'LEMMA =+\n\s(\w+)', result) for result in results]
In [9]:
[['carpo', 'carpus'],
 ['dies', 'dies'],
 ['credulus', 'credulus'],
 ['postero', 'posterus', 'posterum']]
In [10]:
## Build a lemmatize function; use form frequency to return a single lemma

def lemlat_lemmatize(tokens):
    from string import punctuation
    text = '\n'.join([token.lower() for token in tokens])
    cmd1 = ['echo', text]
    p1 = subprocess.Popen(cmd1, stdout=subprocess.PIPE)
    p2 = subprocess.Popen(shlex.split(cmd2), stdin=p1.stdout, stdout=subprocess.PIPE)
    output = p2.communicate()[0].decode()
    results = re.split(r'A>', output.strip())[1:-1]
    lemmas = []
    for i, result in enumerate(results):
        if result:
            form = tokens[i]
            _lemmas = re.findall(r'LEMMA =+\n\s(\w+)', result)
            if _lemmas:
            elif form in punctuation:
    return lemmas
In [11]:

pprint(lemlat_lemmatize('carpe diem , quam minimum credula postero'.split()))
[['carpo', 'carpus'],
 ['dies', 'dies'],
 ['credulus', 'credulus'],
 ['postero', 'posterus', 'posterum']]
CPU times: user 3.85 ms, sys: 10.8 ms, total: 14.6 ms
Wall time: 4.64 s
In [12]:
# Set up sample text

# Sall. Bell. Cat. 1
text = """Omnis homines, qui sese student praestare ceteris animalibus, summa ope niti decet, ne vitam silentio transeant veluti pecora, quae natura prona atque ventri oboedientia finxit. Sed nostra omnis vis in animo et corpore sita est: animi imperio, corporis servitio magis utimur; alterum nobis cum dis, alterum cum beluis commune est. Quo mihi rectius videtur ingeni quam virium opibus gloriam quaerere et, quoniam vita ipsa, qua fruimur, brevis est, memoriam nostri quam maxume longam efficere. Nam divitiarum et formae gloria fluxa atque fragilis est, virtus clara aeternaque habetur. Sed diu magnum inter mortalis certamen fuit, vine corporis an virtute animi res militaris magis procederet. Nam et, prius quam incipias, consulto et, ubi consulueris, mature facto opus est. Ita utrumque per se indigens alterum alterius auxilio eget.
In [13]:

# Tokenize and lemmatize sample text

tokens = tokenizer.tokenize(text)
lemmas = lemlat_lemmatize(tokens)
CPU times: user 21.7 ms, sys: 13.9 ms, total: 35.6 ms
Wall time: 1min 32s
In [14]:
pprint(list(zip(tokens, lemmas))[:10])
[('Omnis', ['omnis']),
 ('homines', ['homo', 'homo']),
 (',', [',']),
 ('qui', ['queo', 'quis', 'qui']),
 ('sese', ['se']),
 ('student', ['studeo']),
 ('praestare', ['praesto']),
 ('ceteris', ['ceterus']),
 ('animalibus', ['animalis', 'animal']),
 (',', [','])]

LemLat performance on larger text

In [15]:
# Tagging performance on the entirety of Sallust's *Bellum Catilinum*

from cltk.corpus.latin import latinlibrary
bc = latinlibrary.raw('sall.1.txt')
bc = bc[bc.find('[1]'):bc.find('Sallust The Latin Library The Classics Page')]
In [16]:
# Script for preprocessing texts

import html
import re
import string
from cltk.stem.latin.j_v import JVReplacer

def preprocess(text):
    replacer = JVReplacer()
    text = html.unescape(text) # Handle html entities
    text = re.sub(r' ?', ' ',text) #  stripped incorrectly in corpus?
    text = re.sub(r'\x00',' ',text) #Another space problem?
    text = text.lower()
    text = replacer.replace(text) #Normalize u/v & i/j    
    punctuation ="\"#$%&\'()*+,-/:;<=>@[\]^_`{|}~.?!«»—"
    translator = str.maketrans({key: " " for key in punctuation})
    text = text.translate(translator)
    translator = str.maketrans({key: " " for key in '0123456789'})
    text = text.translate(translator)
    text = re.sub('[ ]+',' ', text) # Remove double spaces
    text = re.sub('\s+\n+\s+','\n', text) # Remove double lines and trim spaces around new lines
    return text.strip()
In [17]:
# Preprocess text

bc = preprocess(bc)
bc_tokens = tokenizer.tokenize(bc)
print(f'There are {len(bc_tokens)} tokens in Sallust\'s *Bellum catilinae*')
There are 10802 tokens in Sallust's *Bellum catilinae*
In [18]:

results = lemlat_lemmatize(tokens)
CPU times: user 6.48 ms, sys: 9.44 ms, total: 15.9 ms
Wall time: 1min 36s
In [19]:
[['omnis'], ['homo', 'homo'], [','], ['queo', 'quis', 'qui'], ['se'], ['studeo'], ['praesto'], ['ceterus'], ['animalis', 'animal'], [','], ['summo', 'summa', 'summus'], ['ope', 'ope', 'ops', 'opis', 'opis'], ['nitor'], ['decet'], [','], ['ne', 'ne', 'neo'], ['uita'], ['silentium', 'silentium'], ['transeo'], None, ['pecus'], [','], ['quis', 'qui'], ['natura', 'natura'], ['prono', 'pronus']]

Help with installing LemLat

The installation instructions for LemLat are available on the GitHub README. What I offer here is primarily documentation of how I prefer to install LemLat on OSX.

  1. Download and unzip the 'embedded' version of LemLat
  2. Rename and move the unzipped folder to /usr/local/bin; a command like mv ./osx_embedded /usr/local/bin/lemlat should work.
    • NB: I have not been able to run lemlat outside of /usr/local/bin, so you may need to change directory to continue.
  3. You should be all set now—try it out with the following:

    • echo laudat | ./lemlat
    • Output

        LEMLAT: latin morphological lemmatizer *
  4. There are a number of options for input and output files discussed on the GitHub README.

LemLat should now work as expected in the Notebooks above. If you notice any problems with the installation instructions, please open an issue in this repo.—PJB