Latin Lemmatization with TreeTagger

This is the first installment in series of posts/notebooks on Latin lemmatization that will cover: 1. introductions to existing options for Latin lemmatization (esp. those available for Python); 2. comparisons/benchmarks for existing lemmatizers; and 3. development notes on the CLTK Latin Backoff Lemmatizer and related projects

TreeTagger is a probabilistic, decision tree-based part-of-speech tagger written by Helmut Schmid in 1994. It is described in this paper. Though originally written for German tagging, parameter files have since been written for a number of languages including Latin. This notebook uses G. Brandolini's parameter file which is based on a number of sources for Latin lexical and morphological data: PROIEL data, Perseus data, Index Thomisticus data and Whitaker's Words.

Lemmatization is a by-product of TreeTagger's pos-tagging, but a useful one. It runs quickly, performs well, and has two Python wrappers (shown below). This notebook introduces the two Python wrappers—treetaggerwrapper and treetagger-python—and gives example workflows and some execution time information. The last section of this post offers assistance with installation and configuration of TreeTagger for OSX. [PJB 5.4.18]

In [1]:
# # Install TreeTagger

# # Installation information at []( See last cell for more information.
In [2]:
# # Install treetaggerwrapper

# !pipenv install treetaggerwrapper

# # See docs for more information:
# #
# #Some installation help for treetagger at the bottom of this notebook

Working with treetaggerwrapper

In [3]:
# Imports

import treetaggerwrapper

from pprint import pprint
In [4]:
# Create Latin tagger

tagger = treetaggerwrapper.TreeTagger(TAGLANG='la')
In [5]:
# Set up test text

# Sall. Bell. Cat. 1
text = """Omnis homines, qui sese student praestare ceteris animalibus, summa ope niti decet, ne vitam silentio transeant veluti pecora, quae natura prona atque ventri oboedientia finxit. Sed nostra omnis vis in animo et corpore sita est: animi imperio, corporis servitio magis utimur; alterum nobis cum dis, alterum cum beluis commune est. Quo mihi rectius videtur ingeni quam virium opibus gloriam quaerere et, quoniam vita ipsa, qua fruimur, brevis est, memoriam nostri quam maxume longam efficere. Nam divitiarum et formae gloria fluxa atque fragilis est, virtus clara aeternaque habetur. Sed diu magnum inter mortalis certamen fuit, vine corporis an virtute animi res militaris magis procederet. Nam et, prius quam incipias, consulto et, ubi consulueris, mature facto opus est. Ita utrumque per se indigens alterum alterius auxilio eget.
In [6]:

# Tag with treetagger

print(f'Tagging {len(text.split())} tokens...')
tags = tagger.tag_text(text)
Tagging 125 tokens...
CPU times: user 11.5 ms, sys: 8.72 ms, total: 20.2 ms
Wall time: 2.81 s
In [7]:
# View output from tagger

In [8]:
# View output from tagger, delimited by tab

for tag in tags[:10]:
['Omnis', 'PRON', 'omnis']
['homines', 'N:nom', 'homo']
[',', 'PUN', ',']
['qui', 'REL', 'qui']
['sese', 'PRON', 'sui']
['student', 'V:IND', 'studeo']
['praestare', 'V:INF', 'praesto']
['ceteris', 'ADJ:abl', 'ceterus']
['animalibus', 'N:abl', 'animal|animalis']
[',', 'PUN', ',']
In [9]:
# Format output from tagger as tuples

tags_tuples = treetaggerwrapper.make_tags(tags)
[Tag(word='Omnis', pos='PRON', lemma='omnis'),
 Tag(word='homines', pos='N:nom', lemma='homo'),
 Tag(word=',', pos='PUN', lemma=','),
 Tag(word='qui', pos='REL', lemma='qui'),
 Tag(word='sese', pos='PRON', lemma='sui'),
 Tag(word='student', pos='V:IND', lemma='studeo'),
 Tag(word='praestare', pos='V:INF', lemma='praesto'),
 Tag(word='ceteris', pos='ADJ:abl', lemma='ceterus'),
 Tag(word='animalibus', pos='N:abl', lemma='animal|animalis'),
 Tag(word=',', pos='PUN', lemma=',')]
In [10]:
# Format output as (token, lemma)

lemma_pairs = [(token, lemma) for token, _, lemma in tags_tuples]
[('Omnis', 'omnis'),
 ('homines', 'homo'),
 (',', ','),
 ('qui', 'qui'),
 ('sese', 'sui'),
 ('student', 'studeo'),
 ('praestare', 'praesto'),
 ('ceteris', 'ceterus'),
 ('animalibus', 'animal|animalis'),
 (',', ',')]

Working with treetagger-python

In [11]:
# # Install treetagger-python

# # Working off a fork of treetagger-python since the main package does not yet support 'latin'

# !pipenv install git+[email protected]#egg=treetagger-python

# # Also, add to .bash_profile (vel sim):
# # export TREETAGGER_HOME='/path/to/your/TreeTagger/cmd/'

# # See docs for more information:
# #

# # Some installation help for treetagger at the bottom of this notebook
In [12]:
# Imports

from treetagger import TreeTagger
In [13]:
# Create Latin tagger

tagger = TreeTagger(language='latin')
In [14]:

# Tag with treetagger-python

print(f'Tagging {len(text.split())} tokens...')
tags_list = tagger.tag(text)
Tagging 125 tokens...
CPU times: user 4.65 ms, sys: 7.54 ms, total: 12.2 ms
Wall time: 2.61 s
In [15]:
[['Omnis', 'PRON', 'omnis'],
 ['homines', 'N:nom', 'homo'],
 [',', 'PUN', ','],
 ['qui', 'REL', 'qui'],
 ['sese', 'PRON', 'sui'],
 ['student', 'V:IND', 'studeo'],
 ['praestare', 'V:INF', 'praesto'],
 ['ceteris', 'ADJ:abl', 'ceterus'],
 ['animalibus', 'N:abl', 'animal|animalis'],
 [',', 'PUN', ',']]
In [16]:
# Make a lemma pair list for treetagger-python output

lemma_pairs_2 = [(token, lemma) for token, _, lemma in tags_list]

Since the taggers are using the same treetagger instance, we should expect the output to be the same between the two Python wrappers. The thing is...

In [17]:
# Compare output

unks = []

for i, pair in enumerate(lemma_pairs):
    if pair != lemma_pairs_2[i]:
        unks.append((pair, lemma_pairs_2[i]))

print(f'There were {len(unks)} lemma pairs that did not match. Here are the first five:')
There were 1 lemma pairs that did not match. Here are the first five:
[(('aeternaque', 'aeternaque'), ('aeternaque', '<unknown>'))]

While it is true that both taggers use the same treetagger instance, they run with a slightly different set of parameters, namely treetaggerwrapper returns the token as lemma when no match is found (cf. running treetagger on the command line with the flag '-no-unknown') while treetagger-python returns '' in this case.

We can adjust for this by running treetaggerwrapper with different parameters, spec. by setting TAGOPT to not include the '-no-unknown' flag.

In [18]:
# Create Latin tagger to return '<unknown>'; rerun

tagger = treetaggerwrapper.TreeTagger(TAGLANG='la', TAGOPT='-token -lemma -sgml -quiet')
tags = tagger.tag_text(text)
tags_tuples = treetaggerwrapper.make_tags(tags)
lemma_pairs = [(token, lemma) for token, _, lemma in tags_tuples]
In [19]:
# Compare output again

lemma_pairs == lemma_pairs_2


In [20]:
# Tagging performance on the entirety of Sallust's *Bellum Catilinum*

from cltk.corpus.latin import latinlibrary
bc = latinlibrary.raw('sall.1.txt')
bc = bc[bc.find('[1]'):bc.find('Sallust The Latin Library The Classics Page')]
In [21]:
# Script for preprocessing texts

import html
import re
import string
from cltk.stem.latin.j_v import JVReplacer

def preprocess(text):
    replacer = JVReplacer()
    text = html.unescape(text) # Handle html entities
    text = re.sub(r'&nbsp;?', ' ',text) #&nbsp; stripped incorrectly in corpus?
    text = re.sub(r'\x00',' ',text) #Another space problem?
    text = text.lower()
    text = replacer.replace(text) #Normalize u/v & i/j    
    punctuation ="\"#$%&\'()*+,-/:;<=>@[\]^_`{|}~.?!«»—"
    translator = str.maketrans({key: " " for key in punctuation})
    text = text.translate(translator)
    translator = str.maketrans({key: " " for key in '0123456789'})
    text = text.translate(translator)
    text = re.sub('[ ]+',' ', text) # Remove double spaces
    text = re.sub('\s+\n+\s+','\n', text) # Remove double lines and trim spaces around new lines
    return text.strip()
In [22]:
# Preprocess text

bc = preprocess(bc)
In [23]:

tagger1 = treetaggerwrapper.TreeTagger(TAGLANG='la')
print(f'Tagging {len(bc.split())} tokens with treetaggerwrapper...')
tags = tagger1.tag_text(bc)
Tagging 10665 tokens with treetaggerwrapper...
CPU times: user 539 ms, sys: 177 ms, total: 716 ms
Wall time: 3.25 s
In [24]:

tagger2 = TreeTagger(language='latin')
print(f'Tagging {len(bc.split())} tokens with treetagger-python...')
tags_list = tagger2.tag(bc)
Tagging 10665 tokens with treetagger-python...
CPU times: user 76.3 ms, sys: 38.7 ms, total: 115 ms
Wall time: 2.77 s

treetagger-python seems to run a bit quicker

Help with installing Treetagger

The installation instructions for Treetagger (at least on OSX) are reasonably clear. What I offer here is primarily documentation of how I prefer to install Treetagger with specific attention to working with Latin.

  1. Download all of the Treetagger files, i.e. (again for OSX)
    • tree-tagger-MacOSX-3.2.tar.gz
    • tagger-scripts.tar.gz
  2. Download the Latin parameters file. NB: There are two Latin files—for this notebook I am using G. Brandolini's file (latin-par-linux-3.2.bin.gz)
  3. Unzip tree-tagger-MacOSX-3.2.tar.gz
  4. Rename this folder treetagger and put the other three (3) files inside. You should not unzip the other files.
  5. Move this folder to /usr/local/bin; a command like mv ./treetagger /usr/local/bin should work.
  6. Change directory to /usr/local/bin/treetagger and run the install script, i.e. sh
  7. You should be all set now—try it out with the following:
    • echo 'Salve munde!' | cmd/tree-tagger-latin
    • Output
        Salve   V:IMP   salveo
        munde   N:voc   mundus
        !   SENT    !
  8. It is probably a good idea to add treetagger's location to PATH.
    • Open ~/.bash_profile (or the appropriate file for whatever you shell you are using) and add:
      • export PATH=/usr/local/bin/treetagger/cmd:/usr/local/bin/treetagger/bin:$PATH
    • treetagger-python also requires that you add the following line to ~/.bash_profile:
      • export TREETAGGER_HOME='/usr/local/bin/treetagger/cmd/'

Treetagger should now work as expected in the Notebooks above. If you notice any problems with the installation instructions, please open an issue in this repo.—PJB