Latin Lemmatization with Whitaker's Words

Whitaker's Words is command-line dictionary for Latin written in Ada by William Whitaker in 1993. It is a beloved digital Latin resource with several unofficial implementations, ports, etc. This notebook can be considered yet another example. It is also the focus of a digital preservation effort by Martin Keegan, who is hosting and maintaining the original code on GitHub.

There is no Python wrapper that I could find for Words. (I did find this in-progress Python port by Archimedes Digital, but found the results too uneven at this point to publish. See Appendix B of this Notebook.) This notebook moves in that direction, using subprocess to write results to a temporary file, read in the results, and then parsing them. A temporary file became necessary as Words paginates the stdout; writing to/reading from a file is as far as I can tell then the most efficient way to get batch results from Words. The read/write time likely adds some processing time, but Words still runs pretty fast.

Appendix A offers assistance with installation and configuration of TreeTagger for OSX. [PJB 5.11.18]

In [1]:
# Imports

import os
import re
import subprocess
import shlex

from collections import Counter

from cltk.tokenize.word import WordTokenizer

from pprint import pprint
In [2]:
# Constants

path = '/usr/local/bin/words'
In [3]:
# Set up tools

os.chdir(path)
In [4]:
# Create instances of CLTK tools

tokenizer = WordTokenizer('latin')

Working with Words and subprocess

In [5]:
# Set up subprocess commands

token = 'verbum'
cmd1 = f'echo {token}'
cmd2 = 'xargs words'
In [6]:
%%time

# Build subprocess pipe
# NBL: I'll write up shlex and subprocess.PIPE at some point

p1 = subprocess.Popen(shlex.split(cmd1), stdout=subprocess.PIPE)
p2 = subprocess.Popen(shlex.split(cmd2), stdin=p1.stdout, stdout=subprocess.PIPE)
result = p2.communicate()[0].decode()
print(result)
verb.um              N      2 2 NOM S N                 
verb.um              N      2 2 VOC S N                 
verb.um              N      2 2 ACC S N                 
verbum, verbi  N (2nd) N   [XXXAX]  
word; proverb; [verba dare alicui => cheat/deceive someone];
*

CPU times: user 7.81 ms, sys: 17.2 ms, total: 25 ms
Wall time: 49.2 ms
In [7]:
lemmas = re.findall(r"^.*?\[.{5}\].*?$",result,re.MULTILINE)
lemmas = [item.split(',')[0] for item in lemmas]
print(lemmas)
['verbum']
In [8]:
## Build a lemmatize function; use form frequency to return a single lemma

def words_lemmatize(tokens):
    from string import punctuation
    text = '\n'.join(tokens)
    
    os.makedirs('tmp', exist_ok=True)
    
    with open('tmp/tmp.txt','w+') as f:
        f.write(text)
    
    cmd1 = 'words tmp/tmp.txt tmp/tmp_out.txt'

    p1 = subprocess.run(shlex.split(cmd1))
    
    with open('tmp/tmp_out.txt') as f:
        output = f.read()

    results = output.strip().replace('*','\n').split('\n\n')
    result_lemmas = []
    for result in results:
        _lemmas = re.findall(r"^\w.*?\[.....].*?$", result, re.MULTILINE)
        _lemmas = [re.split(r'[,| ]', item)[0] for item in _lemmas]
        result_lemmas.append(_lemmas)
    
    lemmas = []

    pos = 0
    for token in tokens:
        if token in punctuation or len(token) == 1:
            lemmas.append(list(token))
        elif result_lemmas[pos]:
            lemmas.append(result_lemmas[pos])
            pos += 1
        else:
            lemmas.append(None)
            pos += 1
    
    return lemmas
In [9]:
%%time

print(words_lemmatize('carpe diem , quam minimum credula postero'.split()))
[['carpo'], ['dies'], [','], ['quam', 'quam'], ['parvus'], ['credulus'], ['posterus', 'posterus']]
CPU times: user 6.68 ms, sys: 10.8 ms, total: 17.5 ms
Wall time: 55.8 ms
In [10]:
# Set up sample text

# Sall. Bell. Cat. 1
text = """Omnis homines, qui sese student praestare ceteris animalibus, summa ope niti decet, ne vitam silentio transeant veluti pecora, quae natura prona atque ventri oboedientia finxit. Sed nostra omnis vis in animo et corpore sita est: animi imperio, corporis servitio magis utimur; alterum nobis cum dis, alterum cum beluis commune est. Quo mihi rectius videtur ingeni quam virium opibus gloriam quaerere et, quoniam vita ipsa, qua fruimur, brevis est, memoriam nostri quam maxume longam efficere. Nam divitiarum et formae gloria fluxa atque fragilis est, virtus clara aeternaque habetur. Sed diu magnum inter mortalis certamen fuit, vine corporis an virtute animi res militaris magis procederet. Nam et, prius quam incipias, consulto et, ubi consulueris, mature facto opus est. Ita utrumque per se indigens alterum alterius auxilio eget."""
In [11]:
%%time

# Tokenize and lemmatize sample text

tokens = tokenizer.tokenize(text)
lemmas = words_lemmatize(tokens)
CPU times: user 17.2 ms, sys: 10.8 ms, total: 27.9 ms
Wall time: 226 ms
In [12]:
print(list(zip(tokens, lemmas))[:25])
[('Omnis', ['omnis', 'omne', 'omnis']), ('homines', ['homo']), (',', [',']), ('qui', ['queo', 'qui']), ('sese', None), ('student', ['studeo']), ('praestare', ['praesto', 'praesto', 'praesto']), ('ceteris', ['ceterus']), ('animalibus', ['animal', 'animalis', 'animalis']), (',', [',']), ('summa', ['summus', 'summa', 'summum']), ('ope', ['ops']), ('niti', ['nitor', 'nitor']), ('decet', ['decet']), (',', [',']), ('ne', ['neo', 'ne', 'ne']), ('vitam', ['vita']), ('silentio', ['silentium']), ('transeant', ['transeo']), ('veluti', ['veluti']), ('pecora', ['pecus']), (',', [',']), ('quae', None), ('natura', ['nascor', 'natura', 'naturo']), ('prona', ['pronus'])]
In [13]:
# Tagging performance on the entirety of Sallust's *Bellum Catilinum*

from cltk.corpus.latin import latinlibrary
bc = latinlibrary.raw('sall.1.txt')
bc = bc[bc.find('[1]'):bc.find('Sallust The Latin Library The Classics Page')]
In [14]:
# Script for preprocessing texts

import html
import re
import string
from cltk.stem.latin.j_v import JVReplacer

def preprocess(text):
    
    replacer = JVReplacer()
    
    text = html.unescape(text) # Handle html entities
    text = re.sub(r' ?', ' ',text) #  stripped incorrectly in corpus?
    text = re.sub(r'\x00',' ',text) #Another space problem?
        
    text = text.lower()
    text = replacer.replace(text) #Normalize u/v & i/j    
    
    punctuation ="\"#$%&\'()*+,-/:;<=>@[\]^_`{|}~.?!«»—"
    translator = str.maketrans({key: " " for key in punctuation})
    text = text.translate(translator)
    
    translator = str.maketrans({key: " " for key in '0123456789'})
    text = text.translate(translator)
    
    text = re.sub('[ ]+',' ', text) # Remove double spaces
    text = re.sub('\s+\n+\s+','\n', text) # Remove double lines and trim spaces around new lines
    
    return text.strip()
In [15]:
# Preprocess text

bc = preprocess(bc)
bc_tokens = tokenizer.tokenize(bc)
print(f'There are {len(bc_tokens)} tokens in Sallust\'s *Bellum catilinae*')
There are 10802 tokens in Sallust's *Bellum catilinae*
In [16]:
%%time

results = words_lemmatize(bc_tokens)
CPU times: user 253 ms, sys: 30.3 ms, total: 283 ms
Wall time: 14.8 s

Appendix A: Help with installing Whitaker's Words

The original Ada for Whitaker's Words is readily available in several places online. For OSX, the best option at this point is the binary provided by David Sanson. Installation from that point is pretty straightforward, esp. if you have been following the other posts in this lemmatization series.

  1. Download and unzip the Words binary
  2. Move the unzipped folder to /usr/local/bin; a command like mv ./words /usr/local/bin should work.
  3. Change directory to /usr/local/bin/words.
  4. You should be all set now—try it out with the following:

    • echo verbum | xargs words
    • Output

        verb.um              N      2 2 NOM S N                 
        verb.um              N      2 2 VOC S N                 
        verb.um              N      2 2 ACC S N                 
        verbum, verbi  N (2nd) N   [XXXAX]  
        word; proverb; \[verba dare alicui => cheat/deceive someone];
        *

Words should now work as expected in the Notebooks above. If you notice any problems with the installation instructions, please open an issue in this repo.—PJB

Appendix B: open_words results

In [17]:
# ! pipenv install git+https://github.com/ArchimedesDigital/open_words.git#egg=open_words

from open_words.parse import Parse

parser = Parse()
parser.parse_line('carpe diem')
Out[17]:
[{'word': 'carpe',
  'defs': [{'orth': ['carpa', 'carpae'],
    'senses': ['carp', '(Erasmus)'],
    'infls': [{'ending': 'e',
      'pos': 'noun',
      'form': {'declension': 'nominative',
       'number': 'singular',
       'gender': 'C'}},
     {'ending': 'e',
      'pos': 'noun',
      'form': {'declension': 'vocative', 'number': 'singular', 'gender': 'C'}},
     {'ending': 'e',
      'pos': 'noun',
      'form': {'declension': 'ablative', 'number': 'singular', 'gender': 'C'}},
     {'ending': 'e',
      'pos': 'verb',
      'form': {'tense': 'present',
       'voice': 'active',
       'mood': 'imperative',
       'person': 2,
       'number': 'singular'}}]}]},
 {'word': 'diem',
  'defs': [{'orth': ['dies', 'die'],
    'senses': ['day',
     'daylight',
     '(sunlit hours)',
     '(24 hours from midnight)',
     'open sky',
     'weather'],
    'infls': [{'ending': 'em',
      'pos': 'noun',
      'form': {'declension': 'accusative',
       'number': 'singular',
       'gender': 'C'}}]}]}]
In [18]:
parser.parse('posse')
Out[18]:
{'word': 'posse',
 'defs': [{'orth': ['posso', 'pote', 'potui', ''],
   'senses': ['be able, can',
    '[multum  posse => have much/more/most influence/power]'],
   'infls': [{'ending': 'e',
     'pos': 'verb',
     'form': {'tense': 'present',
      'voice': 'active',
      'mood': 'infinitive',
      'person': 0,
      'number': ''}}]}]}
In [19]:
parser.parse('sunt')
Out[19]:
{'word': 'sunt',
 'defs': [{'orth': ['sunt'],
   'senses': ['to be, exist',
    'also used to form verb perfect passive tenses with NOM PERF PPL'],
   'infls': [{'form': {'form': 'PRES ACTIVE IND 3 P'},
     'ending': '',
     'pos': 'verb'}]}]}