Python Classics Cookbook¶

Edited by Patrick J. Burns ¶

with contributions from Anna Conser

1. Replace macrons¶

Problem: You want to remove all of the macrons from a string, like the following sentence from Caesar's Bellum Gallicum.

In [1]:

text_with_macrons = """Gallia est omnis dīvīsa in partēs trēs, quārum ūnam incolunt Belgae, aliam Aquītānī, tertiam quī ipsōrum linguā Celtae, nostrā Gallī appellantur."""

Here are three methods for removing macrons: 1. with replace; 2. with re.sub; and 3. with translate. Click here for the TLDR best solution.

string `replace`¶

In [2]:

# simple replacement

word = 'dīvīsa'
word_without_macrons = word.replace('ī', 'i')
print(f"{word} > {word_without_macrons}")

word = 'Aquītānī'
word_without_macrons = word.replace('ā', 'a').replace('ī', 'i')
print(f"{word} > {word_without_macrons}")

dīvīsa > divisa
Aquītānī > Aquitani

It would be tedious to chain together enough replace methods to solve this problem. So, we could create a dictionary of replacement patterns and loop over them, replacing the text with each pass.

In [3]:

# create dictionary of macrons

macron_map = {
    'ā': 'a', 
    'ē': 'e', 
    'ī': 'i', 
    'ō': 'o', 
    'ū': 'u',
    'ȳ': 'y',
    'Ā': 'A',
    'Ē': 'E', 
    'Ī': 'I', 
    'Ō': 'O', 
    'Ū': 'U',
    'Ȳ': 'Y'
}

# compact method with dictionary comprehension

vowels = 'aeiouyAEIOUY'
vowels_with_macrons = 'āēīōūȳĀĒĪŌŪȲ'
macron_map = {k: v for k, v in zip(vowels_with_macrons, vowels)}    

print(macron_map)

{'ā': 'a', 'ē': 'e', 'ī': 'i', 'ō': 'o', 'ū': 'u', 'ȳ': 'y', 'Ā': 'A', 'Ē': 'E', 'Ī': 'I', 'Ō': 'O', 'Ū': 'U', 'Ȳ': 'Y'}

In [4]:

# replace by iterating over dictionary

text_without_macrons = text_with_macrons

for k, v in macron_map.items():
    text_without_macrons = text_without_macrons.replace(k, v)
    
print(text_without_macrons)

Gallia est omnis divisa in partes tres, quarum unam incolunt Belgae, aliam Aquitani, tertiam qui ipsorum lingua Celtae, nostra Galli appellantur.

In [5]:

# function for replace by iterating over dictionary

def remove_macrons_1(text_with_macrons, replacement_dictionary):
    text_without_macrons = text_with_macrons
    for k, v in replacement_dictionary.items():
        text_without_macrons = text_without_macrons.replace(k, v)    
    return text_without_macrons

In [6]:

%time
remove_macrons_1(text_with_macrons, macron_map)

CPU times: user 4 µs, sys: 1 µs, total: 5 µs
Wall time: 12.2 µs

Out[6]:

'Gallia est omnis divisa in partes tres, quarum unam incolunt Belgae, aliam Aquitani, tertiam qui ipsorum lingua Celtae, nostra Galli appellantur.'

replacement with regular expressions¶

Another option would be to do the same thing with regular expressions instead of ```replace``...

In [7]:

# function for re.sub by iterating over dictionary

import re

def remove_macrons_2(text_with_macrons, replacement_dictionary):
    text_without_macrons = text_with_macrons
    for k, v in replacement_dictionary.items():
        text_without_macrons = re.sub(rf'{k}', v, text_without_macrons, flags=re.MULTILINE)
    return text_without_macrons

In [8]:

%time
remove_macrons_2(text_with_macrons, macron_map)

CPU times: user 4 µs, sys: 1 µs, total: 5 µs
Wall time: 7.87 µs

Out[8]:

'Gallia est omnis divisa in partes tres, quarum unam incolunt Belgae, aliam Aquitani, tertiam qui ipsorum lingua Celtae, nostra Galli appellantur.'

For a single sentence, this turns out to take about the same amount of time to run (not so with larger texts, as we see below).

replacement with `translate`¶

Another option is the translate method. This allows us to make all changes using a translation table without having to loop repeated over the original string.

In [9]:

# compact method with dictionary comprehension
# note that translate uses ```ord```, i.e. the Unicode code point for each mapped character

vowels = 'aeiouyAEIOUY'
vowels_with_macrons = 'āēīōūȳĀĒĪŌŪȲ'
macron_table = {ord(k): v for k, v in zip(vowels_with_macrons, vowels)}    

print(macron_table)

{257: 'a', 275: 'e', 299: 'i', 333: 'o', 363: 'u', 563: 'y', 256: 'A', 274: 'E', 298: 'I', 332: 'O', 362: 'U', 562: 'Y'}

In [10]:

# function for replacing macrons with translate

def remove_macrons_3(text_with_macrons, macron_table):
    text_without_macrons = text_with_macrons.translate(macron_table)
    return text_without_macrons

In [11]:

%time
remove_macrons_3(text_with_macrons, macron_table)

CPU times: user 14 µs, sys: 1 µs, total: 15 µs
Wall time: 19.8 µs

Out[11]:

'Gallia est omnis divisa in partes tres, quarum unam incolunt Belgae, aliam Aquitani, tertiam qui ipsorum lingua Celtae, nostra Galli appellantur.'

Testing recipes on larger texts¶

All three methods run at about the same speed on a single sentence. But minor differences can add up as the amount of text to be processed increased. How do these recipes perform on larger texts?

In [12]:

# Get sample text with macrons
# Here we'll use the Dickinson College Commentaries text of Caesar's *Bellum Gallicum* (which has macrons!) as found in conventus-lex's github repo for Maccer.

from requests_html import HTMLSession
session = HTMLSession()
url = 'https://raw.githubusercontent.com/conventus-lex/maccer/master/sources/DCC/Caesar%20-%20Selections%20from%20the%20Gallic%20War.txt'
r = session.get(url)
test = r.text
test = test[test.find('1.1'):] # remove 'metadata'
test = re.sub(r'\d\.\d+', '', test) # remove chapter headings, e.g. 1.1
print(test[2:147]) # print sample

Gallia est omnis dīvīsa in partēs trēs, quārum ūnam incolunt Belgae, aliam Aquītānī, tertiam quī ipsōrum linguā Celtae, nostrā Gallī appellantur.

In [13]:

print(f'This text has {len(test.split())} words.')

This text has 6399 words.

Here are the results of timeit on my iMac 2.7 GHz Intel Core i5...

In [14]:

%timeit -n 1000 remove_macrons_1(test, macron_map)

317 µs ± 21.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [15]:

%timeit -n 1000 remove_macrons_2(test, macron_map)

1.74 ms ± 234 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [16]:

%timeit -n 1000 remove_macrons_3(test, macron_table)

4.74 ms ± 947 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

The string method replace even with the multiple passes over the string is much faster than the other two methods.

Warning about combining characters¶

Before wrapping up a discussion about string replacement and unicode characters with diacriticals, it seems like a good time to mention decomposed and precomposed unicode characters. Note the following behavior.

In [17]:

word1 = 'dīvīsa'
print(len(word1))

In [18]:

word2 = 'dīvīsa'
print(len(word2))

In [19]:

print(word1 == word2)

False

These strings are not the same—word2 contains two decomposed lower-case-i-with-macrons.

In [20]:

print(word1.encode('unicode-escape'))
print(word2.encode('unicode-escape'))

b'd\\u012bv\\u012bsa'
b'di\\u0304vi\\u0304sa'

It seems like a good idea to handle these differences before attempting to replace characters. We can use unicodedata.normalize to convert all strings for replacement to Normalization Form C (NFC) before processing.

In [21]:

import unicodedata
word2 = unicodedata.normalize('NFC', word2)
print(len(word2))

In [22]:

# function with NFC preprocessing

def remove_macrons_1b(text_with_macrons, replacement_dictionary):
    text_without_macrons = unicodedata.normalize('NFC', text_with_macrons)
    for k, v in replacement_dictionary.items():
        text_without_macrons = text_without_macrons.replace(k, v)    
    return text_without_macrons

In [23]:

%time
remove_macrons_1b(text_with_macrons, macron_map)

CPU times: user 8 µs, sys: 2 µs, total: 10 µs
Wall time: 18.8 µs

Out[23]:

'Gallia est omnis divisa in partes tres, quarum unam incolunt Belgae, aliam Aquitani, tertiam qui ipsorum lingua Celtae, nostra Galli appellantur.'

Remove Macrons: Best solution ¶

Putting it all together we have the following function that we can use for macron replacement.

In [24]:

import unicodedata

def remove_macrons(text_with_macrons):
    '''Replace macrons in Latin text'''
    vowels = 'aeiouyAEIOUY'
    vowels_with_macrons = 'āēīōūȳĀĒĪŌŪȲ'
    replacement_dictionary = {k: v for k, v in zip(vowels_with_macrons, vowels)}    
    
    temp = unicodedata.normalize('NFC', text_with_macrons)

    for k, v in replacement_dictionary.items():
        temp = temp.replace(k, v)
    else:
        temp = text_without_macrons

    return text_without_macrons

In [25]:

%timeit -n 1000 remove_macrons(test)

415 µs ± 34.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [26]:

print(remove_macrons(test)[:147])

Gallia est omnis divisa in partes tres, quarum unam incolunt Belgae, aliam Aquitani, tertiam qui ipsorum lingua Celtae, nostra Galli appellantur.

So, slightly slower with normalization, but still faster than other methods.

2. Remove diacriticals ¶

Problem: You want to remove all of the diacriticals from a string of Greek text, like the following sentence from Thucydides's Historiae.

In [27]:

text_with_diacriticals = """Θουκυδίδης Ἀθηναῖος ξυνέγραψε τὸν πόλεμον τῶν Πελοποννησίων καὶ Ἀθηναίων, ὡς ἐπολέμησαν πρὸς ἀλλήλους, ἀρξάμενος εὐθὺς καθισταμένου καὶ ἐλπίσας μέγαν τε ἔσεσθαι καὶ ἀξιολογώτατον τῶν προγεγενημένων, τεκμαιρόμενος ὅτι ἀκμάζοντές τε ᾖσαν ἐς αὐτὸν ἀμφότεροι παρασκευῇ τῇ πάσῃ καὶ τὸ ἄλλο Ἑλληνικὸν ὁρῶν ξυνιστάμενον πρὸς ἑκατέρους, τὸ μὲν εὐθύς, τὸ δὲ καὶ διανοούμενον."""

In [28]:

word1 = 'Θουκυδίδης' # composed
word2 = 'Θουκυδίδης' # decomposed

print(f'{word1} in {len(word1)} characters long.')
print(f'{word2} in {len(word2)} characters long.')
print(f'{word1} and {word2} are equal: {word1 == word2}.')

Θουκυδίδης in 10 characters long.
Θουκυδίδης in 11 characters long.
Θουκυδίδης and Θουκυδίδης are equal: False.

In [29]:

# Characters with their unicode points

for i, char in enumerate(word2):
    print(char, char.encode('unicode-escape'))

Θ b'\\u0398'
ο b'\\u03bf'
υ b'\\u03c5'
κ b'\\u03ba'
υ b'\\u03c5'
δ b'\\u03b4'
ι b'\\u03b9'
́ b'\\u0301'
δ b'\\u03b4'
η b'\\u03b7'
ς b'\\u03c2'

In [30]:

# So, our plan is to strip everything like \u0301, like below...

print(u'\u03b9\u0301') # prints iota with acute accent
print(u'\u03b9\u0301'.replace(u'\u0301', '')) # prints iota

ί
ι

Note that the combining method in unicodedata returns 0 if the character is not in a "canonical combining class"...

In [31]:

import unicodedata

print(unicodedata.combining(u'\u0398')) # iota is not in a combining class
print(unicodedata.combining(u'\u0301')) # acute accent is in a combining class

0
230

So we can build a map of all combining characters by iterating through the unicode characters and discarding anything that returns 0 for unicode.combining.

In [32]:

import sys

combining_character_table = dict.fromkeys(c for c in range(sys.maxunicode) 
                                          if unicodedata.combining(chr(c))
                                         )                                        

In [33]:

# decompose string

text_with_diacriticals = unicodedata.normalize('NFD', text_with_diacriticals)

# replace combining characters with translate
text_without_diacriticals = text_with_diacriticals.translate(combining_character_table)

# print results
print(text_without_diacriticals)

Θουκυδιδης Αθηναιος ξυνεγραψε τον πολεμον των Πελοποννησιων και Αθηναιων, ως επολεμησαν προς αλληλους, αρξαμενος ευθυς καθισταμενου και ελπισας μεγαν τε εσεσθαι και αξιολογωτατον των προγεγενημενων, τεκμαιρομενος οτι ακμαζοντες τε ησαν ες αυτον αμφοτεροι παρασκευη τη παση και το αλλο Ελληνικον ορων ξυνισταμενον προς εκατερους, το μεν ευθυς, το δε και διανοουμενον.

In [34]:

# function for removing diacriticals

import sys
import unicodedata

def remove_diacriticals(text_with_diacriticals):
    ''''''
    combining_character_table = dict.fromkeys(c for c in range(sys.maxunicode) 
                                          if unicodedata.combining(chr(c))
                                         )
    
    text_with_diacriticals = unicodedata.normalize('NFD', text_with_diacriticals)
    
    text_without_diacriticals = text_with_diacriticals.translate(combining_character_table)
    
    return text_without_diacriticals

In [35]:

%timeit -n 100 text_without_diacriticals = remove_diacriticals(text_with_diacriticals)

390 ms ± 11.6 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [36]:

print(text_without_diacriticals)

Θουκυδιδης Αθηναιος ξυνεγραψε τον πολεμον των Πελοποννησιων και Αθηναιων, ως επολεμησαν προς αλληλους, αρξαμενος ευθυς καθισταμενου και ελπισας μεγαν τε εσεσθαι και αξιολογωτατον των προγεγενημενων, τεκμαιρομενος οτι ακμαζοντες τε ησαν ες αυτον αμφοτεροι παρασκευη τη παση και το αλλο Ελληνικον ορων ξυνισταμενον προς εκατερους, το μεν ευθυς, το δε και διανοουμενον.

aconser notes that this method could also be used for removing macrons...

In [37]:

import unicodedata

def remove_macrons_4(text_with_macrons):
    MACRON = u'\u0304'
    temp = unicodedata.normalize('NFD', text_with_macrons)
    text_without_macrons = temp.replace(MACRON, '')
    return unicodedata.normalize('NFC', text_without_macrons)

In [38]:

%timeit -n 1000 remove_macrons_4(test)

2.22 ms ± 49.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [39]:

# Cf. 
%timeit -n 1000 remove_macrons(test)

376 µs ± 11.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

3. Extract Greek words ¶

Problem: You want to extract the Greek words from a Latin text (or, really, any non-Greek text).

In [40]:

# Cicero Att 1.4
# http://www.perseus.tufts.edu/hopper/text?doc=Perseus%3Atext%3A1999.02.0008%3Abook%3D1%3Aletter%3D1%3Asection%3D4

text_with_greek = """
abs te peto ut mihi hoc ignoscas et me existimes humanitate esse prohibitum ne contra amici summam existimationem miserrimo eius tempore venirem, cum is omnia sua studia et officia in me contulisset. quod si voles in me esse durior, ambitionem putabis mihi obstitisse. ego autem arbitror, etiam si id sit, mihi ignoscendum esse, “ἐπεὶ οὐχ ἱερήϊον οὐδὲ βοεΐην.” vides enim in quo cursu simus et quam omnis gratias non modo retinendas verum etiam acquirendas putemus. spero tibi me causam probasse, cupio quidem certe. 
"""

We can use regular expressions to replace any non-Greek characters with a space and then split the string on that space to return a list of Greek words. We start by defining the unicode range for Greek characters...

In [41]:

# Define 
GREEK_UNICODE_RANGE = '\u0300-\u03FF'
GREEK_EXT_UNICODE_RANGE = '\u1F00-\u1FFF'

In [42]:

import re

%timeit -n 1000 greek_words  = re.sub('[^%s%s]' % (GREEK_UNICODE_RANGE, GREEK_EXT_UNICODE_RANGE),' ', text_with_greek).split()

91.3 µs ± 3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [43]:

greek_words  = re.sub('[^%s%s]' % (GREEK_UNICODE_RANGE, GREEK_EXT_UNICODE_RANGE),' ', text_with_greek).split()
print(greek_words)
print(greek_words == ['ἐπεὶ', 'οὐχ', 'ἱερήϊον', 'οὐδὲ', 'βοεΐην'])

['ἐπεὶ', 'οὐχ', 'ἱερήϊον', 'οὐδὲ', 'βοεΐην']
True

In [44]:

# function for extracting Greek words

import re

def extract_greek(text_with_greek):
    ''''''
    GREEK_UNICODE_RANGE = '\u0300-\u03FF'
    GREEK_EXT_UNICODE_RANGE = '\u1F00-\u1FFF'
    
    return re.sub('[^%s%s]' % (GREEK_UNICODE_RANGE, GREEK_EXT_UNICODE_RANGE),' ', text_with_greek).split()

In [45]:

print(extract_greek(text_with_greek))

['ἐπεὶ', 'οὐχ', 'ἱερήϊον', 'οὐδὲ', 'βοεΐην']

This recipe could be extended to other language character sets by redefining the unicode character ranges as necessary.

4. Make iota adscripts ¶

Problem: You want to change iota subscripts to iota adscripts, e.g. τῷ θεῷ to τῶι θεῶι.

In [46]:

text_with_iota_subscripts = """Χάρις δὲ τῷ θεῷ τῷ διδόντι τὴν αὐτὴν σπουδὴν ὑπὲρ ὑμῶν ἐν τῇ καρδίᾳ Τίτου, ὅτι τὴν μὲν παράκλησιν ἐδέξατο, σπουδαιότερος δὲ ὑπάρχων αὐθαίρετος ἐξῆλθεν πρὸς ὑμᾶς."""

As with the recipes above for dealing with diacriticals, we will take advantage of unicode decomposition to replace iota subscripts with a full iota.

In [47]:

import unicodedata

def make_iota_adscripts(text):
    text = unicodedata.normalize('NFD', text)
    text = text.replace('\u0345','ι')
    text = unicodedata.normalize('NFC', text)
    return text

In [48]:

%timeit -n 100 text_with_iota_adscripts = make_iota_adscripts(text_with_iota_subscripts)

27.2 µs ± 8.54 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [49]:

print(make_iota_adscripts(text_with_iota_subscripts))

Χάρις δὲ τῶι θεῶι τῶι διδόντι τὴν αὐτὴν σπουδὴν ὑπὲρ ὑμῶν ἐν τῆι καρδίαι Τίτου, ὅτι τὴν μὲν παράκλησιν ἐδέξατο, σπουδαιότερος δὲ ὑπάρχων αὐθαίρετος ἐξῆλθεν πρὸς ὑμᾶς.

Please open an issue for any problems you see with the code. You can also use issues, if you would like to suggest another Python solution for any of the recipes you see in this notebook.

Python Classics Cookbook¶

Edited by Patrick J. Burns¶

1. Replace macrons¶

string replace¶

replacement with regular expressions¶

replacement with translate¶

Testing recipes on larger texts¶

Warning about combining characters¶

Remove Macrons: Best solution¶

2. Remove diacriticals¶

3. Extract Greek words¶

4. Make iota adscripts¶