Python Classics Cookbook

Edited by Patrick J. Burns

with contributions from Anna Conser

1. Replace macrons

Problem: You want to remove all of the macrons from a string, like the following sentence from Caesar's Bellum Gallicum.

In [1]:
text_with_macrons = """Gallia est omnis dīvīsa in partēs trēs, quārum ūnam incolunt Belgae, aliam Aquītānī, tertiam quī ipsōrum linguā Celtae, nostrā Gallī appellantur."""

Here are three methods for removing macrons: 1. with replace; 2. with re.sub; and 3. with translate. Click here for the TLDR best solution.

string replace

In [2]:
# simple replacement

word = 'dīvīsa'
word_without_macrons = word.replace('ī', 'i')
print(f"{word} > {word_without_macrons}")

word = 'Aquītānī'
word_without_macrons = word.replace('ā', 'a').replace('ī', 'i')
print(f"{word} > {word_without_macrons}")
dīvīsa > divisa
Aquītānī > Aquitani

It would be tedious to chain together enough replace methods to solve this problem. So, we could create a dictionary of replacement patterns and loop over them, replacing the text with each pass.

In [3]:
# create dictionary of macrons

macron_map = {
    'ā': 'a', 
    'ē': 'e', 
    'ī': 'i', 
    'ō': 'o', 
    'ū': 'u',
    'ȳ': 'y',
    'Ā': 'A',
    'Ē': 'E', 
    'Ī': 'I', 
    'Ō': 'O', 
    'Ū': 'U',
    'Ȳ': 'Y'
}

# compact method with dictionary comprehension

vowels = 'aeiouyAEIOUY'
vowels_with_macrons = 'āēīōūȳĀĒĪŌŪȲ'
macron_map = {k: v for k, v in zip(vowels_with_macrons, vowels)}    

print(macron_map)
{'ā': 'a', 'ē': 'e', 'ī': 'i', 'ō': 'o', 'ū': 'u', 'ȳ': 'y', 'Ā': 'A', 'Ē': 'E', 'Ī': 'I', 'Ō': 'O', 'Ū': 'U', 'Ȳ': 'Y'}
In [4]:
# replace by iterating over dictionary

text_without_macrons = text_with_macrons

for k, v in macron_map.items():
    text_without_macrons = text_without_macrons.replace(k, v)
    
print(text_without_macrons)
Gallia est omnis divisa in partes tres, quarum unam incolunt Belgae, aliam Aquitani, tertiam qui ipsorum lingua Celtae, nostra Galli appellantur.
In [5]:
# function for replace by iterating over dictionary

def remove_macrons_1(text_with_macrons, replacement_dictionary):
    text_without_macrons = text_with_macrons
    for k, v in replacement_dictionary.items():
        text_without_macrons = text_without_macrons.replace(k, v)    
    return text_without_macrons
In [6]:
%time
remove_macrons_1(text_with_macrons, macron_map)
CPU times: user 6 µs, sys: 1 µs, total: 7 µs
Wall time: 12.2 µs
Out[6]:
'Gallia est omnis divisa in partes tres, quarum unam incolunt Belgae, aliam Aquitani, tertiam qui ipsorum lingua Celtae, nostra Galli appellantur.'

replacement with regular expressions

Another option would be to do the same thing with regular expressions instead of `replace...

In [7]:
# function for re.sub by iterating over dictionary

import re

def remove_macrons_2(text_with_macrons, replacement_dictionary):
    text_without_macrons = text_with_macrons
    for k, v in replacement_dictionary.items():
        text_without_macrons = re.sub(rf'{k}', v, text_without_macrons, flags=re.MULTILINE)
    return text_without_macrons
In [8]:
%time
remove_macrons_2(text_with_macrons, macron_map)
CPU times: user 6 µs, sys: 1 µs, total: 7 µs
Wall time: 73.9 µs
Out[8]:
'Gallia est omnis divisa in partes tres, quarum unam incolunt Belgae, aliam Aquitani, tertiam qui ipsorum lingua Celtae, nostra Galli appellantur.'

For a single sentence, this turns out to take about the same amount of time to run (not so with larger texts, as we see below).

replacement with translate

Another option is the translate method. This allows us to make all changes using a translation table without having to loop repeated over the original string.

In [9]:
# compact method with dictionary comprehension
# note that translate uses ```ord```, i.e. the Unicode code point for each mapped character

vowels = 'aeiouyAEIOUY'
vowels_with_macrons = 'āēīōūȳĀĒĪŌŪȲ'
macron_table = {ord(k): v for k, v in zip(vowels_with_macrons, vowels)}    

print(macron_table)
{257: 'a', 275: 'e', 299: 'i', 333: 'o', 363: 'u', 563: 'y', 256: 'A', 274: 'E', 298: 'I', 332: 'O', 362: 'U', 562: 'Y'}
In [10]:
# function for replacing macrons with translate

def remove_macrons_3(text_with_macrons, macron_table):
    text_without_macrons = text_with_macrons.translate(macron_table)
    return text_without_macrons
In [11]:
%time
remove_macrons_3(text_with_macrons, macron_table)
CPU times: user 5 µs, sys: 2 µs, total: 7 µs
Wall time: 11.7 µs
Out[11]:
'Gallia est omnis divisa in partes tres, quarum unam incolunt Belgae, aliam Aquitani, tertiam qui ipsorum lingua Celtae, nostra Galli appellantur.'

Testing recipes on larger texts

All three methods run at about the same speed on a single sentence. But minor differences can add up as the amount of text to be processed increased. How do these recipes perform on larger texts?

In [12]:
# Get sample text with macrons
# Here we'll use the Dickinson College Commentaries text of Caesar's *Bellum Gallicum* (which has macrons!) as found in conventus-lex's github repo for Maccer.

from requests_html import HTMLSession
session = HTMLSession()
url = 'https://raw.githubusercontent.com/conventus-lex/maccer/master/sources/DCC/Caesar%20-%20Selections%20from%20the%20Gallic%20War.txt'
r = session.get(url)
test = r.text
test = test[test.find('1.1'):] # remove 'metadata'
test = re.sub(r'\d\.\d+', '', test) # remove chapter headings, e.g. 1.1
print(test[2:147]) # print sample
Gallia est omnis dīvīsa in partēs trēs, quārum ūnam incolunt Belgae, aliam Aquītānī, tertiam quī ipsōrum linguā Celtae, nostrā Gallī appellantur.
In [13]:
print(f'This text has {len(test.split())} words.')
This text has 6399 words.

Here are the results of timeit on my iMac 2.7 GHz Intel Core i5...

In [14]:
%timeit -n 1000 remove_macrons_1(test, macron_map)
279 µs ± 27.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [15]:
%timeit -n 1000 remove_macrons_2(test, macron_map)
2.15 ms ± 343 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [16]:
%timeit -n 1000 remove_macrons_3(test, macron_table)
6.64 ms ± 2.26 ms per loop (mean ± std. dev. of 7 runs, 1000 loops each)

The string method replace even with the multiple passes over the string is much faster than the other two methods.

Warning about combining characters

Before wrapping up a discussion about string replacement and unicode characters with diacriticals, it seems like a good time to mention decomposed and precomposed unicode characters. Note the following behavior.

In [17]:
word1 = 'dīvīsa'
print(len(word1))
6
In [18]:
word2 = 'dīvīsa'
print(len(word2))
8
In [19]:
print(word1 == word2)
False

These strings are not the same—word2 contains two decomposed lower-case-i-with-macrons.

In [20]:
print(word1.encode('unicode-escape'))
print(word2.encode('unicode-escape'))
b'd\\u012bv\\u012bsa'
b'di\\u0304vi\\u0304sa'

It seems like a good idea to handle these differences before attempting to replace characters. We can use unicodedata.normalize to convert all strings for replacement to Normalization Form C (NFC) before processing.

In [21]:
import unicodedata
word2 = unicodedata.normalize('NFC', word2)
print(len(word2))
6
In [22]:
# function with NFC preprocessing

def remove_macrons_1b(text_with_macrons, replacement_dictionary):
    text_without_macrons = unicodedata.normalize('NFC', text_with_macrons)
    for k, v in replacement_dictionary.items():
        text_without_macrons = text_without_macrons.replace(k, v)    
    return text_without_macrons
In [23]:
%time
remove_macrons_1b(text_with_macrons, macron_map)
CPU times: user 6 µs, sys: 1e+03 ns, total: 7 µs
Wall time: 12.9 µs
Out[23]:
'Gallia est omnis divisa in partes tres, quarum unam incolunt Belgae, aliam Aquitani, tertiam qui ipsorum lingua Celtae, nostra Galli appellantur.'

Putting it all together we have the following function that we can use for macron replacement.

In [24]:
import unicodedata

def remove_macrons(text_with_macrons):
    '''Replace macrons in Latin text'''
    vowels = 'aeiouyAEIOUY'
    vowels_with_macrons = 'āēīōūȳĀĒĪŌŪȲ'
    replacement_dictionary = {k: v for k, v in zip(vowels_with_macrons, vowels)}    
    
    temp = unicodedata.normalize('NFC', text_with_macrons)

    for k, v in replacement_dictionary.items():
        temp = temp.replace(k, v)
    else:
        temp = text_without_macrons

    return text_without_macrons
In [25]:
%timeit -n 1000 remove_macrons(test)
618 µs ± 148 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [26]:
print(remove_macrons(test)[:147])
Gallia est omnis divisa in partes tres, quarum unam incolunt Belgae, aliam Aquitani, tertiam qui ipsorum lingua Celtae, nostra Galli appellantur.

So, slightly slower with normalization, but still faster than other methods.

Problem: You want to remove all of the diacriticals from a string of Greek text, like the following sentence from Thucydides's Historiae.

In [27]:
text_with_diacriticals = """Θουκυδίδης Ἀθηναῖος ξυνέγραψε τὸν πόλεμον τῶν Πελοποννησίων καὶ Ἀθηναίων, ὡς ἐπολέμησαν πρὸς ἀλλήλους, ἀρξάμενος εὐθὺς καθισταμένου καὶ ἐλπίσας μέγαν τε ἔσεσθαι καὶ ἀξιολογώτατον τῶν προγεγενημένων, τεκμαιρόμενος ὅτι ἀκμάζοντές τε ᾖσαν ἐς αὐτὸν ἀμφότεροι παρασκευῇ τῇ πάσῃ καὶ τὸ ἄλλο Ἑλληνικὸν ὁρῶν ξυνιστάμενον πρὸς ἑκατέρους, τὸ μὲν εὐθύς, τὸ δὲ καὶ διανοούμενον."""
In a certain respect, this recipe appears to be similar to [Remove Macrons](#)—it is a replacement task. But the number of possible combinations would be burdensome to map by hand, so we are better off looking at a more manageable solution, one that works with entire ranges of characters. One strategy along these lines is to decompose the unicode combining characters and strip them using translate (cf. **PCRx 2.12**). This way rather than figure out a mapping for all possible precombined characters (e.g. ί to ι), we can just map the diacriticals separately and remove them.
In [28]:
word1 = 'Θουκυδίδης' # composed
word2 = 'Θουκυδίδης' # decomposed

print(f'{word1} in {len(word1)} characters long.')
print(f'{word2} in {len(word2)} characters long.')
print(f'{word1} and {word2} are equal: {word1 == word2}.')
Θουκυδίδης in 10 characters long.
Θουκυδίδης in 11 characters long.
Θουκυδίδης and Θουκυδίδης are equal: False.
In [29]:
# Characters with their unicode points

for i, char in enumerate(word2):
    print(char, char.encode('unicode-escape'))
Θ b'\\u0398'
ο b'\\u03bf'
υ b'\\u03c5'
κ b'\\u03ba'
υ b'\\u03c5'
δ b'\\u03b4'
ι b'\\u03b9'
́ b'\\u0301'
δ b'\\u03b4'
η b'\\u03b7'
ς b'\\u03c2'
In [30]:
# So, our plan is to strip everything like \u0301, like below...

print(u'\u03b9\u0301') # prints iota with acute accent
print(u'\u03b9\u0301'.replace(u'\u0301', '')) # prints iota
ί
ι

Note that the combining method in unicodedata returns 0 if the character is not in a "canonical combining class"...

In [31]:
import unicodedata

print(unicodedata.combining(u'\u0398')) # iota is not in a combining class
print(unicodedata.combining(u'\u0301')) # acute accent is in a combining class
0
230

So we can build a map of all combining characters by iterating through the unicode characters and discarding anything that returns 0 for unicode.combining.

In [32]:
import sys

combining_character_table = dict.fromkeys(c for c in range(sys.maxunicode) 
                                          if unicodedata.combining(chr(c))
                                         )                                        
In [33]:
# decompose string

text_with_diacriticals = unicodedata.normalize('NFD', text_with_diacriticals)

# replace combining characters with translate
text_without_diacriticals = text_with_diacriticals.translate(combining_character_table)

# print results
print(text_without_diacriticals)
Θουκυδιδης Αθηναιος ξυνεγραψε τον πολεμον των Πελοποννησιων και Αθηναιων, ως επολεμησαν προς αλληλους, αρξαμενος ευθυς καθισταμενου και ελπισας μεγαν τε εσεσθαι και αξιολογωτατον των προγεγενημενων, τεκμαιρομενος οτι ακμαζοντες τε ησαν ες αυτον αμφοτεροι παρασκευη τη παση και το αλλο Ελληνικον ορων ξυνισταμενον προς εκατερους, το μεν ευθυς, το δε και διανοουμενον.
In [34]:
# function for removing diacriticals

import sys
import unicodedata

def remove_diacriticals(text_with_diacriticals):
    ''''''
    combining_character_table = dict.fromkeys(c for c in range(sys.maxunicode) 
                                          if unicodedata.combining(chr(c))
                                         )
    
    text_with_diacriticals = unicodedata.normalize('NFD', text_with_diacriticals)
    
    text_without_diacriticals = text_with_diacriticals.translate(combining_character_table)
    
    return text_without_diacriticals
In [35]:
%timeit -n 100 text_without_diacriticals = remove_diacriticals(text_with_diacriticals)
379 ms ± 34.2 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [36]:
print(text_without_diacriticals)
Θουκυδιδης Αθηναιος ξυνεγραψε τον πολεμον των Πελοποννησιων και Αθηναιων, ως επολεμησαν προς αλληλους, αρξαμενος ευθυς καθισταμενου και ελπισας μεγαν τε εσεσθαι και αξιολογωτατον των προγεγενημενων, τεκμαιρομενος οτι ακμαζοντες τε ησαν ες αυτον αμφοτεροι παρασκευη τη παση και το αλλο Ελληνικον ορων ξυνισταμενον προς εκατερους, το μεν ευθυς, το δε και διανοουμενον.

aconser notes that this method could also be used for removing macrons...

In [37]:
import unicodedata

def remove_macrons_4(text_with_macrons):
    MACRON = u'\u0304'
    temp = unicodedata.normalize('NFD', text_with_macrons)
    text_without_macrons = temp.replace(MACRON, '')
    return unicodedata.normalize('NFC', text_without_macrons)
In [38]:
%timeit -n 1000 remove_macrons_4(test)
2.16 ms ± 33 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [39]:
# Cf. 
%timeit -n 1000 remove_macrons(test)
394 µs ± 68.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Problem: You want to extract the Greek words from a Latin text (or, really, any non-Greek text).

In [40]:
# Cicero Att 1.4
# http://www.perseus.tufts.edu/hopper/text?doc=Perseus%3Atext%3A1999.02.0008%3Abook%3D1%3Aletter%3D1%3Asection%3D4

text_with_greek = """
abs te peto ut mihi hoc ignoscas et me existimes humanitate esse prohibitum ne contra amici summam existimationem miserrimo eius tempore venirem, cum is omnia sua studia et officia in me contulisset. quod si voles in me esse durior, ambitionem putabis mihi obstitisse. ego autem arbitror, etiam si id sit, mihi ignoscendum esse, “ἐπεὶ οὐχ ἱερήϊον οὐδὲ βοεΐην.” vides enim in quo cursu simus et quam omnis gratias non modo retinendas verum etiam acquirendas putemus. spero tibi me causam probasse, cupio quidem certe. 
"""

We can use regular expressions to replace any non-Greek characters with a space and then split the string on that space to return a list of Greek words. We start by defining the unicode range for Greek characters...

In [41]:
# Define 
GREEK_UNICODE_RANGE = '\u0300-\u03FF'
GREEK_EXT_UNICODE_RANGE = '\u1F00-\u1FFF'
In [42]:
import re

%timeit -n 1000 greek_words  = re.sub('[^%s%s]' % (GREEK_UNICODE_RANGE, GREEK_EXT_UNICODE_RANGE),' ', text_with_greek).split()
102 µs ± 14.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [43]:
greek_words  = re.sub('[^%s%s]' % (GREEK_UNICODE_RANGE, GREEK_EXT_UNICODE_RANGE),' ', text_with_greek).split()
print(greek_words)
print(greek_words == ['ἐπεὶ', 'οὐχ', 'ἱερήϊον', 'οὐδὲ', 'βοεΐην'])
['ἐπεὶ', 'οὐχ', 'ἱερήϊον', 'οὐδὲ', 'βοεΐην']
True
In [44]:
# function for extracting Greek words

import re

def extract_greek(text_with_greek):
    ''''''
    GREEK_UNICODE_RANGE = '\u0300-\u03FF'
    GREEK_EXT_UNICODE_RANGE = '\u1F00-\u1FFF'
    
    return re.sub('[^%s%s]' % (GREEK_UNICODE_RANGE, GREEK_EXT_UNICODE_RANGE),' ', text_with_greek).split()
In [45]:
print(extract_greek(text_with_greek))
['ἐπεὶ', 'οὐχ', 'ἱερήϊον', 'οὐδὲ', 'βοεΐην']

This recipe could be extended to other language character sets by redefining the unicode character ranges as necessary.


Please open an issue for any problems you see with the code. You can also use issues, if you would like to suggest another Python solution for any of the recipes you see in this notebook.