Problem: You want to remove all of the macrons from a string, like the following sentence from Caesar's Bellum Gallicum.
text_with_macrons = """Gallia est omnis dīvīsa in partēs trēs, quārum ūnam incolunt Belgae, aliam Aquītānī, tertiam quī ipsōrum linguā Celtae, nostrā Gallī appellantur."""
Here are three methods for removing macrons: 1. with replace
; 2. with re.sub
; and 3. with translate
. Click here for the TLDR best solution.
replace
¶# simple replacement
word = 'dīvīsa'
word_without_macrons = word.replace('ī', 'i')
print(f"{word} > {word_without_macrons}")
word = 'Aquītānī'
word_without_macrons = word.replace('ā', 'a').replace('ī', 'i')
print(f"{word} > {word_without_macrons}")
dīvīsa > divisa Aquītānī > Aquitani
It would be tedious to chain together enough replace
methods to solve this problem. So, we could create a dictionary of replacement patterns and loop over them, replacing the text with each pass.
# create dictionary of macrons
macron_map = {
'ā': 'a',
'ē': 'e',
'ī': 'i',
'ō': 'o',
'ū': 'u',
'ȳ': 'y',
'Ā': 'A',
'Ē': 'E',
'Ī': 'I',
'Ō': 'O',
'Ū': 'U',
'Ȳ': 'Y'
}
# compact method with dictionary comprehension
vowels = 'aeiouyAEIOUY'
vowels_with_macrons = 'āēīōūȳĀĒĪŌŪȲ'
macron_map = {k: v for k, v in zip(vowels_with_macrons, vowels)}
print(macron_map)
{'ā': 'a', 'ē': 'e', 'ī': 'i', 'ō': 'o', 'ū': 'u', 'ȳ': 'y', 'Ā': 'A', 'Ē': 'E', 'Ī': 'I', 'Ō': 'O', 'Ū': 'U', 'Ȳ': 'Y'}
# replace by iterating over dictionary
text_without_macrons = text_with_macrons
for k, v in macron_map.items():
text_without_macrons = text_without_macrons.replace(k, v)
print(text_without_macrons)
Gallia est omnis divisa in partes tres, quarum unam incolunt Belgae, aliam Aquitani, tertiam qui ipsorum lingua Celtae, nostra Galli appellantur.
# function for replace by iterating over dictionary
def remove_macrons_1(text_with_macrons, replacement_dictionary):
text_without_macrons = text_with_macrons
for k, v in replacement_dictionary.items():
text_without_macrons = text_without_macrons.replace(k, v)
return text_without_macrons
%time
remove_macrons_1(text_with_macrons, macron_map)
CPU times: user 4 µs, sys: 1 µs, total: 5 µs Wall time: 12.2 µs
'Gallia est omnis divisa in partes tres, quarum unam incolunt Belgae, aliam Aquitani, tertiam qui ipsorum lingua Celtae, nostra Galli appellantur.'
Another option would be to do the same thing with regular expressions instead of ```replace``...
# function for re.sub by iterating over dictionary
import re
def remove_macrons_2(text_with_macrons, replacement_dictionary):
text_without_macrons = text_with_macrons
for k, v in replacement_dictionary.items():
text_without_macrons = re.sub(rf'{k}', v, text_without_macrons, flags=re.MULTILINE)
return text_without_macrons
%time
remove_macrons_2(text_with_macrons, macron_map)
CPU times: user 4 µs, sys: 1 µs, total: 5 µs Wall time: 7.87 µs
'Gallia est omnis divisa in partes tres, quarum unam incolunt Belgae, aliam Aquitani, tertiam qui ipsorum lingua Celtae, nostra Galli appellantur.'
For a single sentence, this turns out to take about the same amount of time to run (not so with larger texts, as we see below).
translate
¶Another option is the translate
method. This allows us to make all changes using a translation table without having to loop repeated over the original string.
# compact method with dictionary comprehension
# note that translate uses ```ord```, i.e. the Unicode code point for each mapped character
vowels = 'aeiouyAEIOUY'
vowels_with_macrons = 'āēīōūȳĀĒĪŌŪȲ'
macron_table = {ord(k): v for k, v in zip(vowels_with_macrons, vowels)}
print(macron_table)
{257: 'a', 275: 'e', 299: 'i', 333: 'o', 363: 'u', 563: 'y', 256: 'A', 274: 'E', 298: 'I', 332: 'O', 362: 'U', 562: 'Y'}
# function for replacing macrons with translate
def remove_macrons_3(text_with_macrons, macron_table):
text_without_macrons = text_with_macrons.translate(macron_table)
return text_without_macrons
%time
remove_macrons_3(text_with_macrons, macron_table)
CPU times: user 14 µs, sys: 1 µs, total: 15 µs Wall time: 19.8 µs
'Gallia est omnis divisa in partes tres, quarum unam incolunt Belgae, aliam Aquitani, tertiam qui ipsorum lingua Celtae, nostra Galli appellantur.'
All three methods run at about the same speed on a single sentence. But minor differences can add up as the amount of text to be processed increased. How do these recipes perform on larger texts?
# Get sample text with macrons
# Here we'll use the Dickinson College Commentaries text of Caesar's *Bellum Gallicum* (which has macrons!) as found in conventus-lex's github repo for Maccer.
from requests_html import HTMLSession
session = HTMLSession()
url = 'https://raw.githubusercontent.com/conventus-lex/maccer/master/sources/DCC/Caesar%20-%20Selections%20from%20the%20Gallic%20War.txt'
r = session.get(url)
test = r.text
test = test[test.find('1.1'):] # remove 'metadata'
test = re.sub(r'\d\.\d+', '', test) # remove chapter headings, e.g. 1.1
print(test[2:147]) # print sample
Gallia est omnis dīvīsa in partēs trēs, quārum ūnam incolunt Belgae, aliam Aquītānī, tertiam quī ipsōrum linguā Celtae, nostrā Gallī appellantur.
print(f'This text has {len(test.split())} words.')
This text has 6399 words.
Here are the results of timeit on my iMac 2.7 GHz Intel Core i5...
%timeit -n 1000 remove_macrons_1(test, macron_map)
317 µs ± 21.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit -n 1000 remove_macrons_2(test, macron_map)
1.74 ms ± 234 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit -n 1000 remove_macrons_3(test, macron_table)
4.74 ms ± 947 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
The string method replace
even with the multiple passes over the string is much faster than the other two methods.
Before wrapping up a discussion about string replacement and unicode characters with diacriticals, it seems like a good time to mention decomposed and precomposed unicode characters. Note the following behavior.
word1 = 'dīvīsa'
print(len(word1))
6
word2 = 'dīvīsa'
print(len(word2))
8
print(word1 == word2)
False
These strings are not the same—word2 contains two decomposed lower-case-i-with-macrons.
print(word1.encode('unicode-escape'))
print(word2.encode('unicode-escape'))
b'd\\u012bv\\u012bsa' b'di\\u0304vi\\u0304sa'
It seems like a good idea to handle these differences before attempting to replace characters. We can use unicodedata.normalize to convert all strings for replacement to Normalization Form C (NFC) before processing.
import unicodedata
word2 = unicodedata.normalize('NFC', word2)
print(len(word2))
6
# function with NFC preprocessing
def remove_macrons_1b(text_with_macrons, replacement_dictionary):
text_without_macrons = unicodedata.normalize('NFC', text_with_macrons)
for k, v in replacement_dictionary.items():
text_without_macrons = text_without_macrons.replace(k, v)
return text_without_macrons
%time
remove_macrons_1b(text_with_macrons, macron_map)
CPU times: user 8 µs, sys: 2 µs, total: 10 µs Wall time: 18.8 µs
'Gallia est omnis divisa in partes tres, quarum unam incolunt Belgae, aliam Aquitani, tertiam qui ipsorum lingua Celtae, nostra Galli appellantur.'
Putting it all together we have the following function that we can use for macron replacement.
import unicodedata
def remove_macrons(text_with_macrons):
'''Replace macrons in Latin text'''
vowels = 'aeiouyAEIOUY'
vowels_with_macrons = 'āēīōūȳĀĒĪŌŪȲ'
replacement_dictionary = {k: v for k, v in zip(vowels_with_macrons, vowels)}
temp = unicodedata.normalize('NFC', text_with_macrons)
for k, v in replacement_dictionary.items():
temp = temp.replace(k, v)
else:
temp = text_without_macrons
return text_without_macrons
%timeit -n 1000 remove_macrons(test)
415 µs ± 34.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
print(remove_macrons(test)[:147])
Gallia est omnis divisa in partes tres, quarum unam incolunt Belgae, aliam Aquitani, tertiam qui ipsorum lingua Celtae, nostra Galli appellantur.
So, slightly slower with normalization, but still faster than other methods.
Problem: You want to remove all of the diacriticals from a string of Greek text, like the following sentence from Thucydides's Historiae.
text_with_diacriticals = """Θουκυδίδης Ἀθηναῖος ξυνέγραψε τὸν πόλεμον τῶν Πελοποννησίων καὶ Ἀθηναίων, ὡς ἐπολέμησαν πρὸς ἀλλήλους, ἀρξάμενος εὐθὺς καθισταμένου καὶ ἐλπίσας μέγαν τε ἔσεσθαι καὶ ἀξιολογώτατον τῶν προγεγενημένων, τεκμαιρόμενος ὅτι ἀκμάζοντές τε ᾖσαν ἐς αὐτὸν ἀμφότεροι παρασκευῇ τῇ πάσῃ καὶ τὸ ἄλλο Ἑλληνικὸν ὁρῶν ξυνιστάμενον πρὸς ἑκατέρους, τὸ μὲν εὐθύς, τὸ δὲ καὶ διανοούμενον."""
word1 = 'Θουκυδίδης' # composed
word2 = 'Θουκυδίδης' # decomposed
print(f'{word1} in {len(word1)} characters long.')
print(f'{word2} in {len(word2)} characters long.')
print(f'{word1} and {word2} are equal: {word1 == word2}.')
Θουκυδίδης in 10 characters long. Θουκυδίδης in 11 characters long. Θουκυδίδης and Θουκυδίδης are equal: False.
# Characters with their unicode points
for i, char in enumerate(word2):
print(char, char.encode('unicode-escape'))
Θ b'\\u0398' ο b'\\u03bf' υ b'\\u03c5' κ b'\\u03ba' υ b'\\u03c5' δ b'\\u03b4' ι b'\\u03b9' ́ b'\\u0301' δ b'\\u03b4' η b'\\u03b7' ς b'\\u03c2'
# So, our plan is to strip everything like \u0301, like below...
print(u'\u03b9\u0301') # prints iota with acute accent
print(u'\u03b9\u0301'.replace(u'\u0301', '')) # prints iota
ί ι
Note that the combining
method in unicodedata
returns 0 if the character is not in a "canonical combining class"...
import unicodedata
print(unicodedata.combining(u'\u0398')) # iota is not in a combining class
print(unicodedata.combining(u'\u0301')) # acute accent is in a combining class
0 230
So we can build a map of all combining characters by iterating through the unicode characters and discarding anything that returns 0 for unicode.combining
.
import sys
combining_character_table = dict.fromkeys(c for c in range(sys.maxunicode)
if unicodedata.combining(chr(c))
)
# decompose string
text_with_diacriticals = unicodedata.normalize('NFD', text_with_diacriticals)
# replace combining characters with translate
text_without_diacriticals = text_with_diacriticals.translate(combining_character_table)
# print results
print(text_without_diacriticals)
Θουκυδιδης Αθηναιος ξυνεγραψε τον πολεμον των Πελοποννησιων και Αθηναιων, ως επολεμησαν προς αλληλους, αρξαμενος ευθυς καθισταμενου και ελπισας μεγαν τε εσεσθαι και αξιολογωτατον των προγεγενημενων, τεκμαιρομενος οτι ακμαζοντες τε ησαν ες αυτον αμφοτεροι παρασκευη τη παση και το αλλο Ελληνικον ορων ξυνισταμενον προς εκατερους, το μεν ευθυς, το δε και διανοουμενον.
# function for removing diacriticals
import sys
import unicodedata
def remove_diacriticals(text_with_diacriticals):
''''''
combining_character_table = dict.fromkeys(c for c in range(sys.maxunicode)
if unicodedata.combining(chr(c))
)
text_with_diacriticals = unicodedata.normalize('NFD', text_with_diacriticals)
text_without_diacriticals = text_with_diacriticals.translate(combining_character_table)
return text_without_diacriticals
%timeit -n 100 text_without_diacriticals = remove_diacriticals(text_with_diacriticals)
390 ms ± 11.6 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
print(text_without_diacriticals)
Θουκυδιδης Αθηναιος ξυνεγραψε τον πολεμον των Πελοποννησιων και Αθηναιων, ως επολεμησαν προς αλληλους, αρξαμενος ευθυς καθισταμενου και ελπισας μεγαν τε εσεσθαι και αξιολογωτατον των προγεγενημενων, τεκμαιρομενος οτι ακμαζοντες τε ησαν ες αυτον αμφοτεροι παρασκευη τη παση και το αλλο Ελληνικον ορων ξυνισταμενον προς εκατερους, το μεν ευθυς, το δε και διανοουμενον.
aconser notes that this method could also be used for removing macrons...
import unicodedata
def remove_macrons_4(text_with_macrons):
MACRON = u'\u0304'
temp = unicodedata.normalize('NFD', text_with_macrons)
text_without_macrons = temp.replace(MACRON, '')
return unicodedata.normalize('NFC', text_without_macrons)
%timeit -n 1000 remove_macrons_4(test)
2.22 ms ± 49.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# Cf.
%timeit -n 1000 remove_macrons(test)
376 µs ± 11.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Problem: You want to extract the Greek words from a Latin text (or, really, any non-Greek text).
# Cicero Att 1.4
# http://www.perseus.tufts.edu/hopper/text?doc=Perseus%3Atext%3A1999.02.0008%3Abook%3D1%3Aletter%3D1%3Asection%3D4
text_with_greek = """
abs te peto ut mihi hoc ignoscas et me existimes humanitate esse prohibitum ne contra amici summam existimationem miserrimo eius tempore venirem, cum is omnia sua studia et officia in me contulisset. quod si voles in me esse durior, ambitionem putabis mihi obstitisse. ego autem arbitror, etiam si id sit, mihi ignoscendum esse, “ἐπεὶ οὐχ ἱερήϊον οὐδὲ βοεΐην.” vides enim in quo cursu simus et quam omnis gratias non modo retinendas verum etiam acquirendas putemus. spero tibi me causam probasse, cupio quidem certe.
"""
We can use regular expressions to replace any non-Greek characters with a space and then split the string on that space to return a list of Greek words. We start by defining the unicode range for Greek characters...
# Define
GREEK_UNICODE_RANGE = '\u0300-\u03FF'
GREEK_EXT_UNICODE_RANGE = '\u1F00-\u1FFF'
import re
%timeit -n 1000 greek_words = re.sub('[^%s%s]' % (GREEK_UNICODE_RANGE, GREEK_EXT_UNICODE_RANGE),' ', text_with_greek).split()
91.3 µs ± 3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
greek_words = re.sub('[^%s%s]' % (GREEK_UNICODE_RANGE, GREEK_EXT_UNICODE_RANGE),' ', text_with_greek).split()
print(greek_words)
print(greek_words == ['ἐπεὶ', 'οὐχ', 'ἱερήϊον', 'οὐδὲ', 'βοεΐην'])
['ἐπεὶ', 'οὐχ', 'ἱερήϊον', 'οὐδὲ', 'βοεΐην'] True
# function for extracting Greek words
import re
def extract_greek(text_with_greek):
''''''
GREEK_UNICODE_RANGE = '\u0300-\u03FF'
GREEK_EXT_UNICODE_RANGE = '\u1F00-\u1FFF'
return re.sub('[^%s%s]' % (GREEK_UNICODE_RANGE, GREEK_EXT_UNICODE_RANGE),' ', text_with_greek).split()
print(extract_greek(text_with_greek))
['ἐπεὶ', 'οὐχ', 'ἱερήϊον', 'οὐδὲ', 'βοεΐην']
This recipe could be extended to other language character sets by redefining the unicode character ranges as necessary.
Problem: You want to change iota subscripts to iota adscripts, e.g. τῷ θεῷ to τῶι θεῶι.
text_with_iota_subscripts = """Χάρις δὲ τῷ θεῷ τῷ διδόντι τὴν αὐτὴν σπουδὴν ὑπὲρ ὑμῶν ἐν τῇ καρδίᾳ Τίτου, ὅτι τὴν μὲν παράκλησιν ἐδέξατο, σπουδαιότερος δὲ ὑπάρχων αὐθαίρετος ἐξῆλθεν πρὸς ὑμᾶς."""
As with the recipes above for dealing with diacriticals, we will take advantage of unicode decomposition to replace iota subscripts with a full iota.
import unicodedata
def make_iota_adscripts(text):
text = unicodedata.normalize('NFD', text)
text = text.replace('\u0345','ι')
text = unicodedata.normalize('NFC', text)
return text
%timeit -n 100 text_with_iota_adscripts = make_iota_adscripts(text_with_iota_subscripts)
27.2 µs ± 8.54 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
print(make_iota_adscripts(text_with_iota_subscripts))
Χάρις δὲ τῶι θεῶι τῶι διδόντι τὴν αὐτὴν σπουδὴν ὑπὲρ ὑμῶν ἐν τῆι καρδίαι Τίτου, ὅτι τὴν μὲν παράκλησιν ἐδέξατο, σπουδαιότερος δὲ ὑπάρχων αὐθαίρετος ἐξῆλθεν πρὸς ὑμᾶς.
Please open an issue for any problems you see with the code. You can also use issues, if you would like to suggest another Python solution for any of the recipes you see in this notebook.