Classifying Roman Names by Gender

This notebook follows section 6.1 of the NLTK Book on Supervised Classification, specifically the examples on gender identification (pp. 222-227 in the print edition). The NLTK corpora include a 'names' corpus divided male/female. (Passing over for the moment the limited scope of this binary.) In order to build a similar classifier for Roman names, I downloaded name/gender information from the Romans1By1 site, building a dictionary with the necessary data.

The first part of this notebook follows the NLTK example closely, using the NaiveBayesClassifier with the last letter of a given name as the feature. Unsurprisingly, -s correlates well with 'male' names and -a with 'female'. This simple classifier consistently produced accuracies of 95-98%.

The second part of this notebook extends the feature set to look at the terminations of all names available for each person, i.e. praenomen, nomen, and cognomen. It also uses the total number of words in a name as a feature. With this extended featureset, accuracy is consistently at 99%.

It would be interesting to see what other features could be used to improve the classifier (though it is already pretty accurate). Also, if anyone knows of more datasets with Roman names tagged for gender, it would be interesting to see how the classifier performs on new data. [PJB 3.8.18]

Setup notebook

In [1]:
# Imports

import random
import pickle

import nltk

from pprint import pprint

Get Roman names data

In [2]:
## Get list of names in Latin

## We can use the Romans1by1 database to get a list of Roman names by gender.
## Note that the Romans1by1 data is available under under a Creative Commons Attribution 4.0 International Licence.
## See http://romans1by1.com/pages/phome for terms of use.
#
# from requests_html import HTMLSession
#
# from tqdm import tqdm
#
# session = HTMLSession()
# urlbase = "http://romans1by1.com/rpeople/people?page="
#
# records = []
#
# for i in tqdm(range(1,138)):
#     r = session.get('{}{}'.format(urlbase, i))
#     thead = r.html.find('thead', first=True)
#     tbody = r.html.find('tbody', first=True)
#     keys = [th.text for th in ths]
#     trs = tbody.find('tr')
#     for tr in trs:
#         record = {}
#         tds = tr.find('td')
#         for i, td in enumerate(tds):
#             record[keys[i]] = td.text
#         records.append(record)
# pickle.dump(records, open('romans1by1.p', 'wb'))

records = pickle.load(open('./data/romans1by1.p', 'rb'))

Using nomen information

In [3]:
# Define a function for classification feature following NLTK Book example; here, final letter

def gender_features(word):
    return {'last_letter': word[-1]}
In [4]:
# Separate records by gender

male = [record for record in records  if record['Gender'] == 'Male']
female = [record for record in records  if record['Gender'] == 'Female']
In [5]:
# Helper function for removing non-Latin characters, spec. Greek in this case

import unicodedata as ud

latin_letters= {}

def is_latin(uchr):
    try: return latin_letters[uchr]
    except KeyError:
         return latin_letters.setdefault(uchr, 'LATIN' in ud.name(uchr))

def only_roman_chars(unistr):
    return all(is_latin(uchr)
           for uchr in unistr
           if uchr.isalpha()) # isalpha suggested by John Machin
In [6]:
# Build list of nomina for 'male'; drop incomplete names and Greek names

male_names = []
for record in male:
    if record['Nomen'].isalpha() and only_roman_chars(record['Nomen']):
        male_names.append(record['Nomen'])
male_names = sorted(list(set(male_names)))

print('There are {} male nomina'.format(len(male_names)))
print('Some examples include: {}'.format(", ".join(male_names[:10])))
There are 636 male nomina
Some examples include: Abudius, Aburnius, Aciliius, Acilius, Aebutius, Aelia, Aelius, Aemilianus, Aemilius, Aesernius
In [7]:
# Build list of nomina for 'female'; drop incomplete names and Greek names

female_names = []
for record in female:
    if record['Nomen'].isalpha() and only_roman_chars(record['Nomen']):
        female_names.append(record['Nomen'])
female_names = sorted(list(set(female_names)))

print('There are {} female nomina'.format(len(female_names)))
print('Some examples include: {}'.format(", ".join(female_names[:10])))
There are 232 female nomina
Some examples include: Abuccia, Aburia, Acutia, Aebutia, Aelia, Aelius, Aemilia, Ambia, Amibia, Anclarenia
In [8]:
# Create list of names with labels & shuffle for train/test classifier

names = ([(name, 'Male') for name in male_names] +
         [(name, 'Female') for name in female_names])

print('There are {} names in this dataset.'.format(len(names)))
random.shuffle(names)
There are 868 names in this dataset.
In [9]:
# Set up classifier, spec. Naive Bayers classifier

featuresets = [(gender_features(n), g) for (n,g) in names]
train_set, test_set = featuresets[100:], featuresets[:100]
classifier = nltk.NaiveBayesClassifier.train(train_set)
In [10]:
# Give accuracy 

print(nltk.classify.accuracy(classifier, test_set))
0.96
In [11]:
# Sample classifications...

print(classifier.classify(gender_features('Ulysses')))
print(classifier.classify(gender_features('Penelope')))
print(classifier.classify(gender_features('Hercules')))
print(classifier.classify(gender_features('Deineira')))
print(classifier.classify(gender_features('Aeneas')))
print(classifier.classify(gender_features('Dido'))) # Note incorrect classification here.
Male
Female
Male
Female
Male
Male
In [12]:
# Show the 'most informative features

classifier.show_most_informative_features(10)
Most Informative Features
             last_letter = 's'              Male : Female =     80.8 : 1.0
             last_letter = 'a'            Female : Male   =     32.6 : 1.0
             last_letter = 'o'              Male : Female =      1.1 : 1.0
In [13]:
# Review errors from test

errors = []
for (name, tag) in names:
    guess = classifier.classify(gender_features(name))
    if guess != tag:
        errors.append((tag, guess, name))
        ref_name = name

print('There were {} errors out of {}.'.format(len(errors), len(names)))        
        
for (tag, guess, name) in sorted(errors):
    print('correct=%-8s guess=%-8s name=%-30s' % (tag, guess, name))            
There were 23 errors out of 868.
correct=Female   guess=Male     name=Aelius                        
correct=Female   guess=Male     name=Caetennius                    
correct=Female   guess=Male     name=Charitio                      
correct=Male     guess=Female   name=Aelia                         
correct=Male     guess=Female   name=Aquila                        
correct=Male     guess=Female   name=Aurelia                       
correct=Male     guess=Female   name=Axenna                        
correct=Male     guess=Female   name=Charagonia                    
correct=Male     guess=Female   name=Claudia                       
correct=Male     guess=Female   name=Concinna                      
correct=Male     guess=Female   name=Dida                          
correct=Male     guess=Female   name=Fera                          
correct=Male     guess=Female   name=Iulia                         
correct=Male     guess=Female   name=Messala                       
correct=Male     guess=Female   name=Prastina                      
correct=Male     guess=Female   name=Rustia                        
correct=Male     guess=Female   name=Sufena                        
correct=Male     guess=Female   name=Sura                          
correct=Male     guess=Female   name=Surucca                       
correct=Male     guess=Female   name=Torquata                      
correct=Male     guess=Female   name=Ulpia                         
correct=Male     guess=Female   name=Valeria                       
correct=Male     guess=Female   name=Viccia                        
In [14]:
print([record for record in records if record['Nomen'] == ref_name][0])
{'Details': 'Show', 'Code': '530', 'Praenomen': '', 'Nomen': 'Ulpia', 'Cognomen/Personal name': 'Ianuaria', 'Father/Master name': '', 'Natione': '', 'Ethnicity': '', 'Gender': 'Female', 'Citizen': 'true', 'Libertus/-a': 'false', 'Veteranus': 'false', 'Peregrine': 'false', 'Slave': 'false', 'Veteranus unit': '', 'Veteranus rank': '', 'Tribus': '', 'Origo': '', 'Domus': '', 'Inscription code': '00017DS', 'Province': 'Dacia Superior'}

Using full name information

In [15]:
# Build list of full names with gender

fullnames = []

for record in records:
    fullname = [record['Praenomen'], record['Nomen'], record['Cognomen/Personal name']]
    fullname = ' '.join(fullname).strip()
    if fullname.replace(' ','').isalpha() and only_roman_chars(fullname) and fullname.lower() != "such":
        if record['Gender'] == 'Male' or record['Gender'] == 'Female':
            fullnames.append((fullname, record['Gender']))

print('There are {} names in this dataset.'.format(len(fullnames)))
random.shuffle(fullnames)
There are 7256 names in this dataset.
In [16]:
# Build expanded feature set; looking at the final two (2) letters of each name seemed to yield
# better results that just the last letter, e.g. -us being +300x more common in male names then female.

def features(name):
    name_components = name.split()
    features = {}
    features['last1_2'] = name_components[0][-2:]
    if len(name_components) > 1:
       features['last2_2'] = name_components[1][-2:]
    else:
       features['last2_2'] = None
    if len(name_components) > 2:
       features['last3_2'] = name_components[1][-2:]
    else:
       features['last3_2'] = None
    features['wordcount'] = len(name_components) # Number of words in name also seems to be a useful feature
    return features
In [17]:
# Setup classifer with train/dev/test sets

featuresets = [(features(n), g) for (n,g) in fullnames]
train_set, devtest_set, test_set = featuresets[1400:], featuresets[700:1400], featuresets[:700]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, devtest_set))
0.9885714285714285
In [18]:
classifier.show_most_informative_features(10)
Most Informative Features
                 last2_2 = 'us'             Male : Female =    298.8 : 1.0
                 last3_2 = 'us'             Male : Female =    265.2 : 1.0
                 last1_2 = 'ia'           Female : Male   =    251.5 : 1.0
                 last2_2 = 'na'           Female : Male   =    246.2 : 1.0
                 last1_2 = 'us'             Male : Female =    164.3 : 1.0
                 last2_2 = 'ia'           Female : Male   =    108.4 : 1.0
                 last1_2 = 'na'           Female : Male   =     80.6 : 1.0
                 last2_2 = 'ta'           Female : Male   =     54.7 : 1.0
                 last1_2 = 'ta'           Female : Male   =     44.6 : 1.0
               wordcount = 3                Male : Female =     41.2 : 1.0
In [19]:
# Review errors from devtest

errors = []
for (name, tag) in fullnames:
    guess = classifier.classify(features(name))
    if guess != tag:
        errors.append((tag, guess, name))

print('There were {} errors out of {}.'.format(len(errors), len(fullnames)))
        
for (tag, guess, name) in sorted(errors):
    print('correct=%-8s guess=%-8s name=%-30s' % (tag, guess, name))     
There were 91 errors out of 7256.
correct=Female   guess=Male     name=Aelius Florentina             
correct=Female   guess=Male     name=Aelius Severina               
correct=Female   guess=Male     name=Amabilis                      
correct=Female   guess=Male     name=Asholes                       
correct=Female   guess=Male     name=Aurelius                      
correct=Female   guess=Male     name=Batsinis                      
correct=Female   guess=Male     name=Cethithis                     
correct=Female   guess=Male     name=Cinenes                       
correct=Female   guess=Male     name=Corinnis                      
correct=Female   guess=Male     name=Dentusucus                    
correct=Female   guess=Male     name=Dimidusis                     
correct=Female   guess=Male     name=Dutuborinis                   
correct=Female   guess=Male     name=Epictesis                     
correct=Female   guess=Male     name=Euhelpis                      
correct=Female   guess=Male     name=Flavia Delus                  
correct=Female   guess=Male     name=Helpis                        
correct=Female   guess=Male     name=Mucapuis                      
correct=Female   guess=Male     name=Mucapuius                     
correct=Female   guess=Male     name=Myro                          
correct=Female   guess=Male     name=Pieris                        
correct=Female   guess=Male     name=Publius Caetennius Clemes     
correct=Female   guess=Male     name=Sarbis                        
correct=Female   guess=Male     name=Zieisis                       
correct=Female   guess=Male     name=Ziles                         
correct=Female   guess=Male     name=Zises Mucazenis               
correct=Male     guess=Female   name=Abisalma                      
correct=Male     guess=Female   name=Aelia Potens                  
correct=Male     guess=Female   name=Agatho                        
correct=Male     guess=Female   name=Agricola                      
correct=Male     guess=Female   name=Agrippa                       
correct=Male     guess=Female   name=Amica                         
correct=Male     guess=Female   name=Aquila                        
correct=Male     guess=Female   name=Aquila Barsemon               
correct=Male     guess=Female   name=Aqvila                        
correct=Male     guess=Female   name=Arsama                        
correct=Male     guess=Female   name=Atila                         
correct=Male     guess=Female   name=Aurelia Iulia                 
correct=Male     guess=Female   name=Cerdo                         
correct=Male     guess=Female   name=Charagonia Arche              
correct=Male     guess=Female   name=Clagissa                      
correct=Male     guess=Female   name=Claudia Hygia                 
correct=Male     guess=Female   name=Coca                          
correct=Male     guess=Female   name=Currithie                     
correct=Male     guess=Female   name=Daphno                        
correct=Male     guess=Female   name=Dasa                          
correct=Male     guess=Female   name=Dasa                          
correct=Male     guess=Female   name=Dida Sita                     
correct=Male     guess=Female   name=Diurpagissa                   
correct=Male     guess=Female   name=Dizala                        
correct=Male     guess=Female   name=Dotu                          
correct=Male     guess=Female   name=Ediuna                        
correct=Male     guess=Female   name=Hera                          
correct=Male     guess=Female   name=Hera                          
correct=Male     guess=Female   name=Hercla                        
correct=Male     guess=Female   name=Iaehetav                      
correct=Male     guess=Female   name=Iulia Pollitta                
correct=Male     guess=Female   name=Laxtucissa                    
correct=Male     guess=Female   name=Libella                       
correct=Male     guess=Female   name=Lossa                         
correct=Male     guess=Female   name=Marsua                        
correct=Male     guess=Female   name=Mnasea                        
correct=Male     guess=Female   name=Mucala                        
correct=Male     guess=Female   name=Mucatra                       
correct=Male     guess=Female   name=Nica                          
correct=Male     guess=Female   name=Primattia                     
correct=Male     guess=Female   name=Primitiva                     
correct=Male     guess=Female   name=Publia Aelia Florentina       
correct=Male     guess=Female   name=Rodo                          
correct=Male     guess=Female   name=Romaesta                      
correct=Male     guess=Female   name=Rustia Respecta               
correct=Male     guess=Female   name=Scaris Busila                 
correct=Male     guess=Female   name=Seneca                        
correct=Male     guess=Female   name=Sextia Torquata               
correct=Male     guess=Female   name=Sita                          
correct=Male     guess=Female   name=Soda                          
correct=Male     guess=Female   name=Sola                          
correct=Male     guess=Female   name=Sura                          
correct=Male     guess=Female   name=Surucca                       
correct=Male     guess=Female   name=Tara                          
correct=Male     guess=Female   name=Tarsa                         
correct=Male     guess=Female   name=Tarsa                         
correct=Male     guess=Female   name=Tsinna                        
correct=Male     guess=Female   name=Ulpia Andia                   
correct=Male     guess=Female   name=Ulpia Diotima                 
correct=Male     guess=Female   name=Valeria Priscilla             
correct=Male     guess=Female   name=Valeria Zile                  
correct=Male     guess=Female   name=Viccia Satunina               
correct=Male     guess=Female   name=Zacca                         
correct=Male     guess=Female   name=Zina                          
correct=Male     guess=Female   name=Zinama                        
correct=Male     guess=Female   name=Zura                          
In [20]:
# What other features could we use to boost accuracy?
In [21]:
# An example classification...

print(classifier.classify(features('Lucius Annaeus Seneca')))
print(classifier.classify(features('Pompeia Paulina')))
Male
Female
In [22]:
# Sample classifications...

print(classifier.classify(features('Ulysses')))
print(classifier.classify(features('Penelope')))
print(classifier.classify(features('Hercules')))
print(classifier.classify(features('Deineira')))
print(classifier.classify(features('Aeneas')))
print(classifier.classify(features('Dido'))) # Note correct classification here; see above
Male
Female
Male
Female
Male
Female