Using word endings to determine part of speech

This notebook follows section 6.1 of the NLTK Book on Supervised Classification, specifically the examples on part-of-speech tagging (pp. 229-230 in the print edition). The example in the NLTK Book uses the tagged Brown corpora. In order to get a similar collection, I retrieve token and POS information from G. Celano's Lemmatized Ancient Greek XML repository, specifically the 'Unique Values' dataset, a collection of all of the unique tokens in the Morpheus and PerseusUnderPhilologic databases. The requests_html package is used to parse the XML and return a list of tuples of this form: [(token1, pos1), (token2, pos2), ...].

With this tagged collection, we get the top 100 ending of either 1, 2, or 3 letters. We then use this featureset to train the DecisionTreeClassifier. (Not the fastest process btw!) The results are honestly not impressive, consistently around 59-60% accurate. Adding additional similar features, e.g. 4 and 5 letter terminations, does not improve accuracy.

One nice feature of the DecisionTreeClassifer is that you can generate pseudocode to trace the decision making process. Here is a sample decision tree for this test:

Sample decision tree (depth=4)

    if endswith(ς) == False: 
      if endswith(υ) == False: 
        if endswith(ῳ) == False: 
          if endswith(ον) == False: return 'a'
          if endswith(ον) == True: return 'v'
        if endswith(ῳ) == True: return 'n'
      if endswith(υ) == True: 
        if endswith(νου) == False: return 'n'
        if endswith(νου) == True: return 'v'
    if endswith(ς) == True: 
      if endswith(τες) == False: 
        if endswith(τας) == False: 
          if endswith(νος) == False: return 'n'
          if endswith(νος) == True: return 'v'
        if endswith(τας) == True: return 'v'
      if endswith(τες) == True: return 'v'

Not much more to add here—I'd say that the results (i.e. 59-60%) strike me as not unlike being in first-year Greek where endings are helpful—and some endings are much, much more helpful than others, e.g. -εσθαι—but not always. The amount of ambiguity in the endings remains a challenge that only becomes easier to overcome with a better handle on working with words (and word endings) in context. And context is where the next tutorial will take us. [PJB 3.10.18]

In [9]:
# Imports

import random
import pickle

import nltk
In [10]:
## Get POS data

# from requests_html import HTMLSession

# from tqdm import tqdm

# session = HTMLSession()
# urlbase = "https://raw.githubusercontent.com/gcelano/LemmatizedAncientGreekXML/master/uniqueTokens/values/"

# records = []

# for i in range(1,2):
#     r = session.get('{}{}.xml'.format(urlbase, i))
#     ds = r.html.find('d')
#     for d in tqdm(ds):
#         form = d.find('f', first=True).text
#         pos = d.find('p', first=True).text[0]
#         records.append((form, pos))
        
# pickle.dump(records, open('./data/greek_pos.p', 'wb'))
records = pickle.load(open('./data/greek_pos.p', 'rb'))
In [11]:
# Get stats on dataset

print('This dataset consists of {} tokens from the Lemmatized Ancient Greek XML Unique Tokens corpus'.format(len(records)))
print('Here is a sample of the dataset: {}'.format(records[:10]))
This dataset consists of 100000 tokens from the Lemmatized Ancient Greek XML Unique Tokens corpus
Here is a sample of the dataset: [('ἀρχόμενος', 'v'), ('σέο', 'p'), (',', 'u'), ('Φοῖβε', 'n'), ('παλαιγενέων', 'm'), ('κλέα', 'n'), ('φωτῶν', 'n'), ('μνήσομαι', 'v'), ('οἳ', 'p'), ('Πόντοιο', 'n')]
In [17]:
de = [word for word, _ in records if word == 'δέ']
print(de)
['δέ', 'δέ', 'δέ']
In [4]:
# Dictionary of POS tags from https://github.com/gcelano/LemmatizedAncientGreekXML

pos_tags = {
    'n': 'noun',
    'v': 'verb',
    'a': 'adjective',
    'd': 'adverb',
    'l': 'article',
    'g': 'particle',
    'c': 'conjunction',
    'r': 'preposition',
    'p': 'pronoun',
    'm': 'numeral',
    'i': 'interjection',
    'u': 'punctuation'
}
In [5]:
# Get a list of the op word terminations of 1, 2, and 3 characters

suffix_fdist = nltk.FreqDist()

greek_words = [word for word, _ in records]

for word in greek_words:
    word = word.lower()
    suffix_fdist[word[-1:]] += 1
    suffix_fdist[word[-2:]] += 1
    suffix_fdist[word[-3:]] += 1
    
common_suffixes = [suffix for (suffix, count) in suffix_fdist.most_common(100)]
print(common_suffixes)
['ν', 'ς', 'ι', 'ον', 'α', 'ος', 'αι', 'ων', 'ας', 'ιν', 'ις', 'αν', 'ε', 'ες', 'ο', 'ʼ', 'υ', 'εν', 'ου', 'οι', 'σιν', 'υς', 'ους', 'το', '᾽', 'τες', 'τα', 'σι', 'ῶν', 'τος', 'ται', 'ης', 'ῳ', 'νος', 'η', 'οις', 'ην', 'θαι', 'ει', 'νον', 'των', 'τας', 'σαν', 'ὸν', 'ιον', 'ῖς', 'ίας', 'νων', 'ντα', 'ῃ', 'νοι', 'ειν', 'όν', 'ί', 'ντο', 'τον', 'ναι', 'σαι', 'ὶ', 'να', 'ρον', 'τʼ', 'ίαν', 'ως', 'ός', 'ῖν', 'ίων', 'εῖν', 'αις', 'ὸς', 'μεν', 'τι', 'ῦ', 'ᾳ', 'σας', 'ω', 'οῦ', 'τ᾽', 'ια', 'δʼ', 'σα', 'τε', 'ά', 'εις', 'ίου', 'ὺς', 'ὴν', 'νου', 'ὰς', 'σε', 'ῆς', 'ῷ', 'ῇ', 'νης', 'ὰ', 'νας', 'ρα', 'λον', 'νην', 'δ᾽']
In [6]:
# Function for getting features for a given word

def pos_features(word):
    features = {}
    for suffix in common_suffixes:
        features['endswith({})'.format(suffix)] = word.lower().endswith(suffix)
    return features 
In [7]:
# Build featuresets and randomize

tagged_words = records
featuresets = [(pos_features(n), g) for (n,g) in tagged_words]
random.shuffle(featuresets)
In [8]:
# Make train/test sets

size = int(len(featuresets) * 0.1)
train_set, test_set = featuresets[size:], featuresets[:size]
In [9]:
# Create classifier instance and test

classifier = nltk.DecisionTreeClassifier.train(train_set)
nltk.classify.accuracy(classifier, test_set)
Out[9]:
0.5852
In [10]:
word = 'λόγος'
pos_test = classifier.classify(pos_features(word))
print('The classifier determines the POS of {} to be {}.'.format(word, pos_tags[pos_test]))
The classifier determines the POS of λόγος to be noun.
In [11]:
# Return first four levels of decision tree for the above classification

print(classifier.pseudocode(depth=4))
if endswith(ς) == False: 
  if endswith(υ) == False: 
    if endswith(ῳ) == False: 
      if endswith(ον) == False: return 'n'
      if endswith(ον) == True: return 'a'
    if endswith(ῳ) == True: return 'n'
  if endswith(υ) == True: 
    if endswith(νου) == False: return 'n'
    if endswith(νου) == True: return 'v'
if endswith(ς) == True: 
  if endswith(τες) == False: 
    if endswith(τας) == False: 
      if endswith(νος) == False: return 'v'
      if endswith(νος) == True: return 'v'
    if endswith(τας) == True: return 'v'
  if endswith(τες) == True: return 'v'

In [12]:
# Add more terminations as features, i.e. 4 and 5 letters

suffix_fdist = nltk.FreqDist()

greek_words = [word for word, _ in records]

for word in greek_words:
    word = word.lower()
    suffix_fdist[word[-1:]] += 1
    suffix_fdist[word[-2:]] += 1
    suffix_fdist[word[-3:]] += 1
    suffix_fdist[word[-4:]] += 1
    suffix_fdist[word[-5:]] += 1
    
common_suffixes = [suffix for (suffix, count) in suffix_fdist.most_common(100)]
print(common_suffixes)
['ν', 'ς', 'ι', 'ον', 'α', 'ος', 'αι', 'ων', 'ας', 'ιν', 'ις', 'αν', 'ε', 'ο', 'ες', 'ʼ', 'υ', 'εν', 'ου', 'οι', 'σιν', 'υς', 'ους', 'το', '᾽', 'τες', 'τα', 'σι', 'ντες', 'ῶν', 'τος', 'ται', 'ης', 'ῳ', 'νος', 'η', 'οις', 'ην', 'θαι', 'ει', 'σθαι', 'νον', 'των', 'τας', 'σαν', 'ὸν', 'ιον', 'ῖς', 'ίας', 'νων', 'ντα', 'ῃ', 'νοι', 'ειν', 'όν', 'ί', 'τʼ', 'ντο', 'ντος', 'δʼ', 'ντας', 'τον', 'ενος', 'μενος', 'ναι', 'σαι', 'ὶ', 'να', 'ρον', 'ντων', 'δ᾽', 'ενοι', 'ίαν', 'ως', 'μενοι', 'ός', 'τ᾽', 'νους', 'ένων', 'ίων', 'ῖν', 'μένων', 'εῖν', 'μεν', 'αις', 'ὸς', 'τι', 'ενον', 'μενον', 'ῦ', 'εσθαι', 'ᾳ', 'ω', 'σας', 'οῦ', 'ια', 'τε', 'σα', 'νται', 'ά']
In [13]:
# Build featureset, train classifier, and test

tagged_words = records
featuresets = [(pos_features(n), g) for (n,g) in tagged_words]
random.shuffle(featuresets)

size = int(len(featuresets) * 0.1)
train_set, test_set = featuresets[size:], featuresets[:size]

classifier = nltk.DecisionTreeClassifier.train(train_set)
nltk.classify.accuracy(classifier, test_set)
Out[13]:
0.5921