Using word endings and sentence position to determine part of speech

This notebook follows section 6.1 of the NLTK Book on Supervised Classification, specifically the examples on exploiting content (pp. 230-231 in the print edition). Since we need both tagged words and also sentence positions, I put together a new dataset, again drawn from G. Celano's Lemmatized Ancient Greek XML repository. This time I parse the XML of the 1K Years of Greek texts for the works of Aristotle. This gives us around ~34,000 tagged sentences to work with.

As in the last notebook in this series, we use the endings of either 1, 2, or 3 letters from each word as well as the previous token. We then use this featureset to train the NaiveBayesClassifier. This shows a considerable bump in accuracy—now 81-83%. (Though note in the final cell—running the previous notebook's decision-tree classification on this dataset in itself bumps accuracy up to ~68%). Still room for improvement, but its good to confirm our instincts as Greek readers that word endings alone are only so useful in determining part of speech. Context, even just knowledge of the preceding word, can significantly increase understand of a given word form.

This is still a limited, simplistic even, approach to context. For example, it would likely be helpful to know not just the preceding word, but the part of speech of the preceding word. As Bird et al. note: "In general, simple classifiers always treat each input as independent from all other inputs." In the following notebook, we will look at working with related inputs through the use of joint classifiers. [PJB 3.11.18]

In [1]:
# Imports

import random
import pickle

import nltk
In [2]:
## Get tagged sent data (for Aristotle's works in the 1K Years of Greek texts)

# from requests_html import HTMLSession

# from tqdm import tqdm

# url_base = "https://raw.githubusercontent.com/gcelano/LemmatizedAncientGreekXML/master/texts/"

# aristotle_xmls = ['tlg0086.tlg001.1st1K-grc1.xml',
#                      'tlg0086.tlg001.1st1K-grc2.xml',
#                      'tlg0086.tlg002.1st1K-grc1.xml',
#                      'tlg0086.tlg002.1st1K-grc2.xml',
#                      'tlg0086.tlg003.1st1K-grc1.xml',
#                      'tlg0086.tlg003.1st1K-grc2.xml',
#                      'tlg0086.tlg006.1st1K-grc1.xml',
#                      'tlg0086.tlg008.1st1K-grc1.xml',
#                      'tlg0086.tlg014.1st1K-grc1.xml',
#                      'tlg0086.tlg016.1st1K-grc1.xml',
#                      'tlg0086.tlg017.1st1K-grc1.xml',
#                      'tlg0086.tlg018.1st1K-grc1.xml',
#                      'tlg0086.tlg020.1st1K-grc1.xml',
#                      'tlg0086.tlg022.1st1K-grc1.xml',
#                      'tlg0086.tlg024.1st1K-grc1.xml',
#                      'tlg0086.tlg026.1st1K-grc1.xml',
#                      'tlg0086.tlg030.1st1K-grc1.xml',
#                      'tlg0086.tlg031.1st1K-grc1.xml',
#                      'tlg0086.tlg034.1st1K-grc1.xml',
#                      'tlg0086.tlg037.1st1K-grc1.xml',
#                      'tlg0086.tlg040.1st1K-grc1.xml',
#                      'tlg0086.tlg041.1st1K-grc1.xml',
#                      'tlg0086.tlg042.1st1K-grc1.xml',
#                      'tlg0086.tlg044.1st1K-grc1.xml',
#                      'tlg0086.tlg052.1st1K-grc1.xml',
#                      'tlg0086.tlg054.1st1K-grc1.xml']

# session = HTMLSession()

# sents = []

# for aristotle_xml in aristotle_xmls:
#     r = session.get('{}{}'.format(url_base, aristotle_xml))
    
#     sents_xml = r.html.find('s')

#     for sent_xml in tqdm(sents_xml):
#         sent = []
#         ts = sent_xml.find('t')
#         for t in ts:
#             pos = t.attrs['o'][0]
#             form = t.find('f', first=True).text
#             sent.append((form, pos))
#         sents.append(sent)
        
#pickle.dump(sents, open('./data/aristotle_tagged_sents.p', 'wb'))

tagged_sents = pickle.load(open('./data/aristotle_tagged_sents.p', 'rb'))
In [3]:
# Summarize data

print('There are {} sentences in this dataset.'.format(len(tagged_sents)))
print('Here is a sample sentence: {}.'.format(tagged_sents[2]))
There are 34244 sentences in this dataset.
Here is a sample sentence: [('εἶτα', 'd'), ('διορίσαι', 'v'), ('τί', 'p'), ('ἐστι', 'v'), ('πρότασις', 'n'), ('καὶ', 'd'), ('τί', 'd'), ('ὅρος', 'n'), ('καὶ', 'd'), ('τί', 'p'), ('συλλογισμός', 'n'), (',', 'u'), ('καὶ', 'd'), ('ποῖος', 'a'), ('τέλειος', 'a'), ('καὶ', 'd'), ('ποῖος', 'a'), ('ἀτελής', 'a'), (',', 'u'), ('μετὰ', 'r'), ('δὲ', 'g'), ('ταῦτα', 'p'), ('τί', 'd'), ('τὸ', 'l'), ('ἐν', 'r'), ('ὅλῳ', 'a'), ('εἶναι', 'v'), ('ἢ', 'd'), ('μὴ', 'd'), ('εἶναι', 'v'), ('τόδε', 'p'), ('τῷδε', 'p'), (',', 'u'), ('καὶ', 'd'), ('τί', 'p'), ('λέγομεν', 'v'), ('τὸ', 'l'), ('κατὰ', 'r'), ('παντὸς', 'a'), ('ἢ', 'c'), ('μηδενὸς', 'p'), ('κατηγορεῖσθαι', 'v'), ('.', 'u')].
In [4]:
# Create list of sentences without tags; in book, necessary?

sents = [[word for word, _ in sent] for sent in tagged_sents]
In [5]:
# Definte features

def pos_features(sentence, i):
    features = {"suffix(1)": sentence[i][-1:],
                "suffix(2)": sentence[i][-2:],
                "suffix(3)": sentence[i][-3:]}
    if i == 0:
        features["prev-word"] = "<START>"
    else:
        features["prev-word"] = sentence[i-1]
    return features
In [6]:
# Here is an example of a feature for the second word in the third sentence...

pos_features(sents[2], 1)
Out[6]:
{'prev-word': 'εἶτα', 'suffix(1)': 'ι', 'suffix(2)': 'αι', 'suffix(3)': 'σαι'}
In [7]:
# Put together featuresets

featuresets = []
for tagged_sent in tagged_sents:
    untagged_sent = nltk.tag.untag(tagged_sent)
    for i, (word, tag) in enumerate(tagged_sent):
        featuresets.append((pos_features(untagged_sent, i), tag))

random.shuffle(featuresets)
In [8]:
# Set up train/test data

size = int(len(featuresets) * 0.1)
train_set, test_set = featuresets[size:], featuresets[:size]
In [9]:
# Train classifier and evaluate

classifier = nltk.NaiveBayesClassifier.train(train_set)
nltk.classify.accuracy(classifier, test_set)
Out[9]:
0.8215093729799612

Appendix: Testing DecisionTreeClassifier on this dataset

This appendix takes the method/code from the "Using word endings to determine part of speech" notebook and applies it to the Aristotle sentences. Perhaps this is due to the more "natural" state of this dataset—i.e. the previous experiment was run using the "Unique Tokens" dataset in Lemmatized Ancient Greek XML. Sentences drawn directly from Aristotle's works show a more natural distribution of words, and so word endings, which may be responsible for the increase in accuracy; e.g. a word such as "καὶ" appears three times in the previous dataset but 3143 times in the dataset used below. Note that I limited to the number of tagged words for the classifer to 100000, because of performance problems on the complete dataset. Increasing this number may incrementally increase accuracy.

In [10]:
# DecisionTreeClassifer on Aristotle sentences, 100000 samples
suffix_fdist = nltk.FreqDist()    

flat_tagged_sents = [val for sublist in tagged_sents for val in sublist]
greek_words = [word for word, _ in flat_tagged_sents]

for word in greek_words:
    word = word.lower()
    suffix_fdist[word[-1:]] += 1
    suffix_fdist[word[-2:]] += 1
    suffix_fdist[word[-3:]] += 1
    
common_suffixes = [suffix for (suffix, count) in suffix_fdist.most_common(100)]

def pos_features(word):
    features = {}
    for suffix in common_suffixes:
        features['endswith({})'.format(suffix)] = word.lower().endswith(suffix)
    return features

tagged_words = flat_tagged_sents[:100000]
featuresets = [(pos_features(n), g) for (n,g) in tagged_words]

random.shuffle(featuresets[:100000])

size = int(len(featuresets) * 0.1)
train_set, test_set = featuresets[size:], featuresets[:size]

classifier = nltk.DecisionTreeClassifier.train(train_set)
nltk.classify.accuracy(classifier, test_set)
Out[10]:
0.6847
In [11]:
# NaiveBayesClassifier on 100000 samples, for comparison

def pos_features(sentence, i):
    features = {"suffix(1)": sentence[i][-1:],
                "suffix(2)": sentence[i][-2:],
                "suffix(3)": sentence[i][-3:]}
    if i == 0:
        features["prev-word"] = "<START>"
    else:
        features["prev-word"] = sentence[i-1]
    return features

featuresets = []
for tagged_sent in tagged_sents:
    untagged_sent = nltk.tag.untag(tagged_sent)
    for i, (word, tag) in enumerate(tagged_sent):
        featuresets.append((pos_features(untagged_sent, i), tag))

random.shuffle(featuresets[:100000])

size = int(len(featuresets) * 0.1)
train_set, test_set = featuresets[size:], featuresets[:size]

classifier = nltk.NaiveBayesClassifier.train(train_set)
nltk.classify.accuracy(classifier, test_set)
Out[11]:
0.8264221073044602