The following is a tutorial aimed at a true machine learning novice, as I was when I took on this project a year ago. (And which I still am -- though now I'm just regular green instead of neon green when it comes to this material.) We're going to build a tagger capable of predicting which character speaks a line of dialogue in the novels of A Song of Ice and Fire, by George RR Martin. To do that we're going to use conditional random fields, and a wonderful utility called CRFsuite, by Naoki Okazaki. For our text processing we'll be using Python 2.7 and NLTK (Natural Language Toolkit).

I'm going to get granular with this; my hope is that by mentioning each piece of my workflow, you may learn about some new tools and techniques that will be useful to you in your own projects. The code explanations will be from a novice and aimed at a beginner, someone who recognizes Python syntax and knows list comprehensions, but not much beyond that. Feel free to skip stuff if my expounding is sapping your life force.

Important: if you're hoping to learn about CRF's from a theoretical perspective, this is not your tutorial. To me, CRFsuite is a beautiful black monolith that I run my monkey paws over. We'll spend some time experimenting with boosting the tagger's efficiency, but that will be trial and error. While that may be disappointing, consider this:

  • I was able to get a great result (~75% accuracy) with CRFsuite off-the-shelf
  • There will be no LaTex in this document.

Our gameplan is pretty simple. As with any machine learning algorithm, we need to prepare a training and a test set. We'll then select a set of features -- things the algorithm will monitor in order to make its classifications. After we process our text with these features, we then feed the results into CRFsuite and congratulate ourselves on a job well done. (Or steel ourselves for the painstaking work of verifying the computer's guesses.)

Let's get started.

Loading Text Data

First we've got to track down a copy of our source text, and I'll leave it up to you whether or not you pay the iron price for it.

If you're new to the world of natural language processing, you may not yet appreciate how plaintext isn't so plain. But every .txt file has a character encoding, which dictates how each character should be represented. ASCII, the format I read GameFAQs Ocarina of Time walkthroughs in, has since been superseded by UTF-8, which can handle all special characters. (ASCII could only represent 128 characters.) My copy of ASOIAF is in UTF-8, which brings a little cognitive overhead with it, but is in fact a bonus.

We'll pull this text file into NLTK so we can more easily manipulate it. NLTK can do a ton of stuff, and it's how I learned Python, so if this stuff seems interesting to you, check out their excellent online book. For our purposes we're going to use it for loading our data and tokenizing it. Tokenizing is the act of splitting a sentence into its words and punctuation, and is a common step in NLP projects.

In [1]:
import nltk
nltk.word_tokenize("NLTK is ready to go.")
Out[1]:
['NLTK', 'is', 'ready', 'to', 'go', '.']

NLTK comes preloaded with a number of text corpuses, but we need to roll our own.

Create a folder and put your ASOIAF text files into it. Since the series is so long, an omnibus plaintext would be almost 10 megabytes. Not ideal when you're doing find and replace. I've got mine split up by book, but the real pro, who is planning on doing further analysis, will separate each book into its chapters and give them a sequential ID.

But let's keep it simple for now! Once we have our text files in the folder, we run this code:

In [2]:
corpus = nltk.corpus.PlaintextCorpusReader(r'corpus', 'George.*\.txt', encoding = 'utf-8')

The r denotes a rawstring in python, which means I don't have to escape any backslashes. Not a problem here, because I'm going directly into a folder named "corpus", but if your folder setup is more elaborate, it's good to keep in mind.

The second argument is a regex, which tells NLTK to grab any file in the corpus folder which has "George" in the filename and ends in ".txt".

The encoding parameter is very important -- if there's a mismatch between your file encodings and what NLTK is expecting, you'll get errors.

An NLTK corpus is handy because it lets us access the text at a number of different levels.

In [3]:
corpus.words("George R. R. Martin - 01 - A Game Of Thrones.txt")[-5:]
Out[3]:
[u'the', u'music', u'of', u'dragons', u'.']
In [4]:
corpus.words()[0]
Out[4]:
u'PROLOGUE'
In [5]:
corpus.sents()[1][:6]
Out[5]:
[u'\u201c', u'We', u'should', u'start', u'back', u',\u201d']

There we hear the doomed Gared from the prologue of AGOT, and see some Unicode characters represented in Python. You can see that all Unicode strings are prefaced with a u, and within them they will have special strings. \u201c is a curly-left quote, \u201d a curly-right. I mentioned earlier that UTF-8 is a bonus, and here's why. Let's see what would happen if we opened this corpus without UTF-8 encoding.

In [6]:
bad_corpus = nltk.corpus.PlaintextCorpusReader(r'corpus', '.*\.txt')
In [7]:
bad_corpus.sents()[1][:9]
Out[7]:
['\xe2', '\x80\x9c', 'We', 'should', 'start', 'back', ',', '\xe2', '\x80\x9d']

In the same way that \u indicates a Unicode string, \x indicates a hexadecimal string, so NLTK is giving us three hexadecimal bytes -- \xe2, \x80, \x9c -- and then trying to tokenize them. You can see that it doesn't know how.

We'll be working with paragraphs, so let's look at one of those:

In [8]:
print corpus.paras()[1]
[[u'\u201c', u'We', u'should', u'start', u'back', u',\u201d', u'Gared', u'urged', u'as', u'the', u'woods', u'began', u'to', u'grow', u'dark', u'around', u'them', u'.'], [u'\u201c', u'The', u'wildlings', u'are', u'dead', u'.\u201d']]

You can see how NLTK structures the data. Sentences are lists of tokens, paragraphs are lists of sentences. Easy enough!

Feature Sets

Next we need to prepare our training data, but in order to do that, we have to decide what labels we'll use. When a part-of-speech tagger is working, it knows that any given token will belong to a lexical category, each of which has a label. JJ is an adjective, NN is a noun, IN is a preposition. These labels are crucial to the tagger's competency. The Penn Treebank lists 36 such part-of-speech labels.

What will our labels be? The naive answer is character names. That ain't gonna work, for a few reasons:

  1. ASOIAF has over 1,000 speakers. That's too many choices for our poor tagger. We want to winnow down the labels as much as possible, so that we can score a few correct classifications on dumb luck.
  2. Characters are referred to in many different ways. Joffrey might be "Joffrey", "Joff", "the prince", or simply "he".
  3. If we use character names as labels, these characters will have to be represented in the training data. Otherwise our tagger won't have any information on the characters, and will therefore never guess them.
  4. All the characters sound alike, basically. (I discovered this is another machine learning experiment in which I tried to cluster the characters by their vocabulary). Some have catchphrases, like grievous for Varys and Hodor for Hodor, but those are rare. Plus, most of the speakers don't get enough speaking time to differentiate themselves from everybody else.

So while it's tempting to tag by character names, let's discard it in favor of something that better represents the tagging process ongoing in a reader's mind.

Pick up the nearest novel, turn to a random page, and try to figure out who's talking. How do you do it? You look for proper nouns in the vicinity of the dialogue.

“Will saw them,” Gared said.

[...]

Ser Waymar Royce glanced at the sky with disinterest. “It does that every day about this time. Are you unmanned by the dark, Gared?”

Not every line of dialogue is tagged, though. Say you space out, return to the text, and see this:

“Did you make note of the position of the bodies?”

You look at the paragraphs above and below. Here are the two above:

“Did you see any weapons?”

“Some swords, a few bows. One man had an axe. Heavy-looking, double-bladed, a cruel piece of iron. It was on the ground beside him, right by his hand.”

No help there. Here are the next two:

Will shrugged. “A couple are sitting up against the rock. Most of them on the ground. Fallen, like.”

“Or sleeping,” Royce suggested.

We know that Will won't be asking and answering himself, so we can say he's not the speaker, and since most dialogue goes back and forth, we assume Royce was our speaker.

That same heuristic will inform our tagger. We'll teach it to recognize proper nouns near the text, and to refer to nearby paragraphs when it can't find any. Here are our tags, then:

PS +/-2, FN +/-2, NN +/-2, Other.

PS is a "post speaker". If a paragraph's tag is PS -2, that means the decisive piece of text was the person whose name appears right after the dialogue two paragraphs back. If it's FN 1, that means it's the first name that appears in the next paragraph. NN 0 means there are at least two character names preceding the dialogue in the paragraph, and we want the one which is nearest to the dialogue.

I'll also toss in an ADR +/-2, for characters addressed within a line of dialogue.

Tagging

Now we prep the training data. Here's where SublimeText is handy. I went into AGOT, selected a left curly quote, did a Find -> Quick Find All, and then hit the home button twice. Now I have a cursor at the beginning of every paragraph with dialogue in it. I then type "{}". Since there are no curly brackets in the text, we can use this to contain markup we may wish to strip out later.

We'll use this regex -- (?<=\{)(?=\}) -- to hop from curly bracket to curly bracket. If you haven't seen that notation before, these are positive lookarounds. The first set of parentheses tells SublimeText to start matching after it finds a curly bracket (escaped with a backslash). The next set of parentheses tells it to stop matching once it finds a curly right bracket. You can see that both have the ?= construction, only the lookbehind has a < in there too.

Now you can navigate brackets with a press of the F3 button, which is the Windows shortcut for Find Next in SublimeText. This kind of efficiency is important, because you'll be tagging a thousand paragraphs of dialogue or so. At least that's how many I did. It wasn't as brutal as I was expecting, nor as time-consuming. (Though it's been a year since it happened, so maybe that's a lie.)

One note before you get started: consider whether or not you want to use the positional labels (PS, FN, NN), or actual character names. I know, I said we're not doing character names, but if you commit to the positionals then you are marrying this training data to a certain model. If you just type {Jon} when Jon says something, then you can convert that to a positional later, or use a different set of labels if you come up with something better.

I don't think there's a good answer here. Last year I tagged by character names. Now I'm forced to do a weird preprocessing step, which is disambiguating the decisive speaker tag. If Ned's name shows up two paragraphs back and one paragraph ahead, which is preferred? This will have major effects on the tagger's behavior, and doing it programatically makes it a tiny bit dumber. So I'm not sure what I'd recommend to you -- I think it's easier from a tagging perspective to see a character name and type a character name, but far easier from a programming perspective to have the positional labels.

Extracting Features

Okay, you tagged some stuff. I salute your dedication to the cause of NLP. All we have to do now is write a few functions that will accept a paragraph and tag it with all the features we're interested in.

What are those, again? The workhorses that will be responsible for most of our accuracy are simple: is there a PS, FN, or NN in the current paragraph, or those paragraphs around it?

Finding Names

Our first function will have to scan for proper names. You could do this by doing some part-of-speech tagging.

In [9]:
sentence = corpus.paras()[33][0]
print " ".join(sentence)
print nltk.pos_tag(sentence)
“ Such eloquence , Gared ,” Ser Waymar observed .
[(u'\u201c', 'NN'), (u'Such', 'JJ'), (u'eloquence', 'NN'), (u',', ','), (u'Gared', 'NNP'), (u',\u201d', 'NNP'), (u'Ser', 'NNP'), (u'Waymar', 'NNP'), (u'observed', 'VBD'), (u'.', '.')]

That NNPs next to Ser and Waymar tells us it's a proper noun. But there's a couple downsides here:

  1. Miscategorization. See how that closing quote was also classified as a proper noun?
  2. POS tagging takes time.
In [10]:
%timeit nltk.pos_tag(sentence)
10 loops, best of 3: 18.1 ms per loop
In [11]:
asoiaf_sentence_count = 143669
( asoiaf_sentence_count * 19.2 ) / 1000 / 60
Out[11]:
45.974079999999994

There's a lot of paragraphs to iterate through in ASOIAF, and 45+ minutes of POS tagging will make testing and refactoring a drag. Sure, we could do one tagging off the bat, and then do lookups from there on out, but that's one more data structure to deal with, and that tagging will have to be redone whenever you modify the source text. (And that will happen now and then.)

Luckily we're not forced to do POS tagging to identify character names. It's one of the benefits of choosing ASOIAF as your object of study: there's a ton of data mining that's already been done. Let's go scrape some of that.

Scraping

I went to the very handy A Wiki of Ice and Fire, scraped their character list with a basic copy and paste, and had a nearly comprehensive list of every name particle. You can get that list here. If that's good enough for you, we'll rendezvous at the next section. For those who might be curious how we might scrape that page programatically, let me tick off a few options that I've used in other projects.

Wget

A great utility that is dead simple if you're lucky enough to have a predictable set of links to crawl. Instead of worrying about a legitimate crawler, you create a list of links and pass it in on an -i flag, like so:

wget -i list_of_links.txt

Requests

Python has a library, requests, which works nicely for single pages.

In [12]:
import requests

r = requests.get("http://awoiaf.westeros.org/index.php/List_of_characters")
html = r.text
print html[:100]
<!DOCTYPE html>
<html lang="en" dir="ltr" class="client-nojs">
<head>
<meta charset="UTF-8"/>
<title

Parsing

Once we've got the html, we need to shuck all the extra html markup to get at our links. BeautifulSoup is an HTML parser that will get you the links with a minimum of fuss. After you've installed it and parsed some HTML with it, fetching the links from a page is as simple as:

parsed_html.find_all("a")

You can read more about that here.

I'd like to show you a different approach, which uses a library called lxml. This is another parser that allows you to use Xpath. I'm new to Xpath, but it's such a powerful way to traverse tree data.

In [13]:
import lxml.html

tree = lxml.html.fromstring(html)
character_names = tree.xpath("//ul/li/a[1]/@title")
print character_names[:5]
['Abelar Hightower', 'Addam', 'Addam Frey', 'Addam Marbrand', 'Addam Osgrey']

If you're squinting at that Xpath expression in the above codeblock, here's how it works.

tree.xpath("//ul        # select all unordered lists, wherever they are on the page
             /li        # select the li tags within these unordered lists
             /a[1]      # select the first link within these li's...
             /@title    # return the title attribute
           ")

From here, we just need to atomize these names and remove any particles we might not want to recognize as part of a name. Just scrolling through AWOIAF's page, I see things like "Taena of Myr". We don't want our tagger to guess a line was spoken by "of".

NLTK can help us with that. It comes with a corpus of "stopwords". Stopwords are those mundane words that are so ubiquitous they're of little use when it comes to characterizing a piece of text.

In [14]:
particles = ' '.join(character_names).split(" ")
print len(set(particles))

stopwords = nltk.corpus.stopwords.words('english')
print stopwords[:5]

particles = set(particles) - set(stopwords)
print len(particles)

# Some things will slip through the cracks. Since Aegon I is in the AWOIAF list,
# that means a capital I is now considered a name. You'll have to clean that up!
"I" in particles
2167
['i', 'me', 'my', 'myself', 'we']
2146
Out[14]:
True

Finally we'll want to include some character nicknames that the scraper might have missed, like Dany, Blackfish, and Joff. Once you're happy with your list, dump it to a file somewhere for future reference.

Finding Names, part two

We've abandoned the part-of-speech approach and have gotten a list of name particles. We'll extract sequences and test to see if their components can be found in that list. It's finally time to write some code of our own.

In [15]:
import itertools
from operator import itemgetter

particles = [particle.rstrip('\n') for particle in open('asoiaf_name_particles.txt')]
tokens = [u'\u201c', u'Such', u'eloquence', u',', u'Gared', u',\u201d', u'Ser', u'Waymar', u'observed', u'.']

def roll_call(tokens, particles):
    speakers = {}
    particle_indices = [i for (i, w) in enumerate(tokens) if w in particles]
    for k, g in itertools.groupby(enumerate(particle_indices), lambda (i,x): i-x):
        index_run = map(itemgetter(1), g)
        speaker_name = ' '.join(tokens[i] for i in index_run)
        speakers[min(index_run)] = speaker_name
    return speakers

This is a function that makes use of lambdas, which is to say that there's no way I would have used it last year when I did this project. But the script I used to get the job done at the time is so tragically ugly and unreadable I couldn't bear to put it on the internet. Plus, I think there's some cool stuff to be learned here for a novice, so I'll step through this one.

Itertools is a library worth knowing. I use it frequently to flatten nested lists, or to do permutations. At the moment we're interested in its groupby function. While coming up with a new version of this function for this tutorial, I completely glossed over groupby in favor of dropwhile and takewhile, which I slapped together in a semi-fancy recursive solution.

But as I kept designing I realized that roll_call, to be useful to the different functions calling it, would have to know the locations of the names it found. So I thought I would locate all the indices that are in our list of name particles. You see that on line 3 of the function:

particle_indices = [i for (i, w) in enumerate(tokens) if w in particles]

Learning about enumerate was a big help when I was getting my feet wet in Python. It takes an iterable and returns a tuple for each element, containing the index and the element.

Line four is the slickest piece of code you'll see in this tutorial, and I did not write it. It's pulled straight from the itertools docs.

for k, g in itertools.groupby(enumerate(particle_indices), lambda (i,x): i-x):

Groupby scrolls through an iterable and groups the elements according to the result of the lambda. Lambdas are anonymous functions. Unlike roll_call, they don't get a def block. They're just bits of codes that can take an argument and return a value. Our function here is a simple subtraction of the number from its index within the list.

Let's look at how that plays out.

In [16]:
print tokens
particle_indices = [i for (i, w) in enumerate(tokens) if w in particles]
print particle_indices

for index, location in enumerate(particle_indices):
    lambda_function = index-location
    print "{} - {} = {}".format(index, location, lambda_function)
[u'\u201c', u'Such', u'eloquence', u',', u'Gared', u',\u201d', u'Ser', u'Waymar', u'observed', u'.']
[4, 6, 7]
0 - 4 = -4
1 - 6 = -5
2 - 7 = -5

That's the trick of the groupby: indices always increase by one, so if two neighboring elements in a list are indeed consecutive -- increasing by one themselves -- then the result of that subtraction will remain constant.

The groupby method sees the -4 and assigns 4 to its own group. The 6, 7, both having -5, get placed together.

Now that we know where the particle sequences are, we have to go get them. What does groupby return? A key, which is the result of our lambda, and this "grouper" object. A grouper is an iterator that has inside it, at index 1, the number we're looking for. So we use a map, which applies a function to each item in the collection the map is called on. In this case, we're using itemgetter, which gets items out of tuples.

Once the groupby is complete, it's a matter of simple list indexing to pull out the names, which are then handed over to a dictionary called "speakers".

In [17]:
roll_call(tokens, particles)
Out[17]:
{4: u'Gared', 6: u'Ser Waymar'}

Optimizing

Let's compare the speed of this function to the POS tagging approach:

In [18]:
%timeit roll_call(tokens, particles)
100 loops, best of 3: 4.28 ms per loop

5-6 times faster isn't bad, but we can improve on that with a set. Sets are much better bouncers than lists: they are almost instantaneous when checking if an element is "on the list".

In [19]:
set_of_particles = set(particle.rstrip('\n') for particle in open('asoiaf_name_particles.txt'))
%timeit roll_call(tokens, set_of_particles)
10000 loops, best of 3: 19.8 µs per loop

You know you're in good shape when you start seeing Greek letters in your time profiler.

Finding Names Relative to Dialogue

Now we need to write some functions that call roll in the right places, so that we find the speakers before the dialogue starts, the characters named during the dialogue, and the characters explicitly tagged after the dialogue. We'll put these all in a class that can return a full dictionary of speaker positions, which we can then pass off another featureset builder, and then off to CRFsuite.

But I'd like to get our data in better shape, first.

XML Parser

After my gratifying Xpath one-liner above, I decided to write an XML parser for our text files. The format makes a ton of sense for this kind of data. ASOIAF is a set of novels, each of which contains chapters, each of which contains paragraphs, some of which contain dialogue -- and we want to unobtrusively mark up that text. If I didn't convert to XML -- and I didn't, the first time I rode this merry-go-round -- I'll have tag data clogging up paragraphs.

I'd rather not talk about to following script: it reminds me of my very first days working in Python, writing these leviathan functions, littering with little one-liners to massage the data just so, rife with slightly-too-long variables.

In [20]:
from lxml import etree
import codecs
import re

def ASOIAFtoXML(input):
  # Each item in input should be a dictionary with a book's title and its location on disk.
  root = etree.Element("root")
  for item in input:
    title = item["title"]
    current_book = etree.Element("book", title=item["title"])
    root.append(current_book)
    with codecs.open(item["contents"], "r", encoding="utf-8") as book_file:
        #A hack for text files that don't have any lines that match our chapter matcher.
        current_chapter = etree.Element("chapter", title="Debug")
        for paragraph in book_file:
            paragraph = paragraph.strip()
            if paragraph != "":
                title_match = re.match("\A[A-Z\W ]+\Z", paragraph)
                if title_match:
                    current_chapter = etree.Element("chapter", title=title_match.group())
                    current_book.append(current_chapter)
                else:
                    current_graf = etree.SubElement(current_chapter, "paragraph")
                    while paragraph != "":
                        current_dialogue = current_graf.xpath('./dialogue[last()]')
                        speaker_match = re.search("(\{(.*?)\} )", paragraph)
                        if speaker_match:
                            speaker_tag = speaker_match.group(1)
                            speaker_name = speaker_match.group(2)
                            paragraph = paragraph.replace(speaker_tag, "")
                        open_quote = paragraph.find(u"\u201c")
                        if open_quote == -1:
                            if current_dialogue:
                                current_dialogue[0].tail = paragraph
                            else:
                                current_graf.text = paragraph
                            paragraph = ""
                        elif open_quote == 0:
                            current_dialogue = etree.SubElement(current_graf, "dialogue")
                            if speaker_name:
                                current_dialogue.attrib["speaker"] = speaker_name
                            close_quote = paragraph.find(u"\u201d") + 1
                            if close_quote == 0:
                                # Find will return -1 in this case, so by testing for equality to 0
                                # we're asking if there's no closing quote remaining. This occurs
                                # in long monologues that have paragraph breaks.
                                close_quote = len(paragraph)
                            current_dialogue.text = paragraph[open_quote: close_quote]
                            paragraph = paragraph[close_quote:]
                        else:
                            if current_dialogue:
                                current_dialogue[0].tail = paragraph[:open_quote]
                            else:
                                current_graf.text = paragraph[:open_quote]
                            paragraph = paragraph[open_quote:]
    return root

tree = ASOIAFtoXML([{"title": "AGOT", "contents": "corpus/train_asoiaf_tagged.txt"}])

# Here's how you dump that tree to a file.
# et = etree.ElementTree(tree)
# et.write(codecs.open("asoiaf.xml", "w", encoding="utf-8"), pretty_print=True)

The gist of it is this: we use lxml to create a tree, then we iterate through a text file, one line at a time. If that line matches the chapter header format (all capital letters, punctuation, and spaces), then we append a new chapter to the current book node. Once we're in the chapter we chomp our way through the paragraphs, using another regex to give the dialogue nodes a speaker attribute, assuming we've already tagged it.

One interesting note about XML. It's hierarchical data, so it naturally wants everything to be neatly contained, one node within another. That's not the case with a paragraph of prose. There we've got dialogue nodes floating in some text. lxml has a solution: text and tail. So every XML node can contain text, but that text is interrupted by the arrival of another node.

In [21]:
markup = '''<paragraph>Worse and worse, Catelyn thought in despair. My brother is a fool.
Unbidden, unwanted, tears filled her eyes. <dialogue speaker="Catelyn Stark">
&#8220;If this was an escape,&#8221;</dialogue> she said softly,
<dialogue speaker="Catelyn Stark">&#8220;and not an exchange of hostages, why should the Lannisters
give my daughters to Brienne?&#8221;</dialogue></paragraph>'''
graf = lxml.etree.fromstring(markup)
print graf.text
Worse and worse, Catelyn thought in despair. My brother is a fool.
Unbidden, unwanted, tears filled her eyes. 

Now we get the first dialogue node.

In [22]:
print graf[0].text
“If this was an escape,”

What happens to "she said softly"? We assign that to the dialogue node's tail property.

In [23]:
print graf[0].tail
 she said softly,

And we carry on like this, adding more dialogue nodes and their tails.

Incidentally, this makes our life so easy when it comes time to find the speakers in a paragraph. Which is now!

In [34]:
class feature_extractor_simple:
    """Analyze dialogue features of a paragraph. Paragraph should be an lxml node."""
    def __init__(self, paragraph_node, particles, tag_distance=0):
        self.paragraph = paragraph_node
        self.particles = set(particles)
        self.tag_distance = tag_distance
        self.raw = ''.join(t for t in self.paragraph.itertext())
        self.tokens = self.tokenize(self.raw)

    def tokenize(self, string):
        return nltk.wordpunct_tokenize(string)

    def find_speakers(self, tokens):
        speakers = {}
        particle_indices = [i for (i, w) in enumerate(tokens) if w in self.particles]
        for k, g in itertools.groupby(enumerate(particle_indices), lambda (i,x): i-x):
            index_run = map(itemgetter(1), g)
            speaker_name = ' '.join(tokens[i] for i in index_run)
            speakers[min(index_run)] = speaker_name
        return speakers

    def pre_speak(self, prior_tag="FN", near_tag="NN"):
        #Names that appear before the first line of dialogue.
        features = {}
        if self.paragraph.text is not None:
            speakers = self.find_speakers(self.tokenize(self.paragraph.text))
            if len(speakers) > 0:
                features.update({"{} {}".format(prior_tag,self.tag_distance): speakers.values()[0]})
            if len(speakers) > 1:
                features.update({"{} {}".format(near_tag,self.tag_distance): speakers[max(speakers.keys())]})
        return features

    def dur_speak(self, tag="ADR"):
        # Names that appear in dialogue, in the vocative case.
        features = {}
        for dialogue in self.paragraph.itertext("dialogue", with_tail=False):
            tokens = self.tokenize(dialogue)
            named = self.find_speakers(tokens)
            addressed = {k: v for (k, v) in named.items() if tokens[k-1] == "," or tokens[k + 1 + v.count(" ")].startswith(",")}
            if len(addressed) > 0:
                features.update({"{} {}".format(tag, self.tag_distance): addressed[max(addressed.keys())]})
        return features

    def post_speak(self, tag="PS"):
        features = {}
        # Names right after a closing quote.
        tails = (line.tail for line in self.paragraph.iterfind("dialogue") if line.tail is not None)
        for tail in tails:
            tokens = self.tokenize(tail)
            speakers = {k: v for (k, v) in self.find_speakers(tokens).items() if k <= 1}
            if len(speakers) > 0:
                features.update({"{} {}".format(tag, self.tag_distance): speakers[min(speakers.keys())]})
                break
        return features

A few words about these functions.

If you're new to Python and haven't dug into classes yet, don't let them intimidate you. To use a class, you basically write singleton functions as normal, except you give them all this self argument. This lets Python know which instance of the class the method is operating on. A class is a clone factory, and an instance is an individual clone. They all have the same DNA -- that is, methods and data representations -- but their personalities are different because of their different life experiences -- in this analogy, the data they are given.

Classes also have the special __init__ method, which lets you initialize an instance with certain data.

With your data in the care of a custom class, you can relax. Because you've done the work of abstracting its behaviors, you can now ring a silver bell have data brought to you on a tray.

In [25]:
paragraph = tree.xpath(".//paragraph")[32]

example_extractor = feature_extractor_simple(paragraph, particles)
print example_extractor.raw
print example_extractor.pre_speak()
print example_extractor.dur_speak()
print example_extractor.post_speak()
“Such eloquence, Gared,” Ser Waymar observed. “I never suspected you had it in you.”
{}
{'ADR 0': u'Gared'}
{'PS 0': 'Ser Waymar'}

If you're confused by some of these methods, I'll quickly explain what they're doing here. If the above code looks reasonable to you, you know the drill: rendezvous at the next section.

There is some ungainly dictionary manipulation going on, and that's because dictionaries are unordered in Python. That's something I knew, in the way that you kinda know your pocket feels wrong when you're just about to lock yourself out of the house. I had to do a lot of very painful data checking, just now. So to ensure we're getting the first or last speaker, as the case may be, I'm looking at the keys and selecting the min/max.

pre_speak

As I mentioned above, the paragraph's text property will contain all the text before the first line of dialogue. We simply check that for character names.

dur_speak

There can be many lines of dialogue in one paragraph, and we iterate through them all with this:

for dialogue in self.paragraph.itertext("dialogue", with_tail=False)

itertext is an lxml method that will get all the texts of a node. Here we specify we only want the text of dialogue nodes, and set the with_tail parameter to false, so we know we're only getting dialogue.

Once we find the character names, then selects only those that have a comma to either side, indicating the vocative case. (eg "Ned, promise me." / "Promise me, Ned.")

My gut tells me that the last person addressed in dialogue is the most likely to be the one responding in the next paragraph, so we'll let the function overwrite the earlier addresses and return the last one heard.

post_speak

For this method, we want the first speaker, so we break the loop once we find something. This is a somewhat arbitrary decision to take the first one.

The method looks in a two token window to the right of any closing quotes. That's where you find stuff like,

"Hello," Jon said. "Goodbye," said Jon.

One syntax tip for very new programmers: you can call functions in a list comprehension.

tails = [line.tail for line in self.paragraph.iterfind("dialogue") if line.tail is not None]

This lets me get the dialogue tails in one shot. (I just have to add a conditional that discards any None results.)

CRFsuite

This may be the part you're most anxious about. It's got all the conditional random fields, whatever those are, and it's run from the command line, and there's no way to see how it works from the inside.

In fact, CRFsuite is the easy and fun part of this. I discovered while writing this that it has a Python wrapper, but for this tutorial, we're going to keep it simple and use the executable from the command line.

(I plan on beefing this tagger up when TWOW comes out, and I'll integrate that then. But I've got a couple years before I have to worry about it.)

All CRFsuite needs is a plaintext file with many rows of tab-separated features, like this:

FN 0    Graf Sent Len=4 FN 1=True   FN -2=True  FN 0=True   NN 1=True

This is the format for the training data. The first tag is the correct answer. All the following bits of texts are features. These strings can look like anything you want, though don't use a colon -- that's for weighting features, and will get your stuff misinterpreted.

You'll open up a command prompt wherever your crfsuite.exe lives, and type in this:

crfsuite learn -m asoiaf.model train.txt

That will create a model, which is the brain of your tagger. You can call it anything, I named mine asoiaf. To see your model's accuracy, you type this:

crfsuite tag -qt -m asoiaf.model test.txt

To actually do some tagging, you type this:

crfsuite tag -m asoiaf.model untagged.txt

untagged.txt will look just like train.txt, only there won't be a correct label at the beginning. So you would have instead something like:

NN -1=True  FN 0=True   FN 2=True   FN -1=True  NN 0=True

The programmer gives a fuller tutorial here.

With all that mentioned up top, let's start messing around with featuresets to see what makes for an accurate tagger. We'll start with the most basic: a series of booleans describing the positional labels in and around the paragraph.

Here's our feature_extractor class again, this time with some new methods at the top.

In [35]:
-
In [27]:
paragraph = tree.xpath(".//paragraph")[-1]

example_extractor = feature_extractor(paragraph, particles)
print example_extractor.raw
print example_extractor.features()
print example_extractor.local_features()
print example_extractor.feature_booleans()
And in their hands, the daggers.
{}
['NoQuotes=True', 'Short Graf=True', 'Little Talk=True']
['PS 0=False', 'FN 0=False', 'NN 0=False', 'ADR 0=False']

Last night, in an undocumented frenzy of machine learning, I evolved a feature set. Here are some sketchy, in media res notes on that process.

Featureset 1: Positional Booleans, True Only

Label   Count  Recall
PS 0    207    0.9949
FN 0    185    0.95
NULL    118    0.3492
OTHER   56     0.3939
PS - 2  44     0.5238

Item accuracy: 430 / 678 (0.6342)

We're going to be seeing a bunch of these statistics, so let me tell you what they are.

Let's imagine that we're out at lunch, people watching. I ask you to identify members of the Illuminati as they pass. You, being fully aware that we all live in the dark shadow of this grand cabal, finish your ravioli and start tagging people.

Precision, which I've ignored in the statistics I'll be showing you, measures the frequency of false positives. In other words, how often do you wrongly accuse a passerby of being in the Illuminati?

Recall measures how many of the labels in the test data your model predicted correctly.

And F1 is a composite score for both these metrics. You can see how if you were to guess that everyone is in the Illuminati, that would net you a perfect recall score and a lousy precision score.

Because all these tags will be vetted by me, I'm not terribly worried about precision. I want recall and item accuracy.

In this first featureset, I only looked at positive booleans. So in the paragraph above, the entire featureset consisted of: "ADR 0=True" and "PS 0=True". It scored a 63.4% on item accuracy.

Is 63.4% good? Considering that NULL, PS 0, and FN 0 make up three-quarters of our test instances, and those are essentially gimmes, we can certainly do better. Let's throw in the other half of our positional booleans, the falses.

Featureset 2: All Positional Booleans

Label   Count  Recall
NULL    254    0.9048
PS 0    204    0.9899
FN 0    149    0.975
OTHER   24     0.2273
PS - 2  19     0.2857

Item accuracy: 515 / 678 (0.7596)

Now we're hammering the gimmes and have achieved a respectable score. 75% accuracy means you basically have to tag AGOT and the first third of ACOK by hand, and the computer will spot you for the other three quarters. That's a bunch of hours, but not hideous.

Still, there's really no reason to have anything but a 98%+ precision on the NULL label, so let's add a feature to target that.

Featureset 3: Any quotes?

Label   Count  Recall
PS 0    218    0.9907
NULL    180    0.9119
FN 0    167    0.9118
OTHER   63     0.3784
PS 2    25     0.5

Item accuracy: 550 / 710 (0.7746)

Counting the number of opening quotes in the paragraph.

I've gotta say, I'm surprised the NULLs weren't even more accurate. More work to be done there. Next I'd like to notch up FN 0

Featureset 4: Index of first name?

Label   Count  Recall
PS 0    218    0.9907
NULL    183    0.9057
FN 0    157    0.8971
OTHER   68     0.4189
PS - 2  23     0.5484

Item accuracy: 551 / 710 (0.7761)

This feature reports the number of the index at which we see the first name.

Hm: may be overcomplicating, let's switch to a boolean.

Featureset 5: Index 0 a name? + No redundancies

Label   Count  Recall
PS 0    216    0.986
FN 0    166    0.9265
NULL    160    1
OTHER   85     0.5811
PS 2    32     0.7143

Item accuracy: 578 / 710 (0.8141)

That was it! I was accidentally clogging up the featureset with redundancies by calculating LeftQuoteCount for each paragraph and its five neighbors.

Once I fixed that, NULL shot up to a perfect score... but we've run out of cheap gas. I might actually have to be clever to improve this much further! We'll see if that works...

Featureset 6: Post Speaker + and -2

Here we introduce a boolean: are there post speakers two paragraphs behind and two paragraphs ahead of the current? Theory is this will bump up our PS -2 score.

Label   Count  Recall
PS 0    216    0.986
FN 0    166    0.9265
NULL    160    1
OTHER   84     0.5676
PS 2    32     0.7143

Item accuracy: 578 / 710 (0.8141)

Nothing doing there!

Featureset 7: Sequences??

Label   Count  Recall
PS 0    217    0.986
FN 0    168    0.9265
NULL    160    1
OTHER   82     0.5541
PS 2    30     0.6429

Item accuracy: 576 / 710 (0.8113) Instance accuracy: 56 / 142 (0.3944)

Now hold the phone! Turns out that CRF can operate on sequences, and indeed that is its whole point. I had been ignoring that instance accuracy for awhile, because it was always 0/1. Which meant that it was looking at the entire text as one long conversation, so we weren't

Excuse me a second, let me smack myself in the face. Assuming this will improve our accuracy -- and that's an open question -- how do we want to approach this? In this run I made each sequence five paragraphs long, but that doesn't seem correct.

Perhaps if there's two NULLS in a row, the conversation is over, and that will be the sequence.

After playing with it some more, I can't get the model to care about my conversations. As I understand it, it should have a special set of transition weights, depending on where we are in the sequence. So if we're at the start of a conversation, it might make a different decision on a label than we would in the middle, or at the end.

But nothing in the text dump of my model indicates that's happening. I'll keep fiddling with some other features in the meantime. Oh, and let's have a look at the script that is generating our training and testing data. It's not efficient, since it calculates each paragraph's features five different times. I'll leave it like that for this tutorial, but know that you can speed this thing up by doing one loop, storing each paragraph's feature booleans, and then doing another loop in which you add these in.

In [28]:
tree = ASOIAFtoXML([{"title": "ASOIAF", "contents": "corpus/train_asoiaf_pos_tagged.txt"}])
paragraphs = tree.xpath(".//paragraph")
In [ ]:
def prep_test_data(paragraphs):
    max_index = len(paragraphs)
    results = []
    for index, paragraph in enumerate(paragraphs):
        extractor = feature_extractor(paragraph, set_of_particles)
        all_features = extractor.local_features() + extractor.feature_booleans()
        for n in [-2, -1, 1, 2]:
            if 0 <= n+index < max_index:
                neighbor_features = feature_extractor(paragraphs[index + n], set_of_particles, tag_distance = n).feature_booleans()
                if neighbor_features:
                    all_features += neighbor_features      
        all_features.insert(0, extractor.speaker)
        results.append("\t".join(all_features))
    return results

# results = prep_test_data(paragraphs)

def prep_test_data(paragraphs):
    max_index = len(paragraphs)
    for index, paragraph in enumerate(paragraphs):
        extractor = feature_extractor(paragraph, set_of_particles)
        all_features = extractor.local_features() + extractor.feature_booleans()
        for n in [-2, -1, 1, 2]:
            if 0 <= n+index < max_index:
                neighbor_features = feature_extractor(paragraphs[index + n], set_of_particles, tag_distance = n).feature_booleans()
                if neighbor_features:
                    all_features += neighbor_features      
        all_features.insert(0, extractor.speaker)
        yield ("\t".join(all_features))

%prun results = list(prep_test_data(paragraphs))
In [30]:
max_index = len(results)

with codecs.open(r"new_test.txt", "w", "utf-8") as output:
    for line in results[:int(max_index/2)]:
            output.write(line + '\n')

with codecs.open(r"new_train.txt", "w", "utf-8") as output:
    for line in results[int(max_index/2):]:
            output.write(line + '\n')
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-30-18bc0ae78269> in <module>()
----> 1 max_index = len(results)
      2 
      3 with codecs.open(r"new_test.txt", "w", "utf-8") as output:
      4     for line in results[:int(max_index/2)]:
      5             output.write(line + '\n')

TypeError: object of type 'generator' has no len()

More Featuresets

I tried a few more features.

  • Counting the amount of names that appear before the first line of dialogue. Theory: this is where NN is most common. Result: nope.
  • A feature which notices if the paragraph is all or mostly dialogue. This did a good job of biasing PS -2s and FN -2s, but didn't make much difference overall.
  • Short/long paragraphs. Didn't do much.
  • "And" or "but" in the pre-speech. (Trying to target NN 0, which is being underguessed.)

I thought that last one was pretty clever, but it didn't work, and we never got above 81%.

I idly tried swapping the training set with the test set and scored an 84%. You're not supposed to spend too much time perfecting your features for a certain test set, because that leads to overfitting. In fact it's a better idea to mix up your training and test sets as much as possible. I haven't been shuffling my training and test sets because I thought that would screw up the sequence, but I'm not currently paying attention to sequence information, so why not? We'll shuffle.

Some shuffling occurs

That got an 82%.

Okay! I think we've reached the limits of my skill, here.

There's No Way This Tutorial Is Still Going, Right?

Let's wrap it up, and talk about how one might take this further.

  • Beef up the training set. Right now I've got training and test sets of 700 paragraphs a piece. The series as a whole has ~40,000 paragraphs. Which is cool, that I tagged 1.7% of the series and got 80% out of it. (That 80% is suspect to me, in practice it felt like 75%.) What happens if we use 10,000 paragraphs for training and tagging? The marginal value of extra training data diminishes pretty quickly, but for low frequency tags like ADR, there simply aren't enough observations in the 700 I've got.
  • Pore over the CRFsuite docs. There are surely some options that could help me out.
  • Experiment with weighting features.
  • Really take a crack at sequences.
  • Implement Python wrapper. Besides saving me a lot of data passing, this would let me really smarten the tagger up. For instance, I could do...
  • Fallbacks. The weakness of my tagger is that it reaches for OTHER too frequently. OTHER is the tagger saying, look, I know somebody's talking in this paragraph, but you haven't given me any way to identify 'em. So there's never any upside to having an OTHER tag -- the wildest stab in the dark would be an improvement.
  • Gender tagging. I love dialogue with names in the tail. That doesn't happen when a man and woman are in conversation. So a lot of Catelyn's dialogue isn't being tagged, because she's typically the only woman in the room, and therefore she reliably refers to her. This would be easy to circumvent; we could look for a she in a dialogue tail, then check the title of the current chapter. If it's a woman's name, there you go.

Conclusion

Okay! I hope that was helpful to somebody. Thanks for reading, and if you'd like to get in touch, I'm on Twitter.

Also, I should say that all of this tagging was done for a large critical study of A Game of Thrones. If you're a fan of those books and want to read the analysis that this tagging allowed me to do, I'll be posting that soon.