Chapter 5: Building NLP applications

-- A Python Course for the Humanities by Folgert Karsdorp and Maarten van Gompel

Chapter is still in DRAFT stage


In the last chapter we made some tools to prepare corpora for further processing. To be able to tokenise a text is nice, but from a humanities perspective not very interesting. So, what are we going to do with it? In this chapter, you'll implement two major applications that build upon the tools you developed. The first will be a relatively simple program that scores each text in a corpus according to its Automatic Readability Index. In the second application we will build a system that can predict who wrote a certain text. Again, we'll need to cover a lot of ground and things are becoming increasingly difficult now. So, let's get started!

Automatic Readability Index

The Automatic Readability Index is a readability test designed to gauge the understandability of a text. The formula for calculating the Automated Readability Index is as follows:

$$ 4.71 \cdot \frac{nchars}{nwords} + 0.5 \cdot \frac{nwords}{nsents} - 21.43 $$

Let's apply some wishful thinking. If we had all the information needed to compute this formula, we could start with writing a function that does it for us. Let's do so.


Quiz!

Write a function automatic_readability_index that takes three arguments n_chars, n_words and n_sents and returns the ARI given those arguments.

In [ ]:
def automatic_readability_index(n_chars, n_words, n_sents):
    # insert your code here

# do not modify the code below, it is for testing your answer only!
# it should output True if you did well
print(abs(automatic_readability_index(300, 40, 10) - 15.895) < 0.001)

Now we need to write some code to obtain the numbers we so wishfully assumed to have. We will use the code we wrote in earlier chapters to read and tokenise texts. In the file preprocessing.py we defined a function read_corpus which reads all files with the extension .txt in the given directory. It tokenizes each text into a list of sentences, each of which is represented by a list of words. All words are lowercased and we remove all punctuation. We import the function using the following line of code:

In [ ]:
from pyhum.preprocessing import read_corpus

Let's write a function extract_counts that takes a list of sentences as input and returns the number of characters, the number of words and the number of sentences as a tuple.


Quiz!

Write the function extract_counts.

In [ ]:
def extract_counts(sentences):
    # insert your code here

# do not modify the code below, for testing only!
print(extract_counts(
    [["this", "was", "rather", "easy"], 
     ["please", "give", "me", "something", "more", "challenging"]]) == (53, 10, 2))

Well done! We're almost done. We could use our two functions to compute the ARI for a given text as follows:

In [ ]:
sentences = [["this", "was", "rather", "easy"], 
             ["Please", "give", "me", "something", "more", "challenging"]]

n_chars, n_words, n_sents = extract_counts(sentences)
print(automatic_readability_index(n_chars, n_words, n_sents))

However, it would be nice to have a little more abstraction.


Quiz!

Write the function compute_ARI that takes as argument a list of sentences (represented by lists of words) and returns the Automatic Readability Index for that input.

In [ ]:
def compute_ARI(sentences):
    # insert your code here
    
# do not modify the code below, it is for testing your answer only!
# it should output True if you did well
print(abs(compute_ARI(sentences) - 4.442) < 0.001)

Finally it would be nice to compare the readability of a number of texts. We need a function that iterates through the files in a directory and prints the Automatic Readability Index for each text.


Quiz!

Write a function compute_ARIs that takes the name of a directory as input and prints the Automatic Readbility Index for each document in that directory.

In [ ]:
def compute_ARIs(directory):
    # insert your code here

Remember that in Chapter 3, we plotted different basic statistics using Python plotting library matplotlib. Can you do the same for all ARIs?

In [ ]:
import matplotlib.pyplot as plt

# insert your code here

Authorship attribution

In this section you will implement the core of an authorship attribution application. You won't build a full stand-alone application, but rather focus on the core functions for classifying new texts for their authors.

The core of our application will be a naive bayes classifier. Following good programming principles, we will try to make this classifier as generic as possible. This allows us to use the classifier in other contexts than authorship attribution, such as text classification and classification in general.

The naive bayes classifier is a probabilistic classifier that, given a set of features, tries to find the class with the highest probability. It is based on applying Bayes' theorem and is called naive because of its strong independence assumption between features. This means that the absence or presence of each feature is assumed to be independent of each other. We compute the posterior probability of a class as the joint probability of all features given that class:

$$ P(y|x_1,\ldots,x_n) \propto P(y) \prod^n_{i=1} P(x_i|y)$$

Classification is based on the maximum a posteriori or MAP descision rule which simply picks the class (or author in our case) that is most probable:

$$ classify(x_1, \ldots, x_n) = \arg\max_y P(y) \prod^n_{i=1} P(x_i|y) $$

The main function we will implement has a simple job: take a text as an argument and classify it as being written by one of the authors. Let's again apply some wishful thinking and implement the function as follows:

In [ ]:
def predict_author(text, feature_database):
    "Predict who wrote this text."
    return classify(score(extract_features(text), feature_database))

This function takes two arguments: the text to classify and the training data against which we want to classify the text. The function is basically an abstraction layer on top of classify, extract_features and score. classify is a simple function that takes a dictionary of {$author_i$: $P(author_i|text)$} and returns the author that is most probable. Let's implement this function.


Quiz!

Implement the function classify. It takes one argument, scores which is a dictionary with authors as keys and the probability of an author as value. Return the author with maximum probability. (Tip: use the built in function max, see the documentation)

In [ ]:
scores = {"Hermans": 0.15, "Voskuil": 0.55, "Reve": 0.2, "Mulisch": 0.18, "Claus": 0.02}

def classify(scores):
    # insert your code here
    
print(classify(scores) == "Voskuil")

The function extract_features is rather straightforward as well. We'll build this function on top of the functions we defined in the previous chapters. For the moment we will assume that our model is a bag-of-words (BOW) model where the only features are individual words. We will define extract_features as an abstraction layer on top of read_corpus_file and tokenise as follows:

In [ ]:
from pyhum.preprocessing import read_corpus_file, tokenize

def extract_features(filename):
    return tokenise(read_corpus_file(filename))

Now for our training data, we need to store for each word how often it occurs with a particular author. As our data structure we will use a nested dictionary of the following structure author -> word -> count. We'll store the counts in the variable feature_database.

(For more information about defaultdict, see the defaultdict documentation. For more information about lambda expressions, see the lambda documentation.)

In [ ]:
from collections import defaultdict

feature_database = defaultdict(lambda: defaultdict(int))

To fill our feature_database we will need a couple of functions. First we need a function that returns the author of a particular text. To make things a little easier, we named our training files with the author's name followed by the title of the book, i.e. austen-emma.txt. Or when the path is part of the filename, /path/to/austen-emma.txt.


Quiz!

Write a function extract_author that takes a filename as input and returns the name of the author.

In [ ]:
def extract_author(filename):
    # insert your code here

# do not modify the code below, it is for testing your answer only!
# it should output True if you did well
print(extract_author("Austen-emma.txt") == "Austen")
print(extract_author("/path/to/Austen-emma.txt") == "Austen")

Next we'll need a function, update_counts, that takes as argument the name of an author and the words extracted using extract_features and adds these to our feature_database. The function should return a new updated version of the feature_database.

In [ ]:
from preprocess import tokenise

def update_counts(author, text, feature_database):
    # insert your code here
    return feature_database

# do not modify the code below, for testing only!
feature_database = defaultdict(lambda: defaultdict(int))
feature_database = update_counts("Anonymous", "This was written with a lack of inspiration", 
                                 feature_database)
test_database = defaultdict(lambda: defaultdict(int))
for word in "This was written with a lack of inspiration".split():
    test_database["Anonymous"][word] += 1
print(sorted(feature_database.items()) == sorted(test_database.items()))

Finally we define a function add_file_to_database that takes a filename and the feature_database as input, extracts the author from the filename and adds the feature counts to the feature_database. We define it as follows:

In [ ]:
def add_file_to_database(filename, feature_database):
    return update_counts(extract_author(filename), 
                         extract_features(filename), 
                         feature_database)

Quiz!

Now that we have a function to add one file to the feature_database, we need a function that adds an entire corpus to the database. Write a function that takes the name of a directory as input and add all files in this directory to the feature_database. Again, the function should return an updated version of our database.

In [ ]:
import os

def add_directory_to_database(directory, feature_database):
    # insert your code here
    return feature_database

We now have a function to extract the features of a document. We have also defined a couple of functions to add those features to our feature_database. It is now time to implement the core function of our authorship attribution application.

Before we will implement the score function we will first implement a function to compute the probability of one feature given an author. There are two things to note with regard to this function.

First, if a given author and a word never occur together in the feature_database, the probabilty of that class will be zero (think about this, if you don't understand why). Needless to say this is rather problematic. A common strategy to surpass this problem is to add pseudocounts to the observed counts, normally 1. The pseudocounts need to be incorporated in both the numerator and the denominator.

Second, the probability of one feature will normally be quite small. If we now multiply all probabilities of our features given an author we will get a very small number, possibly too small, to be adequately represented by Python. Consider the code below:

In [ ]:
x = 0.00000000000000001
for i in range(30):
    x = x * 0.000000000000001
    print(x)

After less than 30 multiplications, the values are too small for Python to distinguish them from each other. Even worse, the values default to zero and as we know multiplying by zero will return zero, and therefore all our probabilities could be zero! We therefore take the log of the individual feature probabilities and sum them to obtain our final score. Let's implement the log_probability function first.

In [ ]:
from math import log

def log_probability(feature_counts, features_sum, n_features):
    return log((feature_counts + 1.0) / (features_sum + n_features))

feature_counts is the number of times the given feature occurs with a particular author. features_sum is the sum of all feature counts for that author. n_features is the number of unique features in our feature_database (needed for pseudocounts). Now that we have defined this crucial function, we are ready to implement our score function.


Quiz!

The function score takes as input a list of features and the feature_database. It should return a dictionary with authors as keys and their probabilities given the list of features as values. We'll provide the basic frame for this function below and ask you to fill in the details. This is without doubt the most challenging Quiz! you have seen so far and we will be very impressed if you get it right.

In [ ]:
def score(features, feature_database):
    "Predict who wrote the document on the basis of the corpus."
    scores = defaultdict(float)
    # compute the number of features in the feature database here
    for author in feature_database:
        # compute the probability of features given that author here
    return scores

# do not modify the code below, for testing your answer only! 
# It should return True if you did well!
features = ["the", "a", "the", "be", "book"]
feature_database = defaultdict(lambda: defaultdict(int))
feature_database["A"]["the"] = 2
feature_database["A"]["a"] = 5
feature_database["A"]["book"]= 1
feature_database["B"]["the"] = 5
feature_database["B"]["a"] = 1
feature_database["B"]["book"] = 6
print(abs(dict(score(features, feature_database))["A"] - -7.30734) < 0.001)

Wow! You have really done a great job. You have implemented almost entirely by yourself a complete naive bayes classifier that can be used for all kinds of classification problems, such as document classification and authorship attribution.

Now we should put all things together and test how well our system works. In the folder data/gutenberg/training we have provided a couple of training documents from different author. The folder data/gutenberg/testing provides three test documents. Let's test our classifier on one of those documents!

In [ ]:
# first define the feature_database
feature_database = defaultdict(lambda: defaultdict(int))
feature_database = add_directory_to_database("data/gutenberg/training", feature_database)
print(predict_author("data/gutenberg/testing/milton-poetical.txt", feature_database))

It would be nice to evaluate our classifier on more than one document and to obtain some sort of score of how well our classifier performs. We will implement two functions: test_from_corpus and analyze_results.


Quiz!

test_from_corpus takes as input the name of a directory and a trained feature database. It then tries to predict the author of all files in the given directory. The function should return a list of (ground-truth-author, predicted-author) tuples.

In [ ]:
def test_from_corpus(directory, feature_database):
    results = []
    # insert your code here
    return results

Finally we will implement the function analyze_results which takes a list of (ground-truth-author, predicted-author) tuples as input and returns the accuracy of the classifier, which is defined as:

$$ accuracy(X) = \frac{\textrm{number of correct predictions}}{\textrm{total number of predictions}}$$


Quiz!

Implement the function analyze_results and test your classifier on the test corpus in data/gutenberg/testing:

In [ ]:
def analyze_results(results):
    # insert your code here

# do not modify the code below, for testing only!
print(analyze_results([("A", "A"), ("A", "B"), ("C", "C"), ("D", "C"), ("E", "E")]) == 0.6)

Ignore this, it's only here to make the page pretty:

In [1]:
from IPython.core.display import HTML
def css_styling():
    styles = open("styles/custom.css", "r").read()
    return HTML(styles)
css_styling()
Out[1]:
/* Placeholder for custom user CSS mainly to be overridden in profile/static/custom/custom.css This will always be an empty file in IPython */