Chapter 9 - Learning from Examples

-- A Python course for the Humanities


What are the rules for distinguishing a real e-mail message from spam? You might come up with rules like "if a message has XXX in the subject" it is probably spam. Or if the message contains many mentions of large amounts of money, it's probably spam. Spammers change their behavior all the time and their spamming techniques have evolved substantially over time. You will soon realize that hand-crafting a set of rules for spam is practically infeasable. In his famous essay 'A plan for Spam', Paul Graham thought exactly the same thing. Graham proposed to employ techniques from Machine Learning to automatically learn from a number of examples which features distinguish best between spam and non-spam. The basic idea of his spam filter is that we keep track of the frequencies with which words occur in spam messages and in non-spam messages and compare the frequencies of the words in a new, previously unseen e-mail to those statistics to decide whether we're dealing with spam or not.

Graham's spam filter is an example of a supervised learning system. We call it supervized, because on the basis of a set of training examples of which the label is known (e.g. spam or non-spam), we attempt to automatically find a way to distinguish between the two categories. More formally, supervized learning systems estimate functions $f : X \rightarrow Y$ from a multi-dimensional input space $X$ to a set of discrete classes $Y$. In the example of the spam filter, $f$ would be the filter we try to learn, $X$ could be the word frequencies of the e-mail messages (both spam and non-spam) and $Y$ would be the two classes spam and non-spam ($Y = \{ \textrm{spam},\textrm{non-spam}\})$.

Have a look at the following example. Say we have a training set of four e-mail messages, two spams and two non-spams. In those four e-mails we find a total of 5 unique words. For each training e-mail $x \in X$, we write down how often each of those words occurs in $x$:

sunny friend viagra dollar million
non-spam 1 1 0 1 0
non-spam 1 1 0 0 0
spam 0 1 1 0 0
spam 0 0 0 1 1

Now we get a new e-mail in our inbox that has the following features:

sunny friend viagra dollar million
? 1 0 1 1 0

What do you think? Is it spam or non-spam? Why? There you have it, you have made a prediction on the basis of training examples. This is essentially what Machine Learning attempts to do, only it tries to do it automatically using a computer...

In this chapter we will introduce you to an important supervised machine learning algorithm: naive bayes. There are many excellent introductions and tutorials to machine learning in which this algorithm is discussed. I will not repeat those here, but try to focus on the applicability of the algorithm to a problem relevant in humanities research. This chapter contains another practical part in which we will develop an Authorship Attribution System.


Naive Bayes Learner

You might say that the attribution of authors to pieces of art, such as literature, music and paintings has been a source of fierce debate ever since art came into existence. Nowadays, the field of stylometry and authorship attribution draws heavily on the aid of computational systems, such as machine learning systems and artificial intelligence systems to automatically learn characteristic features for particular authors, composers or painters. In this section we will develop a simple computational attribution system. At the core of this system will be a general naive bayes classifier that can be put to use for all kinds of similar problems, such as text classification and classification in general.

Our Authorship Attribution System can be thought of as a subclass of a naive bayes learner which in its turn can be thought of as a subclass of a more general learner. This general learner doesn't actually do anything but provides a framework for the more complex learners. Let's start with an implementation of the abstract learner:

In [1]:
class Learner:
    """Abstract Learner. A Learner can be trained
    with a dataset X and the corresponding labels y. After training
    the Learner can be asked to predict the outcome of an example."""

    def fit(self, X, y):
        """Fit or train the learner."""
        self.X = X
        self.y = y

    def predict(self, x):
        """Predict the outcome of an example x."""
        raise NotImplementedError

A Learner must specify two methods: fit and predict. fit takes as argument some multi-dimensional input space $X$ where for each example $x \in X$ we have observed a class label $y_x$ where $y \in Y$.


Quiz!

Have a look at the following rather dumb learner that simply predicts that the class label for each new unseen example is the label that occurs most often in the training data. Complete the fit method in the code below:

In [ ]:
class MajorityLearner(Learner):
    """Dumb Learner that always predicts that the class of
    a new, unseen example is the label that occurs most often
    in the training data."""
    
    def fit(self, X, y):
        "Find that label in the training data that occurs most often."
        # insert your code here
        
    def predict(self, x):
        "Always predict that `x`'s label is equal to the most popular one."
        return self.most_popular
    
# these tests should return True if your code is correct
learner = MajorityLearner()
X = [('a', 'a', 'b'), ('a', 'a', 'a'), ('b', 'b', 'b')]
y = [0, 0, 1]
learner.fit(X, y)
print(learner.predict(('c', 'c', 'c')) == 0)

The naive bayes classifier is a probabilistic classifier that, given a set of features, tries to find the class with the highest probability. It is based on applying Bayes' theorem and is called naive because of its strong independence assumption between features. This means that the absence or presence of each feature is assumed to be independent of each other. We compute the posterior probability of a class as the product of the prior probability of a class and the joint probability of all features given that class:

$$ P(y|x_1,\ldots,x_n) \propto P(y) \prod^n_{i=1} P(x_i|y)$$

Classification is based on the maximum a posteriori or MAP descision rule which simply picks the class (or author in our case) that is most probable:

$$ predict(x_1, \ldots, x_n) = \arg\max_y P(y) \prod^n_{i=1} P(x_i|y) $$

If you're unfamiliar with reading formulas, this might all seem quite daunting. To better understand what is going on, let's work out a small example. Say we have a small corpus of five very short books of which the author of the fifth book is unknown. The total vocabulary $V$ is 10 words long. For each book we store how often each word $w_i \in V$ occurs:

the poetry society america realism a harry magic health system
David Foster Wallace 6 4 3 3 0 10 0 0 1 4
David Foster Wallace 8 1 2 4 0 7 0 0 10 3
Walt Whitman 12 10 1 8 0 4 0 0 0 4
J.K. Rowling 8 0 0 0 0 15 12 10 0 0
? 7 4 0 0 0 12 6 8 3 0

What is the probability of $P(y=\textrm{Walt Whitman}|x = [12, 10, 1, 8, 0, 4, 0, 0, 0, 4])$? And what is the probability of $P(y=\textrm{J.K. Rowling}|x = [7, 4, 0, 0, 0, 12, 6, 8, 3, 0])$?

The probability of a word like the given some author is computed by dividing the number of occurences of that word by the total number of words for that author. In the case of Walt Whitman, the probability of the word poetry is:

$$ \begin{array}{lll} P(x_i=\textrm{poetry}|y=\textrm{Walt Whitman}) & = & \frac{10}{12 + 10 + 1 + 8 + 0 + 4 + 0 + 0 + 0 + 4}\\ & = & \frac{10}{39} \\ & = & 0.256 \\ \end{array} $$

Quiz!

Compute the probability of the word magic given J.K. Rowling and the word society given David Foster Wallace.

Double click this cell and write down your answer.


The posterior probability of a class computes the joint probability of all features given that class. This means that for each nonzero word $w_1, w_2, \ldots, w_n$ in our unknown book, we compute the probability of that word given a particular author $y$: $P(w_i|y)$. We then take the product (joint probability) of these individual words, which multiplied by the prior probability of the author, provides us with the posterior probability. Let's find out how that works.


Quiz!

a) Compute the joint probability of all features in our unknown book given $y$=J.K. Rowling:

$$\begin{array}{llll} P(y|X_?) & \propto & (P(\textit{the}|y) \times 7) & \times \\ & & (P(\textit{poetry}|y) \times 4) & \times \\ & & (P(\textit{a}|y) \times 12) & \times \\ & & (P(\textit{harry}|y) \times 6) & \times \\ & & (P(\textit{magic}|y) \times 8) & \times \\ & & (P(\textit{health}|y) \times 3) & \\ & = & ? \end{array}$$
In [ ]:
# insert your code here

b) A common strategy to surpass this rather unpleasant outcome is to add pseudocounts to the observed counts, normally 1. The pseudocounts need to be incorporated in both the numerator and the denominator. This is called smoothing the distribution using Laplacian Smoothing. Recompute the joint probability using our smoothing technique.

In [ ]:
# insert your code here

c) To compute the final posterior probability we must multiply the joint probability of the words given J.K. Rowling with the prior probability that a book was written by J.K. Rowling in our corpus. Do that in the cell below:

In [ ]:
# insert your code here

d) Now that we know how to compute the posterior probability, it is time to implement our naive bayes learner. We will start with implementing the fit method. The fit method has three core jobs:

  1. extract all the counts of each feature given each class;
  2. count how often each class label occurs in the training data;
  3. count the number of unique features.

The NaiveBayesLearner below provides the skeleton of our class. Implement the fit method.

In [ ]:
from collections import defaultdict, Counter


class NaiveBayesLearner(Learner):
    """Naive Bayes Learner. This learner can be
    initialized using nb = NaiveBayesLearner()
    to construct a classifier that incorporates the prior
    probabilities of each possible class label."""

    def fit(self, X, y):
        """Fit or train the naive bayes classifier. X must be an
        iterable of examples where each example is an iterable as
        well."""
        self.C = # insert your code here (class counts)
        self.N = # insert your code here (feature counts per class)
        # add the feature counts per class here
        
        self.V = len(set(x for y_x in self.N for x in self.N[y_x])) # number of unique features
       
    def predict(self, x):
        """Predict the outcome for example x. Choose the most
        likely outcome out of all possible outcomes."""
        pass
    
# these tests should return True if your code is correct
nb = NaiveBayesLearner()
X, y = [('a', 'a', 'a', 'b'), ('a', 'b', 'b')], [1, 0]
nb.fit(X, y)
print(nb.N[1]['b'] == 1 and nb.C[1] == 1)

Now that we have specified the fit method of our learner, we can continue with the predict method. predict takes as argument a single example and returns the class with the maximum posterior probability. This means that within predict we have to compute the posterior probability of all unique class labels in our training data. We could do all of the computation in the single method predict, but our code will remain a little more clean if we split it into a number of helper functions.


Quiz!

a) We will start with a method that computes the probability of a feature $x_i$ given a class label $y$. The method probability in the code below takes as argument a feature $x_i$ and and class label $y$ and should return the probability $P(x_i|y)$.

In [ ]:
class NaiveBayesLearner(Learner):
    """Naive Bayes Learner. This learner can be
    initialized using nb = NaiveBayesLearner()
    to construct a classifier that incorporates the prior
    probabilities of each possible class label."""

    def fit(self, X, y):
        """Fit or train the naive bayes classifier. X must be an
        iterable of examples where each example is an iterable as
        well."""
        self.C = Counter(y)
        self.N = defaultdict(Counter)
        for x, y_x in zip(X, y):
            self.N[y_x] += Counter(x)
        self.V = len(set(x for y_x in self.N for x in self.N[y_x]))

    def prior(self, y):
        """Return the prior probability of class y."""
        pass
            
    def probability(self, x, y):
        """Apply Laplace Smoothing to give a probability
        estimate of feature x given y."""
        # insert your code here

# these tests should return True if your code is correct
nb = NaiveBayesLearner()
X, y = [('a', 'a', 'a', 'b'), ('a', 'b', 'b'), ('b', 'b', 'b')], [1, 0, 0]
nb.fit(X, y)
print(abs(nb.probability('a', 1) - 0.66666) < 0.00001)
print(abs(nb.probability('c', 0) - 0.125) < 0.00001)

b) Next, complete the method prior in the class definition above (remove the pass statement). It takes as argument a class label and should return the prior probability of that label.

In [ ]:
# these tests should return True if your code is correct
nb = NaiveBayesLearner()
X, y = [('a', 'a', 'a', 'b'), ('a', 'b', 'b'), ('b', 'b', 'b')], [1, 0, 0]
nb.fit(X, y)
print(round(nb.prior(1), 2) == 0.33)

With the two methods prior and probability in place, we are ready to implement the final predict method. Before we do that, however, there is a small technical issue that must be discussed. The formula above states that we have to compute the joint probability of all features given a label as the product over the individual probabilities. However, each probability is a number between 0 and 1 and when we multiply those numbers, the resulting number will be smaller than both individual probabilities. Now imagine we have a thousand features or -- and in natural language that is still quite small -- 100k features. If we now multiply the probabilities of all these features we will get a very small number, possibly too small to be adequately represented by Python. Consider the code below:

In [ ]:
x = 1e-10
for i in range(10):
    x = x * x
    print(x)

After less than 10 multiplications, the values are too small for Python to distinguish them from each other. Even worse, the values default to zero and as we know multiplying by zero will return zero, and therefore the result of computing the joint probability could be zero! We therefore take the log of the individual feature probabilities and sum them to obtain our final score. This is equivalent to taking the product of the probabilties.

In [ ]:
from math import log

x = log(1e-10)
for i in range(10):
    x = x + x
    print(x)

Quiz!

Implement the method predict. It takes a argument a list or some other iterable of values and should return the label that maximizes the posterior probability.

In [ ]:
class NaiveBayesLearner(Learner):
    """Naive Bayes Learner. This learner can be
    initialized using nb = NaiveBayesLearner()
    to construct a classifier that incorporates the prior
    probabilities of each possible class label."""

    def fit(self, X, y):
        """Fit or train the naive bayes classifier. X must be an
        iterable of examples where each example is an iterable as
        well."""
        self.C = Counter(y)
        self.N = defaultdict(Counter)
        for x, y_x in zip(X, y):
            self.N[y_x] += Counter(x)
        self.V = len(set(x for y_x in self.N for x in self.N[y_x]))

    def prior(self, y):
        """Return the prior probability of class y."""
        return self.C[y] / sum(self.C.values())

    def probability(self, x, y):
        """Apply Laplace Smoothing to give a probability
        estimate of feature x given y."""
        return (self.N[y][x] + 1.0) / (sum(self.N[y].values()) + self.V)

    def predict(self, x):
        """Predict the outcome for example x. Choose the most
        likely outcome out of all possible outcomes."""
        # insert your code here

# these tests should return True if your code is correct
nb = NaiveBayesLearner()
X, y = [('a', 'a', 'a', 'b'), ('a', 'b', 'b'), ('b', 'b', 'b')], [1, 0, 0]
nb.fit(X, y)
print(nb.predict(('a', 'a')) == 1 and 
      nb.predict(('a', 'b')) == 0 and
      nb.predict(('b', 'b')) == 0)

Authorship Attribution Learner

Authorship attribution is a more specific instance of a more general problem and can therefore be thought of as a subclass of a NaiveBayesLearner. In textual authorship attribution our input space will be represented by textual features and the class labels are the authors we want attribute to a written document. Before we implement the Learner, we must first define a number of functions to extract features from running text.

What kind of features can we use for authorship attribution? Words, defined as everything surrounded by spaces, are generally conceived to be good features. The same holds for bigrams of words and character $n$-grams. However, there ain't no such thing as a free lunch. So, let's not restrict ourselves to a single feature representation but experiment with a number of different representations and see what works best.

In the folder data/british-novels you will find 26 famous British novels downloaded from Project Gutenberg. This is a small toy dataset that we will use in our experiments. First, we will create a simple representation of a document. I choose to represent each document as a tuple of an author, a title and the actual text. Instead of ordinary tuples we will use the namedtuple from the collections module in Python's standard library. A namedtuple can be constructed as follows:

In [ ]:
from collections import namedtuple

Point = namedtuple("Point", ["x", "y"])
p = Point(1, y=2)

where the first argument represents the typename of the tuple and the second argument specifies the fieldnames. These fieldnames can be conveniently accessed using:

In [ ]:
print(p.x) # by name
print(p[1]) # by index

Namedtuples follow the exact same behavior as regular tuples: they are hashable and can be unpacked:

In [ ]:
x, y = p
print(x)

We define a namedtuple for our documents as follows:

In [ ]:
Document = namedtuple("Document", ["author", "title", "text"])

Quiz!

a) Next we will write some functions to read a file and extract the contents as either an iterable of word $n$-grams or character $n$-grams. We start with a function to extract all $n$-grams (either character or word) from an iterable. The function should take as argument a fixed order iterable, and a tuple representing the $n$-gram range. When a tuple of (1, 2) is given, the function should return all unigrams and bigrams in the iterable. When passed (1, 1) as $n$-gram range, it should only return unigrams. Implement the function ngrams:

In [ ]:
def ngrams(iterable, ngram_range=(1, 1)):
    "Return the iterable as a sequence of n-grams."
    min_n, max_n = ngram_range
    if isinstance(iterable, (list, set)):
        iterable = tuple(iterable)
    ngrams = []
    # insert your code here
    return ngrams

# these tests should return True if your code is correct
print(ngrams("humanities", ngram_range=(2, 3)) == ['hu', 'um', 'ma', 'an', 'ni', 'it', 
                                                   'ti', 'ie', 'es', 'hum', 'uma', 'man', 
                                                   'ani', 'nit', 'iti', 'tie', 'ies'])
print(ngrams("teach me Python".split(), ngram_range=(1, 2)) ==  [
    ('teach',), ('me',), ('Python',), ('teach', 'me'), ('me', 'Python')])

b) Can you think of any reason why we convert our iterable into a tuple?

Double click this cell and write down your answer.


We will write a function make_document that takes as argument a filename and returns an instance of our named tuple Document. Each filename in data/british-novels consist of the author and the title separated by an underscore. This allows us to use the filenames to easily extract the title and the author. The function make_document takes as argument a filename, an $n$-gram range, an argument that states whether to lowercase the text the type of $n$-grams (either word or char) and how large the sample of each text should be:

In [ ]:
from os.path import basename
from itertools import islice
import re

def tokenize(text, lowercase=True):
    "Tokenize a string on whitespace."
    text = text.lower() if lowercase else text
    for match in re.finditer(r"\w+(\.?\w+)*", text):
        yield match.group()

def make_document(filename, ngram_range=(1, 1), lowercase=True, ngram_type='word', sample=5000):
    with open(filename) as infile:
        text = list(islice(tokenize(infile.read(), lowercase=lowercase), sample))
    if ngram_type == 'char':
        text = ' '.join(text)
    author, title = basename(filename).replace('.txt', '').split('_')
    return Document(author, title, ngrams(text, ngram_range=ngram_range))

The sample size is important since we want to control for the fact that some documents are longer than others which might influence our distributions.

OK, so we have our feature extraction in place as well as our naive bayes learner. Let's now focus on implementing the authorship attribution learner. As said, we will implement the learner as a subclass of the NaiveBayesLearner. The prediction method will remain the same, but we have to make some small changes to the fit method that specifically target the authorship attribution problem.

A fundamental insight from stylometry and computational authorship attribution is that function words (e.g. the, of, in etc) are particularly suited for the problem of authorship attribution because they are writer invariant. Content words (nouns, verbs and adjectives) draw too much attention to the actual contents of a text which makes it hard to find commonalities between two texts written by the same author when they deal with different topics. Therefore researchers tend to extract only the $n$ most frequent words from a text (which dominantly consist of function words).


Quiz!

Implement the fit method in the AuthorshipLearner class below. Contrary to our previous implementation, here you should only store the $n$ most frequent features.

In [ ]:
class AuthorshipLearner(NaiveBayesLearner):
    """Subclass of NaiveBayesLearner that targets the authorship
    attribution problem. The learner can be initialized using
    >>> al = AuthorshipLearner(n_most_frequent=100)
    where the argument `n_most_frequent` is used to specify
    the n most frequent features. If set to None, all features
    will be used."""
    
    def __init__(self, n_most_frequent=100):
        self.n_most_frequent = n_most_frequent
        
    def fit(self, X, y):
        """Fit or train the Authorship Learner. X must be an
        iterable of examples where each example is an iterable as
        well."""
        self.C = Counter(y)
        self.N = defaultdict(Counter)
        # insert your code here
            
# these tests should return True if your code is correct
X = [('a', 'a', 'a', 'b', 'b', 'c'), ('a', 'c', 'b', 'b'), ('b', 'b', 'b', 'c', 'c', 'a')]
y = [1, 0, 0]
al = AuthorshipLearner(n_most_frequent=2)
al.fit(X, y)
print('c' not in al.N[1] and set(['b', 'c']) == set(al.N[0].keys()))

Experimentation and Evaluation

So there we have it: we have created an authorship attribution system on the basis of a naive bayes learner. It is time to put the learner to the test. But how do we test our Learner? What kind of evaluation metrics are appropriate? In this section we will show you some important evaluation techniques and evaluation metrics that will give you a detailed view on how well your system is performing. We will also show you how to optimize for a set of parameters.

Evaluation Metrics

Perhaps the most simple form of evaluation is by means of accuracy. Accuracy is defined as the proportion of correctly predicted outcomes:

$$ accuracy(y_1, y_2, \ldots, y_n) = \frac{1}{n} \sum^n_{i=1} \left\{\begin{array}{l l}1 & \quad \text{if $y_i$ equals $t_i$} \\0 & \quad \text{otherwise} \end{array} \right. $$

Where $y_i$ is the label predicted by our Learner and $t_i$ the true label.


Quiz!

Implement the function accuracy. It takes as argument a list of label predictions for each item in a test set and a list of all true labels. It should return the proportion of correctly predicted labels.

In [ ]:
def accuracy(pred_labels, true_labels):
    """Return the accuracy, defined as the proportion of
    correctly predicted labels."""
    # insert your code here

# these tests should return True if your code is correct
predictions = [1, 2, 0, 0, 1]
true =        [1, 2, 0, 1, 1]
print(accuracy(predictions, true) == 0.8)

Although accuracy is an intuitive meassure, it is not very insightful as it doesn't provide us with any information about what kind of errors the system makes. We can distinguish two kinds of errors:

  1. False positives: cases in which the learner wrongly predicts an item to be of a class;
  2. False negatives: cases in which the learner wrongly predicts an item not to be of a class;

Likewise we can distinguish between two kinds of correct predictions:

  1. True positives: cases in which the learner rightfully predicts an item to be of a class;
  2. True negatives: cases in which the learner rightfully predicts an item not to be of a class;

We can use these distinctions to compute two different evaluation measures: precision and recall. Precision is computed by dividing the number of true positives by the sum of the true positives and the false positives:

$$\text{Precision} = \frac{\text{true positives}}{\text{true positives} + \text{false positives}}$$

Recall is computed by dividing the number of true positves by the sum of the number of true positives and the number of false negatives:

$$\text{Recall} = \frac{\text{true positives}}{\text{true positives} + \text{false negatives}}$$

The $F$-score combines the two measures into a harmonic mean:

$$\text{$F$-score} = 2 \times \frac{\text{precision} \times \text{recall}}{\text{precision} + \text{recall}}$$

The following function takes as input a list of predictions and their corresponding ground truth labels. It returns a matrix in which for each unique label we count the number of true and false positives and the number of false negatives.

In [ ]:
def error_statistics(preds, trues):
    """Given a list of predictions and ground truth labels,
    return a matrix in which for each unique label, we count
    the number of true and false positives and the number of 
    false negatives."""
    errors = defaultdict(Counter)
    for pred, true in zip(preds, trues):
        if pred == true:
            errors[true]['tp'] += 1
        else:
            errors[pred]['fn'] += 1
            errors[true]['fp'] += 1
    return errors

Given the error statistics, computing the precision, recall and F-score becomes trivial. First we implement the precision function:

In [ ]:
from statistics import mean # only available in Python 3.4+

def precision(pred_labels, true_labels, average=True, matrix=None):
    if matrix is None:
        matrix = error_statistics(pred_labels, true_labels)
    scores = {}
    for label in set(true_labels):
        try:
            scores[label] = matrix[label]['tp'] / (matrix[label]['tp'] + matrix[label]['fp'])
        except ZeroDivisionError:
            scores[label] = 0.0
    return scores if not average else mean(scores.values())

print(precision(predictions, true))

Quiz!

a) Now it is your turn to implement the recall function:

In [ ]:
from statistics import mean

def recall(pred_labels, true_labels, average=True, matrix = None):
    if matrix is None:
        matrix = error_statistics(pred_labels, true_labels)
    # insert your code here
    
# these tests should return True if your code is correct
predictions = [1, 2, 0, 0, 1]
true =        [1, 2, 0, 1, 1]
print(abs(recall(predictions, true) - 0.8333) < 0.0001)

c) Finally, implement the function to compute the F-score:

In [ ]:
def f_score(pred_labels, true_labels, average=True, matrix=None):
    if matrix is None:
        matrix = error_statistics(pred_labels, true_labels)
    # insert your code here

# these tests should return True if your code is correct
predictions = [1, 2, 0, 0, 1]
true =        [1, 2, 0, 1, 1]
print(abs(f_score(predictions, true, average=True) - 0.8222) < 0.0001)

Evaluation strategies

Now that we have our evaluation metrics in place, we will write some functions that allow us to test the performance of our AuthorshipLearner. We implement a function called test which takes as argument a fitted Learner object and a testing input space test_X. We evaluate by comparing the predictions of the learner against the ground truth labels in test_y given a particular scoring function score_fn:

In [ ]:
def test(learner, test_X, test_y, score_fn=accuracy, verbose=False):
    """Return the proportion of examples in test_X that
    are correctly predicted by the learner. The learner
    must already be fitted."""
    pred_y = []
    for y_x, x in zip(test_y, test_X):
        prediction = learner.predict(x)
        pred_y.append(prediction)
        if verbose >= 3:
            print('%s! (true = %s, pred = %s)' % (
                'Wrong' if y_x != prediction else 'Correct', y_x, prediction))
    return score_fn(pred_y, test_y)

To use this function, we have to split our data set into a train and test set. A basic strategy would be to make a random split in our data set and train on one of the resulting chunks while testing on the other. We define yet another function train_and_test that takes as argument a Learner object and an input space $X$ and the corresponding target labels $y$. The learner is tested on the examples between a start index and end index and trained on the remainder:

In [ ]:
def train_and_test(learner, X, y, start, end, score_fn=accuracy, verbose=False):
    """Train and test a Learner. We reserve X[start:end] for testing
    and test on the remainder. Returns the proportion of examples
    that are correctly predicted by the learner."""
    train_X, train_y = X[:start] + X[end:], y[:start] + y[end:]
    test_X, test_y = X[start:end], y[start:end]
    learner.fit(train_X, train_y)
    return test(learner, test_X, test_y, score_fn=score_fn, verbose=verbose)

Since our corpus is so small, instead of dividing our data set into only two parts, it is better to use a method call $n$-fold cross-validation where we divide the data at random into $n$ subsets of approximately equal size. We then take a single fold to test on and train on the remaining folds. The function cross_validate takes as argument a Learner object an input space $X$ and the corresponding target labels $y$, and an argument $n$ specifying the number of folds. It performs $n$-fold cross-validation using this learner and returns the mean of the performance given a scoring function score_fn.

In [ ]:
import random
from statistics import mean 

def shuffle(*args, seed=None):
    """Similar to random.shuffle, but takes multiple arguments
    that will be shuffled in parallel."""
    data = list(zip(*args))
    random.shuffle(data, random=seed)
    return zip(*data)

def cross_validate(learner, X, y, k=10, score_fn=accuracy, verbose=False):
    if k is None: k = len(X)
    n = len(X)
    X, y = shuffle(X, y)
    scores = []
    for i in range(k):
        start, end = int(i * (n / k)), int((i + 1) * (n / k))
        score = train_and_test(learner, X, y, start, end, score_fn=score_fn, verbose=verbose)
        if verbose >= 2:
            print("Cross validation on fold %d/%d = %.3f" % (i+1, k, score))
        scores.append(score)
    return mean(scores)

By setting k to None, the function will take out a single document for testing and train on the remaining texts. This type of cross-validation is called leave one out cross-validation (LOO cross-validation). This type of validation seems ideal for our small data set, because for each author we have no more and no less than 2 documents. LOO cross-validation ensures us that for each test document we will have at least one document in our training data with the correct author.

Authorship Attribution Experiments

We have implemented a number of evaluation metrics as well as some evaluation strategies, such as cross-validation. In this section we will setup some experiments to finally test the performance of our authorship attribution system. There are many parameters to choose from: how large is our sample from each document, what kind of $n$-grams do we use and what will be their size? We will first do a simple run with only default settings. We make a list of all documents:

In [ ]:
from glob import glob

documents = [make_document(f) for f in glob('data/british-novels/*.txt')]

Next we perform leave one out cross-validation on the dataset and evaluate by means of the $F$-score:

In [ ]:
authors, titles, texts = zip(*documents)
cross_validate(AuthorshipLearner(), texts, authors, k=None, score_fn=f_score)

Quiz!

a) As you can see the score isn't particularly high. How can we improve upon this number? First, let's experiment with the sample size. Write a loop that increases the sample size from 100 to 5000 with steps of 500. For each step, make a list of Documents and evaluate by means of LOO cross-validation and the $F$-score. Store the result in the dictionary called scores with the steps as keys and the scores as values:

In [ ]:
%matplotlib inline # execute this line first
In [ ]:
scores = {}
# insert your code here

Visualize the scores by executing the following cell:

In [ ]:
import seaborn as sb

samples, scores = zip(*sorted(scores.items()))
sb.plt.plot(samples, scores)

b) Do the same thing with the variable n_most_frequent passed to the AuthorshipLearner. Let n_most_frequent be in the range between 50 and 1000 with a step size of 100.

In [ ]:
scores = {}
# insert your code here

Visualize the scores by executing the following cell:

In [ ]:
n_most_frequent, scores = zip(*sorted(scores.items()))
sb.plt.plot(n_most_frequent, scores)

Manually setting and unsetting all parameters is a rather cumbersome job. It would be much more efficient if we could automatically explore a large parameter space. Parameter optimization is a large (and unsolved) field of study. One easy way to evaluate the paramater space is to perform a grid search, in which we explore all possible parameter combinations given a list of parameter settings. Naturally this list cannot be too big, because we take the product of all parameter settings which quickly grows beyond a manageble set of parameter settings.

The following function takes as argument an intialized learner, the directory containing the corpus of training files, how many folds we want to use in the cross-validation, the scoring function to use and an argument to specify the parameter settings:

In [ ]:
from itertools import product

def grid_search(learner, directory, n_folds=10, params={}, score_fn=accuracy, verbose=0):
    if not isinstance(learner, Learner):
        raise ValueError("Learner is not initialized")
    arguments, values = zip(*params.items())
    grid_scores = []
    for param_setting in product(*values):
        param_setting = dict(zip(arguments, param_setting))
        if verbose >= 1:
            print('Searching grid with %s' % param_setting)
        learner_settings, feature_settings = {}, {}
        for arg, value in param_setting.items():
            if arg in learner.__dict__:
                learner_settings[arg] = value
            else:
                feature_settings[arg] = value
        for arg, value in learner_settings.items():
            learner.__setattr__(arg, value)
        documents = [make_document(f, **feature_settings) for f in glob(directory + '/*.txt')]
        y, _, X = zip(*documents)
        score = cross_validate(learner, X, y, k=n_folds, score_fn=score_fn, verbose=verbose)
        grid_scores.append((param_setting, score))
    grid_scores.sort(key=lambda i: i[1])
    return grid_scores[-1]

The parameter settings argument is a dictionary consisting of the argument names and their values of either the learner class or the make_document function:

In [ ]:
params = {'n_most_frequent': [50, 100], 'sample': [500, 1000, 1500]}

The function computes the score for all combinations of parameters and returns the best performing parameter setting and corresponding score:

In [ ]:
grid_search(AuthorshipLearner(), 'data/british-novels/', 
            params=params, n_folds=None, score_fn=f_score, verbose=1)

Challenge!

Experiment with a range of different parameter settings using the grid_search function and report on your best performing system.

In [ ]:
# insert your code here

You've reached the end of the chapter. Ignore the code below, it's just here to make the page pretty:

In [370]:
from IPython.core.display import HTML
def css_styling():
    styles = open("styles/custom.css", "r").read()
    return HTML(styles)
css_styling()
Out[370]:
/* Placeholder for custom user CSS mainly to be overridden in profile/static/custom/custom.css This will always be an empty file in IPython */