Notebook

Classifying Literary Critical Texts¶

This notebook is a part of work being done for the Trace of Theory project, a collaboration between researchers of NovelTM and the HathiTrust Research Center (HTRC). In particular, we are wanting to use both supervised and unsupervised machine learning techniques on HTRC texts to gain a better understanding of the extent and nature of theory in various genres.

This notebook is a much shorter version of the Classifying Philosophical Texts notebook where many of the steps are explained in more detail. The purpose of this notebook is to compare the methodology with another corpus.

Document Similarity¶

Below is a longer chunk of (mostly unexplained) code that essentially produces two visualizations for document similarity:

a dendrogram that shows a cluster hierarchy of documents
a colour-coded scatterplot that shows the multidimensional term frequencies matrix in a 2D space (suggesting clusters)

In [29]:

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.manifold import MDS
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import ward, dendrogram
%matplotlib inline

# plot the documents from the corpus
def plot_corpus_similarity(corpus, vectorizer):
    
    # generate the vector, distances and positions
    texts = [corpus.raw(fileid) for fileid in corpus.fileids()]
    documentTermMatrix = vectorizer.fit_transform(texts)
    distances = 1 - cosine_similarity(documentTermMatrix)
    mds = MDS(dissimilarity="precomputed", random_state=1)
    positions = mds.fit_transform(distances)
    
    # plot dendrogram
    linkage_matrix = ward(distances)
    plt.figure(figsize=(8,10))
    dendrogram(linkage_matrix, labels=corpus.fileids(), orientation="right");
    plt.show()  # fixes margins
    
    # plot scatter
    xvalues = positions[:, 0] 
    yvalues = positions[: ,1] 
    plt.figure(figsize=(20,10))
    for x, y, name in zip(xvalues, yvalues, corpus.fileids()):
        plt.scatter(x, y)
        # the colour-coding here is a bit of a hard-coded hack for what is otherwise mostly reusable code
        plt.text(x, y, name.replace(".txt", "")[:25], color='red' if 'NonLitCrit' in name else 'green')
    plt.show()

Simple Document Similarities¶

We'll begin by loading our corpus into an NLTK corpus for convenience (all the plain text files in the data/LitCrit/texts directory of the repository).

In [30]:

import nltk
from nltk.corpus.reader.plaintext import PlaintextCorpusReader

data_dir = "../../data/LitCrit"
corpus = PlaintextCorpusReader(data_dir+"/texts", ".*\.txt")

Now that we have a corpus, we can try out the functions above by sending our entire corpus with a simple tokenizer that doesn't include stopwords or keywords to keep. Notice that we use a TF-IDF vectorizer but we tell the vectorizer to not compute the document frequencies (use_idf=False) which simply normalizes (relativizes) the term frequencies by document.

In [31]:

simple_vectorizer = TfidfVectorizer(use_idf=False)
plot_corpus_similarity(corpus, simple_vectorizer)

We see that the documents cluster fairly well in both graphs.

Simple Document Similarities with Stopwords¶

Let's repeat the experience now using English stopwords – this will remove common function words from the relative term frequencies matrix. We'll also limit the number of terms considered to 5,000 for further efficiency.

In [32]:

stoplist_vectorizer = TfidfVectorizer(use_idf=False, stop_words=nltk.corpus.stopwords.words("english"), max_features=5000)
plot_corpus_similarity(corpus, stoplist_vectorizer)

These new graphs with stopwords show some differences, things seem a bit tighter than without stopwords.

Simple Document Similarities with LitCrit Keywords¶

Rather than exclude certain words as we've done with the English stopwords, we can also run the same process by only including certain (LitCrit) words. We have a list of LitCrit keywords developed by Laura & Co. that we can use.

In [33]:

keywords = list(set([line.rstrip('\n').lower() for line in open(data_dir+'/LitTermsNonHierarch.txt')]))
keywords_vectorizer = TfidfVectorizer(use_idf=False, vocabulary=keywords)
plot_corpus_similarity(corpus, keywords_vectorizer)

At a quick glance, the clustering here is better with the keywords (unlike our experiment with philosophical texts where the philosophical keywords performed less well that the stopword list).

Document Similarities Summary¶

The major take-away from this exploratory work with document similarities is:

the best clustering seem to happen when including only keywords

Supervised Classification¶

Supervised classification is essentially a process where we help the computer understand how known items (like texts) are classified in a training set, and then the computer tries to help us by classifying items it hasn't seen before. To determine the accuracy of a classifier, we can use a subset of our known items for training purposes and then test against another subset of our known items to see how many are correctly classified.

In [34]:

import random
from pandas import DataFrame
from collections import defaultdict
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import RidgeClassifier
from sklearn.svm import LinearSVC
from sklearn.linear_model import SGDClassifier
from sklearn.linear_model import Perceptron
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.naive_bayes import BernoulliNB, MultinomialNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neighbors import NearestCentroid

def benchmark_svms(labelled_texts, runs, vectorizer):
    results = defaultdict(list)
    split = int(len(labelled_texts)/2)
    for i in range(0, runs):
        random.shuffle(labelled_texts)
        train_set, test_set = labelled_texts[split:], labelled_texts[:split]
        train_set_categories = ["NonLitCrit" if "NonLitCrit" in category else "LitCrit" for category, text in train_set]
        test_set_categories = ["NonLitCrit" if "NonLitCrit" in category else "LitCrit" for category, text in test_set]
        X_train = vectorizer.fit_transform([text for category, text in train_set])
        X_test = vectorizer.transform([text for category, text in test_set])

        for clf, name in (
                (RidgeClassifier(tol=1e-2, solver="lsqr"), "Ridge Classifier"),
                (Perceptron(n_iter=50), "Perceptron"),
                (PassiveAggressiveClassifier(n_iter=50), "Passive-Aggressive"),
                (KNeighborsClassifier(n_neighbors=10), "kNN"),
                (LinearSVC(), "LinearSVC"),
                (LinearSVC(loss='l2', penalty="l2", dual=False, tol=1e-3), "LinearSCV l2"),
                (LinearSVC(loss='l2', penalty="l1", dual=False, tol=1e-3), "LinearSCV l1"),
                (SGDClassifier(alpha=.0001, n_iter=50, penalty="l2"), "SGD l2"),
                (SGDClassifier(alpha=.0001, n_iter=50, penalty="l1"), "SGD l1"),
                (SGDClassifier(alpha=.0001, n_iter=50, penalty="elasticnet"), "SGD elasticnet"),
                (NearestCentroid(), "NearestCentroid (aka Rocchio classifier"),
                (MultinomialNB(alpha=.01), "Naïve Bayes Multinomial"),
                (BernoulliNB(alpha=.01), "Naïve Bayes Bernoulli")):
            clf.fit(X_train, train_set_categories)
            pred = clf.predict(X_test)
            results[name].append(clf.score(X_test, test_set_categories))
    orderedresults = [(name, values) for name, values in results.items()] 
    results_df = DataFrame([values for name,values in orderedresults], index=[name for name,values in orderedresults])
    print("Ordered averages:")
    print(results_df.mean(axis=1).order(ascending=False))
    results_df.transpose().plot(figsize=(20, 10))

Supervised Classification of Full Corpus¶

Let's send all the texts in our corpus. We need to use an ordered list of labelled texts because the order of the texts will be randomized during use and we need a way of retrieving the label.

In [36]:

labelled_texts = [(fileid, corpus.raw(fileid)) for fileid in corpus.fileids()]
benchmark_svms(labelled_texts, 5, keywords_vectorizer)

Ordered averages:
Naïve Bayes Multinomial                    0.914286
Perceptron                                 0.904762
NearestCentroid (aka Rocchio classifier    0.904762
LinearSVC                                  0.895238
LinearSCV l2                               0.895238
Passive-Aggressive                         0.895238
SGD elasticnet                             0.895238
Naïve Bayes Bernoulli                      0.885714
SGD l1                                     0.885714
Ridge Classifier                           0.885714
kNN                                        0.885714
SGD l2                                     0.876190
LinearSCV l1                               0.847619
dtype: float64

Those scores are pretty good – scores in the .92 range are indicating that over the course of 5 runs that modeller is correctly classifying texts in our test set 92% of the time (note that every time this test is run the results may vary).

(CC-BY) By Stéfan Sinclair, Geoffrey Rockwell and the Trace of Theory team, last updated November 13, 2015.