Notebook

Classifying the HTRC Genre Word Frequencies¶

This notebook is a part of work being done for the Trace of Theory project, a collaboration between researchers of NovelTM and the HathiTrust Research Center (HTRC). In particular, we are wanting to use both supervised and unsupervised machine learning techniques on HTRC texts to gain a better understanding of the extent and nature of theory in various genres.

This notebook builds on the Classifying Philosophical Texts notebook where we looked at building a classifier for philosophical texts, based on a small training corpus. In this notebook we'll try to use a trained classifier to identify philosophical texts based on genre-specific wordcounts for 178,381 volumes from the HathiTrust Digital Library; the genres are fiction, drama and poetry.

Building a Philosophical Classifier¶

The first step is to (re)build our philosophical classifier. It's worth reiterating that the classifier is being trained on a relatively small corpus (so isn't likely as representative as it might be) and that the new HTRC genre corpus is literature-specific (so a different kind of beast from our training corpus). Is it still useful as a classifier? that's part of what we'd like to find out.

The classifier created below is essentially the same as before, though we'll use the LinearSVC algorithm because it provides a way of not just classifying (philosophical or non-philosophical) but also of expressing a value for how philosophical or not the text is.

In [180]:

import nltk
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC

# define the training corpus to use (while filtering out our Philosohpical ouliers)
philo_data_dir = "../../data/philosophy"
philo_corpus = nltk.corpus.reader.plaintext.PlaintextCorpusReader(philo_data_dir+"/texts", ".*\.txt")
filtered_fileids = [fileid for fileid in philo_corpus.fileids() if "GameOfLogic" not in fileid and "ThusSpakeZarathustr" not in fileid]

# create TF-IDF (actually elative frequencies) vectorizer
stopword_vectorizer = TfidfVectorizer(use_idf=False, stop_words=nltk.corpus.stopwords.words("english"), max_features=10000)
X_train = stopword_vectorizer.fit_transform([philo_corpus.raw(fileid) for fileid in filtered_fileids])
philo_categories = ["Philosophy" if "Philosophy" in fileid else "Other" for fileid in filtered_fileids]

# create a classiier
philo_clf = LinearSVC(loss='l2', penalty="l2", dual=False, tol=1e-3)
philo_clf.fit(X_train, philo_categories)

Out[180]:

LinearSVC(C=1.0, class_weight=None, dual=False, fit_intercept=True,
     intercept_scaling=1, loss='l2', multi_class='ovr', penalty='l2',
     random_state=None, tol=0.001, verbose=0)

Analyzing the HTRC Genre Corpus¶

The HTRC Genre corpus is organized by genre (if you rsync the directory rather than just download the files from the web the files are organized into subfolders by genre). For each genre there's a metadata file with all the volumes for that genre and then a set of compressed archives (.tar.gz) organized by time slice. Our strategy here will be as follows:

for each genre folder (fiction, drama, poetery)
- read the metadata file into a table
- for each compressed archive in the genre folder (*.tar.gz)
  - for each tab-separated values file in the archive
    - create a pseudo text
    - for each word-count pair: add the word the specified number of times to our pseuedo text
    - produce a classifier decision (a value how philosophical the text is
    - add the prediction value to the corresponding row in the metadata table

In [101]:

from os import walk
import pandas as pd
from os.path import join
import glob

def get_genre_metadata_and_predictions(genre_dir, clf, vectorizer):
    metadatas = {}
    for (dirpath, dirnames, filenames) in walk(htrc_genre_dir):
        for genre in dirnames:
            genre_path = join(htrc_genre_dir, genre) 
            metadata = pd.read_csv(join(genre_path, genre+"_metadata.csv"), index_col=0)
            metadata['prediction'] = [float(0)] * len(metadata)
            for tgz in glob.glob(join(genre_path,"*.tar.gz")):
                print("Analyzing "+tgz)
                tar = tarfile.open(tgz, "r:gz")
                for tarinfo in tar:
                    if tarinfo.isreg() and tarinfo.name.endswith("tsv"):

                        # read in the TSV file and expand the text (it would probably be quicker to
                        # create a vectorizer that can use the feature counts directly, but oh well
                        text = ""
                        tsv = tar.extractfile(tarinfo)
                        for line in tsv.readlines():
                            word, count = line.decode("utf-8").strip().split("\t")
                            if any(c for c in word if c.isalpha()):
                                text += (word + " ") * int(count)

                        # predict the class
                        X_test = vectorizer.transform([text])
                        metadata['prediction'][tarinfo.name[0:-4]] = clf.decision_function(X_test)[0]

                tar.close()
            metadatas[genre] = metadata.sort('prediction', ascending=False)
        break
    return metadatas

Now we should be ready to use our classifier on the HTRC Genre corpus. This returns a dictionary object with keys for each genre (fiction, drama, poetry) and values that are pandas dataframes with all the existing metadata for each volume, plus the philosophical prediction that we've added.

In [100]:

htrc_genre_dir = "/Users/sgs/Downloads/genre"
philo_metadatas = get_genre_metadata_and_predictions(htrc_genre_dir, philo_clf, stopword_vectorizer)

We can have a quick peek to see how many volumes are contained in each genre:

In [190]:

total = 0
for genre, metadata in philo_metadatas.items():
    total += len(metadata.index)
    print(genre+": "+"{:,}".format(len(metadata.index)))
print("total: "+"{:,}".format(total))

fiction: 101,948
drama: 17,709
poetry: 58,724
total: 178,381

Listing Most Philosophical Texts¶

Ok, so let's rub our hands together in anticipation and have a closer look at the predictions. For each genre, let's enumerate the 15 most philosophical texts (i.e. the texts that were assigned the highest values by our philosophical classifier).

In [171]:

for genre, metadata in philo_metadatas.items():
    print(genre)
    for name, row in metadata.head(15).iterrows():
        print("  "+str(row['prediction']) + ": " + str(row["author"]) + " " + str(row["title"])[:40] + " ("+ name+")")

fiction
  0.942876776127685: Hamilton, William, Lectures on metaphysics and logic (uc2.ark+=13960=t6d21t16h)
  0.9264089941279002: Hamilton, William, Lectures on metaphysics and logic (uc2.ark+=13960=t5x63ch8w)
  0.7556690393458386: Hamilton, William, Lectures on metaphysics and logic (uc2.ark+=13960=t3707z04z)
  0.7400285535717109: Alcott, Amos Bronson, Table-talk (mdp.39015063976719)
  0.7343142773526233: Ladd, George Trumbull, Primer of psychology (nyp.33433070247659)
  0.7240166966930507: Morley, John, Critical miscellanies (uva.x002075999)
  0.7224443869202736: Morley, John, Critical miscellanies (uc1.b3312082)
  0.7221550252376431: Morley, John, Critical miscellanies (mdp.39015008447859)
  0.7219327902034643: Morley, John, Critical miscellanies (uc1.b3271511)
  0.7205189779025428: Morley, John, Critical miscellanies (uc2.ark+=13960=t9z03906q)
  0.6752848002922364: Smith, Garnet, The melancholy of Stephen Allard (nyp.33433074925573)
  0.6685784667208577: Dillon, Henry Augustus Dillon-Lee, The life and opinions of Sir Richard Mal (wu.89099782401)
  0.6676523527212904: Greg, William R. Enigmas of life (uc1.b293840)
  0.656211764437482: Lovett, Robert Morss, A wingéd victory (mdp.39015063939238)
  0.6559990109761032: Mackintosh, James, The miscellaneous works of the Right Hon (mdp.39015011444083)
drama
  0.46788088291335583: Jones, Lloyd, A reply to Mr. R. Carlile's objections t (wu.89097121669)
  0.10232714349406424: Muilman, Teresia Constantia, A letter humbly address'd to the Right H (mdp.39015035813768)
  0.06650988674725727: nan Boston medical police (hvd.hxj8ev)
  0.06509884255367532: Cleveland, Grover, Speech of Grover Cleveland, president of (njp.32101067015907)
  0.054464962463584565: Combe, William, A letter to Her Grace the Duchess of Dev (mdp.39015073305511)
  0.04846397981574457: nan An address to the people of Maine from t (nyp.33433034030118)
  0.02053164585346362: Lessing, Gotthold Ephraim, Nathan the wise (uc2.ark+=13960=t0sq8qm0z)
  0.0: Jones, Henry Arthur, Judah; (nyp.33433074928197)
  0.0: Euripides. The tragedies of Euripides in English ve (nyp.33433082192612)
  0.0: Euripides. The tragedies of Euripides in English ve (nyp.33433082192604)
  0.0: Shakespeare, William, Much ado about nothing (nyp.33433075793939)
  0.0: Shakespeare, William, The works of William Shakespeare (uva.x000031590)
  0.0: Davidson, John, Plays (nyp.33433074912506)
  0.0: Steele, Richard, Richard Steele (nyp.33433074912175)
  0.0: Shakespeare, William, The complete works of Shakespeare (nyp.33433074892237)
poetry
  0.6253940555248959: Greenlaw, Asbury Lincoln, Resident forces of life, the evolution o (loc.ark+=13960=t4qj82h6d)
  0.6236786508373916: Greenlaw, Asbury Lincoln, Resident forces of life, the evolution o (nyp.33433075833206)
  0.540108435156661: Spalding, John Lancaster, The Spalding year-book; (mdp.39015064337135)
  0.5400371868572602: Newcomb, Charles Benjamin, Principles of psychic philosophy (njp.32101066127778)
  0.46935127211324157: Laidlaw, James S. God in reason and intuition (loc.ark+=13960=t3126bv3p)
  0.4415951471487116: Gilmour, William Pegram, A diagnosis (uc2.ark+=13960=t2b853s2m)
  0.43340354399532943: Gilmour, William Pegram, A diagnosis (loc.ark+=13960=t3vt29v3s)
  0.4215994505874262: White, William Allen, A theory of spiritual progress; (uc2.ark+=13960=t46q1vf0j)
  0.4171965056415603: Boyd, Jackson, The unveiling; (uc2.ark+=13960=t80k2b290)
  0.41382667807742624: White, William Allen, A theory of spiritual progress; (nyp.33433081958526)
  0.40420893766678634: Parsons, A. R. Surf lines that mark where waves of thou (loc.ark+=13960=t2988t00v)
  0.39825467176157203: nan Surf lines that mark where waves of thou (uc2.ark+=13960=t5n873j3b)
  0.3966047587208943: nan Surf lines that mark where waves of thou (nyp.33433074826847)
  0.3899518109492506: De Waters, Lillian (Stephenson), Good cheer (loc.ark+=13960=t85h86d32)
  0.38421283612256785: Blavatsky, H. P. Quotations (uc2.ark+=13960=t23b66b8t)

One of the things this exposes immediately is that the HTRC genre feature sets contain duplicate texts (presumably because of sampling from different libraries – these aren't identifical files since they're the product of separated digitization and possibly separate editions). This is an annoyance, though we can probably just skip through duplicates in looking at the top samples from each group.

Another noticeable thing is the set of 0.0 scores with drama. That score is likely because no prediction was made for the particular text, for some reason. But more importantly, it means that we don't dig very deep in the drama corpus before reaching texts that have negative scores (or zero), which means that the corpus as a whole appears to be less philosohpical.

Plotting by Year¶

Because our volumes metadata contains date/year values, we can plot the philosophical predictions by year for each genre, this might give us a sense of some diachronic trends, how things change over time.

In [170]:

import matplotlib.pyplot as plt
%matplotlib inline  

for genre, metadata in philo_metadatas.items():
    metadata.plot(kind='scatter', x='date', y='prediction', label=genre)

The clearest thing from these graphs is that all three genres show more variability over time. This is no doubt in part because there are more texts per year as we move forward in time, but that doesn't fully explain the larger discrepencies in scores – it would see that some texts are getting more philosophical while some texts are getting less philosohpical.

We also see here a confirmation of our observation earlier about drama being less philosophical, we can eyeball that most scores are under zero. We can observe a crevasse around 1900 in drama, and an even more pronounced gap in poetery at about the same time. This may be caused by an issue with the HTRC genre feature sets where poetry_1894-1899.tar.gz and drama_1880-1884.tar.gz are empty.

Plotting Means by Year¶

Plotting all 178,381 points makes for very dense scatterplots. Another way to look at change over time is to consider the annual mean philosophical value for each genre. In other words, we look at the average of all the classifier predictions by year, and then comapre by genre.

In [168]:

plt.figure(figsize=(10,5))
for genre, metadata in philo_metadatas.items():
    values = {}
    for key, grp in metadata.groupby(['date']):
        values[key] = grp['prediction'].mean()
    plt.plot(list(values.keys()), list(values.values()), label=genre)
plt.legend(philo_metadatas.keys())
plt.show()

The first century or so shows more fluctuation erratic than the rest (possibly because there are fewer texts sampled). We see the same weirdness (though even clearer here) with drama and poetry that we noticed in the previous plots. Most genres have a graduate drop in philosophically classified texts, with the possible exception of drama.

Summary and Next Steps¶

Even though our philosophical classifier was created with a relatively small and heterogeneous corpus, it seems useful in trying to identify philosophical texts in the HTRC Genre corpus, as well as suggesting some possible insights about genre and changes over time. In particular:

the HTRC Genre corpus has a lot of duplicate texts
the HTRC Genre corpus increases the number of volumes per year over time
there's an issue with the HTRC Genre corpus around the end of the 19th century
philosophical variation seems to increase over time
drama is the least philosohpical genre
fiction and poetry seem to get less philosophical over time

It might be interesting to run the same experiment with the LitCrit classifier.

The HTRC Genre corpus is a wonderful resource because it provides well-organized and readily-accessible word frequency values. The next step might be to try something similar on the full HTRC corpus of 4.8 volumes (though likely a subset of that since our classifier has been trained for English texts only).

(CC-BY) By Stéfan Sinclair, Geoffrey Rockwell and the Trace of Theory team, last updated November 16, 2015.