This notebook is a part of work being done for the Trace of Theory project, a collaboration between researchers of NovelTM and the HathiTrust Research Center (HTRC). In particular, we are wanting to use both supervised and unsupervised machine learning techniques on HTRC texts to gain a better understanding of the extent and nature of theory in various genres.
This notebook is a much shorter version of the Classifying Philosophical Texts notebook where many of the steps are explained in more detail. The purpose of this notebook is to compare the methodology with another corpus.
Below is a longer chunk of (mostly unexplained) code that essentially produces two visualizations for document similarity:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.manifold import MDS
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import ward, dendrogram
%matplotlib inline
# plot the documents from the corpus
def plot_corpus_similarity(corpus, vectorizer):
# generate the vector, distances and positions
texts = [corpus.raw(fileid) for fileid in corpus.fileids()]
documentTermMatrix = vectorizer.fit_transform(texts)
distances = 1 - cosine_similarity(documentTermMatrix)
mds = MDS(dissimilarity="precomputed", random_state=1)
positions = mds.fit_transform(distances)
# plot dendrogram
linkage_matrix = ward(distances)
plt.figure(figsize=(8,10))
dendrogram(linkage_matrix, labels=corpus.fileids(), orientation="right");
plt.show() # fixes margins
# plot scatter
xvalues = positions[:, 0]
yvalues = positions[: ,1]
plt.figure(figsize=(20,10))
for x, y, name in zip(xvalues, yvalues, corpus.fileids()):
plt.scatter(x, y)
# the colour-coding here is a bit of a hard-coded hack for what is otherwise mostly reusable code
plt.text(x, y, name.replace(".txt", "")[:25], color='red' if 'NonLitCrit' in name else 'green')
plt.show()
We'll begin by loading our corpus into an NLTK corpus for convenience (all the plain text files in the data/LitCrit/texts directory of the repository).
import nltk
from nltk.corpus.reader.plaintext import PlaintextCorpusReader
data_dir = "../../data/LitCrit"
corpus = PlaintextCorpusReader(data_dir+"/texts", ".*\.txt")
Now that we have a corpus, we can try out the functions above by sending our entire corpus with a simple tokenizer that doesn't include stopwords or keywords to keep. Notice that we use a TF-IDF vectorizer but we tell the vectorizer to not compute the document frequencies (use_idf=False
) which simply normalizes (relativizes) the term frequencies by document.
simple_vectorizer = TfidfVectorizer(use_idf=False)
plot_corpus_similarity(corpus, simple_vectorizer)
We see that the documents cluster fairly well in both graphs.
Let's repeat the experience now using English stopwords – this will remove common function words from the relative term frequencies matrix. We'll also limit the number of terms considered to 5,000 for further efficiency.
stoplist_vectorizer = TfidfVectorizer(use_idf=False, stop_words=nltk.corpus.stopwords.words("english"), max_features=5000)
plot_corpus_similarity(corpus, stoplist_vectorizer)
These new graphs with stopwords show some differences, things seem a bit tighter than without stopwords.
Rather than exclude certain words as we've done with the English stopwords, we can also run the same process by only including certain (LitCrit) words. We have a list of LitCrit keywords developed by Laura & Co. that we can use.
keywords = list(set([line.rstrip('\n').lower() for line in open(data_dir+'/LitTermsNonHierarch.txt')]))
keywords_vectorizer = TfidfVectorizer(use_idf=False, vocabulary=keywords)
plot_corpus_similarity(corpus, keywords_vectorizer)
At a quick glance, the clustering here is better with the keywords (unlike our experiment with philosophical texts where the philosophical keywords performed less well that the stopword list).
The major take-away from this exploratory work with document similarities is:
Supervised classification is essentially a process where we help the computer understand how known items (like texts) are classified in a training set, and then the computer tries to help us by classifying items it hasn't seen before. To determine the accuracy of a classifier, we can use a subset of our known items for training purposes and then test against another subset of our known items to see how many are correctly classified.
import random
from pandas import DataFrame
from collections import defaultdict
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import RidgeClassifier
from sklearn.svm import LinearSVC
from sklearn.linear_model import SGDClassifier
from sklearn.linear_model import Perceptron
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.naive_bayes import BernoulliNB, MultinomialNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neighbors import NearestCentroid
def benchmark_svms(labelled_texts, runs, vectorizer):
results = defaultdict(list)
split = int(len(labelled_texts)/2)
for i in range(0, runs):
random.shuffle(labelled_texts)
train_set, test_set = labelled_texts[split:], labelled_texts[:split]
train_set_categories = ["NonLitCrit" if "NonLitCrit" in category else "LitCrit" for category, text in train_set]
test_set_categories = ["NonLitCrit" if "NonLitCrit" in category else "LitCrit" for category, text in test_set]
X_train = vectorizer.fit_transform([text for category, text in train_set])
X_test = vectorizer.transform([text for category, text in test_set])
for clf, name in (
(RidgeClassifier(tol=1e-2, solver="lsqr"), "Ridge Classifier"),
(Perceptron(n_iter=50), "Perceptron"),
(PassiveAggressiveClassifier(n_iter=50), "Passive-Aggressive"),
(KNeighborsClassifier(n_neighbors=10), "kNN"),
(LinearSVC(), "LinearSVC"),
(LinearSVC(loss='l2', penalty="l2", dual=False, tol=1e-3), "LinearSCV l2"),
(LinearSVC(loss='l2', penalty="l1", dual=False, tol=1e-3), "LinearSCV l1"),
(SGDClassifier(alpha=.0001, n_iter=50, penalty="l2"), "SGD l2"),
(SGDClassifier(alpha=.0001, n_iter=50, penalty="l1"), "SGD l1"),
(SGDClassifier(alpha=.0001, n_iter=50, penalty="elasticnet"), "SGD elasticnet"),
(NearestCentroid(), "NearestCentroid (aka Rocchio classifier"),
(MultinomialNB(alpha=.01), "Naïve Bayes Multinomial"),
(BernoulliNB(alpha=.01), "Naïve Bayes Bernoulli")):
clf.fit(X_train, train_set_categories)
pred = clf.predict(X_test)
results[name].append(clf.score(X_test, test_set_categories))
orderedresults = [(name, values) for name, values in results.items()]
results_df = DataFrame([values for name,values in orderedresults], index=[name for name,values in orderedresults])
print("Ordered averages:")
print(results_df.mean(axis=1).order(ascending=False))
results_df.transpose().plot(figsize=(20, 10))
Let's send all the texts in our corpus. We need to use an ordered list of labelled texts because the order of the texts will be randomized during use and we need a way of retrieving the label.
labelled_texts = [(fileid, corpus.raw(fileid)) for fileid in corpus.fileids()]
benchmark_svms(labelled_texts, 5, keywords_vectorizer)
Ordered averages: Naïve Bayes Multinomial 0.914286 Perceptron 0.904762 NearestCentroid (aka Rocchio classifier 0.904762 LinearSVC 0.895238 LinearSCV l2 0.895238 Passive-Aggressive 0.895238 SGD elasticnet 0.895238 Naïve Bayes Bernoulli 0.885714 SGD l1 0.885714 Ridge Classifier 0.885714 kNN 0.885714 SGD l2 0.876190 LinearSCV l1 0.847619 dtype: float64
Those scores are pretty good – scores in the .92 range are indicating that over the course of 5 runs that modeller is correctly classifying texts in our test set 92% of the time (note that every time this test is run the results may vary).
(CC-BY) By Stéfan Sinclair, Geoffrey Rockwell and the Trace of Theory team, last updated November 13, 2015.