Usupervised classification of philosophical genres

This notebook is a part of work being done for the Trace of Theory project, a collaboration between researchers of NovelTM and the HathiTrust Research Center (HTRC).

Here, we'll use unsupervised techniques to identify clusters of similar texts within a corpus of about 3,200 philosophical texts. These texts were previously identified in the HathiTrust public domain corpus using a list of philosophical keywords. The idea now is to look for something like philsophical "genres" within this subcorpus of philosophical texts and to compare the computational results to human labels. Our features will mix word-count data with measures of form and with textual metadata, so that we're examining not just subject matter, but also style and (minimal) context.

The work below is almost exclusively about methods. There's not a lot of analysis, and the notebook ends with suggestions for things to try, rather than conclusions about philosophical genre.

Roadmap

  • Download feature data for the 3,200 philosophical texts from the HathiTrust Research Center
  • Parse feature data
  • Calculate other features derived from the same sources
  • Reduce the dimensions of the feature space in order to have some hope of clustering the texts
  • Perform k-means and DBSCAN clustering on the dimension-reduced features
  • Visualize the clustering output alongside the human labels; use both static plots (via Matplotlib/Seaborn) and interactives (via Bokeh)

Imports

In [1]:
import warnings
warnings.filterwarnings('ignore')

# Data libraries
import pandas as pd
from nltk.corpus.reader import PlaintextCorpusReader
from nltk.corpus import cmudict
import nltk
from collections import defaultdict
import bz2
import json
import os
import subprocess

# Machine learning and math libraries
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.cluster import DBSCAN
import numpy as np

# Plotting libraries
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import seaborn as sns  # Note that seaborn >= 0.6.0 is required for some plots
from bokeh.charts import Scatter, output_notebook, show
from bokeh.models import HoverTool, ColumnDataSource
from bokeh.plotting import figure
from bokeh.palettes import Spectral5, Spectral6, Spectral7

# Set up plotting
sns.set_context('talk')
sns.set_style('darkgrid')
plt.figure(figsize=(8, 6))
basecolor = 'steelblue'
% matplotlib inline
output_notebook()