This notebook is a part of work being done for the Trace of Theory project, a collaboration between researchers of NovelTM and the HathiTrust Research Center (HTRC).
Here, we'll use unsupervised techniques to identify clusters of similar texts within a corpus of about 3,200 philosophical texts. These texts were previously identified in the HathiTrust public domain corpus using a list of philosophical keywords. The idea now is to look for something like philsophical "genres" within this subcorpus of philosophical texts and to compare the computational results to human labels. Our features will mix word-count data with measures of form and with textual metadata, so that we're examining not just subject matter, but also style and (minimal) context.
The work below is almost exclusively about methods. There's not a lot of analysis, and the notebook ends with suggestions for things to try, rather than conclusions about philosophical genre.
import warnings
warnings.filterwarnings('ignore')
# Data libraries
import pandas as pd
from nltk.corpus.reader import PlaintextCorpusReader
from nltk.corpus import cmudict
import nltk
from collections import defaultdict
import bz2
import json
import os
import subprocess
# Machine learning and math libraries
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.cluster import DBSCAN
import numpy as np
# Plotting libraries
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import seaborn as sns # Note that seaborn >= 0.6.0 is required for some plots
from bokeh.charts import Scatter, output_notebook, show
from bokeh.models import HoverTool, ColumnDataSource
from bokeh.plotting import figure
from bokeh.palettes import Spectral5, Spectral6, Spectral7
# Set up plotting
sns.set_context('talk')
sns.set_style('darkgrid')
plt.figure(figsize=(8, 6))
basecolor = 'steelblue'
% matplotlib inline
output_notebook()