Here, we'll use unsupervised techniques to identify clusters of similar texts within a corpus of about 3,200 philosophical texts. These texts were previously identified in the HathiTrust public domain corpus using a list of philosophical keywords. The idea now is to look for something like philsophical "genres" within this subcorpus of philosophical texts and to compare the computational results to human labels. Our features will mix word-count data with measures of form and with textual metadata, so that we're examining not just subject matter, but also style and (minimal) context.
The work below is almost exclusively about methods. There's not a lot of analysis, and the notebook ends with suggestions for things to try, rather than conclusions about philosophical genre.
import warnings warnings.filterwarnings('ignore') # Data libraries import pandas as pd from nltk.corpus.reader import PlaintextCorpusReader from nltk.corpus import cmudict import nltk from collections import defaultdict import bz2 import json import os import subprocess # Machine learning and math libraries from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.decomposition import PCA from sklearn.preprocessing import StandardScaler from sklearn.cluster import KMeans from sklearn.cluster import DBSCAN import numpy as np # Plotting libraries import matplotlib.pyplot as plt import matplotlib.cm as cm import seaborn as sns # Note that seaborn >= 0.6.0 is required for some plots from bokeh.charts import Scatter, output_notebook, show from bokeh.models import HoverTool, ColumnDataSource from bokeh.plotting import figure from bokeh.palettes import Spectral5, Spectral6, Spectral7 # Set up plotting sns.set_context('talk') sns.set_style('darkgrid') plt.figure(figsize=(8, 6)) basecolor = 'steelblue' % matplotlib inline output_notebook()