Introduction

This is a sand pit for experimenting with the ideas around using sentiment analysis to understand plot trajectories. My work on this was prompted by the discussion surrounding Matthew L. Jockers Syuzhet Package.

I don't use R, so I'm attempting to reproduce and extend Matthew's findings via this iPython Notebook.

The meaning of noise

Discomfort around use of notion of 'noise'. Assuming the whole sentiment analysis tactic makes sense, there are two sources:

  • Algorithmic Noise - arising because the sentiment analysis makes mistakes.
  • Narrative Noise - arising from the prose itself.

Niether of these is likely to be simple Gaussian noise. Preference to defer interpreted of the noise and find different ways to look at the data.

Dependencies

I installed TextBlob for NLTK-powered analysis:

$ pip3 install -U textblob
$ python3 -m textblob.download_corpora

I downloaded the AFINN implementation from here and patched it to work with Python3 (here).

I downloaded some books, e.g. (http://www.gutenberg.org/ebooks/4217)

I could not find implementations of the 'bing' and 'nrc' sentiment algorithms, and have not yet tried plumbing in the Stanford parser. I'm using AFINN as the default to try to make results comparible across implementations.

I do not have access to all of the texts used by Matthew, which also limits reproducibility.

Implementation

The following code sets up basic sentence-level sentiment analysis. The main addition is that it also supports collecting the 'culmulative emotional valence' as well as just the 'local' values.

In [8]:
from textblob import TextBlob
import afinn
import random
%matplotlib inline
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = 10, 5

def senticuml(filename, method='afinn', show_sample=False):
    with open (filename, "r") as myfile:
        text=myfile.read().replace('\n', ' ')   

        blob = TextBlob(text)       

        tot = 0.0
        sent = []
        cuml = []
        i = 0
        for sentence in blob.sentences:
            if method == 'afinn':
                senti = afinn.sentiment(str(sentence))
            elif method == 'random':
                senti = random.uniform(-1,1)
            else:
                senti = sentence.sentiment.polarity
            sent.append(senti)
            tot = tot + senti
            cuml.append(tot)
            i += 1
            if i > 100 and i < 110 and show_sample:
                print(tot, senti, sentence[0:40])
                
        return sent, cuml

def sentiplot(filename, title, use_cuml=True, method='afinn'):
    fig = plt.figure()
    sent, cuml = senticuml(filename, method=method)
    if use_cuml == True:
        plt.plot(cuml,label=title)
        plt.ylabel("Cumulative Emotional Valence")
        plt.xlabel("Sentence #")
    else:
        plt.plot(sent,label=title)
        plt.ylabel("Emotional Valence")
        plt.xlabel("Sentence #")
    plt.legend()
    
In [3]:
sentiplot("texts/pg4217-trimmed.txt", "A Portrait of the Artist as a Young Man",use_cuml=False)

This seems broadly comparable with the 'noisy plot' shown here

However, Matthew is using the 'bing' method (which seems to be a more heavily discretised scale than AFINN), so the results are not directly comparable.

Cumulative Emotional Valence

Rather than immediately attempting to 'smooth out' this noisy data, I'd like to find alternative ways of interpreting and visualising the data to make the structure clearer. Here, we treat the sentence-level 'emotional valance' as a modulation, rather than presenting it as an absolute. Specifically, we shift to presenting the sum of the individual values of the emotional valence as we 'read' through the text, what I'll call the 'cumulative emotional valence'.

Taking this approach, the noisy parts are naturally diminished, and the runs of positive or negative sentiment are much clearer:

In [4]:
sentiplot("texts/pg4217-trimmed.txt", "A Portrait of the Artist as a Young Man")

Here is the same plot for the Da Vinci Code. Comparing this (rather arbitrarily) to Portrait, the middle-swing goes up rather than down, and there is a much higher level of fluctuation throughout:

In [5]:
sentiplot("texts/tdvc.txt", "The Da Vinci Code")

And here are the same plots for two more texts that were used as examples during the discussions around the Syuzhet package...

In [6]:
sentiplot("texts/pg174-trimmed.txt", "The Picture of Dorian Gray")
sentiplot("texts/pg1112-trimmed.txt", "THE TRAGEDY OF ROMEO AND JULIET")

Given how these stories end, this probably indicates that this cumulative method rather underestimates the negative swing towards the end of text. (We could perhaps consider a summation window rather than always summing from the start?)

Putting that aside for the moment, what does leap out here is (a) how different these two are from the Joyce example above and (b) how strikingly similar they are to each other. We can investigate this a little further by rescaling and overlaying the two plots...

In [10]:
def rescale(cuml):
    max_cuml = max(cuml)
    x = []
    y = []
    for i in range(0,len(cuml)):
        x.append(i/len(cuml))
        y.append(cuml[i]/max_cuml)
    return x,y

sent_pdg, cuml_pdg = senticuml("texts/pg174-trimmed.txt")
sent_trj, cuml_trj = senticuml("texts/pg1112-trimmed.txt")

x_pdg, y_pdg = rescale(cuml_pdg)
x_trj, y_trj = rescale(cuml_trj)
plt.plot(x_pdg, y_pdg, label="The Picture of Dorian Gray")
plt.plot(x_trj, y_trj, label="THE TRAGEDY OF ROMEO AND JULIET")
plt.legend(loc=4)
plt.xlabel("Position in text")
plt.ylabel("Rescaled Cumulative Emotional Valance")
plt.show()

This is interesting, but hardly conclusive of anything on it's own. It would at least need to be run on a larger corpus (e.g. Markov variations) and/or compared against randomly generated texts or manually calibrated data sets.

However, as a neat visualisation method for exploring texts, it seems quite appealing and may be worth considering applying to texts found in the web archive. Clearly, this would be much improved by being able to quickly which sections of the text corresponded to which parts of the graph (e.g. hover text, or text-by-plot with highlighting), but that's rather too fancy do to in an iPython Notebook I think.

Statistical Significance

Are these trends significant? One approach is to generate random sentiment versions and see if that accidentally generates 'trends'.

In [17]:
# Set the seed so we can re-generate the same path
random.seed(2)
# Plot using the 'random' sentiment analysis:
sentiplot("texts/tdvc.txt", "The Random Walker", method='random')

It's easy to see how you might find order in that noise, so any treatment of these cumulative graphs probably needs to bake in some comparison with the likelihood that the same result could be generated by a random process.

On Comparing Sentiment Algorithms

It would be good to produce a more detailed comparison of sentiment analysis algorithms. However, the only other one I have access to right now is the NLTK method wrapped by the Python pattern package.

Oddly, this seems to consistently lean towards generating positive sentiments in the texts I've tested it on. For example:

In [13]:
sentiplot("texts/pg4217-trimmed.txt", "A Portrait of the Artist as a Young Man",method="nltk",use_cuml=False)
sentiplot("texts/pg4217-trimmed.txt", "A Portrait of the Artist as a Young Man",method="nltk")
plt.legend(loc=4)
plt.show()

Overlaying multiple plots, the linear trend seems consistent and acts to mask differences between texts...

In [15]:
sent_pdg, cuml_pdg = senticuml("texts/pg174-trimmed.txt",method="nltk")
sent_trj, cuml_trj = senticuml("texts/pg1112-trimmed.txt",method="nltk")
sent_pay, cuml_pay = senticuml("texts/pg4217-trimmed.txt",method="nltk")

x_pdg, y_pdg = rescale(cuml_pdg)
x_trj, y_trj = rescale(cuml_trj)
x_pay, y_pay= rescale(cuml_pay)
plt.plot(x_pdg, y_pdg, label="The Picture of Dorian Gray")
plt.plot(x_trj, y_trj, label="THE TRAGEDY OF ROMEO AND JULIET")
plt.plot(x_pay, y_pay, label="A Portrait of the Artist as a Young Man")
plt.legend(loc=4)
plt.xlabel("Position in text")
plt.ylabel("Rescaled Cumulative Emotional Valance")
plt.show()

Calibration

If this is going to be anything more than a visual hook or curiosity, the coding of sentiment analysis really ought to be calibrated against 'clost reading', in the manner of this blog. As a start, one could take the data from that blog and replot it in the cumulative form to see how it compares.

On Smoothing

...

Suitability of Fourier Transforms

  • Pretty sure the periodic thing can be avoided, as I don't think use of wave-based transforms in JPEG or MP3 forces periodicity on the data. Probably ways of dealing with this, like windowing or signal mirroring.
  • Pretty sure this is entirely separate from 'ringing artefacts' which are largely arbitrary and arise from how the frequency truncation was done. i.e. a step-function in frequency space creates a ringing sinc in the time axis of the signal, primarily determined by the depth of the step (i.e. the shape+height of frequency curve at truncation point).
  • Empirical approach to this is to fiddle with the input parameters and attempt to descern whether fiddling with the frequency cut-off has a stronger effect on the result than twiddling the input data.

Could try fourier smoothing with mirroring to avoid the periodicity problem, but I'm not sure that's worth pursuing yet.

Below from http://stackoverflow.com/questions/19122157/fft-bandpass-filter-in-python

In [ ]:
import numpy as np
from scipy.fftpack import rfft, irfft, fftfreq

time   = np.linspace(0,10,2000)
signal = np.cos(5*np.pi*time) + np.cos(7*np.pi*time)

W = fftfreq(signal.size, d=time[1]-time[0])
f_signal = rfft(signal)

# If our original signal time was in seconds, this is now in Hz    
cut_f_signal = f_signal.copy()
cut_f_signal[(W<6)] = 0

cut_signal = irfft(cut_f_signal)

# And plot...

import pylab as plt
plt.subplot(221)
plt.plot(time,signal)
plt.subplot(222)
plt.plot(W,f_signal)
plt.xlim(0,10)
plt.subplot(223)
plt.plot(W,cut_f_signal)
plt.xlim(0,10)
plt.subplot(224)
plt.plot(time,cut_signal)
plt.show()
In [ ]: