Bake-off: The semantic orientation method

Important: This isn't being run as a bake-off this year. It's included in the repository in case people want to do additional exploration or incorporate this kind of evaluation into a project.

In [1]:
__author__ = "Christopher Potts"
__version__ = "CS224u, Stanford, Spring 2018 term"


The semantic orientation method of Turney and Littman 2003 is a general, unsupervised (or lightly supervised?) method for building lexicons for any desired semantic dimension using VSMs.

The method relies on intuitive seeds sets and the basic distance measures we use in basically all work with VSMs. Here's a summary of the method:

  1. Let $X$ be a VSM with dimension $m \times n$ and vocabulary $V$. Let $I$ be the set of indices $\{1, 2, \ldots, m\}$

  2. For $i \in I$, let $X_{i}$ be the vector representation of $V_{i}$.

  3. Define two seed-sets $S{_1} \subseteq I$ and $S_{2} \subseteq I$. They should have the same cardinality and be semantically opposing in some way that is appropriate for your matrix.

  4. Pick a vector distance measure $\textbf{f}$.

  5. For all $i \in I$:

$$\textbf{score}(x_{i}) =

\left(\sum{j \in S{1}} \textbf{f}(x{i}, x{j})\right)

\left(\sum{k \in S{2}} \textbf{f}(x{i}, s{k})\right)$$

This method is implemented below as semantic_orientation. You can play around with making your own seed-sets and seeing what scores the method assigns. After that, the bake-off itself involves assessing your output against a multidimensional sentiment lexicon.


In [2]:
from collections import defaultdict
import csv
import numpy as np
import os
import pandas as pd
from scipy.stats import pearsonr, spearmanr
import vsm
In [3]:
data_home = 'vsmdata'


In [4]:
def semantic_orientation(
        seeds1=('bad', 'nasty', 'poor', 'negative', 'unfortunate', 'wrong', 'inferior'),
        seeds2=('good', 'nice', 'excellent', 'positive', 'fortunate', 'correct', 'superior'),
    """No frills implementation of the semantic Orientation (SO) method of 
    Turney and Littman. `seeds1` and `seeds2` should be representative members 
    of two intutively opposing semantic classes. The method will then try 
    to rank the vocabulary by its relative association with each seed set.
    df : pd.DataFrame
        The matrix used to derive the SO ranking.           
    seeds1 : tuple of str
        The default is the negative seed set of Turney and Littman.        
    seeds2 : tuple of str
        The default is the positive seed set of Turney and Littman.        
    distfunc : function mapping vector pairs to floats (default: `cosine`)
        The measure of distance between vectors. Can also be `euclidean`, 
        `matching`, `jaccard`, as well as any other distance measure 
        between 1d vectors. 
        The vocabulary ranked according to the SO method, with words 
        closest to `seeds1` at the top and words closest to `seeds2` at the 
    rownames = set(df.index)
    # Check that the seed sets are in the vocabulary, filtering
    # where necessary, and warn the user about exclusions:
    seeds1 = _value_check(seeds1, "seeds1", rownames)
    seeds2 = _value_check(seeds2, "seeds2", rownames)
    # Subframes for the two seeds-sets
    sm1 = df.loc[seeds1]
    sm2 = df.loc[seeds2]
    # Core semantic orientation calculation:
    def row_func(row):
        val1 = sm1.apply(lambda x: distfunc(row, x), axis=1).sum()
        val2 = sm2.apply(lambda x: distfunc(row, x), axis=1).sum()
        return val1 - val2
    scores = df.apply(row_func, axis=1)
    return scores.sort_values(ascending=False)

def _value_check(ss, name, rownames):
    new = set()
    for w in ss:
        if w not in rownames:
            print("Warning: {} not in {}".format(w, name))
    return new
In [5]:
imdb20 = pd.read_csv(
    os.path.join(data_home, 'imdb_window20-flat.csv.gz'), index_col=0)
In [6]:
imdb20_ppmi = vsm.pmi(imdb20)
In [7]:
imdb20_ppmi_so = semantic_orientation(imdb20_ppmi)
Warning: inferior not in seeds1
In [8]:
excellent    0.596622
superb       0.249968
great        0.247541
superior     0.230842
nice         0.189436
dtype: float64
In [9]:
unfortunate   -1.694198
nasty         -1.838113
poor          -1.907216
wrong         -1.929000
bad           -1.954924
dtype: float64

Multidimensional sentiment lexicon

Warriner et al. (20130 released a dataset called 'Norms of valence, arousal, and dominance for 13,915 English lemmas'. This is included in vsmdata as Ratings_Warriner_et_al.csv. The following code reads this file in and creates a DataFrame that gives just the overall means for these three semantic dimensions.

In [10]:
def load_warriner_lexicon(src_filename, df=None):
    """Read in 'Ratings_Warriner_et_al.csv' and optionally restrict its 
    vocabulary to items in `df`.
    src_filename : str
        Full path to 'Ratings_Warriner_et_al.csv'
    df : pd.DataFrame or None
        If this is given, then its index is intersected with the 
        vocabulary from the lexicon, and we return a lexicon 
        containing only values in both vocabularies.
    lexicon = pd.read_csv(src_filename, index_col=0)
    lexicon = lexicon[['Word', 'V.Mean.Sum', 'A.Mean.Sum', 'D.Mean.Sum']]
    lexicon = lexicon.set_index('Word').rename(
        columns={'V.Mean.Sum': 'Valence', 
                 'A.Mean.Sum': 'Arousal', 
                 'D.Mean.Sum': 'Dominance'})
    if df is not None:
        shared_vocab = sorted(set(lexicon.index) & set(df.index))
        lexicon = lexicon.loc[shared_vocab]
    return lexicon
In [11]:
lexicon = load_warriner_lexicon(
    os.path.join(data_home, 'Ratings_Warriner_et_al.csv'),
In [12]:
Valence Arousal Dominance
TV 5.42 4.29 6.23
ability 7.00 4.85 6.55
able 6.64 3.38 6.17
abortion 2.58 5.43 4.73
absolute 5.43 3.48 5.58


We evaluate VSMs by the Pearson correlation coefficient between the scores delivered by semantic_orientation and the values in the Warriner et al. lexicon.

In [13]:
def evaluation(lexicon, so, colname='Valence', metric=pearsonr):
    lexicon['so'] = so
    rho, pvalue = metric(lexicon['so'], lexicon[colname])
    print("{0:}'s r: {1:0.3f}".format(metric.__name__, rho))


Here's a simple baseline: PPMI on imdb20 as loaded above.

In [14]:
imdb20_ppmi_so = semantic_orientation(imdb20_ppmi)
Warning: inferior not in seeds1
In [15]:
evaluation(lexicon, imdb20_ppmi_so, colname='Valence')
pearsonr's r: 0.361
In [16]:
evaluation(lexicon, imdb20_ppmi_so, colname='Arousal')
pearsonr's r: 0.005
In [17]:
evaluation(lexicon, imdb20_ppmi_so, colname='Dominance')
pearsonr's r: 0.315

Bake-off submission

  1. The name of the count matrix you started with (must be one in vsmdata).
  2. The seed-sets you used.
  3. A description of the steps you took to create your bake-off VSM – must be different from the above baseline.
  4. Your Pearson r values for 'Valence', 'Arousal', and 'Dominance'.