# Bake-off: The semantic orientation method¶

Important: This isn't being run as a bake-off this year. It's included in the repository in case people want to do additional exploration or incorporate this kind of evaluation into a project.

In [1]:
__author__ = "Christopher Potts"
__version__ = "CS224u, Stanford, Spring 2018 term"


## Overview¶

The semantic orientation method of Turney and Littman 2003 is a general, unsupervised (or lightly supervised?) method for building lexicons for any desired semantic dimension using VSMs.

The method relies on intuitive seeds sets and the basic distance measures we use in basically all work with VSMs. Here's a summary of the method:

1. Let $X$ be a VSM with dimension $m \times n$ and vocabulary $V$. Let $I$ be the set of indices $\{1, 2, \ldots, m\}$

2. For $i \in I$, let $X_{i}$ be the vector representation of $V_{i}$.

3. Define two seed-sets $S{_1} \subseteq I$ and $S_{2} \subseteq I$. They should have the same cardinality and be semantically opposing in some way that is appropriate for your matrix.

4. Pick a vector distance measure $\textbf{f}$.

5. For all $i \in I$:

$$\textbf{score}(x_{i}) = ## \left(\sum{j \in S{1}} \textbf{f}(x{i}, x{j})\right) ¶ \left(\sum{k \in S{2}} \textbf{f}(x{i}, s{k})\right)$$

This method is implemented below as semantic_orientation. You can play around with making your own seed-sets and seeing what scores the method assigns. After that, the bake-off itself involves assessing your output against a multidimensional sentiment lexicon.

## Set-up¶

In [2]:
from collections import defaultdict
import csv
import numpy as np
import os
import pandas as pd
from scipy.stats import pearsonr, spearmanr
import vsm

In [3]:
data_home = 'vsmdata'


## Implementation¶

In [4]:
def semantic_orientation(
df,
seeds1=('bad', 'nasty', 'poor', 'negative', 'unfortunate', 'wrong', 'inferior'),
seeds2=('good', 'nice', 'excellent', 'positive', 'fortunate', 'correct', 'superior'),
distfunc=vsm.cosine):
"""No frills implementation of the semantic Orientation (SO) method of
Turney and Littman. seeds1 and seeds2 should be representative members
of two intutively opposing semantic classes. The method will then try
to rank the vocabulary by its relative association with each seed set.

Parameters
----------
df : pd.DataFrame
The matrix used to derive the SO ranking.
seeds1 : tuple of str
The default is the negative seed set of Turney and Littman.
seeds2 : tuple of str
The default is the positive seed set of Turney and Littman.
distfunc : function mapping vector pairs to floats (default: cosine)
The measure of distance between vectors. Can also be euclidean,
matching, jaccard, as well as any other distance measure
between 1d vectors.

Returns
-------
pd.Series
The vocabulary ranked according to the SO method, with words
closest to seeds1 at the top and words closest to seeds2 at the
bottom.

"""
rownames = set(df.index)
# Check that the seed sets are in the vocabulary, filtering
# where necessary, and warn the user about exclusions:
seeds1 = _value_check(seeds1, "seeds1", rownames)
seeds2 = _value_check(seeds2, "seeds2", rownames)

# Subframes for the two seeds-sets
sm1 = df.loc[seeds1]
sm2 = df.loc[seeds2]

# Core semantic orientation calculation:
def row_func(row):
val1 = sm1.apply(lambda x: distfunc(row, x), axis=1).sum()
val2 = sm2.apply(lambda x: distfunc(row, x), axis=1).sum()
return val1 - val2

scores = df.apply(row_func, axis=1)
return scores.sort_values(ascending=False)

def _value_check(ss, name, rownames):
new = set()
for w in ss:
if w not in rownames:
print("Warning: {} not in {}".format(w, name))
else:
return new

In [5]:
imdb20 = pd.read_csv(
os.path.join(data_home, 'imdb_window20-flat.csv.gz'), index_col=0)

In [6]:
imdb20_ppmi = vsm.pmi(imdb20)

In [7]:
imdb20_ppmi_so = semantic_orientation(imdb20_ppmi)

Warning: inferior not in seeds1

In [8]:
imdb20_ppmi_so.head()

Out[8]:
excellent    0.596622
superb       0.249968
great        0.247541
superior     0.230842
nice         0.189436
dtype: float64
In [9]:
imdb20_ppmi_so.tail()

Out[9]:
unfortunate   -1.694198
nasty         -1.838113
poor          -1.907216
wrong         -1.929000
dtype: float64

## Multidimensional sentiment lexicon¶

Warriner et al. (20130 released a dataset called 'Norms of valence, arousal, and dominance for 13,915 English lemmas'. This is included in vsmdata as Ratings_Warriner_et_al.csv. The following code reads this file in and creates a DataFrame that gives just the overall means for these three semantic dimensions.

In [10]:
def load_warriner_lexicon(src_filename, df=None):
"""Read in 'Ratings_Warriner_et_al.csv' and optionally restrict its
vocabulary to items in df.

Parameters
----------
src_filename : str
Full path to 'Ratings_Warriner_et_al.csv'
df : pd.DataFrame or None
If this is given, then its index is intersected with the
vocabulary from the lexicon, and we return a lexicon
containing only values in both vocabularies.

Returns
-------
pd.DataFrame

"""
lexicon = lexicon[['Word', 'V.Mean.Sum', 'A.Mean.Sum', 'D.Mean.Sum']]
lexicon = lexicon.set_index('Word').rename(
columns={'V.Mean.Sum': 'Valence',
'A.Mean.Sum': 'Arousal',
'D.Mean.Sum': 'Dominance'})
if df is not None:
shared_vocab = sorted(set(lexicon.index) & set(df.index))
lexicon = lexicon.loc[shared_vocab]
return lexicon

In [11]:
lexicon = load_warriner_lexicon(
os.path.join(data_home, 'Ratings_Warriner_et_al.csv'),
imdb20_ppmi)

In [12]:
lexicon.head()

Out[12]:
Valence Arousal Dominance
Word
TV 5.42 4.29 6.23
ability 7.00 4.85 6.55
able 6.64 3.38 6.17
abortion 2.58 5.43 4.73
absolute 5.43 3.48 5.58

## Evaluation¶

We evaluate VSMs by the Pearson correlation coefficient between the scores delivered by semantic_orientation and the values in the Warriner et al. lexicon.

In [13]:
def evaluation(lexicon, so, colname='Valence', metric=pearsonr):
lexicon['so'] = so
rho, pvalue = metric(lexicon['so'], lexicon[colname])
print("{0:}'s r: {1:0.3f}".format(metric.__name__, rho))


## Baseline¶

Here's a simple baseline: PPMI on imdb20 as loaded above.

In [14]:
imdb20_ppmi_so = semantic_orientation(imdb20_ppmi)

Warning: inferior not in seeds1

In [15]:
evaluation(lexicon, imdb20_ppmi_so, colname='Valence')

pearsonr's r: 0.361

In [16]:
evaluation(lexicon, imdb20_ppmi_so, colname='Arousal')

pearsonr's r: 0.005

In [17]:
evaluation(lexicon, imdb20_ppmi_so, colname='Dominance')

pearsonr's r: 0.315


## Bake-off submission¶

1. The name of the count matrix you started with (must be one in vsmdata).
2. The seed-sets you used.
3. A description of the steps you took to create your bake-off VSM – must be different from the above baseline.
4. Your Pearson r values for 'Valence', 'Arousal', and 'Dominance'.