Important: This isn't being run as a bake-off this year. It's included in the repository in case people want to do additional exploration or incorporate this kind of evaluation into a project.
__author__ = "Christopher Potts"
__version__ = "CS224u, Stanford, Spring 2018 term"
The semantic orientation method of Turney and Littman 2003 is a general, unsupervised (or lightly supervised?) method for building lexicons for any desired semantic dimension using VSMs.
The method relies on intuitive seeds sets and the basic distance measures we use in basically all work with VSMs. Here's a summary of the method:
Let $X$ be a VSM with dimension $m \times n$ and vocabulary $V$. Let $I$ be the set of indices $\{1, 2, \ldots, m\}$
For $i \in I$, let $X_{i}$ be the vector representation of $V_{i}$.
Define two seed-sets $S{_1} \subseteq I$ and $S_{2} \subseteq I$. They should have the same cardinality and be semantically opposing in some way that is appropriate for your matrix.
Pick a vector distance measure $\textbf{f}$.
For all $i \in I$:
This method is implemented below as semantic_orientation
. You can play around with making your own seed-sets and seeing what scores the method assigns. After that, the bake-off itself involves assessing your output against a multidimensional sentiment lexicon.
from collections import defaultdict
import csv
import numpy as np
import os
import pandas as pd
from scipy.stats import pearsonr, spearmanr
import vsm
data_home = 'vsmdata'
def semantic_orientation(
df,
seeds1=('bad', 'nasty', 'poor', 'negative', 'unfortunate', 'wrong', 'inferior'),
seeds2=('good', 'nice', 'excellent', 'positive', 'fortunate', 'correct', 'superior'),
distfunc=vsm.cosine):
"""No frills implementation of the semantic Orientation (SO) method of
Turney and Littman. `seeds1` and `seeds2` should be representative members
of two intutively opposing semantic classes. The method will then try
to rank the vocabulary by its relative association with each seed set.
Parameters
----------
df : pd.DataFrame
The matrix used to derive the SO ranking.
seeds1 : tuple of str
The default is the negative seed set of Turney and Littman.
seeds2 : tuple of str
The default is the positive seed set of Turney and Littman.
distfunc : function mapping vector pairs to floats (default: `cosine`)
The measure of distance between vectors. Can also be `euclidean`,
`matching`, `jaccard`, as well as any other distance measure
between 1d vectors.
Returns
-------
pd.Series
The vocabulary ranked according to the SO method, with words
closest to `seeds1` at the top and words closest to `seeds2` at the
bottom.
"""
rownames = set(df.index)
# Check that the seed sets are in the vocabulary, filtering
# where necessary, and warn the user about exclusions:
seeds1 = _value_check(seeds1, "seeds1", rownames)
seeds2 = _value_check(seeds2, "seeds2", rownames)
# Subframes for the two seeds-sets
sm1 = df.loc[seeds1]
sm2 = df.loc[seeds2]
# Core semantic orientation calculation:
def row_func(row):
val1 = sm1.apply(lambda x: distfunc(row, x), axis=1).sum()
val2 = sm2.apply(lambda x: distfunc(row, x), axis=1).sum()
return val1 - val2
scores = df.apply(row_func, axis=1)
return scores.sort_values(ascending=False)
def _value_check(ss, name, rownames):
new = set()
for w in ss:
if w not in rownames:
print("Warning: {} not in {}".format(w, name))
else:
new.add(w)
return new
imdb20 = pd.read_csv(
os.path.join(data_home, 'imdb_window20-flat.csv.gz'), index_col=0)
imdb20_ppmi = vsm.pmi(imdb20)
imdb20_ppmi_so = semantic_orientation(imdb20_ppmi)
Warning: inferior not in seeds1
imdb20_ppmi_so.head()
excellent 0.596622 superb 0.249968 great 0.247541 superior 0.230842 nice 0.189436 dtype: float64
imdb20_ppmi_so.tail()
unfortunate -1.694198 nasty -1.838113 poor -1.907216 wrong -1.929000 bad -1.954924 dtype: float64
Warriner et al. (20130 released a dataset called 'Norms of valence, arousal, and dominance for 13,915 English lemmas'. This is included in vsmdata
as Ratings_Warriner_et_al.csv
. The following code reads this file in and creates a DataFrame that gives just the overall means for these three semantic dimensions.
def load_warriner_lexicon(src_filename, df=None):
"""Read in 'Ratings_Warriner_et_al.csv' and optionally restrict its
vocabulary to items in `df`.
Parameters
----------
src_filename : str
Full path to 'Ratings_Warriner_et_al.csv'
df : pd.DataFrame or None
If this is given, then its index is intersected with the
vocabulary from the lexicon, and we return a lexicon
containing only values in both vocabularies.
Returns
-------
pd.DataFrame
"""
lexicon = pd.read_csv(src_filename, index_col=0)
lexicon = lexicon[['Word', 'V.Mean.Sum', 'A.Mean.Sum', 'D.Mean.Sum']]
lexicon = lexicon.set_index('Word').rename(
columns={'V.Mean.Sum': 'Valence',
'A.Mean.Sum': 'Arousal',
'D.Mean.Sum': 'Dominance'})
if df is not None:
shared_vocab = sorted(set(lexicon.index) & set(df.index))
lexicon = lexicon.loc[shared_vocab]
return lexicon
lexicon = load_warriner_lexicon(
os.path.join(data_home, 'Ratings_Warriner_et_al.csv'),
imdb20_ppmi)
lexicon.head()
Valence | Arousal | Dominance | |
---|---|---|---|
Word | |||
TV | 5.42 | 4.29 | 6.23 |
ability | 7.00 | 4.85 | 6.55 |
able | 6.64 | 3.38 | 6.17 |
abortion | 2.58 | 5.43 | 4.73 |
absolute | 5.43 | 3.48 | 5.58 |
We evaluate VSMs by the Pearson correlation coefficient between the scores delivered by semantic_orientation
and the values in the Warriner et al. lexicon.
def evaluation(lexicon, so, colname='Valence', metric=pearsonr):
lexicon['so'] = so
rho, pvalue = metric(lexicon['so'], lexicon[colname])
print("{0:}'s r: {1:0.3f}".format(metric.__name__, rho))
Here's a simple baseline: PPMI on imdb20
as loaded above.
imdb20_ppmi_so = semantic_orientation(imdb20_ppmi)
Warning: inferior not in seeds1
evaluation(lexicon, imdb20_ppmi_so, colname='Valence')
pearsonr's r: 0.361
evaluation(lexicon, imdb20_ppmi_so, colname='Arousal')
pearsonr's r: 0.005
evaluation(lexicon, imdb20_ppmi_so, colname='Dominance')
pearsonr's r: 0.315
vsmdata
).