**Important**: This isn't being run as a bake-off this year. It's included in the repository in case people want to do additional exploration or incorporate this kind of evaluation into a project.

In [1]:

```
__author__ = "Christopher Potts"
__version__ = "CS224u, Stanford, Spring 2018 term"
```

The semantic orientation method of Turney and Littman 2003 is a general, unsupervised (or lightly supervised?) method for building lexicons for any desired semantic dimension using VSMs.

The method relies on intuitive seeds sets and the basic distance measures we use in basically all work with VSMs. Here's a summary of the method:

Let $X$ be a VSM with dimension $m \times n$ and vocabulary $V$. Let $I$ be the set of indices $\{1, 2, \ldots, m\}$

For $i \in I$, let $X_{i}$ be the vector representation of $V_{i}$.

Define two seed-sets $S{_1} \subseteq I$ and $S_{2} \subseteq I$. They should have the same cardinality and be semantically opposing in some way that is appropriate for your matrix.

Pick a vector distance measure $\textbf{f}$.

For all $i \in I$:

$$\textbf{score}(x_{i}) =

\left(\sum*{k \in S*{2}} \textbf{f}(x*{i}, s*{k})\right)$$

This method is implemented below as `semantic_orientation`

. You can play around with making your own seed-sets and seeing what scores the method assigns. After that, the bake-off itself involves assessing your output against a multidimensional sentiment lexicon.

In [2]:

```
from collections import defaultdict
import csv
import numpy as np
import os
import pandas as pd
from scipy.stats import pearsonr, spearmanr
import vsm
```

In [3]:

```
data_home = 'vsmdata'
```

In [4]:

```
def semantic_orientation(
df,
seeds1=('bad', 'nasty', 'poor', 'negative', 'unfortunate', 'wrong', 'inferior'),
seeds2=('good', 'nice', 'excellent', 'positive', 'fortunate', 'correct', 'superior'),
distfunc=vsm.cosine):
"""No frills implementation of the semantic Orientation (SO) method of
Turney and Littman. `seeds1` and `seeds2` should be representative members
of two intutively opposing semantic classes. The method will then try
to rank the vocabulary by its relative association with each seed set.
Parameters
----------
df : pd.DataFrame
The matrix used to derive the SO ranking.
seeds1 : tuple of str
The default is the negative seed set of Turney and Littman.
seeds2 : tuple of str
The default is the positive seed set of Turney and Littman.
distfunc : function mapping vector pairs to floats (default: `cosine`)
The measure of distance between vectors. Can also be `euclidean`,
`matching`, `jaccard`, as well as any other distance measure
between 1d vectors.
Returns
-------
pd.Series
The vocabulary ranked according to the SO method, with words
closest to `seeds1` at the top and words closest to `seeds2` at the
bottom.
"""
rownames = set(df.index)
# Check that the seed sets are in the vocabulary, filtering
# where necessary, and warn the user about exclusions:
seeds1 = _value_check(seeds1, "seeds1", rownames)
seeds2 = _value_check(seeds2, "seeds2", rownames)
# Subframes for the two seeds-sets
sm1 = df.loc[seeds1]
sm2 = df.loc[seeds2]
# Core semantic orientation calculation:
def row_func(row):
val1 = sm1.apply(lambda x: distfunc(row, x), axis=1).sum()
val2 = sm2.apply(lambda x: distfunc(row, x), axis=1).sum()
return val1 - val2
scores = df.apply(row_func, axis=1)
return scores.sort_values(ascending=False)
def _value_check(ss, name, rownames):
new = set()
for w in ss:
if w not in rownames:
print("Warning: {} not in {}".format(w, name))
else:
new.add(w)
return new
```

In [5]:

```
imdb20 = pd.read_csv(
os.path.join(data_home, 'imdb_window20-flat.csv.gz'), index_col=0)
```

In [6]:

```
imdb20_ppmi = vsm.pmi(imdb20)
```

In [7]:

```
imdb20_ppmi_so = semantic_orientation(imdb20_ppmi)
```

In [8]:

```
imdb20_ppmi_so.head()
```

Out[8]:

In [9]:

```
imdb20_ppmi_so.tail()
```

Out[9]:

Warriner et al. (20130 released a dataset called 'Norms of valence, arousal, and dominance for 13,915 English lemmas'. This is included in `vsmdata`

as `Ratings_Warriner_et_al.csv`

. The following code reads this file in and creates a DataFrame that gives just the overall means for these three semantic dimensions.

In [10]:

```
def load_warriner_lexicon(src_filename, df=None):
"""Read in 'Ratings_Warriner_et_al.csv' and optionally restrict its
vocabulary to items in `df`.
Parameters
----------
src_filename : str
Full path to 'Ratings_Warriner_et_al.csv'
df : pd.DataFrame or None
If this is given, then its index is intersected with the
vocabulary from the lexicon, and we return a lexicon
containing only values in both vocabularies.
Returns
-------
pd.DataFrame
"""
lexicon = pd.read_csv(src_filename, index_col=0)
lexicon = lexicon[['Word', 'V.Mean.Sum', 'A.Mean.Sum', 'D.Mean.Sum']]
lexicon = lexicon.set_index('Word').rename(
columns={'V.Mean.Sum': 'Valence',
'A.Mean.Sum': 'Arousal',
'D.Mean.Sum': 'Dominance'})
if df is not None:
shared_vocab = sorted(set(lexicon.index) & set(df.index))
lexicon = lexicon.loc[shared_vocab]
return lexicon
```

In [11]:

```
lexicon = load_warriner_lexicon(
os.path.join(data_home, 'Ratings_Warriner_et_al.csv'),
imdb20_ppmi)
```

In [12]:

```
lexicon.head()
```

Out[12]:

We evaluate VSMs by the Pearson correlation coefficient between the scores delivered by `semantic_orientation`

and the values in the Warriner et al. lexicon.

In [13]:

```
def evaluation(lexicon, so, colname='Valence', metric=pearsonr):
lexicon['so'] = so
rho, pvalue = metric(lexicon['so'], lexicon[colname])
print("{0:}'s r: {1:0.3f}".format(metric.__name__, rho))
```

Here's a simple baseline: PPMI on `imdb20`

as loaded above.

In [14]:

```
imdb20_ppmi_so = semantic_orientation(imdb20_ppmi)
```

In [15]:

```
evaluation(lexicon, imdb20_ppmi_so, colname='Valence')
```

In [16]:

```
evaluation(lexicon, imdb20_ppmi_so, colname='Arousal')
```

In [17]:

```
evaluation(lexicon, imdb20_ppmi_so, colname='Dominance')
```

- The name of the count matrix you started with (must be one in
`vsmdata`

). - The seed-sets you used.
- A description of the steps you took to create your bake-off VSM – must be different from the above baseline.
- Your Pearson r values for 'Valence', 'Arousal', and 'Dominance'.