In [ ]:

```
__author__ = "Christopher Potts"
__version__ = "CS224u, Stanford, Spring 2018 term"
```

This homework covers material from the unit on distributed representations. The primary goal is to explore some new techniques for building and assessing VSMs. The code you write as part of the assignment should be useful for research involving vector representations as well.

Like all homeworks, this should be submitted via Canvas. All you have to do is paste in your answers (which are all numerical values) and include the SUNetIds of anyone you worked with. Here's a direct link to the homework form:

https://canvas.stanford.edu/courses/83399/quizzes/50268

**Contents**

In [ ]:

```
import numpy as np
import os
import pandas as pd
from mittens import GloVe
from scipy.stats import pearsonr
import vsm
```

First, implement Dice distance for real-valued vectors of dimension $n$, as

$$\textbf{dice}(u, v) = 1 - \frac{ 2 \sum_{i=1}^{n}\min(u_{i}, v_{i}) }{ \sum_{i=1}^{n} u_{i} + v_{i} }$$

(You can use `vsm.matching`

for part of this.)

Second, you might want to test your implementation. Here's a simple function for that:

In [ ]:

```
def test_dice_implementation(func):
"""`func` should be an implementation of `dice` as defined above."""
X = np.array([
[ 4., 4., 2., 0.],
[ 4., 61., 8., 18.],
[ 2., 8., 10., 0.],
[ 0., 18., 0., 5.]])
assert func(X[0], X[1]).round(5) == 0.80198
assert func(X[1], X[2]).round(5) == 0.67568
```

Third, use your implementation to measure the distance between A and B and between B and C in the toy `ABC`

matrix we used in the first VSM notebook, repeated here for convenience.

In [ ]:

```
ABC = pd.DataFrame([
[ 2.0, 4.0],
[10.0, 15.0],
[14.0, 10.0]],
index=['A', 'B', 'C'],
columns=['x', 'y'])
ABC
```

**To submit:**

- Dice distance between A and B.
- Dice distance between B and C.

(The real question, which these values answer, is whether this measure place A and B close together relative to B and C – our goal for that example.)

The t-test statistic can be thought of as a reweighting scheme. For a count matrix $X$, row index $i$, and column index $j$:

$$\textbf{ttest}(X, i, j) = \frac{ P(X, i, j) - \big(P(X, i, *)P(X, *, j)\big) }{ \sqrt{(P(X, i, *)P(X, *, j))} }$$

where $P(X, i, j)$ is $X_{ij}$ divided by the total values in $X$, $P(X, i, *)$ is the sum of the values in row $i$ of $X$ divided by the total values in $X$, and $P(X, *, j)$ is the sum of the values in column $j$ of $X$ divided by the total values in $X$.

First, implement this reweighting scheme.

Second, test your implementation:

In [ ]:

```
def test_ttest_implementation(func):
"""`func` should be an implementation of ttest reweighting as defined above."""
X = pd.DataFrame(np.array([
[ 4., 4., 2., 0.],
[ 4., 61., 8., 18.],
[ 2., 8., 10., 0.],
[ 0., 18., 0., 5.]]))
actual = np.array([
[ 0.33056, -0.07689, 0.04321, -0.10532],
[-0.07689, 0.03839, -0.10874, 0.07574],
[ 0.04321, -0.10874, 0.36111, -0.14894],
[-0.10532, 0.07574, -0.14894, 0.05767]])
predicted = func(X)
assert np.array_equal(predicted.round(5), actual)
```

Third, apply your implementation to the matrix stored in `imdb_window5-scaled.csv.gz`

.

**To submit**: the cell value for the row labeled *superb* and the column labeled *movie*.

(The goal here is really to obtain a working implementation of $\textbf{ttest}$. It could be an ingredient in a winning bake-off entry!)

We've seen that raw count matrices encode a lot of frequency information. This is not necessarily all bad (stronger words like *superb* will be rarer than weak ones like *good* in part because of their more specialized semantics), but we do hope that our reweighting schemes will get us away from these relatively mundane associations. Thus, for any reweighting scheme, we should ask about its correlation with the raw co-occurrence counts.

Your task: using scipy.stats.pearsonr, calculate the Pearson correlation coefficient between the raw count values of `imdb5`

as loaded in the previous question and the values obtained from applying PMI and Positive PMI to this matrix, and from reweighting each row by its length norm (as defined in the first noteboook for this unit; `vsm.length_norm`

). Note: `X.values.ravel()`

will give you the vector of values in the `pd.DataFrame`

instance `X`

.

**To submit:**

- Correlation coefficient for the PMI comparison.
- Correlation coefficient for the Positive PMI comparison.
- Correlation coefficient for the length-norm comparison.

(The hope is that seeing these values will give you a better sense for how these reweighting schemes compare to the input count matrices.)

We saw that GloVe can be thought of as seeking vectors whose dot products are proportional to their PMI values. How close does GloVe come to this in practice? This question asks you to conduct a simple empirical assessment of that:

- Load the matrix stored as
`imdb_window5-scaled.csv.gz`

in the data distribution. Call this`imdb5`

. - Reweight
`imdb5`

with Positive PMI. - Run GloVe on
`imdb5`

for 10 iterations, learning vectors of dimension 20 (`n=20`

). Definitely use the implementation in the`mittens`

package, not in`vsm.glove`

, else this will take way too long. Except for`max_iter`

and`n`

, use all the default parameters. - Report the correlation between the cell values in the PMI and GloVe versions. For this, you can include all 0 values (even though GloVe ignores them). Use
`pearsonr`

as above.

One of the goals of subword modeling is to capture out-of-vocabulary (OOV) words. This is particularly important for **expressive elogations** like *coooooool* and *booriiiing*. Because the amount of elongation is highly variable, we're unlikely to have good representations for such words. How does our simple approach to subword modeling do with these phenomena?

**Your task:**

Use

`vsm.ngram_vsm`

to create a 4-gram character-level VSM from the matrix in`imdb_window20-flat.csv.gz`

.Using

`character_level_rep`

from the notebook for representing words in this space, calculate the cosine distances for pair`cool`

and`cooooool`

.

**To submit**: the cosine distance between `cool`

and `cooooool`

(Of course, the broader question we want to answer is whether these words are being modeled as similar, which is a more subjective, comparative question. It does depend on these distance calculations, though.)