__author__ = "Christopher Potts"
__version__ = "CS224u, Stanford, Spring 2018 term"
This homework covers material from the unit on distributed representations. The primary goal is to explore some new techniques for building and assessing VSMs. The code you write as part of the assignment should be useful for research involving vector representations as well.
Like all homeworks, this should be submitted via Canvas. All you have to do is paste in your answers (which are all numerical values) and include the SUNetIds of anyone you worked with. Here's a direct link to the homework form:
https://canvas.stanford.edu/courses/83399/quizzes/50268
Contents
import numpy as np
import os
import pandas as pd
from mittens import GloVe
from scipy.stats import pearsonr
import vsm
First, implement Dice distance for real-valued vectors of dimension $n$, as
$$\textbf{dice}(u, v) = 1 - \frac{ 2 \sum_{i=1}^{n}\min(u_{i}, v_{i}) }{ \sum_{i=1}^{n} u_{i} + v_{i} }$$(You can use vsm.matching
for part of this.)
Second, you might want to test your implementation. Here's a simple function for that:
def test_dice_implementation(func):
"""`func` should be an implementation of `dice` as defined above."""
X = np.array([
[ 4., 4., 2., 0.],
[ 4., 61., 8., 18.],
[ 2., 8., 10., 0.],
[ 0., 18., 0., 5.]])
assert func(X[0], X[1]).round(5) == 0.80198
assert func(X[1], X[2]).round(5) == 0.67568
Third, use your implementation to measure the distance between A and B and between B and C in the toy ABC
matrix we used in the first VSM notebook, repeated here for convenience.
ABC = pd.DataFrame([
[ 2.0, 4.0],
[10.0, 15.0],
[14.0, 10.0]],
index=['A', 'B', 'C'],
columns=['x', 'y'])
ABC
To submit:
(The real question, which these values answer, is whether this measure place A and B close together relative to B and C – our goal for that example.)
The t-test statistic can be thought of as a reweighting scheme. For a count matrix $X$, row index $i$, and column index $j$:
$$\textbf{ttest}(X, i, j) = \frac{ P(X, i, j) - \big(P(X, i, *)P(X, *, j)\big) }{ \sqrt{(P(X, i, *)P(X, *, j))} }$$where $P(X, i, j)$ is $X_{ij}$ divided by the total values in $X$, $P(X, i, *)$ is the sum of the values in row $i$ of $X$ divided by the total values in $X$, and $P(X, *, j)$ is the sum of the values in column $j$ of $X$ divided by the total values in $X$.
First, implement this reweighting scheme.
Second, test your implementation:
def test_ttest_implementation(func):
"""`func` should be an implementation of ttest reweighting as defined above."""
X = pd.DataFrame(np.array([
[ 4., 4., 2., 0.],
[ 4., 61., 8., 18.],
[ 2., 8., 10., 0.],
[ 0., 18., 0., 5.]]))
actual = np.array([
[ 0.33056, -0.07689, 0.04321, -0.10532],
[-0.07689, 0.03839, -0.10874, 0.07574],
[ 0.04321, -0.10874, 0.36111, -0.14894],
[-0.10532, 0.07574, -0.14894, 0.05767]])
predicted = func(X)
assert np.array_equal(predicted.round(5), actual)
Third, apply your implementation to the matrix stored in imdb_window5-scaled.csv.gz
.
To submit: the cell value for the row labeled superb and the column labeled movie.
(The goal here is really to obtain a working implementation of $\textbf{ttest}$. It could be an ingredient in a winning bake-off entry!)
We've seen that raw count matrices encode a lot of frequency information. This is not necessarily all bad (stronger words like superb will be rarer than weak ones like good in part because of their more specialized semantics), but we do hope that our reweighting schemes will get us away from these relatively mundane associations. Thus, for any reweighting scheme, we should ask about its correlation with the raw co-occurrence counts.
Your task: using scipy.stats.pearsonr, calculate the Pearson correlation coefficient between the raw count values of imdb5
as loaded in the previous question and the values obtained from applying PMI and Positive PMI to this matrix, and from reweighting each row by its length norm (as defined in the first noteboook for this unit; vsm.length_norm
). Note: X.values.ravel()
will give you the vector of values in the pd.DataFrame
instance X
.
To submit:
(The hope is that seeing these values will give you a better sense for how these reweighting schemes compare to the input count matrices.)
We saw that GloVe can be thought of as seeking vectors whose dot products are proportional to their PMI values. How close does GloVe come to this in practice? This question asks you to conduct a simple empirical assessment of that:
imdb_window5-scaled.csv.gz
in the data distribution. Call this imdb5
.imdb5
with Positive PMI.imdb5
for 10 iterations, learning vectors of dimension 20 (n=20
). Definitely use the implementation in the mittens
package, not in vsm.glove
, else this will take way too long. Except for max_iter
and n
, use all the default parameters.pearsonr
as above.One of the goals of subword modeling is to capture out-of-vocabulary (OOV) words. This is particularly important for expressive elogations like coooooool and booriiiing. Because the amount of elongation is highly variable, we're unlikely to have good representations for such words. How does our simple approach to subword modeling do with these phenomena?
Your task:
Use vsm.ngram_vsm
to create a 4-gram character-level VSM from the matrix in imdb_window20-flat.csv.gz
.
Using character_level_rep
from the notebook for representing words in this space, calculate the cosine distances for pair cool
and cooooool
.
To submit: the cosine distance between cool
and cooooool
(Of course, the broader question we want to answer is whether these words are being modeled as similar, which is a more subjective, comparative question. It does depend on these distance calculations, though.)