Notebook

Homework 1¶

In [ ]:

__author__ = "Christopher Potts"
__version__ = "CS224u, Stanford, Spring 2018 term"

This homework covers material from the unit on distributed representations. The primary goal is to explore some new techniques for building and assessing VSMs. The code you write as part of the assignment should be useful for research involving vector representations as well.

Like all homeworks, this should be submitted via Canvas. All you have to do is paste in your answers (which are all numerical values) and include the SUNetIds of anyone you worked with. Here's a direct link to the homework form:

https://canvas.stanford.edu/courses/83399/quizzes/50268

Contents

Questions 1–2: Dice distance [2 points]
Question 3: t-test reweighting [2 points]
Questions 4–6: Reweighting and co-occurrence frequency [3 points]
Question 7: Meeting the GloVe objective [1 point]
Question 8: Expressive eloooongation [2 points]

In [ ]:

import numpy as np
import os
import pandas as pd
from mittens import GloVe
from scipy.stats import pearsonr
import vsm

Questions 1–2: Dice distance [2 points]¶

First, implement Dice distance for real-valued vectors of dimension $n$, as

$$\textbf{dice}(u, v) = 1 - \frac{ 2 \sum_{i=1}^{n}\min(u_{i}, v_{i}) }{ \sum_{i=1}^{n} u_{i} + v_{i} }$$

(You can use vsm.matching for part of this.)

Second, you might want to test your implementation. Here's a simple function for that:

In [ ]:

def test_dice_implementation(func):
    """`func` should be an implementation of `dice` as defined above."""
    X = np.array([
        [  4.,   4.,   2.,   0.],
        [  4.,  61.,   8.,  18.],
        [  2.,   8.,  10.,   0.],
        [  0.,  18.,   0.,   5.]]) 
    assert func(X[0], X[1]).round(5) == 0.80198
    assert func(X[1], X[2]).round(5) == 0.67568

Third, use your implementation to measure the distance between A and B and between B and C in the toy ABC matrix we used in the first VSM notebook, repeated here for convenience.

In [ ]:

ABC = pd.DataFrame([
    [ 2.0,  4.0], 
    [10.0, 15.0], 
    [14.0, 10.0]],
    index=['A', 'B', 'C'],
    columns=['x', 'y']) 

ABC

To submit:

Dice distance between A and B.
Dice distance between B and C.

(The real question, which these values answer, is whether this measure place A and B close together relative to B and C – our goal for that example.)

Question 3: t-test reweighting [2 points]¶

The t-test statistic can be thought of as a reweighting scheme. For a count matrix $X$, row index $i$, and column index $j$:

$$\textbf{ttest}(X, i, j) = \frac{ P(X, i, j) - \big(P(X, i, *)P(X, *, j)\big) }{ \sqrt{(P(X, i, *)P(X, *, j))} }$$

where $P(X, i, j)$ is $X_{ij}$ divided by the total values in $X$, $P(X, i, *)$ is the sum of the values in row $i$ of $X$ divided by the total values in $X$, and $P(X, *, j)$ is the sum of the values in column $j$ of $X$ divided by the total values in $X$.

First, implement this reweighting scheme.

Second, test your implementation:

In [ ]:

def test_ttest_implementation(func):
    """`func` should be an implementation of ttest reweighting as defined above."""
    X = pd.DataFrame(np.array([
        [  4.,   4.,   2.,   0.],
        [  4.,  61.,   8.,  18.],
        [  2.,   8.,  10.,   0.],
        [  0.,  18.,   0.,   5.]]))    
    actual = np.array([
        [ 0.33056, -0.07689,  0.04321, -0.10532],
        [-0.07689,  0.03839, -0.10874,  0.07574],
        [ 0.04321, -0.10874,  0.36111, -0.14894],
        [-0.10532,  0.07574, -0.14894,  0.05767]])    
    predicted = func(X)
    assert np.array_equal(predicted.round(5), actual)

Third, apply your implementation to the matrix stored in imdb_window5-scaled.csv.gz.

To submit: the cell value for the row labeled superb and the column labeled movie.

(The goal here is really to obtain a working implementation of $\textbf{ttest}$. It could be an ingredient in a winning bake-off entry!)

Questions 4–6: Reweighting and co-occurrence frequency [3 points]¶

We've seen that raw count matrices encode a lot of frequency information. This is not necessarily all bad (stronger words like superb will be rarer than weak ones like good in part because of their more specialized semantics), but we do hope that our reweighting schemes will get us away from these relatively mundane associations. Thus, for any reweighting scheme, we should ask about its correlation with the raw co-occurrence counts.

Your task: using scipy.stats.pearsonr, calculate the Pearson correlation coefficient between the raw count values of imdb5 as loaded in the previous question and the values obtained from applying PMI and Positive PMI to this matrix, and from reweighting each row by its length norm (as defined in the first noteboook for this unit; vsm.length_norm). Note: X.values.ravel() will give you the vector of values in the pd.DataFrame instance X.

To submit:

Correlation coefficient for the PMI comparison.
Correlation coefficient for the Positive PMI comparison.
Correlation coefficient for the length-norm comparison.

(The hope is that seeing these values will give you a better sense for how these reweighting schemes compare to the input count matrices.)

Question 7: Meeting the GloVe objective [1 point]¶

We saw that GloVe can be thought of as seeking vectors whose dot products are proportional to their PMI values. How close does GloVe come to this in practice? This question asks you to conduct a simple empirical assessment of that:

Load the matrix stored as imdb_window5-scaled.csv.gz in the data distribution. Call this imdb5.
Reweight imdb5 with Positive PMI.
Run GloVe on imdb5 for 10 iterations, learning vectors of dimension 20 (n=20). Definitely use the implementation in the mittens package, not in vsm.glove, else this will take way too long. Except for max_iter and n, use all the default parameters.
Report the correlation between the cell values in the PMI and GloVe versions. For this, you can include all 0 values (even though GloVe ignores them). Use pearsonr as above.

Question 8: Expressive eloooongation [2 points]¶

One of the goals of subword modeling is to capture out-of-vocabulary (OOV) words. This is particularly important for expressive elogations like coooooool and booriiiing. Because the amount of elongation is highly variable, we're unlikely to have good representations for such words. How does our simple approach to subword modeling do with these phenomena?

Your task:

Use vsm.ngram_vsm to create a 4-gram character-level VSM from the matrix in imdb_window20-flat.csv.gz.
Using character_level_rep from the notebook for representing words in this space, calculate the cosine distances for pair cool and cooooool.

To submit: the cosine distance between cool and cooooool

(Of course, the broader question we want to answer is whether these words are being modeled as similar, which is a more subjective, comparative question. It does depend on these distance calculations, though.)