import numpy as np
import os
glove_dir = 'D:/data/'
embeddings_index = {}
f = open(os.path.join(glove_dir, 'glove.6B.100d.txt'), encoding="utf8")
for line in f:
values = line.split()
word = values[0]
coefs = np.asarray(values[1:], dtype='float32')
embeddings_index[word] = coefs
f.close()
To measure how similar two words are, we need a way to measure the degree of similarity between two embedding vectors for the two words. Given two vectors $u$ and $v$, cosine similarity is defined as follows:
$$\text{CosineSimilarity(u, v)} = \frac {u . v} {||u||_2 ||v||_2} = cos(\theta) \tag{1}$$where $u.v$ is the dot product (or inner product) of two vectors, $||u||_2$ is the norm (or length) of the vector $u$, and $\theta$ is the angle between $u$ and $v$. This similarity depends on the angle between $u$ and $v$. If $u$ and $v$ are very similar, their cosine similarity will be close to 1; if they are dissimilar, the cosine similarity will take a smaller value.
Exercise: Implement the function cosine_similarity()
to evaluate similarity between word vectors.
Reminder: The norm of $u$ is defined as $ ||u||_2 = \sqrt{\sum_{i=1}^{n} u_i^2}$
from sklearn.metrics.pairwise import cosine_similarity
def similarity(u, v):
return np.squeeze(cosine_similarity(u.reshape(1, -1), v.reshape(1, -1)))
father = embeddings_index["father"]
mother = embeddings_index["mother"]
ball = embeddings_index["ball"]
crocodile = embeddings_index["crocodile"]
france = embeddings_index["france"]
italy = embeddings_index["italy"]
paris = embeddings_index["paris"]
rome = embeddings_index["rome"]
print("cosine_similarity(father, mother) = ", similarity(father, mother))
print("cosine_similarity(ball, crocodile) = ",similarity(ball, crocodile))
print("cosine_similarity(france - paris, iran - tehran) = ",similarity(france - paris, rome - italy))
In the word analogy task, we complete the sentence "*a* is to *b* as *c* is to **____**". An example is '*man* is to *woman* as *king* is to *queen*' . In detail, we are trying to find a word d, such that the associated word vectors $e_a, e_b, e_c, e_d$ are related in the following manner: $e_b - e_a \approx e_d - e_c$. We will measure the similarity between $e_b - e_a$ and $e_d - e_c$ using cosine similarity.
Exercise: Complete the code below to be able to perform word analogies!
embeddings_index["father"]
def complete_analogy(word_a, word_b, word_c, embeddings_index):
"""
Performs the word analogy task as explained above: a is to b as c is to ____.
Arguments:
word_a -- a word, string
word_b -- a word, string
word_c -- a word, string
word_to_vec_map -- dictionary that maps words to their corresponding vectors.
Returns:
best_word -- the word such that v_b - v_a is close to v_best_word - v_c, as measured by cosine similarity
"""
# convert words to lower case
word_a, word_b, word_c = word_a.lower(), word_b.lower(), word_c.lower()
# Get the word embeddings v_a, v_b and v_c
e_a, e_b, e_c = embeddings_index[word_a], embeddings_index[word_b], embeddings_index[word_c]
words = embeddings_index.keys()
max_cosine_sim = -100 # Initialize max_cosine_sim to a large negative number
best_word = None # Initialize best_word with None, it will help keep track of the word to output
# loop over the whole word vector set
for w in words:
# to avoid best_word being one of the input words, pass on them.
if w in [word_a, word_b, word_c] :
continue
# Compute cosine similarity between the vector (e_b - e_a) and the vector ((w's vector representation) - e_c) (≈1 line)
cosine_sim = similarity(e_b - e_a, embeddings_index[w] - e_c)
# If the cosine_sim is more than the max_cosine_sim seen so far,
# then: set the new max_cosine_sim to the current cosine_sim and the best_word to the current word (≈3 lines)
if cosine_sim > max_cosine_sim:
max_cosine_sim = cosine_sim
best_word = w
return best_word
Run the cell below to test your code, this may take 1-2 minutes.
complete_analogy('china', 'chinese', 'iran', embeddings_index)
complete_analogy('india', 'delhi', 'iran', embeddings_index)
complete_analogy('man', 'woman', 'boy', embeddings_index)
complete_analogy('small', 'smaller', 'big', embeddings_index)