This word vector visualization tool was built to illustrate some of the properties they have. The vectors themselves were encoded using GloVe model by the good folks at Stanford, (link to GloVe) and this script was heavily inspired by this page here, taken from the materials of one of the courses in Stanford.
In short, they trained GloVe model on 400k word corpus obtained from Wikipedia (at 2014) and Gigaword.
I have chosen to use their smallest vectors with the least dimensions (50d), since the difference in performance was insignificant as compared to the difference in file size. however, you are more than welcome to download the other files from the Glove site (zip files) and play around with it.
I changed the functions and made them slower and less efficient, but in doing so, you can see in inside of them and have a better understanding of how they work. I have also changed the plotting tools to pyplot and made some enhancements.
# Get the interactive Tools for Matplotlib
import matplotlib.pyplot as plt
%matplotlib notebook
%matplotlib inline
import plotly.graph_objects as go
import plotly.express as px
# Get tools to download files and load them
import pickle
import urllib.request
from os.path import exists as check_path
from os import makedirs
# Get tools to performe analysis
import numpy as np
from heapq import heappushpop
from sklearn.decomposition import PCA
- download_files_from_github
- load_word2vecfiles
def download_files_from_github(file_target_dir):
main_url = 'https://raw.githubusercontent.com/Goussha/word-vector-visualization/master/'
if not check_path(file_target_dir):
makedirs(file_target_dir)
urls = [main_url+'file{}.p'.format(x) for x in range(1,9)]
file_names = [file_target_dir+'file{}.p'.format(x) for x in range(1,9)]
for file_name, url in zip(file_names, urls):
if not check_path(file_name):
print ("Downloading file: ",file_name)
filename, headers = urllib.request.urlretrieve(url, filename=file_name)
else:
print('Allready exists: {}'.format(file_name))
def load_word2vecfiles(file_target_dir):
word_dict_loded = {}
for file_num in range(1,9):
full_file_name = file_target_dir+'file{}.p'.format(file_num)
print('Loading file: {}'.format(full_file_name))
with open(full_file_name, 'rb') as fp:
data = pickle.load(fp)
word_dict_loded.update(data)
return word_dict_loded
Run the next cell to download and load the files from my github
(should take about 15 sec to downloand and load)
file_target_dir = "./tmp/"
#Download files
download_files_from_github(file_target_dir)
#Load files and create dict
word_dict = load_word2vecfiles(file_target_dir)
Downloading file: ./tmp/file1.p Downloading file: ./tmp/file2.p Downloading file: ./tmp/file3.p Downloading file: ./tmp/file4.p Downloading file: ./tmp/file5.p Downloading file: ./tmp/file6.p Downloading file: ./tmp/file7.p Downloading file: ./tmp/file8.p Loading file: ./tmp/file1.p Loading file: ./tmp/file2.p Loading file: ./tmp/file3.p Loading file: ./tmp/file4.p Loading file: ./tmp/file5.p Loading file: ./tmp/file6.p Loading file: ./tmp/file7.p Loading file: ./tmp/file8.p
If you wish to check out the other word vector files, you can download them here (zip files). After downloading and unziping, uncomment the next cell and run it(to be added in the future).
'''Not ready yet, to be added'''
Cosine similarity reflects the degree of similarity between two vectors.
As I mentioned, there are more efficiant ways to do this, but in this way, you are able to see exactly what is being calcualed.
Run the next cell
def cosine_similarity(u, v):
"""
Cosine similarity reflects the degree of similarity between u and v
Arguments:
u -- a word vector of shape (n,)
v -- a word vector of shape (n,)
Returns:
cosine_similarity -- the cosine similarity between u and v defined by the formula above.
"""
distance = 0.0
epsilon=1e-10 #Prevent dividing by 0
# Compute the dot product between u and v (≈1 line)
dot = np.dot(u.T,v)
# Compute the L2 norm of u (≈1 line)
norm_u = np.sqrt(np.sum(u**2))
# Compute the L2 norm of v (≈1 line)
norm_v = np.sqrt(np.sum(v**2))
# Compute the cosine similarity defined by formula (1) (≈1 line)
cosine_similarity = dot/((norm_u*norm_v)+epsilon)
return cosine_similarity
A function the finds the most similar word to the input word by calculating the cosine similarity between the word vector and the other word vectors and returning K most similar words
def most_k_similar(word_in,word_dict,k=1):
"""
most_k_similar finds most similar k number of words
Arguments:
word_in -- a word in the corpus
word_dict -- dictinary of word - word vector pairs
k -- number of words to return
Returns:
list of most similar words
"""
words = word_dict.keys()
word_vec = word_dict[word_in]
most_similars_heap = [(-100, '') for _ in range(k)]
for w in words:
if w==word_in:
continue
cosine_sim = cosine_similarity(word_vec, word_dict[w])
heappushpop(most_similars_heap, (cosine_sim, w))
most_similars_tuples = [tup for tup in most_similars_heap]
_,best_words = zip(*most_similars_tuples)
return best_words
takes list of words and returns the word the doesnt match by comaparing the cosine similarities between each word with all other words and returning the words with the lowest score
def doesnt_match(words,word_dict):
dots_tot = []
for w in words:
dots = 0
for w2 in words:
if w2 == w:
continue
v = word_dict[w]
u = word_dict[w2]
dots=dots+cosine_similarity(v,u)
dots_tot.append(dots)
return(words[np.argmin(dots_tot)])
To find the analogy between words, this function subtracks one word vector from the other, and then add the difference to the vector of the third word. The difference between two word vectors represents the difference between their meaning, or the relationship between them, also know as their analogy. By adding this difference to a diferent word vector, you can find a forth word that has the same relationship with word 3 as words 1 and 2 have
meaning: man is to king as woman is to X (queen)
def complete_analogy(word_a, word_b, word_c, word_dict):
"""
Performs the word analogy task as explained above: a is to b as c is to ____.
Arguments:
word_a -- a word, string
word_b -- a word, string
word_c -- a word, string
word_to_vec_map -- dictionary that maps words to their corresponding vectors.
Returns:
best_word -- the word such that v_b - v_a is close to v_best_word - v_c, as measured by cosine similarity
"""
# convert words to lowercase
word_a, word_b, word_c = word_a.lower(), word_b.lower(), word_c.lower()
# Get the word embeddings e_a, e_b and e_c (≈1-3 lines)
e_a, e_b, e_c = word_dict[word_a],word_dict[word_b],word_dict[word_c]
words = word_dict.keys()
max_cosine_sim = -100 # Initialize max_cosine_sim to a large negative number
best_word = None # Initialize best_word with None, it will help keep track of the word to output
# to avoid best_word being one of the input words, skip the input words
# place the input words in a set for faster searching than a list
# We will re-use this set of input words inside the for-loop
input_words_set = set([word_a, word_b, word_c])
# loop over the whole word vector set
cnt=1
for w in words:
# to avoid best_word being one of the input words, skip the input words
if w in input_words_set:
continue
#Compute cosine similarity between the vector (e_b - e_a) and the vector ((w's vector representation) - e_c) (≈1 line)
cosine_sim = cosine_similarity(e_b - e_a, word_dict[w]- e_c)
# If the cosine_sim is more than the max_cosine_sim seen so far,
# then: set the new max_cosine_sim to the current cosine_sim and the best_word to the current word (≈3 lines)
if cosine_sim > max_cosine_sim:
max_cosine_sim = cosine_sim
best_word = w
return best_word
Try replacing the word and running again
doesnt_match(['red','two','one','four'],word_dict)
'red'
doesnt_match(['red','one','blue','orange'],word_dict)
'one'
doesnt_match(['up','down','yes','back','front'],word_dict)
'yes'
doesnt_match(['big','small','huge'],word_dict)
'small'
doesnt_match(['big','small','tiny'],word_dict)
'big'
print(most_k_similar('small',word_dict,10))
('few', 'one', 'typically', 'mostly', 'usually', 'smaller', 'well', 'larger', 'tiny', 'large')
print(most_k_similar('god',word_dict,10))
('gods', 'true', 'jesus', 'sacred', 'christ', 'faith', 'allah', 'heaven', 'holy', 'divine')
a to b as c is to X
complete_analogy('man', 'woman', 'actor', word_dict)
'actress'
complete_analogy('man', 'king', 'woman', word_dict)
'queen'
complete_analogy('japan', 'japanese', 'australia',word_dict)
'british'
complete_analogy('usa', 'obama', 'israel',word_dict)
'netanyahu'
complete_analogy('tall', 'tallest', 'long',word_dict)
'longest'
complete_analogy('good', 'fantastic', 'bad',word_dict)
'incredible'
complete_analogy('germany', 'berlin', 'israel',word_dict)
'jerusalem'
complete_analogy('germany', 'europe', 'israel',word_dict)
'asia'
complete_analogy('good', 'bad', 'up',word_dict)
'subprime'
run the next cell
Takes a list of words and their vector representations Since we can't plot 50 dimention vectors, we must extract the most important feuters. We are doing it here using PCA, but this can be done in other methods such as t-SNE.
The plots are intercative, and you can zoom in and out or rotate the axes.
Inputs:
def display_pca_scatterplot_iplot(word_dict=word_dict, words=None, sample=0):
if words == None:
if sample > 0:
words = np.random.choice(list(word_dict.keys()), sample)
else:
words = word_dict.keys()
word_vectors = np.array([word_dict[w] for w in words])
twodim = PCA().fit_transform(word_vectors)[:,:4]
fig = px.scatter_3d (x=twodim[:,0], y=twodim[:,1],z=twodim[:,2],color=twodim[:,3], text=words,size_max=60)
#fig = px.scatter(x=twodim[:,0], y=twodim[:,1],color=twodim[:,2], text=words, log_x=False, size_max=60)
fig.update_traces({'textposition':'top center','marker':{'showscale':False}})
fig.update_layout(
title_text='PCA',
paper_bgcolor = 'rgba(0,0,0,0)',
plot_bgcolor = 'rgba(0,0,0,0)',
scene = dict(xaxis= dict(nticks=40, range=[-10,10],),
yaxis = dict(nticks=40, range=[-10,10],),
zaxis = dict(nticks=40, range=[-10,10])))
fig.show()
display_pca_scatterplot_iplot(word_dict,
['man','woman','king','queen','prince','princess','duke','duchess'])
display_pca_scatterplot_iplot(word_dict,
['germany','german','italy','italian','england','english','russia','russian'])
display_pca_scatterplot_iplot(word_dict,
['israel','jerusalem','italy','rome','france','paris','ireland','dublin'])
Familiy relations
display_pca_scatterplot_iplot(word_dict,
['son','father','daughter','mother','uncle','aunt','niece','nephew'])
display_pca_scatterplot_iplot(word_dict,
['son','father','daughter','mother','dad','mom','grandpa','grandma','grandfather','grandmother'])
display_pca_scatterplot_iplot(word_dict,
['coffee', 'tea', 'beer', 'wine', 'brandy', 'rum', 'champagne', 'water',
'spaghetti', 'borscht', 'hamburger', 'pizza', 'falafel', 'sushi', 'meatballs',
'dog', 'horse', 'cat', 'monkey', 'parrot', 'koala', 'lizard',
'frog', 'toad', 'monkey', 'ape', 'kangaroo', 'wombat', 'wolf',
'france', 'germany', 'hungary', 'luxembourg', 'australia', 'fiji', 'china',
'homework', 'assignment', 'problem', 'exam', 'test', 'class',
'school', 'college', 'university', 'institute','soda','fanta','cars'])
Pick a few words and find 10 similar words to each and display them
word_list=[]
word_list.append(most_k_similar('israel' ,word_dict,10))
word_list.append(most_k_similar('chair' ,word_dict,10))
word_list.append(most_k_similar('spain' ,word_dict,10))
word_list.append(most_k_similar('good' ,word_dict,10))
words = [i for sub in word_list for i in sub]
display_pca_scatterplot_iplot(word_dict, words)
Randomly pick 300 words and display their vectors
display_pca_scatterplot_iplot(word_dict, sample=300)