Word vector visualization¶

This word vector visualization tool was built to illustrate some of the properties they have. The vectors themselves were encoded using GloVe model by the good folks at Stanford, (link to GloVe) and this script was heavily inspired by this page here, taken from the materials of one of the courses in Stanford.

In short, they trained GloVe model on 400k word corpus obtained from Wikipedia (at 2014) and Gigaword.

I have chosen to use their smallest vectors with the least dimensions (50d), since the difference in performance was insignificant as compared to the difference in file size. however, you are more than welcome to download the other files from the Glove site (zip files) and play around with it.

I changed the functions and made them slower and less efficient, but in doing so, you can see in inside of them and have a better understanding of how they work. I have also changed the plotting tools to pyplot and made some enhancements.

Imports¶

In [0]:

# Get the interactive Tools for Matplotlib
import matplotlib.pyplot as plt
%matplotlib notebook
%matplotlib inline

import plotly.graph_objects as go
import plotly.express as px



# Get tools to download files and load them 
import pickle
import urllib.request
from os.path import exists as check_path
from os import makedirs

# Get tools to performe analysis
import numpy as np
from heapq import heappushpop
from sklearn.decomposition import PCA

Step 1: download files and load the list of word and their vector representations¶

Run the next blocks:

download_files_from_github

load_word2vecfiles

In [0]:

def download_files_from_github(file_target_dir):
    main_url = 'https://raw.githubusercontent.com/Goussha/word-vector-visualization/master/'
    if not check_path(file_target_dir):
        makedirs(file_target_dir)
    
    urls = [main_url+'file{}.p'.format(x) for x in range(1,9)]
    file_names = [file_target_dir+'file{}.p'.format(x) for x in range(1,9)]
    for file_name, url in zip(file_names, urls):
        if not check_path(file_name):
            print ("Downloading file: ",file_name)
            filename, headers = urllib.request.urlretrieve(url, filename=file_name)
        else:
            print('Allready exists: {}'.format(file_name))

In [0]:

def load_word2vecfiles(file_target_dir):
    word_dict_loded = {}
    for file_num in range(1,9):
        full_file_name = file_target_dir+'file{}.p'.format(file_num)
        print('Loading file: {}'.format(full_file_name))
        with open(full_file_name, 'rb') as fp:
            data = pickle.load(fp)
        word_dict_loded.update(data)
    return word_dict_loded

Run the next cell to download and load the files from my github

(should take about 15 sec to downloand and load)

In [5]:

file_target_dir = "./tmp/"

#Download files
download_files_from_github(file_target_dir)
#Load files and create dict
word_dict = load_word2vecfiles(file_target_dir)

Downloading file:  ./tmp/file1.p
Downloading file:  ./tmp/file2.p
Downloading file:  ./tmp/file3.p
Downloading file:  ./tmp/file4.p
Downloading file:  ./tmp/file5.p
Downloading file:  ./tmp/file6.p
Downloading file:  ./tmp/file7.p
Downloading file:  ./tmp/file8.p
Loading file: ./tmp/file1.p
Loading file: ./tmp/file2.p
Loading file: ./tmp/file3.p
Loading file: ./tmp/file4.p
Loading file: ./tmp/file5.p
Loading file: ./tmp/file6.p
Loading file: ./tmp/file7.p
Loading file: ./tmp/file8.p

If you wish to check out the other word vector files, you can download them here (zip files). After downloading and unziping, uncomment the next cell and run it(to be added in the future).

In [0]:

'''Not ready yet, to be added'''

cosine_similarity¶

Cosine similarity reflects the degree of similarity between two vectors.

As I mentioned, there are more efficiant ways to do this, but in this way, you are able to see exactly what is being calcualed.

Run the next cell

In [0]:

def cosine_similarity(u, v):
    """
    Cosine similarity reflects the degree of similarity between u and v
        
    Arguments:
        u -- a word vector of shape (n,)          
        v -- a word vector of shape (n,)

    Returns:
        cosine_similarity -- the cosine similarity between u and v defined by the formula above.
    """
    
    distance = 0.0
    epsilon=1e-10 #Prevent dividing by 0
    # Compute the dot product between u and v (≈1 line)
    dot = np.dot(u.T,v)
    # Compute the L2 norm of u (≈1 line)
    norm_u = np.sqrt(np.sum(u**2))
    
    # Compute the L2 norm of v (≈1 line)
    norm_v = np.sqrt(np.sum(v**2))
    # Compute the cosine similarity defined by formula (1) (≈1 line)
    cosine_similarity = dot/((norm_u*norm_v)+epsilon)
    
    return cosine_similarity    

most_k_similar¶

A function the finds the most similar word to the input word by calculating the cosine similarity between the word vector and the other word vectors and returning K most similar words

In [0]:

def most_k_similar(word_in,word_dict,k=1):
    """
    most_k_similar finds most similar k number of words
        
    Arguments:
        word_in -- a word in the corpus
        word_dict -- dictinary of word - word vector pairs
        k -- number of words to return
    Returns:
        list of most similar words
    """
    words = word_dict.keys()
    word_vec = word_dict[word_in]
    
    
    most_similars_heap = [(-100, '') for _ in range(k)]

        
    for w in words:
        if w==word_in:
            continue
        
        cosine_sim = cosine_similarity(word_vec, word_dict[w])
        heappushpop(most_similars_heap, (cosine_sim, w))
    most_similars_tuples = [tup for tup in most_similars_heap]  
    _,best_words = zip(*most_similars_tuples)
    return best_words

doesnt_match¶

takes list of words and returns the word the doesnt match by comaparing the cosine similarities between each word with all other words and returning the words with the lowest score

In [0]:

def doesnt_match(words,word_dict):
    
    dots_tot = []
    for w in words:
        dots = 0
        for w2 in words:
            if w2 == w:
                continue
            v = word_dict[w]
            u = word_dict[w2]
            dots=dots+cosine_similarity(v,u)
        
        dots_tot.append(dots)
    
    return(words[np.argmin(dots_tot)])         
    
    

complete_analogy¶

To find the analogy between words, this function subtracks one word vector from the other, and then add the difference to the vector of the third word. The difference between two word vectors represents the difference between their meaning, or the relationship between them, also know as their analogy. By adding this difference to a diferent word vector, you can find a forth word that has the same relationship with word 3 as words 1 and 2 have

meaning: man is to king as woman is to X (queen)

In [0]:

def complete_analogy(word_a, word_b, word_c, word_dict):
    """
    Performs the word analogy task as explained above: a is to b as c is to ____. 
    
    Arguments:
    word_a -- a word, string
    word_b -- a word, string
    word_c -- a word, string
    word_to_vec_map -- dictionary that maps words to their corresponding vectors. 
    
    Returns:
    best_word --  the word such that v_b - v_a is close to v_best_word - v_c, as measured by cosine similarity
    """
    
    # convert words to lowercase
    word_a, word_b, word_c = word_a.lower(), word_b.lower(), word_c.lower()
    
    # Get the word embeddings e_a, e_b and e_c (≈1-3 lines)
    e_a, e_b, e_c = word_dict[word_a],word_dict[word_b],word_dict[word_c]
    
    words = word_dict.keys()
    max_cosine_sim = -100              # Initialize max_cosine_sim to a large negative number
    best_word = None                   # Initialize best_word with None, it will help keep track of the word to output

    # to avoid best_word being one of the input words, skip the input words
    # place the input words in a set for faster searching than a list
    # We will re-use this set of input words inside the for-loop
    input_words_set = set([word_a, word_b, word_c])
    
    # loop over the whole word vector set
    
    cnt=1
    
    for w in words:        
        # to avoid best_word being one of the input words, skip the input words
        if w in input_words_set:
            continue
        
        #Compute cosine similarity between the vector (e_b - e_a) and the vector ((w's vector representation) - e_c)  (≈1 line)
        cosine_sim = cosine_similarity(e_b - e_a, word_dict[w]- e_c)
        
        # If the cosine_sim is more than the max_cosine_sim seen so far,
            # then: set the new max_cosine_sim to the current cosine_sim and the best_word to the current word (≈3 lines)
        if cosine_sim > max_cosine_sim:
            max_cosine_sim = cosine_sim
            best_word = w
    return best_word

Time to run some code and see the results¶

Try replacing the word and running again

Examples of "dosnt match"¶

In [10]:

doesnt_match(['red','two','one','four'],word_dict)

Out[10]:

'red'

In [11]:

doesnt_match(['red','one','blue','orange'],word_dict)

Out[11]:

'one'

In [12]:

doesnt_match(['up','down','yes','back','front'],word_dict)

Out[12]:

'yes'

In [0]:

doesnt_match(['big','small','huge'],word_dict)

Out[0]:

'small'

In [0]:

doesnt_match(['big','small','tiny'],word_dict)

Out[0]:

'big'

Examples of "most similar"¶

In [0]:

print(most_k_similar('small',word_dict,10))

('few', 'one', 'typically', 'mostly', 'usually', 'smaller', 'well', 'larger', 'tiny', 'large')

In [0]:

print(most_k_similar('god',word_dict,10))

('gods', 'true', 'jesus', 'sacred', 'christ', 'faith', 'allah', 'heaven', 'holy', 'divine')

Examples of Analogies¶

a to b as c is to X

In [0]:

complete_analogy('man', 'woman', 'actor', word_dict)

Out[0]:

'actress'

In [0]:

complete_analogy('man', 'king', 'woman', word_dict)

Out[0]:

'queen'

In [0]:

complete_analogy('japan', 'japanese', 'australia',word_dict)

Out[0]:

'british'

In [0]:

complete_analogy('usa', 'obama', 'israel',word_dict)

Out[0]:

'netanyahu'

In [0]:

complete_analogy('tall', 'tallest', 'long',word_dict)

Out[0]:

'longest'

In [0]:

complete_analogy('good', 'fantastic', 'bad',word_dict)

Out[0]:

'incredible'

In [0]:

complete_analogy('germany', 'berlin', 'israel',word_dict)

Out[0]:

'jerusalem'

In [0]:

complete_analogy('germany', 'europe', 'israel',word_dict)

Out[0]:

'asia'

In [0]:

complete_analogy('good', 'bad', 'up',word_dict)

Out[0]:

'subprime'

Time for some visualizations¶

run the next cell

display_pca_scatterplot_iplot¶

Takes a list of words and their vector representations Since we can't plot 50 dimention vectors, we must extract the most important feuters. We are doing it here using PCA, but this can be done in other methods such as t-SNE.

The plots are intercative, and you can zoom in and out or rotate the axes.

Inputs:

word_dict = word_widt, ditionary with all word and word vector pairs
words = None, a list of words to plot, if is None, randomly pick sample sized words from word_dict
sample = 0, number of words to display if words is none, is sample is 0, display all 400k words, this will take some time

In [0]:

def display_pca_scatterplot_iplot(word_dict=word_dict, words=None, sample=0):
    if words == None:
        if sample > 0:
            words = np.random.choice(list(word_dict.keys()), sample)
        else:
            words = word_dict.keys()
        
    word_vectors = np.array([word_dict[w] for w in words])

    twodim = PCA().fit_transform(word_vectors)[:,:4]

    fig = px.scatter_3d  (x=twodim[:,0], y=twodim[:,1],z=twodim[:,2],color=twodim[:,3], text=words,size_max=60)
    #fig = px.scatter(x=twodim[:,0], y=twodim[:,1],color=twodim[:,2], text=words, log_x=False, size_max=60)

    fig.update_traces({'textposition':'top center','marker':{'showscale':False}})

    fig.update_layout(
        title_text='PCA',
        paper_bgcolor = 'rgba(0,0,0,0)',
				plot_bgcolor = 'rgba(0,0,0,0)',
        scene = dict(xaxis=  dict(nticks=40, range=[-10,10],),
                     yaxis = dict(nticks=40, range=[-10,10],),
                     zaxis = dict(nticks=40, range=[-10,10])))

    fig.show()

Royalty and gender¶

Relatation between royalty and gender

In [0]:

display_pca_scatterplot_iplot(word_dict, 
                        ['man','woman','king','queen','prince','princess','duke','duchess'])

Countries and languages¶

Countries

In [0]:

display_pca_scatterplot_iplot(word_dict, 
                        ['germany','german','italy','italian','england','english','russia','russian'])

Contries and capitals¶

alt text

In [0]:

display_pca_scatterplot_iplot(word_dict, 
                        ['israel','jerusalem','italy','rome','france','paris','ireland','dublin'])

Familiy relations

alt text

In [0]:

display_pca_scatterplot_iplot(word_dict, 
                        ['son','father','daughter','mother','uncle','aunt','niece','nephew'])

Formal and informal¶

alt text

In [0]:

display_pca_scatterplot_iplot(word_dict, 
                        ['son','father','daughter','mother','dad','mom','grandpa','grandma','grandfather','grandmother'])

In [0]:

display_pca_scatterplot_iplot(word_dict, 
                        ['coffee', 'tea', 'beer', 'wine', 'brandy', 'rum', 'champagne', 'water',
                         'spaghetti', 'borscht', 'hamburger', 'pizza', 'falafel', 'sushi', 'meatballs',
                         'dog', 'horse', 'cat', 'monkey', 'parrot', 'koala', 'lizard',
                         'frog', 'toad', 'monkey', 'ape', 'kangaroo', 'wombat', 'wolf',
                         'france', 'germany', 'hungary', 'luxembourg', 'australia', 'fiji', 'china',
                         'homework', 'assignment', 'problem', 'exam', 'test', 'class',
                         'school', 'college', 'university', 'institute','soda','fanta','cars'])

Pick a few words and find 10 similar words to each and display them

In [0]:

word_list=[]

word_list.append(most_k_similar('israel' ,word_dict,10))
word_list.append(most_k_similar('chair' ,word_dict,10))
word_list.append(most_k_similar('spain' ,word_dict,10))
word_list.append(most_k_similar('good' ,word_dict,10))


words = [i for sub in word_list for i in sub]

display_pca_scatterplot_iplot(word_dict, words)

Randomly pick 300 words and display their vectors

In [0]:

display_pca_scatterplot_iplot(word_dict, sample=300)