word2vec

This notebook is equivalent to demo-word.sh, demo-analogy.sh, demo-phrases.sh and demo-classes.sh from Google.

In [1]:
%load_ext autoreload
%autoreload 2

Training

Download some data, for example: http://mattmahoney.net/dc/text8.zip

In [3]:
import word2vec

Run word2phrase to group up similar words "Los Angeles" to "Los_Angeles"

In [4]:
word2vec.word2phrase('/Users/drodriguez/Downloads/text8', '/Users/drodriguez/Downloads/text8-phrases', verbose=True)
Starting training using file /Users/drodriguez/Downloads/text8
Words processed: 17000K     Vocab size: 4399K  
Vocab size (unigrams + bigrams): 2419827
Words in train file: 17005206

This created a text8-phrases file that we can use as a better input for word2vec. Note that you could easily skip this previous step and use the text data as input for word2vec directly.

Now actually train the word2vec model.

In [5]:
word2vec.word2vec('/Users/drodriguez/Downloads/text8-phrases', '/Users/drodriguez/Downloads/text8.bin', size=100, verbose=True)
Starting training using file /Users/drodriguez/Downloads/text8-phrases
Vocab size: 98331
Words in train file: 15857306
Alpha: 0.000002  Progress: 100.03%  Words/thread/sec: 323.95k  

That created a text8.bin file containing the word vectors in a binary format.

Now we generate the clusters of the vectors based on the trained model.

In [6]:
word2vec.word2clusters('/Users/drodriguez/Downloads/text8', '/Users/drodriguez/Downloads/text8-clusters.txt', 100, verbose=True)
Starting training using file /Users/drodriguez/Downloads/text8
Vocab size: 71291
Words in train file: 16718843
Alpha: 0.000002  Progress: 100.04%  Words/thread/sec: 317.72k  

That created a text8-clusters.txt with the cluster for every word in the vocabulary

Predictions

In [1]:
%load_ext autoreload
%autoreload 2
In [3]:
import word2vec

Import the word2vec binary file created above

In [4]:
model = word2vec.load('/Users/drodriguez/Downloads/text8.bin')

We can take a look at the vocabulary as a numpy array

In [5]:
model.vocab
Out[5]:
array(['</s>', 'the', 'of', ..., 'dakotas', 'nias', 'burlesques'],
      dtype='<U78')

Or take a look at the whole matrix

In [6]:
model.vectors.shape
Out[6]:
(98331, 100)
In [7]:
model.vectors
Out[7]:
array([[ 0.14333282,  0.15825513, -0.13715845, ...,  0.05456942,
         0.10955409,  0.00693387],
       [ 0.07306823,  0.1179086 ,  0.10995189, ...,  0.09345266,
        -0.1312812 , -0.00915683],
       [ 0.26229969,  0.02270839,  0.05854911, ...,  0.03924898,
        -0.03867628,  0.21437503],
       ...,
       [-0.1427108 ,  0.10650002,  0.07283197, ...,  0.14563465,
        -0.06967127,  0.037186  ],
       [ 0.06538665, -0.04184594,  0.13385373, ...,  0.08183857,
        -0.07006828, -0.09386028],
       [-0.00991228, -0.12096601,  0.10771658, ...,  0.01684521,
        -0.143217  , -0.10602982]])

We can retreive the vector of individual words

In [8]:
model['dog'].shape
Out[8]:
(100,)
In [9]:
model['dog'][:10]
Out[9]:
array([ 0.06666815,  0.12450022,  0.02513653,  0.12673911,  0.13396765,
       -0.00938436,  0.06476378,  0.15387769,  0.05472341, -0.08388881])

We can calculate the distance between two or more (all combinations) words.

In [10]:
model.distance("dog", "cat", "fish")
Out[10]:
[('dog', 'cat', 0.8693732680572173),
 ('dog', 'fish', 0.5900484800297155),
 ('cat', 'fish', 0.6269017149314428)]

Similarity

We can do simple queries to retreive words similar to "socks" based on cosine similarity:

In [11]:
indexes, metrics = model.similar("dog")
indexes, metrics
Out[11]:
(array([ 2437,  5478,  7593, 10230,  3964,  9963,  2428, 10309,  4812,
         2391]),
 array([0.86937327, 0.83396105, 0.77854628, 0.7692265 , 0.76743628,
        0.7612772 , 0.7600788 , 0.75935677, 0.75693881, 0.75438956]))

This returned a tuple with 2 items:

  1. numpy array with the indexes of the similar words in the vocabulary
  2. numpy array with cosine similarity to each word

We can get the words for those indexes

In [12]:
model.vocab[indexes]
Out[12]:
array(['cat', 'cow', 'goat', 'pig', 'dogs', 'rabbit', 'bear', 'rat',
       'wolf', 'girl'], dtype='<U78')

There is a helper function to create a combined response as a numpy record array

In [13]:
model.generate_response(indexes, metrics)
Out[13]:
rec.array([('cat', 0.86937327), ('cow', 0.83396105), ('goat', 0.77854628),
           ('pig', 0.7692265 ), ('dogs', 0.76743628),
           ('rabbit', 0.7612772 ), ('bear', 0.7600788 ),
           ('rat', 0.75935677), ('wolf', 0.75693881),
           ('girl', 0.75438956)],
          dtype=[('word', '<U78'), ('metric', '<f8')])

Is easy to make that numpy array a pure python response:

In [14]:
model.generate_response(indexes, metrics).tolist()
Out[14]:
[('cat', 0.8693732680572173),
 ('cow', 0.8339610529888226),
 ('goat', 0.7785462766666428),
 ('pig', 0.7692265048531302),
 ('dogs', 0.7674362783482181),
 ('rabbit', 0.7612771996422674),
 ('bear', 0.7600788045286304),
 ('rat', 0.7593567655129181),
 ('wolf', 0.7569388070301634),
 ('girl', 0.754389556345068)]

Phrases

Since we trained the model with the output of word2phrase we can ask for similarity of "phrases", basically compained words such as "Los Angeles"

In [15]:
indexes, metrics = model.similar('los_angeles')
model.generate_response(indexes, metrics).tolist()
Out[15]:
[('san_francisco', 0.8876351265573288),
 ('san_diego', 0.8652920422732189),
 ('seattle', 0.8387625165949533),
 ('las_vegas', 0.8325965377422355),
 ('california', 0.8252775393303263),
 ('miami', 0.8167069457881345),
 ('detroit', 0.8164911899252103),
 ('chicago', 0.813283620659967),
 ('cincinnati', 0.8116379669114295),
 ('cleveland', 0.810708205429068)]

Analogies

Its possible to do more complex queries like analogies such as: king - man + woman = queen This method returns the same as cosine the indexes of the words in the vocab and the metric

In [16]:
indexes, metrics = model.analogy(pos=['king', 'woman'], neg=['man'])
indexes, metrics
Out[16]:
(array([1087, 6768, 1145, 7523, 1335, 8419, 3141, 1827,  344, 4980]),
 array([0.28823424, 0.26614362, 0.26265608, 0.26111525, 0.26091172,
        0.25844542, 0.25781944, 0.25678284, 0.25424551, 0.2529607 ]))
In [17]:
model.generate_response(indexes, metrics).tolist()
Out[17]:
[('queen', 0.28823424120681784),
 ('regent', 0.26614361576778933),
 ('prince', 0.2626560787162791),
 ('empress', 0.2611152451318436),
 ('wife', 0.26091172315990346),
 ('aragon', 0.25844541581050506),
 ('monarch', 0.25781944140528035),
 ('throne', 0.256782835877586),
 ('son', 0.25424550637754495),
 ('heir', 0.25296070456687614)]

Clusters

In [18]:
clusters = word2vec.load_clusters('/Users/drodriguez/Downloads/text8-clusters.txt')

We can see get the cluster number for individual words

In [19]:
clusters.vocab
Out[19]:
array(['</s>', 'the', 'of', ..., 'bredon', 'skirting', 'santamaria'],
      dtype='<U29')

We can see get all the words grouped on an specific cluster

In [20]:
clusters.get_words_on_cluster(90).shape
Out[20]:
(206,)
In [21]:
clusters.get_words_on_cluster(90)[:10]
Out[21]:
array(['along', 'associated', 'relations', 'relationship', 'deal',
       'combined', 'contact', 'connection', 'respect', 'mixed'],
      dtype='<U29')

We can add the clusters to the word2vec model and generate a response that includes the clusters

In [22]:
model.clusters = clusters
In [23]:
indexes, metrics = model.analogy(pos=["paris", "germany"], neg=["france"])
In [24]:
model.generate_response(indexes, metrics).tolist()
Out[24]:
[('berlin', 0.3187078682472152, 15),
 ('vienna', 0.28562803640143397, 12),
 ('munich', 0.28527806428082675, 21),
 ('moscow', 0.27085681100243797, 74),
 ('leipzig', 0.2697639527846636, 8),
 ('st_petersburg', 0.25841328545046965, 61),
 ('prague', 0.2571333430942206, 72),
 ('bonn', 0.2546126113385251, 8),
 ('dresden', 0.2471285069069249, 71),
 ('warsaw', 0.2450778083401204, 74)]
In [ ]: