Word2Vec.jl

In [1]:
using Word2Vec

Training

Three functions are available for traning:

  • word2vec
  • word2phrase
  • word2cluster

We first download a text corpus: http://mattmahoney.net/dc/text8.zip and unzip it.

All functions are documented, i.e., we can type ?functionname to check input options.

In [2]:
?word2vec
search: 
Out[2]:

word2vec(train, output; size=100, window=5, sample=1e-3, hs=0, negative=5, threads=12, iter=5, min_count=5, alpha=0.025, debug=2, binary=1, cbow=1, save_vocal=Void(), read_vocab=Void(), verbose=false,)

Parameters for training:
    train <file>
        Use text data from <file> to train the model
    output <file>
        Use <file> to save the resulting word vectors / word clusters
    size <Int>
        Set size of word vectors; default is 100
    window <Int>
        Set max skip length between words; default is 5
    sample <AbstractFloat>
        Set threshold for occurrence of words. Those that appear with
        higher frequency in the training data will be randomly
        down-sampled; default is 1e-5.
    hs <Int>
        Use Hierarchical Softmax; default is 1 (0 = not used)
    negative <Int>
        Number of negative examples; default is 0, common values are 
        5 - 10 (0 = not used)
    threads <Int>
        Use <Int> threads (default 12)
    iter <Int>
        Run more training iterations (default 5)
    min_count <Int>
        This will discard words that appear less than <Int> times; default
        is 5
    alpha <AbstractFloat>
        Set the starting learning rate; default is 0.025
    debug <Int>
        Set the debug mode (default = 2 = more info during training)
    binary <Int>
        Save the resulting vectors in binary moded; default is 0 (off)
    cbow <Int>
        Use the continuous back of words model; default is 1 (skip-gram
        model)
    save_vocab <file>
        The vocabulary will be saved to <file>
    read_vocab <file>
        The vocabulary will be read from <file>, not constructed from the
        training data
    verbose <Bool>
        Print output from training
word2vec Word2Vec

In [4]:
word2vec("Downloads/text8", "text8-vec.txt", verbose=true)
Starting training using file Downloads/text8
Vocab size: 71291
Words in train file: 16718843
Alpha: 0.000002  Progress: 100.03%  Words/thread/sec: 356.11k  

This will create a text file text8-vec.txt where each word in text8 is represented by a vector. In certain applications, we want to have vector representation of larger piece of text. For example, instead of considering "san" and "francisco" as two words, we want to have a vector to represent "san francisco". This can be achieved by pre-processing the text corpus with the function word2phrase.

In [29]:
word2phrase("Downloads/text8", "text8phrase")
word2vec("text8phrase", "text8phrase-vec.txt", verbose=true)
Starting training using file Downloads/text8

Vocab size (unigrams + bigrams): 2419827
Words in train file: 17005206
Starting training using file text8phrase
Vocab size: 98331
Words in train file: 15857306
Alpha: 0.000002  Progress: 100.03%  Words/thread/sec: 341.90k  

word2clusters gives each word a class ID number.

In [22]:
word2clusters("text8", "text8-class.txt", 100)
Starting training using file text8
Vocab size: 71291
Words in train file: 16718843
Alpha: 0.000006  Progress: 99.99%  Words/thread/sec: 382.50k  

Modelling

In [25]:
;ls
text8
text8-class.txt
text8phrase
text8phrase-vec.txt
text8-vec.txt
text8.zip
In [4]:
model = wordvectors("text8-vec.txt")
Out[4]:
WordVectors 71291 words, 100-element Float64 vectors

Here are some basic functionalities.

In [26]:
size(model)
Out[26]:
(100,71291)
In [7]:
words = vocabulary(model)
Out[7]:
71291-element Array{AbstractString,1}:
 "</s>"        
 "the"         
 "of"          
 "and"         
 "one"         
 "in"          
 "a"           
 "to"          
 "zero"        
 "nine"        
 "two"         
 "is"          
 "as"          
 ⋮             
 "raam"        
 "barad"       
 "baume"       
 "mothmen"     
 "gallopin"    
 "horsecollar" 
 "mojitos"     
 "snaggletooth"
 "introvigne"  
 "denishawn"   
 "tamiris"     
 "dolophine"   
In [36]:
idx = index(model, "book")
Out[36]:
199
In [37]:
words[idx]
Out[37]:
"book"

We can retrieve the vector representation of individual words and compute the cosine distance between two words.

In [6]:
get_vector(model, "one")
Out[6]:
100-element Array{Float64,1}:
 -0.00124171 
 -0.153338   
  0.102503   
  0.0189016  
  0.0481557  
 -0.017203   
 -0.0345992  
 -0.143795   
  0.13964    
  0.10404    
  0.0987664  
  0.000247274
 -0.0294016  
  ⋮          
 -0.0729129  
  0.00609002 
 -0.115113   
 -0.1635     
  0.104623   
 -0.0815325  
 -0.0979441  
 -0.0522775  
 -0.0893822  
  0.121403   
 -0.0100501  
  0.100918   
In [7]:
similarity(model, "one", "two")
Out[7]:
1-element Array{Float64,1}:
 0.795706
In [8]:
similarity(model, "one", "hello")
Out[8]:
1-element Array{Float64,1}:
 0.000406437

The funciton cosine(model, word, n) return the indices and distances of n neighbors of word.

In [5]:
idxs, dists = cosine(model, "paris", 10)
Out[5]:
([1056,5356,3222,6964,4611,9122,4219,4218,12359,15749],[0.9999999999999999,0.7663500088541577,0.754464599846764,0.7269210839899815,0.6949826015235541,0.6906811525014919,0.6899940820154246,0.6874193547947338,0.6820260362215733,0.6817421670970778])

We can use Gadfly to plot the top 10 similar words to "paris"

In [3]:
using Gadfly
In [8]:
plot(x=words[idxs], y=dists)
Out[8]:
x paris venice vienna leipzig munich villa florence milan bologna turin 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16 0.18 0.20 0.22 0.24 0.26 0.28 0.30 0.32 0.34 0.36 0.38 0.40 0.42 0.44 0.46 0.48 0.50 0.52 0.54 0.56 0.58 0.60 0.62 0.64 0.66 0.68 0.70 0.72 0.74 0.76 0.78 0.80 0.82 0.84 0.86 0.88 0.90 0.92 0.94 0.96 0.98 1.00 1.02 1.04 1.06 1.08 1.10 1.12 1.14 1.16 1.18 1.20 1.22 1.24 1.26 1.28 1.30 1.32 1.34 1.36 1.38 1.40 0.0 0.5 1.0 1.5 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00 1.05 1.10 1.15 1.20 1.25 1.30 1.35 1.40 y
In [12]:
?analogy
search: 
Out[12]:

analogy(wv, pos, neg, n=5)

Compute the analogy similarity between two lists of words. The positions and the similarity values of the top n similar words will be returned. For example, king - man + woman = queen will be pos=["king", "woman"], neg=["man"].

analogy analogy_words

In [10]:
indxs, dists = analogy(model, ["king", "woman"], ["man"], 8)
Out[10]:
([904,6854,1062,2033,527,12269,2076,3422],[0.29024117726721255,0.26277586168028433,0.253278904324895,0.25208853175214935,0.24775773633691375,0.24558402441677105,0.24309061947916874,0.2418303817434562])
In [11]:
plot(x=words[indxs], y=dists)
Out[11]:
x queen empress prince elizabeth emperor sigismund throne princess 0.17 0.18 0.19 0.20 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28 0.29 0.30 0.31 0.32 0.33 0.34 0.35 0.36 0.37 0.180 0.182 0.184 0.186 0.188 0.190 0.192 0.194 0.196 0.198 0.200 0.202 0.204 0.206 0.208 0.210 0.212 0.214 0.216 0.218 0.220 0.222 0.224 0.226 0.228 0.230 0.232 0.234 0.236 0.238 0.240 0.242 0.244 0.246 0.248 0.250 0.252 0.254 0.256 0.258 0.260 0.262 0.264 0.266 0.268 0.270 0.272 0.274 0.276 0.278 0.280 0.282 0.284 0.286 0.288 0.290 0.292 0.294 0.296 0.298 0.300 0.302 0.304 0.306 0.308 0.310 0.312 0.314 0.316 0.318 0.320 0.322 0.324 0.326 0.328 0.330 0.332 0.334 0.336 0.338 0.340 0.342 0.344 0.346 0.348 0.350 0.352 0.354 0.356 0.358 0.360 0.1 0.2 0.3 0.4 0.180 0.185 0.190 0.195 0.200 0.205 0.210 0.215 0.220 0.225 0.230 0.235 0.240 0.245 0.250 0.255 0.260 0.265 0.270 0.275 0.280 0.285 0.290 0.295 0.300 0.305 0.310 0.315 0.320 0.325 0.330 0.335 0.340 0.345 0.350 0.355 0.360 y

analogy_words is a wrapper of analogy.

In [13]:
?analogy_words
search: 
Out[13]:

analogy_words(wv, pos, neg, n=5)

Return the top n words computed by analogy similarity between positive words pos and negaive words neg. from the WordVectors wv.

analogy_words

In [23]:
analogy_words(model, ["paris", "germany"], ["france"], 10)
Out[23]:
10-element Array{AbstractString,1}:
 "berlin"    
 "munich"    
 "leipzig"   
 "vienna"    
 "bonn"      
 "dresden"   
 "hamburg"   
 "stuttgart" 
 "frankfurt" 
 "heidelberg"
In [30]:
model2 = wordvectors("text8phrase-vec.txt")
Out[30]:
WordVectors 98331 words, 100-element Float64 vectors

model2 is pre-processed by word2phrase, so we can compute the similar words of phrases.

In [32]:
cosine_similar_words(model2, "los_angeles", 13)
Out[32]:
13-element Array{AbstractString,1}:
 "los_angeles"  
 "san_francisco"
 "san_diego"    
 "miami"        
 "las_vegas"    
 "seattle"      
 "cincinnati"   
 "cleveland"    
 "st_louis"     
 "california"   
 "chicago"      
 "dallas"       
 "atlanta"      

Clustering

In [61]:
model3 = wordclusters("text8-class.txt")
Out[61]:
WordClusters 71291 words, 100 clusters

The function clusters returns all the clusters in a model.

In [62]:
clusters(model3)
Out[62]:
100-element Array{Integer,1}:
  0
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
  ⋮
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99

We can use get_cluster to retrieve the cluster ID of a given word and use get_words to retrieve all the words of a given cluster ID.

In [65]:
get_cluster(model3, "two")
Out[65]:
39
In [66]:
get_words(model3, 39)
Out[66]:
116-element Array{AbstractString,1}:
 "one"          
 "zero"         
 "nine"         
 "two"          
 "eight"        
 "five"         
 "three"        
 "four"         
 "six"          
 "seven"        
 "years"        
 "th"           
 "century"      
 ⋮              
 "interceptions"
 "nisan"        
 "weekday"      
 "ramadan"      
 "weekdays"     
 "workday"      
 "thirtieth"    
 "lunations"    
 "graders"      
 "goodwrench"   
 "spoked"       
 "rublei"