This tutorial is about using the (Non-Metric Space Library (NMSLIB)) library for similarity queries with a Word2Vec model built with gensim.
The current implementation for finding k nearest neighbors in a vector space in gensim has linear complexity via brute force in the number of indexed documents, although with extremely low constant factors. The retrieved results are exact, which is an overkill in many applications: approximate results retrieved in sub-linear time may be enough. Nmslib can find approximate nearest neighbors much faster. Compared to annoy, nmslib has more parameteres to control the build and query time and accuracy. Nmslib can achieve faster and more accurate nearest neighbors search than annoy. This figure shows a comparison between annoy and nmslib indexer with differents parameters. This shows nmslib is better than annoy.
Additional libraries needed for this tutorial:
# pip install watermark
%reload_ext watermark
%watermark -v -m -p gensim,numpy,scipy,psutil,matplotlib
CPython 3.6.0 IPython 7.5.0 gensim 3.7.3 numpy 1.16.2 scipy 1.2.1 psutil 5.6.3 matplotlib 3.1.0 compiler : GCC 4.2.1 Compatible Apple LLVM 10.0.0 (clang-1000.11.45.5) system : Darwin release : 18.2.0 machine : x86_64 processor : i386 CPU cores : 4 interpreter: 64bit
import os.path
if not os.path.isfile('text8'):
!wget -c http://mattmahoney.net/dc/text8.zip
!unzip text8.zip
--2019-06-27 13:48:42-- http://mattmahoney.net/dc/text8.zip Resolving mattmahoney.net... 67.195.197.75 Connecting to mattmahoney.net|67.195.197.75|:80... connected. HTTP request sent, awaiting response... 200 OK Length: 31344016 (30M) [application/zip] Saving to: 'text8.zip' text8.zip 100%[=====================>] 29.89M 327KB/s in 98s 2019-06-27 13:50:21 (313 KB/s) - 'text8.zip' saved [31344016/31344016] Archive: text8.zip inflating: text8
I'm not going to set up logging due to the verbose input displaying in notebooks, but if you want that, uncomment the lines in the cell below.
LOGS = False
if LOGS:
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
from gensim.models import Word2Vec, KeyedVectors
from gensim.models.word2vec import Text8Corpus
# Using params from Word2Vec_FastText_Comparison
params = {
'alpha': 0.05,
'size': 100,
'window': 5,
'iter': 5,
'min_count': 5,
'sample': 1e-4,
'sg': 1,
'hs': 0,
'negative': 5
}
model = Word2Vec(Text8Corpus('text8'), **params)
print(model)
Word2Vec(vocab=71290, size=100, alpha=0.05)
See the Word2Vec tutorial for how to initialize and save this model.
# Set up the model and vector that we are using in the comparison
from gensim.similarities.index import AnnoyIndexer
from gensim.similarities.nmslib import NmslibIndexer
model.init_sims()
annoy_index = AnnoyIndexer(model, 300)
nmslib_index = NmslibIndexer(model, {'M': 100, 'indexThreadQty': 1, 'efConstruction': 100}, {'efSearch': 10})
# Dry run to make sure both indices are fully in RAM
vector = model.wv.syn0norm[0]
print(model.most_similar([vector], topn=5, indexer=annoy_index))
print(model.most_similar([vector], topn=5, indexer=nmslib_index))
print(model.most_similar([vector], topn=5))
[('the', 1.0), ('of', 0.700629323720932), ('in', 0.6932138800621033), ('which', 0.6132444441318512), ('for', 0.6099657416343689)] [('the', 0.9999999701976776), ('of', 0.91037717461586), ('in', 0.9058823883533478), ('a', 0.8834112882614136), ('and', 0.8790014386177063)] [('the', 1.0000001192092896), ('of', 0.82075434923172), ('in', 0.811764657497406), ('a', 0.7668224573135376), ('and', 0.7580028772354126)]
import time
import numpy as np
def avg_query_time(annoy_index=None, queries=1000):
"""
Average query time of a most_similar method over 1000 random queries,
uses annoy if given an indexer
"""
total_time = 0
for _ in range(queries):
rand_vec = model.wv.syn0norm[np.random.randint(0, len(model.wv.vocab))]
start_time = time.clock()
model.most_similar([rand_vec], topn=5, indexer=annoy_index)
total_time += time.clock() - start_time
return total_time / queries
queries = 10000
gensim_time = avg_query_time(queries=queries)
annoy_time = avg_query_time(annoy_index, queries=queries)
nmslib_time = avg_query_time(nmslib_index, queries=queries)
print("Gensim (s/query):\t{0:.5f}".format(gensim_time))
print("Annoy (s/query):\t{0:.5f}".format(annoy_time))
print("Nmslib (s/query):\t{0:.5f}".format(nmslib_time))
speed_improvement_gensim = gensim_time / nmslib_time
speed_improvement_annoy = annoy_time / nmslib_time
print ("\nNmslib is {0:.2f} times faster on average on this particular run".format(speed_improvement_gensim))
print ("\nNmslib is {0:.2f} times faster on average than annoy on this particular run".format(speed_improvement_annoy))
Gensim (s/query): 0.00560 Annoy (s/query): 0.00099 Nmslib (s/query): 0.00046 Nmslib is 12.16 times faster on average on this particular run Nmslib is 2.15 times faster on average than annoy on this particular run
An instance of NmslibIndexer
needs to be created in order to use Nmslib in gensim. The NmslibIndexer
class is located in gensim.similarities.nmslib
NmslibIndexer()
takes three parameters:
model
: A Word2Vec
or Doc2Vec
model
index_params
: Parameters for building nmslib indexer. index_params
effects the build time and the index size. The example is {'M': 100, 'indexThreadQty': 1, 'efConstruction': 100}
. Increasing the value of M
and efConstruction
improves the accuracy of search. However this also leads to longer indexing times. indexThreadQty
is the number of thread.
query_time_params
: Parameters for querying on nmslib indexer. query_time_params
effects the query time and the search accuracy. The example is {'efSearch': 100}
. A larger efSearch
will give more accurate results, but larger query time.
More information can be found here. The relationship between parameters, build/query time, and accuracy will be investigated later in the tutorial.
Now that we are ready to make a query, lets find the top 5 most similar words to "science" in the Text8 corpus. To make a similarity query we call Word2Vec.most_similar
like we would traditionally, but with an added parameter, indexer
. The only supported indexerers in gensim as of now are Annoy and Nmslib.
# Building nmslib indexer
nmslib_index = NmslibIndexer(model, {'M': 100, 'indexThreadQty': 1, 'efConstruction': 100}, {'efSearch': 10})
# Derive the vector for the word "science" in our model
vector = model["science"]
# The instance of AnnoyIndexer we just created is passed
approximate_neighbors = model.most_similar([vector], topn=11, indexer=nmslib_index)
# Neatly print the approximate_neighbors and their corresponding cosine similarity values
print("Approximate Neighbors")
for neighbor in approximate_neighbors:
print(neighbor)
normal_neighbors = model.most_similar([vector], topn=11)
print("\nNormal (not nmslib-indexed) Neighbors")
for neighbor in normal_neighbors:
print(neighbor)
Approximate Neighbors ('science', 1.0000000596046448) ('fiction', 0.8769421577453613) ('protoscience', 0.8432635962963104) ('multidisciplinary', 0.835610032081604) ('sciences', 0.8348604440689087) ('astrobiology', 0.8340338170528412) ('actuarial', 0.8339103162288666) ('interdisciplinary', 0.8327268362045288) ('xenobiology', 0.8318319618701935) ('criminology', 0.8261869251728058) ('futurists', 0.82555091381073) Normal (not nmslib-indexed) Neighbors ('science', 0.9999998807907104) ('fiction', 0.7538841962814331) ('protoscience', 0.6865270733833313) ('multidisciplinary', 0.6712199449539185) ('sciences', 0.6697208881378174) ('astrobiology', 0.6680675148963928) ('actuarial', 0.6678205132484436) ('interdisciplinary', 0.6654534339904785) ('xenobiology', 0.663663923740387) ('vernor', 0.6569585800170898) ('criminology', 0.652373731136322)
The closer the cosine similarity of a vector is to 1, the more similar that word is to our query, which was the vector for "science". In this case the results are almostly same.
You can save and load your indexes from/to disk to prevent having to construct them each time. This will create two files on disk, fname and fname.d. Both files are needed to correctly restore all attributes.
import os
fname = '/tmp/mymodel.index'
# Persist index to disk
nmslib_index.save(fname)
# Load index back
if os.path.exists(fname):
nmslib_index2 = NmslibIndexer.load(fname)
nmslib_index2.model = model
# Results should be identical to above
vector = model["science"]
approximate_neighbors2 = model.most_similar([vector], topn=11, indexer=nmslib_index2)
for neighbor in approximate_neighbors2:
print(neighbor)
assert approximate_neighbors == approximate_neighbors2
('science', 1.0000000596046448) ('fiction', 0.8769421577453613) ('protoscience', 0.8432635962963104) ('multidisciplinary', 0.835610032081604) ('sciences', 0.8348604440689087) ('astrobiology', 0.8340338170528412) ('actuarial', 0.8339103162288666) ('interdisciplinary', 0.8327268362045288) ('xenobiology', 0.8318319618701935) ('criminology', 0.8261869251728058) ('futurists', 0.82555091381073)
Be sure to use the same model at load that was used originally, otherwise you will get unexpected behaviors.
Nmslib library has a useful feature that indices can be memory-mapped from disk. It saves memory when the same index is used by several processes.
Below are two snippets of code. First one has a separate index for each process. The second snipped shares the index between two processes via memory-mapping. The second example uses less total RAM as it is shared.
# Remove verbosity from code below (if logging active)
if LOGS:
logging.disable(logging.CRITICAL)
from multiprocessing import Process
import psutil
%%time
model.save('/tmp/mymodel.pkl')
def f(process_id):
print('Process Id: {}'.format(os.getpid()))
process = psutil.Process(os.getpid())
new_model = Word2Vec.load('/tmp/mymodel.pkl')
vector = new_model["science"]
nmslib_index = NmslibIndexer(new_model, {'M': 100, 'indexThreadQty': 1, 'efConstruction': 100}, {'efSearch': 10})
approximate_neighbors = new_model.most_similar([vector], topn=5, indexer=nmslib_index)
print('\nMemory used by process {}: {}\n---'.format(os.getpid(), process.memory_info()))
# Creating and running two parallel process to share the same index file.
p1 = Process(target=f, args=('1',))
p1.start()
p1.join()
p2 = Process(target=f, args=('2',))
p2.start()
p2.join()
Process Id: 36262 Memory used by process 36262: pmem(rss=600539136, vms=6052294656, pfaults=170185, pageins=2) --- Process Id: 36263 Memory used by process 36263: pmem(rss=600539136, vms=6052294656, pfaults=170155, pageins=0) --- CPU times: user 547 ms, sys: 308 ms, total: 856 ms Wall time: 59.2 s
%%time
model.save('/tmp/mymodel.pkl')
def f(process_id):
print('Process Id: {}'.format(os.getpid()))
process = psutil.Process(os.getpid())
new_model = Word2Vec.load('/tmp/mymodel.pkl')
vector = new_model["science"]
nmslib_index = NmslibIndexer.load('/tmp/mymodel.index')
nmslib_index.model = new_model
approximate_neighbors = new_model.most_similar([vector], topn=5, indexer=nmslib_index)
print('\nMemory used by process {}: {}\n---'.format(os.getpid(), process.memory_info()))
# Creating and running two parallel process to share the same index file.
p1 = Process(target=f, args=('1',))
p1.start()
p1.join()
p2 = Process(target=f, args=('2',))
p2.start()
p2.join()
Process Id: 36265 Memory used by process 36265: pmem(rss=417345536, vms=5981540352, pfaults=129428, pageins=0) --- Process Id: 36266 Memory used by process 36266: pmem(rss=417579008, vms=5981540352, pfaults=129528, pageins=0) --- CPU times: user 581 ms, sys: 266 ms, total: 847 ms Wall time: 3.74 s
import matplotlib.pyplot as plt
%matplotlib inline
exact_results = [element[0] for element in model.most_similar([model.wv.syn0norm[0]], topn=100)]
# For calculating query time
queries = 1000
def create_evaluation_graph(x_values, y_values_init, y_values_accuracy, y_values_query, param_name):
plt.figure(1, figsize=(12, 6))
plt.subplot(231)
plt.plot(x_values, y_values_init)
plt.title("{} vs initalization time".format(param_name))
plt.ylabel("Initialization time (s)")
plt.xlabel(param_name)
plt.subplot(232)
plt.plot(x_values, y_values_accuracy)
plt.title("{} vs accuracy".format(param_name))
plt.ylabel("% accuracy")
plt.xlabel(param_name)
plt.tight_layout()
plt.subplot(233)
plt.plot(y_values_init, y_values_accuracy)
plt.title("Initialization time vs accuracy")
plt.ylabel("% accuracy")
plt.xlabel("Initialization time (s)")
plt.tight_layout()
plt.subplot(234)
plt.plot(x_values, y_values_query)
plt.title("{} vs query time".format(param_name))
plt.ylabel("query time")
plt.xlabel(param_name)
plt.tight_layout()
plt.subplot(235)
plt.plot(y_values_query, y_values_accuracy)
plt.title("query time vs accuracy")
plt.ylabel("% accuracy")
plt.xlabel("query time (s)")
plt.tight_layout()
plt.show()
def evaluate_nmslib_performance(parameter, is_parameter_query, parameter_start, parameter_end, parameter_step):
nmslib_x_values = []
nmslib_y_values_init = []
nmslib_y_values_accuracy = []
nmslib_y_values_query = []
index_params = {'M': 100, 'indexThreadQty': 10, 'efConstruction': 100, 'post': 0}
query_params = {'efSearch': 100}
for x in range(parameter_start, parameter_end, parameter_step):
nmslib_x_values.append(x)
start_time = time.time()
if is_parameter_query:
query_params[parameter] = x
else:
index_params[parameter] = x
nmslib_index = NmslibIndexer(model
, index_params
, query_params)
nmslib_y_values_init.append(time.time() - start_time)
approximate_results = model.most_similar([model.wv.syn0norm[0]], topn=100, indexer=nmslib_index)
top_words = [result[0] for result in approximate_results]
nmslib_y_values_accuracy.append(len(set(top_words).intersection(exact_results)))
nmslib_y_values_query.append(avg_query_time(nmslib_index, queries=queries))
create_evaluation_graph(nmslib_x_values,
nmslib_y_values_init,
nmslib_y_values_accuracy,
nmslib_y_values_query,
parameter)
# Evaluate nmslib indexer, changing the parameter M
evaluate_nmslib_performance("M", False, 50, 401, 50)
# Evaluate nmslib indexer, changing the parameter efConstruction
evaluate_nmslib_performance("efConstruction", False, 50, 1001, 100)
# Evaluate nmslib indexer, changing the parameter efSearch
evaluate_nmslib_performance("efSearch", True, 50, 401, 100)
# Evaluate annoy indexer, changing the parameter num_tree
annoy_x_values = []
annoy_y_values_init = []
annoy_y_values_accuracy = []
annoy_y_values_query = []
for x in range(100, 401, 50):
annoy_x_values.append(x)
start_time = time.time()
annoy_index = AnnoyIndexer(model, x)
annoy_y_values_init.append(time.time() - start_time)
approximate_results = model.most_similar([model.wv.syn0norm[0]], topn=100, indexer=annoy_index)
top_words = [result[0] for result in approximate_results]
annoy_y_values_accuracy.append(len(set(top_words).intersection(exact_results)))
annoy_y_values_query.append(avg_query_time(annoy_index, queries=queries))
create_evaluation_graph(annoy_x_values,
annoy_y_values_init,
annoy_y_values_accuracy,
annoy_y_values_query,
"num_tree")
# nmslib indexer changing the parameter M, efConstruction, efSearch
nmslib_y_values_init = []
nmslib_y_values_accuracy = []
nmslib_y_values_query = []
for M in [100, 200]:
for efConstruction in [100, 200]:
for efSearch in [100, 200]:
start_time = time.time()
nmslib_index = NmslibIndexer(model,
{'M': M, 'indexThreadQty': 10, 'efConstruction': efConstruction, 'post': 0},
{'efSearch': efSearch})
nmslib_y_values_init.append(time.time() - start_time)
approximate_results = model.most_similar([model.wv.syn0norm[0]], topn=100, indexer=nmslib_index)
top_words = [result[0] for result in approximate_results]
nmslib_y_values_accuracy.append(len(set(top_words).intersection(exact_results)))
nmslib_y_values_query.append(avg_query_time(nmslib_index, queries=queries))
# Make a comparison between annoy and nmslib indexer
plt.figure(1, figsize=(12, 6))
plt.subplot(121)
plt.scatter(nmslib_y_values_init, nmslib_y_values_accuracy, label="nmslib", color='r', marker='o')
plt.scatter(annoy_y_values_init, annoy_y_values_accuracy, label="annoy", color='b', marker='x')
plt.legend()
plt.title("Initialization time vs accuracy. Upper left is better.")
plt.ylabel("% accuracy")
plt.xlabel("Initialization time (s)")
plt.subplot(122)
plt.scatter(nmslib_y_values_query, nmslib_y_values_accuracy, label="nmslib", color='r', marker='o')
plt.scatter(annoy_y_values_query, annoy_y_values_accuracy, label="annoy", color='b', marker='x')
plt.legend()
plt.title("Query time vs accuracy. Upper left is better.")
plt.ylabel("% accuracy")
plt.xlabel("Query time (s)")
plt.xlim(min(nmslib_y_values_query+annoy_y_values_query), max(nmslib_y_values_query+annoy_y_values_query))
plt.tight_layout()
plt.show()
Our model can be exported to a word2vec C format. There is a binary and a plain text word2vec format. Both can be read with a variety of other software, or imported back into gensim as a KeyedVectors
object.
# To export our model as text
model.wv.save_word2vec_format('/tmp/vectors.txt', binary=False)
from smart_open import open
# View the first 3 lines of the exported file
# The first line has the total number of entries and the vector dimension count.
# The next lines have a key (a string) followed by its vector.
with open('/tmp/vectors.txt') as myfile:
for i in range(3):
print(myfile.readline().strip())
71290 100 the -0.20548718 0.19478682 -0.15149663 -0.31142342 0.014471135 -0.17996445 -0.07373469 0.09573618 -0.06328416 0.15571225 0.021413572 -0.12776679 -0.16940169 0.15807933 -0.21688043 0.074471496 0.08091913 0.07911484 0.31909388 -0.12297766 0.16993207 -0.02962172 0.08481803 0.12566781 0.02949822 -0.009697897 0.10780254 -0.102594994 -0.03935867 0.2679534 -0.061677158 -0.26071545 0.16285498 0.051780242 -0.1697231 0.24037386 0.0726078 -0.090416454 0.0776138 -0.06611322 0.057015926 0.07859522 -0.1910579 0.2974446 -0.033308737 -0.07360004 0.10797568 0.3595622 0.26797494 -0.062491674 0.21733648 -0.08524646 -0.06860078 -0.01714756 0.1305319 -0.09754544 -0.11249808 0.27328265 0.0041686473 -0.09874534 -0.30283058 0.111191 -0.026302295 -0.095534325 -0.0907799 -0.09120328 0.00068672217 0.31802058 -0.03345536 0.103762306 0.068564445 0.07402255 0.013657822 0.020439595 0.14985266 -0.13516407 0.36674532 0.0077319355 0.24709526 0.07666927 -0.11271039 0.02220251 -0.0235162 0.06409378 -0.10098407 0.23384795 0.094924204 0.061178204 0.19992544 0.29211295 0.004201066 0.12897494 -0.07837112 0.06269808 0.18545003 0.10452479 0.093281575 -0.24360427 0.01996208 0.35048977 of -0.19580448 0.209091 -0.099757686 -0.21272708 0.022300478 -0.21946296 -0.19841547 0.09413904 0.08382151 0.20281556 0.12914163 -0.062828615 -0.24957581 0.29098126 -0.12465087 0.09803499 0.08116001 0.06726196 0.2615518 -0.035177294 0.06914535 -0.09427601 0.0011476864 0.23290296 0.0034714406 0.009549841 0.03912403 0.11168332 -0.10166585 0.34404838 -0.08050078 -0.3933344 0.17402974 -0.15764657 -0.21314052 0.20686424 0.034598276 0.16018851 0.14357902 0.046236232 0.075825214 0.029642927 -0.059190348 0.4163471 -0.1367429 -0.017528763 0.19181107 0.2601198 -0.020112848 -0.23402186 0.2841525 -0.10974068 -0.002565893 0.00070757867 0.13032512 -0.002393167 -0.14120881 0.22138755 0.027622899 0.06942904 -0.39498508 0.1133777 0.19053803 -0.062439334 -0.025348661 -0.11142109 0.015062763 0.3285828 -0.09184951 0.2661699 -0.11710489 -0.15770112 -0.12773664 0.15360866 0.08832063 -0.20914252 0.32392043 -0.023845093 0.3131217 0.08974748 -0.11354328 -0.2037927 -0.06780317 0.20184614 -0.13539118 0.2029387 0.07701099 -0.048417546 0.09797926 0.284204 0.036153372 0.17912139 -0.118080124 -0.025121484 0.10947146 0.09291596 0.1244357 0.006804844 0.025120731 0.28958535
# To import a word2vec text model
wv = KeyedVectors.load_word2vec_format('/tmp/vectors.txt', binary=False)
# To export our model as binary
model.wv.save_word2vec_format('/tmp/vectors.bin', binary=True)
# To import a word2vec binary model
wv = KeyedVectors.load_word2vec_format('/tmp/vectors.bin', binary=True)
# To create and save Nmslib Index from a loaded `KeyedVectors` object
nmslib_index = NmslibIndexer(wv,
{'M': 100, 'indexThreadQty': 1, 'efConstruction': 100}, {'efSearch': 100})
nmslib_index.save('/tmp/mymodel.index')
# Load and test the saved word vectors and saved nmslib index
wv = KeyedVectors.load_word2vec_format('/tmp/vectors.bin', binary=True)
nmslib_index = NmslibIndexer.load('/tmp/mymodel.index')
nmslib_index.model = wv
vector = wv["cat"]
approximate_neighbors = wv.most_similar([vector], topn=11, indexer=nmslib_index)
# Neatly print the approximate_neighbors and their corresponding cosine similarity values
print("Approximate Neighbors")
for neighbor in approximate_neighbors:
print(neighbor)
normal_neighbors = wv.most_similar([vector], topn=11)
print("\nNormal (not Nmslib-indexed) Neighbors")
for neighbor in normal_neighbors:
print(neighbor)
Approximate Neighbors ('cat', 1.0) ('cats', 0.8335747122764587) ('meow', 0.8298084139823914) ('leopardus', 0.8215900659561157) ('albino', 0.8207193315029144) ('poodle', 0.8168523907661438) ('saimiri', 0.8167656660079956) ('squirrel', 0.8138661682605743) ('sighthound', 0.8134040832519531) ('proboscis', 0.8130573630332947) ('eared', 0.812840610742569) Normal (not Nmslib-indexed) Neighbors ('cat', 1.0) ('cats', 0.6671494245529175) ('meow', 0.6596168279647827) ('leopardus', 0.6431801319122314) ('albino', 0.6414386034011841) ('poodle', 0.633704662322998) ('saimiri', 0.633531391620636) ('squirrel', 0.6277321577072144) ('sighthound', 0.6268081665039062) ('proboscis', 0.6261147260665894) ('eared', 0.6256811618804932)
In this notebook we used the Nmslib module to build an indexed approximation of our word embeddings. To do so, we did the following steps: