Tutorial on Poincaré Embeddings¶

This notebook discusses the motivation and idea behind Poincaré embeddings and demonstrates what kind of operations can be done with them. It also presents quantitative evaluation results on evaluation tasks mentioned in the original paper, and compares them to results obtained from other, non-gensim implementations. Lastly, it tries to provide some intuition behind the nature of the embeddings and hyperbolic space through visualizations.

1. Introduction¶

1.1 Concept and use-case¶

Poincaré embeddings are a method to learn vector representations of nodes in a graph. The input data is of the form of a list of relations (edges) between nodes, and the model tries to learn representations such that the vectors for the nodes accurately represent the distances between them.

The learnt embeddings capture notions of both hierarchy and similarity - similarity by placing connected nodes close to each other and unconnected nodes far from each other; hierarchy by placing nodes lower in the hierarchy farther from the origin, i.e. with higher norms.

The paper uses this model to learn embeddings of nodes in the WordNet noun hierarchy, and evaluates these on 3 tasks - reconstruction, link prediction and lexical entailment, which are described in the section on evaluation. We have compared the results of our Poincaré model implementation on these tasks to other open-source implementations and the results mentioned in the paper.

The paper also describes a variant of the Poincaré model to learn embeddings of nodes in a symmetric graph, unlike the WordNet noun hierarchy, which is directed and asymmetric. The datasets used in the paper for this model are scientific collaboration networks, in which the nodes are researchers and an edge represents that the two researchers have co-authored a paper.

This variant has not been implemented yet, and is therefore not a part of our experiments.

1.2 Motivation¶

The main innovation here is that these embeddings are learnt in hyperbolic space, as opposed to the commonly used Euclidean space. The reason behind this is that hyperbolic space is more suitable for capturing any hierarchical information inherently present in the graph. Embedding nodes into a Euclidean space while preserving the distance between the nodes usually requires a very high number of dimensions. A simple illustration of this can be seen below -

Example tree

Here, the positions of nodes represent their position in 2-D euclidean space. Ideally, the distances between the nodes (A, D) should be the same as that between (D, H) and as that between H and its child nodes. Similarly, all the child nodes of H must be equally far away from node A. It becomes progressively hard to accurately preserve these distances in Euclidean space as the degree and depth of the tree grows larger. Hierarchical structures may also have cross-connections (effectively a directed graph).

There is no representation of this simple tree in 2-dimensional Euclidean space which can reflect these distances correctly. This can be solved by adding more dimensions, but this becomes computationally infeasible as the number of required dimensions grows exponentially. Hyperbolic space is a metric space in which distances aren't straight lines - they are curves, and this allows such tree-like hierarchical structures to have a representation that captures the distances more accurately even in low dimensions.

2. Training the embedding¶

In [1]:

# Import required modules and train an example embedding
% cd ../..

/home/jayant/projects/gensim

In [2]:

%load_ext autoreload   
%autoreload 2

import os
import logging
import numpy as np

from gensim.models.poincare import PoincareModel, PoincareKeyedVectors, PoincareRelations

logging.basicConfig(level=logging.INFO)

poincare_directory = os.path.join(os.getcwd(), 'docs', 'notebooks', 'poincare')
data_directory = os.path.join(poincare_directory, 'data')
wordnet_mammal_file = os.path.join(data_directory, 'wordnet_mammal_hypernyms.tsv')

The model can be initialized using an iterable of relations, where a relation is simply a pair of nodes -

In [3]:

model = PoincareModel(train_data=[('node.1', 'node.2'), ('node.2', 'node.3')])

INFO:gensim.models.poincare:Loading relations from train data..
INFO:gensim.models.poincare:Loaded 2 relations from train data, 3 nodes

The model can also be initialized from a csv-like file containing one relation per line. The module provides a convenience class PoincareRelations to do so.

In [34]:

relations = PoincareRelations(file_path=wordnet_mammal_file, delimiter='\t')
model = PoincareModel(train_data=relations)

INFO:gensim.models.poincare:Loading relations from train data..
INFO:gensim.models.poincare:Loaded 7724 relations from train data, 1182 unique terms

Note that the above only initializes the model and does not begin training. To train the model -

In [35]:

model = PoincareModel(train_data=relations, size=2, burn_in=0)
model.train(epochs=1, print_every=500)

INFO:gensim.models.poincare:Loading relations from train data..
INFO:gensim.models.poincare:Loaded 7724 relations from train data, 1182 unique terms
INFO:gensim.models.poincare:training model of size 2 with 1 workers on 7724 relations for 1 epochs and 0 burn-in epochs, using lr=0.10000 burn-in lr=0.01000 negative=10
INFO:gensim.models.poincare:Starting training (1 epochs)----------------------------------------
INFO:gensim.models.poincare:Training on epoch 1, examples #4990-#5000, loss: 23.57
INFO:gensim.models.poincare:Time taken for 5000 examples: 0.47 s, 10562.18 examples / s
INFO:gensim.models.poincare:Training finished

The same model can be trained further on more epochs in case the user decides that the model hasn't converged yet.

In [36]:

model.train(epochs=1, print_every=500)

INFO:gensim.models.poincare:training model of size 2 with 1 workers on 7724 relations for 1 epochs and 0 burn-in epochs, using lr=0.10000 burn-in lr=0.01000 negative=10
INFO:gensim.models.poincare:Starting training (1 epochs)----------------------------------------
INFO:gensim.models.poincare:Training on epoch 1, examples #4990-#5000, loss: 21.98
INFO:gensim.models.poincare:Time taken for 5000 examples: 0.48 s, 10442.40 examples / s
INFO:gensim.models.poincare:Training finished

The model can be saved and loaded using two different methods -

In [5]:

# Saves the entire PoincareModel instance, the loaded model can be trained further
model.save('/tmp/test_model')
PoincareModel.load('/tmp/test_model')

INFO:gensim.utils:saving PoincareModel object under /tmp/test_model, separately None
INFO:gensim.utils:saved /tmp/test_model
INFO:gensim.utils:loading PoincareModel object from /tmp/test_model
INFO:gensim.utils:loading kv recursively from /tmp/test_model.kv.* with mmap=None
INFO:gensim.utils:loaded /tmp/test_model

Out[5]:

<gensim.models.poincare.PoincareModel at 0x7f82ec108668>

In [7]:

# Saves only the vectors from the PoincareModel instance, in the commonly used word2vec format
model.kv.save_word2vec_format('/tmp/test_vectors')
PoincareKeyedVectors.load_word2vec_format('/tmp/test_vectors')

INFO:gensim.models.keyedvectors:storing 3x50 projection weights into /tmp/test_vectors
INFO:gensim.models.keyedvectors:loading projection weights from /tmp/test_vectors
INFO:gensim.models.keyedvectors:loaded (3, 50) matrix from /tmp/test_vectors

Out[7]:

<gensim.models.poincare.PoincareKeyedVectors at 0x7f82d03c6588>

3. What the embedding can be used for¶

In [4]:

# Load an example model
models_directory = os.path.join(poincare_directory, 'models')
test_model_path = os.path.join(models_directory, 'gensim_model_batch_size_10_burn_in_0_epochs_50_neg_20_dim_50')
model = PoincareModel.load(test_model_path)

INFO:gensim.utils:loading PoincareModel object from /home/jayant/projects/gensim/docs/notebooks/poincare/models/gensim_model_batch_size_10_burn_in_0_epochs_50_neg_20_dim_50
INFO:gensim.utils:loading kv recursively from /home/jayant/projects/gensim/docs/notebooks/poincare/models/gensim_model_batch_size_10_burn_in_0_epochs_50_neg_20_dim_50.kv.* with mmap=None
INFO:gensim.utils:loaded /home/jayant/projects/gensim/docs/notebooks/poincare/models/gensim_model_batch_size_10_burn_in_0_epochs_50_neg_20_dim_50

The learnt representations can be used to perform various kinds of useful operations. This section is split into two - some simple operations that are directly mentioned in the paper, as well as some experimental operations that are hinted at, and might require more work to refine.

The models that are used in this section have been trained on the transitive closure of the WordNet hypernym graph. The transitive closure is the list of all the direct and indirect hypernyms in the WordNet graph. An example of a direct hypernym is (seat.n.03, furniture.n.01) while an example of an indirect hypernym is (seat.n.03, physical_entity.n.01).

3.1 Simple operations¶

All the following operations are based simply on the notion of distance between two nodes in hyperbolic space.

In [11]:

# Distance between any two nodes
model.kv.distance('plant.n.02', 'tree.n.01')

Out[11]:

2.9232418343441235

In [12]:

model.kv.distance('plant.n.02', 'animal.n.01')

Out[12]:

5.5111423377921103

In [13]:

# Nodes most similar to a given input node
model.kv.most_similar('electricity.n.01')

Out[13]:

[('phenomenon.n.01', 2.0296901412261614),
 ('natural_phenomenon.n.01', 2.1052921648852934),
 ('physical_phenomenon.n.01', 2.1084626073820045),
 ('photoelectricity.n.01', 2.4527217652991005),
 ('piezoelectricity.n.01', 2.4687111939575397),
 ('galvanism.n.01', 2.9496409087300357),
 ('cloud.n.02', 3.164090455102602),
 ('electrical_phenomenon.n.01', 3.2563741920630225),
 ('pressure.n.01', 3.3063009504377368),
 ('atmospheric_phenomenon.n.01', 3.313970950348909)]

In [14]:

model.kv.most_similar('man.n.01')

Out[14]:

[('male.n.02', 1.725430794111438),
 ('physical_entity.n.01', 3.5532684790327624),
 ('whole.n.02', 3.5663516391532815),
 ('object.n.01', 3.5885342299888077),
 ('adult.n.01', 3.6422291495399124),
 ('organism.n.01', 4.096498630105297),
 ('causal_agent.n.01', 4.127447093914292),
 ('living_thing.n.01', 4.198756842588067),
 ('person.n.01', 4.371831459784078),
 ('lawyer.n.01', 4.581830548066727)]

In [15]:

# Nodes closer to node 1 than node 2 is from node 1
model.kv.nodes_closer_than('dog.n.01', 'carnivore.n.01')

Out[15]:

['domestic_animal.n.01',
 'canine.n.02',
 'terrier.n.01',
 'hunting_dog.n.01',
 'hound.n.01']

In [16]:

# Rank of distance of node 2 from node 1 in relation to distances of all nodes from node 1
model.kv.rank('dog.n.01', 'carnivore.n.01')

Out[16]:

In [11]:

# Finding Poincare distance between input vectors
vector_1 = np.random.uniform(size=(100,))
vector_2 = np.random.uniform(size=(100,))
vectors_multiple = np.random.uniform(size=(5, 100))

# Distance between vector_1 and vector_2
print(PoincareKeyedVectors.vector_distance(vector_1, vector_2))
# Distance between vector_1 and each vector in vectors_multiple
print(PoincareKeyedVectors.vector_distance_batch(vector_1, vectors_multiple))

0.24618276804
[ 0.20492232  0.21622492  0.22568267  0.20813361  0.26086168]

3.2 Experimental operations¶

These operations are based on the notion that the norm of a vector represents its hierarchical position. Leaf nodes typically tend to have the highest norms, and as we move up the hierarchy, the norm decreases, with the root node being close to the center (or origin).

In [17]:

# Closest child node
model.kv.closest_child('person.n.01')

Out[17]:

'writer.n.01'

In [18]:

# Closest parent node
model.kv.closest_parent('person.n.01')

Out[18]:

'causal_agent.n.01'

In [13]:

# Position in hierarchy - lower values represent that the node is higher in the hierarchy
print(model.kv.norm('person.n.01'))
print(model.kv.norm('teacher.n.01'))

0.940798238231
0.967868985302

In [14]:

# Difference in hierarchy between the first node and the second node
# Positive values indicate the first node is higher in the hierarchy
print(model.kv.difference_in_hierarchy('person.n.01', 'teacher.n.01'))

0.027070747071

In [19]:

# One possible descendant chain
model.kv.descendants('mammal.n.01')

Out[19]:

['carnivore.n.01',
 'dog.n.01',
 'hunting_dog.n.01',
 'terrier.n.01',
 'sporting_dog.n.01']

In [20]:

# One possible ancestor chain
model.kv.ancestors('dog.n.01')

Out[20]:

['canine.n.02',
 'domestic_animal.n.01',
 'placental.n.01',
 'ungulate.n.01',
 'chordate.n.01',
 'animal.n.01',
 'physical_entity.n.01']

Note that the chains are not symmetric - while descending to the closest child recursively, starting with mammal, the closest child of carnivore is dog, however, while ascending from dog to the closest parent, the closest parent to dog is canine.

This is despite the fact that Poincaré distance is symmetric (like any distance in a metric space). The asymmetry stems from the fact that even if node Y is the closest node to node X amongst all nodes with a higher norm (lower in the hierarchy) than X, node X may not be the closest node to node Y amongst all the nodes with a lower norm (higher in the hierarchy) than Y.

4. Evaluation¶

The following section presents the results of Poincaré models trained using some open-source implementations of the model, as well as the Gensim implementation. Original results as mentioned in the paper are also provided for reference.

The following two external, open-source implementations are used -

C++
Numpy

This is the list of tasks from the paper that have been evaluated -

WordNet reconstruction
WordNet link prediction
Lexical entailment on HyperLex

4.1 WordNet Reconstruction¶

For this task, embeddings are learnt using the entire transitive closure of the WordNet noun hypernym hierarchy. Subsequently, for every hypernym pair (u, v), the rank of v amongst all nodes that do not have a positive edge with v is computed. The final metric mean_rank is the average of all these ranks. The MAP metric is the mean of the Average Precision of the rankings for all positive nodes for a given node u.

Note that this task tests representation capacity of the learnt embeddings, and not the generalization ability.

The prefix in the model names denote -

cpp_: Model trained using open source C++ implementation
numpy_: Model trained using open source Numpy implementation
gensim_: Model trained using Gensim implementation

The rest of the model name contains information about the model hyperparameters used for training.

For more details about the exact evaluation method, and to reproduce the results described above, refer to the detailed evaluation notebook.

Our results -

Reconstruction Results

Results from the paper - Reconstruction Results

The figures above illustrate a few things -

The gensim implementation does significantly better for all model sizes and hyperparameters than both the other implementations.
The results from the original paper have not been achieved completely. Especially for models with lower dimensions, the paper mentions significantly better mean rank and MAP for the reconstruction task.
Using a higher number of negatives or a different batch size does not have a consistently significant impact on the results.
Using burn-in leads to better results with low model sizes, however the results do not improve significantly with increasing model size.

4.2 Link Prediction¶

This task is similar to the reconstruction task described above, except that the list of relations is split into a training and testing set, and the mean rank reported is for the edges in the test set.

Therefore, this tests the ability of the model to predict unseen edges between nodes, i.e. generalization ability, as opposed to the representation capacity tested in the Reconstruction task

Our results -

Link Prediction Eval

Link Prediction Paper

These results follow similar trends to the reconstruction results. Specifically -

The gensim implementation does significantly better for all model sizes and hyperparameters than both the other implementations.
The results from the original paper have not been achieved completely. Especially for models with lower dimensions, the paper mentions significantly better mean rank and MAP for the reconstruction task.
Using a higher number of negatives or a different batch size does not have a consistently significant impact on the results.
Using burn-in leads to better results with low model sizes, however the results do not improve significantly with increasing model size.

The main difference from the reconstruction results is that mean ranks for link prediction are slightly worse most of the time than the corresponding reconstruction results. This is to be expected, as link prediction is performed on a held-out test set.

4.3 Lexical Entailment¶

The Lexical Entailment task is performed using the HyperLex dataset, a collection of 2163 noun pairs and scores that denote "To what degree is noun A a type of noun Y". For example -

girl person 9.85

These scores are out of 10.

The spearman's correlation score is computed for the predicted and actual similarity scores, with the models trained on the entire WordNet noun hierarchy.

Our results - LE Results

Results from paper (for Poincaré Embeddings, as well as other embeddings from previous papers) - LE Results

Some observations -

We achieve a max spearman score of 0.48, fairly close to the spearman score of 0.512 mentioned in the paper.
The best results are obtained with 20 negative examples, a batch size of 10, and no burn-in, however the differences are too low to make a meaningful conclusion.

However, there are a few ambiguities and caveats -

The paper does not mention which hyperparameters and model size have been used for the above mentioned result. Hence it is possible that the results are achieved with a significantly lower model size than the one we use, which would imply that our implementation still has some way to go.
The same word can have multiple nodes in the WordNet dataset for different senses of the word, and it is unclear in the paper how to decide which node to pick. For the above results, we have gone with the sane default of picking the particular sense that has the maximum similarity score with the target word.
Certain words in the HyperLex dataset seem to be absent from the WordNet data - the paper does not mention any such thing. Pairs containing missing words have been omitted from the evaluation (182/2163).

4.4 Link Prediction on Collaboration Networks¶

The paper also describes a variant of the Poincaré model to learn embeddings of nodes in a symmetric graph, unlike the WordNet noun hierarchy, which is directed and asymmetric. The datasets used in the paper for this model are scientific collaboration networks, in which the nodes are researchers and an edge represents that the two researchers have co-authored a paper.

This variant has not been implemented yet, and is therefore not a part of our experiments.

5. Visualization¶

The paper presents a visual representation of a 2-D model trained on the mammals subtree of the WordNet noun hierarchy. This is a useful tool to get an intuitive sense of the model, and is also helpful for the purposes of debugging where the model could be learning incorrect representations. Visualizations for some models are presented below.

In [37]:

import pickle
from plotly.offline import init_notebook_mode, iplot
from gensim.viz.poincare import poincare_2d_visualization, poincare_distance_heatmap

init_notebook_mode(connected=True)

show_node_labels = [
    'mammal.n.01', 'placental.n.01', 'ungulate.n.01', 'carnivore.n.01', 'rodent.n.01',
    'canine.n.02', 'even-toed_ungulate.n.01', 'odd-toed_ungulate.n.01', 'elephant.n.01',
    'rhinoceros.n.01', 'german_shepherd.n.01', 'feline.n.01', 'tiger.n.02', 'homo_sapiens.n.01']

In [38]:

tree = pickle.load(open(os.path.join(poincare_directory, 'data', 'mammal_tree.pkl'), 'rb'))

In [39]:

model = PoincareModel.load(os.path.join(models_directory, 'gensim_mammals_epochs_50_dim_2'))
figure_title = """
<b>2-D Visualization of model trained on mammals subtree</b><br>
50 epochs, model hasn't converged"""
iplot(poincare_2d_visualization(model, tree, figure_title, show_node_labels=show_node_labels))

INFO:gensim.utils:loading PoincareModel object from /home/jayant/projects/gensim/docs/notebooks/poincare/models/gensim_mammals_epochs_50_dim_2
INFO:gensim.utils:loading kv recursively from /home/jayant/projects/gensim/docs/notebooks/poincare/models/gensim_mammals_epochs_50_dim_2.kv.* with mmap=None
INFO:gensim.utils:loaded /home/jayant/projects/gensim/docs/notebooks/poincare/models/gensim_mammals_epochs_50_dim_2

In [40]:

model = PoincareModel.load(os.path.join(models_directory, 'gensim_mammals_epochs_200_dim_2'))
figure_title = """
<b>2-D Visualization of model trained on mammals subtree</b><br>
200 epochs, model is closer to convergence"""
iplot(poincare_2d_visualization(model, tree, figure_title, show_node_labels=show_node_labels))

INFO:gensim.utils:loading PoincareModel object from /home/jayant/projects/gensim/docs/notebooks/poincare/models/gensim_mammals_epochs_200_dim_2
INFO:gensim.utils:loading kv recursively from /home/jayant/projects/gensim/docs/notebooks/poincare/models/gensim_mammals_epochs_200_dim_2.kv.* with mmap=None
INFO:gensim.utils:loaded /home/jayant/projects/gensim/docs/notebooks/poincare/models/gensim_mammals_epochs_200_dim_2

This is slightly different from the representation shown in the paper. Some key differences are -

Some of the nodes fairly high in the hierarchy (carnivore, canine, odd-toed ungulate) are much closer to the boundary than in the paper. As a result, it is likely that distances from them to certain positive nodes will be higher than they should be.
Some nodes very close to the boundary have an edge with a node which is also very close to the boundary, but on the other side of the disk. This is certainly an incorrect placement, as the distance between these two nodes is going be much higher.

Note that the actual distance between two nodes is not the same as the "naked-eye distance" seen in the above representation. The actual distance is the Poincaré distance, whereas the "naked-eye distance" is the Euclidean distance.

To get a better sense of Poincaré distance, a visualization is presented below.

In [41]:

iplot(poincare_distance_heatmap([0.0, 0.0]))

In [42]:

iplot(poincare_distance_heatmap([0.5, 0.5]))

In [43]:

iplot(poincare_distance_heatmap([0.7, 0.7]))

Some interesting things to note here are -

Distances increase much more quickly when the origin point is closer to the boundary - this can be seen visually from the fact that the darker region around the origin point grows smaller and smaller (though the changing scales make this slightly error-prone, the fact still holds).

This is also borne out from the mathematical formula for Poincare distance.

For points close to the boundary, distance to other points close to the boundary is relatively low to points with a low euclidean distance, but then increases significantly and saturates - this explains the positioning of children of parent node close to the boundary in a small arc around the parent node, and the presence of all negative nodes to the parent node (nodes without an edge to the parent node) outside of that small arc.

6. Next steps¶

The model can be investigated further to understand why it doesn't produce results as good as the paper. It is possible that this might be due to training details not present in the paper, or due to us incorrectly interpreting some ambiguous parts of the paper. We have not been able to clarify all such ambiguitities in communication with the authors.
Optimizing the training process further - with a model size of 50 dimensions and a dataset with ~700k relations and ~80k nodes, the Gensim implementation takes around 70 seconds to complete an epoch (~10k relations per second), whereas the open source C++ implementation takes around 1/10th the time (~95k relations per second).
Implementing the variant of the model mentioned in the paper for symmetric graphs and evaluating on the scientific collaboration datasets described earlier in the report.