Notebook

Relational Learning with TensorFlow¶

Brian Jones, FireEye Inc.
March 2016
Accompanying code: GitHub site

Table of Contents¶

Relational Learning
Loading the Data
Preprocessing
Background: Matrix Factorization
Tensor Decomposition
Optimization
Max-Norm Regularization
Loss Functions
Bilinear Model (RESCAL)
TransE
Take It For a Spin
Visualization
References

Like many machine learning researchers, I recently decided to take some time to learn Google's new [TensorFlow](http://www.tensorflow.org/) framework. For those who have not taken a look yet, TensorFlow provides a well-designed API in both C and Python for building and training graph-based models like deep neural networks. Most of the existing introductory material focuses on computer vision and sequence modeling, so I thought I'd take a different route and use it to build statistical relational learning models. I turned that experience into this tutorial in case it is useful for others who are also learning the framework or interested in the topic.

In this tutorial we'll build three different types of learning models, implement a max-norm constraint regularization technique, and evaluate them on a popular dataset

Before diving in, make sure you've gone over the background material provided on the official site, as I won't be covering the absolute basics. In addition to TensorFlow, we'll also be using some standard Python numerical libraries: NumPy and Pandas. Finally, we'll make use of some utilities that can be found in the accompanying code for this tutorial.

Relational Learning¶

Relational learning models try to predict unseen or future relationships between entities based on past observations. Item recommendation is a common example, where the model predicts how much a person will enjoy a particular item based on past rating (or consumption) data collected from a user base. A similar task is link prediction, where the goal is to predict new edges, or edge values, in a graph that are likely to be observed in the future. When applied to a knowledge database like WordNet or FreeBase, it is sometimes used to suggest missing facts for "knowledge base completion".

link prediction

Many approaches have been developed for relational learning, and here we will focus on a subset called latent factor models. While TensorFlow is primarily geared towards neural network construction with layers of weights and nonlinear activations, it is also a nice framework for building latent factor models and even offers some specific functionality for them with its "embedding" methods.

What makes frameworks like TensorFlow , Theano, and Torch so great for productivity is their automatic differentiation feature, which calculates all of the partial derivatives necessary for running optimization procedures like stochastic gradient descent (SGD). One simply needs to declare the computational graph structure and the framework takes care of all of the messy details for backpropagation. While calculating derivatives by hand can be fun (for some!), having to do it for every model variation you are developing is incredibly time-consuming.

Let's get started!

Loading the Data¶

For this tutorial, we'll be using a subset of the WordNet database that is popular for benchmarking relational learning models [Bordes14]. You can download it here. WordNet is a large knowledge base of English words, their definitions, and their relationships to each other. This particular dataset contains one file with word definitions, and three files with relationships for standard train / validate / test purposes. The code below will load the data from text files into Pandas DataFrames objects, and also replaces the identifier numbers with actual words to make them more readable.

In [46]:

import os
import numpy as np
import pandas as pd
import tensorflow as tf
from IPython.display import display

from tf_rl_tutorial import util

def read_and_replace(fpath, def_df):
    df = pd.read_table(fpath, names=['head', 'rel', 'tail'])
    df['head'] = def_df.loc[df['head']]['word'].values
    df['tail'] = def_df.loc[df['tail']]['word'].values
    return df

data_dir = '~/data/wordnet-mlj12' # change to where you extracted the data
definitions = pd.read_table(os.path.join(data_dir, 'wordnet-mlj12-definitions.txt'), 
                            index_col=0, names=['word', 'definition'])
train = read_and_replace(os.path.join(data_dir, 'wordnet-mlj12-train.txt'), definitions)
val = read_and_replace(os.path.join(data_dir, 'wordnet-mlj12-valid.txt'), definitions)
test = read_and_replace(os.path.join(data_dir, 'wordnet-mlj12-test.txt'), definitions)

print('Train shape:', train.shape)
print('Validation shape:', val.shape)
print('Test shape:', test.shape)
all_train_entities = set(train['head']).union(train['tail'])
print('Training entity count: {}'.format(len(all_train_entities)))
print('Training relationship type count: {}'.format(len(set(train['rel']))))
print('Example training triples:')
display(train.sample(5))

Train shape: (141442, 3)
Validation shape: (5000, 3)
Test shape: (5000, 3)
Training entity count: 40943
Training relationship type count: 18
Example training triples:

	head	rel	tail
9565	__incinerate_VB_2	_verb_group	__incinerate_VB_1
132173	__outside_door_NN_1	_has_part	__panelling_NN_1
108291	__negative_JJ_2	_derivationally_related_form	__negativity_NN_2
82581	__typography_NN_1	_derivationally_related_form	__typographer_NN_1
127266	__dominican_republic_NN_1	_part_of	__hispaniola_NN_1

As we can see, this dataset contains relationship triples where the left-hand side (head) and right-hand side (tail) entries are things (or concepts) and the middle entry is a relationship between them. It contains 18 different relationship types, making this a multi-relational dataset. The extra suffix on each word serves to differentiate between homographs (e.g. fish the animal vs. fish the verb).

Preprocessing¶

The validation and test sets also contain true statements in the form shown above. One thing I noticed is that most of the triples in these two sets are mirror images of triples in the training set. This is due to most relationships having an opposite counterpart, for example "fishing_rod has_part reel" and "reel part_of fishing_rod". I think this makes the problem a bit too easy, so let's remove the mirror-image triples from the training set. We'll also remove triples in the validation and test sets where the head or tail entity is not present in that position in the training set, because our first model will not be able to handle those.

In [47]:

from collections import defaultdict

mask = np.zeros(len(train)).astype(bool)
lookup = defaultdict(list)
for idx,h,r,t in train.itertuples():
    lookup[(h,t)].append(idx) 
for h,r,t in pd.concat((val,test)).itertuples(index=False):
    mask[lookup[(h,t)]] = True
    mask[lookup[(t,h)]] = True
train = train.loc[~mask]
heads, tails = set(train['head']), set(train['tail'])
val = val.loc[val['head'].isin(heads) & val['tail'].isin(tails)]
test = test.loc[test['head'].isin(heads) & test['tail'].isin(tails)]
print('Train shape:', train.shape)
print('Validation shape:', val.shape)
print('Test shape:', test.shape)

Train shape: (131974, 3)
Validation shape: (4389, 3)
Test shape: (4394, 3)

Now let's add some false statements to the validation and test sets to make this a classification problem, as was done in [Socher13]. For each true statement, we'll corrupt it by replacing either the head or tail entity with a random one (checking to make sure that the resulting triple isn't in our data). Some code to create these true/false pairs is provided in the util module function create_tf_pairs().

In [48]:

rng = np.random.RandomState(42)
combined_df = pd.concat((train, val, test))
val = util.create_tf_pairs(val, combined_df, rng)
test = util.create_tf_pairs(test, combined_df, rng)
print('Validation shape:', val.shape)
print('Test shape:', test.shape)

Validation shape: (8778, 4)
Test shape: (8788, 4)

To see what kind of prediction task we're up against, let's examine the training and test data for relationships involving the entity "brain cell":

In [49]:

example_entity = '__brain_cell_NN_1'
example_train_rows = (train['head'] == example_entity) | (train['tail'] == example_entity)
print('Train:')
display(train.loc[example_train_rows])
example_test_rows = (test['head'] == example_entity) | (test['tail'] == example_entity)
print('Test:')
display(test.loc[example_test_rows])

Train:

	head	rel	tail
1409	__encephalon_NN_1	_has_part	__brain_cell_NN_1
22089	__brain_cell_NN_1	_hyponym	__golgi_cell_NN_1
44047	__golgi_cell_NN_1	_hypernym	__brain_cell_NN_1
59003	__brain_cell_NN_1	_part_of	__encephalon_NN_1

Test:

	head	rel	tail	truth_flag
4320	__neuron_NN_1	_hyponym	__brain_cell_NN_1	True
4321	__lobbyist_NN_1	_hyponym	__brain_cell_NN_1	False
4505	__spice_up_VB_2	_derivationally_related_form	__brain_cell_NN_1	False

We want to build models that can learn "a neuron is a type of brain cell", and "a lobbyist is not a type of brain cell", having never seen those statements before.

Background: Matrix Factorization¶

Before describing the multi-relational models we will build, it is useful to consider a popular single-relationship modeling technique called matrix factorization. This approach became very popular during the $1M Netflix Prize movie recommendation contest, as it was one of the primary techniques used by the top teams [Koren09].

In a factorization model, we will represent our training data for a single relationship type as a matrix, $\textbf{X}$. Each row in $\textbf{X}$ corresponds to an entity appearing in the head of the relationships, each column a tail entity, and the matrix entries contain the values for all possible pairings.

For movie ratings each row would be a person, each column a movie, the values integers between 1-5, and the matrix only partially observed (each person has not rated every movie). Our training data consists of true statements only, which raises the issue of how we will represent all other possible statements. This is known as the open vs. closed world assumption, and for now we will go with the latter and treat all other statements as false. This is clearly not perfect, as some true statements have been intentionally withheld in the validation and test sets, and knowledge bases like WordNet are seldom complete. In some previous work the unobserved entries are weighted less to account for this [Hu08].

A single matrix is sufficient for datasets with only one relationship type, but for multi-relational data like we are handling with WordNet we would need to do this for each of the 18 relationship types. Here is what part of our matrix looks like for the relationship type has_part:

In [11]:

has_part_triples = val.loc[val['rel'] == '_has_part']
query_entities = ['__noaa_NN_1', '__vascular_plant_NN_1', '__retina_NN_1']
has_part_example = has_part_triples.loc[has_part_triples['head'].isin(query_entities)]
matrix_view = pd.pivot_table(has_part_example, 'truth_flag', 'head', 'tail', 
                             fill_value=False).astype(int)
display(matrix_view)

tail	__acculturate_VB_1	__national_weather_service_NN_1	__rod_cell_NN_1	__singapore_NN_2	__vascular_tissue_NN_1
head
__noaa_NN_1	0	1	0	0	0
__retina_NN_1	0	0	1	0	0
__vascular_plant_NN_1	0	0	0	0	1

The factorization model approximates the observation matrix $\textbf{X}$ as the product of two matrices. We'll call them $\textbf{H}$ and $\textbf{T}$ for "head" and "tail". If $\textbf{X}$ contains $I$ rows and $J$ columns, then $\textbf{H} = [h_1, h_2, ..., h_I] \in \unicode{x211D}^{DxI}$ and $\textbf{T} = [t_1, t_2, ..., t_J]\in \unicode{x211D}^{DxJ}$. $D$ is typically much smaller than $I$ or $J$.

$$ \textbf{X} \approx \textbf{H}^T \textbf{T} $$

matrix factorization

For relational learning, one can think of the matrix $\textbf{H}$ as containing latent factor vectors in its columns for the entities appearing in the head entry of our relationship triples and likewise for $\textbf{T}$ and the tail entities. This is the same concept that has become popular in word2vec models [Mikolov13], where the vector representation is instead called an embedding.

The model output for the relationship between head entity $i$ and tail entity $j$ is the dot product of their latent vectors:

$$ f(i,j) = h_i^T t_j = \sum_{d=1}^D \textbf{H}_{di} \textbf{T}_{dj} $$

The figure below contains a simple geometric illustration in a D=2 space for some of the words shown in the has_part matrix above. Since $ h_i^T t_j = \lVert h_i \rVert \lVert t_j \rVert cos \left( \theta_{ij} \right)$, the model score for (retina, has_part, singapore) will be 0 due to the vectors being orthogonal, whereas the score for (retina, has_part, rod_cell) will be positive and based on the length of each vector and their angle $\theta$.

dot product

A common optimization objective for this model is to find latent vectors for all head and tail entities such that the total squared approximation error is minimized. This is similar to a truncated SVD but without orthogonality constraints:

$$ min_{\textbf{H},\textbf{T}} \ \lVert \textbf{X} - \textbf{H}^T \textbf{T} \rVert_{Fro}^{2} \ = \ min_{\textbf{H},\textbf{T}} \ \sum_{ij} \left( \textbf{X}_{ij} - h_i^T t_j \right) ^2 $$

Note that squared error is not the only possible choice, and regularization is often important to control model complexity. We will explore both of these topics as we move through the tutorial.

Implementing matrix factorization in TensorFlow is pretty simple, but we're going to hold off because we'd rather address multi-relational learning, and it is also a special case of our next model.

Tensor Decomposition¶

A major limitation of the matrix factorization described above is that it only applies to a single relationship type at a time. For a multi-relational dataset like WordNet, this results in 18 independent models, and anything one model "learns" cannot influence the others.

One way to encourage knowledge transfer across relationships is to stack their matrices into a 3D array called a tensor. We can then build one factorization model for this tensor which shares the head and tail entity representations across all relationships.

A natural fit for what we have described is the CANDECOMP / PARAFAC (CP) tensor decomposition [Hitchcock27] [Kolda09]. We will add a new factor matrix for our relationship types, $\textbf{R} = [r_1, r_2, ..., r_K] \in \unicode{x211D}^{DxK}$, and CP then approximates the target tensor $\bf{X}$ as the sum of $D$ rank-one tensors, each formed by the outer product of one dimension from the factor matrices.

$$ \textbf{X} \approx \sum_{d=1}^{D} \textbf{H}_{d,:} \circ \textbf{T}_{d,:} \circ \textbf{R}_{d,:} $$

where $\circ$ denotes the outer product, and the subscript $d,:$ denotes a matrix row in Matlab/NumPy slice syntax. The model output for a single cell is similar to matrix factorization, but now each entry in the dot product between head and tail is also multiplied by the corresponding entry in the relationship's vector:

$$ f(i,j,k) = \sum_{d=1}^{D} \textbf{H}_{di} \textbf{T}_{dj} \textbf{R}_{dk} $$

or as a matrix multiplication:

$$ f(i,j,k) = h_i^T diag(r_k) \ t_j $$

where $diag(r_k)$ denotes a $D \ x \ D$ matrix with the elements from vector $r_k$ on the diagonal.

Also as before, when the decomposition is not full-rank a common objective is least-squares approximation:

$$ min_{\textbf{H},\textbf{T},\textbf{R}} \ \sum_{ijk} \left( \textbf{X}_{ijk} - f(i,j,k) \right)^2 $$

Defining a model like this in TensorFlow is straightforward. First, we create placeholders for the triples that the model takes as input, as well as the target (desired output) placeholder for each input triple during training. In our case the target values are boolean and will be represented by 1 and 0. Using None as the dimension size means that it will be determined at runtime by the size of the mini-batch passed in for each training step.

In [12]:

graph = tf.Graph()
with graph.as_default():
    # input and target placeholders
    head_input = tf.placeholder(tf.int32, shape=[None])
    rel_input = tf.placeholder(tf.int32, shape=[None])
    tail_input = tf.placeholder(tf.int32, shape=[None])
    target = tf.placeholder(tf.float32, shape=[None])

CP decomposition requires two separate embeddings for the head and tail entities, even if the same entity appears in both sets (other models do not require this, as we'll see). One way of implementing the input and embedding layer would be to one-hot encode each of our three inputs into long vectors where only the entry for the correpsonding item is set to 1. These input vectors would then be connected to the embedding layer in a fully-connected fashion as is typical in neural networks, and the weight matrix would contain the embedding for each item. Instead, TensorFlow provides a special method for this use case, tf.nn.embedding_lookup(), which will accomplish the same goal with less code. We'll also need to convert our strings into integers in order to feed them into the input placeholders.

In [13]:

embedding_size = 20
head_cnt = len(set(train['head']))
rel_cnt = len(set(train['rel']))
tail_cnt = len(set(train['tail']))
    
with graph.as_default():
    # embedding variables
    init_sd = 1.0 / np.sqrt(embedding_size)
    head_embedding_vars = tf.Variable(tf.truncated_normal([head_cnt, embedding_size], 
                                                          stddev=init_sd))
    rel_embedding_vars = tf.Variable(tf.truncated_normal([rel_cnt, embedding_size], 
                                                         stddev=init_sd))
    tail_embedding_vars = tf.Variable(tf.truncated_normal([tail_cnt, embedding_size], 
                                                          stddev=init_sd))
    # embedding layer for the (h, r, t) triple being fed in as input
    head_embed = tf.nn.embedding_lookup(head_embedding_vars, head_input)
    rel_embed = tf.nn.embedding_lookup(rel_embedding_vars, rel_input)
    tail_embed = tf.nn.embedding_lookup(tail_embedding_vars, tail_input)
    # CP model output
    output = tf.reduce_sum(tf.mul(tf.mul(head_embed, rel_embed), tail_embed), 1)
    
# TensorFlow requires integer indices
field_categories = (set(train['head']), set(train['rel']), set(train['tail']))
train, train_idx_array = util.make_categorical(train, field_categories)
val, val_idx_array = util.make_categorical(val, field_categories)
test, test_idx_array = util.make_categorical(test, field_categories)

Optimization and Negative Sampling¶

With our model structure defined, we now need to determine how to optimize it on the training data. A popular method for the squared-loss CP model above is alternating least squares, which can efficiently handle the huge number of zero entries typical in a relationship tensor [Kolda09]. We'd like to use the nice SGD optimizers that come with TensorFlow, however, and may also want to use other objective functions where alternating least squares is not appropriate. To do so, we must decide how to form our random training mini-batches.

For this tutorial, we will use an approach that has become popular for training embedding models called negative sampling [Mikolov13] [Bordes11] [Socher13]. We will augment each positive example in the mini-batch by creating a negative counterpart with a corruption operation. Corruption is accomplished by replacing either the head or the tail (but not the relationship type) with a random entity, just as we did to create the negative examples in the validation and test sets. While this method is usually used for logistic loss or pairwise loss functions (which we'll examine soon), we will first give it a shot on squared loss.

Note that we have deviated a bit from the original CP formulation above, as these mini-batches are not a random sample from the overall objective function. Our model will therefore not be exactly the CP decomposition described above, but instead "CP-like".

For brevity I have omitted the ContrastiveTrainingProvider class which provides these training mini-batches, but it can be found in the sample code. Because this model creates a separate embedding for head and tail entities, we will tell the batch provider to honor this by setting separate_head_tail=True. Let's take a look at an example mini-batch it provides:

In [17]:

from tf_rl_tutorial.util import ContrastiveTrainingProvider

batch_provider = ContrastiveTrainingProvider(train_idx_array, batch_pos_cnt=3, 
                                             separate_head_tail=True)
batch_triples, batch_labels = batch_provider.next_batch()
batch_df = pd.DataFrame()
batch_df['head'] = pd.Categorical.from_codes(batch_triples[:,0], train['head'].cat.categories)
batch_df['rel'] = pd.Categorical.from_codes(batch_triples[:,1], train['rel'].cat.categories)
batch_df['tail'] = pd.Categorical.from_codes(batch_triples[:,2], train['tail'].cat.categories)
batch_df['label'] = batch_labels
display(batch_triples)
print('which encodes:')
display(batch_df)

array([[35150,     6, 11760],
       [35150,     6,  9035],
       [ 4505,    17, 27691],
       [ 4505,    17, 10438],
       [  739,    10, 22320],
       [  739,    10, 26591]])

which encodes:

	head	rel	tail	label
0	__tractable_JJ_1	_also_see	__manageable_JJ_1	1
1	__tractable_JJ_1	_also_see	__true_flycatcher_NN_1	0
2	__trash_VB_2	_hypernym	__pick_at_VB_3	1
3	__trash_VB_2	_hypernym	__factor_NN_3	0
4	__wish_well_VB_1	_derivationally_related_form	__wish_NN_2	1
5	__wish_well_VB_1	_derivationally_related_form	__athletics_NN_2	0

We are now ready to train. We'll monitor progress by periodically printing the average loss on both the current training mini-batch and the validation set. We'll also track how many of the contrastive pairs in the validation set are ranked properly, which means that the positive triple receives a higher score than the negative triple. This is accomplished using util.pair_ranking_accuracy().

In [18]:

max_iter = 30000

batch_provider = ContrastiveTrainingProvider(train_idx_array, batch_pos_cnt=100, 
                                             separate_head_tail=True)
opt = tf.train.AdagradOptimizer(1.0)

sess = tf.Session(graph=graph)
with graph.as_default():
    loss = tf.reduce_sum(tf.square(output - target))
    train_step = opt.minimize(loss)
    sess.run(tf.initialize_all_variables())

# feed dict for monitoring progress on validation set
val_labels = np.array(val['truth_flag'], dtype=np.float)
val_feed_dict = {head_input: val_idx_array[:,0],
                 rel_input: val_idx_array[:,1],
                 tail_input: val_idx_array[:,2],
                 target: val_labels}

for i in range(max_iter):
    batch_triples, batch_labels = batch_provider.next_batch()
    feed_dict = {head_input: batch_triples[:,0],
                 rel_input: batch_triples[:,1],
                 tail_input: batch_triples[:,2],
                 target: batch_labels}
    if i % 2000 == 0 or i == (max_iter-1):
        batch_avg_loss = sess.run(loss, feed_dict) / len(batch_labels)
        val_output, val_loss = sess.run((output,loss), val_feed_dict)
        val_avg_loss = val_loss / len(val_labels)
        val_pair_ranking_acc = util.pair_ranking_accuracy(val_output)
        msg = 'iter {}, batch loss: {:.2}, val loss: {:.2}, val pair ranking acc: {:.2}'
        print(msg.format(i, batch_avg_loss, val_avg_loss, val_pair_ranking_acc))
    sess.run(train_step, feed_dict)

sess.close()

iter 0, batch loss: 0.5, val loss: 0.5, val pair ranking acc: 0.51
iter 2000, batch loss: 0.5, val loss: 0.54, val pair ranking acc: 0.49
iter 4000, batch loss: 0.12, val loss: 0.55, val pair ranking acc: 0.55
iter 6000, batch loss: 0.11, val loss: 0.55, val pair ranking acc: 0.56
iter 8000, batch loss: 0.077, val loss: 0.53, val pair ranking acc: 0.57
iter 10000, batch loss: 0.072, val loss: 0.52, val pair ranking acc: 0.58
iter 12000, batch loss: 0.069, val loss: 0.51, val pair ranking acc: 0.58
iter 14000, batch loss: 0.077, val loss: 0.51, val pair ranking acc: 0.59
iter 16000, batch loss: 0.061, val loss: 0.5, val pair ranking acc: 0.6
iter 18000, batch loss: 0.055, val loss: 0.5, val pair ranking acc: 0.6
iter 20000, batch loss: 0.036, val loss: 0.49, val pair ranking acc: 0.6
iter 22000, batch loss: 0.043, val loss: 0.49, val pair ranking acc: 0.61
iter 24000, batch loss: 0.05, val loss: 0.48, val pair ranking acc: 0.62
iter 26000, batch loss: 0.059, val loss: 0.48, val pair ranking acc: 0.62
iter 28000, batch loss: 0.039, val loss: 0.48, val pair ranking acc: 0.62
iter 29999, batch loss: 0.047, val loss: 0.47, val pair ranking acc: 0.63

Not so good. It is achieving a very tight fit on the training set but making little progress on the validation set, a sign of overfitting.

Max-Norm Regularization¶

There are a variety of things we could do here, for example lowering the number of model parameters by reducing the embedding dimension. A popular approach for latent factor models is instead to add a regularization term to our objective function that operates on the model parameters $\Omega$:

$$ min_{\Omega} \ \ Loss+ \lambda Reg(\Omega) $$

The $L_2$ (ridge) penalty is a common choice which encourages the sum of the squares of the embeddings to be small:

$$ Reg_{L_{2}} ( \textbf{H},\textbf{T},\textbf{R} ) = \lVert \textbf{H} \rVert_{Fro}^2 + \lVert \textbf{T} \rVert_{Fro}^2 + \lVert \textbf{R} \rVert_{Fro}^2 $$

We could easily add this form of regularization to our CP model with a single line of code, but it creates a bit of a nuisance: it destroys the nice sparse updates we get in SGD and can slow down the optimization process significantly. This is because SGD subsamples the overall loss using a subset of training examples, and doing this creates 0 gradient and no updates for embeddings not present in the mini-batch (note: this is not true for all optimizers, see caveat below). The regularization term, however, is data-independent and creates a gradient for all embeddings on every SGD step. There are some clever approaches to handle this issue [Bottou12] [Carpenter08], but they would add a bit of extra complexity to our code. Other researchers have opted to only apply the regularization gradient to parameters active in the mini-batch [Koren09].

An alternative way to control model complexity, which will also preserve the sparsity of the SGD gradients, is to instead constrain all embedding vectors to lie within an $L_2$ ball of radius $C$. This is called max-norm regularization, and has been shown to be effective in a variety of settings including relational learning models [Sbrero04] [Srivastava14] [Bordes13]. Our optimization problem is now:

$$ min_{\textbf{H},\textbf{T},\textbf{R}} \ Loss \\ s.t. \\ \forall i \ \ \lVert h_i \rVert <= C, \ \ \forall j \ \ \lVert t_j \rVert <= C, \ \ \forall k \ \ \lVert r_k \rVert <= C $$

We can enforce these constraints by using projected gradient descent: after each step, the embedding vectors that have moved outside of the $L_2$ ball are projected back onto it. If the optimizer only updates embedding vectors with nonzero gradient, the ones present in each mini-batch, we only need to check this small subset after each step. MaxNorm This form of constraint is not directly supported by TensorFlow, but we can implement it by tracking which rows in the embedding matrices could have changed, and using the scatter_update() function to apply sparse projection updates. I'm not sure if this is the best way to accomplish the goal, it's just the one I came up with when learning the TensorFlow API.

Caveat: As of when this tutorial was written, all of the optimizers I tried in TensorFlow ignore variables with zero gradient except AdamOptimizer, which can apply momentum updates to them. The full code example therefore uses a dense projection operation when Adam is used, and an efficient sparse update otherwise.

In [50]:

maxnorm = 1.5

def sparse_maxnorm_update(var_matrix, indices, maxnorm=1.0):
    selected_rows = tf.nn.embedding_lookup(var_matrix, indices)
    row_norms = tf.sqrt(tf.reduce_sum(tf.square(selected_rows), 1))
    scaling = maxnorm / tf.maximum(row_norms, maxnorm)
    scaled = selected_rows * tf.expand_dims(scaling, 1)
    return tf.scatter_update(var_matrix, indices, scaled)

def dense_maxnorm_update(var_matrix, maxnorm=1.0):
    row_norms = tf.sqrt(tf.reduce_sum(tf.square(var_matrix), 1))
    scaling = maxnorm / tf.maximum(row_norms, maxnorm)
    scaled = var_matrix * tf.expand_dims(scaling, 1)
    return tf.assign(var_matrix, scaled)

with graph.as_default():
    # tf.unique used to gather the head, tail, and rel indices that are active in the minibatch
    head_constraint = sparse_maxnorm_update(head_embedding_vars, 
                                            tf.unique(head_input)[0], maxnorm)
    rel_constraint = sparse_maxnorm_update(rel_embedding_vars, 
                                           tf.unique(rel_input)[0], maxnorm)
    tail_constraint = sparse_maxnorm_update(tail_embedding_vars, 
                                            tf.unique(tail_input)[0], maxnorm)
    postprocess_step = [head_constraint, rel_constraint, tail_constraint]

We can now run the max-norm constraint operation after each training step and see if we get a better result:

In [20]:

max_iter = 30000

sess = tf.Session(graph=graph)
with graph.as_default():
    # init variables and ensure constraints are initially satisfied
    sess.run(tf.initialize_all_variables())
    sess.run(dense_maxnorm_update(head_embedding_vars, maxnorm))
    sess.run(dense_maxnorm_update(rel_embedding_vars, maxnorm))
    sess.run(dense_maxnorm_update(tail_embedding_vars, maxnorm))

for i in range(max_iter):
    batch_triples, batch_labels = batch_provider.next_batch()
    feed_dict = {head_input: batch_triples[:,0],
                 rel_input: batch_triples[:,1],
                 tail_input: batch_triples[:,2],
                 target: batch_labels}
    if i % 2000 == 0 or i == (max_iter-1):
        batch_avg_loss = sess.run(loss, feed_dict) / len(batch_labels)
        val_output, val_loss = sess.run((output,loss), val_feed_dict)
        val_avg_loss = val_loss / len(val_labels)
        val_pair_ranking_acc = util.pair_ranking_accuracy(val_output)
        # check on embedding norm constraints
        all_embed = np.vstack(sess.run([head_embedding_vars, 
                                        rel_embedding_vars, 
                                        tail_embedding_vars]))
        norms = np.linalg.norm(all_embed, axis=1)
        msg = 'iter {}, batch loss: {:.2}, val loss: {:.2}, val pair ranking acc: {:.2}, max norm: {:.3}'
        print(msg.format(i, batch_avg_loss, val_avg_loss, val_pair_ranking_acc, np.max(norms)))

    sess.run(train_step, feed_dict)
    sess.run(postprocess_step, feed_dict)

iter 0, batch loss: 0.5, val loss: 0.5, val pair ranking acc: 0.5, max norm: 1.35
iter 2000, batch loss: 0.39, val loss: 0.5, val pair ranking acc: 0.55, max norm: 1.5
iter 4000, batch loss: 0.24, val loss: 0.43, val pair ranking acc: 0.65, max norm: 1.5
iter 6000, batch loss: 0.12, val loss: 0.4, val pair ranking acc: 0.68, max norm: 1.5
iter 8000, batch loss: 0.15, val loss: 0.38, val pair ranking acc: 0.71, max norm: 1.5
iter 10000, batch loss: 0.1, val loss: 0.37, val pair ranking acc: 0.73, max norm: 1.5
iter 12000, batch loss: 0.11, val loss: 0.36, val pair ranking acc: 0.73, max norm: 1.5
iter 14000, batch loss: 0.088, val loss: 0.35, val pair ranking acc: 0.75, max norm: 1.5
iter 16000, batch loss: 0.11, val loss: 0.34, val pair ranking acc: 0.76, max norm: 1.5
iter 18000, batch loss: 0.091, val loss: 0.34, val pair ranking acc: 0.76, max norm: 1.5
iter 20000, batch loss: 0.11, val loss: 0.33, val pair ranking acc: 0.77, max norm: 1.5
iter 22000, batch loss: 0.081, val loss: 0.33, val pair ranking acc: 0.77, max norm: 1.5
iter 24000, batch loss: 0.096, val loss: 0.33, val pair ranking acc: 0.77, max norm: 1.5
iter 26000, batch loss: 0.076, val loss: 0.33, val pair ranking acc: 0.76, max norm: 1.5
iter 28000, batch loss: 0.085, val loss: 0.32, val pair ranking acc: 0.77, max norm: 1.5
iter 29999, batch loss: 0.085, val loss: 0.32, val pair ranking acc: 0.76, max norm: 1.5

While there is still a large gap between training and validation loss, this variant is faring better than its unregularized counterpart. Let's see how it does on the test set.

For evaluation on the test set we will not just report the pair ranking accuracy, but will also turn our model into a boolean classifier and measure prediction accuracy. To do this we'll need to threshold the real-valued model output. As is done in [Socher13], the thresholds for each relationship type will be found on the validation set, and this code is in util.threshold_and_eval().

In [21]:

test_feed_dict = {head_input: test_idx_array[:,0],
                  rel_input: test_idx_array[:,1],
                  tail_input: test_idx_array[:,2]}

test_scores = sess.run(output, test_feed_dict)
val_scores = sess.run(output, val_feed_dict)
acc, pred, scores, thresh_map = util.threshold_and_eval(test, test_scores, 
                                                        val, val_scores)
print('Test set accuracy: {:.2}'.format(acc))
sess.close()

Test set accuracy: 0.71

Is that the best it can do? Probably not, as there are a number of things we could experiment with like the max-norm constraint, embedding dimension, optimizer type, optimizer parameters, alternate loss functions, how many negatives we sample per positive, etc. For now, let's move on and look at some alternative loss functions and more models.

Loss Functions¶

Per-Instance Loss¶

So far we have used squared loss as the objective for model training which implies a normally-distributed likelihood. For binary data, a Bernoulli model is a natural choice and only requires two changes: we squash the output with a sigmoid function, and switch to cross-entropy loss. Another popular choice is the hinge loss from SVM and other max-margin models.

Pairwise Ranking Margin¶

The loss functions above operate on individual positive and negative training triples, and encourage the model to map known positive examples to 1 and all others to 0. The contrastive sampling we are utilizing suggests an alternative objective function we can optimize. Instead of trying to map each positive instance to 1 and its corrupted partner to 0, we could instead ask the model to just try to score the positive instance higher by a certain margin. If this could be done for all contrastive pairs, then our positive training examples would all be ranked above the others.

Instead of attempting to satisfy this constraint exactly, we can optimize the so-called soft margin of hinge-loss penalties in a similar fashion as RankSVM [Joachims06].

$$ RankingLoss\left((i,j,k), (i',j',k)\right) = [\gamma - f(i,j,k) + f(i',j',k) ]_{+} $$

where $(i,j,k)$ is the positive example, $(i',j',k)$ is the corrupted triple where either the head $i$ or the tail $j$ has been modified, $\gamma$ is the margin, and $[x]_{+} = max(0, x)$.

This is where automatic differentiation really shines, as we can easily experiment with these loss function variations with just a few lines of code. Since we've already used squared loss in our first attempt, we'll switch to others in the upcoming models.

In [51]:

def least_squares_objective(output, target, add_bias=True):
    y = output
    if add_bias:
        bias = tf.Variable([0.0])
        y = output + bias
    loss = tf.reduce_sum(tf.square(y - target))
    return y, loss

def logistic_objective(output, target, add_bias=True):
    y = output
    if add_bias:
        bias = tf.Variable([0.0])
        y = output + bias
    squashed_y = tf.clip_by_value(tf.sigmoid(y), 0.001, 0.999) # avoid NaNs
    loss = -tf.reduce_sum(target*tf.log(squashed_y) + (1-target)*tf.log(1-squashed_y))
    return squashed_y, loss

def ranking_margin_objective(output, margin=1.0):
    ''' This only works when given model output on alternating
        positive/negative pairs: [pos,neg,pos,neg,...] '''
    y_pairs = tf.reshape(output, [-1,2]) # fold: 1 x n -> [n/2 x 2]
    pos_scores, neg_scores = tf.split(1, 2, y_pairs) # separate the pairs
    hinge_losses = tf.nn.relu(margin - pos_scores + neg_scores)
    total_hinge_loss = tf.reduce_sum(hinge_losses)
    return output, total_hinge_loss

Bilinear Model (RESCAL)¶

RESCAL [Nickel11] employs two changes to the basic CP decomposition:

Entities have a single embedding space; each entity is mapped to the same embedding vector no matter its position in the relationship triple
Each relationship type is embedded into a larger $D^2$ space which operates as a $D \ x \ D$ bilinear operator on the entity embeddings

The figure below is based on one from the original paper that I think does a nice job depicting this decomposition:

RESCAL

The two separate $\textbf{H}$ and $\textbf{T}$ matrices from before have been replaced with single embedding matrix for all entities, $\textbf{E} = [e_1,e_2,...,e_N] \in \unicode{x211D}^{DxN}$, and the model output for an input triple is:

$$ f(i,j,k) = e_i^T \textbf{R}_k e_j $$

where $\textbf{R}_k$ is a $D \ x \ D$ matrix for relationship type $k$.

This is a more flexible tensor decomposition than CP, as the relationship matrix introduces interaction terms between each entry in the embedding vectors. As shown before, CP can be seen as a special case of this formulation where each relationship type operates as a diagonal scaling matrix instead of a full bilinear one.

Implementing a bilinear model requires only a few changes to our code above. We are going to start doing things a little differently, however, and use a common base class for our models which contains reusable code. Each model only has to implement the _create_model() method to define its structure, training step, and postprocessing step.

We will also deviate a bit from the original RESCAL paper which proposes $L_2$ regularization, and instead stick to our max-norm constraints. A bit of experimentation has shown that it can help to use a larger max-norm for the relationship embeddings, which can be specified using rel_maxnorm_mult. Finally, we'll switch to logistic loss instead of squared loss, and again utilize a contrastive negative sampling approach.

In [23]:

from tf_rl_tutorial.models import BaseModel, dense_maxnorm 

class Bilinear(BaseModel):
    
    def __init__(self, embedding_size, maxnorm=1.0, rel_maxnorm_mult=3.0, 
                 batch_pos_cnt=100, max_iter=1000, 
                 model_type='least_squares', add_bias=True, opt=None):
        super(Bilinear, self).__init__(
            embedding_size=embedding_size,
            maxnorm=maxnorm,
            batch_pos_cnt=batch_pos_cnt,
            max_iter=max_iter,
            model_type=model_type,
            opt=opt)
        self.rel_maxnorm_mult = rel_maxnorm_mult
    
    def _create_model(self, train_triples):
        # Count unique items to determine embedding matrix sizes
        entity_cnt = len(set(train_triples[:,0]).union(train_triples[:,2]))
        rel_cnt = len(set(train_triples[:,1]))
        init_sd = 1.0 / np.sqrt(self.embedding_size)
        # Embedding variables for all entities and relationship types
        entity_embedding_shape = [entity_cnt, self.embedding_size]
        # Relationship embeddings will be stored in flattened format to make 
        # applying maxnorm constraints easier
        rel_embedding_shape = [rel_cnt, self.embedding_size * self.embedding_size]
        entity_init = tf.truncated_normal(entity_embedding_shape, stddev=init_sd)
        rel_init = tf.truncated_normal(rel_embedding_shape, stddev=init_sd)
        if self.maxnorm is not None:
            # Ensure maxnorm constraints are initially satisfied
            entity_init = dense_maxnorm(entity_init, self.maxnorm)
            rel_init = dense_maxnorm(rel_init, self.maxnorm)
        self.entity_embedding_vars = tf.Variable(entity_init)
        self.rel_embedding_vars = tf.Variable(rel_init)
        # Embedding layer for each (head, rel, tail) triple being fed in as input
        head_embed = tf.nn.embedding_lookup(self.entity_embedding_vars, self.head_input)
        tail_embed = tf.nn.embedding_lookup(self.entity_embedding_vars, self.tail_input)
        rel_embed = tf.nn.embedding_lookup(self.rel_embedding_vars, self.rel_input)
        # Reshape rel_embed into square D x D matrices
        rel_embed_square = tf.reshape(rel_embed, (-1, self.embedding_size, self.embedding_size))
        # Reshape head_embed and tail_embed to be suitable for the matrix multiplication
        head_embed_row = tf.expand_dims(head_embed, 1) # embeddings as row vectors
        tail_embed_col = tf.expand_dims(tail_embed, 2) # embeddings as column vectors
        head_rel_mult = tf.batch_matmul(head_embed_row, rel_embed_square)
        # Output needs a squeeze into a 1d vector
        raw_output = tf.squeeze(tf.batch_matmul(head_rel_mult, tail_embed_col)) 
        self.output, self.loss = self._create_output_and_loss(raw_output)
        # Optimization
        self.train_step = self.opt.minimize(self.loss)
        if self.maxnorm is not None:
            # Post-processing to limit embedding vars to L2 ball
            rel_maxnorm = self.maxnorm * self.rel_maxnorm_mult
            unique_ent_indices = tf.unique(tf.concat(0, [self.head_input, self.tail_input]))[0]
            unique_rel_indices = tf.unique(self.rel_input)[0]
            entity_constraint = self._norm_constraint_op(self.entity_embedding_vars, 
                                                         unique_ent_indices, 
                                                         self.maxnorm)
            rel_constraint = self._norm_constraint_op(self.rel_embedding_vars, 
                                                      unique_rel_indices, 
                                                      rel_maxnorm)
            self.post_step = [entity_constraint, rel_constraint]

Because the rest of our models will use a single embedding space for all entities, we need to redo the categorical mappings:

In [53]:

all_entities = set(train['head']).union(set(train['tail']))
field_categories = (all_entities, set(train['rel']), all_entities)
train, train_idx_array = util.make_categorical(train, field_categories)
val, val_idx_array = util.make_categorical(val, field_categories)
test, test_idx_array = util.make_categorical(test, field_categories)

In [25]:

bilinear = Bilinear(embedding_size=20,
                    maxnorm=1.0,
                    rel_maxnorm_mult=6.0,
                    batch_pos_cnt=100, 
                    max_iter=30000,
                    model_type='logistic')

val_feed_dict = bilinear.create_feed_dict(val_idx_array, val_labels)

def train_step_callback(itr, batch_feed_dict):
    if (itr % 2000) == 0 or (itr == (bilinear.max_iter-1)):
        batch_size = len(batch_feed_dict[bilinear.target])
        batch_avg_loss = bilinear.sess.run(bilinear.loss, batch_feed_dict) / batch_size
        val_output, val_loss = bilinear.sess.run((bilinear.output, bilinear.loss), 
                                                 val_feed_dict)
        val_avg_loss = val_loss / len(val_labels)
        val_pair_ranking_acc = util.pair_ranking_accuracy(val_output)
        msg = 'itr {}, batch loss: {:.2}, val loss: {:.2}, val pair ranking acc: {:.2}'
        print(msg.format(itr, batch_avg_loss, val_avg_loss, val_pair_ranking_acc))
    return True

bilinear.fit(train_idx_array, train_step_callback)

acc, pred, scores, thresh_map = util.model_threshold_and_eval(bilinear, test, val)
print('Test set accuracy: {:.2}'.format(acc))
bilinear.close()

itr 0, batch loss: 0.69, val loss: 0.69, val pair ranking acc: 0.5
itr 2000, batch loss: 0.66, val loss: 0.7, val pair ranking acc: 0.53
itr 4000, batch loss: 0.58, val loss: 0.68, val pair ranking acc: 0.61
itr 6000, batch loss: 0.53, val loss: 0.66, val pair ranking acc: 0.65
itr 8000, batch loss: 0.48, val loss: 0.65, val pair ranking acc: 0.69
itr 10000, batch loss: 0.52, val loss: 0.64, val pair ranking acc: 0.7
itr 12000, batch loss: 0.38, val loss: 0.64, val pair ranking acc: 0.71
itr 14000, batch loss: 0.43, val loss: 0.62, val pair ranking acc: 0.72
itr 16000, batch loss: 0.43, val loss: 0.62, val pair ranking acc: 0.74
itr 18000, batch loss: 0.39, val loss: 0.61, val pair ranking acc: 0.73
itr 20000, batch loss: 0.4, val loss: 0.61, val pair ranking acc: 0.76
itr 22000, batch loss: 0.39, val loss: 0.6, val pair ranking acc: 0.78
itr 24000, batch loss: 0.4, val loss: 0.59, val pair ranking acc: 0.78
itr 26000, batch loss: 0.4, val loss: 0.58, val pair ranking acc: 0.78
itr 28000, batch loss: 0.35, val loss: 0.58, val pair ranking acc: 0.79
itr 29999, batch loss: 0.35, val loss: 0.56, val pair ranking acc: 0.8
Test set accuracy: 0.75

TransE¶

The TransE model [Bordes13] is different than the ones we have built so far because it is not based on tensor decomposition. It also embeds entities into a vector space, but instead treats relationship types as translations between the entity vectors. The model scores a triple based on a dissimilarity function between the head and tail entity after applying the relationship's translation to the head. This formulation was inspired by previous work on word embedding models, where researchers discovered after the fact that translations in the embedding space captured certain types of relationships between them.

TransE is shown below in a 2D embedding space for both a perfect match (left) and a depiction of what has_part might look like (right):

TransE

The model output for a triple is:

$$ f(i,j,k) = -d(e_i+r_k, e_j) $$

and for Euclidean distance:

$$ d(e_i+r_k, e_j) = \lVert e_i + r_k - e_j \rVert $$

Other distances can be used, such as Manhattan, squared Euclidean, etc. The distance is negated in order to keep with our convention that models should output higher scores for true statements, here 0 being the best possible score. For training, the pairwise ranking loss described above is used with the margin $\gamma$ as a configurable hyperparameter. The loss across all training triples $S$ and their corrupted counterparts $S'$ is:

$$ Loss = \sum_{(i,j,k) \in S} \ \sum_{(i',j',k) \in S_{(i,j,k)}'} [\gamma + d(e_i+r_k, e_j) - d(e_{i'}+r_{k'}, e_{j'})]_{+} $$

which is optimized by sampling positive and negative pairs in mini-batch SGD, just as we have been doing with our previous two models. In the paper, the authors propose constraining the embedding vectors to lie on the unit sphere: $\forall i \ ||e_i|| = 1$. Here we will continue to use a max-norm constraint, which is less restrictive and allows them to instead lie within the unit ball. The relationship translation vectors are left unconstrained.

In [35]:

class TransE(BaseModel):

    def __init__(self, embedding_size, batch_pos_cnt=100, 
                 max_iter=1000, dist='euclidean', 
                 margin=1.0, opt=None):
        super(TransE, self).__init__(embedding_size=embedding_size,
                                     maxnorm=1.0,
                                     batch_pos_cnt=batch_pos_cnt,
                                     max_iter=max_iter,
                                     model_type='ranking_margin',
                                     opt=opt)
        self.dist = dist
        self.margin = margin
        self.EPS = 1e-3 # for sqrt gradient when dist='euclidean'
    
    def _create_model(self, train_triples):
        # Count unique items to determine embedding matrix sizes
        entity_cnt = len(set(train_triples[:,0]).union(train_triples[:,2]))
        rel_cnt = len(set(train_triples[:,1]))
        init_sd = 1.0 / np.sqrt(self.embedding_size)
        # Embedding variables
        entity_var_shape = [entity_cnt, self.embedding_size]
        rel_var_shape = [rel_cnt, self.embedding_size]
        entity_init  = tf.truncated_normal(entity_var_shape, stddev=init_sd)
        rel_init = tf.truncated_normal(rel_var_shape, stddev=init_sd)
        # Ensure maxnorm constraints are initially satisfied
        entity_init = dense_maxnorm(entity_init, self.maxnorm)
        self.entity_embedding_vars = tf.Variable(entity_init)
        self.rel_embedding_vars = tf.Variable(rel_init)
        # Embedding layer for each (head, rel, tail) triple being fed in as input
        head_embed = tf.nn.embedding_lookup(self.entity_embedding_vars, self.head_input)
        tail_embed = tf.nn.embedding_lookup(self.entity_embedding_vars, self.tail_input)
        rel_embed = tf.nn.embedding_lookup(self.rel_embedding_vars, self.rel_input)
        # Relationship vector acts as a translation in entity embedding space
        diff_vec = tail_embed - (head_embed + rel_embed)
        # negative dist so higher scores are better (important for pairwise loss)
        if self.dist == 'manhattan':
            raw_output = -tf.reduce_sum(tf.abs(diff_vec), 1)
        elif self.dist == 'euclidean':
            # +eps because gradients can misbehave for small values in sqrt
            raw_output = -tf.sqrt(tf.reduce_sum(tf.square(diff_vec), 1) + self.EPS)
        elif self.dist == 'sqeuclidean':
            raw_output = -tf.reduce_sum(tf.square(diff_vec), 1)
        else:
            raise Exception('Unknown distance type')
        # Model output
        self.output, self.loss = ranking_margin_objective(raw_output, self.margin)
        # Optimization with postprocessing to limit embedding vars to L2 ball
        self.train_step = self.opt.minimize(self.loss)
        unique_ent_indices = tf.unique(tf.concat(0, [self.head_input, self.tail_input]))[0]
        self.post_step = self._norm_constraint_op(self.entity_embedding_vars, 
                                                  unique_ent_indices, 
                                                  self.maxnorm)

In [31]:

transE = TransE(embedding_size=20,
                margin=1.0,
                dist='euclidean',
                batch_pos_cnt=100, 
                max_iter=30000)

val_feed_dict = transE.create_feed_dict(val_idx_array, val_labels)

def train_step_callback(itr, batch_feed_dict):
    if (itr % 2000) == 0 or (itr == (transE.max_iter-1)):
        batch_size = len(batch_feed_dict[transE.target])
        batch_avg_loss = transE.sess.run(transE.loss, batch_feed_dict) / batch_size
        val_output, val_loss = transE.sess.run((transE.output, transE.loss), val_feed_dict)
        val_avg_loss = val_loss / len(val_labels)
        val_pair_ranking_acc = util.pair_ranking_accuracy(val_output)
        msg = 'itr {}, batch loss: {:.2}, val loss: {:.2}, val pair ranking acc: {:.2}'
        print(msg.format(itr, batch_avg_loss, val_avg_loss, val_pair_ranking_acc))
    return True

transE.fit(train_idx_array, train_step_callback)

acc, pred, scores, thresh_map = util.model_threshold_and_eval(transE, test, val)
print('Test set accuracy: {:.2}'.format(acc))

itr 0, batch loss: 0.48, val loss: 0.5, val pair ranking acc: 0.5
itr 2000, batch loss: 0.36, val loss: 0.41, val pair ranking acc: 0.63
itr 4000, batch loss: 0.28, val loss: 0.38, val pair ranking acc: 0.69
itr 6000, batch loss: 0.24, val loss: 0.34, val pair ranking acc: 0.77
itr 8000, batch loss: 0.15, val loss: 0.31, val pair ranking acc: 0.82
itr 10000, batch loss: 0.17, val loss: 0.29, val pair ranking acc: 0.86
itr 12000, batch loss: 0.1, val loss: 0.27, val pair ranking acc: 0.88
itr 14000, batch loss: 0.12, val loss: 0.25, val pair ranking acc: 0.89
itr 16000, batch loss: 0.094, val loss: 0.24, val pair ranking acc: 0.91
itr 18000, batch loss: 0.082, val loss: 0.23, val pair ranking acc: 0.91
itr 20000, batch loss: 0.11, val loss: 0.22, val pair ranking acc: 0.92
itr 22000, batch loss: 0.1, val loss: 0.22, val pair ranking acc: 0.92
itr 24000, batch loss: 0.089, val loss: 0.21, val pair ranking acc: 0.92
itr 26000, batch loss: 0.094, val loss: 0.21, val pair ranking acc: 0.93
itr 28000, batch loss: 0.096, val loss: 0.2, val pair ranking acc: 0.92
itr 29999, batch loss: 0.083, val loss: 0.2, val pair ranking acc: 0.92
Test set accuracy: 0.87

This is not too bad, 87% is quite a bit better than our previous attempts.

Take It For a Spin¶

Let's take a look at the top 10 scoring triples for a single head entity and relationship type. We'll use (brain cell, part of, ?):

In [32]:

entity = '__brain_cell_NN_1'
relationship_type = '_part_of'
entity_cats = train['head'].cat.categories
rel_cats = train['rel'].cat.categories
entity_code = (entity_cats == entity).argmax()
has_part_code = (rel_cats == relationship_type).argmax()

print(entity_code, pd.Categorical.from_codes([entity_code], entity_cats).astype('str'))
print(has_part_code, pd.Categorical.from_codes([has_part_code], rel_cats).astype('str'))

query_triples = np.zeros((len(entity_cats), 3))
query_triples[:,0] = entity_code
query_triples[:,1] = has_part_code
query_triples[:,2] = range(len(entity_cats))

scores = transE.predict(query_triples)
sorted_idxs = np.argsort(scores)[::-1]
print()
print('Top 10 matches:')
print(np.array(entity_cats)[sorted_idxs][:10])

example_train_rows = (train['head'] == entity) | (train['tail'] == entity)
print('\nTraining triples for entity:')
display(train.loc[example_train_rows])

26308 ['__brain_cell_NN_1']
4 ['_part_of']

Top 10 matches:
['__brain_cell_NN_1' '__encephalon_NN_1' '__temporal_lobe_NN_1'
 '__funiculus_NN_2' '__neural_structure_NN_1' '__pallium_NN_1'
 '__systema_nervosum_NN_1' '__external_body_part_NN_1'
 '__male_reproductive_system_NN_1' '__hypothalamus_NN_1']

Training triples for entity:

	head	rel	tail
1325	__encephalon_NN_1	_has_part	__brain_cell_NN_1
20593	__brain_cell_NN_1	_hyponym	__golgi_cell_NN_1
41067	__golgi_cell_NN_1	_hypernym	__brain_cell_NN_1
55004	__brain_cell_NN_1	_part_of	__encephalon_NN_1

Some good predictions in there, especially considering how few training examples included the word directly (shown at the bottom). One error is the object itself, and the other errors are at least body parts. Not all queries I've tried perform as well as this one, however!

Visualization¶

Another useful feature of these relational learning models is that the embedding vectors can be used for clustering and visualization. Let's take a look at some of them laid out in a 2D scatter using Barnes-Hut T-SNE [VanDerMaaten14]. This requires scikit-learn 0.17 or above and the bokeh plotting library.

We will color each embedding point by the word's top-level category in WordNet, called its "lexname". I have included a file with these in the data directory. You should be able to mouse over the plot and see the words, which lets you get a feel for how things are being clustered in embedding space. Note that we are only showing 10% of the data here in order to save some time and keep the web browser happy. You should be able to run it on all of the embeddings if you would like, as Barnes-Hut T-SNE is an $O(n \ log(n))$ algorithm.

In [54]:

from sklearn.manifold import TSNE
from bokeh.models import HoverTool, BoxSelectTool
from bokeh.plotting import figure, output_notebook, show, ColumnDataSource, reset_output
from bokeh.palettes import Spectral11

subsample = 10

reset_output()
output_notebook()

lexnames = pd.read_table(os.path.join(data_dir, 'wordnet_lexnames.txt'), index_col=0)
entity_embeddings = transE.sess.run(transE.entity_embedding_vars)
entity_cats = train['head'].cat.categories
entity_names = pd.Categorical.from_codes(range(len(entity_embeddings)), 
                                         entity_cats).astype(str)
entity_lexnames = lexnames.loc[entity_names].values

# Run on just a subset of the data to save some time
emb_subset = entity_embeddings[::subsample, :] 
emb_subset_names = entity_names[::subsample]
emb_subset_lexnames = entity_lexnames[::subsample]

print('Embeddings shape:', entity_embeddings.shape)
print('Using subset:', emb_subset.shape)
print('Running T-SNE, may take a while...')
tsne = TSNE(n_iter=1000, method='barnes_hut')
lowdim = tsne.fit_transform(emb_subset)

print('Plotting...')
source = ColumnDataSource(
  data=dict(x=lowdim[:,0],
            y=lowdim[:,1],
            name=emb_subset_names,
            lexname=emb_subset_lexnames)
)
colormap = {}
for i,ln in enumerate(set(emb_subset_lexnames.flat)):
    colormap[ln] = Spectral11[i % len(Spectral11)]
colors = [colormap[ln] for ln in emb_subset_lexnames.flat]
tools = 'pan,wheel_zoom,box_zoom,reset,resize,hover'
fig = figure(title="T-SNE of WordNet TransE Embeddings", 
             plot_width=800, plot_height=600, tools=tools)
fig.scatter('x', 'y', source=source, alpha=0.5, fill_color=colors, line_color=None)
hover = fig.select(dict(type=HoverTool))
hover.tooltips = [('','@name, @lexname')]
h = show(fig)

Embeddings shape: (39996, 20)
Using subset: (4000, 20)
Running T-SNE, may take a while...
Plotting...

Concluding Thoughts¶

This tutorial has covered three relatively simple models for relational learning: two based on the CP and RESCAL tensor decompositions, and TransE which treats relationships as translations in embedding space. We trained each with contrastive negative sampling and max-norm regularization, which of course are not the only possibilities.

TransE performed the best for me on this particular dataset, although in fairness not much time was spent optimizing hyperparameters for each model. Based on the brief time I have spent with it on the WordNet data, it appears to cluster things well, but does not necessarily possess high precision for all of the fine-grained relationships.

Many other models have also been proposed in the literature, for example TransH [Wang14], Neural Tensor Network (NTN) [Socher13], and models using a more traditional neural network architecture Dong14. I believe that all of these are good candidates for TensorFlow and it would be fun to add them to the framework we have developed here.

One thing to note is that this evaluation is measuring classification accuracy on a balanced test set with an equal number of positive and negative instances. Other performance measures have been proposed which take more of an information-retrieval approach, for example comparing the score of each positive test triple against all of its possible corruptions and reporting things like mean rank and precision@k. I did not use these for this tutorial because they are a bit more involved to code up, and evaluation takes much longer. They may be a better metric than accuracy on a balanced set, however, depending on application.

Lastly, this is a relatively small dataset for the complex task the model is attempting to learn, with only about 3 triples per entity on average. Many recent publications on knowledge-base completion are using much larger data sets, but I decided to go with this particular one to keep training times reasonable for a tutorial.

I hope you have enjoyed this tutorial and found it useful! I know I had a lot of fun putting it together.

References¶

[Bordes11] Bordes, Antoine, et al. "Learning structured embeddings of knowledge bases." Conference on Artificial Intelligence. No. EPFL-CONF-192344. 2011.

[Bordes13] Bordes, Antoine, et al. "Translating embeddings for modeling multi-relational data." Advances in Neural Information Processing Systems. 2013.

[Bordes14] Bordes, Antoine, et al. "A semantic matching energy function for learning with multi-relational data." Machine Learning 94.2 (2014): 233-259.

[Bottou12] Bottou, Léon. "Stochastic gradient descent tricks." Neural Networks: Tricks of the Trade. Springer Berlin Heidelberg, 2012. 421-436.

[Carpenter08] Carpenter, Bob. "Lazy sparse stochastic gradient descent for regularized multinomial logistic regression." Alias-i, Inc., Tech. Rep (2008): 1-20.

[Dong14] Dong, Xin, et al. "Knowledge vault: A web-scale approach to probabilistic knowledge fusion." Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2014.

[Gutmann10] Gutmann, Michael, and Aapo Hyvärinen. "Noise-contrastive estimation: A new estimation principle for unnormalized statistical models." International Conference on Artificial Intelligence and Statistics. 2010.

[Hitchcock27] Hitchcock, Frank L. "The expression of a tensor or a polyadic as a sum of products." Journal of Mathematics and Physics 6.1 (1927): 164-189.

[Hu08] Hu, Yifan, Yehuda Koren, and Chris Volinsky. "Collaborative filtering for implicit feedback datasets." Data Mining, 2008. ICDM'08. Eighth IEEE International Conference on. Ieee, 2008.

[Joachims06] Joachims, Thorsten. "Training linear SVMs in linear time." Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2006.

[Kolda09] Kolda, Tamara G., and Brett W. Bader. "Tensor decompositions and applications." SIAM review 51.3 (2009): 455-500.

[Koren09] Koren, Yehuda, Robert Bell, and Chris Volinsky. "Matrix factorization techniques for recommender systems." Computer 8 (2009): 30-37.

[Mikolov13] Mikolov, Tomas, et al. "Distributed representations of words and phrases and their compositionality." Advances in neural information processing systems. 2013.

[Mnih13] Mnih, Andriy, and Koray Kavukcuoglu. "Learning word embeddings efficiently with noise-contrastive estimation." Advances in Neural Information Processing Systems. 2013.

[Nickel11] Nickel, Maximilian, Volker Tresp, and Hans-Peter Kriegel. "A three-way model for collective learning on multi-relational data." Proceedings of the 28th international conference on machine learning (ICML-11). 2011.

[Nickel15] Nickel, Maximilian, et al. "A review of relational machine learning for knowledge graphs: From multi-relational link prediction to automated knowledge graph construction." arXiv preprint arXiv:1503.00759 (2015).

[Rendle04] Rendle, Steffen, et al. "BPR: Bayesian personalized ranking from implicit feedback." Proceedings of the twenty-fifth conference on uncertainty in artificial intelligence. AUAI Press, 2009.

[Sbrero04] Srebro, Nathan, Jason Rennie, and Tommi S. Jaakkola. "Maximum-margin matrix factorization." Advances in neural information processing systems. 2004.

[Socher13] Socher, Richard, et al. "Reasoning with neural tensor networks for knowledge base completion." Advances in Neural Information Processing Systems. 2013.

[Srivastava14] Srivastava, Nitish, et al. "Dropout: A simple way to prevent neural networks from overfitting." The Journal of Machine Learning Research 15.1 (2014): 1929-1958.

[VanDerMaaten14] Van Der Maaten, Laurens. "Accelerating t-sne using tree-based algorithms." The Journal of Machine Learning Research 15.1 (2014): 3221-3245.

[Wang14] Wang, Zhen, et al. "Knowledge Graph Embedding by Translating on Hyperplanes." AAAI. 2014.