Brian Jones, FireEye Inc.

March 2016

Accompanying code: GitHub site

- Relational Learning
- Loading the Data
- Preprocessing
- Background: Matrix Factorization
- Tensor Decomposition
- Optimization
- Max-Norm Regularization
- Loss Functions
- Bilinear Model (RESCAL)
- TransE
- Take It For a Spin
- Visualization
- References

Like many machine learning researchers, I recently decided to take some time to learn Google's new TensorFlow framework. For those who have not taken a look yet, TensorFlow provides a well-designed API in both C and Python for building and training graph-based models like deep neural networks. Most of the existing introductory material focuses on computer vision and sequence modeling, so I thought I'd take a different route and use it to build statistical relational learning models. I turned that experience into this tutorial in case it is useful for others who are also learning the framework or interested in the topic.

In this tutorial we'll build three different types of learning models, implement a max-norm constraint regularization technique, and evaluate them on a popular dataset

Before diving in, make sure you've gone over the background material provided on the official site, as I won't be covering the absolute basics. In addition to TensorFlow, we'll also be using some standard Python numerical libraries: NumPy and Pandas. Finally, we'll make use of some utilities that can be found in the accompanying code for this tutorial.

Relational learning models try to predict unseen or future relationships between entities based on past observations. Item recommendation is a common example, where the model predicts how much a person will enjoy a particular item based on past rating (or consumption) data collected from a user base. A similar task is link prediction, where the goal is to predict new edges, or edge values, in a graph that are likely to be observed in the future. When applied to a knowledge database like WordNet or FreeBase, it is sometimes used to suggest missing facts for "knowledge base completion".

Many approaches have been developed for relational learning, and here we will focus on a subset called latent factor models. While TensorFlow is primarily geared towards neural network construction with layers of weights and nonlinear activations, it is also a nice framework for building latent factor models and even offers some specific functionality for them with its "embedding" methods.

What makes frameworks like TensorFlow , Theano, and Torch so great for productivity is their automatic differentiation feature, which calculates all of the partial derivatives necessary for running optimization procedures like stochastic gradient descent (SGD). One simply needs to declare the computational graph structure and the framework takes care of all of the messy details for backpropagation. While calculating derivatives by hand can be fun (for some!), having to do it for every model variation you are developing is incredibly time-consuming.

Let's get started!

For this tutorial, we'll be using a subset of the WordNet database that is popular for benchmarking relational learning models [Bordes14]. You can download it here. WordNet is a large knowledge base of English words, their definitions, and their relationships to each other. This particular dataset contains one file with word definitions, and three files with relationships for standard train / validate / test purposes. The code below will load the data from text files into Pandas DataFrames objects, and also replaces the identifier numbers with actual words to make them more readable.

In [46]:

```
import os
import numpy as np
import pandas as pd
import tensorflow as tf
from IPython.display import display
from tf_rl_tutorial import util
def read_and_replace(fpath, def_df):
df = pd.read_table(fpath, names=['head', 'rel', 'tail'])
df['head'] = def_df.loc[df['head']]['word'].values
df['tail'] = def_df.loc[df['tail']]['word'].values
return df
data_dir = '~/data/wordnet-mlj12' # change to where you extracted the data
definitions = pd.read_table(os.path.join(data_dir, 'wordnet-mlj12-definitions.txt'),
index_col=0, names=['word', 'definition'])
train = read_and_replace(os.path.join(data_dir, 'wordnet-mlj12-train.txt'), definitions)
val = read_and_replace(os.path.join(data_dir, 'wordnet-mlj12-valid.txt'), definitions)
test = read_and_replace(os.path.join(data_dir, 'wordnet-mlj12-test.txt'), definitions)
print('Train shape:', train.shape)
print('Validation shape:', val.shape)
print('Test shape:', test.shape)
all_train_entities = set(train['head']).union(train['tail'])
print('Training entity count: {}'.format(len(all_train_entities)))
print('Training relationship type count: {}'.format(len(set(train['rel']))))
print('Example training triples:')
display(train.sample(5))
```

The validation and test sets also contain true statements in the form shown above. One thing I noticed is that most of the triples in these two sets are mirror images of triples in the training set. This is due to most relationships having an opposite counterpart, for example "fishing_rod has_part reel" and "reel part_of fishing_rod". I think this makes the problem a bit too easy, so let's remove the mirror-image triples from the training set. We'll also remove triples in the validation and test sets where the head or tail entity is not present in that position in the training set, because our first model will not be able to handle those.

In [47]:

```
from collections import defaultdict
mask = np.zeros(len(train)).astype(bool)
lookup = defaultdict(list)
for idx,h,r,t in train.itertuples():
lookup[(h,t)].append(idx)
for h,r,t in pd.concat((val,test)).itertuples(index=False):
mask[lookup[(h,t)]] = True
mask[lookup[(t,h)]] = True
train = train.loc[~mask]
heads, tails = set(train['head']), set(train['tail'])
val = val.loc[val['head'].isin(heads) & val['tail'].isin(tails)]
test = test.loc[test['head'].isin(heads) & test['tail'].isin(tails)]
print('Train shape:', train.shape)
print('Validation shape:', val.shape)
print('Test shape:', test.shape)
```

*util* module function *create_tf_pairs()*.

In [48]:

```
rng = np.random.RandomState(42)
combined_df = pd.concat((train, val, test))
val = util.create_tf_pairs(val, combined_df, rng)
test = util.create_tf_pairs(test, combined_df, rng)
print('Validation shape:', val.shape)
print('Test shape:', test.shape)
```

In [49]:

```
example_entity = '__brain_cell_NN_1'
example_train_rows = (train['head'] == example_entity) | (train['tail'] == example_entity)
print('Train:')
display(train.loc[example_train_rows])
example_test_rows = (test['head'] == example_entity) | (test['tail'] == example_entity)
print('Test:')
display(test.loc[example_test_rows])
```

**is** a type of brain cell", and "a lobbyist **is not** a type of brain cell", having never seen those statements before.

Before describing the multi-relational models we will build, it is useful to consider a popular single-relationship modeling technique called matrix factorization. This approach became very popular during the \$1M Netflix Prize movie recommendation contest, as it was one of the primary techniques used by the top teams [Koren09].

In a factorization model, we will represent our training data for a single relationship type as a matrix, $\textbf{X}$. Each row in $\textbf{X}$ corresponds to an entity appearing in the head of the relationships, each column a tail entity, and the matrix entries contain the values for all possible pairings.

For movie ratings each row would be a person, each column a movie, the values integers between 1-5, and the matrix only partially observed (each person has not rated every movie). Our training data consists of true statements only, which raises the issue of how we will represent all other possible statements. This is known as the open vs. closed world assumption, and for now we will go with the latter and treat all other statements as false. This is clearly not perfect, as some true statements have been intentionally withheld in the validation and test sets, and knowledge bases like WordNet are seldom complete. In some previous work the unobserved entries are weighted less to account for this [Hu08].

A single matrix is sufficient for datasets with only one relationship type, but for multi-relational data like we are handling with WordNet we would need to do this for each of the 18 relationship types. Here is what part of our matrix looks like for the relationship type *has_part*:

In [11]:

```
has_part_triples = val.loc[val['rel'] == '_has_part']
query_entities = ['__noaa_NN_1', '__vascular_plant_NN_1', '__retina_NN_1']
has_part_example = has_part_triples.loc[has_part_triples['head'].isin(query_entities)]
matrix_view = pd.pivot_table(has_part_example, 'truth_flag', 'head', 'tail',
fill_value=False).astype(int)
display(matrix_view)
```

The factorization model approximates the observation matrix $\textbf{X}$ as the product of two matrices. We'll call them $\textbf{H}$ and $\textbf{T}$ for "head" and "tail". If $\textbf{X}$ contains $I$ rows and $J$ columns, then $\textbf{H} = [h_1, h_2, ..., h_I] \in \unicode{x211D}^{DxI}$ and $\textbf{T} = [t_1, t_2, ..., t_J]\in \unicode{x211D}^{DxJ}$. $D$ is typically much smaller than $I$ or $J$.

$$ \textbf{X} \approx \textbf{H}^T \textbf{T} $$For relational learning, one can think of the matrix $\textbf{H}$ as containing latent factor vectors in its columns for the entities appearing in the head entry of our relationship triples and likewise for $\textbf{T}$ and the tail entities. This is the same concept that has become popular in word2vec models [Mikolov13], where the vector representation is instead called an embedding.

The model output for the relationship between head entity $i$ and tail entity $j$ is the dot product of their latent vectors:

$$ f(i,j) = h_i^T t_j = \sum_{d=1}^D \textbf{H}_{di} \textbf{T}_{dj} $$The figure below contains a simple geometric illustration in a *D*=2 space for some of the words shown in the *has_part* matrix above. Since $ h_i^T t_j = \lVert h_i \rVert \lVert t_j \rVert cos \left( \theta_{ij} \right)$, the model score for (retina, has_part, singapore) will be 0 due to the vectors being orthogonal, whereas the score for (retina, has_part, rod_cell) will be positive and based on the length of each vector and their angle $\theta$.

A common optimization objective for this model is to find latent vectors for all head and tail entities such that the total squared approximation error is minimized. This is similar to a truncated SVD but without orthogonality constraints:

$$ min_{\textbf{H},\textbf{T}} \ \lVert \textbf{X} - \textbf{H}^T \textbf{T} \rVert_{Fro}^{2} \ = \ min_{\textbf{H},\textbf{T}} \ \sum_{ij} \left( \textbf{X}_{ij} - h_i^T t_j \right) ^2 $$Note that squared error is not the only possible choice, and regularization is often important to control model complexity. We will explore both of these topics as we move through the tutorial.

Implementing matrix factorization in TensorFlow is pretty simple, but we're going to hold off because we'd rather address multi-relational learning, and it is also a special case of our next model.

A major limitation of the matrix factorization described above is that it only applies to a single relationship type at a time. For a multi-relational dataset like WordNet, this results in 18 independent models, and anything one model "learns" cannot influence the others.

One way to encourage knowledge transfer across relationships is to stack their matrices into a 3D array called a tensor. We can then build one factorization model for this tensor which shares the head and tail entity representations across all relationships.

A natural fit for what we have described is the CANDECOMP / PARAFAC (CP) tensor decomposition [Hitchcock27] [Kolda09]. We will add a new factor matrix for our relationship types, $\textbf{R} = [r_1, r_2, ..., r_K] \in \unicode{x211D}^{DxK}$, and CP then approximates the target tensor $\bf{X}$ as the sum of $D$ rank-one tensors, each formed by the outer product of one dimension from the factor matrices.

$$ \textbf{X} \approx \sum_{d=1}^{D} \textbf{H}_{d,:} \circ \textbf{T}_{d,:} \circ \textbf{R}_{d,:} $$where $\circ$ denotes the outer product, and the subscript $d,:$ denotes a matrix row in Matlab/NumPy slice syntax. The model output for a single cell is similar to matrix factorization, but now each entry in the dot product between head and tail is also multiplied by the corresponding entry in the relationship's vector:

$$ f(i,j,k) = \sum_{d=1}^{D} \textbf{H}_{di} \textbf{T}_{dj} \textbf{R}_{dk} $$or as a matrix multiplication:

$$ f(i,j,k) = h_i^T diag(r_k) \ t_j $$where $diag(r_k)$ denotes a $D \ x \ D$ matrix with the elements from vector $r_k$ on the diagonal.

Also as before, when the decomposition is not full-rank a common objective is least-squares approximation:

$$ min_{\textbf{H},\textbf{T},\textbf{R}} \ \sum_{ijk} \left( \textbf{X}_{ijk} - f(i,j,k) \right)^2 $$*None* as the dimension size means that it will be determined at runtime by the size of the mini-batch passed in for each training step.

In [12]:

```
graph = tf.Graph()
with graph.as_default():
# input and target placeholders
head_input = tf.placeholder(tf.int32, shape=[None])
rel_input = tf.placeholder(tf.int32, shape=[None])
tail_input = tf.placeholder(tf.int32, shape=[None])
target = tf.placeholder(tf.float32, shape=[None])
```

*tf.nn.embedding_lookup()*, which will accomplish the same goal with less code. We'll also need to convert our strings into integers in order to feed them into the input placeholders.

In [13]:

```
embedding_size = 20
head_cnt = len(set(train['head']))
rel_cnt = len(set(train['rel']))
tail_cnt = len(set(train['tail']))
with graph.as_default():
# embedding variables
init_sd = 1.0 / np.sqrt(embedding_size)
head_embedding_vars = tf.Variable(tf.truncated_normal([head_cnt, embedding_size],
stddev=init_sd))
rel_embedding_vars = tf.Variable(tf.truncated_normal([rel_cnt, embedding_size],
stddev=init_sd))
tail_embedding_vars = tf.Variable(tf.truncated_normal([tail_cnt, embedding_size],
stddev=init_sd))
# embedding layer for the (h, r, t) triple being fed in as input
head_embed = tf.nn.embedding_lookup(head_embedding_vars, head_input)
rel_embed = tf.nn.embedding_lookup(rel_embedding_vars, rel_input)
tail_embed = tf.nn.embedding_lookup(tail_embedding_vars, tail_input)
# CP model output
output = tf.reduce_sum(tf.mul(tf.mul(head_embed, rel_embed), tail_embed), 1)
# TensorFlow requires integer indices
field_categories = (set(train['head']), set(train['rel']), set(train['tail']))
train, train_idx_array = util.make_categorical(train, field_categories)
val, val_idx_array = util.make_categorical(val, field_categories)
test, test_idx_array = util.make_categorical(test, field_categories)
```

With our model structure defined, we now need to determine how to optimize it on the training data. A popular method for the squared-loss CP model above is alternating least squares, which can efficiently handle the huge number of zero entries typical in a relationship tensor [Kolda09]. We'd like to use the nice SGD optimizers that come with TensorFlow, however, and may also want to use other objective functions where alternating least squares is not appropriate. To do so, we must decide how to form our random training mini-batches.

For this tutorial, we will use an approach that has become popular for training embedding models called negative sampling [Mikolov13] [Bordes11] [Socher13]. We will augment each positive example in the mini-batch by creating a negative counterpart with a corruption operation. Corruption is accomplished by replacing either the head or the tail (but not the relationship type) with a random entity, just as we did to create the negative examples in the validation and test sets. While this method is usually used for logistic loss or pairwise loss functions (which we'll examine soon), we will first give it a shot on squared loss.

Note that we have deviated a bit from the original CP formulation above, as these mini-batches are not a random sample from the overall objective function. Our model will therefore not be exactly the CP decomposition described above, but instead "CP-like".

For brevity I have omitted the *ContrastiveTrainingProvider* class which provides these training mini-batches, but it can be found in the sample code. Because this model creates a separate embedding for head and tail entities, we will tell the batch provider to honor this by setting *separate_head_tail=True*. Let's take a look at an example mini-batch it provides:

In [17]:

```
from tf_rl_tutorial.util import ContrastiveTrainingProvider
batch_provider = ContrastiveTrainingProvider(train_idx_array, batch_pos_cnt=3,
separate_head_tail=True)
batch_triples, batch_labels = batch_provider.next_batch()
batch_df = pd.DataFrame()
batch_df['head'] = pd.Categorical.from_codes(batch_triples[:,0], train['head'].cat.categories)
batch_df['rel'] = pd.Categorical.from_codes(batch_triples[:,1], train['rel'].cat.categories)
batch_df['tail'] = pd.Categorical.from_codes(batch_triples[:,2], train['tail'].cat.categories)
batch_df['label'] = batch_labels
display(batch_triples)
print('which encodes:')
display(batch_df)
```

*util.pair_ranking_accuracy()*.

In [18]:

```
max_iter = 30000
batch_provider = ContrastiveTrainingProvider(train_idx_array, batch_pos_cnt=100,
separate_head_tail=True)
opt = tf.train.AdagradOptimizer(1.0)
sess = tf.Session(graph=graph)
with graph.as_default():
loss = tf.reduce_sum(tf.square(output - target))
train_step = opt.minimize(loss)
sess.run(tf.initialize_all_variables())
# feed dict for monitoring progress on validation set
val_labels = np.array(val['truth_flag'], dtype=np.float)
val_feed_dict = {head_input: val_idx_array[:,0],
rel_input: val_idx_array[:,1],
tail_input: val_idx_array[:,2],
target: val_labels}
for i in range(max_iter):
batch_triples, batch_labels = batch_provider.next_batch()
feed_dict = {head_input: batch_triples[:,0],
rel_input: batch_triples[:,1],
tail_input: batch_triples[:,2],
target: batch_labels}
if i % 2000 == 0 or i == (max_iter-1):
batch_avg_loss = sess.run(loss, feed_dict) / len(batch_labels)
val_output, val_loss = sess.run((output,loss), val_feed_dict)
val_avg_loss = val_loss / len(val_labels)
val_pair_ranking_acc = util.pair_ranking_accuracy(val_output)
msg = 'iter {}, batch loss: {:.2}, val loss: {:.2}, val pair ranking acc: {:.2}'
print(msg.format(i, batch_avg_loss, val_avg_loss, val_pair_ranking_acc))
sess.run(train_step, feed_dict)
sess.close()
```

Not so good. It is achieving a very tight fit on the training set but making little progress on the validation set, a sign of overfitting.

There are a variety of things we could do here, for example lowering the number of model parameters by reducing the embedding dimension. A popular approach for latent factor models is instead to add a regularization term to our objective function that operates on the model parameters $\Omega$:

$$ min_{\Omega} \ \ Loss+ \lambda Reg(\Omega) $$The $L_2$ (ridge) penalty is a common choice which encourages the sum of the squares of the embeddings to be small:

$$ Reg_{L_{2}} ( \textbf{H},\textbf{T},\textbf{R} ) = \lVert \textbf{H} \rVert_{Fro}^2 + \lVert \textbf{T} \rVert_{Fro}^2 + \lVert \textbf{R} \rVert_{Fro}^2 $$We could easily add this form of regularization to our CP model with a single line of code, but it creates a bit of a nuisance: it destroys the nice sparse updates we get in SGD and can slow down the optimization process significantly. This is because SGD subsamples the overall loss using a subset of training examples, and doing this creates 0 gradient and no updates for embeddings not present in the mini-batch (**note**: this is not true for all optimizers, see caveat below). The regularization term, however, is data-independent and creates a gradient for all embeddings on every SGD step. There are some clever approaches to handle this issue [Bottou12] [Carpenter08], but they would add a bit of extra complexity to our code. Other researchers have opted to only apply the regularization gradient to parameters active in the mini-batch [Koren09].

An alternative way to control model complexity, which will also preserve the sparsity of the SGD gradients, is to instead constrain all embedding vectors to lie within an $L_2$ ball of radius $C$. This is called max-norm regularization, and has been shown to be effective in a variety of settings including relational learning models [Sbrero04] [Srivastava14] [Bordes13]. Our optimization problem is now:

$$ min_{\textbf{H},\textbf{T},\textbf{R}} \ Loss \\ s.t. \\ \forall i \ \ \lVert h_i \rVert <= C, \ \ \forall j \ \ \lVert t_j \rVert <= C, \ \ \forall k \ \ \lVert r_k \rVert <= C $$We can enforce these constraints by using projected gradient descent: after each step, the embedding vectors that have moved outside of the $L_2$ ball are projected back onto it. If the optimizer only updates embedding vectors with nonzero gradient, the ones present in each mini-batch, we only need to check this small subset after each step.
This form of constraint is not directly supported by TensorFlow, but we can implement it by tracking which rows in the embedding matrices could have changed, and using the *scatter_update()* function to apply sparse projection updates. I'm not sure if this is the best way to accomplish the goal, it's just the one I came up with when learning the TensorFlow API.

**Caveat**: As of when this tutorial was written, all of the optimizers I tried in TensorFlow ignore variables with zero gradient except *AdamOptimizer*, which can apply momentum updates to them. The full code example therefore uses a dense projection operation when Adam is used, and an efficient sparse update otherwise.

In [50]:

```
maxnorm = 1.5
def sparse_maxnorm_update(var_matrix, indices, maxnorm=1.0):
selected_rows = tf.nn.embedding_lookup(var_matrix, indices)
row_norms = tf.sqrt(tf.reduce_sum(tf.square(selected_rows), 1))
scaling = maxnorm / tf.maximum(row_norms, maxnorm)
scaled = selected_rows * tf.expand_dims(scaling, 1)
return tf.scatter_update(var_matrix, indices, scaled)
def dense_maxnorm_update(var_matrix, maxnorm=1.0):
row_norms = tf.sqrt(tf.reduce_sum(tf.square(var_matrix), 1))
scaling = maxnorm / tf.maximum(row_norms, maxnorm)
scaled = var_matrix * tf.expand_dims(scaling, 1)
return tf.assign(var_matrix, scaled)
with graph.as_default():
# tf.unique used to gather the head, tail, and rel indices that are active in the minibatch
head_constraint = sparse_maxnorm_update(head_embedding_vars,
tf.unique(head_input)[0], maxnorm)
rel_constraint = sparse_maxnorm_update(rel_embedding_vars,
tf.unique(rel_input)[0], maxnorm)
tail_constraint = sparse_maxnorm_update(tail_embedding_vars,
tf.unique(tail_input)[0], maxnorm)
postprocess_step = [head_constraint, rel_constraint, tail_constraint]
```

In [20]:

```
max_iter = 30000
sess = tf.Session(graph=graph)
with graph.as_default():
# init variables and ensure constraints are initially satisfied
sess.run(tf.initialize_all_variables())
sess.run(dense_maxnorm_update(head_embedding_vars, maxnorm))
sess.run(dense_maxnorm_update(rel_embedding_vars, maxnorm))
sess.run(dense_maxnorm_update(tail_embedding_vars, maxnorm))
for i in range(max_iter):
batch_triples, batch_labels = batch_provider.next_batch()
feed_dict = {head_input: batch_triples[:,0],
rel_input: batch_triples[:,1],
tail_input: batch_triples[:,2],
target: batch_labels}
if i % 2000 == 0 or i == (max_iter-1):
batch_avg_loss = sess.run(loss, feed_dict) / len(batch_labels)
val_output, val_loss = sess.run((output,loss), val_feed_dict)
val_avg_loss = val_loss / len(val_labels)
val_pair_ranking_acc = util.pair_ranking_accuracy(val_output)
# check on embedding norm constraints
all_embed = np.vstack(sess.run([head_embedding_vars,
rel_embedding_vars,
tail_embedding_vars]))
norms = np.linalg.norm(all_embed, axis=1)
msg = 'iter {}, batch loss: {:.2}, val loss: {:.2}, val pair ranking acc: {:.2}, max norm: {:.3}'
print(msg.format(i, batch_avg_loss, val_avg_loss, val_pair_ranking_acc, np.max(norms)))
sess.run(train_step, feed_dict)
sess.run(postprocess_step, feed_dict)
```

While there is still a large gap between training and validation loss, this variant is faring better than its unregularized counterpart. Let's see how it does on the test set.

For evaluation on the test set we will not just report the pair ranking accuracy, but will also turn our model into a boolean classifier and measure prediction accuracy. To do this we'll need to threshold the real-valued model output. As is done in [Socher13], the thresholds for each relationship type will be found on the validation set, and this code is in *util.threshold_and_eval()*.

In [21]:

```
test_feed_dict = {head_input: test_idx_array[:,0],
rel_input: test_idx_array[:,1],
tail_input: test_idx_array[:,2]}
test_scores = sess.run(output, test_feed_dict)
val_scores = sess.run(output, val_feed_dict)
acc, pred, scores, thresh_map = util.threshold_and_eval(test, test_scores,
val, val_scores)
print('Test set accuracy: {:.2}'.format(acc))
sess.close()
```

So far we have used squared loss as the objective for model training which implies a normally-distributed likelihood. For binary data, a Bernoulli model is a natural choice and only requires two changes: we squash the output with a sigmoid function, and switch to cross-entropy loss. Another popular choice is the hinge loss from SVM and other max-margin models.

The loss functions above operate on individual positive and negative training triples, and encourage the model to map known positive examples to 1 and all others to 0. The contrastive sampling we are utilizing suggests an alternative objective function we can optimize. Instead of trying to map each positive instance to 1 and its corrupted partner to 0, we could instead ask the model to just try to score the positive instance higher by a certain margin. If this could be done for all contrastive pairs, then our positive training examples would all be ranked above the others.

Instead of attempting to satisfy this constraint exactly, we can optimize the so-called soft margin of hinge-loss penalties in a similar fashion as RankSVM [Joachims06].

$$ RankingLoss\left((i,j,k), (i',j',k)\right) = [\gamma - f(i,j,k) + f(i',j',k) ]_{+} $$where $(i,j,k)$ is the positive example, $(i',j',k)$ is the corrupted triple where either the head $i$ or the tail $j$ has been modified, $\gamma$ is the margin, and $[x]_{+} = max(0, x)$.

In [51]:

```
def least_squares_objective(output, target, add_bias=True):
y = output
if add_bias:
bias = tf.Variable([0.0])
y = output + bias
loss = tf.reduce_sum(tf.square(y - target))
return y, loss
def logistic_objective(output, target, add_bias=True):
y = output
if add_bias:
bias = tf.Variable([0.0])
y = output + bias
squashed_y = tf.clip_by_value(tf.sigmoid(y), 0.001, 0.999) # avoid NaNs
loss = -tf.reduce_sum(target*tf.log(squashed_y) + (1-target)*tf.log(1-squashed_y))
return squashed_y, loss
def ranking_margin_objective(output, margin=1.0):
''' This only works when given model output on alternating
positive/negative pairs: [pos,neg,pos,neg,...] '''
y_pairs = tf.reshape(output, [-1,2]) # fold: 1 x n -> [n/2 x 2]
pos_scores, neg_scores = tf.split(1, 2, y_pairs) # separate the pairs
hinge_losses = tf.nn.relu(margin - pos_scores + neg_scores)
total_hinge_loss = tf.reduce_sum(hinge_losses)
return output, total_hinge_loss
```

RESCAL [Nickel11] employs two changes to the basic CP decomposition:

- Entities have a single embedding space; each entity is mapped to the same embedding vector no matter its position in the relationship triple
- Each relationship type is embedded into a larger $D^2$ space which operates as a $D \ x \ D$ bilinear operator on the entity embeddings

The figure below is based on one from the original paper that I think does a nice job depicting this decomposition:

The two separate $\textbf{H}$ and $\textbf{T}$ matrices from before have been replaced with single embedding matrix for all entities, $\textbf{E} = [e_1,e_2,...,e_N] \in \unicode{x211D}^{DxN}$, and the model output for an input triple is:

$$ f(i,j,k) = e_i^T \textbf{R}_k e_j $$where $\textbf{R}_k$ is a $D \ x \ D$ matrix for relationship type $k$.

This is a more flexible tensor decomposition than CP, as the relationship matrix introduces interaction terms between each entry in the embedding vectors. As shown before, CP can be seen as a special case of this formulation where each relationship type operates as a diagonal scaling matrix instead of a full bilinear one.

Implementing a bilinear model requires only a few changes to our code above. We are going to start doing things a little differently, however, and use a common base class for our models which contains reusable code. Each model only has to implement the *_create_model()* method to define its structure, training step, and postprocessing step.

We will also deviate a bit from the original RESCAL paper which proposes $L_2$ regularization, and instead stick to our max-norm constraints. A bit of experimentation has shown that it can help to use a larger max-norm for the relationship embeddings, which can be specified using *rel_maxnorm_mult*. Finally, we'll switch to logistic loss instead of squared loss, and again utilize a contrastive negative sampling approach.

In [23]:

```
from tf_rl_tutorial.models import BaseModel, dense_maxnorm
class Bilinear(BaseModel):
def __init__(self, embedding_size, maxnorm=1.0, rel_maxnorm_mult=3.0,
batch_pos_cnt=100, max_iter=1000,
model_type='least_squares', add_bias=True, opt=None):
super(Bilinear, self).__init__(
embedding_size=embedding_size,
maxnorm=maxnorm,
batch_pos_cnt=batch_pos_cnt,
max_iter=max_iter,
model_type=model_type,
opt=opt)
self.rel_maxnorm_mult = rel_maxnorm_mult
def _create_model(self, train_triples):
# Count unique items to determine embedding matrix sizes
entity_cnt = len(set(train_triples[:,0]).union(train_triples[:,2]))
rel_cnt = len(set(train_triples[:,1]))
init_sd = 1.0 / np.sqrt(self.embedding_size)
# Embedding variables for all entities and relationship types
entity_embedding_shape = [entity_cnt, self.embedding_size]
# Relationship embeddings will be stored in flattened format to make
# applying maxnorm constraints easier
rel_embedding_shape = [rel_cnt, self.embedding_size * self.embedding_size]
entity_init = tf.truncated_normal(entity_embedding_shape, stddev=init_sd)
rel_init = tf.truncated_normal(rel_embedding_shape, stddev=init_sd)
if self.maxnorm is not None:
# Ensure maxnorm constraints are initially satisfied
entity_init = dense_maxnorm(entity_init, self.maxnorm)
rel_init = dense_maxnorm(rel_init, self.maxnorm)
self.entity_embedding_vars = tf.Variable(entity_init)
self.rel_embedding_vars = tf.Variable(rel_init)
# Embedding layer for each (head, rel, tail) triple being fed in as input
head_embed = tf.nn.embedding_lookup(self.entity_embedding_vars, self.head_input)
tail_embed = tf.nn.embedding_lookup(self.entity_embedding_vars, self.tail_input)
rel_embed = tf.nn.embedding_lookup(self.rel_embedding_vars, self.rel_input)
# Reshape rel_embed into square D x D matrices
rel_embed_square = tf.reshape(rel_embed, (-1, self.embedding_size, self.embedding_size))
# Reshape head_embed and tail_embed to be suitable for the matrix multiplication
head_embed_row = tf.expand_dims(head_embed, 1) # embeddings as row vectors
tail_embed_col = tf.expand_dims(tail_embed, 2) # embeddings as column vectors
head_rel_mult = tf.batch_matmul(head_embed_row, rel_embed_square)
# Output needs a squeeze into a 1d vector
raw_output = tf.squeeze(tf.batch_matmul(head_rel_mult, tail_embed_col))
self.output, self.loss = self._create_output_and_loss(raw_output)
# Optimization
self.train_step = self.opt.minimize(self.loss)
if self.maxnorm is not None:
# Post-processing to limit embedding vars to L2 ball
rel_maxnorm = self.maxnorm * self.rel_maxnorm_mult
unique_ent_indices = tf.unique(tf.concat(0, [self.head_input, self.tail_input]))[0]
unique_rel_indices = tf.unique(self.rel_input)[0]
entity_constraint = self._norm_constraint_op(self.entity_embedding_vars,
unique_ent_indices,
self.maxnorm)
rel_constraint = self._norm_constraint_op(self.rel_embedding_vars,
unique_rel_indices,
rel_maxnorm)
self.post_step = [entity_constraint, rel_constraint]
```

In [53]:

```
all_entities = set(train['head']).union(set(train['tail']))
field_categories = (all_entities, set(train['rel']), all_entities)
train, train_idx_array = util.make_categorical(train, field_categories)
val, val_idx_array = util.make_categorical(val, field_categories)
test, test_idx_array = util.make_categorical(test, field_categories)
```

In [25]:

```
bilinear = Bilinear(embedding_size=20,
maxnorm=1.0,
rel_maxnorm_mult=6.0,
batch_pos_cnt=100,
max_iter=30000,
model_type='logistic')
val_feed_dict = bilinear.create_feed_dict(val_idx_array, val_labels)
def train_step_callback(itr, batch_feed_dict):
if (itr % 2000) == 0 or (itr == (bilinear.max_iter-1)):
batch_size = len(batch_feed_dict[bilinear.target])
batch_avg_loss = bilinear.sess.run(bilinear.loss, batch_feed_dict) / batch_size
val_output, val_loss = bilinear.sess.run((bilinear.output, bilinear.loss),
val_feed_dict)
val_avg_loss = val_loss / len(val_labels)
val_pair_ranking_acc = util.pair_ranking_accuracy(val_output)
msg = 'itr {}, batch loss: {:.2}, val loss: {:.2}, val pair ranking acc: {:.2}'
print(msg.format(itr, batch_avg_loss, val_avg_loss, val_pair_ranking_acc))
return True
bilinear.fit(train_idx_array, train_step_callback)
acc, pred, scores, thresh_map = util.model_threshold_and_eval(bilinear, test, val)
print('Test set accuracy: {:.2}'.format(acc))
bilinear.close()
```

The TransE model [Bordes13] is different than the ones we have built so far because it is not based on tensor decomposition. It also embeds entities into a vector space, but instead treats relationship types as translations between the entity vectors. The model scores a triple based on a dissimilarity function between the head and tail entity after applying the relationship's translation to the head. This formulation was inspired by previous work on word embedding models, where researchers discovered after the fact that translations in the embedding space captured certain types of relationships between them.

TransE is shown below in a 2D embedding space for both a perfect match (left) and a depiction of what *has_part* might look like (right):

The model output for a triple is:

$$ f(i,j,k) = -d(e_i+r_k, e_j) $$and for Euclidean distance:

$$ d(e_i+r_k, e_j) = \lVert e_i + r_k - e_j \rVert $$Other distances can be used, such as Manhattan, squared Euclidean, etc. The distance is negated in order to keep with our convention that models should output higher scores for true statements, here 0 being the best possible score. For training, the pairwise ranking loss described above is used with the margin $\gamma$ as a configurable hyperparameter. The loss across all training triples $S$ and their corrupted counterparts $S'$ is:

$$ Loss = \sum_{(i,j,k) \in S} \ \sum_{(i',j',k) \in S_{(i,j,k)}'} [\gamma + d(e_i+r_k, e_j) - d(e_{i'}+r_{k'}, e_{j'})]_{+} $$which is optimized by sampling positive and negative pairs in mini-batch SGD, just as we have been doing with our previous two models. In the paper, the authors propose constraining the embedding vectors to lie on the unit sphere: $\forall i \ ||e_i|| = 1$. Here we will continue to use a max-norm constraint, which is less restrictive and allows them to instead lie within the unit ball. The relationship translation vectors are left unconstrained.

In [35]:

```
class TransE(BaseModel):
def __init__(self, embedding_size, batch_pos_cnt=100,
max_iter=1000, dist='euclidean',
margin=1.0, opt=None):
super(TransE, self).__init__(embedding_size=embedding_size,
maxnorm=1.0,
batch_pos_cnt=batch_pos_cnt,
max_iter=max_iter,
model_type='ranking_margin',
opt=opt)
self.dist = dist
self.margin = margin
self.EPS = 1e-3 # for sqrt gradient when dist='euclidean'
def _create_model(self, train_triples):
# Count unique items to determine embedding matrix sizes
entity_cnt = len(set(train_triples[:,0]).union(train_triples[:,2]))
rel_cnt = len(set(train_triples[:,1]))
init_sd = 1.0 / np.sqrt(self.embedding_size)
# Embedding variables
entity_var_shape = [entity_cnt, self.embedding_size]
rel_var_shape = [rel_cnt, self.embedding_size]
entity_init = tf.truncated_normal(entity_var_shape, stddev=init_sd)
rel_init = tf.truncated_normal(rel_var_shape, stddev=init_sd)
# Ensure maxnorm constraints are initially satisfied
entity_init = dense_maxnorm(entity_init, self.maxnorm)
self.entity_embedding_vars = tf.Variable(entity_init)
self.rel_embedding_vars = tf.Variable(rel_init)
# Embedding layer for each (head, rel, tail) triple being fed in as input
head_embed = tf.nn.embedding_lookup(self.entity_embedding_vars, self.head_input)
tail_embed = tf.nn.embedding_lookup(self.entity_embedding_vars, self.tail_input)
rel_embed = tf.nn.embedding_lookup(self.rel_embedding_vars, self.rel_input)
# Relationship vector acts as a translation in entity embedding space
diff_vec = tail_embed - (head_embed + rel_embed)
# negative dist so higher scores are better (important for pairwise loss)
if self.dist == 'manhattan':
raw_output = -tf.reduce_sum(tf.abs(diff_vec), 1)
elif self.dist == 'euclidean':
# +eps because gradients can misbehave for small values in sqrt
raw_output = -tf.sqrt(tf.reduce_sum(tf.square(diff_vec), 1) + self.EPS)
elif self.dist == 'sqeuclidean':
raw_output = -tf.reduce_sum(tf.square(diff_vec), 1)
else:
raise Exception('Unknown distance type')
# Model output
self.output, self.loss = ranking_margin_objective(raw_output, self.margin)
# Optimization with postprocessing to limit embedding vars to L2 ball
self.train_step = self.opt.minimize(self.loss)
unique_ent_indices = tf.unique(tf.concat(0, [self.head_input, self.tail_input]))[0]
self.post_step = self._norm_constraint_op(self.entity_embedding_vars,
unique_ent_indices,
self.maxnorm)
```

In [31]:

```
transE = TransE(embedding_size=20,
margin=1.0,
dist='euclidean',
batch_pos_cnt=100,
max_iter=30000)
val_feed_dict = transE.create_feed_dict(val_idx_array, val_labels)
def train_step_callback(itr, batch_feed_dict):
if (itr % 2000) == 0 or (itr == (transE.max_iter-1)):
batch_size = len(batch_feed_dict[transE.target])
batch_avg_loss = transE.sess.run(transE.loss, batch_feed_dict) / batch_size
val_output, val_loss = transE.sess.run((transE.output, transE.loss), val_feed_dict)
val_avg_loss = val_loss / len(val_labels)
val_pair_ranking_acc = util.pair_ranking_accuracy(val_output)
msg = 'itr {}, batch loss: {:.2}, val loss: {:.2}, val pair ranking acc: {:.2}'
print(msg.format(itr, batch_avg_loss, val_avg_loss, val_pair_ranking_acc))
return True
transE.fit(train_idx_array, train_step_callback)
acc, pred, scores, thresh_map = util.model_threshold_and_eval(transE, test, val)
print('Test set accuracy: {:.2}'.format(acc))
```

This is not too bad, 87% is quite a bit better than our previous attempts.

Let's take a look at the top 10 scoring triples for a single head entity and relationship type. We'll use (brain cell, part of, ?):

In [32]:

```
entity = '__brain_cell_NN_1'
relationship_type = '_part_of'
entity_cats = train['head'].cat.categories
rel_cats = train['rel'].cat.categories
entity_code = (entity_cats == entity).argmax()
has_part_code = (rel_cats == relationship_type).argmax()
print(entity_code, pd.Categorical.from_codes([entity_code], entity_cats).astype('str'))
print(has_part_code, pd.Categorical.from_codes([has_part_code], rel_cats).astype('str'))
query_triples = np.zeros((len(entity_cats), 3))
query_triples[:,0] = entity_code
query_triples[:,1] = has_part_code
query_triples[:,2] = range(len(entity_cats))
scores = transE.predict(query_triples)
sorted_idxs = np.argsort(scores)[::-1]
print()
print('Top 10 matches:')
print(np.array(entity_cats)[sorted_idxs][:10])
example_train_rows = (train['head'] == entity) | (train['tail'] == entity)
print('\nTraining triples for entity:')
display(train.loc[example_train_rows])
```

Another useful feature of these relational learning models is that the embedding vectors can be used for clustering and visualization. Let's take a look at some of them laid out in a 2D scatter using Barnes-Hut T-SNE [VanDerMaaten14]. This requires scikit-learn 0.17 or above and the bokeh plotting library.

We will color each embedding point by the word's top-level category in WordNet, called its "lexname". I have included a file with these in the data directory. You should be able to mouse over the plot and see the words, which lets you get a feel for how things are being clustered in embedding space. Note that we are only showing 10% of the data here in order to save some time and keep the web browser happy. You should be able to run it on all of the embeddings if you would like, as Barnes-Hut T-SNE is an $O(n \ log(n))$ algorithm.

In [54]:

```
from sklearn.manifold import TSNE
from bokeh.models import HoverTool, BoxSelectTool
from bokeh.plotting import figure, output_notebook, show, ColumnDataSource, reset_output
from bokeh.palettes import Spectral11
subsample = 10
reset_output()
output_notebook()
lexnames = pd.read_table(os.path.join(data_dir, 'wordnet_lexnames.txt'), index_col=0)
entity_embeddings = transE.sess.run(transE.entity_embedding_vars)
entity_cats = train['head'].cat.categories
entity_names = pd.Categorical.from_codes(range(len(entity_embeddings)),
entity_cats).astype(str)
entity_lexnames = lexnames.loc[entity_names].values
# Run on just a subset of the data to save some time
emb_subset = entity_embeddings[::subsample, :]
emb_subset_names = entity_names[::subsample]
emb_subset_lexnames = entity_lexnames[::subsample]
print('Embeddings shape:', entity_embeddings.shape)
print('Using subset:', emb_subset.shape)
print('Running T-SNE, may take a while...')
tsne = TSNE(n_iter=1000, method='barnes_hut')
lowdim = tsne.fit_transform(emb_subset)
print('Plotting...')
source = ColumnDataSource(
data=dict(x=lowdim[:,0],
y=lowdim[:,1],
name=emb_subset_names,
lexname=emb_subset_lexnames)
)
colormap = {}
for i,ln in enumerate(set(emb_subset_lexnames.flat)):
colormap[ln] = Spectral11[i % len(Spectral11)]
colors = [colormap[ln] for ln in emb_subset_lexnames.flat]
tools = 'pan,wheel_zoom,box_zoom,reset,resize,hover'
fig = figure(title="T-SNE of WordNet TransE Embeddings",
plot_width=800, plot_height=600, tools=tools)
fig.scatter('x', 'y', source=source, alpha=0.5, fill_color=colors, line_color=None)
hover = fig.select(dict(type=HoverTool))
hover.tooltips = [('','@name, @lexname')]
h = show(fig)
```

This tutorial has covered three relatively simple models for relational learning: two based on the CP and RESCAL tensor decompositions, and TransE which treats relationships as translations in embedding space. We trained each with contrastive negative sampling and max-norm regularization, which of course are not the only possibilities.

TransE performed the best for me on this particular dataset, although in fairness not much time was spent optimizing hyperparameters for each model. Based on the brief time I have spent with it on the WordNet data, it appears to cluster things well, but does not necessarily possess high precision for all of the fine-grained relationships.

Many other models have also been proposed in the literature, for example TransH [Wang14], Neural Tensor Network (NTN) [Socher13], and models using a more traditional neural network architecture Dong14. I believe that all of these are good candidates for TensorFlow and it would be fun to add them to the framework we have developed here.

One thing to note is that this evaluation is measuring classification accuracy on a balanced test set with an equal number of positive and negative instances. Other performance measures have been proposed which take more of an information-retrieval approach, for example comparing the score of each positive test triple against all of its possible corruptions and reporting things like mean rank and [email protected] I did not use these for this tutorial because they are a bit more involved to code up, and evaluation takes much longer. They may be a better metric than accuracy on a balanced set, however, depending on application.

Lastly, this is a relatively small dataset for the complex task the model is attempting to learn, with only about 3 triples per entity on average. Many recent publications on knowledge-base completion are using much larger data sets, but I decided to go with this particular one to keep training times reasonable for a tutorial.

I hope you have enjoyed this tutorial and found it useful! I know I had a lot of fun putting it together.

**[Bordes11]** Bordes, Antoine, et al. "Learning structured embeddings of knowledge bases." Conference on Artificial Intelligence. No. EPFL-CONF-192344. 2011.

**[Bordes13]** Bordes, Antoine, et al. "Translating embeddings for modeling multi-relational data." Advances in Neural Information Processing Systems. 2013.

**[Bordes14]** Bordes, Antoine, et al. "A semantic matching energy function for learning with multi-relational data." Machine Learning 94.2 (2014): 233-259.

**[Bottou12]** Bottou, Léon. "Stochastic gradient descent tricks." Neural Networks: Tricks of the Trade. Springer Berlin Heidelberg, 2012. 421-436.

**[Carpenter08]** Carpenter, Bob. "Lazy sparse stochastic gradient descent for regularized multinomial logistic regression." Alias-i, Inc., Tech. Rep (2008): 1-20.

**[Dong14]** Dong, Xin, et al. "Knowledge vault: A web-scale approach to probabilistic knowledge fusion." Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2014.

**[Gutmann10]** Gutmann, Michael, and Aapo Hyvärinen. "Noise-contrastive estimation: A new estimation principle for unnormalized statistical models." International Conference on Artificial Intelligence and Statistics. 2010.

**[Hitchcock27]** Hitchcock, Frank L. "The expression of a tensor or a polyadic as a sum of products." Journal of Mathematics and Physics 6.1 (1927): 164-189.

**[Hu08]** Hu, Yifan, Yehuda Koren, and Chris Volinsky. "Collaborative filtering for implicit feedback datasets." Data Mining, 2008. ICDM'08. Eighth IEEE International Conference on. Ieee, 2008.

**[Joachims06]** Joachims, Thorsten. "Training linear SVMs in linear time." Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2006.

**[Kolda09]** Kolda, Tamara G., and Brett W. Bader. "Tensor decompositions and applications." SIAM review 51.3 (2009): 455-500.

**[Koren09]** Koren, Yehuda, Robert Bell, and Chris Volinsky. "Matrix factorization techniques for recommender systems." Computer 8 (2009): 30-37.

<a name=Mikolov13></a> **[Mikolov13]** Mikolov, Tomas, et al. "Distributed representations of words and phrases and their compositionality." Advances in neural information processing systems. 2013.

**[Mnih13]** Mnih, Andriy, and Koray Kavukcuoglu. "Learning word embeddings efficiently with noise-contrastive estimation." Advances in Neural Information Processing Systems. 2013.

**[Nickel11]** Nickel, Maximilian, Volker Tresp, and Hans-Peter Kriegel. "A three-way model for collective learning on multi-relational data." Proceedings of the 28th international conference on machine learning (ICML-11). 2011.

**[Nickel15]** Nickel, Maximilian, et al. "A review of relational machine learning for knowledge graphs: From multi-relational link prediction to automated knowledge graph construction." arXiv preprint arXiv:1503.00759 (2015).

**[Rendle04]** Rendle, Steffen, et al. "BPR: Bayesian personalized ranking from implicit feedback." Proceedings of the twenty-fifth conference on uncertainty in artificial intelligence. AUAI Press, 2009.

**[Sbrero04]** Srebro, Nathan, Jason Rennie, and Tommi S. Jaakkola. "Maximum-margin matrix factorization." Advances in neural information processing systems. 2004.

**[Socher13]** Socher, Richard, et al. "Reasoning with neural tensor networks for knowledge base completion." Advances in Neural Information Processing Systems. 2013.

**[Srivastava14]** Srivastava, Nitish, et al. "Dropout: A simple way to prevent neural networks from overfitting." The Journal of Machine Learning Research 15.1 (2014): 1929-1958.

**[VanDerMaaten14]** Van Der Maaten, Laurens. "Accelerating t-sne using tree-based algorithms." The Journal of Machine Learning Research 15.1 (2014): 3221-3245.

**[Wang14]** Wang, Zhen, et al. "Knowledge Graph Embedding by Translating on Hyperplanes." AAAI. 2014.