Notebook

Link prediction with GCN¶

Run the latest release of this notebook:

In this example, we use our implementation of the GCN algorithm to build a model that predicts citation links in the Cora dataset (see below). The problem is treated as a supervised link prediction problem on a homogeneous citation network with nodes representing papers (with attributes such as binary keyword indicators and categorical subject) and links corresponding to paper-paper citations.

To address this problem, we build a model with the following architecture. First we build a two-layer GCN model that takes labeled node pairs (citing-paper -> cited-paper) corresponding to possible citation links, and outputs a pair of node embeddings for the citing-paper and cited-paper nodes of the pair. These embeddings are then fed into a link classification layer, which first applies a binary operator to those node embeddings (e.g., concatenating them) to construct the embedding of the potential link. Thus obtained link embeddings are passed through the dense link classification layer to obtain link predictions - probability for these candidate links to actually exist in the network. The entire model is trained end-to-end by minimizing the loss function of choice (e.g., binary cross-entropy between predicted link probabilities and true link labels, with true/false citation links having labels 1/0) using stochastic gradient descent (SGD) updates of the model parameters, with minibatches of 'training' links fed into the model.

In [1]:

%pip show tensorflow

Name: tensorflow
Version: 2.2.0
Summary: TensorFlow is an open source machine learning framework for everyone.
Home-page: https://www.tensorflow.org/
Author: Google Inc.
Author-email: packages@tensorflow.org
License: Apache 2.0
Location: /usr/local/lib/python3.6/dist-packages
Requires: google-pasta, keras-preprocessing, six, gast, astunparse, tensorflow-estimator, wrapt, tensorboard, termcolor, grpcio, protobuf, opt-einsum, absl-py, scipy, numpy, h5py, wheel
Required-by: fancyimpute

In [2]:

!curl https://patch-diff.githubusercontent.com/raw/tensorflow/tensorflow/pull/41701.diff | patch --strip=1 --directory=/usr/local/lib/python3.6/dist-packages

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  3628    0  3628    0     0  16642      0 --:--:-- --:--:-- --:--:-- 16642
patching file tensorflow/python/keras/engine/compile_utils.py
Hunk #1 succeeded at 278 (offset 2 lines).
Hunk #2 succeeded at 448 (offset 3 lines).
can't find file to patch at input line 44
Perhaps you used the wrong -p or --strip option?
The text leading up to this was:
--------------------------
|diff --git a/tensorflow/python/keras/engine/compile_utils_test.py b/tensorflow/python/keras/engine/compile_utils_test.py
|index 39127270539a3..d2cc78f768fbc 100644
|--- a/tensorflow/python/keras/engine/compile_utils_test.py
|+++ b/tensorflow/python/keras/engine/compile_utils_test.py
--------------------------
File to patch: 
Skip this patch? [y] 
Skipping patch.
1 out of 1 hunk ignored
patching file tensorflow/python/keras/engine/training.py
Hunk #1 succeeded at 328 with fuzz 2 (offset -215 lines).

In [3]:

# confirm the patch was applied
print("training.py")
!grep -C3 loss=loss /usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training.py
print("\n\ncompile_utils.py")
!grep -C3 is_binary_crossentropy /usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/compile_utils.py

training.py
          loss, loss_weights, output_names=self.output_names)
      mc = compile_utils.MetricsContainer(metrics,
                                          weighted_metrics,
                                          loss=loss,
                                          output_names=self.output_names)
      self.compiled_metrics = mc



compile_utils.py
      y_p_rank = len(y_p.shape.as_list())
      y_t_last_dim = y_t.shape.as_list()[-1]
      y_p_last_dim = y_p.shape.as_list()[-1]
      is_binary_crossentropy = (
          isinstance(self._loss, losses_mod.BinaryCrossentropy) or
          (isinstance(self._loss, losses_mod.LossFunctionWrapper) and
           (self._loss.fn == losses_mod.binary_crossentropy)) or
          self._loss == "bce")
      is_binary = y_p_last_dim == 1 or is_binary_crossentropy
      is_sparse_categorical = (
          y_t_rank < y_p_rank or y_t_last_dim == 1 and y_p_last_dim > 1)

In [4]:

# example from https://github.com/tensorflow/tensorflow/issues/41361

import tensorflow as tf

def run(loss):
  inp = tf.keras.Input(3)
  model = tf.keras.Model(inp, inp)

  model.compile(loss=loss, metrics=["acc"])
  model.evaluate(tf.constant([[0.1, 0.6, 0.9]]), tf.constant([[0, 1, 1]]), verbose=0)

  if tf.version.VERSION == "2.2.0":
      print(repr(loss), model.compiled_metrics.metrics[0]._fn.__name__)
  else:
      print(repr(loss), model._per_output_metrics[0]["acc"]._fn.__name__)

run("bce")
run("binary_crossentropy")
run(tf.keras.losses.binary_crossentropy)
run(tf.keras.losses.BinaryCrossentropy())
run(tf.keras.losses.get("bce"))

'bce' binary_accuracy
'binary_crossentropy' categorical_accuracy
<function binary_crossentropy at 0x7ff1b8db0950> categorical_accuracy
<tensorflow.python.keras.losses.BinaryCrossentropy object at 0x7ff1a8dc1208> binary_accuracy
<function binary_crossentropy at 0x7ff1b8db0950> categorical_accuracy

In [5]:

# install StellarGraph if running on Google Colab
import sys
if 'google.colab' in sys.modules:
  %pip install -q stellargraph[demos]==1.2.1

     |████████████████████████████████| 440kB 2.6MB/s 
     |████████████████████████████████| 235kB 4.9MB/s 
     |████████████████████████████████| 51kB 4.7MB/s 
  Building wheel for mplleaflet (setup.py) ... done

In [6]:

# verify that we're using the correct version of StellarGraph for this notebook
import stellargraph as sg

try:
    sg.utils.validate_notebook_version("1.2.1")
except AttributeError:
    raise ValueError(
        f"This notebook requires StellarGraph version 1.2.1, but a different version {sg.__version__} is installed.  Please see <https://github.com/stellargraph/stellargraph/issues/1172>."
    ) from None

In [7]:

import stellargraph as sg
from stellargraph.data import EdgeSplitter
from stellargraph.mapper import FullBatchLinkGenerator
from stellargraph.layer import GCN, LinkEmbedding


from tensorflow import keras
from sklearn import preprocessing, feature_extraction, model_selection

from stellargraph import globalvar
from stellargraph import datasets
from IPython.display import display, HTML
%matplotlib inline

Loading the CORA network data¶

(See the "Loading from Pandas" demo for details on how data can be loaded.)

In [8]:

dataset = datasets.Cora()
display(HTML(dataset.description))
G, _ = dataset.load(subject_as_feature=True)

The Cora dataset consists of 2708 scientific publications classified into one of seven classes. The citation network consists of 5429 links. Each publication in the dataset is described by a 0/1-valued word vector indicating the absence/presence of the corresponding word from the dictionary. The dictionary consists of 1433 unique words.

In [9]:

print(G.info())

StellarGraph: Undirected multigraph
 Nodes: 2708, Edges: 5429

 Node types:
  paper: [2708]
    Features: float32 vector, length 1440
    Edge types: paper-cites->paper

 Edge types:
    paper-cites->paper: [5429]
        Weights: all 1 (default)
        Features: none

We aim to train a link prediction model, hence we need to prepare the train and test sets of links and the corresponding graphs with those links removed.

We are going to split our input graph into a train and test graphs using the EdgeSplitter class in stellargraph.data. We will use the train graph for training the model (a binary classifier that, given two nodes, predicts whether a link between these two nodes should exist or not) and the test graph for evaluating the model's performance on hold out data. Each of these graphs will have the same number of nodes as the input graph, but the number of links will differ (be reduced) as some of the links will be removed during each split and used as the positive samples for training/testing the link prediction classifier.

From the original graph G, extract a randomly sampled subset of test edges (true and false citation links) and the reduced graph G_test with the positive test edges removed:

In [10]:

# Define an edge splitter on the original graph G:
edge_splitter_test = EdgeSplitter(G)

# Randomly sample a fraction p=0.1 of all positive links, and same number of negative links, from G, and obtain the
# reduced graph G_test with the sampled links removed:
G_test, edge_ids_test, edge_labels_test = edge_splitter_test.train_test_split(
    p=0.1, method="global", keep_connected=True
)

** Sampled 542 positive and 542 negative edges. **

The reduced graph G_test, together with the test ground truth set of links (edge_ids_test, edge_labels_test), will be used for testing the model.

Now repeat this procedure to obtain the training data for the model. From the reduced graph G_test, extract a randomly sampled subset of train edges (true and false citation links) and the reduced graph G_train with the positive train edges removed:

In [11]:

# Define an edge splitter on the reduced graph G_test:
edge_splitter_train = EdgeSplitter(G_test)

# Randomly sample a fraction p=0.1 of all positive links, and same number of negative links, from G_test, and obtain the
# reduced graph G_train with the sampled links removed:
G_train, edge_ids_train, edge_labels_train = edge_splitter_train.train_test_split(
    p=0.1, method="global", keep_connected=True
)

** Sampled 488 positive and 488 negative edges. **

G_train, together with the train ground truth set of links (edge_ids_train, edge_labels_train), will be used for training the model.

Creating the GCN link model¶

Next, we create the link generators for the train and test link examples to the model. The link generators take the pairs of nodes (citing-paper, cited-paper) that are given in the .flow method to the Keras model, together with the corresponding binary labels indicating whether those pairs represent true or false links.

The number of epochs for training the model:

In [12]:

epochs = 50

For training we create a generator on the G_train graph, and make an iterator over the training links using the generator's flow() method:

In [13]:

train_gen = FullBatchLinkGenerator(G_train, method="gcn")
train_flow = train_gen.flow(edge_ids_train, edge_labels_train)

Using GCN (local pooling) filters...

In [14]:

test_gen = FullBatchLinkGenerator(G_test, method="gcn")
test_flow = train_gen.flow(edge_ids_test, edge_labels_test)

Using GCN (local pooling) filters...

Now we can specify our machine learning model, we need a few more parameters for this:

the layer_sizes is a list of hidden feature sizes of each layer in the model. In this example we use two GCN layers with 16-dimensional hidden node features at each layer.
activations is a list of activations applied to each layer's output
dropout=0.3 specifies a 30% dropout at each layer.

We create a GCN model as follows:

In [15]:

gcn = GCN(
    layer_sizes=[16, 16], activations=["relu", "relu"], generator=train_gen, dropout=0.3
)

To create a Keras model we now expose the input and output tensors of the GCN model for link prediction, via the GCN.in_out_tensors method:

In [16]:

x_inp, x_out = gcn.in_out_tensors()

Final link classification layer that takes a pair of node embeddings produced by the GCN model, applies a binary operator to them to produce the corresponding link embedding (ip for inner product; other options for the binary operator can be seen by running a cell with ?LinkEmbedding in it), and passes it through a dense layer:

In [17]:

prediction = LinkEmbedding(activation="relu", method="ip")(x_out)

The predictions need to be reshaped from (X, 1) to (X,) to match the shape of the targets we have supplied above.

In [18]:

prediction = keras.layers.Reshape((-1,))(prediction)

Stack the GCN and prediction layers into a Keras model, and specify the loss

In [19]:

model = keras.Model(inputs=x_inp, outputs=prediction)

model.compile(
    optimizer=keras.optimizers.Adam(lr=0.01),
    loss=keras.losses.binary_crossentropy,
    metrics=["acc"],
)

In [20]:

help(type(model.compiled_metrics)) # does this have the new signature? yes

Help on class MetricsContainer in module tensorflow.python.keras.engine.compile_utils:

class MetricsContainer(Container)
 |  A container class for metrics passed to `Model.compile`.
 |  
 |  Method resolution order:
 |      MetricsContainer
 |      Container
 |      builtins.object
 |  
 |  Methods defined here:
 |  
 |  __init__(self, metrics=None, weighted_metrics=None, output_names=None, loss=None)
 |      Initialize self.  See help(type(self)) for accurate signature.
 |  
 |  update_state(self, y_true, y_pred, sample_weight=None)
 |      Updates the state of per-output metrics.
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors defined here:
 |  
 |  metrics
 |      Metrics created by this container.
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors inherited from Container:
 |  
 |  __dict__
 |      dictionary for instance variables (if defined)
 |  
 |  __weakref__
 |      list of weak references to the object (if defined)

In [21]:

model.compiled_metrics._loss

Out[21]:

<function tensorflow.python.keras.losses.binary_crossentropy>

Evaluate the initial (untrained) model on the train and test set:

In [22]:

init_train_metrics = model.evaluate(train_flow)
init_test_metrics = model.evaluate(test_flow)

print("\nTrain Set Metrics of the initial (untrained) model:")
for name, val in zip(model.metrics_names, init_train_metrics):
    print("\t{}: {:0.4f}".format(name, val))

print("\nTest Set Metrics of the initial (untrained) model:")
for name, val in zip(model.metrics_names, init_test_metrics):
    print("\t{}: {:0.4f}".format(name, val))

1/1 [==============================] - 0s 2ms/step - loss: 1.9011 - acc: 0.0000e+00
1/1 [==============================] - 0s 1ms/step - loss: 1.9194 - acc: 0.0000e+00

Train Set Metrics of the initial (untrained) model:
	loss: 1.9011
	acc: 0.0000

Test Set Metrics of the initial (untrained) model:
	loss: 1.9194
	acc: 0.0000

Train the model:

In [23]:

history = model.fit(
    train_flow, epochs=epochs, validation_data=test_flow, verbose=2, shuffle=False
)

Epoch 1/50
1/1 - 0s - loss: 1.8239 - acc: 0.0000e+00 - val_loss: 1.9013 - val_acc: 1.0000
Epoch 2/50
1/1 - 0s - loss: 2.2800 - acc: 0.0000e+00 - val_loss: 0.7303 - val_acc: 1.0000
Epoch 3/50
1/1 - 0s - loss: 0.7164 - acc: 0.0000e+00 - val_loss: 0.7009 - val_acc: 1.0000
Epoch 4/50
1/1 - 0s - loss: 0.6643 - acc: 0.0000e+00 - val_loss: 0.6633 - val_acc: 1.0000
Epoch 5/50
1/1 - 0s - loss: 0.6591 - acc: 0.0000e+00 - val_loss: 0.6840 - val_acc: 1.0000
Epoch 6/50
1/1 - 0s - loss: 0.7817 - acc: 0.0000e+00 - val_loss: 0.6575 - val_acc: 1.0000
Epoch 7/50
1/1 - 0s - loss: 0.6532 - acc: 0.0000e+00 - val_loss: 0.6751 - val_acc: 1.0000
Epoch 8/50
1/1 - 0s - loss: 0.6580 - acc: 0.0000e+00 - val_loss: 0.7115 - val_acc: 1.0000
Epoch 9/50
1/1 - 0s - loss: 0.6634 - acc: 0.0000e+00 - val_loss: 0.7146 - val_acc: 1.0000
Epoch 10/50
1/1 - 0s - loss: 0.6663 - acc: 0.0000e+00 - val_loss: 0.6902 - val_acc: 1.0000
Epoch 11/50
1/1 - 0s - loss: 0.6104 - acc: 0.0000e+00 - val_loss: 0.6537 - val_acc: 1.0000
Epoch 12/50
1/1 - 0s - loss: 0.5665 - acc: 0.0000e+00 - val_loss: 0.6539 - val_acc: 1.0000
Epoch 13/50
1/1 - 0s - loss: 0.5567 - acc: 0.0000e+00 - val_loss: 0.7373 - val_acc: 1.0000
Epoch 14/50
1/1 - 0s - loss: 0.6328 - acc: 0.0000e+00 - val_loss: 0.8294 - val_acc: 1.0000
Epoch 15/50
1/1 - 0s - loss: 0.7314 - acc: 0.0000e+00 - val_loss: 0.8411 - val_acc: 1.0000
Epoch 16/50
1/1 - 0s - loss: 0.6917 - acc: 0.0000e+00 - val_loss: 0.7806 - val_acc: 1.0000
Epoch 17/50
1/1 - 0s - loss: 0.6041 - acc: 0.0000e+00 - val_loss: 0.7094 - val_acc: 1.0000
Epoch 18/50
1/1 - 0s - loss: 0.5606 - acc: 0.0000e+00 - val_loss: 0.6731 - val_acc: 1.0000
Epoch 19/50
1/1 - 0s - loss: 0.5116 - acc: 0.0000e+00 - val_loss: 0.6556 - val_acc: 1.0000
Epoch 20/50
1/1 - 0s - loss: 0.4408 - acc: 0.0000e+00 - val_loss: 0.6464 - val_acc: 1.0000
Epoch 21/50
1/1 - 0s - loss: 0.4645 - acc: 0.0000e+00 - val_loss: 0.6541 - val_acc: 1.0000
Epoch 22/50
1/1 - 0s - loss: 0.4752 - acc: 0.0000e+00 - val_loss: 0.6613 - val_acc: 1.0000
Epoch 23/50
1/1 - 0s - loss: 0.4297 - acc: 0.0000e+00 - val_loss: 0.6739 - val_acc: 1.0000
Epoch 24/50
1/1 - 0s - loss: 0.4512 - acc: 0.0000e+00 - val_loss: 0.6806 - val_acc: 1.0000
Epoch 25/50
1/1 - 0s - loss: 0.4228 - acc: 0.0000e+00 - val_loss: 0.7265 - val_acc: 1.0000
Epoch 26/50
1/1 - 0s - loss: 0.4397 - acc: 0.0000e+00 - val_loss: 0.7720 - val_acc: 1.0000
Epoch 27/50
1/1 - 0s - loss: 0.4142 - acc: 0.0000e+00 - val_loss: 0.8096 - val_acc: 1.0000
Epoch 28/50
1/1 - 0s - loss: 0.4156 - acc: 0.0000e+00 - val_loss: 0.8088 - val_acc: 1.0000
Epoch 29/50
1/1 - 0s - loss: 0.4002 - acc: 0.0000e+00 - val_loss: 0.7927 - val_acc: 1.0000
Epoch 30/50
1/1 - 0s - loss: 0.3574 - acc: 0.0000e+00 - val_loss: 0.6540 - val_acc: 1.0000
Epoch 31/50
1/1 - 0s - loss: 0.3789 - acc: 0.0000e+00 - val_loss: 0.6474 - val_acc: 1.0000
Epoch 32/50
1/1 - 0s - loss: 0.3895 - acc: 0.0000e+00 - val_loss: 0.6569 - val_acc: 1.0000
Epoch 33/50
1/1 - 0s - loss: 0.4090 - acc: 0.0000e+00 - val_loss: 0.6549 - val_acc: 1.0000
Epoch 34/50
1/1 - 0s - loss: 0.3769 - acc: 0.0000e+00 - val_loss: 0.6673 - val_acc: 1.0000
Epoch 35/50
1/1 - 0s - loss: 0.4104 - acc: 0.0000e+00 - val_loss: 0.6995 - val_acc: 1.0000
Epoch 36/50
1/1 - 0s - loss: 0.3890 - acc: 0.0000e+00 - val_loss: 0.7183 - val_acc: 1.0000
Epoch 37/50
1/1 - 0s - loss: 0.3726 - acc: 0.0000e+00 - val_loss: 0.7319 - val_acc: 1.0000
Epoch 38/50
1/1 - 0s - loss: 0.3751 - acc: 0.0000e+00 - val_loss: 0.7495 - val_acc: 1.0000
Epoch 39/50
1/1 - 0s - loss: 0.3590 - acc: 0.0000e+00 - val_loss: 0.7862 - val_acc: 1.0000
Epoch 40/50
1/1 - 0s - loss: 0.3596 - acc: 0.0000e+00 - val_loss: 0.7452 - val_acc: 1.0000
Epoch 41/50
1/1 - 0s - loss: 0.2787 - acc: 0.0000e+00 - val_loss: 0.7562 - val_acc: 1.0000
Epoch 42/50
1/1 - 0s - loss: 0.3174 - acc: 0.0000e+00 - val_loss: 0.7422 - val_acc: 1.0000
Epoch 43/50
1/1 - 0s - loss: 0.2992 - acc: 0.0000e+00 - val_loss: 0.7888 - val_acc: 1.0000
Epoch 44/50
1/1 - 0s - loss: 0.2931 - acc: 0.0000e+00 - val_loss: 0.8191 - val_acc: 1.0000
Epoch 45/50
1/1 - 0s - loss: 0.2849 - acc: 0.0000e+00 - val_loss: 0.8627 - val_acc: 1.0000
Epoch 46/50
1/1 - 0s - loss: 0.3036 - acc: 0.0000e+00 - val_loss: 0.8936 - val_acc: 1.0000
Epoch 47/50
1/1 - 0s - loss: 0.2882 - acc: 0.0000e+00 - val_loss: 0.8857 - val_acc: 1.0000
Epoch 48/50
1/1 - 0s - loss: 0.2673 - acc: 0.0000e+00 - val_loss: 0.9019 - val_acc: 1.0000
Epoch 49/50
1/1 - 0s - loss: 0.2910 - acc: 0.0000e+00 - val_loss: 0.9017 - val_acc: 1.0000
Epoch 50/50
1/1 - 0s - loss: 0.2768 - acc: 0.0000e+00 - val_loss: 0.9232 - val_acc: 1.0000

Plot the training history:

In [24]:

sg.utils.plot_history(history)

Evaluate the trained model on test citation links:

In [25]:

train_metrics = model.evaluate(train_flow)
test_metrics = model.evaluate(test_flow)

print("\nTrain Set Metrics of the trained model:")
for name, val in zip(model.metrics_names, train_metrics):
    print("\t{}: {:0.4f}".format(name, val))

print("\nTest Set Metrics of the trained model:")
for name, val in zip(model.metrics_names, test_metrics):
    print("\t{}: {:0.4f}".format(name, val))

1/1 [==============================] - 0s 1ms/step - loss: 0.2011 - acc: 0.0000e+00
1/1 [==============================] - 0s 1ms/step - loss: 0.9232 - acc: 1.0000

Train Set Metrics of the trained model:
	loss: 0.2011
	acc: 0.0000

Test Set Metrics of the trained model:
	loss: 0.9232
	acc: 1.0000

Run the latest release of this notebook: