In this example, we use our implementation of the GCN algorithm to build a model that predicts citation links in the Cora dataset (see below). The problem is treated as a supervised link prediction problem on a homogeneous citation network with nodes representing papers (with attributes such as binary keyword indicators and categorical subject) and links corresponding to paper-paper citations.
To address this problem, we build a model with the following architecture. First we build a two-layer GCN model that takes labeled node pairs (citing-paper
-> cited-paper
) corresponding to possible citation links, and outputs a pair of node embeddings for the citing-paper
and cited-paper
nodes of the pair. These embeddings are then fed into a link classification layer, which first applies a binary operator to those node embeddings (e.g., concatenating them) to construct the embedding of the potential link. Thus obtained link embeddings are passed through the dense link classification layer to obtain link predictions - probability for these candidate links to actually exist in the network. The entire model is trained end-to-end by minimizing the loss function of choice (e.g., binary cross-entropy between predicted link probabilities and true link labels, with true/false citation links having labels 1/0) using stochastic gradient descent (SGD) updates of the model parameters, with minibatches of 'training' links fed into the model.
%pip show tensorflow
Name: tensorflow Version: 2.2.0 Summary: TensorFlow is an open source machine learning framework for everyone. Home-page: https://www.tensorflow.org/ Author: Google Inc. Author-email: packages@tensorflow.org License: Apache 2.0 Location: /usr/local/lib/python3.6/dist-packages Requires: google-pasta, keras-preprocessing, six, gast, astunparse, tensorflow-estimator, wrapt, tensorboard, termcolor, grpcio, protobuf, opt-einsum, absl-py, scipy, numpy, h5py, wheel Required-by: fancyimpute
!curl https://patch-diff.githubusercontent.com/raw/tensorflow/tensorflow/pull/41701.diff | patch --strip=1 --directory=/usr/local/lib/python3.6/dist-packages
% Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 3628 0 3628 0 0 16642 0 --:--:-- --:--:-- --:--:-- 16642 patching file tensorflow/python/keras/engine/compile_utils.py Hunk #1 succeeded at 278 (offset 2 lines). Hunk #2 succeeded at 448 (offset 3 lines). can't find file to patch at input line 44 Perhaps you used the wrong -p or --strip option? The text leading up to this was: -------------------------- |diff --git a/tensorflow/python/keras/engine/compile_utils_test.py b/tensorflow/python/keras/engine/compile_utils_test.py |index 39127270539a3..d2cc78f768fbc 100644 |--- a/tensorflow/python/keras/engine/compile_utils_test.py |+++ b/tensorflow/python/keras/engine/compile_utils_test.py -------------------------- File to patch: Skip this patch? [y] Skipping patch. 1 out of 1 hunk ignored patching file tensorflow/python/keras/engine/training.py Hunk #1 succeeded at 328 with fuzz 2 (offset -215 lines).
# confirm the patch was applied
print("training.py")
!grep -C3 loss=loss /usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training.py
print("\n\ncompile_utils.py")
!grep -C3 is_binary_crossentropy /usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/compile_utils.py
training.py loss, loss_weights, output_names=self.output_names) mc = compile_utils.MetricsContainer(metrics, weighted_metrics, loss=loss, output_names=self.output_names) self.compiled_metrics = mc compile_utils.py y_p_rank = len(y_p.shape.as_list()) y_t_last_dim = y_t.shape.as_list()[-1] y_p_last_dim = y_p.shape.as_list()[-1] is_binary_crossentropy = ( isinstance(self._loss, losses_mod.BinaryCrossentropy) or (isinstance(self._loss, losses_mod.LossFunctionWrapper) and (self._loss.fn == losses_mod.binary_crossentropy)) or self._loss == "bce") is_binary = y_p_last_dim == 1 or is_binary_crossentropy is_sparse_categorical = ( y_t_rank < y_p_rank or y_t_last_dim == 1 and y_p_last_dim > 1)
# example from https://github.com/tensorflow/tensorflow/issues/41361
import tensorflow as tf
def run(loss):
inp = tf.keras.Input(3)
model = tf.keras.Model(inp, inp)
model.compile(loss=loss, metrics=["acc"])
model.evaluate(tf.constant([[0.1, 0.6, 0.9]]), tf.constant([[0, 1, 1]]), verbose=0)
if tf.version.VERSION == "2.2.0":
print(repr(loss), model.compiled_metrics.metrics[0]._fn.__name__)
else:
print(repr(loss), model._per_output_metrics[0]["acc"]._fn.__name__)
run("bce")
run("binary_crossentropy")
run(tf.keras.losses.binary_crossentropy)
run(tf.keras.losses.BinaryCrossentropy())
run(tf.keras.losses.get("bce"))
'bce' binary_accuracy 'binary_crossentropy' categorical_accuracy <function binary_crossentropy at 0x7ff1b8db0950> categorical_accuracy <tensorflow.python.keras.losses.BinaryCrossentropy object at 0x7ff1a8dc1208> binary_accuracy <function binary_crossentropy at 0x7ff1b8db0950> categorical_accuracy
# install StellarGraph if running on Google Colab
import sys
if 'google.colab' in sys.modules:
%pip install -q stellargraph[demos]==1.2.1
|████████████████████████████████| 440kB 2.6MB/s |████████████████████████████████| 235kB 4.9MB/s |████████████████████████████████| 51kB 4.7MB/s Building wheel for mplleaflet (setup.py) ... done
# verify that we're using the correct version of StellarGraph for this notebook
import stellargraph as sg
try:
sg.utils.validate_notebook_version("1.2.1")
except AttributeError:
raise ValueError(
f"This notebook requires StellarGraph version 1.2.1, but a different version {sg.__version__} is installed. Please see <https://github.com/stellargraph/stellargraph/issues/1172>."
) from None
import stellargraph as sg
from stellargraph.data import EdgeSplitter
from stellargraph.mapper import FullBatchLinkGenerator
from stellargraph.layer import GCN, LinkEmbedding
from tensorflow import keras
from sklearn import preprocessing, feature_extraction, model_selection
from stellargraph import globalvar
from stellargraph import datasets
from IPython.display import display, HTML
%matplotlib inline
(See the "Loading from Pandas" demo for details on how data can be loaded.)
dataset = datasets.Cora()
display(HTML(dataset.description))
G, _ = dataset.load(subject_as_feature=True)
print(G.info())
StellarGraph: Undirected multigraph Nodes: 2708, Edges: 5429 Node types: paper: [2708] Features: float32 vector, length 1440 Edge types: paper-cites->paper Edge types: paper-cites->paper: [5429] Weights: all 1 (default) Features: none
We aim to train a link prediction model, hence we need to prepare the train and test sets of links and the corresponding graphs with those links removed.
We are going to split our input graph into a train and test graphs using the EdgeSplitter class in stellargraph.data
. We will use the train graph for training the model (a binary classifier that, given two nodes, predicts whether a link between these two nodes should exist or not) and the test graph for evaluating the model's performance on hold out data.
Each of these graphs will have the same number of nodes as the input graph, but the number of links will differ (be reduced) as some of the links will be removed during each split and used as the positive samples for training/testing the link prediction classifier.
From the original graph G, extract a randomly sampled subset of test edges (true and false citation links) and the reduced graph G_test with the positive test edges removed:
# Define an edge splitter on the original graph G:
edge_splitter_test = EdgeSplitter(G)
# Randomly sample a fraction p=0.1 of all positive links, and same number of negative links, from G, and obtain the
# reduced graph G_test with the sampled links removed:
G_test, edge_ids_test, edge_labels_test = edge_splitter_test.train_test_split(
p=0.1, method="global", keep_connected=True
)
** Sampled 542 positive and 542 negative edges. **
The reduced graph G_test, together with the test ground truth set of links (edge_ids_test, edge_labels_test), will be used for testing the model.
Now repeat this procedure to obtain the training data for the model. From the reduced graph G_test, extract a randomly sampled subset of train edges (true and false citation links) and the reduced graph G_train with the positive train edges removed:
# Define an edge splitter on the reduced graph G_test:
edge_splitter_train = EdgeSplitter(G_test)
# Randomly sample a fraction p=0.1 of all positive links, and same number of negative links, from G_test, and obtain the
# reduced graph G_train with the sampled links removed:
G_train, edge_ids_train, edge_labels_train = edge_splitter_train.train_test_split(
p=0.1, method="global", keep_connected=True
)
** Sampled 488 positive and 488 negative edges. **
G_train, together with the train ground truth set of links (edge_ids_train, edge_labels_train), will be used for training the model.
Next, we create the link generators for the train and test link examples to the model. The link generators take the pairs of nodes (citing-paper
, cited-paper
) that are given in the .flow
method to the Keras model, together with the corresponding binary labels indicating whether those pairs represent true or false links.
The number of epochs for training the model:
epochs = 50
For training we create a generator on the G_train
graph, and make an iterator over the training links using the generator's flow()
method:
train_gen = FullBatchLinkGenerator(G_train, method="gcn")
train_flow = train_gen.flow(edge_ids_train, edge_labels_train)
Using GCN (local pooling) filters...
test_gen = FullBatchLinkGenerator(G_test, method="gcn")
test_flow = train_gen.flow(edge_ids_test, edge_labels_test)
Using GCN (local pooling) filters...
Now we can specify our machine learning model, we need a few more parameters for this:
layer_sizes
is a list of hidden feature sizes of each layer in the model. In this example we use two GCN layers with 16-dimensional hidden node features at each layer.activations
is a list of activations applied to each layer's outputdropout=0.3
specifies a 30% dropout at each layer.We create a GCN model as follows:
gcn = GCN(
layer_sizes=[16, 16], activations=["relu", "relu"], generator=train_gen, dropout=0.3
)
To create a Keras model we now expose the input and output tensors of the GCN model for link prediction, via the GCN.in_out_tensors
method:
x_inp, x_out = gcn.in_out_tensors()
Final link classification layer that takes a pair of node embeddings produced by the GCN model, applies a binary operator to them to produce the corresponding link embedding (ip
for inner product; other options for the binary operator can be seen by running a cell with ?LinkEmbedding
in it), and passes it through a dense layer:
prediction = LinkEmbedding(activation="relu", method="ip")(x_out)
The predictions need to be reshaped from (X, 1)
to (X,)
to match the shape of the targets we have supplied above.
prediction = keras.layers.Reshape((-1,))(prediction)
Stack the GCN and prediction layers into a Keras model, and specify the loss
model = keras.Model(inputs=x_inp, outputs=prediction)
model.compile(
optimizer=keras.optimizers.Adam(lr=0.01),
loss=keras.losses.binary_crossentropy,
metrics=["acc"],
)
help(type(model.compiled_metrics)) # does this have the new signature? yes
Help on class MetricsContainer in module tensorflow.python.keras.engine.compile_utils: class MetricsContainer(Container) | A container class for metrics passed to `Model.compile`. | | Method resolution order: | MetricsContainer | Container | builtins.object | | Methods defined here: | | __init__(self, metrics=None, weighted_metrics=None, output_names=None, loss=None) | Initialize self. See help(type(self)) for accurate signature. | | update_state(self, y_true, y_pred, sample_weight=None) | Updates the state of per-output metrics. | | ---------------------------------------------------------------------- | Data descriptors defined here: | | metrics | Metrics created by this container. | | ---------------------------------------------------------------------- | Data descriptors inherited from Container: | | __dict__ | dictionary for instance variables (if defined) | | __weakref__ | list of weak references to the object (if defined)
model.compiled_metrics._loss
<function tensorflow.python.keras.losses.binary_crossentropy>
Evaluate the initial (untrained) model on the train and test set:
init_train_metrics = model.evaluate(train_flow)
init_test_metrics = model.evaluate(test_flow)
print("\nTrain Set Metrics of the initial (untrained) model:")
for name, val in zip(model.metrics_names, init_train_metrics):
print("\t{}: {:0.4f}".format(name, val))
print("\nTest Set Metrics of the initial (untrained) model:")
for name, val in zip(model.metrics_names, init_test_metrics):
print("\t{}: {:0.4f}".format(name, val))
1/1 [==============================] - 0s 2ms/step - loss: 1.9011 - acc: 0.0000e+00 1/1 [==============================] - 0s 1ms/step - loss: 1.9194 - acc: 0.0000e+00 Train Set Metrics of the initial (untrained) model: loss: 1.9011 acc: 0.0000 Test Set Metrics of the initial (untrained) model: loss: 1.9194 acc: 0.0000
Train the model:
history = model.fit(
train_flow, epochs=epochs, validation_data=test_flow, verbose=2, shuffle=False
)
Epoch 1/50 1/1 - 0s - loss: 1.8239 - acc: 0.0000e+00 - val_loss: 1.9013 - val_acc: 1.0000 Epoch 2/50 1/1 - 0s - loss: 2.2800 - acc: 0.0000e+00 - val_loss: 0.7303 - val_acc: 1.0000 Epoch 3/50 1/1 - 0s - loss: 0.7164 - acc: 0.0000e+00 - val_loss: 0.7009 - val_acc: 1.0000 Epoch 4/50 1/1 - 0s - loss: 0.6643 - acc: 0.0000e+00 - val_loss: 0.6633 - val_acc: 1.0000 Epoch 5/50 1/1 - 0s - loss: 0.6591 - acc: 0.0000e+00 - val_loss: 0.6840 - val_acc: 1.0000 Epoch 6/50 1/1 - 0s - loss: 0.7817 - acc: 0.0000e+00 - val_loss: 0.6575 - val_acc: 1.0000 Epoch 7/50 1/1 - 0s - loss: 0.6532 - acc: 0.0000e+00 - val_loss: 0.6751 - val_acc: 1.0000 Epoch 8/50 1/1 - 0s - loss: 0.6580 - acc: 0.0000e+00 - val_loss: 0.7115 - val_acc: 1.0000 Epoch 9/50 1/1 - 0s - loss: 0.6634 - acc: 0.0000e+00 - val_loss: 0.7146 - val_acc: 1.0000 Epoch 10/50 1/1 - 0s - loss: 0.6663 - acc: 0.0000e+00 - val_loss: 0.6902 - val_acc: 1.0000 Epoch 11/50 1/1 - 0s - loss: 0.6104 - acc: 0.0000e+00 - val_loss: 0.6537 - val_acc: 1.0000 Epoch 12/50 1/1 - 0s - loss: 0.5665 - acc: 0.0000e+00 - val_loss: 0.6539 - val_acc: 1.0000 Epoch 13/50 1/1 - 0s - loss: 0.5567 - acc: 0.0000e+00 - val_loss: 0.7373 - val_acc: 1.0000 Epoch 14/50 1/1 - 0s - loss: 0.6328 - acc: 0.0000e+00 - val_loss: 0.8294 - val_acc: 1.0000 Epoch 15/50 1/1 - 0s - loss: 0.7314 - acc: 0.0000e+00 - val_loss: 0.8411 - val_acc: 1.0000 Epoch 16/50 1/1 - 0s - loss: 0.6917 - acc: 0.0000e+00 - val_loss: 0.7806 - val_acc: 1.0000 Epoch 17/50 1/1 - 0s - loss: 0.6041 - acc: 0.0000e+00 - val_loss: 0.7094 - val_acc: 1.0000 Epoch 18/50 1/1 - 0s - loss: 0.5606 - acc: 0.0000e+00 - val_loss: 0.6731 - val_acc: 1.0000 Epoch 19/50 1/1 - 0s - loss: 0.5116 - acc: 0.0000e+00 - val_loss: 0.6556 - val_acc: 1.0000 Epoch 20/50 1/1 - 0s - loss: 0.4408 - acc: 0.0000e+00 - val_loss: 0.6464 - val_acc: 1.0000 Epoch 21/50 1/1 - 0s - loss: 0.4645 - acc: 0.0000e+00 - val_loss: 0.6541 - val_acc: 1.0000 Epoch 22/50 1/1 - 0s - loss: 0.4752 - acc: 0.0000e+00 - val_loss: 0.6613 - val_acc: 1.0000 Epoch 23/50 1/1 - 0s - loss: 0.4297 - acc: 0.0000e+00 - val_loss: 0.6739 - val_acc: 1.0000 Epoch 24/50 1/1 - 0s - loss: 0.4512 - acc: 0.0000e+00 - val_loss: 0.6806 - val_acc: 1.0000 Epoch 25/50 1/1 - 0s - loss: 0.4228 - acc: 0.0000e+00 - val_loss: 0.7265 - val_acc: 1.0000 Epoch 26/50 1/1 - 0s - loss: 0.4397 - acc: 0.0000e+00 - val_loss: 0.7720 - val_acc: 1.0000 Epoch 27/50 1/1 - 0s - loss: 0.4142 - acc: 0.0000e+00 - val_loss: 0.8096 - val_acc: 1.0000 Epoch 28/50 1/1 - 0s - loss: 0.4156 - acc: 0.0000e+00 - val_loss: 0.8088 - val_acc: 1.0000 Epoch 29/50 1/1 - 0s - loss: 0.4002 - acc: 0.0000e+00 - val_loss: 0.7927 - val_acc: 1.0000 Epoch 30/50 1/1 - 0s - loss: 0.3574 - acc: 0.0000e+00 - val_loss: 0.6540 - val_acc: 1.0000 Epoch 31/50 1/1 - 0s - loss: 0.3789 - acc: 0.0000e+00 - val_loss: 0.6474 - val_acc: 1.0000 Epoch 32/50 1/1 - 0s - loss: 0.3895 - acc: 0.0000e+00 - val_loss: 0.6569 - val_acc: 1.0000 Epoch 33/50 1/1 - 0s - loss: 0.4090 - acc: 0.0000e+00 - val_loss: 0.6549 - val_acc: 1.0000 Epoch 34/50 1/1 - 0s - loss: 0.3769 - acc: 0.0000e+00 - val_loss: 0.6673 - val_acc: 1.0000 Epoch 35/50 1/1 - 0s - loss: 0.4104 - acc: 0.0000e+00 - val_loss: 0.6995 - val_acc: 1.0000 Epoch 36/50 1/1 - 0s - loss: 0.3890 - acc: 0.0000e+00 - val_loss: 0.7183 - val_acc: 1.0000 Epoch 37/50 1/1 - 0s - loss: 0.3726 - acc: 0.0000e+00 - val_loss: 0.7319 - val_acc: 1.0000 Epoch 38/50 1/1 - 0s - loss: 0.3751 - acc: 0.0000e+00 - val_loss: 0.7495 - val_acc: 1.0000 Epoch 39/50 1/1 - 0s - loss: 0.3590 - acc: 0.0000e+00 - val_loss: 0.7862 - val_acc: 1.0000 Epoch 40/50 1/1 - 0s - loss: 0.3596 - acc: 0.0000e+00 - val_loss: 0.7452 - val_acc: 1.0000 Epoch 41/50 1/1 - 0s - loss: 0.2787 - acc: 0.0000e+00 - val_loss: 0.7562 - val_acc: 1.0000 Epoch 42/50 1/1 - 0s - loss: 0.3174 - acc: 0.0000e+00 - val_loss: 0.7422 - val_acc: 1.0000 Epoch 43/50 1/1 - 0s - loss: 0.2992 - acc: 0.0000e+00 - val_loss: 0.7888 - val_acc: 1.0000 Epoch 44/50 1/1 - 0s - loss: 0.2931 - acc: 0.0000e+00 - val_loss: 0.8191 - val_acc: 1.0000 Epoch 45/50 1/1 - 0s - loss: 0.2849 - acc: 0.0000e+00 - val_loss: 0.8627 - val_acc: 1.0000 Epoch 46/50 1/1 - 0s - loss: 0.3036 - acc: 0.0000e+00 - val_loss: 0.8936 - val_acc: 1.0000 Epoch 47/50 1/1 - 0s - loss: 0.2882 - acc: 0.0000e+00 - val_loss: 0.8857 - val_acc: 1.0000 Epoch 48/50 1/1 - 0s - loss: 0.2673 - acc: 0.0000e+00 - val_loss: 0.9019 - val_acc: 1.0000 Epoch 49/50 1/1 - 0s - loss: 0.2910 - acc: 0.0000e+00 - val_loss: 0.9017 - val_acc: 1.0000 Epoch 50/50 1/1 - 0s - loss: 0.2768 - acc: 0.0000e+00 - val_loss: 0.9232 - val_acc: 1.0000
Plot the training history:
sg.utils.plot_history(history)
Evaluate the trained model on test citation links:
train_metrics = model.evaluate(train_flow)
test_metrics = model.evaluate(test_flow)
print("\nTrain Set Metrics of the trained model:")
for name, val in zip(model.metrics_names, train_metrics):
print("\t{}: {:0.4f}".format(name, val))
print("\nTest Set Metrics of the trained model:")
for name, val in zip(model.metrics_names, test_metrics):
print("\t{}: {:0.4f}".format(name, val))
1/1 [==============================] - 0s 1ms/step - loss: 0.2011 - acc: 0.0000e+00 1/1 [==============================] - 0s 1ms/step - loss: 0.9232 - acc: 1.0000 Train Set Metrics of the trained model: loss: 0.2011 acc: 0.0000 Test Set Metrics of the trained model: loss: 0.9232 acc: 1.0000