Node Classification on Citation Network

In this tutorial, we demostrate how GraphScope process node classification task on citation network by combining analytics, interactive and graph neural networks computation.

In this example, we use ogbn-mag dataset. ogbn-mag is a heterogeneous network composed of a subset of the Microsoft Academic Graph. It contains 4 types of entities(i.e., papers, authors, institutions, and fields of study), as well as four types of directed relations connecting two entities.

Given the heterogeneous ogbn-mag data, the task is to predict the class of each paper. We apply both the attribute and structural information to classify papers. In the graph, each paper node contains a 128-dimensional word2vec vector representing its content, which is obtained by averaging the embeddings of words in its title and abstract. The embeddings of individual words are pre-trained. The structural information is computed on-the-fly.

This tutorial has the following steps:

  • Creating a session and loading graph
  • Query graph data.
  • Run graph algorithm.
  • Run graph-based machine learning tasks.

First, let's create a session and load obgn_mag dataset as a graph.

In [ ]:
import os
import graphscope
from graphscope.dataset.ogbn_mag import load_ogbn_mag

k8s_volumes = {
    "data": {
        "type": "hostPath",
          "field": {
          "path": "/testingdata",
          "type": "Directory"
        },
        "mounts": {
          "mountPath": "/home/jovyan/datasets",
          "readOnly": True
        }
    }
}

graphscope.set_option(show_log=True)
sess = graphscope.session(k8s_volumes=k8s_volumes)

graph = load_ogbn_mag(sess, "/home/jovyan/datasets/ogbn_mag_small/")

Interactive query with gremlin

In this example, we launch a interactive query and use graph traversal to count the number of papers two given authors have co-authored. To simplify the query, we assume the authors can be uniquely identified by ID 2 and 4307, respectively.

In [ ]:
# get the entrypoint for submitting Gremlin queries on graph g.
interactive = sess.gremlin(graph)

# count the number of papers two authors (with id 2 and 4307) have co-authored.
papers = interactive.execute("g.V().has('author', 'id', 2).out('writes').where(__.in('writes').has('id', 4307)).count()").one()
print("result", papers)

Graph analytics with analytical engine

Continuing our example, we run graph algorithms on graph to generate structural features. below we first derive a subgraph by extracting publications in specific time out of the entire graph (using Gremlin!), and then run k-core decomposition and triangle counting to generate the structural features of each paper node.

In [ ]:
# exact a subgraph of publication within a time range.
sub_graph = interactive.subgraph(
    "g.V().has('year', inside(2014, 2020)).outE('cites')"
)

# project the subgraph to simple graph by selecting papers and their citations.
simple_g = sub_graph.project(vertices={"paper": []}, edges={"cites": []})
# compute the kcore and triangle-counting.
kc_result = graphscope.k_core(simple_g, k=5)
tc_result = graphscope.triangles(simple_g)

# add the results as new columns to the citation graph.
sub_graph = sub_graph.add_column(kc_result, {"kcore": "r"})
sub_graph = sub_graph.add_column(tc_result, {"tc": "r"})

Graph neural networks (GNNs)

Then, we use the generated structural features and original features to train a learning model with learning engine.

In our example, we train a GCN model to classify the nodes (papers) into 349 categories, each of which represents a venue (e.g. pre-print and conference).

In [ ]:
# define the features for learning, 
# we chose original 128-dimension feature and k-core, triangle count result as new features.
paper_features = []
for i in range(128):
    paper_features.append("feat_" + str(i))
paper_features.append("kcore")
paper_features.append("tc")

# launch a learning engine. here we split the dataset, 75% as train, 10% as validation and 15% as test.
lg = sess.learning(sub_graph, nodes=[("paper", paper_features)],
                   edges=[("paper", "cites", "paper")],
                   gen_labels=[
                       ("train", "paper", 100, (0, 75)),
                       ("val", "paper", 100, (75, 85)),
                       ("test", "paper", 100, (85, 100))
                   ])

# Then we define the training process, use internal GCN model.
from graphscope.learning.examples import GCN
from graphscope.learning.graphlearn.python.model.tf.trainer import LocalTFTrainer
from graphscope.learning.graphlearn.python.model.tf.optimizer import get_tf_optimizer

def train(config, graph):
    def model_fn():
        return GCN(graph,
                    config["class_num"],
                    config["features_num"],
                    config["batch_size"],
                    val_batch_size=config["val_batch_size"],
                    test_batch_size=config["test_batch_size"],
                    categorical_attrs_desc=config["categorical_attrs_desc"],
                    hidden_dim=config["hidden_dim"],
                    in_drop_rate=config["in_drop_rate"],
                    neighs_num=config["neighs_num"],
                    hops_num=config["hops_num"],
                    node_type=config["node_type"],
                    edge_type=config["edge_type"],
                    full_graph_mode=config["full_graph_mode"])
    trainer = LocalTFTrainer(model_fn,
                             epoch=config["epoch"],
                             optimizer=get_tf_optimizer(
                             config["learning_algo"],
                             config["learning_rate"],
                             config["weight_decay"]))
    trainer.train_and_evaluate()
    
# hyperparameters config.
config = {"class_num": 349, # output dimension
            "features_num": 130, # 128 dimension + kcore + triangle count
            "batch_size": 500,
            "val_batch_size": 100,
            "test_batch_size":100,
            "categorical_attrs_desc": "",
            "hidden_dim": 256,
            "in_drop_rate": 0.5,
            "hops_num": 2,
            "neighs_num": [5, 10],
            "full_graph_mode": False,
            "agg_type": "gcn",  # mean, sum
            "learning_algo": "adam",
            "learning_rate": 0.01,
            "weight_decay": 0.0005,
            "epoch": 5,
            "node_type": "paper",
            "edge_type": "cites"}

# start traning and evaluating
train(config, lg)

Finally, don't forget to close the session.

In [ ]:
# close the session.
sess.close()
In [ ]: