This notebook illustrates how Node2Vec
[1] can be applied to learn low dimensional node embeddings of an edge weighted graph through weighted biased random walks over the graph.
Specifically, the notebook demonstrates:
The example uses components from the stellargraph
, Gensim
, and scikit-learn
libraries.
Note: For clarity of exposition, this notebook forgoes the use of standard machine learning practices such as Node2Vec
parameter tuning, node feature standarization, data splitting that handles class imbalance, classifier selection, and classifier tuning to maximize predictive accuracy. We leave such improvements to the reader.
[1] Node2Vec: Scalable Feature Learning for Networks. A. Grover, J. Leskovec. ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2016. (link)
[2] Distributed representations of words and phrases and their compositionality. T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. In Advances in Neural Information Processing Systems (NIPS), pp. 3111-3119, 2013. (link)
[3] Gensim: Topic modelling for humans. (link)
[4] scikit-learn: Machine Learning in Python (link)
# install StellarGraph if running on Google Colab
import sys
if 'google.colab' in sys.modules:
%pip install -q stellargraph[demos]==1.2.1
# verify that we're using the correct version of StellarGraph for this notebook
import stellargraph as sg
try:
sg.utils.validate_notebook_version("1.2.1")
except AttributeError:
raise ValueError(
f"This notebook requires StellarGraph version 1.2.1, but a different version {sg.__version__} is installed. Please see <https://github.com/stellargraph/stellargraph/issues/1172>."
) from None
from sklearn.manifold import TSNE
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegressionCV
from sklearn.metrics import accuracy_score
from sklearn.metrics.pairwise import pairwise_distances
from sklearn import preprocessing
import numpy as np
from stellargraph.data import BiasedRandomWalk
from stellargraph import StellarGraph
from gensim.models import Word2Vec
import warnings
import collections
from stellargraph import datasets
from IPython.display import display, HTML
import matplotlib.pyplot as plt
%matplotlib inline
walk_length = 100 # maximum length of a random walk to use throughout this notebook
For weighted biased random walks the underlying graph should have weights over the edges. Since the links in the Cora dataset are unweighted, we need to synthetically add weights to the links in the graph. One possibility is to weight each edge by the similarity of its end nodes. Here we assign the Jaccard similarity of the features of the pair of nodes as the weight of edge which we can compute by logical operations on NumPy dataframes of the one-hot encoded features.
def jaccard_weights(graph, _subjects, edges):
sources = graph.node_features(edges.source)
targets = graph.node_features(edges.target)
intersection = np.logical_and(sources, targets)
union = np.logical_or(sources, targets)
return intersection.sum(axis=1) / union.sum(axis=1)
(See the "Loading from Pandas" demo for details on how data can be loaded.)
dataset = datasets.Cora()
display(HTML(dataset.description))
G, subjects = dataset.load(
largest_connected_component_only=True,
edge_weights=jaccard_weights,
str_node_ids=True, # Word2Vec requires strings, not ints
)
print(G.info())
StellarGraph: Undirected multigraph Nodes: 2485, Edges: 5209 Node types: paper: [2485] Features: float32 vector, length 1433 Edge types: paper-cites->paper Edge types: paper-cites->paper: [5209]
Just a quick look at the weights of the edges.
_, weights = G.edges(include_edge_weight=True)
wt, cnt = np.unique(weights, return_counts=True)
plt.figure(figsize=(10, 8))
plt.bar(wt, cnt, width=0.005, color="b")
plt.title("Edge weights histogram")
plt.ylabel("Count")
plt.xlabel("edge weights")
plt.xticks(np.linspace(0, 1, 10))
plt.show()
The above distribution of edge weights illustrates that majority of linked nodes are insignificantly similar in terms of their attributes.
The Node2Vec algorithm is a method for learning continuous feature representations for nodes in networks [1]. This approach can simply be described as a mapping of nodes to a low dimensional space of features that maximizes the likelihood of persevering neighborhood structure of the nodes. This approach is not tied to a fixed definition of neighborhood of a node but can be used in conjunction with different notions of node neighborhood, such as, homophily or structural equivalence, among other concepts. The algorithm efficiently explores diverse neighborhoods of nodes through a biased random walk procedure that is parametrized to emulate a specific concept of the neighborhood of a node.
Once a predefined number of walks, of fixed lengths, have been sampled, the low dimension embedding vectors of nodes can be learnt using Word2vec
algorithm [2]. We use the Word2Vec implementation in the free Python library Gensim
[3] to learn representations for each node in the graph.
The stellargraph library provides an implementation of random walks that can be unweighted or weighted as required by Node2Vec. The random walks have a predefined fixed maximum length and are controlled by three parameters p
, q
, and weight
. By default, the weight over the edges is assumed to be 1. See [1] for a detailed description of these parameters.
The first step for the weighted biased random walk is to build a random walk object by passing it a Stellargraph object.
rw = BiasedRandomWalk(G)
The next step is to sample a set of random walks of predefined length starting from each node of the graph. Parameters p
, q
, and weighted
influence the type of random walks in the procedure. In this demo, we are going to start 10 random walks from each node in the graph with a length up to 100. We set parameter p
to 0.5 and q
to 2.0 and the weight parameter set to True. The run
method in the random walk will check if the weights over the edges are available and resolve other issues, such as, whether the weights are numeric and that their is no ambiguity of edge traversal (i.e. each pair of node is connected by a unique numerically weighted edge).
weighted_walks = rw.run(
nodes=G.nodes(), # root nodes
length=walk_length, # maximum length of a random walk
n=10, # number of random walks per root node
p=0.5, # Defines (unormalised) probability, 1/p, of returning to source node
q=2.0, # Defines (unormalised) probability, 1/q, for moving away from source node
weighted=True, # for weighted random walks
seed=42, # random seed fixed for reproducibility
)
print("Number of random walks: {}".format(len(weighted_walks)))
Number of random walks: 24850
Once we have a sample set of walks, we learn the low-dimensional embedding of nodes using Word2Vec approach. We set the dimensionality of the learned embedding vectors to 128 as in [1].
weighted_model = Word2Vec(
weighted_walks, size=128, window=5, min_count=0, sg=1, workers=1, iter=1
)
# The embedding vectors can be retrieved from model.wv using the node ID as key.
# E.g., for node id '19231', the embedding vector is retrieved as
emb = weighted_model.wv["19231"]
emb.shape
(128,)
# Retrieve node embeddings and corresponding subjects
node_ids = weighted_model.wv.index2word # list of node IDs
weighted_node_embeddings = (
weighted_model.wv.vectors
) # numpy.ndarray of size number of nodes times embeddings dimensionality
# the gensim ordering may not match the StellarGraph one, so rearrange
node_targets = subjects.loc[node_ids].astype("category")
# Apply t-SNE transformation on node embeddings
tsne = TSNE(n_components=2, random_state=42)
weighted_node_embeddings_2d = tsne.fit_transform(weighted_node_embeddings)
# draw the points
alpha = 0.7
plt.figure(figsize=(10, 8))
plt.scatter(
weighted_node_embeddings_2d[:, 0],
weighted_node_embeddings_2d[:, 1],
c=node_targets.cat.codes,
cmap="jet",
alpha=0.7,
)
plt.show()
The node embeddings calculated using Word2Vec
can be used as feature vectors in a downstream task such as node classification. Here we give an example of training a logistic regression classifier using the node embeddings, learnt above, as features.
# X will hold the 128-dimensional input features
X = weighted_node_embeddings
# y holds the corresponding target values
y = np.array(node_targets)
We use 75% of the data for training and the remaining 25% for testing as a hold out test set.
X_train, X_test, y_train, y_test = train_test_split(
X, y, train_size=0.75, test_size=None, random_state=42
)
print(
"Array shapes:\n X_train = {}\n y_train = {}\n X_test = {}\n y_test = {}".format(
X_train.shape, y_train.shape, X_test.shape, y_test.shape
)
)
Array shapes: X_train = (1863, 128) y_train = (1863,) X_test = (622, 128) y_test = (622,)
We train a Logistic Regression classifier on the training data.
clf = LogisticRegressionCV(
Cs=10,
cv=10,
tol=0.001,
max_iter=1000,
scoring="accuracy",
verbose=False,
multi_class="ovr",
random_state=5434,
)
clf.fit(X_train, y_train)
LogisticRegressionCV(Cs=10, class_weight=None, cv=10, dual=False, fit_intercept=True, intercept_scaling=1.0, l1_ratios=None, max_iter=1000, multi_class='ovr', n_jobs=None, penalty='l2', random_state=5434, refit=True, scoring='accuracy', solver='lbfgs', tol=0.001, verbose=False)
Predict the hold out test set.
y_pred = clf.predict(X_test)
Calculate the accuracy of the classifier on the test set.
accuracy_score(y_test, y_pred)
0.8263665594855305
Lets compare weighted random walks with unweighted random walks. This simply requires toggling the weight parameter to False in the run
method of the BiasedRandomWalk. Note, the weight parameter is by default set to False, hence, not specifying the weight parameter would result in the same action.
walks = rw.run(
nodes=G.nodes(), # root nodes
length=walk_length, # maximum length of a random walk
n=10, # number of random walks per root node
p=0.5, # Defines (unormalised) probability, 1/p, of returning to source node
q=2.0, # Defines (unormalised) probability, 1/q, for moving away from source node
weighted=False, # since we are interested in unweighted walks
seed=42, # for reproducibility
)
print("Number of random walks: {}".format(len(walks)))
Number of random walks: 24850
model = Word2Vec(walks, size=128, window=5, min_count=0, sg=1, workers=1, iter=1)
# Retrieve node embeddings and corresponding subjects
node_ids = model.wv.index2word # list of node IDs
node_embeddings = (
model.wv.vectors
) # numpy.ndarray of size number of nodes times embeddings dimensionality
node_targets = subjects.loc[node_ids].astype("category")
# Apply t-SNE transformation on node embeddings
tsne = TSNE(n_components=2, random_state=42)
node_embeddings_2d = tsne.fit_transform(node_embeddings)
# draw the points
alpha = 0.7
plt.figure(figsize=(10, 8))
plt.scatter(
node_embeddings_2d[:, 0],
node_embeddings_2d[:, 1],
c=node_targets.cat.codes,
cmap="jet",
alpha=0.7,
)
plt.show()
Visual comparison of node embedding plots for weighted and unweighted random walks illustrates the differences between the two.
# X will hold the 128-dimensional input features
X = node_embeddings
# y holds the corresponding target values
y = np.array(node_targets)
X_train, X_test, y_train, y_test = train_test_split(
X, y, train_size=0.75, test_size=None, random_state=42
)
print(
"Array shapes:\n X_train = {}\n y_train = {}\n X_test = {}\n y_test = {}".format(
X_train.shape, y_train.shape, X_test.shape, y_test.shape
)
)
Array shapes: X_train = (1863, 128) y_train = (1863,) X_test = (622, 128) y_test = (622,)
clf = LogisticRegressionCV(
Cs=10,
cv=10,
tol=0.01,
max_iter=1000,
scoring="accuracy",
verbose=False,
multi_class="ovr",
random_state=5434,
)
clf.fit(X_train, y_train)
LogisticRegressionCV(Cs=10, class_weight=None, cv=10, dual=False, fit_intercept=True, intercept_scaling=1.0, l1_ratios=None, max_iter=1000, multi_class='ovr', n_jobs=None, penalty='l2', random_state=5434, refit=True, scoring='accuracy', solver='lbfgs', tol=0.01, verbose=False)
y_pred = clf.predict(X_test)
accuracy_score(y_test, y_pred)
0.8247588424437299
Generally, the node embeddings extracted from unweighted random walks are more representative of the underlying community structure of the Cora
dataset than the embeddings learnt from weighted random walks over the artificially weighted Cora network.
Lastly, we perform a quick check of whether weighted biased random walks are identical to unweighted biased random walks when weights over the edges are identically 1.
First, set weights of all edges in the graph to 1.
unweighted_G, _ = dataset.load(
largest_connected_component_only=True,
edge_weights=None,
str_node_ids=True, # Word2Vec requires strings, not ints
)
Quick check to confirm if all edge weights are actually 1.
_, weights = unweighted_G.edges(include_edge_weight=True)
wt, cnt = np.unique(weights, return_counts=True)
plt.figure(figsize=(10, 8))
plt.bar(wt, cnt, width=0.005, color="b")
plt.title("Edge weights histogram")
plt.ylabel("Count")
plt.xlabel("edge weights")
plt.xticks(np.linspace(0, 1, 10))
plt.show()
rw = BiasedRandomWalk(unweighted_G)
weighted_walks = rw.run(
nodes=unweighted_G.nodes(), # root nodes
length=walk_length, # maximum length of a random walk
n=10, # number of random walks per root node
p=0.5, # Defines (unormalised) probability, 1/p, of returning to source node
q=2.0, # Defines (unormalised) probability, 1/q, for moving away from source node
weighted=True, # indicates the walks are weighted
seed=42, # seed fixed for reproducibility
)
print("Number of random walks: {}".format(len(weighted_walks)))
Number of random walks: 24850
Compare unweighted walks with weighted walks when all weights are uniformly set to 1. Note, the two sets should be identical given all other parameters and random seeds are fixed.
assert walks == weighted_walks
weighted_model = Word2Vec(
weighted_walks, size=128, window=5, min_count=0, sg=1, workers=1, iter=1
)
# Retrieve node embeddings and corresponding subjects
node_ids = weighted_model.wv.index2word # list of node IDs
weighted_node_embeddings = (
weighted_model.wv.vectors
) # numpy.ndarray of size number of nodes times embeddings dimensionality
# Apply t-SNE transformation on node embeddings
tsne = TSNE(n_components=2, random_state=42)
weighted_node_embeddings_2d = tsne.fit_transform(weighted_node_embeddings)
# draw the points
alpha = 0.7
plt.figure(figsize=(10, 8))
plt.scatter(
weighted_node_embeddings_2d[:, 0],
weighted_node_embeddings_2d[:, 1],
c=node_targets.cat.codes,
cmap="jet",
alpha=0.7,
)
plt.show()
Compare classification of nodes through logistic regression on embeddings learnt from weighted (weight == 1) walks to that of unweighted walks demonstrated above.
# X will hold the 128-dimensional input features
X = weighted_node_embeddings
# y holds the corresponding target values
y = np.array(node_targets)
X_train, X_test, y_train, y_test = train_test_split(
X, y, train_size=0.75, test_size=None, random_state=42
)
print(
"Array shapes:\n X_train = {}\n y_train = {}\n X_test = {}\n y_test = {}".format(
X_train.shape, y_train.shape, X_test.shape, y_test.shape
)
)
Array shapes: X_train = (1863, 128) y_train = (1863,) X_test = (622, 128) y_test = (622,)
clf = LogisticRegressionCV(
Cs=10,
cv=10,
tol=0.001,
max_iter=1000,
scoring="accuracy",
verbose=False,
multi_class="ovr",
random_state=5434,
)
clf.fit(X_train, y_train)
LogisticRegressionCV(Cs=10, class_weight=None, cv=10, dual=False, fit_intercept=True, intercept_scaling=1.0, l1_ratios=None, max_iter=1000, multi_class='ovr', n_jobs=None, penalty='l2', random_state=5434, refit=True, scoring='accuracy', solver='lbfgs', tol=0.001, verbose=False)
y_pred = clf.predict(X_test)
accuracy_score(y_test, y_pred)
0.8247588424437299
np.array_equal(weighted_node_embeddings, node_embeddings)
True
The weighted random walks with weight == 1 are identical to unweighted random walks. Moreover, the embeddings learnt over the two kinds of walks are identical as well.
The above demo of application of Node2Vec method on the CORA dataset using weighted biased random walks demonstrates, weighted biased random walks produce inherently different node embeddings from the embeddings learnt through unweighted random walks over the same graph, as illustrated by t-SNE visualization of the two as well as comparison of performance over node classification. These differences illustrate that (realistic) weights on the edges of a graph can be leveraged to learn more accurate representation of nodes in low dimensional feature space.