Here, we implement empirical measurements of graph embedding algorithms. We implement multiple methods:
Link Prediction. This is done for all graphs. We remove a fixed % of edges in the graph and predict missing links in the graph. We train a logistic regression and LightGBM gradient boosted decision tree to predict the edges and report their AUC, accuracy and F1 score.
Clustering (on graphs with community/cluster labels). We use hierarchical agglomerative clustering on the graph embeddings and measure overlap of the embedding with the real-world communities in the network. Agglomerative Hierarchical clustering is chosen because it is deterministic and not sensitive to cluster shape or scaling or embedding metric space. We measure overlap with RAND index, mutual information score and Fowlkes-Mallows score.
Label prediction. This can be multilabel classifications (for graphs where a node can be in multiple groups) or regular classification performance. We train a logistic regression and LightGBM gradient boosted decision tree to predict the labels and report their AUC, accuracy and F1 score.
We see a sharp divide in empirical performance along first-order and higher-order graph embedding methods.
First order methods directly minimize the distance between nodes and their neighbours. Most graph embedding methods that are based around adjacency matrix factorization (laplacian eigenmaps, SVD, etc.) are first order.
Higher order methods take in account deeper graph structure. For instance, a second order method would account for neighbors-of-neighbors in each node's embeddings. 3rd and higher order deepen this relationship.
Note that you can do higher order embedding through graph factorization algorithms by augmenting the graph adjacency matrix.
For instance, one of the methods tested below samples random walks on the graph and generates a co-occurence matrix from these samples to train GLoVe (a first order algorithm). GGVec and GraRep generate higher-order adjacency matrices by taking the dot product of the graph's random walk markov chain transition matrix with itself, then taking a first order embedding method on that.
Methods based around random walks + word2vec (deepwalk, node2vec, etc.) are naturally higher order methods. The order can be constrained by reducing the word2vec window, or restricting walk length to be extremely short.
We find that first-order methods generally perform better on clustering and label prediction than higher-order methods. Recommended first order methods are GGVec and ProNE.
On the other hand, higher order methods perform better on the link prediction task. Interestingly, the gap in link prediction performance is inexistant for artificially created graphs. This suggests higher order methods do learn some of the structure intrinsic to real world graphs.
These results put in context that it's important to have a diversity of downstream tasks when evaluating embedding models.
Moreover, we find that neural-net based methods (deepwalk, node2vec and descendants) are extremely inefficient with respect to output dimensions. They tend to perform much worse with smaller number of output dimensions.
We develop GGVec, a first (and higher) order embedding algorithm.
This method is very fast and scalable for large graphs by directly minimizing distange between nodes with edges (like GLoVe). It is naturally a first-order method but can be made higher order through the method mentionned above (dot product of graph transition matrix). Scaling of higher-order is worse however.
Moreover, this is the first algorithm that can embed directly from an edgelist file (because minimization loop is per-edge). This remains to be implemented.
import gc
import networkx as nx
import numpy as np
import os
import pandas as pd
import time
import scipy
import sklearn
from sklearn import cluster, linear_model
from sklearn.decomposition import TruncatedSVD
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.model_selection import train_test_split
from sklearn.multiclass import OneVsRestClassifier
import sys
import warnings # Silence perf warning
sys.path.append(os.path.realpath('..'))
import nodevectors
import csrgraph as cg
from csrgraph import methods
from nodevectors.evaluation import link_pred
from nodevectors.evaluation import graph_eval
# UMAP to test (on pip)
import umap
warnings.simplefilter("ignore")
def nx_node_weights(G, method, **kwargs):
"""Node Weights through networkX API"""
pr = np.zeros(len(G))
prdict = method(G, **kwargs)
for i in G.nodes:
pr[i] = prdict[i]
return pr
#### CONFIG
N_COMPONENTS = 32 # resulting embedding dim
SEED = 42 # RNG Seed
TEST_SIZE = 0.2
# For resampling tests
RESAMPLE_WALKS = 10
RESAMPLE_LEN = 6
Data for these notebooks can be found here: https://github.com/VHRanger/Graph-Data Just download it and point the graph generation methods below to it
The data is in a different repo to avoid polluting the pip package.
#### GRAPHS
#### Uncomment one to choose which graph to run evaluation on
#### Artificial random graphs
# G = nx.binomial_graph(700, 0.6)
# G, labels = graph_eval.make_cluster_graph(n_nodes=820, n_clusters=18, connections=1000, drop_pct=0.5)
# G, labels = graph_eval.make_weighed_cluster_graph(n_nodes=500, n_clusters=6, connections=1500, drop_pct=0.2, max_edge_weight=15)
#### Social graphs
# G, labels = graph_eval.make_blogcatalog(dedupe=True)
# G, mlabels = graph_eval.make_blogcatalog(dedupe=False)
G, labels = graph_eval.make_email()
# G, labels = graph_eval.get_karateclub("facebook") # twitch, github, facebook, wikipedia
# G = graph_eval.get_from_snap(url="http://snap.stanford.edu/data/facebook_combined.txt.gz", sep=' ', header=None, comment='#')
#### Biology Graphs
# G, mlabels = graph_eval.get_n2v_ppi("../data/bioNEV/node2vec_PPI")
#### Needs OutOfBounds Nodes support from CSRGraphs to work
# G = graph_eval.get_drugbank_ddi("../data/bioNEV/DrugBank_DDI")
# G, mlabels = graph_eval.get_mashup_ppi("../data/bioNEV/Mashup_PPI")
#### For Link Prediction: Split graph into train and test edge sets
#### (All nodes are still present in both)
G_train, testing_pos_edges = link_pred.split_train_test_graph(G, testing_ratio=TEST_SIZE)
#### Lazy way to set up evaluation
try:
y = labels.label
n_clusters = y.nunique()
HAS_LABELS = True
print(f"clusters: {n_clusters}")
except:
try: # Multilabels
y = MultiLabelBinarizer().fit_transform(mlabels.mlabels)
HAS_LABELS = True
print(f"multilabels: {y.shape[1]}")
except: # No Labels
HAS_LABELS = False
print("No Labels")
NNODES = len(G)
print(f"Nodes: {NNODES}\nEdges: {len(G.edges)}\nconnected: {nx.is_connected(G_train)}")
clusters: 38 Nodes: 10312 Edges: 333983 connected: True
ggvec_params = dict(
n_components=N_COMPONENTS,
order=2,
tol=0.07,
tol_samples=10,
max_epoch=6_000,
learning_rate=0.1,
negative_ratio=0.15,
exponent=0.33,
verbose=True,
)
start_t = time.time()
w_train = nodevectors.GGVec(**ggvec_params).fit_transform(G_train)
print(f"Time: {time.time() - start_t :.4f}")
result = link_pred.LinkPrediction(w_train, G, G_train, testing_pos_edges)
time.sleep(0.1)
if HAS_LABELS:
w = nodevectors.GGVec(**ggvec_params).fit_transform(G)
graph_eval.print_labeled_tests(w, y, test_size=TEST_SIZE, seed=SEED)
Loss: 0.0070 : 0%| | 12/6000 [00:27<3:52:49, 2.33s/it]
Converged! Loss: 0.0070 Time: 32.4033 Link Prediction:
(logit) AUC-ROC: 0.956, AUC-PR: 0.954, Acc: 0.891, F1: 0.890 (lgbm) AUC-ROC: 0.961, AUC-PR: 0.959, Acc: 0.897, F1: 0.898
Loss: 0.0070 : 0%| | 12/6000 [00:27<3:49:13, 2.30s/it]
Converged! Loss: 0.0070
MI: 0.13, RAND 0.10, FM: 0.10 Label Prediction: (logit) Acc: 0.312, F1 micro: 0.312, F1 macro: 0.312 (lgbm) Acc: 0.321, F1 micro: 0.321, F1 macro: 0.321
n2v_params = dict(
n_components=N_COMPONENTS,
epochs=20,
walklen=60,
return_weight=1.,
neighbor_weight=1.,
w2vparams={
"window":3,
"negative":5,
"iter":2,
"batch_words":128}
)
start_t = time.time()
w_train = nodevectors.Node2Vec(**n2v_params).fit_transform(G_train)
print(f"Time: {time.time() - start_t :.4f}")
result = link_pred.LinkPrediction(w_train, G, G_train, testing_pos_edges)
if HAS_LABELS:
w = nodevectors.Node2Vec(**n2v_params).fit_transform(G)
graph_eval.print_labeled_tests(w, y, test_size=TEST_SIZE, seed=SEED)
Making walks... Done, T=3.73 Mapping Walk Names... Done, T=13.18 Training W2V... Done, T=41.50 Time: 60.9425 Link Prediction: (logit) AUC-ROC: 0.871, AUC-PR: 0.874, Acc: 0.794, F1: 0.792 (lgbm) AUC-ROC: 0.955, AUC-PR: 0.951, Acc: 0.889, F1: 0.889 Making walks... Done, T=2.12 Mapping Walk Names... Done, T=13.19 Training W2V... Done, T=43.05 MI: 0.11, RAND 0.13, FM: 0.13 Label Prediction: (logit) Acc: 0.334, F1 micro: 0.334, F1 macro: 0.334 (lgbm) Acc: 0.313, F1 micro: 0.313, F1 macro: 0.313
pne_params = dict(
n_components=N_COMPONENTS,
step=5,
mu=0.2,
theta=0.5,
)
start_t = time.time()
pne = nodevectors.ProNE(**pne_params)
w_train = pne.fit_transform(G_train)
print(f"Time: {time.time() - start_t :.4f}")
result = link_pred.LinkPrediction(w_train, G, G_train, testing_pos_edges)
if HAS_LABELS:
pne = nodevectors.ProNE(**pne_params)
w = pne.fit_transform(G)
graph_eval.print_labeled_tests(w, y, test_size=TEST_SIZE, seed=SEED)
Time: 2.4970 Link Prediction: (logit) AUC-ROC: 0.940, AUC-PR: 0.939, Acc: 0.867, F1: 0.866 (lgbm) AUC-ROC: 0.957, AUC-PR: 0.954, Acc: 0.885, F1: 0.885 MI: 0.13, RAND 0.11, FM: 0.11 Label Prediction: (logit) Acc: 0.348, F1 micro: 0.348, F1 macro: 0.348 (lgbm) Acc: 0.339, F1 micro: 0.339, F1 macro: 0.339
grarep_params = dict(
n_components=N_COMPONENTS,
order=1,
embedder=TruncatedSVD(
n_iter=10,
random_state=42),
# merger=(lambda x : np.sum(x, axis=0)),
merger=lambda x : x[-1]
)
start_t = time.time()
w_train = nodevectors.GraRep(**grarep_params).fit_transform(G_train)
print(f"Time: {time.time() - start_t :.4f}")
result = link_pred.LinkPrediction(w_train, G, G_train, testing_pos_edges)
time.sleep(0.1)
if HAS_LABELS:
w = nodevectors.GraRep(**grarep_params).fit_transform(G)
graph_eval.print_labeled_tests(w, y, test_size=TEST_SIZE, seed=SEED)
100%|██████████| 1/1 [00:22<00:00, 22.44s/it]
Time: 24.7683 Link Prediction: (logit) AUC-ROC: 0.922, AUC-PR: 0.909, Acc: 0.843, F1: 0.843 (lgbm) AUC-ROC: 0.950, AUC-PR: 0.946, Acc: 0.874, F1: 0.877
100%|██████████| 1/1 [00:22<00:00, 22.16s/it]
MI: 0.04, RAND 0.23, FM: 0.23 Label Prediction: (logit) Acc: 0.145, F1 micro: 0.145, F1 macro: 0.145 (lgbm) Acc: 0.310, F1 micro: 0.310, F1 macro: 0.310
ump_params = dict(
embedder=umap.UMAP,
n_neighbors=3,
min_dist=0.,
metric='cosine',
normalize_graph=True,
n_components=N_COMPONENTS,
)
start_t = time.time()
w_train = nodevectors.SKLearnEmbedder(**ump_params).fit_transform(G_train)
print(f"Time: {time.time() - start_t :.4f}")
result = link_pred.LinkPrediction(w_train, G, G_train, testing_pos_edges)
if HAS_LABELS:
w = nodevectors.SKLearnEmbedder(**ump_params).fit_transform(G)
graph_eval.print_labeled_tests(w, y, test_size=TEST_SIZE, seed=SEED)
Time: 39.4685 Link Prediction: (logit) AUC-ROC: 0.927, AUC-PR: 0.924, Acc: 0.859, F1: 0.860 (lgbm) AUC-ROC: 0.937, AUC-PR: 0.936, Acc: 0.866, F1: 0.865 MI: 0.04, RAND 0.07, FM: 0.07 Label Prediction: (logit) Acc: 0.178, F1 micro: 0.178, F1 macro: 0.178 (lgbm) Acc: 0.176, F1 micro: 0.176, F1 macro: 0.176
### GLoVe with random walks ###
glove_params = dict(
n_components=N_COMPONENTS,
tol=0.0005,
max_epoch=6_000,
learning_rate=0.02,
max_loss=10.,
max_count=50,
exponent=0.5,
)
start_t = time.time()
wg = cg.csrgraph(G_train).random_walk_resample(walklen=RESAMPLE_LEN, epochs=RESAMPLE_WALKS)
w_train = nodevectors.Glove(**glove_params).fit_transform(wg)
print(f"Time: {time.time() - start_t :.4f}")
print(f"Virtual edges: {wg.dst.size}")
result = link_pred.LinkPrediction(w_train, G, G_train, testing_pos_edges)
if HAS_LABELS:
wg = cg.csrgraph(G).random_walk_resample(walklen=RESAMPLE_LEN, epochs=RESAMPLE_WALKS)
w = nodevectors.Glove(**glove_params).fit_transform(wg)
graph_eval.print_labeled_tests(w, y, test_size=TEST_SIZE, seed=SEED)
5%|▌ | 303/6000 [00:21<06:43, 14.12it/s]
Time: 25.4425 Virtual edges: 506553 Link Prediction: (logit) AUC-ROC: 0.671, AUC-PR: 0.709, Acc: 0.609, F1: 0.562 (lgbm) AUC-ROC: 0.870, AUC-PR: 0.872, Acc: 0.787, F1: 0.791
5%|▍ | 274/6000 [00:21<07:30, 12.72it/s]
MI: 0.00, RAND 0.04, FM: 0.04 Label Prediction: (logit) Acc: 0.149, F1 micro: 0.149, F1 macro: 0.149 (lgbm) Acc: 0.131, F1 micro: 0.131, F1 macro: 0.131
# From the related karateclub lib (on pip)
# https://github.com/benedekrozemberczki/KarateClub
# from karateclub.node_embedding.neighbourhood import NodeSketch, Walklets
###### Slooooowwwwwww ########
# walklets_params = dict(
# walk_number=10,
# walk_length=30,
# dimensions=N_COMPONENTS,
# window_size=4,
# epochs=1,
# learning_rate=0.05
# )
# try: # Karateclub models don't handle certain graphs
# start_t = time.time()
# model = Walklets(**walklets_params)
# model.fit(G_train)
# print(f"Time: {time.time() - start_t :.3f}")
# w_train = model.get_embedding()
# result = link_pred.LinkPrediction(w_train, G, G_train, testing_pos_edges)
# if HAS_LABELS:
# model = Walklets(**walklets_params)
# model.fit(G)
# w = model.get_embedding()
# graph_eval.print_labeled_tests(w, y, test_size=TEST_SIZE, seed=SEED)
# except: pass
### Completely random baseline ###
w = np.random.randn(len(G), N_COMPONENTS)
result = link_pred.LinkPrediction(w, G, G_train, testing_pos_edges)
try:
graph_eval.print_labeled_tests(w, y, test_size=TEST_SIZE, seed=SEED)
except: pass
Link Prediction: (logit) AUC-ROC: 0.572, AUC-PR: 0.569, Acc: 0.551, F1: 0.551 (lgbm) AUC-ROC: 0.750, AUC-PR: 0.775, Acc: 0.684, F1: 0.664 MI: -0.00, RAND 0.04, FM: 0.04 Label Prediction: (logit) Acc: 0.143, F1 micro: 0.143, F1 macro: 0.143 (lgbm) Acc: 0.115, F1 micro: 0.115, F1 macro: 0.115