Notebook

Research Notes:¶

Here, we implement empirical measurements of graph embedding algorithms. We implement multiple methods:

Link Prediction. This is done for all graphs. We remove a fixed % of edges in the graph and predict missing links in the graph. We train a logistic regression and LightGBM gradient boosted decision tree to predict the edges and report their AUC, accuracy and F1 score.
Clustering (on graphs with community/cluster labels). We use hierarchical agglomerative clustering on the graph embeddings and measure overlap of the embedding with the real-world communities in the network. Agglomerative Hierarchical clustering is chosen because it is deterministic and not sensitive to cluster shape or scaling or embedding metric space. We measure overlap with RAND index, mutual information score and Fowlkes-Mallows score.
Label prediction. This can be multilabel classifications (for graphs where a node can be in multiple groups) or regular classification performance. We train a logistic regression and LightGBM gradient boosted decision tree to predict the labels and report their AUC, accuracy and F1 score.

First vs Second Order:¶

We see a sharp divide in empirical performance along first-order and higher-order graph embedding methods.

First order methods directly minimize the distance between nodes and their neighbours. Most graph embedding methods that are based around adjacency matrix factorization (laplacian eigenmaps, SVD, etc.) are first order.

Higher order methods take in account deeper graph structure. For instance, a second order method would account for neighbors-of-neighbors in each node's embeddings. 3rd and higher order deepen this relationship.

Note that you can do higher order embedding through graph factorization algorithms by augmenting the graph adjacency matrix.

For instance, one of the methods tested below samples random walks on the graph and generates a co-occurence matrix from these samples to train GLoVe (a first order algorithm). GGVec and GraRep generate higher-order adjacency matrices by taking the dot product of the graph's random walk markov chain transition matrix with itself, then taking a first order embedding method on that.

Methods based around random walks + word2vec (deepwalk, node2vec, etc.) are naturally higher order methods. The order can be constrained by reducing the word2vec window, or restricting walk length to be extremely short.

Findings¶

We find that first-order methods generally perform better on clustering and label prediction than higher-order methods. Recommended first order methods are GGVec and ProNE.

On the other hand, higher order methods perform better on the link prediction task. Interestingly, the gap in link prediction performance is inexistant for artificially created graphs. This suggests higher order methods do learn some of the structure intrinsic to real world graphs.

These results put in context that it's important to have a diversity of downstream tasks when evaluating embedding models.

Moreover, we find that neural-net based methods (deepwalk, node2vec and descendants) are extremely inefficient with respect to output dimensions. They tend to perform much worse with smaller number of output dimensions.

GGVec¶

We develop GGVec, a first (and higher) order embedding algorithm.

This method is very fast and scalable for large graphs by directly minimizing distange between nodes with edges (like GLoVe). It is naturally a first-order method but can be made higher order through the method mentionned above (dot product of graph transition matrix). Scaling of higher-order is worse however.

Moreover, this is the first algorithm that can embed directly from an edgelist file (because minimization loop is per-edge). This remains to be implemented.

In [1]:

import gc
import networkx as nx
import numpy as np
import os
import pandas as pd
import time
import scipy
import sklearn
from sklearn import cluster, linear_model
from sklearn.decomposition import TruncatedSVD
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.model_selection import train_test_split
from sklearn.multiclass import OneVsRestClassifier
import sys
import warnings # Silence perf warning

sys.path.append(os.path.realpath('..'))

import nodevectors
import csrgraph as cg
from csrgraph import methods
from nodevectors.evaluation import link_pred
from nodevectors.evaluation import graph_eval

# UMAP to test (on pip)
import umap

warnings.simplefilter("ignore")

def nx_node_weights(G, method, **kwargs):
    """Node Weights through networkX API"""
    pr = np.zeros(len(G))
    prdict = method(G, **kwargs)
    for i in G.nodes:
        pr[i] = prdict[i]
    return pr

In [2]:

#### CONFIG
N_COMPONENTS = 32 # resulting embedding dim
SEED = 42 # RNG Seed
TEST_SIZE = 0.2

# For resampling tests
RESAMPLE_WALKS = 10
RESAMPLE_LEN = 6

Data Availability¶

Data for these notebooks can be found here: https://github.com/VHRanger/Graph-Data Just download it and point the graph generation methods below to it

The data is in a different repo to avoid polluting the pip package.

In [3]:

#### GRAPHS
#### Uncomment one to choose which graph to run evaluation on

#### Artificial random graphs
# G = nx.binomial_graph(700, 0.6)
# G, labels = graph_eval.make_cluster_graph(n_nodes=820, n_clusters=18, connections=1000, drop_pct=0.5)
# G, labels = graph_eval.make_weighed_cluster_graph(n_nodes=500, n_clusters=6, connections=1500, drop_pct=0.2, max_edge_weight=15)
#### Social graphs
# G, labels = graph_eval.make_blogcatalog(dedupe=True)
# G, mlabels = graph_eval.make_blogcatalog(dedupe=False)
G, labels = graph_eval.make_email()
# G, labels = graph_eval.get_karateclub("facebook") # twitch, github, facebook, wikipedia
# G = graph_eval.get_from_snap(url="http://snap.stanford.edu/data/facebook_combined.txt.gz", sep=' ', header=None, comment='#')
#### Biology Graphs
# G, mlabels = graph_eval.get_n2v_ppi("../data/bioNEV/node2vec_PPI")


#### Needs OutOfBounds Nodes support from CSRGraphs to work
# G = graph_eval.get_drugbank_ddi("../data/bioNEV/DrugBank_DDI")
# G, mlabels = graph_eval.get_mashup_ppi("../data/bioNEV/Mashup_PPI")

In [4]:

#### For Link Prediction: Split graph into train and test edge sets
#### (All nodes are still present in both)
G_train, testing_pos_edges = link_pred.split_train_test_graph(G, testing_ratio=TEST_SIZE)

#### Lazy way to set up evaluation
try:
    y = labels.label
    n_clusters = y.nunique()
    HAS_LABELS = True
    print(f"clusters: {n_clusters}")
except:
    try: # Multilabels 
        y = MultiLabelBinarizer().fit_transform(mlabels.mlabels)
        HAS_LABELS = True
        print(f"multilabels: {y.shape[1]}")
    except: # No Labels
        HAS_LABELS = False
        print("No Labels")
NNODES = len(G)
print(f"Nodes: {NNODES}\nEdges: {len(G.edges)}\nconnected: {nx.is_connected(G_train)}")

clusters: 38
Nodes: 10312
Edges: 333983
connected: True

In [16]:

ggvec_params = dict(
    n_components=N_COMPONENTS,
    order=2,
    tol=0.07,
    tol_samples=10,
    max_epoch=6_000,
    learning_rate=0.1,
    negative_ratio=0.15,
    exponent=0.33,
    verbose=True,
)

start_t = time.time()
w_train = nodevectors.GGVec(**ggvec_params).fit_transform(G_train)

print(f"Time: {time.time() - start_t :.4f}")
result = link_pred.LinkPrediction(w_train, G, G_train, testing_pos_edges)
time.sleep(0.1)
if HAS_LABELS:
    w = nodevectors.GGVec(**ggvec_params).fit_transform(G)
    graph_eval.print_labeled_tests(w, y, test_size=TEST_SIZE, seed=SEED)

Loss: 0.0070	:   0%|          | 12/6000 [00:27<3:52:49,  2.33s/it]

Converged! Loss: 0.0070
Time: 32.4033
Link Prediction:

	(logit) AUC-ROC: 0.956, AUC-PR: 0.954, Acc: 0.891, F1: 0.890
	(lgbm)  AUC-ROC: 0.961, AUC-PR: 0.959, Acc: 0.897, F1: 0.898

Loss: 0.0070	:   0%|          | 12/6000 [00:27<3:49:13,  2.30s/it]

Converged! Loss: 0.0070

MI: 0.13, RAND 0.10, FM: 0.10
Label Prediction:
	(logit) Acc: 0.312, F1 micro: 0.312, F1 macro: 0.312
	(lgbm) Acc: 0.321, F1 micro: 0.321, F1 macro: 0.321

In [6]:

n2v_params = dict(
    n_components=N_COMPONENTS,
    epochs=20,
    walklen=60,
    return_weight=1.,
    neighbor_weight=1.,
    w2vparams={
        "window":3, 
        "negative":5, 
        "iter":2,
        "batch_words":128}
)

start_t = time.time()
w_train = nodevectors.Node2Vec(**n2v_params).fit_transform(G_train)
print(f"Time: {time.time() - start_t :.4f}")
result = link_pred.LinkPrediction(w_train, G, G_train, testing_pos_edges)
if HAS_LABELS:
    w = nodevectors.Node2Vec(**n2v_params).fit_transform(G)
    graph_eval.print_labeled_tests(w, y, test_size=TEST_SIZE, seed=SEED)

Making walks... Done, T=3.73
Mapping Walk Names... Done, T=13.18
Training W2V... Done, T=41.50
Time: 60.9425
Link Prediction:
	(logit) AUC-ROC: 0.871, AUC-PR: 0.874, Acc: 0.794, F1: 0.792
	(lgbm)  AUC-ROC: 0.955, AUC-PR: 0.951, Acc: 0.889, F1: 0.889
Making walks... Done, T=2.12
Mapping Walk Names... Done, T=13.19
Training W2V... Done, T=43.05
MI: 0.11, RAND 0.13, FM: 0.13
Label Prediction:
	(logit) Acc: 0.334, F1 micro: 0.334, F1 macro: 0.334
	(lgbm) Acc: 0.313, F1 micro: 0.313, F1 macro: 0.313

In [17]:

pne_params = dict(
    n_components=N_COMPONENTS,
    step=5,
    mu=0.2,
    theta=0.5,
)

start_t = time.time()
pne = nodevectors.ProNE(**pne_params)
w_train = pne.fit_transform(G_train)
print(f"Time: {time.time() - start_t :.4f}")
result = link_pred.LinkPrediction(w_train, G, G_train, testing_pos_edges)
if HAS_LABELS:
    pne = nodevectors.ProNE(**pne_params)
    w = pne.fit_transform(G)
    graph_eval.print_labeled_tests(w, y, test_size=TEST_SIZE, seed=SEED)

Time: 2.4970
Link Prediction:
	(logit) AUC-ROC: 0.940, AUC-PR: 0.939, Acc: 0.867, F1: 0.866
	(lgbm)  AUC-ROC: 0.957, AUC-PR: 0.954, Acc: 0.885, F1: 0.885
MI: 0.13, RAND 0.11, FM: 0.11
Label Prediction:
	(logit) Acc: 0.348, F1 micro: 0.348, F1 macro: 0.348
	(lgbm) Acc: 0.339, F1 micro: 0.339, F1 macro: 0.339

In [8]:

grarep_params = dict(
    n_components=N_COMPONENTS,
    order=1,
    embedder=TruncatedSVD(
        n_iter=10,
        random_state=42),
#     merger=(lambda x : np.sum(x, axis=0)),
    merger=lambda x : x[-1]
)

start_t = time.time()
w_train = nodevectors.GraRep(**grarep_params).fit_transform(G_train)

print(f"Time: {time.time() - start_t :.4f}")
result = link_pred.LinkPrediction(w_train, G, G_train, testing_pos_edges)
time.sleep(0.1)
if HAS_LABELS:
    w = nodevectors.GraRep(**grarep_params).fit_transform(G)
    graph_eval.print_labeled_tests(w, y, test_size=TEST_SIZE, seed=SEED)

100%|██████████| 1/1 [00:22<00:00, 22.44s/it]

Time: 24.7683
Link Prediction:
	(logit) AUC-ROC: 0.922, AUC-PR: 0.909, Acc: 0.843, F1: 0.843
	(lgbm)  AUC-ROC: 0.950, AUC-PR: 0.946, Acc: 0.874, F1: 0.877

100%|██████████| 1/1 [00:22<00:00, 22.16s/it]

MI: 0.04, RAND 0.23, FM: 0.23
Label Prediction:
	(logit) Acc: 0.145, F1 micro: 0.145, F1 macro: 0.145
	(lgbm) Acc: 0.310, F1 micro: 0.310, F1 macro: 0.310

In [9]:

ump_params = dict(
    embedder=umap.UMAP,
    n_neighbors=3,
    min_dist=0.,
    metric='cosine',
    normalize_graph=True,
    n_components=N_COMPONENTS,
)

start_t = time.time()
w_train = nodevectors.SKLearnEmbedder(**ump_params).fit_transform(G_train)
print(f"Time: {time.time() - start_t :.4f}")
result = link_pred.LinkPrediction(w_train, G, G_train, testing_pos_edges)
if HAS_LABELS:
    w = nodevectors.SKLearnEmbedder(**ump_params).fit_transform(G)
    graph_eval.print_labeled_tests(w, y, test_size=TEST_SIZE, seed=SEED)

Time: 39.4685
Link Prediction:
	(logit) AUC-ROC: 0.927, AUC-PR: 0.924, Acc: 0.859, F1: 0.860
	(lgbm)  AUC-ROC: 0.937, AUC-PR: 0.936, Acc: 0.866, F1: 0.865
MI: 0.04, RAND 0.07, FM: 0.07
Label Prediction:
	(logit) Acc: 0.178, F1 micro: 0.178, F1 macro: 0.178
	(lgbm) Acc: 0.176, F1 micro: 0.176, F1 macro: 0.176

In [10]:

### GLoVe with random walks ###
glove_params = dict(
    n_components=N_COMPONENTS,
    tol=0.0005,
    max_epoch=6_000,
    learning_rate=0.02, 
    max_loss=10.,
    max_count=50, 
    exponent=0.5,
)

start_t = time.time()
wg = cg.csrgraph(G_train).random_walk_resample(walklen=RESAMPLE_LEN, epochs=RESAMPLE_WALKS)
w_train = nodevectors.Glove(**glove_params).fit_transform(wg)

print(f"Time: {time.time() - start_t :.4f}")
print(f"Virtual edges: {wg.dst.size}")
result = link_pred.LinkPrediction(w_train, G, G_train, testing_pos_edges)
if HAS_LABELS:
    wg = cg.csrgraph(G).random_walk_resample(walklen=RESAMPLE_LEN, epochs=RESAMPLE_WALKS)
    w = nodevectors.Glove(**glove_params).fit_transform(wg)
    graph_eval.print_labeled_tests(w, y, test_size=TEST_SIZE, seed=SEED)

  5%|▌         | 303/6000 [00:21<06:43, 14.12it/s]

Time: 25.4425
Virtual edges: 506553
Link Prediction:
	(logit) AUC-ROC: 0.671, AUC-PR: 0.709, Acc: 0.609, F1: 0.562
	(lgbm)  AUC-ROC: 0.870, AUC-PR: 0.872, Acc: 0.787, F1: 0.791

  5%|▍         | 274/6000 [00:21<07:30, 12.72it/s]

MI: 0.00, RAND 0.04, FM: 0.04
Label Prediction:
	(logit) Acc: 0.149, F1 micro: 0.149, F1 macro: 0.149
	(lgbm) Acc: 0.131, F1 micro: 0.131, F1 macro: 0.131

In [12]:

# From the related karateclub lib (on pip)
# https://github.com/benedekrozemberczki/KarateClub
# from karateclub.node_embedding.neighbourhood import NodeSketch, Walklets

###### Slooooowwwwwww ########
# walklets_params = dict(
#     walk_number=10, 
#     walk_length=30, 
#     dimensions=N_COMPONENTS,
#     window_size=4,
#     epochs=1, 
#     learning_rate=0.05
# )

# try: # Karateclub models don't handle certain graphs
#     start_t = time.time()
#     model = Walklets(**walklets_params)
#     model.fit(G_train)
#     print(f"Time: {time.time() - start_t :.3f}")
#     w_train = model.get_embedding()
#     result = link_pred.LinkPrediction(w_train, G, G_train, testing_pos_edges)
#     if HAS_LABELS:
#         model = Walklets(**walklets_params)
#         model.fit(G)
#         w = model.get_embedding()
#         graph_eval.print_labeled_tests(w, y, test_size=TEST_SIZE, seed=SEED)
# except: pass

In [13]:

### Completely random baseline ###

w = np.random.randn(len(G), N_COMPONENTS)

result = link_pred.LinkPrediction(w, G, G_train, testing_pos_edges)
try:
    graph_eval.print_labeled_tests(w, y, test_size=TEST_SIZE, seed=SEED)
except: pass

Link Prediction:
	(logit) AUC-ROC: 0.572, AUC-PR: 0.569, Acc: 0.551, F1: 0.551
	(lgbm)  AUC-ROC: 0.750, AUC-PR: 0.775, Acc: 0.684, F1: 0.664
MI: -0.00, RAND 0.04, FM: 0.04
Label Prediction:
	(logit) Acc: 0.143, F1 micro: 0.143, F1 macro: 0.143
	(lgbm) Acc: 0.115, F1 micro: 0.115, F1 macro: 0.115

In [ ]: