Read_graph

A graph is stored in a file in a format like below:

node1 \t node2

Where each line is an edge of the graph. We read the txt representation into defaultdict.

In [13]:
from collections import defaultdict
import numpy as np
import random
from gensim.models import Word2Vec, keyedvectors 
from sklearn.linear_model import LogisticRegression as lg
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import *
import sys
def read_graph(filename):
  graph = defaultdict(list)
  with open(filename) as f:
    for line in f:
        line = line.strip().split('\t')
        graph[line[0]].append(line[1])
        graph[line[1]].append(line[0])
  return graph

g=read_graph('../data/sigmod_subgraph.txt')
# you can print the graph to see the list representation. print(g)

Random Walk

A random walk visits nodes in a graph randomly. The basic random walk keeps selecting one of the neiboring nodes, and continue the process in a long trace. There are several problems with this random walk approach. One is that the random walk could be trapped in a small region. A worse scenario is that some parts may not be even reachable. There are several approaches to mitigating these problems. One traditional method is to do random jump with some probability. This is widely used in PageRank algorithms.

Such traditional random walks are not good for embedding purpose, because the random walk taces are mostly short due to the random teleporting. DeepWalk produces traces that are of fixed length, mostly 100, hence solves the short trace problem.

For directled graphs, the short trace problem is more acute, because most traces are short even if teleporting is limited because of directions in the graph. In other words, shortwalks are inevitable in random walks. To solve this problem, we utilize the short walks but create the training pairs in a different way, so that the performance is still good. This work is summarized in "ShortWalk: an approach to network embedding on directed graphs".

The following random walk algorithm is the one similar to DeepWalk--it has a fixed length without random jump.

In [14]:
def walk(G, LENGTH = 10):
    nodes = list(G.keys())
    random.shuffle(nodes)
    N=len(G)
    complete_path=""
    for i,node in enumerate(nodes):
        cur = node
        path = [cur]
        while len(path) < LENGTH:
            nextnode = G[cur][np.random.randint(len(G[cur]))] 
            cur=nextnode
            path.append(cur)
         
        complete_path=complete_path+' '.join(path)+ '\n'
    return complete_path

paths=walk(g)
# you can print the paths to see that traces are treated as text. print(paths)
file = open(dataset+".walkpath", 'w')
file.write(paths)
file.close()
In [ ]:
 
In [15]:
class sentence(object):
    def __init__(self, filename):
        self.filename = filename
    def __iter__(self):
        for line in open(self.filename):
            yield line.rstrip().split('\t')
                  
def embedding(filename):
    model = Word2Vec(
        sentence(filename),
        min_count = 0,
        size = 128,
        window = 5,
        sg = 1,
        hs = 0, # negative sampling
    )
    return model

model = embedding(dataset+".walkpath")
            
2021-02-24 09:59:42,441 : INFO : collecting all words and their counts 
2021-02-24 09:59:42,446 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types 
2021-02-24 09:59:42,478 : INFO : collected 2619 word types from a corpus of 2619 raw words and 2619 sentences 
2021-02-24 09:59:42,480 : INFO : Loading a fresh vocabulary 
2021-02-24 09:59:42,498 : INFO : effective_min_count=0 retains 2619 unique words (100% of original 2619, drops 0) 
2021-02-24 09:59:42,500 : INFO : effective_min_count=0 leaves 2619 word corpus (100% of original 2619, drops 0) 
2021-02-24 09:59:42,526 : INFO : deleting the raw counts dictionary of 2619 items 
2021-02-24 09:59:42,527 : INFO : sample=0.001 downsamples 0 most-common words 
2021-02-24 09:59:42,529 : INFO : downsampling leaves estimated 2619 word corpus (100.0% of prior 2619) 
2021-02-24 09:59:42,546 : INFO : estimated required memory for 2619 words and 128 dimensions: 3991356 bytes 
2021-02-24 09:59:42,548 : INFO : resetting layer weights 
2021-02-24 09:59:43,242 : INFO : training model with 3 workers on 2619 vocabulary and 128 features, using sg=1 hs=0 sample=0.001 negative=5 window=5 
2021-02-24 09:59:43,286 : INFO : worker thread finished; awaiting finish of 2 more threads 
2021-02-24 09:59:43,287 : INFO : worker thread finished; awaiting finish of 1 more threads 
2021-02-24 09:59:43,289 : INFO : worker thread finished; awaiting finish of 0 more threads 
2021-02-24 09:59:43,290 : INFO : EPOCH - 1 : training on 2619 raw words (2619 effective words) took 0.0s, 132864 effective words/s 
2021-02-24 09:59:43,330 : INFO : worker thread finished; awaiting finish of 2 more threads 
2021-02-24 09:59:43,332 : INFO : worker thread finished; awaiting finish of 1 more threads 
2021-02-24 09:59:43,334 : INFO : worker thread finished; awaiting finish of 0 more threads 
2021-02-24 09:59:43,336 : INFO : EPOCH - 2 : training on 2619 raw words (2619 effective words) took 0.0s, 455724 effective words/s 
2021-02-24 09:59:43,382 : INFO : worker thread finished; awaiting finish of 2 more threads 
2021-02-24 09:59:43,384 : INFO : worker thread finished; awaiting finish of 1 more threads 
2021-02-24 09:59:43,385 : INFO : worker thread finished; awaiting finish of 0 more threads 
2021-02-24 09:59:43,387 : INFO : EPOCH - 3 : training on 2619 raw words (2619 effective words) took 0.0s, 72363 effective words/s 
2021-02-24 09:59:43,422 : INFO : worker thread finished; awaiting finish of 2 more threads 
2021-02-24 09:59:43,425 : INFO : worker thread finished; awaiting finish of 1 more threads 
2021-02-24 09:59:43,428 : INFO : worker thread finished; awaiting finish of 0 more threads 
2021-02-24 09:59:43,430 : INFO : EPOCH - 4 : training on 2619 raw words (2619 effective words) took 0.0s, 81995 effective words/s 
2021-02-24 09:59:43,456 : INFO : worker thread finished; awaiting finish of 2 more threads 
2021-02-24 09:59:43,459 : INFO : worker thread finished; awaiting finish of 1 more threads 
2021-02-24 09:59:43,461 : INFO : worker thread finished; awaiting finish of 0 more threads 
2021-02-24 09:59:43,463 : INFO : EPOCH - 5 : training on 2619 raw words (2619 effective words) took 0.0s, 98933 effective words/s 
2021-02-24 09:59:43,464 : INFO : training on a 13095 raw words (13095 effective words) took 0.2s, 59678 effective words/s 
2021-02-24 09:59:43,466 : WARNING : under 10 jobs per worker: consider setting a smaller `batch_words' for smoother alpha decay 
In [16]:
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s ' , level=logging .INFO)

def evaluation(test, model):
    i = open(test+'-idxs.txt')
    l = open(test+'-labels.txt')
    X = []
    Y = []
    labels = {}
    for idx,label in zip(i,l):
        idx = idx.rstrip()
        try:
            X.append(model.wv[idx])
            label = label.rstrip()
            if label not in labels:
                labels[label] = len(labels)
            Y.append(labels[label])
        except KeyError:
            continue
    Y_predict = cross_val_predict(lg(solver='lbfgs',multi_class='ovr'),X,Y,cv=10) 
    # use logistic regression as the classifier, 10-fold cross validation
    res = classification_report(Y,Y_predict,digits=4)
    print(res)
    
    with open(f"{test}-{model}-result.txt",'w') as fout:
        fout.write(res)
        
def deepwalk(filename):
    model = Word2Vec(
        sentence(filename),
        min_count = 0,
        size = 128,
        window = 5,
        sg = 1,
        hs = 0, # negative sampling
        workers = 4,
    )
    return model

class sentence(object):
    def __init__(self, filename):
        self.filename = filename
    def __iter__(self):
        for line in open(self.filename):
            yield line.rstrip().split('\t')

Project_topics

  • Network embedding evaluaiton:
    • Node classification
    • Link prediction
    • Can you extend the evaluation on new datasets that are used in serious academic papers?
  • DeepWalk hyper parameters:
    • DeepWalk does not use the 'standard' random walk. It uses the fixed length instead of random jump with a probability. The purpuse of fixed length is to generate walking traces ("senteces") that are not very short.
      • Fixed length can trap the random walking inside a restricted area. Will this cause a problem?
      • Random jump with a probability: What is the walking trace length distribution? What is the normal length of traces?
    • Walk length in DeepWalk: how long the walk length should be?
  • Node2Vec etc
    • Can you compare a few network embedding algorithms?
  • Mathematics of random walk
    • Visiting probability of a node is proportional to its degree in undirected graphs. Is fixed-length RW change such property?
    • Visiting probability of a node is proportional to the PageRank value in directed graphs. PageRank is better than node degree when evaluating the importance of nodes. Whether we should rely on PageRank values in embedding?
  • Directed graph: what is the performance if the graph is directed?
    • More evaluation on our ShortWalk algorithm?
    • ShortWalk: an approach to network embedding on directed graphs
  • Heterogeneous_graph: How to learn representations for heterogeneous graphs that have different node types and/or edge types?

Resources