A graph is stored in a file in a format like below:

`node1 \t node2`

Where each line is an edge of the graph. We read the txt representation into defaultdict.

In [13]:

```
from collections import defaultdict
import numpy as np
import random
from gensim.models import Word2Vec, keyedvectors
from sklearn.linear_model import LogisticRegression as lg
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import *
import sys
def read_graph(filename):
graph = defaultdict(list)
with open(filename) as f:
for line in f:
line = line.strip().split('\t')
graph[line[0]].append(line[1])
graph[line[1]].append(line[0])
return graph
g=read_graph('../data/sigmod_subgraph.txt')
# you can print the graph to see the list representation. print(g)
```

A random walk visits nodes in a graph randomly. The basic random walk keeps selecting one of the neiboring nodes, and continue the process in a long trace. There are several problems with this random walk approach. One is that the random walk could be trapped in a small region. A worse scenario is that some parts may not be even reachable. There are several approaches to mitigating these problems. One traditional method is to do random jump with some probability. This is widely used in PageRank algorithms.

Such traditional random walks are not good for embedding purpose, because the random walk taces are mostly short due to the random teleporting. DeepWalk produces traces that are of fixed length, mostly 100, hence solves the short trace problem.

For directled graphs, the short trace problem is more acute, because most traces are short even if teleporting is limited because of directions in the graph. In other words, shortwalks are inevitable in random walks. To solve this problem, we utilize the short walks but create the training pairs in a different way, so that the performance is still good. This work is summarized in "ShortWalk: an approach to network embedding on directed graphs".

The following random walk algorithm is the one similar to DeepWalk--it has a fixed length without random jump.

In [14]:

```
def walk(G, LENGTH = 10):
nodes = list(G.keys())
random.shuffle(nodes)
N=len(G)
complete_path=""
for i,node in enumerate(nodes):
cur = node
path = [cur]
while len(path) < LENGTH:
nextnode = G[cur][np.random.randint(len(G[cur]))]
cur=nextnode
path.append(cur)
complete_path=complete_path+' '.join(path)+ '\n'
return complete_path
paths=walk(g)
# you can print the paths to see that traces are treated as text. print(paths)
file = open(dataset+".walkpath", 'w')
file.write(paths)
file.close()
```

In [ ]:

```
```

In [15]:

```
class sentence(object):
def __init__(self, filename):
self.filename = filename
def __iter__(self):
for line in open(self.filename):
yield line.rstrip().split('\t')
def embedding(filename):
model = Word2Vec(
sentence(filename),
min_count = 0,
size = 128,
window = 5,
sg = 1,
hs = 0, # negative sampling
)
return model
model = embedding(dataset+".walkpath")
```

In [16]:

```
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s ' , level=logging .INFO)
def evaluation(test, model):
i = open(test+'-idxs.txt')
l = open(test+'-labels.txt')
X = []
Y = []
labels = {}
for idx,label in zip(i,l):
idx = idx.rstrip()
try:
X.append(model.wv[idx])
label = label.rstrip()
if label not in labels:
labels[label] = len(labels)
Y.append(labels[label])
except KeyError:
continue
Y_predict = cross_val_predict(lg(solver='lbfgs',multi_class='ovr'),X,Y,cv=10)
# use logistic regression as the classifier, 10-fold cross validation
res = classification_report(Y,Y_predict,digits=4)
print(res)
with open(f"{test}-{model}-result.txt",'w') as fout:
fout.write(res)
def deepwalk(filename):
model = Word2Vec(
sentence(filename),
min_count = 0,
size = 128,
window = 5,
sg = 1,
hs = 0, # negative sampling
workers = 4,
)
return model
class sentence(object):
def __init__(self, filename):
self.filename = filename
def __iter__(self):
for line in open(self.filename):
yield line.rstrip().split('\t')
```

- Network embedding evaluaiton:
- Node classification
- Link prediction
- Can you extend the evaluation on new datasets that are used in serious academic papers?

- DeepWalk hyper parameters:
- DeepWalk does not use the 'standard' random walk. It uses the fixed length instead of random jump with a probability. The purpuse of fixed length is to generate walking traces ("senteces") that are not very short.
- Fixed length can trap the random walking inside a restricted area. Will this cause a problem?
- Random jump with a probability: What is the walking trace length distribution? What is the normal length of traces?

- Walk length in DeepWalk: how long the walk length should be?

- DeepWalk does not use the 'standard' random walk. It uses the fixed length instead of random jump with a probability. The purpuse of fixed length is to generate walking traces ("senteces") that are not very short.
- Node2Vec etc
- Can you compare a few network embedding algorithms?

- Mathematics of random walk
- Visiting probability of a node is proportional to its degree in undirected graphs. Is fixed-length RW change such property?
- Visiting probability of a node is proportional to the PageRank value in directed graphs. PageRank is better than node degree when evaluating the importance of nodes. Whether we should rely on PageRank values in embedding?

- Directed graph: what is the performance if the graph is directed?
- More evaluation on our ShortWalk algorithm?
- ShortWalk: an approach to network embedding on directed graphs

- Heterogeneous_graph: How to learn representations for heterogeneous graphs that have different node types and/or edge types?

- DeepWalk: Online Learning of Social Representations
- https://snap.stanford.edu/node2vec/
- ShortWalk: an approach to network embedding on directed graphs