Notebook

Node Label Prediction \ Link Prediction¶

In [1]:

import numpy as np
import matplotlib.pyplot as plt
import scipy as sp
import networkx as nx
%matplotlib inline

We will start by node label prediction. Download this network. It contains protein communications in Baker’s yeast. Each node (protein) has a special binary attribute ICSC (intracellular signaling cascade).

In [3]:

g = nx.read_gml('seminar_prediction/ppi.CC.gml')
cc = list(nx.connected_components(g))
g = nx.subgraph(g,cc[0])
g = nx.relabel.convert_node_labels_to_integers(g)

In [4]:

labels = np.array(nx.get_node_attributes(g, 'ICSC').values(), dtype=float)

In [5]:

nx.draw_spring(g, node_color = labels)

It might not be clear from the picture above but the level of homogeneity is quite high. For each node we are able to compute the average value of label

In [8]:

nnICSC = np.asarray(map(lambda(v): np.mean(labels[g.neighbors(v)]), g.nodes_iter()))
nnICSC
plt.figure(figsize=(10,5))
plt.hist(nnICSC[np.where(labels == 1)], bins=6,)

Out[8]:

(array([  5.,   1.,   3.,  15.,  12.,  34.]),
 array([ 0.        ,  0.16666667,  0.33333333,  0.5       ,  0.66666667,
         0.83333333,  1.        ]),
 <a list of 6 Patch objects>)

Iterative Classification Method¶

ICM is kind of NN-classifier. Our prediction is based on the largest ratio of nearest neighbours of labeled nodes.

Task 1¶

Randomly set unlabeled nodes.
Implement classification rule of ICM (HINT look at the code cell above)
Implement function for classification error and use it wisely

In [9]:

lablNN = labels[:]
idx = np.random.randint(0,len(lablNN), size=40)
lablNN[idx] = np.nan

In [12]:

idx = np.nonzero(np.isnan(lablNN))[0]
lablClassified = np.asarray(map(lambda(v): np.round(np.nanmean(lablNN[g.neighbors(v)])), idx))
lablNN[idx] = lablClassified
lablNN

Out[12]:

array([ 1.,  1.,  1.,  1.,  1.,  0.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,
        1.,  1.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,  1.,  1.,
        1.,  1.,  0.,  1.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  1.,  0.,
        0.,  1.,  0.,  1.,  1.,  0.,  1.,  0.,  0.,  0.,  1.,  1.,  1.,
        0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  1.,  1.,  0.,
        0.,  0.,  0.,  1.,  1.,  1.,  1.,  1.,  1.,  0.,  1.,  0.,  0.,
        1.,  0.,  1.,  1.,  0.,  0.,  0.,  1.,  1.,  0.,  0.,  0.,  0.,
        1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  0.,  0.,  0.,
        0.,  1.,  0.,  1.,  0.,  1.,  0.,  0.,  1.,  1.,  0.,  0.,  0.,
        1.,  0.,  0.,  1.,  1.,  1.,  1.,  1.,  1.,  0.])

In [13]:

def classError(trueLabel, predLabels):
    return np.sum(trueLabel != predLabels)

In [14]:

classError(labels, lablNN)

Out[14]:

Label Propagation¶

Now instead of looking at neigbours we are switching random walks between all the nodes

Just to recall the Label Propagation method:

Compute $P = D^{-1}A$
Set $Y^{(0)} = (Y_l,0)$ ($Y_l$ - labeled data)
repeat
- $Y^{(t+1)} = PY^{(t)}$
- $Y_l^{(t+1)} = Y_l$
until $Y^{(t)}$ converges

In [15]:

# It is better to initialize like that
fixedLabels = labels[:]+1
curLabels = labels[:]+1

# And indicate labeled nodes instead of unlabeled
idxLabeled = np.zeros((g.number_of_nodes(),), dtype=bool)
idxLabeled[np.random.randint(0,len(labels), size=90)] = True
curLabels[~idxLabeled] = 0

In [16]:

def LabelPropagation(G, idxLabeled, curLabels, fixedLabels, iters = 1000):
    A = np.asarray(nx.adj_matrix(G).todense())
    P = np.diag(1.0/sum(A)).dot(A)
    
    iterNorm = 1.0
    resultLabels = P.dot(curLabels)
    deltaNorm = 1.0
    iterNum = 1
    
    while (deltaNorm > 1e-3 and iterNum < iters and ~np.all(resultLabels >= 1)):
        iterNorm = np.linalg.norm(resultLabels - curLabels)
        runLabels, curLabels = resultLabels[:],resultLabels[:]
        runLabels[idxLabeled] = fixedLabels[idxLabeled]
        resultLabels = P.dot(runLabels)
        deltaNorm = np.abs(iterNorm - np.linalg.norm(resultLabels - curLabels))
        print deltaNorm
        iterNum+=1
    print 'Converged in {0} iteration'.format(iterNum)
    return np.round(resultLabels)

In [18]:

result = LabelPropagation(g, idxLabeled, curLabels, fixedLabels, iters = 1000)
result

6.6004377163
1.05281315836
0.279725703227
0.129492471182
0.051648016945
0.0200835427088
0.0128969227926
0.00434079880178
0.00382025524803
0.0013564441591
0.00124407968333
0.000582641210313
Converged in 13 iteration

Out[18]:

array([ 1.,  1.,  2.,  2.,  2.,  2.,  2.,  2.,  2.,  2.,  2.,  2.,  2.,
        2.,  2.,  2.,  1.,  1.,  1.,  1.,  2.,  1.,  1.,  1.,  2.,  2.,
        2.,  2.,  1.,  2.,  1.,  1.,  1.,  2.,  1.,  1.,  1.,  2.,  2.,
        1.,  2.,  1.,  2.,  2.,  1.,  1.,  1.,  1.,  1.,  2.,  2.,  2.,
        1.,  1.,  1.,  1.,  1.,  1.,  2.,  1.,  1.,  1.,  2.,  2.,  2.,
        1.,  1.,  2.,  1.,  1.,  1.,  2.,  2.,  2.,  1.,  2.,  1.,  1.,
        1.,  1.,  2.,  2.,  1.,  1.,  1.,  2.,  2.,  1.,  1.,  1.,  1.,
        1.,  2.,  2.,  2.,  2.,  2.,  2.,  2.,  2.,  2.,  1.,  1.,  1.,
        1.,  2.,  2.,  2.,  1.,  1.,  1.,  1.,  2.,  2.,  1.,  1.,  1.,
        2.,  1.,  2.,  2.,  2.,  2.,  2.,  2.,  1.,  1.])

In [19]:

classError(labels+1, result)

Out[19]:

Link Prediction - Scoring functions¶

In this section we will implement some scoring functions for link prediction and compare the values for adjacent and non-adjacent nodes.

Load french blog network and compute the following scores:

In [20]:

g = nx.read_gml('./seminar_prediction/fblog.gml')
vNum = g.number_of_nodes()

In [21]:

def matrixPlot(A):
    plt.figure(1, figsize=(6, 6))
    plt.imshow(A,
           interpolation="none"
           )

Shortest Path Length¶

In [22]:

# Your code here
spScore = nx.shortest_paths.floyd_warshall_numpy(g)
matrixPlot(spScore)

Number of common neighbours¶

In [27]:

# Your code here
nnScore = np.zeros((vNum,vNum))
for i in xrange(vNum):
    for j in xrange(i+1, vNum):
        nnScore[i,j] = len(np.intersect1d(nx.neighbors(g,i),nx.neighbors(g,j)))
nnScore[0,]

Out[27]:

array([ 0.,  1.,  2.,  1.,  2.,  2.,  2.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  1.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  1.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  1.,  0.,  0.,  1.,  0.,
        1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        1.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,
        0.,  0.,  0.,  0.,  2.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.])

Jaccard Score¶

In [29]:

# Your code here
jaScore = np.zeros((vNum,vNum))
for i in xrange(vNum):
    for j in xrange(i+1, vNum):
        jaScore[i,j] = 1.0*len(np.intersect1d(nx.neighbors(g,i),nx.neighbors(g,j))) / len(np.union1d(nx.neighbors(g,i),nx.neighbors(g,j)))
jaScore[0,]

Out[29]:

array([ 0.        ,  0.05882353,  0.5       ,  0.09090909,  0.14285714,
        0.33333333,  0.4       ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.02      ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.03448276,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.04347826,  0.        ,  0.        ,
        0.01818182,  0.        ,  0.        ,  0.        ,  0.        ,
        0.03703704,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.01960784,  0.        ,
        0.        ,  0.        ,  0.04545455,  0.        ,  0.        ,
        0.04166667,  0.        ,  0.09090909,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.05      ,  0.        ,  0.        ,  0.03846154,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.02272727,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.03921569,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ])

Adamic/Adar Score¶

$Score(a,b) = \sum\limits_{v \in \text{NN}(a) \cap \text{NN}(b)} \frac{1}{\log(|\text{NN}(v)|)}$

In [31]:

# Your code here
aaScore = np.zeros((vNum,vNum))
for i in xrange(vNum):
    for j in xrange(i+1, vNum):
        for v in nx.common_neighbors(g, i, j):
            aaScore[i,j] += 1/np.log(len(nx.neighbors(g,v)))
aaScore[0,]

Out[31]:

array([ 0.        ,  0.43429448,  0.79496824,  0.36067376,  0.79496824,
        0.79496824,  0.79496824,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.43429448,  0.        ,
        0.        ,  0.        ,  0.        ,  0.36067376,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.36067376,  0.        ,  0.        ,
        0.36067376,  0.        ,  0.        ,  0.        ,  0.        ,
        0.36067376,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.36067376,  0.        ,
        0.        ,  0.        ,  0.36067376,  0.        ,  0.        ,
        0.43429448,  0.        ,  0.36067376,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.36067376,  0.        ,  0.        ,  0.36067376,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.43429448,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.79496824,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ])

Preferential Attachment score¶

$Score(a,b) = |\text{NN}(a)| \times |\text{NN}(b)|$

In [32]:

# Your code here
paScore = np.zeros((vNum,vNum))
for i in xrange(vNum):
    for j in xrange(i+1,vNum):
        paScore[i,j] = len(nx.neighbors(g,i))*len(nx.neighbors(g,j))

Katz Score¶

$Score(a,b) = \sum\limits_{\text{Paths}_{a,b}} \beta^{|p_{a,b}|}\times|p_{a,b}|$

In [ ]:

# Your code here
katzScore = np.zeros((vNum, vNum))
beta = 0.6
for i in xrange(vNum):
    for j in xrange(i+1, vNum):
        l = map(lambda x: beta**len(x)*len(x), tuple(nx.all_simple_paths(g, i, j, cutoff=4)))
        katzScore[i,j] = np.sum(np.power(beta, l)*l)

Let's compare the scores behavious for pairs of nodes with or without edge in-between

In [42]:

A = np.asarray(nx.adj_matrix(g).todense())
xyTriu = np.vstack(np.triu_indices_from(A, k=1)).T
wEdge = [nnScore[xy[0],xy[1]] for xy in xyTriu if A[xy[0],xy[1]]]
woEdge = [nnScore[xy[0],xy[1]] for xy in xyTriu if ~A[xy[0],xy[1]]]
data = [wEdge, woEdge]

In [43]:

fig, axes = plt.subplots(nrows=1, ncols=1, figsize=(10,5))
axes.violinplot(data, showmeans=True)
axes.set_xticklabels(['', 'With Edges', '', 'W/o Edges'])

Out[43]:

[<matplotlib.text.Text at 0x7f4379bdc610>,
 <matplotlib.text.Text at 0x7f4379bed1d0>,
 <matplotlib.text.Text at 0x7f4379b97350>,
 <matplotlib.text.Text at 0x7f4379b97a90>]