Notebook

Centrality measures experiments¶

Github repository: link

If you are looking at this via github, please use this link to visualize it in the proper way: this notebook on nbviewer

Centrality¶

Centrality measures address the question of defining the concept of importance of a node in a graph.
We can answer to the question of "what node is the most important in a graph?" in many different ways, depending on the context.

The question is infact perfectly comparable to asking "Which one of my friends is the most famous one in my circle of friends?", "Which research paper is the most relevant?", or "Which github repository is better?"

For this purpose, we can choose among different ways of measuring it. Possible types of measures are:

Degree centrality
Closeness centrality
Betweeness centrality
Eigenvector centrality
PageRank centrality

The aim of this notebook is to explain a possible implementation for computing eigenvector centrality, and showing a comparison with degree centrality.

Initial setup¶

We are going to use the networkx library for storing and managing the graph, plus scipy and numpy for working with matrices.

In [1]:

import networkx as nx
import scipy as sc
import numpy as np

For the sake of visualization, we are going to use plotly. Since talking about visualization is not the purpose of this notebook, I made a module to manage all the visualization-related functions and to store plotly configurations. The code of this module can still be found in the same repository of this notebook, if you are interested in taking a look.

In [2]:

import plotly_config as pc

Degree Centrality¶

The degree of a node is just the number of connections it has with other nodes. In the context of a social network, if an user is a node, its degree could be interpreted as the number of friends he has.

Degree Centrality is a simple form of interpreting how relevant a node is.

Let us play with a toy graph in the beginning, here we define it:

In [3]:

test_G = nx.Graph()
test_G.add_edge(0, 1)
test_G.add_edge(1, 2)
test_G.add_edge(2, 3)
test_G.add_edge(2, 10)
test_G.add_edge(3, 0)
test_G.add_edge(3, 1)
test_G.add_edge(1, 4)
test_G.add_edge(4, 5)
test_G.add_edge(5, 6)
test_G.add_edge(5, 7)
test_G.add_edge(5, 8)
test_G.add_edge(5, 9)
test_G.add_edge(5, 11)

And here we draw it:

In [4]:

pc.drawSpringLayoutGraph(test_G)

Eigenvector Centrality¶

Eigenvector Centrality is just another way of computing the influence of a node in a network. The main idea behind this is that having connections to many nodes with high centrality values is more important than just having many connections.

If we see it in the context of a social network, an interpretation of this may be that a person with famous friends is more famous than a person with many friends.

Computing Eigenvector Centrality¶

Let $A = (a_{v,t})$ be the adjacency matrix of a graph $G$.
Let $M(v)$ be the set of neighbors of $v$.
Let $\lambda$ be the greatest eigenvalue of $A$.

The eigenvector centrality score of a vertex $v \in G$ can be defined as:

$$ x_v = \frac{1}{\lambda}\sum\limits_{t \in M(v)} x_t = \frac{1}{\lambda}\sum\limits_{t \in G} a_{v,t}x_t$$

First, we compute the adjacency matrix. Since our graph is undirected, our adjacency matrix will be symmetric.

In [5]:

def create_adjacency_matrix(test_G):
    nodelist = test_G.nodes()
    nlen = len(nodelist)

    # define an index of the nodes in the graph
    index = dict(zip(nodelist,range(nlen)))

    # initialize a matrix full of zeros
    M = np.zeros((nlen,nlen), dtype=int)

    # for each edge in G
    for u,v in test_G.edges_iter():
        i,j = index[u], index[v]

        # write 1 in each cell corresponding to adjacent nodes
        M[i,j] = 1

        # if the graph is undirected, this matrix is going to be symmetric
        if not test_G.is_directed():
            M[j,i] = M[i,j]
            
    return M

M = create_adjacency_matrix(test_G)
print M

[[0 1 0 1 0 0 0 0 0 0 0 0]
 [1 0 1 1 1 0 0 0 0 0 0 0]
 [0 1 0 1 0 0 0 0 0 0 1 0]
 [1 1 1 0 0 0 0 0 0 0 0 0]
 [0 1 0 0 0 1 0 0 0 0 0 0]
 [0 0 0 0 1 0 1 1 1 1 0 1]
 [0 0 0 0 0 1 0 0 0 0 0 0]
 [0 0 0 0 0 1 0 0 0 0 0 0]
 [0 0 0 0 0 1 0 0 0 0 0 0]
 [0 0 0 0 0 1 0 0 0 0 0 0]
 [0 0 1 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 1 0 0 0 0 0 0]]

As you can see, this matrix represents the connections between nodes.

Now, this matrix is very wasteful: it contains a lot of information that we don't really need with all those zero cells. We are, infact, just interested in the adjacent nodes, so we could as well use a much more efficient sparse matrix. Scipy conveniently offers us a function to directly transform a non-sparse matrix to a sparse one, so we are going to use that.

In [6]:

M_sparse = sc.sparse.coo_matrix(M)

print M_sparse

  (0, 1)	1
  (0, 3)	1
  (1, 0)	1
  (1, 2)	1
  (1, 3)	1
  (1, 4)	1
  (2, 1)	1
  (2, 3)	1
  (2, 10)	1
  (3, 0)	1
  (3, 1)	1
  (3, 2)	1
  (4, 1)	1
  (4, 5)	1
  (5, 4)	1
  (5, 6)	1
  (5, 7)	1
  (5, 8)	1
  (5, 9)	1
  (5, 11)	1
  (6, 5)	1
  (7, 5)	1
  (8, 5)	1
  (9, 5)	1
  (10, 2)	1
  (11, 5)	1

Now we need to compute the eigenvectors and the resulting eigenvector centrality.

In [7]:

def compute_eigen_centr(M_sparse):
    eigenvalues, eigenvectors = sc.sparse.linalg.eigs(M_sparse.asfptype())
    index_of_greatest_eigenvalue = eigenvalues.real.argmax(axis=0)  
    dominant_eigenvector = eigenvectors[:,index_of_greatest_eigenvalue]

    eigenvector_centrality = {}

    for i in range (0, test_G.number_of_nodes()):
        eigenvector_centrality[i] = abs(dominant_eigenvector[i].real)

    factor = 1.0 / sum(eigenvector_centrality.values())
    for k in eigenvector_centrality: 
        eigenvector_centrality[k] = eigenvector_centrality[k] * factor
        
    return eigenvector_centrality

eigen_centr = compute_eigen_centr(M_sparse) 
print eigen_centr

{0: 0.11944518672642711, 1: 0.18090047353384806, 2: 0.13670202819104554, 3: 0.15528231721527438, 4: 0.097721382250572433, 5: 0.094139885154717648, 6: 0.033447744709516961, 7: 0.033447744709517162, 8: 0.033447744709517058, 9: 0.033447744709517176, 10: 0.048570003380529608, 11: 0.03344774470951694}

Now, the one below is just one way of doing it, that assumes that the matrix is small enough for us to actually compute the eigenvectors.

In other circumstances, we would have to use an approximation. The power method, shown in the code below, iterates through different cycles with the purpose of finding the dominant eigenvector and eigenvalue.

The method starts by using a random vector $b_0$.
It assumes that:

A has an eigenvalue that is strictly greater in magnitude than its other eigenvalues
The starting vector $b_0$ has a nonzero component in the direction of an eigenvector associated with the dominant eigenvalue.

At every iteration, $b_0$ is multiplied by the adjacence matrix $A$ and normalized:

$$b_{k+1} = \frac{Ab_k}{||Ab_k||}$$

$b_k$ should converge to an eigenvector associated to the dominant eigenvalue.

In [8]:

def power_iteration(A):
    # start with a random vector
    bk = np.random.rand(A.shape[0])
    
    num_simulations = 100

    for _ in range(num_simulations):
        bk = np.dot(A, bk) / np.linalg.norm(np.dot(A, bk))

    return bk

def normalize_vector(vec):
    return vec / np.linalg.norm(vec, ord=1)
        
print normalize_vector(power_iteration(M))

[ 0.11944516  0.18090021  0.136702    0.15528222  0.09772155  0.0941394
  0.0334479   0.0334479   0.0334479   0.0334479   0.04856995  0.0334479 ]

In the end, we finally visually represent the values to get a better idea of how they compare with what we have seen previously.

In [9]:

pc.drawSpringLayoutGraphFromDict(test_G, eigen_centr)

In the graph above, you can actually compare the difference between degree cenrality and eigenvector centrality because of how they are represented: the size of a node represents its eigenvector centrality, while the color represents its degree centrality.

It is interesting to notice that node 5 has 6 edges towards other nodes, and has therefore the highest value in degree centrality, but doesn't keep the same relevance if instead we consider eigenvector centrality, for which node 1 detains the highest value, because it has more connections with more connected nodes.

Reddit test¶

Now, we want to analyze how this works on a much bigger network, namely the graph of communities from Reddit.

I collected the data from Reddit's API in the months of October and November 2016.

A node, in this case, is a subreddit, that is a sub-community in Reddit. Each subreddit revolves around a topic, so users usually post in subreddits about their favourite interests and topics. The range is very wide, with many technology-related subreddits, programming ones, but also about art, various types of hobbies, or very general.

An edge is formed between two subreddits when they share a certain number of users. Thinking that probably a lot of users from the subreddit 'archlinux' could also post in the subreddit 'linux', I established that two subreddits are more similar according to how many users they share. For this purpose, the weight of the edges is equal to the Jaccard similarity between them, computed on the number of users that post in both subreddits:

$$weight(a, b) = \frac{|users_a \cap users_b|}{|users_a \cup users_b|}$$

I stored the data into an edgelist:

In [10]:

G = nx.read_edgelist(open('data/edgelist_weighted_clean.txt', 'rb'))
print "number of nodes: " + str(G.number_of_nodes()), "number of edges: " + str(G.number_of_edges())

number of nodes: 6909 number of edges: 518493

So we have 6909 subreddits and 518493 edges between them. The next step is to compute the eigenvector centrality. This time, I'm going to use the function built in networkx because of the amount of data to process.

In [11]:

eigenvector_centrality = nx.eigenvector_centrality(G)

Since it would be very difficult to render so many edges in a viable amount of time, I had to trim some of them based on their weight for visualizing the resulting graph. Therefore, although the caluclations for centrality are made on the full graph, the less meaningful edges have been removed from the visualization. After this step, I also removed the nodes that didn't have any edge anymore.

In [12]:

G2 = pc.trim_graph_edges_for_visualization(G)
print G2.number_of_nodes()
print G2.number_of_edges()

2026
5912

At this point we can draw the graph.

As in the other graphs above, color is indicative of degree centrality, while size is indicative of eigenvector centrality. Hover on the nodes to see the names of the subreddits.

In [13]:

pc.drawSpringLayoutGraphFromDict(G2, eigenvector_centrality)

We can see that there is a central group of subreddits that are very big both in degree and eigenvector centrality.

Those are the more general ones, the ones in which everybody could post and that are very often about general topics, like sharing funny pics or talking about politics. They share a huge number of users with any other subreddit.

On the borders of the graph, we find subreddits for more specific interests, like particular games or about a particular country.

In general, it turns out that degree similarity is in this case pretty similar to eigenvector centrality.

If we tweak the clustering coefficient, we can see in a much clearer way that there are actual clusters of communities: by hovering with the mouse, you can notice that the topics in subreddits that are near are actually similar.
The center is still very dense with very general topics, but you can see that, for example, all the programming and technology related subreddits are near each other.

In [14]:

pc.drawSpringLayoutGraphFromDict(G2, eigenvector_centrality, coefficient=0.025)

More reddit graphs and analysis can be found here: map of reddit