Getting started

The package we will use for our network analysis is igraph. There are also other packages out there for Python, such as networkx or graphtool, but will not use them during this course. First, we will load all required packages.

In [1]:
# Networks
import igraph as ig
import ighelper
import louvain

# Computation
import numpy as np
np.random.seed(0)
import scipy
import random
random.seed(0)

# Data
import pandas as pd

# Plotting
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
/usr/local/lib/python2.7/dist-packages/IPython/html.py:14: ShimWarning: The `IPython.html` package has been deprecated. You should import from `notebook` instead. `IPython.html.widgets` has moved to `ipywidgets`.
  "`IPython.html.widgets` has moved to `ipywidgets`.", ShimWarning)

Note that at any time, you can get some information by pressing Shift-Tab when your cursor is on some function, and simply Tab to autocomplete some value.

We can create a graph ourselves, adding vertices and edges as we go.

In [2]:
G = ig.Graph(directed=False)
G.add_vertices(n=4)
G.add_edges([(0, 1),
             (0, 2),
             (0, 3),
             (1, 2),
             (1, 3),
             (2, 3)])

We can get a summary of the graph, providing some basic information on the graph

In [3]:
G.summary()
Out[3]:
'IGRAPH U--- 4 6 -- '

The result indicates this graph is undirected (indicated by the U) has 4 nodes and 6 edges. The information on the number of nodes and edges can also be obtained using functions.

In [4]:
n = G.vcount()
m = G.ecount()
print('{0} nodes, {1} edges'.format(n, m))
4 nodes, 6 edges

In the remainder of this exercise, we will work with various networks, that need to be downloaded. There is one exception, a network that has become so famous that it is built in igraph: the karate club network constructed by Zachary (1977).

In [5]:
G = ig.Graph.Famous('Zachary')

You can plot the graph easily

In [6]:
G['layout'] = G.layout_fruchterman_reingold()
G.vs['color'] = 'gray'
G.es['color'] = 'gray'
ig.plot(G)
Out[6]:

The most basic elements of any graph are the nodes and the connection between the nodes, called edges. Nodes are also called vertices and edges are also called links. These words will be used interchangebly. Vertices and edges are the terms that are most often used in graph theory, while nodes and links are more common in (social) network analysis. Sometimes, you also see the term tie or arc for referring to an edge.

In igraph the terms vertex and edges are used throughout, and they can be accessed through a so-called VertexSequence and EdgeSequence. You can make different selections of vertices and edges, either based on attributes or simply specifying specific (vertex or edge) indices. The functionality on vertex and edge sequences is quite extensive. The following just demonstrates one possibility.

In [7]:
vs = G.vs[[0, 1, 2, 3]]
es = G.es.select(_within=vs)
vs['color'] = 'blue'
es['color'] = 'blue'
ig.plot(G)
Out[7]:

The nodes that are linked to another node are called the neighbors of a node. The number of neighbours is called the degree of a node.

In [8]:
neighbors = G.neighbors(4)
print(neighbors, G.degree(4))
([0, 6, 10], 3)

Paths

One of the most basic operations on any network is finding a path from one node to another node, not unlike finding a route from one city to another using some navigational software.

In [9]:
edge_paths = G.get_shortest_paths(v=16, to=15, output='epath')
edge_paths
Out[9]:
[[39, 4, 1, 28, 48]]

This retrieves a shortest path between node 16 and node 15. It returns the indices of the edges because we set output='epath'. In this case there is only one path of 5 edges long. We can also get the endpoints of those edges, so that the path becomes a bit more clear.

In [10]:
edge_path = G.es[edge_paths[0]];
[(edge.source, edge.target) for edge in edge_path]
Out[10]:
[(5, 16), (0, 5), (0, 2), (2, 32), (15, 32)]

We can also get the same path in terms of vertices

In [11]:
vertex_path = G.vs[G.get_shortest_paths(v=16, to=15, output='vpath')[0]]
[v.index for v in vertex_path]
Out[11]:
[16, 5, 0, 2, 32, 15]

We can visualize this path as follows.

In [12]:
G.vs['color'] = 'gray'
vertex_path['color'] = 'red'

G.es['color'] = 'gray'
G.es['width'] = 0.5
edge_path['color'] = 'red'
edge_path['width'] = 2

ig.plot(G)
Out[12]:

Rather than determining the actual shortest paths, we may also simply be interested in the distance between two nodes.

In [13]:
G.shortest_paths(source=16, target=15)
Out[13]:
[[5]]

This only provides the actual length of the path (5 edges) rather than the actual path. We can also use this function to get the distance of all nodes to all other nodes. This is conveniently represented as a matrix, for which the numpy library is especially well suited.

In [14]:
distances = np.array(G.shortest_paths())
distances
Out[14]:
array([[0, 1, 1, ..., 1, 2, 2],
       [1, 0, 1, ..., 2, 2, 2],
       [1, 1, 0, ..., 2, 1, 2],
       ..., 
       [1, 2, 2, ..., 0, 1, 1],
       [2, 2, 1, ..., 1, 0, 1],
       [2, 2, 2, ..., 1, 1, 0]])

The largest distance from a node to any other node is called the eccentricity. If a node has a very high eccentricity, it is a bit peripheral. If a node has a very low eccentricity, it means that it is relatively close to to all other nodes, and that it is in the center of the graph in a certain sense.

In [15]:
eccentricity = distances.max(1)
ig.plot(G, vertex_size=5*(6 - eccentricity))
Out[15]:

The minimum eccentricity is called the radius of the graph and the maximum eccentricity is called the diameter. The diameter of any graph is always larger than the radius and smaller than twice the radius.

The network we currently look at is connected, which is not necessarily the case. It can consist of multiple components. Let us delete a number of edges so that we get a network with two components.

In [16]:
G.delete_edges(G.es.select(_between=[(4, 5, 6, 10), (0,)]))
ig.plot(G)
Out[16]:

The path we constructed earlier is now no longer connected. In this case, there is no longer any path between the two nodes at all. The distance between the two nodes in hence infinitely large.

In [17]:
G.shortest_paths(source=16, target=15)
Out[17]:
[[inf]]

There is now no longer a path between any two nodes, and the graph is then no longer said to be connected.

In [18]:
G.is_connected()
Out[18]:
False

The network is now said to be disconnected and the different parts of the network that still are connected are called connected components.

In [19]:
components = G.clusters()
G.vs[components[0]]['color'] = 'red'
G.vs[components[1]]['color'] = 'blue'
ig.plot(G)
Out[19]:

Usually, networks tend to have one large component, and many smaller components. In empirical networks, it is therefore common practice to restrict any further analyses to the largest connected component. Because the difference between the largest connected component and the other components is usually quite large, the largest connected component is sometimes also called the giant component.

In [20]:
H = G.clusters().giant()
ig.plot(H, layout=H.layout_fruchterman_reingold())
Out[20]:

Cycles

If there are two paths going from one node to another node, we can join them together and make a cycle. We start again with a fresh graph.

In [21]:
G = ig.Graph.Famous('Zachary')
G['layout'] = G.layout_fruchterman_reingold()
In [22]:
cycle = G.es[G.get_eids(path=(24, 25, 23, 27, 24))]

G.es['color'] = 'gray'
G.es['width'] = 0.5
cycle['color'] = 'red'
cycle['width'] = 2

ig.plot(G)
Out[22]:
In [23]:
[(e.source, e.target) for e in cycle]
Out[23]:
[(24, 25), (23, 25), (23, 27), (24, 27)]

This is a cycle of length 4, but longer cycles do exist in the graph. The smallest cycle of length 3 are of particular interest to the social sciences. A small group of three people is commonly called a triad in the social sciences. It is of particular interest because if two people have a friend in common, they are more likely to become friends themselves. This is known as triadic closure.

If there are many closed triads, i.e. many cycles of length 3, it suggests that many people have many friends in common. In particular, the network is then said to be clustered. The clustering coefficient for a node is defined as the proportion of neighbours that are connected amongst each other. A clustering coefficient of 1 then means that all the neighbours of a node are connected to each other, while a clustering coefficient of 0 means that no neigbhours are connected amongst each other. The overall clustering is then simply the average over the whole network.

In [24]:
clustering = G.transitivity_local_undirected(mode="zero")

ig.plot(G, vertex_size=10*(np.array(clustering)+0.5))
Out[24]:

As you can see, the nodes that are more clustered seem to be more peripheral. This is something that can be seen more often. Nodes that have many neighbours usually have fewer connections between all their neighbours, so that they tend to have a lower clustering coefficient.

Nodes that have a higher clustering coefficient tend to be well embedded in the network. Nodes with a low clustering coefficient on the other hand tend to function as bridges, connecting different parts of the network. For example, when removing node 0 from the network, this disconnects the network.

In [25]:
H = G.copy()
H.delete_vertices(0)
components = H.clusters()
ig.plot(components, layout=H.layout_fruchterman_reingold())
Out[25]:

This means that node 0 functions as a bridge. In fact, this is the only node that acts as a bridge in this graph.

In [26]:
is_bridge = [False]*G.vcount()
for v in range(G.vcount()):
    H = G.copy()
    H.delete_vertices(v)
    is_bridge[v] = len(H.clusters()) > 1
    
ig.plot(G, vertex_color=is_bridge, palette=ig.RainbowPalette(2))
Out[26]:

Social network analysis

Reciprocity

Perhaps one of the most fundamental aspect in many social networks is reciprocity, the tendency to do unto another what (s)he did to you. That is, the relationships between people have some tendency to be symmetric. This is surely not the case for all relationships, as there are for example some clear assymmetries between for example an employer and an employee or a soldier and a general. Nonetheless, in other cases, we may expect some reciprocity to hold.

In our current case, the graph is undirected, so the reciprocity equals 1.

In [27]:
G.reciprocity()
Out[27]:
1.0

Homophily

If two nodes are more likely to be connected when they share a certain attribute, this is known as homophily. We will illustrate this on a network of contact between pupils from a high school available from SocioPatterns. The contacts were collected through devices that would automatically record face-to-face interaction (i.e. physical proximity) between pupils during 5 days.

In [28]:
G = ighelper.GraphFromURL('https://storage.googleapis.com/css-files/sociopatterns.graphml')
G = G.clusters().giant()

G.es['width'] = 0.01*np.array(G.es['weight'])
G['layout'] = G.layout_fruchterman_reingold(weights='weight')
G.vs['color'] = ['pink' if v['gender'] == 'F' else 'blue' for v in G.vs]

ig.plot(G, vertex_size=8)
Out[28]: