#!/usr/bin/env python
# coding: utf-8

# # Getting started

# The package we will use for our network analysis is `igraph`. There are also other packages out there for Python, such as `networkx` or `graphtool`, but we will not use them during this course. First, we will load all required packages.

# In[1]:


# Networks
import igraph as ig
import leidenalg

# Computation
import numpy as np
np.random.seed(0)
import scipy
import random
random.seed(0)

# Data
import pandas as pd

# Plotting
import matplotlib.pyplot as plt
import seaborn as sns
get_ipython().run_line_magic('matplotlib', 'inline')


# Note that at any time, you can get some information by pressing Shift-Tab when your cursor is on some function, and simply Tab to autocomplete some value.

# We can create a graph ourselves, adding vertices and edges as we go.

# In[2]:


G = ig.Graph(directed=False)
G.add_vertices(n=4)
G.add_edges([(0, 1),
             (0, 2),
             (0, 3),
             (1, 2),
             (1, 3),
             (2, 3)])


# We can get a summary of the graph, providing some basic information on the graph

# In[3]:


G.summary()


# The result indicates this graph is undirected (indicated by the `U`) has 4 nodes and 6 edges. The information on the number of nodes and edges can also be obtained using functions.

# In[4]:


n = G.vcount()
m = G.ecount()
print('{0} nodes, {1} edges'.format(n, m))


# In the remainder of this exercise, we will work with various networks, that need to be downloaded. There is one exception, a network that has become so famous that it is built in `igraph`: the karate club network constructed by [Zachary (1977)](https://www.jstor.org/stable/3629752?seq=1#page_scan_tab_contents).

# In[5]:


G = ig.Graph.Famous('Zachary')


# You can plot the graph easily

# In[6]:


G['layout'] = G.layout_fruchterman_reingold()
G.vs['color'] = 'gray'
G.es['color'] = 'gray'
ig.plot(G)


# The most basic elements of any graph are the *nodes* and the connection between the nodes, called *edges*. Nodes are also called *vertices* and edges are also called *links*. These words will be used interchangebly. Vertices and edges are the terms that are most often used in graph theory, while nodes and links are more common in (social) network analysis. Sometimes, you also see the term *tie* or *arc* for referring to an edge.

# In `igraph` the terms vertex and edges are used throughout, and they can be accessed through a so-called `VertexSequence` and `EdgeSequence`. You can make different selections of vertices and edges, either based on attributes or simply specifying specific (vertex or edge) indices. The functionality on vertex and edge sequences is quite extensive. The following just demonstrates one possibility.

# In[7]:


vs = G.vs[[0, 1, 2, 3]]
es = G.es.select(_within=vs)
vs['color'] = 'blue'
es['color'] = 'blue'
ig.plot(G)


# The nodes that are linked to another node are called the *neighbors* of a node. The number of neighbors is called the *degree* of a node.

# In[8]:


neighbors = G.neighbors(4)
print(neighbors, G.degree(4))


# ## Paths

# One of the most basic operations on any network is finding a *path* from one node to another node, not unlike finding a route from one city to another using some navigational software.

# In[9]:


edge_paths = G.get_shortest_paths(v=16, to=15, output='epath')
edge_paths


# This retrieves a shortest path between node `16` and node `15`. It returns the *indices* of the edges because we set `output='epath'`. In this case there is only one path of 5 edges long. We can also get the endpoints of those edges, so that the path becomes a bit more clear.

# In[10]:


edge_path = G.es[edge_paths[0]];
[(edge.source, edge.target) for edge in edge_path]


# We can also get the same path in terms of vertices

# In[11]:


vertex_path = G.vs[G.get_shortest_paths(v=16, to=15, output='vpath')[0]]
[v.index for v in vertex_path]


# We can visualize this path as follows.

# In[12]:


G.vs['color'] = 'gray'
vertex_path['color'] = 'red'

G.es['color'] = 'gray'
G.es['width'] = 0.5
edge_path['color'] = 'red'
edge_path['width'] = 2

ig.plot(G)


# Rather than determining the actual shortest paths, we may also simply be interested in the distance between two nodes.

# In[13]:


G.shortest_paths(source=16, target=15)


# This only provides the actual length of the path (5 edges) rather than the actual path. We can also use this function to get the distance of all nodes to all other nodes. This is conveniently represented as a matrix, for which the `numpy` library is especially well suited.

# In[14]:


distances = np.array(G.shortest_paths())
distances


# The largest distance from a node to any other node is called the *eccentricity*. If a node has a very high eccentricity, it is a bit peripheral. If a node has a very low eccentricity, it means that it is relatively close to to all other nodes, and that it is in the center of the graph in a certain sense.

# In[15]:


eccentricity = distances.max(1)
ig.plot(G, vertex_size=5*(6 - eccentricity))


# The minimum eccentricity is called the *radius* of the graph and the maximum eccentricity is called the *diameter*. The diameter of any graph is always larger than the radius and smaller than twice the radius.

# The network we currently look at is connected, which is not necessarily the case. It can consist of multiple components. Let us delete a number of edges so that we get a network with two components.

# In[16]:


G.delete_edges(G.es.select(_between=[(4, 5, 6, 10), (0,)]))
ig.plot(G)


# The path we constructed earlier is now no longer connected. In this case, there is no longer any path between the two nodes at all. The distance between the two nodes in hence infinitely large.

# In[17]:


G.shortest_paths(source=16, target=15)


# There is now no longer a path between any two nodes, and the graph is then no longer said to be *connected*.

# In[18]:


G.is_connected()


# The network is now said to be *disconnected* and the different parts of the network that still are connected are called *connected components*.

# In[19]:


components = G.clusters()
G.vs[components[0]]['color'] = 'red'
G.vs[components[1]]['color'] = 'blue'
ig.plot(G)


# Usually, networks tend to have one large component, and many smaller components. In empirical networks, it is therefore common practice to restrict any further analyses to the largest connected component. Because the difference between the largest connected component and the other components is usually quite large, the largest connected component is sometimes also called the *giant component*.

# In[20]:


H = G.clusters().giant()
ig.plot(H, layout=H.layout_fruchterman_reingold())


# ## Cycles

# If there are two paths going from one node to another node, we can join them together and make a *cycle*. We start again with a fresh graph.

# In[21]:


G = ig.Graph.Famous('Zachary')
G['layout'] = G.layout_fruchterman_reingold()


# In[22]:


cycle = G.es[G.get_eids(path=(24, 25, 23, 27, 24))]

G.es['color'] = 'gray'
G.es['width'] = 0.5
cycle['color'] = 'red'
cycle['width'] = 2

ig.plot(G)


# In[23]:


[(e.source, e.target) for e in cycle]


# This is a cycle of length 4, but longer cycles do exist in the graph. The smallest cycle of length 3 are of particular interest to the social sciences. A small group of three people is commonly called a *triad* in the social sciences. It is of particular interest because if two people have a friend in common, they are more likely to become friends themselves. This is known as *triadic closure*.
# 
# If there are many closed triads, i.e. many cycles of length 3, it suggests that many people have many friends in common. In particular, the network is then said to be *clustered*. The *clustering coefficient* for a node is defined as the proportion of neighbors that are connected amongst each other. A clustering coefficient of 1 then means that all the neighbors of a node are connected to each other, while a clustering coefficient of 0 means that no neigbhours are connected amongst each other. The overall clustering is then simply the average over the whole network.

# In[24]:


clustering = G.transitivity_local_undirected(mode="zero")

ig.plot(G, vertex_size=10*(np.array(clustering)+0.5))


# As you can see, the nodes that are more clustered seem to be more peripheral. This is something that can be seen more
# often. Nodes that have many neighbors usually have fewer connections between *all* their neighbors, so that they tend to have a lower clustering coefficient.
# 
# Nodes that have a higher clustering coefficient tend to be well embedded in the network. Nodes with a low clustering coefficient on the other hand tend to function as bridges, connecting different parts of the network. For example, when removing node 0 from the network, this disconnects the network.

# In[25]:


H = G.copy()
H.delete_vertices(0)
components = H.clusters()
ig.plot(components, layout=H.layout_fruchterman_reingold())


# This means that node 0 functions as a bridge. In fact, this is the only node that acts as a bridge in this graph.

# In[26]:


is_bridge = [False]*G.vcount()
for v in range(G.vcount()):
    H = G.copy()
    H.delete_vertices(v)
    is_bridge[v] = len(H.clusters()) > 1
    
ig.plot(G, vertex_color=is_bridge, palette=ig.RainbowPalette(2))


# # Social network analysis

# ## Reciprocity

# Perhaps one of the most fundamental aspect in many social networks is *reciprocity*, the tendency to do unto another what (s)he did to you. That is, the relationships between people have some tendency to be symmetric. This is surely not the case for all relationships, as there are for example some clear assymmetries between for example an employer and an employee or a soldier and a general. Nonetheless, in other cases, we may expect some reciprocity to hold.

# In our current case, the graph is undirected, so the reciprocity equals 1.

# In[27]:


G.reciprocity()


# ## Homophily

# If two nodes are more likely to be connected when they share a certain attribute, this is known as *homophily*. We will illustrate this on a network of contact between pupils from a high school available from [SocioPatterns](http://www.sociopatterns.org/datasets/high-school-contact-and-friendship-networks/). The contacts were collected through devices that would automatically record face-to-face interaction (i.e. physical proximity) between pupils during 5 days.

# In[28]:


G = ig.Graph.Read('data/sociopatterns.gml')
G = G.clusters().giant()

G.es['width'] = 0.01*np.array(G.es['weight'])
G['layout'] = G.layout_fruchterman_reingold(weights='weight')
G.vs['color'] = ['pink' if v['gender'] == 'F' else 'blue' for v in G.vs]

ig.plot(G, vertex_size=8)


# For weighted graphs the weighted degree is called the strength.

# In[29]:


strength = G.strength(weights='weight')


# In the above plot, the male pupils are colored blue, and the females pink. Now the central question is whether males and femals tend to have more links to the same sex, or more evenly distributed. The more general term for homophily is *assortativity*, which refers to how likely two nodes that have the same attribute are linked. In the plot above it is not immediately clear whether there is gender assortativity (or homophily) in this network. But using a statistical measure of assortativity gives us more insight.

# In[30]:


G.assortativity_nominal(ig.VertexClustering.FromAttribute(G, 'gender').membership)


# Like a correlation, the assortativity would be 0 if gender would have no relation with the connectivity, 1 if all connections would only be between nodes of the same class and -1 if all connections would only be between nodes of different classes. An assortativity of 0.17 is then not very high, but it does indicate some tendency for pupils of the same sex to be linked.

# The pupils mostly associate with other people from the same (school)class. This is again clear from the assortativity based on the (school)class.

# In[31]:


G.assortativity_nominal(ig.VertexClustering.FromAttribute(G, 'class').membership)


# If we look at the gender assortativity per class this drops to 0. This suggests that the overall assortativity of 0.17 due to gender imbalance in the classes.

# In[32]:


classes = ig.VertexClustering.FromAttribute(G, 'class');
for H in classes.subgraphs():
    gender_ratio = sum([v['gender'] == 'F' for v in H.vs])/float(H.vcount())
    gender_assortativity = H.assortativity_nominal(ig.VertexClustering.FromAttribute(H, 'gender').membership)
    print('class: {0},\t% female: {1:.2f},\tassortativity: {2:.3f}'.format(H.vs['class'][0], gender_ratio, gender_assortativity))


# ## Social influence

# One of the key concepts in social network analysis is social influence. The basic idea is that people that are linked influence each other. For example, if person A holds opinion A but all his neighbors hold opinion B, then he is likely to switch to opinion B. Let us simulate some simple opinion dynamics on the highschool network. We will assume that every node will switch to the majority opinion in its neighborhood, and we will start from some random initial condition.

# In[33]:


G = ig.Graph.Read('data/sociopatterns.gml')
G = G.clusters().giant()
G.es['width'] = 0.01*np.array(G.es['weight'])
G['layout'] = G.layout_fruchterman_reingold(weights='weight')


# In[34]:


G.vs['opinion'] = 0
G.vs['new_opinion'] = np.random.randint(2, size=G.vcount())

while G.vs['opinion'] != G.vs['new_opinion']:
    G.vs['opinion'] = G.vs['new_opinion']
    for v in G.vs:
        v['new_opinion'] = 1*(sum(G.vs[G.neighbors(v)]['opinion']) > 0.5*v.degree())

ig.plot(G, vertex_color=[int(v['opinion']) for v in G.vs], palette=ig.RainbowPalette(2), vertex_size=8)


# One of the problems with social influence is that it usually is confounded with homophily. Let us check if there is assortativity on the result of the opinion dynamics.

# In[35]:


G.assortativity_nominal(G.vs['opinion'])


# The assortativity on the opinion is roughly the same as the assortativity on the classes. But that begs the question of whether homophily was operating or social influence. The cases here are clear. The class likely influences the probability of a link, so homophily is operating. In the opinion dynamics simulation, we know that the network did not change, but only the opinion did, so there is clear social influence. But if we empirically see that there is some assortativity on some variable, we cannot determine unambiguously whether there is homophily or social influence.

# ## Centrality

# Sometimes, rather than viewing social influence as a process, it is seen as the overall influence some person can exert over others. It is thought that more central persons are generally more influential. In order to sway opinion, you could for example better target the more central person. However, there are several variants of centrality, and all of them have some merit. We will illustrate them on the karate club network again.

# In[36]:


G = ig.Graph.Famous('Zachary')
G.vs['color'] = 'gray'
G.es['color'] = 'gray'
G['layout'] = G.layout_auto()


# ### Degree centrality

# The simples is degree centrality, which is nothing more than simply the number of contacts somebody has.

# In[37]:


G.vs['centrality'] = G.degree()
ig.plot(G, vertex_size=G.vs['centrality'])


# ### Eigenvector centrality

# The idea of eigenvector centrality is that you are as central as the people you are connected with. So, if you connect to many central people, you should be central yourself. Perhaps this recursive notion of centrality is confusion, but it leads to an elegant solution. If you write out the math, it turns out the largest so-called eigenvector is actually the solution, whence the name.

# In[38]:


G.vs['centrality'] = G.eigenvector_centrality()
ig.plot(G, vertex_size=10*np.array(G.vs['centrality']))


# ### Pagerank

# One problem with eigenvector centrality is that it is not always well-defined in the case of directed graphs. The interesting thing of eigenvector centrality is that it is similar to the proportion of time spent in a node if a *random walker* would traverse the network, choosing links to follow at random. In directed graphs there may be so-called *sinks*: nodes with no outgoing edges. If you think of a random walker, you can imagine the problem: (s)he gets stuck in a sink. The opposite of *sinks* are called *sources*: nodes with no incoming edge. A random walker would then never arrive in a source.
# 
# To alleviate this problem, the idea of *teleportation* was introduced. With a small probability a random walker would start in any other node at random. Hence, from any sink node the walker can then always escape, and there is always a probability (s)he arrives at a source node. This was introduced by the founders of Google to model a *random surfer*, jumping from webpage to webpage and every now and then starting anew. This centrality measure is called pagerank, and it forms the core of Google's search engine.

# In[39]:


G.vs['centrality'] = G.pagerank()
ig.plot(G, vertex_size=100*np.array(G.vs['centrality']))


# ### Betweenness centrality

# The previous two centralities were based on random walks in a certain sense. Betweenness centrality in contrast uses the shortest path rather than random walks. The fraction of the shortest paths that goes through a certain node is called the betweenness centrality.

# In[40]:


G.vs['centrality'] = G.betweenness(G.vs)
ig.plot(G, vertex_size=0.1*np.array(G.vs['centrality']) + 1)


# As you can see, most centralities tend to give relatively similar results, but there can be some variation.

# ## Weak ties

# There are two main ideas associated with the so-called strenght of weak ties (i.e. links of low weight). The first is that weak ties hold together the network. The second is that new information is obtained through weak ties. The first can be studied relatively easily on any given weighted network. The second is more difficult, and is less frequently studied, because it also requires observations on information sharing.
# 
# The idea that weak ties hold together the network can be studied in two contexts. One straightforward possibility is that if we cut weak ties we relatively quickly disconnect the network. Another possibility is that weak ties mostly fall between groups of people. Finally, it is often also analysed whether weak ties mostly fall between nodes that have relatively few common neighbors.
# 
# We will study this using the highschool network.

# In[41]:


G = ig.Graph.Read('data/sociopatterns.gml')
G = G.clusters().giant()
G['layout'] = G.layout_fruchterman_reingold(weights='weight')
G.es['width'] = 0.01*np.array(G.es['weight'])


# Let us see where the weak links are in the network. We will see all links with weights below the average as weak links.

# In[42]:


class_partition = ig.VertexClustering.FromAttribute(G, 'class')  
ig.plot(class_partition, 
        edge_color=['gray' if e['weight'] < np.mean(G.es['weight']) else 'black' for e in G.es],
        vertex_size=8)


# The weak links appear to fall between the different classes, while the stronger links are more likely to be found within the classes. This can easily be computed.

# In[43]:


is_crossing_edge = class_partition.crossing()
weight_between = np.mean([e['weight'] for e in G.es if is_crossing_edge[e.index]])
weight_within = np.mean([e['weight'] for e in G.es if not is_crossing_edge[e.index]])
print('Weight within {0:.2f}, between {1:.2f}.'.format(weight_within, weight_between))


# This suggests that if we delete the weak links first, we disintegrate the network. Let us delete edges from the graph in two different orders: the weakest edges first or the strongest edges first. We look at whether there is a clear giant component, or whether the network splits into more equally sized components.

# In[44]:


H = G.copy()
weak_component_size = [];
while H.ecount() > 0:
    H.delete_edges(min(H.es, key=lambda e: e['weight']))
    #weak_component_size.append(H.clusters().giant().vcount())
    cl = H.clusters()
    weak_component_size.append(sum(np.array(cl.sizes())**2) - cl.giant().vcount()**2)
    
H = G.copy()
strong_component_size = [];
while H.ecount() > 0:
    H.delete_edges(max(H.es, key=lambda e: e['weight']))
    #strong_component_size.append(H.clusters().giant().vcount())    
    cl = H.clusters()
    strong_component_size.append(sum(np.array(cl.sizes())**2) - cl.giant().vcount()**2)


# Plotting the results shows that the deleting the weak edges first tends to have a quite different effect than deleting the strong edges first.

# In[45]:


import seaborn as sns
import matplotlib.pyplot as plt
get_ipython().run_line_magic('matplotlib', 'inline')
plt.plot(weak_component_size, label='Weak')
plt.plot(strong_component_size, label='Strong')
plt.xlabel('Edge number')
plt.ylabel('LCC size')
plt.legend(loc='best');


# The plot may be a bit obscure. Inspecting the actual resulting networks after deleting a critical amount of edges may be more revealing.

# In[46]:


break_point = np.array(weak_component_size).argmax()


# Let us first look at the network if we remove the strong links first. It is clear that this network has a single giant component, which is only connected through weak links, and many isolated nodes.

# In[47]:


H = G.copy()
del H['layout']
H.delete_edges(sorted(H.es, key=lambda e: e['weight'], reverse=True)[:break_point])
ig.plot(H.clusters())


# This contrasts sharply to the network if we remove the weak links first. We can clearly see that it seperates into several smaller sizeable components.

# In[48]:


H = G.copy()
del H['layout']
H.delete_edges(sorted(H.es, key=lambda e: e['weight'])[:break_point])
ig.plot(H.clusters())


# Finally, we can also check if strong links tend to occur between people that have a high overlap of common neighbors. This overlap is usually calculated as the Jaccard similarity.

# In[49]:


jaccard_similarity = G.similarity_jaccard(pairs=G.get_edgelist());
df = pd.DataFrame({'jaccard': jaccard_similarity, 
                   'weight': G.es['weight']})
df['jaccard_group'] = np.round(df['jaccard'], 1)

sns.boxplot(x='jaccard_group', y='weight', data=df)
plt.yscale('log')


# ## Structural holes

# When nodes have a high clustering coefficient, they are often thought to be well *embedded* in the network. People in such a position in the network tend to have friends that are also friends themselves. In a certain sense, they are in a *cohesive* environment. This is generally thought to increase trust in one another, but also tends to be associated with higher peer pressure and social norm enforcement.
# 
# Node that have a low clustering coefficient occupy a rather different position in the network. They tend to connect different parts of the network, and we earlier saw that they tend to act as bridges. This notion of a bridge can be a bit relaxed to a *local bridge*: an edge without common neighbors, or stated differently a jaccard similarity of 0.
# 
# Nodes with such a position are said to "fill" a *structural hole*. They are the go-between for different people, providing such strutural holes with a strategic advantage. For example, a real-estate agent may facilitate a transaction between a buyer and a seller. Neither buyer nor seller would be connected if it wasn't for the real-estate agent. Indeed, people in such a position of a *structural hole* are sometimes also called *brokers*.
# 
# Structural holes are also thought to have more opportunities for innovation and creativity. At the interface of different groups, people occupying structural holes may hear much more, and diverse, information. This could enable them to integrate this diverse information into something new.
# 
# Let us identify the brokers in the highschool network.

# In[50]:


G = ig.Graph.Read('data/sociopatterns.gml')
G = G.clusters().giant()
G['layout'] = G.layout_fruchterman_reingold(weights='weight')
G.es['width'] = 0.01*np.array(G.es['weight'])


# In[51]:


G.vs['clustering'] = G.transitivity_local_undirected()
most_broker = min(G.vs, key=lambda v: v['clustering'])

G.vs['color'] = 'gray'
most_broker['color'] = 'red'
G.vs[G.neighbors(most_broker)]['color'] = 'blue'

ig.plot(G, vertex_size=10.0*(1 - np.array(G.vs['clustering'])))


# As can be seen, the node with the lowest clustering tends to connect people from across the network. This contrasts to a node with a higher clustering (but at least as many neighbors).

# In[52]:


least_broker = max(G.vs.select(_degree_ge=most_broker.degree()), key=lambda v: v['clustering'])
                   
G.vs['color'] = 'gray'
least_broker['color'] = 'red'
G.vs[G.neighbors(least_broker)]['color'] = 'blue'

ig.plot(G,vertex_size=10.0*(1 - np.array(G.vs['clustering'])))


# We can make this more explicit. We can calculate the dispersion of neighbors across different classes in the network as the entropy. A higher entropy indicates that the neighbors are more equally distributed across the network, whereas a lower entropy indicates that the neighbors are mostly concentrated in a few classes.

# In[53]:


from collections import Counter

def entropy(lst):
    freq = Counter(lst);
    prob = [float(f)/len(lst) for f in freq.values()]
    return -sum(prob*np.log(prob))

entropy_node = [entropy(G.vs[G.neighbors(v)]['class']) for v in G.vs];

plt.plot(G.vs['clustering'], entropy_node, '.')
plt.xlabel('Clustering')
plt.ylabel('Entropy')


# ## Equivalence

# People occupy different positions in a network. We already saw brokers for example. This idea can be somewhat generalized to the notion of equivalence. The most strict definition of equivalence between two nodes is if they share the exacte same neighbors, called *structural equivalence*. A slightly more relaxed definition is that of *isomorphic equivalence*, meaning that switching the labelling does not change the structure of the graph. This is still quite restrictive actually, and hard to compute.
# 
# The most relaxed definition is *regular equivalence*, which considers two nodes as equivalent if they connect to similar others. For example, employees report to their managers who report to the board. Then all managers occupy a regularly equivalent position: they connect to employees and board members. Similarly all employees are regularly equivalent: they all report to managers (even though they don't necessarily report to the same manager). This is a recursive definition, and also is relatively difficult to compute. 
# 
# Of course exact regular equivalence may not be the most informative, so some trade-off between the number of 'errors' (i.e. deviating from the other members of the equivalence class) may be tolerated. This is sometimes also referred to as *role detection* as a generalization of *community detection* which we will encounter later on.
# 
# Unfortunately there are currently no implementations available for `igraph`, and in fact, no implementations for Python that I know of.

# ## Social balance

# Social balance does not refer to some zen concept of balance or a state of peace. If anything, it actually has more to do with conflict than with peace. Social balance is a theory regarding the structure of *negative links*, i.e. links that have a negative connotation, such as hate or war. These can either be presented as a separate type of link, or as links having some continuous weight, which may also be negative. 

# The basis of the theory is that if two nodes are positively connected, they must be similarly connected to another third party: either both should have positive relations or both should have negative relations. The following two triads are socially balanced.

# In[54]:


H = ig.Graph.Full(3)
H.es['weight'] = [1, -1, -1]
ig.plot(H, bbox=[0, 0, 100, 100], vertex_color='gray', edge_width=2, 
        edge_color=['blue' if e['weight'] > 0 else 'red' for e in H.es])


# In[55]:


H = ig.Graph.Full(3)
H.es['weight'] = [1, 1, 1]
ig.plot(H, bbox=[0, 0, 100, 100], vertex_color='gray', edge_width=2, 
        edge_color=['blue' if e['weight'] > 0 else 'red' for e in H.es])


# The following two triads are then *unbalanced*.

# In[56]:


H = ig.Graph.Full(3)
H.es['weight'] = [1, 1, -1]
ig.plot(H, bbox=[0, 0, 100, 100], vertex_color='gray', edge_width=2, 
        edge_color=['blue' if e['weight'] > 0 else 'red' for e in H.es])


# In[57]:


H = ig.Graph.Full(3)
H.es['weight'] = [-1, -1, -1]
ig.plot(H, bbox=[0, 0, 100, 100], vertex_color='gray', edge_width=2, 
        edge_color=['blue' if e['weight'] > 0 else 'red' for e in H.es])


# The idea of social balance is that the triads that are unbalanced are also unstable and tend to disappear, whereas the triads that are balanced tend to be stable. Note that the product of the weights of the stable triads is always positive. If all cycles have such a positive product of the weights, the whole network is said to be balanced (but this rarely happens). But if a network is balanced, something interesting happens: the network splits in two mutually antagonistic factions.

# Let us explore this in a network of international (conflict) relations from 2000-2015. This network both contains conflicts and alliances, where the conflicts are negatively weighted, and the alliances positively weighted. 

# In[58]:


G = ig.Graph.Read('data/international_relations_2000_2015.gml')

G['layout'] = G.layout_fruchterman_reingold()
ig.plot(G, vertex_color='gray', vertex_size=8, 
        edge_color=['blue' if e['weight'] > 0 else 'red' for e in G.es])


# Let us explore what the triads look like in this international network.

# In[59]:


triads = []
def store_signs(H, v, i):
    # Only store the complete triad
    if (H.vs[v].subgraph().ecount() == 3):
        triads.append( (v, 
                        np.product(H.vs[v].subgraph().es['weight'])
                       )
                     )
    return 0
G.motifs_randesu(callback=store_signs)


# Let us see how many triads are found in this network.

# In[60]:


len(triads)


# Now let us see which triads have a negative sign, and are hence unbalanced.

# In[61]:


len([t for (t, s) in triads if s < 0])


# Let us examine one such unbalanced triad:

# In[62]:


t = next(t for (t, s) in triads if s < 0)
H = G.vs[t].subgraph()
[(H.vs[e.source]['StateNme'], H.vs[e.target]['StateNme'], e['weight']) for e in H.es]


# In this case, the US is in conflict with Iran, but is allied with Pakistan, while Pakistan is allies with Iran. As you can already sense, this creates some tension: one of the three countries is likely to revise their relations. Either the US becomes allied to Iran (perhaps an unlikely scenario), or comes in conflict with Pakistan. Or Pakistan also enters in conflict with Iran.

# # Complex networks

# The previous chapter focussed on concepts that originate more in the sociological tradition of network analysis. This chapter will cover aspects that originate in the so-called *complex networks* perspective. These are more recent development, and often have a motivation stemming from the exact sciences. Its outlook is generally more mathematical and statistical, rather than substantive. One challenge is to bridge these two somewhat separate worlds.

# ## Random graphs

# There are various motiviations for studying random graphs. The first is that it provides a baseline to compare real-world networks with. In other words, it can function as a null-model. If a certain characteristic is  reproduced in a random graph (e.g. path length), then such a simple random process already provides an "explanation" for the observation. That is, we don't need to make any other assumption than randomness about any underlying process to explain that characteristic. On the other hand, if a certain characteristic is not reproduced in a random graph (e.g. clustering), then this characteristic is unlikely to be seen at random. This is usually what is understood as "significance" because if has a low p-value (i.e. a low probability of observing that characteristic under the null-hypothesis). However, it also make clear that the random graph in that case is a bad model for our observations. This seems to be sometimes forgotten when people use "significance": it simply means your null-model is really bad at reproducing your observations.
# 
# Another reason for having null-models is that they may provide some (first-order) approximation of real-world graphs. The amount of data always remains limited (even today) and are always only one possible observation. So, if we want to make a statement about the effect of certain graph properties on dynamical processes (e.g. the effect of clustering on opinion dynamics), it is convenient to have networks on which we can test our theories.

# ### Erdös-Rényi

# The simplest random graph simply connects all pairs of nodes with the same probablity $p$. There also exists a slightly different form where you simply specify the number of links $m$ that appear between any pair of nodes with the same probability. There are called Erdös-Rényi graphs after two prolific authors in graph theory. You can easily generate random graphs in `igraph`. Let us compare a the highschool network with a random graph with the same number of links and nodes.

# In[63]:


G = ig.Graph.Read('data/sociopatterns.gml')
G = G.clusters().giant()
G['layout'] = G.layout_fruchterman_reingold(weights='weight')

G_random = ig.Graph.Erdos_Renyi(n=G.vcount(), m=G.ecount())


# Normally, you would generate many random graphs and compare the distribution of values of various characteristics. We will not do that here and we just illustrate the comparisons. Let us start with average path length.

# In[64]:


avg_distance = np.array(G.shortest_paths()).mean()
avg_distance_random = np.array(G_random.shortest_paths()).mean()
print('Empirical average distance: {0:.2f}, random: {1:.2f}'.format(avg_distance, avg_distance_random))


# The average distance seems to be quite close to the average distance in the random graph. The shortest paths in a random graph are actually somewhat smaller than in the empirical network. The so-called six degree of separations (here only about 2) thus seems to be already present in a random network. In part, this is due to the fact that the highschool network is clustered in classes. To get from one part of the network to another, a path first needs to exit one class and enter another. With random links, such boundaries play no role, and they tend to make the average distances smaller.
# 
# Now let us look at clustering.

# In[65]:


clustering = G.transitivity_avglocal_undirected()
clustering_random = G_random.transitivity_avglocal_undirected()
print('Empirical clustering: {0:.2f}, random: {1:.2f}'.format(clustering, clustering_random))


# This is a much larger difference: the clustering in the random graph is not nearly as high as the clustering in the empirical graph. This is what we could expect in a random graph. For clustering we need to have three edges of a triad filled, and the probability that this happens is relatively small.

# Finally, let us examine the number of neighbors each node has.

# In[66]:


degree = G.degree()
degree_random = G_random.degree()
plt.plot(degree, degree_random, '.')
plt.xlabel('Degree')
plt.ylabel('Random degree')


# As you can see, there is almost no correspondence between the degree and the random degree. This is not surprising, since all are treated the same in the random graph, while empirically there are clear differences (some people are more popular, some are more solitary). Indeed, this difference als becomes kleer if we look at the distribution of the degree.

# The degree of each node is Poisson distributed (close to a normal distribution). On average each node has a degree of $\langle k \rangle = p n$ and most nodes have a degree relatively close to that.

# In[67]:


sns.kdeplot(np.array(degree), label='Empirical')
sns.kdeplot(np.array(degree_random), label='Random')
plt.legend(loc='best')
plt.xlabel('Degree')
plt.ylabel('Probability')


# The empirical network has a much broader degree distribution than what is observed randomly. This is typically the case and most real-world networks have rather broad degree distribution. Since this happens so frequently, a common alternative type of random graph is often considered.

# ### Configuration model

# The configuration models start from the observed empirical degrees. Rather than assuming that links have the same probability for all pairs of nodes, it is assumed that a node should have the same number of neighbours as observed empirically.

# In[68]:


G_random_config = ig.Graph.Degree_Sequence(degree)


# This creates a random graph having exactly the same degree for each node.

# In[69]:


plt.plot(degree, G_random_config.degree(), '.')


# Let us compare again the shortest path and the clustering coefficient.

# In[70]:


avg_distance_random_config = np.array(G_random_config.shortest_paths()).mean()
print('Empirical average distance: {0:.2f}, random: {1:.2f}, configuration: {2:.2f}'.format(
        avg_distance, avg_distance_random, avg_distance_random_config))

clustering_random_config = G_random_config.transitivity_avglocal_undirected();
print('Empirical clustering: {0:.2f}, random: {1:.2f}, configuration: {2:.2f}'.format(
        clustering, clustering_random, clustering_random_config))


# As you can see, the clustering and the average distance seems to be not so much affected by the changed degree distribution. Hence, the clustering is still significantly higher in the empirical network than randomly expected, while the average distance is more similar to the random graph. This is what is sometimes called a *small world*: a network with relatively high clustering and small distances.

# One difficulty with graphs is that they sometimes show dependencies which you might not have expected at first sight. For example, it was thought that a decreasing clustering with higher degrees would be indicative of some hierarchical structure: the high degree nodes would connect people from different communities (low clustering), whereas the low degree nodes would connect mostly to people within the same community (high clustering). Let us check whether that is the case for our empirical network.

# In[71]:


G.vs['clustering'] = G.transitivity_local_undirected()
plt.plot(G.degree(), G.vs['clustering'], '.')
scipy.stats.pearsonr(G.degree(), G.vs['clustering'])


# The result above indicates there is quite a strong correlation between the clustering and the degree. The result is apparently also highly significant, as indicated by the p-value in the order of $10^{-30}$. However, this is misleading. The tests for significance in correlations assumes that the two variables both are independently sampled from some distribution, whereas this is not the case for networks. In fact, if we calculate the correlation for the random configuration graph, we see also a clear correlation:

# In[72]:


G_random_config.vs['clustering'] = G_random_config.transitivity_local_undirected()
plt.plot(G_random_config.degree(), G_random_config.vs['clustering'], '.')
scipy.stats.pearsonr(G_random_config.degree(), G_random_config.vs['clustering'])


# Although the two plots are very distinct, the correlation between the degree and clustering is still 'significant' according to the p-value in the order of $10^{-5}$. The correlation is not as clear as in the empirical network, so there may be some merit to the idea of an hierarchical structure in the empirical network.
# 
# If you really want to establish that some observation is different from what can be expected at random, you should repeatedly perform the same analysis on a random graph, and compare your empirical results to the obtained distribution of results on a random graph.
# 
# The key point here is: don't trust p-values without reflecting on what they do exactly. Furthermore, not all non-zero results are necessarily significant: as we have just seen, a correlation of -0.23 can also be achieved in a random graph.

# ### Scale-free

# Most networks seem to have a rather skewed degree distribution, which are sometimes referred to as scale-free networks. Already early in the 20<sup>th</sup> century it was suggested that such a type of distribution could arise via a rich-get-richer mechanism. Such a mechanism has also been called the Matthew effect or preferential attachment. The idea is always similar: the probability of getting one more link is proportional to the number of links a node already has. Indeed, such mechanisms do seem to operate (at least to some extent) in a number of networks. The most recent well-known model was introduced by Barabási & Albert, and is often simply referred to as the BA-model.

# In[73]:


ig.plot(ig.Graph.Barabasi(n=50, m=3))


# In practice, when comparing properties such as clustering or path length, it makes more sense to use the configuration model. Nonetheless, the scale-free model is often used to compare some process against the Erdös-Rényi graphs. For example, a spreading process works quite differently in scale-free graphs than in ER graphs.

# ## Spreading

# Spreading processes are much like opinion processes, but they work slightly differently. Compared to the opinion dynamics examined earlier in this notebook, it usually only requires a single exposure to an "infected" individual to spread the infection. This is different from an opinion model, where it is usually assumed that only multiple exposures induce behavioral changes. This is sometimes referred to as the difference between *simple contagion* and *complex contagion*.
# 
# Let us examine a very simply process where some initial person becoming infected. After becoming infected, a person has a single chance to infect each of its neighbors. If some neighbor becomes infected the process continues from that neighbor.

# In[74]:


def spreading(G, initial_node, p):
    G.vs['infected'] = False
    infect_nodes = [initial_node]
    while infect_nodes:
        v = infect_nodes.pop()
        v['infected'] = True
        for u in G.vs[G.neighbors(v)]:
            if not u['infected'] and np.random.rand() < p:
                infect_nodes.append(u)
    return sum(G.vs['infected'])


# Let us try to see what happens if we start from some random node. Here below the probability is chosen such that on average about 1 neighbor is infected. As you will see, sometimes it will then spread quite some extent, whereas at other times it will not spread at all. You can try to run it multiple times to see how it changes.

# In[75]:


spreading(G, G.vs[np.random.randint(G.vcount())], 1/np.mean(G.degree()))
G.vs['color'] = ['red' if v['infected'] else 'gray' for v in G.vs]
ig.plot(G)


# Let us look a bit more systematic into the spreading behavior by repeating it many times.

# In[76]:


def repeat_spreading(G, p, n_runs):
    return np.mean([spreading(G, G.vs[np.random.randint(G.vcount())], p) for idx in range(n_runs)])


# Now we can easily repeat it for different infection probabilities. It is interesting to see how it compares to the other two random graphs we previously encountered: the ER random graph and the configuration model.

# <div class="alert alert-warning">
# The following code may take quite some time to execute
# </div>

# In[77]:


n_runs = 100
p_list = np.linspace(0, 0.1, 10)
nb_infected = [repeat_spreading(G, p, n_runs) for p in p_list]
nb_infected_random = [repeat_spreading(G_random, p, n_runs) for p in p_list]
nb_infected_random_config = [repeat_spreading(G_random_config, p, n_runs) for p in p_list]
G_scale_free = ig.Graph.Barabasi(n=G.vcount(), m=int(np.mean(G.degree())))
nb_infected_scale_free = [repeat_spreading(G_scale_free, p, n_runs) for p in p_list]


# Let us plot the results

# In[78]:


plt.plot(p_list, nb_infected, label='Empirical')
plt.plot(p_list, nb_infected_random, label='Random')
plt.plot(p_list, nb_infected_random_config, label='Random Configuration')
plt.plot(p_list, nb_infected_scale_free, label='Scale Free')
plt.xlim(0, 0.1)
plt.xlabel('Probability')
plt.ylabel('Infected')
plt.legend(loc='best')


# The most clearest difference is that the scale-free network is much more "infectious", just because of its structure. This is a well-known result: ordinary random graphs have some threshold (here around 0.03) below which almost nobody is infected, and above which a sizeable part of the network becomes infected. In the scale-free graph this is radically different, and even for the smallest probabilities, infections always spread to a sizeable part of the network. The difference around 0.03 is substantial: more than 200 infections in a scale free graph, while not even 50 infections in the other graphs.
# 
# Nonetheless, for the highschool social network, the random graph is a very reasonable approximation, and the configuration model is an even better approximation. This is important because for further analysis we could use the random approximation, which makes it significantly easier and simpler to get further insights.
# 
# Remarkably, the presence of a clear group structure in terms of classes does not seem to have any effect. The group structure is completely absent in the configuration model, yet they show almost identical results. This is quite different compared to the opinion dynamics, which depend clearly on the group structure.
# 
# There are many other questions that are of interest here: how does the spreading depend on which node is infected first? How is it best contained? How many nodes need to be vaccinated so that spreading is unlikely?  We won't go into these question here, but as you can imagine, they are of great importance.

# ## Community detection

# A common task in network analysis is *community detection*: finding groups of nodes that are relatively densely connected within communities, but sparsely between communities. It provides a sort of birds-eye view of the network, which helps in the interpretation of the network. Communities often seem to share some characteristic (or in for example biological networks, some function).
# 
# The method that has been most prominent in the literature is called modularity. It is based on comparing the real network to a random network using the configuration null-model. The idea is that we want to find as many edges within communities compared to the expected number of such edges. Let us illustrate this on the highschool network. We use the recently introduced Leiden algorithm to detect communities.

# ### Modularity

# In[79]:


partition = leidenalg.find_partition(G, leidenalg.ModularityVertexPartition, weights='weight')
ig.plot(partition)


# As you can see, modularity finds a number of groups that seems to uncover the classes to some extent. The actual overlap between the two partitions (one partition based on the classes and the other partition based on modularity) can be calculated using something called the *normalised mutual information* (NMI). This value reaches 1 if the two correspond exactly with each other, and reaches 0 if there is no correspondence.

# In[80]:


G.vs['modularity'] = partition.membership
class_partition = ig.VertexClustering.FromAttribute(G, 'class')
partition.compare_to(class_partition, 'nmi')


# ### CPM

# Indeed modularity thus seems to uncover the classes quite well. Nonetheless, some classes seem to be split in multiple communities, probably because some social groups within the classes exist. Let us examine one particular class from closer by.

# In[81]:


H_class = class_partition.subgraph(4)
H_class['layout'] = H_class.layout_fruchterman_reingold(weights='weight')
H_class.es['width'] = 0.01*np.array(H_class.es['weight'])
partition_of_class = ig.VertexClustering.FromAttribute(H_class, 'modularity')
ig.plot(partition_of_class)


# We see that only a few people are not in the same group as the rest of the class. They are also clearly not so much connected to the rest of the class. However, there does seem to be some more particular structure within this group. Let us try modularity again.

# In[82]:


partition_of_class2 = leidenalg.find_partition(H_class, leidenalg.ModularityVertexPartition, weights='weight')
ig.plot(partition_of_class2)


# The results are interesting: it suggest there are indeed some smaller groups within the classes. But the definition of 'community' is now no longer clear. What would be called a group at one level (i.e. the class at the overall level of the school) would no longer be considered a group at a lower level, as modularity tends to split it up. This is related to the problem of the *resolution limit* in community detection. Modularity may 'hide' smaller communities, hidden away in the communities it detects.
# 
# Both the groups within a class, as well as taking classes as groups themselves would be valid social groups. Clearly they have a relevance in some way. Indeed, what resolution is "better" is not something that has a single unique answer. However, the downside of modularity is that the resolution in a sense depends on the size of the network we are looking at.
# 
# We will now investigate another method which does not suffer from this problem, which is called the *Constant Potts Model* (CPM). Although it does require us to choose a resolution parameter, it has a definition of communities that is independent of the size of the network.

# In[83]:


resolution_parameter = 1.15
partition = leidenalg.find_partition(G, leidenalg.CPMVertexPartition, weights='weight', resolution_parameter=resolution_parameter)
ig.plot(partition)


# In[84]:


G.vs['CPM'] = partition.membership
H_class = ig.VertexClustering.FromAttribute(G, 'class').subgraph(4)
H_class['layout'] = H_class.layout_fruchterman_reingold(weights='weight')
H_class.es['width'] = 0.01*np.array(H_class.es['weight'])
partition_of_class = ig.VertexClustering.FromAttribute(H_class, 'CPM')
ig.plot(partition_of_class)


# CPM finds that this class is completely part of a single community. If we rerun CPM on this particular network, we should find a highly similar community structure.

# In[85]:


partition_of_class2 = leidenalg.find_partition(H_class, leidenalg.CPMVertexPartition, weights='weight', 
                                             resolution_parameter=resolution_parameter)
ig.plot(partition_of_class2)


# Now we could change the resolution parameter if we want to find smaller subgroups within the classes.

# In[86]:


resolution_parameter = 25
partition_of_class2 = leidenalg.find_partition(H_class, leidenalg.CPMVertexPartition, weights='weight', 
                                             resolution_parameter=resolution_parameter)
ig.plot(partition_of_class2)


# If we apply this new resolution parameter also at the higher level, we should again get similar results.

# In[87]:


partition = leidenalg.find_partition(G, leidenalg.CPMVertexPartition, weights='weight', resolution_parameter=resolution_parameter)
ig.plot(partition)


# Indeed, now each class seems to be partitioned in multiple smaller groups. In other words, the definition of what is a 'community' is independ of whether we look at the level of a single class, or at the level of an entire school.
# 
# However, it can be tricky to try to find resolutions that are of interest. It seems impossible to do this automatically without creating problems, so it really depends on your own interpretation of the network and of the results. Nonetheless, there are some tools to help you. One is to construct a so-called *resolution profile*. We simply detect for each resolution within a certain range the partition. This can actually be done reasonably efficiently using bisectioning, which explains also the name of the procedure:

# <div class="alert alert-warning">
# Constructing a resolution profile may take some time.
# </div>

# In[88]:


opt = leidenalg.Optimiser()
resolution_profile = opt.resolution_profile(G, leidenalg.CPMVertexPartition, resolution_range=(0, 25), weights='weight')


# Let us plot the results, and see how they correspond to to the classes as they are present in the data.

# In[89]:


fig = plt.figure()
ax1 = fig.add_subplot(111)
ax2 = ax1.twinx()
ax1.step([part.resolution_parameter for part in resolution_profile], 
         [part.total_weight_in_all_comms() for part in resolution_profile], 'blue', label='Internal weight')
ax2.step([part.resolution_parameter for part in resolution_profile], 
         [part.compare_to(class_partition, 'nmi') for part in resolution_profile], 'red', label='NMI (classes)')
ax1.legend(loc='upper left')
ax2.legend(loc='upper right')
plt.xscale('log')


# This reveals that resolutions somewhere between 0.1 and 0.3 seem to give identical partitions. Hence, a partition in this range is not very sensitive to small changes in the resolution parameter. This might be an indication that there is something of interest at that range.

# In[90]:


resolution_parameter = 0.2
partition = leidenalg.find_partition(G, leidenalg.CPMVertexPartition, weights='weight', resolution_parameter=resolution_parameter)
ig.plot(partition)


# If we take a closer look, we see that the classes can actually be divided in three groups: PC/PSI (Physics & Chemistry/engineering), BIO (biology) and MP (Mathematics & Physics). This suggest that the pupils tend to connect more to other people with a similar educational background or interest.

# In[91]:


from collections import Counter
[Counter(H.vs['class']) for H in partition.subgraphs()]


# ## Negative links

# Modularity does not work correct if there are negative weights. However, this can be corrected for. Let us again work with the international relations network.

# In[92]:


G = ig.Graph.Read('data/international_relations_2000_2015.gml')


# In order to properly detect communities with negative links, we need to split the network into two parts: one with all the positive weights, and one with all the negative weights. Then we want to maximize the number of internal links in the positive network, and minimize the number of internal links in the negative network. This is represented below by the respectively positive and negative `layer_weight`.

# In[93]:


G_positive = G.subgraph_edges(G.es.select(weight_gt=0), delete_vertices=False)
G_negative = G.subgraph_edges(G.es.select(weight_lt=0), delete_vertices=False)
G_negative.es['weight'] = -np.array(G_negative.es['weight'])

partition_positive = leidenalg.ModularityVertexPartition(G_positive, weights='weight')
partition_negative = leidenalg.ModularityVertexPartition(G_negative, weights='weight')
opt = leidenalg.Optimiser()
diff = opt.optimise_partition_multiplex([partition_positive, partition_negative], layer_weights=[1.0, -1.0])
mod_partition = ig.VertexClustering(G, membership=partition_positive.membership)


# Let us examine the results.

# In[94]:


G['layout'] = G.layout_fruchterman_reingold()
ig.plot(mod_partition, vertex_size=8, 
        edge_color=['blue' if e['weight'] > 0 else 'red' for e in G.es])


# Modularity still has some issues with degree. For example, the long paths connected through negative links are sometimes put in the same community. But this is a bit strange, since they should be in separate communities.
# 
# Again, CPM seems to work somewhat better in this regard. We can simply work with CPM directly, as it has no problems dealing with negative weights. By setting the resolution parameter to 0, we find communities that are positive connected within, but negatively connected in between. In fact, this corresponds to finding communities that are socially balanced.

# In[95]:


CPM_partition = leidenalg.find_partition(G, leidenalg.CPMVertexPartition, weights='weight', resolution_parameter=0)

G['layout'] = G.layout_fruchterman_reingold()
ig.plot(CPM_partition, vertex_size=8, 
        edge_color=['blue' if e['weight'] > 0 else 'red' for e in G.es])


# # Conclusion

# This concludes the introduction in the network aspects of computational social science. You can keep this notebook to look up things during the assignment. Also, don't forget you can always can help within the notebook using Shift-Tab and Tab. If you need more help on `igraph`, you can also go to either the [reference manual](http://igraph.org/python/doc/tutorial/tutorial.html) or the [introduction](http://igraph.org/python/doc/tutorial/tutorial.html). Finally, you can always ask one of the lecturers to help you out.