NTDS'19 assignment 1: network science — solution¶

Eda Bayram, EPFL LTS4 and Nikolaos Karalias, EPFL LTS2.

Students¶

Team: <your team number>
Students: <your name> (for the indivudual submission) or <the name of all students in the team> (for the team submission)

Rules¶

Grading:

The first deadline is for individual submissions. The second deadline is for the team submission.
All team members will receive the same grade based on the team solution submitted on the second deadline.
As a fallback, a team can ask for individual grading. In that case, solutions submitted on the first deadline are graded.
Collaboration between team members is encouraged. No collaboration between teams is allowed.

Submission:

Textual answers shall be short. Typically one to two sentences.
Code has to be clean.
You cannot import any other library than we imported. Note that Networkx is imported in the second section and cannot be used in the first.
When submitting, the notebook is executed and the results are stored. I.e., if you open the notebook again it should show numerical results and plots. We won't be able to execute your notebooks.
The notebook is re-executed from a blank state before submission. That is to be sure it is reproducible. You can click "Kernel" then "Restart Kernel and Run All Cells" in Jupyter.

Objective¶

The purpose of this milestone is to explore a given dataset, represent it by network by constructing different graphs. In the first section, you will analyze the network properties. In the second section, you will explore various network models and find out the network model fitting the ones you construct from the dataset.

Cora Dataset¶

The Cora dataset consists of scientific publications classified into one of seven research fields.

Citation graph: the citation network can be constructed from the connections given in the cora.cites file.
Feature graph: each publication in the dataset is described by a 0/1-valued word vector indicating the absence/presence of the corresponding word from the dictionary and its research field, given in the cora.content file. The dictionary consists of 1433 unique words. A feature graph can be constructed using the Euclidean distance between the feature vector of the publications.

The README provides details about the content of cora.cites and cora.content.

Section 1: Network Properties¶

In [1]:

import numpy as np
import pandas as pd
from matplotlib import pyplot as plt

%matplotlib inline

Question 1: Construct a Citation Graph and a Feature Graph¶

Read the cora.content file into a Pandas DataFrame by setting a header for the column names. Check the README file.

In [2]:

feature_list = list(range(1,1434))
column_list = ['paper_id'] + feature_list + ['class_label']
pd_content = pd.read_csv('data/cora/cora.content', delimiter='\t', names=column_list) 
pd_content.head()

Out[2]:

	paper_id	...	1426	1427	class_label
0	31336	...	0	1	Neural_Networks
1	1061127	...	1	0	Rule_Learning
2	1106406	...	0	0	Reinforcement_Learning
3	13195	...	0	0	Reinforcement_Learning
4	37879	...	0	0	Probabilistic_Methods

5 rows × 1435 columns

Print out the number of papers contained in each of the reasearch fields.

Hint: You can use the value_counts() function.

In [3]:

pd_content['class_label'].value_counts()

Out[3]:

Neural_Networks           818
Probabilistic_Methods     426
Genetic_Algorithms        418
Theory                    351
Case_Based                298
Reinforcement_Learning    217
Rule_Learning             180
Name: class_label, dtype: int64

Select all papers from a field of your choice and store their feature vectors into a NumPy array. Check its shape.

In [4]:

my_field = "Neural_Networks"
pd_field = pd_content[pd_content['class_label'] == my_field]
features = pd_field[feature_list].to_numpy()
features.shape

Out[4]:

(818, 1433)

Let $D$ be the Euclidean distance matrix whose $(i,j)$ entry corresponds to the Euclidean distance between feature vectors $i$ and $j$. Using the feature vectors of the papers from the field which you have selected, construct $D$ as a Numpy array.

In [5]:

num_nodes = features.shape[0]
distance = np.zeros((num_nodes,num_nodes))

for row in range(num_nodes):
    for row2 in range(num_nodes):
        distance[row,row2] = np.sqrt(((features[row,:]-features[row2,:])**2).sum())
        
distance.shape

Out[5]:

(818, 818)

Check the mean pairwise distance $\mathbb{E}[D]$.

In [6]:

mean_distance = distance.mean()
mean_distance

Out[6]:

5.689638434824233

Plot an histogram of the euclidean distances.

In [7]:

plt.figure(1, figsize=(8, 4))
plt.title("Histogram of Euclidean distances between papers")
plt.hist(distance.flatten());

Now create an adjacency matrix for the papers by thresholding the Euclidean distance matrix. The resulting (unweighted) adjacency matrix should have entries $$ A_{ij} = \begin{cases} 1, \; \text{if} \; d(i,j)< \mathbb{E}[D], \; i \neq j, \\ 0, \; \text{otherwise.} \end{cases} $$

First, let us choose the mean distance as the threshold.

In [8]:

threshold = mean_distance
A_feature = (distance < threshold).astype(float) - np.eye(distance.shape[0])

Now read the cora.cites file and construct the citation graph by converting the given citation connections into an adjacency matrix.

In [9]:

cora_cites = np.genfromtxt('data/cora/cora.cites', delimiter='\t')
# get graph edges and flatten them for processing
edges = cora_cites.astype(int)
edges = edges.flatten()

# get adjacency matrix from edge_list
newedge = pd.factorize(edges)[0]
new_edge_list = newedge.reshape(-1, 2)
dim = max(newedge) + 1
adjmat = np.zeros((dim, dim))
adjmat[new_edge_list[:, 0], new_edge_list[:, 1]] = 1

# caution, matrix is not symmetric at this point 
A_citation = adjmat
A_citation.shape

Out[9]:

(2708, 2708)

Get the adjacency matrix of the citation graph for the field that you chose. You have to appropriately reduce the adjacency matrix of the citation graph.

In [10]:

# map the paper ids from the selected community to the relabeled ones
nodemap = dict(zip(edges, newedge))

community_nodes = [nodemap[i] for i in pd_field.paper_id]

A_citation = A_citation[community_nodes, :][:, community_nodes]

Check if your adjacency matrix is symmetric. Symmetrize your final adjacency matrix if it's not already symmetric.

In [11]:

A_citation = ((A_citation + A_citation.T)>0).astype(float)
np.count_nonzero(A_citation - A_citation.transpose())

Out[11]:

Check the shape of your adjacency matrix again.

In [12]:

A_citation.shape

Out[12]:

(818, 818)

Question 2: Degree Distribution and Moments¶

What is the total number of edges in each graph?

In [13]:

num_edges_feature = A_feature.sum() / 2
num_edges_citation = A_citation.sum() / 2
print(f"Number of edges in the feature graph: {num_edges_feature}")
print(f"Number of edges in the citation graph: {num_edges_citation}")

Number of edges in the feature graph: 136771.0
Number of edges in the citation graph: 1175.0

Plot the degree distribution histogram for each of the graphs.

In [14]:

degrees_citation = A_citation.sum(1)
degrees_feature = A_feature.sum(1)

deg_hist_normalization = np.ones(degrees_citation.shape[0]) / degrees_citation.shape[0]

fig, axes = plt.subplots(1, 2, figsize=(16, 4))
axes[0].set_title('Citation graph degree distribution')
axes[0].hist(degrees_citation, weights=deg_hist_normalization);
axes[1].set_title('Feature graph degree distribution')
axes[1].hist(degrees_feature, weights=deg_hist_normalization);

Calculate the first and second moments of the degree distribution of each graph.

In [15]:

degfs, counts_feat = np.unique(degrees_feature, return_counts=True)
degcs, counts_cit = np.unique(degrees_citation, return_counts=True)

degfeatprobs = counts_feat / counts_feat.sum()
degcitprobs = counts_cit / counts_cit.sum()

cit_moment_1 = (degcs*degcitprobs).sum()
cit_moment_2 = (degcs*degcs*degcitprobs).sum()

feat_moment_1 = (degfs*degfeatprobs).sum()
feat_moment_2 = (degfs*degfs*degfeatprobs).sum()

print(f"1st moment of citation graph: {cit_moment_1}")
print(f"2nd moment of citation graph: {cit_moment_2}")
print(f"1st moment of feature graph: {feat_moment_1}")
print(f"2nd moment of feature graph: {feat_moment_2}")

1st moment of citation graph: 2.8728606356968216
2nd moment of citation graph: 23.765281173594133
1st moment of feature graph: 334.4034229828851
2nd moment of feature graph: 167201.1980440098

What information do the moments provide you about the graphs? Explain the differences in moments between graphs by comparing their degree distributions.

Your answer here: The first moment of the graph gives us the average degree. The second moment measures the spread of the distribution around 0. This is slightly different from the variance (second central moment) which measures the spread of the distribution around the mean.

The citation graph is significantly sparser (recall the number of edges from the previous step). This leads to a spike at low degree nodes in the degree distribution and therefore explains the value of the first moment for that graph. For the feature graph we observe that it has more highly connected nodes (notice the fat tail of the distribution). This leads to a higher average degree for the feature graph.

For the feature graph we observe from the degree histogram that the tail of the distribution is more spread out (larger range of degrees). This explains the value the second moment and its differences with the second moment of the citation graph which has less spread in its distribution (smaller range of degrees) and therefore smaller second moment.

Select the 20 largest hubs for each of the graphs and remove them. Observe the sparsity pattern of the adjacency matrices of the citation and feature graphs before and after such a reduction.

In [16]:

# get degrees of the new unordered one
sort_feat = np.argsort(degrees_feature)
sort_cit = np.argsort(degrees_citation)

reduced_A_feature = A_feature[sort_feat[:-20],:][:, sort_feat[:-20]]
reduced_A_citation = A_citation[sort_cit[:-20],:][:, sort_cit[:-20]]

A_feature = A_feature[sort_feat,:][:,sort_feat]
A_citation = A_citation[sort_cit,:][:,sort_cit]

fig, axes = plt.subplots(2, 2, figsize=(16, 16))
axes[0, 0].set_title('Feature graph: adjacency matrix sparsity pattern')
axes[0, 0].spy(A_feature);
axes[0, 1].set_title('Feature graph without top 20 hubs: adjacency matrix sparsity pattern')
axes[0, 1].spy(reduced_A_feature);
axes[1, 0].set_title('Citation graph: adjacency matrix sparsity pattern')
axes[1, 0].spy(A_citation);
axes[1, 1].set_title('Citation graph without top 20 hubs: adjacency matrix sparsity pattern')
axes[1, 1].spy(reduced_A_citation);

Plot the new degree distribution histograms.

In [17]:

reduced_degrees_feat = reduced_A_feature.sum(1)
reduced_degrees_cit = reduced_A_citation.sum(1)

deg_hist_normalization = np.ones(reduced_degrees_feat.shape[0]) / reduced_degrees_feat.shape[0]

fig, axes = plt.subplots(1, 2, figsize=(16, 4))
axes[0].set_title('Citation graph degree distribution')
axes[0].hist(reduced_degrees_cit, weights=deg_hist_normalization);
axes[1].set_title('Feature graph degree distribution')
axes[1].hist(reduced_degrees_feat, weights=deg_hist_normalization);

Compute the first and second moments for the new graphs.

In [18]:

reduced_degfs, reduced_counts_feat = np.unique(reduced_degrees_feat,return_counts=True)
reduced_degns, reduced_counts_net = np.unique(reduced_degrees_cit,return_counts=True)

reduced_degfeatprobs = reduced_counts_feat / reduced_counts_feat.sum()
reduced_degnetprobs = reduced_counts_net / reduced_counts_net.sum()

reduced_cit_moment_1 = (reduced_degns*reduced_degnetprobs).sum()
reduced_cit_moment_2 = (reduced_degns*reduced_degns*reduced_degnetprobs).sum()

reduced_feat_moment_1 = (reduced_degfs*reduced_degfeatprobs).sum()
reduced_feat_moment_2 = (reduced_degfs*reduced_degfs*reduced_degfeatprobs).sum()

print("Citation graph first moment:", reduced_cit_moment_1)
print("Citation graph second moment:", reduced_cit_moment_2)
print("Feature graph first moment: ", reduced_feat_moment_1)
print("Feature graph second moment: ", reduced_feat_moment_2)

Citation graph first moment: 1.9724310776942355
Citation graph second moment: 6.270676691729323
Feature graph first moment:  302.30827067669173
Feature graph second moment:  142170.32581453634

Print the number of edges in the reduced graphs.

In [19]:

print("Feat graph edges", reduced_A_feature.sum() / 2)
print("Citation graph edges", reduced_A_citation.sum() / 2)

Feat graph edges 120621.0
Citation graph edges 787.0

Is the effect of removing the hubs the same for both networks? Look at the percentage changes for each moment. Which of the moments is affected the most and in which graph? Explain why.

Hint: Examine the degree distributions.

Your answer here: Removing hubs has a more pronounced effect in the degree distribution of the citation graph. It is thin tailed and therefore the second moment is going to be affected the most. Notice that the second moment of the citation graph is roughly 0.25 of the original while for the feature graph it's closer to 0.85 times the original value. We also observe that the tail of the feature graph degree distribution still has a high probability and it hasn't shifted at all. This is in contrast with the citation graph where the maximum degree has shifted considerably.

Question 3: Pruning, sparsity, paths¶

By adjusting the threshold of the euclidean distance matrix, prune the feature graph so that its number of edges is roughly close (within a hundred edges) to the number of edges in the citation graph.

In [20]:

threshold = 2.83

A_feature_pruned = (distance < threshold).astype(float) - np.eye(distance.shape[0])
num_edges_feature_pruned = A_feature_pruned .sum() / 2

print(f"Number of edges in the feature graph: {num_edges_feature}")
print(f"Number of edges in the feature graph after pruning: {num_edges_feature_pruned}")
print(f"Number of edges in the citation graph: {num_edges_citation}")

Number of edges in the feature graph: 136771.0
Number of edges in the feature graph after pruning: 1386.0
Number of edges in the citation graph: 1175.0

Check your results by comparing the sparsity patterns and total number of edges between the graphs.

In [21]:

fig, axes = plt.subplots(1, 2, figsize=(12, 6))
axes[0].set_title('Citation graph sparsity')
axes[0].spy(A_citation);
axes[1].set_title('Feature graph sparsity')
axes[1].spy(A_feature_pruned);

Let $C_{k}(i,j)$ denote the number of paths of length $k$ from node $i$ to node $j$.

We define the path matrix $P$, with entries $ P_{ij} = \displaystyle\sum_{k=0}^{N}C_{k}(i,j). $

Calculate the path matrices for both the citation and the unpruned feature graphs for $N =10$.

Hint: Use powers of the adjacency matrix.

In [22]:

path_lengths = range(11)

path_matrix_citation = sum([np.linalg.matrix_power(A_citation, k) for k in path_lengths])
path_matrix_feature = sum([np.linalg.matrix_power(A_feature, k) for k in path_lengths])

Check the sparsity pattern for both of path matrices.

In [23]:

fig, axes = plt.subplots(1, 2, figsize=(16, 9))
axes[0].set_title('Citation Path matrix sparsity')
axes[0].spy(path_matrix_citation);
axes[1].set_title('Feature Path matrix sparsity')
axes[1].spy(path_matrix_feature);

Now calculate the path matrix of the pruned feature graph for $N=10$. Plot the corresponding sparsity pattern. Is there any difference?

In [24]:

path_matrix_pruned = sum([np.linalg.matrix_power(A_feature_pruned, k) for k in path_lengths])

plt.figure(figsize=(12, 6))
plt.title('Feature Path matrix sparsity')
plt.spy(path_matrix_pruned);

Your answer here: Yes, the path matrix for the unpruned graph doesn't have zero entries while the one for the pruned graph does.

Describe how you can use the above process of counting paths to determine whether a graph is connected or not. Is the original (unpruned) feature graph connected?

Your answer here: Compute the path matrix for $N=|V|-1$ where $V$ represents the vertex set of the graph. If the matrix has no zero entries then the graph is connected, otherwise it's not. $|V|-1$ is the largest possible number of steps between two nodes in any given graph. This is the case for the path graph on N nodes. It is straightforward to observe that $N-1$ steps are needed to get from one end of the path graph to the other.

If the graph is connected, how can you guess its diameter using the path matrix?

Your answer here: The diameter is the length of the "longest shortest path". Compute the smallest $N$ for which the path matrix has no zero entries. This means that $N$ will be the smallest path length at which paths between all pairs of nodes exist, which by definition is the diameter.

If any of your graphs is connected, calculate the diameter using that process.

In [25]:

path_matrix = np.zeros((A_feature.shape[0],A_feature.shape[0]))
for k in range(11):
    path_matrix += np.linalg.matrix_power(A_feature, k)
    if (path_matrix==0).sum() == 0:
        break    

diameter = k
print(f"The diameter is: {diameter}")

The diameter is: 2

Check if your guess was correct using NetworkX. Note: usage of NetworkX is only allowed in this part of Section 1.

In [26]:

import networkx as nx
feature_graph = nx.from_numpy_matrix(A_feature)
print(f"Diameter according to networkx: {nx.diameter(feature_graph)}")

Diameter according to networkx: 2

Section 2: Network Models¶

In this section, you will analyze the feature and citation graphs you constructed in the previous section in terms of the network model types. For this purpose, you can use the NetworkX libary imported below.

In [27]:

import networkx as nx

Let us create NetworkX graph objects from the adjacency matrices computed in the previous section.

In [28]:

G_citation = nx.from_numpy_matrix(A_citation)
print('Number of nodes: {}, Number of edges: {}'. format(G_citation.number_of_nodes(), G_citation.number_of_edges()))
print('Number of self-loops: {}, Number of connected components: {}'. format(G_citation.number_of_selfloops(), nx.number_connected_components(G_citation)))

Number of nodes: 818, Number of edges: 1175
Number of self-loops: 0, Number of connected components: 104

In the rest of this assignment, we will consider the pruned feature graph as the feature network.

In [29]:

G_feature = nx.from_numpy_matrix(A_feature_pruned)
print('Number of nodes: {}, Number of edges: {}'. format(G_feature.number_of_nodes(), G_feature.number_of_edges()))
print('Number of self-loops: {}, Number of connected components: {}'. format(G_feature.number_of_selfloops(), nx.number_connected_components(G_feature)))

Number of nodes: 818, Number of edges: 1386
Number of self-loops: 0, Number of connected components: 684

Question 4: Simulation with Erdős–Rényi and Barabási–Albert models¶

Create an Erdős–Rényi and a Barabási–Albert graph using NetworkX to simulate the citation graph and the feature graph you have. When choosing parameters for the networks, take into account the number of vertices and edges of the original networks.

The number of nodes should exactly match the number of nodes in the original citation and feature graphs.

In [30]:

assert len(G_citation.nodes()) == len(G_feature.nodes())
n = len(G_citation.nodes())
n

Out[30]:

The number of match shall fit the average of the number of edges in the citation and the feature graph.

In [31]:

m = np.round((G_citation.size() + G_feature.size()) / 2)
m

Out[31]:

1280.0

How do you determine the probability parameter for the Erdős–Rényi graph?

Your answer here: For the ER network, the expected number of edges is $$E[m] = pn(n-1)/2.$$

In [32]:

p = 2*m / (n * (n-1))
G_er = nx.erdos_renyi_graph(n, p)

Check the number of edges in the Erdős–Rényi graph.

In [33]:

print('My Erdos-Rényi network that simulates the citation graph has {} edges.'.format(G_er.size()))

My Erdos-Rényi network that simulates the citation graph has 1288 edges.

How do you determine the preferential attachment parameter for Barabási–Albert graphs?

Your answer here: For the BA network, m is fixed to $$m = q(n-q),$$ where $q$ is the model parameter.

In [34]:

q = 2
G_ba = nx.barabasi_albert_graph(n, q)

Check the number of edges in the Barabási–Albert graph.

In [35]:

print('My Barabási-Albert network that simulates the citation graph has {} edges.'.format(G_ba.size()))

My Barabási-Albert network that simulates the citation graph has 1632 edges.

Question 5: Giant Component¶

Check the size of the largest connected component in the citation and feature graphs.

In [36]:

giant_citation = max(nx.connected_component_subgraphs(G_citation), key=len)
print('The giant component of the citation graph has {} nodes and {} edges.'.format(giant_citation.number_of_nodes(), giant_citation.size()))

The giant component of the citation graph has 636 nodes and 1079 edges.

In [37]:

giant_feature = max(nx.connected_component_subgraphs(G_feature), key=len)
print('The giant component of the feature graph has {} nodes and {} edges.'.format(giant_feature.number_of_nodes(), giant_feature.size()))

The giant component of the feature graph has 117 nodes and 1364 edges.

Check the size of the giant components in the generated Erdős–Rényi graph.

In [38]:

giant_er = max(nx.connected_component_subgraphs(G_er), key=len)
print('The giant component of the Erdos-Rényi network has {} nodes and {} edges.'.format(giant_er.number_of_nodes(), giant_er.size()))

The giant component of the Erdos-Rényi network has 768 nodes and 1286 edges.

Let us match the number of nodes in the giant component of the feature graph by simulating a new Erdős–Rényi network. How do you choose the probability parameter this time?

Hint: Recall the expected giant component size from the lectures.

Your answer here: We know from the lectures that in Erdős–Rényi networks, the fraction of nodes in the giant component $$S = \frac{N_G}{N}$$ grows with the average degree by $$S = 1 - e^{-\langle k \rangle S}.$$ Therefore, as $\langle k \rangle = (N-1)p$:

In [39]:

S = giant_feature.number_of_nodes() / n
p_new = - np.log(1-S) / (S * (n-1))
G_er_new = nx.erdos_renyi_graph(n, p_new)

Check the size of the new Erdős–Rényi network and its giant component.

In [40]:

print('My new Erdos Renyi network that simulates the citation graph has {} edges.'.format(G_er_new.size()))
giant_er_new = max(nx.connected_component_subgraphs(G_er_new), key=len) # Your code here.
print('The giant component of the new Erdos-Rényi network has {} nodes and {} edges.'.format(giant_er_new.number_of_nodes(), giant_er_new.size()))

My new Erdos Renyi network that simulates the citation graph has 463 edges.
The giant component of the new Erdos-Rényi network has 249 nodes and 253 edges.

Question 6: Degree Distributions¶

Recall the degree distribution of the citation and the feature graph.

In [41]:

fig, axes = plt.subplots(1, 2, figsize=(15, 6))
axes[0].set_title('Citation graph')
citation_degrees = [degree for node, degree in G_citation.degree()]
axes[0].hist(citation_degrees);
axes[1].set_title('Feature graph')
feature_degrees = [degree for node, degree in G_feature.degree()]
axes[1].hist(feature_degrees);

What does the degree distribution tell us about a network? Can you make a prediction on the network model type of the citation and the feature graph by looking at their degree distributions?

Your answe here: The degree distribution may give us a hint about the formation pattern of a network. For example, a power-law degree distribution signifies a scale-free network and a binomial distribution is outcome of a random network formation. For the feature and the citation graph, we observe a degree distribution close to a power-law distribution. Hence, we can predict that they might have a scale-free growth.

Now, plot the degree distribution historgrams for the simulated networks.

In [42]:

fig, axes = plt.subplots(1, 3, figsize=(20, 6))
axes[0].set_title('Erdos-Rényi network')
er_degrees = [degree for node, degree in G_er.degree()]
axes[0].hist(er_degrees);
axes[1].set_title('Barabási-Albert network')
ba_degrees = [degree for node, degree in G_ba.degree()]
axes[1].hist(ba_degrees);
axes[2].set_title('new Erdos-Rényi network')
er_new_degrees = [degree for node, degree in G_er_new.degree()]
axes[2].hist(er_new_degrees);

In terms of the degree distribution, is there a good match between the citation and feature graphs and the simulated networks? For the citation graph, choose one of the simulated networks above that match its degree distribution best. Indicate your preference below.

Your answer here: Degree distribution of the citation graph matches that of the Barabási-Albert network best.

You can also simulate a network using the configuration model to match its degree disctribution exactly. Refer to Configuration model.

Let us create another network to match the degree distribution of the feature graph.

In [43]:

G_config = nx.configuration_model(feature_degrees) 
print('Configuration model has {} nodes and {} edges.'.format(G_config.number_of_nodes(), G_config.size()))

Configuration model has 818 nodes and 1386 edges.

Does it mean that we create the same graph with the feature graph by the configuration model? If not, how do you understand that they are not the same?

Your answer here: It does not mean considering the topology. We can prove that they are not the same graph simply by checking the number of connected components for example.

In [44]:

nx.number_connected_components(G_config)

Out[44]:

Question 7: Clustering Coefficient¶

Let us check the average clustering coefficient of the original citation and feature graphs.

In [45]:

nx.average_clustering(G_citation)

Out[45]:

0.21693567980632192

In [46]:

nx.average_clustering(G_feature)

Out[46]:

0.1220744470334593

What does the clustering coefficient tell us about a network? Comment on the values you obtain for the citation and feature graph.

Your answer here: Clustering coefficient give us a measure about the connectivity pattern of a node to its neighbors and give us a hint about the network model. We obtain small values for a "small-world" network models since they are subject to unexpected wirings to some degree. Relatively large values are observed for the networks having more regular connectivity patterns. By looking at the the values obtained here, we can say that the clustering coeficients of the citation and feature graphs are as large as to show that they are not random.

Now, let us check the average clustering coefficient for the simulated networks.

In [47]:

nx.average_clustering(G_er)

Out[47]:

0.003341483292583537

In [48]:

nx.average_clustering(G_ba)

Out[48]:

0.023269203696504245

In [49]:

nx.average_clustering(nx.Graph(G_config))

Out[49]:

0.056406212498437525

Comment on the values you obtain for the simulated networks. Is there any good match to the citation or feature graph in terms of clustering coefficient?

Your answer here: Since ER network is constructed totally random, it has a very small clustering coefficient. BA and configuration model might have relatively higher clustering coefficient yet not as high as the original graphs that we constructed from the dataset. It is probably because the original graphs have much more of a regular pattern than a BA network model.

Check the other network model generators provided by NetworkX. Which one do you predict to have a better match to the citation graph or the feature graph in terms of degree distribution and clustering coefficient at the same time? Justify your answer.

Your answer here: Our graphs display more regular patterns than a BA network considering the clustering coefficient, but still seem to have a power-law degree distribution considering the degree distributions. We can try clustered power-law graph, considering the sparsity pattern in the previous section.

If you find a better fit, create a graph object below for that network model. Print the number of edges and the average clustering coefficient. Plot the histogram of the degree distribution.

In [50]:

q = 2  # preferential attachement (determines the number of edges)
p = 0.2  # probability of triangle rewiring (determines regularity)
G_pl = nx.powerlaw_cluster_graph(n, q, p)
nx.average_clustering(G_pl)

Out[50]:

0.179981922130866

In [51]:

G_pl.size()

Out[51]:

In [52]:

plt.hist([degree for node, degree in G_pl.degree()]);

Comment on the similarities of your match.

Your answer here: BA network was a good fit in terms of degree distribution but it has a very small clustering coefficient due to its random wirings. Using the clustered power law model, we are able to obtain a graph with a higher clustering coefficient since the graph has a more regular connectivity pattern due to the clusters. Also, it still displays a power-law degree distribution.