In the previous sessions, you learned how to construct scientometric networks in Python. It was clear that this can be quite challenging. VOSviewer takes care of a lot of the necessary work in creating scientometric networks. You can hence use VOSviewer to create networks, which you could then export and analyse further in Python. We will here take this approach.
You have previously constructed scientometric networks using VOSviewer. You can import the resulting network for further analysis in igraph
. In order to import the file in igraph
you need to have saved both the map
file and the network
file in VOSviewer. See the manual of VOSviewer for more explanation. As in the previous Python notebook, we have prepared some files for you, in this case the author collaboration network from the Web of Science files that we analysed previously.
We first import the necessary packages. You will presumably recognize these still from the previous Python notebook.
import pandas as pd
import igraph as ig
Now let us read the map and network file from VOSviewer.
data-files/vosviewer/vosviewer_map.txt
using tabs ('\t'
) as a field separator, and call the resulting variable map_df
.
The network file from VOSviewer has no header, so we set it manually
network_df = pd.read_csv('data-files/vosviewer/vosviewer_network.txt', sep='\t', header=None,
names=['idA', 'idB', 'weight'])
Now we have loaded the data, so we can simply construct a network as before.
G_vosviewer = ig.Graph.DictList(
vertices=map_df.to_dict('records'),
edges=network_df.to_dict('records'),
vertex_name_attr='id',
edge_foreign_keys=('idA', 'idB'),
directed=False
)
The layout and clustering is also stored by VOSviewer, and we can use that to display the same visualization in igraph
.
layout = ig.Layout(coords=zip(*[G_vosviewer.vs['x'], G_vosviewer.vs['y']]))
clustering = ig.VertexClustering.FromAttribute(G_vosviewer, 'cluster')
ig.plot(clustering, layout=layout, vertex_size=4, vertex_frame_width=0, vertex_label=None)
A common phenomenon in many networks is the presence of group structure, where nodes within the same group are densely connected. Such a structure is sometimes called a modular structure, and a frequently used measure of group structure is known as modularity. You have already encountered this functionality briefly in VOSviewer, which provides clusters. Here we will explore this a bit more in-depth.
First, we will import a package called leidenalg
which is the Leiden algorithm, which we will use for clustering. It is built on top of igraph
so that it easily integrates with all the exisiting methods of igraph
.
import leidenalg
Now let us find clusters in the collaboration network from VOSviewer, using the weight of the edges. Because the algorithm is stochastic, it may yield somewhat different results every time you run it. To prevent that from happening, and to always get the same result, we will set the random seed to 0. The result is a VertexClustering
, which we already briefly encountered when using the clustering results from VOSviewer.
We will first find clusters using modularity.
optimiser = leidenalg.Optimiser()
optimiser.set_rng_seed(0)
clusters = leidenalg.ModularityVertexPartition(G_vosviewer, weights='weight')
optimiser.optimise_partition(clusters)
The length of the clusters
variable indicates the number of clusters.
len(clusters)
When accessing clusters
variable as a list, each element corresponds to the set of nodes in that cluster.
30
?
Hence, node 548
, node 1052
, etc... belong to cluster 30
. Another way to look at the clusters is by looking at the membership
of clusters
.
Hence, node 0
belongs to cluster 7
, node 1
belongs to cluster 9
, node 2
belongs to cluster 4
, et cetera.
Let us take a closer look at the largest cluster.
H = clusters.giant()
print(H.summary())
We could again detect clusters using modularity in the largest cluster.
optimiser.set_rng_seed(0)
subclusters = leidenalg.ModularityVertexPartition(H, weights='weight')
optimiser.optimise_partition(subclusters)
ig.plot(subclusters, vertex_size=5, vertex_label=None)
In general, modularity will continue to find subclusters in this way. An alternative approach, called CPM, does not suffer from that problem.
Let us detect clusters using CPM. We do have to specify a parameter, called the resolution_parameter
. As its name suggests, it specifies the resolution of the clusters we would like to find. At a higher resolution we will tend to find smaller clusters, while at a lower resolution we find larger clusters. Let us use the resolution parameter 0.01.
optimiser.set_rng_seed(0)
clusters = leidenalg.CPMVertexPartition(G_vosviewer,
weights='weight',
resolution_parameter=0.1)
optimiser.optimise_partition(clusters)
clusters.giant().vcount()
resolution_parameter
. How many subclusters do you find? How does that compare to modularity?
resolution_parameter
.
Modularity adapts itself to the network. In a sense that is convenient, because you then do not have to specify any parameters. On the other hand, it makes the definition of what a "cluster" is less clear.
CPM does not adapt itself to the network, and maintains the same defintion across different networks. That is convenient, because it brings more clarity to what we mean by a "cluster". Whenever you try to find subclusters using the same resolution_parameter
, CPM should not find any subclusters. In practice, it may happen that CPM still finds some subclusters, in which case the original clusters were actually not the best possible. The Leiden algorithm can be run for multiple iterations, and with each iteration, the chances are smaller that CPM would find such subclusters. Modularity will always find subclusters, independent of the number of iterations.
optimise_partition
returns how much further it managed to improve the function, so that if it returns 0.0
, it means it couldn't find any further improvement. Execute the cell repeatedly. Does it return 0.0 after some time?
Let us compare the clusters that we detected in Python with the clustering results from VOSviewer.
We can summarize the overall similarity to the partition based on the disciplines using the Normalised Mutual Information (NMI). The NMI varies between 0 and 1 and equals 1 if both are identical.
clusters.compare_to(clustering, method='nmi')
There are some differences between the clustering from VOSviewer and the clusters we detected in Python. This will of course highly depend on what resolution parameter we have used for both results. One other important difference is that VOSviewer will by default use normalized weights. By default, it will divide the weight of a link by the expected weight, assuming that the total link weight of each node would remain the same, which is sometimes referred to as the association strength. We also perform this normalization here.
G_vosviewer.es['weight_normalized'] = [
e['weight']/( G_vosviewer.vs[e.source]['weight<Total link strength>']*G_vosviewer.vs[e.target]['weight<Total link strength>'] / (2*sum(G_vosviewer.es['weight'])) )
for e in G_vosviewer.es]
By default VOSviewer uses the default resolution of 1
for these normalized weights. If we now detect clusters using these weights, you will see that the result are more closely aligned to the VOSviewer results.
clusters = leidenalg.find_partition(G_vosviewer, leidenalg.CPMVertexPartition,
weights='weight_normalized', resolution_parameter=1,
n_iterations=10)
clusters.compare_to(clustering, method='nmi')
Finally, the Leiden algorithm is also directly implemented in igraph
itself nowadays. It is somewhat less elaborate than the leidenalg
package, but it is also substantially faster. If you are analysing very large networks, it might be better to use the igraph
Leiden algorithm. Using it is straightforward.
clusters = G_vosviewer.community_leiden(objective_function='CPM',weights='weight_normalized',
resolution_parameter=1.0, n_iterations=10)
clusters.compare_to(clustering, method='nmi')
Now let us explore cluster detection a bit further.
resolution_parameter
when detecting clusters using the CPM method. What resolution_parameter
seems reasonable to you, and why?
resolution_parameter
such that the network separates in two large clusters (and some remaining small clusters). What is the cause of these two large clusters? (Hint: examine the author names)
We will now use the same type of clustering technique that we used previously in a slightly different way. Instead of clustering a network, we will cluster a specific type of network, namely a bipartite network. This requires a slightly different (and more complicated) approach. More specifically, we will cluster a document-term network, where documents are linked to terms if those terms appear in a document.
We leave the task of extracting terms to VOSviewer, and simply import the resulting document-term network in Python. At the end of the notebook, you will find instructions how to extract the document-term network from VOSviewer yourself.
We read two files: (1) the terms.txt
file, which simply contains the terms and their id
; and (2) the doc-term.txt
file, which contains which term occurs in which document. The document id
refers to the line number of the WoS files that were read by VOSviewer. We will encounter this later.
terms_df = pd.read_csv('data-files/vosviewer/terms.txt', sep='\t', index_col='id')
doc_terms_df = pd.read_csv('data-files/vosviewer/doc-term.txt', sep='\t')
In this file, both the documents and the terms are using the same numbers, so that igraph
cannot distinguish them (e.g. there is both a document 1
and a term 1
). We therefore create separate ids for both the documents and the terms.
doc_terms_df['document id'] = doc_terms_df['document id'].map(lambda x: str(x) + '-doc');
doc_terms_df['term id'] = doc_terms_df['term id'].map(lambda x: str(x) + '-term');
We can now create the network.
G_doc_term = ig.Graph.TupleList(
edges=doc_terms_df.values,
vertex_name_attr='id',
directed=False
)
This is a bipartite network, and we create a specific vertex attribute to indicate what the type is of the node: either a doc
or a term
.
G_doc_term.vs['type'] = ['doc' if 'doc' in v['id'] else 'term' for v in G_doc_term.vs]
Similar to the co-authorship network, VOSviewer typically normalizes the weights in a network by using the association strength, and we will also use that here.
G_doc_term.es['weight'] = [2.0*G_doc_term.ecount()/(G_doc_term.vs[e.source].degree()*G_doc_term.vs[e.target].degree())
for e in G_doc_term.es];
We now employ a small trick in the leidenalg
package in order to do clustering in a bipartite network. We will not explain the full details here, please see the documentation for a brief explanation of this approach. Please note that this approach is not possible using the internal igraph
Leiden algorithm.
partition, partition_docs, partition_terms = leidenalg.CPMVertexPartition.Bipartite(
G_doc_term, types='type', weights='weight', resolution_parameter_01=1)
We are now ready to detect clusters, but we are going to use all three partitions we created. We do so by using the function optimise_partition_multiplex
instead of the optimise_partition
function that we used previously. We have to pass a list of partitions to that function. For the trick to work, we also need to pass the argument layer_weights=[1,-1,-1]
, which assumes that the partition
is the first element of the list that we pass.
optimiser = leidenalg.Optimiser()
optimiser.set_rng_seed(0)
optimiser.optimise_partition_multiplex(
[partition, partition_docs, partition_terms],
layer_weights=[1,-1,-1], n_iterations=100)
Now partition
contains the clustering results (actually, partition_docs
and partition_terms
contain the identical clustering results). We extract the cluster membership of each node, and make it a new node attribute.
G_doc_term.vs['cluster'] = partition.membership
G_doc_term.vs['degree'] = G_doc_term.degree();
We will now create a so-called projection of the bipartite graph, which actually simply refers to the creation of a co-occurrence network.
G_doc_term.vs['type_int'] = [1 if v['type'] == 'term' else 0 for v in G_doc_term.vs];
G_terms = G_doc_term.bipartite_projection(types='type_int', which=1);
G_terms.simplify(combine_edges='sum');
G_terms.vs['id'] = [int(v['id'][:-5]) for v in G_terms.vs];
G_terms.vs['term'] = [terms_df.loc[v['id'],'term'] for v in G_terms.vs];
Now G_terms
contains only terms and the co-occurrence between them. We will export this network to a file format so that we can read it back into VOSviewer. First, let us create the output directory (if necessary).
import os
output_dir = 'results/'
if not os.path.exists(output_dir):
os.makedirs(output_dir)
Now we export the network G_terms
in file format which is understandable to VOSviewer.
nodes_df = pd.DataFrame.from_dict({attr: G_terms.vs[attr] for attr in G_terms.vs.attributes()});
nodes_df['label'] = nodes_df['term'];
nodes_df['cluster'] += 1;
nodes_df['weight<Occurence>'] = nodes_df['degree'];
nodes_df = nodes_df.sort_values('id')
nodes_df[['id', 'label', 'cluster', 'weight<Occurence>']].to_csv(output_dir + 'map_vosviewer.txt', sep='\t', index=False);
edge_df = pd.DataFrame([(G_terms.vs[e.source]['id'], G_terms.vs[e.target]['id'], e['weight']) for e in G_terms.es],
columns=['source', 'target', 'weight']);
edge_df = edge_df.sort_values(['source', 'target']);
edge_df.to_csv(output_dir + 'network_vosviewer.txt', sep='\t', index=False, header=False);
The great benefit of doing the clustering in Python is that we now also have a clustering of the publications. This is something that is not possible in VOSviewer.
Let us first load the actual publication files which were used by VOSviewer (we have already done this in the previous notebook). As said, the document id
refers to the line number of the WoS files that were read by VOSviewer, starting from 1
. We therefore also create a document id
that is the same.
import glob
import csv
files = sorted(glob.glob('data-files/wos/tab-delimited/*.txt'))
publications_df = pd.concat(pd.read_csv(f, sep='\t', quoting=csv.QUOTE_NONE,
usecols=range(68), index_col='UT') for f in files)
publications_df['document id'] = range(1,publications_df.shape[0]+1)
Now let us create a dataframe from G_doc_term
with all the information from the documents.
nodes_df = pd.DataFrame.from_dict({attr: G_doc_term.vs[attr] for attr in G_doc_term.vs.attributes()});
nodes_df = nodes_df[nodes_df['type'] == 'doc'];
Now we need back the original integer document id
, instead of the identifiers we created doc-1
, doc-2
, etc... We can then use those document id
to merge back the results with the original information from the publications.
nodes_df['document id'] = nodes_df['id'].str[:-4].astype(int);
publications_df = pd.merge(nodes_df[['document id', 'cluster']], publications_df,
left_on='document id', right_on='document id')
Finally, for further inspection, we may want to export our results to a .csv
file.
publications_df[['AU', 'PY', 'TI', 'SO', 'cluster']].to_csv(output_dir + 'publications_clustering.csv',
index=False)
terms.csv
file and the doc-terms.csv
file.
VOSviewer will now calculate the "relevance" scores. When it is done, you will be shown a list of terms together with the number of their occurrences and the relevance scores. Please follow the following remaining steps.
terms.txt
) and make sure you choose an appropriate directory and then press "Export".doc-terms.txt
) and make sure you choose an appropriate directory and then press "Export".terms.csv
file and the doc-terms.csv
files. Detect the clusters in this bipartite network, as explained above.