Introduction

In the previous sessions, you learned how to construct scientometric networks in Python. It was clear that this can be quite challenging. VOSviewer takes care of a lot of the necessary work in creating scientometric networks. You can hence use VOSviewer to create networks, which you could then export and analyse further in Python. We will here take this approach.

VOSviewer

You have previously constructed scientometric networks using VOSviewer. You can import the resulting network for further analysis in igraph. In order to import the file in igraph you need to have saved both the map file and the network file in VOSviewer. See the manual of VOSviewer for more explanation. As in the previous Python notebook, we have prepared some files for you, in this case the author collaboration network from the Web of Science files that we analysed previously.

We first import the necessary packages. You will presumably recognize these still from the previous Python notebook.

In [ ]:
import pandas as pd
import igraph as ig

Now let us read the map and network file from VOSviewer.

Read the file data-files/vosviewer/vosviewer_map.txt using tabs ('\t') as a field separator, and call the resulting variable map_df.
In [ ]:
map_df = pd.read_csv('data-files/vosviewer/vosviewer_map.txt', sep='\t')

The network file from VOSviewer has no header, so we set it manually

In [ ]:
network_df = pd.read_csv('data-files/vosviewer/vosviewer_network.txt', sep='\t', header=None,
                         names=['idA', 'idB', 'weight'])

Now we have loaded the data, so we can simply construct a network as before.

In [ ]:
G_vosviewer = ig.Graph.DictList(
      vertices=map_df.to_dict('records'),
      edges=network_df.to_dict('records'),
      vertex_name_attr='id',
      edge_foreign_keys=('idA', 'idB'),
      directed=False
      )

The layout and clustering is also stored by VOSviewer, and we can use that to display the same visualization in igraph.

In [ ]:
layout = ig.Layout(coords=zip(*[G_vosviewer.vs['x'], G_vosviewer.vs['y']]))
clustering = ig.VertexClustering.FromAttribute(G_vosviewer, 'cluster')

ig.plot(clustering, layout=layout, vertex_size=4, vertex_frame_width=0, vertex_label=None)

Clustering

A common phenomenon in many networks is the presence of group structure, where nodes within the same group are densely connected. Such a structure is sometimes called a modular structure, and a frequently used measure of group structure is known as modularity. You have already encountered this functionality briefly in VOSviewer, which provides clusters. Here we will explore this a bit more in-depth.

First, we will import a package called leidenalg which is the Leiden algorithm, which we will use for clustering. It is built on top of igraph so that it easily integrates with all the exisiting methods of igraph.

In [ ]:
import leidenalg

Now let us find clusters in the collaboration network from VOSviewer, using the weight of the edges. Because the algorithm is stochastic, it may yield somewhat different results every time you run it. To prevent that from happening, and to always get the same result, we will set the random seed to 0. The result is a VertexClustering, which we already briefly encountered when using the clustering results from VOSviewer.

We will first find clusters using modularity.

In [ ]:
optimiser = leidenalg.Optimiser()
optimiser.set_rng_seed(0)
clusters = leidenalg.ModularityVertexPartition(G_vosviewer, weights='weight')
optimiser.optimise_partition(clusters)

The length of the clusters variable indicates the number of clusters.

In [ ]:
len(clusters)

When accessing clusters variable as a list, each element corresponds to the set of nodes in that cluster.

What are the nodes in cluster 30?
In [ ]:
clusters[30]

Hence, node 548, node 1052, etc... belong to cluster 30. Another way to look at the clusters is by looking at the membership of clusters.

What is the membership of the first 10 nodes?
In [ ]:
clusters.membership[:10]

Hence, node 0 belongs to cluster 7, node 1 belongs to cluster 9, node 2 belongs to cluster 4, et cetera.

Let us take a closer look at the largest cluster.

In [ ]:
H = clusters.giant()
print(H.summary())

We could again detect clusters using modularity in the largest cluster.

In [ ]:
optimiser.set_rng_seed(0)
subclusters = leidenalg.ModularityVertexPartition(H, weights='weight')
optimiser.optimise_partition(subclusters)
ig.plot(subclusters, vertex_size=5, vertex_label=None)

In general, modularity will continue to find subclusters in this way. An alternative approach, called CPM, does not suffer from that problem.

Let us detect clusters using CPM. We do have to specify a parameter, called the resolution_parameter. As its name suggests, it specifies the resolution of the clusters we would like to find. At a higher resolution we will tend to find smaller clusters, while at a lower resolution we find larger clusters. Let us use the resolution parameter 0.01.

In [ ]:
optimiser.set_rng_seed(0)
clusters = leidenalg.CPMVertexPartition(G_vosviewer,
                                     weights='weight',
                                     resolution_parameter=0.1)
optimiser.optimise_partition(clusters)
clusters.giant().vcount()
Detect subclusters in the largest cluster using CPM, using the same resolution_parameter. How many subclusters do you find? How does that compare to modularity?
In [ ]:
 
Try to find more subclusters by specifying a higher resolution_parameter.
In [ ]:
 

Modularity adapts itself to the network. In a sense that is convenient, because you then do not have to specify any parameters. On the other hand, it makes the definition of what a "cluster" is less clear.

CPM does not adapt itself to the network, and maintains the same defintion across different networks. That is convenient, because it brings more clarity to what we mean by a "cluster". Whenever you try to find subclusters using the same resolution_parameter, CPM should not find any subclusters. In practice, it may happen that CPM still finds some subclusters, in which case the original clusters were actually not the best possible. The Leiden algorithm can be run for multiple iterations, and with each iteration, the chances are smaller that CPM would find such subclusters. Modularity will always find subclusters, independent of the number of iterations.

Try to find optimise the partition with more iterations, as indicated below (n_iterations=10). Note that the function returns how much further it managed to improve the function, so that if it returns 0.0, it means it couldn't find any further improvement. Execute the cell repeatedly. Does it return 0.0 after some time?
In [ ]:
optimiser.optimise_partition(clusters, n_iterations=10)

Let us compare the clusters that we detected in Python with the clustering results from VOSviewer.

We can summarize the overall similarity to the partition based on the disciplines using the Normalised Mutual Information (NMI). The NMI varies between 0 and 1 and equals 1 if both are identical.

In [ ]:
clusters.compare_to(clustering, method='nmi')

There are some differences between the clustering from VOSviewer and the clusters we detected in Python. This will of course highly depend on what resolution parameter we have used for both results. One other important difference is that VOSviewer will by default use normalized weights. By default, it will divide the weight of a link by the expected weight, assuming that the total link weight of each node would remain the same, which is sometimes referred to as the association strength. We also perform this normalization here.

In [ ]:
G_vosviewer.es['weight_normalized'] = [
    e['weight']/( G_vosviewer.vs[e.source]['weight<Total link strength>']*G_vosviewer.vs[e.target]['weight<Total link strength>'] / (2*sum(G_vosviewer.es['weight'])) ) 
    for e in G_vosviewer.es]

By default VOSviewer uses the default resolution of 1 for these normalized weights. If we now detect clusters using these weights, you will see that the result are more closely aligned to the VOSviewer results.

In [ ]:
clusters = leidenalg.find_partition(G_vosviewer, leidenalg.CPMVertexPartition, 
                                       weights='weight_normalized', resolution_parameter=1,
                                       n_iterations=10)

clusters.compare_to(clustering, method='nmi')

Now let us explore cluster detection a bit further.

Vary the resolution_parameter when detecting clusters using the CPM method. What resolution_parameter seems reasonable to you, and why?
In [ ]:
 
Try to find a resolution_parameter such that the network separates in two large clusters (and some remaining small clusters). What is the cause of these two large clusters? (Hint: examine the author names)
In [ ]:
 
Compare the co-authorship network that we created previously in Python to the network created in VOSviewer. What are the differences?
In [ ]:
 

Document-term clustering

We will now use the same type of clustering technique that we used previously in a slightly different way. Instead of clustering a network, we will cluster a specific type of network, namely a bipartite network. This requires a slightly different (and more complicated) approach. More specifically, we will cluster a document-term network, where documents are linked to terms if those terms appear in a document.

We leave the task of extracting terms to VOSviewer, and simply import the resulting document-term network in Python. At the end of the notebook, you will find instructions how to extract the document-term network from VOSviewer yourself.

We read two files: (1) the terms.txt file, which simply contains the terms and their id; and (2) the doc-term.txt file, which contains which term occurs in which document. The document id refers to the line number of the WoS files that were read by VOSviewer. We will encounter this later.

In [ ]:
terms_df = pd.read_csv('data-files/vosviewer/terms.txt', sep='\t', index_col='id')
doc_terms_df = pd.read_csv('data-files/vosviewer/doc-term.txt', sep='\t')

In this file, both the documents and the terms are using the same numbers, so that igraph cannot distinguish them (e.g. there is both a document 1 and a term 1). We therefore create separate ids for both the documents and the terms.

In [ ]:
doc_terms_df['document id'] = doc_terms_df['document id'].map(lambda x: str(x) + '-doc');
doc_terms_df['term id'] = doc_terms_df['term id'].map(lambda x: str(x) + '-term');

We can now create the network.

In [ ]:
G_doc_term = ig.Graph.TupleList(
      edges=doc_terms_df.values,
      vertex_name_attr='id',
      directed=False
      )

This is a bipartite network, and we create a specific vertex attribute to indicate what the type is of the node: either a doc or a term.

In [ ]:
G_doc_term.vs['type'] = ['doc' if 'doc' in v['id'] else 'term' for v in G_doc_term.vs]

Similar to the co-authorship network, VOSviewer typically normalizes the weights in a network by using the association strength, and we will also use that here.

In [ ]:
G_doc_term.es['weight'] = [2.0*G_doc_term.ecount()/(G_doc_term.vs[e.source].degree()*G_doc_term.vs[e.target].degree()) 
                           for e in G_doc_term.es];

We now employ a small trick in order to do clustering in a bipartite network. We will not explain the full details, but it involves creating two empty networks. Please see the documentation for a brief explanation of this approach.

In [ ]:
H_docs = G_doc_term.subgraph_edges([], delete_vertices=False);
H_docs.vs['node_sizes'] = [1 if v['type'] == 'doc' else 0 for v in H_docs.vs];

H_terms = G_doc_term.subgraph_edges([], delete_vertices=False);
H_terms.vs['node_sizes'] = [1 if v['type'] == 'term' else 0 for v in H_terms.vs];

In order to make this trick work, we now also have to create three separate partitions as follows. The res_param contains the resolution parameter that we previously used, and again plays a similar role. A value of around 1 seems to give reasonable results in this case.

In [ ]:
res_param = 1
partition = leidenalg.CPMVertexPartition(G_doc_term, weights='weight', 
                                       resolution_parameter=res_param)

partition_docs = leidenalg.CPMVertexPartition(H_docs, weights='weight', 
                                       node_sizes=H_docs.vs['node_sizes'], 
                                       resolution_parameter=res_param)

partition_terms = leidenalg.CPMVertexPartition(H_terms, weights='weight', 
                                        node_sizes=H_terms.vs['node_sizes'], 
                                        resolution_parameter=res_param)

We are now ready to detect clusters, but we are going to use all three partitions we created. We do so by using the function optimise_partition_multiplex instead of the optimise_partition function that we used previously. We have to pass a list of partitions to that function. For the trick to work, we also need to pass the argument layer_weights=[1,-1,-1], which assumes that the partition is the first element of the list that we pass.

In [ ]:
optimiser = leidenalg.Optimiser()
optimiser.set_rng_seed(0)
optimiser.optimise_partition_multiplex(
              [partition, partition_docs, partition_terms],  
              layer_weights=[1,-1,-1], n_iterations=100)

Now partition contains the clustering results (actually, partition_docs and partition_terms contain the identical clustering results). We extract the cluster membership of each node, and make it a new node attribute.

In [ ]:
G_doc_term.vs['cluster'] = partition.membership
G_doc_term.vs['degree'] = G_doc_term.degree();

We will now create a so-called projection of the bipartite graph, which actually simply refers to the creation of a co-occurrence network.

In [ ]:
G_doc_term.vs['type_int'] = [1 if v['type'] == 'term' else 0 for v in G_doc_term.vs];
G_terms = G_doc_term.bipartite_projection(types='type_int', which=1);
G_terms.simplify(combine_edges='sum');

G_terms.vs['id'] = [int(v['id'][:-5]) for v in G_terms.vs];
G_terms.vs['term'] = [terms_df.loc[v['id'],'term'] for v in G_terms.vs];

Now G_terms contains only terms and the co-occurrence between them. We will export this network to a file format so that we can read it back into VOSviewer. First, let us create the output directory (if necessary).

In [ ]:
import os
output_dir = 'results/'
if not os.path.exists(output_dir):
    os.makedirs(output_dir)

Now we export the network G_terms in file format which is understandable to VOSviewer.

In [ ]:
nodes_df = pd.DataFrame.from_dict({attr: G_terms.vs[attr] for attr in G_terms.vs.attributes()});
nodes_df['label'] = nodes_df['term'];
nodes_df['cluster'] += 1;
nodes_df['weight<Occurence>'] = nodes_df['degree'];
nodes_df = nodes_df.sort_values('id')
nodes_df[['id', 'label', 'cluster', 'weight<Occurence>']].to_csv(output_dir + 'map_vosviewer.txt', sep='\t', index=False);

edge_df = pd.DataFrame([(G_terms.vs[e.source]['id'], G_terms.vs[e.target]['id'], e['weight']) for e in G_terms.es],
                       columns=['source', 'target', 'weight']);
edge_df = edge_df.sort_values(['source', 'target']);
edge_df.to_csv(output_dir + 'network_vosviewer.txt', sep='\t', index=False, header=False);

The great benefit of doing the clustering in Python is that we now also have a clustering of the publications. This is something that is not possible in VOSviewer.

Let us first load the actual publication files which were used by VOSviewer (we have already done this in the previous notebook). As said, the document id refers to the line number of the WoS files that were read by VOSviewer, starting from 1. We therefore also create a document id that is the same.

In [ ]:
import glob
import csv
files = sorted(glob.glob('data-files/wos/tab-delimited/*.txt'))
publications_df = pd.concat(pd.read_csv(f, sep='\t', quoting=csv.QUOTE_NONE, 
                                        usecols=range(68), index_col='UT') for f in files)
publications_df['document id'] = range(1,publications_df.shape[0]+1)

Now let us create a dataframe from G_doc_term with all the information from the documents.

In [ ]:
nodes_df = pd.DataFrame.from_dict({attr: G_doc_term.vs[attr] for attr in G_doc_term.vs.attributes()});
nodes_df = nodes_df[nodes_df['type'] == 'doc'];

Now we need back the original integer document id, instead of the identifiers we created doc-1, doc-2, etc... We can then use those document id to merge back the results with the original information from the publications.

In [ ]:
nodes_df['document id'] = nodes_df['id'].str[:-4].astype(int);
publications_df = pd.merge(nodes_df[['document id', 'cluster']], publications_df, 
                           left_on='document id', right_on='document id')

Finally, for further inspection, we may want to export our results to a .csv file.

In [ ]:
publications_df[['AU', 'PY', 'TI', 'SO', 'cluster']].to_csv(output_dir + 'publications_clustering.csv', 
                                                            index=False)

Own analysis

Load your own data in VOSviewer and create a co-citation network of journals.
In [ ]:
 
Detect comunities in the journal co-citation network. What do you think the different clusters mean?
In [ ]:
 
Load your own data in VOSviewer and create a term-map. Please take the following steps to create the term-map and extract the terms.csv file and the doc-terms.csv file.
  1. Open VOSviewer and press the button "Create...".
  2. Choose "Create a map based on text data" and press "Next".
  3. Choose "Read data from bibliographic database files" and press "Next".
  4. Choose the "Web of Science" tab and select the files you have downloaded yourself and press "Next".
  5. Choose "Title and abstract fields" (the default) and press "Next".
  6. Choose "Binary counting" (the default) and press "Next".
  7. Leave the default threshold of 10 and press "Next".
  8. Leave the default number of terms to be selected and press "Next".
VOSviewer will now calculate the "relevance" scores. When it is done, you will be shown a list of terms together with the number of their occurrences and the relevance scores. Please follow the following remaining steps.
  1. On the list of terms, click-right, and choose "Export selected terms...". Choose an appropriate file name (terms.txt) and make sure you choose an appropriate directory and then press "Export".
  2. On the list of terms, click-right, and choose "Export document-term relations...". Choose an appropriate file name (doc-terms.txt) and make sure you choose an appropriate directory and then press "Export".
Load the terms.csv file and the doc-terms.csv files. Detect the clusters in this bipartite network, as explained above.
In [ ]:
 
Compare the results to the clusters you can detect immediately in VOSviewer itself. Are they similar or not?
In [ ]:
 
Try to identify the main topic for the largest few clusters on the basis of the terms in the term map. Does that match well with the publications in the same cluster? Do you see any discrepancies?
In [ ]: