Introduction

In this lab exercise, you will learn how to perform scientometric network analysis in Python. We will start with practicalities on some basic data handling and import. We then move on to creating a network and cover some basic analysis. In the next session, we will be using more advanced techniques.

This Python notebook is intended to be used as an exercise. We have prepared it for you to include many details, but at some parts we will ask you to fill in some of the blanks. Exercises where you are asked to do something, or to think about something, will be indicated like this. If you need to execute and write your own code, we provide empty space below to do so.
If you need any help with anything, please don't hesitate to ask your teachers.

Data handling

Python is a general purpose programming language and it can be used to handle data in general. In this notebook we will specifically deal with scientometric datasets, but you can also use it for other purposes.

We will start by handling some data from a scientometric data source. There are many different possible data sources, and we discussed some of them earlier this week. In this notebook we will focus on data downloaded from Web of Science. We have already downloaded some data for you to demonstrate Python. At the end of the exercise you will be asked to load your own data.

The data that we provided is a selection of publications from authors from Belgium from Tropical Medicine from 2000-2017.

Note: You cannot load your own data when you run this notebook online using Binder.

We start by loading the data. In order to read in the data, we first need to make sure that Python is able to read it. A very versatile package for handling data in Python is called pandas. For those of you familiar with R, it is similar to the data.frame in R.

We import this package as follows, and we call the pandas package pd, for easy reference. We also need the csv package to indicate some options to the pandas package.

In order to execute the code you have to press Ctrl-Enter while selecting the code cell below. Alternatively, you can press the "Play" button at the top of the screen. This also moves to the next cell at the same time. Using Shift-Enter instead of Ctrl-Enter will also execute the code and move to the next cell at the same time.
In [ ]:
import pandas as pd
import csv
If you have executed that code cell correctly, it should now be numbered 1. While the code in a cell is being executed it is marked by an asterisk *. Each cell of executed code will be numbered in the order in which you execute it. If you execute it again, it will be numbered 2, et cetera.

We are now ready to read in the data that you just downloaded. We have named the pandas package pd, which will save us some typing.

In [ ]:
publications_df = pd.read_csv('data-files/wos/tab-delimited/savedrecs_0001_0500.txt', 
                              sep='\t', index_col='UT',
                              quoting=csv.QUOTE_NONE, usecols=range(68))

We called the function read_csv of the pandas package. We provide it with several arguments.

  1. The location of the file we want to read.

  2. The second argument is a named argument, we provide both the name of the argument (sep) and its value ('\t'). This indicates the separator between different fields. In this case it is a tab-delimited file, so the fields are separated by tabs, which is indicated by '\t'.

  3. The third argument is again a named argument. We indicate that the UT field should be the index. This is the unique identifier that WoS uses.

The two subqeuent arguments are needed to correctly handle some peculiarities of WoS files.

We downloaded some example files for you, which are located in the folder data_files/wos. At the end of this notebook, you will be asked to download your own data. If you want to load that data instead, use the path to that data.
Note: Windows usually uses backslashes \ to separate directories, in Python you can also use the forward slash /, which is usually more convenient for a number of reasons.

The pandas package took care of reading the file, and has now stored it in the variable called publications_df. You can take a closer look at publications_df to see the data that we just read.

In [ ]:
publications_df

You will see that the data has quite cryptic column headers. Each line contains information about a single publication, and contains various details, such as the title (TI), abstract (AB), authors (AU), journal title (SO) and cited references (CR). Unfortunately, the documentation of Web of Science is relatively limited, but some explanation can be found here. You can retrieve this information in various ways from the pandas dataframe publications_df. For example, you can list the first five titles as follows:

In [ ]:
publications_df.TI[:5]

Here, [:5] indicates that you want the first elements (starting at 0) until (but excluding) 5, so item 0, 1, 2, 3 and 4. This is called a slice of the data. You can also look at authors for rows 5 until 10 as follows:

In [ ]:
publications_df.AU[5:10]

In order to get the last few elements, you can use negative indices. The last element is indicated by -1, the penultimate element is indicated by -2, and so on. You can get the journals for the last five sources as follows:

In [ ]:
publications_df.SO[-5:]

Alternatively, there are various ways to index the dataframe. For example, to get the title and abstract for the first five elements you can do the following.

In [ ]:
publications_df[0:5][['TI', 'AB']]

The notation ['TI', 'AB'] creates a list of elements in Python. We now used it to get multiple columns from the dataframe.

The following does exactly the same:

In [ ]:
publications_df[['TI', 'AB']][0:5]

The pandas package automatically determines whether you try to get columns or rows. Slices are always assumed to refer to rows.

Show the title (TI), abstract (AB), journal (SO) and publication year (PY) for rows 200-210.
To start typing in the cell below, select the cell using the mouse, or select it using the arrows on the keyboard and press Enter
In [ ]:
 

You can also access a particular UT directly by using the .loc indexer.

In [ ]:
publications_df.loc['WOS:000419235100004', ['TI', 'AU', 'SO', 'PY']]

Reading multiple files

Until now we have only loaded one file. But we have of course downloaded more files, and we need to load all of them. We can list all files in a directory using the package glob. We first import the package.

In [ ]:
import glob

Now, let us get a list of all files in the directory data_files/wos/tab-delimited/.

In [ ]:
files = sorted(glob.glob('data-files/wos/tab-delimited/*.txt'))
files

We asked glob for a list of files that end with txt (*.txt) in the directory data-files/wos/tab-delimited. We sorted the list to ensure that we read the files in the correct order. We can now simply pass this list of files to read multiple WoS files.

In [ ]:
publications_df = pd.concat(pd.read_csv(f, sep='\t', quoting=csv.QUOTE_NONE, 
                                        usecols=range(68), index_col='UT') for f in files)
publications_df = publications_df.sort_index()
Now checkout the new publications_df data frame, and see how many rows it has.
In [ ]:
 

Data summarisation

The pandas package provides various ways to summarise the data and get a useful overview of the data. For example, you can group by a certain column, and count or sum things. For example, we can count the number of articles in each journal that is included in this dataset:

In [ ]:
grouped_by_journal = publications_df.groupby('SO')
grouped_by_journal.size().sort_values(ascending=False)[:10]

We could also ask the mean publication year of publications in those journals

In [ ]:
grouped_by_journal['PY'].mean()
Group by the year (PY) and count the number of paper from each year.
Now it is time to introduce you a little trick: you can get a list of all functions and argument of some variable by simply pressing Tab. For example, you can type publications_df., including the . and then press Tab (make sure the cursor is located after the .). If you then start typing the name of the function you are looking for and press Tab again, Python will automatically finish it as much as possible. This is something general: whenever you press Tab Python will try to autocomplete whatever you are typing. One other trick: if you have selected a function and press Shift-Tab you will get documentation of what this function does. You can press the + to find out more.
In [ ]:
 

Network generation

Ultimately, we would like to use this data to generation scientometric networks. This is not a trivial task, and we will now show how to construct a co-authorship network and a journal level bibliographic coupling network.

We first load the network analysis package that we will use in the notebook, igraph.

Import the pacakge igraph and call it ig.
In [ ]:
 

Co-authorship

We first build a co-authorship network. We will do this one publication at the time. All combinations of authors that are involved in a publication are co-authors. Let us look at the authors for publication 0.

In [ ]:
publications_df['AU'][0]

Note that the authors are all listed and separated with a semicolon (;). In computer terms, it is now a single string. We will split this string of all authors into a list of strings where each string then represents a single author.

In [ ]:
publications_df['AU_split'] = publications_df['AU'].fillna('').str.split('; ')
In [ ]:
authors = publications_df['AU_split'][0]
authors

In order to create all possible combinations, we can use a convenient package, called itertools. The function combinations can generate all possible combinations of the elements of a list.

In [ ]:
import itertools as itr
list(itr.combinations(authors, 2))

Of course, we don't want to do this for a single publication only, but rather, for all publications in our dataset. We can do that using the function apply. We can supply it with a small function (called a lambda function) that simply takes some input and produces some output. In this case, the input are the authors, and the output is the result of itr.combinations(...).

In [ ]:
coauthors_per_publication = publications_df['AU_split'].apply(
    lambda authors: list(itr.combinations(authors, 2)))

The variable coauthors_per_publication is now a list of a list of co-authors per publication. That is, each element of coauthors_per_publication contains a list of all co-authors for that publication. So, coauthors_per_publication[0] contains the coauthors we examined previously.

In [ ]:
coauthors_per_publication[0]

Let us "flatten" this list. We can do that as follows:

In [ ]:
coauthors = [coauthor 
                 for coauthors_publication in coauthors_per_publication 
                     for coauthor in coauthors_publication]

Finally, we can create the actual network as follows

In [ ]:
G_coauthorship = ig.Graph.TupleList(
      edges=coauthors,
      vertex_name_attr='author',
      directed=False
      )

Note that this graph will still contain many duplicate edges, because there are multiple edges present. Let us therefore simplify this graph, and simply count the number of multiple edges. We first create a so-called edge attribute n_joint_papers. We can create it by using the edge sequence es of the graph. We can then simply sum this weight when we simplify the graph.

In [ ]:
G_coauthorship.es['n_joint_papers'] = 1
G_coauthorship = G_coauthorship.simplify(combine_edges='sum')

Let us see how many authors (i.e. nodes) there are in the network. This is called the vcount (vertex count) in igraph.

In [ ]:
G_coauthorship.vcount()

Similarly, the number of edges is available as the ecount of the graph.

In [ ]:
G_coauthorship.ecount()

We can do all sorts of analysis on this network. But first, we will create a bibliographic coupling network.

Bibliographic coupling

Bibliographic coupling and co-authorship is in a sense very similar. Previously, we computed for each publication a combination of all co-authors. For bibliographic coupling we can compute for each cited reference the combinations of all citing journals. We will first create a dataframe of all journal citations (SO) of a certain cited reference (CR). Similar to the authors, we need to first split the cited references.

In [ ]:
publication_with_cr_df = publications_df.loc[pd.notnull(publications_df['CR']), ['SO', 'CR']]
publication_with_cr_df['CR'] = publication_with_cr_df['CR'].str.split('; ')

We now simply list all citations from a certain journal (SO) to a certain cited reference (CR).

In [ ]:
journal_cits = [(row['SO'], cr) 
            for idx, row in publication_with_cr_df.iterrows()
                for cr in row['CR']]
journal_cits_df = pd.DataFrame(journal_cits, columns=('SO', 'CR'))

We then create all bibliographic couplings per cited reference as follows. We first group by the cited reference (CR) and then take all combinations of citing journals.

In [ ]:
bibcoupling_per_cr = journal_cits_df.groupby('CR').apply(lambda x: itr.combinations(x['SO'], 2))

We again "flatten" this list.

In [ ]:
bibcouplings = [coupling
                 for couplings in bibcoupling_per_cr
                   for coupling in couplings]

We can then create the network.

In [ ]:
G_coupling = ig.Graph.TupleList(
      edges=bibcouplings,
      vertex_name_attr='SO',
      directed=False
      )
We again need to simplify this network. Create a new edge attribute called coupling set it to 1 and then sum this attribute when simplifying the network.
In [ ]:
 

This network should be reasonably sized, and you should be able to visualize this network by calling ig.plot.

In [ ]:
ig.plot(G_coupling, vertex_label=G_coupling.vs['SO'])

Network analysis

Now that we have created some scientometric networks, let us look at some basic analyses of these networks.

Connectivity

Let us start with a very simple question. Is the co-authorship network connected?

In [ ]:
G_coauthorship.is_connected()

Apparently, not all authors in this dataset are connected via co-authored papers.

How many authors do you think will be connected to each other? 500? 5000? Almost everybody?

In order to take a closer look, we need to detect the connected components. This is easily done, but the function is confusingly called clusters.

In [ ]:
components = G_coauthorship.clusters()

We only want the so-called giant component.

What function do you think returns the giant component?
Remember, you can use Tab and Shift-Tab to find out more about possible functions.
In [ ]:
 

Let us only look at the giant component.

In [ ]:
H = components.giant()

Let us check how many nodes are in the giant component. We can call the function summary.

In [ ]:
print(H.summary())

The first line indicates that we have an undirected graph (U) with 7871 nodes and 69928 links. The next line shows vertex attributes (indicated by the v behind the name of the attribute), and edge attributes (indicated by the e).

  1. What is the percentage of nodes that are in the giant component?
  2. Double check whether the giant component is connected.
In [ ]:
 

Let us take a closer look at how far authors in this data set are apart from one another. Let us simply take a look at node number 0 (remember, the first node has number 0, not 1) and node number 355.

In [ ]:
paths = G_coauthorship.get_shortest_paths(0, 355)
paths

This returns a list of all shortests paths of the nodes between node number 0 and node number 355. In fact, there is only one path, so let us select that.

In [ ]:
path = paths[0]
path
How many nodes are in the path? What is the path length?

These numbers probably do not mean that much to you. You can find out more about an individual node by looking at the VertexSequence of igraph, abbreviated as vs. This is a sort of list of all vertices, and is indexed by brackets [ ], similar to lists, instead of parentheses ( ) as we do for functions.

In [ ]:
G_coauthorship.vs[0]

The vertex itself is also a type of list (called a dictionary), and you can only return the author name as follows

In [ ]:
G_coauthorship.vs[0]['author']

You can also list multiple vertices at once.

In [ ]:
G_coauthorship.vs[[0, 3, 223, 355]]['author']

You can of course also simply pass the variable path that we constructed earlier.

In [ ]:
G_coauthorship.vs[path]['author']

This shows that Osaer collaborated with Geert, who collaborated with Van Mark, who in the end collaborated with Watkins.

You can also get the vertex by searching for the author name. For example, if we want to find 'Van Marck, E' we can use the following.

In [ ]:
G_coauthorship.vs.find(author_eq = 'Van Marck, E')

Here author_eq refers to the condition that the vertex attribute author should equal 'Van Marck, E'.

Find the shortest path from 'Van Marck, E' to 'Migchelsen, S'. Who is in between?
In [ ]:
 

We can let igraph also calculate how far apart all nodes are.

The following may take some time to run
In [ ]:
path_lengths = G_coauthorship.path_length_hist()
print(path_lengths)
How far apart are most authors? Do you think most authors are close by? Or do you think they are far apart?

Let us take a closer look at the path between node 0 and node 355 again. Instead of the nodes on the path, we now want to take a closer look at the edges on the path.

In [ ]:
epath = G_coauthorship.get_shortest_paths(0, 355, output='epath')
epath

There are three edges on this path, but the numbers themselves are not very informative. They refer to the edges, and similar to the VertexSequence we encountered earlier, there is also an EdgeSequence, abbreviated as es. Let us take a closer look to the number of joint papers that the authors had co-authored.

In [ ]:
G_coauthorship.es[epath[0]]['n_joint_papers']

Perhaps there are other paths that connect the two authors with more joint papers? Perhaps we could use the number of joint papers as weights?

In [ ]:
epath = G_coauthorship.get_shortest_paths(0, 355, weights='n_joint_papers', output='epath')
epath

We do get a different path, which it is actually longer. Let us take a look at the number of joint papers.

In [ ]:
G_coauthorship.es[epath[0]]['n_joint_papers']

The total number of joint papers is lower! That is because shortest path means: the path with the lowest sum of the weights. This is clearly not what we want. You should always be aware of this whenever using the concept of the shortest path.

Attention! Weighted shortest paths have the lowest total weight.

Clustering coefficient

Let us look whether co-authors of an author also tend to be co-authors among themselves.

Let us take a look at the co-authors of of author number 0, which are called the neighbors in network terminology.

In [ ]:
G_coauthorship.neighborhood(0)

What we actually want to know is whether many of those neighors are connected. That is, we want to take the subgraph of all authors that have co-authored with author number 0.

In [ ]:
H = G_coauthorship.induced_subgraph(G_coauthorship.neighborhood(0))
print(H.summary())

This subgraph only has 4 nodes (including node 0, so it has 3 neighbours) and 6 edges. This is sufficiently small to be easily plotted for visual inspection.

In [ ]:
H.vs['color'] = 'red'
H.vs[0]['color'] = 'grey'
ig.plot(H)
Do many of the co-authors collaborate among themselves as well? Why do you think this happens?

We can also ask igraph to calculate the clustering coefficient (which is called transitivity in igraph, which is the same concept using different terms) of node 0.

In [ ]:
G_coauthorship.transitivity_local_undirected(0)
What percentage of the co-authors of node 0 have also written papers with each other?

Let us now calculate this for all nodes.

In [ ]:
G_coauthorship.transitivity_avglocal_undirected()
What percentage of the co-authors have also written papers with each other on average? Do you think this is high or not?
In [ ]:
 

Centrality

Often, people want to identify wich nodes seem to be most important in some way in the network. This is often thought of as a type of centrality of a node.

Degree

The simplest type of centrality is the degree of a node, which is simply the number of its neighbors. Previously, we saw that node 0 had 3 neighbors, we therefore say its degree is 3.

In [ ]:
G_coauthorship.degree(0)

We can also simply calculate the degree for everybody and store it in a new vertex attribute called degree.

In [ ]:
G_coauthorship.vs['degree'] = G_coauthorship.degree()
What is the degree of 'Van Marck, E'?
In [ ]:
 

We can also take a look at the complete degree distribution. To plot it, we load the matplotlib package. We import the plotting functionality and name the package plt. We also include a statement telling Python to show the plots immediately in this notebook.

In [ ]:
import matplotlib.pyplot as plt
%matplotlib inline

Now let us plot a histogram of the degree, using 50 bins.

In [ ]:
plt.hist(G_coauthorship.vs['degree'], 50);
plt.yscale('log')

This clearly shows that the degree distribution is quite skewed. Most authors have only few collaborators, while a few authors have many collaborators. If the degree distribution is so skewed, it is sometimes referred to as a scale-free network, although the exact definition has been a topic of intense discussion recently.

The code below sorts the nodes in descending order of the degree.

In [ ]:
highest_degree = sorted(G_coauthorship.vs, key=lambda v: v['degree'], reverse=True)

The sorted function takes a list as input, G_coauthorship.vs in our case, and sorts it according to a sort key. We indicate the sort key by a small function, called a lambda function, that returns the degree. In other words, the sorted function will sort the nodes according to the degree. By indicating reverse=True we obtain a list that is sorted highest to lowest, instead of the other way around.

You can look at the first five results in the following way.

In [ ]:
highest_degree[:5]

So, apparently, U D'Allessandro has collaborated with 695 other authors! This of course only considers the number of co-authors, it does not take into account the number of papers written with somebody else. When specifying such edge weights like the number of joint papers, the weighted degree is referred to as the strength of a node (which is sometimes a bit confusing term).

Let us look at the strength of node 0.

In [ ]:
G_coauthorship.strength(0, weights='n_joint_papers')

Apparently, author 0 collaborated with 3 different authors, and has a total strength of 3. But what does this 3 mean? We need to carefully think about this. Suppose that author 0 has co-authored a single publication with three other co-authors, then each of the three co-authors would have an edge weight of n_joint_papers = 1. So, the strenght would be 3. Hence, the strength denotes the total number of collaborations that an author had, which depends both on the number of publications and the number of collaborators per paper.

Sometimes, we wish to take into account the number of co-authorships when creating a link weight. We can then fractionally count the weight of each collaboration between $n_a$ authors as

$$\frac{1}{n_a - 1}.$$

We need to go back to the publications_df in order to construct such a fractional edge weight.

In [ ]:
import itertools as itr
[(coauthor[0], coauthor[1], 1/(len(authors) - 1)) for coauthor in itr.combinations(authors, 2)]

We again do this for all publications.

In [ ]:
coauthors_per_publication = publications_df['AU_split'].apply(
    lambda authors: 
        [(coauthor[0], coauthor[1], 1, 1/(len(authors) - 1)) 
             for coauthor in itr.combinations(authors, 2)])

The variable coauthors_per_publication is now a list of a list of co-authors per publication, but including a full weight of 1 and a fractional weight of 1/(len(authors) - 1), where len(authors) is the number of authors of the publications. We again need to flatten this list.

In [ ]:
coauthors = [coauthor 
                 for coauthors_publication in coauthors_per_publication 
                     for coauthor in coauthors_publication]

We can again create the network, but now we can pass two edge attributes, n_joint_papers and n_joint_papers_frac. We of course also have to simplify the network again.

In [ ]:
G_coauthorship = ig.Graph.TupleList(
      edges=coauthors,
      vertex_name_attr='author',
      directed=False,
      edge_attrs=('n_joint_papers', 'n_joint_papers_frac')
      )
G_coauthorship = G_coauthorship.simplify(loops=False, combine_edges='sum')
What is the sum of n_joint_papers_frac over all co-authors? Then shouldn't the strength sum up to a whole number? Why isn't that the case here? (Hint: look at the authors of publication 'WOS:000242241600004'
In [ ]:
publications_df.loc['WOS:000242241600004', 'AU']

Betweenness centrality

Betweenness centrality is much more elaborate, and gives an indication of the number of times a node is on the shortest path from one node to another node.

As you can imagine, this can take quite some time to calculate for all nodes. We will therefore use the somewhat smaller bibliographic coupling network of journals.

Note: On larger networks, it may take a long time to calculate the betweenness centrality.
In [ ]:
G_coupling.vs['betweenness'] = G_coupling.betweenness()

Now we can look at the journals that have the highest betweenness.

In [ ]:
sorted(G_coupling.vs, key=lambda v: v['betweenness'], reverse=True)[:5]

As we did previously when dealing with shortest paths, we can also use a weight for determining the shortest paths.

In [ ]:
G_coupling.vs['betweenness_weighted'] = G_coupling.betweenness(weights='coupling')
What is journal with the highest weighted betweenness centrality? Does this make sense if you compare it to the unweighted betweenness centrality?
In [ ]:
 
Attention! Weighted shortest paths have the lowest total weight.

Pagerank

One way of identifying central nodes relies on the idea of a random walk in a network. We will study this in the journal bibliographic coupling network. When performing such a random walk, we simply go from one journal to the next, following the bibliographic coupling links. The journal that is most frequently visited during such a random walk is then seen as most central. This is actually the idea that underlies Google's famous search engine. Luckily, we can compute that a lot faster than betweenness.

In [ ]:
G_coupling.vs['pagerank'] = G_coupling.pagerank()
Get the top 5 most central journals according to Pagerank. Who is the most central? Are the results very different from the betweenness?
In [ ]:
 

We can again take into account the weights. In pagerank this means that a journal that is a more closely bibliographically coupled will be more likely to be visited during a random walk. This is actually much more in line with our intuition than the shortest path. Let us see what we get if we do that.

In [ ]:
G_coupling.vs['pagerank_weighted'] = G_coupling.pagerank(weights='coupling')
Are the results different for the weighted version of pagerank?
In [ ]:
 
Pagerank is very similar to the techniques that underly the journal "Eigenfactor" and the "SCImago Journal Rank", which are seen as indicators of the scientific impact of a journal. Do you think it makes sense to interpret Pagerank on a bibliographic coupling network as the scientific impact of a journal? Why (not)?

Analysis of your own data

You have now learned the basics of handling WoS files and transforming them into scientometric networks. Please take some time now to do your own analysis.

Go to Web of Science and select a publication set of interest. Make sure that the number of publications is higher than 1000, but lower than 5000. Export the files as follows:
  1. Export using "Save to Other File Formats".
  2. Select the appropriate records (e.g. 1-500, 501-1000, etc...).
  3. Select the Record Content "Full Record and Cited References".
  4. Select the File Format "Tab delimited (Win, UTF8)".
  5. Click on Send.
Repeat the above steps for each batch of 500 publications. Load the data from all files you downloaded using pandas
In [ ]:
 
Create a co-authorship network of your publications. Hint: use the approach you encountered earlier.
In [ ]:
 
Identify the authors that are most central to the coauthorship network and interpret the results.
In [ ]:
 
Create a co-citation network of your publications. Hint: use the bibliographic coupling approach, but switch the roles of the source and the target.
In [ ]:
 
Identify the publications that are most central to the co-citation network and interpret the results. Are they relatively recent publications or not?
In [ ]:
 

This website does not host notebooks, it only renders notebooks available on other websites.

Delivered by Fastly, Rendered by Rackspace

nbviewer GitHub repository.

nbviewer version: f697053

nbconvert version: 5.4.1

Rendered (Thu, 22 Aug 2019 18:38:54 UTC)