In this lab exercise, you will learn how to perform scientometric network analysis in Python. We will start with practicalities on some basic data handling and import. We then move on to creating a network and cover some basic analysis. In the next session, we will be using more advanced techniques.
Python is a general purpose programming language and it can be used to handle data in general. In this notebook we will specifically deal with scientometric datasets, but you can also use it for other purposes.
We will start by handling some data from a scientometric data source. There are many different possible data sources, and we discussed some of them earlier this week. In this notebook we will focus on data downloaded from Web of Science. We have already downloaded some data for you to demonstrate Python. At the end of the exercise you will be asked to load your own data.
The data that we provided is a selection of publications from authors from Belgium from Tropical Medicine from 2000-2017.
We start by loading the data. In order to read in the data, we first need to make sure that Python is able to read it. A very versatile package for handling data in Python is called pandas
. For those of you familiar with R
, it is similar to the data.frame
in R
.
We import this package as follows, and we call the pandas
package pd
, for easy reference. We also need the csv
package to indicate some options to the pandas
package.
Ctrl-Enter
while selecting the code cell below. Alternatively, you can press the "Play" button at the top of the screen. This also moves to the next cell at the same time. Using Shift-Enter
instead of Ctrl-Enter
will also execute the code and move to the next cell at the same time.
import pandas as pd
import csv
1
. While the code in a cell is being executed it is marked by an asterisk *
. Each cell of executed code will be numbered in the order in which you execute it. If you execute it again, it will be numbered 2
, et cetera.
We are now ready to read in the data that you just downloaded. We have named the pandas
package pd
, which will save us some typing.
publications_df = pd.read_csv('data-files/wos/tab-delimited/savedrecs_0001_0500.txt',
sep='\t', index_col='UT',
quoting=csv.QUOTE_NONE, usecols=range(68))
We called the function read_csv
of the pandas
package. We provide it with several arguments.
The location of the file we want to read.
The second argument is a named argument, we provide both the name of the argument (sep
) and its value ('\t'
). This indicates the separator between different fields. In this case it is a tab-delimited file, so the fields are separated by tabs, which is indicated by '\t'
.
The third argument is again a named argument. We indicate that the UT
field should be the index. This is the unique identifier that WoS uses.
The two subqeuent arguments are needed to correctly handle some peculiarities of WoS files.
data_files/wos
. At the end of this notebook, you will be asked to download your own data. If you want to load that data instead, use the path to that data.
\
to separate directories, in Python you can also use the forward slash /
, which is usually more convenient for a number of reasons.
The pandas
package took care of reading the file, and has now stored it in the variable called publications_df
. You can take a closer look at publications_df
to see the data that we just read.
publications_df
You will see that the data has quite cryptic column headers. Each line contains information about a single publication, and contains various details, such as the title (TI
), abstract (AB
), authors (AU
), journal title (SO
) and cited references (CR
). Unfortunately, the documentation of Web of Science is relatively limited, but some explanation can be found here. You can retrieve this information in various ways from the pandas dataframe publications_df
. For example, you can list the first five titles as follows:
publications_df.TI[:5]
Here, [:5]
indicates that you want the first elements (starting at 0) until (but excluding) 5, so item 0, 1, 2, 3 and 4. This is called a slice of the data. You can also look at authors for rows 5 until 10 as follows:
publications_df.AU[5:10]
In order to get the last few elements, you can use negative indices. The last element is indicated by -1
, the penultimate element is indicated by -2
, and so on. You can get the journals for the last five sources as follows:
publications_df.SO[-5:]
Alternatively, there are various ways to index the dataframe. For example, to get the title and abstract for the first five elements you can do the following.
publications_df[0:5][['TI', 'AB']]
The notation ['TI', 'AB']
creates a list of elements in Python. We now used it to get multiple columns from the dataframe.
The following does exactly the same:
publications_df[['TI', 'AB']][0:5]
The pandas
package automatically determines whether you try to get columns or rows. Slices are always assumed to refer to rows.
TI
), abstract (AB
), journal (SO
) and publication year (PY
) for rows 200-210.
Enter
You can also access a particular UT
directly by using the .loc
indexer.
publications_df.loc['WOS:000419235100004', ['TI', 'AU', 'SO', 'PY']]
Until now we have only loaded one file. But we have of course downloaded more files, and we need to load all of them. We can list all files in a directory using the package glob
. We first import the package.
import glob
Now, let us get a list of all files in the directory data_files/wos/tab-delimited/
.
files = sorted(glob.glob('data-files/wos/tab-delimited/*.txt'))
files
We asked glob
for a list of files that end with txt
(*.txt
) in the directory data-files/wos/tab-delimited
. We sorted the list to ensure that we read the files in the correct order. We can now simply pass this list of files to read multiple WoS files.
publications_df = pd.concat(pd.read_csv(f, sep='\t', quoting=csv.QUOTE_NONE,
usecols=range(68), index_col='UT') for f in files)
publications_df = publications_df.sort_index()
publications_df
data frame, and see how many rows it has.
The pandas
package provides various ways to summarise the data and get a useful overview of the data. For example, you can group by a certain column, and count or sum things. For example, we can count the number of articles in each journal that is included in this dataset:
grouped_by_journal = publications_df.groupby('SO')
grouped_by_journal.size().sort_values(ascending=False)[:10]
We could also ask the mean publication year of publications in those journals
grouped_by_journal['PY'].mean()
PY
) and count the number of paper from each year.
Tab
. For example, you can type publications_df.
, including the .
and then press Tab
(make sure the cursor is located after the .
). If you then start typing the name of the function you are looking for and press Tab
again, Python will automatically finish it as much as possible. This is something general: whenever you press Tab
Python will try to autocomplete whatever you are typing.
One other trick: if you have selected a function and press Shift-Tab
you will get documentation of what this function does. You can press the +
to find out more.
Ultimately, we would like to use this data to generation scientometric networks. This is not a trivial task, and we will now show how to construct a co-authorship network and a journal level bibliographic coupling network.
We first load the network analysis package that we will use in the notebook, igraph
.
igraph
and call it ig
.
We first build a co-authorship network. We will do this one publication at the time. All combinations of authors that are involved in a publication are co-authors. Let us look at the authors for publication 0.
publications_df['AU'][0]
Note that the authors are all listed and separated with a semicolon (;
). In computer terms, it is now a single string. We will split this string of all authors into a list of strings where each string then represents a single author.
publications_df['AU_split'] = publications_df['AU'].fillna('').str.split('; ')
authors = publications_df['AU_split'][0]
authors
In order to create all possible combinations, we can use a convenient package, called itertools
. The function combinations
can generate all possible combinations of the elements of a list.
import itertools as itr
list(itr.combinations(authors, 2))
Of course, we don't want to do this for a single publication only, but rather, for all publications in our dataset. We can do that using the function apply
. We can supply it with a small function (called a lambda
function) that simply takes some input and produces some output. In this case, the input are the authors
, and the output is the result of itr.combinations(...)
.
coauthors_per_publication = publications_df['AU_split'].apply(
lambda authors: list(itr.combinations(authors, 2)))
The variable coauthors_per_publication
is now a list of a list of co-authors per publication. That is, each element of coauthors_per_publication
contains a list of all co-authors for that publication. So, coauthors_per_publication[0]
contains the coauthors we examined previously.
coauthors_per_publication[0]
Let us turn each element of this list into a separate row. This is done by using explode
in pandas
. Publications with only one author have no co-authors, which results in an NA
(Not Available) value. We will drop those using dropna
.
coauthors = coauthors_per_publication.explode().dropna()
Finally, we can create the actual network as follows
G_coauthorship = ig.Graph.TupleList(
edges=coauthors.to_list(),
vertex_name_attr='author',
directed=False
)
Note that this graph will still contain many duplicate edges, because there are multiple edges present. Let us therefore simplify this graph, and simply count the number of multiple edges. We first create a so-called edge attribute n_joint_papers
. We can create it by using the edge sequence es
of the graph. We can then simply sum this weight when we simplify the graph.
G_coauthorship.es['n_joint_papers'] = 1
G_coauthorship = G_coauthorship.simplify(combine_edges='sum')
Let us see how many authors (i.e. nodes) there are in the network. This is called the vcount
(vertex count) in igraph
.
G_coauthorship.vcount()
Similarly, the number of edges is available as the ecount
of the graph.
G_coauthorship.ecount()
We can do all sorts of analysis on this network. But first, we will create a bibliographic coupling network.
Bibliographic coupling and co-authorship is in a sense very similar. Previously, we computed for each publication a combination of all co-authors. For bibliographic coupling we can compute for each cited reference the combinations of all citing journals. We will first create a dataframe of all journal citations (SO
) of a certain cited reference (CR
). Similar to the authors, we need to first split the cited references.
publication_with_cr_df = publications_df.loc[pd.notnull(publications_df['CR']), ['SO', 'CR']]
publication_with_cr_df['CR'] = publication_with_cr_df['CR'].str.split('; ')
We now simply list all citations from a certain journal (SO
) to a certain cited reference (CR
).
journal_cits_df = publication_with_cr_df[['SO', 'CR']].explode('CR')
We then create all bibliographic couplings per cited reference as follows. We first group by the cited reference (CR
) and then take all combinations of citing journals.
bibcoupling_per_cr = journal_cits_df.groupby('CR').apply(lambda x: list(itr.combinations(x['SO'], 2)))
We again explode
all combinations of two sources citing the same reference.
bibcouplings = bibcoupling_per_cr.explode().dropna()
We can then create the network.
G_coupling = ig.Graph.TupleList(
edges=bibcouplings,
vertex_name_attr='SO',
directed=False
)
coupling
set it to 1
and then sum this attribute when simplifying the network.
This network should be reasonably sized, and you should be able to visualize this network by calling ig.plot
.
ig.plot(G_coupling, vertex_label=G_coupling.vs['SO'])
Now that we have created some scientometric networks, let us look at some basic analyses of these networks.
Let us start with a very simple question. Is the co-authorship network connected?
G_coauthorship.is_connected()
Apparently, not all authors in this dataset are connected via co-authored papers.
In order to take a closer look, we need to detect the connected components. This is easily done, but the function is confusingly called clusters
.
components = G_coauthorship.clusters()
We only want the so-called giant component.
Tab
and Shift-Tab
to find out more about possible functions.
Let us only look at the giant component.
H = components.giant()
Let us check how many nodes are in the giant component. We can call the function summary
.
print(H.summary())
The first line indicates that we have an undirected graph (U
) with 7871 nodes and 69928 links. The next line shows vertex attributes (indicated by the v
behind the name of the attribute), and edge attributes (indicated by the e
).
Let us take a closer look at how far authors in this data set are apart from one another. Let us simply take a look at node number 0
(remember, the first node has number 0
, not 1
) and node number 355
.
paths = G_coauthorship.get_shortest_paths(0, 355)
paths
This returns a list of all shortests paths of the nodes between node number 0 and node number 355. In fact, there is only one path, so let us select that.
path = paths[0]
path
These numbers probably do not mean that much to you. You can find out more about an individual node by looking at the VertexSequence
of igraph
, abbreviated as vs
. This is a sort of list of all vertices, and is indexed by brackets [ ]
, similar to lists, instead of parentheses ( )
as we do for functions.
G_coauthorship.vs[0]
The vertex itself is also a type of list (called a dictionary), and you can only return the author name as follows
G_coauthorship.vs[0]['author']
You can also list multiple vertices at once.
G_coauthorship.vs[[0, 3, 223, 355]]['author']
You can of course also simply pass the variable path
that we constructed earlier.
G_coauthorship.vs[path]['author']
This shows that Osaer collaborated with Geert, who collaborated with Van Mark, who in the end collaborated with Watkins.
You can also get the vertex by searching for the author name. For example, if we want to find 'Van Marck, E'
we can use the following.
G_coauthorship.vs.find(author_eq = 'Van Marck, E')
Here author_eq
refers to the condition that the vertex attribute author
should equal 'Van Marck, E'
.
'Van Marck, E'
to 'Migchelsen, S'
. Who is in between?
We can let igraph
also calculate how far apart all nodes are.
path_lengths = G_coauthorship.path_length_hist()
print(path_lengths)
Let us take a closer look at the path between node 0 and node 355 again. Instead of the nodes on the path, we now want to take a closer look at the edges on the path.
epath = G_coauthorship.get_shortest_paths(0, 355, output='epath')
epath
There are three edges on this path, but the numbers themselves are not very informative. They refer to the edges, and similar to the VertexSequence
we encountered earlier, there is also an EdgeSequence
, abbreviated as es
. Let us take a closer look to the number of joint papers that the authors had co-authored.
G_coauthorship.es[epath[0]]['n_joint_papers']
Perhaps there are other paths that connect the two authors with more joint papers? Perhaps we could use the number of joint papers as weights?
epath = G_coauthorship.get_shortest_paths(0, 355, weights='n_joint_papers', output='epath')
epath
We do get a different path, which it is actually longer. Let us take a look at the number of joint papers.
G_coauthorship.es[epath[0]]['n_joint_papers']
The total number of joint papers is lower! That is because shortest path means: the path with the lowest sum of the weights. This is clearly not what we want. You should always be aware of this whenever using the concept of the shortest path.
Let us look whether co-authors of an author also tend to be co-authors among themselves.
Let us take a look at the co-authors of of author number 0, which are called the neighbors in network terminology.
G_coauthorship.neighborhood(0)
What we actually want to know is whether many of those neighors are connected. That is, we want to take the subgraph of all authors that have co-authored with author number 0.
H = G_coauthorship.induced_subgraph(G_coauthorship.neighborhood(0))
print(H.summary())
This subgraph only has 4 nodes (including node 0, so it has 3 neighbours) and 6 edges. This is sufficiently small to be easily plotted for visual inspection.
H.vs['color'] = 'red'
H.vs[0]['color'] = 'grey'
ig.plot(H)
We can also ask igraph
to calculate the clustering coefficient (which is called transitivity in igraph, which is the same concept using different terms) of node 0.
G_coauthorship.transitivity_local_undirected(0)
You can calculate the average for all nodes using the function transitivity_avglocal_undirected
.
Often, people want to identify wich nodes seem to be most important in some way in the network. This is often thought of as a type of centrality of a node.
The simplest type of centrality is the degree of a node, which is simply the number of its neighbors. Previously, we saw that node 0 had 3 neighbors, we therefore say its degree is 3.
G_coauthorship.degree(0)
We can also simply calculate the degree for everybody and store it in a new vertex attribute called degree
.
G_coauthorship.vs['degree'] = G_coauthorship.degree()
'Van Marck, E'
?
We can also take a look at the complete degree distribution. To plot it, we load the matplotlib
package. We import the plotting functionality and name the package plt
. We also include a statement telling Python to show the plots immediately in this notebook.
import matplotlib.pyplot as plt
%matplotlib inline
Now let us plot a histogram of the degree, using 50 bins.
plt.hist(G_coauthorship.vs['degree'], 50);
plt.yscale('log')
This clearly shows that the degree distribution is quite skewed. Most authors have only few collaborators, while a few authors have many collaborators. If the degree distribution is so skewed, it is sometimes referred to as a scale-free network, although the exact definition has been a topic of intense discussion recently.
The code below sorts the nodes in descending order of the degree.
highest_degree = sorted(G_coauthorship.vs, key=lambda v: v['degree'], reverse=True)
The sorted
function takes a list as input, G_coauthorship.vs
in our case, and sorts it according to a sort key. We indicate the sort key by a small function, called a lambda
function, that returns the degree. In other words, the sorted
function will sort the nodes according to the degree. By indicating reverse=True
we obtain a list that is sorted highest to lowest, instead of the other way around.
You can look at the first five results in the following way.
highest_degree[:5]
So, apparently, U D'Allessandro has collaborated with 715 other authors! This of course only considers the number of co-authors, it does not take into account the number of papers written with somebody else. When specifying such edge weights like the number of joint papers, the weighted degree is referred to as the strength of a node (which is sometimes a bit confusing term).
Let us look at the strength of node 0.
G_coauthorship.strength(0, weights='n_joint_papers')
Apparently, author 0 collaborated with 3 different authors, and has a total strength of 3. But what does this 3 mean? We need to carefully think about this. Suppose that author 0 has co-authored a single publication with three other co-authors, then each of the three co-authors would have an edge weight of n_joint_papers = 1
. So, the strenght would be 3. Hence, the strength denotes the total number of collaborations that an author had, which depends both on the number of publications and the number of collaborators per paper.
Sometimes, we wish to take into account the number of co-authorships when creating a link weight. We can then fractionally count the weight of each collaboration between $n_a$ authors as
$$\frac{1}{n_a - 1}.$$We need to go back to the publications_df
in order to construct such a fractional edge weight.
import itertools as itr
[(coauthor[0], coauthor[1], 1/(len(authors) - 1)) for coauthor in itr.combinations(authors, 2)]
We again do this for all publications.
coauthors_per_publication = publications_df['AU_split'].apply(
lambda authors:
[(coauthor[0], coauthor[1], 1, 1/(len(authors) - 1))
for coauthor in itr.combinations(authors, 2)])
The variable coauthors_per_publication
is now a list of a list of co-authors per publication, but including a full weight of 1
and a fractional weight of 1/(len(authors) - 1)
, where len(authors)
is the number of authors of the publications. We again explode
this list.
coauthors = coauthors_per_publication.explode().dropna()
We can again create the network, but now we can pass two edge attributes, n_joint_papers
and n_joint_papers_frac
. We of course also have to simplify the network again.
G_coauthorship = ig.Graph.TupleList(
edges=coauthors.to_list(),
vertex_name_attr='author',
directed=False,
edge_attrs=('n_joint_papers', 'n_joint_papers_frac')
)
G_coauthorship = G_coauthorship.simplify(loops=False, combine_edges='sum')
n_joint_papers_frac
over all co-authors? Then shouldn't the strength sum up to a whole number? Why isn't that the case here? (Hint: look at the authors of publication 'WOS:000242241600004'
publications_df.loc['WOS:000242241600004', 'AU']
Betweenness centrality is much more elaborate, and gives an indication of the number of times a node is on the shortest path from one node to another node.
As you can imagine, this can take quite some time to calculate for all nodes. We will therefore use the somewhat smaller bibliographic coupling network of journals.
G_coupling.vs['betweenness'] = G_coupling.betweenness()
Now we can look at the journals that have the highest betweenness.
sorted(G_coupling.vs, key=lambda v: v['betweenness'], reverse=True)[:5]
As we did previously when dealing with shortest paths, we can also use a weight for determining the shortest paths.
G_coupling.vs['betweenness_weighted'] = G_coupling.betweenness(weights='coupling')
One way of identifying central nodes relies on the idea of a random walk in a network. We will study this in the journal bibliographic coupling network. When performing such a random walk, we simply go from one journal to the next, following the bibliographic coupling links. The journal that is most frequently visited during such a random walk is then seen as most central. This is actually the idea that underlies Google's famous search engine. Luckily, we can compute that a lot faster than betweenness.
G_coupling.vs['pagerank'] = G_coupling.pagerank()
We can again take into account the weights. In pagerank this means that a journal that is a more closely bibliographically coupled will be more likely to be visited during a random walk. This is actually much more in line with our intuition than the shortest path. Let us see what we get if we do that.
G_coupling.vs['pagerank_weighted'] = G_coupling.pagerank(weights='coupling')
We can also create co-authorship using a more theoretical approach from graph theory. We can first construct a network consisting of publications and authors.
We first again explode
all authors for each publication, and create a graph out of it.
author_pubs_df = publications_df['AU_split'].explode()
G_pub_authors = ig.Graph.TupleList(
edges=author_pubs_df.reset_index().values,
vertex_name_attr='name',
directed=False
)
This network consists of two types: publications and authors. This is called a bipartite graph. We can automatically get the types using is_bipartite
.
is_bipartite, types = G_pub_authors.is_bipartite(return_types = True)
print(is_bipartite)
The actual types are simply returned as a list of True
and False
values, which are arbitrary labels for publications and authors. Let us see what the first label stands for.
print(types[0])
print(G_pub_authors.vs[0])
From the name
of node 0
we can see that it refers to a publication, and so False
indicates publications, while True
indicates authors.
We now would like to perform a so-called bipartite projection onto the authors. This is exactly the type of operation that leads to a co-authorship network. If we were to project onto the publication, we would end up with a network of publications where each pair of publications is linked if it is authored by the same author.
G_author_projection = G_pub_authors.bipartite_projection(types=types, which=True)
By default, it keeps track of the multiplicity (i.e. the number of joint papers) in the weight
edge attribute. Unfortunately, it is not possible to do fractional counting using this approach.
G_coauthorship
? (Hint: checkout the degree.)
You have now learned the basics of handling WoS files and transforming them into scientometric networks. Please take some time now to do your own analysis.
Load the data from all files you downloaded using pandas