In [1]:

```
# install.packages("igraph")
# install.packages("tidygraph")
```

Raphael Leung

The Indexing and Data Management Section (IDMS) in the House of Commons Library keeps a taxonomy/ controlled vocabulary of subject topics. Parliamentary stuff is regularly tagged against this taxonomy.

We want the taxonomy to more closely adhere to Simple Knowledge Organization System (SKOS). SKOS is an RDF-based model for simple knowledge structures like thesauri/ controlled vocabularies. It lets you organize your concepts into informal hierarchies and association networks, document and annotate them, link them to other concepts, group them into collections..., importantly, in a way that let's others outside your domain know what you're doing (because SKOS's a W3C recommendation). SKOS can be extended, too (i.e. “Not take it or leave it, but take what you want, make what you need".) Other open linked data vocabularies you may have heard of include Dublin Core for metadata terms.

If you're new to semantic web technologies, linked data vocabularies basically encourage vocabulary re-use across the Web, so you have a web of broader and narrower concepts. Core idea is to enable wider re-use and better interoperability.

Anyway, there're these taxonomies that have been maintained for a long time internally but not closely analyzed. They are used for indexing. They evolve and are manually curated. They have hierarchies (e.g. broader, narrower terms). Would be nice to have a closer look...

This uses `tidyraph`

in R to do so. It has `igraph`

under the hood combined with `dplyr`

syntax.

First, fetch the terms from http://lda.data.parliament.uk/terms and load in the csv.

In [5]:

```
terms_labels <- read.csv("terms_labels.csv", header=TRUE)
head(terms_labels)
```

This uses `dplyr`

(with piping syntax imported from `magrittr`

) in tidyverse which mostly makes nested function calls look cleaner and more readable.

In [6]:

```
library(dplyr)
terms_labels$uri %<>% as.character() %>% strsplit("/", fixed=TRUE) %>% sapply(tail, 1)
terms_labels$narrower...uri %<>% as.character() %>% strsplit("/", fixed=TRUE) %>% sapply(tail, 1) %>% sapply(function(x) if(identical(x,character(0))) NA else x)
terms_labels$broader...uri %<>% as.character() %>% strsplit("/", fixed=TRUE) %>% sapply(tail, 1) %>% sapply(function(x) if(identical(x,character(0))) NA else x)
terms_labels %<>% select(uri,broader...uri,narrower...uri,pref.label,class) %>% rename(ID=uri,broader_ID=broader...uri,narrower_ID=narrower...uri,idmsClass=class) %>% transform(ID = as.numeric(ID), broader_ID = as.numeric(broader_ID), narrower_ID = as.numeric(narrower_ID), pref.label = as.character(pref.label), idmsClass = as.character(idmsClass))
head(terms_labels)
```

I should note that, natively, the data exposed on lda.data.parliament.uk is in RDF format, a type of graph. However, there aren't many good packages that loads RDF data into R. (There's `SPARQL`

, `redland`

, `rrdf`

R packages which are maintained to various degrees.) There is a hacky way to run SPARQL on the site but it's not an actively developed-on service and large payloads often throw up errors. The terms data was thus exported into relational/ tabular shape as a CSV, and after cleaning, looks like it does above. So it's worth noting that `tidygraph`

graph objects more closely follow the property graph model (nodes and edges have properties) rather than RDF.

The tabular data above isn't yet a graph data structure. Common graph data structures/ representations include edgelists, adjacency/incidence matrices, and adjacency lists.

The next step converts it to an edgelist. Cocept A is broader than Concept B is represented by an edge from A to B.

If you're into network analysis, the key things to note are the graph is directed/ asymmetric, unweighted, and one-mode. ("One-mode" because both "from" and "to" are the same type of actor, here concepts. For a less simple graph, can also treat IDMS-defined classes like CTP, TPG, ORG etc as an edge property. Since concepts are shared between classes, this would make the graph multiplex, i.e. multiple relations between the same actors, instead of uniplex. Since the taxonomies evolve, we can also bring in time to make this a dynamic/ time-dependent instead of cross-sectional. But let's start with the simplest form.)

In [7]:

```
terms_labels_subclass <- terms_labels %>% filter(idmsClass=="TPG")
broader <- terms_labels_subclass %>% subset(select=c("broader_ID", "ID")) %>% na.omit() %>% unique() %>% rename(from=broader_ID,to=ID)
narrower <- terms_labels_subclass %>% subset(select=c("ID", "narrower_ID")) %>% na.omit() %>% unique() %>% rename(from=ID,to=narrower_ID)
edgelist <- broader %>% rbind(narrower) %>% unique()
head(edgelist)
```

As you can see, I subset the data to just the smallest and most intuitive taxonomy, TPG (which stands for "topic page"...) If you're interested in the story, these topics can be traced to the current website at http://parliament.uk/topics/topical-issues.htm, though that has been a graveyard of weblinks for the past few years and will eventually be replaced on https://beta.parliament.uk.

We'll look at all of the IDMS-defined classes at the end.

`tidygraph`

¶`tidygraph`

is a fairly new entry into the tidyverse in 2017. So it has `dplyr`

verbs and `magrittr`

pipes. `tidygraph`

adds more graph algorithms and visualizations to its underlying `igraph`

, which is an open-source network analysis package available in R, Python and C, docs here.

In [10]:

```
library(tidygraph)
library(igraph)
g <- edgelist %>% as_tbl_graph(directed=TRUE)
g
print(paste0("# of edges: ", gsize(g)," | # of nodes: ", gorder(g), " | # of weakly connected components: ", length(decompose(g, mode="weak"))))
```

In [11]:

```
options(repr.plot.width=10, repr.plot.height=10)
g %>% plot(layout=layout.fruchterman.reingold, vertex.size=degree(g)*0.6, vertex.label.color="darkblue", vertex.label.cex=0.6, vertex.color="orange", edge.arrow.size=0.1, edge.width=0.5, edge.color="gray")
```

`tidygraph`

lets you add properties to nodes and edges with `mutate`

, the `dplyr`

verb to create new columns. E.g. `g %>% activate(nodes) %>% mutate(name = as.numeric(name))`

activates the nodes and turn them from characters to numerics. This blog gives a great overview of functionality of the package in addition to the docs.

Now here's the exciting part -- graph traversal! Let's make an example graph.

(Less excitingly, `nbviewer`

currently has trouble rendering image/svg+xml 😐 so we'll have to make do with a static image)

In [12]:

```
options(repr.plot.width=5, repr.plot.height=4)
example <- create_tree(6, children=3) %>% mutate(name = letters[1:6])
# example %>% plot(layout=layout.reingold.tilford,vertex.label.cex=1, vertex.size=40)
library(IRdisplay)
display_png(file="igraphTraversalExample.PNG")
```

There are two main ways of traversing it: breadth-first search (BFS) and depth-first search (DFS). Both require specifying an entry point or the root node.

- BFS: Go a level at a time, until all nodes in that level have been reached, then continue to next level.
- a → b → c → d → e → f

- DFS: Go as deep as possible along each branch then backtracking upwards
- a → b → e → f → c → d (preorder; there are other orderings)

BFS uses a queue and DFS uses a stack. A queue is horizontal, like a restaurant queue, it's first-in-first-out/ last-in-last-out (FIFO/ LILO). Main operations are enqueue and dequeue. A stack is vertical, like a stack of food trays, it's first-in-last-out/ last-in-first-out (FILO/ LIFO). Main operations are push and pop. This IDEA assembly instructions illustrates it in a fun way...

You can implement queue and stack functionalities with e.g linked lists.

The relative complexity of both DFS and BFS is $O(E + V)$, because you vist every edge and node in worst case. So whichever larger term dominates. If graphs's really dense, $|E|$ can be as high as $|V|^2$, so $E$ dominates and complexity is $O(E)$. For a sparse graph where $E$<$V$ then complexity is $O(V)$.

In [13]:

```
example %>% mutate(rank = map_bfs_int(node_is_root(), .f = function(rank, ...) {
rank
}))
```

In [14]:

```
example %>% mutate(rank = map_dfs_int(node_is_root(), .f = function(rank, ...) {
rank
}))
```

We can see the orders in which the nodes are visited do indeed adhere to the explanation above.

Now that we know what BFS and DFS is, let's return to the indexing taxonomy.

One thing we can use the traversal algorithms for is to get the depth of each node. Depth of a node = the number of edges from the node to the tree's root node. Depth is generally useful to obtain because, e.g. here, we can then know how far down the taxonomy an indexer went to pick the subject tag.

Instead of returning the depth per node, can also do calculations/ return metrics of centrality/ finding paths, etc. There're many, many applications of traversal algorithms!

Here, we can add a property called "depth" to the nodes, calling `bfs_dist()`

and passing in a `tbl_graph`

object and specifying its root. `bfs_dist()`

gets the number of nodes between the root and each node in a bredth ﬁrst search.

But, how do we find the entry point/ root of the graph?

The root can be determined by passing all nodes into `neighbors()`

function in igraph, to check which node has zero incoming edges, which in a one-component graph, makes it the root node.

In [15]:

```
which(V(g) %>% sapply(function(x) neighbors(g,x,mode="in")) %>% sapply(length) == 0)
```

In [16]:

```
g %>% mutate(depth = bfs_dist(
which(V(g) %>% sapply(function(x) neighbors(g,x,mode="in")) %>% sapply(length) == 0)
))
```

But, that isn't very efficient -- there are 2 `sapply`

s (applies that return vectors).

The following tries to do the same in a more efficient way...

In [17]:

```
g %>% mutate(neighborhood_in_edges = map_local_dbl(mode="in", .f = function(neighborhood, ...) {
gsize(neighborhood)
})) %>% mutate(depth = bfs_dist(which(neighborhood_in_edges==0))) %>% arrange(neighborhood_in_edges)
```

For hierarchical graphs like taxonomies, a good visual check is plotting the graph in Reingold & Tilford’s Tidier layout.

In [18]:

```
options(repr.plot.width=10, repr.plot.height=5)
par(mfrow = c(1, 2))
g %>% plot(layout=layout.kamada.kawai,vertex.size=degree(g)*1.2, vertex.label.color="darkblue", vertex.label.cex=0.6, vertex.color="orange", edge.arrow.size=0.1, edge.width=0.5, edge.color="gray")
mtext("Kamada Kawai layout")
g %>% plot(layout=layout.reingold.tilford, vertex.size=10, vertex.label.color="darkblue", vertex.label.cex=0.6, vertex.color="orange", edge.arrow.size=0.2, edge.width=1, edge.color="darkgray")
mtext("Reingold & Tilford’s Tidier layout")
```