### Loading Credentials from local file;
### this cell is meant to be deleted before publishing
import yaml
with open("../creds.yml", 'r') as ymlfile:
cfg = yaml.safe_load(ymlfile)
uri = cfg["sonar_creds"]["uri"]
user = cfg["sonar_creds"]["user"]
password = cfg["sonar_creds"]["pass"]
SoNAR (IDH) - HNA Curriculum
Notebook 3: SoNAR (IDH)
This curriculum is created for the SoNAR (IDH) project. SoNAR (IDH) is in its core a graph based approach to structure and links big amounts of historical data (more on the SoNAR (IDH) project and database can be found in Notebook 3). Therefor, the whole curriculum focuses on graph theory and network analysis.
This notebook provides an introduction to the SoNAR (IDH) database and its underlying Neo4j graph-database technology as well as the Cypher query language which is part of the Neo4j ecosystem.
SoNAR (IDH) is short for Interfaces to Data for Historical Social Network Analysis and Research. The main objective of the project is the examination and evaluation of approaches to build and operate an advanced research technology environment supporting HNA.
SoNAR (IDH) is a research project in collaboration of the following institutions:
One of the main elements of the SonAR (IDH) projects is a Neo4j graph database. This database contains the merged data of multiple archives and libraries. See Chapter 2 for more details about the structure and the contents of the SonAR (IDH) database.
The SoNAR (IDH) database consists of nodes and edges. Each of the nodes and edges have additional properties that provide rich meta information.
This data description section provides details about the data sources and overall characteristics of the data. The section is based on the state of the SoNAR (IDH) database during February 2021. A diagram of the database schema can be found here.
SocialRelation
however, are implicit edges and were derived based on Resources
.The SoNAR (IDH) database has the following aggregated characteristics:
Nodes Summary
Node Type | Node Count |
---|---|
CorpName | 1.487.711 |
GeoName | 308.197 |
MeetName | 814.044 |
PerName | 5.087.660 |
TopicTerm | 212.135 |
UniTitle | 385.300 |
ChronTerm | 537.054 |
IsilTerm | 611 |
Resource | 25.679.240 |
Edges Summary
Edge Type | Edge Count |
---|---|
RelationToPerName | 14.630.465 |
RelationToCorpName | 5.099.190 |
RelationToMeetName | 263.180 |
RelationToUniTitle | 53.998 |
RelationToTopicTerm | 4.951.617 |
RelationToGeoName | 5.140.556 |
RelationToChronTerm | 5.446.841 |
RelationToIsil | 55.556.913 |
RelationToResource | 7.387.400 |
SocialRelation | 40.301.595 |
SoNAR (IDH) combines data from four different data sources. The table below provides a compact overview:
Data Source | Number of Nodes | Number of Edges (incl. RelationToIsilTerm) |
---|---|---|
GND (Integrated Authority File) | 8.295.047 | 32.776.628 |
DNB (German National Library) | 19.384.733 | 5.655.859 |
ZDB (Zeitschriftendatenbank) | 1.908.334 | 43.419.339 |
KPE (Kalliope Union Catalog) | 4.386.173 | 16.678.334 |
SBB (Katalog der Staatsbibliothek zu Berlin) | to be added | to be added |
We will need some specific libraries to work with the SoNAR (IDH) database. Let's start with installing the neo4j
library.
When you are using the curriculum on binder or by running it as a docker container locally, the package is already installed. When you want to interact with the SoNAR (IDH) database independently, install the package with the following code line in a new notebook cell:
!pip install neo4j
from neo4j import GraphDatabase
driver = GraphDatabase.driver(uri, auth=(user, password))
With the code above we create a Neo4j driver object. This driver stores the connection details for the database. We can use this driver now to send requests to the database.
Data exploration is usually the very first thing to do when working with new data. So let's start diving into the SoNAR (IDH) database by exploring it.
Whenever we want to retrieve data from the Neo4j database of SoNAR (IDH) we can use a query language called "Cypher Query Language". Cypher provides a comparably easy to comprehend syntax for requesting data from the database. Furthermore, Cypher provides an extensive set of tools for applying graph algorithms, data science methods and data wrangling procedures.
Throughout this curriculum we will use this Cypher Query Language whenever we directly retrieve data from SoNAR (IDH). A more in-depth introduction to Cypher can be found here. More external resources are listed in the Cypher summary chapter.
We start off by requesting the database to return all node labels. Node labels are categories nodes can belong to. You can think of them as entity groups. The SoNAR (IDH) database distinguishes between persons, corporations and more. Let's ask the database to return all the labels available.
with driver.session() as session:
result = session.run("CALL db.labels()").data()
result
[{'label': 'IsilTerm'}, {'label': 'CorpName'}, {'label': 'GeoName'}, {'label': 'MeetName'}, {'label': 'PerName'}, {'label': 'TopicTerm'}, {'label': 'UniTitle'}, {'label': 'Resource'}, {'label': 'ChronTerm'}]
Code Breakdown:
The
with
statement is basically used to make the database call as resource effective and concise as possible. There are more advantages of thewith
call but their explanation would exceed the goal of this curriculum. However, an in-depth explanation of thewith
statement can be found here.When we request data from the database we need to establish a connection (
session
). Thedriver
object we created earlier stores the connection details. When we use the methoddriver.session()
we establish a new connection. This connection is assigned to the objectsession
object for thewhile
statement.The most relevant part of the code for retrieving the data is
"CALL db.labels()"
. This part is the actual Cypher query. TheCALL
clause is used to call thedb.labels()
procedure. More details about Neo4j procedures can be found below.The result of this code chunk is a list that contains a key-value pair (
dictionary
) per label in the database.
Some useful built-in procedures for exploring and describing the database are listed in the table below. You can get a full list of built-in procedures by using the following query: CALL dbms.procedures()
Procedure | Description |
---|---|
db.labels() |
List all labels in the database. |
db.propertyKeys() |
List all property keys in the database. |
db.relationshipTypes() |
List all relationship types in the database. |
db.schema() |
Show the schema of the data. |
db.stats.retrieve() |
Retrieve statistical data about the current database. Valid sections are 'GRAPH COUNTS', 'TOKENS', 'QUERIES', 'META' |
Now, try one of the other methods listed in the table above by following the same procedure we used with the db.labels()
call.
You can select nodes by using the MATCH
statement. Cypher uses ASCII-art style syntax to define nodes, relationships and the direction of relationships in queries.
Nodes are referred to by using parentheses ()
. Inside the parentheses, you can define a node variable. This variable can be used to refer to a specific set of nodes throughout the rest of the query.
The example below matches any kind of node and assigns the variable name n (n)
. We use the LIMIT
statement to tell the database we only want to have the first 5 results. The number of results can drastically increase the response time of the database, so the LIMIT
statement oftentimes can be handy if you want to test a query or if you suspect too many results.
The RETURN
statement defines what the database returns after your query was evaluated. You can be very specific in this statement in case you only want to retrieve certain aspects of the query results.
# define query
query = """
MATCH (n)
RETURN n
LIMIT 5
"""
# send query to database
with driver.session() as session:
result = session.run(query).data()
# print result
result
[{'n': {'id': 'IsilTermAT_LAW', 'Name': 'AT-LAW'}}, {'n': {'id': 'IsilTermAT_NMW_Z', 'Name': 'AT-NMW-Z'}}, {'n': {'id': 'IsilTermAT_OeNB', 'Name': 'AT-OeNB'}}, {'n': {'id': 'IsilTermAT_UBK', 'Name': 'AT-UBK'}}, {'n': {'id': 'IsilTermAT_WBR', 'Name': 'AT-WBR'}}]
The output above is produced by calling the .data()
method of the Neo4j Python Driver. This method returns the result of our query as a list of dictionaries. This result type is quite versatile since we can further manipulate the output to our liking by applying filters or transforming the result to different formats (e.g. Pandas data frame).
""" ... """
for the query to tell Python we are writing a character string over multiple lines. We are doing this, so the query looks tidy and well-structured. You also could write the full query in one line - but this results in bad readability and makes debugging more difficult.
In the next step, we want to apply filters inside the query, so we have control over the nodes we retrieve from the database.
The query below only returns one node of the type PerName
without specifying which exact node we want to retrieve.
# define query
query = """
MATCH (n:PerName)
RETURN n
LIMIT 1"""
# send query to database
with driver.session() as session:
result = session.run(query).data()
# print result
result
[{'n': {'GenType': 'p', 'SpecType': 'piz', 'VariantName': 'Lombez, Ambrosius de;;;La Peirie, Ambroise;;;LaPeirie, Ambroise;;;Ambroise;;;Lombez, Ambroise de;;;LaPeyrie;;;Lombez, Ambrosius von', 'Id': '(DE-588)100000096', 'id': 'Aut100000096', 'Uri': 'http://d-nb.info/gnd/100000096', 'Name': 'Ambrosius'}}]
Filtering nodes by properties
Now, let's try to find a specific person. Let's try to find the node of Max Weber, the sociologist and political economist.
We can define a filter based on properties of a node. The query below only returns nodes that have "Weber, Max" as Name
property. The names in SoNAR are based on their GND entry and follow the order name, first name
. You can check out GND entries on https://portal.dnb.de/.
We suspect the name "Weber, Max" to be not unique inside the big SoNAR (IDH) database. So we want to check, how many Max Webers we can find. For that, we return the count of nodes (RETURN count(n)
) and not the actual nodes.
query = """
MATCH (n:PerName {Name: 'Weber, Max'})
RETURN count(n)
"""
with driver.session() as session:
result = session.run(query).data()
result
[{'count(n)': 34}]
In fact, we detected 34 hits in the database. So we need to apply more filters to find the correct Max.
Let's start by checking what properties are available for nodes of type PerName
:
query = """
MATCH (n:PerName)
WITH LABELS(n) AS labels , KEYS(n) AS keys
UNWIND labels AS label
UNWIND keys AS key
RETURN DISTINCT label, COLLECT(DISTINCT key) AS props
ORDER BY label
"""
with driver.session() as session:
result = session.run(query).data()
result
[{'label': 'PerName', 'props': ['Uri', 'SpecType', 'VariantName', 'GenType', 'Id', 'id', 'Name', 'DateStrictOriginal', 'DateStrictEnd', 'DateApproxOriginal', 'DateApproxEnd', 'DateStrictBegin', 'Gender', 'DateApproxBegin', 'OldId']}]
Before we take a look at the output, let's talk about the query real quick:
In this query, we use two Cypher list functions (LABELS()
and KEYS()
). These functions return a list of the element they are applied on (KEYS(n)
returns all property names of the nodes captured in n
as a list). The UNWIND
clause
is used to expand the created lists back to individual rows. Finally, we match the distinct labels (we only include PerName
nodes in this query) with a list of distinct properties that belong to PerName
nodes.
Here you can find the documentation for the applied functions and clauses:
Now, let's take a look at the result:
We can see that there are several date
properties for PerName
nodes. The year of birth is stored in the property called DateApproxBegin
.
So let's apply a date filter. Let's assume we only know that Max Weber was born in the year 1864, and we want to filter based on this information.
query = """
MATCH (n:PerName)
WHERE n.Name = "Weber, Max" AND n.DateApproxBegin = "1864"
RETURN n
"""
with driver.session() as session:
result = session.run(query).data()
result
[{'n': {'GenType': 'p', 'DateApproxEnd': '1920', 'DateStrictOriginal': '21.04.1864-14.06.1920', 'SpecType': 'piz', 'VariantName': 'Makesi, Weipei;;;Weber, Karl Emil Maximilian;;;Veber, Maks;;;Veber, M.;;;Weibo, ...;;;Uēbā, Makkusu;;;Wibir, Māks;;;Weibo, Makesi;;;Fībir, Māks;;;Vēbā, Makkusu;;;Ma ke si Wei bo;;;Makesi-Weibo;;;馬克思, 威培;;;فيبر، ماكس;;;マックス・ウェーバー;;;马克斯•韦伯;;;ובר, מקס;;;韦伯, 马克斯', 'DateStrictBegin': '21.04.1864', 'DateStrictEnd': '14.06.1920', 'Gender': '1', 'DateApproxOriginal': '1864-1920', 'Uri': 'http://d-nb.info/gnd/118629743', 'Name': 'Weber, Max', 'DateApproxBegin': '1864', 'Id': '(DE-588)118629743', 'id': 'Aut118629743'}}]
In the query above, we used a WHERE
clause to apply a filter. You can define multiple conditions inside a filter, e.g. by concatenating multiple logical conditions with AND
, OR
or XOR
. See this documentation page for more details.
As a last example, let's assume we only know that the last Name of Max Weber is spelled "韦伯" in Chinese, so we need to use this information as a filter.
In the query result above, you see a node property called VariantName
. This variable stores many alternative variants of the name we are looking for. So let's check how we could query the database by searching within this property by using the CONTAINS
operator (click here for more details):
query = """
MATCH (n:PerName)
WHERE n.VariantName CONTAINS "韦伯"
RETURN n
"""
with driver.session() as session:
result = session.run(query).data()
result
[{'n': {'GenType': 'p', 'DateApproxEnd': '1920', 'DateStrictOriginal': '21.04.1864-14.06.1920', 'SpecType': 'piz', 'VariantName': 'Makesi, Weipei;;;Weber, Karl Emil Maximilian;;;Veber, Maks;;;Veber, M.;;;Weibo, ...;;;Uēbā, Makkusu;;;Wibir, Māks;;;Weibo, Makesi;;;Fībir, Māks;;;Vēbā, Makkusu;;;Ma ke si Wei bo;;;Makesi-Weibo;;;馬克思, 威培;;;فيبر، ماكس;;;マックス・ウェーバー;;;马克斯•韦伯;;;ובר, מקס;;;韦伯, 马克斯', 'DateStrictBegin': '21.04.1864', 'DateStrictEnd': '14.06.1920', 'Gender': '1', 'DateApproxOriginal': '1864-1920', 'Uri': 'http://d-nb.info/gnd/118629743', 'Name': 'Weber, Max', 'DateApproxBegin': '1864', 'Id': '(DE-588)118629743', 'id': 'Aut118629743'}}, {'n': {'GenType': 'p', 'VariantName': 'Veber, Mattias;;;Weber, Matthias;;;Weibo, Madiyasi;;;Ma di ya si Wei bo;;;Madiyasi-Weibo;;;Ma ti ya si Wei bo;;;Matiyasi-Weibo;;;Weibo, Matiyasi;;;马蒂亚斯•韦伯;;;韦伯, 马蒂亚斯', 'SpecType': 'piz', 'DateApproxBegin': '1967', 'Id': '(DE-588)124003303', 'DateApproxOriginal': '1967-', 'Gender': '1', 'id': 'Aut124003303', 'Uri': 'http://d-nb.info/gnd/124003303', 'Name': 'Weber, Mathias'}}, {'n': {'GenType': 'p', 'SpecType': 'piz', 'VariantName': 'Bi de Wei bo;;;Bide-Weibo;;;Veber, Peter;;;Weibo, Bide;;;韦伯彼得;;;韦伯, 彼得', 'OldId': '(DE-588)1018410120', 'DateApproxBegin': '1968', 'Gender': '1', 'Id': '(DE-588)124253679', 'DateApproxOriginal': '1968-', 'id': 'Aut124253679', 'Uri': 'http://d-nb.info/gnd/124253679', 'Name': 'Weber, Peter'}}]
Id
property. The Id
property is a combination of the ISIL (International Standard Identifier for Libraries and Related Organisations) and the GND-ID. Id
of Max Weber is (DE-588)118629743
. DE-588
is the ISIL code of the GND (Gemeinsame Normdatei) and 118629743
is the GND-ID of Max Weber.
Similar to node labels, we can retrieve the categories of the relations inside the database. Every relation must have exactly one relationship type. This type defines the kind or category the relation belongs to.
with driver.session() as session:
result = session.run("CALL db.relationshipTypes()").data()
result
[{'relationshipType': 'RelationToIsilTerm'}, {'relationshipType': 'RelationToTopicTerm'}, {'relationshipType': 'RelationToGeoName'}, {'relationshipType': 'RelationToUniTitle'}, {'relationshipType': 'RelationToCorpName'}, {'relationshipType': 'RelationToChronTerm'}, {'relationshipType': 'RelationToPerName'}, {'relationshipType': 'RelationToMeetName'}, {'relationshipType': 'SocialRelation'}, {'relationshipType': 'RelationToResource'}]
In the section about nodes, we saw that we need to use parenthesis ()
to select nodes. When selecting relationships on the other hand, we need to use brackets []
instead.
Additionally, we can not solely query for plain relationships, but we need to define a pattern in which this relationship needs to appear in the database.
The most simple relationship pattern we can define is: the relationship needs to be between any kind of two nodes. In the Cypher query language, this would be expressed as:
()-[r]-()
query = """
MATCH ()-[r]-()
RETURN r
LIMIT 5
"""
with driver.session() as session:
result = session.run(query).data()
result
[{'r': ({}, 'RelationToIsilTerm', {})}, {'r': ({}, 'RelationToIsilTerm', {})}, {'r': ({}, 'RelationToIsilTerm', {})}, {'r': ({}, 'RelationToIsilTerm', {})}, {'r': ({}, 'RelationToIsilTerm', {})}]
You can filter relationships in a similar fashion like you can filter nodes.
Let's retrieve relationships of the type SocialRelation
.
query = """
MATCH ()-[r:SocialRelation]-()
RETURN r
LIMIT 5
"""
with driver.session() as session:
result = session.run(query).data()
result
[{'r': ({}, 'SocialRelation', {})}, {'r': ({}, 'SocialRelation', {})}, {'r': ({}, 'SocialRelation', {})}, {'r': ({}, 'SocialRelation', {})}, {'r': ({}, 'SocialRelation', {})}]
This result is correct, but the output is not very informative. Let's do some deeper exploration of the relationships.
Filtering Relationships by Properties
Just as nodes, relationships can have properties that provide meta information about the relation. Let's check the properties of the five relationships we retrieved above:
query = """
MATCH p = ()-[r:SocialRelation]-()
UNWIND relationships(p) as rel
RETURN properties(rel) as properties
LIMIT 5
"""
with driver.session() as session:
result = session.run(query).data()
result
[{'properties': {'TypeAddInfo': 'undirected', 'SourceType': 'associatedRelation', 'Source': 'Bib1072198592'}}, {'properties': {'TypeAddInfo': 'undirected', 'SourceType': 'areCoEditors', 'Source': 'Bib1072198592'}}, {'properties': {'TypeAddInfo': 'undirected', 'SourceType': 'areCoEditors', 'Source': 'Bib1072198592'}}, {'properties': {'TypeAddInfo': 'undirected', 'SourceType': 'areCoEditors', 'Source': 'Bib1072198592'}}, {'properties': {'TypeAddInfo': 'undirected', 'SourceType': 'associatedRelation', 'Source': 'Bib1072198592'}}]
As we can see, the properties of relationships of the type SocialRelation
have three different elements:
TypeAddInfo
- either directed or undirectedSourceType
- can take the values: associatedRelation, areCoAuthors, areCoEditors, affiliatedRelation, correspondedRelation, knowsSource
- id of the sourceSocialRelation
-nodes are derived from Resource
-nodes. The Source
property of a SocialRelation
is the id
of the corresponding Resource
Let's use the properties to filter out people that are connected to each other because they had a correspondence with each other.
# in the RETURN clause we define specifically what elements
# we want to retrieve, this way the output is easier to read
query = """
MATCH (n1:PerName)-[r:SocialRelation]-(n2:PerName)
WHERE r.SourceType = "correspondedRelation"
RETURN n1.Name, n2.Name, r.SourceType, r.TypeAddInfo
LIMIT 5
"""
with driver.session() as session:
result = session.run(query).data()
result
[{'n1.Name': 'Vacchiery, Karl Albrecht von', 'n2.Name': 'Oefele, Andreas Felix von', 'r.SourceType': 'correspondedRelation', 'r.TypeAddInfo': 'directed'}, {'n1.Name': 'Vacchiery, Karl Albrecht von', 'n2.Name': 'Oefele, Andreas Felix von', 'r.SourceType': 'correspondedRelation', 'r.TypeAddInfo': 'directed'}, {'n1.Name': 'Plotho, Erich Christoph von', 'n2.Name': 'Maria Anna', 'r.SourceType': 'correspondedRelation', 'r.TypeAddInfo': 'directed'}, {'n1.Name': 'Plotho, Erich Christoph von', 'n2.Name': 'Gerstenberg, Heinrich Wilhelm von', 'r.SourceType': 'correspondedRelation', 'r.TypeAddInfo': 'directed'}, {'n1.Name': 'Plotho, Erich Christoph von', 'n2.Name': 'Fresenius, Johann Philipp', 'r.SourceType': 'correspondedRelation', 'r.TypeAddInfo': 'directed'}]
We can see that all of these relationships have a TypeAddInfo
of directed. Relationships can be directed and undirected. In the SoNAR (IDH) database, all correspondences are directed and therefor hold the information whether someone was contacted or contacted someone else.
Let's see who received letters from Max Weber. The query below extends the basic ()-[]-()
structure for representing a node-relationship search pattern by an >
. This arrow defines that we are searching only for directed relationships. So the new pattern scaffolding is ()-[]->()
query = """
MATCH (n1:PerName)-[r:SocialRelation]->(n2:PerName)
WHERE n1.Name = "Weber, Max" AND n1.DateApproxBegin = "1864"
AND r.SourceType = "correspondedRelation"
RETURN n1.Name, n2.Name, r.SourceType, r.TypeAddInfo
"""
with driver.session() as session:
result = session.run(query).data()
result
[{'n1.Name': 'Weber, Max', 'n2.Name': 'Tönnies, Ferdinand', 'r.SourceType': 'correspondedRelation', 'r.TypeAddInfo': 'directed'}, {'n1.Name': 'Weber, Max', 'n2.Name': 'Schiele, Friedrich Michael', 'r.SourceType': 'correspondedRelation', 'r.TypeAddInfo': 'directed'}, {'n1.Name': 'Weber, Max', 'n2.Name': 'Fuchs, Carl Johannes', 'r.SourceType': 'correspondedRelation', 'r.TypeAddInfo': 'directed'}, {'n1.Name': 'Weber, Max', 'n2.Name': 'Rade, Martin', 'r.SourceType': 'correspondedRelation', 'r.TypeAddInfo': 'directed'}, {'n1.Name': 'Weber, Max', 'n2.Name': 'Rickert, Sophie', 'r.SourceType': 'correspondedRelation', 'r.TypeAddInfo': 'directed'}, {'n1.Name': 'Weber, Max', 'n2.Name': 'Rickert, Heinrich', 'r.SourceType': 'correspondedRelation', 'r.TypeAddInfo': 'directed'}, {'n1.Name': 'Weber, Max', 'n2.Name': 'Rickert, Sophie', 'r.SourceType': 'correspondedRelation', 'r.TypeAddInfo': 'directed'}, {'n1.Name': 'Weber, Max', 'n2.Name': 'Radbruch, Gustav', 'r.SourceType': 'correspondedRelation', 'r.TypeAddInfo': 'directed'}, {'n1.Name': 'Weber, Max', 'n2.Name': 'Schröder, Richard', 'r.SourceType': 'correspondedRelation', 'r.TypeAddInfo': 'directed'}, {'n1.Name': 'Weber, Max', 'n2.Name': 'Radbruch, Gustav', 'r.SourceType': 'correspondedRelation', 'r.TypeAddInfo': 'directed'}, {'n1.Name': 'Weber, Max', 'n2.Name': 'Rickert, Sophie', 'r.SourceType': 'correspondedRelation', 'r.TypeAddInfo': 'directed'}, {'n1.Name': 'Weber, Max', 'n2.Name': 'Radbruch, Gustav', 'r.SourceType': 'correspondedRelation', 'r.TypeAddInfo': 'directed'}, {'n1.Name': 'Weber, Max', 'n2.Name': 'Rickert, Heinrich', 'r.SourceType': 'correspondedRelation', 'r.TypeAddInfo': 'directed'}, {'n1.Name': 'Weber, Max', 'n2.Name': 'Rickert, Sophie', 'r.SourceType': 'correspondedRelation', 'r.TypeAddInfo': 'directed'}, {'n1.Name': 'Weber, Max', 'n2.Name': 'Rickert, Heinrich', 'r.SourceType': 'correspondedRelation', 'r.TypeAddInfo': 'directed'}, {'n1.Name': 'Weber, Max', 'n2.Name': 'Rickert, Heinrich', 'r.SourceType': 'correspondedRelation', 'r.TypeAddInfo': 'directed'}, {'n1.Name': 'Weber, Max', 'n2.Name': 'Radbruch, Gustav', 'r.SourceType': 'correspondedRelation', 'r.TypeAddInfo': 'directed'}, {'n1.Name': 'Weber, Max', 'n2.Name': 'Rickert, Heinrich', 'r.SourceType': 'correspondedRelation', 'r.TypeAddInfo': 'directed'}, {'n1.Name': 'Weber, Max', 'n2.Name': 'Rickert, Heinrich', 'r.SourceType': 'correspondedRelation', 'r.TypeAddInfo': 'directed'}, {'n1.Name': 'Weber, Max', 'n2.Name': 'Radbruch, Gustav', 'r.SourceType': 'correspondedRelation', 'r.TypeAddInfo': 'directed'}, {'n1.Name': 'Weber, Max', 'n2.Name': 'Rickert, Heinrich', 'r.SourceType': 'correspondedRelation', 'r.TypeAddInfo': 'directed'}, {'n1.Name': 'Weber, Max', 'n2.Name': 'Rickert, Heinrich', 'r.SourceType': 'correspondedRelation', 'r.TypeAddInfo': 'directed'}, {'n1.Name': 'Weber, Max', 'n2.Name': 'Rickert, Heinrich', 'r.SourceType': 'correspondedRelation', 'r.TypeAddInfo': 'directed'}, {'n1.Name': 'Weber, Max', 'n2.Name': 'Rickert, Heinrich', 'r.SourceType': 'correspondedRelation', 'r.TypeAddInfo': 'directed'}, {'n1.Name': 'Weber, Max', 'n2.Name': 'Bezold, Carl', 'r.SourceType': 'correspondedRelation', 'r.TypeAddInfo': 'directed'}, {'n1.Name': 'Weber, Max', 'n2.Name': 'Hampe, Karl', 'r.SourceType': 'correspondedRelation', 'r.TypeAddInfo': 'directed'}, {'n1.Name': 'Weber, Max', 'n2.Name': 'Fischer, Kuno', 'r.SourceType': 'correspondedRelation', 'r.TypeAddInfo': 'directed'}, {'n1.Name': 'Weber, Max', 'n2.Name': 'Hampe, Karl', 'r.SourceType': 'correspondedRelation', 'r.TypeAddInfo': 'directed'}, {'n1.Name': 'Weber, Max', 'n2.Name': 'Boll, Franz', 'r.SourceType': 'correspondedRelation', 'r.TypeAddInfo': 'directed'}, {'n1.Name': 'Weber, Max', 'n2.Name': 'Hettner, Alfred', 'r.SourceType': 'correspondedRelation', 'r.TypeAddInfo': 'directed'}, {'n1.Name': 'Weber, Max', 'n2.Name': 'Michels, Robert', 'r.SourceType': 'correspondedRelation', 'r.TypeAddInfo': 'directed'}, {'n1.Name': 'Weber, Max', 'n2.Name': 'Bücher, Karl', 'r.SourceType': 'correspondedRelation', 'r.TypeAddInfo': 'directed'}, {'n1.Name': 'Weber, Max', 'n2.Name': 'Bücher, Karl', 'r.SourceType': 'correspondedRelation', 'r.TypeAddInfo': 'directed'}, {'n1.Name': 'Weber, Max', 'n2.Name': 'Bücher, Karl', 'r.SourceType': 'correspondedRelation', 'r.TypeAddInfo': 'directed'}, {'n1.Name': 'Weber, Max', 'n2.Name': 'Bücher, Karl', 'r.SourceType': 'correspondedRelation', 'r.TypeAddInfo': 'directed'}, {'n1.Name': 'Weber, Max', 'n2.Name': 'Bücher, Karl', 'r.SourceType': 'correspondedRelation', 'r.TypeAddInfo': 'directed'}, {'n1.Name': 'Weber, Max', 'n2.Name': 'Koch, Adolf', 'r.SourceType': 'correspondedRelation', 'r.TypeAddInfo': 'directed'}, {'n1.Name': 'Weber, Max', 'n2.Name': 'Koch, Adolf', 'r.SourceType': 'correspondedRelation', 'r.TypeAddInfo': 'directed'}, {'n1.Name': 'Weber, Max', 'n2.Name': 'Bücher, Karl', 'r.SourceType': 'correspondedRelation', 'r.TypeAddInfo': 'directed'}, {'n1.Name': 'Weber, Max', 'n2.Name': 'Bücher, Karl', 'r.SourceType': 'correspondedRelation', 'r.TypeAddInfo': 'directed'}, {'n1.Name': 'Weber, Max', 'n2.Name': 'Bücher, Karl', 'r.SourceType': 'correspondedRelation', 'r.TypeAddInfo': 'directed'}, {'n1.Name': 'Weber, Max', 'n2.Name': 'Bücher, Karl', 'r.SourceType': 'correspondedRelation', 'r.TypeAddInfo': 'directed'}, {'n1.Name': 'Weber, Max', 'n2.Name': 'Bücher, Karl', 'r.SourceType': 'correspondedRelation', 'r.TypeAddInfo': 'directed'}, {'n1.Name': 'Weber, Max', 'n2.Name': 'Bücher, Karl', 'r.SourceType': 'correspondedRelation', 'r.TypeAddInfo': 'directed'}, {'n1.Name': 'Weber, Max', 'n2.Name': 'Bücher, Karl', 'r.SourceType': 'correspondedRelation', 'r.TypeAddInfo': 'directed'}, {'n1.Name': 'Weber, Max', 'n2.Name': 'Bücher, Karl', 'r.SourceType': 'correspondedRelation', 'r.TypeAddInfo': 'directed'}, {'n1.Name': 'Weber, Max', 'n2.Name': 'Bücher, Karl', 'r.SourceType': 'correspondedRelation', 'r.TypeAddInfo': 'directed'}, {'n1.Name': 'Weber, Max', 'n2.Name': 'Bücher, Karl', 'r.SourceType': 'correspondedRelation', 'r.TypeAddInfo': 'directed'}, {'n1.Name': 'Weber, Max', 'n2.Name': 'Bücher, Karl', 'r.SourceType': 'correspondedRelation', 'r.TypeAddInfo': 'directed'}, {'n1.Name': 'Weber, Max', 'n2.Name': 'Bücher, Karl', 'r.SourceType': 'correspondedRelation', 'r.TypeAddInfo': 'directed'}, {'n1.Name': 'Weber, Max', 'n2.Name': 'Bücher, Karl', 'r.SourceType': 'correspondedRelation', 'r.TypeAddInfo': 'directed'}, {'n1.Name': 'Weber, Max', 'n2.Name': 'Bücher, Karl', 'r.SourceType': 'correspondedRelation', 'r.TypeAddInfo': 'directed'}, {'n1.Name': 'Weber, Max', 'n2.Name': 'Bücher, Karl', 'r.SourceType': 'correspondedRelation', 'r.TypeAddInfo': 'directed'}, {'n1.Name': 'Weber, Max', 'n2.Name': 'Bücher, Karl', 'r.SourceType': 'correspondedRelation', 'r.TypeAddInfo': 'directed'}, {'n1.Name': 'Weber, Max', 'n2.Name': 'Bücher, Karl', 'r.SourceType': 'correspondedRelation', 'r.TypeAddInfo': 'directed'}, {'n1.Name': 'Weber, Max', 'n2.Name': 'Bücher, Karl', 'r.SourceType': 'correspondedRelation', 'r.TypeAddInfo': 'directed'}, {'n1.Name': 'Weber, Max', 'n2.Name': 'Bücher, Karl', 'r.SourceType': 'correspondedRelation', 'r.TypeAddInfo': 'directed'}, {'n1.Name': 'Weber, Max', 'n2.Name': 'Bücher, Karl', 'r.SourceType': 'correspondedRelation', 'r.TypeAddInfo': 'directed'}, {'n1.Name': 'Weber, Max', 'n2.Name': 'Bücher, Karl', 'r.SourceType': 'correspondedRelation', 'r.TypeAddInfo': 'directed'}, {'n1.Name': 'Weber, Max', 'n2.Name': 'Bücher, Karl', 'r.SourceType': 'correspondedRelation', 'r.TypeAddInfo': 'directed'}, {'n1.Name': 'Weber, Max', 'n2.Name': 'Bücher, Karl', 'r.SourceType': 'correspondedRelation', 'r.TypeAddInfo': 'directed'}, {'n1.Name': 'Weber, Max', 'n2.Name': 'Philippovich, Eugen von', 'r.SourceType': 'correspondedRelation', 'r.TypeAddInfo': 'directed'}, {'n1.Name': 'Weber, Max', 'n2.Name': 'Mayer-Pfannholz, Anton', 'r.SourceType': 'correspondedRelation', 'r.TypeAddInfo': 'directed'}, {'n1.Name': 'Weber, Max', 'n2.Name': 'Amira, Karl von', 'r.SourceType': 'correspondedRelation', 'r.TypeAddInfo': 'directed'}, {'n1.Name': 'Weber, Max', 'n2.Name': 'Amira, Karl von', 'r.SourceType': 'correspondedRelation', 'r.TypeAddInfo': 'directed'}, {'n1.Name': 'Weber, Max', 'n2.Name': 'Amira, Karl von', 'r.SourceType': 'correspondedRelation', 'r.TypeAddInfo': 'directed'}, {'n1.Name': 'Weber, Max', 'n2.Name': 'Deissmann, Adolf', 'r.SourceType': 'correspondedRelation', 'r.TypeAddInfo': 'directed'}, {'n1.Name': 'Weber, Max', 'n2.Name': 'Lukács, Georg', 'r.SourceType': 'correspondedRelation', 'r.TypeAddInfo': 'directed'}, {'n1.Name': 'Weber, Max', 'n2.Name': 'Susman, Margarete', 'r.SourceType': 'correspondedRelation', 'r.TypeAddInfo': 'directed'}, {'n1.Name': 'Weber, Max', 'n2.Name': 'Wolfskehl, Karl', 'r.SourceType': 'correspondedRelation', 'r.TypeAddInfo': 'directed'}, {'n1.Name': 'Weber, Max', 'n2.Name': 'Jaspers, Karl', 'r.SourceType': 'correspondedRelation', 'r.TypeAddInfo': 'directed'}, {'n1.Name': 'Weber, Max', 'n2.Name': 'Jaffe, Else', 'r.SourceType': 'correspondedRelation', 'r.TypeAddInfo': 'directed'}, {'n1.Name': 'Weber, Max', 'n2.Name': 'Ernst, Paul', 'r.SourceType': 'correspondedRelation', 'r.TypeAddInfo': 'directed'}, {'n1.Name': 'Weber, Max', 'n2.Name': 'Diederichs, Eugen', 'r.SourceType': 'correspondedRelation', 'r.TypeAddInfo': 'directed'}]
So far, we only focused on retrieving textual outputs from our queries. But of course we can visualize networks too. The code block below gives a quick example of how we can visualize the query output as a network.
In the code below, we are going to use a custom written function (to_nx_graph()
). This function is stored in another python file and hence we can load it as if it would be an own library. You can find a more in-depth explanation on the steps below in the chapter Complex Queries & Data Preparation.
The query below is an extension of the query we just used. We check out the network of people Max Weber corresponded with, but we also take a look into the second degree of the same relationships. So we also check the correspondences of the people Max Weber corresponded with.
# the line below loads in the custom function "to_nx_graph()". See chapter 6 for more details.
from helper_functions.helper_fun import to_nx_graph
driver = GraphDatabase.driver(uri, auth=(user, password))
query = """
MATCH (n1:PerName)-[r:SocialRelation]->(n2:PerName)-[r2:SocialRelation]->(n3:PerName)
WHERE n1.Id = "(DE-588)118629743" AND r.SourceType = "correspondedRelation" AND r2.SourceType = "correspondedRelation"
RETURN *
"""
G = to_nx_graph(neo4j_driver=driver,
query=query)
For the visualizations we are going to use a custom draw function. Please check out Chapter 3 in Notebook 2 for more details.
from matplotlib.colors import rgb2hex
from matplotlib.patches import Circle
import matplotlib.pyplot as plt
import networkx as nx
import numpy as np
# defining general variables
## we start off by setting the position of nodes and edges again
pos = nx.kamada_kawai_layout(G)
## set the color map to be used
color_map = plt.cm.plasma
## extract the node label attribute from graph object
#node_labels = nx.get_node_attributes(G, "label")
# setup node_colors
node_color_attribute = "type"
groups = set(nx.get_node_attributes(G, node_color_attribute).values())
group_ids = np.array(range(len(groups)))
if len(group_ids) > 1:
group_ids_norm = (group_ids - np.min(group_ids))/np.ptp(group_ids)
else:
group_ids_norm = group_ids
mapping = dict(zip(groups, group_ids_norm))
node_colors = [mapping[G.nodes()[n][node_color_attribute]] for n in G.nodes()]
# defining the graph options & styling
## dictionary for node options:
node_options = {
"pos": pos,
"alpha": 1,
"node_size": 150,
"alpha": 0.5,
"node_color": node_colors, # here we set the node_colors object as an option
"cmap": color_map # this cmap defines the color scale we want to use
}
## dictionary for edge options:
edge_options = {
"pos": pos,
"width": 1.5,
"alpha": 0.2,
}
## set plot size and plot margins
plt.figure(figsize=[20, 20])
plt.margins(x=0.1, y = 0.1)
# draw the graph
## draw the nodes
nx.draw_networkx_nodes(G, **node_options)
## draw the edges
nx.draw_networkx_edges(G, **edge_options)
# create custom legend according to color_map
geom_list = [Circle([], color = rgb2hex(color_map(float(mapping[term])))) for term in groups]
plt.legend(geom_list, groups)
# show the plot
plt.show()
Write a query that retrieves all RelationToGeoName
edges from Max Weber as well as the corresponding GeoName
nodes.
Visualize the resulting graph (see Notebook 2 for an explanation on how to visualize graphs).
In this section about data exploration, we took a quick look into the very basics of the Cypher Query Language. Whenever you want to retrieve data directly from the SoNAR (IDH) database, you need to write a Cypher query.
A full introduction into this query language would exceed the scope of this curriculum. But the list below provides an overview of good resources for digging deeper into Cypher:
The upcoming sections of this curriculum also heavily relies on Cypher, but there won't be detailed explanation of every used clause and command. You can see these cells as code recipes. You can check out the aforementioned resources for a documentation of the applied Cypher clauses.
We can also aggregate values and do more complex calculations with Cypher. Let's create a summary of how many Nodes, Relationships, Node Labels and Relationship Types are inside the database.
driver = GraphDatabase.driver(uri, auth=(user, password))
query = """
MATCH (n)
RETURN 'Number of Nodes: ' + count(n) as output
UNION
MATCH ()-[]->()
RETURN 'Number of Relationships: ' + count(*) as output
UNION
CALL db.labels() YIELD label
RETURN 'Number of Labels: ' + count(*) AS output
UNION
CALL db.relationshipTypes() YIELD relationshipType
RETURN 'Number of Relationship Types: ' + count(*) AS output
"""
with driver.session() as session:
result = session.run(query).data()
result
[{'output': 'Number of Nodes: 51953727'}, {'output': 'Number of Relationships: 184468575'}, {'output': 'Number of Labels: 9'}, {'output': 'Number of Relationship Types: 10'}]
In the next code cell, we calculate the count of each node category in the database.
driver = GraphDatabase.driver(uri, auth=(user, password))
query = """
MATCH (n)
RETURN DISTINCT COUNT(LABELS(n)) AS count, LABELS(n) AS label
ORDER BY count
"""
with driver.session() as session:
result = session.run(query).data()
result
[{'count': 611, 'label': ['IsilTerm']}, {'count': 308197, 'label': ['GeoName']}, {'count': 385300, 'label': ['UniTitle']}, {'count': 424270, 'label': ['TopicTerm']}, {'count': 814044, 'label': ['MeetName']}, {'count': 1487711, 'label': ['CorpName']}, {'count': 5087660, 'label': ['PerName']}, {'count': 5446841, 'label': ['ChronTerm']}, {'count': 37999093, 'label': ['Resource']}]
We can do the same count calculation for relationship types too. However, the query below uses a slightly different logic to retrieve the count per relationship type than the query we applied to the nodes above.
The query below calls the procedure db.relationshipTypes()
to retrieve a list of all relationship types in the database. Afterwards, we use a procedure called apoc.cypher.run()
. This procedure can be used to execute a Cypher query per row. We use this procedure to run the count
function for each type retrieved from db.relationshipTypes()
.
This way of writing the query is a lot faster than the way we used above in the section Summarize Node Labels.
query = """
CALL db.relationshipTypes() YIELD relationshipType as type
CALL apoc.cypher.run('MATCH ()-[:`'+type+'`]->() RETURN count(*) as count',{}) YIELD value
RETURN type, value.count AS count
ORDER BY count
"""
with driver.session() as session:
result = session.run(query).data()
result
[{'type': 'RelationToUniTitle', 'count': 128389}, {'type': 'RelationToMeetName', 'count': 422351}, {'type': 'RelationToChronTerm', 'count': 5454155}, {'type': 'RelationToCorpName', 'count': 6731666}, {'type': 'RelationToGeoName', 'count': 6873399}, {'type': 'RelationToResource', 'count': 7389423}, {'type': 'RelationToPerName', 'count': 20860575}, {'type': 'RelationToTopicTerm', 'count': 24279324}, {'type': 'SocialRelation', 'count': 37072940}, {'type': 'RelationToIsilTerm', 'count': 75256353}]
We also can easily create a plot using the result we just generated. The code block below uses pandas to convert the result we got in the code block above into a data frame. Furthermore, we use the Pandas method plot.bar
to create a bar plot. More details on the method plot.bar
can be found here.
import pandas as pd
pd.DataFrame(result).plot.bar(x="type", y="count")
<AxesSubplot:xlabel='type'>
Centrality algorithms can be used to uncover the roles and importance of nodes in a network. There are many ways to measure the centrality of a node. The example below uses the Degree centrality as one of the simplest centrality measures. (Needham & Hodler, 2019)
Degree centrality simply counts the number of incoming and outgoing relationships from a node. Degree centrality was introduced by Freeman in his paper "Centrality in social networks conceptual clarification" (1979).
The example below calculates the number of SocialRelation
for PerName
nodes and returns the top 10 people with the most social relationships in the SoNAR (IDH) database.
More information about Cypher based centrality procedures can be found here.
# In the query below we use the build-in degree centrality procedure of Neo4j.
# We define a "node projection" and a "relationship projection", to narrow down the degree centrality calculation
# to a specific subset of nodes and edges.
# More details can be found by following the link mentioned in the text above.
query = """
CALL gds.alpha.degree.stream({
nodeProjection: {type: "PerName"},
relationshipProjection: {type: "SocialRelation"}
})
YIELD nodeId, score
RETURN gds.util.asNode(nodeId).Name AS Name, score
ORDER BY score DESC
LIMIT 10
"""
with driver.session() as session:
result = session.run(query).data()
result
[{'Name': 'Unseld, Siegfried', 'score': 143347.0}, {'Name': 'Zeeh, Burgel', 'score': 76720.0}, {'Name': 'Ritzerfeld, Helene', 'score': 64609.0}, {'Name': 'Böttiger, Carl August', 'score': 47762.0}, {'Name': 'Höllerer, Walter', 'score': 45616.0}, {'Name': 'Mozart, Wolfgang Amadeus', 'score': 42590.0}, {'Name': 'Hauptmann, Gerhart', 'score': 37309.0}, {'Name': 'Goethe, Johann Wolfgang von', 'score': 36265.0}, {'Name': 'Francke, Gotthilf August', 'score': 35440.0}, {'Name': 'Johnson, Uwe', 'score': 35068.0}]
Task:
PerName
nodes with respect to RelationToPerName
relationships.
As shown in notebook 2, we can conduct a path finding algorithm to find the shortest path between two nodes. The shortest path algorithm can take weighted relationships into account and is widely applied in navigation systems.
Furthermore, the detection of shortest paths can provide insights about how close people are to each other and how similar they might be to each other or if they share something in common. (Needham & Hodler, 2019)
The example below shows the calculation of the shortest path between John Hume ((DE-588)119444666
) and Marie Curie ((DE-588)118523023
). Also, we define a nodeProjection
and a relationshipProjection
. These projections are arguments you can use inside the shortest path procedure to define specific properties and characteristics of the nodes and relationships you want to consider for the shortest path calculation.
More information on the Cypher shortest path finding algorithm and the projections can be found here.
query = """
MATCH (start:PerName {Id: "(DE-588)119444666"}),
(end:PerName {Id: "(DE-588)118523023"})
CALL gds.alpha.shortestPath.stream({
startNode: start,
endNode: end,
nodeProjection: {type: "PerName"},
relationshipProjection: {
all: {
type: "SocialRelation",
orientation: "NATURAL",
TypeAddInfo: "directed",
SourceType: "correspondedRelation"
}
}})
YIELD nodeId, cost
RETURN gds.util.asNode(nodeId).Name AS Name
"""
with driver.session() as session:
result = session.run(query).data()
result
[{'Name': 'Hume, John'}, {'Name': 'Annan, Kofi A.'}, {'Name': 'Fischer, Joschka'}, {'Name': 'Bereska, Henryk'}, {'Name': 'Skłodowska-Curie, Marie'}]
Task:
In this last chapter of notebook 3 we want to take a look into more complex queries and data processing procedures. The queries in this chapter use concepts and functionalities of the Cypher query language we did not use so far. As mentioned earlier there won't be an in-depth explanation of how the queries are working, but there will be links to the documentation of the most important parts.
For the query below we want to retrieve all resources (Resource
) and related works (UniTitle
). Furthermore we apply a temporal filter, so we only retrieve resources and works created in a given time span.
from neo4j import GraphDatabase
import networkx as nx
driver = GraphDatabase.driver(uri, auth=(user, password))
from_year = "1900"
to_year = "1925"
# this is a RegEx pattern that defines a 4 digit year pattern (eg. "1800")
date_pattern = "([0-9]{4})"
# query scaffolding with placeholders
query = """
MATCH (n:UniTitle)-[r]-(m:Resource)
WHERE m.DateApproxBegin =~ "{date_pattern}"
AND toInteger(m.DateApproxBegin) >= toInteger({from_year})
AND toInteger(m.DateApproxBegin) <= toInteger({to_year})
RETURN *
"""
# replace placeholders in query sccaffolding
query = query.format(from_year=from_year,
to_year=to_year,
date_pattern=date_pattern)
The query above uses the following elements to construct the database request:
Regular Expressions are used to select only correct year formats. Click here for more details on matching with Cypher using regular expressions.
Scalar Functions (toInteger()
) to convert string values to integer values. Click here for more details.
The Python string format()
method to replace placeholders inside a character string in Python. Click here for more details.
In the next step, we use a custom function to process the query. We import a function called to_nx_graph()
. This function is helping us to keep the code slim and clean. The function itself is doing things we did several times before already:
On the one hand, it sends the query to the SoNAR (DH) database and ingests the data base reply.
On the other hand, the function generates a networkx
Graph object from the returned data. This process is similar to the one used in the chapter "Case Study: Nobel Laureates" in notebook 2.
Click here to see the source code of this helper function.
from helper_functions.helper_fun import to_nx_graph
G = to_nx_graph(neo4j_driver=driver,
query=query)
This graph object can easily be converted to a data frame and analyzed as tabular data.
import pandas as pd
graph_df = pd.DataFrame.from_dict(dict(G.nodes(data=True)), orient='index')
graph_df["type"].value_counts()
Resource 905 UniTitle 363 Name: type, dtype: int64
In the next step, we prepare the visualization of the graph. This way we get an general overview of the graph structure.
from matplotlib.colors import rgb2hex
from matplotlib.patches import Circle
import matplotlib.pyplot as plt
import networkx as nx
import numpy as np
# defining general variables
## we start off by setting the position of nodes and edges again
pos = nx.kamada_kawai_layout(G)
## set the color map to be used
color_map = plt.cm.plasma
# setup node_colors
node_color_attribute = "type"
groups = set(nx.get_node_attributes(G, node_color_attribute).values())
group_ids = np.array(range(len(groups)))
if len(group_ids) > 1:
group_ids_norm = (group_ids - np.min(group_ids))/np.ptp(group_ids)
else:
group_ids_norm = group_ids
mapping = dict(zip(groups, group_ids_norm))
node_colors = [mapping[G.nodes()[n][node_color_attribute]] for n in G.nodes()]
# defining the graph options & styling
## dictionary for node options:
node_options = {
"pos": pos,
"alpha": 1,
"node_size": 150,
"alpha": 0.5,
"node_color": node_colors, # here we set the node_colors object as an option
"cmap": color_map # this cmap defines the color scale we want to use
}
## dictionary for edge options:
edge_options = {
"pos": pos,
"width": 1.5,
"alpha": 0.2,
}
## set plot size and plot margins
plt.figure(figsize=[20, 20])
plt.margins(x=0.1, y = 0.1)
# draw the graph
## draw the nodes
nx.draw_networkx_nodes(G, **node_options)
## draw the edges
nx.draw_networkx_edges(G, **edge_options)
# create custom legend according to color_map
geom_list = [Circle([], color = rgb2hex(color_map(float(mapping[term])))) for term in groups]
plt.legend(geom_list, groups)
# show the plot
plt.show()
The following example is about analyzing persons based on a specific topic term and whether the person was alive during a given time period.
The query below filters people that were Sociologists and were alive between January 1st, 1900 and January 1st, 1925. Furthermore, the query retrieves all connected resources of the persons that meet the filter criteria.
from neo4j import GraphDatabase
import networkx as nx
driver = GraphDatabase.driver(uri, auth=(user, password))
from_date = "1900-01-01"
to_date = "1925-01-01"
topic_term = "Soziolog"
date_pattern = "([0-9]{2}[.][0-9]{2}[.][0-9]{4})"
# query scaffolding with placeholders
query = """
MATCH (n:PerName)-[r1]-(t:TopicTerm),
(n:PerName)-[r2]-(m:Resource)
WHERE n.DateStrictBegin =~ '{date_pattern}' AND n.DateStrictEnd =~ '{date_pattern}' AND
t.Name CONTAINS "{topic_term}"
WITH apoc.date.parse(n.DateStrictBegin, "ms", "dd.MM.yyyy") AS parsed_birth,
apoc.date.parse(n.DateStrictEnd, "ms", "dd.MM.yyyy") AS parsed_death,
n, m, t, r1, r2
WHERE apoc.coll.max([date(datetime({{epochmillis: parsed_birth}})), date("{from_date}")]) <= apoc.coll.min([date(datetime({{epochmillis: parsed_death}})), date("{to_date}")])
RETURN *
LIMIT 2000
"""
# replace placeholders in query scaffolding
query = query.format(from_date=from_date,
to_date=to_date,
date_pattern=date_pattern,
topic_term=topic_term)
The query above uses the following new elements to construct the database request:
WITH
clause is used to chain together new variables with the rest of the query. More details on the WITH
clause can be found hereIn the next step, we call the custom function to_nx_graph()
again and convert the graph to a data frame.
from helper_functions.helper_fun import to_nx_graph
G = to_nx_graph(neo4j_driver=driver,
query=query)
import pandas as pd
graph_df = pd.DataFrame.from_dict(dict(G.nodes(data=True)), orient='index')
graph_df["type"].value_counts()
Resource 1978 PerName 124 TopicTerm 1 Name: type, dtype: int64
Aggregate by time period
Let's use the pandas data frame to aggregate the retrieved data and plot the distribution of the resources over time:
# change this value to the year range you want to use as a aggregation period (e.g. "10y" for ten years)
time_range = "10y"
# cleaning up the dataframe
# only keep observations with a valid year as "DateApproxBegin"
agg_df = graph_df[graph_df.DateApproxBegin.str.fullmatch(
"([0-9]{4})", na=False)]
# only keep Resources and drop all other node types
agg_df = agg_df[agg_df.type == "Resource"]
# add a new column called "clean_date" to the dataframe containing a corectly formated date format
agg_df.insert(0, "clean_date", pd.to_datetime(
agg_df.DateApproxBegin, format="%Y"), False)
# aggregate the data
# aggregate by the given time range and calculate the number of observations in the time period per node type
agg_df = agg_df.groupby(["type", pd.Grouper(key="clean_date", freq=time_range)])[
"type"].agg("count")
# reset the grouping index so we have a "normal" dataframe again
agg_df = agg_df.reset_index(name="count")
# plot the result
# replace the full "clean_date" values with only the string of the respective ending year of the time period
agg_df["clean_date"] = agg_df["clean_date"].dt.strftime("%Y")
# plot a bar chart
agg_df.plot.bar(x="clean_date", y="count")
<AxesSubplot:xlabel='clean_date'>
Visualize the network
Of course, we also can visualize the full network again:
from matplotlib.colors import rgb2hex
from matplotlib.patches import Circle
import matplotlib.pyplot as plt
import numpy as np
import networkx as nx
# defining general variables
## we start off by setting the position of nodes and edges again
pos = nx.kamada_kawai_layout(G)
## set the color map to be used
color_map = plt.cm.plasma
# setup node_colors
node_color_attribute = "type"
groups = set(nx.get_node_attributes(G, node_color_attribute).values())
group_ids = np.array(range(len(groups)))
if len(group_ids) > 1:
group_ids_norm = (group_ids - np.min(group_ids))/np.ptp(group_ids)
else:
group_ids_norm = group_ids
mapping = dict(zip(groups, group_ids_norm))
node_colors = [mapping[G.nodes()[n][node_color_attribute]] for n in G.nodes()]
# defining the graph options & styling
## dictionary for node options:
node_options = {
"pos": pos,
"alpha": 1,
"node_size": 150,
"alpha": 0.5,
"node_color": node_colors, # here we set the node_colors object as an option
"cmap": color_map # this cmap defines the color scale we want to use
}
## dictionary for edge options:
edge_options = {
"pos": pos,
"width": 1.5,
"alpha": 0.2,
}
## set plot size and plot margins
plt.figure(figsize=[20, 20])
plt.margins(x=0.1, y = 0.1)
# draw the graph
## draw the nodes
nx.draw_networkx_nodes(G, **node_options)
## draw the edges
nx.draw_networkx_edges(G, **edge_options)
# create custom legend according to color_map
geom_list = [Circle([], color = rgb2hex(color_map(float(mapping[term])))) for term in groups]
plt.legend(geom_list, groups)
# show the plot
plt.show()
Task:
This notebook introduced you to the SoNAR (IDH) database, the data structure and the Cypher query language to retrieve and analyze the SoNAR data. In the next notebook we are going to use an exploratory approach to analyze the historical network of physiologists.
This section provides the solutions for the exercises in this notebook.
db.labels()
call.with driver.session() as session:
result = session.run("CALL db.propertyKeys()").data()
result
from helper_functions.helper_fun import to_nx_graph
from neo4j import GraphDatabase
import networkx as nx
driver = GraphDatabase.driver(uri, auth=(user, password))
query = """
MATCH (n1:PerName)-[r:RelationToGeoName]-(n2:GeoName)
WHERE n1.Id = "(DE-588)118629743"
RETURN *
"""
G = to_nx_graph(neo4j_driver=driver,
query=query)
from matplotlib.colors import rgb2hex
from matplotlib.patches import Circle
import matplotlib.pyplot as plt
# defining general variables
## we start off by setting the position of nodes and edges again
pos = nx.kamada_kawai_layout(G)
## set the color map to be used
color_map = plt.cm.plasma
# setup node_colors
node_color_attribute = "type"
groups = set(nx.get_node_attributes(G, node_color_attribute).values())
group_ids = np.array(range(len(groups)))
if len(group_ids) > 1:
group_ids_norm = (group_ids - np.min(group_ids))/np.ptp(group_ids)
else:
group_ids_norm = group_ids
mapping = dict(zip(groups, group_ids_norm))
node_colors = [mapping[G.nodes()[n][node_color_attribute]] for n in G.nodes()]
# defining the graph options & styling
## dictionary for node options:
node_options = {
"pos": pos,
"alpha": 1,
"node_size": 150,
"alpha": 0.5,
"node_color": node_colors, # here we set the node_colors object as an option
"cmap": color_map # this cmap defines the color scale we want to use
}
## dictionary for edge options:
edge_options = {
"pos": pos,
"width": 1.5,
"alpha": 0.2,
}
## set plot size and plot margins
plt.figure(figsize=[5,5])
plt.margins(x=0.1, y = 0.1)
# draw the graph
## draw the nodes
nx.draw_networkx_nodes(G, **node_options)
## draw the edges
nx.draw_networkx_edges(G, **edge_options)
# create custom legend according to color_map
geom_list = [Circle([], color = rgb2hex(color_map(float(mapping[term])))) for term in groups]
plt.legend(geom_list, groups)
# show the plot
plt.show()
query = """
CALL gds.alpha.degree.stream({
nodeProjection: {type: "PerName"},
relationshipProjection: {type: "RelationToPerName"}
})
YIELD nodeId, score
RETURN gds.util.asNode(nodeId).Name AS Name, score
ORDER BY score DESC
LIMIT 10
"""
with driver.session() as session:
result = session.run(query).data()
result
query = """
MATCH (start:PerName {Id: "(DE-588)119444666"}),
(end:PerName {Id: "(DE-588)118523023"})
CALL gds.alpha.shortestPath.stream({
startNode: start,
endNode: end,
nodeProjection: {type: "PerName"},
relationshipProjection: {
all: {
type: "RelationToGeoName",
orientation: "NATURAL",
TypeAddInfo: "directed"
}
}})
YIELD nodeId, cost
RETURN gds.util.asNode(nodeId).Name AS Name
"""
with driver.session() as session:
result = session.run(query).data()
result
# change this value to the year range you want to use as a aggregation period (e.g. "10y" for ten years)
time_range = "5y"
# cleaning up the dataframe
# only keep observations with a valid year as "DateApproxBegin"
agg_df = graph_df[graph_df.DateApproxBegin.str.fullmatch(
"([0-9]{4})", na=False)]
# only keep Resources and drop all other node types
agg_df = agg_df[agg_df.type == "Resource"]
# add a new column called "clean_date" to the dataframe containing a corectly formated date format
agg_df.insert(0, "clean_date", pd.to_datetime(
agg_df.DateApproxBegin, format="%Y"), False)
# aggregate the data
# aggregate by the given time range and calculate the number of observations in the time period per node type
agg_df = agg_df.groupby(["type", pd.Grouper(key="clean_date", freq=time_range)])[
"type"].agg("count")
# reset the grouping index so we have a "normal" dataframe again
agg_df = agg_df.reset_index(name="count")
# plot the result
# replace the full "clean_date" values with only the string of the respective ending year of the time period
agg_df["clean_date"] = agg_df["clean_date"].dt.strftime("%Y")
# plot a bar chart
agg_df.plot.bar(x="clean_date", y="count")
Freeman, L. C. (1978). Centrality in social networks conceptual clarification. Social Networks, 1(3), 215–239. https://doi.org/10.1016/0378-8733(78)90021-7
Needham, M. & Hodler, A. (2019). Graph algorithms : practical examples in Apache Spark and Neo4j. Beijing: O'Reilly.