While many databases, services, or museums might expose their data via a web API, there can be limitations. Matthew Lincoln has an excellent tutorial at The Programming Historian that walks us through some of these differences, but the key one is in the way the data is represented. When data is described using a 'Resource Description Framework', RDF, the resource - the 'thing'- is described via a series of relationships, rather than as rows in a table or keys having values.
Information is in the relationships. It's a network. It's a graph. Thus, every 'thing' in this graph can have its own uniform resource identifier (URI) that lives as a location on the internet. Information can then be created by making statements that use these URIs, similarly to how English grammar creates meaning: subject verb object. Or, in RDF-speak, 'subject predicate object', also known as a triple. In this way, data in different places can be linked together by referencing the elements they have in common. This is Linked Open Data (LOD). The access point for interrogating LOD is called an 'endpoint'.
Finally, SPARQL is an acronymn for SPARQL Protocol and RDF Query Language (yes, it's one of those kinds of acronyms).
In this notebook, we're not using Python or R directly. Instead, we've set up a 'kernel' (think of that as the 'engine' for the notebook) that already includes everything necessary to set up and run SPARQL queries. (For reference, the kernel code is here). Both R and Python can interact with and query endpoints, and manipulate linked open data, but for the sake of learning a bit of what one can do with SPARQL, this notebook keeps all of that ancillary code tucked away. The [followup notebook](Using R to Retrieve and Visualize Data from SPARQL.ipynb) to this one shows you how to use R to do some basic manipulations of the query results.
Here, we are following Matthew Lincoln's tutorial.
Let's look at his example, which concerns the painting, 'The Nightwatch'.
<The Nightwatch> <was created by> <Rembrandt van Rijn> .
This statement has three elements:
<The Nightwatch>
<was created by>
<Rembrandt van Rijn>
Lincoln combines these, and other such statements, into a (pseudo-)RDF database like so:
<The Nightwatch> <was created by> <Rembrandt van Rijn> .
<The Nightwatch> <was created in> <1642> .
<The Nightwatch> <has medium> <oil on canvas> .
<Rembrandt van Rijn> <was born in> <1606> .
<Rembrandt van Rijn> <has nationality> <Dutch> .
<Johannes Vermeer> <has nationality> <Dutch> .
<Woman with a Balance> <was created by> <Johannes Vermeer> .
<Woman with a Balance> <has medium> <oil on canvas> .
Such RDF databases are describing nodes and links, and so we can visualize as a graph like so:
But there is a difference between the pseudo-RDF that Lincoln shows us, and what actual RDF might look like:
<http://data.rijksmuseum.nl/item/8909812347> <http://purl.org/dc/terms/creator> <http://dbpedia.org/resource/Rembrandt>
The human-readable version requires more statements:
<http://data.rijksmuseum.nl/item/8909812347> <http://purl.org/dc/terms/title> "The Nightwatch" .
<http://purl.org/dc/terms/creator> <http://www.w3.org/1999/02/22-rdf-syntax-ns#label> "was created by" .
<http://dbpedia.org/resource/Rembrandt> <http://xmlns.com/foaf/0.1/name> "Rembrandt van Rijn" .
This is just a quick introduction; please do examine Lincoln's tutorial for more details. But now, let's explore how this notebook can be used to write some queries.
# Jupyter notebooks have various built-in commands called 'magics' that are accessed with the '%' character; these depend on the kernel.
# Let's see what the SPARQL kernel has
%lsmagics
# when using this notebook, the first thing we have to do - or rather, the first time we run _any_ query,
# is to tell it what endpoint we're going to use. Let's use the British Museum's:
%endpoint http://collection.britishmuseum.org/sparql
Lincoln suggests that when we first encountered a new RDF graph, that we explore the network of relationships from an example object to understand what is going on in the database, to see what is available for querying. Since we're querying the British Museum, let's take the Rosetta Stone as our example.
In the query below, p
and o
stand for 'predicate' and 'object'. Thus, we're building up a query that asks, 'show me every statment structured <The Rosetta Stone> <predicate> <object>
. When the results load up, you can right-click on each statement (which is a URI, remember) to see what we've discovered. This could give you the necessary information to construct more complicated queries.
Nb The British Museum sparql endpoint and the underlying infrastructure does not appear to be well supported. Results are sometimes flaky or not reachable.
SELECT ?p ?o
WHERE {
<http://collection.britishmuseum.org/id/object/YCA62958> ?p ?o .
}
In this next query, we look for objects in the collection that have the label 'fibula'.
%endpoint http://collection.britishmuseum.org/sparql
%display table
PREFIX bmo: <http://www.researchspace.org/ontology/>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
SELECT ?object
WHERE {
# Search for all values of ?object that have a given "object type"
?object bmo:PX_object_type ?object_type .
# That object type should have the label "fibula"
?object_type skos:prefLabel "fibula" .
}
LIMIT 10
Wikidata is another endpoint we can query. Below we have a query by Sebastian Heath that extracts some of the genealogical data on Roman emperors contained in that database. The wd:Q842606
can be expanded to refer to https://www.wikidata.org/wiki/Q842606, which describes the concept 'Roman Emperor'. wdt:P39
is a predicate meaning 'Position held' https://www.wikidata.org/wiki/Property:P39.
%endpoint http://query.wikidata.org/sparql
%display table
SELECT ?emperorLabel ?emperor_dob
?childLabel
?motherLabel ?maternalGrandfatherLabel ?maternalGrandmotherLabel
?emperor ?child ?mother ?maternalGrandfather ?maternalGrandmother WHERE {
?emperor wdt:P39 wd:Q842606 . #p39: position held. Q842606: Roman Emperor
?emperor wdt:P569 ?emperor_dob . # p569: date of birth
?child wdt:P22 ?emperor . #p22: father
?child wdt:P25 ?mother . #p25: mother
OPTIONAL { ?mother wdt:P22 ?maternalGrandfather }
OPTIONAL { ?mother wdt:P25 ?maternalGrandmother }
# automatic label expander
SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
} ORDER BY ?emperor_dob
Let's visualize these relationships. We're running the same query, but we use CONSTRUCT to create the nodes and edges that represent these familial relationships. We want to show 'emperor x is the father of person y' and 'person a is the mother of person y'. That gives us the structure. To get the content, we run the SELECT command where we first tell it to retrieve those individuals who were emperor, and then retrieve the children data.
Once you've run the query, use ctrl+f to find someone familiar, like Augustus (Q1405). In the resulting graph, an edge labeled 'p22' eg Q1405 ->P22 -> Q2259 can be read, 'Q1405 is the father of Q2259', or rather, 'Augustus is the father of Julia the Elder'.
Roman geneaology.... it's complicated!
%endpoint http://query.wikidata.org/sparql
%display diagram
CONSTRUCT {
?emperor wdt:P22 ?child . #p22: father
?mother wdt:P25 ?child . #p25: mother
}
WHERE {
?emperor wdt:P39 wd:Q842606 .
?child wdt:P22 ?emperor . #p22: father
?child wdt:P25 ?mother . #p25: mother
OPTIONAL { ?mother wdt:P22 ?maternalGrandfather }
OPTIONAL { ?mother wdt:P25 ?maternalGrandmother }
}
Another excellent SPARQL endpoint is the Nomisma portal for numismatic materials.
%endpoint http://nomisma.org/query
Now, if you actually go to http://query.wikidata.org/sparql you'll find a query builder with the following information already preloaded:
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX bio: <http://purl.org/vocab/bio/0.1/>
PREFIX crm: <http://www.cidoc-crm.org/cidoc-crm/>
PREFIX dcmitype: <http://purl.org/dc/dcmitype/>
PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX geo: <http://www.w3.org/2003/01/geo/wgs84_pos#>
PREFIX nm: <http://nomisma.org/id/>
PREFIX nmo: <http://nomisma.org/ontology#>
PREFIX org: <http://www.w3.org/ns/org#>
PREFIX osgeo: <http://data.ordnancesurvey.co.uk/ontology/geometry/>
PREFIX rdac: <http://www.rdaregistry.info/Elements/c/>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX spatial: <http://jena.apache.org/spatial#>
PREFIX void: <http://rdfs.org/ns/void#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
SELECT * WHERE {
?s ?p ?o
} LIMIT 100
All those prefixes are the ontologies being used to describe the materials. The ?s ?p ?o
are the subject, predicate, objects that we're going to search for. Let's run some of the example queries that Nomisma can handle. Since Roman Emperors are often depicted on coins, let's see which emperors are present in Nomisma.
%display table
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX bio: <http://purl.org/vocab/bio/0.1/>
PREFIX crm: <http://www.cidoc-crm.org/cidoc-crm/>
PREFIX dcmitype: <http://purl.org/dc/dcmitype/>
PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX geo: <http://www.w3.org/2003/01/geo/wgs84_pos#>
PREFIX nm: <http://nomisma.org/id/>
PREFIX nmo: <http://nomisma.org/ontology#>
PREFIX org: <http://www.w3.org/ns/org#>
PREFIX osgeo: <http://data.ordnancesurvey.co.uk/ontology/geometry/>
PREFIX rdac: <http://www.rdaregistry.info/Elements/c/>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX spatial: <http://jena.apache.org/spatial#>
PREFIX void: <http://rdfs.org/ns/void#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
SELECT ?uri ?label WHERE {
?uri a foaf:Person ;
skos:prefLabel ?label ;
org:hasMembership ?membership .
?membership org:role nm:roman_emperor .
FILTER(langMatches(lang(?label), "EN"))
}
uri | label |
---|---|
http://nomisma.org/id/vabalathus | Vabalathus |
http://nomisma.org/id/valerius_valens | Valerius Valens |
http://nomisma.org/id/arcadius | Arcadius |
http://nomisma.org/id/carausius | Carausius |
http://nomisma.org/id/valerian | Valerian |
http://nomisma.org/id/commodus | Commodus |
http://nomisma.org/id/justinian_i | Justinian I |
http://nomisma.org/id/tetricus_ii | Tetricus II |
http://nomisma.org/id/nerva | Nerva |
http://nomisma.org/id/olybrius | Olybrius |
http://nomisma.org/id/volusian | Volusian |
http://nomisma.org/id/sextus_martinianus | Martinianus |
http://nomisma.org/id/leo_ii | Leo II |
http://nomisma.org/id/elagabalus | Elagabalus |
http://nomisma.org/id/tiberius | Tiberius |
http://nomisma.org/id/sebastianus | Sebastianus |
http://nomisma.org/id/julius_nepos | Julius Nepos |
http://nomisma.org/id/numerian | Numerian |
http://nomisma.org/id/constantius_ii | Constantius II |
http://nomisma.org/id/maximus_barcelona | Maximus of Barcelona |
We can also do spatial queries; this one looks coins from mints within 50 km of Athens.
It also specifies the format in which we wants the results returned, and to write these results to a json file for further manipulation.
%format json
%display table
%outfile mints.json
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX geo: <http://www.w3.org/2003/01/geo/wgs84_pos#>
PREFIX nm: <http://nomisma.org/id/>
PREFIX nmo: <http://nomisma.org/ontology#>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX spatial: <http://jena.apache.org/spatial#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
SELECT * WHERE {
?loc spatial:nearby (37.974722 23.7225 50 'km') ;
geo:lat ?lat ;
geo:long ?long .
?mint geo:location ?loc ;
skos:prefLabel ?label ;
a nmo:Mint
FILTER langMatches (lang(?label), 'en')
}