Raphael Leung
Open data serves a public good. It boosts transparency, accountability, and creates economic value (£16 billion a year, according to Deloitte). Folks at the ODI and Open Data Camp have written more about that.
I've been working at the UK Parliamentary Digital Service on open data. Under the hood, we turn open parliamentary data from sources (mostly relational stores) into a graph with an ETL process, which is then exposed via APIs and consumed by, among others, beta.parliament.uk. (technical details).
But what graph and why? There are many types and models of graphs. Commonly, the term property graph has come to denote an attributed, multi-relational graph, i.e a graph where the edges are labeled and both vertices and edges can have any number of key/value properties associated with them. This (labeled) property graph model is commonly seen in graph databases like neo4j and some network sciences analysis. While related and increasingly commercially commonplace, here, we discuss the RDF model instead.
Unlike other graph models, Resource Description Framework (RDF) is a W3C specification. Its standard query language is called SPARQL Protocol and RDF Query Language (yes, the "S" in SPARQL stands for SPARQL). Along with the Web Ontology Language (OWL), these standards commonly feature in discussions about the semantic web, linked open data, and knowledge graphs.
RDF has been championed by, among others, ODI founders Tim Berners-Lee and Nigel Shadbolt. They have knighthoods and Sir Tim is the inventor of the World Wide Web! They must know what they're saying.
But, over the years, there're others who disagree (Exhibit A, Exhibit B). If you have appetite for this sort of thing, just put "is the semantic web dead?" into your search engine of choice and you'll get many counter-opinions.
Without stoking any fires, here's a quick, hopefully-neutral summary:
Our ETL pipeline for data orchestration is in C# and uses an open-source library called dotNetRDF.
But, in the vein of making it easier to access and understand, can we easily work with RDF data in python?
To find out, I first try to load some data from a pandas dataframe into an RDF store (Part 1).
Then I try to query an RDF store and try to turn the results into pandas dataframes (Part 2).
import rdfpandas.graph
import pandas as pd
import rdflib
import requests
df = pd.DataFrame(data = {
'http://www.w3.org/1999/02/22-rdf-syntax-ns#type': ['https://id.parliament.uk/schema/House', 'https://id.parliament.uk/schema/House'],
'https://id.parliament.uk/schema/houseName': ['House of Commons', 'House of Lords'],
'https://id.parliament.uk/schema/househasHouseSeat': ['https://id.parliament.uk/Z7YQPdng', ""],
},
index=['https://id.parliament.uk/1AFu55Hs', 'https://id.parliament.uk/WkUWUBMx'])
df
http://www.w3.org/1999/02/22-rdf-syntax-ns#type | https://id.parliament.uk/schema/houseName | https://id.parliament.uk/schema/househasHouseSeat | |
---|---|---|---|
https://id.parliament.uk/1AFu55Hs | https://id.parliament.uk/schema/House | House of Commons | https://id.parliament.uk/Z7YQPdng |
https://id.parliament.uk/WkUWUBMx | https://id.parliament.uk/schema/House | House of Lords |
graph = rdfpandas.graph.to_graph(df)
payload = graph.serialize(format='turtle')
print(type(payload))
payload
<class 'bytes'>
b'@prefix ns1: <https://id.parliament.uk/schema/> .\n@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .\n@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .\n@prefix xml: <http://www.w3.org/XML/1998/namespace> .\n@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .\n\n<https://id.parliament.uk/1AFu55Hs> a "https://id.parliament.uk/schema/House" ;\n ns1:houseName "House of Commons" ;\n ns1:househasHouseSeat "https://id.parliament.uk/Z7YQPdng" .\n\n<https://id.parliament.uk/WkUWUBMx> a "https://id.parliament.uk/schema/House" ;\n ns1:houseName "House of Lords" ;\n ns1:househasHouseSeat "" .\n\n'
This example uses a free version of GraphDB 8.5.0.
If you're interested in the history, this particular RDF store is developed by a company in Sofia, Bulgaria called Ontotext, starting around early 2000s. It was previously called OWLIM. It supports open-source RDF framework [Seasame](https://en.wikipedia.org/wiki/Sesame_(framework) which then became RDF4J. They have blogged about UK Parliament's data service. At the time of writing, db-engines ranks them as the #7 RDF store.
See db-engines link for alternative triplestores.
r = requests.get("http://localhost:7200/repositories/Master/size")
print("Number of statements in repository: ", r.text)
from IPython.display import Image
Image(filename='triplestore_before.png')
Number of statements in repository: 0
url = 'http://localhost:7200/repositories/Master/statements'
headers = {'Content-Type':'text/turtle'}
r = requests.request(method='POST', url=url, data=payload, headers=headers)
print("HTTP response code: ", r.status_code)
HTTP response code: 204
r = requests.get("http://localhost:7200/repositories/Master/size")
print("Number of statements in repository: ", r.text)
from IPython.display import Image
Image(filename='triplestore_after.png')
Number of statements in repository: 6
There was previous talk of adding RDF support for linked datasets into pandas
which did not happen (see thread). There have been attempts to develop such support in supplementary packages, funnily enough named pandasrdf
and rdfpandas
, both of which do not currently have active development communities around them.
The above example uses the latter. There are indeed some issues here, the major ones being the package treats IRIs as literals, and it doesn't deal with nulls/ blanks correctly (see RDF specs for more on IRIs).
Suffice it to say there isn't currently a great ecosystem of support for RDF in python, definitely at least compared to what dotNetRDF offers in C#. There're better ways of orchestrating data into a triplestore than with a python pipeline.
That said, the above demonstrates the basic feasibility of convert some relational data into RDF and loading it into a triplestore in python, notwithstanding issues when the pipeline scales.
Perhaps more usefully, how about the other way round?
This uses the SPARQLWrapper
package from RDFLib
.
I use this package as it recently added support for custom HTTP headers (commit).
The UK Parliament SPARQL endpoint, while read-only and rate-limited, is currently under development and requires a subscription key. This is passed in the HTTP POST request in a header.
from SPARQLWrapper import SPARQLWrapper2
wrapper = SPARQLWrapper2("https://api.parliament.uk/Live/sparql-endpoint/master?")
wrapper.customHttpHeaders = {'Content-Type': 'application/sparql-query', 'Ocp-Apim-Subscription-Key': 'INSERT_KEY'}
from SPARQLWrapper import XML, GET, POST, JSON, JSONLD, N3, TURTLE, RDF, SELECT, INSERT, RDFXML, CSV, TSV
from SPARQLWrapper import URLENCODED, POSTDIRECTLY
wrapper.setQuery("""
PREFIX : <https://id.parliament.uk/schema/>
SELECT ?displayAs ?houseName ?weblink
WHERE {
?person a :Person .
OPTIONAL { ?person :personGivenName ?givenName } .
OPTIONAL { ?person :personFamilyName ?familyName } .
OPTIONAL { ?person <http://example.com/F31CBD81AD8343898B49DC65743F0BDF> ?displayAs } .
OPTIONAL { ?person :personHasPersonWebLink ?weblink } .
?person :memberHasParliamentaryIncumbency ?incumbency .
FILTER NOT EXISTS { ?incumbency a :PastParliamentaryIncumbency . }
FILTER NOT EXISTS {
?incumbency :incumbencyHasIncumbencyInterruption ?interruption.
FILTER NOT EXISTS {
?interruption :endDate ?end.
}
}
?incumbency :seatIncumbencyHasHouseSeat ?houseSeat .
?houseSeat :houseSeatHasHouse ?house .
?house :houseName ?houseName .
FILTER regex(str(?weblink), "twitter", "i")
}
ORDER BY ?houseName
""")
wrapper.setMethod(POST)
wrapper.setRequestMethod(POSTDIRECTLY)
results = wrapper.query().convert()
df = pd.DataFrame(results.bindings)
df.head()
displayAs | houseName | weblink | |
---|---|---|---|
0 | Value(literal:'Adam Afriyie') | Value(literal:'House of Commons') | Value(uri:'https://twitter.com/AdamAfriyie') |
1 | Value(literal:'Afzal Khan') | Value(literal:'House of Commons') | Value(uri:'https://twitter.com/Afzal4Gorton') |
2 | Value(literal:'Alan Brown') | Value(literal:'House of Commons') | Value(uri:'https://twitter.com/alanbrownsnp') |
3 | Value(literal:'Alan Mak') | Value(literal:'House of Commons') | Value(uri:'https://twitter.com/AlanMakMP') |
4 | Value(literal:'Albert Owen') | Value(literal:'House of Commons') | Value(uri:'https://twitter.com/AlbertOwenMP') |
def extract(binding):
return binding.value
df.applymap(extract).head()
displayAs | houseName | weblink | |
---|---|---|---|
0 | Adam Afriyie | House of Commons | https://twitter.com/AdamAfriyie |
1 | Afzal Khan | House of Commons | https://twitter.com/Afzal4Gorton |
2 | Alan Brown | House of Commons | https://twitter.com/alanbrownsnp |
3 | Alan Mak | House of Commons | https://twitter.com/AlanMakMP |
4 | Albert Owen | House of Commons | https://twitter.com/AlbertOwenMP |
Besides SELECT
, SPARQLWrapper
also supports CONSTRUCT
, ASK
, and DESCRIBE
queries. (docs).
Technically, in the above example, we still had to convert the JSON response into a pandas dataframe. A package called gastrodon
(like the pokemon...) has even better pandas dataframe integration.
But it currently doesn't support custom HTTP headers when calling a remote SPARQL endpoint.
from gastrodon import RemoteEndpoint,QName,ttl,URIRef,inline
Here, my example uses the MusicBrainz SPARQL endpoint.
They are fairly random queries: the first returns the musical records with high number of tracks (>20), the second shows that apparently most artist names start with the letter T
.
Depending on subject interest, there are many other SPRARQL endpoints from a wide range of organizations and data publishers. They vary quite a bit: some are built in-house and some out-sourced, with different backends as well as levels of maintenance. Some vendors may not supoort all parts of the SPARQL 1.1 specification, some may have extensions for additional capability, but the basics should generally be supported.
If we have a) an endpoint that supports federated SPARQL queries and b) datasets with classes that are annotated with predicates like owl:sameAs
, we can start to expand the scope of our questions due to (relatively easy) data linkage between datasets.
prefixes=inline("""
@prefix mo: <http://purl.org/ontology/mo/>.
@prefix mbz: <http://purl.org/ontology/mbz#>.
@prefix owl: <http://www.w3.org/2002/07/owl#>.
@prefix xsd: <http://www.w3.org/2001/XMLSchema#>.
@prefix bio: <http://purl.org/vocab/bio/0.1/>.
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>.
@prefix tags: <http://www.holygoat.co.uk/owl/redwood/0.1/tags/>.
@prefix geo: <http://www.geonames.org/ontology#>.
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>.
@prefix lingvoj: <http://www.lingvoj.org/ontology#>.
@prefix rel: <http://purl.org/vocab/relationship/>.
@prefix vocab: <http://dbtune.org/musicbrainz/resource/vocab/>.
@prefix event: <http://purl.org/NET/c4dm/event.owl#>.
@prefix map: <file:/home/moustaki/work/motools/musicbrainz/d2r-server-0.4/mbz_mapping_raw.n3#>.
@prefix db: <http://dbtune.org/musicbrainz/resource/>.
@prefix foaf: <http://xmlns.com/foaf/0.1/>.
@prefix dc: <http://purl.org/dc/elements/1.1/>.
""").graph
endpoint=RemoteEndpoint(
"http://dbtune.org/musicbrainz/sparql",
prefixes=prefixes,
base_uri="http://dbtune.org/musicbrainz/resource/"
)
df=endpoint.select("""
SELECT ?artistName ?recordTitle ?tracks ?language ?coverArt
WHERE {
?record a mo:Record;
dc:title ?recordTitle;
foaf:maker ?s;
dc:language ?lang;
vocab:albummeta_coverarturl ?coverArt;
vocab:tracks ?tracks.
?lang lingvoj:iso2b ?language.
?s a mo:MusicArtist;
foaf:name ?artistName.
FILTER (?tracks>=20)
}
ORDER BY DESC(?tracks)
""")
df
artistName | recordTitle | tracks | language | coverArt | |
---|---|---|---|---|---|
0 | 池頼広 | ソニックX オリジナルサウンドトラックス | 40 | jpn | http://ec1.images-amazon.com/images/P/B0001AEL... |
1 | Fell Venus | @ | 37 | eng | |
2 | Various Artists | なるたる | 30 | jpn | http://ec1.images-amazon.com/images/P/B0000C9V... |
3 | 田中公平 | おたくのビデオ | 30 | jpn | http://ec1.images-amazon.com/images/P/B000UVES... |
4 | 松谷卓 | のだめカンタービレ | 26 | jpn | http://ec1.images-amazon.com/images/P/B000MZHT... |
5 | 菅野よう子 | エスカフローネ | 26 | jpn | http://images-jp.amazon.com/images/P/B00005072... |
6 | 近藤浩治 | ☆スーパーマリオ☆ ヨッシーアイランド オリジナル・サウンド ヴァージョン | 26 | jpn | http://ec1.images-amazon.com/images/P/B00005FN... |
7 | RUX | 우린 어디로 가는가 | 25 | kor | |
8 | 조영욱 | 싸이보그지만 괜찮아 | 25 | kor | http://ec1.images-amazon.com/images/P/B000LYE8... |
9 | Various Artists | おもひでぽろぽろ | 24 | jpn | http://ec1.images-amazon.com/images/P/B00005GF... |
10 | Various Artists | うたわれるもの オリジナルサウンドトラック | 24 | jpn | http://ec1.images-amazon.com/images/P/B000065E... |
11 | 菅野祐悟 | ホタルノヒカリ | 24 | jpn | http://ec1.images-amazon.com/images/P/B000TCZ7... |
12 | 辻陽 | トリック オリジナル・サウンドトラック | 24 | jpn | http://ec1.images-amazon.com/images/P/B00005HL... |
13 | 増田順一 | ポケモンひけるかな? | 21 | jpn | http://ec1.images-amazon.com/images/P/B000034C... |
14 | 久石譲 | となりのトトロ | 20 | jpn | http://ec1.images-amazon.com/images/P/B00004RC... |
15 | 藤圭子 | ゴールデン★ベスト | 20 | jpn | http://ec1.images-amazon.com/images/P/B000B63E... |
We can see gastrodon
has the nice feature of returning a usable pandas
dataframe straight from the query.
It also integrates gracefully with displaying namespaces and turning the GROUP BY
variable into the df index, as shown below.
Compared to SPARQLWrapper
, this saves data wrangling time.
df=endpoint.select("""
SELECT ?firstLetter (COUNT (?s) as ?n_artists)
WHERE {
?s a mo:MusicArtist;
foaf:name ?artistName.
BIND(UCASE(SUBSTR(?artistName, 1, 1)) AS ?firstLetter)
}
GROUP BY ?firstLetter ORDER BY DESC(?n_artists)
""")
df.head()
n_artists | |
---|---|
firstLetter | |
T | 93 |
S | 87 |
M | 82 |
J | 70 |
D | 63 |
%matplotlib inline
df.plot.bar()
<matplotlib.axes._subplots.AxesSubplot at 0x19dcc7de550>
The author of gastrodon
has written a bunch of interesting notebooks for both local and remote endpoints here.
For more, there is also the SPARQL Jupyter kernel (see example from Semantic Web London meetup). However, as things stand now, the python packages are more user-friendly than the SPARQL kernel as they allow you to put both your SPARQL query and python code in the same notebook.
Based on the above, it seems the answer is yes, we can work with RDF data in python. It's imperfect but there are tools that are developing.
Why might working with RDF data in python be desirable anyway?
If the goal is wider adoption of W3C's semantic web standards and recommendations (RDF, OWL, SPARQL, SKOS etc), then better integration with popular scripting and data analysis languages like python definitely wouldn't hurt.
It's also worth adding that while python remains one of the most popular tools for data science, some triplestores are adding native ML capabilities (e.g. Stardog Predictive Analytics API allows for easier model building, graph-native NLP capabilities), which is quite exciting. There may be concerns of reproducability when RDF data changes. This may be addressed by reification (statements about statements) and provenance (information about the people and activities that produced the data), which are used in RDF for versioning data/ boosting reliability and trustworthiness.