Notebook

British National Bibliography - Linked Data¶

The British National Bibliography (BNB) Linked Data Platform provides access to the British National Bibliography, a comprehensive database detailing books and serials published in the UK since 1950.

This notebook will show how to start constructing queries over this service using SPARQL and then parsing the returned data into a pandas DataFrame.

The SPARQL endpoint for the service can be found at: http://bnb.data.bl.uk/sparql

A bulk data download of the Linked Data is also available.

Example queries can be found here:

getting Started with the BNB as Linked Open Data
BNB example queries [Leigh Dodds]

In [ ]:

#Install a library to help us run some SPARQL queries if we haven't already installed it
#http://rdflib.github.io/sparqlwrapper/
!pip3 uninstall -y sparqlwrapper
!pip3 install sparqlwrapper

#NOTE: if you find the SPARQL queries slowing down, or throwing an error message, try the following:
## 1) Save your notebook.
## 2) Close it.
## 3) Shut it down.
#This should reset sparqlwrapper
## 4) Restart the notebook.
# You will need to run the cells again to load packages, reset state etc, becuase you will have started a new IPython process.

In [ ]:

#Import the necessary packages
from SPARQLWrapper import SPARQLWrapper, JSON

In [ ]:

#Declare the BNB endpoint
endpoint="http://bnb.data.bl.uk/sparql"
sparql = SPARQLWrapper(endpoint)

In [ ]:

#My experience of SPARQL is that things work then they don't and you have no idea which bit is broken
#This test should work. It really should. It has before. And it shouldn't take too long.
#It comes from http://bnb.data.bl.uk/getting-started
q='''PREFIX bibo: <http://purl.org/ontology/bibo/>
PREFIX bio: <http://purl.org/vocab/bio/0.1/>
PREFIX blt: <http://www.bl.uk/schemas/bibliographic/blterms#>
PREFIX dct: <http://purl.org/dc/terms/>
PREFIX event: <http://purl.org/NET/c4dm/event.owl#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX geo: <http://www.w3.org/2003/01/geo/wgs84_pos#>
PREFIX isbd: <http://iflastandards.info/ns/isbd/elements/>
PREFIX org: <http://www.w3.org/ns/org#>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX rda: <http://rdvocab.info/ElementsGr2/>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX void: <http://rdfs.org/ns/void#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>

SELECT ?book ?bnb ?title WHERE {
    #Match the book by ISBN
    ?book bibo:isbn13 "9780729408745";
    #bind some variables to its other attributes
    blt:bnb ?bnb;
    dct:title ?title. }'''
sparql.setQuery(q)
sparql.setReturnFormat(JSON)
results = sparql.query().convert()
results

Assuming that the above test - using a provided example query - works, we can start to construct our own queries.

We'll follow the suggested approach of building up a comprehensive prefix statement that we can make use of in any query to the endpoint. If we come across further useful prefixes, we can add them in.

In [ ]:

#Declare a standard, if exhaustive, list of prefixes we can apply to each query
#Don't leave white space on the left hand side...
prefix='''
PREFIX bibo: <http://purl.org/ontology/bibo/>
PREFIX bio: <http://purl.org/vocab/bio/0.1/>
PREFIX blt: <http://www.bl.uk/schemas/bibliographic/blterms#>
PREFIX dct: <http://purl.org/dc/terms/>
PREFIX event: <http://purl.org/NET/c4dm/event.owl#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX geo: <http://www.w3.org/2003/01/geo/wgs84_pos#>
PREFIX isbd: <http://iflastandards.info/ns/isbd/elements/>
PREFIX org: <http://www.w3.org/ns/org#>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX rda: <http://rdvocab.info/ElementsGr2/>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX void: <http://rdfs.org/ns/void#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
    
'''

Reviewing some of the other example queries, we can identify some useful query fragments.

For example, we note that a book may have a creator, and a title; and that a creator may have a name. If we piece these together, we should be able to get the titles of books created by a particular person.

In [ ]:

#Let's just test a simple query
#Search for books by author name
q='''
SELECT DISTINCT ?book ?title WHERE {
    ?book dct:creator ?author ;
        dct:title ?title.
    ?author foaf:name "Iain Banks".
} LIMIT 5
'''

In [ ]:

#Run the query, parse the response as JSON, and get them into a variable
sparql.setQuery(prefix+q)
sparql.setReturnFormat(JSON)
results = sparql.query().convert()

We're going to get the data back as triples, represented using JSON.

In [ ]:

#Here's what the response looks like
results

In [ ]:

#Let's specify the response columns we want to display
answerCols=['book','title']

In [ ]:

#We can then iterate through these
for result in results["results"]["bindings"]:
    for ans in answerCols:
        print(result[ans]['value'], end=" ")
    print()

In [ ]:

#Let's make a function to handle that a little more tidily
def printResults(results,ansCols):
    ''' Print the required results column values from the SPARQL query '''
    for result in results["results"]["bindings"]:
        for ans in answerCols:
            print(result[ans]['value'], end=" ")
        print()

In [ ]:

printResults(results,answerCols)

In [ ]:

#Let's do a little more wrapping
def runQuery(endpoint,prefix,q):
    ''' Run a SPARQL query with a declared prefix over a specified endpoint '''
    sparql = SPARQLWrapper(endpoint)
    sparql.setQuery(prefix+q)
    sparql.setReturnFormat(JSON)
    return sparql.query().convert()

def queryResults(endpoint,prefix,q,ansCols):
    ''' Run a SPARQL query with a declared prefix over a specified endpoint and print the required results columns '''
    results=runQuery(endpoint,prefix,q)
    printResults(results,ansCols)

In [ ]:

queryResults(endpoint,prefix,q,answerCols)

In [ ]:

#Let's see what the results look like
results

In [ ]:

#Some endpoints will return data in other formats, for example flattened as a CSV data table
#We can flatten the data ourselves in an ad hoc way and get it into a pandas datatable

import pandas as pd

#pandas may have a better way of doing this?!
data=[]
for result in results["results"]["bindings"]:
    tmp={}
    for el in result:
        tmp[el]=result[el]['value']
    data.append(tmp)
    #Note that we lise the type information which we could have used to type the columns in the final dataframe

df = pd.DataFrame(data)
df

In [ ]:

#Let's wrap everything up
def dict2df(results):
    ''' Hack a function to flatten the SPARQL query results and return the column values '''
    data=[]
    for result in results["results"]["bindings"]:
        tmp={}
        for el in result:
            tmp[el]=result[el]['value']
        data.append(tmp)

    df = pd.DataFrame(data)
    return df

def dfResults(endpoint,prefix,q):
    ''' Generate a data frame containing the results of running
        a SPARQL query with a declared prefix over a specified endpoint '''
    return dict2df( runQuery( endpoint, prefix, q ) )

In [ ]:

dfResults(endpoint,prefix,q)

Learning More About a Resource¶

A good way of inspecting the properties associated resource is to use the DESCRIBE command.

In [ ]:

q='DESCRIBE ?book WHERE { ?book bibo:isbn10 "1857232356" }'
ans=runQuery(endpoint,prefix,q)
ans

Trying to run this query gives me the forllowing warning:

Format requested was JSON, but RDF/XML (application/rdf+xml;charset=UTF-8) has been returned by the endpoint

For some reason, the endpoint doesn't seem to want to provide a JSON represented result for the DESCRIBE query - so we will have to handle the RDF response ourselves.

If we try to serialise this response as a set of N-triples we are presented with a bytestream.

In [ ]:

ans.serialize(format="nt")

We can get a clearer(?!) view by decoding the bytestream as a UTF-8 string and then printing the result.

In [ ]:

print(ans.serialize(format="nt").decode("utf-8"))

In [ ]:

#For convenience, let's just bundle that up in case we need to call it again
def printDesc(endpoint,prefix,q):
    ans=runQuery(endpoint,prefix,q)
    print(ans.serialize(format="nt").decode("utf-8"))

In [ ]:

q='DESCRIBE ?book WHERE { ?book bibo:isbn10 "1857232356" }'
printDesc(endpoint,prefix,q)

Let's pull back some information about a book with a given ISBN:

In [ ]:

q='''
SELECT ?book ?bnb ?publicationEvent ?title ?creator WHERE {
    #Match the book by ISBN
    ?book bibo:isbn10 "1857232356";
    
        #bind some variables to other attributes of the work
        
        #Get the British National Bibliography number
        blt:bnb ?bnb;
        
        #Identify the publication event associated with this work
        blt:publication ?publicationEvent;
        
        #Identify the title of the work
        dct:title ?title;
        
        #Identify the creator of the work
        dct:creator ?creator.
    }
'''
runQuery(endpoint,prefix,q)

What other sorts of thing are we able to find out about the creator?

In [ ]:

q='''
SELECT DISTINCT ?property 
where {
    ?book bibo:isbn10 "1857232356";
        dct:creator ?creator.
    ?creator ?property ?x
}
'''
runQuery(endpoint,prefix,q)

Do you think you might be able to use any of this information to add the author's name into the response?

In [ ]:

q='''
SELECT ?book ?isbn10 ?bnb ?title ?author WHERE {
    #Match the book by ISBN
    ?book bibo:isbn10 "1857232356";
    
        #bind some variables to its other attributes
        blt:bnb ?bnb;
        dct:title ?title;
        bibo:isbn10 ?isbn10;
    
        dct:creator ?creator.
        
    ?creator foaf:name ?author.
    }
'''
dfResults(endpoint,prefix,q)

Now see what you can find out about the publication event.

In [ ]:

#YOUR INVESTIGATION HERE

Here's what I found:

In [ ]:

q='''
SELECT DISTINCT ?a ?b WHERE {
    <http://bnb.data.bl.uk/id/resource/012701972/publicationevent/LondonOrbit1994> ?a ?b.
}
'''
dfResults(endpoint,prefix,q)

What can we learn about the date?

In [ ]:

q='''
SELECT DISTINCT ?a ?b WHERE {
    <http://reference.data.gov.uk/id/year/1994> ?a ?b.
}
'''
dfResults(endpoint,prefix,q)

Putting these various pieces together, I should now be able to search for books published by Ian Banks between two dates, using a FILTER command to prune the results to show only books published between two dates.

We can further tidy the way the results are presented by ordering the results according to publication date using the ORDER BY limit.

In [ ]:

q='''
SELECT DISTINCT ?book ?title ?date WHERE {
    #Find books by 'Iain Banks':
    ?book dct:creator ?author ;
        dct:title ?title.
    ?author foaf:name "Iain Banks".
    
    #Find when they were published:
    ?book blt:publication ?publicationEvent.
    ?publicationEvent event:time ?eventTime.
    ?eventTime rdfs:label ?date.
    
    #Look for books published between 1985 and 1990
    FILTER (?date>="1985" && ?date<"1990")
} ORDER BY ?date 
'''
dfResults(endpoint,prefix,q)

When you have constructed a useful query, you might consider wrapping it within a reusable function. For example:

In [ ]:

def getBooksByAuthorBetweenDates(author,fromDate,toDate):
    q='''
            SELECT DISTINCT ?book ?title ?date WHERE {{
                #Find books by name:
                ?book dct:creator ?author ;
                    dct:title ?title.
                ?author foaf:name "{0}".

                #Find when they were published:
                ?book blt:publication ?publicationEvent.
                ?publicationEvent event:time ?eventTime.
                ?eventTime rdfs:label ?date.

                #Look for books published between dates
                FILTER (?date>="{1}" && ?date<="{2}")
            }} ORDER BY ?date
        '''.format(author,fromDate,toDate)
    return dfResults(endpoint,prefix,q)

getBooksByAuthorBetweenDates("Terry Pratchett","1985","1987")

What Next?¶

This notebook has shown how to start working with the British National Bibliography Linked Data platform.