This Jupyter notebook is a step-by-step guide to the extraction of RDF graphs from TEI/XML documents using lxml and RDFLib, as suggested by LIFT.
LIFT is an open-source web application based entirely on Python. The aim of LIFT is to show and demonstrate how it is possible to extract RDF graphs, supported by widely adopted ontological vocabularies, from TEI/XML documents.
This notebook will show you how to leverage the lxml.etree library to parse TEI/XML documents and the RDFLib library to build RDF statements using the information extracted from the TEI input file.
TEI/XML - the standard vocabulary for textual encoding in the humanities https://www.tei-c.org/release/doc/tei-p5-doc/en/html/index.html
lxml.etree - a Python library for XML processing
RDFLib - a Python library for working with RDF https://rdflib.readthedocs.io/en/stable/index.html
Firstly, if you do not already have it, install lxml onto your computer by following the instructions provided at this link: https://lxml.de/installation.html.
Do the same for RDFLib. Information on how to install the library is available at https://rdflib.readthedocs.io/en/stable/gettingstarted.html.
The following blocks of code are ideally stored into a single Python file, which you can create and name something like TEItoRDF.py
. Alternatively, remember that you can download this Jupyter notebook as a Python file by clicking on File > Download as > Python (.py). Let's go!
Starting with an empty Python file, we begin by importing lxml.etree (a library for processing XML using Python, cf. section 1) into our script:
from lxml import etree
To read from a TEI/XML file (further on referred to as 'input' or 'TEI document'), we use the parse()
function:
tree = etree.parse('input-test.xml')
Make sure to specify the correct path. In this case, the file input-test.xml
is stored in the current folder. For a basic introduction to paths see https://www.w3schools.com/html/html_filepaths.asp.
In order to retrieve the root element of the TEI document (i.e. input-test.xml
), we use the function getroot()
and store the result in the 'root' variable:
root = tree.getroot()
We also assign the values of the TEI attributes @xml:base
and @xml:id
, which are attached to the root element of the TEI document, to the variables 'base_uri' and 'edition_id' respectively. These will come handy when generating entity URIs.
In order to retrieve the attributes we leverage the get()
function (note how we substituted the prefix 'xml' with the actual namespace, this is the canonical way of working with attributes belonging to the xml namespace in lxml):
base_uri = root.get('{http://www.w3.org/XML/1998/namespace}base')
edition_id = root.get('{http://www.w3.org/XML/1998/namespace}id')
We then bind the TEI namespace to the prefix 'tei' (we will use this later to refer to TEI elements) as follows:
tei = {'tei': 'http://www.tei-c.org/ns/1.0'}
Firstly, we import the Graph, Literal, BNode, Namespace and URIRef classes from RDFLib as follows:
from rdflib import Graph, Literal, BNode, Namespace, URIRef
Secondly, we declare the namespaces of the ontological vocabularies that are going to provide the semantics of the resulting RDF graph. Some namespaces are available by direct import from RDFLib, so we can simply type:
from rdflib.namespace import RDF, RDFS, XSD, DCTERMS, OWL
Any other namespace is to be declared in the following way (these are the ontologies used in LIFT):
agrelon = Namespace("https://d-nb.info/standards/elementset/agrelon#")
crm = Namespace("http://www.cidoc-crm.org/cidoc-crm/")
frbroo = Namespace("http://iflastandards.info/ns/fr/frbr/frbroo/")
pro = Namespace("http://purl.org/spar/pro/")
proles = Namespace("http://www.essepuntato.it/2013/10/politicalroles/")
prov = Namespace("http://www.w3.org/ns/prov#")
ti = Namespace("http://www.essepuntato.it/2012/04/tvc/")
An RDFLib graph is a set of RDF triples. We declare our output graph and name it 'g':
g = Graph()
Using the function bind()
, we bind each of our namespaces to a prefix:
g.bind("agrelon", agrelon)
g.bind("crm", crm)
g.bind("frbroo", frbroo)
g.bind("dcterms", DCTERMS)
g.bind("owl", OWL)
g.bind("pro", pro)
g.bind("proles", proles)
g.bind("prov", prov)
g.bind("ti", ti)
In order to iterate through all <person>
elements in the TEI document, we use the lxml findall()
method, which takes as an argument a simple XPath-like language called ElementPath and returns a list of matching elements (line 1). Then, for each person in the TEI document, we:
@xml:id
(line 2).@xml:id
. In order to make clear what kind of resource the URI represents, we also add the directory /person/
before the actual person's @xml:id
(line 3).rdf:type
, the object is crm:E21_Person
. This triple states that the person belongs to the class http://www.cidoc-crm.org/cidoc-crm/E21_Person (line 4).We suggest that you keep the TEI document input-test.xml
within sight, to better grasp how the extraction script works.
for person in root.findall('.//tei:person', tei):
person_id = person.get('{http://www.w3.org/XML/1998/namespace}id')
person_uri = URIRef(base_uri + '/person/' + person_id)
g.add( (person_uri, RDF.type, crm.E21_Person))
Now, run the following print()
functions to print out the set of triples just generated (this is just a test, which you can make at any time during this tutorial; at the end, we will print out the RDF graph to a file). RDFLib allows us to choose among different serialization formats, such as xml, n3, and nt:
print('RDF/XML serialization:\n')
print(g.serialize(format='xml'))
print('Notation3 serialization:\n')
print(g.serialize(format='n3'))
print('N-triples serialization:\n')
print(g.serialize(format='nt'))
RDF/XML serialization: <?xml version="1.0" encoding="UTF-8"?> <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" > <rdf:Description rdf:about="https://example.org/person/Aristot"> <rdf:type rdf:resource="http://www.cidoc-crm.org/cidoc-crm/E21_Person"/> </rdf:Description> <rdf:Description rdf:about="https://example.org/person/Socr"> <rdf:type rdf:resource="http://www.cidoc-crm.org/cidoc-crm/E21_Person"/> </rdf:Description> <rdf:Description rdf:about="https://example.org/person/Criti"> <rdf:type rdf:resource="http://www.cidoc-crm.org/cidoc-crm/E21_Person"/> </rdf:Description> <rdf:Description rdf:about="https://example.org/person/Xen"> <rdf:type rdf:resource="http://www.cidoc-crm.org/cidoc-crm/E21_Person"/> </rdf:Description> <rdf:Description rdf:about="https://example.org/person/Plat"> <rdf:type rdf:resource="http://www.cidoc-crm.org/cidoc-crm/E21_Person"/> </rdf:Description> </rdf:RDF> Notation3 serialization: @prefix agrelon: <https://d-nb.info/standards/elementset/agrelon#> . @prefix crm: <http://www.cidoc-crm.org/cidoc-crm/> . @prefix dcterms: <http://purl.org/dc/terms/> . @prefix frbroo: <http://iflastandards.info/ns/fr/frbr/frbroo/> . @prefix owl: <http://www.w3.org/2002/07/owl#> . @prefix pro: <http://purl.org/spar/pro/> . @prefix proles: <http://www.essepuntato.it/2013/10/politicalroles/> . @prefix prov: <http://www.w3.org/ns/prov#> . @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . @prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> . @prefix ti: <http://www.essepuntato.it/2012/04/tvc/> . @prefix xml: <http://www.w3.org/XML/1998/namespace> . @prefix xsd: <http://www.w3.org/2001/XMLSchema#> . <https://example.org/person/Aristot> a crm:E21_Person . <https://example.org/person/Criti> a crm:E21_Person . <https://example.org/person/Plat> a crm:E21_Person . <https://example.org/person/Socr> a crm:E21_Person . <https://example.org/person/Xen> a crm:E21_Person . N-triples serialization: <https://example.org/person/Aristot> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.cidoc-crm.org/cidoc-crm/E21_Person> . <https://example.org/person/Socr> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.cidoc-crm.org/cidoc-crm/E21_Person> . <https://example.org/person/Criti> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.cidoc-crm.org/cidoc-crm/E21_Person> . <https://example.org/person/Xen> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.cidoc-crm.org/cidoc-crm/E21_Person> . <https://example.org/person/Plat> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.cidoc-crm.org/cidoc-crm/E21_Person> .
Moving on, we look for a @sameAs
attribute provided on the <person>
element. We expect this attribute to contain one or more URIs pointing to authority records such as VIAF, or to resources about the same person such as DBpedia:
get()
function, we look for a @sameAs
attribute and .@sameAs
attribute (lines 7-11). We record the URIs in the variable 'same_as_uri' (line 9).owl:sameAs
, the object is the URI retrieved from within the @sameAs
attribute. For example, if a @sameAs
attribute contains two URIs, two distinct RDF triples are added to the graph.for person in root.findall('.//tei:person', tei):
person_id = person.get('{http://www.w3.org/XML/1998/namespace}id')
person_uri = URIRef(base_uri + '/person/' + person_id)
same_as = person.get('sameAs')
if same_as is not None:
same_as = same_as.split()
i = 0
while i < len(same_as):
same_as_uri = URIRef(same_as[i])
g.add( (person_uri, OWL.sameAs, same_as_uri))
i += 1
The next step is to provide each person entity with a human-readable label:
<persName>
elements (lines 1-4).@xml:lang
attribute is also present (line 6-7).@xml:lang
is found, the script adds an RDF triple. The subject of such a triple is the person, the predicate is rdf:label
, and the object is a literal value (i.e. an xsd:string
). A language declaration is also attached to the triple (e.g. xml:lang='en'
for English) (line 8).@xml:lang
is found, the script creates an RDF triple whithout declaring any specific language (line 10).for person in root.findall('.//tei:person', tei):
person_id = person.get('{http://www.w3.org/XML/1998/namespace}id')
person_uri = URIRef(base_uri + '/person/' + person_id)
persname = person.find('./tei:persName', tei)
if persname is not None:
label = persname.text
if persname.get('{http://www.w3.org/XML/1998/namespace}lang') is not None:
label_lang = persname.get('{http://www.w3.org/XML/1998/namespace}lang')
g.add( (person_uri, RDFS.label, Literal(label, lang=label_lang)))
else:
g.add( (person_uri, RDFS.label, Literal(label)))
In TEI, groups of somehow related <person>
elements (e.g. they are of the same type) are usually nested within a common <listPerson>
element. The following script retrieves any potential @type or @corresp attributes on <listPerson>
. These should contain a natural language description of the person's type or anauthority record URI respectively:
<listPerson>
parent element (line 4).@type
and/or @corresp
(lines 5-6).@type
attribute was found, we add an RDF triple formed by the person's URI, the property dcterms:description
and a literal value containing a natural language description of the person's type (lines 7-8).@corresp
attribute was found, we add an RDF triple formed by the person's URI, the property dcterms:subject
and a URI (ideally) of an authority record (lines 9-10).for person in root.findall('.//tei:person', tei):
person_id = person.get('{http://www.w3.org/XML/1998/namespace}id')
person_uri = URIRef(base_uri + '/person/' + person_id)
listperson = person.find('./...', tei)
perstype = listperson.get('type')
perscorr = listperson.get('corresp')
if perstype is not None:
g.add( (person_uri, DCTERMS.description, Literal(perstype)))
if perscorr is not None and perscorr.startswith('http'):
g.add( (person_uri, DCTERMS.subject, URIRef(perscorr)))
We may also be interested in extracting all references to a particular person in the text. The following script does precisely this:
<persName>
element in the text whose @ref
attributes corresponds to the @xml:id
of the person (lines 3-4).<persName>
and creates a unique URI for it (lines 6-7).dcterms:isReferencedBy
, and the parent element's URI (line 8).frbroo:F23_Expression_Fragment
(cf. http://iflastandards.info/ns/fr/frbr/frbroo/F23) that is part of the TEI file (lines 9-10).for person in root.findall('.//tei:person', tei):
person_id = person.get('{http://www.w3.org/XML/1998/namespace}id')
ref = './tei:text//tei:persName[@ref="#' + person_id + '"]'
for referenced_person in root.findall(ref, tei):
parent = referenced_person.getparent()
parent_id = parent.get('{http://www.w3.org/XML/1998/namespace}id')
parent_uri = URIRef(base_uri + '/text/' + parent_id)
g.add( (person_uri, DCTERMS.isReferencedBy, parent_uri))
g.add( (parent_uri, RDF.type, frbroo.F23_Expression_Fragment))
g.add( (parent_uri, frbroo.R15i_is_fragment_of, URIRef(base_uri + '/' + edition_id)))
Our person's description is now complete.
Note that we could also write all of the above code by dividing it into smaller functions (i.e. def function_name()
) as shown in the following block of code, then call the functions altogether at the end. In this way, we spare some lines of code and make our script a little bit easier to maintain:
def subject(person):
g.add( (person_uri, RDF.type, crm.E21_Person))
def sameas(person):
same_as = person.get('sameAs')
if same_as is not None:
same_as = same_as.split()
i = 0
while i < len(same_as):
same_as_uri = URIRef(same_as[i])
g.add( (person_uri, OWL.sameAs, same_as_uri))
i += 1
def persname(person):
persname = person.find('./tei:persName', tei)
if persname is not None:
label = persname.text
label_lang = persname.get('{http://www.w3.org/XML/1998/namespace}lang')
if label_lang is not None:
g.add( (person_uri, RDFS.label, Literal(label, lang=label_lang)))
else:
g.add( (person_uri, RDFS.label, Literal(label)))
def perstype(person):
listperson = person.find('./...', tei)
perstype = listperson.get('type')
perscorr = listperson.get('corresp')
if perstype is not None:
g.add( (person_uri, DCTERMS.description, Literal(perstype)))
if perscorr is not None and perscorr.startswith('http'):
g.add( (person_uri, DCTERMS.subject, URIRef(perscorr)))
def referenced_person(person_id):
ref = './tei:text//tei:persName[@ref="#' + person_id + '"]'
for referenced_person in root.findall(ref, tei):
parent = referenced_person.getparent()
parent_id = parent.get('{http://www.w3.org/XML/1998/namespace}id')
parent_uri = URIRef(base_uri + '/text/' + parent_id)
g.add( (person_uri, DCTERMS.isReferencedBy, parent_uri))
g.add( (parent_uri, RDF.type, frbroo.F23_Expression_Fragment))
g.add( (parent_uri, frbroo.R15i_is_fragment_of, URIRef(base_uri + '/' + edition_id)))
# Calling all functions
for person in root.findall('.//tei:person', tei):
person_id = person.get('{http://www.w3.org/XML/1998/namespace}id')
person_uri = URIRef(base_uri + '/person/' + person_id)
person_ref = '#' + person_id
subject(person)
sameas(person)
persname(person)
referenced_person(person_id)
perstype(person)
For the rest of this Jupyter notebook, we will adopt this style: the script will be divided into functions, which will be called afterwords.
The following group of functions extract information about people participating at events. Participation to an event revolves around the conceptual class pro:RoleInTime, which represents a "particular situation that describe a role an agent may have, that can be restricted to a particular time interval" (http://purl.org/spar/pro/RoleInTime). Such a class is directly related to the person, his/her role, a time, an event.
Let's begin by checking if a person participates (i.e. has a role) in an event:
<event>
element within the <person>
element, we build a unique URI representing the participation of the individual to the event (line 2).pro:holdsRoleInTime
, the object is the participation of the individual to the event (i.e. the pro:RoleInTime
). You can find a visual diagram of the PRO ontology here (line 3).def partic_event(person):
partic_event_uri = URIRef(base_uri + '/' + person_id + '-in-' + event_id)
g.add( (person_uri, pro.holdsRoleInTime, partic_event_uri))
The next block aims to extract information about the role held by the person in a specific event.
<event>
element found in <person>
, we add to our graph an RDF triple which assigns the participation of a person to a specific event to the class pro:RoleInTime
(line 2).<persName>
element within the <event>
element (line 3).<persName>
with an attribute @ref
corresponding to that of the person record within which the event is nested (cf. the XPath expression //person[@xml:id="Socr"]//persName[@ref="#Socr"]
in input.xml
), as well as an attribute @role
associated to this <persName>
(line 4), we build a unique URI for such a role (line 5).pro:Role
; the third triple associates a human-readable label to the role entity.@corresp
attribute is also present (line 9) (this should contain a link to an authority record for the role), the script adds an extra RDF triple to associate the authority record URI to the role entity via an owl:sameAs
property (lines 10-11).<event>
, we add the same triples as above with the only difference that the role is set to 'participant' (12-17). The participation of the person to the event is taken for granted as the <event>
element is nested within the <person>
element.def role_in_event(person):
g.add( (rit_uri, RDF.type, pro.RoleInTime))
pers_in_event = event.find('./tei:desc/tei:persName', tei)
if pers_in_event is not None and pers_in_event.get('ref') == person_ref and pers_in_event.get('role') is not None:
role_uri = URIRef(base_uri + '/role/' + pers_in_event.get('role'))
g.add( (rit_uri, pro.withRole, role_uri))
g.add( (role_uri, RDF.type, pro.Role))
g.add( (role_uri, RDFS.label, Literal(pers_in_event.get('role'))))
if pers_in_event.get('corresp') is not None:
corresp_role_uri = URIRef(pers_in_event.get('corresp'))
g.add( (role_uri, OWL.sameAs, corresp_role_uri))
else:
g.add( (rit_uri, pro.withRole, URIRef(base_uri + '/role/participant')))
role_uri = URIRef(base_uri + '/role/participant')
g.add( (role_uri, RDF.type, pro.Role))
g.add( (role_uri, OWL.sameAs, URIRef('http://wordnet-rdf.princeton.edu/id/10421528-n')))
g.add( (role_uri, RDFS.label, Literal('participant')))
The following block aims at extracting information about the time of an event:
TimeInterval
(line 4).@when
, @from
, or @to
attributes to determine the time interval and add RDF triples on the basis of the values found (lines 5-11).def event_time():
event_time_uri = URIRef(base_uri + '/' + event_id + '-time')
g.add( (rit_uri, ti.atTime, event_time_uri))
g.add( (event_time_uri, RDF.type, URIRef('http://www.ontologydesignpatterns.org/cp/owl/timeinterval.owl#TimeInterval')))
if event.get('when') is not None:
g.add( (event_time_uri, OWL.hasIntervalStartDate, Literal(event.get('when'), datatype=XSD.date)))
g.add( (event_time_uri, OWL.hasIntervalEndDate, Literal(event.get('when'), datatype=XSD.date)))
if event.get('from') is not None:
g.add( (event_time_uri, OWL.hasIntervalStartDate, Literal(event.get('from'), datatype=XSD.date)))
if event.get('to') is not None:
g.add( (event_time_uri, OWL.hasIntervalEndDate, Literal(event.get('to'), datatype=XSD.date)))
We are now ready to describe the event itself:
pro:relatesToEntity
(line 2).crm:E5_Event
as well as to the class schems:Event
(lines 3-4).def event_desc():
g.add( (rit_uri, pro.relatesToEntity, URIRef(base_uri + '/event/' + event_id)))
g.add( (event_uri, RDF.type, crm.E5_Event))
if event.find('./tei:label', tei) is not None:
label = event.find('./tei:label', tei).text
g.add( (event_uri, RDFS.label, Literal(label)))
if event.get('type') is not None:
g.add( (event_uri, DCTERMS.description, Literal(event.get('type'))))
if event.get('corresp') is not None and event.get('corresp').startswith('http'):
g.add( (event_uri, DCTERMS.subject, URIRef(event.get('corresp'))))
In order to extract informatio about the place where the event took place, we:
<placeName>
element in the <event>
element (lines 2).<placeName>
element to which an attribute @type="place_of_event"
is associated and add a triple relating the 'role-in-time' to that specific place. The place URI is build by concatenating the project base URI, a directory /place/
and the unique ID for the place (lines 3-5).<event>
element contains only one reference to a place, the script simply uses that as a place record for the event.def event_place():
place = event.find('./tei:desc/tei:placeName', tei)
if place > 1:
place_of_event = place.get('type="place_of_event"')
g.add( (rit_uri, proles.relatesToPlace, URIRef(base_uri + '/place/' + place.get('ref').replace("#", ""))))
elif event.find('./tei:desc/tei:placeName', tei) == 1:
g.add( (rit_uri, proles.relatesToPlace, URIRef(base_uri + '/place/' + place.get('ref').replace("#", ""))))
If a literary source for the event is cited within the <event>
element itself, we run the following set of instructions:
<bibl>
element within <event>
(line 2).<bibl>
is found (line 3), we build a URI for it after having retrieved its unique ID (lines 4-5), then add a new RDF triple to our graph: the subject is the event entity, which is linked to the source via the property prov:hadPrimarySource
(line 6).prov:PrimarySource
(line 7).<bibl>
element may contain the elements <author>
, <title>
, and <date>
. An RDF triple is generated for each of these metadata, if present (lines 8-16).@sameAs
attribute on <bibl>
to relate the source to a related resource or to an authority record such as Worldcat (lines 17-20).def event_source():
source = event.find('./tei:bibl', tei)
if source is not None:
source_id = source.get('{http://www.w3.org/XML/1998/namespace}id')
source_uri = URIRef(base_uri + '/source/' + source_id)
g.add( (event_uri, prov.hadPrimarySource, source_uri))
g.add( (source_uri, RDF.type, prov.PrimarySource))
if source.find('./tei:author', tei) is not None and source.find('./tei:author', tei).get('ref') is not None:
author_ref = source.find('./tei:author', tei).get('ref')
author_id = author_ref.split('#')
g.add( (source_uri, DCTERMS.creator, URIRef(base_uri + '/person/' + author_id[1])))
if source.find('.tei:title', tei) is not None:
g.add( (source_uri, DCTERMS.title, Literal(source.find('.tei:title', tei).text)))
if source.find('.tei:date', tei) is not None:
evdate = source.find('.tei:date', tei)
g.add( (source_uri, DCTERMS.date, Literal(evdate.get('when'), datatype=XSD.date)))
if source.get('sameAs') is not None:
sameAs = source.get('sameAs')
if sameAs.startswith('http'):
g.add( (source_uri, OWL.sameAs, URIRef(source.get('sameAs'))))
Finally, we call the functions just created. If you wish, you can print out the resulting graph as done in section 3.3 of this notebook by typing print(g.serialize(format="n3"))
.
for person in root.findall('.//tei:person', tei):
person_id = person.get('{http://www.w3.org/XML/1998/namespace}id')
person_uri = URIRef(base_uri + '/person/' + person_id)
person_ref = '#' + person_id
for event in person.findall('./tei:event', tei):
event_id = event.get('{http://www.w3.org/XML/1998/namespace}id')
event_uri = URIRef(base_uri + '/event/' + event_id)
rit_uri = URIRef(base_uri + '/rit/' + person_id + '-at-' + event_id)
partic_event(person)
role_in_event(person)
event_time()
event_desc()
event_place()
event_source()
The aim of the following script is to extract information about the relationships to which a person participates. In TEI, relationships are normally encoded using the element <relation>
, nested within the <listPerson>
element. There are two main types of relationships: active/passive (unilateral relationship, e.g. Person A (active) is mother of Person B (passive)) and mutual (mutual relationship, e.g. Person A/B is colleague of Person B/A).
<relation>
elements (lines 1-2).@active
attribute containing a reference to the person is found on <relation>
(line 3), the script iterates through all possible values of the @passive
attribute adding an RDF triple for each of them (lines 4-8). The @name
attribute on <relation>
should provide a term from an ontology such as AgRelOn (https://d-nb.info/standards/elementset/agrelon) (line 7).def relation(person):
for relation in root.findall('.//tei:listRelation/tei:relation', tei):
if relation.get('active') is not None and relation.get('active') == person_ref:
passive = relation.get('passive').replace("#", "").split()
i = 0
while i < len(passive):
g.add( (person_uri, agrelon[relation.get('name')], URIRef(base_uri + '/' + passive[i])))
i += 1
elif relation.get('mutual') is not None:
if person_ref in relation.get('mutual').split():
mutual = relation.get('mutual').replace("#", "").replace(person_id, "").split()
i = 0
while i < len(mutual):
g.add( (person_uri, agrelon[relation.get('name')], URIRef(base_uri + '/' + mutual[i])))
i += 1
We now call the relation(person)
function:
for person in root.findall('.//tei:person', tei):
person_id = person.get('{http://www.w3.org/XML/1998/namespace}id')
person_uri = URIRef(base_uri + '/person/' + person_id)
person_ref = '#' + person_id
relation(person)
This section is about describing all places mentioned in the TEI file:
<place>
element found, we add an RDF triple to our graph assigning the place entity to the class crm:E53_Place
.def subject(place):
g.add( (place_uri, RDF.type, crm.E53_Place))
Moving on, we look for a @sameAs
attribute provided on the <place>
element. We expect this attribute to contain one or more URIs pointing to authority records such as VIAF, or to resources about the same person such as DBpedia:
get()
function, we look for a @sameAs
attribute and split its contents by whitespace (line 2).@sameAs
attribute (lines 3-7). We record the URIs in the variable 'same_as_uri' (line 5).owl:sameAs
, the object is the URI retrieved from within the @sameAs
attribute.def place_sameas(place):
same_as = place.get('sameAs').split()
i = 0
while i < len(same_as):
same_as_uri = URIRef(same_as[i])
g.add( (place_uri, OWL.sameAs, same_as_uri))
i += 1
The next step is to provide each place entity with a human-readable label:
<placeName>
element within <place>
(line 2).@xml:lang
attribute is also present (line 3-4).@xml:lang
is found, the script adds an RDF triple. The subject of such a triple is the place entity, the predicate is rdf:label
, and the object is a literal value (i.e. an xsd:string
). A language declaration is also attached to the triple (lines 5-6).@xml:lang
is found, the script creates an RDF triple whithout declaring any specific language (lines 7-8).def placename(place):
placename = place.find('./tei:placeName', tei)
label = placename.text
label_lang = placename.get('{http://www.w3.org/XML/1998/namespace}lang')
if label_lang is not None:
g.add( (place_uri, RDFS.label, Literal(label, lang=label_lang)))
else:
g.add( (place_uri, RDFS.label, Literal(label)))
We may also be interested in extracting all references to a particular place in the text. The following script does precisely this:
<placeName>
element in the text whose @ref
attributes corresponds to the @xml:id
of the place (lines 2-3).<placeName>
and creates a unique URI for it (lines 4-6).dcterms:isReferencedBy
, and the parent element's URI (line 7).frbroo:F23_Expression_Fragment
that is part of the TEI file (lines 8-9).def referenced_place(place_id):
ref = './/tei:placeName[@ref="#' + place_id + '"]'
for referenced_place in root.findall(ref, tei):
parent = referenced_place.getparent()
parent_id = parent.get('{http://www.w3.org/XML/1998/namespace}id')
parent_uri = URIRef(base_uri + '/text/' + parent_id)
g.add( (place_uri, DCTERMS.isReferencedBy, parent_uri))
g.add( (parent_uri, RDF.type, frbroo.F23_Expression_Fragment))
g.add( (parent_uri, frbroo.R15i_is_fragment_of, URIRef(base_uri + '/' + edition_id)))
Finally, we call the functions iterating through each <place>
element:
for place in root.findall('.//tei:place', tei):
place_id = place.get('{http://www.w3.org/XML/1998/namespace}id')
place_uri = URIRef(base_uri + '/place/' + place_id)
place_ref = '#' + place_id
subject(place)
place_sameas(place)
placename(place)
referenced_place(place_id)
The following instructions print the RDF graph to external files. Beside the serialization, you can specify a destination as follows:
# RDF/XML output
g.serialize(destination="output.xml", format='xml')
# Notation3 output
g.serialize(destination="output.n3", format='n3')
# N-triples output
g.serialize(destination="output.nt", format='nt')