Getting started with the DBpedia SPARQL endpoint

In this episode I begin to explore the DBpedia public SPARQL endpoint. I'll go through the following stages

  • setting up tools
  • counting triples
  • counting predicates
  • examination of the predicate pr:skipperlastname
  • countings classes
  • examination of the class on:CareerStation

My method is a deliberate combination of systematic analysis (looking at counts, methods that can applied to arbitrary predicates or classes) and opportunism (looking at topics that catch my eye.) DBpedia is too heterogenous to characterize in one article, but I'll begin to uncover the dark art of writing SPARQL queries against generic databases.

Setting up tools

A first step is to import a number of symbols that we'll use to do SPARQL queries and visualize the result

In [1]:
import sys
from gastrodon import RemoteEndpoint,QName,ttl,URIRef,inline
import pandas as pd

First I'll define a few prefixes for namespaces that I want to use.

In [2]:
    @prefix : <> .
    @prefix on: <> .
    @prefix pr: <> .
time: 7.5 ms

Next I set up a SPARQL endpoint and register the above prefixes so I can use them; it is also important that I set the default graph and base_uri so we'll get good looking short results.

In [3]:
time: 12 ms

Counting Triples

First I count how many triples there are in the main graph

In [4]:"""
    SELECT (COUNT(*) AS ?count) { ?s ?p ?o .}
time: 2.01 s

Counting Predicates

For the next query I make a list of common predicates; note that there are a whole lot of them! The public SPARQL endpoint has a limit of 10,000 returned rows and we are finding many more than that.

Each predicate is a relationship between a topic and either another topic or a literal value. For instance, the rdf:type predicate links a topic to another topic representing a class that the first topic is an instance, for instance:

<Alan_Alda> rdf:type on:Person .

rdfs:label, on the other hand, links topics to literal values, such as

<Alan_Alda> rdfs:label 
                "Alan Alda"@en,
                "アラン・アルダ"@ja .

Strings in RDF (like the one above) are unusual compared to other computer languages because they can contain language tags, a particularly helpful feature for multilingual databases such as DBpedia.

In [5]:"""
    SELECT ?p (COUNT(*) AS ?count) { ?s ?p ?o .} GROUP BY ?p ORDER BY DESC(?count)
rdf:type 113715893 33623696 23990506
rdfs:label 22430852 15801285
on:wikiPageID 15797811
on:wikiPageRevisionID 15797811 12845235 12845235 12845234
rdfs:comment 12391811
on:abstract 12390578
on:wikiPageExternalLink 7772279
on:wikiPageRedirects 7632358 4146581 4090049 3080338 2897004
on:team 2007122
on:birthDate 1740614 1698618
on:thumbnail 1695460
pr:title 1566113
on:wikiPageDisambiguates 1537180 1475015 1448505 1418209
on:birthPlace 1330297
pr:subdivisionType 1321475 1289109
... ...
pr:plantLatM 135
ns2:v4b 135
ns5:v1b 135
pr:affdate 135
on:dissolved 135
pr:game11Attendance 135
ns5:v2b 135
pr:game9Attendance 135
pr:dharmaName 135
pr:irishGridReference 135
pr:seats1Next 135
pr:payloadCapacity 135
pr:ports 135
pr:compartment 135
pr:powerout 135
pr:seats1Begin 135
pr:longMin 135
pr:rotAreaSqm 135
pr:plantLongM 135
pr:deacon 135
pr:singleTemperature 135
pr:skipperlastname 135
pr:anchor 135
pr:flavour 135
pr:bo 135
pr:buschCarTeam 135
pr:majorsites 135
ns1:v4b 135
ns10:a 135
pr:seats1Last 134

10000 rows × 1 columns

time: 7.89 s

Some notes.

First of all, properties that are original to DBpedia. are in two namespaces; the on namespace contains DBpedia Ontology properties which are better organized (mapped manually) than the pr namespace that contains properties that are mapped automatically. The select function returns short names for predicates in these namespaces because I specified them in the prefix list above.

DBpedia also uses predicates that are defined in other namespaces, such as foaf and dc. Frequently these duplicate properties that are defined in DBpedia, but facilitate interoperability with tools and data that use standard vocabularies. select would show you short names for these to if I added them to the prefix list, but I didn't, so it doesn't.

If you look closely, you might notice we got exactly 10,000 results from this last query. This is not because DBpedia uses only 10,000 distinct predicates, but because the DBpedia SPARQL endpoint has a 10,000 row result limit. This can be annoying sometimes, but it protects the endpoint from people who write crazy queries. There is a bag of tricks for dealing with this, but in the purposes of this article, 10,000 predicates is enough to get started.

This begs the question:

"How many distinct predicates are used in DBpedia?"

which is easy to answer with a SPARQL query:

In [6]:"""
    SELECT (COUNT(*) AS ?count) { SELECT DISTINCT ?p { ?s ?p ?o .} }
0 60649
time: 269 ms

When you have a number of "things" ordered by how prevalent there are, a cumulative distribution function is a great nonparametric method of characterizing the statistics

In [7]:
time: 2.5 ms
In [8]:
%matplotlib inline
<matplotlib.axes._subplots.AxesSubplot at 0x1673ca97518>
time: 345 ms

This distribution certainly looks like it has a "knee" somewhere in the teens, probably involving a transition from predicates that could apply to any topic such as rdfs:comment as opposed to predicates specific to certain subject areas, such as on:team.

In [9]:
<matplotlib.axes._subplots.AxesSubplot at 0x1673c4a5390>
time: 120 ms

Here are the top 20 predicates, representing more than 80% of the predicates used in the main graph

In [10]:
count dist
rdf:type 113715893 0.259426 33623696 0.336134 23990506 0.390864
rdfs:label 22430852 0.442037 15801285 0.478085
on:wikiPageID 15797811 0.514126
on:wikiPageRevisionID 15797811 0.550166 12845235 0.579471 12845235 0.608775 12845234 0.638080
rdfs:comment 12391811 0.666350
on:abstract 12390578 0.694617
on:wikiPageExternalLink 7772279 0.712348
on:wikiPageRedirects 7632358 0.729760 4146581 0.739220 4090049 0.748551 3080338 0.755578 2897004 0.762187
on:team 2007122 0.766766
on:birthDate 1740614 0.770737 1698618 0.774612
on:thumbnail 1695460 0.778480
pr:title 1566113 0.782053
on:wikiPageDisambiguates 1537180 0.785560 1475015 0.788925 1448505 0.792230 1418209 0.795465
on:birthPlace 1330297 0.798500
pr:subdivisionType 1321475 0.801515 1289109 0.804456
time: 9.5 ms

Looking at the tail, I find some very random sorts of properties.

In [11]:
count dist
pr:buschCarTeam 135 0.998442
pr:majorsites 135 0.998442
ns1:v4b 135 0.998442
ns10:a 135 0.998442
pr:seats1Last 134 0.998443
time: 15 ms

Here are predicates that are at the 90%, 95%, 98%, and 99% cumulative distributions, just to get a sense of what happens as things get more rare.

In [12]:
count dist
on:areaTotal 179581 0.900161
time: 17.1 ms
In [13]:
count dist
pr:nativeName 31226 0.95004
time: 18 ms
In [14]:
count dist
pr:ordination 4839 0.980009
time: 15.5 ms
In [15]:
count dist
pr:namedAfter 1765 0.990003
time: 19.5 ms

pr:skipperlastname (property ranked number 9993) caught my eye, so I take a look at it.

In [16]:"""
    SELECT ?s ?o  { ?s pr:skipperlastname ?o  }
s o
0 <1989–90_Whitbread_Round_the_World_Race> English
1 <1993–94_Whitbread_Round_the_World_Race> Field
2 <1973–74_Whitbread_Round_the_World_Race> Goodwin
3 <1977–78_Whitbread_Round_the_World_Race> James
4 <1989–90_Whitbread_Round_the_World_Race> Smith
5 <1993–94_Whitbread_Round_the_World_Race> Smith
6 <1981–82_Whitbread_Round_the_World_Race> Taylor
7 <1985–86_Whitbread_Round_the_World_Race> Taylor
8 <Oryx_Quest> Thompson
9 <1973–74_Whitbread_Round_the_World_Race> Ainslie
10 <The_Race_(yachting_race)> Lewis
11 <1977–78_Whitbread_Round_the_World_Race> Watts
12 <Volvo_Baltic_Race> Williams
13 <1981–82_Whitbread_Round_the_World_Race> Williams
14 <1977–78_Whitbread_Round_the_World_Race> Williams
15 <1973–74_Whitbread_Round_the_World_Race> Williams
16 <1989–90_Whitbread_Round_the_World_Race> Dubois
17 <1989–90_Whitbread_Round_the_World_Race> Edwards
18 <Volvo_Baltic_Race> Mortensen
19 <1993–94_Whitbread_Round_the_World_Race> Riley
20 <1985–86_Whitbread_Round_the_World_Race> Salmon
21 <1989–90_Whitbread_Round_the_World_Race> Salmon
22 <1989–90_Whitbread_Round_the_World_Race> Dalton
23 <1993–94_Whitbread_Round_the_World_Race> Dalton
24 <The_Race_(yachting_race)> Dalton
25 <1977–78_Whitbread_Round_the_World_Race> Francis
26 <1977–78_Whitbread_Round_the_World_Race> Ridgway
27 <1993–94_Whitbread_Round_the_World_Race> Dickson
28 <1985–86_Whitbread_Round_the_World_Race> Berner
29 <1993–94_Whitbread_Round_the_World_Race> Conner
... ... ...
105 <1973–74_Whitbread_Round_the_World_Race> Laucht
106 <1985–86_Whitbread_Round_the_World_Race> Lugt
107 <1993–94_Whitbread_Round_the_World_Race> Maisto
108 <1981–82_Whitbread_Round_the_World_Race> Malingri
109 <1973–74_Whitbread_Round_the_World_Race> Malingri
110 <1989–90_Whitbread_Round_the_World_Race> Mallé
111 <1981–82_Whitbread_Round_the_World_Race> Mcgown-Fyfe
112 <1973–74_Whitbread_Round_the_World_Race> Myatt
113 <1985–86_Whitbread_Round_the_World_Race> Norsk
114 <1981–82_Whitbread_Round_the_World_Race> Panada
115 <1973–74_Whitbread_Round_the_World_Race> Pascoli
116 <1973–74_Whitbread_Round_the_World_Race> Perlicki
117 <1973–74_Whitbread_Round_the_World_Race> Pienkawa
118 <1985–86_Whitbread_Round_the_World_Race> Péan
119 <1981–82_Whitbread_Round_the_World_Race> Rietschoten
120 <1977–78_Whitbread_Round_the_World_Race> Rietschoten
121 <1981–82_Whitbread_Round_the_World_Race> Stampi
122 <1981–82_Whitbread_Round_the_World_Race> Tabarly
123 <1985–86_Whitbread_Round_the_World_Race> Tabarly
124 <1989–90_Whitbread_Round_the_World_Race> Tabarly
125 <1993–94_Whitbread_Round_the_World_Race> Tabarly
126 <1973–74_Whitbread_Round_the_World_Race> Tabarly
127 <1981–82_Whitbread_Round_the_World_Race> Versluys
128 <1985–86_Whitbread_Round_the_World_Race> Versluys
129 <1981–82_Whitbread_Round_the_World_Race> Viant
130 <1977–78_Whitbread_Round_the_World_Race> Viant
131 <1973–74_Whitbread_Round_the_World_Race> Viant
132 <1985–86_Whitbread_Round_the_World_Race> Visiers
133 <1989–90_Whitbread_Round_the_World_Race> Wilkeri
134 <1985–86_Whitbread_Round_the_World_Race> Zehender-Mueller

135 rows × 2 columns

time: 483 ms

Looks like it has to do with sailing. It's not an area that I know much about, so I'll transclude the page describing one of the topics from Wikipedia so we can understand it.

In [17]:
from bs4 import BeautifulSoup
from IPython.display import display, HTML
from uritools import urijoin

def transclude(file):
    with open(file,"rt",encoding="utf8") as fp:
        soop = BeautifulSoup(fp,"html5lib")
    for a in soop.find_all("a"):
    return HTML(str(soop.body))
time: 152 ms
In [18]:

The Race (yachting race)

From Wikipedia, the free encyclopedia

The Race was a round-the-world sailing race starting in Barcelona, Spain on December 31, 2000. It was the first ever non-stop, no-rules, no-limits, round-the-world sailing event, with a $2 million US prize. It was organized by Bruno Peyron.

The stated objectives of this race were:

  • to unite the different maritime cultures of the world
  • to gather together the world's premiere yachtsmen and women in a common event
  • to promote creativity in ocean sailing
  • to ally high technology and the environment
  • to create the most spectacular and most prestigious fleet of offshore racers that sailing has ever seen

A second race was planned for 2004, but was cancelled amid controversy that Tracy Edwards had organized a competing event called Oryx Quest.


The 2000–01 race was won by Club Med, skippered by Grant Dalton in 62d 6h 56' 33".

Pos Boat Crew Country Time
1 Club Med Dalton, Grant Grant Dalton  New Zealand 62d 6h 56m 33s
2 Innovation Explorer Peyron, Loick Loick Peyron & Skip Novak  France 64d 22h 32m 38s
3 Team Adventure Lewis, Cam Cam Lewis  United States 82d 20h 21m 02s
4 Warta Polpharma Paszke, Roman Roman Paszke  Poland 99d 12h 31m
5 Team Legato Bullimore, Tony Tony Bullimore  Great Britain 104d 20h 52m
- PlayStation Fossett, Steve Steve Fossett  United States DNF[a]
- Team Philips Goss, Pete Pete Goss  Great Britain DNS
  1. ^ Damaged and forced to withdraw on day 16

Legend: DNF – Did not finish; DNS – Did not start;

time: 42.5 ms

That wikipedia page is pretty informative, let's see what facts are in DBpedia concerning "The Race".

Because I set the base_uri when I the endpoint object, DBpedia resources (which largely correspond to Wikipedia pages) can be easily written using angle brackets. It would be tempting to create a namespace for them, but it turns out that SPARQL and Turtle let you write a wider range of characters insides brackets, as opposed to in a namespace. Particularly, the parenthesis in <The_Race_(yachting_race)> are legal, but dbpedia:The_Race_(yachting_race) is not allowed!

In [19]:
    SELECT ?p ?o  {<The_Race_(yachting_race)> ?p ?o  }
p o
0 rdf:type on:SportsEvent
1 rdf:type
2 rdf:type
3 rdf:type
4 rdf:type
5 rdf:type
6 rdf:type
7 rdf:type
8 rdf:type
9 rdf:type
10 rdfs:label The Race (yachting race)
11 rdfs:label The Race
12 rdfs:label The Race
13 rdfs:label The Race (vela)
14 rdfs:label The Race
15 rdfs:label The Race
16 rdfs:comment The Race : No Limit Around The World est une épreuve sportive imaginée et créée par Bruno Peyron...
17 rdfs:comment The Race war eine Hochseeregatta, die 2000/2001 auf Mehrrumpf-Segelyachten der G-Class - d. h. u...
18 rdfs:comment The Race : No Limit Around The World è stata una corsa a vela immaginata e creata da Bruno Peyro...
19 rdfs:comment The Race (fr. La Course du Millénaire) – regaty dookoła świata bez zawijania do portu, które odb...
20 rdfs:comment The Race was a round-the-world sailing race starting in Barcelona, Spain on December 31, 2000. I...
21 rdfs:comment The Race was de eerste non-stop, no-limits wedstrijd rond de wereld die startte op 31 december 2...
31 <Category:Yachting_races>
32 <Category:Round-the-world_sailing_competitions>
33 <Category:2000_in_sailing>
34 on:wikiPageID 2280786
35 on:wikiPageRevisionID 660377883
36 on:wikiPageExternalLink
39 on:abstract The Race : No Limit Around The World est une épreuve sportive imaginée et créée par Bruno Peyron...
40 on:abstract The Race was a round-the-world sailing race starting in Barcelona, Spain on December 31, 2000. I...
41 on:abstract The Race war eine Hochseeregatta, die 2000/2001 auf Mehrrumpf-Segelyachten der G-Class - d. h. u...
42 on:abstract The Race : No Limit Around The World è stata una corsa a vela immaginata e creata da Bruno Peyro...
43 on:abstract The Race (fr. La Course du Millénaire) – regaty dookoła świata bez zawijania do portu, które odb...
44 on:abstract The Race was de eerste non-stop, no-limits wedstrijd rond de wereld die startte op 31 december 2...
45 pr:nation FRA
46 pr:nation POL
47 pr:nation USA
48 pr:nation GBR
49 pr:nation NZL
50 pr:nationality yes
51 pr:pos 1
52 pr:pos 2
53 pr:pos 3
54 pr:pos 4
55 pr:pos 5
56 pr:pos
57 pr:time yes
58 pr:time DNF
59 pr:time 5381793.0
60 pr:time 5610758.0
61 pr:time 7158062.0
62 pr:time 8598660.0
63 pr:time 9060720.0
64 pr:time DNS
65 pr:boatname <Warta_Polpharma>
66 pr:boatname <Team_Philips>
67 pr:boatname <PlayStation_(yacht)>
68 pr:boatname <Doha_2006_(yacht)>
69 pr:boatname <Innovation_Explorer>
70 pr:boatname <Team_Adventure>
71 pr:boatname <Team_Legato>
72 pr:boatname yes
73 pr:dnf yes
74 pr:dns yes
75 pr:skipper yes
76 pr:skipper Skip Novak
77 pr:skipperfirstname Grant
78 pr:skipperfirstname Roman
79 pr:skipperfirstname Tony
80 pr:skipperfirstname Steve
81 pr:skipperfirstname Pete
82 pr:skipperfirstname Cam
83 pr:skipperfirstname Loick
84 pr:skipperlastname Lewis
85 pr:skipperlastname Dalton
86 pr:skipperlastname Bullimore
87 pr:skipperlastname Fossett
88 pr:skipperlastname Goss
89 pr:skipperlastname Paszke
90 pr:skipperlastname Peyron
91 <Race>
time: 466 ms

What's the story here? Cells from the table have been converted into facts, but the order of the facts has been scrambled. We know that one of the boats finished in "5381793.0" seconds, and we know there was a boat named "Warta_Polpharma" and so forth, but we don't know which boats finished, which boats boats finished in what time, which boat had what skipper, etc.

This is not a limitation of RDF, but it is a common limitation of RDF-based systems in the "Linked Data" era, and it's historically been a problem in RDF.

The basic problem is that if we want to write a statement like the one on the first row of the HTML table, we end up having to write something like

  pr:pos 1 ;
  pr:boat <Club_Med> ;
  pr:skipper <Grant_Dalton> ;
  pr:nation "NZL" ;
  pr:time 5381793.0 .

<The_Race_(yaching_race)> pr:entry <Some_Node> .

the only hard part is determing a name for <Some_Node>. In the case of DBpedia, names are derived from URIs in Wikipedia, a formula that doesn't apply when we're talking about a concept that doesn't have a URI in Wikipedia. We can duck the problem of assigning a name by using a blank node (which states a node exists without giving a specific name) but that causes problems of its own which come from the difficulty of having something nameless in a distributed system. (What if I want to talk about a nameless entity that exists in DBpedia?)

For specific problems, it's possible, and often straightforward, to find ways to name nodes like <Some_Node>. However, it is hard to find a solution that pleases everybody, particularly when we are talking about a system which is decentralized, in which people would like names to be stable over time, etc.

With conflicting demands, it's no wonder that this area has not been standardized by the W3C, but it's great to see that DBpedia is making some progress in this area, which I'll show in the next section.


Note I started this analysis by looking first at the most commonly used predicates. If I was looking a SQL database, this would be like looking at a list of columns first, and if I was looking at an Object-Oriented program, it would be like looking at a list of methods and fields.

It would be much more common to look at tables first in SQL or classes first in Java, but RDF is different from SQL and Java, and it often makes sense to look at properties first.

For one thing, it is possible to write properties without defining any classes or categories, that is, the RDF statement

<SomeTopic> :hasNumber 1023 .

is self-sufficient and meaningful without knowledge that <SomeTopic> is a particular kind of topic. Thus, properties are more fundamental.

More practically, people get into more trouble with classes than they do with properties. Part of it is that people tend to argue more about classes (ex. can a video game be art?) than they do about properties (ex. "Hideo Kojima was thge director of Metal Gear Solid") In the case of DBpedia, one problem is the sheer number of categories:

In [20]:"""
    SELECT ?type (COUNT(*) AS ?count) { ?s a ?type .} GROUP BY ?type ORDER BY DESC(?count)
type 12856178 5044222
on:Image 2897004 2822488 2720458 2190190 2061271
on:Person 1818074 1654844 1548330
on:Agent 1546264 1529881 1529881 1475015 1366065 1365758 1243399 1243399 1243399 1243399 1243399 1229049 1216438
on:TimePeriod 1127706 1079890 996625 989272
on:CareerStation 977023
on:Place 881597 839987
on:Location 839987 689249 659092 644115 620586
on:Settlement 581293
on:PopulatedPlace 516747
on:Work 508099 497592 496070 496070 478906 468228 455794 421838 407890
on:Athlete 392672 373140
on:Organisation 352081
... ... 257 257 257 257 257 257 257ñaPlayers 257 257 257 257 257 257 257 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256

10000 rows × 1 columns

time: 1min 30s
In [21]:"""
    SELECT (COUNT(*) AS ?count) { SELECT DISTINCT ?type { ?s a ?type .} }
0 483605
time: 12.1 s

On average, that's nearly eight classes for every property!

DBpedia, it turns out, contains many types from YAGO, which are in turn generated from Wikipedia Categories and other data sources. Many of these classes such as yago:WikicatPeopleFromYokohama and yago:MexicanMaleFilmActors are classes that are members of very large families that include "People from Lanzarote", "Brazillian female professional wrestlers" as such. Two common patterns are:

  1. Restriction types: One could name "People from Yokohama" as a class, and ask for instances of that class. Alternatively, one could query for people for whom the property "comes from" has the value "Yokohama". A class whose membership is determined by property values is a "restriction type".
  2. Intersection types: "Mexican Person" is a class, "Male Person" is a class, "Film Actor" is a class. The set of topics which are members of all of those classes is "Mexican Male Film Actors".

As you can say the same things with or without restriction and intersection types, it is a case-by-case decision as to whether to use them or to compose them from other elements. What is clear, in this type, is that there are so many realized restriction and intersection types from YAGO that it gets in the way of seeing what kind of things are talked about in DBpedia.

An easy "set of blinders" to use here is to look only at types that are in the DBpedia Ontology namespace. Rather than write a new SPARQL query, I use the filtering operator in Pandas to pick out common types from the DBpedia Ontology.

In [22]:
on:Image 2897004
on:Person 1818074
on:Agent 1546264
on:TimePeriod 1127706
on:CareerStation 977023
on:Place 881597
on:Location 839987
on:Settlement 581293
on:PopulatedPlace 516747
on:Work 508099
on:Athlete 392672
on:Organisation 352081
on:SportsTeamMember 318735
on:OrganisationMember 318392
on:Species 306833
on:Eukaryote 302686
on:Village 231103
on:Animal 230175
on:MusicalWork 209142
on:ArchitecturalStructure 203065
on:PersonFunction 171413
on:SoccerPlayer 151207
on:Album 147917
on:Insect 146657
on:Film 129980
on:Building 123567
on:Company 109629
on:Infrastructure 92281
on:Artist 82757
on:Event 77583
on:OfficeHolder 71753
on:MusicalArtist 71014
on:TelevisionShow 70690
on:Station 69525
on:WrittenWork 69112
on:NaturalPlace 67863
on:Band 67831
on:Single 67150
on:Book 64239
on:Plant 62543
on:SocietalEvent 60321
on:MeanOfTransportation 59078
on:EducationalInstitution 55860
on:SportsSeason 55732
on:Software 52743
on:School 45445
on:Town 45415
on:SportsTeam 44978
on:BodyOfWater 44204
... ...
on:DartsPlayer 587
on:Chef 584
on:RugbyLeague 584
on:Winery 569
on:Jockey 556
on:MusicFestival 547
on:Skater 545
on:VoiceActor 543
on:Presenter 532
on:TableTennisPlayer 524
on:LawFirm 514
on:Rocket 501
on:Medician 501
on:FloweringPlant 499
on:AustralianFootballTeam 492
on:Moss 486
on:LacrossePlayer 482
on:CyclingTeam 482
on:SumoWrestler 473
on:Bodybuilder 472
on:SnookerPlayer 465
on:Photographer 459
on:Canoeist 458
on:AmateurBoxer 448
on:RoadJunction 448
on:Entomologist 437
on:Artery 428
on:SquashPlayer 422
on:Nerve 415
on:Racecourse 410
on:Pope 407
on:HandballTeam 393
on:GreenAlga 391
on:SolarEclipse 380
on:Database 363
on:RadioHost 359
on:Muscle 347
on:HorseTrainer 330
on:ClassicalMusicArtist 322
on:RoadTunnel 314
on:Poet 308
on:IceHockeyLeague 304
on:Brewery 292
on:Rower 279
on:BaseballSeason 275
on:PlayboyPlaymate 274
on:RaceTrack 269
on:NetballPlayer 263
on:CricketGround 260

388 rows × 1 columns

time: 21 ms

on:Image catches my eye, so I look at a few examples and pick one out.

In [23]:"""
    SELECT ?that { 
        ?that a on:Image
    } LIMIT 10
time: 293 ms
In [24]:
HTML('<img src="{0}">'.format([0,'that']))
time: 4 ms

These "topics" are what I would call "non-topic topics" in the sense that they are the subject of a statement, but not an actual "thing in the world" described by the knowledge base. (Wikipedia documents the outside world primarily, and only secondarily has a metadata catalog for items that are in it.)

The following query finds "plain ordinary topics"

In [25]:"""
    SELECT ?that { 
        ?that a on:Person
    } LIMIT 10
0 <Andreas_Ekberg>
1 <Danilo_Tognon>
2 <Lorine_Livington_Pruette>
3 <Megan_Lawrence>
4 <Nikolaos_Ventouras>
5 <Peter_Ceffons>
6 <Sani_ol_molk>
7 <Siniša_Žugić>
8 <Strength_athlete>
9 <Trampolino_Gigante_Corno_d'Aola>
time: 289 ms

"Andres_Ekberg" is a shorthand for <> which is parallel to the Wikipedia page at <>. The select() method shows just "Andreas_Ekberg" because I registered <> as the base URI of this endpoint when I created the endpoint object way back at the beginning of this notebook.

What most people would think of as "topics" in DBpedia live in the <> namespace.

Another common kind of topic in DBpedia is the on:Agent:

In [26]:"""
    SELECT ?that { 
        ?that a on:Agent
    } LIMIT 10
0 <3Com>
1 <7-Eleven>
2 <A._C._Bhaktivedanta_Swami_Prabhupada>
3 <Aardman_Animations>
4 <Aaron_Burr>
5 <Abbie_Hoffman>
6 <>
7 <Abraham_Robinson>
8 <Abraham_de_Moivre>
9 <Academy_of_Motion_Picture_Arts_and_Sciences>
time: 280 ms

The "Agent" concept is connected with the shared attributes of individuals and organizations; I like to think that an "Agent" is something that can be the originator or recipient of a communication. If I remove people using the MINUS operator, only organizations remain.

In [27]:"""
    SELECT ?that { 
        ?that a on:Agent
        MINUS {?that a on:Person}
    } LIMIT 10
0 <3Com>
1 <7-Eleven>
2 <Aardman_Animations>
3 <>
4 <Academy_of_Motion_Picture_Arts_and_Sciences>
5 <Acorn_Computers>
6 <Activision>
7 <Ad_Lib,_Inc.>
8 <Adnams_Brewery>
9 <Aermacchi>
time: 290 ms

Unlike the classes I've shown so far, a on:TimePeriod can be either a topic or non-topic. Asking for just 10 time periods, I find that some of them correspond to calendar years:

In [28]:"""
    SELECT ?that { 
        ?that a on:TimePeriod
    } LIMIT 10
0 <1>
1 <10>
2 <100>
3 <1000>
4 <1001>
5 <1002>
6 <1003>
7 <1004>
8 <1005>
9 <1006>
time: 298 ms
In [29]:


From Wikipedia, the free encyclopedia

Year 1004 (MIV) was a leap year starting on Saturday (link will display the full calendar) of the Julian calendar.


By place[edit]






time: 35 ms

If, however, I make a query that eliminates topics that start with a number, the query returns a large number of non-topics. Even though these resources are in the <> namespace, they don't have corresponding Wikipedia pages.

In [30]:"""
    SELECT ?that { 
        ?that a on:TimePeriod .
    } LIMIT 10
0 <A._M._A._Azeez__1>
1 <A._R._Colquhoun__1>
2 <Abbie_Wolanow__1>
3 <Abbie_Wolanow__2>
4 <Abbie_Wolanow__3>
5 <Abbie_Wolanow__4>
6 <Abbie_Wolanow__5>
7 <Abdul_Wahab_Khan__1>
8 <Adam_Wolanin__1>
9 <Adam_Wolanin__2>
time: 352 ms

Let's take a closer look. It seems that this record describes a time that a soccer player spent playing for a team (although unfortunately it doesn't say when this time began or ended):

In [31]:"""
    BASE <>
    SELECT ?p ?o { 
        <Abbie_Wolanow__1> ?p ?o .
p o
0 rdf:type
1 rdf:type on:CareerStation
2 rdf:type on:TimePeriod
3 on:team <Hapoel_Tel_Aviv_F.C.>
time: 288 ms

This record is more complete, and shows how the career record can be linked to a time, as well as information about how the player performed:

In [32]:"""
    BASE <>
    SELECT ?p ?o { 
        <Abbie_Wolanow__5> ?p ?o .
p o
0 rdf:type
1 rdf:type on:CareerStation
2 rdf:type on:TimePeriod
3 on:numberOfGoals 0
4 on:numberOfMatches 1
5 on:team <United_States_men's_national_soccer_team>
6 on:years 1961-01-01
time: 283 ms

Going to the right of the career station (finding objects for which it is the subject) we see the team, but we don't see the player. Going to the left, however (finding objects for which it is the subject) we see the player.

In [33]:"""
    BASE <>
    SELECT ?s ?p  { 
        ?s ?p <Abbie_Wolanow__5> .
s p
0 <Abbie_Wolanow> on:careerStation
time: 282 ms

Thus this fragment of the RDF graph looks like:

and this a general pattern for how one might deal with situations where we want to say something more complex than "Abbie Wolanow played for the U.S. Men's National Soccer Team".

In terms of the source data, Career stations are much like the race entries in the yachting example in that a single page on Wikipedia contains a number of "sub-topics" that need to be referred to in order to keep together facts such as "this boat was the third finisher" and "Cam Lewis was the skipper of this boat"

The difference is that DBpedia identifies individual career stations while it does not indentify individual race entries.

Here is a survey of the different predicate types that are used to describe career stations. I was probably a bit unlucky to pick a player who didn't have on:years specified very often:

In [34]:"""
    SELECT ?p (COUNT(*) AS ?count) { 
        ?that a on:CareerStation .
        ?that ?p ?o .
    } GROUP BY ?p ORDER BY DESC(?count)
rdf:type 2931158
on:team 941316
on:years 927710
on:numberOfGoals 647584
on:numberOfMatches 645122
on:title 12
on:filename 2
on:description 2
on:deathDate 1
on:birthDate 1 1
on:country 1 1
time: 2.15 s

What sort of people have career stations? I count the career stations and get the following results:

In [35]:
    SELECT ?type (COUNT(*) AS ?count) { 
        ?station a on:CareerStation .
        ?who on:careerStation ?station .
        ?who a ?type .
    } GROUP BY ?type ORDER BY DESC(?count)
on:Person 977021
on:Agent 977021
on:SoccerPlayer 869100
on:Athlete 823540
on:SoccerManager 197693
on:SportsManager 197451
on:IceHockeyPlayer 2079
on:Building 178
on:River 178
on:AmericanFootballPlayer 116
on:Organisation 108
time: 12.7 s

Career stations seem heavily weighted towards people who play soccer! The numbers above are hard to compare to other characteristics, however, because they are counting the career stations instead of the people. For instance, Abbie Wolanow is counted five times because he has five career stations.

With a slightly different query, I can count the actual number of people of various types who have career stations.

In [36]:"""
    SELECT ?type (COUNT(*) AS ?count) {
        { SELECT DISTINCT ?who {
            ?station a on:CareerStation .
            ?who on:careerStation ?station .
        } }
        ?who a ?type .
    } GROUP BY ?type ORDER BY DESC(?count)
on:Person 135887
on:Agent 135887
on:SoccerPlayer 125421
on:Athlete 121552
on:SoccerManager 18652
on:SportsManager 18617
on:IceHockeyPlayer 268
on:Building 31
on:River 28
on:AmericanFootballPlayer 19
on:Cricketer 16
on:Organisation 14
on:FictionalCharacter 13
on:SportsTeam 9
time: 11.2 s

Note that the counts here do not need to add up to anything in particular, because it is possible for someone to be in more than one category at a time. For instance, we see the same count for on:Person and on:Agent as well as on:Athlete and on:SoccerPlayer because each soccer player is an athlete. I got suspicious, however, and found that if I added the number of soccer players to the number of soccer managers...

In [37]:
time: 2.5 ms

... and found they were equal! That suggests that all of the people with career stations are involved with soccer, and that on:SoccerPlayer and on:SoccerManager are mutually exclusive.

I test that mutually exclusive bit by counting the number of topics which are both soccer players and soccer managers:

In [38]:"""
    SELECT (COUNT(*) AS ?count) {
        ?x a on:SoccerPlayer .
        ?x a on:SoccerManager .
0 8192
time: 528 ms

Those two really are mutually exclusive.

This seems strange to me. I don't know much about soccer (I am from the U.S. after all!) but frequently coaches and team managers are former players in other sports, shoudn't they be in soccer?

I investigate just a bit more, first getting a sample of managers...

In [39]:"""
    SELECT ?x {
        ?x a on:SoccerManager .
    } LIMIT 10
0 <Alan_Shearer>
1 <Alex_Ferguson>
2 <Dennis_Bergkamp>
3 <Enzo_Scifo>
4 <Marco_van_Basten>
5 <Osvaldo_Ardiles>
6 <Ruud_Gullit>
7 <Walter_Winterbottom>
8 <Alejandro_Morera_Soto>
9 <Aleksandr_Smirnov_(footballer,_born_1968)>
time: 273 ms

... and then looking at the text description of one in particular:

In [40]:"""
    SELECT ?comment  { 
       <> rdfs:comment ?comment .
'Sir Alexander Chapman "Alex" Ferguson, CBE (born 31 December 1941) is a former Scottish football manager and player who managed Manchester United from 1986 to 2013. He is regarded by many players, managers and analysts to be one of the greatest and most successful managers of all time.'
time: 339 ms

As I suspected, Alex Ferguson was a player who became a manager. These things are not mutually exclusive in the real world, although they are mutually exclusive in DBpedia.

It's a typical example of what you find when you look at "how things are" as opposed to "how things are supposed to be".

If it were up to me you'd be a soccer player if you'd ever played soccer and you'd be a manager if you'd ever managed a soccer team. On the other hand, I don't have my own database of thousands of soccer players (and managers!) so having to accept data in the format it is provided in is part of the price of "free" data.


In this article I began an investigation of data in DBpedia that particularly focused on two kinds of topics: race entries and the careers of soccer players. In the first case, information about different entries in the race are scrambled, because no subject is introduced for each entry. In the second case, DBpedia provides identifiers for "Career Stations" upon which it can state facts such as what team a person played on, for what time period, and so forth.

I hope very much that the "Career Station" is the future of DBpedia because there are many other things that can be modeled very similarly such as:

  • a person's educational career
  • the work career of a person who works for multiple employers over time
  • times in which a person has been a member of a band
  • locations of a concert tour
  • results of a series of sports events

Introduced in DBpedia 3.9, "Career Station" is relatively new. Other generic databases such as Freebase and Wikidata have used mechanisms such as compound value types and qualifiers to similar effect. Let's hope that the enthusiasm soccer fans have brought to DBpedia will carry over to other sports and endeavors!

This article is part of a series.
Subscribe to my mailing list to be notified when new installments come out.