Extracting and manipulating article metadata (RDF) from Het Laatste Nieuws

This is a very rough tutorial demonstrating the use of Python libraries for manipulating article metadata for articles on Het Laatste Nieuws. We extract the metadata using a RDFa parser, demonstrate SPARQL queries against the metadata of a single article and multiple articles (using pandas to work with SPARQL query results) and finish with a silly example visualising query results.

Setup

Install requirements form manipulating RDF and RDFa. Optionally pandas is a useful library for manipulating SPARQL results in the form of a dataset.

pip install rdflib html5lib sparql-client pandas
pip install git+git://github.com/RDFLib/pyrdfa3.git
In [44]:
# load the required libraries
import pyRdfa, sparql, pandas

Load and query RDF for a single article

Load the metadata embedded in the article HTML by using the RDFa parser. Store it in a Graph object g.

In [35]:
p = pyRdfa.pyRdfa()
article_url = 'http://www.hln.be/hln/nl/957/Binnenland/article/detail/1565024/2013/01/19/Charles-Michel-Als-N-VA-en-PS-zo-voortdoen-blokkeert-alles.dhtml'
g = p.graph_from_source(article_url)

Serialize the graph containing the article metadata using the human readable Turtle-format.

In [36]:
print g.serialize(format='turtle')
@prefix article: <http://ogp.me/ns/article#> .
@prefix fb: <http://ogp.me/ns/fb#> .
@prefix og: <http://ogp.me/ns#> .

<http://www.hln.be/hln/nl/957/Binnenland/article/detail/1565024/2013/01/19/Charles-Michel-Als-N-VA-en-PS-zo-voortdoen-blokkeert-alles.dhtml> og:description "De PS en de N-VA zijn elkaars beste vijanden. Ze delen klappen uit, maar kunnen elkaar geen pijn doen want ze staan allebei in hun eigen ring te ...";
    og:image "http://static3.hln.be/static/photo/2013/17/6/4/20130119112748/media_xl_5482999.jpg";
    og:site_name "HLN";
    og:title "Charles Michel: \"Als N-VA en PS zo voortdoen, blokkeert alles\" ";
    og:type "article";
    og:url "http://www.hln.be/hln/nl/957/Binnenland/article/detail/1565024/2013/01/19/Charles-Michel-Als-N-VA-en-PS-zo-voortdoen-blokkeert-alles.dhtml";
    article:author "http://www.hln.be/auteur/Stijn-Vossen";
    article:expiration_time "2100-01-01T00:00:00MET";
    article:published_time "2013-01-19T10:34:00MET";
    article:section "Binnenland";
    article:tag "charles michel",
        "n va",
        "politiek",
        "ps",
        "verkiezingen 2014";
    fb:app_id "367957443214829" .


Select and print the article title and author (i.e. author url) using a SPARQL query on the graph g.

In [40]:
q = """
SELECT ?title ?author_url
WHERE { ?url <http://ogp.me/ns#title> ?title .
        ?url <http://ogp.me/ns/article#author> ?author_url . } ."""
for record in g.query(q).bindings:
    print record.get('title'), record.get('author_url')
Charles Michel: "Als N-VA en PS zo voortdoen, blokkeert alles"  http://www.hln.be/auteur/Stijn-Vossen

Query RDF over multiple articles

I have collected the RDF metadat for a set of 200 articles over the last two weeks, and loaded it in a locally running Fuseki instance. Instead of using the build-in rdflib SPARQL client to query a local grpah g, we the Fuseki SPARQL-endpoint using the Python sparql-client.

As an example query we select all articles that have "N-VA" in the title, and return the link and the title for each of the 167 found articles, and optionally the author and the publishing date.

We convert the SPARQL results to a pandas dataframe and use pandas and IPython together to render the results inline as HTML (the first 20 results).

In [130]:
q = """
SELECT ?url ?title ?author ?pubdate
WHERE {
    ?url <http://ogp.me/ns#title> ?title .
    OPTIONAL {
        ?url <http://ogp.me/ns/article#author> ?author .
        ?url <http://ogp.me/ns/article#published_time> ?pubdate.
    }
    FILTER regex(?title, "N-VA", "i") }
"""
result = sparql.query('http://localhost:3030/ds/query', q)
#for row in result:
#   print row

df = pandas.DataFrame(result.fetchall(), columns=result.variables)
print 'Nr. of articles:', len(df)
from IPython.core.display import HTML
HTML(df[1:20].to_html())
Nr. of articles: 167
Out[130]:
url title author pubdate
1 http://www.hln.be/hln/nl/13816/Verkiezingen-20... N-VA zetelt in 19 Vlaams-Brabantse coalities, ... None None
2 http://www.hln.be/hln/nl/13816/Verkiezingen-20... N-VA, Groen en Open Vld vinden elkaar in distr... http://www.hln.be/auteur/Toon Mast 2012-12-17T17:25:00MET
3 http://www.hln.be/hln/nl/957/Binnenland/articl... N-VA-burgemeester ziet terugkeercentrum in buu... http://www.hln.be/auteur/redactie 2013-01-08T19:36:00MET
4 http://www.hln.be/hln/nl/13816/Verkiezingen-20... Aalsterse socialisten blijven in bestuur met N... None None
5 http://www.hln.be/hln/nl/957/Binnenland/articl... N-VA: "Niet alleen Glenn Audenaert keeg steekp... None None
6 http://www.hln.be/hln/nl/13816/Verkiezingen-20... Antwerpse sp.a verbaasd door overstap gemeente... None None
7 http://www.hln.be/hln/nl/13816/Verkiezingen-20... Denert (N-VA) verliest sjerp in Kruibeke None None
8 http://www.hln.be/hln/nl/4833/Gevangenissen/ar... N-VA pleit voor gevangenis op militair domein ... None None
9 http://www.hln.be/hln/nl/13816/Verkiezingen-20... Oud-VB-voorzitter Frank Vanhecke roept op om N... None None
10 http://www.hln.be/hln/nl/13816/Verkiezingen-20... N-VA Zaventem daagt vijf mandatarissen voor Ra... None None
11 http://www.hln.be/hln/nl/957/Binnenland/articl... Rutten: "Di Rupo focust beter op beleid, niet ... http://www.hln.be/auteur/Steven-Peeters 2013-01-07T10:16:00MET
12 http://www.hln.be/hln/nl/13816/Verkiezingen-20... N-VA-kandidaat in Tremelo belooft gratis te zu... None None
13 http://www.hln.be/hln/nl/13816/Verkiezingen-20... Mark Demesmaeker (N-VA) van Halle naar Europa None None
14 http://www.hln.be/hln/nl/2741/KV-Kortrijk/arti... N-VA-ministers tegen lessen Frans bij KV Kortrijk None None
15 http://www.hln.be/hln/nl/13816/Verkiezingen-20... Open Vld en N-VA smeden coalitie in Niel None None
16 http://www.hln.be/hln/nl/13816/Verkiezingen-20... Dewinter droomt van coalitie met N-VA in Antwe... None None
17 http://www.hln.be/hln/nl/957/Binnenland/articl... Marcourt: "Succes N-VA zal investeerders afsch... None None
18 http://www.hln.be/hln/nl/13816/Verkiezingen-20... "Stem op N-VA verhoogt kans op oven in Kampenh... None None
19 http://www.hln.be/hln/nl/13816/Verkiezingen-20... Nicht Emir Kir niet op N-VA-lijst Sint-Joost-t... None None

Process results of multiple articles

We demonstrate how you can use the basic article metadata using a silly example. For all the parsed articles on HLN that contain 'N-VA' in their title, we show the associated image for that article.

It gives you a brief glimps/basic visualisation of the actors associated with such topics.

In [121]:
q = """
SELECT ?img
WHERE {
    ?url <http://ogp.me/ns#title> ?title .
    ?url <http://ogp.me/ns#image> ?img .
    FILTER regex(?title, "N-VA", "i") }
"""
result = sparql.query('http://localhost:3030/ds/query', q)
img_urls = [url[0] for url in result.fetchall()]
img_urls = [str(url) for url in img_urls if 'logo' not in str(url)]
In [126]:
html = ''
for url in img_urls:
    html = html + '<img src=' + url + '/>'
HTML(html)
Out[126]: