PoliticalMashup Search Engine

Diachronical comparative analysis on parliamentary proceedings made easy

Well that is our goal ;-)

You PhD students decide if we succeeded.

Maarten Marx, ILPS, UvA



What is this talk about?

Our goals and requirements:

Create a repository of ALL parliamentary proceedings in one machine readable format

Requirements

  1. Affiliation network of politicians, their parties and their speeches in parliament
    • Of every word ever spoken in parliament we indicate when it was said, by whom, in what role, in what context, $\ldots$
  2. Complete
  3. Very high precision

Two main use cases:

Easy search and exploration

  • Quickly find parts of proceedings, then "deep read" them, and explore further.
  • Very easy, exact reference to immutable sources (by means of a URL)
    • a URL exists for every speech, even every paragraph, in the context of a debate

Data science like applications

  • "Machine shallow reading" of large amounts of data.
  • Example:

Look at neighbouring words of [Ii]mmigran?t.*, group by party and "plot" the development of these words over the last 100 years, and compare this for Canada, the UK and the Netherlands.

Example application

  • Jaco Alberts, Vrij Nederland, 2016 23 April, Wie is de Anti Wilders?

Context

<img align='bottom' width='30%' src = 'http://politicalmashup.nl/uploads/2014/04/Dilipad-logo-REVERSED-300dpi.png'/>

Credits

  • Sander Lijbrink & Kees Halvemaan (ILPS): data preparation, ElasticSearch, SERP
  • Kaspar Beelen (U Toronto): data preparation
  • Laura Mul (UvA, Bsc thesis): testing, user evaluation

Data and Data Model

  • Type and structure of the documents
  • Size of collection and index
  • XML model
  • Flattened JSON model

Data: Parliamentary Proceedings

What's in the data?

Actors: Politicians and parties

  • Names, abbreviations, different spellings, and identifiers (URI's) used in the proceedings
  • immutable attributes: gender, date of birth, wikipedia pages, links to DBpedia, etc
  • mutable attributes: membership of parties, constituencies, occupation, etc

Data: Parliamentary Proceedings

Proceedings

  • Nested data with metadata at each level, and most text in the bottom level
  • topic
    • scene
      • speech
        • paragraph

Meaning

  • topic: a debate on$\ldots$
  • speech: a 'non-interrupted' sequence of words spoken by one person
  • scene:
    • CA/UK: 'subtopic'
    • NL: a speech given from the central lectern including interruptions by others

Data: example proceedings

Data: an example

  • Typical raw data: a scanned and OCRed PDF file containing all that is said on one day
    • 100 pages 2 column as dense as ACM style (but with less formulas, pictures and tables)
  • We extract structure and add metadata at all levels and arrive at semistructured document
    • containing more or less the same text in the same (reading) order

Example

http://resolver.politicalmashup.nl/nl.proc.sgd.d.198319840000869?view=html#nl.proc.sgd.d.198319840000869.2.13.51

Data: numbers and sizes

  • Data: parliamentary proceedings CA, NL, UK. Only debates.
    • Period: Dutch: 1814-2014, UK 1935-2014, CA: from 1994
    • three 8 linked datasets: proceedings, politicians, parties
    • Data format: XML (and derived from that, HTML, RDF, and now also JSON)

Numbers (summer 2015)

topics scenes speeches GB XML GB index
CA 29.193 128.918 1.299.671 2.8 5.0
NL 105.734 198.909 2.616.865 6.8 15.2
UK 182.967 320.196 4.455.924 6.8 17.9

Current numbers (next slide)

In [23]:
import pandas as pd

stats= pd.read_excel('PoliticalMashupStatistieken.xls',index_col=0)
ef=stats[[ u'Topics', u'Scenes',u'Speeches', u'GB XML',u'GB Index',u'Period']]

print ef.to_latex()
\begin{tabular}{lrrrrrl}
\toprule
{} &  Topics &  Scenes &  Speeches &  GB XML &  GB Index &     Period \\
Land            &         &         &           &         &           &            \\
\midrule
United Kingdom  &  332675 &  573477 &   6274707 &    11.0 &      27.0 &  1803-2014 \\
The Netherlands &  105734 &  198909 &   2616865 &     6.7 &      15.2 &  1814-2004 \\
Canada          &  211236 &  392598 &   4425249 &     5.1 &      14.7 &  1901-2014 \\
Sweden          &   51316 &       0 &    310740 &     1.1 &       1.9 &  1990-2014 \\
Denmark         &   18946 &       0 &    549053 &     0.8 &       1.4 &  1999-2014 \\
Norway          &   13262 &   88238 &    188825 &     0.5 &       1.2 &  1998-2014 \\
European Union  &   20302 &       0 &    282831 &     0.6 &       1.1 &  1999-2014 \\
Belgium         &   24423 &       0 &    139371 &     0.3 &       0.5 &  1995-2014 \\
\bottomrule
\end{tabular}

Information needs we want to support

  • known-item search is what most existing search engines on this data cater for

Used techniques

  • facets
  • allow different rankings
  • search at multiple granularities: topic, scene, speech
  • Aggregations
    • time lines, also grouped by actors
    • word-cloud summaries, also grouped by actors.
  • Allow queries which combine content with debate network constraints:
    • return documents about moslims in which Wilders speaks but where he is not interrupted by Pechtold.

Specific technical choices

Ngram viewer

  • Goal: support each ngram, and show # hits per year over full period
  • Consequence: must also index stopwords
    • Index grew from 14.3 to 15.2 Gb
    • stopwords also appear in "primitive" wordclouds,
      • filtering them out is expensive and must be done at query time

Demo

search.politicalmashup.nl

Highlights

  • Entry point retrieval
    • user chooses granularity
  • Data exploration:
    • facets
    • different sorting options
    • 3D histograms (# hits per actor per year)
    • 2D "summaries" (related terms per actor)
  • Network type queries
    • inclusion and exclusion of actors
  • Ngram viewer

Conclusions

  1. ElasticSearch seems well suited for exploratory search engines
  2. Aggregation options and possibilities grow fast
    • Kibana
  3. Prototype development on a serious (complete) dataset is feasible
    • reindexing the dutch (15Gb) collection is done in 90 minutes

Next steps

  1. User evaluation
  2. Stress testing
  3. Install at KB, obtain serious traffic, get usage-data
    • KB currently hosts a search engine on a largely overlapping dataset.
    • KB already hosts our in house developed n-gram viewer over KB kranten corpus
  4. Add more datasets
    • Done:
      • UK 1800-1930
      • CA 1900-1993

In [ ]: