Co-Occurring Tag Analysis

Analysing how tags co-occur across various Parliamentary publications. The idea behind this is to see whether there are naturally occurring groupings of topic tags by virtue of their co-occurence when used to tag different classes of Parlimanetary publication.

The data is provided as a set of Linked Data triples exported as Turtle (.ttl) data files. The data represents, among other things, Parlimentary resources (such as early day motions or other proceedings records) and subject/topic labels they are tagged with.

The data allows us to generate a graph that associates tags with resources, and from that a graph that directly associates tags with other tags by virtue of their commonly tagging the same resource or set of resources.

In [2]:
#Data files
!ls ../data/dataexport
edms        proceedings terms


Import a library that lets us work with the data files:

In [3]:
#Data is provided as Turtle/ttl files - rdflib handles those

#!pip3 install rdflib
from rdflib import Graph

Simple utility to load all the .ttl files in a particular directory into a graph:

In [4]:
import os
def ttl_graphbuilder(path,g=None,debug=False):
    #We can add the triples to an existing graph or create a new one for them
    if g is None:
    #Loop through all the files in the directory and then load the ones that have a .ttl suffix
    for ttl in [f for f in os.listdir(path) if f.endswith('.ttl')]:
        if debug: print(ttl)
        g.parse('{}/{}'.format(path,ttl), format='turtle')
    return g

Tools for running queries over a graph and either printing the result or putting it into a pandas dataframe:

In [5]:
def rdfQuery(graph,q):
    for row in ans:
        for el in row:
            print(el,end=" ")

#ish via
import pandas as pd
def sparql2df(graph,q, cast_to_numeric=True):
    c = []
    for b in a.bindings:
        for k in a.vars:

    df = pd.DataFrame(c)
    df.columns = [str(v) for v in a.vars]
    if cast_to_numeric:
        df = df.apply(lambda x: pd.to_numeric(x, errors='ignore'))

    return df

Tools to support the export and display of graphs - networkx package is handy in this respect, eg exporting to GEXF format for use with Gephi. We can also run projections on the graph quite easily.

In [6]:
import networkx as nx

Exploring the Data - Terms

In [7]:
In [8]:
#What's in the graph generally?
    ?x ?y ?z.
} LIMIT 10
rdfQuery(termgraph,q) CND Vittinghoff, Kurt LEG  
In [9]:
#What does a term have associated with it more specifically?
    <> ?y ?z.
} LIMIT 10
rdfQuery(termgraph,q) Defence policy TPG 

Looks like the prefLabel is what we want:

In [10]:
    ?z <> ?topic.
} LIMIT 10
z topic
0 Tim Devlin
1 Standard Chartered Capital Markets
3 Comite des Sages for Air Transport
6 Regency Act 1937
7 Portuguese Trade and Tourism Office
9 Vittinghoff, Kurt

Exploring the Data - EDMS

In [11]:
In [12]:
#See what's there generally...
    ?x ?y ?z.
} LIMIT 10
rdfQuery(g,q) That this House celebrates the 40th anniversary of the Ulster American Folk Park and Museum and the strong relationship between the United States (US) and the United Kingdom of Great Britain and Northern Ireland; notes that this anniversary underlines the special relationship between Northern Ireland and the US and its 16 Presidents who had Ulster connections; and hopes that this bond will continue to thrive and blossom in the years ahead. 2016-05-18T23:00:00+00:00 That this House remembers with respect and love, in the week of his birthday, the incredible legacy left by Nelson Mandela to the world; acknowledges that to many Africans he was known simply as Madiba; endorses the message behind Mandela Day that all citizens of the world should endeavour to do good every day; agrees that the actions of those citizens should focus on the realisation or restoration of dignity and empowerment, looks forward to welcoming Madiba's friend and one time fellow political prisoner Denis Goldberg to Parliament on 20 July 2016; and supports calls by the Nelson Mandela Foundation to make every day a Mandela Day. 
In [13]:
#Explore a specific EDM
    <> ?y ?z.
rdfQuery(g,q) That this House notes the announcement of 300 redundancies at the Nestlé manufacturing factories in York, Fawdon, Halifax and Girvan and that production of the Blue Riband bar will be transferred to Poland; acknowledges in the first three months of 2017 Nestlé achieved £21 billion in sales, a 0.4 per cent increase over the same period in 2016; further notes 156 of these job losses will be in York, a city that in the last six months has seen 2,000 job losses announced and has become the most inequitable city outside of the South East, and a further 110 jobs from Fawdon, Newcastle; recognises the losses come within a month of triggering Article 50, and as negotiations with the EU on the UK leaving the EU and the UK's future with the EU are commencing; further recognises the cost of importing products, including sugar, cocoa and production machinery, has risen due to the weakness of the pound and the uncertainty over the UK's future relationship with the single market and customs union; and calls on the Government to intervene and work with hon. Members, trades unions GMB and Unite and the company to avert these job losses now and prevent further job losses across Nestlé. 2017-04-25T23:00:00+00:00 

Let's merge the EDM graph data with the terms data.

In [15]:

Now we can look at the term labels associated with a particular EDM.

In [16]:
    <> <> ?z.
    ?z <> ?t.
} LIMIT 10
Arms control 
International politics and government 
North America 
Defence policy 

We can also create a table that links topic labels with EDMs.

In [17]:
SELECT DISTINCT ?edms ?topic {
    ?edms <> <>.
    ?edms <> ?z.
    ?z <> ?topic.
edms topic
0 Roads
1 Charities
2 Disability discrimination
3 Animals
4 Service industries

From this table, we can a generate a bipartite networkx graph that links topic labels with EDMs.

In [18]:
nxg=nx.from_pandas_dataframe(g_df, 'edms', 'topic')

We can then project this bipartite graph onto just the topic label nodes - edges will now connect nodes that are linked through one or more common EDMs.

In [19]:
from networkx.algorithms import bipartite
#We can find the sets of names/tags associated with the disjoint sets in the graph
#I think the directedness of the graph means we can be reasonably sure the variable names are correctly ordered?

#Collapse the bipartite graph to a graph of topic labels connected via a common EDM
topicgraph= bipartite.projected_graph(nxg, topic)

We can also generate a weighted graph, where edges are weighted relative to how many times topics are linked through different EDMs.

In [20]:
topicgraph_weighted= bipartite.weighted_projected_graph(nxg, topic)

Predicting Topics

In [39]:
#!pip3 install sklearn
In [73]:
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import LinearSVC
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.multiclass import OneVsRestClassifier
from sklearn.preprocessing import MultiLabelBinarizer
In [56]:
edms topic
0 [Sports and Olympic Games]
1 [Tourism, Service industries, Food]
2 [Charities, Armed forces welfare]
3 [Health staff and professions, Health services...
4 [Diseases, Health education and preventive med...
In [57]:
SELECT DISTINCT ?edms ?motiontext {
    ?edms <> <>.
    ?edms <> ?motiontext.
edms motiontext topic
0 That this House is aware that guide dog owners... [Roads, Charities, Disability discrimination, ...
1 That this House congratulates Titanic Belfast ... [Tourism]
2 That this House recognises the vitally importa... [Health services, Mental health]
3 That this House congratulates Glasgow-based Wo... [Charities]
4 That this House congratulates the Scotsman new... [Press]
In [69]:
X_train= np.array(m_df['motiontext'][:-100].tolist())
X_test = np.array(m_df['motiontext'][-100:].tolist())   
In [70]:
['Roads', 'Charities', 'Disability discrimination']
In [76]:
#ytrain= [[target_names.index(i) for i in t] for t in m_df['topic'][:-100] ]
y_train_text = [ t for t in m_df['topic'][:-100] ]
  'Disability discrimination',
  'Service industries'],
 ['Health services', 'Mental health']]
In [96]:
mlb = MultiLabelBinarizer()
Y = mlb.fit_transform(y_train_text)

classifier = Pipeline([
    ('vectorizer', CountVectorizer(analyzer='word',stop_words='english')),
    ('tfidf', TfidfTransformer()),
    ('clf', OneVsRestClassifier(LinearSVC()))]), Y)
predicted = classifier.predict(X_test)
all_labels = mlb.inverse_transform(predicted)

for item, labels in zip(X_test, all_labels):
    if labels!=(): hits.append('{0} => {1}'.format(item, ', '.join(labels)))
    else: misses.append('{0} => {1}'.format(item, ', '.join(labels)))
print("some hits:\n{}\n\nsome misses:\n{}".format('\n'.join(hits[:3]),'\n'.join(misses[:3])))
some hits:
That this House believes that air pollution from burning diesel has a significant impact on health in the UK, causing thousands of premature deaths and reducing quality of life for many people; acknowledges that forthcoming legislation to tackle emissions, such as Clean Air Zones proposed by the Department for Environment, Food and Rural Affairs, needs to look at all sources of diesel emissions; notes that auxiliary engines, such as transport refrigeration units, can emit many times more nitrogen oxides and particulate matter than a vehicle's primary engine; and calls for legislation to reflect this in order to protect people's health and help promote the uptake of clean alternatives. => Pollution
That this House notes the decision to leave the EU made in the referendum held on 23 June 2016; further notes that this decision did not call for the UK to leave the Single European Market and the Conservative Party manifesto commitment to safeguard British interests in the Single Market; notes that the Prime Minister does not intend to offer the House a vote on the strategic priorities to be pursued in the UK's negotiations with the EU or on whether to trigger Article 50 of the Lisbon Treaty; duly notes the consequential impact on the value of the pound, the concerns expressed by the UK's business partners for the prospects for the UK economy and HM Treasury's estimate of the long-term cost of Hard Brexit; and calls on the Government to reverse its decision not to offer the House a vote on the strategic priorities to be pursued in the UK's negotiations with the EU and to seek a democratic mandate for the outcome of that negotiation by vote of the House. => EU law and treaties
That this House welcomes the birth of Ibrahim Al Hussein, thought to be the first child of Syrian refugees born in Aberdeen; congratulates his parents Fadila and Khalid and his brother Shadea on making their new home in the North East of Scotland after fleeing the horrors of war-torn Damascus; recognises the harrowing experiences Fadila and Khalid went through living in the refugee camp outside Erbil in Iraq for 27 months; regrets that they were unable to reach the UK before Shadea's birth meaning Fadila had to give birth to her eldest son in the camp where the family lived in a small tent with a mud floor; believes that more refugees must be brought into the UK in order to prevent more individuals and families from living in such horrific conditions; and calls for the Government to take in more refugees fleeing the war in Syria so that more children like Ibrahim can be born in safe and peaceful circumstances. => Asylum

some misses:
That this House congratulates Ayrshire College on winning the Semta Skills Award for Training Partner of the Year 2017; understands that Ayrshire College has developed a close strategic partnership with industry that supports both the students to learn valuable in-demand skills and local companies need for highly trained individuals to join the workforce; notes that Ayrshire College has sparked the imagination of 200 school pupils and college students with its competition to design a space experiment aided by NASA experts, the winning entry will be tested at the International Space Station in 2017; applauds its #ThisAyrshireGirlCan campaign to encourage more girls into engineering and advancing equality by promoting apprenticeships to female students; recognises the great success of its scheme to retrain unemployed engineers and through a partnership business help them re-join the local workforce; wishes Ayrshire College continued success with its strategic partnership approach. => 
That this House notes that Diageo closed its final salary pension scheme to new employees in 2005, replacing this with the Lifestyle Plan Pension Scheme; further notes Diageo's announcement that this replacement scheme will now be closed to new starters, without consultation, the final salary scheme closed completely, and that a defined contribution scheme will be introduced; notes the company's operating profit is expected to be £2.841 billion for 2016, a 3.5 per cent increase from that in 2015; considers that profits are generated through the hard work of Diageo's workforce and that these new pension proposals are an attack on the terms and conditions of the workers; fears that this proposal is purely a cost-cutting measure; supports the efforts of the trade unions GMB and UNITE to represent their members and secure pensions justice; and demands that the board of directors at Diageo intervene and reverse this proposal. => 
That this House recognises the achievements of BRAG Enterprise, which is based in Lochgelly; commends the work it  does in regenerating local communities through the creation and support of sustainable employment; welcomes the organisation's new programme Greenpower Formula 24 which introduces young people to STEM subjects through the designing, building, modification and racing of a full-size single seater battery-powered racing car; praises the project's aims to change the perception of STEM subjects and encourage more girls into the field; wishes the project every success and hopes it will inspire many into the field; understands the great potential for work like this to be developed in Scotland; and highlights the importance of STEM subjects to Scotland and its wider economy. => 
In [94]:
('Oil, petrol and natural gas',)

Exploring the Data - proceedings

In [ ]:
In [ ]:
!ls {path}
In [ ]:
!cat {path}/0006D323-D0B5-4E22-A26E-75ABB621F58E.ttl