Topic Modelling of Australian Parliamentary Press Releases

An exploration by Adel Rahmani
This Jupyter notebook can be found on GitHub

As part of his wonderful work with Trove data, Tim Sherratt has harvested and made available a subset of more than 12,000 Australian parliamentary press releases (the data can be downloaded from his GLAM Workbench website). The collection of press releases was built by selecting documents containing key words relating to immigration/refugee issues.

On his GLAM Workbench website Tim Sherratt explains how the documents were harvested:

Politicians talking about 'immigrants' and 'refugees'

Using the notebook above I harvested parliamentary press releases that included any of the terms 'immigrant', 'asylum seeker', 'boat people', 'illegal arrivals', or 'boat arrivals'. A total of 12,619 text files were harvested.

I was curious about the contents of the press releases, however, at more than 12,000 documents the collection is too overwhelming to read through, so I thought I'd get the computer to do it for me, and use topic modelling to poke aroung the corpus.

Let's start by importing several modules that we will need.

In [2]:
# suppress warnings
import warnings
warnings.simplefilter('ignore')

# dataframe (think spreadsheet-like data) manipulation
import pandas as pd

# fast array operations  
import numpy as np

# file system access handling (more practical than the os module)
from pathlib import Path

# plotting
import matplotlib.pyplot as plt
import seaborn as sns

# natural language processing
from sklearn.feature_extraction.text import TfidfVectorizer

# matrix factorisation
from sklearn.decomposition import NMF

# dimensionality reduction
from umap import UMAP

# regular expressions
import re

# render plots within the notebook
%matplotlib inline

Reading the documents

I downoaded the Politicians talking about 'immigrants' and 'refugees' data from the GLAM Workbench, and unzipped it.

Each press release is its own .txt file inside a directory named texts.

Let's create a Path object and use it to get a list of all the files.

In [3]:
docs_path = [p for p in Path('texts').glob('*.txt')]
print(f"Found {len(docs_path)} documents.")
Found 12619 documents.

Each element of docs_path is a pathlib.PosixPath object with some useful methods.

For instance, we can read the text directly without an explicit call to open.

Let's illustrate this with the first press release.

In [4]:
docs_path[0]
Out[4]:
PosixPath('texts/2009-10-15-truss-warren-national-party-of-australia-211330210.txt')
In [5]:
# utf-8 is used by default on my computer so there's no real need for me
# to specify the encoding...

print(docs_path[0].read_text(encoding='utf-8').strip())  
Labor policy has failed on boat arrivals   15-October-2009

The fact that at least 25 people have died trying to get to Australia since the Federal Labor Government signalled  its new open door policy on border security should be enough for Labor to rethink its failed strategy.



The Leader of The Nationals, Warren Truss, said the processes put in place by the Coalition ended the trade in  human misery.



“That trade has tragically re-emerged and the world knows Australia is now a soft touch,” Mr Truss said. “The  Coalition policies closed the door to people smugglers and stopped the boats. Our tough approach worked once  and will work again.”



On Wednesday the Prime Minister tried to paint himself as a hardliner on this issue, saying: “The key thing is to  have a tough, hard-nosed approach to border security, dealing with the vile species who are represented by  smugglers on the one hand and a humane approach to our international obligations on the other.”



“Sadly, Mr Rudd is failing on both counts. The people smugglers have been given a green light and a few insults  will not deter them. After all the Prime Minister has used much stronger language than that on his Caucus  colleagues.



“There is nothing tough or humane about Labor’s approach - it is just weak and pathetic, full of mixed messages  and no deterrents.  “The worldwide tally of refugees and those seeking asylum has not changed significantly from when the Coalition  was in government.



“I understand that people are desperate to leave their embattled homelands. I understand there are economic  reasons why people choose Australia over Indonesia or other countries. And I also understand that without a  sense of order in this process, those who are waiting patiently to be granting refugee status in Australia in the  proper way are being shunted further and further back down the line.



“Labor has lost control on immigration. The pandering to special interests in Australia has left porous borders,  increasing social unrest here and worst of all, death on the high seas,” Mr Truss said.

Source:  Warren Truss, MP

Processing the documents

Now that we can read the documents we need to transform them into something
the computer can ingest.

Computers like numbers so we need to convert the text into a sequence of numbers.

This process is called tokenisation. We break the text into tokens and associate an integer with each unique token. Tokens can be many things, from single characters to ngrams, but for simplicity we'll use words. More specifically, we'll consider words of a least 2 letters (from the Latin alphabet).

Let's tokenise the documents using the excellent scikit-learn library. We'll use the TfidfVectorizer, and wrap it into a tokenize function (every time I try to leave the US spelling of "tokenise" behind, they pull me back in!).

For anyone interested in the details behind the TF-IDF process I've got more information here.

Note: In natural language processing, it is common (especially for topic modelling) to stem or lemmatise the tokens. This is to avoid redundant terms (such as the singular and plural forms of the same word). I won't do that here because the process (lemmatisation most notably) increases the processing time, and I'm just doing some basic exploration.

In [6]:
def tokenize(corpus, docs_path, **kwargs):
    '''
    Simple wrapper function around a sklearn 
    TfidfVectorizer object. 
    '''
    # create an instance of the vectoriser
    tfidf = TfidfVectorizer(**kwargs)                             

    # the vectoriser returns a sparse array which
    # we convert to a dense array for convenience
    X_tfidf = np.asarray(tfidf.fit_transform(corpus).todense())

    print(f"Tokenised {X_tfidf.shape[0]} documents using a vocabulary of {len(tfidf.get_feature_names())} tokens.")
    
    return X_tfidf, tfidf

We now need a corpus. Let's create a generator that will yield each document one at a time and call it corpus.

This generator will be consumed by the tokeniser

In [7]:
corpus = (p.read_text(encoding='utf-8').strip() for p in docs_path)

Time to give it a spin. The tokeniser takes several parameters. The most important for us are the following:

  • min_df: minimum document frequency, i.e., only consider words which appear in at least min_df documents of the corpus.
  • max_df: maximum document frequency, i.e., only consider words which appear in at most max_df documents of the corpus.
  • token_pattern: what we mean by a token. I'm using a regular expression.
  • max_features: maximum size of the vocabulary.

Note: both min_df and max_df accept integer or float values. Integer values represent a number of documents in the corpus, floating point values must be between 0 and 1.0 and represent a proportion of documents in the corpus.

Another useful thing to give our tokeniser is a list of stopwords. These are typically words which are so common in the language they are virtually useless when trying to determine what a document is about.

In English, for instance, saying that the word "the" appears in a document does not shed much light on its contents.

Various collections of stopwords can be found. Here are the words in my STOPWORDS variable:

In [8]:
STOPWORDS = {'anything', 'mightn', 'upon', 'six', 'herein', 'hers', 'indeed', 'becomes', 'twenty', 'at', 'up', 'will', 'meanwhile', 'same', 'onto', 'seem', 'it', 'had', 'they', "'m", 'beforehand', 'describe', 'was', 'moreover', 'hereupon', 'your', 'due', 'un', 'eleven', 'further', 'him', 'is', 'whereas', 'hasnt', 'in', 'we', 'them', 'ten', 'however', 'done', 'fire', 'through', 'keep', 'sometimes', 'unless', 'needn', 'until', 'top', 'there', 'just', 'didn', 'because', 'wherever', 'couldnt', 'front', 'someone', 'afterwards', 'within', 'won', 'except', 'he', 'fill', 'ours', 'my', 'others', 'latterly', 'made', 'first', 'about', 'call', 'may', 'thence', 'seeming', 'nor', 'haven', 'couldn', 'nothing', 'everyone', 'enough', 'her', 'latter', 'detail', 'now', 'where', 'while', 'became', 'wouldn', 'besides', 'do', 'its', 'wasn', 'another', 'during', 'around', 'shouldn', 'some', 'whoever', 'once', 'inc', 'con', 'll', 'four', 'back', 'm', 'although', 've', 'either', 'their', 'beside', 'yourself', 'how', 'when', 'whom', 'sincere', 'thereafter', 'out', 'between', 'whether', 'hereafter', 'she', "'re", 'over', 'thru', 'i', 'very', 'whereupon', 'above', 'third', 'alone', 'aren', 'nevertheless', 'almost', 'various', 'nowhere', 'so', 'make', 'somehow', 'here', 'take', "'d", 'those', 'whereby', 'whereafter', 'mill', 'get', 'after', 'into', 'ourselves', 'more', 'regarding', 'quite', 'don', 'ever', 'everywhere', 'whole', 'five', 'ma', 'whence', 'below', 'eg', 'give', 'under', 'ltd', 'yours', 'd', 'whatever', 'might', 'be', 'using', 'serious', 'not', 'anyhow', 'ca', 'his', 'becoming', 'who', 'hasn', 'therein', 'again', 'me', 'empty', 'noone', 'being', 't', 'nobody', 'hadn', 'theirs', 'since', 'rather', 'mustn', 'nine', 'from', 'none', 'the', 'seems', "'ve", 'system', 'amongst', 'thereby', 'been', 'own', 'next', 'down', 'hundred', 'each', 'seemed', 'other', 'everything', 'across', 'ain', 'off', 'doesn', 'than', 'many', 'show', 'but', 'an', 'then', 'never', 'without', 'before', 'only', 'anyway', 'namely', 'o', 'etc', 'formerly', 'wherein', 'two', 'did', 'y', 'toward', 'thereupon', "'ll", 'full', 'most', 'have', 'always', 'were', 'myself', 'name', 'move', 'say', 'put', 'cry', 'become', 'would', 'to', 'am', 'bottom', 'having', 'amoungst', 'as', 'already', 'whenever', 'thin', 'us', 'that', 'whither', 'our', 'yourselves', 'cant', 'several', "'s", 'really', 'fifteen', 'otherwise', 'must', 'anywhere', 'much', 'hereby', 'anyone', 'for', 'could', 'often', 'themselves', 'can', 'all', 'too', 'sometime', 'what', 'somewhere', 'every', 'find', 'herself', 'together', 'are', 'well', 'de', 'on', 'which', 'interest', 'bill', 'isn', 'himself', 'therefore', 'whose', 'along', 'has', 'though', 'mostly', 'please', 'beyond', 'neither', 'against', 'go', 'behind', 'amount', 'something', 'hence', 'part', 'this', 'and', 'you', 'eight', 'per', 'among', 'least', 'side', 'mine', 'towards', 'see', 'a', 'also', 'by', 'via', 'twelve', 'forty', 'found', 'such', 'less', 'even', 'still', 'these', 'few', 's', 'perhaps', 'both', 'throughout', "n't", 'shan', 'elsewhere', 'co', 'sixty', 'why', 'one', 'if', 'thus', 'itself', 'used', 'ie', 'of', 'fifty', 'former', 'else', 'or', 'three', 'cannot', 'last', 'any', 'thick', 'no', 're', 'with', 'should', 'doing', 'weren', 'does', 'yet'}

Tokenisation

Let's now tokenise our 12,000+ press releases.

It's a good idea to place the corpus generator in the same cell as the call to tokenize, because otherwise, running the cell more than once will raise an exception due to the generator having been consumed.

(I'm using ipython magic to output the running time just for information. I'm running this on a 2014 MacBook Pro.)

In [9]:
%%time

corpus = (p.read_text(encoding='utf-8').strip() for p in docs_path)

X_tfidf, tfidf = tokenize(corpus,                     # the corpus (duh!)
                          docs_path,                  # list of paths to the individual documents
                          min_df=10,                  # only consider words which appear in at least 10 docs
                          max_df=0.5,                 # only consider words which appear in at most 50% of the docs
                          lowercase=True,             # convert everything to lowercase
                          token_pattern='[a-z]{2,}',  # what's a token (2 or more letters)
                          stop_words=STOPWORDS,       # which words are to be excluded
                          max_features=10000          # keep the top 10,000 tokens (based on tfidf scores)
                         )
Tokenised 12619 documents using a vocabulary of 10000 tokens.
CPU times: user 8.95 s, sys: 1.58 s, total: 10.5 s
Wall time: 13.3 s

Our tokenize function returns a trained tokeniser tfidf, and our transformed corpus X_tfidf in the form of a numpy array with as many rows as there are documents in our corpus, and as many columns as there are tokens (words) in our vocabulary.

In [10]:
X_tfidf.shape
Out[10]:
(12619, 10000)

We can get the vocabulary from the tfidf object. Let's print out the first few tokens.

In [11]:
vocabulary = tfidf.get_feature_names()
print(vocabulary[:100])
['aa', 'aaa', 'aas', 'aat', 'ab', 'abandon', 'abandoned', 'abandoning', 'abandonment', 'abbott', 'abc', 'abcc', 'abe', 'abetz', 'abf', 'abhorrent', 'abide', 'abiding', 'abilities', 'ability', 'abject', 'able', 'abn', 'aboard', 'abolish', 'abolished', 'abolishing', 'abolition', 'aboriginal', 'aboriginals', 'aborigines', 'abortion', 'abroad', 'abs', 'absence', 'absent', 'absolute', 'absolutely', 'absorb', 'absorbed', 'absorption', 'abstract', 'absurd', 'abu', 'abundant', 'abundantly', 'abuse', 'abused', 'abuses', 'abusing', 'ac', 'academia', 'academic', 'academics', 'academy', 'accc', 'accelerate', 'accelerated', 'accelerating', 'accept', 'acceptable', 'acceptance', 'accepted', 'accepting', 'accepts', 'access', 'accessed', 'accessibility', 'accessible', 'accessing', 'accession', 'accident', 'accidents', 'accommodate', 'accommodated', 'accommodating', 'accommodation', 'accompanied', 'accompany', 'accompanying', 'accomplished', 'accord', 'accordance', 'accorded', 'according', 'accordingly', 'accords', 'account', 'accountability', 'accountable', 'accounted', 'accounting', 'accounts', 'accreditation', 'accredited', 'accrual', 'accumulated', 'accuracy', 'accurate', 'accurately']

Essentially, X_tfidf contains a count of how many times each word in the vocabulary occurs in each of the documents. In reality, things are a bit more subtle, due to the idf part of tfidf. Once again, see here for a more detailed description of what's happening behind the scene.

Great! We've managed to turn our collection of 12,619 press releases into a big array of numbers. Now what?

Although X_tfidf contains quite a bit of useful information, and is definitely more palatable for our computer than the original texts, it's not exactly illuminating to humans.

For starters, each document is described by 10,000 numbers (the tfidf scores for the 10,000 tokens of the vocabulary). This is too much (for us, the computer doesn't mind).

Furthermore, we'd like to go beyond being able to say that word "W" appears lots of times in document "D". In particular, we'd like to be able to learn something about the corpus as a whole.

This is were topic modelling can help.

Topic modelling

We're interested in finding out what the corpus is about, what topics (in the common sense of the word) are most salient in the collection of press releases. In this case, because Tim Sherratt did all the hard work of harvesting and selecting the press releases, we've got some idea of what the documents are about, however, as we shall see, a topic model can still help us gain new insights into the corpus.

We have just transformed our corpus into an array which tells us how important each word (token) of the vocabulary is to each document. This gives us some relationship between our documents and our vocabulary. We can exploit this relationship to introduce the notion of topics. One way to do this is to use matrix factorisation. Essentially, we take our big array (X_tfidf) with "number of documents" rows and "number of words" columns, and factorise it into 2 (or more) smaller arrays (matrices actually but I'll use the two terms interchangeably here).

For our specific purpose I will use Non-negative Matrix Factorisation (NMF) which will approximate X_tfidf with a product of a "number of documents" by "number of topics" array, and a "number of topics" by "number of words" array.

Essentially, the process decomposes the documents into latent topics (latent cause they're not immediately apparent in X_tfidf), and describes the topics as a collection of scores over the vocabulary.

Note that with this approach it's up to us to choose the number of topics. Because our goal is to have something digestible by a human, and we're mostly interested in learning about the general features of the corpus, a reasonable number of topics, say 10, is the way to go.

We can use the wonderful scikit-learn library to factorise X_tfidf.

In [12]:
%%time
model = NMF(n_components=10, random_state=0)
X_nmf = model.fit_transform(X_tfidf)
CPU times: user 1min 45s, sys: 1.11 s, total: 1min 46s
Wall time: 27.3 s

The X_nmf array has 12,619 rows and 10 columns, and describes the relationship between the documents and the topics.

In [13]:
X_nmf.shape
Out[13]:
(12619, 10)

Recall that we said the factorisation approximates X_tfidf by a product of 2 arrays. One array is X_nmf, and the second, which has 10 rows and as many columns as the size of our vocabulary, describes the topics in terms of the tokens of our vocabulary.

In [14]:
model.components_.shape
Out[14]:
(10, 10000)

Topics extraction

Let's create a helper function that will allow us to extract the topics by selecting for each topic the top N words with the highest scores in the corresponding row of the model.components_ array, and outputting them in a nice way.

In [15]:
def extract_topics(model, vec, sep=' | ', n=5):
    '''
    Extract topics in terms of vocabulary tokens from
    from a trained tokeniser and a trained NMF model.
    '''
    
    topics = {}
    
    # sort the array so that the most important tokens are first
    idx = model.components_.argsort(axis=1)[:, ::-1]
    
    # extract the most important tokens 
    for i, t in enumerate(np.array(vec.get_feature_names())[idx]):
        topic = sep.join(t[:n])
        topics[i] = topic
    
    return topics
In [16]:
topics = extract_topics(model, tfidf, n=5)
topics
Out[16]:
{0: 'world | economic | countries | region | international',
 1: 'keenan | labor | boats | boat | morrison',
 2: 'migrants | services | community | ethnic | settlement',
 3: 'greens | detention | hanson | young | children',
 4: 'abbott | tony | tax | question | carbon',
 5: 'democrats | bartlett | senator | andrew | spokesperson',
 6: 'bowen | abbott | clare | offshore | processing',
 7: 'journalist | prime | think | going | got',
 8: 'vessel | command | border | protection | island',
 9: 'humanitarian | refugees | refugee | program | resettlement'}

Voilà! Our topics represented by the corresponding 5 most important words from of our vocabulary.

A couple of comments are in order.

First, the number associated with each topic is arbitrary. The number means nothing other than telling us which row of the model.components_ (or column of X_nmf) corresponds to the topic.

Second, while some topics seem rather nice and self-explanatory, others look a bit strange. In particular, some topics are clearly overwhelmed by the names of politicians.

As much as I'd love to claim that this is evidence of our self-obsessed pollies' proclivity to talk about themselves, the truth is more prosaic. Let's take a closer look at our documents to try to improve our processing pipeline.

When in doubt, look at the data

So far we haven't really looked at the data. Let's fix that now and take a closer look at one of the press releases.

In [17]:
sample_press_release = docs_path[600].read_text(encoding='utf-8').strip()

print(sample_press_release[:5000])
Press conference with The Hon. Peter Dutton MP Minister for immigration, Senator The Hon. Concetta Fierravanti‐Wells Parliamentary Secretary to the Minister for Social Services and Paris Aristotle Refugee Resettlement Advisory Council

11 September 2015

Transcript Location:

Canberra

E&OE

PETER DUTTON: Ladies and gentlemen thank you very much for being here.

I’m going to make a few opening remarks and Minister Morrison is going to make some opening remarks. I'm happy to take questions and then I have a plane to catch so I'm going to duck off and let Scott answer any questions you may have for him.

Obviously this morning we've had a very productive meeting and I want to say thank you very much to all of the leaders, the community leaders, who joined us today to talk about how we're going to make a new life for 12,000 people who are living in a very, very desperate situation.

The message from the UNHCR, from IOM, from the Red Cross in my discussions with them in Europe this week, is that this is a very bad situation getting worse.

Millions of people now have left Syria. Millions of people in Syria are displaced. The political turmoil in Syria shows no sign of resolution and we have a very important job to do along with many other countries around the world to try and

Transcripts

provide an opportunity for these people to start a new life.

I'm very proud of the response that we've been able to provide and I want to say thank you again to all of those people that we met with this morning who had some great suggestions around the way in which we could provide support through the screening process and then ultimately through their settlement here in Australia. Scott.

SCOTT MORRISON: Thanks very much Peter and it's good to have you back and to get those reports.

I'm joined here today, obviously, by Parliamentary Secretary Senator Fierravanti‐Wells and also PARIS ARISTOTLE who is the Chair of the Refugee Resettlement Advisory Council and Paris will have a bit to say later after Peter has departed.

But it was a very constructive meeting today. The purpose of the meeting today was really to identify ways where we can begin to really harness this incredibly large level of community support.

This has been a very well received announcement and there is an outpouring of support from people right across the community of all different backgrounds, of all different faiths.

It's important we put in place measures to enable us to harness that, to direct it purposefully, to ensure it delivers the support and compassion people are seeking to provide.

I want to thank all of those community leaders who came along today and there will be many more of these meetings.

There will be more direct engagements with quite specific communities of all different backgrounds, of all different faiths, to ensure we maintain the momentum long after the images that we've seen that have sparked so much outpouring of support.

This support needs to be maintained over a long time because when you resettle someone in Australia you resettle them for life and that support has to be there over their lifetime and their families as they become great Australians, as they do.

That was an important part of today's meeting. The Refugee Resettlement Advisory Council will be meeting next week under Paris's leadership and we will then go further into the task of breaking this process down.

As I remarked this morning, we've already settled in the last two years, almost 8,000 refugees and humanitarian immigrants from Iraq and Syria. The focus of that intake has been, as it will be in the future, it's predominantly focused on those from persecuted minorities, around 70% in those categories, with the balance coming from other groups. And so the processes we have in place will continue to be applied.

We are the best in the world at refugee and humanitarian resettlement and those processes will be put to work in this task we have going forward.

PETER DUTTON: OK, any questions?

JOURNALIST: Minister, I believe there were some questions raised about how the Government itself would define what a persecuted minority was. Given you've said already that you will take the advice of the UNHCR on that, but the final decision will be yours to make as a Government, can you tell us if you're any closer to defining what a persecuted minority is?

PETER DUTTON: Well as Scott's pointed out, we for many years have been able to identify people who are at risk of persecution because of their religion, because of threats otherwise.

I think it's important for people to understand, not only is Australia the most generous in terms of the number of people we settle under the Refugee and Humanitarian Program, but we were ahead of the curve. Over the last couple of years we have settled thousands of people from Syria and Iraq.

We've been able to identify those people who were most at risk. We've been able to conduct the security checks, to conduct the health c

Transcript annotations

Aha! Notice the format. These are transcripts of declarations or questions-and-answers sessions of politicians in front of journalists. In particular, each speaker is identified by annotations like "PETER DUTTON:", "SCOTT MORRISON:", or "JOURNALIST:".

The presence of this type of annotations will skew the topics towards politicians names and words like "journalist".

There's no guarantee that the annotations are consistent across the 12,619 press releases, but since we've discovered this convention, let's at least deal with it.

So how can we deal with the annotations? This is a great opportunity for us to remember this wonderful quote by Jamie Zawinski:

Some people, when confronted with a problem, think "I know, I'll use regular expressions". Now they have two problems.

So, how can we deal with the annotations. I know, I'll use regular expressions.

We'll write a regex to match upper case words (including spaces) followed by a colon. Let's see if it words on our sample_press_release.

In [18]:
regex = re.compile('\s+([A-Z\s]+:)')
regex.findall(sample_press_release)
Out[18]:
['PETER\xa0DUTTON:',
 'SCOTT\xa0MORRISON:',
 'PETER\xa0DUTTON:',
 'JOURNALIST:',
 'PETER\xa0DUTTON:',
 'JOURNALIST:',
 'PETER\xa0DUTTON:',
 'JOURNALIST:',
 'PETER\xa0DUTTON:',
 'SCOTT\xa0MORRISON:',
 'JOURNALIST:',
 'SCOTT\xa0MORRISON:',
 'JOURNALIST:',
 'SCOTT\xa0MORRISON:',
 'JOURNALIST:',
 'SCOTT\xa0MORRISON:',
 'JOURNALIST:',
 'PETER\xa0DUTTON:',
 'JOURNALIST:',
 'PETER\xa0DUTTON:',
 'JOURNALIST:',
 'PETER\xa0DUTTON:',
 'JOURNALIST:',
 'PETER\xa0DUTTON:',
 'SCOTT\xa0MORRISON:',
 'JOURNALIST:',
 'SCOTT\xa0MORRISON:',
 'JOURNALIST:',
 'PETER\xa0DUTTON:',
 'JOURNALIST:',
 'PETER\xa0DUTTON:',
 'MINISTER\xa0MORRISON:',
 'PARIS\xa0ARISTOTLE:',
 'MINISTER\xa0MORRISON:',
 'PARLIAMENTARY\xa0SECRETARY\xa0FIERRAVANTI\xa0WELLS:',
 'MINISTER\xa0MORRISON:',
 'QUESTION:',
 'MINISTER\xa0MORRISON:',
 'QUESTION:',
 'PARLIAMENTARY\xa0SECRETARY\xa0FIERRAVANTI\xa0WELLS:',
 'MINISTER\xa0MORRISON:',
 'QUESTION:',
 'MINISTER\xa0MORRISON:',
 'PARLIAMENTARY\xa0SECRETARY\xa0FIERRAVANTI\xa0WELLS:',
 'MINISTER\xa0MORRISON:',
 'QUESTION:',
 'MINISTER\xa0MORRISON:',
 'QUESTION:',
 'MINISTER\xa0MORRISON:',
 'QUESTION:',
 'MINISTER\xa0MORRISON:',
 'PARLIAMENTARY\xa0SECRETARY\xa0FIERRAVANTI\xa0WELLS:',
 'QUESTION:',
 'PARLIAMENTARY\xa0SECRETARY\xa0FIERRAVANTI\xa0WELLS:',
 'MINISTER\xa0MORRISON:',
 'QUESTION:',
 'MINISTER\xa0MORRISON:',
 'QUESTION:',
 'MINISTER\xa0MORRISON:']

Looks ok but what's that \xa0 business? It's a non-breaking space character. This is going to annoy me so let's deal with it globally by replacing all occurences of this character by a "normal" whitespace character.

In [19]:
regex = re.compile('\s+([A-Z\s]+:)')
regex.findall(sample_press_release.replace('\xa0',' '))
Out[19]:
['PETER DUTTON:',
 'SCOTT MORRISON:',
 'PETER DUTTON:',
 'JOURNALIST:',
 'PETER DUTTON:',
 'JOURNALIST:',
 'PETER DUTTON:',
 'JOURNALIST:',
 'PETER DUTTON:',
 'SCOTT MORRISON:',
 'JOURNALIST:',
 'SCOTT MORRISON:',
 'JOURNALIST:',
 'SCOTT MORRISON:',
 'JOURNALIST:',
 'SCOTT MORRISON:',
 'JOURNALIST:',
 'PETER DUTTON:',
 'JOURNALIST:',
 'PETER DUTTON:',
 'JOURNALIST:',
 'PETER DUTTON:',
 'JOURNALIST:',
 'PETER DUTTON:',
 'SCOTT MORRISON:',
 'JOURNALIST:',
 'SCOTT MORRISON:',
 'JOURNALIST:',
 'PETER DUTTON:',
 'JOURNALIST:',
 'PETER DUTTON:',
 'MINISTER MORRISON:',
 'PARIS ARISTOTLE:',
 'MINISTER MORRISON:',
 'PARLIAMENTARY SECRETARY FIERRAVANTI WELLS:',
 'MINISTER MORRISON:',
 'QUESTION:',
 'MINISTER MORRISON:',
 'QUESTION:',
 'PARLIAMENTARY SECRETARY FIERRAVANTI WELLS:',
 'MINISTER MORRISON:',
 'QUESTION:',
 'MINISTER MORRISON:',
 'PARLIAMENTARY SECRETARY FIERRAVANTI WELLS:',
 'MINISTER MORRISON:',
 'QUESTION:',
 'MINISTER MORRISON:',
 'QUESTION:',
 'MINISTER MORRISON:',
 'QUESTION:',
 'MINISTER MORRISON:',
 'PARLIAMENTARY SECRETARY FIERRAVANTI WELLS:',
 'QUESTION:',
 'PARLIAMENTARY SECRETARY FIERRAVANTI WELLS:',
 'MINISTER MORRISON:',
 'QUESTION:',
 'MINISTER MORRISON:',
 'QUESTION:',
 'MINISTER MORRISON:']

Cleanup

Looks better. Now that we can capture these annotations, let's remove them.

In [20]:
print(regex.sub('', sample_press_release.replace('\xa0',' '))[:5000])
Press conference with The Hon. Peter Dutton MP Minister for immigration, Senator The Hon. Concetta Fierravanti‐Wells Parliamentary Secretary to the Minister for Social Services and Paris Aristotle Refugee Resettlement Advisory Council

11 September 2015

Transcript Location:

Canberra

E&OE Ladies and gentlemen thank you very much for being here.

I’m going to make a few opening remarks and Minister Morrison is going to make some opening remarks. I'm happy to take questions and then I have a plane to catch so I'm going to duck off and let Scott answer any questions you may have for him.

Obviously this morning we've had a very productive meeting and I want to say thank you very much to all of the leaders, the community leaders, who joined us today to talk about how we're going to make a new life for 12,000 people who are living in a very, very desperate situation.

The message from the UNHCR, from IOM, from the Red Cross in my discussions with them in Europe this week, is that this is a very bad situation getting worse.

Millions of people now have left Syria. Millions of people in Syria are displaced. The political turmoil in Syria shows no sign of resolution and we have a very important job to do along with many other countries around the world to try and

Transcripts

provide an opportunity for these people to start a new life.

I'm very proud of the response that we've been able to provide and I want to say thank you again to all of those people that we met with this morning who had some great suggestions around the way in which we could provide support through the screening process and then ultimately through their settlement here in Australia. Scott. Thanks very much Peter and it's good to have you back and to get those reports.

I'm joined here today, obviously, by Parliamentary Secretary Senator Fierravanti‐Wells and also PARIS ARISTOTLE who is the Chair of the Refugee Resettlement Advisory Council and Paris will have a bit to say later after Peter has departed.

But it was a very constructive meeting today. The purpose of the meeting today was really to identify ways where we can begin to really harness this incredibly large level of community support.

This has been a very well received announcement and there is an outpouring of support from people right across the community of all different backgrounds, of all different faiths.

It's important we put in place measures to enable us to harness that, to direct it purposefully, to ensure it delivers the support and compassion people are seeking to provide.

I want to thank all of those community leaders who came along today and there will be many more of these meetings.

There will be more direct engagements with quite specific communities of all different backgrounds, of all different faiths, to ensure we maintain the momentum long after the images that we've seen that have sparked so much outpouring of support.

This support needs to be maintained over a long time because when you resettle someone in Australia you resettle them for life and that support has to be there over their lifetime and their families as they become great Australians, as they do.

That was an important part of today's meeting. The Refugee Resettlement Advisory Council will be meeting next week under Paris's leadership and we will then go further into the task of breaking this process down.

As I remarked this morning, we've already settled in the last two years, almost 8,000 refugees and humanitarian immigrants from Iraq and Syria. The focus of that intake has been, as it will be in the future, it's predominantly focused on those from persecuted minorities, around 70% in those categories, with the balance coming from other groups. And so the processes we have in place will continue to be applied.

We are the best in the world at refugee and humanitarian resettlement and those processes will be put to work in this task we have going forward. OK, any questions? Minister, I believe there were some questions raised about how the Government itself would define what a persecuted minority was. Given you've said already that you will take the advice of the UNHCR on that, but the final decision will be yours to make as a Government, can you tell us if you're any closer to defining what a persecuted minority is? Well as Scott's pointed out, we for many years have been able to identify people who are at risk of persecution because of their religion, because of threats otherwise.

I think it's important for people to understand, not only is Australia the most generous in terms of the number of people we settle under the Refugee and Humanitarian Program, but we were ahead of the curve. Over the last couple of years we have settled thousands of people from Syria and Iraq.

We've been able to identify those people who were most at risk. We've been able to conduct the security checks, to conduct the health checks, and ultimately to allow those people a safe passage into Australian 

Updated pipeline - Rinse and repeat

Great! Let's update our processing pipeline so that the documents are cleaned up automatically.

In [21]:
pattern_UPPER = re.compile('\s+([A-Z\s]+:)')

def clean_text(path):
    return pattern_UPPER.sub('', path.read_text(encoding='utf-8').strip().replace('\xa0',' '))
    

We can now rerun the pipeline (notice the updated corpus generator).

In [22]:
%%time

corpus = (clean_text(p) for p in docs_path)

X_tfidf, tfidf = tokenize(corpus,
                          docs_path,
                          min_df=10, 
                          max_df=0.5, 
                          lowercase=True,
                          token_pattern="[a-z]{2,}",
                          stop_words=STOPWORDS,
                          max_features=10000
                         )
Tokenised 12619 documents using a vocabulary of 10000 tokens.
CPU times: user 11.8 s, sys: 813 ms, total: 12.6 s
Wall time: 12.5 s

Matrix factorisation take 2.

In [23]:
%%time
X_nmf = model.fit_transform(X_tfidf)
CPU times: user 2min 12s, sys: 1.07 s, total: 2min 13s
Wall time: 33.9 s
In [24]:
topics = extract_topics(model, tfidf, n=8)
topics
Out[24]:
{0: 'think | tax | going | labor | got | want | party | prime',
 1: 'keenan | boats | labor | boat | border | morrison | protection | illegal',
 2: 'detention | children | asylum | nauru | seekers | island | centres | centre',
 3: 'greens | hanson | young | senator | sarah | spokesperson | asylum | byard',
 4: 'countries | world | region | international | economic | security | foreign | united',
 5: 'democrats | bartlett | senator | andrew | spokesperson | org | senate | refugees',
 6: 'abbott | bowen | clare | offshore | processing | tony | boats | nauru',
 7: 'vessel | command | border | protection | christmas | island | stinson | jayne',
 8: 'humanitarian | refugees | refugee | program | resettlement | million | unhcr | assistance',
 9: 'migrants | services | ethnic | settlement | community | migrant | multicultural | program'}

Hmmm.... Some improvement but still too many politicians' names for my taste.

Interlude

At this stage, we need to think about what we're interested in. In some way, it is natural for politicians to feature prominently in the topics, given that we're looking at parliamentary press releases. One might thus be interested in associating the topics (or at least some of them) with political parties or personalities. In that case, we're done and we can use this trained model for further analyses.

However, I'd like to go beyond the individual politicians and see what the press releases are about. So let's take drastic measures and add the names of politicians, as well as a few other terms, to our stopwords.

In [25]:
%%time

my_stopwords = STOPWORDS | {'keenan','hanson','young', 
                            'sarah','bartlett','andrew','bowen',
                            'clare','abbott','tony','ruddock','morrison',
                            'journalist','mr','think','going','want','got'}

corpus = (clean_text(p) for p in docs_path)

X_tfidf, tfidf = tokenize(corpus,
                          docs_path,
                          min_df=10, 
                          max_df=0.5, 
                          lowercase=True,
                          token_pattern="[a-z]{2,}",
                          stop_words=my_stopwords,
                          max_features=10000
                         )
Tokenised 12619 documents using a vocabulary of 10000 tokens.
CPU times: user 11.9 s, sys: 829 ms, total: 12.7 s
Wall time: 12.5 s
In [26]:
%%time
X_nmf = model.fit_transform(X_tfidf)
CPU times: user 2min 18s, sys: 1.77 s, total: 2min 20s
Wall time: 35.8 s
In [27]:
topics = extract_topics(model, tfidf, n=5)
topics
Out[27]:
{0: 'tax | labor | party | carbon | prime',
 1: 'boats | labor | boat | border | protection',
 2: 'refugees | refugee | humanitarian | program | resettlement',
 3: 'nauru | processing | offshore | malaysia | boats',
 4: 'countries | world | region | security | economic',
 5: 'vessel | command | border | protection | christmas',
 6: 'democrats | senator | spokesperson | senate | org',
 7: 'aid | million | assistance | food | relief',
 8: 'detention | greens | children | asylum | seekers',
 9: 'migrants | services | settlement | community | migrant'}

This is nice. We notice "boats" alongside "boat". That's the price to pay for not stemming/lemmatising our text.

Topic 6 is a bit cryptic but the rest looks reasonable given the rusticity of our pipeline.

Reading at a distance

This is probably as good a time as any to illustrate how the min_df and max_df parameters of the tokeniser can affect our topics.

Let's start with max_df. A large value will introduce in our vocabulary more words that are common across the corpus, thereby tuning the topics towards more general concepts.

In [28]:
corpus = (clean_text(p) for p in docs_path)

X_tfidf, tfidf = tokenize(corpus,
                          docs_path,
                          min_df=10, 
                          max_df=0.9,    # allow tokens which appear in at most 90% of the corpus
                          lowercase=True,
                          token_pattern="[a-z]{2,}",
                          stop_words=my_stopwords,
                          max_features=10000
                         )
X_nmf = model.fit_transform(X_tfidf)
topics = extract_topics(model, tfidf, n=5)
topics
Tokenised 12619 documents using a vocabulary of 10000 tokens.
Out[28]:
{0: 'tax | government | people | labor | party',
 1: 'boats | labor | boat | border | people',
 2: 'immigration | ethnic | migration | affairs | minister',
 3: 'nauru | processing | offshore | minister | malaysia',
 4: 'australia | world | countries | region | international',
 5: 'vessel | command | border | protection | christmas',
 6: 'democrats | senator | spokesperson | senate | australian',
 7: 'humanitarian | refugees | australia | refugee | program',
 8: 'detention | greens | children | asylum | seekers',
 9: 'services | settlement | migrants | community | migrant'}

We see more general words like "government" or "people" appear in our topics.

By contrast, reducing max_df will exclude the more common tokens and results in a set of more specialised topics.

In [29]:
corpus = (clean_text(p) for p in docs_path)

X_tfidf, tfidf = tokenize(corpus,
                          docs_path,
                          min_df=10, 
                          max_df=0.2,    # allow tokens which appear in at most 20% of the corpus
                          lowercase=True,
                          token_pattern="[a-z]{2,}",
                          stop_words=my_stopwords,
                          max_features=10000
                         )
X_nmf = model.fit_transform(X_tfidf)
topics = extract_topics(model, tfidf, n=5)
topics
Tokenised 12619 documents using a vocabulary of 10000 tokens.
Out[29]:
{0: 'migration | review | rights | visa | tribunal',
 1: 'boats | illegal | rudd | michael | smugglers',
 2: 'tax | carbon | budget | jobs | billion',
 3: 'nauru | manus | island | png | processing',
 4: 'region | east | indonesia | asia | trade',
 5: 'vessel | command | christmas | island | stinson',
 6: 'democrats | spokesperson | org | stott | senate',
 7: 'greens | spokesperson | brown | byard | schultz',
 8: 'migrants | settlement | migrant | ethnic | multicultural',
 9: 'offshore | processing | malaysia | boats | jason'}

Notice how the topics are more focussed? We also see more people's names appearing.

Incidentally, this idea of moving from general to specialised topics by changing the way we construct our vocabulary can be thought of reading the corpus from different "distances". This idea of distant reading comes from digital humanities. The interested reader can find some relevant references in a work on legal documents that my colleagues and I did previously.

The min_df parameter works at the other hand of the spectrum, controlling how many "rare" words we allow in the corpus. Usually, words that appear in only a tiny number of document aren't particularly interesting for building topics, so it's common to set a lower limit. We used 10 so far, but let illustrate what happens when we choose a higher lower bound.

In [30]:
corpus = (clean_text(p) for p in docs_path)

X_tfidf, tfidf = tokenize(corpus,
                          docs_path,
                          min_df=200,    # only use words which appear in at least 200 press releases
                          max_df=0.5,    
                          lowercase=True,
                          token_pattern="[a-z]{2,}",
                          stop_words=my_stopwords,
                          max_features=10000
                         )
X_nmf = model.fit_transform(X_tfidf)
topics = extract_topics(model, tfidf, n=5)
topics
Tokenised 12619 documents using a vocabulary of 3528 tokens.
Out[30]:
{0: 'tax | labor | party | carbon | know',
 1: 'boats | labor | boat | border | protection',
 2: 'migrants | settlement | services | ethnic | community',
 3: 'nauru | processing | offshore | asylum | malaysia',
 4: 'world | countries | region | international | security',
 5: 'democrats | senator | spokesperson | senate | org',
 6: 'vessel | command | border | protection | christmas',
 7: 'refugees | humanitarian | aid | million | refugee',
 8: 'detention | children | rights | asylum | centres',
 9: 'greens | senator | brown | spokesperson | asylum'}

Notice how our vocabulary has shrunk to much fewer than 10,000 tokens. This is due to the fact that a great many rare words have been culled by the higher threshold.

Base pipeline

So which parameters should you choose? Well, it's up to you and will depend the purpose of the study. Here I'm interested in learning something about the topics of the press releases so it seems like a moderately focused set of topics would be nice.

Let's rerun our pipeline with min_df=20 and max_df=0.5.

In [31]:
corpus = (clean_text(p) for p in docs_path)

my_stopwords = STOPWORDS | {'keenan','hanson','young', 
                            'sarah','bartlett','andrew','bowen',
                            'clare','abbott','tony','ruddock','morrison',
                            'journalist','mr','think','going','want','got'}

X_tfidf, tfidf = tokenize(corpus,
                          docs_path,
                          min_df=20, 
                          max_df=0.5,    
                          lowercase=True,
                          token_pattern="[a-z]{2,}",
                          stop_words=my_stopwords,
                          max_features=10000
                         )
X_nmf = model.fit_transform(X_tfidf)
topics = extract_topics(model, tfidf, n=5)
topics
Tokenised 12619 documents using a vocabulary of 10000 tokens.
Out[31]:
{0: 'tax | labor | party | carbon | prime',
 1: 'boats | labor | boat | border | protection',
 2: 'ethnic | migration | review | affairs | tribunal',
 3: 'nauru | processing | offshore | malaysia | asylum',
 4: 'countries | world | region | international | security',
 5: 'vessel | command | border | protection | christmas',
 6: 'democrats | senator | spokesperson | senate | org',
 7: 'humanitarian | refugees | refugee | program | million',
 8: 'detention | greens | children | asylum | seekers',
 9: 'migrants | services | settlement | community | migrant'}

Good enough for government work... Let's move on and try to "read" our corpus using this set of topics.

Note: Keep in mind that the topics aren't really these lists of 5 words. Rather they are the numerical scores in each row of model.components_. We're simply extracting the top 5 words with the highest scores for convenience.

Topic importance

The first thing I'd like to do is get a sense of how "important" these topics are (remember that the ordering of the topics above is meaningless).

One way to do this is to ask what proportion of our corpus belongs to a given topic. Each row of X_nmf corresponds to a press release, and contains the "importance scores" of each topic. If we normalise the rows of X_nmf to sum to 1, we will turn the topic scores into proportions (unlike methods like latent Dirichlet allocation, NMF does not treat documents as probability distributions over the topics, hence the score don't sum to 1 by default).

We can then sum the columns of X_nmf to get an estimate of the number of documents which are associated with the topic.

More simply, if the topic proportion for document D for topic 1 is 0.2, we say that document D counts for 0.2 documents in the document tally for topic 1.

We can perform these manipulations directly on the array but it's more convenient to transform X_nmf into a pandas dataframe first.

In [32]:
df = pd.DataFrame(X_nmf, columns=extract_topics(model, tfidf, n=5).values())
df = df.div(df.sum(axis=1), axis=0)     
df.head()
Out[32]:
tax | labor | party | carbon | prime boats | labor | boat | border | protection ethnic | migration | review | affairs | tribunal nauru | processing | offshore | malaysia | asylum countries | world | region | international | security vessel | command | border | protection | christmas democrats | senator | spokesperson | senate | org humanitarian | refugees | refugee | program | million detention | greens | children | asylum | seekers migrants | services | settlement | community | migrant
0 0.123345 0.534986 0.000000 0.058218 0.208694 0.000000 0.006109 0.020487 0.048162 0.000000
1 0.000000 0.000000 0.179445 0.000000 0.227359 0.008271 0.000000 0.523997 0.060928 0.000000
2 0.022627 0.003940 0.188790 0.067588 0.289858 0.000000 0.000000 0.392389 0.034808 0.000000
3 0.044017 0.000000 0.000000 0.000000 0.060260 0.070227 0.125036 0.000000 0.105946 0.594513
4 0.120850 0.000000 0.055066 0.425899 0.245945 0.033193 0.090209 0.015919 0.012921 0.000000

We can now visualise how prevalent the topics are within the corpus.

In [33]:
ax = df.sum(axis='rows').sort_values().plot(kind='barh', width=0.6, alpha=0.8, figsize=(12, 6))
ax.tick_params(axis = 'both', which = 'major', labelsize = 14)
ax.set_xlabel('Effective number of documents', fontsize=18)
ax.set_ylabel('Topics', fontsize=18);

As is often the case, the dominant topics are fairly general, because they capture the most basic ideas within the corpus, which are common to many documents (recall that documents don't necessarily belong to a single topic).

Topic allocation

We can also visualise how a given document is "made up" of different topics.

Let's illustrate this with the sample press release we used to cleanup the annotations.

In [34]:
ax = df.iloc[600].sort_values().plot(kind='barh', width=0.6, alpha=0.8, figsize=(12, 6))
ax.tick_params(axis = 'both', which = 'major', labelsize = 14)
ax.set_title(f"Topic allocation for {docs_path[600].name}", fontsize=14)
ax.set_xlabel('Topic proportion', fontsize=18)
ax.set_ylabel('Topics', fontsize=18);

A quick look at the press release shows that this decomposition is reasonable.

This corpus was harvested by looking for refugee/immigration related keywords, so the fact that the majority of of the topics are about migration/refugee/humanitarian issues makes sense, however, there's a strange topic that seems to be about carbon tax.

How is this related to refugees? Let's find out.

Topical archetype

First let's write a helper function to extract the most representative press release(s) for a given topic, and plot its topic allocation.

In [35]:
def plot_topic_allocation(doc_index):
    fig, ax = plt.subplots(figsize=(8, 3))
    df.iloc[doc_index].sort_values().plot(kind='barh', width=0.6, alpha=0.8, ax=ax)
    ax.tick_params(axis = 'both', which = 'major', labelsize = 14)
    ax.set_title(f"Topic allocation for {docs_path[doc_index].name}", fontsize=14)
    ax.set_xlabel('Topic proportion', fontsize=18)
    ax.set_ylabel('Topics', fontsize=18);


def get_most_representative_doc_for_topic(topic, n=3):            
    
    # sort the results according to the score for the topic of interest
    docs_idx = df.iloc[:, topic].sort_values(ascending=False).index.values[:n]
    
    # create a nice header
    label = f'************ {topics[topic].upper()} ************'
    print(f"\n\n{'='*len(label)}\n{label}\n{'='*len(label)}")          
    
    # extract the top n most representative documents          
    results = [docs_path[idx] for idx in docs_idx]    

    # output the results and plot the topic allocations      
    for i, item in zip(docs_idx, results):
        print(item.name)
        plot_topic_allocation(i)
    
    return results

The "carbon tax" topic is topic number 0, let's extract the top 10 press releases.

In [36]:
L  = get_most_representative_doc_for_topic(0, n=10)
==============================================================
************ TAX | LABOR | PARTY | CARBON | PRIME ************
==============================================================
2014-11-02-leigh-andrew-211447096.txt
2011-06-14-hunt-greg-211355794.txt
1978-03-05-fraser-malcolm-213722604.txt
2011-06-14-abbott-tony-211355793.txt
2011-07-27-abbott-tony-211358116.txt
2012-07-02-hockey-joe-211376011.txt
2011-07-13-gillard-julia-211357548.txt
2012-07-16-hockey-joe-211428530.txt
2012-07-16-hockey-joe-211428506.txt
2013-03-05-hockey-joe-211391530.txt

These documents seem to belong exclusively to the "carbon tax" topic. Let's look at the first press release.

In [37]:
print(L[0].read_text().strip())
ANDREW LEIGH MP  SHADOW ASSISTANT TREASURER  SHADOW MINISTER FOR COMPETITION  MEMBER FOR FRASER



E&OE TRANSCRIPT  DOORSTOP  SUNDAY, 2 NOVEMBER 2014  CANBERRA, PARLIAMENT HOUSE

SUBJECT/ S: W estern Sydney W anderers; Abbott Government’s broken  promise on GST; Abbott Government’s unfair budget; Climate change.

ANDREW LEIGH, SHADOW ASSISTANT TREASURER: Thanks very much  everyone for coming out today. My name is Andrew Leigh, the Shadow Assistant  Treasurer. I want to open with my congratulations to the Western Sydney  Wanderers, Asian Cup Champions. Just a great result for all Australians and many  hearts will be swelling with pride tonight.

Also wanted to say a few words about the statements that we’ve seen Joe Hockey  and Mathias Cormann making on GST distribution. It would be pretty clear to  anyone who follows the GST debate that if one state is going to get a larger amount  of GST that either means other states get a smaller share of GST or that the rate or

the base of the GST goes up. The only way that doesn’t happen is if the GST is a  magic pudding. And right now Joe Hockey and Mathias Cormann seem to be  auditioning for the roles of Bunyip Bluegum and Sam Sawnoff.

But the GST is not a magic pudding and if this government wants a mature and  responsible debate about tax they need to stop saying one thing in the west and  another thing in the east. They need to be very clear with the Australian people  about the implications of increasing any state’s share of the GST and they need to  be clear about the implications of that either for states getting a smaller share of  GST or for the rate of GST going up.

Tony Abbott said 33 times that he wouldn’t be increasing the GST so it’s pretty  strange now that he’s starting to starve states into submission and getting them to  accept what is clearly an underlying Liberal agenda of a higher GST. Happy to take  questions.

JOURNALIST: On another magic pudding, a lot of people think income tax and  bracket creep is going to be unsustainable, do you think there is an argument to link  it to inflation rather than rather than wages growth?

LEIGH: Well we certainly see from this government just taxes going up and up. We  were promised no new taxes before the election but the Prime Minister has broken  that promise along with so many others. He’s put on a new GP tax, he wants to  increase the fuel tax, he’s increasing income taxes and so we’ve seen on so many  fronts this being a Government that is putting up tax where it’s claimed before the  election that it would never do so.



JOURNALIST: Andrew, does Labor reject any shift from direct to indirect taxation?

LEIGH: Labor’s always open for sensible debates over taxation, but we were very  clear before the election that we wouldn’t be supporting changes to the GST and,  unlike to Coalition, we’re sticking that that promise.

JOURNALIST: Do you think income tax specifically needs to be overhauled the way  it is collected?

LEIGH: Labor’s not supporting increases in the income tax burden on low income  Australians. We’ve got at the moment from Tony Abbott so many slugs on the most  vulnerable. You know, if you look at poorest single families, a single mother earning  $65,000 a year, Tony Abbott’s budget has her losing $6,000. To me that’s simply  unconscionable and so we’ll apply the test of fairness to any proposals that Mr  Abbott brings forward, particularly after a generation in which we’ve seen billionaires  make out so much better than battlers.



JOURNALIST: We’ve got predictions that if nothing changes that’s a $25 billion  windfall for the Abbott Government, what sort of impact do you think that could  have, not just on the budget bottom line, but on low income earners?

LEIGH: Tony Abbott still seems incapable of balancing the budget. He said before  the election that he was going to bring down the deficit; in fact he’s increased the  deficit. His first economic update doubled the deficit, and even if parliament had just  rubber stamped the last budget, it would have brought down a higher, not a smaller,  deficit than under the Pre-Election Economic and Fiscal Outlook.



That’s because Mr Abbott is saying no to so many sensible sources of revenue, for  the mining tax, the carbon price, and through fair taxation of people with more than  $2 million in their superannuation accounts. That test of fairness is one that I think  all ordinary Australians would apply to anything that a Government is doing. The  Abbott Government seems unwilling to do that basic, moral act of placing itself in  the shoes of the most vulnerable.

JOURNALIST: Do we need income tax cuts then to sort of counter the bracket  creep?

LEIGH: If the Abbott Government wants to bring forward proposals to the  Parliament we’re always happy to look at them. But at the moment what it’s doing to

vulnerable Australians is it’s making it harder for them to afford the essentials in life.  Increasing the cost of going to the doctor, increasing the cost of driving to the  doctor, and on the long run, imposing cost of living impacts on Australians by kicking  the climate change can down the road.

JOURNALIST: When it comes to, perhaps as you say, mature debate over tax or  mature discussions, essentially what we’ve had is a mention of forward revenues,  what you do with the states and federal governments, just a mention of it in a  speech or two, and Labor straight away saying no GST, not over our dead body. Is  that a mature debate Labor’s involved in?

LEIGH: Well both parties before the last election were very clear that we didn’t  want to see increases in the GST. Labor has stuck steadfastly to that pledge but Mr  Abbott has instead cut $80 billion out of states health and education funding, very  clearly in an attempt to have as allies in this campaign for a higher GST - people like  Denis Napthine. We’re open for a tax debate; we brought down the Henry Tax  Review, we put in place an important tax reform in an emissions trading scheme  which raises the price of pollution and lowers the price of work. Textbook tax reform  undone by the Abbott Government.



JOURNALIST: Should with the GST though shouldn’t you, for example, look at well  would there be possible compensation for low income earners, would there be more  services out of it, a discussion about what it means rather than a blanket no way?

LEIGH: Look, if you think that low income earners are going to come out well out of  any Abbott tax reform, I’ve got a bridge you might like to buy. We have seen on  every twist and turn the Abbott Government’s changes hurting the most vulnerable.  The decision to focus only on spending and then to focus later on tax expenditures  has a regressive impact in and of itself and we’ve seen that independent modelling  from NATSEM showing that the poorest single parents lose one dollar in ten of their  incomes. So I’d be very surprised if the Abbott Government brings forward reforms  that are fair for Australians, and particularly that assist the most vulnerable. This  matters so much because we’ve had a huge rise in inequality over the past  generation, and so governments which want to give more to the affluent at the  expense of the vulnerable are running against the tide of history.

JOURNALIST: If there are changes that would address the concerns about bracket  creep, and that’s a $25 billion shortfall, how would you go about it or how would  Labor go about addressing that funding that’d be lost?

LEIGH: We’ll have our tax policies released well ahead of the next election and  certainly I would expect ahead of the timetable that the Abbott Government brought  its policies to the people. Certainly though you can judge us on what we did when  we were in office which is to deliver personal income tax cuts, to deliver a price on  carbon pollution, another textbook economic reform and then to begin looking at tax  expenditures such as the fact that if you’ve got more than $2 million in your  superannuation account, you’re getting a superannuation tax break that exceeds the  value of the full rate pension. So there the sorts of issues that we looked at in the  last government and that’s how you can expect a future Shorten Government would  approach tax reform.



JOURNALIST: Mr Leigh on climate change the Environment Minister says that the  Direct Action program will kick off next year. Labor’s position is that Direct Action  won’t work, but you also said that asylum seeker, turning asylum seeker boat  wouldn’t work and that seems to have worked. So is it now time for Labor to allow  this Direct Action policy to run a bit of its course to see if it does work and if it does  work and they get, reach their reduction targets where does that leave the  Opposition?

LEIGH: Climate change is not a policy that Australia is alone in approaching and  other countries around the world are asking themselves what’s the right way of  tackling climate change. And universally experts in those countries, economic  experts, are pointing to the importance of carbon pricing. It is very clear that putting  a price on carbon pollution gets you lowest cost abatement and that the effective  carbon tax that is Direct Action is going to be a bigger slug on Australian households  than an emissions trading scheme. And that’s simply because a pay-the-polluters  scheme is more expensive in order to get every tonne of abatement. RepuTex has  said that Direct Action will get maybe a fifth of the total emissions abatement that  Australia needs to hit. Ken Henry has said you’d need to spend around twice as  much as the Government’s budgeted in order to come anywhere close to the  targets.

So it’s very clear from theoretical evidence, from empirical evidence and from simply  listening to the experts that Direct Action won’t hit the mark, where as a carbon  price had seen the biggest fall in Australian emissions in 24 years, just in January  this year. That’s why countries around the world are shaking their head at Australia  dropping the ball. Australia is now unique in the world in being the only country that  has removed a nationwide price on carbon pollution and economic experts just  scratch their heads in bewilderment as to why we’d do that.

JOURNALIST: So to clarify you think, just using your phrase there, I don’t think the  Coalition would be too impressed with you describing it as a carbon tax given their  attacks over the years?

LEIGH: It’s very clear that the Coalition is slugging Australian households with  higher taxes, $2 billion of fuel taxes is about the cost of Direct Action. Mr Abbott in  the past has described other policies as being simply a ‘money-go-round’. That  seems a very apt description of his pay-the-polluters’ scheme.

Thanks everyone.

ENDS

A conundrum

This is really not about immigration/refugee issues. So why have these documents been returned when querying immigration-related keywords?

To find how let's create a function to extract only the lines in a press release that match a search pattern. We'll use the keywords that Tim Sherratt used to harvest the documents as well as a few of our own, as our search patterns.

In [38]:
def search_for_pattern_in_doc(path, regex):
    text = path.read_text().strip()
    for line in text.splitlines():
        if regex.search(line.lower()) is not None:
            print(f"\n>>>>{line}\n")        
In [39]:
pat = '|'.join(['migrant', 
                'immigration',
                'refugee',
                'asylum',
                'seeker', 
                'boat', 
                'illegal',
                'arrival', 
                'alien'])

regex = re.compile(pat, re.I)

for path in L:
    label = f'************ {path.name.upper()} ************'
    print(f"\n\n{'='*len(label)}\n{label}\n{'='*len(label)}")   
    search_for_pattern_in_doc(path, regex)
===============================================================
************ 2014-11-02-LEIGH-ANDREW-211447096.TXT ************
===============================================================

>>>>JOURNALIST: Mr Leigh on climate change the Environment Minister says that the  Direct Action program will kick off next year. Labor’s position is that Direct Action  won’t work, but you also said that asylum seeker, turning asylum seeker boat  wouldn’t work and that seems to have worked. So is it now time for Labor to allow  this Direct Action policy to run a bit of its course to see if it does work and if it does  work and they get, reach their reduction targets where does that leave the  Opposition?



============================================================
************ 2011-06-14-HUNT-GREG-211355794.TXT ************
============================================================

>>>>Mr Abbott, you said over the weekend that Nauru signing up to the UN convention on refugees was  imminent yet according to the United Nations, Nauru hasn’t even made an approach yet to the United  Nations. Just how imminent is it?



=================================================================
************ 1978-03-05-FRASER-MALCOLM-213722604.TXT ************
=================================================================


==============================================================
************ 2011-06-14-ABBOTT-TONY-211355793.TXT ************
==============================================================

>>>>Mr Abbott, you said over the weekend that Nauru signing up to the UN convention on refugees was  imminent yet according to the United Nations, Nauru hasn’t even made an approach yet to the United  Nations. Just how imminent is it?



==============================================================
************ 2011-07-27-ABBOTT-TONY-211358116.TXT ************
==============================================================

>>>>Well again, the same as I’ve got out of forums right around the country since the election but particularly  since the Prime Minister’s carbon tax was announced. I want to hear what people have got to say to me. I  hope people will be receptive to my message. Certainly, the message that I’ve been getting loud and clear  from the Australian people is that this tax is just toxic. They don’t like it and that’s the reason why the Prime  Minister seems now to be confining herself to Canberra. She went to the National Press Club last Thursday,  she went to a school last Friday, she went to Tasmania to talk about forestry on the weekend, she stayed in  Canberra to talk about boat people on Monday, yesterday she went to Melbourne to talk to Tony Blair. It



=============================================================
************ 2012-07-02-HOCKEY-JOE-211376011.TXT ************
=============================================================

>>>>Just on asylum seekers. Given your tears during the debate …


>>>>Asylum seeker.


>>>>Let me just say this. We had a three pronged approach. We had a Pacific solution, we had  temporary protection visas and where possible you turn the boats around and send them  back to where they came from. I cannot believe a Prime Minister who would contract out  the responsibly of a Prime Minister to a committee after what happened last week in  Parliament. I’m afraid we have a Prime Minister without any core principles and frankly, it’s  not just the carbon tax or the mining tax or asylum seekers or all the other mistakes she’s  made, it’s the fact that we have a Prime Minister without any core principles and she has  taken the Labor Party to hell in a hand basket, and for nothing. The Labor Party needs to



================================================================
************ 2011-07-13-GILLARD-JULIA-211357548.TXT ************
================================================================

>>>>Subjects:      Carbon price; Clean Energy Future; ACCC; Renewable  energy; Asylum seekers; Coal industry; Steel industry


>>>>JOURNALIST: (inaudible) on the Malaysian asylum seeker deal?



=============================================================
************ 2012-07-16-HOCKEY-JOE-211428530.TXT ************
=============================================================


=============================================================
************ 2012-07-16-HOCKEY-JOE-211428506.TXT ************
=============================================================


=============================================================
************ 2013-03-05-HOCKEY-JOE-211391530.TXT ************
=============================================================

>>>>Do you agree with Scott Morrison’s protocols he wants to put in place for asylum seekers?  Has he taken these to Shadow Cabinet? Or is he just freelancing?


>>>>upsets me most about what has happened with asylum seekers is they are being cast into the  community without a right to work. Therefore they are being put onto welfare without any  opportunity to get out of it. The living conditions are worse than what we, as Australians,  would tolerate. Therefore, how can Labor live with itself? At least with our Temporary  Protection Visas people had the opportunity to work.


>>>>I am not setting out to be critical of Scott Morrison. I am setting out to be critical of the  Government. The Government is the one - over 17,000 people came on boats last year.


>>>>But if there was a boat, would you support…


>>>>When you say that protocols for asylum seekers is up for debate - are you saying Scott  Morrison is freelancing at the moment and this is not actually Coalition policy?


>>>>No, I didn’t say that… The bottom line here is this; the government has an obligation to know  where people are located if they are coming to Australia by boat - if they are asylum seekers.  The Government has got an obligation to know where they are. The Government has a  responsibility, even a moral responsibility, to ensure that those people are not dumped into  the community with no capacity to work….


>>>>With no capacity to have a reasonable quality of life. You start to get behavioural issues  when people are just sitting around on welfare doing nothing with no hope of getting a job.  That is when you get your behavioural issues. That is when you get people engaging in crime  - not necessarily, but it can be the end outcome. That’s why I don’t think Scott Morrison’s  comments were motivated by anything other than genuine concern for the welfare of the  broader community and genuine concern for the welfare of the people seeking asylum.  Thanks very much.

Taxation or immigration?

Interesting. Not all these press releases are about immigration/refugee issues, although some of them make brief mentions of related matters.

I think this might have to do with the nature of the documents. Some press releases are just that, a statement made by a member of parliament and released to the press. They tend to be focused on a specific issue around immigration/refugee matters.

However, some documents (like the one immediately above) are transcripts of Q&A sessions with journalists, which means that they do not necessarily follow a set agenda. I think this is why some documents tackle many different issues, and might mention immigration matters among a more diverse of topics.

What's particularly surprising though, is that some of the press releases recorded no presence of any of our keywords. Let's look at one of them more closely.

In [40]:
print(L[2].read_text().strip())
t i » W  V  ► » w «  t> I Μ  ;

ΪΙ

FOR PRESS 5 MARCH 1978

ELECTORATE TALK

This year promises significant progress in the Australian  economy.

The results of the Government's firm and responsible economic  management in the last two" years will become even more evident.

It will also be a year when the most dramatic tax cuts in  Australia's history take direct effect throughout the economy. They will lift spending which will, in turn, assist industrial  production and stimulate economic activity. They will help

create jobs.

The Government's economic strategy has been clear. For a long  term and sustained reduction in the number of men. and women  seeking jobs, inflation had to be reduced.

Inflation has been making Australian industry uncompetitive. This 'meant that jobs were lost and opportunities for new jobs  were squandered.

Our success in the fight against inflation is now without  challenge - it is under 10% for the first time since 1972.

Inflation is still too high - but the rate will fall even  further this year.

It is timely that with this success the Government has already  in operation policies that will help create more jobs - without  fuelling inflation. '  -

The point that needs to be made is the February tax cuts alone  will put almost $1 billion into the hands of Australian families  this year. .

This extra money can now be spent on things we need'- for our­ selves and our families. That will mean more jobs for Australians

Of course,  unless Government policies continue to bear down  on inflation the additional benefits from the tax cuts will be  eroded, and job opportunities lost.  That fight to beat inflation  will go on as strongly as ever.

In terms of their impact on the economy, the Government's tax  cuts are comparable to the measures which other countries have  taken to stimulate their economies. ·  â–Â

But the key point  responsible one.   kept under proper

is that in Australia the stimulus is a  The growth in the money supply is being  control.

In other words, the stimulus from these tax cuts will not  push inflation up.

It is worth recalling that these tax cuts have been made  possible by controlling Government spending. We have been  prudent with taxpayers dollars - and are returning taxes to  all Australians. ,

While the Government introduced its tax cuts and tax indexation  in part as a stimulus to the economy, we brought about these  changes for another reason: we believed simply that Australians

were paying too much tax. Until we acted, our tax system took  away the incentive to work, the incentive to earn.

Now, with inflation coming down and with anti-inflationary  pressures at work in the economy, the Government's tax reforms  are in place to provide el stimulus to tha economy.

They can create increased demand for porducts and services  which will lead industry and business to create new jobs.

The Australian economy is recovering. Government' policies  are working to create a climate of stability and certainty  for private industry, which employs three out of every four  Australians.

As inflation continues to fall, as interest rates fall, and  as the effects of almost $1 billion worth of the February tax  cuts for this year alone take effect in the marketplace,  unemployment will steadily fall. .  ·  .

These are the pov/erful forces that will open up new job  opportunities for Australia. .

Any attempt at the alternative - the "quick fix" or return  to the "one stroke of the pen" approach - would be disastrous  for Australia. The Government rejects that course.

Curiouser and curiouser...

I really can't think of why this press release would have been returned in a search query on "refugee/immigration". Perhaps the press releases were manually curated and some editor assigned them to the wrong category?

In any case, this hopefully illustrates how topic modelling can complement keyword-based queries, allowing us to gain a deeper insight into a large corpus of documents.

At this stage we could of course explore the other topics in a similar fashion (I encourage you to do so if you're interested), but let's instead take a different look at our corpus.

A picture is worth 10,000 tokens.

The visualisation of the topics importance in the corpus, displayed above, is nice but it is in some ways too high level. We've aggregated the topics all the way down to a single number (the effective number of documents), thereby losing some of the nuance that is present in the topic model.

An alternate approach consists in trying to visualise both the documents and the topics at the same time. At the moment (in X_nmf) the documents are points in 10-dimensional space (because we have 10 topics). That's about 7 or 8 dimensions too many for us to look at, so we need to reduce the dimensionality of our representation, preferably down to 2 dimensions.

There are many ways to project (embed) a collection of points onto a 2D plane. A very common (and computationally cheap) one is principal component analysis, however this method doesn't work well with our type of (nonlinear) data.

An very popular dimensionality reduction technique these days is t-distributed stochastic neighbour embedding (TSNE to its friends). It works well but can be tricky to tune, and computationaly slow at times, which is why I'll use my favourite dimensionality reduction technique to date, the awesome UMAP which stands for Uniform Manifold Approximation and Projection for Dimension Reduction.

I also recommend watching this wonderful presentation by Leland McInnes, the creator of the UMAP library.

The general idea here is to project our 10-dimensional points onto a 2-dimensional plane, in such a way so as to preserve local similarities. Indeed, if 2 documents are neighbours in the 10-dimensional space, it means that they have similar topic "signatures". What UMAP does is to ensure that these 2 points would also end up close to each other after projection onto a 2-dimensional plane.

We first create an instance of the UMAP object, and we then use it to transform X_nmf into an array with still as many rows as we have documents, but only 2 columns.

In [41]:
%%time
proj = UMAP(n_components=2, n_neighbors=100, min_dist=0.8, random_state=0)
X_proj = proj.fit_transform(X_nmf)
CPU times: user 33.1 s, sys: 2.26 s, total: 35.4 s
Wall time: 29.4 s

Now that we've got a 2-dimensional version of our data, we can plot it as a scatter plot, where each point corresponds to a press release.

However, we'd also like to have a sense of what each document might be about.

We can do that by extracting the dominant topic for each document, and using it to colour the point representing the document.

Let's gather all this information in a dataframe

In [42]:
dominant_topic = X_nmf.argsort(axis=1)[:, -1]
In [43]:
df_proj = (pd.DataFrame(X_proj, columns=['x', 'y'])
               .assign(topic_num = dominant_topic)
          )

df_proj = df_proj.assign(topic=df_proj.topic_num.map(topics))

df_proj = pd.concat((df_proj, pd.DataFrame(X_nmf)), axis='columns')

df_proj.head()
Out[43]:
x y topic_num topic 0 1 2 3 4 5 6 7 8 9
0 -0.846373 -4.189077 1 boats | labor | boat | border | protection 0.013513 0.058611 0.000000 0.006378 0.022864 0.000000 0.000669 0.002244 0.005276 0.000000
1 -4.835822 -0.177999 7 humanitarian | refugees | refugee | program | ... 0.000000 0.000000 0.021207 0.000000 0.026870 0.000978 0.000000 0.061928 0.007201 0.000000
2 -5.005290 0.397936 7 humanitarian | refugees | refugee | program | ... 0.003917 0.000682 0.032677 0.011699 0.050171 0.000000 0.000000 0.067918 0.006025 0.000000
3 -0.079695 4.868925 9 migrants | services | settlement | community |... 0.003153 0.000000 0.000000 0.000000 0.004317 0.005031 0.008957 0.000000 0.007589 0.042587
4 -1.882492 -2.729572 3 nauru | processing | offshore | malaysia | asylum 0.015261 0.000000 0.006954 0.053783 0.031058 0.004192 0.011392 0.002010 0.001632 0.000000

For convenience, let's write a function to plot the results, as well as another helper function to zoom in on any part of the plot (I could have used bokeh to do that interactively, but for simplicity I'll stick with seaborn/matplotlib).

In [44]:
def plot_embedding(df_proj, xlim=None, ylim=None, figsize=(17, 10)):

    fig, ax = plt.subplots(figsize=figsize)
    sns.scatterplot(x='x', 
                    y='y', 
                    hue='topic', 
                    data=df_proj, 
                    palette='Paired', 
                    alpha=0.8, 
                    s=50,
                    ax=ax)

    leg = ax.legend(bbox_to_anchor = (1, 1), markerscale=2, frameon=False, prop={"size":14})
    leg.texts[0].set_text("")
    leg.set_title('Dominant topic', prop={"size":18})
    
    if xlim is not None:
        ax.set_xlim(xlim)
    if ylim is not None:
        ax.set_ylim(ylim)
        ax.get_legend().remove()
        
    ax.set_title('Topical portrait of the press releases', fontsize=18)

#     ax.set_axis_off() # comment this line to see the axes
    fig.tight_layout()

    return ax

def list_documents_in_frame(ax):
    indices = df_proj[df_proj.x.between(*ax.get_xlim()) & df_proj.y.between(*ax.get_ylim())].index.values
    return [docs_path[i].name for i in indices]
In [45]:
ax = plot_embedding(df_proj)