Harvest parliament press releases from Trove

Trove includes more than 370,000 press releases, speeches, and interview transcripts issued by Australian federal politicians and saved by the Parliamentary Library. You can view them all in Trove by searching for nuc:"APAR:PR" in the books & libraries category.

This notebook shows you how to harvest both metadata and full text from a search of the parliamentary press releases. The metadata is available from Trove, but to get the full text we have to go back to the Parliamentary Library's database, ParlInfo. The code in this notebook updates my original GitHub repository.

There are two main steps:

  • Use the Trove API to search for specific keywords within the press releases and harvest metadata from the results. This gives us urls that we can use to get the text of the press releases from ParlInfo.
  • Use the harvested urls to retrieve the press release from ParlInfo. The text of each release is extracted from the HTML page and saved as a plain text file.

Sometimes multiple press releases can be grouped together as 'works' in Trove. This is because Trove thinks that they're versions of the same thing. However, these versions are not always identical, and sometimes Trove has grouped press releases together incorrectly. To make sure that we harvest as many individual press releases as possible, the code below unpacks any versions contained within a 'work' and turns them into individual records.

It looks like the earlier documents have been OCRd and the results are quite variable. If you follow the fulltext_url link you should be able to view a PDF version for comparison.

It also seems that some documents only have a PDF version and not any OCRd text. These documents will be ignored by the save_texts() function, so you might end up with fewer texts than records.

The copyright statement attached to each record in Trove reads:

Copyright remains with the copyright holder. Contact the Australian Copyright Council for further information on your rights and responsibilities.

So depending on what you want to do with them, you might need to contact individual copyright holders for permission.

Duplicates and false positives

As noted Trove sometimes groups different press releases together as a single work. This seems to happen when press releases share a title and creator – for example, if an MP issues a press release titled 'Anzac Day' every year, these might be grouped as a single work. As noted above, all the different versions will be harvested by default. However, because search is operating at the work level, it's entirely possible that some of the grouped versions won't actually contain the search term you're looking for. To exclude these, we need to examine the text of each version individually to see if they match.

There will also be press releases that have exactly the same text content, both within and across works. For example, when a press release is issued both by a Minister and their department, or when MPs disseminate press releases issued by their party.

To make it easier to deal with these two issues, I've added some post-harvest processing steps to:

  • remove records where the text content of the press release doesn't include any of the search terms (you'll need to adjust this to meet your needs)
  • add a hash column that represents the text content of a press release – this can be used to identify duplicates

An example – politicians talking about 'immigrants' and 'refugees'

I've used this notebook to update an example dataset relating to refugees that I first generated in December 2017. It's been created by searching for the terms 'immigrant', 'asylum seeker', 'boat people', 'illegal arrivals', and 'boat arrivals' amongst the press releases. The exact query used is:

nuc:"APAR:PR" AND ("illegal arrival" OR text:"immigrant" OR text:"immigrants" OR "asylum seeker" OR "boat people" OR refugee OR "boat arrivals")

You can view the results of this query on Trove.

After unpacking the versions and harvesting available texts I ended up with 12,619 text files. You can browse the files on CloudStor, or download the complete dataset as a zip file (43mb).

Set your options

In the cell below you need to insert your search query and your Trove API key.

The search query can be anything you would enter in the Trove search box. As you can see from the examples below it can include phrases, exact phrases, and boolean operators (AND, OR, and NOT).

You can get a Trove API key by following these instructions.

You can change output_dir to save the results to a specific directory on your machine.

In [16]:
# Insert your query between the single quotes.
# query = '"illegal arrival" OR text:"immigrant" OR text:"immigrants" OR "asylum seeker" OR "boat people" OR refugee OR "boat arrivals"'
query = 'coronavirus OR covid'
# Insert your Trove API key between the single quotes
api_key = 'YOUR API KEY'
# You don't have to change this
output_dir = 'press-releases'

Import the libraries we'll need

In [2]:
import requests
import time
from requests.exceptions import HTTPError
from bs4 import BeautifulSoup
from slugify import slugify
import pandas as pd
from datetime import datetime
import os
import shutil
from IPython.display import display, HTML, FileLink
from tqdm.auto import tqdm
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry
from pathlib import Path
import re
import hashlib

s = requests.Session()
retries = Retry(total=5, backoff_factor=1, status_forcelist=[ 502, 503, 504 ])
s.mount('https://', HTTPAdapter(max_retries=retries))
s.mount('http://', HTTPAdapter(max_retries=retries))

Define some functions to do the work

In [3]:
def get_total_results(params):
    '''
    Get the total number of results for a search.
    '''
    these_params = params.copy()
    these_params['n'] = 0
    response = s.get('https://api.trove.nla.gov.au/v2/result', params=these_params)
    data = response.json()
    return int(data['response']['zone'][0]['records']['total'])

def get_fulltext_url(links):
    '''
    Loop through the identifiers to find a link to the digital version of the journal.
    '''
    url = None
    for link in links:
        if link['linktype'] == 'fulltext':
            url = link['value']
            break
    return url

def get_source(version):
    '''
    Get the metadata source of a version.
    '''
    if 'metadataSource' in version:
        try:
            source = version['metadataSource']['value']
        except TypeError:
            try:
                source = version['metadataSource']
            except TypeError:
                print(version)
            
        except KeyError:
            source = None
    else:
        source = None
    return source    

def harvest_prs(query, api_key):
    '''
    Harvest details of parliamentary press releases using the Trove API.
    This function saves the 'version' level records individually (these are grouped under 'works').
    '''
    # Define parameters for the search -- you could change this of course
    # The nuc:"APAR:PR" limits the results to the Parliamentary Press Releases
    params = {
        'q': 'nuc:"APAR:PR" AND ({})'.format(query),
        'zone': 'article',
        'n': 100,
        'key': api_key,
        'bulkHarvest': 'true',
        'encoding': 'json',
        'include': 'workVersions',
        'l-availability': 'y'
    }
    start = '*'
    total = get_total_results(params)
    records = []
    url = 'http://api.trove.nla.gov.au/v2/result'
    with tqdm(total=total) as pbar:
        while start:
            params['s'] = start
            response = s.get(url, params=params)
            data = response.json()
            # If there's a startNext value then we get it to request the next page of results
            try:
                start = data['response']['zone'][0]['records']['nextStart']
            except KeyError:
                start = None
            for work in data['response']['zone'][0]['records']['work']:
                # Different records can be grouped within works as versions.
                # So we're going to extract each version as a separate record.
                for version in work['version']:
                    # Sometimes there are even versions grouped together in a version... ¯\_(ツ)_/¯
                    # We need to extract their ids from a single string
                    ids = version['id'].split()
                    # This may or may not be a list...
                    if isinstance(version['record'], list):
                        version_records = version['record']
                    else:
                        version_records = [version['record']]
                    # Loop through versions in versions.
                    for index, record in enumerate(version_records):
                        source = get_source(record)
                        if source == 'APAR:PR':
                            # Add the id to the version record
                            record['version_id'] = ids[index]
                            record['work_id'] = work['id']
                            record = clean_metadata(record)
                            records.append(record) 
                # Try to avoid hitting the API request limit
            pbar.update(100)
            time.sleep(0.2)
    return records

def stringify_values(version, field):
    '''
    If a value is a list, join it into a pipe separate string.
    Otherwise just return the string value.
    '''
    try:
        if isinstance(version[field], list):
            values = [str(v) for v in version.get(field)]
            value = '|'.join(values)
        else:
            value = version.get(field, '')
    except KeyError:
        value = ''
    return value

def clean_metadata(version):
    '''
    Standardises, cleans, and stringifies record metadata.
    '''
    record = {}
    record['work_id'] = version['work_id']
    record['version_id'] = version['version_id']
    record['title'] = version.get('title')
    record['date'] = version.get('date')
    # Make sure creators is a list
    record['creators'] = stringify_values(version, 'creator')
    record['subjects'] = stringify_values(version, 'subject')
    record['source'] = stringify_values(version, 'source')
    # Get the fulltext url from the list of identifiers
    try:
        record['fulltext_url'] = get_fulltext_url(version['identifier'])
    except KeyError:
        record['fulltext_url'] = ''
    record['trove_url'] = 'https://trove.nla.gov.au/version/{}'.format(version['version_id'])
    return record

def save_texts(records, output_dir, query):
    '''
    Get the text of press releases in the ParlInfo db.
    This function uses urls harvested from Trove to request press releases from Parlinfo.
    Text is extracted from the HTML files and saved as individual text files.
    '''
    # Loop through all the previously harvested records
    for record in tqdm(records):
        output_path = Path(output_dir, f'press-releases-{slugify(query)}', 'texts')
        output_path.mkdir(parents=True, exist_ok=True)
        filename = '{}-{}-{}.txt'.format(record['date'], slugify(record['creators']), record['version_id'])
        file_path = os.path.join(output_path, filename)
        # Only save files we haven't saved before
        if not os.path.exists(file_path):
            # Get the Parlinfo web page
            response = s.get(record['fulltext_url'])
            # Parse web page in Beautiful Soup
            soup = BeautifulSoup(response.text, 'lxml')
            content = soup.find('div', class_='box')
            # If we find some text on the web page then save it.
            if content:
                # Open file
                # print 'Saving file...'
                with open(file_path, 'w', encoding='utf-8') as text_file:
                    # Get the contents of each paragraph and write it to the file
                    for para in content.find_all('p'):
                        text_file.write('{}\n\n'.format(para.get_text().strip()))
            time.sleep(0.5)

Harvest the metadata!

Running the cell below will harvest details of all the press releases matching our query using the Trove API. The results will be saved in the records variable for further use.

In [ ]:
records = harvest_prs(query, api_key)

Save the harvested metadata

The cells below convert the records variable into a Pandas DataFrame, have a little peek inside, and then save all the harvested metadata as a CSV formatted text file. This file provides an index to the harvested press releases.

In [5]:
df = pd.DataFrame(records)
df.head()
Out[5]:
work_id version_id title date creators subjects source fulltext_url trove_url
0 10011567 266512042 Communique 2020-04-02 Council of Australian Governments EDUCATION COUNCIL http://parlinfo.aph.gov.au/parlInfo/search/dis... https://trove.nla.gov.au/version/266512042
1 193002751 211299706 Pacific Islands forum. 2006-10-22 Howard, John|Liberal Party of Australia PRIME MINISTER http://parlinfo.aph.gov.au/parlInfo/search/dis... https://trove.nla.gov.au/version/211299706
2 193002751 211285325 Pacific Islands Forum. 2005-10-27 Howard, John|Liberal Party of Australia PRIME MINISTER http://parlinfo.aph.gov.au/parlInfo/search/dis... https://trove.nla.gov.au/version/211285325
3 193002751 242219210 Pacific Islands Forum. 2004-08-06 Howard, John|Liberal Party of Australia Regionalism (International relations)|Pacific ... PRIME MINISTER http://parlinfo.aph.gov.au/parlInfo/search/dis... https://trove.nla.gov.au/version/242219210
4 193002751 273995943 Pacific Islands Forum 2021-02-02 Morrison, Scott|Liberal Party of Australia PRIME MINISTER http://parlinfo.aph.gov.au/parlInfo/search/dis... https://trove.nla.gov.au/version/273995943

Note that the number of records in the harvested data might be different to the number of search results. This is because we've unpacked versions that had been combined into a single work.

In [6]:
# How many records
df.shape[0]
Out[6]:
4030

Download the text files

The details we've harvested from the Trove API include a url that points to the full text of the press release in the ParlInfo database. Now we can loop through all those urls, saving the text of the press releases.

In [ ]:
save_texts(records, output_dir, query)

Removing non-matches

As noted above, some of the press releases might not actually match our search. The cell below uses regular expressions to run a very basic check of the harvested text files to see if they contain the desired search terms. You will need to adjust pattern to suit your desired search results. In particular, you'll need to consider the amount of fuzziness you might expect in your search results and whether that will be captured by the regular expression pattern. If this is a problem, it might be better to use something like fuzzysearch to do the comparisons.

If the desired search terms are not found in a text file, the corresponding Trove record is removed from the results dataframe, and the text file is deleted.

In [8]:
# Change this!
pattern = r'(covid|coronavirus)'

for text_file in Path(output_dir, f'press-releases-{slugify(query)}', 'texts').glob('*.txt'):
    # Are our search terms in the file?
    if re.findall(pattern, text_file.read_text().lower()) == []:
        # Get the version id
        version_id = re.search(r'\-(\d+)\.txt', text_file.name).group(1)
        # Remove the record with that version_id from the dataset
        df = df.loc[df['version_id'] != version_id]
        # Delete the text file
        text_file.unlink()

How many records do we have now?

In [9]:
df.shape[0]
Out[9]:
3995

Find press releases with duplicate content

As noted above, there might be press releases that have the same content, but different metadata (eg title or creator). To make these duplicates easy to identify, this cell adds a hash column to the dataset. The hash value is a short string representation of each record's associated text file. If two records have the same hash value, then the contents of the press releases will be the same.

If you want, you can use this column to drop duplicates from the dataset. On the other hand, if you're interested in seeing how press releases are disseminated, you might want to group records by their hash values and compare the metadata within each group.

In [10]:
def get_hash(version_id):
    try:
        text_file = next(Path(output_dir, f'press-releases-{slugify(query)}', 'texts').glob(f'*-{version_id}.txt'))
        hashed = hashlib.sha1(text_file.read_text().encode()).hexdigest()
    except StopIteration:
        print(version_id)
        hashed = None
    return hashed
    
df['hash'] = df['version_id'].apply(get_hash)
273635031

How many unique press releases are there?

In [11]:
df['hash'].nunique()
Out[11]:
2988

Save the dataset and zip all the results for easy download

Let's save the dataset as a CSV file for download.

In [12]:
# Save the data as a CSV file
output_path = Path(output_dir, f'press-releases-{slugify(query)}')
output_path.mkdir(parents=True, exist_ok=True)
df.to_csv(Path(output_path, f'press-releases-{slugify(query)}.csv'), index=False)

The metadata and text files we've harvested are all sitting in a directory named using the query value. If you're running this notebook on a cloud service, like Binder, you probably want to download it all. Running the cell below will zip up the whole directory and provide a convenient download link.

In [13]:
shutil.make_archive(output_path, 'zip', output_path)
display(HTML('<b>Download results</b>'))
# display(FileLink('{}.zip'.format(output_path)))
display(HTML(f'<a download="press-releases-{slugify(query)}.zip" href="{output_path}.zip">{output_path}.zip</a>'))

Created by Tim Sherratt for the GLAM Workbench.

Work on this notebook was supported by the Humanities, Arts and Social Sciences (HASS) Data Enhanced Virtual Lab.