Harvest parliament press releases from Trove

Trove includes more than 370,000 press releases, speeches, and interview transcripts issued by Australian federal politicians and saved by the Parliamentary Library. You can view them all in Trove by searching for nuc:"APAR:PR" in the journals zone.

This notebook shows you how to harvest both metadata and full text from a search of the parliamentary press releases. The metadata is available from Trove, but to get the full text we have to go back to the Parliamentary Library's database, ParlInfo. The code in this notebook updates my original GitHub repository.

There are two main steps:

  • Use the Trove API to search for specific keywords within the press releases and harvest metadata from the results. This gives us urls that we can use to get the text of the press releases from ParlInfo.
  • Use the harvested urls to retrieve the press release from ParlInfo. The text of each release is extracted from the HTML page and saved as a plain text file.

Sometimes multiple press releases can be grouped together as 'works' in Trove. This is because Trove thinks that they're versions of the same thing. Indeed, there are multiple versions of some press releases. For example, sometimes the office of a Minister and the Minister's department both issue a copy of the same press release or transcript. But these versions are not always identical, and sometimes Trove has grouped press releases together incorrectly. To make sure that we harvest as many individual press releases as possible, the code below unpacks any versions contained within a 'work' and turns them into individual records. This means there will be more duplicates, but it also means you can explore how the versions might differ.

It looks like the earlier documents have been OCRd and the results are quite variable. If you follow the fulltext_url link you should be able to view a PDF version for comparison.

It also seems that some documents only have a PDF version and not any OCRd text. These documents will be ignored by the save_texts() function, so you might end up with fewer texts than records.

The copyright statement attached to each record in Trove reads:

Copyright remains with the copyright holder. Contact the Australian Copyright Council for further information on your rights and responsibilities.

So depending on what you want to do with them, you might need to contact individual copyright holders for permission.

An example – politicians talking about 'immigrants' and 'refugees'

I've used this notebook to update an example dataset relating to refugees that I first generated in December 2017. It's been created by searching for the terms 'immigrant', 'asylum seeker', 'boat people', 'illegal arrivals', and 'boat arrivals' amongst the press releases. The exact query used is:

nuc:"APAR:PR" AND ("illegal arrival" OR text:"immigrant" OR text:"immigrants" OR "asylum seeker" OR "boat people" OR refugee OR "boat arrivals")

You can view the results of this query on Trove.

After unpacking the versions and harvesting available texts I ended up with 12,619 text files. You can browse the files on CloudStor, or download the complete dataset as a zip file (43mb).

Set your options

In the cell below you need to insert your search query and your Trove API key.

The search query can be anything you would enter in the Trove search box. As you can see from the examples below it can include phrases, exact phrases, and boolean operators (AND, OR, and NOT).

You can get a Trove API key by following these instructions.

You can change output_dir to save the results to a specific directory on your machine.

In [10]:
# Insert your query between the single quotes.
# query = '"illegal arrival" OR text:"immigrant" OR text:"immigrants" OR "asylum seeker" OR "boat people" OR refugee OR "boat arrivals"'
query = 'atomic'
# Insert your Trove API key between the single quotes
api_key = 'YOUR API KEY GOES HERE'
# You don't have to change this
output_dir = 'press-releases'

Import the libraries we'll need

In [11]:
import requests
import time
from requests.exceptions import HTTPError
from bs4 import BeautifulSoup
from slugify import slugify
import pandas as pd
from datetime import datetime
import os
import shutil
from IPython.display import display, HTML, FileLink
from tqdm import tqdm_notebook
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry

s = requests.Session()
retries = Retry(total=5, backoff_factor=1, status_forcelist=[ 502, 503, 504 ])
s.mount('https://', HTTPAdapter(max_retries=retries))
s.mount('http://', HTTPAdapter(max_retries=retries))

Define some functions to do the work

In [12]:
def get_total_results(params):
    '''
    Get the total number of results for a search.
    '''
    these_params = params.copy()
    these_params['n'] = 0
    response = s.get('https://api.trove.nla.gov.au/v2/result', params=these_params)
    data = response.json()
    return int(data['response']['zone'][0]['records']['total'])

def get_fulltext_url(links):
    '''
    Loop through the identifiers to find a link to the digital version of the journal.
    '''
    url = None
    for link in links:
        if link['linktype'] == 'fulltext':
            url = link['value']
            break
    return url

def get_source(version):
    '''
    Get the metadata source of a version.
    '''
    if 'metadataSource' in version:
        try:
            source = version['metadataSource']['value']
        except TypeError:
            try:
                source = version['metadataSource']
            except TypeError:
                print(version)
            
        except KeyError:
            source = None
    else:
        source = None
    return source    

def harvest_prs(query, api_key):
    '''
    Harvest details of parliamentary press releases using the Trove API.
    This function saves the 'version' level records individually (these are grouped under 'works').
    '''
    # Define parameters for the search -- you could change this of course
    # The nuc:"APAR:PR" limits the results to the Parliamentary Press Releases
    params = {
        'q': 'nuc:"APAR:PR" AND ({})'.format(query),
        'zone': 'article',
        'n': 100,
        'key': api_key,
        'bulkHarvest': 'true',
        'encoding': 'json',
        'include': 'workVersions',
        'l-availability': 'y'
    }
    start = '*'
    total = get_total_results(params)
    records = []
    url = 'http://api.trove.nla.gov.au/v2/result'
    with tqdm_notebook(total=total) as pbar:
        while start:
            params['s'] = start
            response = s.get(url, params=params)
            data = response.json()
            # If there's a startNext value then we get it to request the next page of results
            try:
                start = data['response']['zone'][0]['records']['nextStart']
            except KeyError:
                start = None
            for work in data['response']['zone'][0]['records']['work']:
                # Different records can be grouped within works as versions.
                # So we're going to extract each version as a separate record.
                for version in work['version']:
                    # Sometimes there are even versions grouped together in a version... ¯\_(ツ)_/¯
                    # We need to extract their ids from a single string
                    ids = version['id'].split()
                    # This may or may not be a list...
                    if isinstance(version['record'], list):
                        version_records = version['record']
                    else:
                        version_records = [version['record']]
                    # Loop through versions in versions.
                    for index, record in enumerate(version_records):
                        source = get_source(record)
                        if source == 'APAR:PR':
                            # Add the id to the version record
                            record['version_id'] = ids[index]
                            record = clean_metadata(record)
                            records.append(record) 
                # Try to avoid hitting the API request limit
            pbar.update(100)
            time.sleep(0.2)
    return records

def stringify_values(version, field):
    '''
    If a value is a list, join it into a pipe separate string.
    Otherwise just return the string value.
    '''
    try:
        if isinstance(version[field], list):
            values = [str(v) for v in version.get(field)]
            value = '|'.join(values)
        else:
            value = version.get(field, '')
    except KeyError:
        value = ''
    return value

def clean_metadata(version):
    '''
    Standardises, cleans, and stringifies record metadata.
    '''
    record = {}
    record['version_id'] = version['version_id']
    record['title'] = version.get('title')
    record['date'] = version.get('date')
    # Make sure creators is a list
    record['creators'] = stringify_values(version, 'creator')
    record['subjects'] = stringify_values(version, 'subject')
    record['source'] = stringify_values(version, 'source')
    # Get the fulltext url from the list of identifiers
    try:
        record['fulltext_url'] = get_fulltext_url(version['identifier'])
    except KeyError:
        record['fulltext_url'] = ''
    record['trove_url'] = 'https://trove.nla.gov.au/version/{}'.format(version['version_id'])
    return record

def save_texts(records, output_dir, query):
    '''
    Get the text of press releases in the ParlInfo db.
    This function uses urls harvested from Trove to request press releases from Parlinfo.
    Text is extracted from the HTML files and saved as individual text files.
    '''
    # Loop through all the previously harvested records
    for record in tqdm_notebook(records):
        output_path = os.path.join(output_dir, 'press-releases-{}'.format(slugify(query)), 'texts')
        os.makedirs(output_path, exist_ok=True)
        filename = '{}-{}-{}.txt'.format(record['date'], slugify(record['creators']), record['version_id'])
        file_path = os.path.join(output_path, filename)
        # Only save files we haven't saved before
        if not os.path.exists(file_path):
            # Get the Parlinfo web page
            response = s.get(record['fulltext_url'])
            # Parse web page in Beautiful Soup
            soup = BeautifulSoup(response.text, 'lxml')
            content = soup.find('div', class_='box')
            # If we find some text on the web page then save it.
            if content:
                # Open file
                # print 'Saving file...'
                with open(file_path, 'w', encoding='utf-8') as text_file:
                    # Get the contents of each paragraph and write it to the file
                    for para in content.find_all('p'):
                        text_file.write('{}\n\n'.format(para.get_text().strip()))
            time.sleep(0.5)

Harvest the metadata!

Running the cell below will harvest details of all the press releases matching our query using the Trove API. The results will be saved in the records variable for further use.

In [13]:
records = harvest_prs(query, api_key)

Save the harvested metadata

The cells below convert the records variable into a Pandas DataFrame, have a little peek inside, and then save all the harvested metadata as a CSV formatted text file. This file provides an index to the harvested press releases.

In [14]:
df = pd.DataFrame(records)
df.head()
Out[14]:
creators date fulltext_url source subjects title trove_url version_id
0 Evans, Gareth 1991-09-16 http://parlinfo.aph.gov.au/parlInfo/search/dis... Minister for Foreign Affairs and Trade The International Atomic Energy Agency and the... https://trove.nla.gov.au/version/214098272 214098272
1 ALP 1902-01-01 http://parlinfo.aph.gov.au/parlInfo/search/dis... AUSTRALIAN LABOR PARTY History of the Federal Capital and Parliament ... Australian Labor Party: 2nd Commonwealth Confe... https://trove.nla.gov.au/version/211168619 211168619
2 ALP 1964-04-27 http://parlinfo.aph.gov.au/parlInfo/search/dis... LEADER OF THE OPPOSITION Outside control of the liberal party https://trove.nla.gov.au/version/211168681 211168681
3 ALP 1965-06-10 http://parlinfo.aph.gov.au/parlInfo/search/dis... LEADER OF THE OPPOSITION Decisions of the federal executive https://trove.nla.gov.au/version/211168736 211168736
4 ALP 1964-11-09 http://parlinfo.aph.gov.au/parlInfo/search/dis... LEADER OF THE OPPOSITION Decisions of the federal executive https://trove.nla.gov.au/version/211168720 211168720

Note that the number of records in the harvested data might be different to the number of search results. This is because we've unpacked versions that had been combined into a single work.

In [15]:
# How many records
df.shape
Out[15]:
(1771, 8)
In [16]:
# Save the data as a CSV file
os.makedirs(os.path.join(output_dir, 'press-releases-{}'.format(slugify(query))), exist_ok=True)
df.to_csv(os.path.join(output_dir, 'press-releases-{}'.format(slugify(query)), 'press-releases-{}.csv'.format(slugify(query))), index=False)

Download the text files

The details we've harvested from the Trove API include a url that points to the full text of the press release in the ParlInfo database. Now we can loop through all those urls, saving the text of the press releases.

In [17]:
# Only run this cell if you need to reload the harvested metadata from the CSV
df = pd.read_csv(os.path.join(output_dir, 'press-releases-{}'.format(slugify(query)), 'press-releases-{}.csv'.format(slugify(query))), keep_default_na=False)
records = df.to_dict('records')
In [18]:
save_texts(records, output_dir, query)

Zip the results for easy download

The metadata and text files we've harvested are all sitting in a directory named using the query value. If you're running this notebook on a cloud service, like Binder, you probably want to download it all. Running the cell below will zip up the whole directory and provide a convenient download link.

In [19]:
output_path = os.path.join(output_dir, 'press-releases-{}'.format(slugify(query)))
shutil.make_archive(output_path, 'zip', output_path)
display(HTML('<b>Download results</b>'))
display(FileLink('{}.zip'.format(output_path)))

Created by Tim Sherratt.

Work on this notebook was supported by the Humanities, Arts and Social Sciences (HASS) Data Enhanced Virtual Lab.