Notebook

Harvesting the complete set of data from the People and Organisations zone¶

There are two methods of harvesting the complete set of data from the People and Organisations zone – using the OAI-PMH API, or using the main Trove API in conjunction with the SRU interface. The OAI-PMH method is much faster, but includes duplicate records that you'll need to filter out afterwards. This notebook demonstrates the API/SRU method.

You can't use the SRU interface on its own as the SRU interface limits the lifespan of results sets, so attempting to traverse the complete database produces unexpected results. The main Trove API doesn't include full details of People and Organisations, but it does include identifiers, and does support bulk harvests. So you can harvest a complete list of identifiers from the main Trove API and then use these identifiers to request the full EAC-CPF records from the SRU interface. It's slow, but it seems to work.

I've saved a complete harvest of all the people and organisations data on CloudStor (700mb zip file). The harvest was run on 23 January 2023. Each row in the dataset is a separate EAC-CPF encoded XML file.

In [ ]:

import json
import os
import re
import time
from datetime import datetime
from pathlib import Path

import requests_cache
from bs4 import BeautifulSoup
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry
from tqdm.auto import tqdm

s = requests_cache.CachedSession()
retries = Retry(total=5, backoff_factor=1, status_forcelist=[502, 503, 504])
s.mount("https://", HTTPAdapter(max_retries=retries))
s.mount("http://", HTTPAdapter(max_retries=retries))

In [ ]:

%%capture
# Load variables from the .env file if it exists
# Use %%capture to suppress messages
%load_ext dotenv
%dotenv

In [ ]:

# Insert your Trove API key
API_KEY = "YOUR API KEY"

# Use api key value from environment variables if it is available
if os.getenv("TROVE_API_KEY"):
    API_KEY = os.getenv("TROVE_API_KEY")

Harvest identifiers from the Trove API¶

First we need to get the identifiers for all the people and organisations record from the main Trove API. We'll use a 'blank' search to get everything. The bulkHarvest parameter is necessary for large data harvests as it maintains the results set in a fixed order so you don't end up with duplicates.

In [ ]:

params = {
    "zone": "people",
    "q": " ",  # Blank search to get everything
    "bulkHarvest": "true",
    "encoding": "json",
    "key": API_KEY,
    "n": 100,
}

api_url = "https://api.trove.nla.gov.au/v2/result"

In [ ]:

def get_total_results(params):
    params["n"] = 0
    response = s.get(api_url, params=params, timeout=30)
    data = response.json()
    return int(data["response"]["zone"][0]["records"]["total"])

In [ ]:

peau_ids = []
total = get_total_results(params.copy())
start = "*"
with tqdm(total=total) as pbar:
    while start:
        params["s"] = start
        response = s.get(api_url, params=params)
        data = response.json()
        for record in data["response"]["zone"][0]["records"]["people"]:
            peau_ids.append(record["id"])
        # If there's more results there'll be a value for `nextStart`
        # that we use as the `start` vaue in the next request.
        try:
            start = data["response"]["zone"][0]["records"]["nextStart"]
        # If there's no nextStart value then we've finished!
        except KeyError:
            start = None
        pbar.update(len(data["response"]["zone"][0]["records"]["people"]))
        time.sleep(0.2)

In [ ]:

# Write the identifiers to a file as backup
with Path(f"peau_ids_{datetime.now().strftime('%Y%m%d')}.json").open("w") as json_file:
    json.dump(peau_ids, json_file)

Harvest EAC-CPF records¶

Now we have a big list of identifiers, we can use them to request the full records from the SRU interface.

In [ ]:

# Basic params for SRU requests

p_params = {
    "version": "1.1",
    "operation": "searchRetrieve",
    "recordSchema": "urn:isbn:1-931666-33-4",  # EAC-CPF encoding
    "maximumRecords": 10,
    "startRecord": 1,
    "resultSetTTL": 300,
    "recordPacking": "xml",
    "recordXPath": "",
    "sortKeys": "",
}

p_api_url = "http://www.nla.gov.au/apps/srw/search/peopleaustralia"

This retrieves a single EAC-CPF encoded record at a time, appending it to the peau-data.xml file. If the harvest is interrupted, delete peau-data.xml before restarting to avoid creating duplicates. Query results are cached, so a restarted harvest will grab results from the cache if possible.

The peau-data.xml file has one EAC-CPF encoded record per line. This makes it easier to save and process the records efficiently.

In [ ]:

for p_id in tqdm(peau_ids):
    # Construct a party id using the identifier and use it to query the SRU interface using the rec.identifier field
    p_params["query"] = f'rec.identifier="http://nla.gov.au/nla.party-{p_id}"'
    response = s.get(p_api_url, params=p_params)
    soup = BeautifulSoup(response.content, "xml")
    with Path(f"peau-data-{datetime.now().strftime('%Y%m%d')}.xml").open(
        "a"
    ) as xml_file:
        for record in soup.find_all("record"):
            # Extract the EAC-CPF record
            eac_cpf = str(record.find("eac-cpf"))
            # Strip out any line breaks within the record
            eac_cpf = eac_cpf.replace("\n", "")
            eac_cpf = re.sub(r"\s+", " ", eac_cpf)
            # Write record as a new line
            xml_file.write(eac_cpf + "\n")
            if not response.from_cache:
                time.sleep(0.2)

This will take a long time. But it works. If you don't want to run your own harvest, just download my pre-harvested dataset from CloudStor (700mb zip file).

Now we can try extracting some aggregate data from the people and organisations harvest.

Created by Tim Sherratt for the GLAM Workbench.

The development of this notebook was supported by the Australian Cultural Data Engine.