Notebook

Harvesting the complete set of data from the People and Organisations zone using OAI-PMH¶

There are two methods of harvesting the complete set of data from the People and Organisations zone – using the OAI-PMH API, or using the main Trove API in conjunction with the SRU interface. The OAI-PMH method is much faster, but includes duplicate records that you'll need to filter out afterwards. This notebook demonstrates the OAI-PMH method.

Using the ListRecords method with the OAI-PMH API will cause an error unless you provide a set parameter. There is a set for each organisation contributing data to the People and Organisations zone. So to harvest everything you need to loop through the list of sets, downloading the records for each. However, records are matched and merged across sets, so the same record can appear in multiple sets. This means your harvest will contain duplicate records.

The list of records also includes records that have been deleted, these records are effectively empty containing only an identifier and date. They're not saved as part of the harvest, but you could adjust the code below to save them to file if you wanted to.

The results of the harvest are saved to a file named with the current date – peau-oai-data-YYYYMMDD.xml. Each row in the dataset is a separate EAC-CPF encoded XML file.

In [ ]:

import re
from datetime import datetime
from pathlib import Path

import requests
from bs4 import BeautifulSoup
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry
from tqdm.auto import tqdm

# Not using requests_cache as caching results causes problems with resumptionToken
s = requests.Session()
retries = Retry(total=5, backoff_factor=1, status_forcelist=[502, 503, 504])
s.mount("https://", HTTPAdapter(max_retries=retries))
s.mount("http://", HTTPAdapter(max_retries=retries))

Get the list of sets¶

First use the ListSets method to get a list of all the available sets.

In [ ]:

# Get a list of sets
response = s.get(
    "http://www.nla.gov.au/apps/peopleaustralia-oai/OAIHandler?verb=ListSets"
)
set_soup = BeautifulSoup(response.text, features="xml")

Get the records¶

Next loop through the list of sets, using the ListRecords mthod with the set parameter to get all the records in each set.

This retrieves a single EAC-CPF encoded record at a time, appending it to the peau-data-oai-YYYYMMDD.xml file. If the harvest is interrupted, delete the output file before restarting to avoid creating duplicates.

In [ ]:

# Set the output file
output = Path(f"peau-oai-data-{datetime.now().strftime('%Y%m%d')}.xml")

# Loop through the sets
for source in set_soup.find_all("set"):
    source_id = source.setSpec.string
    print(source_id)
    # Get the records in this set
    with tqdm() as pbar:
        params = {"verb": "ListRecords", "set": source_id, "metadataPrefix": "eac-cpf"}
        # OAI-PMH uses resumption token to paginate through the coomplete results set
        # We'll continue harvesting until there's no resumption token
        while params:
            response = s.get(
                "http://www.nla.gov.au/apps/peopleaustralia-oai/OAIHandler",
                params=params,
            )
            soup = BeautifulSoup(response.text, features="xml")
            with output.open("a") as xml_file:
                for record in soup.find_all("record"):
                    # Extract the EAC-CPF record
                    # First check there is an eac-cpf record inside
                    # If there's not, it's probably a deleted record
                    if not record.find("eac-cpf"):
                        # You could change this to display or save deleted records
                        # print(record)
                        break
                    eac_cpf = str(record.find("eac-cpf"))
                    # Strip out any line breaks within the record
                    eac_cpf = eac_cpf.replace("\n", "")
                    eac_cpf = re.sub(r"\s+", " ", eac_cpf)
                    # Write record as a new line
                    xml_file.write(eac_cpf + "\n")
                    pbar.update(1)
            # Get the resumption token and add to request params
            if soup.find("resumptionToken") and soup.find("resumptionToken").string:
                if not pbar.total:
                    pbar.total = int(soup.find("resumptionToken")["completeListSize"])
                params = {
                    "verb": "ListRecords",
                    "resumptionToken": soup.find("resumptionToken").string,
                }
            else:
                params = None

Created by Tim Sherratt for the GLAM Workbench.

The development of this notebook was supported by the Australian Cultural Data Engine.