There are two methods of harvesting the complete set of data from the People and Organisations zone – using the OAI-PMH API, or using the main Trove API in conjunction with the SRU interface. The OAI-PMH method is much faster, but includes duplicate records that you'll need to filter out afterwards. This notebook demonstrates the OAI-PMH method.
Using the ListRecords
method with the OAI-PMH API will cause an error unless you provide a set
parameter. There is a set for each organisation contributing data to the People and Organisations zone. So to harvest everything you need to loop through the list of sets, downloading the records for each. However, records are matched and merged across sets, so the same record can appear in multiple sets. This means your harvest will contain duplicate records.
The list of records also includes records that have been deleted, these records are effectively empty containing only an identifier and date. They're not saved as part of the harvest, but you could adjust the code below to save them to file if you wanted to.
The results of the harvest are saved to a file named with the current date – peau-oai-data-YYYYMMDD.xml
. Each row in the dataset is a separate EAC-CPF encoded XML file.
import re
from datetime import datetime
from pathlib import Path
import requests
from bs4 import BeautifulSoup
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry
from tqdm.auto import tqdm
# Not using requests_cache as caching results causes problems with resumptionToken
s = requests.Session()
retries = Retry(total=5, backoff_factor=1, status_forcelist=[502, 503, 504])
s.mount("https://", HTTPAdapter(max_retries=retries))
s.mount("http://", HTTPAdapter(max_retries=retries))
First use the ListSets
method to get a list of all the available sets.
# Get a list of sets
response = s.get(
"http://www.nla.gov.au/apps/peopleaustralia-oai/OAIHandler?verb=ListSets"
)
set_soup = BeautifulSoup(response.text, features="xml")
Next loop through the list of sets, using the ListRecords
mthod with the set
parameter to get all the records in each set.
This retrieves a single EAC-CPF encoded record at a time, appending it to the peau-data-oai-YYYYMMDD.xml
file. If the harvest is interrupted, delete the output file before restarting to avoid creating duplicates.
# Set the output file
output = Path(f"peau-oai-data-{datetime.now().strftime('%Y%m%d')}.xml")
# Loop through the sets
for source in set_soup.find_all("set"):
source_id = source.setSpec.string
print(source_id)
# Get the records in this set
with tqdm() as pbar:
params = {"verb": "ListRecords", "set": source_id, "metadataPrefix": "eac-cpf"}
# OAI-PMH uses resumption token to paginate through the coomplete results set
# We'll continue harvesting until there's no resumption token
while params:
response = s.get(
"http://www.nla.gov.au/apps/peopleaustralia-oai/OAIHandler",
params=params,
)
soup = BeautifulSoup(response.text, features="xml")
with output.open("a") as xml_file:
for record in soup.find_all("record"):
# Extract the EAC-CPF record
# First check there is an eac-cpf record inside
# If there's not, it's probably a deleted record
if not record.find("eac-cpf"):
# You could change this to display or save deleted records
# print(record)
break
eac_cpf = str(record.find("eac-cpf"))
# Strip out any line breaks within the record
eac_cpf = eac_cpf.replace("\n", "")
eac_cpf = re.sub(r"\s+", " ", eac_cpf)
# Write record as a new line
xml_file.write(eac_cpf + "\n")
pbar.update(1)
# Get the resumption token and add to request params
if soup.find("resumptionToken") and soup.find("resumptionToken").string:
if not pbar.total:
pbar.total = int(soup.find("resumptionToken")["completeListSize"])
params = {
"verb": "ListRecords",
"resumptionToken": soup.find("resumptionToken").string,
}
else:
params = None
Created by Tim Sherratt for the GLAM Workbench.
The development of this notebook was supported by the Australian Cultural Data Engine.