Harvest records from the NMA API

The National Museum of Australia provides access to its collection data through an API. But if you're going to do any large-scale analysis of the data, you probably want to harvest and save it locally. This notebook helps you do just that.

According to the API documentation, the possible endpoints are:

  • /object - the museum catalogue plus images/media
  • /narrative - narratives by Museum staff about featured topics
  • /party - people and organisations associated with collection items
  • /place - locations associated with collection items
  • /collection - sub-collections within the museum catalogue
  • /media - images and other media associated with collection items

This notebook should harvest records from any of these endpoints, though I've only tested object, party, and place.

It harvests records in the simple JSON format and saves them as they are to a file-based database using TinyDB. See the other notebooks in this repository for examples of loading the JSON data into a DataFrame for manipulation and analysis.

If you haven't used one of these notebooks before, they're basically web pages in which you can write, edit, and run live code. They're meant to encourage experimentation, so don't feel nervous. Just try running a few cells and see what happens!

Some tips:

  • Code cells have boxes around them.
  • To run a code cell click on the cell and then hit Shift+Enter. The Shift+Enter combo will also move you to the next cell, so it's a quick way to work through the notebook.
  • While a cell is running a * appears in the square brackets next to the cell. Once the cell has finished running the asterix will be replaced with a number.
  • In most cases you'll want to start from the top of notebook and work your way down running each cell in turn. Later cells might depend on the results of earlier ones.
  • To edit a code cell, just click on it and type stuff. Remember to run the cell once you've finished editing.

Is this thing on? If you can't edit or run any of the code cells, you might be viewing a static (read only) version of this notebook. Click here to load a live version running on Binder.

Import what we need

In [ ]:
import requests
from tinydb import TinyDB, Query
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry
from tqdm.auto import tqdm

s = requests.Session()
retries = Retry(total=5, backoff_factor=1, status_forcelist=[ 502, 503, 504 ])
s.mount('http://', HTTPAdapter(max_retries=retries))
s.mount('https://', HTTPAdapter(max_retries=retries))

API_BASE_URL = 'https://data.nma.gov.au/{}'

Set our API key

To make full use of the NMA API and avoid rate limits, you should go and get yourself and API key. Once you have your key, paste it in below.

In [ ]:
# Paste your key in between the quotes
API_KEY = 'YOUR API KEY'

Create some functions to do the work

In [ ]:
def get_total(endpoint, params, headers):
    '''
    Get the total number of results.
    '''
    response = s.get(endpoint, headers=headers, params=params)
    data = response.json()
    return data['meta']['results']

def harvest_records(record_type):
    # Put api key in request headers
    headers = {
        'apikey': API_KEY
    }
    
    # Set basic params
    params = {
        'text': '*',
        'limit': 100, # Number of records per request
        'offset': 0 # We'll change this as we loop through
    }
    
    # Create a db to hold the results
    db = TinyDB('nma_{}_db.json'.format(record_type))
    
    # Get the endpoint for this type of record
    endpoint = API_BASE_URL.format(record_type)
    
    # Are there more records? We'll check this on each request.
    more = True
    
    # Get the total number of records
    total_records = get_total(endpoint, params, headers)
    
    # Make a progress bar
    with tqdm(total=total_records) as pbar:
        
        # Continue while 'more' is True
        while more:
            
            # Get the data
            response = s.get(endpoint, headers=headers, params=params)
            data = response.json()
            
            # Insert the records (in the 'data' field) into the db
            db.insert_multiple(data['data'])
            
            # If there's not a 'next' link, set more to False
            more = data.get('links', {}).get('next', False)
            
            # Update the offset value
            params['offset'] += 100
            
            # Update the progress bar
            pbar.update(len(data['data']))
        

Harvest records!

In [ ]:
harvest_records('place')
In [ ]:
harvest_records('party')
In [ ]:
harvest_records('object')

Created by Tim Sherratt for the GLAM Workbench.

Work on this notebook was supported by the Humanities, Arts and Social Sciences (HASS) Data Enhanced Virtual Lab.