Get a random work from Trove using queries and facets

Here's another way you can get a random work from Trove's book, article, picture, map, music, or collection zones. This approach is particularly useful if you want to get a random result from a search, or want to apply a variety of facets. It's not as quick as pinging random work ids at Trove, but it's more flexible.

Basically this method gets all the available facets for a particular search. If the search has more than 100 results, it chooses one of the facets at random and applies it. It keeps doing this until the search returns less that 100 results. Then it chooses a work at random from the results. If you don't supply a query, it uses a random stop word to mix things up a bit.

The problem with this approach is that facets can't always be extracted from records, and there's no way of finding records without a particular facet. For example, you can use the year facet to limit results to a particular year, but what about records that don't have a year value. Once you start using that facet, they're invisible. I'm worried that this will mean that certain parts of Trove will never be surfaced. It would of course be much better if Trove just supported random sorting so I didn't have to do all these stupid workarounds.

Collection searches (ie using NUC identifiers) are particularly tricky, because items from a single collection can share very similar facet values. To try and limit the results in this sort of situation, I've provided a couple of extra parameters:

  • add_word – adds a random stopword to the query
  • add_number – adds a random two digit number to the query (useful if the records use numeric identifiers)

These can help increase the degree of randomness, but again I suspect some parts of collections will never be reached.

In [1]:
import requests
import json
import random
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry
from IPython.display import display, HTML

s = requests.Session()
retries = Retry(total=5, backoff_factor=1, status_forcelist=[ 502, 503, 504 ])
s.mount('https://', HTTPAdapter(max_retries=retries))
s.mount('http://', HTTPAdapter(max_retries=retries))

with open('stopwords.json', 'r') as json_file:
    STOPWORDS = json.load(json_file)
In [2]:
API_KEY = 'YOUR API KEY'
API_URL = 'http://api.trove.nla.gov.au/v2/result'
In [3]:
def get_facet_terms(terms):
    '''
    Get all the terms in a facet.
    '''
    facet_terms = []
    for term in terms:
        facet_terms.append(term['search'])
        if 'term' in term:
            facet_terms += get_facet_terms(term['term'])
    return facet_terms

def get_facets(data):
    '''
    Get the names/terms of facets available from a search.
    '''
    facets = []
    for facet in data['response']['zone'][0]['facets']['facet']:
        if facet['name'][:3] != 'adv' and facet['name'] != 'decade':
            terms = get_facet_terms(facet['term'])
            facets.append({'facet': facet['name'], 'terms': terms})
    return facets


def set_query(params, query=None, add_word=False, add_number=False):
    '''
    Add a 'q' value to the parameters, including random words and numbers if required.
    '''
    random_word = random.choice(STOPWORDS)
    random_number = random.randrange(1, 100)
    if query:
        if add_word:
            params['q'] = f'{query} "{random_word}"'
        elif add_number:
            params['q'] = f'{query} "{random_number:02}"'
        else:
            params['q'] = query
    else:
        params['q'] = f'"{random_word}"'
    return params


def get_random_work_from_zone(zone, query, **kwargs):
    total = 0
    applied_facets = []
    params = {
        'zone': zone,
        'encoding': 'json',
        # Keeping this at 0 until we've filtered the results speeds things up
        'n': '0',
        'key': API_KEY,
        'facet': 'all',
        'include': 'links'
    }
    params['q'] = query
    for key, value in kwargs.items():
        params[f'l-{key}'] = value
        applied_facets.append(key)
    response = s.get(API_URL, params=params)
    data = response.json()
    total = int(data['response']['zone'][0]['records']['total'])
    facets = get_facets(data)
    facets[:] = [f for f in facets if f.get('facet') not in applied_facets]
    # Keep going until we either have less than 100 results or we run out of facets
    while total > 100 and len(facets) > 0:
        # print(f'Facets: {len(facets)}')
        # Select another facet
        new_facet = random.choice(facets)
        # Add it to the applied list
        applied_facets.append(new_facet['facet'])
        # Add the new facet as a parameter
        params[f'l-{new_facet["facet"]}'] = random.choice(new_facet['terms'])
        # Get the new results
        response = s.get(API_URL, params=params)
        data = response.json()
        # Get the facets available from the new search
        facets = get_facets(data)
        # Remove facets from the list that have already been applied
        facets[:] = [f for f in facets if f.get('facet') not in applied_facets]
        total = int(data['response']['zone'][0]['records']['total'])
        # print(total)
        # print(response.url)
    if total > 0:
        params['n'] = '100'
        # Cleaning up a bit
        params.pop('facet', None)
        response = s.get(API_URL, params=params)
        data = response.json()
        work = random.choice(data['response']['zone'][0]['records']['work'])
        return work


def get_zones(data):
    '''
    Find which zones have results in them.
    '''
    zones = []
    for zone in data['response']['zone']:
        if int(zone['records']['total']) > 0:
            zones.append(zone['name'])
    return zones


def get_random_work(zone=None, query=None, add_word=False, add_number=False, **kwargs):
    tries = 0
    zones = []
    params = {
        'encoding': 'json',
        'n': '0',
        'key': API_KEY,
    }
    if zone:
        params['zone'] = zone
    else:
        params['zone'] = 'book,article,picture,map,music,collection'
    params = set_query(params, query, add_word)
    # Add any supplied facets
    for key, value in kwargs.items():
        params[f'l-{key}'] = value
    # Make sure that at least some zones have results
    while len(zones) == 0 and tries <=10:
        params = set_query(params, query, add_word, add_number)
        response = s.get(API_URL, params=params)
        #print(response.url)
        data = response.json()
        zones = get_zones(data)
        tries += 1
    if len(zones) > 0:
        work = get_random_work_from_zone(zone=random.choice(zones), query=params['q'], **kwargs)
        return work

Get a work from Chinese-Australian Historical Images in Australia (CHIA)

This is a collection were facets aren't terribly useful in slicing up the results because the range of values is very limited. However, items in this collection do have numeric identifiers, and so including the add_number parameter seems to help divide it up into chunks of less than 100.

In [4]:
get_random_work(query='(nuc:"VMUS:CHIA")', add_number=True)
Out[4]:
{'id': '197896325',
 'url': '/work/197896325',
 'troveUrl': 'https://trove.nla.gov.au/work/197896325',
 'title': "Quong Tart's funeral cortege leaving Ashfield residence",
 'issued': 1903,
 'type': ['Photograph'],
 'holdingsCount': 1,
 'versionCount': 1,
 'relevance': {'score': '805.48956', 'value': 'very relevant'},
 'identifier': [{'type': 'url',
   'linktype': 'fulltext',
   'value': 'http://www.chia.chinesemuseum.com.au/objects/D001092.htm'},
  {'type': 'url',
   'linktype': 'thumbnail',
   'value': 'http://www.chia.chinesemuseum.com.au/objects/thumbs/tn_P00842_00001.JPG'}]}

Get a photo with a thumbnail

Using the new imageInd parameter in the query to find records with thumbnails.

In [5]:
get_random_work(zone='picture', q='imageInd:thumbnail', format='Photograph')
Out[5]:
{'id': '231674897',
 'url': '/work/231674897',
 'troveUrl': 'https://trove.nla.gov.au/work/231674897',
 'title': "Boyd's Bay Bridge, Tweed Heads 1930s",
 'contributor': ['Aussie~mobs'],
 'issued': 2012,
 'type': ['Photograph'],
 'holdingsCount': 1,
 'versionCount': 3,
 'relevance': {'score': '0.020671109', 'value': 'vaguely relevant'},
 'snippet': ' the same spot…And it just hung on cables! And often when it would come back down it <b>wouldn</b>’t sit on',
 'identifier': [{'type': 'url',
   'linktype': 'fulltext',
   'value': 'https://www.flickr.com/photos/[email protected]/7990079055'},
  {'type': 'url',
   'linktype': 'fulltext',
   'value': 'https://www.flickr.com/photos/[email protected]/7990080389'},
  {'type': 'url',
   'linktype': 'fulltext',
   'value': 'https://www.flickr.com/photos/[email protected]/7990077627'},
  {'type': 'url',
   'linktype': 'thumbnail',
   'value': 'https://live.staticflickr.com/8296/7990077627_f84f8066e0_t.jpg'}]}

Get a work tagged 'Japan'

You can include as many additional facets as you want. Here's an example using publictag.

In [6]:
get_random_work(publictag='Japan')
Out[6]:
{'id': '16575570',
 'url': '/work/16575570',
 'troveUrl': 'https://trove.nla.gov.au/work/16575570',
 'title': 'Gertrude (Jean) Williams interviewed by Jennifer Gall',
 'contributor': ['Williams, Jean (Gertrude Jean), 1909-1999'],
 'issued': 1989,
 'type': ['Sound/Interview, lecture, talk', 'Sound'],
 'holdingsCount': 1,
 'versionCount': 1,
 'relevance': {'score': '7.267762E-4', 'value': 'vaguely relevant'},
 'snippet': ['-200321935 Jean Williams describes <b>her</b> experiences as a foreigner living in Japan over a 50 year period',
  " to Japanese libraries; husband's research projects and writing; <b>her</b> interest in netsuki; husband's"],
 'identifier': [{'type': 'url',
   'linktype': 'fulltext',
   'linktext': 'National Library of Australia digitised item',
   'value': 'http://nla.gov.au/nla.obj-200321935'},
  {'type': 'url',
   'linktype': 'thumbnail',
   'value': 'http://nla.gov.au/nla.obj-200321935-t'}]}

Display a random thumbnail

Just to cheer myself up a bit...

In [124]:
record = get_random_work(zone='picture', q='imageInd:thumbnail', format='Photograph')
for link in record['identifier']:
    if link['linktype'] == 'thumbnail':
        url = link['value']
        break
display(HTML(f'<img src="{url}">'))

Speed test

In [48]:
%%timeit
get_random_work()
The slowest run took 5.62 times longer than the fastest. This could mean that an intermediate result is being cached.
2.17 s ± 1.26 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [ ]: