Notebook

Select a random(ish) record from DigitalNZ¶

The DigitalNZ API doesn't provide a random sort option. You can jump to a randomly selected page of results, but you can't do any deeper than 100,000 pages into a results set (that's 1,000,000 records if you set the per_page value to 100). So we need to find some way of filtering the results until there's less than 1,000,000, then we can grab a random page and record.

We can use facets to filter the results. As you can see at the bottom of this notebook, I did a bit of examination of the facets to understand their coverage. If only 50% of records have a value for a particular facet and we use it to filter the results, then 50% of the records will be missing from the pool we make our random selection from. So we want to use facets that have been applied to as many records as possible.

A blank search returns 31,640,164 results.

I extracted facets for category, display_collection, creator, placename, year, decade, century, language, content_partner, rights, collection, and usage. The facets that seem to have the best coverage are:

category: 31,653,142 records
content_partner: 31,642,453 records
year: 30,867,103 records

I don't know why category and content_partner have more records than a blank search – I suppose either the blank search is filtering out records, or some records have multiple values for these facets. Note, too, that year has 918 values! The maximum number of facet values that can be retrieved in a single request is 350, so this makes it tricky to filter the results using just the year facet. By applying category and content_partner before year, I should limit the number of year values, and hopefully avoid overlooking too many records. (I could analyse all the combinations of these facets to see how many records might be missed, but I don't think it's worth it at this stage.)

So for now, I've decided to apply a randomly selected value from each of these facets in the following order – category, content_partner, and year. After applying each filter I'll check to see if we were under 1,000,000 results, if so we'll grab a record by jumping to a random page, and selecting a random result!

As you can see from the examples below, you can also supply your own filters if you want to limit the selection pool.

Import what we need¶

In [1]:

import requests
import random
import math
import pandas as pd
from tqdm.auto import tqdm
from IPython.display import Image, display, HTML

In [ ]:

API_KEY = '[YOUR API KEY]'
API_URL = 'http://api.digitalnz.org/v3/records.json'

Define some functions¶

In [3]:

def get_total(**kwargs):
    '''
    Get the total number of results from a query built using supplied kwargs as parameters.
    '''
    params = {
        'api_key': API_KEY,
        'per_page': 0
    }
    params = add_kwargs_to_params(params, kwargs)
    data = get_records(params)
    return data['search']['result_count']
    
def get_records(params):
    '''
    Get records from a search using the supplied parameters.
    '''
    response = requests.get(API_URL, params=params)
    return response.json()

def add_kwargs_to_params(params, kwargs):
    '''
    Add kwargs to query parameters.
    '''
    for k, v in kwargs.items():
        if k == 'text':
            params[k] = v
        else:
            params[f'and[{k}][]'] = v
    return params

def get_random_result(**kwargs):
    '''
    Select a random result from a query built using supplied kwargs as parameters.
    '''
    total = get_total(**kwargs)
    pages = math.ceil(total / 100)
    page = random.choice(list(range(1,pages + 1)))
    params = {
        'api_key': API_KEY,
        'per_page': 100,
        'page': page
    }
    params = add_kwargs_to_params(params, kwargs)
    data = get_records(params)
    try:
        record = random.choice(data['search']['results'])
    except KeyError:
        record = None
    return record

def get_facets(facet, **kwargs):
    '''
    Get values for the specified facet.
    '''
    params = {
        'facets': [facet],
        'api_key': API_KEY,
        'per_page': 0,
        'facets_per_page': 350 # 350 is the max
    }
    params = add_kwargs_to_params(params, kwargs)
    data = get_records(params)
    total = data['search']['result_count']
    facets = data['search']['facets'][facet]
    return (total, facets)

def get_random_facet(facets):
    '''
    Select a facet value from a list of facets, using the facet counts as weights.
    '''
    values = [{k:v} for k,v in facets.items()]
    weights = list(facets.values())
    facet = random.choices(values, weights=weights, k=1)[0]
    return list(facet.items())[0]

def select_facet(facet, **kwargs):
    '''
    Apply the specified facet to a query, if the total results are less than 1,000,000 then get a random result.
    '''
    _, facets = get_facets(facet, **kwargs)
    value = get_random_facet(facets)
    print(f'  * {facet.title()}: {value[0]}')
    kwargs[facet] = value[0]
    if value[1] < 1000000:
        record = get_random_result(**kwargs)
    else:
        record = None
    return (record, kwargs)
    
def get_random_record(**kwargs):
    print('Additional filters:')
    if kwargs:
        total = get_total(**kwargs)
        if total < 1000000:
            print('  * None')
            return get_random_result(**kwargs)
    for facet in ['category', 'content_partner', 'year']:
        if facet not in kwargs:
            record, kwargs = select_facet(facet, **kwargs)
            if record:
                return record
    return 'Too many'

A random record¶

In [4]:

# Get a record
record = get_random_record()

# Display the results
display(HTML(f'\n<h4>{record["title"]}</h4>'))
if record['description']:
    display(HTML(f'<p>{record["description"]}</p>'))
display(HTML(f'<a href="{record["landing_url"]}">More...</a>'))

Additional filters:
  * Category: Newspapers
  * Content_Partner: National Library of New Zealand
  * Year: 1905

FURTHER JAPANESE SUCCESSES ON DAND. (Colonist, 12 June 1905)

More...

A random newspaper article¶

In [5]:

# Get a record
record = get_random_record(category='Newspapers')

# Display the results
display(HTML(f'\n<h4>{record["title"]}</h4>'))
display(HTML(f'<p>{record["fulltext"][:500]}...</p>'))
display(HTML(f'<a href="{record["landing_url"]}">More...</a>'))

Additional filters:
  * Content_Partner: National Library of New Zealand
  * Year: 1885

THE FATE OF THE ALABAMA AWARD. (Evening Post, 14 November 1885)

the fate of the alabama award at the time when the geneva award on the alabama claims was i made it waß a matter of remark that not only were the claims of the united states government grossly excessive but the actual amount awarded three millions sterling was far in exasss of any legitimate demands an article which now nppears in the new york press headed a court of swindlers is a con elusive justification of that contention after the three millions were awarded it was determined that the money...

More...

A random newspaper article from a specific decade¶

In [6]:

# Get a record
record = get_random_record(category='Newspapers', decade='1920')

# Display the results
display(HTML(f'<h4>{record["title"]}</h4>'))
display(HTML(f'<p>{record["fulltext"][:500]}...</p>'))
display(HTML(f'<a href="{record["landing_url"]}">More...</a>'))

Additional filters:
  * Content_Partner: National Library of New Zealand
  * Year: 1920

THE TURF. (Marlborough Express, 05 February 1920)

the turf taranaki results press association new plymouth feb fi the taranaki jockey club meeting conoluded in fine weather results hurdles explorer 1112 1 fair paul 90 2 cheddar 90 3 scratched master moutoa won l-y 1 lengths time 3min 1 l-ssec munster fell kawa-j handicap starland 98 1 self alliance 810 2 haversnck 67 3 all started won by four lengths time imin slsec...

More...

A random article from a specific newspaper¶

The newspaper title is stored in collection_title and publisher, but you don't seem to be able to filter using either of these, so we'll just do a text search for the title instead. This may mean we get results that are not actually from this newspaper...

In [7]:

# Get a record
record = get_random_record(category='Newspapers', text='Evening Post')

# Display the results
display(HTML(f'<h4>{record["title"]}</h4>'))
display(HTML(f'<p>{record["fulltext"][:500]}...</p>'))
display(HTML(f'<a href="{record["landing_url"]}">More...</a>'))

Additional filters:
  * Content_Partner: National Library of New Zealand
  * Year: 1936

Page 1 Advertisements Column 4 (Evening Post, 29 January 1936)

newman bros ltd regular services picton-blenheim-christchurch nelson-motueka-takaka west coast glaciers full particulars from a ll government tourist offices thos cook and son t and w young wellington wanted to sell wanted sell piano 14 10s or make offer owner removing iron frame overstrung write 1007 evg post yxiunxiiid to sell girl technical uni form only few months wear cheap apply 01 graftou road roseueath vyanted sell singer g6k4 droplieads bargains singer dropheads 5 hand machines cheap gu...

More...

A random item from a specific content partner¶

In [8]:

# Get a record
record = get_random_record(content_partner='Puke Ariki')

# Display the results
display(HTML(f'<h4>{record["title"]}</h4>'))
if 'thumbnail_url' in record and record['thumbnail_url']:
    display(Image(url=record['thumbnail_url'], format='jpg'))
display(HTML(f'<p>{record["description"]}</p>'))
display(HTML(f'<a href="{record["landing_url"]}">More...</a>'))

Additional filters:
  * None

McCarthy, Woman

Woman seated on bench, portrait in background scratched out.; Black and White Sheet Film; Black and White Negative

More...

A random open image¶

In [9]:

# Get a record
record = get_random_record(category='Images', usage='Use commercially')

# Display the results
display(HTML(f'<h4>{record["title"]}</h4>'))
try:
    if 'large_thumbnail_url' in record:
        display(Image(url=record['large_thumbnail_url'], format='jpg'))
    else:
        display(Image(url=record['thumbnail_url'], format='jpg'))
except:
    pass
if record['description']:
    display(HTML(f'<p>{record["description"]}</p>'))
display(HTML(f'<a href="{record["landing_url"]}">More...</a>'))

Additional filters:
  * Content_Partner: Museum of New Zealand Te Papa Tongarewa

Avenue

More...

Created by Tim Sherratt for the GLAM Workbench. Support this project by becoming a GitHub sponsor.