Select a random(ish) record from DigitalNZ

The DigitalNZ API doesn't provide a random sort option. You can jump to a randomly selected page of results, but you can't do any deeper than 100,000 pages into a results set (that's 1,000,000 records if you set the per_page value to 100). So we need to find some way of filtering the results until there's less than 1,000,000, then we can grab a random page and record.

We can use facets to filter the results. As you can see at the bottom of this notebook, I did a bit of examination of the facets to understand their coverage. If only 50% of records have a value for a particular facet and we use it to filter the results, then 50% of the records will be missing from the pool we make our random selection from. So we want to use facets that have been applied to as many records as possible.

A blank search returns 31,640,164 results.

I extracted facets for category, display_collection, creator, placename, year, decade, century, language, content_partner, rights, collection, and usage. The facets that seem to have the best coverage are:

  • category: 31,653,142 records
  • content_partner: 31,642,453 records
  • year: 30,867,103 records

I don't know why category and content_partner have more records than a blank search – I suppose either the blank search is filtering out records, or some records have multiple values for these facets. Note, too, that year has 918 values! The maximum number of facet values that can be retrieved in a single request is 350, so this makes it tricky to filter the results using just the year facet. By applying category and content_partner before year, I should limit the number of year values, and hopefully avoid overlooking too many records. (I could analyse all the combinations of these facets to see how many records might be missed, but I don't think it's worth it at this stage.)

So for now, I've decided to apply a randomly selected value from each of these facets in the following order – category, content_partner, and year. After applying each filter I'll check to see if we were under 1,000,000 results, if so we'll grab a record by jumping to a random page, and selecting a random result!

As you can see from the examples below, you can also supply your own filters if you want to limit the selection pool.

Import what we need

In [217]:
import requests
import random
import math
import pandas as pd
from tqdm.auto import tqdm
from IPython.display import Image, display, HTML
In [15]:
API_KEY = '[YOUR API KEY]'
API_URL = 'http://api.digitalnz.org/v3/records.json'

Define some functions

In [197]:
def get_total(**kwargs):
    '''
    Get the total number of results from a query built using supplied kwargs as parameters.
    '''
    params = {
        'api_key': API_KEY,
        'per_page': 0
    }
    params = add_kwargs_to_params(params, kwargs)
    data = get_records(params)
    return data['search']['result_count']
    
def get_records(params):
    '''
    Get records from a search using the supplied parameters.
    '''
    response = requests.get(API_URL, params=params)
    return response.json()

def add_kwargs_to_params(params, kwargs):
    '''
    Add kwargs to query parameters.
    '''
    for k, v in kwargs.items():
        if k == 'text':
            params[k] = v
        else:
            params[f'and[{k}][]'] = v
    return params

def get_random_result(**kwargs):
    '''
    Select a random result from a query built using supplied kwargs as parameters.
    '''
    total = get_total(**kwargs)
    pages = math.ceil(total / 100)
    page = random.choice(list(range(1,pages + 1)))
    params = {
        'api_key': API_KEY,
        'per_page': 100,
        'page': page
    }
    params = add_kwargs_to_params(params, kwargs)
    data = get_records(params)
    try:
        record = random.choice(data['search']['results'])
    except KeyError:
        record = None
    return record

def get_facets(facet, **kwargs):
    '''
    Get values for the specified facet.
    '''
    params = {
        'facets': [facet],
        'api_key': API_KEY,
        'per_page': 0,
        'facets_per_page': 350 # 350 is the max
    }
    params = add_kwargs_to_params(params, kwargs)
    data = get_records(params)
    total = data['search']['result_count']
    facets = data['search']['facets'][facet]
    return (total, facets)

def get_random_facet(facets):
    '''
    Select a facet value from a list of facets, using the facet counts as weights.
    '''
    values = [{k:v} for k,v in facets.items()]
    weights = list(facets.values())
    facet = random.choices(values, weights=weights, k=1)[0]
    return list(facet.items())[0]

def select_facet(facet, **kwargs):
    '''
    Apply the specified facet to a query, if the total results are less than 1,000,000 then get a random result.
    '''
    _, facets = get_facets(facet, **kwargs)
    value = get_random_facet(facets)
    print(f'  * {facet.title()}: {value[0]}')
    kwargs[facet] = value[0]
    if value[1] < 1000000:
        record = get_random_result(**kwargs)
    else:
        record = None
    return (record, kwargs)
    
def get_random_record(**kwargs):
    print('Additional filters:')
    if kwargs:
        total = get_total(**kwargs)
        if total < 1000000:
            print('  * None')
            return get_random_result(**kwargs)
    for facet in ['category', 'content_partner', 'year']:
        if facet not in kwargs:
            record, kwargs = select_facet(facet, **kwargs)
            if record:
                return record
    return 'Too many'

A random record

In [226]:
# Get a record
record = get_random_record()

# Display the results
display(HTML(f'\n<h4>{record["title"]}</h4>'))
if record['description']:
    display(HTML(f'<p>{record["description"]}</p>'))
display(HTML(f'<a href="{record["landing_url"]}">More...</a>'))
Additional filters:
  * Category: Newspapers
  * Content_Partner: National Library of New Zealand
  * Year: 1878

SAILED. (Nelson Evening Mail, 06 June 1878)

A random newspaper article

In [198]:
# Get a record
record = get_random_record(category='Newspapers')

# Display the results
display(HTML(f'\n<h4>{record["title"]}</h4>'))
display(HTML(f'<p>{record["fulltext"][:500]}...</p>'))
display(HTML(f'<a href="{record["landing_url"]}">More...</a>'))
Additional filters:
  * Content_Partner: National Library of New Zealand
  * Year: 1902

Page 3 Advertisements Column 9 (Wanganui Herald, 30 January 1902)

buju aiitotiteiisciiu d i ic k tt thibbpaoej k besbbved i i cummins co advertisement v telspkons 70 box 11 a c lennard has just received 2 tanks 9 english bftcuiti and ib oases is preserving jaw sole agent for pubqti natorafc mineral water a c l victoria avenne wanganni usefal presents fob birthdays marriages ahx other great bvjfiotst tarn w m rooxn at bhjlts pbaoticawatchsuxbb aju jewelilkb avenue opposite english chorea ahandsomb stock of odd brooohes in ail the latest dewbom and quality from ...

A random newspaper article from a specific decade

In [202]:
# Get a record
record = get_random_record(category='Newspapers', decade='1920')

# Display the results
display(HTML(f'<h4>{record["title"]}</h4>'))
display(HTML(f'<p>{record["fulltext"][:500]}...</p>'))
display(HTML(f'<a href="{record["landing_url"]}">More...</a>'))
Additional filters:
  * Content_Partner: National Library of New Zealand
  * Year: 1924

TARANAKI ASSOCIATION (Hawera & Normanby Star, 01 November 1924)

taranaki association at the annual meeting of the association the following officers were elected for the ensuing year president mr c b webster vicepresidents messrs a g wallace and o j dickie hon secretary and treasurer mr e g foden auditor mr s e shaw delegate to the new zealand association mr e d salmo ml management committee messrs j o nicholson waverley n balharry eltham f j shearer park e h young sportsdale and j b wilson new plymouth selection committee messrs wallace dickie and webster a...

A random article from a specific newspaper

The newspaper title is stored in collection_title and publisher, but you don't seem to be able to filter using either of these, so we'll just do a text search for the title instead. This may mean we get results that are not actually from this newspaper...

In [203]:
# Get a record
record = get_random_record(category='Newspapers', text='Evening Post')

# Display the results
display(HTML(f'<h4>{record["title"]}</h4>'))
display(HTML(f'<p>{record["fulltext"][:500]}...</p>'))
display(HTML(f'<a href="{record["landing_url"]}">More...</a>'))
Additional filters:
  * Content_Partner: National Library of New Zealand
  * Year: 1874

GRAHAMSTOWN. 7th August. (Evening Post, 08 August 1874)

grahamstown 7th august te hira and other chiefs have not i yetf been induced to go to the tiatitd meeting at whakatuwai v mi vogela speech has caused great astonishment but is very favorably received the advertiser urges him not to delay but to go to the country at once with a grand scheme of constitutional reform for abolishing pro vinces in both island and not in the north only 1 the tribulers in the rose and shamrock claim lodged siity minces of gold in the bank yesterday in assaying it to-da...

A random item from a specific content partner

In [227]:
# Get a record
record = get_random_record(content_partner='Puke Ariki')

# Display the results
display(HTML(f'<h4>{record["title"]}</h4>'))
if 'thumbnail_url' in record and record['thumbnail_url']:
    display(Image(url=record['thumbnail_url'], format='jpg'))
display(HTML(f'<p>{record["description"]}</p>'))
display(HTML(f'<a href="{record["landing_url"]}">More...</a>'))
Additional filters:
  * None

St Josephs Parish, Exterior

Exterior of a building and grounds.; Black and White 120 Roll Film/Black and White Negative/Photographic Negative

A random open image

In [205]:
# Get a record
record = get_random_record(category='Images', usage='Use commercially')

# Display the results
display(HTML(f'<h4>{record["title"]}</h4>'))
try:
    if 'large_thumbnail_url' in record:
        display(Image(url=record['large_thumbnail_url'], format='jpg'))
    else:
        display(Image(url=record['thumbnail_url'], format='jpg'))
except:
    pass
if record['description']:
    display(HTML(f'<p>{record["description"]}</p>'))
display(HTML(f'<a href="{record["landing_url"]}">More...</a>'))
Additional filters:
  * Content_Partner: Auckland Libraries

Corp. F. T. Cameron, of Whangarei, Died of pneumonia.

1/4 length portrait of Corporal F. T. Cameron, of Whangarei, Died of pneumonia.

Coverage of facets

To decide which facets to use in making random selections, I looked to see how widely they were applied.

In [206]:
def check_facet(facet):
    params = {
        'facets': [facet],
        'api_key': API_KEY,
        'per_page': 0,
        'facets_per_page': 350
    }
    data = get_records(params)
    try:
        facets = data['search']['facets'][facet]
    except KeyError:
        print('Not a facet!')
    else:
        df = pd.DataFrame.from_dict(facets, orient='index')
        print(f'Number of facets: {df.shape[0]}')
        print(f'Number of records: {df[0].sum():,}')

Let's see how many records are returned by a blank search.

In [209]:
print(f'There are {get_total():,} records in total...')
There are 31,643,262 records in total...

Let's compare the total from a blank search to the number of records with values for each available facet.

Below are the available facets listed in the API docs. I've added usage. Note that display_collection is not actually a facet despite what the docs say. Also you can only get a maximum of 350 facet values in one request, so if it says there are 350 facet values, they might actually be a lot more. (You can harvest the full set of values using the code below.)

I'm assuming that some records have multiple values for collection and usage – hence the high number of records. It looks like category, content_partner, and year have the best coverage. However, we're only looking at the first 350 values for year.

In [212]:
for facet in ['category', 'display_collection', 'creator', 'placename', 'year', 'decade', 'century', 'language', 'content_partner', 'rights', 'collection', 'usage']:
    print(f'\n{facet}')
    check_facet(facet)
category
Number of facets: 18
Number of records: 31,656,240

display_collection
Not a facet!

creator
Number of facets: 350
Number of records: 2,133,338

placename
Number of facets: 350
Number of records: 25,007,163

year
Number of facets: 350
Number of records: 30,867,985

decade
Number of facets: 260
Number of records: 30,594,835

century
Number of facets: 71
Number of records: 30,516,639

language
Number of facets: 151
Number of records: 24,611,541

content_partner
Number of facets: 211
Number of records: 31,645,551

rights
Number of facets: 350
Number of records: 29,261,370

collection
Number of facets: 350
Number of records: 59,572,442

usage
Number of facets: 5
Number of records: 80,913,278

Let's get all the values for year.

In [218]:
def harvest_facet_values(facet):
    facets = {}
    more = True
    page = 1
    params = {
        'api_key': API_KEY,
        'per_page': 0,
        'facets': facet,
        'facets_per_page': 100,
    }
    with tqdm() as pbar:
        while more:
            params['facets_page'] = page
            data = get_records(params)
            if data['search']['facets'][facet]:
                facets.update(data['search']['facets'][facet])
                pbar.update(100)
                page += 1
            else:
                more = False
        return facets
In [219]:
facets = harvest_facet_values('year')

In [220]:
df_years = pd.DataFrame.from_dict(facets, orient='index')

What's the number of facet values?

In [223]:
df_years.shape[0]
Out[223]:
918

How many records have a value for year (assuming that there's only one value per record)?

In [224]:
df_years[0].sum()
Out[224]:
30870087

Created by Tim Sherratt for the GLAM Workbench.