Get an random newspaper article from Trove

Changes to the Trove API mean that the techniques I've previously used to select resources at random will no longer work. This notebook provides one alternative.

I wanted something that would work efficiently, but would also expose as much of the content as possible. Applying multiple facets together with a randomly-generated query seems to do a good job of getting the result set below 100 (the maximum available from a single API call). This should mean that most of the newspaper articles are reachable, but it's a bit hard to quantify.

Thanks to Mitchell Harrop for suggesting I could use randomly selected stopwords as queries. I've supplemented the stopwords with letters and digits, and together they seem to do a good job of applying an initial filter and mixing up the relevance ranking.

As you can see from the examples below, you can supply any of the facets available in the newspapers zone – for example: state, title, year, illType, category.

In [1]:
import random
import requests
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry
import json

s = requests.Session()
retries = Retry(total=5, backoff_factor=1, status_forcelist=[ 502, 503, 504 ])
s.mount('https://', HTTPAdapter(max_retries=retries))
s.mount('http://', HTTPAdapter(max_retries=retries))

with open('stopwords.json', 'r') as json_file:
    STOPWORDS = json.load(json_file)
In [2]:
API_KEY = 'YOUR API KEY'
API_URL = 'http://api.trove.nla.gov.au/v2/result'
In [3]:
def get_random_facet_value(params, facet):
    '''
    Get values for the supplied facet and choose one at random.
    '''
    these_params = params.copy()
    these_params['facet'] = facet
    response = s.get(API_URL, params=these_params)
    data = response.json()
    try:
        values = [t['search'] for t in data['response']['zone'][0]['facets']['facet']['term']]
    except TypeError:
        return None
    return random.choice(values)

    
def get_total_results(params):
    response = s.get(API_URL, params=params)
    data = response.json()
    total = int(data['response']['zone'][0]['records']['total'])
    return total


def get_random_article(query=None, **kwargs):
    '''
    Get a random article.
    The kwargs can be any of the available facets, such as 'state', 'title', 'illtype', 'year'.
    '''
    total = 0
    applied_facets = []
    facets = ['month', 'year', 'decade', 'word', 'illustrated', 'category', 'title']
    tries = 0
    params = {
        'zone': 'newspaper',
        'encoding': 'json',
        # Note that keeping n at 0 until we've filtered the result set speeds things up considerably
        'n': '0',
        # Uncomment these if you need more than the basic data
        #'reclevel': 'full',
        #'include': 'articleText',
        'key': API_KEY
    }
    if query:
        params['q'] = query
    # If there's no query supplied then use a random stopword to mix up the results
    else:
        random_word = random.choice(STOPWORDS)
        params['q'] = f'"{random_word}"'
    # Apply any supplied factes
    for key, value in kwargs.items():
        params[f'l-{key}'] = value
        applied_facets.append(key)
    # Remove any facets that have already been applied from the list of available facets
    facets[:] = [f for f in facets if f not in applied_facets]
    total = get_total_results(params)
    # If our randomly selected stopword has produced no results
    # keep trying with new queries until we get some (give up after 10 tries)
    while total == 0 and tries <= 10:
        if not query:
            random_word = random.choice(STOPWORDS)
            params['q'] = f'"{random_word}"'
        tries += 1
    # Apply facets one at a time until we have less than 100 results, or we run out of facets
    while total > 100 and len(facets) > 0:
        # Get the next facet
        facet = facets.pop()
        # Set the facet to a randomly selected value
        params[f'l-{facet}'] = get_random_facet_value(params, facet)
        total = get_total_results(params)
        #print(total)
        #print(response.url)
    # If we've ended up with some results, then select one (of the first 100) at random
    if total > 0:
        params['n'] = '100'
        response = s.get(API_URL, params=params)
        data = response.json()
        article = random.choice(data['response']['zone'][0]['records']['article'])
        return article

Get any old article...

In [4]:
get_random_article()
Out[4]:
{'id': '254828266',
 'url': '/newspaper/254828266',
 'heading': 'monday APRIL 24',
 'category': 'Detailed lists, results, guides',
 'title': {'id': '1500',
  'value': 'The Bananacoast Opinion (Coffs Harbour, NSW : 1973 - 1978)'},
 'date': '1978-04-19',
 'page': 7,
 'pageSequence': 7,
 'relevance': {'score': '0.5012544', 'value': 'likely to be relevant'},
 'troveUrl': 'https://trove.nla.gov.au/ndp/del/article/254828266?searchTerm=%22she%27s%22'}

Get a random article about pademelons

In [8]:
get_random_article(query='pademelon')
Out[8]:
{'id': '204735164',
 'url': '/newspaper/204735164',
 'heading': 'MELBOURNE. June 28.',
 'category': 'Article',
 'title': {'id': '970', 'value': 'The Millicent Times (SA : 1891 - 1905)'},
 'date': '1893-07-15',
 'page': 4,
 'pageSequence': 4,
 'relevance': {'score': '24.534437', 'value': 'very relevant'},
 'snippet': 'Each successive day sees a new scheme for settling the people on the land or at least brings amendments of earlier proposals. A great truth is struck by a “Farmer” writing',
 'troveUrl': 'https://trove.nla.gov.au/ndp/del/article/204735164?searchTerm=pademelon'}

Get a random article from Tasmania

In [9]:
get_random_article(state='Tasmania')
Out[9]:
{'id': '65216403',
 'url': '/newspaper/65216403',
 'heading': 'Advertising',
 'category': 'Advertising',
 'title': {'id': '116',
  'value': 'Emu Bay Times and North West and West Coast Advocate (Tas. : 1897 - 1899)'},
 'date': '1898-02-01',
 'page': 2,
 'pageSequence': 2,
 'relevance': {'score': '0.030104881', 'value': 'may have relevance'},
 'troveUrl': 'https://trove.nla.gov.au/ndp/del/article/65216403?searchTerm=%22had%22'}

Get a random article from the Sydney Morning Herald

In [10]:
get_random_article(title='35', category='Article')
Out[10]:
{'id': '15981248',
 'url': '/newspaper/15981248',
 'heading': 'CANADIAN ELECTIONS. GOVERNMENT ROUTED. OTTAWA, Dec. 7.',
 'category': 'Article',
 'title': {'id': '35',
  'value': 'The Sydney Morning Herald (NSW : 1842 - 1954)'},
 'date': '1921-12-09',
 'page': 9,
 'pageSequence': 9,
 'relevance': {'score': '645.6865', 'value': 'very relevant'},
 'snippet': 'Final election returns show that the Liberals overwhelmed the Government, taking 121 seats against the Conservatives 53, Farmers 58, and Independents 2. The Premier, Mr.',
 'troveUrl': 'https://trove.nla.gov.au/ndp/del/article/15981248?searchTerm=%227%22'}

Get a random illustrated article

In [11]:
get_random_article(illustrated='true')
Out[11]:
{'id': '258856101',
 'url': '/newspaper/258856101',
 'heading': 'FOOTBALL The Australian Code DAMPIER ASSOCIATION PREMIERSHIP AND BURTON CUP THE FINAL Mukinbudin Victors NUNGARIN DISAPPOINT  Another Big Attendance',
 'category': 'Article',
 'title': {'id': '1656', 'value': 'The Nungarin Standard (WA : 1934 - 1939)'},
 'date': '1935-09-12',
 'page': 1,
 'pageSequence': 1,
 'status': 'coming soon',
 'relevance': {'score': '0.47818616', 'value': 'likely to be relevant'},
 'snippet': 'In brilliant sunshine and little wind, Mukinbudin and Nungarin contested the final match on Saturday afternoon last, when the former registered'}

Get a random illustrated advertisement from the Australian Womens Weekly

In [12]:
get_random_article(title='112', illustrated='true', category='Advertising')
Out[12]:
{'id': '51197885',
 'url': '/newspaper/51197885',
 'heading': 'Advertising',
 'category': 'Advertising',
 'title': {'id': '112',
  'value': "The Australian Women's Weekly (1933 - 1982)"},
 'date': '1980-03-19',
 'page': 66,
 'pageSequence': 66,
 'relevance': {'score': '1.054913', 'value': 'likely to be relevant'},
 'troveUrl': 'https://trove.nla.gov.au/ndp/del/article/51197885?searchTerm=%22needn%22'}

Get a random cartoon

In [13]:
get_random_article(illtype='Cartoon')
Out[13]:
{'id': '132061740',
 'url': '/newspaper/132061740',
 'heading': 'The Bookworm Criticises',
 'category': 'Detailed lists, results, guides',
 'title': {'id': '505', 'value': 'Sunday Times (Sydney, NSW : 1895 - 1930)'},
 'date': '1930-04-13',
 'page': 3,
 'pageSequence': '3 S',
 'relevance': {'score': '0.30180344', 'value': 'likely to be relevant'},
 'troveUrl': 'https://trove.nla.gov.au/ndp/del/article/132061740?searchTerm=%22what%22'}

Get a random article from 1930

In [14]:
get_random_article(year='1930')
Out[14]:
{'id': '250504028',
 'url': '/newspaper/250504028',
 'heading': 'TRACES OF ANCIENT CIVILISATION IN NEW GUINEA',
 'category': 'Article',
 'title': {'id': '1375',
  'value': 'Papuan Courier (Port Moresby, Papua New Guinea : 1917 - 1942)'},
 'date': '1930-08-15',
 'page': 3,
 'pageSequence': 3,
 'relevance': {'score': '23.659304', 'value': 'very relevant'},
 'snippet': '“Sea Nomad”: E. W. P. Chinnery Government anthropologist, states, that circles of standing stone resembling the relics left by the',
 'troveUrl': 'https://trove.nla.gov.au/ndp/del/article/250504028?searchTerm=%22won%27t%22'}

Get a random article tagged 'poem'

In [15]:
get_random_article(publictag='poem')
Out[15]:
{'id': '4221723',
 'url': '/newspaper/4221723',
 'heading': 'THE KANGAROOS: A FABLE.',
 'category': 'Article',
 'title': {'id': '19',
  'value': 'The Hobart Town Courier (Tas. : 1827 - 1839)'},
 'date': '1828-07-19',
 'page': 2,
 'pageSequence': 2,
 'relevance': {'score': '1.035787', 'value': 'likely to be relevant'},
 'snippet': 'A pair of married kangaroos (The case is oft a human one too,) Were greatly puzzled once to choose, A trade to put their eldest son to,',
 'troveUrl': 'https://trove.nla.gov.au/ndp/del/article/4221723?searchTerm=%22again%22'}

Speed test

In [16]:
%%timeit
get_random_article()
The slowest run took 4.73 times longer than the fastest. This could mean that an intermediate result is being cached.
2.87 s ± 1.62 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

Created by Tim Sherratt for the GLAM Workbench.