Create a list of Trove's digitised journals

Everyone know's about Trove's newspapers, but there is also a growing collection of digitised journals available in the journals zone. They're not easy to find, however, which is why I created the Trove Titles web app.

This notebook uses the Trove API to harvest metadata relating to digitised journals – or more accurately, journals that are freely available online in a digital form. This includes some born digital publications that are available to view in formats like PDF and MOBI, but excludes some digital journals that have access restrictions.

The search strategy to find digitised (and digital) journals takes advantage of the fact that Trove's digitised resources (excluding the newspapers) all have an identifier that includes the string nla.obj. So we start by searching in the journals zone for records that include nla.obj and have the format 'Periodical'. By specifying 'Periodical' we exclude individual articles from digitised journals.

Then it's just a matter of looping through all the results and checking to see if a record includes a fulltext link to a digital copy. If it does it gets saved.

You can see the results in this CSV file. Obviously you could extract additional metadata from each record if you wanted to.

The default fields are:

  • fulltext_url – the url of the landing page of the digital version of the journal
  • title – the title of the journal
  • trove_id – the 'nla.obj' part of the fulltext_url, a unique identifier for the digital journal
  • trove_url – url of the journal's metadata record in Trove

I've used this list to harvest all the OCRd text from digitised journals.

In [8]:
# Let's import the libraries we need.
import requests
import pandas as pd
from bs4 import BeautifulSoup
import time
import json
import os
import re
from import tqdm
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry
from slugify import slugify
from IPython.display import display, FileLink

s = requests.Session()
retries = Retry(total=5, backoff_factor=1, status_forcelist=[ 502, 503, 504 ])
s.mount('https://', HTTPAdapter(max_retries=retries))
s.mount('http://', HTTPAdapter(max_retries=retries))

Add your Trove API key

You can get a Trove API key by following these instructions.

In [9]:
# Add your Trove API key between the quotes
api_key = 'YOUR API KEY'

Define some functions to do the work

In [35]:
def get_total_results(params):
    Get the total number of results for a search.
    these_params = params.copy()
    these_params['n'] = 0
    response = s.get('', params=these_params)
    data = response.json()
    return int(data['response']['zone'][0]['records']['total'])

def get_fulltext_url(links):
    Loop through the identifiers to find a link to the digital version of the journal.
    nla_digitised = False
    for link in links:
        if link['linktype'] == 'fulltext' and 'nla.obj' in link['value']:
            url = link['value']
            if link['linktext'] == 'National Library of Australia digitised item':
                nla_digitised = True
            return url, nla_digitised

def get_titles():
    Harvest metadata about digitised journals.
    With a little adaptation, this basic pattern could be used to harvest
    other types of works from Trove.
    url = ''
    titles = []
    params = {
        # We can 'NOT' the format facet in the query
        'q': '"nla.obj-" NOT format:"Government publication" NOT format:Article',
        'zone': 'article',
        'l-format': 'Periodical', # Journals only
        'include': 'links',
        'bulkHarvest': 'true', # Needed to maintain a consistent order across requests
        'key': api_key,
        'n': 100,
        'encoding': 'json'
    start = '*'
    total = get_total_results(params)
    with tqdm(total=total) as pbar:
        while start:
            params['s'] = start
            response = s.get(url, params=params)
            data = response.json()
            # If there's a startNext value then we get it to request the next page of results
                start = data['response']['zone'][0]['records']['nextStart']
            except KeyError:
                start = None
            for work in data['response']['zone'][0]['records']['work']:
                # Check to see if there's a link to a digital version
                    fulltext_url, nla_digitised = get_fulltext_url(work['identifier'])
                except (KeyError, TypeError):
                    if fulltext_url:
                        trove_id ='(nla\.obj\-\d+)', fulltext_url).group(1)
                        # Get basic metadata
                        # You could add more work data here
                        # Check the Trove API docs for work record structure
                        title = {
                            'title': work['title'],
                            'fulltext_url': fulltext_url, 
                            'trove_url': work['troveUrl'],
                            'trove_id': trove_id,
                            'nla_digitised': nla_digitised
    return titles

Run the harvest

In [ ]:
titles = get_titles()

Convert to a dataframe and save as a CSV file

Let's convert the Python list to a Pandas DataFrame, have a peek inside, then save in CSV format.

In [37]:
df = pd.DataFrame(titles)
title fulltext_url trove_url trove_id nla_digitised
0 The Silver stream songster nla.obj-614066685 False
1 Stonequarry journal (Online) nla.obj-862209995 False
2 Philament (Sydney, N.S.W. : Online) nla.obj-749489295 False
3 Journal (Queensland Law Society) nla.obj-2735787548 False
4 The Order of service for the annual festival t... nla.obj-657473276 True
In [38]:
# How many journals are there?
(2730, 5)

For some reason there are a number of duplicates in the list, where multiple Trove work records point to the same digitised journal. We an display the duplicates like this.

In [41]:
# SHow all the rows
pd.set_option('display.max_rows', None)
# Show dupes
df.loc[df.duplicated(subset=['trove_id'], keep=False)].sort_values(by=['trove_id', 'nla_digitised'])
title fulltext_url trove_url trove_id nla_digitised
1594 Wings (Sydney, N.S.W. : Online) nla.obj-1226109179 False
2379 Wings (Sydney, N.S.W.) nla.obj-1226109179 False
672 [Event programme] / Australian Festival of Cha... nla.obj-1252107366 False
1942 [Event programme] / Australian Festival of Cha... nla.obj-1252107366 False
632 The Brisbane Bushwalker nla.obj-1252263267 False
2314 The Brisbane bushwalker : monthly magazine of ... nla.obj-1252263267 False
1840 The Shadowland newsletter nla.obj-1771610885 False
2692 The Atlas of the solar system / Patrick Moore ... nla.obj-1771610885 False
1996 Photographic review of reviews (Online) nla.obj-389050007 False
2471 Photographic review of reviews nla.obj-389050007 True
1963 The Australian woman's mirror (Online) nla.obj-389050376 False
2482 The Australian woman's mirror nla.obj-389050376 True
352 U3A Sunshine e-Voice nla.obj-483060448 False
1127 U3A Sunshine e-Voice (Online) nla.obj-483060448 False
243 Newsletter (Genealogical Society of Queensland... nla.obj-485357469 False
1131 Bremer echoes (Online) nla.obj-485357469 False
1965 The New South Wales Post Office directory (Onl... nla.obj-518308191 False
1469 The New South Wales Post Office directory nla.obj-518308191 True
1964 Everyones (Sydney, N.S.W. : Online) nla.obj-522690001 False
2635 Everyones nla.obj-522690001 True
1971 Month (Sydney, N.S.W. : Online) nla.obj-597762006 False
2612 Month (Sydney, N.S.W.) nla.obj-597762006 True
1967 South-Asian register (Online) nla.obj-597769314 False
2613 The South-Asian register nla.obj-597769314 True
1969 Tegg's monthly magazine (Online) nla.obj-598267619 False
805 Tegg's monthly magazine nla.obj-598267619 True
1962 Rugby League news (Sydney, N.S.W. : Online) nla.obj-598579045 False
2653 Rugby League news (Sydney, N.S.W.) nla.obj-598579045 True
1972 Bookfellow (Sydney, N.S.W. : 1899 : Online) nla.obj-636005630 False
823 Bookfellow (Sydney, N.S.W. : 1899) nla.obj-636005630 True
1974 Australian magazine (Sydney, N.S.W. : 1899 : O... nla.obj-636091247 False
386 Australian magazine (Sydney, N.S.W. : 1899) nla.obj-636091247 True
324 Bulletin (Sydney, N.S.W. : 1880) nla.obj-68375465 False
39 The bulletin nla.obj-68375465 True
1966 Dun's gazette for New South Wales (Online) nla.obj-724008889 False
393 Dun's gazette for New South Wales nla.obj-724008889 True
1995 Weldon's matrimonial gazette (Online) nla.obj-744869630 False
549 Weldon's matrimonial gazette nla.obj-744869630 True
1986 New South Wales school magazine of literature ... nla.obj-748141557 False
9 The New South Wales school magazine of literat... nla.obj-748141557 True
1979 Australian magazine (Sydney, N.S.W. : 1838 : O... nla.obj-752101760 False
343 Australian magazine (Sydney, N.S.W. : 1838) nla.obj-752101760 True
1984 New South Wales magazine (1833 : Online) nla.obj-753076802 False
388 New South Wales magazine (1833) nla.obj-753076802 True
1994 Sydney coronal (Online) nla.obj-753479079 False
35 The Sydney coronal / by Charles M'Donald nla.obj-753479079 True
1978 Tegg's New South Wales pocket almanac and reme... nla.obj-754081281 False
2615 Tegg's New South Wales pocket almanac and reme... nla.obj-754081281 True
1977 Liberty (Sydney, N.S.W. : Online) nla.obj-760289107 False
2665 Liberty (Sydney, N.S.W.) nla.obj-760289107 True
1973 Sydney once a week magazine (Online) nla.obj-760335335 False
115 The Sydney once a week magazine nla.obj-760335335 True
1970 Literary news (Online) nla.obj-765536757 False
384 The Literary news : a review and magazine of f... nla.obj-765536757 True
1982 Australia and the bookfellow (Online) nla.obj-768289925 False
935 Australia and the bookfellow nla.obj-768289925 True
1981 Australia (Sydney, N.S.W. : 1907 : Online) nla.obj-768329137 False
936 Australia (Sydney, N.S.W. : 1907) nla.obj-768329137 True
1983 Bookfellow (Sydney, N.S.W. : 1911 : Online) nla.obj-768936943 False
937 Bookfellow (Sydney, N.S.W. : 1911) nla.obj-768936943 True
1985 New Triad (Online) nla.obj-788254980 False
2639 The New Triad nla.obj-788254980 True
1975 Triad (Sydney, N.S.W. : Online) nla.obj-875780662 False
2357 Triad (Sydney, N.S.W.) nla.obj-875780662 True
In [43]:
df.sort_values(by=['trove_id', 'nla_digitised']).drop_duplicates(subset='trove_id', keep='last').shape
(2698, 5)
In [40]:
# Save as CSV and display a download link
df.to_csv('digital-journals.csv', index=False)

Created by Tim Sherratt for the GLAM Workbench.

Work on this notebook was supported by the Humanities, Arts and Social Sciences (HASS) Data Enhanced Virtual Lab.