Create a list of Trove's digitised journals

Everyone know's about Trove's newspapers, but there is also a growing collection of digitised journals available in the journals zone. They're not easy to find, however, which is why I created the Trove Titles web app.

This notebook uses the Trove API to harvest metadata relating to digitised journals – or more accurately, journals that are freely available online in a digital form. This includes some born digital publications that are available to view in formats like PDF and MOBI, but excludes some digital journals that have access restrictions.

The search strategy to find digitised (and digital) journals takes advantage of the fact that Trove's digitised resources (excluding the newspapers) all have an identifier that includes the string nla.obj. So we start by searching in the journals zone for records that include nla.obj and have the format 'Periodical'. By specifying 'Periodical' we exclude individual articles from digitised journals.

Then it's just a matter of looping through all the results and checking to see if a record includes a fulltext link to a digital copy. If it does it gets saved.

You can see the results in this CSV file. Obviously you could extract additional metadata from each record if you wanted to.

The default fields are:

  • fulltext_url – the url of the landing page of the digital version of the journal
  • title – the title of the journal
  • trove_id – the 'nla.obj' part of the fulltext_url, a unique identifier for the digital journal
  • trove_url – url of the journal's metadata record in Trove

I've used this list to harvest all the OCRd text from digitised journals.

In [1]:
# Let's import the libraries we need.
import requests
import pandas as pd
from bs4 import BeautifulSoup
import time
import json
import os
import re
from tqdm import tqdm_notebook
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry
from slugify import slugify
from IPython.display import display, FileLink

s = requests.Session()
retries = Retry(total=5, backoff_factor=1, status_forcelist=[ 502, 503, 504 ])
s.mount('https://', HTTPAdapter(max_retries=retries))
s.mount('http://', HTTPAdapter(max_retries=retries))

Add your Trove API key

You can get a Trove API key by following these instructions.

In [2]:
# Add your Trove API key between the quotes
api_key = ''

Define some functions to do the work

In [3]:
def get_total_results(params):
    Get the total number of results for a search.
    these_params = params.copy()
    these_params['n'] = 0
    response = s.get('', params=these_params)
    data = response.json()
    return int(data['response']['zone'][0]['records']['total'])

def get_fulltext_url(links):
    Loop through the identifiers to find a link to the digital version of the journal.
    url = None
    for link in links:
        if link['linktype'] == 'fulltext' and 'nla.obj' in link['value']:
            url = link['value']
    return url

def get_titles():
    Harvest metadata about digitised journals.
    With a little adaptation, this basic pattern could be used to harvest
    other types of works from Trove.
    url = ''
    titles = []
    params = {
        'q': '"nla.obj-"',
        'zone': 'article',
        'l-format': 'Periodical', # Journals only, not journal articles
        'include': 'links',
        'bulkHarvest': 'true', # Needed to maintain a consistent order across requests
        'key': api_key,
        'n': 100,
        'encoding': 'json'
    start = '*'
    total = get_total_results(params)
    with tqdm_notebook(total=total) as pbar:
        while start:
            params['s'] = start
            response = s.get(url, params=params)
            data = response.json()
            # If there's a startNext value then we get it to request the next page of results
                start = data['response']['zone'][0]['records']['nextStart']
            except KeyError:
                start = None
            for work in data['response']['zone'][0]['records']['work']:
                # Check to see if there's a link to a digital version
                    fulltext_url = get_fulltext_url(work['identifier'])
                except KeyError:
                    if fulltext_url:
                        trove_id ='(nla\.obj\-\d+)', fulltext_url).group(1)
                        # Get basic metadata
                        # You could add more work data here
                        # Check the Trove API docs for work record structure
                        title = {
                            'title': work['title'],
                            'fulltext_url': fulltext_url, 
                            'trove_url': work['troveUrl'],
                            'trove_id': trove_id
    return titles

Run the harvest

In [ ]:
titles = get_titles()

Convert to a dataframe and save as a CSV file

Let's convert the Python list to a Pandas DataFrame, have a peek inside, then save in CSV format.

In [5]:
df = pd.DataFrame(titles)
fulltext_url title trove_id trove_url
0 Laws, etc. (Acts of the Parliament) nla.obj-54127737
1 Stonequarry journal (Online) nla.obj-862209995
2 Report of the Auditor-General upon the financi... nla.obj-1371947658
3 Report of the Auditor-General upon the stateme... nla.obj-1270248615
4 Review of activities / Department of Immigrati... nla.obj-837116187
In [6]:
# How many journals are there?
(2620, 4)
In [7]:
# Save as CSV and display a download link
df.to_csv('digital-journals.csv', index=False)

Created by Tim Sherratt.

Work on this notebook was supported by the Humanities, Arts and Social Sciences (HASS) Data Enhanced Virtual Lab.