Harvesting the text of digitised books (and ephemera)

This notebook harvests metadata and OCRd text from digitised books in Trove. There's three main steps:

  • Harvest metadata of digitised books using the Trove API
  • Extract the number of pages for each book from the Trove web interface (the number of pages is necessary to download the OCRd text)
  • Download the OCRd text for each book

It's not easy to identify all the digitised books with OCRd text in Trove. I'm starting with a search in the book zone for records that include the phrase "nla.obj" and are available online. This currently returns 57,759 results. However, this includes books where access to the digital copy is 'restricted'. I think these are mostly recent books submitted in digital form under legal deposit. I've filtered the 57,759 results to remove records where the digital copy is not available, and used the new fullTextInd index to filter out works without any OCRd text. This currently reduces the total to 34,289 results.

But some of those 34,289 results are actually parent records that contain multiple volumes or parts. When I find the number of pages in each book, I'm also checking to see if the record is a 'Multi volume book' and has child works. If it does, I add the child works to the list of books. After this stage there are 35,691 works.

However, not all of these 35,691 records have OCRd text. Parent records of multi volume works, and ebook formats like PDFs or MOBI, don't have individual pages, and therefore don't have any text to download. If we exclude works without pages, there are 24,658 works that might have some OCRd text to download.

But when you harvest the text files from these works, you find that some of them are empty. I've excluded these from the final dataset, leaving a grand total of 19,795 text files.

If you compare the number of downloaded files to the number in the CSV file that are identified as having OCRd text you'll notice a difference – 19,795 compared to 22,758. After a bit more poking around I realised that there are some duplicates in the list of works. This seems to be because more than one Trove metadata record can point to the same digitised work. For example, both this record and this record point to this digitised work. As they're not exact duplicates, I've left them in the results.

Looking through the downloaded text files, it's clear that we're getting ephemera (particularly pamphlets and posters) as well as books. There doesn't seem to be an obvious way to filter these out up front, but of course you could filter later by the number of pages.

Here's the metadata I've harvested in CSV format:

This file includes the following columns:

  • children – pipe-separated ids of any child works
  • contributors – pipe-separated names of contributors
  • date – publication date
  • form – work format
  • fulltext_url – link to the digitised version
  • language – main language of the work
  • pages – number of pages
  • parent – id of parent work (if any)
  • rights – copyright status
  • text_downloaded – file name of the downloaded OCR text
  • text_file – True/False is there any OCRd text
  • title – title of the work
  • trove_id – unique identifier
  • url – link to the metadata record in Trove
  • volume – volume/part number

Since the last harvest a lot of Commonwealth Parliamentary Papers have been digitised and added to the book zone. The code below harvests everything together, but for the sake of convenience, I've separated the text files into two collections on CloudStor. There are:

The previous harvest (April 2019) is also available for download (400mb zip file).

Setting things up

In [1]:
import requests
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry
from tqdm import tqdm_notebook
from IPython.display import display, FileLink
import pandas as pd
import json
import re
import time
import os
from copy import deepcopy
from bs4 import BeautifulSoup
from slugify import slugify
import requests_cache
In [2]:
s = requests_cache.CachedSession()
retries = Retry(total=5, backoff_factor=1, status_forcelist=[ 502, 503, 504 ])
s.mount('https://', HTTPAdapter(max_retries=retries))
s.mount('http://', HTTPAdapter(max_retries=retries))
In [3]:
# Add your Trove API key below
api_key = ''
In [7]:
params = {
    'key': api_key,
    'zone': 'book',
    'q': '"nla.obj" fullTextInd:y', # API v 2.1 added the full text indicator
    'bulkHarvest': 'true',
    'n': 100,
    'encoding': 'json',
    'l-availability': 'y',
    'include': 'links,workversions'
}

Harvest metadata using the API

In [8]:
def get_total_results():
    '''
    Get the total number of results for a search.
    '''
    these_params = params.copy()
    these_params['n'] = 0
    response = s.get('https://api.trove.nla.gov.au/v2/result', params=these_params)
    data = response.json()
    return int(data['response']['zone'][0]['records']['total'])


def get_fulltext_url(links):
    '''
    Loop through the identifiers to find a link to the full text version of the book.
    '''
    url = None
    for link in links:
        if link['linktype'] == 'fulltext' and 'nla.obj' in link['value']:
            url = link['value']
            break
    return url

def get_version_record(record):
    for version in record.get('version'):
        for record in version['record']:
            try:
                if record['metadataSource'].get('value') == 'ANL:DL':
                    return record
            except (AttributeError, TypeError):
                pass
                
def join_list(record, key):
    # A field may have a single value or an array.
    # If it's an array, join the values into a string.
    string_list = ''
    if record:
        value = record.get(key)
        if value:
            try:
                string_list = '|'.join(value)
            except TypeError:
                string_list = value
    return string_list


def harvest_books():
    '''
    Harvest metadata relating to digitised books.
    '''
    books = []
    total = get_total_results()
    start = '*'
    these_params = params.copy()
    with tqdm_notebook(total=total) as pbar:
        while start:
            these_params['s'] = start
            response = s.get('https://api.trove.nla.gov.au/v2/result', params=these_params)
            data = response.json()
            # The nextStart parameter is used to get the next page of results.
            # If there's no nextStart then it means we're on the last page of results.
            try:
                start = data['response']['zone'][0]['records']['nextStart']
            except KeyError:
                start = None
            for record in data['response']['zone'][0]['records']['work']:
                # See if there's a link to the full text version.
                if 'identifier' in record:
                    fulltext_url = get_fulltext_url(record['identifier'])
                    # I'm making the assumption that if this is a booky book (not a map or music etc),
                    # then 'Book' will appear first in the list of types.
                    # This might not be a valid assumption.
                    # try:
                    #    format_type = record.get('type')[0]
                    # except (IndexError, TypeError):
                    #    format_type = None
                    # Save the record if there's a full text link and it's a booky book.
                    if fulltext_url:
                        trove_id = re.search(r'(nla\.obj\-\d+)', fulltext_url).group(1)
                        # Get the basic metadata.
                        book = {
                            'title': record.get('title'),
                            'url': record.get('troveUrl'),
                            'contributors': join_list(record, 'contributor'),
                            'date': record.get('issued'),
                            'fulltext_url': fulltext_url,
                            'trove_id': trove_id
                        }
                        # Add some extra info if avaliable
                        version = get_version_record(record)
                        book['language'] = join_list(version, 'language')
                        book['rights'] = join_list(version, 'rights')
                        books.append(book)
                        # print(book)
            pbar.update(100)
    return books
In [ ]:
# Do the harvest!
books = harvest_books()
In [10]:
len(books)
Out[10]:
34289

Get the number of pages in each book

In order to download the OCRd text we need to know the number of pages in a work. This information is not available via the API, so we have to scrape it from the work's HTML page.

In [20]:
def get_work_data(url):
    '''
    Extract work data in a JSON string from the work's HTML page.
    '''
    response = s.get(url)
    try:
        work_data = re.search(r'var work = JSON\.parse\(JSON\.stringify\((\{.*\})', response.text).group(1)
    except AttributeError:
        work_data = '{}'
    return json.loads(work_data)


def get_pages(work):
    '''
    Get the number of pages from the work data.
    '''
    try:
        pages = len(work['children']['page'])
    except KeyError:
        pages = 0
    return pages


def get_volumes(parent_id):
    '''
    Get the ids of volumes that are children of the current record.
    '''
    start_url = 'https://nla.gov.au/{}/browse?startIdx={}&rows=20&op=c'
    # The initial startIdx value
    start = 0
    # Number of results per page
    n = 20
    parts = []
    # If there aren't 20 results on the page then we've reached the end, so continue harvesting until that happens.
    while n == 20:
        # Get the browse page
        response = s.get(start_url.format(parent_id, start))
        # Beautifulsoup turns the HTML into an easily navigable structure
        soup = BeautifulSoup(response.text, 'lxml')
        # Find all the divs containing issue details and loop through them
        details = soup.find_all(class_='l-item-info')
        for detail in details:
            # Get the issue id
            parts.append(detail.dt.a.string)
            time.sleep(0.2)
        # Increment the startIdx
        start += n
        # Set n to the number of results on the current page
        n = len(details)
    return parts


def add_pages(books):
    '''
    Add the number of pages to the metadata for each book.
    Add volumes from multi volume books.
    '''
    books_with_pages = []
    for book in tqdm_notebook(books):
        # print(book['fulltext_url'])
        work = get_work_data(book['fulltext_url'])
        form = work.get('form')
        pages = get_pages(work)
        book['pages'] = pages
        book['form'] = form
        book['volume'] = ''
        book['parent'] = ''
        book['children'] = ''
        time.sleep(0.2)
        # Multi volume books are containers with child volumes
        # so we have to get the ids of each individual volume and process them
        if pages == 0 and form == 'Multi Volume Book':
            # Get child volumes
            volumes = get_volumes(book['trove_id'])
            # For each volume get details and add as a new book entry
            for index, volume_id in enumerate(volumes):
                volume = book.copy()
                # Add link up to the container
                volume['parent'] = book['trove_id']
                volume['fulltext_url'] = 'http://nla.gov.au/{}'.format(volume_id)
                volume['trove_id'] = volume_id
                work = get_work_data(volume['fulltext_url'])
                form = work.get('form')
                pages = get_pages(work)
                volume['form'] = form
                volume['pages'] = pages
                volume['volume'] = str(index + 1)
                # print(volume)
                books_with_pages.append(volume)
                time.sleep(0.2)
            # Add links from container to volumes
            book['children'] = '|'.join(volumes)
        # print(book)
        books_with_pages.append(book)
    return books_with_pages
In [ ]:
# Add number of pages to the book metadata
books_with_pages = add_pages(deepcopy(books))

Convert and save results

Getting the page numbers takes quite a while, so it's a good idea to save the results to a CSV file before proceeding. That way, you won't have to repeat the process if something goes wrong and you lose the data that's sitting in memory.

In [56]:
df = pd.DataFrame(books_with_pages)
In [57]:
df.head()
Out[57]:
children contributors date form fulltext_url language pages parent rights title trove_id url volume
0 1878-1880 Book http://nla.gov.au/nla.obj-688657424 English 24 Out of Copyright|http://rightsstatements.org/v... Grammar of the Narrinyeri tribe of Australian ... nla.obj-688657424 https://trove.nla.gov.au/work/10029401
1 1839-1900 Book https://nla.gov.au/nla.obj-630176596 English 65 No known copyright restrictions|http://rightss... The works of the Rev. Sydney Smith nla.obj-630176596 https://trove.nla.gov.au/work/1004403
2 1914-1923 Book http://nla.gov.au/nla.obj-24357566 English 246 Out of Copyright|http://rightsstatements.org/v... Nellie Doran : a story of Australian home and ... nla.obj-24357566 https://trove.nla.gov.au/work/10049667
3 1942 Book https://nla.gov.au/nla.obj-51530748 German 80 Out of Copyright|http://rightsstatements.org/v... Lastkraftwagen 3 t Ford : Baumuster V 3000 S :... nla.obj-51530748 https://trove.nla.gov.au/work/10053234
4 1900-1920 Book http://nla.gov.au/nla.obj-19907304 English 388 Out of Copyright|http://rightsstatements.org/v... Trefoil : the story of a girls' society / by M... nla.obj-19907304 https://trove.nla.gov.au/work/10057400
In [58]:
# How many records?
df.shape
Out[58]:
(35691, 13)
In [59]:
# How many have pages?
df.loc[df['pages'] != 0].shape
Out[59]:
(24658, 13)
In [60]:
# How many of each format?
df['form'].value_counts()
Out[60]:
Book                   22555
Digital Publication    10051
Multi Volume Book       2105
Picture                  541
Journal                  367
Manuscript                38
Other - General           14
Map                        2
Other - Australian         1
Name: form, dtype: int64
In [12]:
# Breakdown by language
df['language'].value_counts()
Out[12]:
English                                       18970
                                              14724
Chinese                                        1211
French                                          181
Undetermined                                     86
German                                           78
Japanese                                         61
Australian languages                             53
Dutch                                            50
Austronesian (Other)                             35
Italian                                          29
Latin                                            25
Spanish                                          17
Maori                                            16
Swedish                                          15
Korean                                           15
Portuguese                                       14
Tahitian                                         11
Danish                                           11
Indonesian                                       11
Multiple languages                                8
Finnish                                           7
Tongan                                            7
Papiamento                                        5
Greek, Modern (1453- )                            5
Russian                                           5
Thai                                              4
Czech                                             4
Norwegian                                         4
Samoan                                            3
Polish                                            3
Fijian                                            2
Malay                                             2
Papuan (Other)                                    2
Welsh                                             2
No linguistic content                             2
Miscellaneous languages                           2
Vietnamese                                        1
Philippine (Other)                                1
Creoles and Pidgins, English-based (Other)        1
Hawaiian                                          1
Scottish Gaelic                                   1
Niger-Kordofanian (Other)                         1
pol                                               1
Tagalog                                           1
Javanese                                          1
Sanskrit                                          1
Gã                                                1
Name: language, dtype: int64
In [61]:
# Save as CSV
df.to_csv('trove_digitised_books.csv', index=False)
display(FileLink('trove_digitised_books.csv'))

Download the OCRd texts

In [11]:
# Run this cell if you need to reload the books data from the CSV
df = pd.read_csv('trove_digitised_books.csv', keep_default_na=False)
books_with_pages = df.to_dict('records')
In [6]:
def save_ocr(books, output_dir='text'):
    '''
    Download the OCRd text for each book.
    '''
    os.makedirs(output_dir, exist_ok=True)
    for book in tqdm_notebook(books):
        # Default values
        book['text_downloaded'] = False
        book['text_file'] = ''
        if book['pages'] != 0:       
            # print(book['title'])
            # The index value for the last page of an issue will be the total pages - 1
            last_page = book['pages'] - 1
            file_name = '{}-{}.txt'.format(slugify(book['title'][:50]), book['trove_id'])
            file_path = os.path.join(output_dir, file_name)
            # Check to see if the file has already been harvested
            if os.path.exists(file_path) and os.path.getsize(file_path) > 0:
                # print('Already saved')
                book['text_file'] = file_name
                book['text_downloaded'] = True
            else:
                url = 'https://trove.nla.gov.au/{}/download?downloadOption=ocr&firstPage=0&lastPage={}'.format(book['trove_id'], last_page)
                # print(url)
                # Get the file
                r = s.get(url)
                # Check there was no error
                if r.status_code == requests.codes.ok:
                    # Check that the file's not empty
                    r.encoding = 'utf-8'
                    if len(r.text) > 0 and not r.text.isspace():
                        # Check that the file isn't HTML (some not found pages don't return 404s)
                        if BeautifulSoup(r.text, 'html.parser').find('html') is None:
                            # If everything's ok, save the file
                            with open(file_path, 'w', encoding='utf-8') as text_file:
                                text_file.write(r.text)
                            # print('Saved')
                            book['text_file'] = file_name
                            book['text_downloaded'] = True
                time.sleep(1)
In [ ]:
save_ocr(books_with_pages)

Convert and save updated results

The new books list includes the file name of the downloaded text file (if there is one), and a boolean field indicating if the text has been downloaded.

In [8]:
# Convert this to df
df_downloaded = pd.DataFrame(books_with_pages)
In [9]:
df_downloaded.head()
Out[9]:
children contributors date form fulltext_url language pages parent rights text_downloaded text_file title trove_id url volume
0 1878-1880 Book http://nla.gov.au/nla.obj-688657424 English 24 Out of Copyright|http://rightsstatements.org/v... True grammar-of-the-narrinyeri-tribe-of-australian-... Grammar of the Narrinyeri tribe of Australian ... nla.obj-688657424 https://trove.nla.gov.au/work/10029401
1 1839-1900 Book https://nla.gov.au/nla.obj-630176596 English 65 No known copyright restrictions|http://rightss... True the-works-of-the-rev-sydney-smith-nla.obj-6301... The works of the Rev. Sydney Smith nla.obj-630176596 https://trove.nla.gov.au/work/1004403
2 1914-1923 Book http://nla.gov.au/nla.obj-24357566 English 246 Out of Copyright|http://rightsstatements.org/v... True nellie-doran-a-story-of-australian-home-and-sc... Nellie Doran : a story of Australian home and ... nla.obj-24357566 https://trove.nla.gov.au/work/10049667
3 1942 Book https://nla.gov.au/nla.obj-51530748 German 80 Out of Copyright|http://rightsstatements.org/v... True lastkraftwagen-3-t-ford-baumuster-v-3000-s-ger... Lastkraftwagen 3 t Ford : Baumuster V 3000 S :... nla.obj-51530748 https://trove.nla.gov.au/work/10053234
4 1900-1920 Book http://nla.gov.au/nla.obj-19907304 English 388 Out of Copyright|http://rightsstatements.org/v... True trefoil-the-story-of-a-girls-society-by-m-p-nl... Trefoil : the story of a girls' society / by M... nla.obj-19907304 https://trove.nla.gov.au/work/10057400
In [10]:
# How many have been downloaded?
df_downloaded.loc[df_downloaded['text_downloaded'] == True].shape
Out[10]:
(22758, 15)

Why is the number above different to the number of files actually downloaded? Let's have a look for duplicates.

As you can see below, some digitised works are linked to from multiple metadata records. Hence there are duplicates.

In [11]:
df_downloaded.loc[df_downloaded.duplicated('trove_id', keep=False) == True].sort_values('trove_id')
Out[11]:
children contributors date form fulltext_url language pages parent rights text_downloaded text_file title trove_id url volume
25115 1885 Book https://nla.gov.au/nla.obj-101207695 English 66 Out of Copyright|http://rightsstatements.org/v... True three-weeks-in-southland-being-the-account-of-... Three weeks in Southland : being the account o... nla.obj-101207695 https://trove.nla.gov.au/work/237350529
6939 1885 Book http://nla.gov.au/nla.obj-101207695 66 nla.obj-477008239 True three-weeks-in-southland-being-the-account-of-... Three weeks in Southland : being the account o... nla.obj-101207695 https://trove.nla.gov.au/work/19178390 2
25117 1831 Book https://nla.gov.au/nla.obj-101212925 English 8 No known copyright restrictions|http://rightss... True a-recent-visit-to-several-of-the-polynesian-is... A recent visit to several of the Polynesian is... nla.obj-101212925 https://trove.nla.gov.au/work/237350531
7228 1831-1832 Book http://nla.gov.au/nla.obj-101212925 8 True a-recent-visit-to-several-of-the-polynesian-is... A recent visit to several of the Polynesian is... nla.obj-101212925 https://trove.nla.gov.au/work/19241288
7225 1908 Book http://nla.gov.au/nla.obj-101227721 10 True how-capt-cook-died-new-light-from-an-old-book-... How Capt. Cook died : new light from an old book nla.obj-101227721 https://trove.nla.gov.au/work/19240402
25134 1908 Book https://nla.gov.au/nla.obj-101227721 English 10 No known copyright restrictions|http://rightss... True how-capt-cook-died-new-light-from-an-old-book-... How Capt. Cook died : new light from an old book nla.obj-101227721 https://trove.nla.gov.au/work/237350548
7242 1843 Book http://nla.gov.au/nla.obj-101260366 19 True propagandism-in-the-pacific-nla.obj-101260366.txt Propagandism in the Pacific nla.obj-101260366 https://trove.nla.gov.au/work/19244339
25133 1843 Book https://nla.gov.au/nla.obj-101260366 English 19 No known copyright restrictions|http://rightss... True propagandism-in-the-pacific-nla.obj-101260366.txt Propagandism in the Pacific nla.obj-101260366 https://trove.nla.gov.au/work/237350547
25120 1886 Book https://nla.gov.au/nla.obj-101963006 English 87 No known copyright restrictions|http://rightss... True prize-essays-on-the-industries-of-new-zealand-... Prize essays on the industries of New Zealand nla.obj-101963006 https://trove.nla.gov.au/work/237350534
4914 1886 Book http://nla.gov.au/nla.obj-101963006 87 True prize-essays-on-the-industries-of-new-zealand-... Prize essays on the industries of New Zealand nla.obj-101963006 https://trove.nla.gov.au/work/18429265
34554 1880 Book http://nla.gov.au/nla.obj-101978472 60 True recherches-sur-les-dialectes-tasmaniens-par-h-... Recherches sur les dialectes Tasmaniens / par ... nla.obj-101978472 https://trove.nla.gov.au/work/5171222
25126 1880 Book https://nla.gov.au/nla.obj-101978472 Australian languages 60 No known copyright restrictions|http://rightss... True recherches-sur-les-dialectes-tasmaniens-par-h-... Recherches sur les dialectes tasmaniens / par ... nla.obj-101978472 https://trove.nla.gov.au/work/237350540
7560 1853 Book http://nla.gov.au/nla.obj-102415118 24 True the-paradise-in-the-pacific-nla.obj-102415118.txt The Paradise in the Pacific nla.obj-102415118 https://trove.nla.gov.au/work/19290742
25112 1853 Book https://nla.gov.au/nla.obj-102415118 English 24 No known copyright restrictions|http://rightss... True the-paradise-in-the-pacific-nla.obj-102415118.txt The Paradise in the Pacific nla.obj-102415118 https://trove.nla.gov.au/work/237350526
7556 1874 Book http://nla.gov.au/nla.obj-102459217 24 True the-annexation-of-fiji-and-the-pacific-slave-t... The annexation of Fiji and the Pacific slave t... nla.obj-102459217 https://trove.nla.gov.au/work/19290301
25140 1874 Book https://nla.gov.au/nla.obj-102459217 English 24 No known copyright restrictions|http://rightss... True the-annexation-of-fiji-and-the-pacific-slave-t... The annexation of Fiji and the Pacific slave t... nla.obj-102459217 https://trove.nla.gov.au/work/237350554
25116 1882 Book https://nla.gov.au/nla.obj-103012708 English 17 No known copyright restrictions|http://rightss... True macquarie-island-by-john-h-scott-nla.obj-10301... Macquarie Island / by John H. Scott nla.obj-103012708 https://trove.nla.gov.au/work/237350530
7522 1882 Book http://nla.gov.au/nla.obj-103012708 17 True macquarie-island-by-john-h-scott-nla.obj-10301... Macquarie Island / by John H. Scott nla.obj-103012708 https://trove.nla.gov.au/work/19286855
7152 1915 Book http://nla.gov.au/nla.obj-103545681 8 True some-aspects-of-the-war-lecture-by-the-hon-p-n... "Some aspects of the war" : lecture by the Hon... nla.obj-103545681 https://trove.nla.gov.au/work/19226149
25123 1915 Book https://nla.gov.au/nla.obj-103545681 English 8 No known copyright restrictions|http://rightss... True some-aspects-of-the-war-lecture-by-the-hon-p-n... "Some aspects of the war" : lecture by the Hon... nla.obj-103545681 https://trove.nla.gov.au/work/237350537
7164 1864 Book http://nla.gov.au/nla.obj-103558551 4 True northern-territory-of-south-australia-sales-of... Northern Territory of South Australia : sales ... nla.obj-103558551 https://trove.nla.gov.au/work/19228668
25139 1864 Book https://nla.gov.au/nla.obj-103558551 English 4 No known copyright restrictions|http://rightss... True northern-territory-of-south-australia-sales-of... Northern Territory of South Australia : sales ... nla.obj-103558551 https://trove.nla.gov.au/work/237350553
25136 1648-1843 Book https://nla.gov.au/nla.obj-103710704 English 1 No known copyright restrictions|http://rightss... True fac-simile-of-the-original-warrant-for-beheadi... Fac simile of the original warrant for beheadi... nla.obj-103710704 https://trove.nla.gov.au/work/237350550
3134 1648-1843 Book http://nla.gov.au/nla.obj-103710704 1 True fac-simile-of-the-original-warrant-for-beheadi... Fac simile of the original warrant for beheadi... nla.obj-103710704 https://trove.nla.gov.au/work/170446753
3135 1880-1889 Book http://nla.gov.au/nla.obj-103711020 2 True to-the-queen-s-most-excellent-majesty-the-humb... To the Queen's Most Excellent Majesty, the hum... nla.obj-103711020 https://trove.nla.gov.au/work/170446767
25111 1880-1889 Book https://nla.gov.au/nla.obj-103711020 English 2 No known copyright restrictions|http://rightss... True to-the-queen-s-most-excellent-majesty-the-humb... To the Queen's Most Excellent Majesty, the hum... nla.obj-103711020 https://trove.nla.gov.au/work/237350525
25128 1901 Book https://nla.gov.au/nla.obj-103715027 English 2 No known copyright restrictions|http://rightss... True opening-of-the-commonwealth-parliament-by-his-... Opening of the Commonwealth Parliament by His ... nla.obj-103715027 https://trove.nla.gov.au/work/237350542
3144 1901 Book http://nla.gov.au/nla.obj-103715027 2 True opening-of-the-commonwealth-parliament-by-his-... Opening of the Commonwealth Parliament by His ... nla.obj-103715027 https://trove.nla.gov.au/work/170448703
515 1840 Book http://nla.gov.au/nla.obj-103716394 1 True the-adelaide-libel-at-a-glance-published-by-re... The Adelaide libel at a glance : published by ... nla.obj-103716394 https://trove.nla.gov.au/work/11996585
25109 1840 Book https://nla.gov.au/nla.obj-103716394 English 1 No known copyright restrictions|http://rightss... True the-adelaide-libel-at-a-glance-published-by-re... The Adelaide libel at a glance : published by ... nla.obj-103716394 https://trove.nla.gov.au/work/237350523
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
6498 1861 Book http://nla.gov.au/nla.obj-99043868 1 True history-of-tasmania-nla.obj-99043868.txt History of Tasmania nla.obj-99043868 https://trove.nla.gov.au/work/18984790
25101 1861 Book https://nla.gov.au/nla.obj-99043868 English 1 No known copyright restrictions|http://rightss... True history-of-tasmania-nla.obj-99043868.txt History of Tasmania nla.obj-99043868 https://trove.nla.gov.au/work/237350515
6572 1841 Book http://nla.gov.au/nla.obj-99045723 1 True the-zealand-chief-mr-burns-will-deliver-two-le... The Zealand chief, Mr Burns will deliver two l... nla.obj-99045723 https://trove.nla.gov.au/work/19007694
25082 1841 Book https://nla.gov.au/nla.obj-99045723 English 1 No known copyright restrictions|http://rightss... True the-zealand-chief-mr-burns-will-deliver-two-le... The Zealand chief, Mr Burns will deliver two l... nla.obj-99045723 https://trove.nla.gov.au/work/237350495
25081 1939 Book https://nla.gov.au/nla.obj-99417090 English 1 No known copyright restrictions|http://rightss... True german-organisations-throughout-n-s-wales-sept... German organisations throughout N.S.Wales, Sep... nla.obj-99417090 https://trove.nla.gov.au/work/237350494
35089 1939 Book http://nla.gov.au/nla.obj-99417090 1 True german-organisations-throughout-n-s-wales-sept... German organisations throughout N.S.Wales, Sep... nla.obj-99417090 https://trove.nla.gov.au/work/6963744
25072 1909 Book https://nla.gov.au/nla.obj-99417309 English 103 Out of Copyright|http://rightsstatements.org/v... True the-rescue-of-victoria-the-beautiful-nihilist-... The rescue of Victoria : the beautiful nihilis... nla.obj-99417309 https://trove.nla.gov.au/work/237350485
6504 1909 Book http://nla.gov.au/nla.obj-99417309 103 True the-rescue-of-victoria-the-beautiful-nihilist-... The rescue of Victoria : the beautiful nihilis... nla.obj-99417309 https://trove.nla.gov.au/work/18988259
25080 1930 Book https://nla.gov.au/nla.obj-99417405 English 1 No known copyright restrictions|http://rightss... True centenary-of-congregationalism-1830-1930-celeb... Centenary of Congregationalism 1830-1930 celeb... nla.obj-99417405 https://trove.nla.gov.au/work/237350493
9685 1930 Book http://nla.gov.au/nla.obj-99417405 1 True centenary-of-congregationalism-1830-1930-celeb... Centenary of Congregationalism 1830-1930 celeb... nla.obj-99417405 https://trove.nla.gov.au/work/21081846
3182 1945 Book http://nla.gov.au/nla.obj-99434150 1 True revised-australian-coupon-scale-nla.obj-994341... Revised Australian coupon scale nla.obj-99434150 https://trove.nla.gov.au/work/17155491
25083 1945 Book https://nla.gov.au/nla.obj-99434150 English 1 No known copyright restrictions|http://rightss... True revised-australian-coupon-scale-nla.obj-994341... Revised Australian coupon scale nla.obj-99434150 https://trove.nla.gov.au/work/237350496
25085 1841 Book https://nla.gov.au/nla.obj-99438221 English 1 No known copyright restrictions|http://rightss... True the-hereford-bull-lottery-just-imported-by-the... The Hereford bull "Lottery", just imported by ... nla.obj-99438221 https://trove.nla.gov.au/work/237350499
435 1841 Book http://nla.gov.au/nla.obj-99438221 1 True the-hereford-bull-lottery-just-imported-by-the... The Hereford bull "Lottery", just imported by ... nla.obj-99438221 https://trove.nla.gov.au/work/11862068
25087 1818 Book https://nla.gov.au/nla.obj-99453041 English 2 No known copyright restrictions|http://rightss... True a-serious-address-and-lamentation-to-the-young... A Serious address and lamentation to the young... nla.obj-99453041 https://trove.nla.gov.au/work/237350501
10924 1818 Book http://nla.gov.au/nla.obj-99453041 2 True a-serious-address-and-lamentation-to-the-young... A Serious address and lamentation to the young... nla.obj-99453041 https://trove.nla.gov.au/work/21855529
8308 1933 Book http://nla.gov.au/nla.obj-99566674 1 True who-are-the-real-clique-pass-resolutions-of-p-... Who are the real "clique"? : pass resolutions ... nla.obj-99566674 https://trove.nla.gov.au/work/20025101
25074 1933 Book https://nla.gov.au/nla.obj-99566674 English 1 No known copyright restrictions|http://rightss... True who-are-the-real-clique-pass-resolutions-of-p-... Who are the real "clique"? : pass resolutions ... nla.obj-99566674 https://trove.nla.gov.au/work/237350487
7195 1859 Book http://nla.gov.au/nla.obj-99609527 30 True les-europeens-dans-l-oceanie-l-australie-colon... Les Europeens dans l'Oceanie : l'Australie col... nla.obj-99609527 https://trove.nla.gov.au/work/19233998
25103 1859 Book https://nla.gov.au/nla.obj-99609527 French 30 No known copyright restrictions|http://rightss... True les-europeens-dans-l-oceanie-l-australie-colon... Les Europeens dans l'Oceanie : l'Australie col... nla.obj-99609527 https://trove.nla.gov.au/work/237350517
25077 1883 Book https://nla.gov.au/nla.obj-99616102 English 14 No known copyright restrictions|http://rightss... True amongst-the-pacific-islands-j-c-bell-nla.obj-9... Amongst the Pacific islands / [J.C. Bell] nla.obj-99616102 https://trove.nla.gov.au/work/237350490
7222 1883 Book http://nla.gov.au/nla.obj-99616102 14 True amongst-the-pacific-islands-j-c-bell-nla.obj-9... Amongst the Pacific islands / [J.C. Bell] nla.obj-99616102 https://trove.nla.gov.au/work/19240023
31591 1884 Book http://nla.gov.au/nla.obj-99644337 1 True camperdown-public-park-nla.obj-99644337.txt Camperdown Public Park nla.obj-99644337 https://trove.nla.gov.au/work/24047372
25091 1884 Book https://nla.gov.au/nla.obj-99644337 English 1 No known copyright restrictions|http://rightss... True camperdown-public-park-nla.obj-99644337.txt Camperdown Public Park nla.obj-99644337 https://trove.nla.gov.au/work/237350505
25093 1895 Book https://nla.gov.au/nla.obj-99671695 English 1 No known copyright restrictions|http://rightss... True a-wonderful-illawarra-waterfall-a-rare-beauty-... A Wonderful Illawarra waterfall : a rare beaut... nla.obj-99671695 https://trove.nla.gov.au/work/237350507
31596 1895 Book http://nla.gov.au/nla.obj-99671695 1 True a-wonderful-illawarra-waterfall-a-rare-beauty-... A Wonderful Illawarra waterfall : a rare beaut... nla.obj-99671695 https://trove.nla.gov.au/work/24063846
25138 1873 Book https://nla.gov.au/nla.obj-99716940 English 2 No known copyright restrictions|http://rightss... True the-results-of-the-census-of-1871-supplement-t... The Results of the census of 1871 : supplement... nla.obj-99716940 https://trove.nla.gov.au/work/237350552
3759 1873 Book http://nla.gov.au/nla.obj-99716940 2 True the-results-of-the-census-of-1871-supplement-t... The Results of the census of 1871 : supplement... nla.obj-99716940 https://trove.nla.gov.au/work/17856108
25122 1850 Book https://nla.gov.au/nla.obj-99727992 English 1 No known copyright restrictions|http://rightss... True regular-packets-for-australia-emigration-to-po... Regular packets for Australia : emigration to ... nla.obj-99727992 https://trove.nla.gov.au/work/237350536
815 1850 Book http://nla.gov.au/nla.obj-99727992 1 True regular-packets-for-australia-emigration-to-po... Regular packets for Australia : emigration to ... nla.obj-99727992 https://trove.nla.gov.au/work/12328620

6336 rows × 15 columns

In [12]:
# Save as CSV
df_downloaded.to_csv('trove_digitised_books_with_ocr.csv', index=False)
display(FileLink('trove_digitised_books_with_ocr.csv'))

Some leftover bits used for renaming the text files

In [ ]:
# Rename files to include truncated title of book
for row in df.itertuples():
    try:
        os.rename(os.path.join('text', '{}.txt'.format(row.book_id)), os.path.join('text', '{}-{}.txt'.format(slugify(row.title[:50]), row.book_id)))
    except FileNotFoundError:
        pass
In [ ]:
# Convert all filenames back to just nla.obj- form
for filename in [f for f in os.listdir('text') if f[-4:] == '.txt']:
    try:
        objname = re.search(r'.*(nla\.obj.*)', filename).group(1)
    except AttributeError:
        print(filename)
    os.rename(os.path.join('text', filename), os.path.join('text', objname))

Created by Tim Sherratt.

Work on this notebook was supported by the Humanities, Arts and Social Sciences (HASS) Data Enhanced Virtual Lab.