Harvesting the text of digitised books (and ephemera)

This notebook harvests metadata and OCRd text from digitised books in Trove. There's three main steps:

  • Harvest metadata of digitised books using the Trove API
  • Extract the number of pages for each book from the Trove web interface (the number of pages is necessary to download the OCRd text)
  • Download the OCRd text for each book

It's not easy to identify all the digitised books with OCRd text in Trove. I'm starting with a search in the book zone for records that include the phrase "nla.obj" and are available online. This currently returns 21,699 results. However, this includes books where access to the digital copy is 'restricted'. I think these are mostly recent books submitted in digital form under legal deposit. I've filtered the 21,699 results to remove records where the digital copy is not available. This currently reduces the total to 13,500 results.

But some of those 13,500 results are actually parent records that contain multiple volumes or parts. When I find the number of pages in each book, I'm also checking to see if the record is a 'Multi volume book' and has child works. If it does, I add the child works to the list of books. After this stage there are 14,538 works.

However, not all of these 14,538 records have OCRd text. Parent records of multi volume works, and ebook formats like PDFs or MOBI, don't have individual pages, and therefore don't have any text to download. If we exclude works without pages, there are 11,045 works that might have some OCRd text to download.

But when you harvest the text files from these works, you find that some of them are empty. I've excluded these from the final dataset, leaving a grand total of 9,738 text files.

If you compare the number of downloaded files to the number in the CSV file that are identified as having OCRd text you'll notice a difference – 9,738 compared to 9,754. After a bit more poking around I realised that there are some duplicates in the list of works. This seems to be because more than one Trove metadata record can point to the same digitised work. For example, both this record and this record point to this digitised work. As they're not exact duplicates, I've left them in the results.

Looking through the downloaded text files, it's clear that we're getting ephemera (particularly pamphlets and posters) as well as books. There doesn't seem to be an obvious way to filter these out up front, but of course you could filter later by the number of pages.

Here's the metadata I've harvested in CSV format:

This file includes the following columns:

  • children – pipe-separated ids of any child works
  • contributors – pipe-separated names of contributors
  • date – publication date
  • form – work format
  • fulltext_url – link to the digitised version
  • pages – number of pages
  • parent – id of parent work (if any)
  • text_downloaded – file name of the downloaded OCR text
  • text_file – True/False is there any OCRd text
  • title – title of the work
  • trove_id – unique identifier
  • url – link to the metadata record in Trove
  • volume – volume/part number

The 9,738 downloaded text files are in the text directory of this repository. You can also browse the collection in CloudStor, or download the complete set as a zip file (400mb).

Setting things up

In [76]:
import requests
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry
from tqdm import tqdm_notebook
from IPython.display import display, FileLink
import pandas as pd
import json
import re
import time
import os
from copy import deepcopy
from bs4 import BeautifulSoup
from slugify import slugify
In [9]:
s = requests.Session()
retries = Retry(total=5, backoff_factor=1, status_forcelist=[ 502, 503, 504 ])
s.mount('https://', HTTPAdapter(max_retries=retries))
s.mount('http://', HTTPAdapter(max_retries=retries))
In [10]:
# Add your Trove API key below
api_key = ''
In [11]:
params = {
    'key': api_key,
    'zone': 'book',
    'q': 'nla.obj',
    'bulkHarvest': 'true',
    'n': 100,
    'encoding': 'json',
    'l-availability': 'y'
}

Harvest metadata using the API

In [89]:
def get_total_results():
    '''
    Get the total number of results for a search.
    '''
    these_params = params.copy()
    these_params['n'] = 0
    response = s.get('https://api.trove.nla.gov.au/v2/result', params=these_params)
    data = response.json()
    return int(data['response']['zone'][0]['records']['total'])


def get_fulltext_url(links):
    '''
    Loop through the identifiers to find a link to the full text version of the book.
    '''
    url = None
    for link in links:
        if link['linktype'] == 'fulltext' and 'nla.obj' in link['value']:
            url = link['value']
            break
    return url


def harvest_books():
    '''
    Harvest metadata relating to digitised books.
    '''
    books = []
    total = get_total_results()
    start = '*'
    these_params = params.copy()
    with tqdm_notebook(total=total) as pbar:
        while start:
            these_params['s'] = start
            response = s.get('https://api.trove.nla.gov.au/v2/result', params=these_params)
            data = response.json()
            # The nextStart parameter is used to get the next page of results.
            # If there's no nextStart then it means we're on the last page of results.
            try:
                start = data['response']['zone'][0]['records']['nextStart']
            except KeyError:
                start = None
            for record in data['response']['zone'][0]['records']['work']:
                # See if there's a link to the full text version.
                fulltext_url = get_fulltext_url(record['identifier'])
                # I'm making the assumption that if this is a booky book (not a map or music etc),
                # then 'Book' will appear first in the list of types.
                # This might not be a valid assumption.
                # try:
                #    format_type = record.get('type')[0]
                # except (IndexError, TypeError):
                #    format_type = None
                # Save the record if there's a full text link and it's a booky book.
                if fulltext_url:
                    trove_id = re.search(r'(nla\.obj\-\d+)', fulltext_url).group(1)
                    # The 'contributor' field may have a single value or an array.
                    # If it's an array, join the values into a string.
                    try:
                        contributors = '|'.join(record.get('contributor'))
                    except TypeError:
                        contributors = record.get('contributor')
                    # Get the basic metadata.
                    book = {
                        'title': record.get('title'),
                        'url': record.get('troveUrl'),
                        'contributors': contributors,
                        'date': record.get('issued'),
                        'fulltext_url': fulltext_url,
                        'trove_id': trove_id
                    }
                    books.append(book)
                    #print(book)
            pbar.update(100)
    return books
In [90]:
# Do the harvest!
books = harvest_books()

Get the number of pages in each book

In order to download the OCRd text wwe need to know the number of pages in a work. This information is not available via the API, so we have to scrape it from the work's HTML page.

In [126]:
def get_work_data(url):
    '''
    Extract work data in a JSON string from the work's HTML page.
    '''
    response = s.get(url)
    try:
        work_data = re.search(r'var work = JSON\.parse\(JSON\.stringify\((\{.*\})', response.text).group(1)
    except AttributeError:
        work_data = '{}'
    return json.loads(work_data)


def get_pages(work):
    '''
    Get the number of pages from the work data.
    '''
    try:
        pages = len(work['children']['page'])
    except KeyError:
        pages = 0
    return pages


def get_volumes(parent_id):
    '''
    Get the ids of volumes that are children of the current record.
    '''
    start_url = 'https://nla.gov.au/{}/browse?startIdx={}&rows=20&op=c'
    # The initial startIdx value
    start = 0
    # Number of results per page
    n = 20
    parts = []
    # If there aren't 20 results on the page then we've reached the end, so continue harvesting until that happens.
    while n == 20:
        # Get the browse page
        response = s.get(start_url.format(parent_id, start))
        # Beautifulsoup turns the HTML into an easily navigable structure
        soup = BeautifulSoup(response.text, 'lxml')
        # Find all the divs containing issue details and loop through them
        details = soup.find_all(class_='l-item-info')
        for detail in details:
            # Get the issue id
            parts.append(detail.dt.a.string)
            time.sleep(0.2)
        # Increment the startIdx
        start += n
        # Set n to the number of results on the current page
        n = len(details)
    return parts


def add_pages(books):
    '''
    Add the number of pages to the metadata for each book.
    Add volumes from multi volume books.
    '''
    books_with_pages = []
    for book in tqdm_notebook(books):
        # print(book['fulltext_url'])
        work = get_work_data(book['fulltext_url'])
        form = work.get('form')
        pages = get_pages(work)
        book['pages'] = pages
        book['form'] = form
        book['volume'] = ''
        book['parent'] = ''
        book['children'] = ''
        time.sleep(0.2)
        # Multi volume books are containers with child volumes
        # so we have to get the ids of each individual volume and process them
        if pages == 0 and form == 'Multi Volume Book':
            # Get child volumes
            volumes = get_volumes(book['trove_id'])
            # For each volume get details and add as a new book entry
            for index, volume_id in enumerate(volumes):
                volume = book.copy()
                # Add link up to the container
                volume['parent'] = book['trove_id']
                volume['fulltext_url'] = 'http://nla.gov.au/{}'.format(volume_id)
                volume['trove_id'] = volume_id
                work = get_work_data(volume['fulltext_url'])
                form = work.get('form')
                pages = get_pages(work)
                volume['form'] = form
                volume['pages'] = pages
                volume['volume'] = str(index + 1)
                # print(volume)
                books_with_pages.append(volume)
                time.sleep(0.2)
            # Add links from container to volumes
            book['children'] = '|'.join(volumes)
        # print(book)
        books_with_pages.append(book)
    return books_with_pages
In [93]:
# Add number of pages to the book metadata
books_with_pages = add_pages(deepcopy(books))

Convert and save results

Getting the page numbers takes quite a while, so it's a good idea to save the results to a CSV file before proceeding. That way, you won't have to repeat the process if something goes wrong and you lose the data that's sitting in memory.

In [ ]:
df = pd.DataFrame(books_with_pages)
In [168]:
df.head()
Out[168]:
children contributors date form fulltext_url pages parent title trove_id url volume
0 Taplin, George 1878-1880 Book http://nla.gov.au/nla.obj-688657424 24 Grammar of the Narrinyeri tribe of Australian ... nla.obj-688657424 https://trove.nla.gov.au/work/10029401
1 Miriam Agatha 1914-1923 Book http://nla.gov.au/nla.obj-24357566 246 Nellie Doran : a story of Australian home and ... nla.obj-24357566 https://trove.nla.gov.au/work/10049667
2 Germany. Heer. Heereswaffenamt 1942 Book http://nla.gov.au/nla.obj-51530748 80 Lastkraftwagen 3 t Ford : Baumuster V 3000 S :... nla.obj-51530748 https://trove.nla.gov.au/work/10053234
3 Macdonald, M. P. (Margaret P.) 1900-1920 Book http://nla.gov.au/nla.obj-19907304 388 Trefoil : the story of a girls' society / by M... nla.obj-19907304 https://trove.nla.gov.au/work/10057400
4 1909 Picture http://nla.gov.au/nla.obj-233089297 0 Military report on the province of Chiang-su (... nla.obj-233089297 https://trove.nla.gov.au/work/10068876
In [95]:
# How many records?
df.shape
Out[95]:
(14538, 11)
In [122]:
# How many have pages?
df.loc[df['pages'] != 0].shape
Out[122]:
(11045, 11)
In [181]:
# How many of each format?
df['form'].value_counts()
Out[181]:
Book                   9652
Digital Publication    2356
Multi Volume Book      1681
Picture                 607
Journal                 116
                         78
Manuscript               32
Other - General          13
Other - Australian        3
Name: form, dtype: int64
In [129]:
# Save as CSV
df.to_csv('trove_digitised_books.csv', index=False)
display(FileLink('trove_digitised_books.csv'))

Download the OCRd texts

In [157]:
# Run this cell if you need to reload the books data from the CSV
df = pd.read_csv('trove_digitised_books.csv', keep_default_na=False)
books_with_pages = df.to_dict('records')
In [158]:
def save_ocr(books, output_dir='text'):
    '''
    Download the OCRd text for each book.
    '''
    os.makedirs(output_dir, exist_ok=True)
    for book in tqdm_notebook(books):
        # Default values
        book['text_downloaded'] = False
        book['text_file'] = ''
        if book['pages'] != 0:       
            # print(book['title'])
            # The index value for the last page of an issue will be the total pages - 1
            last_page = book['pages'] - 1
            file_name = '{}-{}.txt'.format(slugify(book['title'][:50]), book['trove_id'])
            file_path = os.path.join(output_dir, file_name)
            # Check to see if the file has already been harvested
            if os.path.exists(file_path) and os.path.getsize(file_path) > 0:
                # print('Already saved')
                book['text_file'] = file_name
                book['text_downloaded'] = True
            else:
                url = 'https://trove.nla.gov.au/{}/download?downloadOption=ocr&firstPage=0&lastPage={}'.format(book['trove_id'], last_page)
                # print(url)
                # Get the file
                r = s.get(url)
                # Check there was no error
                if r.status_code == requests.codes.ok:
                    # Check that the file's not empty
                    r.encoding = 'utf-8'
                    if len(r.text) > 0 and not r.text.isspace():
                        # Check that the file isn't HTML (some not found pages don't return 404s)
                        if BeautifulSoup(r.text, 'html.parser').find('html') is None:
                            # If everything's ok, save the file
                            with open(file_path, 'w', encoding='utf-8') as text_file:
                                text_file.write(r.text)
                            # print('Saved')
                            book['text_file'] = file_name
                            book['text_downloaded'] = True
                time.sleep(1)
In [159]:
save_ocr(books_with_pages)

Convert and save updated results

The new books list includes the file name of the downloaded text file (if there is one), and a boolean field indicating if the text has been downloaded.

In [169]:
# Convert this to df
df_downloaded = pd.DataFrame(books_with_pages)
In [171]:
df_downloaded.head()
Out[171]:
children contributors date form fulltext_url pages parent text_downloaded text_file title trove_id url volume
0 Taplin, George 1878-1880 Book http://nla.gov.au/nla.obj-688657424 24 True grammar-of-the-narrinyeri-tribe-of-australian-... Grammar of the Narrinyeri tribe of Australian ... nla.obj-688657424 https://trove.nla.gov.au/work/10029401
1 Miriam Agatha 1914-1923 Book http://nla.gov.au/nla.obj-24357566 246 True nellie-doran-a-story-of-australian-home-and-sc... Nellie Doran : a story of Australian home and ... nla.obj-24357566 https://trove.nla.gov.au/work/10049667
2 Germany. Heer. Heereswaffenamt 1942 Book http://nla.gov.au/nla.obj-51530748 80 True lastkraftwagen-3-t-ford-baumuster-v-3000-s-ger... Lastkraftwagen 3 t Ford : Baumuster V 3000 S :... nla.obj-51530748 https://trove.nla.gov.au/work/10053234
3 Macdonald, M. P. (Margaret P.) 1900-1920 Book http://nla.gov.au/nla.obj-19907304 388 True trefoil-the-story-of-a-girls-society-by-m-p-nl... Trefoil : the story of a girls' society / by M... nla.obj-19907304 https://trove.nla.gov.au/work/10057400
4 1909 Picture http://nla.gov.au/nla.obj-233089297 0 False Military report on the province of Chiang-su (... nla.obj-233089297 https://trove.nla.gov.au/work/10068876
In [165]:
# How many have been downloaded?
df_downloaded.loc[df_downloaded['text_downloaded'] == True].shape
Out[165]:
(9754, 13)

Why is the number above different to the number of files actually downloaded? Let's have a look for duplicates.

As you can see below, some digitised works are linked to from multiple metadata records. Hence there are duplicates.

In [179]:
df_downloaded.loc[df_downloaded.duplicated('trove_id', keep=False) == True].sort_values('trove_id')
Out[179]:
children contributors date form fulltext_url pages parent text_downloaded text_file title trove_id url volume
5768 1830-1839 Book http://nla.gov.au/nla.obj-1874683 1 True mother-don-t-you-cry-oh-well-i-can-remember-n-... Mother, don't you cry. : Oh, well I can rememb... nla.obj-1874683 https://trove.nla.gov.au/work/192090169
12530 1830-1839 Book http://nla.gov.au/nla.obj-1874683 1 True mother-don-t-you-cry-oh-well-i-can-remember-n-... Mother, don't you cry. : Oh, well I can rememb... nla.obj-1874683 https://trove.nla.gov.au/work/31771096
5010 Proeschel, F. (Frederick), 1809-1870 1863 Picture http://nla.gov.au/nla.obj-230987283 0 False Atlas, containing a map of Australasia : accom... nla.obj-230987283 https://trove.nla.gov.au/work/18784604
4031 Proeschel, F 1863 Picture http://nla.gov.au/nla.obj-230987283/view 0 False Atlas, containing a map of Australasia : accom... nla.obj-230987283 https://trove.nla.gov.au/work/184158101
11754 Brady, J. (John), approximately 1800-1871 1845 Book http://nla.gov.au/nla.obj-26250201 58 True a-descriptive-vocabulary-of-the-native-languag... A descriptive vocabulary of the native languag... nla.obj-26250201 https://trove.nla.gov.au/work/26205931
2431 Brady, J. (John), approximately 1800-1871 1845 Book http://nla.gov.au/nla.obj-26250201/view 58 True a-descriptive-vocabulary-of-the-native-languag... A descriptive vocabulary of the native languag... nla.obj-26250201 https://trove.nla.gov.au/work/16212988
2464 1935 Book http://nla.gov.au/nla.obj-293324241 36 True the-kangaroo-kook-book-a-book-containing-simpl... The Kangaroo Kook book : a book containing sim... nla.obj-293324241 https://trove.nla.gov.au/work/163413287
8972 1935 Book http://nla.gov.au/nla.obj-293324241/view?partI... 36 True the-kangaroo-kook-book-a-book-containing-simpl... The Kangaroo Kook book : a book containing sim... nla.obj-293324241 https://trove.nla.gov.au/work/224334184
8901 1845 Book http://nla.gov.au/nla.obj-33498844 16 True narrative-of-the-massacre-of-the-crew-of-the-s... Narrative of the massacre of the crew of the s... nla.obj-33498844 https://trove.nla.gov.au/work/22370708
9789 1845 Book http://nla.gov.au/nla.obj-33498844 16 True narrative-of-the-massacre-of-the-crew-of-the-s... Narrative of the massacre of the crew of the s... nla.obj-33498844 https://trove.nla.gov.au/work/230355479
5971 Hamlyn-Harris, Ronald 1912 Book http://nla.gov.au/nla.obj-33526052/view?search... 10 True papuan-mummification-as-practised-in-the-torre... Papuan mummification : as practised in the Tor... nla.obj-33526052 https://trove.nla.gov.au/work/192614849
7838 Hamlyn-Harris, Ronald, 1874-1953 1912 Book http://nla.gov.au/nla.obj-33526052 10 True papuan-mummification-as-practised-in-the-torre... Papuan mummification : as practised in the Tor... nla.obj-33526052 https://trove.nla.gov.au/work/21434566
8344 De Mole, F. E. (Fanny Elizabeth), 1835-1866 1861 Book http://nla.gov.au/nla.obj-33536409 47 True wild-flowers-of-south-australia-by-f-e-d-nla.o... Wild flowers of South Australia / by F.E.D nla.obj-33536409 https://trove.nla.gov.au/work/22034245
41 De Mole, F. E. (Fanny Elizabeth), 1835-1866 1861-1981 Book http://nla.gov.au/nla.obj-33536409 47 True wild-flowers-of-south-australia-f-e-d-nla.obj-... Wild flowers of South Australia / [F.E.D.] nla.obj-33536409 https://trove.nla.gov.au/work/10282429
12140 1917 Book http://nla.gov.au/nla.obj-37914471 7 True greater-britain-compulsion-in-australia-by-an-... Greater Britain : compulsion in Australia / by... nla.obj-37914471 https://trove.nla.gov.au/work/26859141
6111 1917 Book http://nla.gov.au/nla.obj-37914471 7 True greater-britain-compulsion-in-australia-elect-... Greater Britain : compulsion in Australia / [e... nla.obj-37914471 https://trove.nla.gov.au/work/192824195
6115 1852 Book http://nla.gov.au/nla.obj-46906507 152 False Xin yue quan shu / [electronic resource] nla.obj-46906507 https://trove.nla.gov.au/work/192846358
1834 1852 Book http://nla.gov.au/nla.obj-46906507 152 False Xin yue quan shu / [translated by the Committe... nla.obj-46906507 https://trove.nla.gov.au/work/12626164
6196 1901 Book https://nla.gov.au/nla.obj-485691946 208 True sydney-souvenir-of-the-arrival-of-the-first-go... Sydney : [souvenir of the arrival of the first... nla.obj-485691946 https://trove.nla.gov.au/work/19304426
6197 Wood, Samuel 1910-1919 Book http://nla.gov.au/nla.obj-485691946 208 True sydney-40-full-page-views-of-sydney-and-surrou... Sydney : 40 full page views of Sydney and surr... nla.obj-485691946 https://trove.nla.gov.au/work/19304446
2238 Ngarimu Victoria Cross Investiture Meeting, Ro... 1943 Multi Volume Book http://nla.gov.au/nla.obj-497842369 12 nla.obj-497824348 True souvenir-of-the-ngarimu-victoria-cross-investi... Souvenir of the Ngarimu Victoria Cross investi... nla.obj-497842369 https://trove.nla.gov.au/work/13744662 3.0
2236 Ngarimu Victoria Cross Investiture Meeting, Ro... 1943 Multi Volume Book http://nla.gov.au/nla.obj-497842369 12 nla.obj-497824348 True souvenir-of-the-ngarimu-victoria-cross-investi... Souvenir of the Ngarimu Victoria Cross investi... nla.obj-497842369 https://trove.nla.gov.au/work/13744662 1.0
6113 880-06 Kawabe, Gyokuen 1848 Book http://nla.gov.au/nla.obj-50152052 35 False Rikka hayakeiko. 2-hen / [electronic resource] nla.obj-50152052 https://trove.nla.gov.au/work/192841577
950 Ikeda, Tōri 1848 Book http://nla.gov.au/nla.obj-50152052 35 False Rikka hayakeiko. Ikeda Tōri Ō sho ; Kawabe G... nla.obj-50152052 https://trove.nla.gov.au/work/12468600
2043 Lees, William 1899 Book http://nla.gov.au/nla.obj-52757589 78 True the-goldfields-of-queensland-nla.obj-52757589.txt The goldfields of Queensland nla.obj-52757589 https://trove.nla.gov.au/work/12945049
8557 Lees, William 1899 Book http://nla.gov.au/nla.obj-52757589 78 True the-goldfields-of-queensland-the-warwick-glads... The goldfields of Queensland. the Warwick, Gla... nla.obj-52757589 https://trove.nla.gov.au/work/22224726
13716 Feilberg, Carl Adolph, 1844-1887 1880 Book http://nla.gov.au/nla.obj-52760287 61 True the-way-we-civilise-black-and-white-the-native... The way we civilise : black and white, the nat... nla.obj-52760287 https://trove.nla.gov.au/work/37436463
6476 1880 Book http://nla.gov.au/nla.obj-52760287 61 True the-way-we-civilise-black-and-white-the-native... The way we civilise : black and white, the nat... nla.obj-52760287 https://trove.nla.gov.au/work/19683457
9506 1863 Book http://nla.gov.au/nla.obj-52760768 64 True the-history-of-the-pilot-schooner-sea-witch-nl... The History of the pilot schooner, Sea witch nla.obj-52760768 https://trove.nla.gov.au/work/228940314
2581 1863 Book http://nla.gov.au/nla.obj-52760768 64 True the-history-of-the-pilot-schooner-sea-witch-nl... The History of the pilot schooner, Sea witch nla.obj-52760768 https://trove.nla.gov.au/work/16649477
12498 Bingle, John 1837 Book http://nla.gov.au/nla.obj-52763581 196 True a-letter-to-the-right-honorable-lord-viscount-... A letter to the Right Honorable Lord Viscount ... nla.obj-52763581 https://trove.nla.gov.au/work/30974356
8032 Bingle, John, 1796-1882 1837 Book http://nla.gov.au/nla.obj-52763581 196 True a-letter-to-the-right-honorable-lord-viscount-... A letter to the Right Honorable Lord Viscount ... nla.obj-52763581 https://trove.nla.gov.au/work/21702434
9473 Great Britain. Colonial Office 1839 Book http://nla.gov.au/nla.obj-52764244/view 61 True australian-aborigines-copies-of-extracts-of-de... Australian Aborigines : copies of extracts of ... nla.obj-52764244 https://trove.nla.gov.au/work/228835870
5437 Glenelg, Charles Grant, Baron, 1778-1866 1839 Book http://nla.gov.au/nla.obj-52764244 61 True australian-aborigines-copies-of-extracts-of-de... Australian Aborigines : copies of extracts of ... nla.obj-52764244 https://trove.nla.gov.au/work/18987234
6673 McConnel, Mary, 1824-1910 1905 Book http://nla.gov.au/nla.obj-52767712 60 True memories-of-days-long-gone-by-by-the-wife-of-a... Memories of days long gone by / by the wife of... nla.obj-52767712 https://trove.nla.gov.au/work/200519339
349 McConnel, Mary, 1830-1910 1905-2017 Book http://nla.gov.au/nla.obj-52767712 60 True memories-of-days-long-gone-by-by-the-wife-of-a... Memories of days long gone by / by the wife of... nla.obj-52767712 https://trove.nla.gov.au/work/11901799
7147 Berry, Richard, 1824- 1895 Book http://nla.gov.au/nla.obj-52808317 82 True an-old-tar-s-yarn-being-the-autobiography-of-c... An old tar's yarn : being the autobiography of... nla.obj-52808317 https://trove.nla.gov.au/work/20965104
3222 Berry, Richard, b. 1824 1895 Book http://nla.gov.au/nla.obj-52808317 82 True an-old-tar-s-yarn-being-the-autobiography-of-c... An old tar's yarn : being the autobiography of... nla.obj-52808317 https://trove.nla.gov.au/work/179838571
13718 1929 Book http://nla.gov.au/nla.obj-52818931 35 True metropolitan-golf-club-1908-1929-nla.obj-52818... Metropolitan Golf Club, 1908-1929 nla.obj-52818931 https://trove.nla.gov.au/work/37474049
5223 1929 Book http://nla.gov.au/nla.obj-52818931 35 True metropolitan-golf-club-1908-1929-nla.obj-52818... Metropolitan Golf Club, 1908-1929 nla.obj-52818931 https://trove.nla.gov.au/work/18884620
7058 Australian News and Information Bureau 1947 Book http://nla.gov.au/nla.obj-52821207 15 True exhibition-of-australian-aboriginal-cave-paint... Exhibition of Australian Aboriginal cave paint... nla.obj-52821207 https://trove.nla.gov.au/work/209191950
11671 Australian News and Information Bureau 1947 Book http://nla.gov.au/nla.obj-52821207/view 15 True exhibition-of-australian-aboriginal-cave-paint... Exhibition of Australian Aboriginal cave paint... nla.obj-52821207 https://trove.nla.gov.au/work/25578531
7728 1947 Book http://nla.gov.au/nla.obj-52845752 18 True stirling-henry-ltd-1924-1947-nla.obj-52845752.txt Stirling Henry Ltd., 1924-1947 nla.obj-52845752 https://trove.nla.gov.au/work/21289109
13757 1947 Book http://nla.gov.au/nla.obj-52845752 18 True stirling-henry-ltd-1924-1947-nla.obj-52845752.txt Stirling Henry Ltd., 1924-1947 nla.obj-52845752 https://trove.nla.gov.au/work/38125513
14346 Queensland. Royal Commission Appointed to Inqu... 1908 Book http://nla.gov.au/nla.obj-52848217/view?partId... 366 True report-together-with-minutes-of-proceedings-mi... Report, together with minutes of proceedings, ... nla.obj-52848217 https://trove.nla.gov.au/work/8465926
5544 Queensland. Royal Commission Appointed to Inqu... 1908 Book http://nla.gov.au/nla.obj-52848217 366 True report-of-the-royal-commission-appointed-to-in... Report of the Royal Commission Appointed to In... nla.obj-52848217 https://trove.nla.gov.au/work/19096180
3908 1895 Book http://nla.gov.au/nla.obj-52876957 215 True the-plates-of-the-chinese-annamese-japanese-co... The plates of the Chinese, Annamese, Japanese,... nla.obj-52876957 https://trove.nla.gov.au/work/18353045
3907 Lockhart, James H. Stewart Sir, (James Haldane... 1895-1907 Book http://nla.gov.au/nla.obj-52876957 215 True the-currency-of-the-farther-east-from-the-earl... The currency of the farther East from the earl... nla.obj-52876957 https://trove.nla.gov.au/work/18352867
11246 Blüm, Fleur, 1984- 2018 Digital Publication http://nla.gov.au/nla.obj-875406373 0 False Sophie's path : a choose your own adventure ro... nla.obj-875406373 https://trove.nla.gov.au/work/235394514
11300 Blüm, Fleur 2018 Digital Publication http://nla.gov.au/nla.obj-875406373 0 False Sophie's path : a choose your own adventure ro... nla.obj-875406373 https://trove.nla.gov.au/work/235519254
122 Walker, W. D 1935 Book http://nla.gov.au/nla.obj-88862902 3 True three-interesting-narratives-illustrated-by-la... Three interesting narratives : illustrated by ... nla.obj-88862902 https://trove.nla.gov.au/work/10928822
9475 The Lecture Agency 1935 Book http://nla.gov.au/nla.obj-88862902 3 True three-interesting-narratives-illustrated-by-la... Three interesting narratives : illustrated by ... nla.obj-88862902 https://trove.nla.gov.au/work/228838404
In [166]:
# Save as CSV
df_downloaded.to_csv('trove_digitised_books_with_ocr.csv', index=False)
display(FileLink('trove_digitised_books_with_ocr.csv'))

Some leftover bits used for renaming the text files

In [ ]:
# Rename files to include truncated title of book
for row in df.itertuples():
    try:
        os.rename(os.path.join('text', '{}.txt'.format(row.book_id)), os.path.join('text', '{}-{}.txt'.format(slugify(row.title[:50]), row.book_id)))
    except FileNotFoundError:
        pass
In [ ]:
# Convert all filenames back to just nla.obj- form
for filename in [f for f in os.listdir('text') if f[-4:] == '.txt']:
    try:
        objname = re.search(r'.*(nla\.obj.*)', filename).group(1)
    except AttributeError:
        print(filename)
    os.rename(os.path.join('text', filename), os.path.join('text', objname))

Created by Tim Sherratt.

Work on this notebook was supported by the Humanities, Arts and Social Sciences (HASS) Data Enhanced Virtual Lab.