Getting the text of Trove books from the Internet Archive¶

Previously I've harvested the text of books digitised by the National Library of Australia and made available through Trove. But it occured to me it might be possible to get the full text of other books in Trove by making use of the links to the Open Library.

There are lots of links to the Open Library in Trove. A search for "http://openlibrary.org/" in the books zone currently return almost a million results. Many of the linked Open Library records themselves point to digital copies in the Internet Archive. However, this is less useful than it seems as many of the digital copies have access restrictions. Nonetheless, at least some of the books in Trove will have freely accessible versions in the Internet Archive.

At first I thought finding them might require three steps – get Open Library identifier from Trove, query the Open Library API to get the Internet Archive identifier, then download the text from the Internet Archive. But then I realised that you can query the Internet Archive API with an Open Library identifier, so that cut out a step. This is the basic method:

Search the Trove API for Australian books that include "http://openlibrary.org/"
Extract metadata, including the Open Library identifier, from these records
Work through the results, retrieving item metadata from the Internet Archive using the Open Library identifier
If the item has a freely available text version of the book, download it and save the metadata

To talk to the Internet Archive API I made use of the internetarchive Python package. Before you can use this, you need to have an account at the Internet Archive, and then run ia configure on the command line. This will prompt you for you login details and save them in a config file.

The results:

The list of books with full text includes the follwing fields:

creators – pipe-separated list of creators
date – publication date
ia_formats – pipe-separated list of file formats available from the Internet Archive (these can be downloaded from the IA)
ia_id – Internet Archive identifier
ia_url – link to more information in the Internet Archive
ol_id – Open Library identifier
publisher – publisher
text_filename – name of the downloaded text file
title – title of the book
trove_url – link to more information in Trove
version_id – Trove version identifier
work_id – Trove work identifier

I ended up downloading 1,513 text files. However, despite the fact that I used the Australian content filter in Trove, it's clear that some of them have nothing to do with Australia. Nonetheless, there are many interesting books amongst the results, and it's an interesting example of how you can make use of cross-links between resources.

You can download the harvested text files from CloudStor.

Set things up¶

In [1]:

import requests
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry
from tqdm import tqdm_notebook
import pandas as pd
import time
import os
import urllib
# Remember to run ia configure at the command line first
import internetarchive as ia

In [2]:

api_key = '[YOUR TROVE API KEY GOES HERE]'

# Note that we're excluding periodicals even though there seems to be quite a few in the IA.
# I thought it would be best to stick to books for now & do the journals later.
# I'm using the 'Australian content' filter to try & limit to books published in or about Australia.
# The filter is not always accurate as you can see in some of the results...
params = {
    'q': '"http://openlibrary.org/" NOT format:Periodical',
    'zone': 'book',
    'l-australian': 'y', # Australian content --> yes please
    'include': 'workVersions', # We want all versions to make sure we find the OL record
    'key': api_key,
    'encoding': 'json',
    'bulkHarvest': 'true',
    'n': 100
}

Define some functions to do the work¶

In [3]:

s = requests.Session()
retries = Retry(total=5, backoff_factor=1, status_forcelist=[ 502, 503, 504 ])
s.mount('https://', HTTPAdapter(max_retries=retries))
s.mount('http://', HTTPAdapter(max_retries=retries))

def get_total_results():
    '''
    Get the total number of results for a search.
    '''
    these_params = params.copy()
    these_params['n'] = 0
    response = s.get('https://api.trove.nla.gov.au/v2/result', params=these_params)
    data = response.json()
    return int(data['response']['zone'][0]['records']['total'])

def get_ol_id(record):
    '''
    Extract the Open Library identifier form a record.
    '''
    ol_id = None
    for link in record['identifier']:
        if link['type'] == 'control number' and link['value'][:2] == 'OL':
            ol_id = link['value']
    return ol_id

def get_details(record):
    '''
    Get basic metadata from a record.
    '''
    if isinstance(record.get('creator'), list):
        creators = '|'.join(record.get('creator'))
    else:
        creators = record.get('creator')
    book = {
        'title': record.get('title'),
        'creators': creators,
        'date': record.get('issued'),
        'publisher': record.get('publisher')
    }
    return book

def process_record(record, work_id, version_id):
    '''
    Check to see if a version record comes from the OpenLibrary.
    If it does, extract the OL identifier and prepare basic metadata.
    '''
    book = None
    source = record.get('metadataSource')
    if source and source == 'Open Library':
        ol_id = get_ol_id(record)
        if ol_id:
            book = get_details(record)
            book['ol_id'] = ol_id
            book['work_id'] = work_id
            book['version_id'] = version_id
            book['trove_url'] = 'https://trove.nla.gov.au/version/{}'.format(version_id)
    return book
    
def harvest_books():
    '''
    Get records from Trove with Open Library links.
    Extract and save the OL identifier and basic book metadata.
    '''
    books = []
    total = get_total_results()
    start = '*'
    these_params = params.copy()
    with tqdm_notebook(total=total) as pbar:
        while start:
            these_params['s'] = start
            response = s.get('https://api.trove.nla.gov.au/v2/result', params=these_params)
            data = response.json()
            # The nextStart parameter is used to get the next page of results.
            # If there's no nextStart then it means we're on the last page of results.
            try:
                start = data['response']['zone'][0]['records']['nextStart']
            except KeyError:
                start = None
            for work in data['response']['zone'][0]['records']['work']:
                # Sometimes there's a single version, other times a list
                # Make sure we process all of them to get different editions
                for version in work['version']:
                    if isinstance(version['record'], list):
                        for record in version['record']:
                            book = process_record(record, work['id'], version['id'])
                            if book:
                                books.append(book)
                    else:
                        book = process_record(version['record'], work['id'], version['id'])
                        if book:
                            books.append(book)
            pbar.update(100)
    return books

def get_ia_details(books):
    '''
    Process a list of books with OL identifiers.
    Retrieve metadata from Internet Archive.
    If there's a freely available text file, download it.
    Remember to run ia config at the command line first or else you'll get authentication errors!
    '''
    ia_books = []
    for book in tqdm_notebook(books):
        # Search for items with a specific OL identifier
        for item in ia.search_items('openlibrary_edition:{}'.format(book['ol_id'])).iter_as_items():
            formats = []
            # Check to see if there are digital files available
            if 'files' in item.item_metadata:
                # Loop through the files and save the format names
                for file in item.item_metadata['files']:
                    if file.get('private') != 'true':
                        formats.append(file['format'])
                        # If there's a text version, grab the filename
                        if file['format'] == 'DjVuTXT':
                            text_file = file['name']
            # If there's a text version we'll download it
            if 'DjVuTXT' in formats:
                # Check to see if we've already got it
                if not os.path.exists(os.path.join('ia_texts', text_file)):
                    try:
                        dl = ia.download(item.identifier, formats='DjVuTXT', destdir='ia_texts', no_directory=True)
                    except requests.exceptions.HTTPError as err:
                        # Even though I tried to exclude 'private' files above, I still got some authentication errors
                        if err.response.status_code == 403:
                            dl = None
                        else:
                            raise
                else:
                    dl = True
                # If we've successfully downloaded a text file, save the book details
                if dl is not None:
                    ia_book = book.copy()
                    ia_book['ia_formats'] = '|'.join(formats)
                    ia_book['ia_id'] = item.identifier
                    ia_book['text_filename'] = text_file
                    ia_book['ia_url'] = 'https://archive.org/details/{}'.format(item.identifier)
                    ia_books.append(ia_book)
    return ia_books

Get Trove books with Open Library links¶

In [ ]:

books = harvest_books()

In [5]:

# Convert to a dataframe
df = pd.DataFrame(books)
df.head()

Out[5]:

	creators	date	ol_id	publisher	title	trove_url	version_id	work_id
0	Steve Parker 1952-	1993	OL1404493M	London Dorling Kindersley	Rocks and minerals written by Steve Parker.	https://trove.nla.gov.au/version/257741553	257741553	10007961
1	South Australia. Premier's Dept. Publicity and...	1980	OL24656253M	Adelaide Publicity, Premier's Department, Sout...	South Australia	https://trove.nla.gov.au/version/166795880	166795880	10015487
2	Thomas Chastain	1981	OL4092928M	Garden City, N.Y Doubleday	The diamond exchange Thomas Chastain	https://trove.nla.gov.au/version/49089270	49089270	10020752
3	Baker, Richard W.\|East-West Center.	1994	OL1405863M	Westport, Conn Praeger	The ANZUS states and their region regional pol...	https://trove.nla.gov.au/version/186090623	186090623	10022258
4	Arthur, Elizabeth 1953-	1995	OL1413756M	New York Knopf	Antarctic navigation a novel Elizabeth Arthur.	https://trove.nla.gov.au/version/171444544	171444544	10025636

In [6]:

# How many results?
df.shape

Out[6]:

(8273, 8)

In [7]:

# Save to CSV
df.to_csv('books_with_olids.csv', index=False)

Get metadata and download text files from the Internet Archive¶

In [ ]:

ia_books = get_ia_details(books)

In [13]:

# Convert to dataframe
df_ia = pd.DataFrame(ia_books)
df_ia.head()

Out[13]:

	creators	date	ia_formats	ia_id	ia_url	ol_id	publisher	text_filename	title	trove_url	version_id	work_id
0	George Jeffrey\|Perkins, Arthur J. (Arthur Jame...	1907	Item Tile\|DjVu\|Animated GIF\|Text PDF\|Abbyy GZ\|...	cu31924003182643	https://archive.org/details/cu31924003182643	OL24169301M	Adelaide Printed by Vardon & sons, ltd.	cu31924003182643_djvu.txt	A practical handbook on sheep and wool for the...	https://trove.nla.gov.au/version/49129017	49129017	10051865
1	Rolf Boldrewood 1826-1915	1969	DjVu\|Animated GIF\|Image Container PDF\|Abbyy GZ...	oldmelbournemem00boldgoog	https://archive.org/details/oldmelbournemem00b...	OL5729660M	Melbourne Heinemann	oldmelbournemem00boldgoog_djvu.txt	Old Melbourne memories [by] Rolf Boldrewood. I...	https://trove.nla.gov.au/version/544219	544219	10070727
2	Rolf Boldrewood 1826-1915	1896	Item Tile\|DjVu\|Animated GIF\|Text PDF\|Abbyy GZ\|...	oldmelbournememo00bold	https://archive.org/details/oldmelbournememo00...	OL7114646M	London, New York Macmillan and Co.	oldmelbournememo00bold_djvu.txt	Old Melbourne memories [by] Rolf Boldrewood.	https://trove.nla.gov.au/version/1560640	1560640	10070727
3	Rolf Boldrewood 1826-1915	1884	Item Tile\|DjVu\|Animated GIF\|Image Container PD...	oldmelbournemem01boldgoog	https://archive.org/details/oldmelbournemem01b...	OL23448373M	George Robertson	oldmelbournemem01boldgoog_djvu.txt	Old Melbourne Memories	https://trove.nla.gov.au/version/3575747	3575747	10070727
4	Athel D'Ombrain 1901-\|Swan, Wendy.	1981	Item Tile\|DjVu\|Animated GIF\|Text PDF\|Abbyy GZ\|...	religionbusiness00stim	https://archive.org/details/religionbusiness00...	OL3518420M	Sydney, N.S.W Reed	religionbusiness00stim_djvu.txt	Historic buildings of Maitland District Maitla...	https://trove.nla.gov.au/version/183601587	183601587	10077826

In [14]:

# Check the number of records
df_ia.shape

Out[14]:

(1511, 12)

In [15]:

# Save as a CSV
df_ia.to_csv('trove-books-in-ia.csv', index=False)