This notebook harvests metadata and OCRd text from digitised books in Trove. There's three main steps:
It's not easy to identify all the digitised books with OCRd text in Trove. I'm starting with a search in the book zone for books that include the phrase "nla.obj"
and are available online. This currently returns 65,050 results in the web interface. In amongst these are a range of government papers and reports, as well as publications where access to the digital copy is restricted. I think these are mostly recent books submitted in digital form under legal deposit. Things are made even more confusing by the fact that the contents of the 'Books & libraries' category in the web interface is not the same as the API's books zone. Anyway, I've used the new fullTextInd
index to try and filter out works without any OCRd text. This reduces the total results to 40,751 results.
But some of those 40,751 results are actually parent records that contain multiple volumes or parts. When I find the number of pages in each book, I'm also checking to see if the record is a 'Multi volume book' and has child works. If it does, I add the child works to the list of books. After this stage there are 42,174 works. However, not all of these 42,174 records have OCRd text. Parent records of multi volume works, and ebook formats like PDFs or MOBI, don't have individual pages, and therefore don't have any text to download. If we exclude works without pages, there are 31,402 works that might have some OCRd text to download.
After downloading all the OCRd text files (ignoring any that were empty) I ended up with a grand total of 26,762 files.
If you compare the number of downloaded files to the number in the CSV file that are identified as having OCRd text you'll notice a difference – 26,762 compared to 29,652. After a bit more poking around I realised that there are some duplicates in the list of works. This seems to be because more than one Trove metadata record can point to the same digitised work. For example, both this record and this record point to this digitised work. As they're not exact duplicates, I've left them in the results.
Looking through the downloaded text files, it's clear that we're getting ephemera (particularly pamphlets and posters) as well as books. There doesn't seem to be an obvious way to filter these out up front, but of course you could filter later by the number of pages.
Here's the metadata I've harvested in CSV format:
This file includes the following columns:
title
– title of the workurl
– link to the metadata record in Trovecontributors
– pipe-separated names of contributorsdate
– publication dateformat
– the type of work, eg 'Book' or 'Government publication', can have multiple values (pipe-separated)fulltext_url
– link to the digital versiontrove_id
– unique identifier of the digital versionlanguage
– main language of the workrights
– copyright statuspages
– number of pagesform
– work format, generally one of 'Book', 'Multi volume book', or 'Digital publication'volume
– volume/part numberchildren
– pipe-separated ids of any child worksparent
– id of parent work (if any)text_downloaded
– file name of the downloaded OCR texttext_file
– True/False is there any OCRd textBrowse and download text files from Cloudstor:
The full list of books in digital format is also available as a searchable database running on Glitch. It includes links to download OCRd text from CloudStor. You can use this database to filter the titles and create your own list of books. Search results can be downloaded as in CSV or JSON format.
import requests
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry
from tqdm.auto import tqdm
from IPython.display import display, FileLink
import pandas as pd
import json
import re
import time
import os
import arrow
from copy import deepcopy
from bs4 import BeautifulSoup
from slugify import slugify
import requests_cache
s = requests_cache.CachedSession()
retries = Retry(total=5, backoff_factor=1, status_forcelist=[ 502, 503, 504 ])
s.mount('https://', HTTPAdapter(max_retries=retries))
s.mount('http://', HTTPAdapter(max_retries=retries))
# Add your Trove API key below
api_key = 'YOUR API KEY'
params = {
'key': api_key,
'zone': 'book',
'q': '"nla.obj" fullTextInd:y', # API v 2.1 added the full text indicator
'bulkHarvest': 'true',
'n': 100,
'encoding': 'json',
'l-availability': 'y',
'l-format': 'Book',
'include': 'links,workversions'
}
def get_total_results():
'''
Get the total number of results for a search.
'''
these_params = params.copy()
these_params['n'] = 0
response = s.get('https://api.trove.nla.gov.au/v2/result', params=these_params)
data = response.json()
return int(data['response']['zone'][0]['records']['total'])
def get_fulltext_url(links):
'''
Loop through the identifiers to find a link to the full text version of the book.
'''
url = None
for link in links:
if link['linktype'] == 'fulltext' and 'nla.obj' in link['value']:
url = link['value']
break
return url
def get_version_record(record):
for version in record.get('version'):
for record in version['record']:
try:
if record['metadataSource'].get('value') == 'ANL:DL':
return record
except (AttributeError, TypeError, KeyError):
pass
def join_list(record, key):
# A field may have a single value or an array.
# If it's an array, join the values into a string.
string_list = ''
if record:
value = record.get(key, [])
if not isinstance(value, list):
value = [value]
string_list = '|'.join(value)
return string_list
def harvest_books():
'''
Harvest metadata relating to digitised books.
'''
books = []
total = get_total_results()
start = '*'
these_params = params.copy()
with tqdm(total=total) as pbar:
while start:
these_params['s'] = start
response = s.get('https://api.trove.nla.gov.au/v2/result', params=these_params)
data = response.json()
# The nextStart parameter is used to get the next page of results.
# If there's no nextStart then it means we're on the last page of results.
try:
start = data['response']['zone'][0]['records']['nextStart']
except KeyError:
start = None
for record in data['response']['zone'][0]['records']['work']:
# See if there's a link to the full text version.
if 'identifier' in record:
fulltext_url = get_fulltext_url(record['identifier'])
# I'm making the assumption that if this is a booky book (not a map or music etc),
# then 'Book' will appear first in the list of types.
# This might not be a valid assumption.
# try:
# format_type = record.get('type')[0]
# except (IndexError, TypeError):
# format_type = None
# Save the record if there's a full text link and it's a booky book.
if fulltext_url:
trove_id = re.search(r'(nla\.obj\-\d+)', fulltext_url).group(1)
# Get the basic metadata.
book = {
'title': record.get('title'),
'url': record.get('troveUrl'),
'contributors': join_list(record, 'contributor'),
'date': record.get('issued'),
'format': join_list(record, 'type'),
'fulltext_url': fulltext_url,
'trove_id': trove_id
}
# Add some extra info if avaliable
version = get_version_record(record)
book['language'] = join_list(version, 'language')
book['rights'] = join_list(version, 'rights')
books.append(book)
# print(book)
if not response.from_cache:
time.sleep(0.2)
pbar.update(100)
return books
# Do the harvest!
books = harvest_books()
len(books)
40751
In order to download the OCRd text we need to know the number of pages in a work. This information is not available via the API, so we have to scrape it from the work's HTML page.
def get_work_data(url):
'''
Extract work data in a JSON string from the work's HTML page.
'''
response = s.get(url)
try:
work_data = re.search(r'var work = JSON\.parse\(JSON\.stringify\((\{.*\})', response.text).group(1)
except AttributeError:
work_data = '{}'
if not response.from_cache:
time.sleep(0.2)
return json.loads(work_data)
def get_pages(work):
'''
Get the number of pages from the work data.
'''
try:
pages = len(work['children']['page'])
except KeyError:
pages = 0
return pages
def get_volumes(parent_id):
'''
Get the ids of volumes that are children of the current record.
'''
start_url = 'https://nla.gov.au/{}/browse?startIdx={}&rows=20&op=c'
# The initial startIdx value
start = 0
# Number of results per page
n = 20
parts = []
# If there aren't 20 results on the page then we've reached the end, so continue harvesting until that happens.
while n == 20:
# Get the browse page
response = s.get(start_url.format(parent_id, start))
# Beautifulsoup turns the HTML into an easily navigable structure
soup = BeautifulSoup(response.text, 'lxml')
# Find all the divs containing issue details and loop through them
details = soup.find_all(class_='l-item-info')
for detail in details:
title = detail.find('h3')
if title:
issue_id = title.parent['href'].strip('/')
else:
issue_id = detail.find('a')['href'].strip('/')
# Get the issue id
parts.append(issue_id)
if not response.from_cache:
time.sleep(0.2)
# Increment the startIdx
start += n
# Set n to the number of results on the current page
n = len(details)
return parts
def add_pages(books):
'''
Add the number of pages to the metadata for each book.
Add volumes from multi volume books.
'''
books_with_pages = []
for book in tqdm(books):
# print(book['fulltext_url'])
work = get_work_data(book['fulltext_url'])
form = work.get('form')
pages = get_pages(work)
book['pages'] = pages
book['form'] = form
book['volume'] = ''
book['parent'] = ''
book['children'] = ''
# Multi volume books are containers with child volumes
# so we have to get the ids of each individual volume and process them
if pages == 0 and form == 'Multi Volume Book':
# Get child volumes
volumes = get_volumes(book['trove_id'])
# For each volume get details and add as a new book entry
for index, volume_id in enumerate(volumes):
volume = book.copy()
# Add link up to the container
volume['parent'] = book['trove_id']
volume['fulltext_url'] = 'http://nla.gov.au/{}'.format(volume_id)
volume['trove_id'] = volume_id
work = get_work_data(volume['fulltext_url'])
form = work.get('form')
pages = get_pages(work)
volume['form'] = form
volume['pages'] = pages
volume['volume'] = str(index + 1)
# print(volume)
books_with_pages.append(volume)
# Add links from container to volumes
book['children'] = '|'.join(volumes)
# print(book)
books_with_pages.append(book)
return books_with_pages
# Add number of pages to the book metadata
books_with_pages = add_pages(deepcopy(books))
Getting the page numbers takes quite a while, so it's a good idea to save the results to a CSV file before proceeding. That way, you won't have to repeat the process if something goes wrong and you lose the data that's sitting in memory.
df = pd.DataFrame(books_with_pages)
df.head()
title | url | contributors | date | format | fulltext_url | trove_id | language | rights | pages | form | volume | parent | children | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Goliath Joe, fisherman / by Charles Thackeray ... | https://trove.nla.gov.au/work/10013347 | Thackeray, Charles | 1900-1919 | Book|Book/Illustrated | https://nla.gov.au/nla.obj-2831231419 | nla.obj-2831231419 | English | Out of Copyright|http://rightsstatements.org/v... | 130 | Book | |||
1 | Grammar of the Narrinyeri tribe of Australian ... | https://trove.nla.gov.au/work/10029401 | Taplin, George | 1878-1880 | Book|Government publication | http://nla.gov.au/nla.obj-688657424 | nla.obj-688657424 | English | Out of Copyright|http://rightsstatements.org/v... | 24 | Book | |||
2 | The works of the Rev. Sydney Smith | https://trove.nla.gov.au/work/1004403 | Smith, Sydney, 1771-1845 | 1839-1900 | Book|Book/Illustrated|Microform | https://nla.gov.au/nla.obj-630176596 | nla.obj-630176596 | English | No known copyright restrictions|http://rightss... | 65 | Book | |||
3 | Nellie Doran : a story of Australian home and ... | https://trove.nla.gov.au/work/10049667 | Miriam Agatha | 1914-1923 | Book | http://nla.gov.au/nla.obj-24357566 | nla.obj-24357566 | English | Out of Copyright|http://rightsstatements.org/v... | 246 | Book | |||
4 | Lastkraftwagen 3 t Ford : Baumuster V 3000 S :... | https://trove.nla.gov.au/work/10053234 | Germany. Heer. Heereswaffenamt | 1942 | Book|Book/Illustrated|Government publication | https://nla.gov.au/nla.obj-51530748 | nla.obj-51530748 | German | Out of Copyright|http://rightsstatements.org/v... | 80 | Book |
# How many records?
df.shape
(42174, 14)
# How many have pages?
df.loc[df['pages'] != 0].shape
(31402, 14)
# How many of each format?
df['form'].value_counts()
Book 29069 Digital Publication 9808 Multi Volume Book 2348 Picture 523 Journal 357 Manuscript 36 Other - General 14 Map 2 Other - Australian 1 Name: form, dtype: int64
# Breakdown by language
df['language'].value_counts()
English 25674 14284 Chinese 1219 French 210 Undetermined 193 German 92 Japanese 63 Dutch 56 Australian languages 55 Austronesian (Other) 55 Italian 31 Latin 31 Spanish 22 Maori 20 Swedish 19 Portuguese 16 Korean 15 Tahitian 13 Indonesian 12 Danish 11 Multiple languages 8 Tongan 7 Greek, Modern (1453- ) 7 Finnish 7 Russian 6 Norwegian 5 Czech 4 Samoan 4 Thai 4 Polish 3 Fijian 2 Miscellaneous languages 2 Papiamento 2 Malay 2 Welsh 2 Papuan (Other) 2 No linguistic content 2 Tagalog 1 Niger-Kordofanian (Other) 1 Sanskrit 1 Javanese 1 pol 1 Philippine (Other) 1 Scottish Gaelic 1 Vietnamese 1 Yiddish 1 Hawaiian 1 Creoles and Pidgins, English-based (Other) 1 Irish 1 Gã 1 Nauru 1 Name: language, dtype: int64
# Save as CSV
df.to_csv('trove_digitised_books.csv', index=False)
display(FileLink('trove_digitised_books.csv'))
# Run this cell if you need to reload the books data from the CSV
df = pd.read_csv('trove_digitised_books.csv', keep_default_na=False)
books_with_pages = df.to_dict('records')
def save_ocr(books, output_dir='text'):
'''
Download the OCRd text for each book.
'''
os.makedirs(output_dir, exist_ok=True)
for book in tqdm(books):
# Default values
book['text_downloaded'] = False
book['text_file'] = ''
if book['pages'] != 0:
# print(book['title'])
# The index value for the last page of an issue will be the total pages - 1
last_page = book['pages'] - 1
file_name = '{}-{}.txt'.format(slugify(str(book['title'])[:50]), book['trove_id'])
file_path = os.path.join(output_dir, file_name)
# Check to see if the file has already been harvested
if os.path.exists(file_path) and os.path.getsize(file_path) > 0:
# print('Already saved')
book['text_file'] = file_name
book['text_downloaded'] = True
else:
url = 'https://trove.nla.gov.au/{}/download?downloadOption=ocr&firstPage=0&lastPage={}'.format(book['trove_id'], last_page)
# print(url)
# Get the file
r = s.get(url)
# Check there was no error
if r.status_code == requests.codes.ok:
# Check that the file's not empty
r.encoding = 'utf-8'
if len(r.text) > 0 and not r.text.isspace():
# Check that the file isn't HTML (some not found pages don't return 404s)
if BeautifulSoup(r.text, 'html.parser').find('html') is None:
# If everything's ok, save the file
with open(file_path, 'w', encoding='utf-8') as text_file:
text_file.write(r.text)
# print('Saved')
book['text_file'] = file_name
book['text_downloaded'] = True
if not r.from_cache:
time.sleep(1)
save_ocr(books_with_pages, '/Volumes/bigdata/mydata/Trove/books')
The new books list includes the file name of the downloaded text file (if there is one), and a boolean field indicating if the text has been downloaded.
# Convert this to df
df_downloaded = pd.DataFrame(books_with_pages)
df_downloaded.head()
title | url | contributors | date | format | fulltext_url | trove_id | language | rights | pages | form | volume | parent | children | text_downloaded | text_file | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Goliath Joe, fisherman / by Charles Thackeray ... | https://trove.nla.gov.au/work/10013347 | Thackeray, Charles | 1900-1919 | Book|Book/Illustrated | https://nla.gov.au/nla.obj-2831231419 | nla.obj-2831231419 | English | Out of Copyright|http://rightsstatements.org/v... | 130 | Book | True | goliath-joe-fisherman-by-charles-thackeray-wob... | |||
1 | Grammar of the Narrinyeri tribe of Australian ... | https://trove.nla.gov.au/work/10029401 | Taplin, George | 1878-1880 | Book|Government publication | http://nla.gov.au/nla.obj-688657424 | nla.obj-688657424 | English | Out of Copyright|http://rightsstatements.org/v... | 24 | Book | True | grammar-of-the-narrinyeri-tribe-of-australian-... | |||
2 | The works of the Rev. Sydney Smith | https://trove.nla.gov.au/work/1004403 | Smith, Sydney, 1771-1845 | 1839-1900 | Book|Book/Illustrated|Microform | https://nla.gov.au/nla.obj-630176596 | nla.obj-630176596 | English | No known copyright restrictions|http://rightss... | 65 | Book | True | the-works-of-the-rev-sydney-smith-nla.obj-6301... | |||
3 | Nellie Doran : a story of Australian home and ... | https://trove.nla.gov.au/work/10049667 | Miriam Agatha | 1914-1923 | Book | http://nla.gov.au/nla.obj-24357566 | nla.obj-24357566 | English | Out of Copyright|http://rightsstatements.org/v... | 246 | Book | True | nellie-doran-a-story-of-australian-home-and-sc... | |||
4 | Lastkraftwagen 3 t Ford : Baumuster V 3000 S :... | https://trove.nla.gov.au/work/10053234 | Germany. Heer. Heereswaffenamt | 1942 | Book|Book/Illustrated|Government publication | https://nla.gov.au/nla.obj-51530748 | nla.obj-51530748 | German | Out of Copyright|http://rightsstatements.org/v... | 80 | Book | True | lastkraftwagen-3-t-ford-baumuster-v-3000-s-ger... |
# How many have been downloaded?
df_downloaded.loc[df_downloaded['text_downloaded'] == True].shape
(29652, 16)
Why is the number above different to the number of files actually downloaded? Let's have a look for duplicates.
As you can see below, some digitised works are linked to from multiple metadata records. Hence there are duplicates.
df_downloaded.loc[df_downloaded.duplicated('trove_id', keep=False) == True].sort_values('trove_id')
title | url | contributors | date | format | fulltext_url | trove_id | language | rights | pages | form | volume | parent | children | text_downloaded | text_file | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
25788 | Three weeks in Southland : being the account o... | https://trove.nla.gov.au/work/237350529 | Reid, Stuart, active 1884-1885 | 1885 | Book | https://nla.gov.au/nla.obj-101207695 | nla.obj-101207695 | English | Out of Copyright|http://rightsstatements.org/v... | 66 | Book | True | three-weeks-in-southland-being-the-account-of-... | |||
7469 | Three weeks in Southland : being the account o... | https://trove.nla.gov.au/work/19178390 | Reid, Stuart, active 1884-1885 | 1885 | Book | http://nla.gov.au/nla.obj-101207695 | nla.obj-101207695 | 66 | Book | 2 | nla.obj-477008239 | True | three-weeks-in-southland-being-the-account-of-... | |||
25790 | A recent visit to several of the Polynesian is... | https://trove.nla.gov.au/work/237350531 | Bennett, George, active 1830-1831 | 1831 | Book | https://nla.gov.au/nla.obj-101212925 | nla.obj-101212925 | English | No known copyright restrictions|http://rightss... | 8 | Book | True | a-recent-visit-to-several-of-the-polynesian-is... | |||
7771 | A recent visit to several of the Polynesian is... | https://trove.nla.gov.au/work/19241288 | Bennett, George, active 1830-1831 | 1831-1832 | Book/Illustrated|Book | http://nla.gov.au/nla.obj-101212925 | nla.obj-101212925 | 8 | Book | True | a-recent-visit-to-several-of-the-polynesian-is... | |||||
25807 | How Capt. Cook died : new light from an old book | https://trove.nla.gov.au/work/237350548 | 1908 | Book | https://nla.gov.au/nla.obj-101227721 | nla.obj-101227721 | English | No known copyright restrictions|http://rightss... | 10 | Book | True | how-capt-cook-died-new-light-from-an-old-book-... | ||||
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
37508 | A Wonderful Illawarra waterfall : a rare beaut... | https://trove.nla.gov.au/work/24063846 | 1895 | Book | http://nla.gov.au/nla.obj-99671695 | nla.obj-99671695 | 1 | Book | True | a-wonderful-illawarra-waterfall-a-rare-beauty-... | ||||||
25811 | The Results of the census of 1871 : supplement... | https://trove.nla.gov.au/work/237350552 | 1873 | Book | https://nla.gov.au/nla.obj-99716940 | nla.obj-99716940 | English | No known copyright restrictions|http://rightss... | 2 | Book | True | the-results-of-the-census-of-1871-supplement-t... | ||||
4099 | The Results of the census of 1871 : supplement... | https://trove.nla.gov.au/work/17856108 | 1873 | Book | http://nla.gov.au/nla.obj-99716940 | nla.obj-99716940 | 2 | Book | True | the-results-of-the-census-of-1871-supplement-t... | ||||||
25795 | Regular packets for Australia : emigration to ... | https://trove.nla.gov.au/work/237350536 | 1850 | Book | https://nla.gov.au/nla.obj-99727992 | nla.obj-99727992 | English | No known copyright restrictions|http://rightss... | 1 | Book | True | regular-packets-for-australia-emigration-to-po... | ||||
909 | Regular packets for Australia : emigration to ... | https://trove.nla.gov.au/work/12328620 | 1850 | Book | http://nla.gov.au/nla.obj-99727992 | nla.obj-99727992 | 1 | Book | True | regular-packets-for-australia-emigration-to-po... |
6234 rows × 16 columns
# Save as CSV
df_downloaded.to_csv('trove_digitised_books_with_ocr.csv', index=False)
display(FileLink('trove_digitised_books_with_ocr.csv'))
To make it easy to explore the list of books, let's load the CSV file into Datasette. First we'll drop some columns, do some reordering, and add links to the downloaded text files stored on CloudStor.
df_datasette = df_downloaded.copy()
# Add link to Cloudstor
df_datasette['cloudstor_url'] = df_datasette.loc[df_datasette['text_downloaded'] == True]['text_file'].apply(lambda x: f'https://cloudstor.aarnet.edu.au/plus/s/ugiw3gdijSKaoTL/download?path={x}')
Remove some columns that aren't going to be useful.
df_datasette = df_datasette[['title', 'contributors', 'date', 'format', 'language', 'rights', 'pages', 'url', 'fulltext_url', 'cloudstor_url', 'form', 'volume', 'parent', 'children']]
Rename columns for clarity.
df_datasette.columns = ['title', 'contributors', 'date', 'format', 'language', 'copyright', 'pages', 'view_details_url', 'view_book_url', 'download_text_url', 'form', 'volume', 'parent', 'children']
df_datasette.head()
title | contributors | date | format | language | copyright | pages | view_details_url | view_book_url | download_text_url | form | volume | parent | children | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Goliath Joe, fisherman / by Charles Thackeray ... | Thackeray, Charles | 1900-1919 | Book|Book/Illustrated | English | Out of Copyright|http://rightsstatements.org/v... | 130 | https://trove.nla.gov.au/work/10013347 | https://nla.gov.au/nla.obj-2831231419 | https://cloudstor.aarnet.edu.au/plus/s/ugiw3gd... | Book | |||
1 | Grammar of the Narrinyeri tribe of Australian ... | Taplin, George | 1878-1880 | Book|Government publication | English | Out of Copyright|http://rightsstatements.org/v... | 24 | https://trove.nla.gov.au/work/10029401 | http://nla.gov.au/nla.obj-688657424 | https://cloudstor.aarnet.edu.au/plus/s/ugiw3gd... | Book | |||
2 | The works of the Rev. Sydney Smith | Smith, Sydney, 1771-1845 | 1839-1900 | Book|Book/Illustrated|Microform | English | No known copyright restrictions|http://rightss... | 65 | https://trove.nla.gov.au/work/1004403 | https://nla.gov.au/nla.obj-630176596 | https://cloudstor.aarnet.edu.au/plus/s/ugiw3gd... | Book | |||
3 | Nellie Doran : a story of Australian home and ... | Miriam Agatha | 1914-1923 | Book | English | Out of Copyright|http://rightsstatements.org/v... | 246 | https://trove.nla.gov.au/work/10049667 | http://nla.gov.au/nla.obj-24357566 | https://cloudstor.aarnet.edu.au/plus/s/ugiw3gd... | Book | |||
4 | Lastkraftwagen 3 t Ford : Baumuster V 3000 S :... | Germany. Heer. Heereswaffenamt | 1942 | Book|Book/Illustrated|Government publication | German | Out of Copyright|http://rightsstatements.org/v... | 80 | https://trove.nla.gov.au/work/10053234 | https://nla.gov.au/nla.obj-51530748 | https://cloudstor.aarnet.edu.au/plus/s/ugiw3gd... | Book |
df_datasette.to_csv('trove-digital-books-datasette.csv', index=False)
This post describes how you can load your CSV files into Datasette using Glitch. Here's the result – a searchable database of Trove books available in digital form.
# Rename files to include truncated title of book
for row in df.itertuples():
try:
os.rename(os.path.join('text', '{}.txt'.format(row.book_id)), os.path.join('text', '{}-{}.txt'.format(slugify(row.title[:50]), row.book_id)))
except FileNotFoundError:
pass
# Convert all filenames back to just nla.obj- form
for filename in [f for f in os.listdir('text') if f[-4:] == '.txt']:
try:
objname = re.search(r'.*(nla\.obj.*)', filename).group(1)
except AttributeError:
print(filename)
os.rename(os.path.join('text', filename), os.path.join('text', objname))
Created by Tim Sherratt for the GLAM Workbench.
Work on this notebook was supported by the Humanities, Arts and Social Sciences (HASS) Data Enhanced Virtual Lab.