Ever searched for items in RecordSearch and wanted to save the results as a CSV file, or in some other machine-readable format? This notebook walks you through the process of creating, managing, and saving item searches – all the way from search terms to downloadable dataset. You can even download all the images from items that have been digitised!
RecordSearch doesn't currently have an option for downloading machine-readable data. So to get collection metadata in a structured form, we have to resort of screen-scraping. All the screen-scraping code used in this notebook is contained in the recordsearch_tools package.
If you haven't used one of these notebooks before, they're basically web pages in which you can write, edit, and run live code. They're meant to encourage experimentation, so don't feel nervous. Just try running a few cells and see what happens!
Some tips:
Is this thing on? If you can't edit or run any of the code cells, you might be viewing a static (read only) version of this notebook. Click here to load a live version running on Binder.
from tqdm.auto import tqdm
from tinydb import TinyDB, Query
from tinydb.operations import increment
from pathlib import Path
import math
import time
from datetime import datetime
from recordsearch_tools.client import RSSearchClient, TooManyError
import pandas as pd
import json
import string
from IPython.display import display, FileLink, HTML
import requests
from PIL import Image
from io import BytesIO
from slugify import slugify
# Make sure the 'data' directory exists
Path('data').mkdir(exist_ok=True)
# This is a workaround for a problem with tqdm adding space to cells
HTML("""
<style>
.p-Widget.jp-OutputPrompt.jp-OutputArea-prompt:empty {
padding: 0;
border: 0;
}
</style>
""")
The available search parameters are the same as those in RecordSearch's Advanced Search form. There's lots of them, but you'll probably only end up using a few like kw
and series
. Note that you can use * for wildcard searches as you can in the web interface. So setting kw
to 'wragge*' will find both 'wragge' and 'wragges'.
kw
– string containing keywords to search forkw_options
– how to interpret kw
, possible values are:kw
as a phrase rather than a list of wordskw_exclude
– string containing keywords to exclude from searchkw_exclude_options
– how to interpret kw_exclude
, possible values are:kw_exact
as a phrase rather than a list of wordssearch_notes
– set to 'on' to search item notes as well as metadataseries
– search for items in this seriesseries_exclude
– exclude items from this seriescontrol
– search for items matching this control symbolcontrol_exclude
– exclude items matching this control symbolitem_id
– search for items with this item ID number (formerly called barcode
)date_from
– search for items with a date (year) greater than or equal to this, eg. '1935'date_to
– search for items with a date (year) less than or equal to thisformats
– limit search to items in a particular format, see possible values belowformats_exclude
– exclude items in a particular format, see possible values belowlocations
– limit search to items held in a particular location, see possible values belowlocations_exclude
– exclude items held in a particular location, see possible values belowaccess
– limit to items with a particular access status, see possible values belowaccess_exclude
– exclude items with a particular access status, see possible values belowdigital
– set to 'on' to limit to items that are digitisedPossible values for formats
and formats_exclude
:
Possible values for locations
and locations_exclude
:
Possible values for access
and access_exclude
:
Once you've decided on your parameters you can use them to create a search. For example, if we wanted to find all items that included the word 'wragge' and were digitised, our parameters would be:
kw='wragge'
digital='on'
# Create a search client
c = RSSearchClient()
# Feed the search parameters to the client and save the results
results = c.search(kw='wragge', digital='on')
Now we can have a look to see how many results there are in the complete results set.
# Display total results
results['total_results']
Note that the search client only gets one page of results (containing 20 items) at a time. You can check this.
# How many results do we actually have
len(results['results'])
Let's have a look at the first item.
results['results'][0]
There are some additional parameters that affect the way the search results are delivered. We'll use some of these to harvest the complete results set.
page
– return a specific page of research resultssort
– return results in a specified order, possible values:digitised
– set to True
(default) or False
to control whether to include the number of pages in each digitised file (if True
, extra requests are made to get the info which slows things down a bit)So to get the second page of results from the search above.
results = c.search(kw='wragge', digital='on', page=2)
The first item in our result set should be different, because it's coming from the second page of results.
results['results'][0]
Ok, we've learnt how to create a search and get back some data, but only getting the first 20 results is not so useful. What if our search contains hundreds or thousands of items? How do we get them all?
To save everything, we have to loop through each page in the result set, saving the results as we go. The functions below do just that.
But wait! You might have noticed that RecordSearch only displays results for searches that return fewer than 20,000 items. Because the screen scraper is just extracting details from the RecordSearch web pages, the 20,000 limit applies here as well. If your search has more than 20,000 results, you'll need to narrow it down using additional parameters.
The main function below is harvest_items()
. You just give it any of the search parameters listed above. It will loop through all the pages in the result set, saving the items to a simple JSON database using TinyDB.
The database will be created in the data
directory. It's name will include a timestamp, identifying the time at which the harvest was started. For example db-items-1567492794.json
. There are more functions for using and managing the db files below.
def get_total_results(client, **kwargs):
'''
Get the total number of results returned by a search.
'''
try:
# Get the first page of results, passing digitised=Flase to speed things up
results = client.search(digitised=False, **kwargs)
# Get the total number of results
total = results['total_results']
# Uh oh there are more than 20,000 results
except TooManyError:
print('There are more than 20,000 results.')
total = None
return total
def harvest_items(start=1, db_path=None, check_duplicates=False, **kwargs):
'''
Harvest items from a search and save them to a database.
Supply any of the search parameters listed above.
Set check_duplicates to True if you want to check for possible duplicates (probably not necessary in most cases).
'''
# Initiate the client
client = RSSearchClient()
# Get the total number of results returned by this search
total_results = get_total_results(client, **kwargs)
# If the number of results is between 1 and 20,000 we can harvest!
if total_results:
# Calculate the number of results pages
total_pages = math.ceil(int(total_results) / client.results_per_page)
# Get the current timestamp to add into the meta
timestamp = int(time.time())
# We're creating a new db
if not db_path:
# Use the timestamp to create a directory & file name for the db
harvest_path = Path('harvests', str(timestamp))
harvest_path.mkdir(parents=True, exist_ok=True)
db_path = Path(harvest_path, 'db-items.json')
# Create the new db
db = TinyDB(db_path)
else:
# If we have an existing db, open it
db = TinyDB(db_path)
# Save the details of this harvest (or restart) to the 'meta' table of the database
# This keeps the query metadata with the results, and helps us to restart the harvest if necessary.
db.table('meta').insert({
'timestamp': timestamp,
'total_results': int(total_results),
'total_pages': total_pages,
'start_page': start,
'pages_harvested': start - 1,
'results_per_page': client.results_per_page,
'params': kwargs
})
# Loop through the range of pages
for page in tqdm(range(start, total_pages + 1), unit='page', desc='Pages:'):
# Get results from each page
# Note that sort is set to 9 (barcode) to make sure the pages stay in the same order
# If we don't set the sort param we can end up getting duplicates and missing records
items = client.search(sort=9, page=page, **kwargs)
# If check_duplicates is set to true we'll loop through each result individually
# This will slow things down a little, and is probably uncessary unless you're stitching together multiple harvests
if check_duplicates is True:
Record = Query()
# Loop through results
for item in items['results']:
# If they're not in the db, then add them!
if db.table('items').contains(Record.identifier == item['identifier']) is False:
db.table('items').insert(item)
else:
# Save the results from this page to the db in one hit
db.table('items').insert_multiple(items['results'])
Meta = Query()
db.table('meta').update(increment('pages_harvested'), Meta.timestamp == timestamp)
# Pause briefly
time.sleep(0.5)
return db_path
# Create a harvest!
harvest_items(series='J3115')
If you're doing a large harvest, you might find that it fails part way through. You might also want to check on the details of a past harvest, or even reharvest a query to see if anything new has been added. Because we've saved the harvest metadata and results into a TinyDB database, it's easy to perform some basic checks and management tasks.
There are three main functions defined below:
harvest_report()
– prints basic details of a harvestharvest_restart()
– restarts a failed harvestreharvest_items()
– creates a new harvest using the query settings of an existing harvestIn each case you can specify the path to an existing harvest database, something like data/db-items-1567480717.json
. If you don't specify a database, the function will assume you want the most recent.
Here's an example of the output from harvest_report()
.
Harvest started: 2019-09-03 15:21:11
Items harvested: 200 of 200
{'timestamp': 1567488071,
'total_results': 200,
'total_pages': 10,
'results_per_page': 20,
'params': {'kw': 'wragge'}}
def get_latest_db():
'''
Get the database created by the most recent harvest.
'''
p = Path('harvests')
harvests = sorted([d for d in p.iterdir() if d.is_dir()])
try:
latest = Path(harvests[-1], 'db-items.json')
except IndexError:
print('No databases')
latest = None
return latest
def get_db(db_path):
'''
Get a harvest database.
If db_path is supplied then return that db.
If not, then return the most recently created db and path.
'''
db = None
if not db_path:
db_path = get_latest_db()
db = TinyDB(db_path)
return (db_path, db)
def harvest_report(db_path=None):
'''
Print a report of the specified harvest.
If db_path is not supplied, display details from the most recently created harvest.
'''
_, db = get_db(db_path)
if db is not None:
meta = db.table('meta').all()[0]
date = datetime.fromtimestamp(meta['timestamp']).strftime('%Y-%m-%d %H:%M:%S')
items_harvested = len(db.table('items').all())
print(f'Harvest started: {date}')
print(f'Items harvested: {items_harvested} of {meta["total_results"]}\n')
display(meta)
def harvest_restart(db_path=None):
'''
Attempt to restart the specified harvest.
If db_path is not supplied, restart the most recently created harvest.
'''
db_path, db = get_db(db_path)
if db is not None:
meta = db.table('meta').all()[-1]
pages_harvested = meta['pages_harvested']
if pages_harvested < meta['total_pages']:
start = pages_harvested + 1
harvest_items(db_path=str(db_path), start=start, **meta['params'])
else:
print('Harvest complete')
def reharvest_items(db_path=None):
'''
Harvest items using the parameters of the specified db.
If db_path is not supplied, use the most recently created harvest.
'''
_, db = get_db(db_path)
if db is not None:
meta = db.table('meta').all()[0]
harvest_items(**meta['params'])
# Display details of the most recent harvest
# Optionally, supply the path to an existing db, eg: harvest_report('data/db-items-1567480968.json')
harvest_report()
# Restart the most recent harvest
# Optionally, supply the path to an existing db, eg: harvest_restart('data/db-items-1567480968.json')
harvest_restart()
# Create a new harvest using the parameters of the most recent harvest
# Optionally, supply the path to an existing db, eg: reharvest_items('data/db-items-1567480968.json')
reharvest_items()
Although your harvest is already saved in a TinyDB database, you might want to convert it to a simpler format for download and analysis. The functions below provide two options:
save_harvest_as_json()
– save the harvested items as a JSON filesave_harvest_as_csv()
– save the harvested items as a CSV fileThe columns in the CSV-formatted file are:
identifier
– the barcode numberseries
– identifier of the series which contains the itemcontrol_symbol
– individual control symboltitle
– title of the itemcontents_date_str
– the contents date string as in RecordSearchcontents_start_date
– the first date in the contents date string converted to ISO formatcontents_end_date
– the second date in the contents date string converted to ISO formatlocation
– where the file is heldaccess_status
– 'Closed', 'Open', 'OWE, or 'NYEdigitised_status
– True/False, has the file been digitiseddigitised_pages
– number of pages in the digitised filedef save_harvest_as_json(db_path=None):
'''
Save harvested items as a json file.
If db_path is not supplied, use the most recently created harvest.
'''
_, db = get_db(db_path)
if db is not None:
# Get harvest metadata
meta = db.table('meta').all()[0]
# Get harvested items
items = db.table('items').all()
# Set file name and path
filename = Path(f'harvests/{meta["timestamp"]}/items.json')
# Dump items to a JSON file
with open(filename, 'w') as json_file:
json.dump(items, json_file)
# Display link to download
display(FileLink(filename))
def save_harvest_as_csv(db_path=None):
'''
Save harvested items as a CSV file.
If db_path is not supplied, use the most recently created harvest.
'''
_, db = get_db(db_path)
if db is not None:
# Get harvest metadata
meta = db.table('meta').all()[0]
# Get harvested items
items = db.table('items').all()
# Flatten the date field using json_normalise and convert the items to a dataframe
df = pd.DataFrame(pd.json_normalize(items))
# Rename the date columns
df.rename(columns={'contents_dates.date_str': 'contents_date_str', 'contents_dates.start_date': 'contents_start_date', 'contents_dates.end_date': 'contents_end_date'}, inplace=True)
# Put the columns in a nice order
df = df[['identifier', 'series', 'control_symbol', 'title', 'contents_date_str', 'contents_start_date', 'contents_end_date', 'location', 'access_status', 'digitised_status', 'digitised_pages']]
# Set file name and path
filename = Path(f'harvests/{meta["timestamp"]}/items.csv')
# Save as CSV
df.to_csv(filename, index=False)
# Display link to download
display(FileLink(filename))
# Save the most recent harvest as a json file
# Optionally, supply the path to an existing db, eg: save_harvest_as_json('data/db-1567480968.json')
save_harvest_as_json()
# Save the most recent harvest as a CSV file
# Optionally, supply the path to an existing db, eg: save_harvest_as_json('data/db-1567480968.json')
save_harvest_as_csv()
You can harvest a complete series using the examples above. Just set the series
search parameter to the series identifier. For example:
harvest_items(series='A6119')
But what do you do if the series contains more than 20,000 items? One way of splitting the series up into harvestable chunks is to use wildcard values and the control
search parameter. For example, B13 has more than 20,000 items, but if we limit the results to items with a control symbol starting with '1', we bring the number down to under 20,000:
harvest_items(series='B13', control='1*')
To make sure we get everything in the series we can repeat the harvest using a range of prefixes for the control symbol – the easiest approach is simply to loop through each letter and number from A to Z and 0 to 9. That's exactly what the harvest_large_series()
function below does. The key thing is the control_range
. By default, it is a list that looks like this:
['A*', 'B*', 'C*', 'D*', 'E*', 'F*', 'G*', 'H*', 'I*', 'J*', 'K*', 'L*', 'M*', 'N*', 'O*', 'P*', 'Q*', 'R*', 'S*', 'T*', 'U*', 'V*', 'W*', 'X*', 'Y*', 'Z*', '0*', '1*', '2*', '3*', '4*', '5*', '6*', '7*', '8*', '9*']
Note that it's possible to get duplicate items this way because some items include earlier versions of control symbols and these are searched as well as the current ones. We can filter out the duplicates by asking the harvest function to check whether an item is already in the db before saving it.
def harvest_large_series(series, control_range=None, restart=False, db_path=None):
'''
RecordSearch will not return more than 20,000 results.
If a series has more than 20,000 items you'll need to break it up.
The easiest way to do this is to add a param for control_symbol.
This function will break break a series harvest down into a series of harvests --
using each letter and number with a wildcard as the control_symbol parameter.
This should be enough to harvest most large series, but in some cases you might need to supply a custom list of control_symbol prefixes.
'''
start = 1
# If you don't supply a range of prefixes, use A-Z and 0-9
if not control_range:
control_range = [letter + '*' for letter in string.ascii_uppercase] + [str(number) + '*' for number in range(0, 10)]
# Try to restart a failed harvest
if restart is True:
# Get the current db
db_path, db = get_db(db_path)
# Get the most recent metadata
meta = db.table('meta').all()[-1]
# Get the last used control
control = meta['params']['control']
# Limit the prefixes to ones that haven't been used yet
control_range = control_range[control_range.index(control):]
# Work out which page to start from
pages_harvested = meta['pages_harvested']
if pages_harvested < meta['total_pages']:
start = pages_harvested + 1
# Loop through control prefixes initiating a new harvest for each
# After the first harvest we'll have a value for dp_path, which we can then use in subsequent calls.
# This means all the separate harvests will be saved in a single db
for control in control_range:
# Start a new harvest using the control symbol. Set harvester to check for duplicates.
db_path = harvest_items(db_path=db_path, start=start, check_duplicates=True, series=series, control=control)
start = 1
# Harvest a series with more than 20,000 items
harvest_large_series(series='B13')
However, there are some series that can't easily be sliced up into chunks of less than 20,000. Either they're extremely large, or their range of control symbol prefixes is very small. In this case you have to experiment to find a list of prefixes that will work. How do you know? The function below let's you test a range of control symbol prefixes to check whether the slices return fewer than 20,000 items. If you give it a series identifier, it will use the default range of prefixes described above. But you can also feed it a customised list.
def test_control_prefixes(series, control_range=None):
'''
Test a range of control symbol prefixes to see if they split a series up into chunks of less than 20,000 items.
Prints the number of results for each prefix, or 'More than 20,000' if greater than 20,000.
'''
c = RSSearchClient()
if not control_range:
control_range = [letter + '*' for letter in string.ascii_uppercase] + [str(number) + '*' for number in range(0, 10)]
for control in control_range:
try:
results = c.search(series=series, control=control)
total = results['total_results']
except TooManyError:
total = 'More than 20,000'
print(f'{control}: {total}')
For example, series A1 has more than 60,000 items and uses the year as the control symbol prefix. In this case we can create a list of control symbol prefixes to slice the series up by year-like combinations of digits. The following cell creates such a list.
# For series like A1 that use the year as the control symbol prefix, this range should work.
control_range = [str(num) + '*' for num in range(2,10)] + ['1{}*'.format(num2) for num2 in [str(num) for num in range(0,9)]] + ['19{}*'.format(num2) for num2 in [str(num) for num in range(1,10)]]
When the cell above is run, it creates a list that looks like this:
['2*', '3*', '4*', '5*', '6*', '7*', '8*', '9*', '10*', '11*', '12*', '13*', '14*', '15*', '16*', '17*', '18*', '191*', '192*', '193*', '194*', '195*', '196*', '197*', '198*', '199*']
Note how it splits years in the 20th century by decade? To test that this will actually work, we can feed this list to the test_control_prefixes()
function.
test_control_prefixes('A1', control_range=control_range)
The test_control_prefixes()
prints out the following results:
2*: 0
3*: 2
4*: 0
5*: 0
6*: 0
7*: 0
8*: 0
9*: 2
10*: 1
11*: 0
12*: 0
13*: 0
14*: 0
15*: 0
16*: 0
17*: 1
18*: 0
191*: 17501
192*: 18908
193*: 18247
194*: 0
195*: 0
196*: 0
197*: 0
198*: 0
199*: 2
Note that all the slices are less than 20,000. Yay it works! You might also note that while most of the control symbol prefixes are years, there are a few oddities as well. In general, you can't rely on control symbols being consistent. Even if the series notes say they use numbers, you need to check to make sure no letters have snuck in. So basically it takes so it takes a fair bit of trial and error!
Once we've defined a custom control range as above, we can feed it to harvest_large_series()
to use in place of the default.
# Use custom range to harvest a large series
harvest_large_series(series='A1', control_range=control_range)
Once you've saved all the metadata from your search, you can use it to download images from all the items that have been digitised.
The function below will look for all items that have a digitised_status
of True in a harvest db, and then download all the images from them. The images will be saved in a folder named with the timestamp of the original harvest, eg. data/items-1567492815-images
.
Within that folder, the images from each item will be saved in a separate folder, named using the series
, control_symbol
, and identifier
values. So the folder a2487-1919-8962-156686
contains images from a file in series A2487, with the control symbol 1919/8962, and the barcode of 156686. Images are named with the identifier
(barcode) and page numbers, eg: 156686-1.jpg
.
The function also saves some image metadata back into the harvest db, specifically:
image_dir
– the path where the image has been savedimage_name
– the filename of the imageidentifier
– the item barcodepage
– the page numberwidth
– the width of the image (in pixels)height
– the height of the image (in pixels)def harvest_images(db_path=None):
'''
Download images from all the digitised files in a harvest.
If db_path is not supplied, use the most recently created harvest.
'''
# Get the harvest db
_, db = get_db(db_path)
if db is not None:
# Get the metadata
meta = db.table('meta').all()[0]
# Find items where digitised_status is True
Record = Query()
items = db.table('items').search(Record.digitised_status == True)
# Loop through digitised items
for item in tqdm(items, desc='Items'):
# Set name of the folder in which to save the images
image_dir = Path(f'harvests/{meta["timestamp"]}/images/{slugify(item["series"])}-{slugify(item["control_symbol"])}-{item["identifier"]}')
# Create the folder (and parent if necessary)
image_dir.mkdir(exist_ok=True, parents=True)
# Loop through the page numbers
for page in tqdm(range(1, item['digitised_pages'] + 1), desc='Pages', leave=False):
# Define the image filename using the barcode and page number
filename = Path(f'{image_dir}/{item["identifier"]}-{page}.jpg')
# Check to see if the image already exists (useful if rerunning a failed harvest)
if not filename.exists():
# If it doens't already exist then download it
img_url = f'http://recordsearch.naa.gov.au/NaaMedia/ShowImage.asp?B={item["identifier"]}&S={page}&T=P'
response = requests.get(img_url)
response.raise_for_status()
# Try opening the file as an image
try:
image = Image.open(BytesIO(response.content))
except IOError:
print('Not an image')
else:
# If it's an image, get its dimensions
width, height = image.size
# Save the image
image.save(filename)
# Create image metadata
image_meta = {
'image_dir': str(image_dir),
'image_name': '{}-{}.jpg'.format(item['identifier'], page),
'identifier': item['identifier'],
'page': page,
'width': width,
'height': height
}
# Add/update the image metadata
db.table('images').upsert(image_meta, Record.image_name == image_meta['image_name'])
# Pause
time.sleep(0.5)
# Download images from the most recent harvest
# Optionally, supply the path to an existing db, eg: save_harvest_as_json('harvests/1567480968/db-items.json')
harvest_images()
It might be handy to have all the image metadata in a single CSV file.
def save_metadata_as_csv(db_path=None):
'''
Save metadata of harvested images as a CSV file.
If db_path is not supplied, use the most recently created harvest.
'''
_, db = get_db(db_path)
if db is not None:
# Get harvest metadata
meta = db.table('meta').all()[0]
# Get harvested images
images = db.table('images').all()
# Convert the image metadata to a dataframe
df = pd.DataFrame(images)
# Set file name and path
filename = Path(f'harvests/{meta["timestamp"]}/images.csv')
# Save as CSV
df.to_csv(filename, index=False)
# Display link to download
display(FileLink(filename))
# Save the image metadata and display a link
# Optionally, supply the path to an existing db, eg: save_harvest_as_json('data/db-1567480968.json')
save_metadata_as_csv()
Coming soon, notebooks to help you explore your harvested data...
Created by Tim Sherratt as part of the GLAM Workbench.