Harvest items from a search in RecordSearch

Ever searched for items in RecordSearch and wanted to save the results as a CSV file, or in some other machine-readable format? This notebook walks you through the process of creating, managing, and saving item searches – all the way from search terms to downloadable dataset. You can even download all the images from items that have been digitised!

RecordSearch doesn't currently have an option for downloading machine-readable data. So to get collection metadata in a structured form, we have to resort of screen-scraping. All the screen-scraping code used in this notebook is contained in the recordsearch_tools package.

If you haven't used one of these notebooks before, they're basically web pages in which you can write, edit, and run live code. They're meant to encourage experimentation, so don't feel nervous. Just try running a few cells and see what happens!

Some tips:

  • Code cells have boxes around them.
  • To run a code cell click on the cell and then hit Shift+Enter. The Shift+Enter combo will also move you to the next cell, so it's a quick way to work through the notebook.
  • While a cell is running a * appears in the square brackets next to the cell. Once the cell has finished running the asterix will be replaced with a number.
  • In most cases you'll want to start from the top of notebook and work your way down running each cell in turn. Later cells might depend on the results of earlier ones.
  • To edit a code cell, just click on it and type stuff. Remember to run the cell once you've finished editing.

Is this thing on? If you can't edit or run any of the code cells, you might be viewing a static (read only) version of this notebook. Click here to load a live version running on Binder.

1. Import what we need

In [310]:
from tqdm.auto import tqdm
from tinydb import TinyDB, Query
from tinydb.operations import increment
from pathlib import Path
import math
import time
from datetime import datetime
from recordsearch_tools.client import RSSearchClient, TooManyError
import pandas as pd
from pandas.io.json import json_normalize
import json
import string
from IPython.display import display, FileLink, HTML
import requests
from PIL import Image
from io import BytesIO

# Make sure the 'data' directory exists
Path('data').mkdir(exist_ok=True)

# This is a workaround for a problem with tqdm adding space to cells
HTML("""
    <style>
    .p-Widget.jp-OutputPrompt.jp-OutputArea-prompt:empty {
      padding: 0;
      border: 0;
    }
    </style>
""")
Out[310]:

2. Available search parameters

The available search parameters are the same as those in RecordSearch's Advanced Search form. There's lots of them, but you'll probably only end up using a few like kw and series. Note that you can use * for wildcard searches as you can in the web interface. So setting kw to 'wragge*' will find both 'wragge' and 'wragges'.

  • kw – string containing keywords to search for
  • kw_options – how to interpret kw, possible values are:
    • 'ALL' – return results containing all of the keywords (default)
    • 'ANY' – return results containg any of the keywords
    • 'EXACT' – treat kw as a phrase rather than a list of words
  • kw_exclude – string containing keywords to exclude from search
  • kw_exclude_options – how to interpret kw_exclude, possible values are:
    • 'ALL' – exclude results containing all of the keywords (default)
    • 'ANY' – exclude results containg any of the keywords
    • 'EXACT' – treat kw_exact as a phrase rather than a list of words
  • search_notes – set to 'on' to search item notes as well as metadata
  • series – search for items in this series
  • series_exclude – exclude items from this series
  • control – search for items matching this control symbol
  • control_exclude – exclude items matching this control symbol
  • barcode – search for items with this barcode number
  • date_from – search for items with a date (year) greater than or equal to this, eg. '1935'
  • date_to – search for items with a date (year) less than or equal to this
  • formats – limit search to items in a particular format, see possible values below
  • formats_exclude – exclude items in a particular format, see possible values below
  • locations – limit search to items held in a particular location, see possible values below
  • locations_exclude – exclude items held in a particular location, see possible values below
  • access – limit to items with a particular access status, see possible values below
  • access_exclude – exclude items with a particular access status, see possible values below
  • digital – set to 'on' to limit to items that are digitised

Possible values for formats and formats_exclude:

  • 'Paper files and documents'
  • 'Index cards'
  • 'Bound volumes'
  • 'Cartographic records'
  • 'Photographs'
  • 'Microforms'
  • 'Audio-visual records'
  • 'Audio records'
  • 'Electronic records'
  • '3-dimensional records'
  • 'Scientific specimens'
  • 'Textiles'

Possible values for locations and locations_exclude:

  • 'NAT, ACT'
  • 'Adelaide'
  • 'Australian War Memorial'
  • 'Brisbane'
  • 'Darwin'
  • 'Hobart'
  • 'Melbourne'
  • 'Perth'
  • 'Sydney'

Possible values for access and access_exclude:

  • 'OPEN'
  • 'OWE'
  • 'CLOSED'
  • 'NYE'

Once you've decided on your parameters you can use them to create a search. For example, if we wanted to find all items that included the word 'wragge' and were digitised, our parameters would be:

  • kw='wragge'
  • digital='on'
In [194]:
# Create a search client
c = RSSearchClient()
In [195]:
# Feed the search parameters to the client and save the results
results =  c.search(kw='wragge', digital='on')

Now we can have a look to see how many results there are in the complete results set.

In [196]:
# Display total results
results['total_results']
Out[196]:
'33'

Note that the search client only gets one page of results (containing 20 items) at a time. You can check this.

In [200]:
# How many results do we actually have
len(results['results'])
Out[200]:
20

Let's have a look at the first item.

In [199]:
results['results'][0]
Out[199]:
{'series': 'A2479',
 'control_symbol': '17/1306',
 'title': 'The Wragge Estate. Property for sale.',
 'access_status': 'Open',
 'location': 'Canberra',
 'contents_dates': {'date_str': '1917 - 1917',
  'start_date': '1917',
  'end_date': '1917'},
 'digitised_status': True,
 'digitised_pages': 4,
 'identifier': '149309'}

4. Changing how your search results are delivered

There are some additional parameters that affect the way the search results are delivered. We'll use some of these to harvest the complete results set.

  • page – return a specific page of research results
  • sort – return results in a specified order, possible values:
    • 1 – series and control symbol
    • 3 – title
    • 5 – start date
    • 7 – digitised items first
    • 12 – items with pdfs first
    • 9 – barcode
    • 11 – audio visual items first
  • digitised – set to True (default) or False to control whether to include the number of pages in each digitised file (if True, extra requests are made to get the info which slows things down a bit)

So to get the second page of results from the search above.

In [29]:
results =  c.search(kw='wragge', digital='on', page=2)

The first item in our result set should be different, because it's coming from the second page of results.

In [30]:
results['results'][0]
Out[30]:
{'series': 'A6770',
 'control_symbol': 'WRAGGE K C',
 'title': 'WRAGGE KEITH CLEMENT : Service Number - B/2680 : Date of birth - 16 Jan 1922 : Place of birth - BRISBANE QLD : Place of enlistment - BRISBANE : Next of Kin - RUPERT',
 'access_status': 'Open',
 'location': 'Canberra',
 'contents_dates': {'date_str': '1939 - 1948',
  'start_date': '1939',
  'end_date': '1948'},
 'digitised_status': True,
 'digitised_pages': 2,
 'identifier': '4523493'}

5. Harvesting a complete set of (less than 20,000) results

Ok, we've learnt how to create a search and get back some data, but only getting the first 20 results is not so useful. What if our search contains hundreds or thousands of items? How do we get them all?

To save everything, we have to loop through each page in the result set, saving the results as we go. The functions below do just that.

But wait! You might have noticed that RecordSearch only displays results for searches that return fewer than 20,000 items. Because the screen scraper is just extracting details from the RecordSearch web pages, the 20,000 limit applies here as well. If your search has more than 20,000 results, you'll need to narrow it down using additional parameters.

The main function below is harvest_items(). You just give it any of the search parameters listed above. It will loop through all the pages in the result set, saving the items to a simple JSON database using TinyDB.

The database will be created in the data directory. It's name will include a timestamp, identifying the time at which the harvest was started. For example db-items-1567492794.json. There are more functions for using and managing the db files below.

In [303]:
def get_total_results(client, **kwargs):
    '''
    Get the total number of results returned by a search.
    '''
    try:
        # Get the first page of results, passing digitised=Flase to speed things up
        results = client.search(digitised=False, **kwargs)
        
        # Get the total number of results
        total = results['total_results']
        
    # Uh oh there are more than 20,000 results
    except TooManyError:
        print('There are more than 20,000 results.')
        total = None
    return total

def harvest_items(start=1, db_path=None, check_duplicates=False, **kwargs):
    '''
    Harvest items from a search and save them to a database.
    Supply any of the search parameters listed above.
    
    Set check_duplicates to True if you want to check for possible duplicates (probably not necessary in most cases).
    '''
    # Initiate the client
    client = RSSearchClient()
    
    # Get the total number of results returned by this search
    total_results = get_total_results(client, **kwargs)
    
    # If the number of results is between 1 and 20,000 we can harvest!
    if total_results:
        
        # Calculate the number of results pages
        total_pages = math.ceil(int(total_results) / client.results_per_page)
        
        # We're creating a new db
        if not db_path:
            
            # Get the current timestamp and use it to creat a file name for the db
            timestamp = int(time.time())
            db_path = Path(f'data/db-items-{timestamp}.json')
            
            # Create a new db using the timestamp as a name
            db = TinyDB(db_path)
            
        else:
            # If we have an existing db, open it
            db = TinyDB(db_path)
            
            # Get the current timestamp to add into the meta
            timestamp = int(time.time())
            
        # Save the details of this harvest (or restart) to the 'meta' table of the database
        # This keeps the query metadata with the results, and helps us to restart the harvest if necessary.
        db.table('meta').insert({
            'timestamp': timestamp, 
            'total_results': int(total_results), 
            'total_pages': total_pages, 
            'start_page': start,
            'pages_harvested': start - 1,
            'results_per_page': client.results_per_page,
            'params': kwargs
        })
            
        # Loop through the range of pages
        for page in tqdm(range(start, total_pages + 1), unit='page', desc='Pages:'):
            
            # Get results from each page
            # Note that sort is set to 9 (barcode) to make sure the pages stay in the same order
            # If we don't set the sort param we can end up getting duplicates and missing records
            items = client.search(sort=9, page=page, **kwargs)
            
            # If check_duplicates is set to true we'll loop through each result individually
            # This will slow things down a little, and is probably uncessary unless you're stitching together multiple harvests
            if check_duplicates is True:
                Record = Query()
                
                # Loop through results
                for item in items['results']:
                    
                    # If they're not in the db, then add them!
                    if db.table('items').contains(Record.identifier == item['identifier']) is False:
                        db.table('items').insert(item)
            else:
                # Save the results from this page to the db in one hit
                db.table('items').insert_multiple(items['results'])
            Meta = Query()
            db.table('meta').update(increment('pages_harvested'), Meta.timestamp == timestamp)
            # Pause briefly
            time.sleep(0.5)
    return db_path
In [ ]:
# Create a harvest!
harvest_items(kw='wragge')

6. Managing harvests

If you're doing a large harvest, you might find that it fails part way through. You might also want to check on the details of a past harvest, or even reharvest a query to see if anything new has been added. Because we've saved the harvest metadata and results into a TinyDB database, it's easy to perform some basic checks and management tasks.

There are three main functions defined below:

  • harvest_report() – prints basic details of a harvest
  • harvest_restart() – restarts a failed harvest
  • reharvest_items() – creates a new harvest using the query settings of an existing harvest

In each case you can specify the path to an existing harvest database, something like data/db-items-1567480717.json. If you don't specify a database, the function will assume you want the most recent.

Here's an example of the output from harvest_report().

Harvest started: 2019-09-03 15:21:11
Items harvested: 200 of 200

{'timestamp': 1567488071,
 'total_results': 200,
 'total_pages': 10,
 'results_per_page': 20,
 'params': {'kw': 'wragge'}}
In [298]:
def get_latest_db():
    '''
    Get the database created by the most recent harvest.
    '''
    p = Path('data')
    dbs = sorted(list(p.glob('db-items-[0-9]*.json')))
    try:
        latest = dbs[-1]
    except IndexError:
        print('No databases')
        latest = None
    return latest

def get_db(db_path):
    '''
    Get a harvest database.
    If db_path is supplied then return that db.
    If not, then return the most recently created db and path.
    '''
    db = None
    if not db_path:
        db_path = get_latest_db()
    if db_path:  
        db = TinyDB(db_path)
    return (db_path, db)

def harvest_report(db_path=None):
    '''
    Print a report of the specified harvest.
    If db_path is not supplied, display details from the most recently created harvest.
    '''
    _, db = get_db(db_path)
    if db is not None:
        meta = db.table('meta').all()[0]
        date = datetime.fromtimestamp(meta['timestamp']).strftime('%Y-%m-%d %H:%M:%S')
        items_harvested = len(db.table('items').all())
        print(f'Harvest started: {date}')
        print(f'Items harvested: {items_harvested} of {meta["total_results"]}\n')
        display(meta)
    
def harvest_restart(db_path=None):
    '''
    Attempt to restart the specified harvest.
    If db_path is not supplied, restart the most recently created harvest.
    '''
    db_path, db = get_db(db_path)
    if db is not None:
        meta = db.table('meta').all()[-1]
        pages_harvested = meta['pages_harvested']
        if pages_harvested < meta['total_pages']:
            start = pages_harvested + 1
            harvest_items(db_path=db_path, start=start, **meta['params'])
        else:
            print('Harvest complete')
            
def reharvest_items(db_path=None):
    '''
    Harvest items using the parameters of the specified db.
    If db_path is not supplied, use the most recently created harvest.
    '''
    _, db = get_db(db_path)
    if db is not None:
        meta = db.table('meta').all()[0]
        harvest_items(**meta['params'])
In [291]:
# Display details of the most recent harvest
# Optionally, supply the path to an existing db, eg: harvest_report('data/db-items-1567480968.json')
harvest_report()
Harvest started: 2019-09-04 21:07:24
Items harvested: 80 of 200

{'timestamp': 1567595244,
 'total_results': 200,
 'total_pages': 10,
 'start_page': 1,
 'pages_harvested': 4,
 'results_per_page': 20,
 'params': {'kw': 'wragge'}}
In [ ]:
# Restart the most recent harvest
# Optionally, supply the path to an existing db, eg: harvest_restart('data/db-items-1567480968.json')
harvest_restart()
In [ ]:
# Create a new harvest using the parameters of the most recent harvest
# Optionally, supply the path to an existing db, eg: reharvest_items('data/db-items-1567480968.json')
reharvest_items()

7. Saving a harvest

Although your harvest is already saved in a TinyDB database, you might want to convert it to a simpler format for download and analysis. The functions below provide two options:

  • save_harvest_as_json() – save the harvested items as a JSON file
  • save_harvest_as_csv() – save the harvested items as a CSV file

The columns in the CSV-formatted file are:

  • identifier – the barcode number
  • series – identifier of the series which contains the item
  • control_symbol – individual control symbol
  • title – title of the item
  • contents_date_str – the contents date string as in RecordSearch
  • contents_start_date – the first date in the contents date string converted to ISO format
  • contents_end_date – the second date in the contents date string converted to ISO format
  • location – where the file is held
  • access_status – 'Closed', 'Open', 'OWE, or 'NYE
  • digitised_status – True/False, has the file been digitised
  • digitised_pages – number of pages in the digitised file
In [187]:
def save_harvest_as_json(db_path=None):
    '''
    Save harvested items as a json file.
    If db_path is not supplied, use the most recently created harvest.
    '''
    _, db = get_db(db_path)
    if db is not None:
        
        # Get harvest metadata
        meta = db.table('meta').all()[0]
        
        # Get harvested items
        items = db.table('items').all()
        
        # Set file name and path
        filename = Path(f'data/items-{meta["timestamp"]}.json')
        
        # Dump items to a JSON file
        with open(filename, 'w') as json_file:
            json.dump(items, json_file)
        
        # Display link to download
        display(FileLink(filename))
            
def save_harvest_as_csv(db_path=None):
    '''
    Save harvested items as a CSV file.
    If db_path is not supplied, use the most recently created harvest.
    '''
    _, db = get_db(db_path)
    if db is not None:
        
        # Get harvest metadata
        meta = db.table('meta').all()[0]
        
        # Get harvested items
        items = db.table('items').all()
        
        # Flatten the date field using json_normalise and convert the items to a dataframe
        df = pd.DataFrame(json_normalize(items))
        
        # Rename the date columns
        df.rename(columns={'contents_dates.date_str': 'contents_date_str', 'contents_dates.start_date': 'contents_start_date', 'contents_dates.end_date': 'contents_end_date'}, inplace=True)
        
        # Put the columns in a nice order
        df = df[['identifier', 'series', 'control_symbol', 'title', 'contents_date_str', 'contents_start_date', 'contents_end_date', 'location', 'access_status', 'digitised_status', 'digitised_pages']]
        
        # Set file name and path
        filename = Path(f'data/items-{meta["timestamp"]}.csv')
        
        # Save as CSV
        df.to_csv(filename, index=False)
        
        # Display link to download
        display(FileLink(filename))
In [ ]:
# Save the most recent harvest as a json file
# Optionally, supply the path to an existing db, eg: save_harvest_as_json('data/db-1567480968.json')
save_harvest_as_json()
In [ ]:
# Save the most recent harvest as a CSV file
# Optionally, supply the path to an existing db, eg: save_harvest_as_json('data/db-1567480968.json')
save_harvest_as_csv()

8. Harvesting complete series

You can harvest a complete series using the examples above. Just set the series search parameter to the series identifier. For example:

harvest_items(series='A6119')

But what do you do if the series contains more than 20,000 items? One way of splitting the series up into harvestable chunks is to use wildcard values and the control search parameter. For example, B13 has more than 20,000 items, but if we limit the results to items with a control symbol starting with '1', we bring the number down to under 20,000:

harvest_items(series='B13', control='1*')

To make sure we get everything in the series we can repeat the harvest using a range of prefixes for the control symbol – the easiest approach is simply to loop through each letter and number from A to Z and 0 to 9. That's exactly what the harvest_large_series() function below does. The key thing is the control_range. By default, it is a list that looks like this:

['A*', 'B*', 'C*', 'D*', 'E*', 'F*', 'G*', 'H*', 'I*', 'J*', 'K*', 'L*', 'M*', 'N*', 'O*', 'P*', 'Q*', 'R*', 'S*', 'T*', 'U*', 'V*', 'W*', 'X*', 'Y*', 'Z*', '0*', '1*', '2*', '3*', '4*', '5*', '6*', '7*', '8*', '9*']

Note that it's possible to get duplicate items this way because some items include earlier versions of control symbols and these are searched as well as the current ones. We can filter out the duplicates by asking the harvest function to check whether an item is already in the db before saving it.

In [270]:
def harvest_large_series(series, control_range=None, restart=False, db_path=None):
    '''
    RecordSearch will not return more than 20,000 results.
    If a series has more than 20,000 items you'll need to break it up.
    The easiest way to do this is to add a param for control_symbol.
    This function will break break a series harvest down into a series of harvests --
    using each letter and number with a wildcard as the control_symbol parameter.
    This should be enough to harvest most large series, but in some cases you might need to supply a custom list of control_symbol prefixes.
    '''
    start = 1
    
    # If you don't supply a range of prefixes, use A-Z and 0-9
    if not control_range:
        control_range = [letter + '*' for letter in string.ascii_uppercase] + [str(number) + '*' for number in range(0, 10)]
        
    # Try to restart a failed harvest
    if restart is True:
        
        # Get the current db
        db_path, db = get_db(db_path)
        
        # Get the most recent metadata
        meta = db.table('meta').all()[-1]
        
        # Get the last used control
        control = meta['params']['control']
        
        # Limit the prefixes to ones that haven't been used yet
        control_range = control_range[control_range.index(control):]
        
        # Work out which page to start from
        pages_harvested = meta['pages_harvested']
        if pages_harvested < meta['total_pages']:
            start = pages_harvested + 1
    
    # Loop through control prefixes initiating a new harvest for each
    # After the first harvest we'll have a value for dp_path, which we can then use in subsequent calls.
    # This means all the separate harvests will be saved in a single db
    for control in control_range:
        
        # Start a new harvest using the control symbol. Set harvester to check for duplicates.
        db_path = harvest_items(db_path=db_path, start=start, check_duplicates=True, series=series, control=control)
        start = 1
In [ ]:
# Harvest a series with more than 20,000 items
harvest_large_series(series='B13')

However, there are some series that can't easily be sliced up into chunks of less than 20,000. Either they're extremely large, or their range of control symbol prefixes is very small. In this case you have to experiment to find a list of prefixes that will work. How do you know? The function below let's you test a range of control symbol prefixes to check whether the slices return fewer than 20,000 items. If you give it a series identifier, it will use the default range of prefixes described above. But you can also feed it a customised list.

In [326]:
def test_control_prefixes(series, control_range=None):
    '''
    Test a range of control symbol prefixes to see if they split a series up into chunks of less than 20,000 items.
    Prints the number of results for each prefix, or 'More than 20,000' if greater than 20,000.
    '''
    c = RSSearchClient()
    if not control_range:
        control_range = [letter + '*' for letter in string.ascii_uppercase] + [str(number) + '*' for number in range(0, 10)]
    for control in control_range:
        try:
            results =  c.search(series=series, control=control)
            total = results['total_results']
        except TooManyError:
            total = 'More than 20,000'
        print(f'{control}: {total}')

For example, series A1 has more than 60,000 items and uses the year as the control symbol prefix. In this case we can create a list of control symbol prefixes to slice the series up by year-like combinations of digits. The following cell creates such a list.

In [319]:
# For series like A1 that use the year as the control symbol prefix, this range should work.
control_range = [str(num) + '*' for num in range(2,10)] + ['1{}*'.format(num2) for num2 in [str(num) for num in range(0,9)]] + ['19{}*'.format(num2) for num2 in [str(num) for num in range(1,10)]]

When the cell above is run, it creates a list that looks like this:

['2*', '3*', '4*', '5*', '6*', '7*', '8*', '9*', '10*', '11*', '12*', '13*', '14*', '15*', '16*', '17*', '18*', '191*', '192*', '193*', '194*', '195*', '196*', '197*', '198*', '199*']

Note how it splits years in the 20th century by decade? To test that this will actually work, we can feed this list to the test_control_prefixes() function.

In [ ]:
test_control_prefixes('A1', control_range=control_range)

The test_control_prefixes() prints out the following results:

2*: 0
3*: 2
4*: 0
5*: 0
6*: 0
7*: 0
8*: 0
9*: 2
10*: 1
11*: 0
12*: 0
13*: 0
14*: 0
15*: 0
16*: 0
17*: 1
18*: 0
191*: 17501
192*: 18908
193*: 18247
194*: 0
195*: 0
196*: 0
197*: 0
198*: 0
199*: 2

Note that all the slices are less than 20,000. Yay it works! You might also note that while most of the control symbol prefixes are years, there are a few oddities as well. In general, you can't rely on control symbols being consistent. Even if the series notes say they use numbers, you need to check to make sure no letters have snuck in. So basically it takes so it takes a fair bit of trial and error!

Once we've defined a custom control range as above, we can feed it to harvest_large_series() to use in place of the default.

In [ ]:
# Use custom range to harvest a large series
harvest_large_series(series='A1', control_range=control_range)

9. Saving images from digitised files

Once you've saved all the metadata from your search, you can use it to download images from all the items that have been digitised.

The function below will look for all items that have a digitised_status of True in a harvest db, and then download all the images from them. The images will be saved in a folder named with the timestamp of the original harvest, eg. data/items-1567492815-images.

Within that folder, the images from each item will be saved in a separate folder, named using the series, control_symbol, and identifier values. So the folder a2487-1919-8962-156686 contains images from a file in series A2487, with the control symbol 1919/8962, and the barcode of 156686. Images are named with the identifier (barcode) and page numbers, eg: 156686-1.jpg.

The function also saves some image metadata back into the harvest db, specifically:

  • image_dir – the path where the image has been saved
  • image_name – the filename of the image
  • identifier – the item barcode
  • page – the page number
  • width – the width of the image (in pixels)
  • height – the height of the image (in pixels)
In [307]:
def harvest_images(db_path=None):
    '''
    Download images from all the digitised files in a harvest.
    If db_path is not supplied, use the most recently created harvest.
    '''
    
    # Get the harvest db
    _, db = get_db(db_path)
    if db is not None:
        
        # Get the metadata
        meta = db.table('meta').all()[0]
        
        # Find items where digitised_status is True
        Record = Query()
        items = db.table('items').search(Record.digitised_status == True)
        
        # Loop through digitised items
        for item in tqdm(items, desc='Items'):
            
            # Set name of the folder in which to save the images
            image_dir = Path(f'data/items-{meta["timestamp"]}-images/{slugify(item["series"])}-{slugify(item["control_symbol"])}-{item["identifier"]}')
            
            # Create the folder (and parent if necessary)
            image_dir.mkdir(exist_ok=True, parents=True)
            
            # Loop through the page numbers
            for page in tqdm(range(1, item['digitised_pages'] + 1), desc='Pages', leave=False):
                
                # Define the image filename using the barcode and page number
                filename = Path(f'{image_dir}/{item["identifier"]}-{page}.jpg')
                
                # Check to see if the image already exists (useful if rerunning a failed harvest)
                if not filename.exists():
                    
                    # If it doens't already exist then download it
                    img_url = f'http://recordsearch.naa.gov.au/NaaMedia/ShowImage.asp?B={item["identifier"]}&S={page}&T=P'
                    response = requests.get(img_url)
                    response.raise_for_status()
                    
                    # Try opening the file as an image
                    try:
                        image = Image.open(BytesIO(response.content))
                    except IOError:
                        print('Not an image')
                    else:
                        # If it's an image, get its dimensions
                        width, height = image.size
                        
                        # Save the image
                        image.save(filename)
                        
                        # Create image metadata
                        image_meta = {
                            'image_dir': str(image_dir),
                            'image_name': '{}-{}.jpg'.format(item['identifier'], page),
                            'identifier': item['identifier'],
                            'page': page,
                            'width': width,
                            'height': height
                        }
                        
                        # Add/update the image metadata
                        db.table('images').upsert(image_meta, Record.image_name == image_meta['image_name'])
                        
                        # Pause
                        time.sleep(0.5)
In [ ]:
# Download images from the most recent harvest
# Optionally, supply the path to an existing db, eg: save_harvest_as_json('data/db-1567480968.json')
harvest_images()

10. Save images metadata

It might be handy to have all the image metadata in a single CSV file.

In [312]:
def save_metadata_as_csv(db_path=None):
    '''
    Save metadata of harvested images as a CSV file.
    If db_path is not supplied, use the most recently created harvest.
    '''
    _, db = get_db(db_path)
    if db is not None:
        
        # Get harvest metadata
        meta = db.table('meta').all()[0]
        
        # Get harvested images
        images = db.table('images').all()
        
        # Convert the image metadata to a dataframe
        df = pd.DataFrame(images)
        
        # Set file name and path
        filename = Path(f'data/items-{meta["timestamp"]}-images.csv')
        
        # Save as CSV
        df.to_csv(filename, index=False)
        
        # Display link to download
        display(FileLink(filename))
In [ ]:
# Save the image metadata and display a link
# Optionally, supply the path to an existing db, eg: save_harvest_as_json('data/db-1567480968.json')
save_metadata_as_csv()

Next steps

Coming soon, notebooks to help you explore your harvested data...


Created by Tim Sherratt as part of the GLAM Workbench.