Harvest items from a search in RecordSearch

Ever searched for items in RecordSearch and wanted to save the results as a CSV file, or in some other machine-readable format? This notebook walks you through the process of creating, managing, and saving item searches – all the way from search terms to downloadable dataset. You can even download all the images from items that have been digitised!

RecordSearch doesn't currently have an option for downloading machine-readable data. So to get collection metadata in a structured form, we have to resort of screen-scraping. This notebook uses the RecordSearch Data Scraper to do most of the work.

Note that the RecordSearch Data Scraper caches results to improve efficiency. This also makes it easy to resume a failed harvest. If you want to completely refresh a harvest, then delete the cache_db.sqlite file to start from scratch.

1. Import what we need

In [ ]:
from tqdm.auto import tqdm
from pathlib import Path
import math
import time
from datetime import datetime
import pandas as pd
import json
import string
from IPython.display import display, FileLink, HTML
import requests
from slugify import slugify
from recordsearch_data_scraper.scrapers import *

# This is a workaround for a problem with tqdm adding space to cells
HTML("""
    <style>
    .p-Widget.jp-OutputPrompt.jp-OutputArea-prompt:empty {
      padding: 0;
      border: 0;
    }
    </style>
""")

2. Available search parameters

The available search parameters are the same as those in RecordSearch's Advanced Search form. There's lots of them, but you'll probably only end up using a few like kw and series. Note that you can use * for wildcard searches as you can in the web interface. So setting kw to 'wragge*' will find both 'wragge' and 'wragges'.

See the RecordSearch Data Scraper documentation for more information on search parameters.

  • kw – string containing keywords to search for
  • kw_options – how to interpret kw, possible values are:
    • 'ALL' – return results containing all of the keywords (default)
    • 'ANY' – return results containg any of the keywords
    • 'EXACT' – treat kw as a phrase rather than a list of words
  • kw_exclude – string containing keywords to exclude from search
  • kw_exclude_options – how to interpret kw_exclude, possible values are:
    • 'ALL' – exclude results containing all of the keywords (default)
    • 'ANY' – exclude results containg any of the keywords
    • 'EXACT' – treat kw_exact as a phrase rather than a list of words
  • search_notes – set to 'on' to search item notes as well as metadata
  • series – search for items in this series
  • series_exclude – exclude items from this series
  • control – search for items matching this control symbol
  • control_exclude – exclude items matching this control symbol
  • item_id – search for items with this item ID number (formerly called barcode)
  • date_from – search for items with a date (year) greater than or equal to this, eg. '1935'
  • date_to – search for items with a date (year) less than or equal to this
  • formats – limit search to items in a particular format, see possible values below
  • formats_exclude – exclude items in a particular format, see possible values below
  • locations – limit search to items held in a particular location, see possible values below
  • locations_exclude – exclude items held in a particular location, see possible values below
  • access – limit to items with a particular access status, see possible values below
  • access_exclude – exclude items with a particular access status, see possible values below
  • digital – set to True to limit to items that are digitised

Possible values for formats and formats_exclude:

  • 'Paper files and documents'
  • 'Index cards'
  • 'Bound volumes'
  • 'Cartographic records'
  • 'Photographs'
  • 'Microforms'
  • 'Audio-visual records'
  • 'Audio records'
  • 'Electronic records'
  • '3-dimensional records'
  • 'Scientific specimens'
  • 'Textiles'

Possible values for locations and locations_exclude:

  • 'NAT, ACT'
  • 'Adelaide'
  • 'Australian War Memorial'
  • 'Brisbane'
  • 'Darwin'
  • 'Hobart'
  • 'Melbourne'
  • 'Perth'
  • 'Sydney'

Possible values for access and access_exclude:

  • 'OPEN'
  • 'OWE'
  • 'CLOSED'
  • 'NYE'

Once you've decided on your parameters you can use them to create a search. For example, if we wanted to find all items that included the word 'wragge' and were digitised, our parameters would be:

  • kw='wragge'
  • digital=True
In [ ]:
# Initialise the search
search = RSItemSearch(kw='wragge', digital=True)

Now we can have a look to see how many results there are in the complete results set.

In [ ]:
# Display total results
search.total_results

To get the first page of results, use .get_results(). Note that by default there are 20 search results on a page. Subsequent calls to .get_results() will retrieve the next page of results. You can use this to work your way through the complete results set. That's how we'll harvest all the items in a search below.

In [ ]:
items = search.get_results()

# Show the first result
items['results'][0]

4. Changing how your search results are delivered

There are some additional parameters that affect the way the search results are delivered.

  • page – return a specific page of research results
  • results_per_page – default is 20
  • sort – return results in a specified order, possible values:
    • 1 – series and control symbol
    • 3 – title
    • 5 – start date
    • 7 – digitised items first
    • 12 – items with pdfs first
    • 9 – barcode
    • 11 – audio visual items first
  • record_detail – controls the amount of information included in each item record:
    • 'brief' (default) – just the info in the search results
    • 'digitised' – add the number of pages if the file is digitised (slower)
    • 'full' – get the full individual record for each result (slowest)

Note that if you want to harvest all the digitised page images from a search, you need to set record_detail to either 'digitised' or 'full'.

Let's repeat the search above, but ask for the full record details.

In [ ]:
# Initialise the search
# Edit the search parameters as desired.
results = RSItemSearch(record_detail='full', kw='wragge', digital=True)

If we look at the first result again, we'll see that it contains some extra fields.

In [ ]:
items = results.get_results()

# Show the first result
items['results'][0]

5. Harvesting a complete set of search results

Ok, we've learnt how to create a search and get back some data, but only getting the first 20 results is not so useful. What if our search contains hundreds or thousands of items? How do we get them all?

To save everything, we have to loop through each page in the result set, saving the results as we go. The cell below does just that.

In [ ]:
items = []
# Initialise the search -- edit the search parameters below as desired
search = RSItemSearch(record_detail='digitised', kw='wragge')
with tqdm(total=search.total_results) as pbar:
    more = True
    while more:
        # Get a page of results
        data = search.get_results()
        if data['results']:
            # Add the page of results to the items list
            items += data['results']
            pbar.update(len(data['results']))
            time.sleep(0.5)
        else:
            more = False

But wait! You might have noticed that RecordSearch only displays results for searches that return fewer than 20,000 items. Because the screen scraper is just extracting details from the RecordSearch web pages, the 20,000 limit applies here as well. If your search has more than 20,000 results, you'll need to narrow it down using additional parameters.

One way of splitting a large search up into harvestable chunks is to use wildcard values and the control search parameter. For example, series B13 has more than 20,000 items, but if we limit the results to items with a control symbol starting with '1', we bring the number down to under 20,000:

To make sure we get everything in the series we can repeat the harvest using a range of prefixes for the control symbol – the easiest approach is simply to loop through each letter and number from A to Z and 0 to 9. That's exactly what the harvest_series() function below does if there are more than 20,000 results in a search.

Note that it's possible to get duplicate items this way because some items include earlier versions of control symbols and these are searched as well as the current ones. These can be removed from the saved harvests by using something like Pandas .drop_duplicates().

In [ ]:
# This is basically a list of letters and numbers that we can use to build up control symbol values.
control_range = [str(number) for number in range(0, 10)] + [letter for letter in string.ascii_uppercase] + ['/']

def get_results(**kwargs):
    '''
    Save all the results from a search using the given parameters.
    If there are more than 20,000 results, return False.
    Otherwise, return the harvested items.
    '''
    s = RSItemSearch(**kwargs)
    if s.total_results == '20,000+':
        return False
    else:
        items = []
        with tqdm(total=s.total_results, leave=False) as pbar:
            more = True
            while more:
                data = s.get_results()
                if data['results']:
                    items += data['results']
                    pbar.update(len(data['results']))
                    time.sleep(0.5)
                else:
                    more = False
        return items
    
def refine_controls(current_control, **kwargs):
    '''
    Add additional letters/numbers to the control symbol wildcard search 
    until the number of results is less than 20,000.
    Then harvest the results.
    Returns:
        * the RSItemSearch object (containing the search params, total results etc)
        * a list containing the harvested items
    '''
    items = []
    for control in  control_range:
        new_control = current_control.strip('*') + control + '*'
        # print(new_control)
        kwargs['control'] = new_control
        results = get_results(**kwargs)
        # print(total)
        if results is False:
            items += refine_controls(new_control, **kwargs)
        else:
            items += results
    return items

def harvest_search(**kwargs):
    '''
    Harvest all the items from a search using the supplied parameters.
    If there are more than 20,000 results, it will use control symbol 
    wildcard values to try and split the results into harvestable chunks.
    '''
    # Initialise the search
    search = RSItemSearch(**kwargs)
    # If there are more than 20,000 results, try chunking using control symbols
    if search.total_results == '20,000+':
        items = []
        # Loop through the letters and numbers
        for control in control_range:
            # Add letter/number as a wildcard value
            kwargs['control'] = f'{control}*'
            # Try getting the results
            results = get_results(**kwargs)
            if results:
                items += results
            # If there's still more than 20,000, add more letters/numbers to the control symbol!
            else:
                items += refine_controls(control, **kwargs)
    # If there's less than 20,000 results, save them all
    else:
        items = get_results(**kwargs)
    return search, items

Start a harvest

Insert your search parameters in the brackets below.

Examples:

  • search, items = harvest_search(kw='rabbit')
  • search, items = harvest_search(kw='rabbit', digital=True)
  • search, items = harvest_search(record_detail='full', kw='rabbit', series='A1)
  • search, items = harvest_search(series='B13')
In [ ]:
search, items = harvest_search(series='B13')

6. Save the harvested results

Once the harvest is finished we can save the results. The function below creates a directory for the harvest, and saves three files inside:

  • metadata.json – this is a summary of your harvest, including the parameters you used and the date it was run
  • results.jsonl – this is the harvested data with each record saved as a JSON object on a new line
  • results.csv – the harvested data saved as a CSV file (if you've saved 'full' records, the list of access_decision_reasons will be saved as a pipe-separated string)

The metadata.json file looks something like this:

{
    "date_harvested": "2021-05-22T22:05:10.705184", 
    "search_params": {"results_per_page": 20, "sort": 9, "record_detail": "digitised"}, 
    "search_kwargs": {"kw": "wragge"}, 
    "total_results": 208, 
    "total_harvested": 208
}

The fields in the results files are:

  • title
  • identifier
  • series
  • control_symbol
  • digitised_status
  • digitised_pages – if record_detail is set to 'digitised' or 'full'
  • access_status
  • access_decision_reasons – if record_detail is set to 'full'
  • location
  • retrieved – date/time when this record was retrieved from RecordSearch
  • contents_date_str
  • contents_start_date
  • contents_end_date
  • access_decision_date_str – if record_detail is set to 'full'
  • access_decision_date – if record_detail is set to 'full'
In [ ]:
def save_harvest(search, items):
    params = search.params.copy()
    params.update(search.kwargs)
    today = datetime.now()
    search_param_str = '_'.join(sorted([f'{k}_{v}' for k, v in params.items() if v is not None and k not in ['results_per_page', 'sort']]))
    data_dir = Path('harvests', f'{today.strftime("%Y%m%d")}_{search_param_str}')
    data_dir.mkdir(exist_ok=True, parents=True)
    metadata = {
        'date_harvested': today.isoformat(),
        'search_params': search.params,
        'search_kwargs': search.kwargs,
        'total_results': search.total_results,
        'total_harvested': len(items)
    }

    with Path(data_dir, 'metadata.json').open('w') as md_file:
        json.dump(metadata, md_file)

    with Path(data_dir, 'results.jsonl').open('w') as data_file:
        for item in items:
            data_file.write(json.dumps(item) + '\n')

    df = pd.json_normalize(items)
    # Flatten list
    try:
        df['access_decision_reasons'] = df['access_decision_reasons'].dropna().apply(lambda l: ' | '.join(l))
    except KeyError:
        pass
    # Remove any duplicates
    df.drop_duplicates(inplace=True)
    df.to_csv(Path(data_dir, 'results.csv'), index=False)
    print(f'Harvest directory: {data_dir}')
    display(FileLink(Path(data_dir, 'metadata.json')))
    display(FileLink(Path(data_dir, 'results.jsonl')))
    display(FileLink(Path(data_dir, 'results.csv')))
    return str(data_dir)
In [ ]:
# The function returns the path to the harvest directory
# We can use that below to save images or pdfs
harvest_dir = save_harvest(search, items)

7. Saving images from digitised files

Once you've saved all the metadata from your search, you can use it to download images from all the items that have been digitised.

Note that you can only save the images if you set the record_detail parameter to 'digitised' or 'full' in the original harvest.

The function below will look for all items that have a digitised_pages value in the harvest results, and then download an image for each page. The images will be saved in an images subdirectory, inside the original harvest directory.

In [ ]:
def save_images(harvest_dir):
    df = pd.read_csv(Path(harvest_dir, 'results.csv'))
    with tqdm(total=df.loc[df['digitised_status'] == True].shape[0], desc='Files') as pbar:
        for item in df.loc[df['digitised_status'] == True].itertuples():
            image_dir = Path(f'{harvest_dir}/images/{slugify(item.series)}-{slugify(item.control_symbol)}-{item.identifier}')

            # Create the folder (and parent if necessary)
            image_dir.mkdir(exist_ok=True, parents=True)

            # Loop through the page numbers
            for page in tqdm(range(1, int(item.digitised_pages) + 1), desc='Images', leave=False):

                # Define the image filename using the barcode and page number
                filename = Path(f'{image_dir}/{item.identifier}-{page}.jpg')

                # Check to see if the image already exists (useful if rerunning a failed harvest)
                if not filename.exists():
                    # If it doens't already exist then download it
                    img_url = f'https://recordsearch.naa.gov.au/NaaMedia/ShowImage.asp?B={item.identifier}&S={page}&T=P'
                    response = requests.get(img_url)
                    response.raise_for_status()
                    filename.write_bytes(response.content)

                    time.sleep(0.5)
            pbar.update(1)
    
In [ ]:
# Supply the path to the directory containing the harvested data
# This is the value returned by the `save_harvest()` function.
# eg: 'harvests/20210522_digital_True_kw_wragge_record_detail_full'
save_images(harvest_dir)

8. Saving digitised files as PDFs

You can also save digitised files as PDFs. The function below will save any digisted files in the results to a pdfs subdirectory within the harvest directory.

In [ ]:
def save_pdfs(harvest_dir):
    df = pd.read_csv(Path(harvest_dir, 'results.csv'))
    pdf_dir = Path(harvest_dir, 'pdfs')
    pdf_dir.mkdir(exist_ok=True, parents=True)
    with tqdm(total=df.loc[df['digitised_status'] == True].shape[0], desc='Files') as pbar:
        for item in df.loc[df['digitised_status'] == True].itertuples():
            image_dir = Path(f'{harvest_dir}/images/{slugify(item.series)}-{slugify(item.control_symbol)}-{item.identifier}')
            pdf_file = Path(pdf_dir, f'{slugify(item.series)}-{slugify(item.control_symbol)}-{item.identifier}.pdf') 
            if not pdf_file.exists():
                pdf_url = f'https://recordsearch.naa.gov.au/SearchNRetrieve/NAAMedia/ViewPDF.aspx?B={item.identifier}&D=D'
                response = requests.get(pdf_url)
                response.raise_for_status()
                pdf_file.write_bytes(response.content)
                time.sleep(0.5)
            pbar.update(1)
In [ ]:
# Supply the path to the directory containing the harvested data
# This is the value returned by the `save_harvest()` function.
# eg: 'harvests/20210522_digital_True_kw_wragge_record_detail_full'
save_pdfs(harvest_dir)

Next steps

Coming soon, notebooks to help you explore your harvested data...


Created by Tim Sherratt for the GLAM Workbench. Support me by becoming a GitHub sponsor!