Notebook

Harvest items from a search in RecordSearch¶

Ever searched for items in RecordSearch and wanted to save the results as a CSV file, or in some other machine-readable format? This notebook makes it easy to save the results of an item search as a downloadable dataset. You can even download all the images from items that have been digitised, or save the complete files as PDFs!

RecordSearch doesn't currently have an option for downloading machine-readable data. So to get collection metadata in a structured form, we have to resort of screen-scraping. This notebook uses the RecordSearch Data Scraper to do most of the work.

Notes:

The RecordSearch Data Scraper caches results to improve efficiency. This also makes it easy to resume a failed harvest. If you want to completely refresh a harvest, then delete the cache_db.sqlite file to start from scratch.
The harvesting function below automatically slices large searches (greater than 20,000 results) into smaller chunks. This avoids RecordSearch's 20,000 result limit. This should work in most cases. If it doesn't, try changing the control_range list below. This list supplies a range of prefixes which are supplied (with a trailing '*' for wildcard matches) as the control value.

Available search parameters¶

The available search parameters are the same as those in RecordSearch's Advanced Search form. There's lots of them, but you'll probably only end up using a few like kw and series. Note that you can use * for wildcard searches as you can in the web interface. So setting kw to 'wragge*' will find both 'wragge' and 'wragges'.

See the RecordSearch Data Scraper documentation for more information on search parameters.

kw – string containing keywords to search for
kw_options – how to interpret kw, possible values are:
- 'ALL' – return results containing all of the keywords (default)
- 'ANY' – return results containg any of the keywords
- 'EXACT' – treat kw as a phrase rather than a list of words
kw_exclude – string containing keywords to exclude from search
kw_exclude_options – how to interpret kw_exclude, possible values are:
- 'ALL' – exclude results containing all of the keywords (default)
- 'ANY' – exclude results containg any of the keywords
- 'EXACT' – treat kw_exact as a phrase rather than a list of words
search_notes – set to 'on' to search item notes as well as metadata
series – search for items in this series
series_exclude – exclude items from this series
control – search for items matching this control symbol
control_exclude – exclude items matching this control symbol
item_id – search for items with this item ID number (formerly called barcode)
date_from – search for items with a date (year) greater than or equal to this, eg. '1935'
date_to – search for items with a date (year) less than or equal to this
formats – limit search to items in a particular format, see possible values below
formats_exclude – exclude items in a particular format, see possible values below
locations – limit search to items held in a particular location, see possible values below
locations_exclude – exclude items held in a particular location, see possible values below
access – limit to items with a particular access status, see possible values below
access_exclude – exclude items with a particular access status, see possible values below
digital – set to True to limit to items that are digitised

Possible values for formats and formats_exclude:

'Paper files and documents'
'Index cards'
'Bound volumes'
'Cartographic records'
'Photographs'
'Microforms'
'Audio-visual records'
'Audio records'
'Electronic records'
'3-dimensional records'
'Scientific specimens'
'Textiles'

Possible values for locations and locations_exclude:

'NAT, ACT'
'Adelaide'
'Australian War Memorial'
'Brisbane'
'Darwin'
'Hobart'
'Melbourne'
'Perth'
'Sydney'

Possible values for access and access_exclude:

'OPEN'
'OWE'
'CLOSED'
'NYE'

There are some additional parameters that affect the way the search results are delivered.

record_detail – controls the amount of information included in each item record, possible values:
- 'brief' (default) – just the info in the search results
- 'digitised' – add the number of pages if the file is digitised (slower)
- 'full' – get the full individual record for each result, includes number of digitised pages and access examination details (slowest)

Note that if you want to harvest all the digitised page images from a search, you need to set record_detail to either 'digitised' or 'full'.

How your harvest is saved¶

Once it's downloaded all the results, the harvesting function creates a directory for the harvest and saves three files inside:

metadata.json – this is a summary of your harvest, including the parameters you used and the date it was run
results.ndjson – this is the harvested data with each record saved as a JSON object on a new line
results.csv – the harvested data with any duplicates removed saved as a CSV file (if you've saved 'full' records, the list of access_decision_reasons will be saved as a pipe-separated string)

The metadata.json file looks something like this:

{
    "date_harvested": "2021-05-22T22:05:10.705184", 
    "search_params": {"results_per_page": 20, "sort": 9, "record_detail": "digitised"}, 
    "search_kwargs": {"kw": "wragge"}, 
    "total_results": 208, 
    "total_harvested": 208,
    "total_deduplicated": 208
}

The 'total' values represent slightly different things:

total_results: the number of matching results RecordSearch thinks there are
total_harvested: the number of results actually harvested
total_deduplicated: the number of records left after duplicates are removed from the harvested results

Duplicate records sometimes occur when items have an alternative control symbol. The CSV creation process removes any duplicates.

The fields in the results files are:

title
identifier
series
control_symbol
digitised_status
digitised_pages – if record_detail is set to 'digitised' or 'full'
access_status
access_decision_reasons – if record_detail is set to 'full'
location
retrieved – date/time when this record was retrieved from RecordSearch
contents_date_str
contents_start_date
contents_end_date
access_decision_date_str – if record_detail is set to 'full'
access_decision_date – if record_detail is set to 'full'

See below for information on saving digitised images and PDFs.

Import what we need¶

In [ ]:

import json
import string
import time
from datetime import datetime
from pathlib import Path

import pandas as pd
import requests
from IPython.display import HTML, FileLink, display
from recordsearch_data_scraper.scrapers import RSItemSearch
from slugify import slugify
from tqdm.auto import tqdm

# This is a workaround for a problem with tqdm adding space to cells
HTML(
    """
    <style>
    .p-Widget.jp-OutputPrompt.jp-OutputArea-prompt:empty {
      padding: 0;
      border: 0;
    }
    </style>
"""
)

Define some functions¶

In [ ]:

# This is basically a list of letters and numbers that we can use to build up control symbol values.
control_range = (
    [str(number) for number in range(0, 10)]
    + [letter for letter in string.ascii_uppercase]
    + ["/"]
)


def get_results(data_dir, **kwargs):
    """
    Save all the results from a search using the given parameters.
    If there are more than 20,000 results, return False.
    Otherwise, return the harvested items.
    """
    s = RSItemSearch(**kwargs)
    if s.total_results == "20,000+":
        return False
    else:
        with tqdm(total=s.total_results, leave=False) as pbar:
            more = True
            while more:
                data = s.get_results()
                if data["results"]:
                    save_to_ndjson(data_dir, data["results"])
                    pbar.update(len(data["results"]))
                    time.sleep(0.5)
                else:
                    more = False
        return True


def refine_controls(current_control, data_dir, **kwargs):
    """
    Add additional letters/numbers to the control symbol wildcard search
    until the number of results is less than 20,000.
    Then harvest the results.
    Returns:
        * the RSItemSearch object (containing the search params, total results etc)
        * a list containing the harvested items
    """
    for control in control_range:
        new_control = current_control.strip("*") + control + "*"
        # print(new_control)
        kwargs["control"] = new_control
        results = get_results(data_dir, **kwargs)
        # print(total)
        if results is False:
            refine_controls(new_control, data_dir, **kwargs)


def create_data_dir(search, today):
    """
    Create a directory for the harvested data -- using the date and search parameters.
    """
    params = search.params.copy()
    params.update(search.kwargs)
    search_param_str = slugify(
        "_".join(
            sorted(
                [
                    f"{k}_{v}"
                    for k, v in params.items()
                    if v is not None and k not in ["results_per_page", "sort"]
                ]
            )
        )
    )
    data_dir = Path("harvests", f'{today.strftime("%Y%m%d_%H%M%S")}_{search_param_str}')
    data_dir.mkdir(exist_ok=True, parents=True)
    return data_dir


def save_to_ndjson(data_dir, results):
    """
    Save results into a single, newline delimited JSON file.
    """
    output_file = Path(data_dir, "results.ndjson")
    with output_file.open("a") as ndjson_file:
        for result in results:
            ndjson_file.write(json.dumps(result) + "\n")


def save_metadata(search, data_dir, today, totals):
    """
    Save information about the harvest to a JSON file.
    """
    metadata = {
        "date_harvested": today.isoformat(),
        "search_params": search.params,
        "search_kwargs": search.kwargs,
        "total_results": search.total_results,
        "total_harvested": totals["harvested"],
        "total_after_deduplication": totals["deduped"],
    }

    with Path(data_dir, "metadata.json").open("w") as md_file:
        json.dump(metadata, md_file)


def save_csv(data_dir):
    """
    Save the harvested results as a CSV file, removing any duplicates.
    """
    output_file = Path(data_dir, "results.csv")
    input_file = Path(data_dir, "results.ndjson")
    df = pd.read_json(input_file, lines=True)
    harvested = df.shape[0]
    # Flatten list
    try:
        df["access_decision_reasons"] = (
            df["access_decision_reasons"].dropna().apply(lambda l: " | ".join(l))
        )
    except KeyError:
        pass
    # Remove any duplicates
    df.drop_duplicates(inplace=True)
    df.to_csv(output_file, index=False)
    deduped = df.shape[0]
    return {"harvested": harvested, "deduped": deduped}


def harvest_search(**kwargs):
    """
    Harvest all the items from a search using the supplied parameters.
    If there are more than 20,000 results, it will use control symbol
    wildcard values to try and split the results into harvestable chunks.
    """
    # Initialise the search
    search = RSItemSearch(**kwargs)
    today = datetime.now()
    data_dir = create_data_dir(search, today)
    # If there are more than 20,000 results, try chunking using control symbols
    if search.total_results == "20,000+":
        # Loop through the letters and numbers
        for control in control_range:
            # print(control)
            # Add letter/number as a wildcard value
            kwargs["control"] = f"{control}*"
            # Try getting the results
            results = get_results(data_dir, **kwargs)
            # print(results)
            if results is False:
                # If there's still more than 20,000, add more letters/numbers to the control symbol!
                refine_controls(control, data_dir, **kwargs)
    # If there's less than 20,000 results, save them all
    else:
        get_results(data_dir, **kwargs)
    totals = save_csv(data_dir)
    save_metadata(search, data_dir, today, totals)
    print(f"Harvest directory: {data_dir}")
    display(FileLink(Path(data_dir, "metadata.json")))
    display(FileLink(Path(data_dir, "results.ndjson")))
    display(FileLink(Path(data_dir, "results.csv")))
    return data_dir


def save_images(harvest_dir):
    df = pd.read_csv(Path(harvest_dir, "results.csv"))
    with tqdm(
        total=df.loc[df["digitised_status"] == True].shape[0], desc="Files"
    ) as pbar:
        for item in df.loc[df["digitised_status"] == True].itertuples():
            image_dir = Path(
                f"{harvest_dir}/images/{slugify(item.series)}-{slugify(str(item.control_symbol))}-{item.identifier}"
            )

            # Create the folder (and parent if necessary)
            image_dir.mkdir(exist_ok=True, parents=True)

            # Loop through the page numbers
            for page in tqdm(
                range(1, int(item.digitised_pages) + 1), desc="Images", leave=False
            ):

                # Define the image filename using the barcode and page number
                filename = Path(f"{image_dir}/{item.identifier}-{page}.jpg")

                # Check to see if the image already exists (useful if rerunning a failed harvest)
                if not filename.exists():
                    # If it doens't already exist then download it
                    img_url = f"https://recordsearch.naa.gov.au/NaaMedia/ShowImage.asp?B={item.identifier}&S={page}&T=P"
                    response = requests.get(img_url)
                    try:
                        response.raise_for_status()
                    except requests.exceptions.HTTPError:
                        pass
                    else:
                        filename.write_bytes(response.content)

                    time.sleep(0.5)
            pbar.update(1)


def save_pdfs(harvest_dir):
    df = pd.read_csv(Path(harvest_dir, "results.csv"))
    pdf_dir = Path(harvest_dir, "pdfs")
    pdf_dir.mkdir(exist_ok=True, parents=True)
    with tqdm(
        total=df.loc[df["digitised_status"] == True].shape[0], desc="Files"
    ) as pbar:
        for item in df.loc[df["digitised_status"] == True].itertuples():
            pdf_file = Path(
                pdf_dir,
                f"{slugify(item.series)}-{slugify(str(item.control_symbol))}-{item.identifier}.pdf",
            )
            if not pdf_file.exists():
                pdf_url = f"https://recordsearch.naa.gov.au/SearchNRetrieve/NAAMedia/ViewPDF.aspx?B={item.identifier}&D=D"
                response = requests.get(pdf_url)
                try:
                    response.raise_for_status()
                except requests.exceptions.HTTPError:
                    pass
                else:
                    pdf_file.write_bytes(response.content)
                time.sleep(0.5)
            pbar.update(1)

Start a harvest¶

Insert your search parameters in the brackets below.

Examples:

search, items = harvest_search(kw='rabbit')
search, items = harvest_search(kw='rabbit', digital=True)
search, items = harvest_search(record_detail='full', kw='rabbit', series='A1)
search, items = harvest_search(series='B13')

If you're running a long harvest, there's a good chance it will get interrupted at some point. Don't worry, just run the cell above again. The scraper caches your results, so it won't need to start from scratch.

In [ ]:

data_dir = harvest_search(kw="wragge exhibit", record_detail="digitised")

Saving images from digitised files¶

Once you've saved all the metadata from your search, you can use it to download images from all the items that have been digitised.

Note that you can only save the images if you set the record_detail parameter to 'digitised' or 'full' in the original harvest.

The function below will look for all items that have a digitised_pages value in the harvest results, and then download an image for each page. The images will be saved in an images subdirectory, inside the original harvest directory.

In [ ]:

# Supply the path to the directory containing the harvested data
# This is the value returned by the `harvest_search()` function.
# eg: 'harvests/20210522_digital_True_kw_wragge_record_detail_full'
save_images(data_dir)

Saving digitised files as PDFs¶

You can also save digitised files as PDFs. The function below will save any digisted files in the results to a pdfs subdirectory within the harvest directory.

In [ ]:

# Supply the path to the directory containing the harvested data
# This is the value returned by the `harvest_search()` function.
# eg: 'harvests/20210522_digital_True_kw_wragge_record_detail_full'
save_pdfs(data_dir)

Created by Tim Sherratt for the GLAM Workbench. Support me by becoming a GitHub sponsor!