Notebook

Convert a Trove list into a CollectionBuilder exhibition¶

This notebook converts Trove lists into a series of files that can be uploaded to a CollectionBuilder-GH repository to create an instant exhibition. See the CollectionBuilder site for more information on how CollectionBuilder works and what it can do.

Demo: this exhibition was generated from this Trove list.

1. What you need¶

a Trove API key (copy & paste your key where indicated below)
a GitHub account
a Trove List containing items you want to include in your exhibition

2. Setup a GitHub repository for your exhibition¶

Login to your GitHub account.
Go to my customised CollectionBuilder-GH template repository.
Click on the big green Use this template button.
Give your repository a name by typing in the Repository name box – the name of the repository will form part of the url for your new exhibition, so you probably want to give it a name that relates to the exhibition.
Click on the big green Create a repository from template button. You'll be automatically redirected to your new repository.

3. Enable GitHub Pages for your repository¶

GitHub builds your exhibition from the files in the repository using GitHub Pages. You need to enable this after you create your repository:

Click on the Settings button in your new repository.
Click on the Pages button in the side menubar.
Under Branch select 'main' from the dropdown list and click on Save.

GitHub will now build your exhibition. Once it's ready you'll see a link on the 'Pages' page. The url will have the form https://[your GH user name].github.io/[your repository name]. At the moment the exhibition will contain dummy data – the next step is to generate your own exhibition data!

4. Generate your exhibition files from your Trove list¶

Find your Trove list's numeric id. The list id is the number in the url of your Trove list. So the list with this url https://trove.nla.gov.au/list/83774 has an id of 83774.
Copy and paste your list id and Trove API key where indicated below in this notebook,
From the Jupyter Run menu select Run all cells.
When everything has finished running, a link to a zip file will be displayed at the bottom of the notebook. Download it to your own computer and open the zip file. Done!

5. Add more metadata (optional)¶

The metadata describing the items in your exhibition is contained in the _data/[list id]-items.csv file. If the items in your exhibition relate to specific places, you may want to add some extra metadata so that CollectionBuilder can display them on a map.

Information about places is contained in three columns: location, latitude, and longitude. In the location field you can include a list of place names, separated by semicolons, eg: 'Melbourne; Sydney; Hobart'. These placenames will be used to build a word cloud when you click on the Location tab in your exhibition.

To add an item to CollectionBuilder's map view, you need to supply values for latitude and longitude.

You might also want to edit the subject, and description fields.

Open your metadata file with either a text editor or a spreadsheet program (but beware that some programs, like Excel, might mangle your dates).
Edit the desired values.
Make sure the edited file is saved in CSV (plain text) format, replacing the original metadata file.

Note that GitHub has it's own built-in file editor. So if you don't have a way of editing the CSV file on your own computer, just skip down to the 'Upload your files...' section below and add them to your GitHub repository. To edit the file just view it in GitHub and click on the pencil icon. Once you've finished editing, make sure you click the Commit button to save your changes.

6. Replace tiny images (optional)¶

Trove work records often only include links to tiny thumbnailed versions of images. These don't look great in an exhibition, so you might want to replace them. Different collections use different image viewers, so there's no easy, automated way to do this. You'll have to manually download them and replace the thumnailed versions.

From the Trove work record, click on the View button and open the link to the original item.
Use whatever download mechanism is provided to save a copy of the image on your computer.
Rename the downloaded image to match the name of the tiny thumbnailed version in your exhibition's objects directory.
Replace the thumbnail image in the objects directory with the new downloaded version.

7. Upload your files to the exhibition repository¶

You're now ready to add your exhibition files to the exhibition repository!

Go to the GitHub repository you created above.
Click on the Add file button and select Upload files.
Select the _config.yml file in the exhibition files you downloaded from this notebook.
Click on the green Commit changes button to save the file in your repository.
Open the _data directory in your GitHub repository.
Click on the Add file button and select Upload files.
Select the _data/[list id]-items.csv file in your exhibition files.
Click on the green Commit changes button to save the file in your repository.
Open the objects directory in your GitHub repository.
Select all the files in the objects directory of your exhibition files.
Click on the green Commit changes button to save the files in your repository.

Once you've uploaded the files, GitHub will rebuild the exhibition using your data. It might take a little while to generate, but once it's ready you see it at https://[your GH user name].github.io/[your repository name].

If your not happy with the metadata and how it displays, you can either edit the exhibition files on your own computer and re-upload them to GitHub. Or you can use GitHub's built-in file editor to make changes. To edit a file just view it in GitHub and click on the pencil icon. Once you've finished editing, make sure you click the Commit button to save your changes.

Every time you make a change to your repository, GitHub will automatically rebuild your exhibition.

8. Further customisation¶

You can further customise the look and feel of your exhibition by editing the _data/theme.yml file. For example, you can:

Set a different featured-image to display in the header of your exhibition.
Change the latitude and longitude values to set the centre on the map view.

See the CollectionBuilder documentation for more options.

Annotating Trove list items¶

You can add your own annotations to Trove list items and these will automatically be included in your exhibition. To add a descriptive note:

Make sure you're logged in to your Trove account.
Go to your list (you can find a list of your lists in your Trove user profile).
Go to the item in your list you want to annotate and click on the Add list item note button.
Add your note.

Your note will be added to the description field of the item when you generate your exhibition files. In addition, any tags added to items in your list will be added to the subject field.

Note that if you make changes to your list, you'll need to regenerate the exhibition files using this notebook and upload them to your GitHub repository before the changes are visible in your exhibition.

In [ ]:

import os
import shutil
from pathlib import Path

import pandas as pd
import requests
import yaml
from IPython.display import HTML
from PIL import Image
from PIL.ImageOps import fit
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry
from tqdm.auto import tqdm
from trove_newspaper_images.articles import download_images

s = requests.Session()
retries = Retry(total=5, backoff_factor=1, status_forcelist=[500, 502, 503, 504])
s.mount("http://", HTTPAdapter(max_retries=retries))
s.mount("https://", HTTPAdapter(max_retries=retries))

In [ ]:

%%capture
# Load variables from the .env file if it exists
# Use %%capture to suppress messages
%load_ext dotenv
%dotenv

Add your API key and list ID values¶

This is the only section that you'll need to edit. Paste your API key and list id in the cells below as indicated. Once you've finished, select Run all cells from the Run menu to generate your exhibition files.

In [ ]:

# Insert your Trove API key between the quotes
API_KEY = "YOUR API KEY"

# Use api key value from environment variables if it is available
if os.getenv("TROVE_API_KEY"):
    API_KEY = os.getenv("TROVE_API_KEY")

In [ ]:

# Paste your list id between the quotes
list_id = "83777"

Define some functions¶

In [ ]:

def listify(value):
    """
    Sometimes values can be lists and sometimes not.
    Turn them all into lists to make life easier.
    """
    if isinstance(value, (str, int)):
        try:
            value = str(value)
        except ValueError:
            pass
        value = [value]
    return value


def get_url(identifiers, linktype):
    """
    Loop through the identifiers to find the request url.
    """
    url = ""
    for identifier in identifiers:
        if identifier["linktype"] == linktype:
            url = identifier["value"]
            break
    return url


def save_as_csv(list_dir, data, data_type):
    df = pd.DataFrame(data)
    df["pages"] = df["pages"].astype("Int64")
    df.to_csv(Path(list_dir, "_data", f"{list_id}-{data_type}.csv"), index=False)


def make_filename(article):
    """
    Create a filename for a text file or PDF.
    For easy sorting/aggregation the filename has the format:
        PUBLICATIONDATE-NEWSPAPERID-ARTICLEID
    """
    date = article["date"]
    date = date.replace("-", "")
    newspaper_id = article["newspaper_id"]
    article_id = article["id"]
    return "{}-{}-{}".format(date, newspaper_id, article_id)


def get_list(list_id):
    list_url = f"https://api.trove.nla.gov.au/v2/list/{list_id}?encoding=json&reclevel=full&include=listItems&key={API_KEY}"
    response = s.get(list_url)
    return response.json()


def get_article(id):
    article_api_url = f"https://api.trove.nla.gov.au/v2/newspaper/{id}/?encoding=json&reclevel=full&key={API_KEY}&include=tags"
    response = s.get(article_api_url)
    return response.json()


def get_work(id):
    article_api_url = f"https://api.trove.nla.gov.au/v2/work/{id}/?encoding=json&reclevel=full&key={API_KEY}&include=tags,links"
    response = s.get(article_api_url)
    return response.json()


def make_dirs(list_id):
    list_dir = Path("cb-exhibitions", list_id)
    list_dir.mkdir(parents=True, exist_ok=True)
    Path(list_dir, "objects").mkdir(exist_ok=True)
    Path(list_dir, "temp").mkdir(exist_ok=True)
    Path(list_dir, "_data").mkdir(exist_ok=True)
    return list_dir


def get_subjects(work):
    subjects = []
    if "subject" in work:
        subjects = listify(work["subject"])
    else:
        subjects = []
    if "tag" in work:
        for tag in work["tag"]:
            subjects.append(tag["value"])
    return subjects


def get_work_image_url(record):
    image_url = get_url(record.get("identifier", ""), "viewcopy")
    if not image_url:
        image_url = get_url(record.get("identifier", ""), "thumbnail")
    return image_url


def save_work_image(list_dir, record):
    image_url = get_work_image_url(record)
    if image_url:
        response = s.get(image_url)
        if response.status_code == 200:
            filename = Path(list_dir, "objects", f"work-{record.get('id', '')}.jpg")
            filename.write_bytes(response.content)
            return filename


def get_article_tags(record):
    subjects = []
    article = get_article(record["id"])["article"]
    if "tag" in article:
        for tag in article["tag"]:
            subjects.append(tag["value"])
    return subjects


def get_parent(record):
    parent = ""
    parents = listify(record.get("isPartOf", []))
    if parents:
        if isinstance(parents[0], dict) and "value" in parents[0]:
            parent = parents[0]["value"]
        else:
            parent = parents[0]
    return parent


def update_config(list_data, list_dir):
    with Path("cb-config", "_config.yml").open("r") as config_in:
        config = yaml.safe_load(config_in)
    config["title"] = list_data["list"][0]["title"]
    config["author"] = list_data["list"][0]["creator"].replace("public:", "")
    config["metadata"] = f'{list_data["list"][0]["id"]}-items'
    with Path(list_dir, "_config.yml").open("w") as config_out:
        config_out.write(yaml.dump(config))


def harvest_list(list_id):
    list_dir = make_dirs(list_id)
    data = get_list(list_id)
    update_config(data, list_dir)
    items = []
    for item in tqdm(data["list"][0]["listItem"]):
        for zone, record in item.items():
            if zone == "work":
                # Some fields aren't included in the list data, so get the full work record
                work_data = get_work(record["id"])["work"]
                work = {
                    "objectid": f"work-{record.get('id', '')}",
                    "title": record.get("title", ""),
                    "type": ";".join(listify(record.get("type", ""))),
                    "date": listify(record.get("issued", []))[0],
                    "creator": "; ".join(listify(record.get("contributor", ""))),
                    "is_part_of": get_parent(record),
                    "trove_url": record.get("troveUrl", ""),
                    "source_url": get_url(record.get("identifier", ""), "fulltext"),
                    "description": item.get("note", ""),
                    "subject": "; ".join(get_subjects(work_data)),
                    "location": "",
                    "latitude": "",
                    "longitude": "",
                }
                image_filename = save_work_image(list_dir, work_data)
                if image_filename:
                    work["filename"] = image_filename.name
                    work["format"] = "image/jpeg"
                items.append(work)
            elif zone == "article":
                newspaper_id = record.get("title", {}).get("id")
                newspaper_title = record.get("title", {}).get("value")
                newspaper_link = f'<a href="http://nla.gov.au/nla.news-title{newspaper_id}">{newspaper_title}</a>'
                # citation =
                article = {
                    "objectid": f"article-{record.get('id', '')}",
                    "title": record.get("heading", ""),
                    "date": record.get("date", ""),
                    "is_part_of": newspaper_link,
                    "pages": record.get("pageSequence", ""),
                    "trove_url": f'http://nla.gov.au/nla.news-article{record.get("id")}',
                    "type": "Newspaper article",
                    "format": "image/jpeg",
                    "description": item.get("note", ""),
                    "subject": "; ".join(get_article_tags(record)),
                    "location": "",
                    "latitude": "",
                    "longitude": "",
                }
                images = download_images(record["id"], Path(list_dir, "temp"))
                img = Image.open(Path(list_dir, "temp", images[0]))
                cropped = fit(
                    img, (800, 800), method=Image.Resampling.LANCZOS, centering=(0.5, 0)
                )
                cropped.save(Path(list_dir, "objects", images[0]), "JPEG")
                article["filename"] = images[0]
                items.append(article)
    shutil.rmtree(Path(list_dir, "temp"))
    if items:
        save_as_csv(list_dir, items, "items")
    return items

Let's do it!¶

Run the cell below to start the exhibition building process.

In [ ]:

items = harvest_list(list_id)

Download the results¶

Run the cell below to zip up all the harvested files and create a download link.

In [ ]:

list_dir = Path("cb-exhibitions", list_id)
shutil.make_archive(list_dir, "zip", list_dir)
HTML(f'<a download="{list_id}.zip" href="{list_dir}.zip">Download your files</a>')

Created by Tim Sherratt for the GLAM Workbench.