Notebook

Download the OCRd text for ALL the digitised periodicals in Trove!¶

Many of the digitised periodicals available in Trove make OCRd text available for download. This notebook helps you download all the OCRd text from all (or most of?) Trove's digitised periodicals, creating one text file for each issue. It also saves a CSV-formatted list of the issues in each periodical. If you want to harvest all the text of a single periodical, see Get OCRd text from a digitised journal in Trove.

While you can download the complete text of an issue using the web interface, there's no option to do this with the API alone. The workaround is to mimic the web interface by constructing a download link using the issue identifier and number of pages.

This notebook works through the dataset of digitised periodical issues created by Enrich the list of periodicals from the Trove API, requesting and saving the OCRd text for each issue. However, not every issue has text available to download.

Setting things up¶

In [ ]:

# Let's import the libraries we need.
import os
import re
import shutil
import time
from datetime import timedelta
from pathlib import Path

import pandas as pd
import requests
import requests_cache
from dotenv import load_dotenv
from requests.adapters import HTTPAdapter
from requests.exceptions import ConnectionError, Timeout
from requests.packages.urllib3.util.retry import Retry
from slugify import slugify
from tqdm.auto import tqdm

s = requests_cache.CachedSession(expire_after=timedelta(days=30))
retries = Retry(total=5, backoff_factor=1, status_forcelist=[502, 503, 504])
s.mount("https://", HTTPAdapter(max_retries=retries))
s.mount("http://", HTTPAdapter(max_retries=retries))

load_dotenv()

In [ ]:

def check_for_file(issue, texts_path):
    issue_date = issue["date"] if issue["date"] else "nd"
    file_name = f"{issue_date}-{issue['id']}.txt"
    file_path = Path(texts_path, file_name)
    if file_path.exists():
        return file_name
    return ""


def download_issue(issue_id, last_page, file_path):
    url = f"https://trove.nla.gov.au/{issue_id}/download?downloadOption=ocr&firstPage=0&lastPage={last_page}"
    # print(url)
    # Get the file
    try:
        response = s.get(url, timeout=180)
    except (Timeout, ConnectionError) as err:
        print(f"{type(err).__name__}: {url}")
    else:
        # Check there was no error
        if response.status_code == requests.codes.ok:
            # Check that the file's not empty
            response.encoding = "utf-8"
            # Check that the file isn't HTML (some not found pages don't return 404s)
            # BS is too lax and will pass text files that happen to have html tags in them
            # if BeautifulSoup(r.text, "html.parser").find("html") is None:
            if (
                len(response.text) > 0
                and not response.text.isspace()
                and not re.search(r"</HTML>", response.text, re.IGNORECASE)
            ):
                file_path.write_text(response.text)
            if not response.from_cache:
                time.sleep(0.5)


def download_all_issues(df, output_dir="periodicals"):
    # Group issues by title, then loop trhough the titles/issues
    for title, issues in tqdm(df.groupby(["title_id", "title"])):
        output_path = Path(output_dir, f"{slugify(title[1])[:50]}-{title[0]}")
        texts_path = Path(output_path, "texts")
        texts_path.mkdir(exist_ok=True, parents=True)
        issues_with_pages = issues.loc[issues["pages"] > 0]
        for issue in tqdm(
            issues_with_pages.itertuples(),
            total=issues_with_pages.shape[0],
            leave=False,
        ):
            last_page = issue.pages - 1
            issue_date = issue.date if issue.date else "nd"
            file_name = f"{issue_date}-{issue.id}.txt"
            file_path = Path(texts_path, file_name)
            # Check to see if the file has already been harvested
            if not file_path.exists():
                download_issue(issue.id, last_page, file_path)
        issues["text_file"] = issues.apply(check_for_file, args=(texts_path,), axis=1)
        issues.to_csv(Path(output_path, "issues.csv"), index=False)

In [ ]:

# Load issues dataset
df = pd.read_csv(
    "https://github.com/GLAM-Workbench/trove-periodicals-data/raw/main/periodical-issues.csv",
    keep_default_na=False,
)

# Download texts
download_all_issues(df)

Results for each journal are saved in a separate directory in the output directory (which defaults to periodicals). The name of the journal directory is created using the journal title and journal id. Inside this directory is a CSV formatted file containing details of all the available issues, and a texts sub-directory to contain the downloaded text files.

The individual file names are created using the journal title, issue details, and issue identifier. So the resulting hierarchy might look something like this:

periodicals
    - angry-penguins-nla.obj-320790312
        - issues.csv
        - texts
            - angry-penguins-broadsheet-no-1-nla.obj-320791009.txt

The issues.csv will contain details of all the issues in the journal. Its structure is the same as the periodical-issues.csv loaded from the trove-periodicals-data repository, except that a new text_file column is added. If text was successfully from an issue, the test_file column will include the name of the saved text file.

In [ ]:

# IGNORE THIS CELL -- FOR TESTING ONLY
if os.getenv("GW_STATUS") == "dev":
    df = pd.read_csv(
        "https://github.com/GLAM-Workbench/trove-periodicals-data/raw/main/periodical-issues.csv",
        keep_default_na=False,
    )

download_all_issues(df[:10], output_dir="texts_test")
shutil.rmtree("texts_test")

Created by Tim Sherratt for the GLAM Workbench.

Work on this notebook was supported by the Humanities, Arts and Social Sciences (HASS) Data Enhanced Virtual Lab.