Download the OCRd text for ALL the digitised periodicals in Trove!

Putting together the list of periodicals created by this notebook with the code in this notebook, you can download the OCRd text from every digitised periodical in the journals zone. If you're going to try this, you'll need a lots of patience and lots of disk space. Needless to say, don't try this on a cloud service like Binder.

Fortunately you don't have to do it yourself, as I've already run the harvest and made all the text files available. See below for details.

I repeat, you probably don't want to do this yourself. The point of this notebook is really to document the methodology used to create the repository.

If you really, really do want to do it yourself, you should first generate an updated list of digitised periodicals.

Here's a harvest I prepared earlier...

I last ran this harvest in August 2021. Here are the results:

  • 1,163 periodicals had OCRd text available for download
  • OCRd text was downloaded from 51,928 periodical issues
  • About 10gb of text was downloaded

Note that, unlike previous harvests, this one excluded periodicals with the format 'government publication' – so the total amount harvested has decreased. Government publications are actually spread across both the books and journals zone, so I'm planning to do a separate harvest just for them.

The list of digital journals with OCRd text is available both as human-readable list and a CSV formatted spreadsheet.

The complete collection of text files for all the journals can be downloaded from this repository on CloudStor.

Setting things up

In [1]:
# Let's import the libraries we need.
import requests
import arrow
from bs4 import BeautifulSoup
import time
import os
import re
import glob
import pandas as pd
from tqdm.auto import tqdm
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry
from slugify import slugify
from IPython.display import display, HTML, FileLink
import requests_cache
from pathlib import Path

s = requests_cache.CachedSession()
retries = Retry(total=5, backoff_factor=1, status_forcelist=[ 502, 503, 504 ])
s.mount('https://', HTTPAdapter(max_retries=retries))
s.mount('http://', HTTPAdapter(max_retries=retries))
In [2]:
# These functions are copied from Get-text-from-a-Trove-journal.ipynb

def harvest_metadata(obj_id):
    '''
    This calls an internal API from a journal landing page to extract a list of available issues.
    '''
    start_url = 'https://nla.gov.au/{}/browse?startIdx={}&rows=20&op=c'
    # The initial startIdx value
    start = 0
    # Number of results per page
    n = 20
    issues = []
    with tqdm(desc='Issues', leave=False) as pbar:
        # If there aren't 20 results on the page then we've reached the end, so continue harvesting until that happens.
        while n == 20:
            # Get the browse page
            response = s.get(start_url.format(obj_id, start), timeout=60)
            # Beautifulsoup turns the HTML into an easily navigable structure
            soup = BeautifulSoup(response.text, 'lxml')
            # Find all the divs containing issue details and loop through them
            details = soup.find_all(class_='l-item-info')
            for detail in details:
                issue = {}
                title = detail.find('h3')
                if title:
                    issue['title'] = title.text
                    issue['id'] = title.parent['href'].strip('/')
                else:
                    issue['title'] = 'No title'
                    issue['id'] = detail.find('a')['href'].strip('/')
                try:
                    # Get the issue details
                    issue['details'] = detail.find(class_='obj-reference content').string.strip()
                except (AttributeError, IndexError):
                    issue['details'] = 'issue'
                # Get the number of pages
                try:
                    issue['pages'] = int(re.search(r'^(\d+)', detail.find('a', attrs={'data-pid': issue['id']}).text, flags=re.MULTILINE).group(1))
                except AttributeError:
                    issue['pages'] = 0
                issues.append(issue)
                # print(issue)
                if not response.from_cache:
                    time.sleep(0.5)
            # Increment the startIdx
            start += n
            # Set n to the number of results on the current page
            n = len(details)
            pbar.update(n)
    return issues

def save_ocr(issues, obj_id, title=None, output_dir='journals'):
    '''
    Download the OCRd text for each issue.
    '''
    processed_issues = []
    if not title:
        title = issues[0]['title']
    output_path = os.path.join(output_dir, '{}-{}'.format(slugify(title)[:50], obj_id))
    texts_path = os.path.join(output_path, 'texts')
    os.makedirs(texts_path, exist_ok=True)
    for issue in tqdm(issues, desc='Texts', leave=False):
        # Default values
        issue['text_file'] = ''
        if issue['pages'] != 0:       
            # print(book['title'])
            # The index value for the last page of an issue will be the total pages - 1
            last_page = issue['pages'] - 1
            file_name = '{}-{}-{}.txt'.format(slugify(issue['title'])[:50], slugify(issue['details'])[:50], issue['id'])
            file_path = os.path.join(texts_path, file_name)
            # Check to see if the file has already been harvested
            if os.path.exists(file_path) and os.path.getsize(file_path) > 0:
                # print('Already saved')
                issue['text_file'] = file_name
            else:
                url = 'https://trove.nla.gov.au/{}/download?downloadOption=ocr&firstPage=0&lastPage={}'.format(issue['id'], last_page)
                # print(url)
                # Get the file
                r = s.get(url, timeout=120)
                # Check there was no error
                if r.status_code == requests.codes.ok:
                    # Check that the file's not empty
                    r.encoding = 'utf-8'
                    if len(r.text) > 0 and not r.text.isspace():
                        # Check that the file isn't HTML (some not found pages don't return 404s)
                        if BeautifulSoup(r.text, 'html.parser').find('html') is None:
                            # If everything's ok, save the file
                            with open(file_path, 'w', encoding='utf-8') as text_file:
                                text_file.write(r.text)
                            issue['text_file'] = file_name
                if not r.from_cache:
                    time.sleep(1)
        processed_issues.append(issue)
    df = pd.DataFrame(processed_issues)
    # Remove empty directories
    try:
        os.rmdir(texts_path)
        os.rmdir(output_path)
    except OSError:
        #It's not empty, so add list of issues
        df.to_csv(os.path.join(output_path, '{}-issues.csv'.format(obj_id)), index=False)

Process all the journals!

As already mentioned, this takes a long time. It will also probably fail at various points and you'll have to run it again. If you do restart, the script will start at the beginning, but won't redownload any text files have already been harvested.

Results for each journal are saved in a separate directory in the outpur directory (which defaults to journals). The name of the journal directory is created using the journal title and journal id. Inside this directory is a CSV formatted file containing details of all the available issues, and a texts sub-directory to contain the downloaded text files.

The individual file names are created using the journal title, issue details, and issue identifier. So the resulting hierarchy might look something like this:

journals
    - angry-penguins-nla.obj-320790312
        - nla.obj-320790312-issues.csv
        - texts
            - angry-penguins-broadsheet-no-1-nla.obj-320791009.txt

The CSV list of issues includes the following fields:

  • details – string with issue details, might include dates, issue numbers etc.
  • id – issue identifier
  • pages – number of pages in this issue
  • text_file – file name of any downloaded OCRd text
  • title – journal title (as extracted from issue browse list, might differ from original journal title)

Note that if the text_file field is empty, it means that no OCRd text could be extracted for that particular issue. Note also that if no OCRd text is available, no journal directory will be created, and nothing will be saved.

In [3]:
# You can provide a different output_dir if you want
def process_titles(output_dir='journals'):
    df = pd.read_csv('digital-journals.csv')
    # df = pd.read_csv('government-publications-periodicals.csv')
    # Drop duplicate records taking the records with nla digitised = True
    journals = df.sort_values(by=['trove_id', 'fulltext_url_type']).drop_duplicates(subset='trove_id', keep='last').to_dict('records')
    for journal in tqdm(journals, desc='Journals'):
        issues = harvest_metadata(journal['trove_id'])
        if issues:
            save_ocr(issues, journal['trove_id'], title= journal['title'], output_dir=output_dir)
In [ ]:
# Start harvesting!!!!
process_titles()

Gather data about the harvest

Because the harvesting takes a long time and is prone to failure, it seemed wise to gather data at the end, rather than keeping a running total.

The cells below create a list of journals that have OCRd text. The list has the following fields:

  • fulltext_url – the url of the landing page of the digital version of the journal
  • title – the title of the journal
  • trove_id – the 'nla.obj' part of the fulltext_url, a unique identifier for the digital journal
  • trove_url – url of the journal's metadata record in Trove
  • issues – the number of available issues
  • issues_with_text – the number of issues that OCRd text could be downloaded from
  • directory – the directory in which the files from this journal have been saved (relative to the output directory)
In [15]:
def collect_issue_data(output_path='journals'):
    titles_with_text = []
    df = pd.read_csv('digital-journals-20210805.csv', keep_default_na=False)
    # df = pd.read_csv('government-publications-periodicals-20210802.csv')
    journals = df.to_dict('records')
    for j in journals:
        j_dir = os.path.join(output_path, '{}-{}'.format(slugify(j['title'])[:50], j['trove_id']))
        if os.path.exists(j_dir):
            csv_file = os.path.join(j_dir, '{}-issues.csv'.format(j['trove_id']))
            issues_df = pd.read_csv(csv_file, keep_default_na=False)
            j['issues'] = issues_df.shape[0]
            j['issues_with_text'] = issues_df.loc[issues_df['text_file'] != ''].shape[0]
            j['directory'] = '{}-{}'.format(slugify(j['title'])[:50], j['trove_id'])
            titles_with_text.append(j)
    return titles_with_text
In [16]:
# Gather the data
#titles_with_text = collect_issue_data()
titles_with_text = collect_issue_data('/Volumes/bigdata/mydata/Trove/journals')

Convert to a dataframe.

In [17]:
df = pd.DataFrame(titles_with_text)
df.head()
Out[17]:
title contributor issued format fulltext_url trove_url trove_id fulltext_url_type issues issues_with_text directory
0 Laws, etc. (Acts of the Parliament) Victoria 1900-2021 Government publication | Periodical | Periodic... http://nla.gov.au/nla.obj-54127737 https://trove.nla.gov.au/work/10078182 nla.obj-54127737 digitised 15 15 laws-etc-acts-of-the-parliament-nla.obj-54127737
1 The Silver stream songster 1890-1900 Periodical | Periodical/Journal, magazine, other https://nla.gov.au/nla.obj-614066685 https://trove.nla.gov.au/work/10087062 nla.obj-614066685 digitised 1 1 the-silver-stream-songster-nla.obj-614066685
2 Report / Defence Force Remuneration Tribunal Australia. Defence Force Remuneration Tribunal 1980-2021 Government publication | Periodical | Periodic... https://nla.gov.au/nla.obj-2137302489 https://trove.nla.gov.au/work/10096343 nla.obj-2137302489 digitised 1 1 report-defence-force-remuneration-tribunal-nla...
3 Territory of Papua : annual report for the per... Australia. Department of External Territories 1940-1949 Government publication | Periodical | Periodic... https://nla.gov.au/nla.obj-2060262652 https://trove.nla.gov.au/work/10103835 nla.obj-2060262652 digitised 4 4 territory-of-papua-annual-report-for-the-perio...
4 Review of operations / Australian Land Transpo... Australian Land Transport Development Program 1991-1994 Government publication | Periodical | Periodic... https://nla.gov.au/nla.obj-1654948325 https://trove.nla.gov.au/work/10105924 nla.obj-1654948325 digitised 3 3 review-of-operations-australian-land-transport...

Save as a CSV file.

In [24]:
df.to_csv('digital-journals-with-text.csv', index=False)
display(FileLink('digital-journals-with-text.csv'))

Or if you want to explore data you've already harvested and saved as a CSV.

In [19]:
df = pd.read_csv('digital-journals-with-text.csv', keep_default_na=False)

Let's have a peek inside...

In [20]:
# Number of journals with OCRd text
df.shape
Out[20]:
(1163, 11)
In [21]:
# Total number of issues
df['issues'].sum()
Out[21]:
52185
In [22]:
# Number of issues with OCRd text
df['issues_with_text'].sum()
Out[22]:
51928

Create a markdown-formatted list

In [26]:
df.sort_values(by=['title'], inplace=True)
with open('digital-journals-with-text.md', 'w') as md_file:
    md_file.write('# Digitised journals from Trove with OCRd text')
    md_file.write('\n\nFor harvesting details see [this notebook](Download-text-for-all-digitised-journals.ipynb), or the [digitised journals section](https://glam-workbench.github.io/trove-journals/) of the GLAM Workbench.')
    md_file.write(f'\n\nThis harvest was completed on {arrow.now("Australia/Canberra").format("D MMMM YYYY")}.')
    md_file.write(f'\n\nNumber of journals harvested: {df.shape[0]:,}')
    md_file.write(f'\n\nNumber of issues with OCRd text: {df["issues_with_text"].sum():,}')
    md_file.write('\n\n----\n\n')
    for row in df.itertuples():
        md_file.write(f'\n### {row.title}')
        if row.contributor:
            md_file.write(f'\n**{row.contributor}, {row.issued}**')
        else:
            md_file.write(f'\n**{row.issued}**')
        md_file.write(f'  \n{row.format}')
        md_file.write(f'\n\n{row.issues_with_text} of {row.issues} issues have OCRd text available for download.')
        md_file.write(f'\n\n* [Details on Trove]({row.trove_url})\n')
        md_file.write(f'* [Browse issues on Trove]({row.fulltext_url})\n')
        md_file.write(f'* [Download issue data as CSV from CloudStor](https://cloudstor.aarnet.edu.au/plus/s/QOmnqpGQCNCSC2h/download?path=%2F{row.directory}&files={row.trove_id}-issues.csv)\n')
        md_file.write(f'* [Download all OCRd text from CloudStor](https://cloudstor.aarnet.edu.au/plus/s/QOmnqpGQCNCSC2h/download?path=%2F{row.directory})\n')
        
display(FileLink('digital-journals-with-text.md'))

Created by Tim Sherratt for the GLAM Workbench.

Work on this notebook was supported by the Humanities, Arts and Social Sciences (HASS) Data Enhanced Virtual Lab.