Finding editorial cartoons in the Bulletin

In another notebook I showed how you could download all the front pages of The Bulletin (and other journals) as images. Amongst the front pages you'll find a number of full page editorial cartoons under The Bulletin's masthead. But you'll also find that many of the front pages are advertising wrap arounds. The full page editorial cartoons were a consistent feature of The Bulletin for many decades, but they moved around between pages one and eleven. That makes them hard to find.

I wanted to try and assemble a collection of all the editorial cartoons, but how? At one stage I was thinking about training an image classifier to identify the cartoons. The problem with that was I'd have to download lots of pages (they're about 20mb each), most of which I'd just discard. That didn't seem like a very efficient way of proceeding, so I started thinking about what I could do with the available metadata.

The Bulletin is indexed at article level in Trove, so I thought I might be able to construct a search that would find the cartoons. I'd noticed that the cartoons tended to have the title 'The Bulletin', so a search for title:"The Bulletin" should find them. The problem was, of course, that there were lots of false positives. I tried adding extra words that often appeared in the masthead to the search, but the results were unreliable. At that stage I was ready to give up.

Then I realised I was going about it backwards. If my aim was to get an editorial cartoon for each issue, I should start with a list of issues and process them individually to find the cartoon. I'd already worked out how to get a complete list of issues from the web interface. I'd also found you could extract useful metadata from each issue's landing page. Putting these two approaches together gave me a way forward. The basic methodology was this.

  • I manually selected all the cartoons from my harvest of front pages, as downloading them again would just make things slower
  • I harvested all the issue metadata
  • Looping through the list of issues I checked to see if I already had a cartoon from each issue, if not...
  • I grabbed the metadata from the issue landing page – for issues with OCRd text this includes a list of the articles and the pages that make up the issue
  • I looked through the list of articles to see if there was one with the exact title 'The Bulletin'
  • I then found the page on which the article was published
  • If the page was odd (almost all the cartoons were on odd pages), I downloaded the page

This did result in some false positives, but they were easy enough to remove manually. At the end I was able to see when the editorial cartoons started appearing, and when they stopped having a page to themselves. This gave me a range of 1886 to 1952 to focus on. Looking within that range it seemed that there were only about 400 issues (out of 3452) that I had no cartoon for. The results were much better than I'd hoped!

I then repeated this process several times, changing the title string and looping through the missing issues. I gradually widened the title match from exactly 'The Bulletin', to a string containing 'Bulletin', to a case-insensitive match etc. I also found and harvested a small group of cartoons from the 1940s that were published on an even numbered page! After a few repeats I had about 100 issues left without cartoons. These were mostly issues that didn't have OCRd text, so there were no articles to find. I thought I might need to process these manually, but then I though why not just go though the odd numbers from one to thirteen, harvesting these pages from the 'missing' issues and manually discarding the misses. This was easy and effective. Before too long I had a cartoon for every issue between 4 September 1886 and 17 September 1952. Yay!

If you have a look at the results you'll see that I ended up with more cartoons than issues. This is for two reasons:

  • Some badly damaged issues were digitised twice, once from the hard copy and once from microfilm. Both versions are in Trove, so I thought I might as well keep both cartoons in my dataset.
  • The cartoons stopped appearing with the masthead in the 1920s, so they became harder to uniquely identify. In the 1940s, there were sometimes two full page cartoons by different artists commenting on current affairs. Where my harvesting found two such cartoons, I kept them both. However, because my aim was to find at least one cartoon from each issue, there are going to be other full page political cartoons that aren't included in my collection.

Because I kept refining and adjusting the code as I went through, I'm not sure it will be very useful. But it's all here just in case.

The complete collection of 3,471 images (approximately 60gb in total) can be downloaded from CloudStor. The names of each image file provide useful contextual metadata. For example, the file name 19330412-2774-nla.obj-606969767-7.jpg tells you:

  • 19330412 – the cartoon was published on 12 April 1933
  • 2774 – it was published in issue number 2774
  • nla.obj-606969767 – the Trove identifier for the issue, can be used to make a url eg https://nla.gov.au/nla.obj-606969767
  • 7 – on page 7

To make it easy to browse the collection I've created a series of PDFs, one for each decade. You can find them on DropBox:

In [40]:
import requests
import re
import time
import json
import glob
import os
import zipfile
import pandas as pd
import io
from tqdm import tqdm_notebook
from bs4 import BeautifulSoup
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry

s = requests.Session()
retries = Retry(total=5, backoff_factor=1, status_forcelist=[ 502, 503, 504 ])
s.mount('https://', HTTPAdapter(max_retries=retries))
s.mount('http://', HTTPAdapter(max_retries=retries))
In [74]:
def harvest_metadata(obj_id):
    '''
    This calls an internal API from a journal landing page to extract a list of available issues.
    '''
    start_url = 'https://nla.gov.au/{}/browse?startIdx={}&rows=20&op=c'
    # The initial startIdx value
    start = 0
    # Number of results per page
    n = 20
    issues = []
    with tqdm_notebook(desc='Issues', leave=False) as pbar:
        # If there aren't 20 results on the page then we've reached the end, so continue harvesting until that happens.
        while n == 20:
            # Get the browse page
            response = s.get(start_url.format(obj_id, start), timeout=60)
            # Beautifulsoup turns the HTML into an easily navigable structure
            soup = BeautifulSoup(response.text, 'lxml')
            # Find all the divs containing issue details and loop through them
            details = soup.find_all(class_='l-item-info')
            for detail in details:
                issue = {}
                # Get the issue id
                issue['issue_id'] = detail.dt.a.string
                rows = detail.find_all('dd')
                try:
                    issue['title'] = rows[0].p.string.strip()
                except (AttributeError, IndexError):
                    issue['title'] = 'title'
                try:
                    # Get the issue details
                    issue['details'] = rows[2].p.string.strip()
                except (AttributeError, IndexError):
                    issue['details'] = 'issue'
                # Get the number of pages
                try:
                    issue['pages'] = int(re.search(r'^(\d+)', detail.find('a', class_="browse-child").text, flags=re.MULTILINE).group(1))
                except AttributeError:
                    issue['pages'] = 0
                issues.append(issue)
                #print(issue)
                time.sleep(0.2)
            # Increment the startIdx
            start += n
            # Set n to the number of results on the current page
            n = len(details)
            pbar.update(n)
    return issues

def get_page_number(work_data, article):
    page_number = 0
    page_id = article['existson'][0]['page']
    for index, page in enumerate(work_data['children']['page']):
        if page['pid'] == page_id:
            page_number = index
            break
    # print(page_number)
    return page_number

def get_articles(work_data):
    articles = []
    for article in work_data['children']['article']:
        #if re.search(r'^The Bulletin\.*', article['title'], flags=re.IGNORECASE):
        if re.search(r'Bulletin', article['title'], flags=re.IGNORECASE):
        #if re.search(r'^No title\.*$', article['title']):
            articles.append(article)
    return articles

def get_work_data(url):
    '''
    Extract work data in a JSON string from the work's HTML page.
    '''
    response = s.get(url, allow_redirects=True, timeout=60)
    # print(response.url)
    try:
        work_data = re.search(r'var work = JSON\.parse\(JSON\.stringify\((\{.*\})', response.text).group(1)
    except AttributeError:
        work_data = '{}'
    return json.loads(work_data)

def download_page(issue_id, page_number, image_dir):
    if not os.path.exists('{}/{}-{}.jpg'.format(image_dir, issue_id, page_number + 1)):
        url = 'https://nla.gov.au/{0}/download?downloadOption=zip&firstPage={1}&lastPage={1}'.format(issue_id, page_number)
        # Get the file
        r = s.get(url, timeout=180)
        # The image is in a zip, so we need to extract the contents into the output directory
        try:
            z = zipfile.ZipFile(io.BytesIO(r.content))
            z.extractall(image_dir)
        except zipfile.BadZipFile:
            print('{}-{}'.format(issue_id, page_number))

def download_issue_title_pages(issue_id, image_dir):
    url = 'https://nla.gov.au/{}'.format(issue_id)
    work_data = get_work_data(url)
    articles = get_articles(work_data)
    for article in articles:
        page_number = get_page_number(work_data, article)
        if page_number and (page_number % 2 == 0):
            download_page(issue_id, page_number, image_dir)
            time.sleep(1)
            
def get_downloaded_issues(image_dir):
    issues = []
    images = [i for i in os.listdir(image_dir) if i[-4:] == '.jpg']
    for image in images:
        issue_id = re.search(r'(nla\.obj-\d+)', image).group(1)
        issues.append(issue_id)
    return issues
            
def download_all_title_pages(issues, image_dir):
    dl_issues = get_downloaded_issues(image_dir)
    for issue in tqdm_notebook(issues):
        if issue['issue_id'] not in dl_issues:
            download_issue_title_pages(issue['issue_id'], image_dir)
            dl_issues.append(issue['issue_id'])
      
In [4]:
# Get current list of issues
issues = harvest_metadata('nla.obj-68375465')

In [37]:
# First pass
download_all_title_pages(issues, '/Volumes/bigdata/mydata/Trove-text/Bulletin/covers/images')
In [ ]:
# Change filenames of downloaded images to include date and issue number
# This makes it easier to sort and group them
df = pd.read_csv('journals/bulletin/bulletin_issues.csv')
all_issues = df.to_dict('records')
import glob
# Add dates / issue nums to file titles
for record in records:
    pages = glob.glob('/Volumes/bigdata/mydata/Trove-text/Bulletin/covers/images/{}-*.jpg'.format(record['id']))
    for page in pages:
        page_number = re.search(r'nla\.obj-\d+-(\d+)\.jpg', page).group(1)
        date = record['date'].replace('-', '')
        os.rename(page, '/Volumes/bigdata/mydata/Trove-text/Bulletin/covers/images/{}-{}-{}-{}.jpg'.format(date, record['number'], record['id'], page_number))
In [51]:
# Only select issues between 1886 and 1952
df = pd.read_csv('journals/bulletin/bulletin_issues.csv', dtype={'number': 'Int64'}, keep_default_na=False)
df = df.rename(index=str, columns={"id": "issue_id"})
filtered = df.loc[(df['number'] >= 344) & (df['number'] <= 3788)]
filtered_issues = filtered.to_dict('records')
In [99]:
# After a round of harvesting, check to see what's missing
missing = []
multiple = []
for issue in tqdm_notebook(filtered_issues):
    pages = glob.glob('/Volumes/bigdata/mydata/Trove-text/Bulletin/covers/images/*{}-*.jpg'.format(issue['issue_id']))
    if len(pages) == 0:
        missing.append(issue)
    elif len(pages) > 1:
        multiple.append(issue)
In [76]:
# Adjust the matching settings and run again with just the missing issues
# Rinse and repeat
download_all_title_pages(missing, '/Volumes/bigdata/mydata/Trove-text/Bulletin/covers/missing')
In [100]:
# How many are mssing now?
len(missing)
Out[100]:
0
In [66]:
# How many images have been harvested
dl = get_downloaded_issues('/Volumes/bigdata/mydata/Trove-text/Bulletin/covers/images')
In [67]:
len(dl)
Out[67]:
3362
In [70]:
len(multiple)
Out[70]:
17
In [98]:
# Create links for missing issues so I can look at them on Trove
for m in missing:
    print('https://nla.gov.au/{}'.format(m['issue_id']))
https://nla.gov.au/nla.obj-690693549
https://nla.gov.au/nla.obj-568185986
In [97]:
# Download odd pages (they start at zero, so 0=1 etc)
for m in missing:
    download_page(m['issue_id'], 14, '/Volumes/bigdata/mydata/Trove-text/Bulletin/covers/missing')
In [ ]: