## Finding editorial cartoons in the Bulletin¶

In another notebook I showed how you could download all the front pages of The Bulletin (and other journals) as images. Amongst the front pages you'll find a number of full page editorial cartoons under The Bulletin's masthead. But you'll also find that many of the front pages are advertising wrap arounds. The full page editorial cartoons were a consistent feature of The Bulletin for many decades, but they moved around between pages one and eleven. That makes them hard to find.

I wanted to try and assemble a collection of all the editorial cartoons, but how? At one stage I was thinking about training an image classifier to identify the cartoons. The problem with that was I'd have to download lots of pages (they're about 20mb each), most of which I'd just discard. That didn't seem like a very efficient way of proceeding, so I started thinking about what I could do with the available metadata.

The Bulletin is indexed at article level in Trove, so I thought I might be able to construct a search that would find the cartoons. I'd noticed that the cartoons tended to have the title 'The Bulletin', so a search for title:"The Bulletin" should find them. The problem was, of course, that there were lots of false positives. I tried adding extra words that often appeared in the masthead to the search, but the results were unreliable. At that stage I was ready to give up.

Then I realised I was going about it backwards. If my aim was to get an editorial cartoon for each issue, I should start with a list of issues and process them individually to find the cartoon. I'd already worked out how to get a complete list of issues from the web interface. I'd also found you could extract useful metadata from each issue's landing page. Putting these two approaches together gave me a way forward. The basic methodology was this.

• I manually selected all the cartoons from my harvest of front pages, as downloading them again would just make things slower
• I harvested all the issue metadata
• Looping through the list of issues I checked to see if I already had a cartoon from each issue, if not...
• I grabbed the metadata from the issue landing page – for issues with OCRd text this includes a list of the articles and the pages that make up the issue
• I looked through the list of articles to see if there was one with the exact title 'The Bulletin'
• I then found the page on which the article was published
• If the page was odd (almost all the cartoons were on odd pages), I downloaded the page

This did result in some false positives, but they were easy enough to remove manually. At the end I was able to see when the editorial cartoons started appearing, and when they stopped having a page to themselves. This gave me a range of 1886 to 1952 to focus on. Looking within that range it seemed that there were only about 400 issues (out of 3452) that I had no cartoon for. The results were much better than I'd hoped!

I then repeated this process several times, changing the title string and looping through the missing issues. I gradually widened the title match from exactly 'The Bulletin', to a string containing 'Bulletin', to a case-insensitive match etc. I also found and harvested a small group of cartoons from the 1940s that were published on an even numbered page! After a few repeats I had about 100 issues left without cartoons. These were mostly issues that didn't have OCRd text, so there were no articles to find. I thought I might need to process these manually, but then I though why not just go though the odd numbers from one to thirteen, harvesting these pages from the 'missing' issues and manually discarding the misses. This was easy and effective. Before too long I had a cartoon for every issue between 4 September 1886 and 17 September 1952. Yay!

If you have a look at the results you'll see that I ended up with more cartoons than issues. This is for two reasons:

• Some badly damaged issues were digitised twice, once from the hard copy and once from microfilm. Both versions are in Trove, so I thought I might as well keep both cartoons in my dataset.
• The cartoons stopped appearing with the masthead in the 1920s, so they became harder to uniquely identify. In the 1940s, there were sometimes two full page cartoons by different artists commenting on current affairs. Where my harvesting found two such cartoons, I kept them both. However, because my aim was to find at least one cartoon from each issue, there are going to be other full page political cartoons that aren't included in my collection.

Because I kept refining and adjusting the code as I went through, I'm not sure it will be very useful. But it's all here just in case.

The complete collection of 3,471 images (approximately 60gb in total) can be downloaded from CloudStor. The names of each image file provide useful contextual metadata. For example, the file name 19330412-2774-nla.obj-606969767-7.jpg tells you:

• 19330412 – the cartoon was published on 12 April 1933
• 2774 – it was published in issue number 2774
• nla.obj-606969767 – the Trove identifier for the issue, can be used to make a url eg https://nla.gov.au/nla.obj-606969767
• 7 – on page 7

To make it easy to browse the collection I've created a series of PDFs, one for each decade. You can find them on DropBox:

In [ ]:
import requests
import re
import time
import json
import glob
import os
import zipfile
import pandas as pd
import io
from tqdm.auto import tqdm
from bs4 import BeautifulSoup
from requests.packages.urllib3.util.retry import Retry
import requests_cache

s = requests_cache.CachedSession()
retries = Retry(total=5, backoff_factor=1, status_forcelist=[ 502, 503, 504 ])

In [ ]:
def harvest_metadata(obj_id):
'''
This calls an internal API from a journal landing page to extract a list of available issues.
'''
start_url = 'https://nla.gov.au/{}/browse?startIdx={}&rows=20&op=c'
# The initial startIdx value
start = 0
# Number of results per page
n = 20
issues = []
with tqdm(desc='Issues', leave=False) as pbar:
# If there aren't 20 results on the page then we've reached the end, so continue harvesting until that happens.
while n == 20:
# Get the browse page
response = s.get(start_url.format(obj_id, start), timeout=60)
# Beautifulsoup turns the HTML into an easily navigable structure
soup = BeautifulSoup(response.text, 'lxml')
# Find all the divs containing issue details and loop through them
details = soup.find_all(class_='l-item-info')
for detail in details:
issue = {}
title = detail.find('h3')
if title:
issue['title'] = title.text
issue['id'] = title.parent['href'].strip('/')
else:
issue['title'] = 'No title'
issue['id'] = detail.find('a')['href'].strip('/')
try:
# Get the issue details
issue['details'] = detail.find(class_='obj-reference content').string.strip()
except (AttributeError, IndexError):
issue['details'] = 'issue'
# Get the number of pages
try:
issue['pages'] = int(re.search(r'^(\d+)', detail.find('a', attrs={'data-pid': issue['id']}).text, flags=re.MULTILINE).group(1))
except AttributeError:
issue['pages'] = 0
issues.append(issue)
# print(issue)
if not response.from_cache:
time.sleep(0.5)
# Increment the startIdx
start += n
# Set n to the number of results on the current page
n = len(details)
pbar.update(n)
return issues

def get_page_number(work_data, article):
page_number = 0
page_id = article['existson'][0]['page']
for index, page in enumerate(work_data['children']['page']):
if page['pid'] == page_id:
page_number = index
break
# print(page_number)
return page_number

def get_articles(work_data):
articles = []
for article in work_data['children']['article']:
#if re.search(r'^The Bulletin\.*', article['title'], flags=re.IGNORECASE):
if re.search(r'Bulletin', article['title'], flags=re.IGNORECASE):
#if re.search(r'^No title\.*\$', article['title']):
articles.append(article)
return articles

def get_work_data(url):
'''
Extract work data in a JSON string from the work's HTML page.
'''
response = s.get(url, allow_redirects=True, timeout=60)
# print(response.url)
try:
work_data = re.search(r'var work = JSON\.parse\(JSON\.stringify\((\{.*\})', response.text).group(1)
except AttributeError:
work_data = '{}'

if not os.path.exists('{}/{}-{}.jpg'.format(image_dir, issue_id, page_number + 1)):
# Get the file
r = s.get(url, timeout=180)
# The image is in a zip, so we need to extract the contents into the output directory
try:
z = zipfile.ZipFile(io.BytesIO(r.content))
z.extractall(image_dir)
print('{}-{}'.format(issue_id, page_number))

url = 'https://nla.gov.au/{}'.format(issue_id)
work_data = get_work_data(url)
articles = get_articles(work_data)
for article in articles:
page_number = get_page_number(work_data, article)
if page_number and (page_number % 2 == 0):
time.sleep(1)

issues = []
images = [i for i in os.listdir(image_dir) if i[-4:] == '.jpg']
for image in images:
issue_id = re.search(r'(nla\.obj-\d+)', image).group(1)
issues.append(issue_id)
return issues

for issue in tqdm_notebook(issues):
if issue['issue_id'] not in dl_issues:
dl_issues.append(issue['issue_id'])


In [ ]:
# Get current list of issues

In [ ]:
# First pass

In [ ]:
# Change filenames of downloaded images to include date and issue number
# This makes it easier to sort and group them
all_issues = df.to_dict('records')
import glob
# Add dates / issue nums to file titles
for record in records:
pages = glob.glob('/Volumes/bigdata/mydata/Trove-text/Bulletin/covers/images/{}-*.jpg'.format(record['id']))
for page in pages:
page_number = re.search(r'nla\.obj-\d+-(\d+)\.jpg', page).group(1)
date = record['date'].replace('-', '')
os.rename(page, '/Volumes/bigdata/mydata/Trove-text/Bulletin/covers/images/{}-{}-{}-{}.jpg'.format(date, record['number'], record['id'], page_number))

In [ ]:
# Only select issues between 1886 and 1952
df = pd.read_csv('journals/bulletin/bulletin_issues.csv', dtype={'number': 'Int64'}, keep_default_na=False)
df = df.rename(index=str, columns={"id": "issue_id"})
filtered = df.loc[(df['number'] >= 344) & (df['number'] <= 3788)]
filtered_issues = filtered.to_dict('records')

In [ ]:
# After a round of harvesting, check to see what's missing
missing = []
multiple = []
for issue in tqdm_notebook(filtered_issues):
pages = glob.glob('/Volumes/bigdata/mydata/Trove-text/Bulletin/covers/images/*{}-*.jpg'.format(issue['issue_id']))
if len(pages) == 0:
missing.append(issue)
elif len(pages) > 1:
multiple.append(issue)

In [ ]:
# Adjust the matching settings and run again with just the missing issues
# Rinse and repeat

In [ ]:
# How many are mssing now?
len(missing)

In [ ]:
# How many images have been harvested

In [ ]:
len(dl)

In [ ]:
len(multiple)

In [ ]:
# Create links for missing issues so I can look at them on Trove
for m in missing:
print('https://nla.gov.au/{}'.format(m['issue_id']))

In [ ]:
# Download odd pages (they start at zero, so 0=1 etc)
for m in missing: