Generate a thumbnail image from a Trove newspaper article

In another notebook, I showed how to get high-resolution page images from newspapers. But what if you only want a nice square thumbnail for display purposes? This notebook gets the page image and then crops and resizes the top of the article to create a thumbnail.

Of course, if you're doing this to lots of articles you won't want to feed each one in manually. If you're viewing this notebook in app mode (no code visible), just click on the 'Edit app' button to see what's going on behind the scenes. You should be able to copy and modify the code to suit your purposes.

Briefly, the steps to generate a thumbnail are:

  • Get the article record using the Trove API
  • Get the page identifier from the article record
  • Use the page identifier to download a high-res page image
  • Scrape the article's HTML page to get the first row of the OCR'd text (or illustration)
  • Extract the coordinates (left, top, and width) from the row element's data attributes to find its position within the high-res image
  • Crop a square image from the page using the coordinates
  • Resize the cropped image
In [1]:
import ipywidgets as widgets
import requests
import random
import re
from IPython.display import display, HTML, FileLink, clear_output
from bs4 import BeautifulSoup
from PIL import Image
from io import BytesIO

titles = {}

def display_button():
    button = widgets.Button(
        description='Get thumbnail',
        disabled=False,
        button_style='primary',
        tooltip='Click to download',
        icon=''
    )
    button.on_click(get_page_image)
    display(button)
    
def get_article_top(article_url):
    '''
    Positional information about the article is attached to each line of the OCR output in data attributes.
    This function loads the HTML version of the article and scrapes the x, y, and width values for the
    top line of text (ie the top of the article).
    '''
    response = requests.get(article_url)
    soup = BeautifulSoup(response.text, 'lxml')
    # Lines of OCR are in divs with the class 'zone'
    # 'onPage' limits to those on the current page
    zones = soup.select('div.zone.onPage')
    # Start with the first element, but...
    top_element = zones[0]
    top_y = int(top_element['data-y'])
    # Illustrations might come after text even if they're above them on the page
    # So loop through the zones to find the element with the lowest 'y' attribute
    for zone in zones:
        if int(zone['data-y']) < top_y:
            top_y = int(zone['data-y'])
            top_element = zone
    top_x = int(top_element['data-x'])
    top_w = int(top_element['data-w'])
    return {'x': top_x, 'y': top_y, 'w': top_w}

def get_page_image(b):
    clear_output(wait=True)
    display_button()
    article = None
    page_id = None
    # Get the article record from the API
    article_id = re.search(r'article\/{0,1}(\d+)', article_url.value).group(1)
    params = {
        'reclevel': 'full',
        'encoding': 'json',
        'key': api_key.value
    }
    api_response = requests.get('http://api.trove.nla.gov.au/v2/newspaper/{}'.format(article_id), params=params)
    data = api_response.json()
    article = data['article']
    try:
        # Get page id
        page_id = re.search(r'page\/(\d+)', article['trovePageUrl']).group(1)
    except AttributeError:
         print('Couldn\'t extract page details!')
    else:
        # Get position of top line of article
        article_top = get_article_top(article_url.value)
        # Construct the url we need to download the image
        page_url = 'https://trove.nla.gov.au/ndp/imageservice/nla.news-page{}/level{}'.format(page_id, 7)
        # Download the page image
        response = requests.get(page_url)
        # Open download as an image for editing
        img = Image.open(BytesIO(response.content))
        # Use coordinates of top line to create a square box to crop thumbnail
        box = (article_top['x'], article_top['y'], article_top['x'] + article_top['w'], article_top['y'] + article_top['w'])
        # Crop image to create thumb
        thumb = img.crop(box)
        # Resize thumb
        thumb.thumbnail((size.value, size.value), Image.ANTIALIAS)
        # Save and display thumbnail
        thumbfile = 'data/{}-thumb-{}.jpg'.format(page_id, size.value)
        thumb.save(thumbfile)
        display(FileLink(thumbfile))
        display(HTML('<img src="{}">'.format(thumbfile)))

Enter your Trove API key

Get your own Trove API key and enter it below.

In [2]:
api_key = widgets.Text(
    placeholder='Enter your Trove API key',
    description='API key:',
    disabled=False
)
display(api_key)

Enter an article url...

You can use the url in your browser's location bar or an article permalink.

In [3]:
article_url = widgets.Text(
    placeholder='Enter an article url',
    description='Article/Page:',
    disabled=False
)
display(article_url)

Thumbnail size

Generate a square thumbnail with this height and width (in pixels).

In [4]:
size = widgets.BoundedIntText(
    min=100,
    max=500,
    value=500,
    step=50,
    description='Size:',
    disabled=False
)
display(size)

Get the thumbnail!

In [5]:
display_button()
In [ ]: