Harvesting data from Home

This is an example of how my original recipe for harvesting data from The Bulletin can be modified for other journals.

If you'd like a pre-harvested dataset of all the Home covers (229 images in a 3.3gb zip file), open this link using your preferred BitTorrent client: magnet:?xt=urn:btih:7888BCEA44E5FF5670931A3394369E5018BFC32B&dn=home-quarterly.zip

In [ ]:
# Let's import the libraries we need.
import requests
from bs4 import BeautifulSoup
import time
import json
import os
import re
In [ ]:
# Create a directory for this journal
# Edit as necessary for a new journal
data_dir = os.path.join('journals', 'Home')
os.makedirs(data_dir, exist_ok=True)

Getting the issue data

Each issue of a digitised journal like has it's own unique identifier. You've probably noticed them in the urls of Trove resources. They look something like this nla.obj-362409353. Once we have the identifier for an issue we can easily download the contents, but how do we get a complete list of identifiers?

The harvesting data from the Bulletin notebook explains how we can find a url that lists all the available issues of a journal.

This is the url we need to start harvesting issue metadata about Home. You could easily modify this to get metadata from another journal by changing the identifier.

https://nla.gov.au/nla.obj-362409353/browse?startIdx=0&rows=20&op=c
In [ ]:
# This is just the url we found above, with a slot into which we can insert the startIdx value
# If you want to download data from another journal, just change the nla.obj identifier to point to the journal.
start_url = 'https://nla.gov.au/nla.obj-362409353/browse?startIdx={}&rows=20&op=c'
In [ ]:
# The initial startIdx value
start = 0
# Number of results per page
n = 20
issues = []
# If there aren't 20 results on the page then we've reached the end, so continue harvesting until that happens.
while n == 20:
    # Get the browse page
    response = requests.get(start_url.format(start))
    # Beautifulsoup turns the HTML into an easily navigable structure
    soup = BeautifulSoup(response.text, 'lxml')
    # Find all the divs containing issue details and loop through them
    details = soup.find_all(class_='l-item-info')
    for detail in details:
        issue = {}
        # Get the issue id
        issue['id'] = detail.dt.a.string
        rows = detail.find_all('dd')
        # Get the issue details
        issue['details'] = rows[2].p.string
        # Get the number of pages
        issue['pages'] = re.search(r'^(\d+)', detail.find('a', class_="browse-child").text, flags=re.MULTILINE).group(1)
        issues.append(issue)
        print(issue)
        time.sleep(0.2)
    # Increment the startIdx
    start += n
    # Set n to the number of results on the current page
    n = len(details)
        
    
In [ ]:
len(issues)
In [ ]:
# Save the harvested results as a JSON file in case we need them later on
with open('{}/home_issues.json'.format(data_dir), 'w') as outfile:
    json.dump(issues, outfile)
In [ ]:
# Open the saved JSON file
with open('{}/home_issues.json'.format(data_dir), 'r') as infile:
    issues = json.load(infile)

Cleaning up the metadata

So far we've just grabbed the complete issue details as a single string. It would be good to parse this string so that we have the dates, volume and issue numbers in separate fields. As is always the case, there's a bit of variation in the way this information is recorded. The code below tries out different combinations and then saves the structured data in a Python list.

I had to modify the code I used with the Bulletin due to slight variations in the way the issue data was recorded. For example, issue dates for Home use the full names of months, while the Bulletin records used abbreviations. It's likely that there will be other variations between journals, so you might have to adjust this code.

In [ ]:
import arrow
from arrow.parser import ParserError
issues_data = []
# Loop through the issues
for issue in issues:
    issue_data = {}
    issue_data['id'] = issue['id']
    issue_data['pages'] = int(issue['pages'])
    print(issue['details'])
    try:
        # This pattern looks for details in the form: Vol. 2 No. 3 (2 Jul 1878)
        details = re.search(r'(.*)Vol. (\d+) No\.* (\d+) \((.+)\)', issue['details'].strip())
        issue_data['label'] = details.group(1).strip()
        issue_data['volume'] = details.group(2)
        issue_data['number'] = details.group(3)
        date = details.group(4)
    except AttributeError:
        try:
            # This pattern looks for details in the form: No. 3 (2 Jul 1878)
            details = re.search(r'No. (\d+) \((.+)\)', issue['details'].strip())
            issue_data['label'] = ''
            issue_data['volume'] = ''
            issue_data['number'] = details.group(1)
            date = details.group(2)
        except AttributeError:
            try:
                # This pattern looks for details in the form: Bulletin Christmas Edition (2 Jul 1878)
                details = re.search(r'(.*) \((.+)\)', issue['details'].strip())
                issue_data['label'] = details.group(1)
                issue_data['volume'] = ''
                issue_data['number'] = ''
                date = details.group(2)
            except AttributeError:
                # This pattern looks for details in the form: Bulletin 1878 Jul 3
                details = re.search(r'Bulletin (.+)', issue['details'].strip())
                date_str = details.group(1)
                # Date is wrong way round, split and reverse
                date = ' '.join(reversed(date_str.split()))
                issue_data['label'] = ''
                issue_data['volume'] = ''
                issue_data['number'] = ''
    # Normalise months
    date = date.replace('Sept', 'Sep').replace('Sepember', 'September').replace('July August', 'July').replace('September October', 'September').replace('  ', ' ')
    # Convert date to ISO format
    try:
        issue_data['date'] = arrow.get(date, 'D MMMM YYYY').isoformat()[:-15]
    except ParserError:
        issue_data['date'] = arrow.get(date, 'D MMM YYYY').isoformat()[:-15]
    issues_data.append(issue_data)
    

Save as CSV

Now the issues data is in a nice, structured form, we can load it into a Pandas dataframe. This allows us to do things like find the total number of pages digitised.

We can also save the metadata as a CSV.

In [ ]:
import pandas as pd
# Convert issues metadata into a dataframe
df = pd.DataFrame(issues_data, columns=['id', 'label', 'volume', 'number', 'date', 'pages'])
In [ ]:
# Find the total number of pages
df['pages'].sum()
In [ ]:
# Save metadata as a CSV.
df.to_csv('{}/home_issues.csv'.format(data_dir), index=False)

Download front covers

Options for downloading images, PDFs and text are described in the harvesting data from the Bulletin notebook. In this recipe we'll just download the fromt covers (because they're awesome).

The code below checks to see if an image has already been saved before downloading it, so if the process is interrupted you can just run it again to pick up where it stopped. If more issues are added to Trove you could run it again to pick up any new images.

In [ ]:
import zipfile
import io
# Prepare a directory to save the images into
output_dir = data_dir + '/images'
os.makedirs(output_dir, exist_ok=True)
# Loop through the issue metadata
for issue in issues_data:
    print(issue['id'])
    id = issue['id']
    # Check to see if the first page of this issue has already been downloaded
    if not os.path.exists('{}/{}-1.jpg'.format(output_dir, id)):
        url = 'https://nla.gov.au/{}/download?downloadOption=zip&firstPage=0&lastPage=0'.format(id)
        # Get the file
        r = requests.get(url)
        # The image is in a zip, so we need to extract the contents into the output directory
        z = zipfile.ZipFile(io.BytesIO(r.content))
        z.extractall(output_dir)
        time.sleep(1)
In [ ]: