Get the archived version of a page closest to a particular date

New to Jupyter notebooks? Try Using Jupyter notebooks for a quick introduction.

To get the archived version of a page closest to a particular date we can use the Memento API. Variations in the way Memento is implemented across repositories are documented in Getting data from web archives using Memento. The functions below smooth out these variations to provide a (mostly) consistent interface to the UK Web Archive, Australian Web Archive, New Zealand Web Archive, and the Internet Archive. They could be easily modified to work with other Memento-compliant repositories.

To get information about available Mementos:

query_timegate([timegate], [url], [date], [timezone])

To get a single Memento closest to your target date:

get_memento([timegate], [url], [date], [timezone])

Parameters:

  • timegate – one of 'ukwa' (UK), 'awa' (Australia), 'nzwa' (New Zealand), or 'ia' (Internet Archive)
  • url – the url you want to look for in the archive
  • date – the target date in ISO format, 'YYYY-MM-DD' (optional, will default to most recent date)
  • tz – a timezone string for your local timezone (optional)
In [1]:
import requests
import arrow
import re
import json
In [2]:
# These are the repositories we'll be using
TIMEGATES = {
    'awa': 'https://web.archive.org.au/awa/',
    'nzwa': 'https://ndhadeliver.natlib.govt.nz/webarchive/wayback/',
    'ukwa': 'https://www.webarchive.org.uk/wayback/en/archive/',
    'ia': 'https://web.archive.org/web/'
}

def format_date_for_headers(iso_date, tz):
    '''
    Convert an ISO date (YYYY-MM-DD) to a datetime at noon in the specified timezone.
    Convert the datetime to UTC and format as required by Accet-Datetime headers:
    eg Fri, 23 Mar 2007 01:00:00 GMT
    '''
    local = arrow.get(f'{iso_date} 12:00:00 {tz}', 'YYYY-MM-DD HH:mm:ss ZZZ')
    gmt = local.to('utc')
    return f'{gmt.format("ddd, DD MMM YYYY HH:mm:ss")} GMT'

def parse_links_from_headers(response):
    '''
    Extract Memento links from 'Link' header.
    '''
    links = response.links
    return {k: v['url'] for k, v in links.items()}

def query_timegate(timegate, url, date=None, tz='Australia/Canberra'):
    '''
    Query the specified repository for a Memento.
    '''
    headers = {}
    if date:
        formatted_date = format_date_for_headers(date, tz)
        headers['Accept-Datetime'] = formatted_date
    # BL & NLNZ don't seem to default to latest date if no date supplied
    elif not date and timegate in ['bl', 'nlnz']:
        formatted_date = format_date_for_headers(arrow.utcnow().format('YYYY-MM-DD'), tz)
        headers['Accept-Datetime'] = formatted_date
    # Note that you don't get a timegate response if you leave off the trailing slash, but extras don't hurt!
    tg_url = f'{TIMEGATES[timegate]}{url}/' if not url.endswith('/') else f'{TIMEGATES[timegate]}{url}'
    # print(tg_url)
    # IA only works if redirects are followed -- this defaults to False with HEAD requests...
    if timegate == 'ia':
        allow_redirects = True
    else:
        allow_redirects = False
    response = requests.head(tg_url, headers=headers, allow_redirects=allow_redirects)
    return parse_links_from_headers(response)

def get_memento(timegate, url, date=None, tz='Australia/Canberra'):
    '''
    If there's no memento in the results, look for an alternative.
    '''
    links = query_timegate(timegate, url, date, tz)
    # NLNZ doesn't always seem to return a Memento, so we'll build in some fuzziness
    if links:
        if 'memento' in links:
            memento = links['memento']
        elif 'prev memento' in links:
            memento = links['prev memento']
        elif 'next memento' in links:
            memento = links['next memento']
        elif 'last memento' in links:
            memento = links['last memento']
    else:
        memento = None
    return memento

Examples

Query NZWA Timegate for information about the NLNZ home page.

In [3]:
query_timegate('nzwa', 'http://natlib.govt.nz')
Out[3]:
{'original': 'http://natlib.govt.nz/',
 'timemap': 'https://ndhadeliver.natlib.govt.nz/webarchive/wayback/timemap/link/http://natlib.govt.nz/',
 'first memento': 'https://ndhadeliver.natlib.govt.nz/webarchive/wayback/20040711213225/http://natlib.govt.nz/',
 'prev memento': 'https://ndhadeliver.natlib.govt.nz/webarchive/wayback/20200730070058/http://natlib.govt.nz/',
 'memento': 'https://ndhadeliver.natlib.govt.nz/webarchive/wayback/20210129013644/http://natlib.govt.nz/',
 'next last memento': 'https://ndhadeliver.natlib.govt.nz/webarchive/wayback/20210130060056/http://natlib.govt.nz/'}

Get a version of my blog from around 2005. First from the AWA:

In [4]:
get_memento('awa', 'http://discontents.com.au', '2005-01-01')
Out[4]:
'https://web.archive.org.au/awa/20041126212006mp_/http://www.discontents.com.au:80/'

Then from the IA:

In [5]:
get_memento('ia', 'http://discontents.com.au', '2005-01-01')
Out[5]:
'https://web.archive.org/web/20041126212006/http://www.discontents.com.au:80/'

Created by Tim Sherratt for the GLAM Workbench. Support me by becoming a GitHub sponsor!

Work on this notebook was supported by the IIPC Discretionary Funding Programme 2019-2020