Getting data from web archives using Memento

New to Jupyter notebooks? Try Using Jupyter notebooks for a quick introduction.

Systems supporting the Memento protocol provide machine-readable information about web archive captures, even if other APIs are not available. In this notebook we'll look at the way the Memento protocol is supported across four web archive repositories – the UK Web Archive, the National Library of Australia, the National Library of New Zealand, and the Internet Archive. In particular we'll examine:

  • Timegates – request web page captures from (around) a particular date
  • Timemaps – request a list of web archive captures from a particular url
  • Mementos – use url modifiers to change the way an archived web page is presented

Notebooks using Timegates or Timemaps to access capture data include:

Useful tools and documentation

In [1]:
import requests
import arrow
import re
import json

# Alternatively use the python Memento client 
In [2]:
# These are the repositories we'll be using
TIMEGATES = {
    'awa': 'https://web.archive.org.au/awa/',
    'nzwa': 'https://ndhadeliver.natlib.govt.nz/webarchive/wayback/',
    'ukwa': 'https://www.webarchive.org.uk/wayback/archive/',
    'ia': 'https://web.archive.org/web/'
}

Timegates

Timegates let you query a web archive for the capture closest to a specific date. You do this by supplying your target date as the Accept-Datetime value in the headers of your request.

For example, if you wanted to query the Australian Web Archive to find the version of http://nla.gov.au/ that was captured as close as possible to 1 January 2001, you'd set the Accept-Datetime header to header to 'Fri, 01 Jan 2010 01:00:00 GMT' and request the url:

https://web.archive.org.au/awa/http://nla.gov.au/

A get request will return the captured page, but if all you want is the url of the archived page you can use a head request and extract the information you need from the response headers. Try this:

In [3]:
response = requests.head('https://web.archive.org.au/awa/http://nla.gov.au/', headers={'Accept-Datetime': 'Fri, 01 Jan 2010 01:00:00 GMT'})
response.headers
Out[3]:
{'Server': 'nginx', 'Date': 'Fri, 22 May 2020 02:40:23 GMT', 'Content-Length': '0', 'Connection': 'keep-alive', 'Location': 'https://web.archive.org.au/awa/20100205144227/http://nla.gov.au/', 'Link': '<http://nla.gov.au/>; rel="original", <https://web.archive.org.au/awa/http://nla.gov.au/>; rel="timegate", <https://web.archive.org.au/awa/timemap/link/http://nla.gov.au/>; rel="timemap"; type="application/link-format", <https://web.archive.org.au/awa/20100205144227mp_/http://nla.gov.au/>; rel="memento"; datetime="Fri, 05 Feb 2010 14:42:27 GMT"', 'Vary': 'accept-datetime'}

The request above returns the following headers:

{
    'Server': 'nginx', 
    'Date': 'Wed, 06 May 2020 04:34:50 GMT', 
    'Content-Length': '0', 'Connection': 'keep-alive', 
    'Location': 'https://web.archive.org.au/awa/20100205144227/http://nla.gov.au/', 
    'Link': '<http://nla.gov.au/>; rel="original", <https://web.archive.org.au/awa/http://nla.gov.au/>; rel="timegate", <https://web.archive.org.au/awa/timemap/link/http://nla.gov.au/>; rel="timemap"; type="application/link-format", <https://web.archive.org.au/awa/20100205144227mp_/http://nla.gov.au/>; rel="memento"; datetime="Fri, 05 Feb 2010 14:42:27 GMT"', 
    'Vary': 'accept-datetime'
}

The Link parameter contains the Memento information. You can see that it's actually providing information on four types of link:

  • the original url (ie the url that was archived) – <http://nla.gov.au/>
  • the timegate for the harvested url (which us what we just used) – <https://web.archive.org.au/awa/http://nla.gov.au/>
  • the timemap for the harvested url (we'll look at this below) – <https://web.archive.org.au/awa/timemap/link/http://nla.gov.au/>
  • the memento<https://web.archive.org.au/awa/20100205144227mp_/http://nla.gov.au/>

The memento link is the capture closest in time to the date we requested. In this case there's only about a month's difference, but of course this will depend on how frequently a url is captured. Opening the link will display the capture in the web archive. As we'll see below, some systems provide additional links such as first memento, last memento, prev memento, and next memento.

Here's some functions to query a timegate in one of the four systems we're exploring. We'll use them to compare the results we get from each.

In [4]:
def format_date_for_headers(iso_date, tz):
    '''
    Convert an ISO date (YYYY-MM-DD) to a datetime at noon in the specified timezone.
    Convert the datetime to UTC and format as required by Accet-Datetime headers:
    eg Fri, 23 Mar 2007 01:00:00 GMT
    '''
    local = arrow.get(f'{iso_date} 12:00:00 {tz}', 'YYYY-MM-DD HH:mm:ss ZZZ')
    gmt = local.to('utc')
    return f'{gmt.format("ddd, DD MMM YYYY HH:mm:ss")} GMT'

def parse_links_from_headers(response):
    '''
    Extract original, timegate, timemap, and memento links from 'Link' header.
    '''
    links = response.links
    return {k: v['url'] for k, v in links.items()}

def format_timestamp(timestamp, date_format='YYYY-MM-DD HH:mm:ss'):
    return arrow.get(timestamp, 'YYYYMMDDHHmmss').format(date_format)

def query_timegate(timegate, url, date=None, tz='Australia/Canberra', request_type='head', allow_redirects=True):
    headers = {}
    if date:
        formatted_date = format_date_for_headers(date, tz)
        headers['Accept-Datetime'] = formatted_date
    # Note that you don't get a timegate response if you leave off the trailing slash
    tg_url = f'{TIMEGATES[timegate]}{url}/' if not url.endswith('/') else f'{TIMEGATES[timegate]}{url}'
    print(tg_url)
    if request_type == 'head':
        response = requests.head(tg_url, headers=headers, allow_redirects=allow_redirects)
    else:
        response = requests.get(tg_url, headers=headers, allow_redirects=allow_redirects)
    # print(response.headers)
    return parse_links_from_headers(response)

Australian Web Archive

A HEAD request that follows redirects returns no results

In [5]:
query_timegate('awa', 'http://www.nla.gov.au')
https://web.archive.org.au/awa/http://www.nla.gov.au/
Out[5]:
{}

A HEAD request that doesn't follow redirects returns results as expected

In [6]:
query_timegate('awa', 'http://www.nla.gov.au', allow_redirects=False)
https://web.archive.org.au/awa/http://www.nla.gov.au/
Out[6]:
{'original': 'http://pandora.nla.gov.au/pan/161756/20200306-0200/www.nla.gov.au/index.html',
 'timegate': 'https://web.archive.org.au/awa/http://pandora.nla.gov.au/pan/161756/20200306-0200/www.nla.gov.au/index.html',
 'timemap': 'https://web.archive.org.au/awa/timemap/link/http://pandora.nla.gov.au/pan/161756/20200306-0200/www.nla.gov.au/index.html',
 'memento': 'https://web.archive.org.au/awa/20200305172547mp_/http://pandora.nla.gov.au/pan/161756/20200306-0200/www.nla.gov.au/index.html'}

A query without an Accept-Datetime value returns a recent capture.

In [7]:
query_timegate('awa', 'http://www.nla.gov.au', allow_redirects=False)
https://web.archive.org.au/awa/http://www.nla.gov.au/
Out[7]:
{'original': 'http://pandora.nla.gov.au/pan/161756/20200306-0200/www.nla.gov.au/index.html',
 'timegate': 'https://web.archive.org.au/awa/http://pandora.nla.gov.au/pan/161756/20200306-0200/www.nla.gov.au/index.html',
 'timemap': 'https://web.archive.org.au/awa/timemap/link/http://pandora.nla.gov.au/pan/161756/20200306-0200/www.nla.gov.au/index.html',
 'memento': 'https://web.archive.org.au/awa/20200305172547mp_/http://pandora.nla.gov.au/pan/161756/20200306-0200/www.nla.gov.au/index.html'}

A query with an Accept-Datetime value of 1 January 2002 returns a capture from 20 January 2002.

In [8]:
query_timegate('awa', 'http://www.education.gov.au/', date='2002-01-01', allow_redirects=False)
https://web.archive.org.au/awa/http://www.education.gov.au/
Out[8]:
{'original': 'http://www.education.gov.au:80/',
 'timegate': 'https://web.archive.org.au/awa/http://www.education.gov.au:80/',
 'timemap': 'https://web.archive.org.au/awa/timemap/link/http://www.education.gov.au:80/',
 'memento': 'https://web.archive.org.au/awa/20020120171009mp_/http://www.education.gov.au:80/'}

Using a GET rather than a HEAD request returns no Memento information when redirects are followed.

In [9]:
query_timegate('awa', 'http://www.education.gov.au/', date='2002-01-01', request_type='get')
https://web.archive.org.au/awa/http://www.education.gov.au/
Out[9]:
{}

Using a GET rather than a HEAD request returns Memento information when redirects are not followed.

In [10]:
query_timegate('awa', 'http://www.education.gov.au/', date='2002-01-01', request_type='get', allow_redirects=False)
https://web.archive.org.au/awa/http://www.education.gov.au/
Out[10]:
{'original': 'http://www.education.gov.au:80/',
 'timegate': 'https://web.archive.org.au/awa/http://www.education.gov.au:80/',
 'timemap': 'https://web.archive.org.au/awa/timemap/link/http://www.education.gov.au:80/',
 'memento': 'https://web.archive.org.au/awa/20020120171009mp_/http://www.education.gov.au:80/'}

New Zealand Web Archive

Changing whether or not redirects are followed has no effect on any of these responses.

A query without an Accept-Datetime value doesn't return a memento, but does include first memento, last memento, and prev memento.

In [11]:
query_timegate('nzwa', 'http://natlib.govt.nz')
https://ndhadeliver.natlib.govt.nz/webarchive/wayback/http://natlib.govt.nz/
Out[11]:
{'original': 'https://natlib.govt.nz/',
 'timemap': 'https://ndhadeliver.natlib.govt.nz/webarchive/wayback/timemap/link/https://natlib.govt.nz/',
 'timegate': 'https://ndhadeliver.natlib.govt.nz/webarchive/wayback/https://natlib.govt.nz/',
 'last memento': 'https://ndhadeliver.natlib.govt.nz/webarchive/wayback/20200130060111/https://natlib.govt.nz/',
 'first memento': 'https://ndhadeliver.natlib.govt.nz/webarchive/wayback/20040711213225/https://natlib.govt.nz/',
 'prev memento': 'https://ndhadeliver.natlib.govt.nz/webarchive/wayback/20200130060106/https://natlib.govt.nz/'}

A query with an Accept-Datetime value of 1 January 2005 doesn't return a memento, even though there's a capture available from July 2004. I don't know why this is.

In [12]:
query_timegate('nzwa', 'http://natlib.govt.nz', date='2005-01-01')
https://ndhadeliver.natlib.govt.nz/webarchive/wayback/http://natlib.govt.nz/
Out[12]:
{'original': 'http://www.natlib.govt.nz/',
 'timemap': 'https://ndhadeliver.natlib.govt.nz/webarchive/wayback/timemap/link/http://www.natlib.govt.nz/',
 'timegate': 'https://ndhadeliver.natlib.govt.nz/webarchive/wayback/http://www.natlib.govt.nz/',
 'first memento': 'https://ndhadeliver.natlib.govt.nz/webarchive/wayback/20040711213225/http://www.natlib.govt.nz/',
 'next memento': 'https://ndhadeliver.natlib.govt.nz/webarchive/wayback/20060704033135/http://www.natlib.govt.nz/',
 'last memento': 'https://ndhadeliver.natlib.govt.nz/webarchive/wayback/20200130060111/http://www.natlib.govt.nz/'}

A query with an Accept-Datetime value of 1 January 2008 returns a memento from 25 February 2008, as well as first memento, last memento, prev memento, and next memento.

In [13]:
query_timegate('nzwa', 'http://natlib.govt.nz', date='2008-01-01')
https://ndhadeliver.natlib.govt.nz/webarchive/wayback/http://natlib.govt.nz/
Out[13]:
{'original': 'http://www.natlib.govt.nz/',
 'timemap': 'https://ndhadeliver.natlib.govt.nz/webarchive/wayback/timemap/link/http://www.natlib.govt.nz/',
 'timegate': 'https://ndhadeliver.natlib.govt.nz/webarchive/wayback/http://www.natlib.govt.nz/',
 'first memento': 'https://ndhadeliver.natlib.govt.nz/webarchive/wayback/20040711213225/http://www.natlib.govt.nz/',
 'prev memento': 'https://ndhadeliver.natlib.govt.nz/webarchive/wayback/20070322041546/http://www.natlib.govt.nz/',
 'memento': 'https://ndhadeliver.natlib.govt.nz/webarchive/wayback/20080225060238/http://www.natlib.govt.nz/',
 'next memento': 'https://ndhadeliver.natlib.govt.nz/webarchive/wayback/20081019225343/http://www.natlib.govt.nz/',
 'last memento': 'https://ndhadeliver.natlib.govt.nz/webarchive/wayback/20200130060111/http://www.natlib.govt.nz/'}

A GET request returns the same results as a HEAD request.

In [14]:
query_timegate('nzwa', 'http://natlib.govt.nz', date='2008-01-01', request_type='get')
https://ndhadeliver.natlib.govt.nz/webarchive/wayback/http://natlib.govt.nz/
Out[14]:
{'original': 'http://www.natlib.govt.nz/',
 'timemap': 'https://ndhadeliver.natlib.govt.nz/webarchive/wayback/timemap/link/http://www.natlib.govt.nz/',
 'timegate': 'https://ndhadeliver.natlib.govt.nz/webarchive/wayback/http://www.natlib.govt.nz/',
 'first memento': 'https://ndhadeliver.natlib.govt.nz/webarchive/wayback/20040711213225/http://www.natlib.govt.nz/',
 'prev memento': 'https://ndhadeliver.natlib.govt.nz/webarchive/wayback/20070322041546/http://www.natlib.govt.nz/',
 'memento': 'https://ndhadeliver.natlib.govt.nz/webarchive/wayback/20080225060238/http://www.natlib.govt.nz/',
 'next memento': 'https://ndhadeliver.natlib.govt.nz/webarchive/wayback/20081019225343/http://www.natlib.govt.nz/',
 'last memento': 'https://ndhadeliver.natlib.govt.nz/webarchive/wayback/20200130060111/http://www.natlib.govt.nz/'}

Internet Archive

Using a HEAD request that follows redirects returns results as expected.

In [15]:
query_timegate('ia', 'http://discontents.com.au')
https://web.archive.org/web/http://discontents.com.au/
Out[15]:
{'original': 'https://discontents.com.au/',
 'timemap': 'https://web.archive.org/web/timemap/link/https://discontents.com.au/',
 'timegate': 'https://web.archive.org/web/https://discontents.com.au/',
 'first memento': 'https://web.archive.org/web/19981206012233/http://www.discontents.com.au:80/',
 'prev memento': 'https://web.archive.org/web/20200418035854/http://www.discontents.com.au/',
 'memento': 'https://web.archive.org/web/20200510215639/https://discontents.com.au/',
 'last memento': 'https://web.archive.org/web/20200510215639/https://discontents.com.au/'}

Using a HEAD request returns no Memento information if redirects are not followed.

In [16]:
query_timegate('ia', 'http://discontents.com.au', allow_redirects=False)
https://web.archive.org/web/http://discontents.com.au/
Out[16]:
{}

A query without an Accept-Datetime value returns a memento and also includes a first memento, last memento, prev memento, and last memento. It seems that the memento returned is the second last capture.

In [17]:
query_timegate('ia', 'http://discontents.com.au')
https://web.archive.org/web/http://discontents.com.au/
Out[17]:
{'original': 'https://discontents.com.au/',
 'timemap': 'https://web.archive.org/web/timemap/link/https://discontents.com.au/',
 'timegate': 'https://web.archive.org/web/https://discontents.com.au/',
 'first memento': 'https://web.archive.org/web/19981206012233/http://www.discontents.com.au:80/',
 'prev memento': 'https://web.archive.org/web/20200418035854/http://www.discontents.com.au/',
 'memento': 'https://web.archive.org/web/20200510215639/https://discontents.com.au/',
 'last memento': 'https://web.archive.org/web/20200510215639/https://discontents.com.au/'}

A query with an Accept-Datetime value of 1 January 2010 returns a memento from 4 September 2010, even though the prev memento date, 30 October 2009, is closer.

In [18]:
query_timegate('ia', 'http://discontents.com.au', date='2010-01-01')
https://web.archive.org/web/http://discontents.com.au/
Out[18]:
{'original': 'http://discontents.com.au:80/',
 'timemap': 'https://web.archive.org/web/timemap/link/http://discontents.com.au:80/',
 'timegate': 'https://web.archive.org/web/http://discontents.com.au:80/',
 'first memento': 'https://web.archive.org/web/19981206012233/http://www.discontents.com.au:80/',
 'prev memento': 'https://web.archive.org/web/20091030053520/http://discontents.com.au/',
 'memento': 'https://web.archive.org/web/20100209041537/http://discontents.com.au:80/',
 'next memento': 'https://web.archive.org/web/20100523101442/http://discontents.com.au:80/',
 'last memento': 'https://web.archive.org/web/20200510215639/https://discontents.com.au/'}

GET requests return different results if redirects are not followed.

In [19]:
query_timegate('ia', 'http://discontents.com.au', date='2010-01-01', request_type='get')
https://web.archive.org/web/http://discontents.com.au/
Out[19]:
{'original': 'http://discontents.com.au:80/',
 'timemap': 'https://web.archive.org/web/timemap/link/http://discontents.com.au:80/',
 'timegate': 'https://web.archive.org/web/http://discontents.com.au:80/',
 'first memento': 'https://web.archive.org/web/19981206012233/http://www.discontents.com.au:80/',
 'prev memento': 'https://web.archive.org/web/20091030053520/http://discontents.com.au/',
 'memento': 'https://web.archive.org/web/20100209041537/http://discontents.com.au:80/',
 'next memento': 'https://web.archive.org/web/20100523101442/http://discontents.com.au:80/',
 'last memento': 'https://web.archive.org/web/20200510215639/https://discontents.com.au/'}
In [20]:
query_timegate('ia', 'http://discontents.com.au', date='2010-01-01', request_type='get', allow_redirects=False)
https://web.archive.org/web/http://discontents.com.au/
Out[20]:
{'original': 'http://discontents.com.au/',
 'memento': 'https://web.archive.org/web/20100209041537/http://discontents.com.au/',
 'timemap': 'https://web.archive.org/web/timemap/link/http://discontents.com.au/'}

UK Web Archive

Changing whether or not redirects are followed has no effect on any of these responses.

A query without an Accept-Datetime value doesn't return a memento.

In [21]:
query_timegate('ukwa', 'http://bl.uk')
https://www.webarchive.org.uk/wayback/archive/http://bl.uk/
Out[21]:
{'original': 'http://bl.uk/',
 'timegate': 'https://www.webarchive.org.uk/wayback/archive/http://bl.uk/',
 'timemap': 'https://www.webarchive.org.uk/wayback/archive/timemap/link/http://bl.uk/'}

A query with an Accept-Datetime value of 1 January 2006 returns a memento from 1 January 2006. However, this date doesn't seem to represent an actual capture. There seems to be a problem with the Timegate.

In [22]:
query_timegate('ukwa', 'http://bl.uk', date='2006-01-01')
https://www.webarchive.org.uk/wayback/archive/http://bl.uk/
Out[22]:
{'original': 'http://bl.uk/',
 'timegate': 'https://www.webarchive.org.uk/wayback/archive/http://bl.uk/',
 'timemap': 'https://www.webarchive.org.uk/wayback/archive/timemap/link/http://bl.uk/',
 'memento': 'https://www.webarchive.org.uk/wayback/archive/20060101010000mp_/http://bl.uk/'}

A GET request returns the same results as a HEAD request.

In [23]:
query_timegate('ukwa', 'http://bl.uk', date='2006-01-01', request_type='get')
https://www.webarchive.org.uk/wayback/archive/http://bl.uk/
Out[23]:
{'original': 'http://bl.uk/',
 'timegate': 'https://www.webarchive.org.uk/wayback/archive/http://bl.uk/',
 'timemap': 'https://www.webarchive.org.uk/wayback/archive/timemap/link/http://bl.uk/',
 'memento': 'https://www.webarchive.org.uk/wayback/archive/20060101010000mp_/http://bl.uk/'}

Summarising the differences

As you can see above, there are a couple of significant differences in the way that Timegates behave across the four repositories.

  • The Wayback systems (IA and NZWA) provide more information than the Pywb systems (first memento, last memento, prev memento, and last memento)
  • The UKWA and NZWA don't return a memento unless you include a date in the Accept-Datetime header. The NLA and IA return a recently captured memento as a default. (Though no necessarily the most recent?)
  • You can use either HEAD or GET with UKWA and NZWA, but IA and AWA behave different depending on the type of request and whether redirects are followed. To get results from either a HEAD or GET request, AWA requests should not follow redirects. To get results from a HEAD requests, IA requests should follow redirects. GET requests to IA will return results whether or not redirects are allowed, however, those results differ.

Normalising Timegate responses and queries

Here's some code to smooth out the differences between systems, and return Memento data as a Python dictionary. Specifically it:

  • Inserts the current date into requests from the UKWA or NLNZ if no date is specified. This means they behave like the other repositories that return a recent Memento.
  • Follows redirects for requests to the IA.
  • If there is no memento value in the response (as sometimes happens with NLNZ), it looks for a first, last, prev or next value instead.
In [24]:
def query_timegate(timegate, url, date=None, tz='Australia/Canberra'):
    '''
    Query the specified repository for a Memento.
    '''
    headers = {}
    if date:
        formatted_date = format_date_for_headers(date, tz)
        headers['Accept-Datetime'] = formatted_date
    # BL & NLNZ don't seem to default to latest date if no date supplied
    elif not date and timegate in ['bl', 'nlnz']:
        formatted_date = format_date_for_headers(arrow.utcnow().format('YYYY-MM-DD'), tz)
        headers['Accept-Datetime'] = formatted_date
    # Note that you don't get a timegate response if you leave off the trailing slash, but extras don't hurt!
    tg_url = f'{TIMEGATES[timegate]}{url}/' if not url.endswith('/') else f'{TIMEGATES[timegate]}{url}'
    # print(tg_url)
    # IA only works if redirects are followed -- this defaults to False with HEAD requests...
    if timegate == 'ia':
        allow_redirects = True
    else:
        allow_redirects = False
    response = requests.head(tg_url, headers=headers, allow_redirects=allow_redirects)
    return parse_links_from_headers(response)

def get_memento(timegate, url, date=None, tz='Australia/Canberra'):
    '''
    If there's no memento in the results, look for an alternative.
    '''
    links = query_timegate(timegate, url, date, tz)
    # NLNZ doesn't always seem to return a Memento, so we'll build in some fuzziness
    if links:
        if 'memento' in links:
            memento = links['memento']
        elif 'prev memento' in links:
            memento = links['prev memento']
        elif 'next memento' in links:
            memento = links['next memento']
        elif 'last memento' in links:
            memento = links['last memento']
    else:
        memento = None
    return memento

Now we can request a Memento from any of the four repositories and get back the results as a Python dictionary. You can see this code in action in the Get full page screenshots from archived web pages notebook.

In [25]:
query_timegate('ukwa', 'http://bl.uk', date='2015-01-01')
Out[25]:
{'original': 'http://bl.uk/',
 'timegate': 'https://www.webarchive.org.uk/wayback/archive/http://bl.uk/',
 'timemap': 'https://www.webarchive.org.uk/wayback/archive/timemap/link/http://bl.uk/',
 'memento': 'https://www.webarchive.org.uk/wayback/archive/20150101010000mp_/http://bl.uk/'}

Or if we just want to get the url for a Memento (and fallback to alternative values if memento is missing).

In [26]:
get_memento('nzwa', 'http://natlib.govt.nz')
Out[26]:
'https://ndhadeliver.natlib.govt.nz/webarchive/wayback/20200130060106/http://natlib.govt.nz/'

Timemaps

Memento Timemaps provide machine-processable lists of web page captures from a particular archive. They are available from both OpenWayback and Pywb systems, though there are some differences. The Pywb documentation notes that the following formats are available:

  • link – returns an application/link-format as required by the Memento spec
  • cdxj – returns a timemap in the native CDXJ format
  • json – returns the timemap as newline-delimited JSON lines (NDJSON) format

Timemaps are requested using a url with the following format:

http://[address.of.archive]/[collection]/timemap/[format]/[web page url]

So if you wanted to query the Australian Web Archive to get a list of captures in JSON format from http://nla.gov.au/ you'd use this url:

https://web.archive.org.au/awa/timemap/json/http://nla.gov.au/

The examples below show how the format and behaviour of Timemaps vary slightly across the four respoitories we're interested in.

In [27]:
def get_timemap(timegate, url, format='json'):
    '''
    Basic function to get a Timemap for the supplied url.
    '''
    tg_url = f'{TIMEGATES[timegate]}timemap/{format}/{url}/'
    response = requests.get(tg_url)
    # Show the content-type
    print(response.headers['content-type'])
    return response.text

National Library of Australia

Request a Timemap in link format. Note that response headers include content-type of application/link-format.

In [28]:
timemap = get_timemap('awa', 'http://www.gov.au', 'link')
# Show the first 5 lines
print('\n'.join(timemap.splitlines()[:5]))
application/link-format
<https://web.archive.org.au/awa/timemap/link/http://www.gov.au/>; rel="self"; type="application/link-format"; from="Wed, 06 Dec 2000 21:15:00 GMT",
<https://web.archive.org.au/awa/http://www.gov.au/>; rel="timegate",
<http://www.gov.au/>; rel="original",
<https://web.archive.org.au/awa/20001206211500mp_/http://www.gov.au/>; rel="memento"; datetime="Wed, 06 Dec 2000 21:15:00 GMT"; collection="awa",
<https://web.archive.org.au/awa/20010118203600mp_/http://www.gov.au/>; rel="memento"; datetime="Thu, 18 Jan 2001 20:36:00 GMT"; collection="awa",

Request a Timemap in json format. This returns ndjson (Newline Delineated JSON) – each capture is a JSON object, separated by a line break. Note that the response headers include content-type of text/x-ndjson.

In [29]:
timemap = get_timemap('awa', 'http://www.aph.gov.au/Senate/committee/eet_ctte/uni_finances/report/index.htm', 'json')
# Show the first line
print('\n'.join(timemap.splitlines()[:1]))
text/x-ndjson
{"urlkey": "au,gov,aph)/senate/committee/eet_ctte/uni_finances/report/index.htm", "timestamp": "20031122074837", "url": "http://www.aph.gov.au/senate/committee/EET_CTTE/uni_finances/report/index.htm", "mime": "text/html", "status": "200", "digest": "3H5Z77RMYKXRFCE6ODTWAKMBPZOGJ4YE", "offset": "97170362", "filename": "NLA-EXTRACTION-1996-2004-ARCS-PART-01336-000000.arc.gz", "source": "awa", "source-coll": "awa"}

Request a Timemap in cdxj format. Note that response headers include content-type of text/x-cdxj.

In [30]:
timemap = get_timemap('awa', 'http://www.aph.gov.au/Senate/committee/eet_ctte/uni_finances/report/index.htm', 'cdxj')
# Show the first line
print('\n'.join(timemap.splitlines()[:1]))
text/x-cdxj
au,gov,aph)/senate/committee/eet_ctte/uni_finances/report/index.htm 20031122074837 {"url": "http://www.aph.gov.au/senate/committee/EET_CTTE/uni_finances/report/index.htm", "mime": "text/html", "status": "200", "digest": "3H5Z77RMYKXRFCE6ODTWAKMBPZOGJ4YE", "offset": "97170362", "filename": "NLA-EXTRACTION-1996-2004-ARCS-PART-01336-000000.arc.gz", "source": "awa", "source-coll": "awa"}

UK Web Archive

Request a Timemap in link format. Note that response headers include content-type of application/link-format.

In [31]:
timemap = get_timemap('ukwa', 'http://bl.uk', 'link')
print('\n'.join(timemap.splitlines()[:5]))
application/link-format
<https://www.webarchive.org.uk/wayback/archive/timemap/link/http://bl.uk/>; rel="self"; type="application/link-format"; from="Tue, 30 Oct 2001 00:00:19 GMT",
<https://www.webarchive.org.uk/wayback/archive/http://bl.uk/>; rel="timegate",
<http://bl.uk/>; rel="original",
<https://www.webarchive.org.uk/wayback/archive/20011030000019mp_/http://www.bl.uk/>; rel="memento"; datetime="Tue, 30 Oct 2001 00:00:19 GMT"; collection="archive",
<https://www.webarchive.org.uk/wayback/archive/20011113000000mp_/http://www.bl.uk/>; rel="memento"; datetime="Tue, 13 Nov 2001 00:00:00 GMT"; collection="archive",

Request a Timemap in json format. This returns ndjson (Newline Delineated JSON) – each capture is a JSON object, separated by a line break. Note that the response headers include content-type of text/x-ndjson.

In [32]:
timemap = get_timemap('ukwa', 'http://bl.uk', 'json')
print('\n'.join(timemap.splitlines()[:1]))
text/x-ndjson
{"urlkey": "uk,bl)/", "timestamp": "20011030000019", "url": "http://www.bl.uk/", "mime": "text/html", "status": "200", "digest": "JN4RHYLNGS7X64HADIX3XHDIMYDBLAAW", "redirect": "-", "robotflags": "-", "length": "0", "offset": "10813988", "filename": "/data/102148/31031347/WARCS/BL-31031347.warc.gz", "load_url": "https://www.webarchive.org.uk/wayback/archive/20011030000019id_/http://www.bl.uk/", "source": "archive", "source-coll": "archive", "access": "allow"}

Request a Timemap in cdxj format. Note that response headers include content-type of text/x-cdxj.

In [33]:
timemap = get_timemap('ukwa', 'http://bl.uk', 'cdxj')
print('\n'.join(timemap.splitlines()[:1]))
text/x-cdxj
uk,bl)/ 20011030000019 {"url": "http://www.bl.uk/", "mime": "text/html", "status": "200", "digest": "JN4RHYLNGS7X64HADIX3XHDIMYDBLAAW", "redirect": "-", "robotflags": "-", "length": "0", "offset": "10813988", "filename": "/data/102148/31031347/WARCS/BL-31031347.warc.gz", "load_url": "https://www.webarchive.org.uk/wayback/archive/20011030000019id_/http://www.bl.uk/", "source": "archive", "source-coll": "archive", "access": "allow"}

National Library of New Zealand

Request a Timemap in link format. Note that response headers include content-type of application/link-format.

In [34]:
timemap = get_timemap('nzwa', 'http://natlib.govt.nz', 'link')
print('\n'.join(timemap.splitlines()[:5]))
application/link-format
<http://natlib.govt.nz/>; rel="original",
<https://ndhadeliver.natlib.govt.nz/webarchive/wayback/timemap/link/http://natlib.govt.nz/>; rel="self"; type="application/link-format"; from="Sun, 11 Jul 2004 21:32:25 GMT"; until="Thu, 30 Jan 2020 06:01:11 GMT",
<https://ndhadeliver.natlib.govt.nz/webarchive/wayback/http://natlib.govt.nz/>; rel="timegate",
<https://ndhadeliver.natlib.govt.nz/webarchive/wayback/20040711213225/http://www.natlib.govt.nz/>; rel="first memento"; datetime="Sun, 11 Jul 2004 21:32:25 GMT",
<https://ndhadeliver.natlib.govt.nz/webarchive/wayback/20060704033135/http://www.natlib.govt.nz/>; rel="memento"; datetime="Tue, 04 Jul 2006 03:31:35 GMT",

A request for a Timemap in json returns results in link format. OpenWayback only supports the link format.

In [35]:
timemap = get_timemap('nzwa', 'http://natlib.govt.nz', 'json')
print('\n'.join(timemap.splitlines()[:5]))
application/link-format
<http://natlib.govt.nz/>; rel="original",
<https://ndhadeliver.natlib.govt.nz/webarchive/wayback/timemap/link/http://natlib.govt.nz/>; rel="self"; type="application/link-format"; from="Sun, 11 Jul 2004 21:32:25 GMT"; until="Thu, 30 Jan 2020 06:01:11 GMT",
<https://ndhadeliver.natlib.govt.nz/webarchive/wayback/http://natlib.govt.nz/>; rel="timegate",
<https://ndhadeliver.natlib.govt.nz/webarchive/wayback/20040711213225/http://www.natlib.govt.nz/>; rel="first memento"; datetime="Sun, 11 Jul 2004 21:32:25 GMT",
<https://ndhadeliver.natlib.govt.nz/webarchive/wayback/20060704033135/http://www.natlib.govt.nz/>; rel="memento"; datetime="Tue, 04 Jul 2006 03:31:35 GMT",

Internet Archive

Request a Timemap in link format. Note that response headers include content-type of application/link-format.

In [36]:
timemap = get_timemap('ia', 'http://discontents.com.au', 'link')
print('\n'.join(timemap.splitlines()[:5]))
application/link-format
<http://www.discontents.com.au:80/>; rel="original",
<https://web.archive.org/web/timemap/link/http://discontents.com.au/>; rel="self"; type="application/link-format"; from="Sun, 06 Dec 1998 01:22:33 GMT",
<https://web.archive.org>; rel="timegate",
<https://web.archive.org/web/19981206012233/http://www.discontents.com.au:80/>; rel="first memento"; datetime="Sun, 06 Dec 1998 01:22:33 GMT",
<https://web.archive.org/web/19981212024410/http://www.discontents.com.au:80/>; rel="memento"; datetime="Sat, 12 Dec 1998 02:44:10 GMT",

Request for timemap in json format returns results in JSON as an array of arrays, where the first row provides the column headings. Response headers include content-type of application/json.

In [37]:
timemap = get_timemap('ia', 'http://discontents.com.au', 'json')
print('\n'.join(timemap.splitlines()[:5]))
application/json
[["urlkey","timestamp","original","mimetype","statuscode","digest","redirect","robotflags","length","offset","filename"],
["au,com,discontents)/","19981206012233","http://www.discontents.com.au:80/","text/html","200","FQJ6JMPIZ7WEKYPQ4SGPVHF57GCV6B36","-","-","1610","43993900","green-0133-19990218235953-919455657-c/green-0141-912907270.arc.gz"],
["au,com,discontents)/","19981212024410","http://www.discontents.com.au:80/","text/html","200","FQJ6JMPIZ7WEKYPQ4SGPVHF57GCV6B36","-","-","1613","17792789","slash-913417727-c/slash-913430608.arc.gz"],
["au,com,discontents)/","19990125094813","http://www.discontents.com.au:80/","text/html","200","FQJ6JMPIZ7WEKYPQ4SGPVHF57GCV6B36","-","-","1613","11419234","slash-913417727-c/slash_19990124232053-917257670.arc.gz"],
["au,com,discontents)/","19990208004052","http://discontents.com.au:80/","text/html","200","FQJ6JMPIZ7WEKYPQ4SGPVHF57GCV6B36","-","-","1612","13269748","slash-913417727-c/slash-918434425.arc.gz"],

Request for timemap in cdxj returns results in plain text, with fields separated by spaces, and captures separated by line breaks. Response headers include content-type of text/plain.

In [38]:
timemap = get_timemap('ia', 'http://discontents.com.au', 'cdxj')
print('\n'.join(timemap.splitlines()[:5]))
text/plain
au,com,discontents)/ 19981206012233 http://www.discontents.com.au:80/ text/html 200 FQJ6JMPIZ7WEKYPQ4SGPVHF57GCV6B36 - - 1610 43993900 green-0133-19990218235953-919455657-c/green-0141-912907270.arc.gz
au,com,discontents)/ 19981212024410 http://www.discontents.com.au:80/ text/html 200 FQJ6JMPIZ7WEKYPQ4SGPVHF57GCV6B36 - - 1613 17792789 slash-913417727-c/slash-913430608.arc.gz
au,com,discontents)/ 19990125094813 http://www.discontents.com.au:80/ text/html 200 FQJ6JMPIZ7WEKYPQ4SGPVHF57GCV6B36 - - 1613 11419234 slash-913417727-c/slash_19990124232053-917257670.arc.gz
au,com,discontents)/ 19990208004052 http://discontents.com.au:80/ text/html 200 FQJ6JMPIZ7WEKYPQ4SGPVHF57GCV6B36 - - 1612 13269748 slash-913417727-c/slash-918434425.arc.gz
au,com,discontents)/ 19990208012714 http://www.discontents.com.au:80/ text/html 200 FQJ6JMPIZ7WEKYPQ4SGPVHF57GCV6B36 - - 1613 17395194 slash-913417727-c/slash-918437200.arc.gz

Differences in field labels

If we compare the Pywb JSON output with the IA Wayback output, we see there are also some differences in the field labels. In particular original in IA Wayback is just url in Pywb, while statuscode and mimetype are shortened to status and mime in Pywb.

In [39]:
timemap = get_timemap('ia', 'http://bl.uk', 'json')
data = json.loads(timemap)
data[0]
application/json
Out[39]:
['urlkey',
 'timestamp',
 'original',
 'mimetype',
 'statuscode',
 'digest',
 'redirect',
 'robotflags',
 'length',
 'offset',
 'filename']
In [40]:
timemap = get_timemap('ukwa', 'http://bl.uk', 'json')
data = [json.loads(line) for line in timemap.splitlines()]
list(data[0].keys())
text/x-ndjson
Out[40]:
['urlkey',
 'timestamp',
 'url',
 'mime',
 'status',
 'digest',
 'redirect',
 'robotflags',
 'length',
 'offset',
 'filename',
 'load_url',
 'source',
 'source-coll',
 'access']

Summarising the differences

The good news is that all repositories provide Timemaps in the standard link format as required by the Memento specification. However, there's more varation when it comes to other formats.

  • NLNZ only provides the link format.
  • IA's json format is different to the Pywb format from UKWA and NLA.
  • IA uses different labels for some values.

Normalising Timemaps

With the information above we can construct some functions to return normalised Timemap results as JSON. To do this we need to:

  • Convert the link format from NLNZ to JSON
  • Restructure the JSON output from IA to match the Pywb format
  • Change some of the column headings in the IA data to match the Pywb format

Because the link format provides less information than the json format, we could also try to enrich the NLNZ data by requesting more information about individual Mementos.

In [41]:
def convert_lists_to_dicts(results):
    '''
    Converts IA style timemap (a JSON array of arrays) to a list of dictionaries.
    Renames keys to standardise IA with other Timemaps.
    '''
    if results:
        keys = results[0]
        results_as_dicts = [dict(zip(keys, v)) for v in results[1:]]
    else:
        results_as_dicts = results
    for d in results_as_dicts:
        d['status'] = d.pop('statuscode')
        d['mime'] = d.pop('mimetype')
        d['url'] = d.pop('original')
    return results_as_dicts

def get_capture_data_from_memento(url, request_type='head'):
    '''
    For OpenWayback systems this can get some extra capture info to insert into Timemaps.
    '''
    if request_type == 'head':
        response = requests.head(url)
    else:
        response = requests.get(url)
    headers = response.headers
    length = headers.get('x-archive-orig-content-length')
    status = headers.get('x-archive-orig-status')
    status = status.split(' ')[0] if status else None
    mime = headers.get('x-archive-orig-content-type')
    mime = mime.split(';')[0] if mime else None
    return {'length': length, 'status': status, 'mime': mime}

def convert_link_to_json(results, enrich_data=False):
    '''
    Converts link formatted Timemap to JSON.
    '''
    data = []
    for line in results.splitlines():
        parts = line.split('; ')
        if len(parts) > 1:
            link_type = re.search(r'rel="(original|self|timegate|first memento|last memento|memento)"', parts[1]).group(1)
            if link_type == 'memento':
                link = parts[0].strip('<>')
                timestamp, original = re.search(r'/(\d{14})/(.*)$', link).groups()
                capture = {'timestamp': timestamp, 'url': original}
                if enrich_data:
                    capture.update(get_capture_data_from_memento(link))
                    print(capture)
                data.append(capture)
    return data
                
def get_timemap_as_json(timegate, url):
    '''
    Get a Timemap then normalise results (if necessary) to return a list of dicts.
    '''
    tg_url = f'{TIMEGATES[timegate]}timemap/json/{url}/'
    response = requests.get(tg_url)
    response_type = response.headers['content-type']
    print(response_type)
    if response_type == 'text/x-ndjson':
        data = [json.loads(line) for line in response.text.splitlines()]
    elif response_type == 'application/json':
        data = convert_lists_to_dicts(response.json())
    elif response_type in ['application/link-format', 'text/html;charset=utf-8']:
        data = convert_link_to_json(response.text)
    return data

Now we can get information about captures in a standardised JSON format from all four repositories. Although, we can't rely on NLNZ data having anything more than timestamp and url for each capture. You can see this in action in the Display changes in the text of an archived web page over time notebook

In [43]:
timemap = get_timemap_as_json('ukwa', 'http://bl.uk')
timemap[0]
text/x-ndjson
Out[43]:
{'urlkey': 'uk,bl)/',
 'timestamp': '20011030000019',
 'url': 'http://www.bl.uk/',
 'mime': 'text/html',
 'status': '200',
 'digest': 'JN4RHYLNGS7X64HADIX3XHDIMYDBLAAW',
 'redirect': '-',
 'robotflags': '-',
 'length': '0',
 'offset': '10813988',
 'filename': '/data/102148/31031347/WARCS/BL-31031347.warc.gz',
 'load_url': 'https://www.webarchive.org.uk/wayback/archive/20011030000019id_/http://www.bl.uk/',
 'source': 'archive',
 'source-coll': 'archive',
 'access': 'allow'}
In [44]:
timemap = get_timemap_as_json('ia', 'http://bl.uk')
timemap[0]
application/json
Out[44]:
{'urlkey': 'uk,bl)/',
 'timestamp': '19970218190613',
 'digest': 'Z42UMUL76GODKO3EMNSLXDTCST66VDAX',
 'redirect': '-',
 'robotflags': '-',
 'length': '1208',
 'offset': '19524651',
 'filename': 'GR-001114-c/GR-002277.arc.gz',
 'status': '200',
 'mime': 'text/html',
 'url': 'http://www.bl.uk:80/'}

Mementos

You can also modify the url of a Memento to change the way it's presented. In particular, adding id_ after the timestamp will tell the server that you want the original harvested version of the webpage, without any rewriting of links, or web archive navigation features. For example:

https://web.archive.org.au/awa/20200302223537id_/http://discontents.com.au/

This works with all four repositories, however, note that for the Australian Web Archive you need to use the web.archive.org.au domain, not webarchive.nla.gov.au.

In addition, NLNZ and IA both support the if_ option, which provides a view of the archived page without web archive headers navigation inserted, but with links to CSS, JS, and images rewritten to point to archived versions. This is as close as you can get to looking at the original page, and I've used it in the Get full page screenshots from archived web pages notebook. Note that if you add if_ to requests from the UKWA or the NLA you'll be redirected to the standard view with the original page framed by the web archive navigation.

Pywb's page on url rewriting has some useful information about this.


Created by Tim Sherratt for the GLAM Workbench.

Work on this notebook was supported by the IIPC Discretionary Funding Programme 2019-2020