Find all the archived versions of a web page

New to Jupyter notebooks? Try Using Jupyter notebooks for a quick introduction.

You can find all the archived versions of a web page by requesting a Timemap from a Memento-compliant repository. If the repository has a CDX API, you can get much the same data by doing an exact url search.

In [1]:
import requests
import json
import re
from surt import surt

Using Timemaps

Works with AWA, IA, NZWA & UKWA

Variations in the way Memento is implemented across repositories are documented in Getting data from web archives using Memento. The functions below smooth out these variations to provide a (mostly) consistent interface to the UK Web Archive, Australian Web Archive, New Zealand Web Archive, and the Internet Archive. They could be easily modified to work with other Memento-compliant repositories.

To get all captures of a url in JSON format:

get_timemap_as_json([timegate], [url], enrich_data=[True or False])

Parameters:

  • timegate – one of 'ukwa' (UK), 'awa' (Australia), 'nzwa' (New Zealand), or 'ia' (Internet Archive)
  • url – the url you want to look for in the archive
  • enrich_data – NZWA Timemaps include less information, if you set this to True the script will query each memento in turn to try and find more capture information (such as mime and status). This will slow things down quite a bit, and isn't always successful, so leave it as False unless you have a good reason.

The data is returned in JSON format. The number of fields returned varies, but these will always be present:

  • urlkey – SURT formatted url (in the case of NZWA this is generated by the script rather than the archive)
  • timestamp – the date and time when the page was captured by the archive, in YYYYMMDDHHmmss format
  • url – the url of the page that was captured

The AWA, IA, and UKWA Timemaps also include:

  • status – HTTP status code returned by the capture request
  • mime – the mimetype of the captured resource
  • digest – algorithmically generated string that uniquely identifies the contents of the captured reource

For more information on the contents of these fields, see Exploring the Internet Archive's CDX API.

In [2]:
# These are the repositories we'll be using
TIMEGATES = {
    'awa': 'https://web.archive.org.au/awa/',
    'nzwa': 'https://ndhadeliver.natlib.govt.nz/webarchive/wayback/',
    'ukwa': 'https://www.webarchive.org.uk/wayback/en/archive/',
    'ia': 'https://web.archive.org/web/'
}

def convert_lists_to_dicts(results):
    '''
    Converts IA style timemap (a JSON array of arrays) to a list of dictionaries.
    Renames keys to standardise IA with other Timemaps.
    '''
    if results:
        keys = results[0]
        results_as_dicts = [dict(zip(keys, v)) for v in results[1:]]
    else:
        results_as_dicts = results
    for d in results_as_dicts:
        d['status'] = d.pop('statuscode')
        d['mime'] = d.pop('mimetype')
        d['url'] = d.pop('original')
    return results_as_dicts

def get_capture_data_from_memento(url, request_type='head'):
    '''
    For OpenWayback systems this can get some extra capture info to insert into Timemaps.
    '''
    if request_type == 'head':
        response = requests.head(url)
    else:
        response = requests.get(url)
    headers = response.headers
    length = headers.get('x-archive-orig-content-length')
    status = headers.get('x-archive-orig-status')
    status = status.split(' ')[0] if status else None
    mime = headers.get('x-archive-orig-content-type')
    mime = mime.split(';')[0] if mime else None
    return {'length': length, 'status': status, 'mime': mime}

def convert_link_to_json(results, enrich_data=False):
    '''
    Converts link formatted Timemap to JSON.
    '''
    data = []
    for line in results.splitlines():
        parts = line.split('; ')
        if len(parts) > 1:
            link_type = re.search(r'rel="(original|self|timegate|first memento|last memento|memento)"', parts[1]).group(1)
            if link_type == 'memento':
                link = parts[0].strip('<>')
                timestamp, original = re.search(r'/(\d{14})/(.*)$', link).groups()
                capture = {'urlkey': surt(original), 'timestamp': timestamp, 'url': original}
                if enrich_data:
                    capture.update(get_capture_data_from_memento(link))
                    print(capture)
                data.append(capture)
    return data
                
def get_timemap_as_json(timegate, url, enrich_data=False):
    '''
    Get a Timemap then normalise results (if necessary) to return a list of dicts.
    '''
    tg_url = f'{TIMEGATES[timegate]}timemap/json/{url}/'
    response = requests.get(tg_url)
    response_type = response.headers['content-type']
    if response_type == 'text/x-ndjson':
        data = [json.loads(line) for line in response.text.splitlines()]
    elif response_type == 'application/json':
        data = convert_lists_to_dicts(response.json())
    elif response_type in ['application/link-format', 'text/html;charset=utf-8']:
        data = convert_link_to_json(response.text, enrich_data=enrich_data)
    return data

Examples

In [3]:
t1 = get_timemap_as_json('ia', 'http://discontents.com.au')
len(t1)
Out[3]:
310
In [4]:
# First -- results in date order
t1[0]
Out[4]:
{'urlkey': 'au,com,discontents)/',
 'timestamp': '19981206012233',
 'digest': 'FQJ6JMPIZ7WEKYPQ4SGPVHF57GCV6B36',
 'redirect': '-',
 'robotflags': '-',
 'length': '1610',
 'offset': '43993900',
 'filename': 'green-0133-19990218235953-919455657-c/green-0141-912907270.arc.gz',
 'status': '200',
 'mime': 'text/html',
 'url': 'http://www.discontents.com.au:80/'}
In [5]:
# Last -- the most recent
t1[-1]
Out[5]:
{'urlkey': 'au,com,discontents)/',
 'timestamp': '20200510215639',
 'digest': '65H2UO6L3CBCB3SJ2NDWEXO2D6OA44Z6',
 'redirect': '-',
 'robotflags': '-',
 'length': '1690',
 'offset': '73114318',
 'filename': 'SURVEY_00010-20200510211045-crawl421/SURVEY_00010-20200510211045-00002.warc.gz',
 'status': '503',
 'mime': 'text/html',
 'url': 'https://discontents.com.au/'}
In [6]:
t2 = get_timemap_as_json('ukwa', 'http://bl.uk')
len(t2)
Out[6]:
450
In [7]:
t3 = get_timemap_as_json('nzwa', 'http://natlib.govt.nz')
len(t3)
Out[7]:
1360

Using the CDX API

Works with AWA, IA, & UKWA

The CDX APIs of the Internet Archive and PyWb-based systems such as the AWA and UKWA behave slightly differently. These differences are documented in Comparing CDX APIs. The functions below smooth out some of these bumps and should return consistently formatted results from the three repositories.

To get all the captures of a url in JSON format:

query_cdx([timegate], [url], [other optional parameters])

Required parameters:

  • timegate – one of 'ukwa' (UK), 'awa' (Australia), 'nzwa' (New Zealand), or 'ia' (Internet Archive)
  • url – the url you want to look for in the archive

Supplying these parameters only is essentially the equivalent of asking for a Timemap (though when I compared results, I found the CDX API included more duplicates). One advantage of the CDX API is that you can filter results by supplying additional parameters. These optional parameters can be anything the CDX APIs support, such as from, to, and filter. However, note that from is a reserved keyword in Python, so use from_ instead. See below for some examples.

The data is returned in JSON format. The number of fields returned varies, but these will always be present:

  • urlkey – SURT formatted url (in the case of NZWA this is generated by the script rather than the archive)
  • timestamp – the date and time when the page was captured by the archive, in YYYYMMDDHHmmss format
  • url – the url of the page that was captured
  • status – HTTP status code returned by the capture request
  • mime – the mimetype of the captured resource
  • digest – algorithmically generated string that uniquely identifies the contents of the captured reource
In [8]:
APIS = {
    'ia': {'url': 'http://web.archive.org/cdx/search/cdx', 'type': 'wb'},
    'awa': {'url': 'https://web.archive.org.au/awa/cdx', 'type': 'pywb'},
    'ukwa': {'url': 'https://www.webarchive.org.uk/wayback/archive/cdx', 'type': 'pywb'}
}

def normalise_filter(api, f):
    '''
    Normalise parameter names and regexp formatting across CDX systems.
    '''
    sys_type = APIS[api]['type']
    if sys_type == 'pywb':
        f = f.replace('mimetype:', 'mime:')
        f = f.replace('statuscode:', 'status:')
        f = f.replace('original:', 'url:')
        f = re.sub(r'^(!{0,1})(\w)', r'\1~\2', f)
    elif sys_type == 'wb':
        f = f.replace('mime:', 'mimetype:')
        f = f.replace('status:', 'statuscode:')
        f = f.replace('url:', 'original:')
    return f

def normalise_filters(api, filters):
    if isinstance(filters, list):
        normalised = []
        for f in filters:
            normalised.append(normalise_filter(api, f))
    else:
        normalised = normalise_filter(api, filters)
    return normalised

def convert_lists_to_dicts(results):
    '''
    Converts IA style timemap (a JSON array of arrays) to a list of dictionaries.
    Renames keys to standardise IA with other Timemaps.
    '''
    if results:
        keys = results[0]
        results_as_dicts = [dict(zip(keys, v)) for v in results[1:]]
    else:
        results_as_dicts = results
    for d in results_as_dicts:
        d['status'] = d.pop('statuscode')
        d['mime'] = d.pop('mimetype')
        d['url'] = d.pop('original')
    return results_as_dicts

def query_cdx(api, url, **kwargs):
    params = kwargs
    if 'filter' in params:
        params['filter'] = normalise_filters(api, params['filter'])
    # CDX accepts a 'from' parameter, but this is a reserved word in Python
    # Use 'from_' to pass the value to the function & here we'll change it back to 'from'.
    if 'from_' in params:
        params['from'] = params['from_']
        del(params['from_'])
    params['url'] = url
    params['output'] = 'json'
    response = requests.get(APIS[api]['url'], params=params)
    response.raise_for_status()
    response_type = response.headers['content-type'].split(';')[0]
    if response_type == 'text/x-ndjson':
        data = [json.loads(line) for line in response.text.splitlines()]
    elif response_type == 'application/json':
        data = convert_lists_to_dicts(response.json())
    return data

Examples

In [9]:
# No filters -- give as all the captures!
d1 = query_cdx('ia', 'http://discontents.com.au')
len(d1)
Out[9]:
312
In [10]:
# First result
d1[0]
Out[10]:
{'urlkey': 'au,com,discontents)/',
 'timestamp': '19981206012233',
 'digest': 'FQJ6JMPIZ7WEKYPQ4SGPVHF57GCV6B36',
 'length': '1610',
 'status': '200',
 'mime': 'text/html',
 'url': 'http://www.discontents.com.au:80/'}
In [11]:
# Last result -- note that the results are in date order, so this is the most recent
d1[-1]
Out[11]:
{'urlkey': 'au,com,discontents)/',
 'timestamp': '20200510215639',
 'digest': '65H2UO6L3CBCB3SJ2NDWEXO2D6OA44Z6',
 'length': '1690',
 'status': '503',
 'mime': 'text/html',
 'url': 'https://discontents.com.au/'}
In [12]:
# Filter by status code - note the number of results decreases
d2 = query_cdx('ia', 'http://discontents.com.au', filter='status:200')
len(d2)
Out[12]:
274
In [13]:
# Filter by date range using from_ and to
d3 = query_cdx('ia', 'http://discontents.com.au', from_='2005', to='2006')
len(d3)
Out[13]:
25
In [14]:
# First result should be from 2005
d3[0]
Out[14]:
{'urlkey': 'au,com,discontents)/',
 'timestamp': '20050209204432',
 'digest': 'IWLJRLZLB7WBQNHYTVXJGD7TTARRGAXM',
 'length': '1024',
 'status': '200',
 'mime': 'text/html',
 'url': 'http://www.discontents.com.au:80/'}
In [15]:
# Last result should be from 2006
d3[-1]
Out[15]:
{'urlkey': 'au,com,discontents)/',
 'timestamp': '20061205043957',
 'digest': 'QGCDU54UYAOMFBTZKGOV27NGYAFE27HZ',
 'length': '1122',
 'status': '200',
 'mime': 'text/html',
 'url': 'http://discontents.com.au:80/'}
In [16]:
# Same as d1, except from AWA
d4 = query_cdx('awa', 'http://discontents.com.au')
len(d4)
Out[16]:
142

Created by Tim Sherratt for the GLAM Workbench.

Work on this notebook was supported by the IIPC Discretionary Funding Programme 2019-2020