Notebook

Find all the archived versions of a web page¶

New to Jupyter notebooks? Try Using Jupyter notebooks for a quick introduction.

You can find all the archived versions of a web page by requesting a Timemap from a Memento-compliant repository. If the repository has a CDX API, you can get much the same data by doing an exact url search.

In [1]:

import json
import re

import requests
from surt import surt

Using Timemaps¶

Works with AWA, IA, NZWA, UKWA & UKGWA

Variations in the way Memento is implemented across repositories are documented in Getting data from web archives using Memento. The functions below smooth out these variations to provide a (mostly) consistent interface to the UK Web Archive, UK Government Web Archive, Australian Web Archive, New Zealand Web Archive, and the Internet Archive. They could be easily modified to work with other Memento-compliant repositories.

To get all captures of a url in JSON format:

get_timemap_as_json([timegate], [url], enrich_data=[True or False])

Parameters:

timegate – one of 'ukwa' (UK Web Archive), 'ukgwa' (UK Government Web Archive) 'awa' (Australia), 'nzwa' (New Zealand), or 'ia' (Internet Archive)
url – the url you want to look for in the archive
enrich_data – NZWA Timemaps include less information, if you set this to True the script will query each memento in turn to try and find more capture information (such as mime and status). This will slow things down quite a bit, and isn't always successful, so leave it as False unless you have a good reason.

The data is returned in JSON format. The number of fields returned varies, but these will always be present:

urlkey – SURT formatted url (in the case of NZWA this is generated by the script rather than the archive)
timestamp – the date and time when the page was captured by the archive, in YYYYMMDDHHmmss format
url – the url of the page that was captured

The AWA, IA, and UKWA Timemaps also include:

status – HTTP status code returned by the capture request
mime – the mimetype of the captured resource
digest – algorithmically generated string that uniquely identifies the contents of the captured reource

For more information on the contents of these fields, see Exploring the Internet Archive's CDX API.

In [3]:

# These are the repositories we'll be using
TIMEGATES = {
    "awa": "https://web.archive.org.au/awa/",
    "nzwa": "https://ndhadeliver.natlib.govt.nz/webarchive/",
    "ukwa": "https://www.webarchive.org.uk/wayback/en/archive/",
    "ia": "https://web.archive.org/web/",
    "ukgwa": "https://webarchive.nationalarchives.gov.uk/ukgwa/"
}


def convert_lists_to_dicts(results):
    """
    Converts IA style timemap (a JSON array of arrays) to a list of dictionaries.
    Renames keys to standardise IA with other Timemaps.
    """
    if results:
        keys = results[0]
        results_as_dicts = [dict(zip(keys, v)) for v in results[1:]]
    else:
        results_as_dicts = results
    for d in results_as_dicts:
        d["status"] = d.pop("statuscode")
        d["mime"] = d.pop("mimetype")
        d["url"] = d.pop("original")
    return results_as_dicts


def get_capture_data_from_memento(url, request_type="head"):
    """
    For OpenWayback systems this can get some extra capture info to insert into Timemaps.
    """
    if request_type == "head":
        response = requests.head(url)
    else:
        response = requests.get(url)
    headers = response.headers
    length = headers.get("x-archive-orig-content-length")
    status = headers.get("x-archive-orig-status")
    status = status.split(" ")[0] if status else None
    mime = headers.get("x-archive-orig-content-type")
    mime = mime.split(";")[0] if mime else None
    return {"length": length, "status": status, "mime": mime}


def convert_link_to_json(results, enrich_data=False):
    """
    Converts link formatted Timemap to JSON.
    """
    data = []
    for line in results.splitlines():
        parts = line.split("; ")
        if len(parts) > 1:
            link_type = re.search(
                r'rel="(original|self|timegate|first memento|last memento|memento)"',
                parts[1],
            ).group(1)
            if link_type == "memento":
                link = parts[0].strip("<>")
                timestamp, original = re.search(r"/(\d{12}|\d{14})/(.*)$", link).groups()
                capture = {
                    "urlkey": surt(original),
                    "timestamp": timestamp,
                    "url": original,
                }
                if enrich_data:
                    capture.update(get_capture_data_from_memento(link))
                    print(capture)
                data.append(capture)
    return data


def get_timemap_as_json(timegate, url, enrich_data=False):
    """
    Get a Timemap then normalise results (if necessary) to return a list of dicts.
    """
    tg_url = f"{TIMEGATES[timegate]}timemap/json/{url}/"
    response = requests.get(tg_url)
    response_type = response.headers["content-type"]
    # print(response_type)
    if response_type == "text/x-ndjson":
        data = [json.loads(line) for line in response.text.splitlines()]
    elif response_type == "application/json":
        data = convert_lists_to_dicts(response.json())
    elif response_type in [
        "application/link-format",
        "application/link-format;charset=ISO-8859-1",
        "text/html;charset=utf-8",
    ]:
        data = convert_link_to_json(response.text, enrich_data=enrich_data)
    return data

Examples¶

In [4]:

t1 = get_timemap_as_json("ia", "http://discontents.com.au")
len(t1)

Out[4]:

In [5]:

# First -- results in date order
t1[0]

Out[5]:

{'urlkey': 'au,com,discontents)/',
 'timestamp': '19981206012233',
 'digest': 'FQJ6JMPIZ7WEKYPQ4SGPVHF57GCV6B36',
 'redirect': '-',
 'robotflags': '-',
 'length': '1610',
 'offset': '43993900',
 'filename': 'green-0133-19990218235953-919455657-c/green-0141-912907270.arc.gz',
 'status': '200',
 'mime': 'text/html',
 'url': 'http://www.discontents.com.au:80/'}

In [6]:

# Last -- the most recent
t1[-1]

Out[6]:

{'urlkey': 'au,com,discontents)/',
 'timestamp': '20230318003745',
 'digest': 'LK7AWVZ7UN745CBGJNEVA3QJMKLJ4N4V',
 'redirect': '-',
 'robotflags': '-',
 'length': '652',
 'offset': '37282813',
 'filename': 'CT-20230318000619-crawl896/CT-20230318003748-00104.warc.gz',
 'status': '-',
 'mime': 'warc/revisit',
 'url': 'http://discontents.com.au/'}

In [7]:

t2 = get_timemap_as_json("ukwa", "http://bl.uk")
len(t2)

Out[7]:

In [8]:

t3 = get_timemap_as_json("nzwa", "http://natlib.govt.nz")
len(t3)

Out[8]:

In [9]:

t4 = get_timemap_as_json("ukgwa", "http://www.mod.uk/")
len(t4)

Out[9]:

Using the CDX API¶

Works with AWA, IA, NZWA, UKWA & UKGWA

The CDX APIs of the Internet Archive and PyWb-based systems such as the AWA, UKWA, UKGWA, and NZWA behave slightly differently. These differences are documented in Comparing CDX APIs. The functions below smooth out some of these bumps and should return consistently formatted results from the three repositories.

To get all the captures of a url in JSON format:

query_cdx([timegate], [url], [other optional parameters])

Required parameters:

timegate – one of 'ukwa' (UK), 'ukgwa' (UKGWA) 'awa' (Australia), 'nzwa' (New Zealand), or 'ia' (Internet Archive)
url – the url you want to look for in the archive

Supplying these parameters only is essentially the equivalent of asking for a Timemap (though when I compared results, I found the CDX API included more duplicates). One advantage of the CDX API is that you can filter results by supplying additional parameters. These optional parameters can be anything the CDX APIs support, such as from, to, and filter. However, note that from is a reserved keyword in Python, so use from_ instead. See below for some examples.

The data is returned in JSON format. The number of fields returned varies, but these will always be present:

urlkey – SURT formatted url (in the case of NZWA this is generated by the script rather than the archive)
timestamp – the date and time when the page was captured by the archive, in YYYYMMDDHHmmss format
url – the url of the page that was captured
status – HTTP status code returned by the capture request
mime – the mimetype of the captured resource
digest – algorithmically generated string that uniquely identifies the contents of the captured reource

In [10]:

APIS = {
    "ia": {"url": "http://web.archive.org/cdx/search/cdx", "type": "wb"},
    "awa": {"url": "https://web.archive.org.au/awa/cdx", "type": "pywb"},
    "nzwa": {
        "url": "https://ndhadeliver.natlib.govt.nz/webarchive/cdx",
        "type": "pywb",
    },
    "ukwa": {
        "url": "https://www.webarchive.org.uk/wayback/archive/cdx",
        "type": "pywb",
    },
    "ukgwa": {
        "url": "https://webarchive.nationalarchives.gov.uk/ukgwa/cdx",
        "type": "pywb",
    },
}


def normalise_filter(api, f):
    """
    Normalise parameter names and regexp formatting across CDX systems.
    """
    sys_type = APIS[api]["type"]
    if sys_type == "pywb":
        f = f.replace("mimetype:", "mime:")
        f = f.replace("statuscode:", "status:")
        f = f.replace("original:", "url:")
        f = re.sub(r"^(!{0,1})(\w)", r"\1~\2", f)
    elif sys_type == "wb":
        f = f.replace("mime:", "mimetype:")
        f = f.replace("status:", "statuscode:")
        f = f.replace("url:", "original:")
    return f


def normalise_filters(api, filters):
    if isinstance(filters, list):
        normalised = []
        for f in filters:
            normalised.append(normalise_filter(api, f))
    else:
        normalised = normalise_filter(api, filters)
    return normalised


def query_cdx(api, url, **kwargs):
    params = kwargs
    if "filter" in params:
        params["filter"] = normalise_filters(api, params["filter"])
    # CDX accepts a 'from' parameter, but this is a reserved word in Python
    # Use 'from_' to pass the value to the function & here we'll change it back to 'from'.
    if "from_" in params:
        params["from"] = params["from_"]
        del params["from_"]
    params["url"] = url
    params["output"] = "json"
    response = requests.get(APIS[api]["url"], params=params)
    response.raise_for_status()
    response_type = response.headers["content-type"].split(";")[0]
    if response_type == "text/x-ndjson":
        data = [json.loads(line) for line in response.text.splitlines()]
    elif response_type == "application/json":
        data = convert_lists_to_dicts(response.json())
    return data

Examples¶

In [11]:

# No filters -- give as all the captures!
d1 = query_cdx("ia", "http://discontents.com.au")
len(d1)

Out[11]:

In [22]:

# First result
d1[0]

Out[22]:

{'urlkey': 'au,com,discontents)/',
 'timestamp': '19981206012233',
 'digest': 'FQJ6JMPIZ7WEKYPQ4SGPVHF57GCV6B36',
 'length': '1610',
 'status': '200',
 'mime': 'text/html',
 'url': 'http://www.discontents.com.au:80/'}

In [23]:

# Last result -- note that the results are in date order, so this is the most recent
d1[-1]

Out[23]:

{'urlkey': 'au,com,discontents)/',
 'timestamp': '20230318003745',
 'digest': 'LK7AWVZ7UN745CBGJNEVA3QJMKLJ4N4V',
 'length': '652',
 'status': '-',
 'mime': 'warc/revisit',
 'url': 'http://discontents.com.au/'}

In [24]:

# Filter by status code - note the number of results decreases
d2 = query_cdx("ia", "http://discontents.com.au", filter="status:200")
len(d2)

Out[24]:

In [25]:

# Filter by date range using from_ and to
d3 = query_cdx("ia", "http://discontents.com.au", from_="2005", to="2006")
len(d3)

Out[25]:

In [26]:

# First result should be from 2005
d3[0]

Out[26]:

{'urlkey': 'au,com,discontents)/',
 'timestamp': '20050209204432',
 'digest': 'IWLJRLZLB7WBQNHYTVXJGD7TTARRGAXM',
 'length': '1024',
 'status': '200',
 'mime': 'text/html',
 'url': 'http://www.discontents.com.au:80/'}

In [27]:

# Last result should be from 2006
d3[-1]

Out[27]:

{'urlkey': 'au,com,discontents)/',
 'timestamp': '20061205043957',
 'digest': 'QGCDU54UYAOMFBTZKGOV27NGYAFE27HZ',
 'length': '1122',
 'status': '200',
 'mime': 'text/html',
 'url': 'http://discontents.com.au:80/'}

In [28]:

# Same as d1, except from AWA
d4 = query_cdx("awa", "http://discontents.com.au")
len(d4)

Out[28]:

In [36]:

# And with the UKGWA
d5 = query_cdx("ukgwa", "http://www.mod.uk/")
len(d5)

Out[36]:

Created by Tim Sherratt for the GLAM Workbench. Support me by becoming a GitHub sponsor!

Work on this notebook was supported by the IIPC Discretionary Funding Programme 2019-2020.

The Web Archives section of the GLAM Workbench is sponsored by the British Library.