Comparing CDX APIs¶

New to Jupyter notebooks? Try Using Jupyter notebooks for a quick introduction.

This notebook documents differences between the Internet Archive's CDX API and the CDX API available from PyWb systems such as the UK Web Archive and the National Library of Australia.

For more details on the data available from the CDX APIs see Exploring the Internet Archive's CDX API.

For examples using CDX APIs to harvest capture data see:

Documentation¶

In [1]:

import json
import re

import pandas as pd
import pytest
import requests

In [2]:

APIS = {
    "ia": {"url": "http://web.archive.org/cdx/search/cdx", "type": "wb"},
    "nla": {"url": "https://web.archive.org.au/awa/cdx", "type": "pywb"},
    "bl": {"url": "https://www.webarchive.org.uk/wayback/archive/cdx", "type": "pywb"},
    "nlnz": {
        "url": "https://ndhadeliver.natlib.govt.nz/webarchive/cdx",
        "type": "pywb",
    },
    "ukgwa": {
        "url": "https://webarchive.nationalarchives.gov.uk/ukgwa/cdx",
        "type": "pywb",
    },
}


def raw_cdx_query(api, url, **kwargs):
    params = kwargs
    params["url"] = url
    params["output"] = "json"
    response = requests.get(APIS[api]["url"], params=params, timeout=60)
    response.raise_for_status()
    return response

Differences between PyWb and IA Wayback¶

JSON results format¶

As with Timemaps, requesting json formatted results from IA and Pywb CDX servers returns different data structures. IA results are an array of arrays, with the field labels in the first array. Pywb results are formatted as NDJSON (Newline Delineated JSON) – each capture is a JSON object, separated by a line break.

Internet Archive (Wayback)¶

In [3]:

raw_cdx_query("ia", "discontents.com.au", limit=1, format="json").json()

Out[3]:

[['urlkey',
  'timestamp',
  'original',
  'mimetype',
  'statuscode',
  'digest',
  'length'],
 ['au,com,discontents)/',
  '19981206012233',
  'http://www.discontents.com.au:80/',
  'text/html',
  '200',
  'FQJ6JMPIZ7WEKYPQ4SGPVHF57GCV6B36',
  '1610']]

NLA (PyWb)¶

In [4]:

json.loads(raw_cdx_query("nla", "discontents.com.au", limit=1, format="json").text)

Out[4]:

{'urlkey': 'au,com,discontents)/',
 'timestamp': '19981206012233',
 'url': 'http://www.discontents.com.au/',
 'mime': 'text/html',
 'status': '200',
 'digest': 'FQJ6JMPIZ7WEKYPQ4SGPVHF57GCV6B36',
 'offset': '59442416',
 'filename': 'NLA-EXTRACTION-1996-2004-ARCS-PART-00309-000001.arc.gz',
 'length': '1610',
 'source': 'awa',
 'source-coll': 'awa'}

Field labels¶

As with Timemaps, some of the field labels are different between the two systems:

IA	PyWb
`original`	`url`
`statuscode`	`status`
`mimetype`	`mime`

Internet Archive (Wayback)¶

In [5]:

raw_cdx_query("ia", "discontents.com.au", limit=1, format="json").json()[0]

Out[5]:

['urlkey',
 'timestamp',
 'original',
 'mimetype',
 'statuscode',
 'digest',
 'length']

NLA (PyWb)¶

In [6]:

list(
    json.loads(
        raw_cdx_query("nla", "discontents.com.au", limit=1, format="json").text
    ).keys()
)

Out[6]:

['urlkey',
 'timestamp',
 'url',
 'mime',
 'status',
 'digest',
 'offset',
 'filename',
 'length',
 'source',
 'source-coll']

NLNZ (PyWb)¶

In [7]:

list(
    json.loads(
        raw_cdx_query("nlnz", "http://digitalnz.org", limit=1, format="json").text
    ).keys()
)

Out[7]:

['urlkey',
 'timestamp',
 'url',
 'mime',
 'status',
 'digest',
 'redirect',
 'robotflags',
 'length',
 'offset',
 'filename',
 'load_url',
 'source',
 'source-coll']

UKWA (PyWb)¶

In [8]:

list(
    json.loads(
        raw_cdx_query(
            "bl", "anjackson.net", filter="status:200", limit=1, format="json"
        ).text
    ).keys()
)

Out[8]:

['urlkey',
 'timestamp',
 'url',
 'mime',
 'status',
 'digest',
 'redirect',
 'robotflags',
 'length',
 'offset',
 'filename',
 'load_url',
 'source',
 'source-coll',
 'access']

UKGWA (PyWb)¶

In [9]:

list(
    json.loads(
        raw_cdx_query(
            "ukgwa",
            "https://www.nationalarchives.gov.uk/",
            filter="status:200",
            limit=1,
            format="json",
        ).text
    ).keys()
)

Out[9]:

['urlkey',
 'timestamp',
 'url',
 'mime',
 'status',
 'digest',
 'redirect',
 'robotflags',
 'length',
 'offset',
 'filename',
 'source',
 'source-coll',
 'access']

Match types¶

From the documentation it seems that you should be able to supply a matchType or use url wildcards on both systems. But there seem to be some inconsistences. In summary:

UKWA needs both the url wildcard and the matchType parameter to work correctly
domain queries do not work with NLA

NLA (PyWb)¶

Prefix queries work as expected, Domain queries do not work.

In [10]:

# Look for an exact url
exact = len(
    raw_cdx_query(
        "nla", "http://chineseaustralia.org", filter="status:200", format="json"
    ).text.splitlines()
)
exact

Out[10]:

In [11]:

# Prefix query using url wildcard works as expected
prefix_url = len(
    raw_cdx_query(
        "nla",
        "http://chineseaustralia.org*",
        filter=["status:200", "mimetype:text/html"],
        format="json",
    ).text.splitlines()
)
prefix_url

Out[11]:

In [12]:

# Prefix query using matchType=prefix works as expected
prefix_match = len(
    raw_cdx_query(
        "nla",
        "http://chineseaustralia.org",
        filter=["status:200", "mimetype:text/html"],
        format="json",
        matchType="prefix",
    ).text.splitlines()
)
prefix_match

Out[12]:

In [13]:

# Domain query using url wildcard causes exception
# This test passes if there is a HTPPError exception
with pytest.raises(requests.exceptions.HTTPError):
    raw_cdx_query(
        "nla",
        "*.chineseaustralia.org",
        filter=["status:200", "mimetype:text/html"],
        format="json",
    ).text.splitlines()

In [14]:

# Domain query using matchType parameter causes exception
# This test passes if there is a HTPPError exception
with pytest.raises(requests.exceptions.HTTPError):
    raw_cdx_query(
        "nla",
        "chineseaustralia.org",
        filter=["status:200", "mimetype:text/html"],
        format="json",
        matchType="domain",
    ).text.splitlines()

In [15]:

# Test the results
assert isinstance(exact, int) is True
assert isinstance(prefix_url, int) is True
assert isinstance(prefix_match, int) is True
assert prefix_url > exact
assert prefix_url == prefix_match

UKWA (PyWb)¶

Domain and prefix queries work as expected.

In [16]:

# Look for an exact url
exact = len(
    raw_cdx_query(
        "bl",
        "anjackson.net",
        filter=["status:200", "mimetype:text/html"],
        format="json",
    ).text.splitlines()
)
exact

Out[16]:

In [17]:

# Prefix query using url wildcard works as expected
prefix_url = len(
    raw_cdx_query(
        "bl",
        "http://anjackson.net/*",
        filter=["status:200", "mimetype:text/html"],
        format="json",
    ).text.splitlines()
)
prefix_url

Out[17]:

In [18]:

# Prefix query using matchType prefix works as expected
prefix_match = len(
    raw_cdx_query(
        "bl",
        "http://anjackson.net",
        filter=["status:200", "mimetype:text/html"],
        format="json",
        matchType="prefix",
    ).text.splitlines()
)
prefix_match

Out[18]:

In [19]:

# Domain query using url wildcard works as expected
domain_url = len(
    raw_cdx_query(
        "bl",
        "*.anjackson.net",
        filter=["status:200", "mimetype:text/html"],
        format="json",
    ).text.splitlines()
)
domain_url

Out[19]:

In [21]:

# Domain query using matchType parameter works as expected
# As of 4 May 2023 this sometimes returns an HTTPError?
try:
    domain_match = len(
        raw_cdx_query(
            "bl",
            "anjackson.net",
            filter=["status:200", "mimetype:text/html"],
            format="json",
            matchType="domain",
        ).text.splitlines()
    )
except requests.exceptions.HTTPError as e:
    print(str(e))
    domain_match = None
domain_match

429 Client Error: Too Many Requests for url: https://www.webarchive.org.uk/wayback/archive/cdx?filter=status%3A200&filter=mimetype%3Atext%2Fhtml&format=json&matchType=domain&url=anjackson.net&output=json

In [22]:

# Test the results
assert isinstance(exact, int) is True
assert isinstance(prefix_url, int) is True
assert isinstance(prefix_match, int) is True
assert isinstance(domain_url, int) is True

assert prefix_url > exact
assert prefix_url == prefix_match
assert domain_url > exact
assert domain_url > prefix_url
if domain_match:
    assert domain_url == domain_match
    assert isinstance(domain_match, int) is True

NLNZ (pywb)¶

Domain and prefix queries work as expected.

In [23]:

# Look for an exact url
exact = len(
    raw_cdx_query(
        "nlnz",
        "http://digitalnz.org/",
        filter=["status:200", "mimetype:text/html"],
        format="json",
    ).text.splitlines()
)
exact

Out[23]:

In [24]:

# Prefix query using url wildcard works as expected
prefix_url = len(
    raw_cdx_query(
        "nlnz",
        "http://digitalnz.org/*",
        filter=["status:200", "mimetype:text/html"],
        format="json",
    ).text.splitlines()
)
prefix_url

Out[24]:

In [25]:

# Prefix query using matchType prefix works as expected
prefix_match = len(
    raw_cdx_query(
        "nlnz",
        "http://digitalnz.org/",
        filter=["status:200", "mimetype:text/html"],
        format="json",
        matchType="prefix",
    ).text.splitlines()
)
prefix_match

Out[25]:

In [26]:

# Domain query using url wildcard works as expected
domain_url = len(
    raw_cdx_query(
        "nlnz",
        "*.digitalnz.org",
        filter=["status:200", "mimetype:text/html"],
        format="json",
    ).text.splitlines()
)
domain_url

Out[26]:

In [27]:

# Domain query using matchType parameter works as expected
domain_match = len(
    raw_cdx_query(
        "nlnz",
        "digitalnz.org",
        filter=["status:200", "mimetype:text/html"],
        format="json",
        matchType="domain",
    ).text.splitlines()
)
domain_match

Out[27]:

In [28]:

# Test the results
assert isinstance(exact, int) is True
assert isinstance(prefix_url, int) is True
assert isinstance(prefix_match, int) is True
assert isinstance(domain_url, int) is True
assert isinstance(domain_match, int) is True
assert prefix_url > exact
assert prefix_url == prefix_match
assert domain_url > exact
assert domain_url > prefix_url
assert domain_url == domain_match

UKGWA (pywb)¶

Domain and prefix queries work as expected.

In [29]:

# Look for an exact url
exact = len(
    raw_cdx_query(
        "ukgwa",
        "http://www.mod.uk/",
        filter=["status:200", "mimetype:text/html"],
        format="json",
    ).text.splitlines()
)
exact

Out[29]:

In [30]:

# Prefix query using url wildcard works as expected
prefix_url = len(
    raw_cdx_query(
        "ukgwa",
        "http://www.mod.uk/*",
        filter=["status:200", "mimetype:text/html"],
        format="json",
    ).text.splitlines()
)
prefix_url

Out[30]:

In [31]:

# Prefix query using matchType prefix works as expected
prefix_match = len(
    raw_cdx_query(
        "ukgwa",
        "http://www.mod.uk/",
        filter=["status:200", "mimetype:text/html"],
        format="json",
        matchType="prefix",
    ).text.splitlines()
)
prefix_match

Out[31]:

In [32]:

# Domain query using url wildcard works as expected
domain_url = len(
    raw_cdx_query(
        "ukgwa", "*.mod.uk", filter=["status:200", "mimetype:text/html"], format="json"
    ).text.splitlines()
)
domain_url

Out[32]:

In [33]:

# Domain query using matchType parameter works as expected
domain_match = len(
    raw_cdx_query(
        "ukgwa",
        "mod.uk",
        filter=["status:200", "mimetype:text/html"],
        format="json",
        matchType="domain",
    ).text.splitlines()
)
domain_match

Out[33]:

In [34]:

# Test the results
assert isinstance(exact, int) is True
assert isinstance(prefix_url, int) is True
assert isinstance(prefix_match, int) is True
assert isinstance(domain_url, int) is True
assert isinstance(domain_match, int) is True
assert prefix_url > exact
assert prefix_url == prefix_match
assert domain_url > exact
assert domain_url > prefix_url
assert domain_url == domain_match

Collapse¶

PyWb doesn't support the collapse parameter. So if you want to remove duplicates, you'll need to use something like Pandas .drop_duplicates() after the results have arrived. However, collapse only works on adjacent index entries, so if only having unique values is important, you'll probably want to run .drop_duplicates() on it anyway,

Internet Archive (Wayback)¶

In [35]:

# Without collapse -- total number of results (subtract one for the header row)
complete = len(raw_cdx_query("ia", "discontents.com.au", format="json").json()) - 1
complete

Out[35]:

In [36]:

# With collapse -- should only be one result as we're collapsing on urlkey and searching for an exact url
collapsed = (
    len(
        raw_cdx_query(
            "ia", "discontents.com.au", format="json", collapse="urlkey"
        ).json()
    )
    - 1
)
collapsed

Out[36]:

In [37]:

# Test expected results
assert complete > collapsed
assert collapsed == 1

UKWA (PyWb)¶

In [38]:

# Without collapse
complete = len(raw_cdx_query("bl", "anjackson.net", format="json").text.splitlines())
complete

Out[38]:

In [39]:

# With collapse
collapsed = len(
    raw_cdx_query(
        "bl", "anjackson.net", collapse="urlkey", format="json"
    ).text.splitlines()
)
collapsed

Out[39]:

In [40]:

# Test expected results
# Collapse has done nothing
assert complete == collapsed

De-duplicate results using Pandas.

In [41]:

data = [
    json.loads(line)
    for line in raw_cdx_query(
        "bl", "anjackson.net", fields="urlkey", format="json"
    ).text.splitlines()
]
df = pd.DataFrame(data).drop_duplicates(subset=["urlkey"])
deduped = df.shape[0]
deduped

Out[41]:

In [42]:

# Test expected results
assert deduped == 1

Sort and Closest¶

IA doesn't support sort or the closest parameter. To implement something similar, I suppose you could use from and to to set a window around a date, and then process the results to calculate time deltas and sort by 'closeness'.

Limiting fields¶

The parameter used for limiting the fields returned from a query used to be different, but this has changed in recent PyWb releases. The IA server expects fl, while PyWb expects either fields or fl. So for cross-compaibility, use fl.

NLA (PyWb)¶

In [43]:

use_fl = json.loads(
    raw_cdx_query("nla", "discontents.com.au", limit=1, fl="urlkey", format="json").text
)
use_fl

Out[43]:

{'urlkey': 'au,com,discontents)/'}

In [44]:

use_fields = json.loads(
    raw_cdx_query(
        "nla", "discontents.com.au", limit=1, fields="urlkey", format="json"
    ).text
)
use_fields

Out[44]:

{'urlkey': 'au,com,discontents)/'}

In [45]:

# Test expected results
assert use_fl == use_fields

IA (Wayback)¶

In [46]:

use_fl_ia = json.loads(
    raw_cdx_query("ia", "discontents.com.au", limit=1, fl="urlkey", format="json").text
)
use_fl_ia

Out[46]:

[['urlkey'], ['au,com,discontents)/']]

In [47]:

# Text expected results
assert use_fl_ia[1][0] == "au,com,discontents)/"

Comparison operators in filters¶

This seems to create the most potential for confusion. In PyWb, the filter parameter uses a number of different operators to indicate the type of match required. IA only uses !. There's no way of indicating a query should be treated as a regular expression in IA, therefore, all queries are treated as regular expressions.

Operator	Example	Result
no operator	`filter=mime:html`	`mime` field contains 'html'
`=`	`filter==mime:text/html`	`mime` field matches 'text/html' exactly
`~`	`filter=~status:30\d{1}`	`status` field matches any 3 digit code starting with 30
`!`	`filter=!mime:html`	`mime` field doesn't contain 'html'
`!=`	`filter=!=mime:text/html`	`mime` field doesn't match 'text/html' exactly
`!~`	`filter=!~status:30\d{1}`	`status` field doesn't match any 3 digit codes starting with 30

IA filter queries look for an exact match (which could be a regular expression) by default. This can be negated by using the ! operator.

Operator	Example	Result
no operator	`filter=mimetype:text/html`	`mimetype` field matches 'text/html'
`!`	`filter=!mimetype:text/html`	`mimetype` field doesn't match 'text/html' exactly

In IA you need to use a regular expression to find a field containing a particular value. So these two expressions should result in the same matching behaviour:

PyWb	IA
`filter=mime:powerpoint`	`filter=mimetype:.powerpoint.`

For interoperability, it seems easiest to always use regular expressions, inserting the ~ operator for PyWb systems. So:

PyWb	IA
`filter=~mime:.powerpoint.`	`filter=mimetype:.powerpoint.`

Internet Archive (Wayback)¶

In [48]:

# Filters are treated as exact matches by default
ia_exact = len(
    raw_cdx_query(
        "ia",
        "defence.gov.au/*",
        filter="mimetype:powerpoint",
        format="json",
        collapse="urlkey",
    ).json()
)
ia_exact

Out[48]:

In [49]:

# Using regex finds results including 'powerpoint' in mimetype
ia_regex = (
    len(
        raw_cdx_query(
            "ia",
            "defence.gov.au/*",
            filter="mimetype:.*powerpoint.*",
            format="json",
            collapse="urlkey",
        ).json()
    )
    - 1
)
ia_regex

Out[49]:

In [50]:

# Test expected results
assert ia_regex > ia_exact

NLA (PyWb)¶

In [51]:

# Filter values are treated as regex by default
# As of 4 May 2023 this cell is consistently timing out, so I've disabled for now
# nla_exact = len(
#    raw_cdx_query(
#        "nla", "defence.gov.au/*", filter="mime:powerpoint", format="json"
#    ).text.splitlines()
# )
# nla_exact

In [52]:

# Explicitly use regex
# As of 4 May 2023 this cell is consistently timing out, so I've disabled for now
# nla_regex = len(
#    raw_cdx_query(
#        "nla", "defence.gov.au/*", filter="~mime:.*powerpoint.*", format="json"
#    ).text.splitlines()
# )
# nla_regex

In [53]:

# Test expected results
# assert nla_exact == nla_regex

Both IA and PyWb can support pagination or results, however, it's not available by default in PyWb. It's only available if repositories are using ZipNum indexes. The UKGWA supports pagination but none of the UKWA, National Library of Australia, or National Library of New Zealand CDX APIs support it. This means that queries to these systems will return all matching results in one hit (unless there is a system defined limit). This is something to bear in mind as large requests might be slow and prone to breakage.

Internet Archive (Wayback)¶

In [54]:

ia_pages = raw_cdx_query(
    "ia", "discontents.com.au", showNumPages="true", format="json"
).text
ia_pages

Out[54]:

'1\n'

NLA (PyWb)¶

In [55]:

# NLA CDX server just ignores the showNumPages parameter and performs the query as normal
nla_pages = json.loads(
    raw_cdx_query(
        "nla", "discontents.com.au", showNumPages="true", format="json"
    ).text.splitlines()[0]
)
nla_pages

Out[55]:

{'urlkey': 'au,com,discontents)/',
 'timestamp': '19981206012233',
 'url': 'http://www.discontents.com.au/',
 'mime': 'text/html',
 'status': '200',
 'digest': 'FQJ6JMPIZ7WEKYPQ4SGPVHF57GCV6B36',
 'offset': '59442416',
 'filename': 'NLA-EXTRACTION-1996-2004-ARCS-PART-00309-000001.arc.gz',
 'length': '1610',
 'source': 'awa',
 'source-coll': 'awa'}

NLNZ (PyWb)¶

In [56]:

# NLNZ CDX server just ignores the showNumPages parameter and performs the query as normal
nlnz_pages = json.loads(
    raw_cdx_query(
        "nlnz", "digitalnz.org", showNumPages="true", format="json"
    ).text.splitlines()[0]
)
nlnz_pages

Out[56]:

{'urlkey': 'org,digitalnz)/',
 'timestamp': '20090129060149',
 'url': 'http://www.digitalnz.org/',
 'mime': 'text/html',
 'status': '200',
 'digest': '3CTAFWGHTJMGYCHECAFS4HKHPXIZOMWO',
 'redirect': '-',
 'robotflags': '-',
 'length': '0',
 'offset': '6208429',
 'filename': 'V1-FL994870.arc',
 'load_url': 'http://10.4.1.66:80/nlnzwebarchive_PROD/ap/20090129060149id_/http://www.digitalnz.org/',
 'source': 'webarchive',
 'source-coll': 'webarchive'}

UKGWA (Pywb)¶

In [57]:

# UKGWA seerver supports pagination
ukgwa_pages = json.loads(
    raw_cdx_query(
        "ukgwa", "www.mod.org.uk", showNumPages="true", format="json"
    ).text.splitlines()[0]
)
ukgwa_pages

Out[57]:

{'pages': 1, 'pageSize': 10, 'blocks': 0}

In [58]:

# Test expected results
assert ia_pages.strip().isnumeric()
assert isinstance(nla_pages, dict)
assert isinstance(nlnz_pages, dict)
assert type(ukgwa_pages["pages"]) == int

Fuzzy matching¶

If your query to a PyWb CDX API returns no matches, the system will use regular expressions to broaden your search and return a set of 'fuzzy' matches. These results will include an is_fuzzy field set to a value of 1. This is not supported in IA.

While fuzzy matching is useful for discovery, it might not be what you want if you're assembling a specific dataset. In this case you'd need to filter the results to remove the is_fuzzy matches.

Internet Archive (Wayback)¶

In [59]:

# This should return no results
ia_not_fuzzy = raw_cdx_query(
    "ia", "discontents.com.au", limit=1, filter="statuscode:666", format="json"
).json()

# Test expected result
assert ia_not_fuzzy == []
ia_not_fuzzy

Out[59]:

[]

NLA (PyWb)¶

In [60]:

# This would return no results except for fuzzy matching
# Note the status value in the result and the 'is_fuzzy' field
nla_fuzzy = json.loads(
    raw_cdx_query(
        "nla", "discontents.com.au", limit=1, filter="status:666", format="json"
    ).text
)

# Test expected result
assert nla_fuzzy["is_fuzzy"] == "1"
nla_fuzzy

Out[60]:

{'urlkey': 'au,com,discontents)/',
 'timestamp': '19981206012233',
 'url': 'http://www.discontents.com.au/',
 'mime': 'text/html',
 'status': '200',
 'digest': 'FQJ6JMPIZ7WEKYPQ4SGPVHF57GCV6B36',
 'offset': '59442416',
 'filename': 'NLA-EXTRACTION-1996-2004-ARCS-PART-00309-000001.arc.gz',
 'length': '1610',
 'source': 'awa',
 'source-coll': 'awa',
 'is_fuzzy': '1'}

Normalising queries¶

It would be possible to wrap some code around queries that simulated collapse and closest across the two systems, but for the moment I'll just focus on some basic normalisation of query parameters and results. The functions below:

Normalise field names in queries and results
Convert results into a list of dictionaries

In [61]:

def normalise_filter(api, f):
    """
    Standardise field names in filters.
    """
    sys_type = APIS[api]["type"]
    if sys_type == "pywb":
        f = f.replace("mimetype:", "mime:")
        f = f.replace("statuscode:", "status:")
        f = f.replace("original:", "url:")
        f = re.sub(r"^(!{0,1})(\w)", r"\1~\2", f)
    elif sys_type == "wb":
        f = f.replace("mime:", "mimetype:")
        f = f.replace("status:", "statuscode:")
        f = f.replace("url:", "original:")
    return f


def normalise_filters(api, filters):
    """
    Standardise field names in filters.
    """
    if isinstance(filters, list):
        normalised = []
        for f in filters:
            normalised.append(normalise_filter(api, f))
    else:
        normalised = normalise_filter(api, filters)
    return normalised


def convert_lists_to_dicts(results):
    """
    Converts IA style timemap (a JSON array of arrays) to a list of dictionaries.
    Renames keys to standardise IA with other Timemaps.
    """
    if results:
        keys = results[0]
        results_as_dicts = [dict(zip(keys, v)) for v in results[1:]]
    else:
        results_as_dicts = results
    for d in results_as_dicts:
        d["status"] = d.pop("statuscode")
        d["mime"] = d.pop("mimetype")
        d["url"] = d.pop("original")
    return results_as_dicts


def query_cdx(api, url, **kwargs):
    """
    Make a request to a CDX API, normalising filters and responses across Wayback & PyWb systems.
    """
    params = kwargs
    if "filter" in params:
        params["filter"] = normalise_filters(api, params["filter"])
    params["url"] = url
    params["output"] = "json"
    response = requests.get(APIS[api]["url"], params=params)
    # print(response.url)
    response.raise_for_status()
    response_type = response.headers["content-type"].split(";")[0]
    # print(response_type)
    if response_type == "text/x-ndjson":
        data = [json.loads(line) for line in response.text.splitlines()]
    elif response_type == "application/json":
        data = convert_lists_to_dicts(response.json())
    return data

Here's some examples – note that the parameters and their values are unchanged, you can just switch repositories.

Internet Archive (Wayback)¶

In [62]:

ia_normalised1 = query_cdx(
    "ia", "defence.gov.au/*", filter=["mime:.*pdf", "status:200"], limit=1
)
ia_normalised1

Out[62]:

[{'urlkey': 'au,gov,defence)/28sqn/ad097.pdf',
  'timestamp': '20140304175138',
  'digest': 'AQBSAVSJJYOYKKLW7GM36PDCYDREFQXA',
  'length': '141731',
  'status': '200',
  'mime': 'application/pdf',
  'url': 'http://www.defence.gov.au/28sqn/AD097.pdf'}]

In [63]:

ia_normalised2 = query_cdx(
    "ia", "defence.gov.au/*", filter=["mimetype:.*pdf", "status:200"], limit=1
)
ia_normalised2

Out[63]:

[{'urlkey': 'au,gov,defence)/28sqn/ad097.pdf',
  'timestamp': '20140304175138',
  'digest': 'AQBSAVSJJYOYKKLW7GM36PDCYDREFQXA',
  'length': '141731',
  'status': '200',
  'mime': 'application/pdf',
  'url': 'http://www.defence.gov.au/28sqn/AD097.pdf'}]

In [64]:

# Test expected results
assert ia_normalised1 == ia_normalised2

NLA (PyWb)¶

In [65]:

nla_norm = query_cdx(
    "nla",
    "defence.gov.au",
    filter=["mimetype:.*pdf", "status:200"],
    matchType="prefix",
    limit=1,
)
nla_norm

Out[65]:

[{'urlkey': 'au,gov,defence)/',
  'timestamp': '19981202111842',
  'url': 'http://www.defence.gov.au/',
  'mime': 'text/html',
  'status': '200',
  'digest': 'ERQQ3XVKGL4VFGI4KXIPE24QI7YMW4Z6',
  'offset': '8871025',
  'filename': 'NLA-EXTRACTION-1996-2004-ARCS-PART-00307-000001.arc.gz',
  'length': '4038',
  'source': 'awa',
  'source-coll': 'awa',
  'is_fuzzy': '1'}]

In [66]:

# Test expected results
assert "mime" in nla_norm[0].keys()

Created by Tim Sherratt for the GLAM Workbench. Support me by becoming a GitHub sponsor!

Work on this notebook was supported by the IIPC Discretionary Funding Programme 2019-2020.

The Web Archives section of the GLAM Workbench is sponsored by the British Library.

Comparing CDX APIs¶

Documentation¶

Differences between PyWb and IA Wayback¶

JSON results format¶

Internet Archive (Wayback)¶

NLA (PyWb)¶

Field labels¶

Internet Archive (Wayback)¶

NLA (PyWb)¶

NLNZ (PyWb)¶

UKWA (PyWb)¶

UKGWA (PyWb)¶

Match types¶

NLA (PyWb)¶

UKWA (PyWb)¶

NLNZ (pywb)¶

UKGWA (pywb)¶

Collapse¶

Internet Archive (Wayback)¶

UKWA (PyWb)¶

Sort and Closest¶

Limiting fields¶

NLA (PyWb)¶

IA (Wayback)¶

Comparison operators in filters¶

Internet Archive (Wayback)¶

NLA (PyWb)¶

Pagination¶

Internet Archive (Wayback)¶

NLA (PyWb)¶

NLNZ (PyWb)¶

UKGWA (Pywb)¶

Fuzzy matching¶

Internet Archive (Wayback)¶

NLA (PyWb)¶

Normalising queries¶

Internet Archive (Wayback)¶

NLA (PyWb)¶