Comparing CDX APIs

New to Jupyter notebooks? Try Using Jupyter notebooks for a quick introduction.

This notebook documents differences between the Internet Archive's CDX API and the CDX API available from PyWb systems such as the UK Web Archive and the National Library of Australia.

For more details on the data available from the CDX APIs see Exploring the Internet Archive's CDX API.

For examples using CDX APIs to harvest capture data see:

In [5]:
import requests
import json
import re
import pandas as pd
In [6]:
APIS = {
    'ia': {'url': 'http://web.archive.org/cdx/search/cdx', 'type': 'wb'},
    'nla': {'url': 'https://web.archive.org.au/awa/cdx', 'type': 'pywb'},
    'bl': {'url': 'https://www.webarchive.org.uk/wayback/archive/cdx', 'type': 'pywb'}
}

def raw_cdx_query(api, url, **kwargs):
    params = kwargs
    params['url'] = url
    params['output'] = 'json'
    response = requests.get(APIS[api]['url'], params=params)
    return response

Differences between PyWb and IA Wayback

JSON results format

As with Timemaps, requesting json formatted results from IA and Pywb CDX servers returns different data structures. IA results are an array of arrays, with the field labels in the first array. Pywb results are formatted as NDJSON (Newline Delineated JSON) – each capture is a JSON object, separated by a line break.

Internet Archive (Wayback)

In [7]:
raw_cdx_query('ia', 'discontents.com.au', limit=1, format='json').json()
Out[7]:
[['urlkey',
  'timestamp',
  'original',
  'mimetype',
  'statuscode',
  'digest',
  'length'],
 ['au,com,discontents)/',
  '19981206012233',
  'http://www.discontents.com.au:80/',
  'text/html',
  '200',
  'FQJ6JMPIZ7WEKYPQ4SGPVHF57GCV6B36',
  '1610']]

NLA (PyWb)

In [8]:
json.loads(raw_cdx_query('nla', 'discontents.com.au', limit=1, format='json').text)
Out[8]:
{'urlkey': 'au,com,discontents)/',
 'timestamp': '19981206012233',
 'url': 'http://www.discontents.com.au/',
 'mime': 'text/html',
 'status': '200',
 'digest': 'FQJ6JMPIZ7WEKYPQ4SGPVHF57GCV6B36',
 'offset': '59442416',
 'filename': 'NLA-EXTRACTION-1996-2004-ARCS-PART-00309-000001.arc.gz',
 'length': '1610',
 'source': 'awa',
 'source-coll': 'awa'}

Field labels

As with Timemaps, some of the field labels are different between the two systems:

IA PyWb
original url
statuscode status
mimetype mime

Internet Archive (Wayback)

In [9]:
raw_cdx_query('ia', 'discontents.com.au', limit=1, format='json').json()[0]
Out[9]:
['urlkey',
 'timestamp',
 'original',
 'mimetype',
 'statuscode',
 'digest',
 'length']

NLA (PyWb)

In [10]:
list(json.loads(raw_cdx_query('nla', 'discontents.com.au', limit=1, format='json').text).keys())
Out[10]:
['urlkey',
 'timestamp',
 'url',
 'mime',
 'status',
 'digest',
 'offset',
 'filename',
 'length',
 'source',
 'source-coll']

Match types

From the documentation it seems that you should be able to supply a matchType or use url wildcards on both systems. But there seem to be some inconsistences. In summary:

  • UKWA needs both the url wildcard and the matchType parameter to work correctly
  • domain queries do not work with NLA

NLA (PyWb)

Prefix queries work as expected, Domain queries do not work.

In [11]:
# Look for an exact url
len(raw_cdx_query('nla', 'http://nla.gov.au/', filter='status:200', format='json').text.splitlines())
Out[11]:
490
In [12]:
# Prefix query using url wildcard works as expected
len(raw_cdx_query('nla', 'http://nla.gov.au/*', filter='status:200', format='json').text.splitlines())
Out[12]:
20075
In [13]:
# Prefix query using matchType=prefix works as expected
len(raw_cdx_query('nla', 'http://nla.gov.au/', filter='status:200', format='json', matchType='prefix').text.splitlines())
Out[13]:
20075
In [14]:
# Domain query using matchType parameter causes error
raw_cdx_query('nla', 'nla.gov.au', filter='status:200', format='json', matchType='domain').text.splitlines()
Out[14]:
['{"message": "Internal Error: tuple index out of range"}']
In [15]:
# Domain query using url wildcard causes error
raw_cdx_query('nla', '*.nla.gov.au', filter='status:200', format='json').text.splitlines()
Out[15]:
['{"message": "Internal Error: tuple index out of range"}']

UKWA (PyWb)

Domain and prefix queries need both the matchType parameter and a url wildcard.

In [16]:
# Look for an exact url
len(raw_cdx_query('bl', 'anjackson.net', filter='status:200', format='json').text.splitlines())
Out[16]:
37
In [17]:
# Domain query using url wildcard has no effect
len(raw_cdx_query('bl', '*.anjackson.net', filter='status:200', format='json').text.splitlines())
Out[17]:
33416
In [18]:
# Domain query using matchType parameter has no effect
len(raw_cdx_query('bl', 'anjackson.net', filter='status:200', format='json', matchType='domain').text.splitlines())
Out[18]:
33416
In [19]:
# Domain query using *both* matchType parameter and url wildcard performs domain search
len(raw_cdx_query('bl', '*.anjackson.net', filter='status:200', format='json', matchType='domain').text.splitlines())
Out[19]:
0

Collapse

PyWb doesn't support the collapse parameter. So if you want to remove duplicates, you'll need to use something like Pandas .drop_duplicates() after the results have arrived. However, collapse only works on adjacent index entries, so if only having unique values is important, you'll probably want to run .drop_duplicates() on it anyway,

Internet Archive (Wayback)

In [20]:
# Without collapse -- total number of results (subtract one for the header row)
len(raw_cdx_query('ia', 'discontents.com.au', format='json').json()) - 1
Out[20]:
334
In [21]:
# With collapse -- should only be one result as we're collapsing on urlkey and searching for an exact url
len(raw_cdx_query('ia', 'discontents.com.au', format='json', collapse='urlkey').json()) - 1
Out[21]:
1

NLA (PyWb)

In [22]:
# Without collapse
len(raw_cdx_query('nla', 'discontents.com.au', format='json').text.splitlines())
Out[22]:
148
In [23]:
# With collapse
len(raw_cdx_query('nla', 'discontents.com.au', collapse='urlkey', format='json').text.splitlines())
Out[23]:
148

De-duplicate results using Pandas.

In [24]:
data = [json.loads(line) for line in raw_cdx_query('nla', 'discontents.com.au', fields='urlkey', format='json').text.splitlines()]
df = pd.DataFrame(data).drop_duplicates(subset=['urlkey'])
df.shape[0]
Out[24]:
1

Sort and Closest

IA doesn't support sort or the closest parameter. To implement something similar, I suppose you could use from and to to set a window around a date, and then process the results to calculate time deltas and sort by 'closeness'.


Limiting fields

The parameter used for limiting the fields returned from a query is different. The IA server expects fl, while PyWb uses fields (the PyWb documentation says fl, but it doesn't work).

NLA (PyWb)

In [25]:
json.loads(raw_cdx_query('nla', 'discontents.com.au', limit=1, fl='urlkey', format='json').text)
Out[25]:
{'urlkey': 'au,com,discontents)/',
 'timestamp': '19981206012233',
 'url': 'http://www.discontents.com.au/',
 'mime': 'text/html',
 'status': '200',
 'digest': 'FQJ6JMPIZ7WEKYPQ4SGPVHF57GCV6B36',
 'offset': '59442416',
 'filename': 'NLA-EXTRACTION-1996-2004-ARCS-PART-00309-000001.arc.gz',
 'length': '1610',
 'source': 'awa',
 'source-coll': 'awa'}
In [26]:
json.loads(raw_cdx_query('nla', 'discontents.com.au', limit=1, fields='urlkey', format='json').text)
Out[26]:
{'urlkey': 'au,com,discontents)/'}

Comparison operators in filters

This seems to create the most potential for confusion. In PyWb, the filter parameter uses a number of different operators to indicate the type of match required. IA only uses !. There's no way of indicating a query should be treated as a regular expression in IA, therefore, all queries are treated as regular expressions.

Operator Example Result
no operator filter=mime:html mime field contains 'html'
= filter==mime:text/html mime field matches 'text/html' exactly
~ filter=~status:30\d{1} status field matches any 3 digit code starting with 30
! filter=!mime:html mime field doesn't contain 'html'
!= filter=!=mime:text/html mime field doesn't match 'text/html' exactly
!~ filter=!~status:30\d{1} status field doesn't match any 3 digit codes starting with 30

IA filter queries look for an exact match (which could be a reguklar expression) by default. This can be negated by using the ! operator.

Operator Example Result
no operator filter=mimetype:text/html mimetype field matches 'text/html'
! filter=!mimetype:text/html mimetype field doesn't match 'text/html' exactly

In IA you need to use a regular expression to find a field containing a particular value. So these two expressions should result in the same matching behaviour:

PyWb IA
filter=mime:powerpoint filter=mimetype:.*powerpoint.*

For interoperability, it seems easiest to always use regular expressions, inserting the ~ operator for PyWb systems. So:

PyWb IA
filter=~mime:.*powerpoint.* filter=mimetype:.*powerpoint.*

Internet Archive (Wayback)

In [27]:
len(raw_cdx_query('ia', 'defence.gov.au/*', filter='mimetype:.*powerpoint.*', format='json', collapse='urlkey').json()) - 1
Out[27]:
229

NLA (PyWb)

In [28]:
len(raw_cdx_query('nla', 'defence.gov.au/*', filter='mime:powerpoint', format='json').text.splitlines())
Out[28]:
177
In [29]:
len(raw_cdx_query('nla', 'defence.gov.au/*', filter='~mime:.*powerpoint.*', format='json').text.splitlines())
Out[29]:
177

Pagination

Both IA and PyWb can support pagination or results, however, it's not available by default in PyWb. It's only available if repositories are using ZipNum indexes. Neither the UKWA or National Library of Australia CDX APIs support pagination. This means that queries to these systems will return all matching results in one hit (unless there is a system defined limit). This is something to bear in mind as large requests might be slow and prone to breakage.

Internet Archive (Wayback)

In [30]:
int(raw_cdx_query('ia', 'discontents.com.au', showNumPages='true', format='json').text)
Out[30]:
1

NLA (PyWb)

In [31]:
# NLA CDX server just ignores the showNumPages parameter and performs the query as normal
json.loads(raw_cdx_query('nla', 'discontents.com.au', showNumPages='true', format='json').text.splitlines()[0])
Out[31]:
{'urlkey': 'au,com,discontents)/',
 'timestamp': '19981206012233',
 'url': 'http://www.discontents.com.au/',
 'mime': 'text/html',
 'status': '200',
 'digest': 'FQJ6JMPIZ7WEKYPQ4SGPVHF57GCV6B36',
 'offset': '59442416',
 'filename': 'NLA-EXTRACTION-1996-2004-ARCS-PART-00309-000001.arc.gz',
 'length': '1610',
 'source': 'awa',
 'source-coll': 'awa'}

Fuzzy matching

If your query to a PyWb CDX API returns no matches, the system will use regular expressions to broaden your search and return a set of 'fuzzy' matches. These results will include an is_fuzzy field set to a value of 1. This is not supported in IA.

While fuzzy matching is useful for discovery, it might not be what you want if you're assembling a specific dataset. In this case you'd need to filter the results to remove the is_fuzzy matches.

Internet Archive (Wayback)

In [32]:
# This should return no results
raw_cdx_query('ia', 'discontents.com.au', limit=1, filter='statuscode:666', format='json').json()
Out[32]:
[]

NLA (PyWb)

In [33]:
# This would return no results except for fuzzy matching
# Note the status value in the result and the 'is_fuzzy' field
json.loads(raw_cdx_query('nla', 'discontents.com.au', limit=1, filter='status:666', format='json').text)
Out[33]:
{'urlkey': 'au,com,discontents)/',
 'timestamp': '19981206012233',
 'url': 'http://www.discontents.com.au/',
 'mime': 'text/html',
 'status': '200',
 'digest': 'FQJ6JMPIZ7WEKYPQ4SGPVHF57GCV6B36',
 'offset': '59442416',
 'filename': 'NLA-EXTRACTION-1996-2004-ARCS-PART-00309-000001.arc.gz',
 'length': '1610',
 'source': 'awa',
 'source-coll': 'awa',
 'is_fuzzy': '1'}

Normalising queries

It would be possible to wrap some code around queries that simulated collapse and closest across the two systems, but for the moment I'll just focus on some basic normalisation of query parameters and results. The functions below:

  • Normalise field names in queries and results
  • Convert results into a list of dictionaries
In [34]:
def normalise_filter(api, f):
    sys_type = APIS[api]['type']
    if sys_type == 'pywb':
        f = f.replace('mimetype:', 'mime:')
        f = f.replace('statuscode:', 'status:')
        f = f.replace('original:', 'url:')
        f = re.sub(r'^(!{0,1})(\w)', r'\1~\2', f)
    elif sys_type == 'wb':
        f = f.replace('mime:', 'mimetype:')
        f = f.replace('status:', 'statuscode:')
        f = f.replace('url:', 'original:')
    return f

def normalise_filters(api, filters):
    if isinstance(filters, list):
        normalised = []
        for f in filters:
            normalised.append(normalise_filter(api, f))
    else:
        normalised = normalise_filter(api, filters)
    return normalised

def convert_lists_to_dicts(results):
    '''
    Converts IA style timemap (a JSON array of arrays) to a list of dictionaries.
    Renames keys to standardise IA with other Timemaps.
    '''
    if results:
        keys = results[0]
        results_as_dicts = [dict(zip(keys, v)) for v in results[1:]]
    else:
        results_as_dicts = results
    for d in results_as_dicts:
        d['status'] = d.pop('statuscode')
        d['mime'] = d.pop('mimetype')
        d['url'] = d.pop('original')
    return results_as_dicts

def query_cdx(api, url, **kwargs):
    params = kwargs
    if 'filter' in params:
        params['filter'] = normalise_filters(api, params['filter'])
    params['url'] = url
    params['output'] = 'json'
    response = requests.get(APIS[api]['url'], params=params)
    print(response.url)
    response.raise_for_status()
    response_type = response.headers['content-type'].split(';')[0]
    print(response_type)
    if response_type == 'text/x-ndjson':
        data = [json.loads(line) for line in response.text.splitlines()]
    elif response_type == 'application/json':
        data = convert_lists_to_dicts(response.json())
    return data

Here's some examples – note that the parameters and their values are unchanged, you can just switch repositories.

Internet Archive (Wayback)

In [35]:
query_cdx('ia', 'defence.gov.au/*', filter=['mimetype:.*pdf', 'status:200'], limit=1)
http://web.archive.org/cdx/search/cdx?filter=mimetype%3A.%2Apdf&filter=statuscode%3A200&limit=1&url=defence.gov.au%2F%2A&output=json
application/json
Out[35]:
[{'urlkey': 'au,gov,defence)/28sqn/ad097.pdf',
  'timestamp': '20140304175138',
  'digest': 'AQBSAVSJJYOYKKLW7GM36PDCYDREFQXA',
  'length': '141731',
  'status': '200',
  'mime': 'application/pdf',
  'url': 'http://www.defence.gov.au/28sqn/AD097.pdf'}]

NLA (PyWb)

In [36]:
query_cdx('nla', 'defence.gov.au', filter=['mimetype:.*pdf', 'status:200'], matchType='prefix', limit=1)
https://web.archive.org.au/awa/cdx?filter=~mime%3A.%2Apdf&filter=~status%3A200&matchType=prefix&limit=1&url=defence.gov.au&output=json
text/x-ndjson
Out[36]:
[{'urlkey': 'au,gov,defence)/',
  'timestamp': '19981202111842',
  'url': 'http://www.defence.gov.au/',
  'mime': 'text/html',
  'status': '200',
  'digest': 'ERQQ3XVKGL4VFGI4KXIPE24QI7YMW4Z6',
  'offset': '8871025',
  'filename': 'NLA-EXTRACTION-1996-2004-ARCS-PART-00307-000001.arc.gz',
  'length': '4038',
  'source': 'awa',
  'source-coll': 'awa',
  'is_fuzzy': '1'}]

Created by Tim Sherratt for the GLAM Workbench. Support me by becoming a GitHub sponsor!

Work on this notebook was supported by the IIPC Discretionary Funding Programme 2019-2020