Getting all available snapshots of a particular page from the Internet Archive – Timemap or CDX?

New to Jupyter notebooks? Try Using Jupyter notebooks for a quick introduction.

There are a couple of ways of getting a list of the available snapshots for a particular url. In this notebook, we'll compare the Internet Archive's CDX index API, with their Memento Timemap API. Do they give us the same data?

See Exploring the Internet Archive's CDX API for more information about the CDX API.

In [1]:
import requests
import pandas as pd

Get the data for comparison

In [2]:
def query_timemap(url):
    '''
    Get a Timemap in JSON format for the specified url.
    '''
    response = requests.get(f'https://web.archive.org/web/timemap/json/{url}', headers={'User-Agent': ''})
    response.raise_for_status()
    return response.json()
In [3]:
def query_cdx(url, **kwargs):
    '''
    Query the IA CDX API for the supplied url.
    You can optionally provide any of the parameters accepted by the API.
    '''
    params = kwargs
    params['url'] = url
    params['output'] = 'json'
    # User-Agent value is necessary or else IA gives an error
    response = requests.get('http://web.archive.org/cdx/search/cdx', params=params, headers={'User-Agent': ''})
    response.raise_for_status()
    return response.json()
In [4]:
url = 'http://nla.gov.au'
tm_data = query_timemap(url)
tm_df = pd.DataFrame(tm_data[1:], columns=tm_data[0])
cdx_data = query_cdx(url)
cdx_df = pd.DataFrame(cdx_data[1:], columns=cdx_data[0])

Are the columns the same?

In [5]:
list(cdx_df.columns)
Out[5]:
['urlkey',
 'timestamp',
 'original',
 'mimetype',
 'statuscode',
 'digest',
 'length']
In [6]:
list(tm_df.columns)
Out[6]:
['urlkey',
 'timestamp',
 'original',
 'mimetype',
 'statuscode',
 'digest',
 'redirect',
 'robotflags',
 'length',
 'offset',
 'filename']

The Timemap data includes three extra columns: robotflags, offset, and filename. The offset and filename columns tell you where to find the snapshot, but I'm not sure what robotflags is for (it's not in the specification). Let's gave a look at what sort of values it has.

In [7]:
tm_df['robotflags'].value_counts()
Out[7]:
-    3528
Name: robotflags, dtype: int64

There's nothing in it – at least for this particular url.

For my purposes, it doesn't look like the Timemap adds anything useful.

Do they provide the same number of snapshots?

In [8]:
tm_df.shape
Out[8]:
(3528, 11)
In [9]:
cdx_df.shape
Out[9]:
(3581, 7)

So there are more snapshots in the CDX results than the Timemap. Can we find out what they are?

In [10]:
# Combine the two dataframes, then only keep rows that aren't duplicated based on timestamp, original, digest, and statuscode
pd.concat([cdx_df,tm_df]).drop_duplicates(subset=['timestamp', 'original', 'digest', 'statuscode'], keep=False)
Out[10]:
urlkey timestamp original mimetype statuscode digest length redirect robotflags offset filename

Hmm, if there were rows in the cdx_df that weren't in the tm_df I'd expect them to show up, but there are no rows that aren't duplicated based on the timestamp, original, digest, and statuscode columns...

Let's try this another way, by finding the number of unique shapshots in each df.

In [11]:
# Remove duplicate rows 
cdx_df.drop_duplicates(subset=['timestamp', 'digest', 'statuscode', 'original'], keep='first').shape
Out[11]:
(3439, 7)
In [12]:
# Remove duplicate rows 
tm_df.drop_duplicates(subset=['timestamp', 'digest', 'statuscode', 'original'], keep='first').shape
Out[12]:
(3439, 11)

Ah, so both sets of data contain duplicates, and there are really only 3,438 unique shapshots. Let's look at some of the duplicates in the CDX data.

In [13]:
dupes = cdx_df.loc[cdx_df.duplicated(subset=['timestamp', 'digest'], keep=False)].sort_values(by='timestamp')
dupes.head(10)
Out[13]:
urlkey timestamp original mimetype statuscode digest length
29 au,gov,nla)/ 19990508095540 http://www2.nla.gov.au:80/ text/html 200 6XT5KK4SHMW2N3CAHPACCPJ2J5ZUIMES 1995
30 au,gov,nla)/ 19990508095540 http://www2.nla.gov.au:80/ text/html 200 6XT5KK4SHMW2N3CAHPACCPJ2J5ZUIMES 1995
879 au,gov,nla)/ 20090327043759 http://www.nla.gov.au/ text/html 200 537C3S5FANRHGLW3A6WPE6A57LULWNOF 6306
880 au,gov,nla)/ 20090327043759 http://www.nla.gov.au/ text/html 200 537C3S5FANRHGLW3A6WPE6A57LULWNOF 6473
881 au,gov,nla)/ 20090515004007 http://www.nla.gov.au/ text/html 200 CC747V3CYGCYQZELL37KNOW5DRPEMFEW 6614
882 au,gov,nla)/ 20090515004007 http://www.nla.gov.au/ text/html 200 CC747V3CYGCYQZELL37KNOW5DRPEMFEW 6614
884 au,gov,nla)/ 20090521102300 http://www.nla.gov.au/ text/html 200 25VWCDZDMMC57PLHGKIJ6XUBG566EW33 6619
885 au,gov,nla)/ 20090521102300 http://www.nla.gov.au/ text/html 200 25VWCDZDMMC57PLHGKIJ6XUBG566EW33 6619
887 au,gov,nla)/ 20090521230410 http://nla.gov.au/ warc/revisit - BDOBBSVBWA4WL3PLC7TSVIA5PE2RZKRD 469
886 au,gov,nla)/ 20090521230410 http://nla.gov.au/ warc/revisit - BDOBBSVBWA4WL3PLC7TSVIA5PE2RZKRD 469
In [14]:
print(f'Date range of duplicates: {dupes["timestamp"].min()} to {dupes["timestamp"].max()}')
Date range of duplicates: 19990508095540 to 20210506043731

So it seems they provide the same number of unique snapshots, but the CDX index adds a few more duplicates.

Is there a difference in speed?

In [15]:
%%timeit
tm_data = query_timemap(url)
2.04 s ± 248 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [16]:
%%timeit
cdx_data = query_cdx(url)
1.04 s ± 156 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Conclusion

Both methods provide much the same data, so it just comes down to convenience and performance.


Created by Tim Sherratt for the GLAM Workbench. Support me by becoming a GitHub sponsor!

Work on this notebook was supported by the IIPC Discretionary Funding Programme 2019-2020