New to Jupyter notebooks? Try Using Jupyter notebooks for a quick introduction.
There are a couple of ways of getting a list of the available snapshots for a particular url. In this notebook, we'll compare the Internet Archive's CDX index API, with their Memento Timemap API. Do they give us the same data?
See Exploring the Internet Archive's CDX API for more information about the CDX API.
import requests
import pandas as pd
def query_timemap(url):
'''
Get a Timemap in JSON format for the specified url.
'''
response = requests.get(f'https://web.archive.org/web/timemap/json/{url}', headers={'User-Agent': ''})
response.raise_for_status()
return response.json()
def query_cdx(url, **kwargs):
'''
Query the IA CDX API for the supplied url.
You can optionally provide any of the parameters accepted by the API.
'''
params = kwargs
params['url'] = url
params['output'] = 'json'
# User-Agent value is necessary or else IA gives an error
response = requests.get('http://web.archive.org/cdx/search/cdx', params=params, headers={'User-Agent': ''})
response.raise_for_status()
return response.json()
url = 'http://nla.gov.au'
tm_data = query_timemap(url)
tm_df = pd.DataFrame(tm_data[1:], columns=tm_data[0])
cdx_data = query_cdx(url)
cdx_df = pd.DataFrame(cdx_data[1:], columns=cdx_data[0])
list(cdx_df.columns)
['urlkey', 'timestamp', 'original', 'mimetype', 'statuscode', 'digest', 'length']
list(tm_df.columns)
['urlkey', 'timestamp', 'original', 'mimetype', 'statuscode', 'digest', 'redirect', 'robotflags', 'length', 'offset', 'filename']
The Timemap data includes three extra columns: robotflags
, offset
, and filename
. The offset
and filename
columns tell you where to find the snapshot, but I'm not sure what robotflags
is for (it's not in the specification). Let's gave a look at what sort of values it has.
tm_df['robotflags'].value_counts()
- 2863 Name: robotflags, dtype: int64
There's nothing in it – at least for this particular url.
For my purposes, it doesn't look like the Timemap adds anything useful.
tm_df.shape
(2863, 11)
cdx_df.shape
(2886, 7)
So there are more snapshots in the CDX results than the Timemap. Can we find out what they are?
# Combine the two dataframes, then only keep rows that aren't duplicated based on timestamp, original, digest, and statuscode
pd.concat([cdx_df,tm_df]).drop_duplicates(subset=['timestamp', 'original', 'digest', 'statuscode'], keep=False)
urlkey | timestamp | original | mimetype | statuscode | digest | length | redirect | robotflags | offset | filename |
---|
Hmm, if there were rows in the cdx_df
that weren't in the tm_df
I'd expect them to show up, but there are no rows that aren't duplicated based on the timestamp
, original
, digest
, and statuscode
columns...
Let's try this another way, by finding the number of unique shapshots in each df.
# Remove duplicate rows
cdx_df.drop_duplicates(subset=['timestamp', 'digest', 'statuscode', 'original'], keep='first').shape
(2788, 7)
# Remove duplicate rows
tm_df.drop_duplicates(subset=['timestamp', 'digest', 'statuscode', 'original'], keep='first').shape
(2788, 11)
Ah, so both sets of data contain duplicates, and there are really only 2,788 unique shapshots. Let's look at some of the duplicates in the CDX data.
dupes = cdx_df.loc[cdx_df.duplicated(subset=['timestamp', 'digest'], keep=False)].sort_values(by='timestamp')
dupes.head(10)
urlkey | timestamp | original | mimetype | statuscode | digest | length | |
---|---|---|---|---|---|---|---|
29 | au,gov,nla)/ | 19990508095540 | http://www2.nla.gov.au:80/ | text/html | 200 | 6XT5KK4SHMW2N3CAHPACCPJ2J5ZUIMES | 1995 |
30 | au,gov,nla)/ | 19990508095540 | http://www2.nla.gov.au:80/ | text/html | 200 | 6XT5KK4SHMW2N3CAHPACCPJ2J5ZUIMES | 1995 |
31 | au,gov,nla)/ | 20000229171639 | http://www.nla.gov.au:80/ | text/html | 200 | D6ZAUEL66NTHWCW3UP6K7WBT3CJLRYTT | 2155 |
32 | au,gov,nla)/ | 20000229171639 | http://www.nla.gov.au:80/ | text/html | 200 | D6ZAUEL66NTHWCW3UP6K7WBT3CJLRYTT | 2155 |
34 | au,gov,nla)/ | 20000302132843 | http://www.nla.gov.au:80/ | text/html | 200 | D6ZAUEL66NTHWCW3UP6K7WBT3CJLRYTT | 2158 |
35 | au,gov,nla)/ | 20000302132843 | http://www.nla.gov.au:80/ | text/html | 200 | D6ZAUEL66NTHWCW3UP6K7WBT3CJLRYTT | 2158 |
61 | au,gov,nla)/ | 20010301210012 | http://www.nla.gov.au:80/ | text/html | 200 | BQJYH5FYDY6DEAXDZMSZWKLZXA5GGJNC | 2390 |
62 | au,gov,nla)/ | 20010301210012 | http://www.nla.gov.au:80/ | text/html | 200 | BQJYH5FYDY6DEAXDZMSZWKLZXA5GGJNC | 2390 |
882 | au,gov,nla)/ | 20090327043759 | http://www.nla.gov.au/ | text/html | 200 | 537C3S5FANRHGLW3A6WPE6A57LULWNOF | 6306 |
883 | au,gov,nla)/ | 20090327043759 | http://www.nla.gov.au/ | text/html | 200 | 537C3S5FANRHGLW3A6WPE6A57LULWNOF | 6473 |
print(f'Date range of duplicates: {dupes["timestamp"].min()} to {dupes["timestamp"].max()}')
Date range of duplicates: 19990508095540 to 20200517065916
So it seems they provide the same number of unique snapshots, but the CDX index adds a few more duplicates.
%%timeit
tm_data = query_timemap(url)
4.22 s ± 453 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
cdx_data = query_cdx(url)
The slowest run took 4.86 times longer than the fastest. This could mean that an intermediate result is being cached. 5.06 s ± 2.2 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
Both methods provide much the same data, so it just comes down to convenience and performance.
Created by Tim Sherratt for the GLAM Workbench.
Work on this notebook was supported by the IIPC Discretionary Funding Programme 2019-2020