New to Jupyter notebooks? Try Using Jupyter notebooks for a quick introduction.
CDX APIs provide access to an index of resources captured by web archives. The results can be filtered in a number of ways, making them a convenient way of harvesting and exploring the holdings of web archives. This notebook focuses on the data you can obtain from the Internet Archives' CDX API. For more information on the differences between this and other CDX APIs see Comparing CDX APIs. To examine differences between CDX data and Timemaps see Timemaps vs CDX APIs.
Notebooks demonstrating ways of getting and using CDX data include:
import requests
import pandas as pd
from io import BytesIO
import altair as alt
import os
from base64 import b32encode
from hashlib import sha1
import arrow
import re
from tqdm.notebook import tqdm
Let's have a look at the sort of data the CDX server gives us. At the very least, we have to provide a url
parameter to point to a particular page (or domain as we'll see below). To avoid flinging too much data about, we'll also ad a limit
parameter that tells the CDX server how many rows of data to give us.
# 8 April 2020 -- without the 'User-Agent' header parameter I get a 445 error
# 27 April 2020 - now seems ok without changing User-Agent
# Feel free to change these values
params1 = {
'url': 'http://nla.gov.au',
'limit': 10
}
# Get the data and print the results
response = requests.get(f'http://web.archive.org/cdx/search/cdx', params=params1)
print(response.text)
By default, the results are returned in a simple text format – fields are separated by spaces, and each result is on a separate line. It's a bit hard to read in this format, so let's add the output
parameter to get the results in JSON format. We'll then use Pandas to display the results in a table.
params2 = {
'url': 'http://nla.gov.au',
'limit': 10,
'output': 'json'
}
# Get the data and print the results
response = requests.get(f'http://web.archive.org/cdx/search/cdx', params=params2)
results = response.json()
# Use Pandas to turn the results into a DataFrame then display
pd.DataFrame(results[1:], columns=results[0]).head(10)
The JSON results are, in Python terms, a list of lists, rather than a list of dictionaries. The first of these lists contains the field names. If you look at the line below, you'll see that we use the first list (results[0]
) to set the column names in the dataframe, while the rest of the data (results[1:]
) makes up the rows.
pd.DataFrame(results[1:], columns=results[0]).head(10)
Let's have a look at the fields.
urlkey
– the page url expressed as a SURT (Sort-friendly URI Reordering Transform)timestamp
– the date and time of the capture in a YYYYMMDDhhmmss
formatoriginal
– the url that was capturedmimetype
– the type of file captured, expressed in a standard formatstatuscode
– a standard code provided by the web server that reports on the result of the capture requestdigest
– also known as a 'checksum' or 'fingerprint', the digest provides an algorithmically generated string that uniquely identifies the content of the captured urllength
– the size of the captured content in bytes (compressed on disk)All makes perfect sense right? Hmmm, we'll dig a little deeper below, but first...
We can use the timestamp
value to retrieve the contents of a particular capture. A url like this will open the captured resource in the Wayback Machine:
https://web.archive.org/web/[timestamp]/[url]
For example: https://web.archive.org/web/20130201130329/http://www.nla.gov.au/
If you want the original contents, without the modifications and navigation added by the Wayback Machine, just add id_
after the timestamp
:
https://web.archive.org/web/[timestamp]id_/[url]
For example: https://web.archive.org/web/20130201130329id_/http://www.nla.gov.au/
You'll probably notice that the original version doesn't look very pretty because links to CSS or Javascript files are still pointing to their old, broken, addresses. If you want a version without the Wayback Machine Navigation, but with urls to any linked files rewritten to point to archived versions, then add if_
after the timestamp.
https://web.archive.org/web/[timestamp]if_/[url]
For example: https://web.archive.org/web/20130201130329if_/http://www.nla.gov.au/
If you want to get all the captures of a particular page, you can just leave out the limit
parameter. However, there is (supposedly) a limit on the number of results returned in a single request. The API documentation says the current limit is 150,000, but it seems much larger – if you ask for cnn.com
without using limit
you get more than 290,000 results! To try and make sure that you're getting everything, there's a couple of ways you can break up the results set into chunks. The first is to set the showResumeKey
parameter to true
. Then, if there are more results available than are returned in your initial request, a couple of extra rows of data will be added to your results. The last row will include a resumption key, while the second last row will be empty, for example:
[],
['com%2Ccnn%29%2F+20000621011732']
You then set the resumeKey
parameter to the value of the resumption key, and add it to your next requests. You can combine the use of the resumption key with the limit
paramater to break a large collection of captures into manageable chunks.
The other way is to add a page
parameter, starting at 0
then incrementing the page
value by one until you've worked through the complete set of results. But how do you know the total number of pages? If you add showNumPages=true
to your query, the server will return a single number representing the total pages. But the pages themselves come from a special index and can contain different numbers of results depending on your query, so there's no obvious way to calculate the number of captures from the number of pages. Also, the maximum size of a page seems quite large and this sometimes causes errors. You can control this by adding a pageSize
parameter. The meaning of this value seems a bit mysterious, but I've found that a pageSize
of 5
seems to be a reasonable balance between the amount of data returned by each requests, and the number of requests you have to make.
Let's put all this together in a few functions that will help us construct CDX queries of any size or complexity.
def check_for_resumption_key(results):
'''
Checks to see if the second-last row is an empty list,
if it is, return the last value as the resumption key.
'''
if not results[-2]:
return results[-1][0]
def get_total_pages(params):
'''
Gets the total number of pages in a set of results.
'''
these_params = params.copy()
these_params['showNumPages'] = 'true'
response = s.get('http://web.archive.org/cdx/search/cdx', params=these_params, headers={'User-Agent': ''})
return int(response.text)
def prepare_params(url, use_resume_key=False, **kwargs):
'''
Prepare the parameters for a CDX API requests.
Adds all supplied keyword arguments as parameters (changing from_ to from).
Adds in a few necessary parameters and showResumeKey if requested.
'''
params = kwargs
params['url'] = url
params['output'] = 'json'
if use_resume_key:
params['showResumeKey'] = 'true'
# CDX accepts a 'from' parameter, but this is a reserved word in Python
# Use 'from_' to pass the value to the function & here we'll change it back to 'from'.
if 'from_' in params:
params['from'] = params['from_']
del(params['from_'])
return params
def get_cdx_data(params):
'''
Make a request to the CDX API using the supplied parameters.
Check the results for a resumption key, and return the key (if any) and the results.
'''
response = requests.get('http://web.archive.org/cdx/search/cdx', params=params, headers={'User-Agent': ''})
response.raise_for_status()
results = response.json()
resumption_key = check_for_resumption_key(results)
# Remove the resumption key from the results
if resumption_key:
results = results[:-2]
return resumption_key, results
def query_cdx_by_page(url, **kwargs):
all_results = []
page = 0
params = prepare_params(url, **kwargs)
total_pages = get_total_pages(params)
with tqdm(total=total_pages-page) as pbar1:
with tqdm() as pbar2:
while page < total_pages:
params['page'] = page
_, results = get_cdx_data(params)
if page == 0:
all_results += results
else:
all_results += results[1:]
page += 1
pbar1.update(1)
pbar2.update(len(results) - 1)
return all_results
def query_cdx_with_key(url, **kwargs):
'''
Harvest results from the CDX API using the supplied parameters.
Uses showResumeKey to check if there are more than one page of results,
and if so loops through pages until all results are downloaded.
'''
params = prepare_params(url, use_resume_key=True, **kwargs)
with tqdm() as pbar:
# This will include the header row
resumption_key, all_results = get_cdx_data(params)
pbar.update(len(all_results) - 1)
while resumption_key is not None:
params['resumeKey'] = resumption_key
resumption_key, results = get_cdx_data(params)
# Remove the header row and add
all_results += results[1:]
pbar.update(len(results) - 1)
return all_results
To harvest all of the captures of 'http://www.nla.gov.au', you can just call:
results = query_cdx_with_key('http://www.nla.gov.au')
To break the harvest down into chunks of 1,000 results at a time, you'd call:
results = query_cdx_with_key('http://www.nla.gov.au', limit=1000)
There are a number of other parameters you can use to filter results from the CDX API, you can supply any of these as well. We'll see some examples below.
So let's get all the captures of 'http://www.nla.gov.au'.
results = query_cdx_with_key('http://www.nla.gov.au')
And convert them into a dataframe.
df = pd.DataFrame(results[1:], columns=results[0])
How many captures are there?
df.shape
Ok, now we've got a dataset, let's look at the structure of the data in a little more detail.
As noted above, the urlkey
field contains things that are technically known as SURTs (Sort-friendly URI Reordering Transform). Basically, the order of components in the url's domain are reversed to make captures easier to sort and group. So instead of nla.gov.au
we have au,gov,nla
. The path component of the url, the bit that points to a specific file within the domain, is tacked on the end of the urlkey
after a closing bracket. Here are some examples:
http://www.nla.gov.au
becomes au,gov,nla
plus the path /
, so the urlkey is:
au,gov,nla)/
http://www.dsto.defence.gov.au/attachments/9%20LEWG%20Oct%2008%20DEU.ppt
becomes au,gov,defence,dsto
plus the path /attachments/9%20lewg%20oct%2008%20deu.ppt
, so the urlkey is:
au,gov,defence,dsto)/attachments/9%20lewg%20oct%2008%20deu.ppt
From the examples above, you'll notice there's a bit of extra normalisation going on. For example, the url components are all converted to lowercase. You might also be wondering what happened to the www
subdomain. By convention these are aliases that just point to the underlying domain – www.nla.gov.au
ends up at the same place as nla.gov.au
– so they're removed from the SURT. We can explore this a bit further by comping the original
urls in our dataset to the urlkeys
.
How many unique urlkey
s are there? Hopefully just one, as we're gathering captures from a single url!
df['urlkey'].unique().shape[0]
But how many different original
urls were captured?
df['original'].unique().shape[0]
Let's have a look at them.
df['original'].value_counts()
So we can see that as well as removing www
, the normalisation process removes www2
and port numbers, and groups together the http
and https
protocols. There's also some odd things that look like email addresses and were probably harvested by mistake from mailto
links.
But wait a minute, our original query was just for the url http://nla.gov.au
, why did we get all these other urls? When we request a particular url
from the CDX API, it matches results based on the url's SURT, not on the original url. This ensures that we get all the variations in the way the url might be expressed. If we want to limit results to a specific form of the url, we can do that by filtering on the original
field, as we'll see below.
Because the urlkey
is essentially a normalised identifier for an individual url, you can use it to group together all the captures of individual pages across a whole domain. For example, if wanted to know how many urls have been captured from the nla.gov.au
domain, we can call our query function like this:
results = query_cdx('nla.gov.au/*', collapse='urlkey', limit=1000)
Note that the url
parameter includes a *
to indicate that we want everything under the nla.gov.au
domain. The collapse='urlkey'
parameter says that we only want unique urlkey
values – so we'll get just one capture for each individual url within the nla.gov.au
domain. This can be a useful way of gathering a domain-level summary.
The timestamp
field is pretty straightforward, it contains the date and time of the capture expressed in the format YYYYMMDDhhmmss
. Once we have the harvested results in a dataframe, we can easily convert the timestamps into a datetime object.
df['date'] = pd.to_datetime(df['timestamp'])
This makes it possible to plot the number of captures over time. Here we group the captures by year.
alt.Chart(df).mark_bar().encode(
x='year(date):T',
y='count()'
).properties(width=700, height=200)
You can use the timestamp
field to filter results by date using the from
and to
parameters. For example, to get results from the year 2000 you'd use from=20000101
and to=20001231
. However, if you're using my functions above, you'll need to use from_
rather than from
as from
is a reserved word in Python. The function will change it back before sending to the CDX server.
The timestamp
field can also be used with the collapse
parameter. If you include collapse=timestamp:4
, the server will look at the first four digits of the timestamp
– ie the year – and only include the first capture from that year. Similarly, collapse=timestamp:8
should give you a maximum of one capture per hour. In reality, collapse
is dependent on the order of results and doesn't work perfectly – so you probably want to check your results for duplicates (Pandas .drop_duplicates()
makes this easy).
Let's test it out – if it works we should end up with a very boring bar chart showing one result per year...
# Get the data - note the `collapse` parameter
results_ts = query_cdx_with_key('http://www.nla.gov.au', collapse='timestamp:4')
# Convert to dataframe
df_ts = pd.DataFrame(results_ts[1:], columns=results_ts[0])
# Convert timestamp to date
df_ts['date'] = pd.to_datetime(df_ts['timestamp'])
# Chart number of results per year
alt.Chart(df_ts).mark_bar().encode(
x='year(date):T',
y='count()'
).properties(width=700, height=200)
As noted above, the original
field includes the actual url that was captured. You can use the filter
parameter and regular expressions to limit your results using the original
value. For example, to only get urls including www
you could use filter=original:https*://www.*
. Let's give it a try. Compare the results produced here to those above.
results_o = query_cdx_with_key('http://www.nla.gov.au', filter='original:https*://www.*')
# Convert to dataframe
df_o = pd.DataFrame(results_o[1:], columns=results_o[0])
df_o['original'].value_counts()
The mimetype
indicates the type of file captured. There's a long list of recognised media types, but you're only likely to meet a small subset of these in a web archive. The most common, of course, will be text/html
, but there will also be the various image formats, CSS and Javascript files, and other formats shared via the web like PDFs and Powerpoint files.
If you're no interested in all the extra bits and pieces, like CSS and Javascript, that make up a web page, you might want to use the filter
parameter to limit your query results to text/html
. You can also use regular expressions with filter
, so if you can't be bothered entering all the possible mimtypes for Powerpoint presentations, you could try something like filter=['mimetype:.*(powerpoint|presentation).*']
. This uses a regular expression to look for mimetype values that contain either 'powerpoint' or 'presentation'. Let's give it a try:
results_m = query_cdx_with_key('*.education.gov.au', filter=['mimetype:.*(powerpoint|presentation).*'])
df_m = pd.DataFrame(results_m[1:], columns=results_m[0])
df_m['mimetype'].value_counts()
One thing you might notice is that sometimes the mimetype
value doesn't seem to match the file extension. Let's try looking for captures with a text/html
mimetype, where the original
value ends in 'pdf'. We can do this by combining the filters mimetype:text/html
and original:.*\.pdf$
. Note that we're using a regular expression to find the '.pdf' extension.
results_m2 = query_cdx_with_key('naa.gov.au/*', filter=['mimetype:text/html', 'original:.*\.pdf$'])
df_m2 = pd.DataFrame(results_m2[1:], columns=results_m2[0])
df_m2.head()
It certainly looks a bit weird, but if we look at the status codes we see that most of these captures are actually redirections or errors, so the server's response is HTML even though file requested was a PDF. We'll look more at status codes below.
df_m2['statuscode'].value_counts()
This is a standard code used by web servers to indicate the result of a file request. A code of 200
indicates everything was ok and the requested file was delivered. A code of 404
means the requested file couldn't be found. Let's look at all the status codes received when attempting to capture nla.gov.au
.
df['statuscode'].value_counts()
As we'd expect, most were ok (200
), but there were a couple of server errors (503
). The -
is not a standard status code, it's used in the archiving process to indicate that a duplicate of the file already exists in the archive – these captures also have a mimetype
of warc/revisit
. The 301
and 302
codes indicate that the original request was redirected. I look at this in more detail in another notebook, but it's worth thinking for a minute about what redirects are, how they are captured, and how they are played back by the Wayback Machine.
Sometimes files get moved around on web servers. To avoid a lot of 'not found' errors, servers can be configured to respond to requests for the old addresses with a 301
or 302
response that includes the new address. Browsers can then load the new address automatically without you even knowing that the page has moved. It's these exchanges between the server and browser (or web archiving bot) that are being captured and presented through the CDX archive.
When you try to look at one of these captures in the Wayback Machine, the captured redirect does what redirects are supposed to do and sends you off to the new address. However, in this case you're redirected to an archived version of the file at the new address from about the same time as the redirect was captured. The Wayback Machine does this by looking for the capture from the new address that is closest in date to the date of the redirect. There's no guarantee that the new address was captured immediately after the redirect was received, as happens in a normal web browser. As a result, the redirect might take you back or forward in time. Let's try an experiment. Here we take to first twenty 302
responses from nla.gov.au
and compare the timestamp
of the captured redirect with the timestamp
of the page we're actually redirected to by the Wayback Machine.
results_s = query_cdx_with_key('nla.gov.au', filter='statuscode:302')
for capture in results_s[1:11]:
timestamp = capture[1]
redirect_date = arrow.get(timestamp, 'YYYYMMDDHHmmss')
response = requests.get(f'https://web.archive.org/web/{timestamp}id_/{capture[2]}')
capture_timestamp = re.search(r'web\/(\d{14})', response.url).group(1)
capture_date = arrow.get(capture_timestamp, 'YYYYMMDDHHmmss')
direction = 'later' if capture_date > redirect_date else 'earlier'
print(f'{redirect_date.humanize(other=capture_date, granularity=["hour", "minute", "second"], only_distance=True)} {direction}')
Does this matter? Probably not, but it's something to be aware of. When we're using something like the Wayback Machine it can seem like we're accessing the live web, but we're not – what we're seeing is an attempt to reconstruct a version of the live web from available captures.
The digest
is an algorithmically generated string that uniquely identifies the contents of the captured url. It's like the file's fingerprint, and it helps us to see when things change. It seems weird that you can represent the complete contents of a file in a short string, but there's nothing too mysterious about it. To create the digests, files are first encrypted using the SHA-1 hash function, then these strings are encoded as Base 32. Try it!
print(b32encode(sha1('This is a string.'.encode()).digest()).decode())
One interesting thing about digests is that small changes to a page can result in very different digests. Let's try adding an exclamation mark to the string above.
print(b32encode(sha1('This is a string!'.encode()).digest()).decode())
Completely changed! So while digests can tell you two files are different, they can't tell you how different.
You can use the digest
field with the collapse
parameter to filter out identical captures, but this only works if the captures are next to each other in the index. As noted above, if you wanted to remove all duplicates, you'd probably need to use Pandas to process the harvested results.
If we look again at our initial harvest you might notice something odd.
df.head(10)
Rows 1, 4, and 7 all have the same digest
, but the length
value is different. How can the files be longer, but the same? We'll look at length
next, but answer is that the length
includes the response headers sent by the web server as well as the actual content of the file. The length of the headers might change depending on the context in which the file was requested, even though the file itself remains the same.
Using the digest
field we can find out how many of the captures in the nla.gov.au
result set are unique.
print(f'{len(df["digest"].unique()) / df.shape[0]:.2%} unique')
In theory, we should be able to use the digest
value to check that the file that was originally captured is the same as the file we can access now through the Wayback Machine. Let's give it a try!
for row in results[1:11]:
snapshot_url = f'https://web.archive.org/web/{row[1]}id_/http://www.nla.gov.au/'
response = requests.get(snapshot_url)
checksum = b32encode(sha1(response.content).digest())
print(f'Digests match? {checksum.decode() == row[5]}')
Hmm, so it seems we can't assume that pages preserved in the web archive will remain unchanged from the moment of capture, but the digest
does at least give us a way of checking.
You'd think length
would be pretty straightforward, but as noted above it includes the headers as well as the file. Also, it's the size of the file and headers stored in compressed form on disk. As a result the length might vary according to the technology used to store the capture. So length
gives us an indication of the original file size, but not an exact measure.
To use the length
field in calculations using Pandas, you'll need to make sure it's being stored as an integer.
df['length'] = df['length'].astype('int')
Let's use the timestamp
, length
, and statuscode
fields to look at all the captures of http://nla.gov.au
.
alt.Chart(df).mark_point().encode(
x='date:T',
y='length:Q',
color='statuscode',
).properties(width=700)