Exploring the Internet Archive's CDX API

New to Jupyter notebooks? Try Using Jupyter notebooks for a quick introduction.

CDX APIs provide access to an index of resources captured by web archives. The results can be filtered in a number of ways, making them a convenient way of harvesting and exploring the holdings of web archives. This notebook focuses on the data you can obtain from the Internet Archives' CDX API. For more information on the differences between this and other CDX APIs see Comparing CDX APIs. To examine differences between CDX data and Timemaps see Timemaps vs CDX APIs.

Notebooks demonstrating ways of getting and using CDX data include:

In [1]:
import requests
import pandas as pd
from io import BytesIO
import altair as alt
import os
from base64 import b32encode
from hashlib import sha1
import arrow
import re
from tqdm.auto import tqdm

Useful resources

Your first CDX request

Let's have a look at the sort of data the CDX server gives us. At the very least, we have to provide a url parameter to point to a particular page (or domain as we'll see below). To avoid flinging too much data about, we'll also ad a limit parameter that tells the CDX server how many rows of data to give us.

In [2]:
# 8 April 2020 -- without the 'User-Agent' header parameter I get a 445 error
# 27 April 2020 - now seems ok without changing User-Agent

# Feel free to change these values
params1 = {
    'url': 'http://nla.gov.au',
    'limit': 10
}

# Get the data and print the results
response = requests.get(f'https://web.archive.org/cdx/search/cdx', params=params1)
print(response.text)
au,gov,nla)/ 19961019064223 http://www.nla.gov.au:80/ text/html 200 M5ORM4XQ5QCEZEDRNZRGSWXPCOGUVASI 1135
au,gov,nla)/ 19961221102755 http://www.nla.gov.au:80/ text/html 200 TM4WSQIGWXAXMB36G4GVOY7MVPTO6CSE 1138
au,gov,nla)/ 19961221132358 http://nla.gov.au:80/ text/html 200 65SH4ZQ7ZYTTPYSVFQUSKZXJPZKSI6XA 603
au,gov,nla)/ 19961223031839 http://www2.nla.gov.au:80/ text/html 200 6XHDP66AXEPMVKVROHHDN6CPZYHZICEX 457
au,gov,nla)/ 19970212053405 http://www.nla.gov.au:80/ text/html 200 TM4WSQIGWXAXMB36G4GVOY7MVPTO6CSE 1141
au,gov,nla)/ 19970215222554 http://nla.gov.au:80/ text/html 200 65SH4ZQ7ZYTTPYSVFQUSKZXJPZKSI6XA 603
au,gov,nla)/ 19970315230640 http://www.nla.gov.au:80/ text/html 200 NOUNS3AYAIAOO4LRFD23MQWW3QIGDMFB 1126
au,gov,nla)/ 19970315230640 http://www.nla.gov.au:80/ text/html 200 TM4WSQIGWXAXMB36G4GVOY7MVPTO6CSE 1140
au,gov,nla)/ 19970413005246 http://nla.gov.au:80/ text/html 200 65SH4ZQ7ZYTTPYSVFQUSKZXJPZKSI6XA 603
au,gov,nla)/ 19970418074154 http://www.nla.gov.au:80/ text/html 200 NOUNS3AYAIAOO4LRFD23MQWW3QIGDMFB 1123

By default, the results are returned in a simple text format – fields are separated by spaces, and each result is on a separate line. It's a bit hard to read in this format, so let's add the output parameter to get the results in JSON format. We'll then use Pandas to display the results in a table.

In [3]:
params2 = {
    'url': 'http://nla.gov.au',
    'limit': 10,
    'output': 'json'
}

# Get the data and print the results
response = requests.get(f'http://web.archive.org/cdx/search/cdx', params=params2)
results = response.json()

# Use Pandas to turn the results into a DataFrame then display
pd.DataFrame(results[1:], columns=results[0]).head(10)
Out[3]:
urlkey timestamp original mimetype statuscode digest length
0 au,gov,nla)/ 19961019064223 http://www.nla.gov.au:80/ text/html 200 M5ORM4XQ5QCEZEDRNZRGSWXPCOGUVASI 1135
1 au,gov,nla)/ 19961221102755 http://www.nla.gov.au:80/ text/html 200 TM4WSQIGWXAXMB36G4GVOY7MVPTO6CSE 1138
2 au,gov,nla)/ 19961221132358 http://nla.gov.au:80/ text/html 200 65SH4ZQ7ZYTTPYSVFQUSKZXJPZKSI6XA 603
3 au,gov,nla)/ 19961223031839 http://www2.nla.gov.au:80/ text/html 200 6XHDP66AXEPMVKVROHHDN6CPZYHZICEX 457
4 au,gov,nla)/ 19970212053405 http://www.nla.gov.au:80/ text/html 200 TM4WSQIGWXAXMB36G4GVOY7MVPTO6CSE 1141
5 au,gov,nla)/ 19970215222554 http://nla.gov.au:80/ text/html 200 65SH4ZQ7ZYTTPYSVFQUSKZXJPZKSI6XA 603
6 au,gov,nla)/ 19970315230640 http://www.nla.gov.au:80/ text/html 200 NOUNS3AYAIAOO4LRFD23MQWW3QIGDMFB 1126
7 au,gov,nla)/ 19970315230640 http://www.nla.gov.au:80/ text/html 200 TM4WSQIGWXAXMB36G4GVOY7MVPTO6CSE 1140
8 au,gov,nla)/ 19970413005246 http://nla.gov.au:80/ text/html 200 65SH4ZQ7ZYTTPYSVFQUSKZXJPZKSI6XA 603
9 au,gov,nla)/ 19970418074154 http://www.nla.gov.au:80/ text/html 200 NOUNS3AYAIAOO4LRFD23MQWW3QIGDMFB 1123

The JSON results are, in Python terms, a list of lists, rather than a list of dictionaries. The first of these lists contains the field names. If you look at the line below, you'll see that we use the first list (results[0]) to set the column names in the dataframe, while the rest of the data (results[1:]) makes up the rows.

pd.DataFrame(results[1:], columns=results[0]).head(10)

Let's have a look at the fields.

  • urlkey – the page url expressed as a SURT (Sort-friendly URI Reordering Transform)
  • timestamp – the date and time of the capture in a YYYYMMDDhhmmss format
  • original – the url that was captured
  • mimetype – the type of file captured, expressed in a standard format
  • statuscode – a standard code provided by the web server that reports on the result of the capture request
  • digest – also known as a 'checksum' or 'fingerprint', the digest provides an algorithmically generated string that uniquely identifies the content of the captured url
  • length – the size of the captured content in bytes (compressed on disk)

All makes perfect sense right? Hmmm, we'll dig a little deeper below, but first...

Requesting a particular capture

We can use the timestamp value to retrieve the contents of a particular capture. A url like this will open the captured resource in the Wayback Machine:

https://web.archive.org/web/[timestamp]/[url]

For example: https://web.archive.org/web/20130201130329/http://www.nla.gov.au/

If you want the original contents, without the modifications and navigation added by the Wayback Machine, just add id_ after the timestamp:

https://web.archive.org/web/[timestamp]id_/[url]

For example: https://web.archive.org/web/20130201130329id_/http://www.nla.gov.au/

You'll probably notice that the original version doesn't look very pretty because links to CSS or Javascript files are still pointing to their old, broken, addresses. If you want a version without the Wayback Machine Navigation, but with urls to any linked files rewritten to point to archived versions, then add if_ after the timestamp.

https://web.archive.org/web/[timestamp]if_/[url]

For example: https://web.archive.org/web/20130201130329if_/http://www.nla.gov.au/

Getting all the captures of a particular page

If you want to get all the captures of a particular page, you can just leave out the limit parameter. However, there is (supposedly) a limit on the number of results returned in a single request. The API documentation says the current limit is 150,000, but it seems much larger – if you ask for cnn.com without using limit you get more than 290,000 results! To try and make sure that you're getting everything, there's a couple of ways you can break up the results set into chunks. The first is to set the showResumeKey parameter to true. Then, if there are more results available than are returned in your initial request, a couple of extra rows of data will be added to your results. The last row will include a resumption key, while the second last row will be empty, for example:

[], 
    ['com%2Ccnn%29%2F+20000621011732']

You then set the resumeKey parameter to the value of the resumption key, and add it to your next requests. You can combine the use of the resumption key with the limit paramater to break a large collection of captures into manageable chunks.

The other way is to add a page parameter, starting at 0 then incrementing the page value by one until you've worked through the complete set of results. But how do you know the total number of pages? If you add showNumPages=true to your query, the server will return a single number representing the total pages. But the pages themselves come from a special index and can contain different numbers of results depending on your query, so there's no obvious way to calculate the number of captures from the number of pages. Also, the maximum size of a page seems quite large and this sometimes causes errors. You can control this by adding a pageSize parameter. The meaning of this value seems a bit mysterious, but I've found that a pageSize of 5 seems to be a reasonable balance between the amount of data returned by each requests, and the number of requests you have to make.

Let's put all this together in a few functions that will help us construct CDX queries of any size or complexity.

In [4]:
def check_for_resumption_key(results):
    '''
    Checks to see if the second-last row is an empty list,
    if it is, return the last value as the resumption key.
    '''
    if not results[-2]:
        return results[-1][0]

def get_total_pages(params):
    '''
    Gets the total number of pages in a set of results.
    '''
    these_params = params.copy()
    these_params['showNumPages'] = 'true'
    response = s.get('http://web.archive.org/cdx/search/cdx', params=these_params, headers={'User-Agent': ''})
    return int(response.text)

def prepare_params(url, use_resume_key=False, **kwargs):
    '''
    Prepare the parameters for a CDX API requests.
    Adds all supplied keyword arguments as parameters (changing from_ to from).
    Adds in a few necessary parameters and showResumeKey if requested.
    '''
    params = kwargs
    params['url'] = url
    params['output'] = 'json'
    if use_resume_key:
        params['showResumeKey'] = 'true'
    # CDX accepts a 'from' parameter, but this is a reserved word in Python
    # Use 'from_' to pass the value to the function & here we'll change it back to 'from'.
    if 'from_' in params:
        params['from'] = params['from_']
        del(params['from_'])
    return params

def get_cdx_data(params):
    '''
    Make a request to the CDX API using the supplied parameters.
    Check the results for a resumption key, and return the key (if any) and the results.
    '''
    response = requests.get('http://web.archive.org/cdx/search/cdx', params=params, headers={'User-Agent': ''})
    response.raise_for_status()
    results = response.json()
    resumption_key = check_for_resumption_key(results)
    # Remove the resumption key from the results
    if resumption_key:
        results = results[:-2]
    return resumption_key, results

def query_cdx_by_page(url, **kwargs):
    all_results = []
    page = 0
    params = prepare_params(url, **kwargs)
    total_pages = get_total_pages(params)
    with tqdm(total=total_pages-page) as pbar1:
        with tqdm() as pbar2:
            while page < total_pages:
                params['page'] = page
                _, results = get_cdx_data(params)
                if page == 0:
                    all_results += results
                else:
                    all_results += results[1:]
                page += 1
                pbar1.update(1)
                pbar2.update(len(results) - 1)
    return all_results

def query_cdx_with_key(url, **kwargs):
    '''
    Harvest results from the CDX API using the supplied parameters.
    Uses showResumeKey to check if there are more than one page of results,
    and if so loops through pages until all results are downloaded.
    '''
    params = prepare_params(url, use_resume_key=True, **kwargs)
    with tqdm() as pbar:
        # This will include the header row
        resumption_key, all_results = get_cdx_data(params)
        pbar.update(len(all_results) - 1)
        while resumption_key is not None:
            params['resumeKey'] = resumption_key
            resumption_key, results = get_cdx_data(params)
            # Remove the header row and add
            all_results += results[1:]
            pbar.update(len(results) - 1)
    return all_results

To harvest all of the captures of 'http://www.nla.gov.au', you can just call:

results = query_cdx_with_key('http://www.nla.gov.au')

To break the harvest down into chunks of 1,000 results at a time, you'd call:

results = query_cdx_with_key('http://www.nla.gov.au', limit=1000)

There are a number of other parameters you can use to filter results from the CDX API, you can supply any of these as well. We'll see some examples below.

So let's get all the captures of 'http://www.nla.gov.au'.

In [5]:
results = query_cdx_with_key('http://www.nla.gov.au')

And convert them into a dataframe.

In [6]:
df = pd.DataFrame(results[1:], columns=results[0])

How many captures are there?

In [7]:
df.shape
Out[7]:
(3581, 7)

Ok, now we've got a dataset, let's look at the structure of the data in a little more detail.

CDX data in depth

SURTs, urlkeys, & urls

As noted above, the urlkey field contains things that are technically known as SURTs (Sort-friendly URI Reordering Transform). Basically, the order of components in the url's domain are reversed to make captures easier to sort and group. So instead of nla.gov.au we have au,gov,nla. The path component of the url, the bit that points to a specific file within the domain, is tacked on the end of the urlkey after a closing bracket. Here are some examples:

http://www.nla.gov.au becomes au,gov,nla plus the path /, so the urlkey is:

au,gov,nla)/

http://www.dsto.defence.gov.au/attachments/9%20LEWG%20Oct%2008%20DEU.ppt becomes au,gov,defence,dsto plus the path /attachments/9%20lewg%20oct%2008%20deu.ppt, so the urlkey is:

au,gov,defence,dsto)/attachments/9%20lewg%20oct%2008%20deu.ppt

From the examples above, you'll notice there's a bit of extra normalisation going on. For example, the url components are all converted to lowercase. You might also be wondering what happened to the www subdomain. By convention these are aliases that just point to the underlying domain – www.nla.gov.au ends up at the same place as nla.gov.au – so they're removed from the SURT. We can explore this a bit further by comping the original urls in our dataset to the urlkeys.

How many unique urlkeys are there? Hopefully just one, as we're gathering captures from a single url!

In [8]:
df['urlkey'].unique().shape[0]
Out[8]:
1

But how many different original urls were captured?

In [9]:
df['original'].unique().shape[0]
Out[9]:
17

Let's have a look at them.

In [10]:
df['original'].value_counts()
Out[10]:
https://www.nla.gov.au/                  1061
http://www.nla.gov.au/                    937
http://www.nla.gov.au:80/                 868
http://nla.gov.au/                        533
http://nla.gov.au:80/                      77
https://nla.gov.au/                        41
http://www.nla.gov.au//                    18
http://www2.nla.gov.au:80/                 11
http://www.nla.gov.au                      11
https://www.nla.gov.au                     11
http://[email protected]/                    6
http://www.nla.gov.au:80/?                  2
http://www.nla.gov.au:80//                  1
http://mailto:[email protected]/       1
http://www.nla.gov.au./                     1
http://mailto:[email protected]/               1
http://nla.gov.au                           1
Name: original, dtype: int64

So we can see that as well as removing www, the normalisation process removes www2 and port numbers, and groups together the http and https protocols. There's also some odd things that look like email addresses and were probably harvested by mistake from mailto links.

But wait a minute, our original query was just for the url http://nla.gov.au, why did we get all these other urls? When we request a particular url from the CDX API, it matches results based on the url's SURT, not on the original url. This ensures that we get all the variations in the way the url might be expressed. If we want to limit results to a specific form of the url, we can do that by filtering on the original field, as we'll see below.

Because the urlkey is essentially a normalised identifier for an individual url, you can use it to group together all the captures of individual pages across a whole domain. For example, if wanted to know how many urls have been captured from the nla.gov.au domain, we can call our query function like this:

results = query_cdx('nla.gov.au/*', collapse='urlkey', limit=1000)

Note that the url parameter includes a * to indicate that we want everything under the nla.gov.au domain. The collapse='urlkey' parameter says that we only want unique urlkey values – so we'll get just one capture for each individual url within the nla.gov.au domain. This can be a useful way of gathering a domain-level summary.

Timestamps

The timestamp field is pretty straightforward, it contains the date and time of the capture expressed in the format YYYYMMDDhhmmss. Once we have the harvested results in a dataframe, we can easily convert the timestamps into a datetime object.

In [11]:
df['date'] = pd.to_datetime(df['timestamp'])

This makes it possible to plot the number of captures over time. Here we group the captures by year.

In [12]:
alt.Chart(df).mark_bar().encode(
    x='year(date):T',
    y='count()'
).properties(width=700, height=200)
Out[12]:

You can use the timestamp field to filter results by date using the from and to parameters. For example, to get results from the year 2000 you'd use from=20000101 and to=20001231. However, if you're using my functions above, you'll need to use from_ rather than from as from is a reserved word in Python. The function will change it back before sending to the CDX server.

The timestamp field can also be used with the collapse parameter. If you include collapse=timestamp:4, the server will look at the first four digits of the timestamp – ie the year – and only include the first capture from that year. Similarly, collapse=timestamp:8 should give you a maximum of one capture per hour. In reality, collapse is dependent on the order of results and doesn't work perfectly – so you probably want to check your results for duplicates (Pandas .drop_duplicates() makes this easy).

Let's test it out – if it works we should end up with a very boring bar chart showing one result per year...

In [13]:
# Get the data - note the `collapse` parameter
results_ts = query_cdx_with_key('http://www.nla.gov.au', collapse='timestamp:4')

# Convert to dataframe
df_ts = pd.DataFrame(results_ts[1:], columns=results_ts[0])

# Convert timestamp to date
df_ts['date'] = pd.to_datetime(df_ts['timestamp'])

# Chart number of results per year
alt.Chart(df_ts).mark_bar().encode(
    x='year(date):T',
    y='count()'
).properties(width=700, height=200)
Out[13]:

Original

As noted above, the original field includes the actual url that was captured. You can use the filter parameter and regular expressions to limit your results using the original value. For example, to only get urls including www you could use filter=original:https*://www.*. Let's give it a try. Compare the results produced here to those above.

In [14]:
results_o = query_cdx_with_key('http://www.nla.gov.au', filter='original:https*://www.*')

# Convert to dataframe
df_o = pd.DataFrame(results_o[1:], columns=results_o[0])

df_o['original'].value_counts()
Out[14]:
https://www.nla.gov.au/       1061
http://www.nla.gov.au/         937
http://www.nla.gov.au:80/      868
http://www.nla.gov.au//         18
http://www2.nla.gov.au:80/      11
http://www.nla.gov.au           11
https://www.nla.gov.au          11
http://www.nla.gov.au:80/?       2
http://www.nla.gov.au./          1
http://www.nla.gov.au:80//       1
Name: original, dtype: int64

Mimetypes

The mimetype indicates the type of file captured. There's a long list of recognised media types, but you're only likely to meet a small subset of these in a web archive. The most common, of course, will be text/html, but there will also be the various image formats, CSS and Javascript files, and other formats shared via the web like PDFs and Powerpoint files.

If you're no interested in all the extra bits and pieces, like CSS and Javascript, that make up a web page, you might want to use the filter parameter to limit your query results to text/html. You can also use regular expressions with filter, so if you can't be bothered entering all the possible mimtypes for Powerpoint presentations, you could try something like filter=['mimetype:.*(powerpoint|presentation).*']. This uses a regular expression to look for mimetype values that contain either 'powerpoint' or 'presentation'. Let's give it a try:

In [15]:
results_m = query_cdx_with_key('*.education.gov.au', filter=['mimetype:.*(powerpoint|presentation).*'])
df_m = pd.DataFrame(results_m[1:], columns=results_m[0])
df_m['mimetype'].value_counts()
Out[15]:
application/vnd.openxmlformats-officedocument.presentationml.presentation    117
application/vnd.ms-powerpoint.presentation.12                                 82
application/vnd.openxmlformats-officedocument.presentationml.slideshow         3
application/vnd.ms-powerpoint.show.12                                          3
Name: mimetype, dtype: int64

One thing you might notice is that sometimes the mimetype value doesn't seem to match the file extension. Let's try looking for captures with a text/html mimetype, where the original value ends in 'pdf'. We can do this by combining the filters mimetype:text/html and original:.*\.pdf$. Note that we're using a regular expression to find the '.pdf' extension.

In [16]:
results_m2 = query_cdx_with_key('naa.gov.au/*', filter=['mimetype:text/html', 'original:.*\.pdf$'])
df_m2 = pd.DataFrame(results_m2[1:], columns=results_m2[0])
df_m2.head()
Out[16]:
urlkey timestamp original mimetype statuscode digest length
0 au,gov,naa)/.../an-approach-green-paper_tcm16-... 20140924080305 http://www.naa.gov.au/.../An-approach-Green-Pa... text/html 302 RBAUTMMEDESHYHSQ5PCUWUILGZSLFOIR 902
1 au,gov,naa)/.../digital-preservation-software-... 20141011134912 http://www.naa.gov.au/.../Digital-Preservation... text/html 302 2EKIQ2YLXTDK5CS4VPFJMEZGNMATFEDG 913
2 au,gov,naa)/.../holt.pdf 20141010120041 http://www.naa.gov.au/.../holt.pdf text/html 302 GDGHFNKCSTNJENDMGLHQWNVXUIAIUZYN 880
3 au,gov,naa)/.../horrie_tcm16-36799.pdf 20141011001020 http://www.naa.gov.au/.../horrie_tcm16-36799.pdf text/html 302 IMGTBT5B33WIEHQ7E6MDMIHBW3UMI2MQ 889
4 au,gov,naa)/.../memento36.pdf 20141010012935 http://www.naa.gov.au/.../memento36.pdf text/html 302 ZNNRD65HNWIR6YYAA4PWJ3EG6E7762UX 882

It certainly looks a bit weird, but if we look at the status codes we see that most of these captures are actually redirections or errors, so the server's response is HTML even though file requested was a PDF. We'll look more at status codes below.

In [17]:
df_m2['statuscode'].value_counts()
Out[17]:
404    641
302    620
200      7
Name: statuscode, dtype: int64

Status code

This is a standard code used by web servers to indicate the result of a file request. A code of 200 indicates everything was ok and the requested file was delivered. A code of 404 means the requested file couldn't be found. Let's look at all the status codes received when attempting to capture nla.gov.au.

In [18]:
df['statuscode'].value_counts()
Out[18]:
200    2531
301     515
302     291
-       241
503       3
Name: statuscode, dtype: int64

As we'd expect, most were ok (200), but there were a couple of server errors (503). The - is not a standard status code, it's used in the archiving process to indicate that a duplicate of the file already exists in the archive – these captures also have a mimetype of warc/revisit. The 301 and 302 codes indicate that the original request was redirected. I look at this in more detail in another notebook, but it's worth thinking for a minute about what redirects are, how they are captured, and how they are played back by the Wayback Machine.

Sometimes files get moved around on web servers. To avoid a lot of 'not found' errors, servers can be configured to respond to requests for the old addresses with a 301 or 302 response that includes the new address. Browsers can then load the new address automatically without you even knowing that the page has moved. It's these exchanges between the server and browser (or web archiving bot) that are being captured and presented through the CDX archive.

When you try to look at one of these captures in the Wayback Machine, the captured redirect does what redirects are supposed to do and sends you off to the new address. However, in this case you're redirected to an archived version of the file at the new address from about the same time as the redirect was captured. The Wayback Machine does this by looking for the capture from the new address that is closest in date to the date of the redirect. There's no guarantee that the new address was captured immediately after the redirect was received, as happens in a normal web browser. As a result, the redirect might take you back or forward in time. Let's try an experiment. Here we take to first twenty 302 responses from nla.gov.au and compare the timestamp of the captured redirect with the timestamp of the page we're actually redirected to by the Wayback Machine.

In [19]:
results_s = query_cdx_with_key('nla.gov.au', filter='statuscode:302')
for capture in results_s[1:11]:
    timestamp = capture[1]
    redirect_date = arrow.get(timestamp, 'YYYYMMDDHHmmss')
    response = requests.get(f'https://web.archive.org/web/{timestamp}id_/{capture[2]}')
    capture_timestamp = re.search(r'web\/(\d{14})', response.url).group(1)
    capture_date = arrow.get(capture_timestamp, 'YYYYMMDDHHmmss')
    direction = 'later' if capture_date > redirect_date else 'earlier'
    print(f'{redirect_date.humanize(other=capture_date, granularity=["hour", "minute", "second"], only_distance=True)} {direction}')
12 hours 13 minutes and 8 seconds later
0 hours 0 minutes and 0 seconds earlier
98 hours 16 minutes and 30 seconds later
15 hours 12 minutes and 8 seconds later
2 hours 6 minutes and 35 seconds earlier
2 hours 2 minutes and 31 seconds earlier
8 hours 7 minutes and 21 seconds earlier
3 hours 43 minutes and 9 seconds later
an hour 6 minutes and 56 seconds later
7 hours 54 minutes and 47 seconds earlier

Does this matter? Probably not, but it's something to be aware of. When we're using something like the Wayback Machine it can seem like we're accessing the live web, but we're not – what we're seeing is an attempt to reconstruct a version of the live web from available captures.

Digest

The digest is an algorithmically generated string that uniquely identifies the contents of the captured url. It's like the file's fingerprint, and it helps us to see when things change. It seems weird that you can represent the complete contents of a file in a short string, but there's nothing too mysterious about it. To create the digests, files are first encrypted using the SHA-1 hash function, then these strings are encoded as Base 32. Try it!

In [20]:
print(b32encode(sha1('This is a string.'.encode()).digest()).decode())
3VQDI552JQRW5ROPWTSKINAWFWGWQ6CQ

One interesting thing about digests is that small changes to a page can result in very different digests. Let's try adding an exclamation mark to the string above.

In [21]:
print(b32encode(sha1('This is a string!'.encode()).digest()).decode())
MWTI7PY7WJDIBYQKZ2P2Y5UA75UWOSYR

Completely changed! So while digests can tell you two files are different, they can't tell you how different.

You can use the digest field with the collapse parameter to filter out identical captures, but this only works if the captures are next to each other in the index. As noted above, if you wanted to remove all duplicates, you'd probably need to use Pandas to process the harvested results.

If we look again at our initial harvest you might notice something odd.

In [22]:
df.head(10)
Out[22]:
urlkey timestamp original mimetype statuscode digest length date
0 au,gov,nla)/ 19961019064223 http://www.nla.gov.au:80/ text/html 200 M5ORM4XQ5QCEZEDRNZRGSWXPCOGUVASI 1135 1996-10-19 06:42:23
1 au,gov,nla)/ 19961221102755 http://www.nla.gov.au:80/ text/html 200 TM4WSQIGWXAXMB36G4GVOY7MVPTO6CSE 1138 1996-12-21 10:27:55
2 au,gov,nla)/ 19961221132358 http://nla.gov.au:80/ text/html 200 65SH4ZQ7ZYTTPYSVFQUSKZXJPZKSI6XA 603 1996-12-21 13:23:58
3 au,gov,nla)/ 19961223031839 http://www2.nla.gov.au:80/ text/html 200 6XHDP66AXEPMVKVROHHDN6CPZYHZICEX 457 1996-12-23 03:18:39
4 au,gov,nla)/ 19970212053405 http://www.nla.gov.au:80/ text/html 200 TM4WSQIGWXAXMB36G4GVOY7MVPTO6CSE 1141 1997-02-12 05:34:05
5 au,gov,nla)/ 19970215222554 http://nla.gov.au:80/ text/html 200 65SH4ZQ7ZYTTPYSVFQUSKZXJPZKSI6XA 603 1997-02-15 22:25:54
6 au,gov,nla)/ 19970315230640 http://www.nla.gov.au:80/ text/html 200 NOUNS3AYAIAOO4LRFD23MQWW3QIGDMFB 1126 1997-03-15 23:06:40
7 au,gov,nla)/ 19970315230640 http://www.nla.gov.au:80/ text/html 200 TM4WSQIGWXAXMB36G4GVOY7MVPTO6CSE 1140 1997-03-15 23:06:40
8 au,gov,nla)/ 19970413005246 http://nla.gov.au:80/ text/html 200 65SH4ZQ7ZYTTPYSVFQUSKZXJPZKSI6XA 603 1997-04-13 00:52:46
9 au,gov,nla)/ 19970418074154 http://www.nla.gov.au:80/ text/html 200 NOUNS3AYAIAOO4LRFD23MQWW3QIGDMFB 1123 1997-04-18 07:41:54

Rows 1, 4, and 7 all have the same digest, but the length value is different. How can the files be longer, but the same? We'll look at length next, but answer is that the length includes the response headers sent by the web server as well as the actual content of the file. The length of the headers might change depending on the context in which the file was requested, even though the file itself remains the same.

Using the digest field we can find out how many of the captures in the nla.gov.au result set are unique.

In [23]:
print(f'{len(df["digest"].unique()) / df.shape[0]:.2%} unique')
50.21% unique

In theory, we should be able to use the digest value to check that the file that was originally captured is the same as the file we can access now through the Wayback Machine. Let's give it a try!

In [24]:
for row in results[1:11]:
    snapshot_url = f'https://web.archive.org/web/{row[1]}id_/http://www.nla.gov.au/'
    response = requests.get(snapshot_url)
    checksum = b32encode(sha1(response.content).digest())
    print(f'Digests match? {checksum.decode() == row[5]}')
Digests match? True
Digests match? True
Digests match? True
Digests match? True
Digests match? True
Digests match? True
Digests match? True
Digests match? False
Digests match? True
Digests match? True

Hmm, so it seems we can't assume that pages preserved in the web archive will remain unchanged from the moment of capture, but the digest does at least give us a way of checking.

Length

You'd think length would be pretty straightforward, but as noted above it includes the headers as well as the file. Also, it's the size of the file and headers stored in compressed form on disk. As a result the length might vary according to the technology used to store the capture. So length gives us an indication of the original file size, but not an exact measure.

To use the length field in calculations using Pandas, you'll need to make sure it's being stored as an integer.

In [25]:
df['length'] = df['length'].astype('int')

Putting it all together

Let's use the timestamp, length, and statuscode fields to look at all the captures of http://nla.gov.au.

In [26]:
alt.Chart(df).mark_point().encode(
    x='date:T',
    y='length:Q',
    color='statuscode',
).properties(width=700)
Out[26]: