Observing change in a web page over time

New to Jupyter notebooks? Try Using Jupyter notebooks for a quick introduction.

This notebook explores what we can find when you look at all captures of a single page over time.

Work in progress – this notebook isn't finished yet. Check back later for more...

In [1]:
import requests
import pandas as pd
import altair as alt
import re
from difflib import HtmlDiff
from IPython.display import display, HTML
import arrow
# alt.renderers.enable('default')
In [2]:
def query_cdx(url, **kwargs):
    params = kwargs
    params['url'] = url
    params['output'] = 'json'
    response = requests.get('http://web.archive.org/cdx/search/cdx', params=params, headers={'User-Agent': ''})
    response.raise_for_status()
    return response.json()
In [3]:
url = 'http://nla.gov.au'

Getting the data

In this example we're using the IA CDX API, but this could easily be adapted to use Timemaps from a range of repositories.

In [4]:
data = query_cdx(url)

# Convert to a dataframe
# The column names are in the first row
df = pd.DataFrame(data[1:], columns=data[0])

# Convert the timestamp string into a datetime object
df['date'] = pd.to_datetime(df['timestamp'])
df.sort_values(by='date', inplace=True, ignore_index=True)

# Convert the length from a string into an integer
df['length'] = df['length'].astype('int')

As noted in the notebook comparing the CDX API with Timemaps, there are a number of duplicate snapshots in the CDX results, so let's remove them.

In [5]:
print(f'Before: {df.shape[0]}')
df.drop_duplicates(subset=['timestamp', 'original', 'digest', 'statuscode', 'mimetype'], keep='first', inplace=True)
print(f'After: {df.shape[0]}')
Before: 3581
After: 3439

The basic shape

In [6]:
df['date'].min()
Out[6]:
Timestamp('1996-10-19 06:42:23')
In [7]:
df['date'].max()
Out[7]:
Timestamp('2021-05-09 04:55:17')
In [8]:
df['length'].describe()
Out[8]:
count     3439.000000
mean      7010.414074
std       5664.637406
min        235.000000
25%        603.500000
50%       5438.000000
75%      11777.000000
max      16005.000000
Name: length, dtype: float64
In [9]:
df['statuscode'].value_counts()
Out[9]:
200    2444
301     485
302     287
-       220
503       3
Name: statuscode, dtype: int64
In [10]:
df['mimetype'].value_counts()
Out[10]:
text/html       3217
warc/revisit     220
unk                2
Name: mimetype, dtype: int64

Plotting snapshots over time

In [11]:
# This is just a bit of fancy customisation to group the types of errors by color
# See https://altair-viz.github.io/user_guide/customization.html#customizing-colors
domain = ['-', '200', '301', '302', '404', '503']
# green for ok, blue for redirects, red for errors
range_ = ['#888888', '#39a035', '#5ba3cf', '#125ca4', '#e13128', '#b21218']

alt.Chart(df).mark_point().encode(
    x='date:T',
    y='length:Q',
    color=alt.Color('statuscode', scale=alt.Scale(domain=domain, range=range_)),
    tooltip=['date', 'length', 'statuscode']
).properties(width=700, height=300)
Out[11]:

Looking at domains, protocols, and redirects

Looking at the chart above, it's hard to understand why a request for a page is sometimes redirected, and sometimes not. To understand this we have to look a bit closer at what pages are actually being archived. Let's look at the breakdown of values in the original column. These are the urls being requested by the archiving bot.

In [12]:
df['original'].value_counts()
Out[12]:
https://www.nla.gov.au/                  990
http://www.nla.gov.au/                   893
http://www.nla.gov.au:80/                868
http://nla.gov.au/                       510
http://nla.gov.au:80/                     77
https://nla.gov.au/                       39
http://www.nla.gov.au//                   18
http://www.nla.gov.au                     11
http://www2.nla.gov.au:80/                10
https://www.nla.gov.au                    10
http://[email protected]/                   6
http://www.nla.gov.au:80/?                 2
http://www.nla.gov.au:80//                 1
http://mailto:[email protected]/              1
http://www.nla.gov.au./                    1
http://nla.gov.au                          1
http://mailto:[email protected]/      1
Name: original, dtype: int64

Ah ok, so there's actually a mix of things in here – some include the 'www' prefix and some don't, some use the 'https' protocol and some just plain old 'http'. There's also a bit of junk in there from badly parsed mailto links. To look at the differences in more detail, let's create new columns for subdomain and protocol.

In [13]:
base_domain = re.search(r'https*:\/\/(\w*)\.', url).group(1)
df['subdomain'] = df['original'].str.extract(r'^https*:\/\/(\w*)\.{}\.'.format(base_domain), flags=re.IGNORECASE)
df['subdomain'].fillna('', inplace=True)
df['subdomain'].value_counts()
Out[13]:
www     2794
         635
www2      10
Name: subdomain, dtype: int64
In [14]:
df['protocol'] = df['original'].str.extract(r'^(https*):')
df['protocol'].value_counts()
Out[14]:
http     2400
https    1039
Name: protocol, dtype: int64

Change in protocol

Let's look to see how the proportion of requests using each of the protocols changes over time. Here we're grouping the rows by year.

In [15]:
alt.Chart(df).mark_bar().encode(
    x='year(date):T',
    y=alt.Y('count()',stack="normalize"),
    color='protocol:N',
    #tooltip=['date', 'length', 'subdomain:N']
).properties(width=700, height=200)
Out[15]: