Observing change in a web page over time

New to Jupyter notebooks? Try Using Jupyter notebooks for a quick introduction.

This notebook explores what we can find when you look at all captures of a single page over time.

Work in progress – this notebook isn't finished yet. Check back later for more...

In [71]:
import requests
import pandas as pd
import altair as alt
import re
from difflib import HtmlDiff
from IPython.display import display, HTML
import arrow
In [72]:
def query_cdx(url, **kwargs):
    params = kwargs
    params['url'] = url
    params['output'] = 'json'
    response = requests.get('http://web.archive.org/cdx/search/cdx', params=params, headers={'User-Agent': ''})
    response.raise_for_status()
    return response.json()
In [73]:
url = 'http://nla.gov.au'

Getting the data

In this example we're using the IA CDX API, but this could easily be adapted to use Timemaps from a range of repositories.

In [74]:
data = query_cdx(url)

# Convert to a dataframe
# The column names are in the first row
df = pd.DataFrame(data[1:], columns=data[0])

# Convert the timestamp string into a datetime object
df['date'] = pd.to_datetime(df['timestamp'])
df.sort_values(by='date', inplace=True, ignore_index=True)

# Convert the length from a string into an integer
df['length'] = df['length'].astype('int')

As noted in the notebook comparing the CDX API with Timemaps, there are a number of duplicate snapshots in the CDX results, so let's remove them.

In [75]:
print(f'Before: {df.shape[0]}')
df.drop_duplicates(subset=['timestamp', 'original', 'digest', 'statuscode', 'mimetype'], keep='first', inplace=True)
print(f'After: {df.shape[0]}')
Before: 2840
After: 2740

The basic shape

In [35]:
df['date'].min()
Out[35]:
Timestamp('1996-10-19 06:42:23')
In [36]:
df['date'].max()
Out[36]:
Timestamp('2020-04-27 07:42:20')
In [37]:
df['length'].describe()
Out[37]:
count     2740.000000
mean      6497.322263
std       5027.627203
min        296.000000
25%        643.000000
50%       5405.500000
75%      11409.500000
max      15950.000000
Name: length, dtype: float64
In [38]:
df['statuscode'].value_counts()
Out[38]:
200    2036
301     273
302     263
-       166
503       2
Name: statuscode, dtype: int64
In [39]:
df['mimetype'].value_counts()
Out[39]:
text/html       2574
warc/revisit     166
Name: mimetype, dtype: int64

Plotting snapshots over time

In [40]:
# This is just a bit of fancy customisation to group the types of errors by color
# See https://altair-viz.github.io/user_guide/customization.html#customizing-colors
domain = ['-', '200', '301', '302', '404', '503']
# green for ok, blue for redirects, red for errors
range_ = ['#888888', '#39a035', '#5ba3cf', '#125ca4', '#e13128', '#b21218']

alt.Chart(df).mark_point().encode(
    x='date:T',
    y='length:Q',
    color=alt.Color('statuscode', scale=alt.Scale(domain=domain, range=range_)),
    tooltip=['date', 'length', 'statuscode']
).properties(width=700, height=300)
Out[40]:

Looking at domains, protocols, and redirects

Looking at the chart above, it's hard to understand why a request for a page is sometimes redirected, and sometimes not. To understand this we have to look a bit closer at what pages are actually being archived. Let's look at the breakdown of values in the original column. These are the urls being requested by the archiving bot.

In [41]:
df['original'].value_counts()
Out[41]:
http://www.nla.gov.au:80/                863
http://www.nla.gov.au/                   728
https://www.nla.gov.au/                  590
http://nla.gov.au/                       421
http://nla.gov.au:80/                     74
http://www.nla.gov.au//                   17
https://nla.gov.au/                       14
http://www.nla.gov.au                     11
http://www2.nla.gov.au:80/                10
http://[email protected]/                   6
http://www.nla.gov.au:80/?                 2
http://www.nla.gov.au:80//                 1
http://www.nla.gov.au./                    1
http://mailto:[email protected]/      1
http://mailto:[email protected]/              1
Name: original, dtype: int64

Ah ok, so there's actually a mix of things in here – some include the 'www' prefix and some don't, some use the 'https' protocol and some just plain old 'http'. There's also a bit of junk in there from badly parsed mailto links. To look at the differences in more detail, let's create new columns for subdomain and protocol.

In [42]:
base_domain = re.search(r'https*:\/\/(\w*)\.', url).group(1)
df['subdomain'] = df['original'].str.extract(r'^https*:\/\/(\w*)\.{}\.'.format(base_domain), flags=re.IGNORECASE)
df['subdomain'].fillna('', inplace=True)
df['subdomain'].value_counts()
Out[42]:
www     2213
         517
www2      10
Name: subdomain, dtype: int64
In [43]:
df['protocol'] = df['original'].str.extract(r'^(https*):')
df['protocol'].value_counts()
Out[43]:
http     2136
https     604
Name: protocol, dtype: int64

Change in protocol

Let's look to see how the proportion of requests using each of the protocols changes over time. Here we're grouping the rows by year.

In [44]:
alt.Chart(df).mark_bar().encode(
    x='year(date):T',
    y=alt.Y('count()',stack="normalize"),
    color='protocol:N',
    #tooltip=['date', 'length', 'subdomain:N']
).properties(width=700, height=200)
Out[44]: