New to Jupyter notebooks? Try Using Jupyter notebooks for a quick introduction.
This notebook explores what we can find when you look at all captures of a single page over time.
Work in progress – this notebook isn't finished yet. Check back later for more...
import requests
import pandas as pd
import altair as alt
import re
from difflib import HtmlDiff
from IPython.display import display, HTML
import arrow
def query_cdx(url, **kwargs):
params = kwargs
params['url'] = url
params['output'] = 'json'
response = requests.get('http://web.archive.org/cdx/search/cdx', params=params, headers={'User-Agent': ''})
response.raise_for_status()
return response.json()
url = 'http://nla.gov.au'
data = query_cdx(url)
# Convert to a dataframe
# The column names are in the first row
df = pd.DataFrame(data[1:], columns=data[0])
# Convert the timestamp string into a datetime object
df['date'] = pd.to_datetime(df['timestamp'])
df.sort_values(by='date', inplace=True, ignore_index=True)
# Convert the length from a string into an integer
df['length'] = df['length'].astype('int')
As noted in the notebook comparing the CDX API with Timemaps, there are a number of duplicate snapshots in the CDX results, so let's remove them.
print(f'Before: {df.shape[0]}')
df.drop_duplicates(subset=['timestamp', 'original', 'digest', 'statuscode', 'mimetype'], keep='first', inplace=True)
print(f'After: {df.shape[0]}')
df['date'].min()
df['date'].max()
df['length'].describe()
df['statuscode'].value_counts()
df['mimetype'].value_counts()
# This is just a bit of fancy customisation to group the types of errors by color
# See https://altair-viz.github.io/user_guide/customization.html#customizing-colors
domain = ['-', '200', '301', '302', '404', '503']
# green for ok, blue for redirects, red for errors
range_ = ['#888888', '#39a035', '#5ba3cf', '#125ca4', '#e13128', '#b21218']
alt.Chart(df).mark_point().encode(
x='date:T',
y='length:Q',
color=alt.Color('statuscode', scale=alt.Scale(domain=domain, range=range_)),
tooltip=['date', 'length', 'statuscode']
).properties(width=700, height=300)
Looking at the chart above, it's hard to understand why a request for a page is sometimes redirected, and sometimes not. To understand this we have to look a bit closer at what pages are actually being archived. Let's look at the breakdown of values in the original
column. These are the urls being requested by the archiving bot.
df['original'].value_counts()
Ah ok, so there's actually a mix of things in here – some include the 'www' prefix and some don't, some use the 'https' protocol and some just plain old 'http'. There's also a bit of junk in there from badly parsed mailto
links. To look at the differences in more detail, let's create new columns for subdomain
and protocol
.
base_domain = re.search(r'https*:\/\/(\w*)\.', url).group(1)
df['subdomain'] = df['original'].str.extract(r'^https*:\/\/(\w*)\.{}\.'.format(base_domain), flags=re.IGNORECASE)
df['subdomain'].fillna('', inplace=True)
df['subdomain'].value_counts()
df['protocol'] = df['original'].str.extract(r'^(https*):')
df['protocol'].value_counts()
Let's look to see how the proportion of requests using each of the protocols changes over time. Here we're grouping the rows by year.
alt.Chart(df).mark_bar().encode(
x='year(date):T',
y=alt.Y('count()',stack="normalize"),
color='protocol:N',
#tooltip=['date', 'length', 'subdomain:N']
).properties(width=700, height=200)