Observing change in a web page over time¶

New to Jupyter notebooks? Try Using Jupyter notebooks for a quick introduction.

This notebook explores what we can find when you look at all captures of a single page over time.

Work in progress – this notebook isn't finished yet. Check back later for more...

In [1]:

import re

import altair as alt
import pandas as pd
import requests
import vegafusion as vf

# alt.data_transformers.disable_max_rows()

vf.enable(row_limit=30000)

Out[1]:

vegafusion.enable(mimetype='html', row_limit=30000, embed_options=None)

In [2]:

def query_cdx(url, **kwargs):
    params = kwargs
    params["url"] = url
    params["output"] = "json"
    response = requests.get(
        "http://web.archive.org/cdx/search/cdx",
        params=params,
        headers={"User-Agent": ""},
    )
    response.raise_for_status()
    return response.json()

In [3]:

url = "http://nla.gov.au"

Getting the data¶

In this example we're using the IA CDX API, but this could easily be adapted to use Timemaps from a range of repositories.

In [4]:

data = query_cdx(url)

# Convert to a dataframe
# The column names are in the first row
df = pd.DataFrame(data[1:], columns=data[0])

# Convert the timestamp string into a datetime object
df["date"] = pd.to_datetime(df["timestamp"])
df.sort_values(by="date", inplace=True, ignore_index=True)

# Convert the length from a string into an integer
df["length"] = df["length"].astype("int")

As noted in the notebook comparing the CDX API with Timemaps, there are a number of duplicate snapshots in the CDX results, so let's remove them.

In [5]:

print(f"Before: {df.shape[0]}")
df.drop_duplicates(
    subset=["timestamp", "original", "digest", "statuscode", "mimetype"],
    keep="first",
    inplace=True,
)
print(f"After: {df.shape[0]}")

Before: 29691
After: 28461

The basic shape¶

In [6]:

df["date"].min()

Out[6]:

Timestamp('1996-10-19 06:42:23')

In [7]:

df["date"].max()

Out[7]:

Timestamp('2023-05-01 08:52:34')

In [8]:

df["length"].describe()

Out[8]:

count    28461.000000
mean      1954.959734
std       5181.999394
min        235.000000
25%        327.000000
50%        330.000000
75%        403.000000
max      30062.000000
Name: length, dtype: float64

In [9]:

df["statuscode"].value_counts()

Out[9]:

301    18355
-       6326
200     3367
302      410
503        3
Name: statuscode, dtype: int64

In [10]:

df["mimetype"].value_counts()

Out[10]:

text/html       22133
warc/revisit     6326
unk                 2
Name: mimetype, dtype: int64

Plotting snapshots over time¶

In [11]:

# This is just a bit of fancy customisation to group the types of errors by color
# See https://altair-viz.github.io/user_guide/customization.html#customizing-colors
domain = ["-", "200", "301", "302", "404", "503"]
# green for ok, blue for redirects, red for errors
range_ = ["#888888", "#39a035", "#5ba3cf", "#125ca4", "#e13128", "#b21218"]

alt.Chart(df).mark_point().encode(
    x="date:T",
    y="length:Q",
    color=alt.Color("statuscode", scale=alt.Scale(domain=domain, range=range_)),
    tooltip=["date", "length", "statuscode"],
).properties(width=700, height=300)

Out[11]:

Looking at domains, protocols, and redirects¶

Looking at the chart above, it's hard to understand why a request for a page is sometimes redirected, and sometimes not. To understand this we have to look a bit closer at what pages are actually being archived. Let's look at the breakdown of values in the original column. These are the urls being requested by the archiving bot.

In [12]:

df["original"].value_counts()

Out[12]:

http://nla.gov.au/                       23912
https://www.nla.gov.au/                   1924
http://www.nla.gov.au/                    1415
http://www.nla.gov.au:80/                  868
https://nla.gov.au/                        172
http://nla.gov.au:80/                       77
https://www.nla.gov.au                      28
http://www.nla.gov.au//                     23
http://www.nla.gov.au                       11
http://www2.nla.gov.au:80/                  10
http://Trove@nla.gov.au/                     6
http://www.nla.gov.au./                      4
http://www.nla.gov.au:80/?                   2
http://nla.gov.au                            2
http://www.nla.gov.au/?                      1
http://cmccarthy@nla.gov.au/                 1
http://mailto:media@nla.gov.au/              1
http://mailto:development@nla.gov.au/        1
http://mailto:www@nla.gov.au/                1
http://www.nla.gov.au:80//                   1
https://www.nla.gov.au//                     1
Name: original, dtype: int64

Ah ok, so there's actually a mix of things in here – some include the 'www' prefix and some don't, some use the 'https' protocol and some just plain old 'http'. There's also a bit of junk in there from badly parsed mailto links. To look at the differences in more detail, let's create new columns for subdomain and protocol.

In [13]:

base_domain = re.search(r"https*:\/\/(\w*)\.", url).group(1)
df["subdomain"] = df["original"].str.extract(
    r"^https*:\/\/(\w*)\.{}\.".format(base_domain), flags=re.IGNORECASE
)
df["subdomain"].fillna("", inplace=True)
df["subdomain"].value_counts()

Out[13]:

        24173
www      4278
www2       10
Name: subdomain, dtype: int64

In [14]:

df["protocol"] = df["original"].str.extract(r"^(https*):")
df["protocol"].value_counts()

Out[14]:

http     26336
https     2125
Name: protocol, dtype: int64

Change in protocol¶

Let's look to see how the proportion of requests using each of the protocols changes over time. Here we're grouping the rows by year.

In [15]:

alt.Chart(df).mark_bar().encode(
    x="year(date):T",
    y=alt.Y("count()", stack="normalize"),
    color="protocol:N",
    # tooltip=['date', 'length', 'subdomain:N']
).properties(width=700, height=200)

Out[15]:

No real surprise there given the increased use of https generally.

Status codes by subdomain¶

Let's now compare the proportion of status codes between the bare nla.gov.au domain and the www subdomain.

In [16]:

alt.Chart(
    df.loc[(df["statuscode"] != "-") & (df["subdomain"] != "www2")]
).mark_bar().encode(
    x="year(date):T",
    y=alt.Y("count()", stack="normalize"),
    color=alt.Color("statuscode", scale=alt.Scale(domain=domain, range=range_)),
    row="subdomain",
    tooltip=["year(date):T", "statuscode"],
).properties(
    width=700, height=100
)

Out[16]:

I think we can start to see what's going on. Around about 2004, requests to nla.gov.au started to be redirected to www.nla.gov.au giving a 302 response, indicating that the page had been moved temporarily. But why the growth in 301 (moved permanently) responses from both domains after 2018? If we look at the chart above showing the increased use of the https protocol, I think we could guess that http requests in both domains are being redirected to https.

Status codes by protocol¶

Let's test that hypothesis by looking at the distribution of status codes by protocol.

In [17]:

alt.Chart(
    df.loc[(df["statuscode"] != "-") & (df["subdomain"] != "www2")]
).mark_bar().encode(
    x="year(date):T",
    y=alt.Y("count()", stack="normalize"),
    color=alt.Color("statuscode", scale=alt.Scale(domain=domain, range=range_)),
    row="protocol",
    tooltip=["year(date):T", "protocol", "statuscode"],
).properties(
    width=700, height=100
)

Out[17]:

We can see that by 2019, all requests using http are being redirected to https.

Looking for major changes¶

Now that we understand what's going on with the different domains and status codes, I think we can focus on just the 'www' domain and the '200' responses.

In [18]:

df_200 = df.copy().loc[
    (df["statuscode"] == "200") & (df["subdomain"] == "www") & (df["length"] > 1000)
]

alt.Chart(df_200).mark_point().encode(
    x="date:T", y="length:Q", tooltip=["date", "length"]
).properties(width=700, height=300)

Out[18]:

Pandas makes it easy to calculate the difference between two adjacent values, so lets find the absolute difference in length between each capture.

In [19]:

df_200["change_in_length"] = abs(df_200["length"].diff())

Now we can look at the captures that varied most in length from their predecessor.

In [20]:

top_ten_changes = df_200.sort_values(by="change_in_length", ascending=False)[:10]
top_ten_changes

Out[20]:

	urlkey	timestamp	original	mimetype	statuscode	digest	length	date	subdomain	protocol	change_in_length
3662	au,gov,nla)/	20210701042826	https://www.nla.gov.au/	text/html	200	6PC4KROOPHEBXIC6A3GRFFHBMO2EJ4F7	29215	2021-07-01 04:28:26	www	https	13933.0
24747	au,gov,nla)/	20230327064440	https://www.nla.gov.au/	text/html	200	QMUYKNTWO6E2JOMY4AUCTTVKS4KTKIQR	23586	2023-03-27 06:44:40	www	https	4779.0
4179	au,gov,nla)/	20220202054835	https://www.nla.gov.au/	text/html	200	O6VCJWSKLU3EAFDSH4IGOIW264Y6IIXZ	27495	2022-02-02 05:48:35	www	https	4648.0
4109	au,gov,nla)/	20220123061020	https://www.nla.gov.au/	text/html	200	77AAUDAPKSHUK2ST233PMJGWEZPSKQ53	27441	2022-01-23 06:10:20	www	https	4513.0
4124	au,gov,nla)/	20220126082745	https://www.nla.gov.au/	text/html	200	NTVX4UULUQWDYZXQN7TYG4AFUBFQR2TT	22941	2022-01-26 08:27:45	www	https	4500.0
4152	au,gov,nla)/	20220128184507	https://www.nla.gov.au/	text/html	200	HBACKZYKQANIT6GXONE732TA35NUH6EH	22949	2022-01-28 18:45:07	www	https	4491.0
4150	au,gov,nla)/	20220128092906	https://www.nla.gov.au/	text/html	200	HBACKZYKQANIT6GXONE732TA35NUH6EH	27440	2022-01-28 09:29:06	www	https	4491.0
4104	au,gov,nla)/	20220122121955	https://www.nla.gov.au	text/html	200	PUJA36QMIDUPFUVWKSFVVBO2KDIB37E2	27447	2022-01-22 12:19:55	www	https	4486.0
4122	au,gov,nla)/	20220125193852	https://www.nla.gov.au/	text/html	200	M5MC73PMLOIWCMC54CQOCL26ZMLP5IA5	27441	2022-01-25 19:38:52	www	https	4485.0
4111	au,gov,nla)/	20220123064328	https://www.nla.gov.au/	text/html	200	77AAUDAPKSHUK2ST233PMJGWEZPSKQ53	22957	2022-01-23 06:43:28	www	https	4484.0

Let's try visualising this by highlighting the major changes in length.

In [21]:

points = (
    alt.Chart(df_200)
    .mark_point()
    .encode(x="date:T", y="length:Q", tooltip=["date", "length"])
    .properties(width=700, height=300)
)

lines = (
    alt.Chart(top_ten_changes)
    .mark_rule(color="red")
    .encode(x="date:T", tooltip=["date"])
    .properties(width=700, height=300)
)

points + lines

Out[21]:

Rather than just a raw number, perhaps the percentage change in length would be more useful. Once again, Pandas makes this easy to calculate. This calculates the percentage change from the previous value – so length2 - length1 / length1.

In [22]:

df_200["pct_change_in_length"] = abs(df_200["length"].pct_change())

In [23]:

top_ten_changes_pct = df_200.sort_values(by="pct_change_in_length", ascending=False)[
    :10
]
top_ten_changes_pct

Out[23]:

	urlkey	timestamp	original	mimetype	statuscode	digest	length	date	subdomain	protocol	change_in_length	pct_change_in_length
3662	au,gov,nla)/	20210701042826	https://www.nla.gov.au/	text/html	200	6PC4KROOPHEBXIC6A3GRFFHBMO2EJ4F7	29215	2021-07-01 04:28:26	www	https	13933.0	0.911726
13	au,gov,nla)/	19980205162107	http://www.nla.gov.au:80/	text/html	200	LIXK3YXSUFO5KPOO22XIMQGPWXKNHV6X	1920	1998-02-05 16:21:07	www	http	757.0	0.650903
79	au,gov,nla)/	20011003175018	http://www.nla.gov.au:80/	text/html	200	BWGDP6NTGVOI2TBA62P7IZ2PPWRLOODN	3367	2001-10-03 17:50:18	www	http	1004.0	0.424884
1519	au,gov,nla)/	20160901112433	http://www.nla.gov.au/	text/html	200	MZD7NTLMH5HBXSFQIHTGTC6IELQDYBN2	11541	2016-09-01 11:24:33	www	http	2738.0	0.311030
1184	au,gov,nla)/	20130211044309	http://www.nla.gov.au/	text/html	200	QWCVHAK2Y6WXLDNIMTJLZT5RY6YCJ7UN	8521	2013-02-11 04:43:09	www	http	1698.0	0.248864
1067	au,gov,nla)/	20110611064218	http://www.nla.gov.au/	text/html	200	Z7PQG2MVOOQUZ62ASNRAWDFLSYFHATQT	5601	2011-06-11 06:42:18	www	http	1739.0	0.236921
2049	au,gov,nla)/	20181212014241	https://www.nla.gov.au/	text/html	200	C3YSGGHG52WIG6U6X7C3XOL5LOHVIINM	14813	2018-12-12 01:42:41	www	https	2831.0	0.236271
786	au,gov,nla)/	20061107083938	http://www.nla.gov.au:80/	text/html	200	HOW52ARISTA4HTCLFPYNCQWR6NK2N2NF	5662	2006-11-07 08:39:38	www	http	1561.0	0.216115
4179	au,gov,nla)/	20220202054835	https://www.nla.gov.au/	text/html	200	O6VCJWSKLU3EAFDSH4IGOIW264Y6IIXZ	27495	2022-02-02 05:48:35	www	https	4648.0	0.203440
3890	au,gov,nla)/	20211121150551	https://www.nla.gov.au/	text/html	200	GEPXW6EKI7GBG22SMUFIIXQW3KIQXNIY	25912	2021-11-21 15:05:51	www	https	4316.0	0.199852

In [24]:

lines = (
    alt.Chart(top_ten_changes_pct)
    .mark_rule(color="red")
    .encode(x="date:T", tooltip=["date"])
    .properties(width=700, height=300)
)

points + lines

Out[24]:

By focusing on percentage difference we can see that more prominence is given to the change in 2001. But rather than just the top 10, should we look at changes greater than 10% or some other threshold?

In [25]:

lines = (
    alt.Chart(df_200.loc[df_200["pct_change_in_length"] > 0.1])
    .mark_rule(color="red")
    .encode(x="date:T", tooltip=["date"])
    .properties(width=700, height=300)
)

points + lines

Out[25]:

Other possibilities to explore¶

Rate of change – what proportion of the snapshots each year are different?
Use similarity measures to identify changes.

Comparing individual captures¶

Once major changes, such as those above, have been identified, we can use some of the other notebooks in this repository to compare individual captures. For example:

Created by Tim Sherratt for the GLAM Workbench. Support me by becoming a GitHub sponsor!

Work on this notebook was supported by the IIPC Discretionary Funding Programme 2019-2020.

The Web Archives section of the GLAM Workbench is sponsored by the British Library.