New to Jupyter notebooks? Try Using Jupyter notebooks for a quick introduction.
This notebook helps you assemble datasets of text extracted from all available captures of archived web pages. You can then feed these datasets to the text analysis tool of your choice to analyse changes over time.
filter_text=False
(default) – save all of the human visible text on the page, this includes boilerplate, footers, and navigation text.filter_text=True
– save only the significant text on the page, excluding recurring items like boilerplate and navigation. This is done by Trafilatura.get_texts_for_url([timegate], [url], filter_text=[True or False])
The timegate
value should be one of:
nla
– National Library of Australianlnz
– National Library of New Zealandbl
– UK Web Archiveia
– Internet Archiveukgwa
– UK Government Web ArchiveUse a CDX query to find all urls that include the specified keyword in their url.
get_texts_for_cdx_query([url], filter_text=[True or False], filter=['original:.*[keyword].*', 'statuscode:200', 'mimetype:text/html'])
The url
value can use wildcards to indicate whether it is a domain or prefix query, for example:
nla.gov.au/*
– prefix query, search all files under nla.gov.au
*.nla.gov.au
– domain query, search all files under nla.gov.au
and any of its subdomainsYou can use any of the keyword parameters that the CDX API recognises, but you probably want to filter for statuscode
and mimetype
and apply some sort of regular expression to original
.
A directory will be created for each url processed. The name of the directory will be a slugified version of the url in SURT (Sort-friendly URI Reordering Transform) format.
Each text file will be saved separately within the directory. Filenames follow the pattern:
[SURT formatted url]-[capture timestamp].txt
There's also a metadata.json
file that includes basic details of the harvest:
timegate
- the repository usedurl
– the url harvestedfilter_text
– text filtering option useddate
– date and time the harvest was startedmementos
– details of each capture, including:url
– link to capture in web archivefile_path
– path to harvested text fileimport json
import re
import time
from pathlib import Path
import arrow
import pandas as pd
import requests
import trafilatura
from bs4 import BeautifulSoup
from IPython.display import FileLink, FileLinks, display
from lxml.etree import ParserError
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry
from slugify import slugify
from surt import surt
from tqdm.auto import tqdm
s = requests.Session()
retries = Retry(total=10, backoff_factor=1, status_forcelist=[502, 503, 504])
s.mount("https://", HTTPAdapter(max_retries=retries))
s.mount("http://", HTTPAdapter(max_retries=retries))
# Default list of repositories -- you could add to this
TIMEGATES = {
"nla": "https://web.archive.org.au/awa/",
"nlnz": "https://ndhadeliver.natlib.govt.nz/webarchive/",
"bl": "https://www.webarchive.org.uk/wayback/archive/",
"ia": "https://web.archive.org/web/",
"ukgwa": "https://webarchive.nationalarchives.gov.uk/ukgwa/",
}
def is_memento(url):
"""
Is this url a Memento? Checks for the presence of a timestamp.
"""
return bool(re.search(r"/(\d{12}|\d{14})(?:id_|mp_|if_)*/http", url))
def get_html(url):
"""
Retrieve the original HTML content of an archived page.
Follow redirects if they go to another archived page.
Return the (possibly redirected) url from the response and the HTML content.
"""
# Adding the id_ hint tells the archive to give us the original harvested version, without any rewriting.
url = re.sub(r"/(\d{12}|\d{14})(?:mp_)*/http", r"/\1id_/http", url)
response = requests.get(url, allow_redirects=True)
# Some captures might redirect themselves to live versions
# If the redirected url doesn't look like a Memento rerun this without redirection
if not is_memento(response.url):
response = requests.get(url, allow_redirects=False)
return {"url": response.url, "html": response.content}
def convert_lists_to_dicts(results):
"""
Converts IA style timemap (a JSON array of arrays) to a list of dictionaries.
Renames keys to standardise IA with other Timemaps.
"""
if results:
keys = results[0]
results_as_dicts = [dict(zip(keys, v)) for v in results[1:]]
else:
results_as_dicts = results
# Rename keys
for d in results_as_dicts:
d["status"] = d.pop("statuscode")
d["mime"] = d.pop("mimetype")
d["url"] = d.pop("original")
return results_as_dicts
def get_capture_data_from_memento(url, request_type="head"):
"""
For OpenWayback systems this can get some extra cpature info to insert in Timemaps.
"""
if request_type == "head":
response = requests.head(url)
else:
response = requests.get(url)
headers = response.headers
length = headers.get("x-archive-orig-content-length")
status = headers.get("x-archive-orig-status")
status = status.split(" ")[0] if status else None
mime = headers.get("x-archive-orig-content-type")
mime = mime.split(";")[0] if mime else None
return {"length": length, "status": status, "mime": mime}
def convert_link_to_json(results, enrich_data=False):
"""
Converts link formatted Timemap to JSON.
"""
data = []
for line in results.splitlines():
parts = line.split("; ")
if len(parts) > 1:
link_type = re.search(
r'rel="(original|self|timegate|first memento|last memento|memento)"',
parts[1],
).group(1)
if link_type == "memento":
link = parts[0].strip("<>")
timestamp, original = re.search(r"/(\d{12}|\d{14})/(.*)$", link).groups()
capture = {"timestamp": timestamp, "url": original}
if enrich_data:
capture.update(get_capture_data_from_memento(link))
data.append(capture)
return data
def get_timemap_as_json(timegate, url):
"""
Get a Timemap then normalise results (if necessary) to return a list of dicts.
"""
tg_url = f"{TIMEGATES[timegate]}timemap/json/{url}/"
response = requests.get(tg_url)
response_type = response.headers["content-type"]
# pywb style Timemap
if response_type == "text/x-ndjson":
data = [json.loads(line) for line in response.text.splitlines()]
# IA Wayback stype Timemap
elif response_type == "application/json":
data = convert_lists_to_dicts(response.json())
# Link style Timemap (OpenWayback)
elif response_type in ["application/link-format", "text/html;charset=utf-8"]:
data = convert_link_to_json(response.text)
return data
def get_all_text(capture_data):
"""
Get all the human visible text from a web page, including headers, footers, and navigation.
Does some cleaning up to remove multiple spaces, tabs, and newlines.
"""
try:
text = BeautifulSoup(capture_data["html"]).get_text()
except TypeError:
return None
else:
# Remove multiple newlines
text = re.sub(r"\n\s*\n", "\n\n", text)
# Remove multiple spaces or tabs with a single space
text = re.sub(r"( |\t){2,}", " ", text)
# Remove leading spaces
text = re.sub(r"\n ", "\n", text)
# Remove leading newlines
text = re.sub(r"^\n*", "", text)
return text
def get_main_text(capture_data):
"""
Get only the main text from a page, excluding boilerplate and navigation.
"""
try:
text = trafilatura.extract(capture_data["html"])
except ParserError:
text = ""
return text
def get_text_from_capture(capture_url, filter_text=False):
"""
Get text from the given memento.
If filter_text is True, only return the significant text (excluding things like navigation).
"""
capture_data = get_html(capture_url)
if filter_text:
text = get_main_text(capture_data)
else:
text = get_all_text(capture_data)
return text
def process_capture_list(timegate, captures, filter_text=False, url=None):
if not url:
url = captures[0]["url"]
metadata = {
"timegate": TIMEGATES[timegate],
"url": url,
"filter_text": filter_text,
"date": arrow.now().format("YYYY-MM-DD HH:mm:ss"),
"mementos": [],
}
try:
urlkey = captures[0]["urlkey"]
except KeyError:
urlkey = surt(url)
# Truncate urls longer than 50 chars so that filenames are not too long
output_dir = Path("text", slugify(urlkey)[:50])
output_dir.mkdir(parents=True, exist_ok=True)
for capture in tqdm(captures, desc="Captures"):
file_path = Path(
output_dir, f'{slugify(urlkey)[:50]}-{capture["timestamp"]}.txt'
)
# Don't reharvest if file already exists
if not file_path.exists():
# Only process successful captures
if capture["status"] == "200":
capture_url = (
f'{TIMEGATES[timegate]}{capture["timestamp"]}id_/{capture["url"]}'
)
capture_text = get_text_from_capture(capture_url, filter_text)
if capture_text:
# Truncate urls longer than 50 chars so that filenames are not too long
file_path = Path(
output_dir, f'{slugify(urlkey)[:50]}-{capture["timestamp"]}.txt'
)
file_path.write_text(capture_text)
metadata["mementos"].append(
{"url": capture_url, "text_file": str(file_path)}
)
time.sleep(0.2)
metadata_file = Path(output_dir, "metadata.json")
with metadata_file.open("wt") as md_json:
json.dump(metadata, md_json)
def save_texts_from_url(timegate, url, filter_text=False):
"""
Save the text contents of all available captures for a given url from the specified repository.
Saves both the harvested text files and a json file with the harvest metadata.
"""
timemap = get_timemap_as_json(timegate, url)
if timemap:
process_capture_list(timegate, timemap, url=url, filter_text=filter_text)
def prepare_params(url, **kwargs):
"""
Prepare the parameters for a CDX API requests.
Adds all supplied keyword arguments as parameters (changing from_ to from).
Adds in a few necessary parameters.
"""
params = kwargs
params["url"] = url
params["output"] = "json"
params["pageSize"] = 5
# CDX accepts a 'from' parameter, but this is a reserved word in Python
# Use 'from_' to pass the value to the function & here we'll change it back to 'from'.
if "from_" in params:
params["from"] = params["from_"]
del params["from_"]
return params
def get_total_pages(params):
"""
Get number of pages in a query.
Note that the number of pages doesn't tell you much about the number of results, as the numbers per page vary.
"""
these_params = params.copy()
these_params["showNumPages"] = "true"
response = s.get(
"http://web.archive.org/cdx/search/cdx",
params=these_params,
headers={"User-Agent": ""},
)
return int(response.text)
def get_cdx_data(params):
"""
Make a request to the CDX API using the supplied parameters.
Return results converted to a list of dicts.
"""
response = s.get("http://web.archive.org/cdx/search/cdx", params=params)
response.raise_for_status()
results = response.json()
try:
if not response.from_cache:
time.sleep(0.2)
except AttributeError:
# Not using cache
time.sleep(0.2)
return convert_lists_to_dicts(results)
def harvest_cdx_query(url, **kwargs):
"""
Harvest results of query from the IA CDX API using pagination.
Returns captures as a list of dicts.
"""
results = []
page = 0
params = prepare_params(url, **kwargs)
total_pages = get_total_pages(params)
with tqdm(total=total_pages - page, desc="CDX") as pbar:
while page < total_pages:
params["page"] = page
results += get_cdx_data(params)
page += 1
pbar.update(1)
return results
def save_texts_from_cdx_query(url, filter_text=False, **kwargs):
captures = harvest_cdx_query(url, **kwargs)
if captures:
df = pd.DataFrame(captures)
groups = df.groupby(by="urlkey")
print(f"{len(groups)} matching urls")
for name, group in groups:
process_capture_list(
"ia", group.to_dict("records"), filter_text=filter_text
)
Get all human-visible text from all captures of a single url in the Australian Web Archive.
save_texts_from_url("nla", "http://discontents.com.au/", filter_text=False)
Get only significant text from all captures of a single url in the New Zealand Web Archive.
save_texts_from_url("nla", "http://digitalnz.org/", filter_text=True)
Harvest text from a series of urls.
urls = ["http://nla.gov.au", "http://nma.gov.au", "http://awm.gov.au"]
for url in urls:
save_texts_from_url("nla", url, filter_text=True)
Harvest text from all pages under the nla.gov.au
domain that include the word 'policy' in the url. Note the use of the regular expression .*policy.*
to match the original
url.
save_texts_from_cdx_query(
"dfat.gov.au/*",
filter_text=True,
filter=["original:.*policy.*", "statuscode:200", "mimetype:text/html"],
)
If you're using Jupyter Lab, you can browse the results of this notebook by just looking inside the text
folder. I've also enabled the jupyter-archive
extension which adds a download option to the right-click menu. Just right click on a folder and you'll see an option to 'Download as an Archive'. This will zip up and download the folder.
The cells below provide a couple of alternative ways of viewing and downloading the results.
# Display all the files under the current text folder (this could be a long list)
display(FileLinks("text"))
# Tar/gzip the current domain folder
!tar -czf text.tar.gz text
# Display a link to the gzipped data
# In JupyterLab you'll need to Shift+right-click on the link and choose 'Download link'
display(FileLink("text.tar.gz"))
Created by Tim Sherratt for the GLAM Workbench. Support me by becoming a GitHub sponsor!
Work on this notebook was supported by the IIPC Discretionary Funding Programme 2019-2020.
The Web Archives section of the GLAM Workbench is sponsored by the British Library.