Notebook

Harvesting collections of text from archived web pages¶

New to Jupyter notebooks? Try Using Jupyter notebooks for a quick introduction.

This notebook helps you assemble datasets of text extracted from all available captures of archived web pages. You can then feed these datasets to the text analysis tool of your choice to analyse changes over time.

Harvest sources¶

Timemaps – harvest text from a single url, or list of urls, using the repository of your choice
CDX API – harvest text from the results of a query to the Internet Archive's CDX API

Options¶

filter_text=False (default) – save all of the human visible text on the page, this includes boilerplate, footers, and navigation text.
filter_text=True – save only the significant text on the page, excluding recurring items like boilerplate and navigation. This is done by Trafilatura.

Usage¶

Using Timemaps¶

get_texts_for_url([timegate], [url], filter_text=[True or False])

The timegate value should be one of:

nla – National Library of Australia
nlnz – National Library of New Zealand
bl – UK Web Archive
ia – Internet Archive
ukgwa – UK Government Web Archive

Using the Internet Archive's CDX API¶

Use a CDX query to find all urls that include the specified keyword in their url.

get_texts_for_cdx_query([url], filter_text=[True or False], filter=['original:.*[keyword].*', 'statuscode:200', 'mimetype:text/html'])

The url value can use wildcards to indicate whether it is a domain or prefix query, for example:

nla.gov.au/* – prefix query, search all files under nla.gov.au
*.nla.gov.au – domain query, search all files under nla.gov.au and any of its subdomains

You can use any of the keyword parameters that the CDX API recognises, but you probably want to filter for statuscode and mimetype and apply some sort of regular expression to original.

Output¶

A directory will be created for each url processed. The name of the directory will be a slugified version of the url in SURT (Sort-friendly URI Reordering Transform) format.

Each text file will be saved separately within the directory. Filenames follow the pattern:

[SURT formatted url]-[capture timestamp].txt

There's also a metadata.json file that includes basic details of the harvest:

timegate - the repository used
url – the url harvested
filter_text – text filtering option used
date – date and time the harvest was started
mementos – details of each capture, including:
- url – link to capture in web archive
- file_path – path to harvested text file

Import what we need¶

In [1]:

import json
import re
import time
from pathlib import Path

import arrow
import pandas as pd
import requests
import trafilatura
from bs4 import BeautifulSoup
from IPython.display import FileLink, FileLinks, display
from lxml.etree import ParserError
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry
from slugify import slugify
from surt import surt
from tqdm.auto import tqdm

s = requests.Session()
retries = Retry(total=10, backoff_factor=1, status_forcelist=[502, 503, 504])
s.mount("https://", HTTPAdapter(max_retries=retries))
s.mount("http://", HTTPAdapter(max_retries=retries))

In [2]:

# Default list of repositories -- you could add to this
TIMEGATES = {
    "nla": "https://web.archive.org.au/awa/",
    "nlnz": "https://ndhadeliver.natlib.govt.nz/webarchive/",
    "bl": "https://www.webarchive.org.uk/wayback/archive/",
    "ia": "https://web.archive.org/web/",
    "ukgwa": "https://webarchive.nationalarchives.gov.uk/ukgwa/",
}

Define some functions¶

In [3]:

def is_memento(url):
    """
    Is this url a Memento? Checks for the presence of a timestamp.
    """
    return bool(re.search(r"/(\d{12}|\d{14})(?:id_|mp_|if_)*/http", url))


def get_html(url):
    """
    Retrieve the original HTML content of an archived page.
    Follow redirects if they go to another archived page.
    Return the (possibly redirected) url from the response and the HTML content.
    """
    # Adding the id_ hint tells the archive to give us the original harvested version, without any rewriting.
    url = re.sub(r"/(\d{12}|\d{14})(?:mp_)*/http", r"/\1id_/http", url)
    response = requests.get(url, allow_redirects=True)
    # Some captures might redirect themselves to live versions
    # If the redirected url doesn't look like a Memento rerun this without redirection
    if not is_memento(response.url):
        response = requests.get(url, allow_redirects=False)
    return {"url": response.url, "html": response.content}


def convert_lists_to_dicts(results):
    """
    Converts IA style timemap (a JSON array of arrays) to a list of dictionaries.
    Renames keys to standardise IA with other Timemaps.
    """
    if results:
        keys = results[0]
        results_as_dicts = [dict(zip(keys, v)) for v in results[1:]]
    else:
        results_as_dicts = results
    # Rename keys
    for d in results_as_dicts:
        d["status"] = d.pop("statuscode")
        d["mime"] = d.pop("mimetype")
        d["url"] = d.pop("original")
    return results_as_dicts


def get_capture_data_from_memento(url, request_type="head"):
    """
    For OpenWayback systems this can get some extra cpature info to insert in Timemaps.
    """
    if request_type == "head":
        response = requests.head(url)
    else:
        response = requests.get(url)
    headers = response.headers
    length = headers.get("x-archive-orig-content-length")
    status = headers.get("x-archive-orig-status")
    status = status.split(" ")[0] if status else None
    mime = headers.get("x-archive-orig-content-type")
    mime = mime.split(";")[0] if mime else None
    return {"length": length, "status": status, "mime": mime}


def convert_link_to_json(results, enrich_data=False):
    """
    Converts link formatted Timemap to JSON.
    """
    data = []
    for line in results.splitlines():
        parts = line.split("; ")
        if len(parts) > 1:
            link_type = re.search(
                r'rel="(original|self|timegate|first memento|last memento|memento)"',
                parts[1],
            ).group(1)
            if link_type == "memento":
                link = parts[0].strip("<>")
                timestamp, original = re.search(r"/(\d{12}|\d{14})/(.*)$", link).groups()
                capture = {"timestamp": timestamp, "url": original}
                if enrich_data:
                    capture.update(get_capture_data_from_memento(link))
                data.append(capture)
    return data


def get_timemap_as_json(timegate, url):
    """
    Get a Timemap then normalise results (if necessary) to return a list of dicts.
    """
    tg_url = f"{TIMEGATES[timegate]}timemap/json/{url}/"
    response = requests.get(tg_url)
    response_type = response.headers["content-type"]
    # pywb style Timemap
    if response_type == "text/x-ndjson":
        data = [json.loads(line) for line in response.text.splitlines()]
    # IA Wayback stype Timemap
    elif response_type == "application/json":
        data = convert_lists_to_dicts(response.json())
    # Link style Timemap (OpenWayback)
    elif response_type in ["application/link-format", "text/html;charset=utf-8"]:
        data = convert_link_to_json(response.text)
    return data


def get_all_text(capture_data):
    """
    Get all the human visible text from a web page, including headers, footers, and navigation.
    Does some cleaning up to remove multiple spaces, tabs, and newlines.
    """
    try:
        text = BeautifulSoup(capture_data["html"]).get_text()
    except TypeError:
        return None
    else:
        # Remove multiple newlines
        text = re.sub(r"\n\s*\n", "\n\n", text)
        # Remove multiple spaces or tabs with a single space
        text = re.sub(r"( |\t){2,}", " ", text)
        # Remove leading spaces
        text = re.sub(r"\n ", "\n", text)
        # Remove leading newlines
        text = re.sub(r"^\n*", "", text)
        return text


def get_main_text(capture_data):
    """
    Get only the main text from a page, excluding boilerplate and navigation.
    """
    try:
        text = trafilatura.extract(capture_data["html"])
    except ParserError:
        text = ""
    return text


def get_text_from_capture(capture_url, filter_text=False):
    """
    Get text from the given memento.
    If filter_text is True, only return the significant text (excluding things like navigation).
    """
    capture_data = get_html(capture_url)
    if filter_text:
        text = get_main_text(capture_data)
    else:
        text = get_all_text(capture_data)
    return text


def process_capture_list(timegate, captures, filter_text=False, url=None):
    if not url:
        url = captures[0]["url"]
    metadata = {
        "timegate": TIMEGATES[timegate],
        "url": url,
        "filter_text": filter_text,
        "date": arrow.now().format("YYYY-MM-DD HH:mm:ss"),
        "mementos": [],
    }
    try:
        urlkey = captures[0]["urlkey"]
    except KeyError:
        urlkey = surt(url)
    # Truncate urls longer than 50 chars so that filenames are not too long
    output_dir = Path("text", slugify(urlkey)[:50])
    output_dir.mkdir(parents=True, exist_ok=True)
    for capture in tqdm(captures, desc="Captures"):
        file_path = Path(
            output_dir, f'{slugify(urlkey)[:50]}-{capture["timestamp"]}.txt'
        )
        # Don't reharvest if file already exists
        if not file_path.exists():
            # Only process successful captures
            if capture["status"] == "200":
                capture_url = (
                    f'{TIMEGATES[timegate]}{capture["timestamp"]}id_/{capture["url"]}'
                )
                capture_text = get_text_from_capture(capture_url, filter_text)
                if capture_text:
                    # Truncate urls longer than 50 chars so that filenames are not too long
                    file_path = Path(
                        output_dir, f'{slugify(urlkey)[:50]}-{capture["timestamp"]}.txt'
                    )
                    file_path.write_text(capture_text)
                    metadata["mementos"].append(
                        {"url": capture_url, "text_file": str(file_path)}
                    )
                time.sleep(0.2)
    metadata_file = Path(output_dir, "metadata.json")
    with metadata_file.open("wt") as md_json:
        json.dump(metadata, md_json)


def save_texts_from_url(timegate, url, filter_text=False):
    """
    Save the text contents of all available captures for a given url from the specified repository.
    Saves both the harvested text files and a json file with the harvest metadata.
    """
    timemap = get_timemap_as_json(timegate, url)
    if timemap:
        process_capture_list(timegate, timemap, url=url, filter_text=filter_text)


def prepare_params(url, **kwargs):
    """
    Prepare the parameters for a CDX API requests.
    Adds all supplied keyword arguments as parameters (changing from_ to from).
    Adds in a few necessary parameters.
    """
    params = kwargs
    params["url"] = url
    params["output"] = "json"
    params["pageSize"] = 5
    # CDX accepts a 'from' parameter, but this is a reserved word in Python
    # Use 'from_' to pass the value to the function & here we'll change it back to 'from'.
    if "from_" in params:
        params["from"] = params["from_"]
        del params["from_"]
    return params


def get_total_pages(params):
    """
    Get number of pages in a query.
    Note that the number of pages doesn't tell you much about the number of results, as the numbers per page vary.
    """
    these_params = params.copy()
    these_params["showNumPages"] = "true"
    response = s.get(
        "http://web.archive.org/cdx/search/cdx",
        params=these_params,
        headers={"User-Agent": ""},
    )
    return int(response.text)


def get_cdx_data(params):
    """
    Make a request to the CDX API using the supplied parameters.
    Return results converted to a list of dicts.
    """
    response = s.get("http://web.archive.org/cdx/search/cdx", params=params)
    response.raise_for_status()
    results = response.json()
    try:
        if not response.from_cache:
            time.sleep(0.2)
    except AttributeError:
        # Not using cache
        time.sleep(0.2)
    return convert_lists_to_dicts(results)


def harvest_cdx_query(url, **kwargs):
    """
    Harvest results of query from the IA CDX API using pagination.
    Returns captures as a list of dicts.
    """
    results = []
    page = 0
    params = prepare_params(url, **kwargs)
    total_pages = get_total_pages(params)
    with tqdm(total=total_pages - page, desc="CDX") as pbar:
        while page < total_pages:
            params["page"] = page
            results += get_cdx_data(params)
            page += 1
            pbar.update(1)
    return results


def save_texts_from_cdx_query(url, filter_text=False, **kwargs):
    captures = harvest_cdx_query(url, **kwargs)
    if captures:
        df = pd.DataFrame(captures)
        groups = df.groupby(by="urlkey")
        print(f"{len(groups)} matching urls")
        for name, group in groups:
            process_capture_list(
                "ia", group.to_dict("records"), filter_text=filter_text
            )

Harvesting a single url or list of urls¶

Get all human-visible text from all captures of a single url in the Australian Web Archive.

In [ ]:

save_texts_from_url("nla", "http://discontents.com.au/", filter_text=False)

Get only significant text from all captures of a single url in the New Zealand Web Archive.

In [ ]:

save_texts_from_url("nla", "http://digitalnz.org/", filter_text=True)

Harvest text from a series of urls.

In [ ]:

urls = ["http://nla.gov.au", "http://nma.gov.au", "http://awm.gov.au"]

for url in urls:
    save_texts_from_url("nla", url, filter_text=True)

Harvesting matching pages from a domain¶

Harvest text from all pages under the nla.gov.au domain that include the word 'policy' in the url. Note the use of the regular expression .*policy.* to match the original url.

In [ ]:

save_texts_from_cdx_query(
    "dfat.gov.au/*",
    filter_text=True,
    filter=["original:.*policy.*", "statuscode:200", "mimetype:text/html"],
)

Viewing and downloading the results¶

If you're using Jupyter Lab, you can browse the results of this notebook by just looking inside the text folder. I've also enabled the jupyter-archive extension which adds a download option to the right-click menu. Just right click on a folder and you'll see an option to 'Download as an Archive'. This will zip up and download the folder.

The cells below provide a couple of alternative ways of viewing and downloading the results.

In [ ]:

# Display all the files under the current text folder (this could be a long list)
display(FileLinks("text"))

In [ ]:

# Tar/gzip the current domain folder
!tar -czf text.tar.gz text

In [ ]:

# Display a link to the gzipped data
# In JupyterLab you'll need to Shift+right-click on the link and choose 'Download link'
display(FileLink("text.tar.gz"))

Created by Tim Sherratt for the GLAM Workbench. Support me by becoming a GitHub sponsor!

Work on this notebook was supported by the IIPC Discretionary Funding Programme 2019-2020.

The Web Archives section of the GLAM Workbench is sponsored by the British Library.