Harvesting collections of text from archived web pages

New to Jupyter notebooks? Try Using Jupyter notebooks for a quick introduction.

This notebook helps you assemble datasets of text extracted from all available captures of archived web pages. You can then feed these datasets to the text analysis tool of your choice to analyse changes over time.

Harvest sources

  • Timemaps – harvest text from a single url, or list of urls, using the repository of your choice
  • CDX API – harvest text from the results of a query to the Internet Archive's CDX API

Options

  • filter_text=False (default) – save all of the human visible text on the page, this includes boilerplate, footers, and navigation text.
  • filter_text=True – save only the significant text on the page, excluding recurring items like boilerplate and navigation. This is done by Trafilatura.

Usage

Using Timemaps

get_texts_for_url([timegate], [url], filter_text=[True or False])

The timegate value should be one of:

  • nla – National Library of Australia
  • nlnz – National Library of New Zealand
  • bl – UK Web Archive
  • ia – Internet Archive

Using the Internet Archive's CDX API

Use a CDX query to find all urls that include the specified keyword in their url.

get_texts_for_cdx_query([url], filter_text=[True or False], filter=['original:.*[keyword].*', 'statuscode:200', 'mimetype:text/html'])

The url value can use wildcards to indicate whether it is a domain or prefix query, for example:

  • nla.gov.au/* – prefix query, search all files under nla.gov.au
  • *.nla.gov.au – domain query, search all files under nla.gov.au and any of its subdomains

You can use any of the keyword parameters that the CDX API recognises, but you probably want to filter for statuscode and mimetype and apply some sort of regular expression to original.

Output

A directory will be created for each url processed. The name of the directory will be a slugified version of the url in SURT (Sort-friendly URI Reordering Transform) format.

Each text file will be saved separately within the directory. Filenames follow the pattern:

[SURT formatted url]-[capture timestamp].txt

There's also a metadata.json file that includes basic details of the harvest:

  • timegate - the repository used
  • url – the url harvested
  • filter_text – text filtering option used
  • date – date and time the harvest was started
  • mementos – details of each capture, including:
    • url – link to capture in web archive
    • file_path – path to harvested text file

Import what we need

In [ ]:
import requests
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry
import re
import pandas as pd
from bs4 import BeautifulSoup
from surt import surt
from pathlib import Path
from slugify import slugify
from tqdm.auto import tqdm
import trafilatura
import arrow
import json
import time
from lxml.etree import ParserError
from IPython.display import display, FileLink, FileLinks
s = requests.Session()
retries = Retry(total=10, backoff_factor=1, status_forcelist=[ 502, 503, 504 ])
s.mount('https://', HTTPAdapter(max_retries=retries))
s.mount('http://', HTTPAdapter(max_retries=retries))
In [ ]:
# Default list of repositories -- you could add to this
TIMEGATES = {
    'nla': 'https://web.archive.org.au/awa/',
    'nlnz': 'https://ndhadeliver.natlib.govt.nz/webarchive/wayback/',
    'bl': 'https://www.webarchive.org.uk/wayback/archive/',
    'ia': 'https://web.archive.org/web/'
}

Define some functions

In [ ]:
def is_memento(url):
    '''
    Is this url a Memento? Checks for the presence of a timestamp.
    '''
    return bool(re.search(r'/\d{14}(?:id_|mp_|if_)*/http', url))

def get_html(url):
    '''
    Retrieve the original HTML content of an archived page.
    Follow redirects if they go to another archived page.
    Return the (possibly redirected) url from the response and the HTML content.
    '''
    # Adding the id_ hint tells the archive to give us the original harvested version, without any rewriting.
    url = re.sub(r'/(\d{14})(?:mp_)*/http', r'/\1id_/http', url)
    response = requests.get(url, allow_redirects=True)
    # Some captures might redirect themselves to live versions
    # If the redirected url doesn't look like a Memento rerun this without redirection
    if not is_memento(response.url):
        response = requests.get(url, allow_redirects=False)
    return {'url': response.url, 'html': response.content}

def convert_lists_to_dicts(results):
    '''
    Converts IA style timemap (a JSON array of arrays) to a list of dictionaries.
    Renames keys to standardise IA with other Timemaps.
    '''
    if results:
        keys = results[0]
        results_as_dicts = [dict(zip(keys, v)) for v in results[1:]]
    else:
        results_as_dicts = results
    # Rename keys
    for d in results_as_dicts:
        d['status'] = d.pop('statuscode')
        d['mime'] = d.pop('mimetype')
        d['url'] = d.pop('original')
    return results_as_dicts

def get_capture_data_from_memento(url, request_type='head'):
    '''
    For OpenWayback systems this can get some extra cpature info to insert in Timemaps.
    '''
    if request_type == 'head':
        response = requests.head(url)
    else:
        response = requests.get(url)
    headers = response.headers
    length = headers.get('x-archive-orig-content-length')
    status = headers.get('x-archive-orig-status')
    status = status.split(' ')[0] if status else None
    mime = headers.get('x-archive-orig-content-type')
    mime = mime.split(';')[0] if mime else None
    return {'length': length, 'status': status, 'mime': mime}

def convert_link_to_json(results, enrich_data=False):
    '''
    Converts link formatted Timemap to JSON.
    '''
    data = []
    for line in results.splitlines():
        parts = line.split('; ')
        if len(parts) > 1:
            link_type = re.search(r'rel="(original|self|timegate|first memento|last memento|memento)"', parts[1]).group(1)
            if link_type == 'memento':
                link = parts[0].strip('<>')
                timestamp, original = re.search(r'/(\d{14})/(.*)$', link).groups()
                capture = {'timestamp': timestamp, 'url': original}
                if enrich_data:
                    capture.update(get_capture_data_from_memento(link))
                data.append(capture)
    return data
                
def get_timemap_as_json(timegate, url):
    '''
    Get a Timemap then normalise results (if necessary) to return a list of dicts.
    '''
    tg_url = f'{TIMEGATES[timegate]}timemap/json/{url}/'
    response = requests.get(tg_url)
    response_type = response.headers['content-type']
    # pywb style Timemap
    if response_type == 'text/x-ndjson':
        data = [json.loads(line) for line in response.text.splitlines()]
    # IA Wayback stype Timemap
    elif response_type == 'application/json':
        data = convert_lists_to_dicts(response.json())
    # Link style Timemap (OpenWayback)
    elif response_type in ['application/link-format', 'text/html;charset=utf-8']:
        data = convert_link_to_json(response.text)
    return data

def get_all_text(capture_data):
    '''
    Get all the human visible text from a web page, including headers, footers, and navigation.
    Does some cleaning up to remove multiple spaces, tabs, and newlines.
    ''' 
    try:
        text = BeautifulSoup(capture_data['html']).get_text()
    except TypeError:
        return None
    else:
        # Remove multiple newlines
        text = re.sub(r'\n\s*\n', '\n\n', text)
        # Remove multiple spaces or tabs with a single space
        text = re.sub(r'( |\t){2,}', ' ', text)
        # Remove leading spaces
        text = re.sub(r'\n ', '\n', text)
        # Remove leading newlines
        text = re.sub(r'^\n*', '', text)
        return text

def get_main_text(capture_data):
    '''
    Get only the main text from a page, excluding boilerplate and navigation.
    '''
    try:
        text = trafilatura.extract(capture_data['html'])
    except ParserError:
        text = ''
    return text
    
def get_text_from_capture(capture_url, filter_text=False):
    '''
    Get text from the given memento.
    If filter_text is True, only return the significant text (excluding things like navigation).
    '''
    capture_data = get_html(capture_url)
    if filter_text:
        text = get_main_text(capture_data)
    else:
        text = get_all_text(capture_data)
    return text

def process_capture_list(timegate, captures, filter_text=False, url=None):
    if not url:
        url = captures[0]['url']
    metadata = {
            'timegate': TIMEGATES[timegate],
            'url': url,
            'filter_text': filter_text,
            'date': arrow.now().format('YYYY-MM-DD HH:mm:ss'),
            'mementos': []
    }
    try:
        urlkey = captures[0]['urlkey']
    except KeyError:
        urlkey = surt(url)
    # Truncate urls longer than 50 chars so that filenames are not too long
    output_dir = Path('text', slugify(urlkey)[:50])
    output_dir.mkdir(parents=True, exist_ok=True)
    for capture in tqdm(captures, desc='Captures'):
        file_path = Path(output_dir, f'{slugify(urlkey)[:50]}-{capture["timestamp"]}.txt')
        # Don't reharvest if file already exists
        if not file_path.exists():
            # Only process successful captures (or all for NLNZ)
            if timegate == 'nlnz' or capture['status'] == '200':
                capture_url = f'{TIMEGATES[timegate]}{capture["timestamp"]}id_/{capture["url"]}'
                capture_text = get_text_from_capture(capture_url, filter_text)
                if capture_text:
                    # Truncate urls longer than 50 chars so that filenames are not too long
                    file_path = Path(output_dir, f'{slugify(urlkey)[:50]}-{capture["timestamp"]}.txt')
                    file_path.write_text(capture_text)
                    metadata['mementos'].append({'url': capture_url, 'text_file': str(file_path)})
                time.sleep(0.2)
    metadata_file = Path(output_dir, 'metadata.json')
    with metadata_file.open('wt') as md_json:
        json.dump(metadata, md_json)

def save_texts_from_url(timegate, url, filter_text=False):
    '''
    Save the text contents of all available captures for a given url from the specified repository.
    Saves both the harvested text files and a json file with the harvest metadata.
    '''
    timemap = get_timemap_as_json(timegate, url)
    if timemap:
        process_capture_list(timegate, timemap, url=url, filter_text=filter_text)
        
def prepare_params(url, **kwargs):
    '''
    Prepare the parameters for a CDX API requests.
    Adds all supplied keyword arguments as parameters (changing from_ to from).
    Adds in a few necessary parameters.
    '''
    params = kwargs
    params['url'] = url
    params['output'] = 'json'
    params['pageSize'] = 5
    # CDX accepts a 'from' parameter, but this is a reserved word in Python
    # Use 'from_' to pass the value to the function & here we'll change it back to 'from'.
    if 'from_' in params:
        params['from'] = params['from_']
        del(params['from_'])
    return params

def get_total_pages(params):
    '''
    Get number of pages in a query.
    Note that the number of pages doesn't tell you much about the number of results, as the numbers per page vary.
    '''
    these_params = params.copy()
    these_params['showNumPages'] = 'true'
    response = s.get('http://web.archive.org/cdx/search/cdx', params=these_params, headers={'User-Agent': ''})
    return int(response.text)

def get_cdx_data(params):
    '''
    Make a request to the CDX API using the supplied parameters.
    Return results converted to a list of dicts.
    '''
    response = s.get('http://web.archive.org/cdx/search/cdx', params=params)
    response.raise_for_status()
    results = response.json()
    try:
        if not response.from_cache:
            time.sleep(0.2)
    except AttributeError:
        # Not using cache
        time.sleep(0.2)
    return convert_lists_to_dicts(results)

def harvest_cdx_query(url, **kwargs):
    '''
    Harvest results of query from the IA CDX API using pagination.
    Returns captures as a list of dicts.
    '''
    results = []
    page = 0
    params = prepare_params(url, **kwargs)
    total_pages = get_total_pages(params)
    with tqdm(total=total_pages-page, desc='CDX') as pbar:
        while page < total_pages:
            params['page'] = page
            results += get_cdx_data(params)
            page += 1
            pbar.update(1)
    return results

def save_texts_from_cdx_query(url, filter_text=False, **kwargs):
    captures = harvest_cdx_query(url, **kwargs)
    if captures:
        df = pd.DataFrame(captures)
        groups = df.groupby(by='urlkey')
        print(f'{len(groups)} matching urls')
        for name, group in groups:
            process_capture_list('ia', group.to_dict('records'), filter_text=filter_text)

Harvesting a single url or list of urls

Get all human-visible text from all captures of a single url in the Australian Web Archive.

In [ ]:
save_texts_from_url('nla', 'http://nla.gov.au/', filter_text=False)

Get only significant text from all captures of a single url in the Australian Web Archive.

In [ ]:
save_texts_from_url('nla', 'http://nla.gov.au/', filter_text=True)

Harvest text from a series of urls.

In [ ]:
urls = [
    'http://nla.gov.au',
    'http://nma.gov.au',
    'http://awm.gov.au'
]

for url in urls:
    save_texts_from_url('nla', url, filter_text=True)

Harvesting matching pages from a domain

Harvest text from all pages under the nla.gov.au domain that include the word policy in the url. Note the use of the regular expression .*policy.* to match the original url.

In [ ]:
save_texts_from_cdx_query('dfat.gov.au/*', filter_text=True, filter=['original:.*policy.*', 'statuscode:200', 'mimetype:text/html'])

Viewing and downloading the results

If you're using Jupyter Lab, you can browse the results of this notebook by just looking inside the domains folder. I've also enabled the jupyter-archive extension which adds a download option to the right-click menu. Just right click on a folder and you'll see an option to 'Download as an Archive'. This will zip up and download the folder.

The cells below provide a couple of alternative ways of viewing and downloading the results.

In [ ]:
# Display all the files under the current domain folder (this could be a long list)
display(FileLinks('text'))
In [ ]:
# Tar/gzip the current domain folder
!tar -czf text.tar.gz text
In [ ]:
# Display a link to the gzipped data
# In JupyterLab you'll need to Shift+right-click on the link and choose 'Download link'
display(FileLink('text.tar.gz'))

Created by Tim Sherratt for the GLAM Workbench. Support me by becoming a GitHub sponsor!

Work on this notebook was supported by the IIPC Discretionary Funding Programme 2019-2020