Find and explore Powerpoint presentations from a specific domain

Image from Powerpoint Image from Powerpoint

New to Jupyter notebooks? Try Using Jupyter notebooks for a quick introduction.

Web archives don't just contain HTML pages! Using the filter parameter in CDX queries we can limit our results to particular types of files, for example Powerpoint presentations.

This notebook helps you find, download, and explore all the presentation files captured from a particular domain, like defence.gov.au. It uses the Internet Archive by default, as their CDX API allows domain level queries and pagination, however, you could try using the UKWA or the National Library of Australia (prefix queries only).

This notebook includes a series of processing steps:

  1. Harvest capture data
  2. Remove duplicates from capture data and download files
  3. Convert Powerpoint files to PDFs
  4. Extract screenshots and text from the PDFs
  5. Save metadata, screenshots, and text into an SQLite database for exploration
  6. Open the SQLite db in Datasette for exploration

Here's an example of the SQLite database created by harvesting Powerpoint files from the defence.gov.au domain, running in Datasette on Glitch.

Moving large files around and extracting useful data from proprietary formats is not a straightforward process. While this notebook has been tested and will work running on Binder, you'll probably want to shift across to a local machine if you're doing any large-scale harvesting. That'll make it easier for you to deal with corrupted files, broken downloads etc.

For more examples of harvesting domain-level data see Harvesting data about a domain using the IA CDX API.

In [ ]:
import requests
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry
from tqdm.auto import tqdm
import pandas as pd
import time
import os
from urllib.parse import urlparse
from pathlib import Path
import time
import math
import io
from PIL import Image
import PIL
# pyMuPDF (aka Fitz) seems to do a better job of converting PDFs to images than pdf2image
import fitz
import sqlite_utils
from slugify import slugify
import arrow
from notebook.notebookapp import list_running_servers
from IPython.display import display, HTML, FileLink, FileLinks

# Also need LibreOffice installed by whatever means is needed for local system

s = requests.Session()
retries = Retry(total=10, backoff_factor=1, status_forcelist=[ 502, 503, 504 ])
s.mount('https://', HTTPAdapter(max_retries=retries))
s.mount('http://', HTTPAdapter(max_retries=retries))
In [ ]:
def convert_lists_to_dicts(results):
    '''
    Converts IA style timemap (a JSON array of arrays) to a list of dictionaries.
    Renames keys to standardise IA with other Timemaps.
    '''
    if results:
        keys = results[0]
        results_as_dicts = [dict(zip(keys, v)) for v in results[1:]]
    else:
        results_as_dicts = results
    for d in results_as_dicts:
        d['status'] = d.pop('statuscode')
        d['mime'] = d.pop('mimetype')
        d['url'] = d.pop('original')
    return results_as_dicts

def get_total_pages(params):
    these_params = params.copy()
    these_params['showNumPages'] = 'true'
    response = s.get('http://web.archive.org/cdx/search/cdx', params=these_params, headers={'User-Agent': ''})
    return int(response.text)

def query_cdx_ia(url, **kwargs):
    results = []
    page = 0
    params = kwargs
    params['url'] = url
    params['output'] = 'json'
    total_pages = get_total_pages(params)
    # print(total_pages)
    with tqdm(total=total_pages-page) as pbar:
        while page < total_pages:
            params['page'] = page
            response = s.get('http://web.archive.org/cdx/search/cdx', params=params, headers={'User-Agent': ''})
            # print(response.url)
            response.raise_for_status()
            results += convert_lists_to_dicts(response.json())
            page += 1
            pbar.update(1)
            time.sleep(0.2)
    return results

Harvest capture data

See Exploring the Internet Archive's CDX API and Comparing CDX APIs for more information on getting CDX data.

Setting filter=mimetype:application/vnd.ms-powerpoint would get standard MS .ppt files, but what about .pptx or Open Office files? Using a regular expression like filter=mimetype:.*(powerpoint|presentation).* should get most presentation software variants.

In [ ]:
# Set the domain that we're interested in
domain = 'defence.gov.au'

# Create output directory for data
domain_dir = Path('domains', slugify(domain))
domain_dir.mkdir(parents=True, exist_ok=True)
In [ ]:
# Domain or prefix search? Domain...
# Collapse on digest? Only removes adjacent captures with the same digest, so probably won't make much difference
# Note the use of regex in the mimetype filter -- should capture all(?) presentations.
results = query_cdx_ia(f'*.{domain}', filter='mimetype:.*(powerpoint|presentation).*')

Let's convert the harvested data into a dataframe.

In [ ]:
df = pd.DataFrame(results)

How many captures are there?

In [ ]:
df.shape

How many unique files are there? This drops duplicates based on the digest field.

In [ ]:
df_unique = df.drop_duplicates(subset=['digest'], keep='first')
In [ ]:
df_unique.shape

What mimetype values are there?

In [ ]:
df_unique['mime'].value_counts()

Download all the Powerpoint files and save some metadata about them

In this step we'll work through the capture data and attempt to download each unique file. We'll also add in some extra metadata:

  • first_capture – the first time this file appears in the archive (ISO formatted date, YYYY-MM-DD)
  • last_capture – the last time this file appears in the archive (ISO formatted date, YYYY-MM-DD)
  • current_status – HTTP status code returned now by the original url

The downloaded Powerpoint files will be stored in /domains/[this domain]/powerpoints/original. The script won't overwrite any existing files, so if it fails and you need to restart, it'll just pick up from where it left off.

In [ ]:
def format_timestamp_as_iso(timestamp):
    return arrow.get(timestamp, 'YYYYMMDDHHmmss').format('YYYY-MM-DD')

def get_date_range(df, digest):
    '''
    Find the first and last captures identified by digest.
    Return the dates in ISO format.
    '''
    captures = df.loc[df['digest'] == digest]
    from_date = format_timestamp_as_iso(captures['timestamp'].min())
    to_date = format_timestamp_as_iso(captures['timestamp'].max())
    return(from_date, to_date)

def check_if_exists(url):
    '''
    Check to see if a file still exists on the web.
    Return the current status code.
    '''
    try:
        response = s.head(url, allow_redirects=True)
    # If there are any problems that don't generate codes
    except requests.exceptions.ConnectionError:
        return 'unreachable'
    return response.status_code

def save_files(df, add_current_status=True):
    '''
    Attempts to download each unique file.
    Adds first_capture, and last_capture, and file_path to the file metadata.
    Optionally checks to see if file is still accessible on the live web, adding current status code to metadata.
    '''
    metadata = df.drop_duplicates(subset=['digest'], keep='first').to_dict('records')
    for row in tqdm(metadata):
        url = f'https://web.archive.org/web/{row["timestamp"]}id_/{row["url"]}'
        parsed = urlparse(row['url'])
        suffix = Path(parsed.path).suffix
        # This should give a sortable and unique filename if there are multiple versions of a file
        file_name = f'{slugify(row["urlkey"])}-{row["timestamp"]}{suffix}'
        # print(filename)
        # Create the output directory for the downloaded files
        output_dir = Path(domain_dir, 'powerpoints', 'original')
        output_dir.mkdir(parents=True, exist_ok=True)
        file_path = Path(output_dir, file_name)
        # Don't re-get files we already got
        if not file_path.exists():
            response = s.get(url=url)
            file_path.write_bytes(response.content)
            time.sleep(1)
        # Get first and last cpature dates
        first, last = get_date_range(df, row["digest"])
        row['first_capture'] = first
        row['last_capture'] = last
        # This can slow things down a bit, so make it optional
        if add_current_status:
            row['current_status'] = check_if_exists(row["url"])
        row['file_path'] = str(file_path)
        row['memento'] = f'https://web.archive.org/web/{row["timestamp"]}/{row["url"]}'
    return metadata
In [ ]:
metadata = save_files(df)

Let's save the enriched data as a CSV file.

In [ ]:
df_md = pd.DataFrame(metadata)
csv_file = Path(domain_dir, f'{slugify(domain)}-powerpoints.csv')
df_md.to_csv(csv_file, index=False)
display(FileLink(csv_file))

How many of the archived Powerpoints are still accessible on the live web?

In [ ]:
df_md['current_status'].value_counts()

Convert the Powerpoint files to PDFs

In this step we'll convert the Powerpoint files to PDFs. PDFs are easier to work with, and we can use them to extract text and images from the original presentation files. However, the conversion process can be a bit fiddly. Before you undertake this you'll need to make sure that you have a recent version of LibreOffice installed on your system.

If you're using this notebook on Binder, or have created a virtualenv using this repository's requirements.txt, you'll have installed the Python command-line program unoconv. Unoconv wraps around LibreOffice to provide a standard interface for converting files from one file type to another. However, while unoconv seems to work ok when running a notebook in a conda environment, it fails in a standard virtualenv. It's got something to do with the default interpreter, but even using LibreOffice's own packaged Python doesn't seem to help.

Fortunately, you can just call LibreOffice directly. To do this, you'll need to know the location of the LibreOffice program. for example, on MacOS this is probably /Applications/LibreOffice.app/Contents/MacOS/soffice. Once you have that, it's easy to convert a directory full of Powerpoint files to pdfs, for example:

!/Applications/LibreOffice.app/Contents/MacOS/soffice --headless --convert-to [format to convert to, eg pdf] --outdir [output directory] [input files]

I've included both the unoconv and direct LibreOffice options below so you can use the one that works best for you. Just comment/uncomment lines as necessary. If you're using this notebook on Binder, it should just work with the default settings.

Some other issues:

  • Both commands might fail silently if LibreOffice is already running on your system – close it first! If a previous call to LibreOffice has failed in some ugly way, it might still be running in the background, so if nothing seems to be happening, hunt around for a running LibreOffice process.
  • Both commands might fail in unhelpful ways on corrupted or malformed Powerpoints. If you keep getting stuck on a particular file, just try moving or deleting it and then run the conversion command again.
  • I've seen reports that suggest the comands might fail if you feed in lots of files. I've processed > 100 without problem, but if this does happen it might be necessary to loop through individual files, or break them up into groups.
  • Because of missing fonts or general weirdness, the PDFs are unlikely to be exact representations of the Powerpoint files. But the point of this exercise is to give some insight into the contents of the files, rather than create perfectly styled PDFs. Again, you'll probably get better results running on your local machine where you've got a range of fonts installed.
  • If you use the LibreOffice program directly, you'll see a lot of output which might include some warnings. Don't worry unless it fails completely.
In [ ]:
input_dir = str(Path(domain_dir, 'powerpoints', 'original')) + '/*[ppt,pps,pptx]'
output_dir = Path(domain_dir, 'powerpoints', 'pdfs')
output_dir.mkdir(parents=True, exist_ok=True)

# Try with unoconv?
#!unoconv -v -f pdf -o {pdf_output} {input_dir}

# Using typical Libre Office location on Ubuntu (ie Binder)
!/usr/bin/soffice --headless --convert-to pdf --outdir {output_dir} {input_dir}

# Typical MacOS installation of most recent LibreOffice
#!/Applications/LibreOffice.app/Contents/MacOS/soffice --headless --convert-to pdf --outdir {output_dir} {input_dir}

Generate screenshots and save data from PDFs

Now that we have PDF-formatted versions of the presentations, we can save individual slides as images, and extract all the text from the presentation. The function below works its way through out metadata file, looking for the PDF version of each of our downloaded Powerpoint files. If the PDF exists it:

  • Saves the first slide (ie the first page in the PDF) as a PNG image in the /domains/[your selected domain]/powerpoints/images directory.
  • Resizes the image to a maximum width of 300px and saves it as binary data in the metadata file.
  • Loops through all the pages to extract the complete text of the presentation, and saves to text to the metadata file.

We end up with a metadata file that includes the full text of the file as well as an image of the first page. In the next step, we'll turn this into a database.

In [ ]:
def get_data_from_pdfs(metadata):
    pdf_data = metadata.copy()
    for pdf in tqdm(pdf_data):
        # See if pdf exists
        pdf_name = f'{slugify(pdf["urlkey"])}-{pdf["timestamp"]}.pdf'
        # print(filename)
        input_dir = Path(domain_dir, 'powerpoints', 'pdfs')
        output_dir = Path(domain_dir, 'powerpoints', 'images')
        output_dir.mkdir(parents=True, exist_ok=True)
        pdf_path = Path(input_dir, pdf_name)
        if pdf_path.exists():
            doc = fitz.open(pdf_path)
            page_count = doc.pageCount
            page_num = 0
            text = ''
            # You can define matrix to increase resolution / size - https://pymupdf.readthedocs.io/en/latest/faq/
            # Increase numbers below to zoom
            mat = fitz.Matrix(1,1)
            while (page_num < page_count):
                page = doc.loadPage(page_num)
                # Convert first page to image
                if page_num == 0:
                    # Get page as image and save
                    pix = page.getPixmap(matrix=mat)
                    image_file = Path(output_dir, f'{pdf_path.stem}.png')
                    pix.writePNG(str(image_file))
                    # Resize the image as a thumbnail for saving in db
                    img_data = pix.getPNGData()
                    img = Image.open(io.BytesIO(img_data))
                    ratio = 300 / img.width
                    (width, height) = (300, math.ceil(img.height * ratio))
                    resized_img = img.resize((width, height), PIL.Image.LANCZOS)
                    # Save as data
                    buffer = io.BytesIO()
                    resized_img.save(buffer, format='PNG')
                    pdf['image'] = buffer.getvalue()
                text += page.getText()
                page_num += 1
            pdf['text'] = text
        else:
            pdf['text'] = ''
            pdf['image'] = ''
    return pdf_data
In [ ]:
pdf_data = get_data_from_pdfs(metadata)

Save into SQLite so we can explore the results in Datasette

In this step we import the enriched metadata file into a SQLite database using SQlite-utils. Then we can open the database in Datasette and explore the results.

There's a few interesting things to note:

  • Using sql-utils we can index the text column so that it's searchable as full text.
  • To get Datasette to work on Binder we have to find the base url of the Jupyter proxy and give it to Datasette.
In [ ]:
# Create the db
db_file = Path(domain_dir, f'{slugify(domain)}-powerpoints.db')
db = sqlite_utils.Database(db_file)
db['files'].delete_where()
In [ ]:
# Insert the metadata into the db
db['files'].insert_all(pdf_data, column_order=('image', 'timestamp', 'memento', 'first_capture', 'last_capture', 'current_status', 'text', 'mime', 'status', 'url'))
In [ ]:
# Make the text column full text searchable
db['files'].enable_fts(['text'], fts_version='FTS4')
db['files'].optimize()

The following code should launch Datasette correctly on either Binder or a local system.

In [ ]:
# Get current running servers
servers = list_running_servers()

# Get the current base url
base_url = next(servers)['base_url']

# Create a base url for Datasette using the proxy path
proxy_url = f'{base_url}proxy/absolute/8001/'

# Display a link to Datasette
display(HTML(f'<p><a style="display: block; border: 1px solid #307fc1; background-color: #1976d2; color: #ffffff; padding: 10px; text-align: center;"href="{proxy_url}">View Datasette</a> (Click on the stop button in the top menu bar to close the Datasette server)</p>'))

# Launch Datasette
!datasette -- {str(db_file)} --port 8001 --config base_url:{proxy_url} --plugins-dir plugins

Viewing or downloading files

If you're using Jupyter Lab, you can browse the results of this notebook by just looking inside the domains folder. I've also enabled the jupyter-archive extension which adds a download option to the right-click menu. Just right click on a folder and you'll see an option to 'Download as an Archive'. This will zip up and download the folder. Remember, however, that some of these folders will contain a LOT of data, so downloading everything might not always work.

The cells below provide a couple of alternative ways of viewing and downloading the results.

In [ ]:
# Display all the files under the current domain folder (this could be a long list)
display(FileLinks(str(domain_dir)))
In [ ]:
# Tar/gzip the current domain folder
!tar -czf {str(domain_dir)}.tar.gz {str(domain_dir)}
In [ ]:
# Display a link to the gzipped data
# In JupyterLab you'll need to Shift+right-click on the link and choose 'Download link'
display(FileLink(f'{str(domain_dir)}.tar.gz'))

Created by Tim Sherratt for the GLAM Workbench.

Work on this notebook was supported by the IIPC Discretionary Funding Programme 2019-2020