Display the results of a harvest as a searchable database using Datasette

You've harvested lots of newspaper articles from Trove using the Newspaper Harvester. But how do you examine the results? Before you fire up Excel (which is likely to weird things with you dates), give Datasette a try!

Datasette is 'a tool for exploring and publishing data'. Give it a CSV file and it turns it into a fully-searchable database, running in your browser. It supports facets, full-text search, and, with a bit of tweaking, can even present images. Although Datasette is a command-line tool, we can run from within a Jupyter notebook, and open a new window to display the results. This notebook shows you how to load the newspaper data you've harvested into Datasette, and start it up. If you've also harvested full-text and images from the newspaper articles, you can add these to your database as well!

Screenshot of Datasette

Import what we need

In [ ]:
import os
import pandas as pd
import sqlite_utils
from pathlib import Path
from notebook.notebookapp import list_running_servers
from IPython.display import display, HTML, FileLink, FileLinks
import re
import json

Load harvested data into Datasette

By default, the cells below load the most recently completed newspaper harvest into Datasette. If you want to load a different harvest, simply supply the harvest's timestamp when you run open_datasette().

In [ ]:
def get_latest_harvest():
    '''
    Get the timestamp of the most recent harvest.
    '''
    harvests = sorted([d for d in os.listdir('data') if os.path.isdir(os.path.join('data', d))])
    return harvests[-1]

def open_harvest_data(timestamp=None):
    '''
    Open the results of the specified harvest (most recent by default).
    Returns a list of records and a timestamp.
    '''
    if not timestamp:
        timestamp = get_latest_harvest()
    df = pd.read_csv(Path('data', timestamp, 'results.csv'))
    return df.to_dict('records')

def get_db(timestamp=None):
    '''
    Look for a db named with the supplied timestamp (or the timestamp of the latest harvest).
    If one exists, return the path to it.
    If it doesn't, create one, and load the CSV data from the harvest.
    '''
    if not timestamp:
        timestamp = get_latest_harvest()
    db_path = Path('data', timestamp, 'results.db')
    # If the db doesn't exist, create it
    if not db_path.exists():
        # Get the harvest data
        data = open_harvest_data(timestamp)
        # Create the db
        db = sqlite_utils.Database(db_path)
        # Load the data, specifying `article_id` as the primary key
        db['records'].insert_all(data, pk='article_id')
    return db_path

def get_proxy_url():
    # Get current running servers
    servers = list_running_servers()
    # Get the current base url
    base_url = next(servers)['base_url']
    # Create a base url for Datasette using the proxy path
    proxy_url = f'{base_url}proxy/absolute/8001/'
    return proxy_url

def open_datasette(timestamp=None):
    '''
    This gets the base url of the currently running notebook. It then uses this url to 
    construct a link to your Datasette instance, using jupyter-server-proxy.
    Finally it creates a button to open up a new tab to view your database.
    '''
    if not timestamp:
        timestamp = get_latest_harvest()
    db_path = get_db(timestamp)
    proxy_url = get_proxy_url()
    # Display a link to Datasette
    display(HTML(f'<p><a style="width: 200px; display: block; border: 1px solid #307fc1; background-color: #1976d2; color: #ffffff; padding: 10px; text-align: center; font-weight: bold;"href="{proxy_url}">View in Datasette</a> (Click on the stop button in the top menu bar to close the Datasette server)</p>'))
    # Launch Datasette
    !datasette -- {str(db_path)} --port 8001 --config base_url:{proxy_url} --config truncate_cells_html:100 --metadata metadata.yml
    
In [ ]:
open_datasette()

Add OCRd text and make it searchable

The Trove newspaper harvester saves the OCRd text of each article into a separate text file. This is to make the harvest more manageable. But you can easily insert the text into Datasette and make it fully-searchable. You might remember that the text files are named using the article id, so we can just grab a file, extract the id, look up the corresponding record in our database, and add the text to the record. Then we just tell Datasette to add a full-text index so it can be easily searched.

In [ ]:
def get_text_files(timestamp=None):
    '''
    Get a list of the text files in a harvest.
    If no timestamp is supplied, it gets the files from the most recent harvest.
    '''
    if not timestamp:
        timestamp = get_latest_harvest()
    text_path = Path('data', timestamp, 'text')
    if not text_path.exists():
        print('No text files in this harvest! Remember to use the --text option when harvesting.')
    for text_file in text_path.glob('*.txt'):
        yield text_file
        
def add_text_to_db(timestamp=None):
    '''
    Add harvested text files to a db.
    '''
    db_path = get_db(timestamp)
    db = sqlite_utils.Database(db_path)
    # Add a text column to the db
    db['records'].add_column('text', str)
    # Loop through the text files, adding them to the appropriate record
    for text_file in get_text_files(timestamp):
        # Get the article id from the file name
        article_id = re.search(r'(\d+)\.txt', str(text_file)).group(1)
        # Get the article record from the db
        record = db['records'].get(int(article_id))
        # Update the article record with the text
        db['records'].update(record['article_id'], {'text': text_file.read_text()})
    # Make the text column full text searchable
    db['records'].enable_fts(['text'], fts_version='FTS4')
    db['records'].optimize()
In [ ]:
# Add the text (supply a timestamp if you want to specify a harvest)
add_text_to_db()
In [ ]:
# Start Datasette
open_datasette()

Our aim here is to display thumbnails of the article images alongside the article metadata. To do this we're making use of two Datasette plugins: datasette-media to serve the images, and datasette-json-html to insert image metadata into our database that will be automatically rendered as a HTML img tag. You just install the plugins via pip, and Datasette will automatically use them. As with the text files, the images file names include the article id, so we'll use it to link each image to its record. We'll even add a link on each thumbnail to open up the full-sized image.

Note that the configuration for the datasette-media plugin is contained in the metadata.yml file in this repository.

In [ ]:
def get_image_files(timestamp=None):
    '''
    Get a list of the image files in a harvest.
    If no timestamp is supplied, it gets the files from the most recent harvest.
    '''
    if not timestamp:
        timestamp = get_latest_harvest()
    image_path = Path('data', timestamp, 'image')
    if not image_path.exists():
        print('No image files in this harvest! Remember to use the --image option when harvesting.')
    for image_file in sorted(image_path.glob('*.jpg')):
        yield image_file
        
def add_images_to_db(timestamp=None):
    '''
    Add harvested images to a db.
    '''
    db_path = get_db(timestamp)
    db = sqlite_utils.Database(db_path)
    db['records'].add_column('image', str)
    db['records'].add_column('imagepath', str)
    proxy_url = get_proxy_url()
    # Make the text column full text searchable
    for image_file in get_image_files(timestamp):
        article_id = re.search(r'(\d+)-\d+\.jpg', str(image_file)).group(1)
        record = db['records'].get(int(article_id))
        # Just add the first image (for article spread over multiple pages)
        if not record['imagepath']:
            # Add the image file path
            db['records'].update(record['article_id'], {'imagepath': str(image_file)})
            # Add some JSON with the url to the image (via the media server and Jupyter proxy)
            db['records'].update(record['article_id'], {'image': json.dumps({'img_src': f'{proxy_url}-/media/thumbnail/{article_id}', 'href': f'{proxy_url}-/media/large/{article_id}'})})
In [ ]:
# Add the images
add_images_to_db()
In [ ]:
# Open Datasette
open_datasette()

Now what?

Datasette provides a number of other plugins you might use to explore or visualise your data. If you'd like to make your database public, look at Share your CSVs online using Datasette and Glitch.


Created by [Tim Sherratt](https://timsherratt.org) ([@wragge](https://twitter.com/wragge)) as part of the [GLAM Workbench project](https://github.com/glam-workbench/).

If you think this project is worthwhile you can [support it on Patreon](https://www.patreon.com/timsherratt).