Chapter 8 - Practical: Searching your own PDF library

-- A Python Course for the Humanities


Our Information Retrieval system of the previous chapter is simple and also quite efficient, but it lacks many features of a full-blown search engine. One particular downside of the system is that it stores the index in RAM memory. This means that either we have to keep it there or we have to rebuild the index each time we would like to search through a particular collection. We could try to improve our system, but there are some excellent Python packages for Information Retrieval and search engines, we could use. In this section we will explore one of them, called Whoosh, a search engine and retrieval system written in pure Python.

whoosh

Ever since science-journal giant Elsevier bought the once so promising bibliography management software Mendeley, I have looked for alternative ways to manage my research PDF collection. For me, one of the most important features of a PDF management tool is to be able to do full text search in the PDFs for the content I am interested in. Uptill today I have not found a tool that fulfills all my needs, so we are going to build one ourselves. We will develop a full-blown search using Whoosh. We'll build a web interface on top of Flask to query our search engine in a user-friendly way. Just to tease you a bit: this is what our search engine will look like:

pydf

This chapter is the first in a series of more practical chapters, in which we will build actual applications ready for use by end-users. You won't be learning many new programming techniques in this chapter, but we will introduce you to a large number of modules and packages that are available either in the standard library of Python or as third-party packages. The most important take-home message is that if you think about implementing a piece of software, the first thing you should do, is check whether someone else hasn't done it before you. Chances are good that someone has, and she or he probably has done a better job.

Before we get started, make sure you have installed both Whoosh and Flask on your computer. If you use Anaconda's Python distribution, you can execute the following command on the commandline:

conda install whoosh flask

It is also possible to install the modules using pip:

pip install whoosh flask

Finally we will need one of the programs shipped with the PDF reader Xpdf. Please install Xpdf from here.


The Index

The Index is the most central object in the Whoosh' search engine. It allows us to store our data in a secure and sustainable way and makes it possible to search through our data. Every index requires an index schema which defines the available fields in the index. Fields represent pieces of information for each document in the collection. Examples of fields are: the author of a text, its publication date or the text body itself, and so on and so forth. For our PDF Index we will use a schema consisting of the following fields:

  1. id (a unique document ID);
  2. path (the filepath to the document on your computer);
  3. source (the filepath to the original source on your computer);
  4. author (the author or authors of the text);
  5. title (the title of a text);
  6. text (the actual text of the pdf).

All these fields will be indexed for each pdf file in our collection and each field will be searchable. To create our schema in Whoosh, we first import the Schema object from whoosh.fields:

In [ ]:
from whoosh.fields import Schema

For each field we need to specify to Whoosh what kind of field it is. Whoosh defines fields like KEYWORD for keyword fields, ID for unique identifiable fields (e.g. filepaths), DATETIME for fields specifying information about dates, TEXT for textual objects, and some others.


Quiz!

Have a look at the different field types provided by Whoosh, here. Which types would be appropriate for the fields defined above?

Double click this cell and write down your answer


Each field type can be passed some additional arguments, such as whether the field should be stored, whether it should be sortable and scorable, etc. For reasons that will be clear later on, we want all our ID fields plus the author and the title field to be stored in the index. We import the appropriate fields and define our schema as follows:

In [ ]:
from whoosh.fields import ID, KEYWORD, TEXT

pdf_schema = Schema(id = ID(unique=True, stored=True), 
                    path = ID(stored=True), 
                    source = ID(stored=True),
                    author = TEXT(stored=True), 
                    title = TEXT(stored=True),
                    text = TEXT)

Once we have defined a schema, we can create the index using the function create_in. We'll create the index in the directory pydf. The web application will reside in the same folder. Therefore, it is convenient to first direct your notebook to that directory.

In [ ]:
cd pydf/
In [ ]:
import os
from whoosh.index import create_in

if not os.path.exists("pdf-index"):
    os.mkdir("pdf-index")
    index = create_in("pdf-index", pdf_schema)

After creation, the index can op openend using the function open_dir:

In [ ]:
from whoosh.index import open_dir

index = open_dir("pdf-index")

Now that we have an index, we can add some documents to it. The IndexWriter object let's you do just that. The method writer() of the class Index returns an instantiation of the IndexWriter.

In [ ]:
writer = index.writer()

We use the add_document method of IndexWriter to add documents to the index. add_document accepts keyword arguments corresponding to the fields we specified in our schema. We add our first document:

In [ ]:
writer.add_document(id = 'blei2003', 
                    path = 'data/blei2003.txt',
                    source = 'static/pdfs/blei2003.pdf',
                    author = 'David Blei, Andrew Ng, Michael Jordan',
                    title = 'Latent Dirichlet Allocation',
                    text = open('data/blei2003.txt', encoding='utf-8').read())

And some more:

In [ ]:
writer.add_document(id = 'goodwyn2013', 
                    path = 'data/goodwyn2013.txt',
                    source = 'static/pdfs/goodwyn2013.pdf',
                    author = 'Erik Goodwyn',
                    title = 'Recurrent motifs as resonant attractor states in the narrative field: a testable model of archetype',
                    text = open('data/goodwyn2013.txt', encoding='utf-8').read())

writer.add_document(id = 'meij2009', 
                    path = 'data/meij2009.txt',
                    source = 'static/pdfs/meij2009.pdf',
                    author = 'Edgar Meij, Dolf Trieschnigg, Maarten de Rijke, Wessel Kraaij',
                    title = 'Conceptual language models for domain-specific retrieval',
                    text = open('data/meij2009.txt', encoding='utf-8').read())

writer.add_document(id = 'muellner2011', 
                    path = 'data/muellner2011.txt',
                    source = 'static/pdfs/muellner2011.pdf',
                    author = 'David Muellner',
                    title = 'Modern hierarchical, agglomerative clustering algorithms',
                    text = open('data/muellner2011.txt', encoding='utf-8').read())

Calling commit() on the IndexWriter saves the changes to the index:

In [ ]:
writer.commit()

Searching and Querying

The index contains four documents. How can we search the index for particular documents? Similar to the method writer of the Index object, the method searcher returns a Searcher object which allows us to search the index. The Searcher object opens a connection to the index, similar to the way Python opens regular files. To prevent the system from running out of file handles, it is best practice to instantiate the searcher within a with statement:

with index.searcher() as searcher:
    do something

We'll do so later when we automatically index a complete collection of pdfs, but for now it is slightly more convenient to instantiate a searcher as follows:

In [ ]:
searcher = index.searcher()

An instantiation of the class Searcher has a search method that takes as argument a Query object. There are two ways to construct Query objects: manually or via a query parser. To construct a query that searches for the terms topic and probabilistic in the text field, we could write something like the following:

In [ ]:
from whoosh.query import Term, And

query = And([Term("text", "model"), Term("text", "topic")])

We can feed this to the search method to obtain a Result object:

In [ ]:
results = searcher.search(query)
print('Number of hits:', len(results))
print('Best hit:', results[0])

Note that all our four documents contain both the term model and topic but the paper about topic models is considered the most relevant one for our query.


Quiz!

a) Construct the same query, but this time using an Or operator.

In [ ]:
from whoosh.query import Or 

# insert your code here

b) Construct a query using the And operator that searches for the terms index and topic in documents of which Dolf Trieschnigg is one of the co-authors:

In [ ]:
# insert your code here

These query constructs are very explicit and clean. It is, however, much more convenient to use Whoosh' QuerParser object to automatically parse strings into Query objects. We construct a query parser as follows:

In [ ]:
from whoosh.qparser import QueryParser

parser = QueryParser("text", index.schema)

Note that we have to specify the field in which we want to search and pass the schema of our index. The QueryParser is quite an intelligent object. It allows users to group terms using the string AND or OR and eliminate terms with the string NOT. It also allows to manually specify other fields in which to search for specific terms:

In [ ]:
parser.parse("probability model prior")
In [ ]:
parser.parse("(cluster OR grouping) AND (model OR schema)")
In [ ]:
parser.parse("topic index author:'Dolf Trieschnigg'")
In [ ]:
parser.parse("clust*")

The last query is a wildcard query that attempts to match all terms starting with clust.


Quiz!

Experiment a little with the QueryParser object.

In [ ]:
# insert your code here

Indexing our PDFs

Now that we have some basic understanding of Whoosh' most important data structures and functions, it is time to put together a number of Python scripts that will construct a Whoosh index on the basis of your own PDF library.

Open your favorite text editor and open a new Python file called indexer.py. Save that in the directory python-course/pydf. Add the schema we defined above to this file as well as the corresponding import statements. Before we will start, it is good to make a list of all the components we need to index our collection. In order to pass our documents to the add_document method we will need the following components:

  1. A function that transforms a PDF into text, since Whoosh requires plain text;
  2. A way to extract meta information from the PDF files (e.g. the author and the title);
  3. The directory or directories containing PDF files we would like to index;
  4. A way to remember which files are already in the index.

Let's start with the first item on our bullet list. Once you have installed Xpdf, a program named pdftotext will be available. This program converts PDF files into .txt files. Fire up a terminal or commandline prompt and test whether the command pdftotext is available. The subprocess module provides different ways to execute external programs using Python. For example, on a Unix machine, the following lines

In [ ]:
import subprocess

subprocess.call(['ls', '-l'])

will silently call the program ls with the list argument -l and return 0 to Python, meaning that the call has been completed. In a similar way we can call pdftotext to convert a PDF file into a text file:

In [ ]:
subprocess.call(['pdftotext', 'pdfs/blei2003.pdf', 'data/blei2003.txt'])

If all went well, Python should return 0 to your notebook. Whoosh requires that all text is encoded in UTF-8. We can pass the argument -enc UTF-8 to pdftotext to make sure our text files are in the right encoding:

In [ ]:
subprocess.call(['pdftotext', '-enc', 'UTF-8', 'pdfs/blei2003.pdf', 'data/blei2003.txt'])

Quiz!

Write a function called pdftotext in Python that takes as argument the filename of a PDF file. It should convert the PDF into plain text and store the result in the directory pydf/data. The .txt file should have the same filename as the PDF file only with a different extension.

In [ ]:
import os

def pdftotext(pdf):
    # insert your code here

# if your answer is correct this should print the first 1000 bytes of the text file
pdftotext("pdfs/blei2003.pdf")
with open(os.path.join('data', 'blei2003.txt')) as infile:
    print(infile.read(1000))

pdftotext has the option -htmlmeta to extract some of the meta data stored in a pdf file. If we run the following command:

In [ ]:
subprocess.call(['pdftotext', '-htmlmeta', '-enc', 'UTF-8', 
                 'pdfs/muellner2011.pdf', 'data/muellner2011.html'])

the output is a HTML file that looks like this:

In [ ]:
print(open('data/muellner2011.html').read(500))

We will write a function called parse_html to extract the meta information and the text from these files. It takes as argument the filepath of the HTML file and returns a dictionary formatted as follows:

d = {'author': AUTHOR, 'title': TITLE, 'text': TEXT}

We have used BeautifulSoup in the previous chapter to read and parse web pages. It will be of service here as well.

In [ ]:
from bs4 import BeautifulSoup

def parse_html(filename):
    """Extract the Author, Title and Text from a HTML file
    which was produced by pdftotext with the option -htmlmeta."""
    with open(filename) as infile:
        html = BeautifulSoup(infile, "html.parser", from_encoding='utf-8')
        d = {'text': html.pre.text}
        if html.title is not None:
            d['title'] = html.title.text
        for meta in html.findAll('meta'):
            try:
                if meta['name'] in ('Author', 'Title'):
                    d[meta['name'].lower()] = meta['content']
            except KeyError:
                continue
        return d
    
parse_html('data/muellner2011.html')

Quiz!

a) The parse_html function returns a dictionary consisting of the contents of some of the fields in our index schema. We will need this dictionary when we add new documents to our index. Reimplement the pdftotext function. It should convert a PDF file into a HTML file, stored in the directory pydf/data. Plug the function parse_html into the pdftotext function. Write the contents of the text field to a .txt file in the pydf/data directory. Make sure that this text file has the same filename as the PDF file except for the extension.

In [ ]:
def pdftotext(pdf):
    """Convert a pdf to a text file. Extract the Author and Title 
    and return a dictionary consisting of the author, title and 
    text."""
    basename, _ = os.path.splitext(os.path.basename(pdf))
    # insert your code here

b) For reasons that will become clear, it would be convenient to add the other fields of our index schema to the dictionary return by parse_html as well. Rewrite the function pdftotext and add the values for the source, target and id to the dictionary returned by parse_html. The values for the source, target and id should match the values we used in the examples above, where we manually added some documents to the index.

In [ ]:
import shutil

def pdftotext(pdf):
    """Convert a pdf to a text file. Extract the Author and Title 
    and return a dictionary consisting of the author, title, text
    the source path, the path of the converted text file and the 
    file ID."""
    basename, _ = os.path.splitext(os.path.basename(pdf))
    subprocess.call(['pdftotext', '-enc', 'UTF-8', '-htmlmeta',
                     pdf, os.path.join('data', basename + '.html')])
    data = parse_html(os.path.join('data', basename + '.html'))
    with open(os.path.join('data', basename + '.txt'), 'w') as outfile:
        outfile.write(data['text'])
    # insert your code here

pdftotext("pdfs/muellner2011.pdf")

c) After we have extracted the meta information from the HTML file, we no longer need it. The remove function in the model os can be used to remove files from your computer. Add a line of code to the pdftotext function that removes the HTML file.

d) We need to store the PDF files in the directory pydf/static/pdfs. The function copy from the shutil module allows you to copy or move files from one directory to the other. Add some lines of code to the pdftotext function in which you copy the original PDF file to the directory pydf/static/pdfs. Make sure that this directory exists. Otherwise, first create it using the function mkdir from the os module. You can use the function exists from the os.path module to check whether a particular file or directory exists.


With the two hardest points off our list, we can move on to the next two. We need a way to make Python aware of the directories we would like to index. There are many different ways to accomplish this. I choose to make a configuration file in which we store the paths of the directories we want to index (and possibly some other information). The Python module configparser provides a class ConfigParser with which we can parse configuration files in a format similar to Microsoft Windows INI files:

[filepaths]
# pdf directory represents the directory or directories
# that you would like to index. Separate multiple directories
# by a semicolon.
pdf directory = pdfs
txt directory = data
index directory = pdf-index
source directory = static/pdfs

[programpaths]
pdftotext = /usr/local/bin/pdftotext

[indexer.options]
recompile = no
move = no
search limit = 20

Copy these lines to a file named pydf.ini in the pydf directory and adapt the paths to your own. We will read the contents the configuration file using the class ConfigParser:

In [ ]:
import configparser

config = configparser.ConfigParser()
config.read('pydf.ini')
config.sections()

It returns a strucure that functions much like a dictionary:

In [ ]:
config['filepaths']['pdf directory']

Now that we have a configuration file, let's adjust the function pdftotext to make it a little more general. In the current version we hard-coded the path to the output directory as well as the path to the pdftotext binary. We move those elements to the function declaration to make them variable arguments of the function. While we're at it, let's also remove some other code redundancies:

In [ ]:
from os.path import basename, splitext


def fileid(filepath):
    """
    Return the basename of a file without its extension.
    >>> fileid('/some/path/to/a/file.pdf')
    file
    """
    base, _ = splitext(basename(filepath))
    return base


def pdftotext(pdf, outdir='.', sourcedir='source', p2t='pdftotext', move=False):
    """Convert a pdf to a text file. Extract the Author and Title 
    and return a dictionary consisting of the author, title, text
    the source path, the path of the converted text file and the 
    file ID."""    
    filename = fileid(pdf)
    htmlpath = os.path.join(outdir, filename + '.html')
    txtpath = os.path.join(outdir, filename + '.txt')
    if not os.path.exists(sourcedir):
        os.mkdir(sourcedir)
    sourcepath = os.path.join(sourcedir, filename + '.pdf')
    subprocess.call([p2t, '-enc', 'UTF-8', '-htmlmeta', pdf, htmlpath])
    data = parse_html(htmlpath)
    os.remove(htmlpath)
    file_action = shutil.move if move else shutil.copy
    file_action(pdf, sourcepath)
    with open(txtpath, 'w') as outfile:
        outfile.write(data['text'])
    data['source'] = sourcepath
    data['path'] = txtpath
    data['id'] = fileid(pdf)
    return data

pdftotext("pdfs/blei2003.pdf", 
          outdir=config.get('filepaths', 'txt directory'),
          sourcedir=config.get('filepaths', 'source directory'),
          move=config.getboolean('indexer.options', 'move pdfs'))

Quiz!

With that set, we are ready to write the main routine of our indexing procedure. I give you the skeleton of the main routine. Tt first sight might seem quite daunting, but it is actually just a prcocedure that puts together all statements we have used before. Fill in the gaps.

In [ ]:
import glob

def index_collection(configpath):
    "Main routine to index a collection of PDFs using Whoosh."
    config = configparser.ConfigParser()
    # read the configuration file
    # insert your code here
    
    recompile = config.getboolean("indexer.options", "recompile")
    # check whether the supplied index directory already exists
    if not os.path.exists(config.get("filepaths", "index directory")):
        # if not, create a new directory and initialize the index
        os.mkdir(config.get("filepaths", "index directory"))
        index = create_in(config.get("filepaths", "index directory"), schema=pdf_schema)
        recompile = True
    # open a connection to the index
    index = # insert your code here
    
    # retrieve a set of all file IDs we already indexed
    indexed = set(map(fileid, os.listdir(config.get("filepaths", "txt directory"))))
    # initialize a IndexWriter object
    writer = # insert your code here
    
    # iterate over all directories 
    for directory in config.get("filepaths", "pdf directory").split(';'):
        # iterate over all PDF files in this directory
        for filepath in glob.glob(directory + "/*.pdf"):
            # poor man's solution to check whether we already indexed this pdf
            if fileid(filepath) not in indexed or recompile:
                try:
                    # call the function pdftotext with the correct arguments
                    data = # insert your code here
                    
                    # add the new document to the index
                    writer.add_document(**data)
                except (IOError, UnicodeDecodeError) as error:
                    print(error)
    # commit our changes
    # insert your code here

Great! Now, add some of your own PDF files to the pdf folder pydf/pdfs and execute the following cell:

In [ ]:
index_collection('pydf.ini')

Before we continue, move the functions index_collection, pdftotext, parse_html and fileid to the file indexed.py together with their corresponding imports. Add the cell above to the file within the main environment at the end of the file:

if __name__ == '__main__':
    index_collection('pydf.ini')

Building a Web Interface with Flask

flask

OK, the dry and hard part of our PDF archiving and search app is over. Now it is time to focus on creating a web application with which we can query the index in user-friendly way. I choose to use the microframework Flask which is an elegant web framework that enables you to get a web app up and running in no time. In no time? Really, in no time! Open a new file called hello.py in your favorite text editor and add the following lines of code:

In [ ]:
from flask import Flask
app = Flask(__name__)

@app.route("/")
def hello():
    return "Hello World!"

if __name__ == "__main__":
    app.run(port=5000)

Next, open a terminal and run the script using:

python hello.py

Direct your browser to http://127.0.0.1:5000/ to see the result. That is what I call a simple web framework, yet a very powerful one too.

In the directory pydf/templates/index.html I created a simple web page that will serve as the landing page of our web application. We can render such pages using Flask's render_template function. Open a file called pydf.py in the directory pydf and add the following lines of code:

In [ ]:
from flask import Flask, render_template

app = Flask(__name__)

@app.route('/')
def index():
    return render_template('index.html')

if __name__ == '__main__':
    app.run(debug=True, host='localhost', port=8000, use_reloader=True, threaded=True)

Run the application with

python pydf.py

and check out the result at http://127.0.0.1:8000/. The search box is not working yet. We need two things: (1) a function to search our collection on the basis of a query and (2) a function to show these results to Flask allowing it to render them properly. We'll start with the search function.


Quiz!

a) Write a function called search that takes as argument a query represented by a string. Open the PDF Index, parse the query, search for the results and return a list of dictionaries in which each dictionary represents a separate search result with the field names as keys and their corresponding values as values.

In [ ]:
from whoosh.index import open_dir
from whoosh.qparser import QueryParser

def search(query):
    # insert your code here
    
print(list(search("(topic model) OR (index probability")))

b) Whoosh' Result object contains a method to create highlighted search result excerpts. Field values that are stored in our index can be directly highlighted by Whoosh, using:

result.highlights(FIELDNAME)

Since we did not store the actual text of our pdfs in the collection, we must first open and read the text file corresponding to our search result. That is the reason why we stored the path to our text files in the field path. Once we have contents of the text file, we can call the highlight method as follows:

result.highlights("text", text=contents)

Adapt the function search in such as way that it includes the highlighted search result excerpts for each search result.

In [ ]:
def search(query):
    # insert your code here
    
print(list(search("(topic model) OR (index probability")))

Now that we have a search function we need a way to represent the results in a format that a browser can read. I choose for the simple solution to directly convert the results into HTML. The following function takes as argument a single search result yielded by search and returns a representation in HTML of the result:

In [ ]:
def to_html(result):
    "Return a representation of a search result in HTML."
    title = result['title'] if 'title' in result else result['id']
    author = result['author'] if 'author' in result else ''
    html = """
        <div id='match'>
          <span id='id'>
            <a href='%s' target='_blank'>%s</a>
          </span>
          <span id='author'>%s</span>
          </br>
          <span id='text'>%s</span>
        </div>
           """ % (result['source'], title, author, result['snippet'])
    return html

print(to_html(next(search("topic model"))))

With that in place, all that is left is to write a function that is connected to the search box in the web interface. This function will be called after a user presses enter in the search box and returns the results of the query.

In [ ]:
from flask import request, jsonify

@app.route('/searchbox', methods=['POST'])
def searchbox():
    query = request.form['q'].strip()
    html_results = '\n'.join(map(to_html, search(query)))
    return jsonify({'html': html_results})

The searchbox function is called by a piece of javascript that resides in static/script.js. It extracts the query, converts the results to html and returns that as JSON to the same javascript which is responsible for putting it in the right place at our web page.

An exciting moment: our PDF search application is ready. Take it for a spin:

python pydf.py

and direct your browser to http://127.0.0.1:8000/. Have fun!


You've reached the end of the chapter. Ignore the code below, it's just here to make the page pretty:

In [5]:
from IPython.core.display import HTML
def css_styling():
    styles = open("styles/custom.css", "r").read()
    return HTML(styles)
css_styling()
Out[5]:
/* Placeholder for custom user CSS mainly to be overridden in profile/static/custom/custom.css This will always be an empty file in IPython */