-- A Python Course for the Humanities
Our Information Retrieval system of the previous chapter is simple and also quite efficient, but it lacks many features of a full-blown search engine. One particular downside of the system is that it stores the index in RAM memory. This means that either we have to keep it there or we have to rebuild the index each time we would like to search through a particular collection. We could try to improve our system, but there are some excellent Python packages for Information Retrieval and search engines, we could use. In this section we will explore one of them, called Whoosh, a search engine and retrieval system written in pure Python.
Ever since science-journal giant Elsevier bought the once so promising bibliography management software Mendeley, I have looked for alternative ways to manage my research PDF collection. For me, one of the most important features of a PDF management tool is to be able to do full text search in the PDFs for the content I am interested in. Uptill today I have not found a tool that fulfills all my needs, so we are going to build one ourselves. We will develop a full-blown search using Whoosh. We'll build a web interface on top of Flask to query our search engine in a user-friendly way. Just to tease you a bit: this is what our search engine will look like:
This chapter is the first in a series of more practical chapters, in which we will build actual applications ready for use by end-users. You won't be learning many new programming techniques in this chapter, but we will introduce you to a large number of modules and packages that are available either in the standard library of Python or as third-party packages. The most important take-home message is that if you think about implementing a piece of software, the first thing you should do, is check whether someone else hasn't done it before you. Chances are good that someone has, and she or he probably has done a better job.
Before we get started, make sure you have installed both Whoosh and Flask on your computer. If you use Anaconda's Python distribution, you can execute the following command on the commandline:
conda install whoosh flask
It is also possible to install the modules using pip:
pip install whoosh flask
Finally we will need one of the programs shipped with the PDF reader Xpdf. Please install Xpdf from here.
The Index
is the most central object in the Whoosh' search engine. It allows us to store our data in a secure and sustainable way and makes it possible to search through our data. Every index requires an index schema which defines the available fields in the index. Fields represent pieces of information for each document in the collection. Examples of fields are: the author of a text, its publication date or the text body itself, and so on and so forth. For our PDF Index we will use a schema consisting of the following fields:
All these fields will be indexed for each pdf file in our collection and each field will be searchable. To create our schema in Whoosh, we first import the Schema
object from whoosh.fields
:
from whoosh.fields import Schema
For each field we need to specify to Whoosh what kind of field it is. Whoosh defines fields like KEYWORD
for keyword fields, ID
for unique identifiable fields (e.g. filepaths), DATETIME
for fields specifying information about dates, TEXT
for textual objects, and some others.
Have a look at the different field types provided by Whoosh, here. Which types would be appropriate for the fields defined above?
Double click this cell and write down your answer
Each field type can be passed some additional arguments, such as whether the field should be stored, whether it should be sortable and scorable, etc. For reasons that will be clear later on, we want all our ID fields plus the author and the title field to be stored in the index. We import the appropriate fields and define our schema as follows:
from whoosh.fields import ID, KEYWORD, TEXT
pdf_schema = Schema(id = ID(unique=True, stored=True),
path = ID(stored=True),
source = ID(stored=True),
author = TEXT(stored=True),
title = TEXT(stored=True),
text = TEXT)
Once we have defined a schema, we can create the index using the function create_in
. We'll create the index in the directory pydf
. The web application will reside in the same folder. Therefore, it is convenient to first direct your notebook to that directory.
cd pydf/
import os
from whoosh.index import create_in
if not os.path.exists("pdf-index"):
os.mkdir("pdf-index")
index = create_in("pdf-index", pdf_schema)
After creation, the index can op openend using the function open_dir
:
from whoosh.index import open_dir
index = open_dir("pdf-index")
Now that we have an index, we can add some documents to it. The IndexWriter
object let's you do just that. The method writer()
of the class Index
returns an instantiation of the IndexWriter
.
writer = index.writer()
We use the add_document
method of IndexWriter
to add documents to the index. add_document
accepts keyword arguments corresponding to the fields we specified in our schema. We add our first document:
writer.add_document(id = 'blei2003',
path = 'data/blei2003.txt',
source = 'static/pdfs/blei2003.pdf',
author = 'David Blei, Andrew Ng, Michael Jordan',
title = 'Latent Dirichlet Allocation',
text = open('data/blei2003.txt', encoding='utf-8').read())
And some more:
writer.add_document(id = 'goodwyn2013',
path = 'data/goodwyn2013.txt',
source = 'static/pdfs/goodwyn2013.pdf',
author = 'Erik Goodwyn',
title = 'Recurrent motifs as resonant attractor states in the narrative field: a testable model of archetype',
text = open('data/goodwyn2013.txt', encoding='utf-8').read())
writer.add_document(id = 'meij2009',
path = 'data/meij2009.txt',
source = 'static/pdfs/meij2009.pdf',
author = 'Edgar Meij, Dolf Trieschnigg, Maarten de Rijke, Wessel Kraaij',
title = 'Conceptual language models for domain-specific retrieval',
text = open('data/meij2009.txt', encoding='utf-8').read())
writer.add_document(id = 'muellner2011',
path = 'data/muellner2011.txt',
source = 'static/pdfs/muellner2011.pdf',
author = 'David Muellner',
title = 'Modern hierarchical, agglomerative clustering algorithms',
text = open('data/muellner2011.txt', encoding='utf-8').read())
Calling commit() on the IndexWriter
saves the changes to the index:
writer.commit()
The index contains four documents. How can we search the index for particular documents? Similar to the method writer
of the Index
object, the method searcher
returns a Searcher
object which allows us to search the index. The Searcher
object opens a connection to the index, similar to the way Python opens regular files. To prevent the system from running out of file handles, it is best practice to instantiate the searcher within a with
statement:
with index.searcher() as searcher:
do something
We'll do so later when we automatically index a complete collection of pdfs, but for now it is slightly more convenient to instantiate a searcher as follows:
searcher = index.searcher()
An instantiation of the class Searcher
has a search
method that takes as argument a Query
object. There are two ways to construct Query
objects: manually or via a query parser. To construct a query that searches for the terms topic and probabilistic in the text field, we could write something like the following:
from whoosh.query import Term, And
query = And([Term("text", "model"), Term("text", "topic")])
We can feed this to the search
method to obtain a Result
object:
results = searcher.search(query)
print('Number of hits:', len(results))
print('Best hit:', results[0])
Note that all our four documents contain both the term model and topic but the paper about topic models is considered the most relevant one for our query.
a) Construct the same query, but this time using an Or
operator.
from whoosh.query import Or
# insert your code here
b) Construct a query using the And
operator that searches for the terms index and topic in documents of which Dolf Trieschnigg is one of the co-authors:
# insert your code here
These query constructs are very explicit and clean. It is, however, much more convenient to use Whoosh' QuerParser
object to automatically parse strings into Query
objects. We construct a query parser as follows:
from whoosh.qparser import QueryParser
parser = QueryParser("text", index.schema)
Note that we have to specify the field in which we want to search and pass the schema of our index. The QueryParser
is quite an intelligent object. It allows users to group terms using the string AND
or OR
and eliminate terms with the string NOT
. It also allows to manually specify other fields in which to search for specific terms:
parser.parse("probability model prior")
parser.parse("(cluster OR grouping) AND (model OR schema)")
parser.parse("topic index author:'Dolf Trieschnigg'")
parser.parse("clust*")
The last query is a wildcard query that attempts to match all terms starting with clust.
Experiment a little with the QueryParser
object.
# insert your code here
Now that we have some basic understanding of Whoosh' most important data structures and functions, it is time to put together a number of Python scripts that will construct a Whoosh index on the basis of your own PDF library.
Open your favorite text editor and open a new Python file called indexer.py
. Save that in the directory python-course/pydf
. Add the schema we defined above to this file as well as the corresponding import statements. Before we will start, it is good to make a list of all the components we need to index our collection. In order to pass our documents to the add_document
method we will need the following components:
Let's start with the first item on our bullet list. Once you have installed Xpdf, a program named pdftotext
will be available. This program converts PDF files into .txt files. Fire up a terminal or commandline prompt and test whether the command pdftotext
is available. The subprocess module provides different ways to execute external programs using Python. For example, on a Unix machine, the following lines
import subprocess
subprocess.call(['ls', '-l'])
will silently call the program ls
with the list argument -l
and return 0 to Python, meaning that the call has been completed. In a similar way we can call pdftotext
to convert a PDF file into a text file:
subprocess.call(['pdftotext', 'pdfs/blei2003.pdf', 'data/blei2003.txt'])
If all went well, Python should return 0 to your notebook. Whoosh requires that all text is encoded in UTF-8. We can pass the argument -enc UTF-8
to pdftotext
to make sure our text files are in the right encoding:
subprocess.call(['pdftotext', '-enc', 'UTF-8', 'pdfs/blei2003.pdf', 'data/blei2003.txt'])
Write a function called pdftotext
in Python that takes as argument the filename of a PDF file. It should convert the PDF into plain text and store the result in the directory pydf/data
. The .txt file should have the same filename as the PDF file only with a different extension.
import os
def pdftotext(pdf):
# insert your code here
# if your answer is correct this should print the first 1000 bytes of the text file
pdftotext("pdfs/blei2003.pdf")
with open(os.path.join('data', 'blei2003.txt')) as infile:
print(infile.read(1000))
pdftotext
has the option -htmlmeta
to extract some of the meta data stored in a pdf file. If we run the following command:
subprocess.call(['pdftotext', '-htmlmeta', '-enc', 'UTF-8',
'pdfs/muellner2011.pdf', 'data/muellner2011.html'])
the output is a HTML file that looks like this:
print(open('data/muellner2011.html').read(500))
We will write a function called parse_html
to extract the meta information and the text from these files. It takes as argument the filepath of the HTML file and returns a dictionary formatted as follows:
d = {'author': AUTHOR, 'title': TITLE, 'text': TEXT}
We have used BeautifulSoup in the previous chapter to read and parse web pages. It will be of service here as well.
from bs4 import BeautifulSoup
def parse_html(filename):
"""Extract the Author, Title and Text from a HTML file
which was produced by pdftotext with the option -htmlmeta."""
with open(filename) as infile:
html = BeautifulSoup(infile, "html.parser", from_encoding='utf-8')
d = {'text': html.pre.text}
if html.title is not None:
d['title'] = html.title.text
for meta in html.findAll('meta'):
try:
if meta['name'] in ('Author', 'Title'):
d[meta['name'].lower()] = meta['content']
except KeyError:
continue
return d
parse_html('data/muellner2011.html')
a) The parse_html
function returns a dictionary consisting of the contents of some of the fields in our index schema. We will need this dictionary when we add new documents to our index. Reimplement the pdftotext
function. It should convert a PDF file into a HTML file, stored in the directory pydf/data
. Plug the function parse_html
into the pdftotext
function. Write the contents of the text
field to a .txt
file in the pydf/data
directory. Make sure that this text file has the same filename as the PDF file except for the extension.
def pdftotext(pdf):
"""Convert a pdf to a text file. Extract the Author and Title
and return a dictionary consisting of the author, title and
text."""
basename, _ = os.path.splitext(os.path.basename(pdf))
# insert your code here
b) For reasons that will become clear, it would be convenient to add the other fields of our index schema to the dictionary return by parse_html
as well. Rewrite the function pdftotext
and add the values for the source
, target
and id
to the dictionary returned by parse_html
. The values for the source
, target
and id
should match the values we used in the examples above, where we manually added some documents to the index.
import shutil
def pdftotext(pdf):
"""Convert a pdf to a text file. Extract the Author and Title
and return a dictionary consisting of the author, title, text
the source path, the path of the converted text file and the
file ID."""
basename, _ = os.path.splitext(os.path.basename(pdf))
subprocess.call(['pdftotext', '-enc', 'UTF-8', '-htmlmeta',
pdf, os.path.join('data', basename + '.html')])
data = parse_html(os.path.join('data', basename + '.html'))
with open(os.path.join('data', basename + '.txt'), 'w') as outfile:
outfile.write(data['text'])
# insert your code here
pdftotext("pdfs/muellner2011.pdf")
c) After we have extracted the meta information from the HTML file, we no longer need it. The remove
function in the model os
can be used to remove files from your computer. Add a line of code to the pdftotext
function that removes the HTML file.
d) We need to store the PDF files in the directory pydf/static/pdfs
. The function copy
from the shutil
module allows you to copy or move files from one directory to the other. Add some lines of code to the pdftotext
function in which you copy the original PDF file to the directory pydf/static/pdfs
. Make sure that this directory exists. Otherwise, first create it using the function mkdir
from the os
module. You can use the function exists
from the os.path
module to check whether a particular file or directory exists.
With the two hardest points off our list, we can move on to the next two. We need a way to make Python aware of the directories we would like to index. There are many different ways to accomplish this. I choose to make a configuration file in which we store the paths of the directories we want to index (and possibly some other information). The Python module configparser provides a class ConfigParser
with which we can parse configuration files in a format similar to Microsoft Windows INI files:
[filepaths]
# pdf directory represents the directory or directories
# that you would like to index. Separate multiple directories
# by a semicolon.
pdf directory = pdfs
txt directory = data
index directory = pdf-index
source directory = static/pdfs
[programpaths]
pdftotext = /usr/local/bin/pdftotext
[indexer.options]
recompile = no
move = no
search limit = 20
Copy these lines to a file named pydf.ini
in the pydf
directory and adapt the paths to your own. We will read the contents the configuration file using the class ConfigParser
:
import configparser
config = configparser.ConfigParser()
config.read('pydf.ini')
config.sections()
It returns a strucure that functions much like a dictionary:
config['filepaths']['pdf directory']
Now that we have a configuration file, let's adjust the function pdftotext
to make it a little more general. In the current version we hard-coded the path to the output directory as well as the path to the pdftotext
binary. We move those elements to the function declaration to make them variable arguments of the function. While we're at it, let's also remove some other code redundancies:
from os.path import basename, splitext
def fileid(filepath):
"""
Return the basename of a file without its extension.
>>> fileid('/some/path/to/a/file.pdf')
file
"""
base, _ = splitext(basename(filepath))
return base
def pdftotext(pdf, outdir='.', sourcedir='source', p2t='pdftotext', move=False):
"""Convert a pdf to a text file. Extract the Author and Title
and return a dictionary consisting of the author, title, text
the source path, the path of the converted text file and the
file ID."""
filename = fileid(pdf)
htmlpath = os.path.join(outdir, filename + '.html')
txtpath = os.path.join(outdir, filename + '.txt')
if not os.path.exists(sourcedir):
os.mkdir(sourcedir)
sourcepath = os.path.join(sourcedir, filename + '.pdf')
subprocess.call([p2t, '-enc', 'UTF-8', '-htmlmeta', pdf, htmlpath])
data = parse_html(htmlpath)
os.remove(htmlpath)
file_action = shutil.move if move else shutil.copy
file_action(pdf, sourcepath)
with open(txtpath, 'w') as outfile:
outfile.write(data['text'])
data['source'] = sourcepath
data['path'] = txtpath
data['id'] = fileid(pdf)
return data
pdftotext("pdfs/blei2003.pdf",
outdir=config.get('filepaths', 'txt directory'),
sourcedir=config.get('filepaths', 'source directory'),
move=config.getboolean('indexer.options', 'move pdfs'))
With that set, we are ready to write the main routine of our indexing procedure. I give you the skeleton of the main routine. Tt first sight might seem quite daunting, but it is actually just a prcocedure that puts together all statements we have used before. Fill in the gaps.
import glob
def index_collection(configpath):
"Main routine to index a collection of PDFs using Whoosh."
config = configparser.ConfigParser()
# read the configuration file
# insert your code here
recompile = config.getboolean("indexer.options", "recompile")
# check whether the supplied index directory already exists
if not os.path.exists(config.get("filepaths", "index directory")):
# if not, create a new directory and initialize the index
os.mkdir(config.get("filepaths", "index directory"))
index = create_in(config.get("filepaths", "index directory"), schema=pdf_schema)
recompile = True
# open a connection to the index
index = # insert your code here
# retrieve a set of all file IDs we already indexed
indexed = set(map(fileid, os.listdir(config.get("filepaths", "txt directory"))))
# initialize a IndexWriter object
writer = # insert your code here
# iterate over all directories
for directory in config.get("filepaths", "pdf directory").split(';'):
# iterate over all PDF files in this directory
for filepath in glob.glob(directory + "/*.pdf"):
# poor man's solution to check whether we already indexed this pdf
if fileid(filepath) not in indexed or recompile:
try:
# call the function pdftotext with the correct arguments
data = # insert your code here
# add the new document to the index
writer.add_document(**data)
except (IOError, UnicodeDecodeError) as error:
print(error)
# commit our changes
# insert your code here
Great! Now, add some of your own PDF files to the pdf folder pydf/pdfs
and execute the following cell:
index_collection('pydf.ini')
Before we continue, move the functions index_collection
, pdftotext
, parse_html
and fileid
to the file indexed.py
together with their corresponding imports. Add the cell above to the file within the main environment at the end of the file:
if __name__ == '__main__':
index_collection('pydf.ini')
OK, the dry and hard part of our PDF archiving and search app is over. Now it is time to focus on creating a web application with which we can query the index in user-friendly way. I choose to use the microframework Flask which is an elegant web framework that enables you to get a web app up and running in no time. In no time? Really, in no time! Open a new file called hello.py
in your favorite text editor and add the following lines of code:
from flask import Flask
app = Flask(__name__)
@app.route("/")
def hello():
return "Hello World!"
if __name__ == "__main__":
app.run(port=5000)
Next, open a terminal and run the script using:
python hello.py
Direct your browser to http://127.0.0.1:5000/ to see the result. That is what I call a simple web framework, yet a very powerful one too.
In the directory pydf/templates/index.html
I created a simple web page that will serve as the landing page of our web application. We can render such pages using Flask's render_template
function. Open a file called pydf.py
in the directory pydf
and add the following lines of code:
from flask import Flask, render_template
app = Flask(__name__)
@app.route('/')
def index():
return render_template('index.html')
if __name__ == '__main__':
app.run(debug=True, host='localhost', port=8000, use_reloader=True, threaded=True)
Run the application with
python pydf.py
and check out the result at http://127.0.0.1:8000/. The search box is not working yet. We need two things: (1) a function to search our collection on the basis of a query and (2) a function to show these results to Flask allowing it to render them properly. We'll start with the search function.
a) Write a function called search
that takes as argument a query represented by a string. Open the PDF Index, parse the query, search for the results and return a list of dictionaries in which each dictionary represents a separate search result with the field names as keys and their corresponding values as values.
from whoosh.index import open_dir
from whoosh.qparser import QueryParser
def search(query):
# insert your code here
print(list(search("(topic model) OR (index probability")))
b) Whoosh' Result
object contains a method to create highlighted search result excerpts. Field values that are stored in our index can be directly highlighted by Whoosh, using:
result.highlights(FIELDNAME)
Since we did not store the actual text of our pdfs in the collection, we must first open and read the text file corresponding to our search result. That is the reason why we stored the path to our text files in the field path
. Once we have contents of the text file, we can call the highlight method as follows:
result.highlights("text", text=contents)
Adapt the function search
in such as way that it includes the highlighted search result excerpts for each search result.
def search(query):
# insert your code here
print(list(search("(topic model) OR (index probability")))
Now that we have a search function we need a way to represent the results in a format that a browser can read. I choose for the simple solution to directly convert the results into HTML. The following function takes as argument a single search result yielded by search
and returns a representation in HTML of the result:
def to_html(result):
"Return a representation of a search result in HTML."
title = result['title'] if 'title' in result else result['id']
author = result['author'] if 'author' in result else ''
html = """
<div id='match'>
<span id='id'>
<a href='%s' target='_blank'>%s</a>
</span>
<span id='author'>%s</span>
</br>
<span id='text'>%s</span>
</div>
""" % (result['source'], title, author, result['snippet'])
return html
print(to_html(next(search("topic model"))))
With that in place, all that is left is to write a function that is connected to the search box in the web interface. This function will be called after a user presses enter in the search box and returns the results of the query.
from flask import request, jsonify
@app.route('/searchbox', methods=['POST'])
def searchbox():
query = request.form['q'].strip()
html_results = '\n'.join(map(to_html, search(query)))
return jsonify({'html': html_results})
The searchbox
function is called by a piece of javascript that resides in static/script.js
. It extracts the query, converts the results to html and returns that as JSON to the same javascript which is responsible for putting it in the right place at our web page.
An exciting moment: our PDF search application is ready. Take it for a spin:
python pydf.py
and direct your browser to http://127.0.0.1:8000/. Have fun!
You've reached the end of the chapter. Ignore the code below, it's just here to make the page pretty:
from IPython.core.display import HTML
def css_styling():
styles = open("styles/custom.css", "r").read()
return HTML(styles)
css_styling()
Python Programming for the Humanities by http://fbkarsdorp.github.io/python-course is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. Based on a work at https://github.com/fbkarsdorp/python-course.