Extracting Text From Heterogeneous Sources

I consume online media through different apps on different devices (plain browsers, Feedly, Twitter, etc.) and to archive interesting reads for future reference I developed a habit of e-mailing links to myself.

Gmail addresses are great for this purpose since an e-mail sent to [email protected] still ends up in the inbox of [email protected]. Combining this with Gmail filters allows you to set up a rule so that all e-mails received on [email protected] are archived automatically and are assigned a label that you can use to collect all blog posts you want to archive (I use the label Links).

I already collected a couple of hundred links to blog articles, webpages, pdfs, etc. that I feel compelled to do something interesting with (such as topic modelling and automatic summarization).

In this blog article I describe a pipeline that automatically accesses my Gmail account and extracts the readable text from all links I have sent myself over time. Of course I am opening a can of worms here since these e-mails come in lots of different formats and point to different materials:

  • some e-mails are just plain text with one or multiple links
  • many e-mails come straight out of Feedly on my iPad which tends to send entire blog articles with html formatting
  • many e-mails contain direct links to PDF files which are generally not as easy to parse as plain HTML

Here I will use a rule-based approach to distinguish between these different formats.

Import Python Libraries

These are the Python imports I will need throughout the pipeline:

In [ ]:
import httplib2

from apiclient.discovery import build
from oauth2client.client import flow_from_clientsecrets
from oauth2client.file import Storage
from oauth2client.tools import run

import base64
import re
import requests
from lxml import etree
from StringIO import StringIO
import itertools as it

import urllib2
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from cStringIO import StringIO

from collections import defaultdict

from lxml.etree import fromstring

import sqlite3
from datetime import datetime
import zlib

Set Up Access to Gmail

Google offer an API to many of their services including Gmail. The free of charge quota of the Gmail API is very generous and you will essentially not need to worry about going over the free allowance.

To access Gmail through Python you need to followe the steps outlined here here (all steps assume that you are logged into your Gmail account in the browser):

  • create a new project in the Google Developer Console
  • on the page of your new project, make certain to set a product name in the APIs&auth/consent screen tab
  • download the JSON from the APIs&auth/credentials tab and store it in a location accessible from this Python script (I called mine client_secret.json)

Get the Google API Python package

pip install google-api-python-client

and the following code (source) will connect to the Gmail API and expose an API resource object pointed to by gmail_service.

In [ ]:
# Path to the client_secret.json file downloaded from the Developer Console
CLIENT_SECRET_FILE = 'client_secret.json'

# Check https://developers.google.com/gmail/api/auth/scopes for all available scopes
OAUTH_SCOPE = 'https://www.googleapis.com/auth/gmail.readonly'

# Location of the credentials storage file
STORAGE = Storage('gmail.storage')

# Start the OAuth flow to retrieve credentials
flow = flow_from_clientsecrets(CLIENT_SECRET_FILE, scope=OAUTH_SCOPE)
http = httplib2.Http()

# Try to retrieve credentials from storage or run the flow to generate them
credentials = STORAGE.get()
if credentials is None or credentials.invalid:
  credentials = run(flow, STORAGE, http=http)

# Authorize the httplib2.Http object with our credentials
http = credentials.authorize(http)

# Build the Gmail service from discovery
gmail_service = build('gmail', 'v1', http=http)

Retrieve Bodies of Relevant E-Mail Messages

The Gmail API resource gmail_service is the only object we need to query to eventually download all e-mail bodies of interest. The Gmail API reference explains how to use the different resource types (labels, message lists, messages) that we need to work with.

The Gmail label of interest is known to me as Links but internally Gmail identifies all labels by a unique ID. The following code identifies the label ID that corresponds to the label name Links:

In [ ]:
labels = gmail_service.users().labels().list(userId='me').execute()['labels']  # 'me' is the currently logged-in user
label_id = filter(lambda x: x['name'] == 'Links', labels)[0]['id']

We can now query gmail_serice for all e-mail messages with the label Links (as identified by label_id). The API returns messages in pages of at most 100 messages so that we will need to track a pointer to the next page (nextPageToken) until all pages are consumed.

In [ ]:
def get_message_ids():
    """ Page through all messages in `label_id` """
    next_page = None

    while True:
        if next_page is not None:
            response = gmail_service.users().messages().list(userId='me', labelIds=[label_id], pageToken=next_page).execute()
            response = gmail_service.users().messages().list(userId='me', labelIds=[label_id]).execute()

        messages = response.get('messages')
        next_page = response.get('nextPageToken')

        for el in messages:
            yield el['id']

        if next_page is None:

To extract message bodies we need to distinguish between text/plain and MIME e-mails (to my understanding the latter allows embedded images and such). When an e-mail is just plain text then the body is found directly in the payload, however when the e-mail is MIME I will extract the body as the first of potentially many parts. See the message API reference for more detail.

In [ ]:
def message_bodies():
    for ctr, message_id in enumerate(get_message_ids()):
        message = gmail_service.users().messages().get(userId='me', id=message_id, format='full').execute()
            body = message['payload']['parts'][0]['body']['data']  # MIME
        except KeyError:
            body = message['payload']['body']['data']  # text/plain
        body = base64.b64decode(str(body), '-_')
        yield body

I think of every e-mail with the Links label as a pointer to a resource on the Internet (be it a link to a webpage, PDF, etc.). Generally to extract URLs from plain text I will make use of the following regular expression formulated by John Gruber:

In [ ]:
pattern = (r'(?i)\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|'
           r'+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*      \)|'

When I sent myself a link through the Feedly iPad app, Feedly sometimes sends the entire article in the e-mail body but it does not do so consistently as far as I can judge.

Some Feedly-originating e-mail bodies will contain multiple URLs (e.g. links as part of a blog article) but they mostly provide a link to the original story as the first URL in the e-mail.

Other times I may type an e-mail myself in which I copy and paste multiple links to interesting resources - in which case I want to extract all URLs from the body and not just the first one.

In [ ]:
def is_feedly(body):
    return 'feedly.com' in body
In [ ]:
def urls():
    for body in message_bodies():
        matches = re.findall(pattern, body)
        if is_feedly(body):
            match = matches[0]
            yield match[0]  # Feedly e-mail: first URL is link to original story
            for match in matches:
                yield match[0]

I noticed a number of URLs in urls() that did not match my expectation - such as links to disqus comments and an online retailer. This is where a rule-based filter comes in:

In [ ]:
exclude = ['packtpub.com', 'disqus', '@', 'list-manage', 'utm_', 'ref=', 'campaign-archive']
def urls_filtered():
    for url in urls():
        if not any([pattern in url.lower() for pattern in exclude]):
            yield url

I quite often send myself a link to the Hacker News message thread on an article of interest.

To extract the URL to the actual article we need to parse the HTML of the Hacker News discussion and locate the relevant URL in a td element with class="title" - a rule that works for these particular Hacker News pages.

In [ ]:
def is_hn(url):
    return 'news.ycombinator.com' in url
In [ ]:
parser = etree.HTMLParser()
def urls_hn_filtered():
    for url in urls_filtered():
        if is_hn(url) and (re.search(r'item\?id=', url) is None):
            continue  # do not keep HN links that do not point to an article
        elif is_hn(url):
            r = requests.get(url)
            if r.status_code != 200:
                continue  # download of HN html failed, skip
            root = etree.parse(StringIO(r.text), parser).getroot()
            title = root.find(".//td[@class='title']")
                a = [child for child in title.getchildren() if child.tag == 'a'][0]
            except AttributeError:
                continue  # title is None

            story_url = a.get('href')
            yield story_url
            yield url

Poor memory gets the better of me quite often and I end up sending myself the same link multiple times. To avoid downloading the same resource twice and skewing follow-on work with duplicates I will track seen URLs with the following code:

In [ ]:
def unique_urls():
    seen = defaultdict(bool)
    for url in urls_hn_filtered():
        key = hash(url)
        if seen[key]:
            seen[key] = True
            yield url

Extract Readable Text from PDF and HTML Documents

Now that we have a stream of unique URLs of interesting articles provided by unique_urls we can download each of these resources and extract their readable text.

To extract readable text I will here distinguish between HTML and PDF. PDF text extraction is handled with pdfminer and I am making use of this code shared in a StackOverflow answer. HTML text extraction is handled with lxml and this code snippet.

In [ ]:
def pdf_from_url_to_txt(url):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    # Open the url provided as an argument to the function and read the content
    f = urllib2.urlopen(urllib2.Request(url)).read()
    # Cast to StringIO object
    fp = StringIO(f)
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    password = ""
    maxpages = 0
    caching = True
    pagenos = set()
    for page in PDFPage.get_pages(fp,
    string = retstr.getvalue()
    return string
In [ ]:
def resource_text():
    headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:33.0) Gecko/20100101 Firefox/33.0'}
    html_parser = etree.HTMLParser(recover=True, encoding='utf-8')
    for url in unique_urls():
        if url.endswith('.pdf'):
                text = pdf_from_url_to_txt(url)
                continue  # something went wrong, just skip ahead
            yield url, text
                r = requests.get(url, headers=headers)
                continue  # something went wrong with HTTP GET, just skip ahead
            if r.status_code != 200:
            if not 'text/html' in r.headers.get('content-type', ''):
            # from: http://stackoverflow.com/a/23929292 and http://stackoverflow.com/a/15830619
                document = fromstring(r.text.encode('utf-8'), html_parser)
                continue  # error parsing document, just skip ahead
            yield url, '\n'.join(etree.XPath('//text()')(document))

Consume Pipeline and Store Extracted Text

This concludes our pipeline as the generator exposed by readable_text returns our current best representation of useful text for each resource.

All that remains now is to consume this generator and store the extracted text together with the source URL for future reference.

Since this pipeline consists of a number of failure-prone steps (communicating with APIs, extracting text, etc.) I will also want to make certain to catch errors and deal with them appropriately: In the generator resource_text I already included a number of places where we just skip ahead to the next resource in case of failure - however failure may still happen and it will be best to catch those errors right here where we consume the pipeline.

I will be lazy here and just skip all resources that generate an error during text extraction. Hence I will consume the generator readable_text until it throws the StopIteration exception and still skip all other exceptions. Using the with statement here makes certain that we do not need to worry about closing the file that we write to once StopIteration is thrown.

In [ ]:
text_generator = resource_text()

    db = sqlite3.connect('gmail_extracted_text.db')
    db.execute('CREATE TABLE gmail (date text, url text, compression text, extracted blob)')
except sqlite3.OperationalError:
    pass  # table gmail already exists

while True:
        db = sqlite3.connect('gmail_extracted_text.db')
        url, text = text_generator.next()
        now = datetime.now().__str__()
        if isinstance(text, unicode):
            text = zlib.compress(text.encode('utf-8'))  # to decompress: zlib.decompress(text).decode('utf-8')
            text = zlib.compress(text)
        db.execute('INSERT INTO gmail VALUES (?, ?, ?, ?)', (unicode(now), unicode(url), u'zlib', sqlite3.Binary(text)))
    except StopIteration:
        break  # generator consumed, stop calling .next() on it
    except Exception, e:
        print e
        continue  # some other exception was thrown by the pipeline, just skip ahead
        db.close()  # tidy up

In my case the above pipeline extracted text from 571 resources and the resultant database is 7.9 MB in size.