Notebook

AutoExtract articleBodyHtml example¶

The AutoExtract API is a service for automatically extracting information from web content. This notebook shows how is it possible to extract article body content from articles automatically and specifically it focuses on the features offered by the attribute articleBodyHtml.

articleBodyHtml attribute returns a clean version of the article content where all irrelevant stuff has been removed (framing, ads, links to content no directly related with article, call to actions elements, etc) and where the resultant HTML is simplified and normalized in such a way that it is consistent across content from different sites.

Resultant HTML offers a great flexibility to:

Apply custom and consistent styling to content from different sites
Pick which content elements to show or hide or even rearange the elements in the article

AutoExtract is relying in machine learning models and is able to detect elements like figure captions or block quotes even if they were not annotated with the proper HTML tag, bringing normalization one step further.

Recomendation: For a better viewing experience execute this notebook cell by cell.

Before starting, let's import some stuff that will be needed:

In [ ]:

import os
import re
import json
from itertools import chain
from autoextract.sync import request_batch
from IPython.core.display import HTML
from parsel import Selector
import html_text

Scrapinghub client library scrapinghub-autoextract brings access to the Articles Extraction API in Python. A key is required to access the service. You can obtain one at in this page. The client library will look for this key in the environmental variable SCRAPINGHUB_AUTOEXTRACT_KEY but you can also set it in the variable AUTOEXTRACT_KEY below and then evaluate the cell.

In [ ]:

# Set in the variable below your AutoExtract key
AUTOEXTRACT_KEY = ""

if AUTOEXTRACT_KEY:
    os.environ['SCRAPINGHUB_AUTOEXTRACT_KEY'] = AUTOEXTRACT_KEY
if not os.environ.get('SCRAPINGHUB_AUTOEXTRACT_KEY'):
    raise Exception("Please, fill the variable 'AUTOEXTRACT_KEY above with your "
                    "AutoExtract key")

The method request_raw is the entrypoint to AutoExtract API. Let's define the method autoextract_article for convenience as:

In [ ]:

def autoextract_article(url):
    return request_batch([url], page_type='article')[0]['article']

Between the attributes returned by AutoExtract this notebook will focus in the attribute articleBodyHtml, which contains the simplified, normalized and cleaned up article content in HTML code.

Let's see an extraction example for this page

In [ ]:

sport_article = autoextract_article(
    "https://thenewdaily.com.au/sport/afl/2020/03/12/clear-the-decks-and-"
    "let-the-aflw-thrive-and-prosper/")
HTML(sport_article['articleBodyHtml'])

Note how only the relevant content of the article was extracted, avoiding elements like ads, unrelated content, etc. AutoExtract relies in advanced machine learning models that are able to discriminate between what is relevant and what is not.

Also note how figures with captions was extracted. Many other elements can be also present.

Styling¶

Having normalized HTML code has some cool advantages. One is that the content can be formatted independently of the original style with simple CSS rules. That means that the same consistent formatting can be applied even if the content is coming from very different pages with different formats.

AutoExtract encapsulates the articleBodyHtml content within article tags. For example:

<article>
    <p>This is a simple article</p>
</article>

For convenience, we are going to encapsulate the content within a div with the class beauty. This way we will be able to apply our custom styling only to div tags with this mark. The method show will take care of that:

In [ ]:

def show(article):
    return HTML(f"""
        <div class=beauty>
            {article['articleBodyHtml']}
        </div>""")

Now let's create some CSS style rules to be applied for the beauty class:

In [ ]:

style = """
<style>
    .beauty {
        font-family: 'Benton Sans', Sans-Serif;
        line-height: 23px;
        font-size: 17.008px;
        font-style: normal;
        background-color: #F9F9F9;
        padding: 20px;
        border: 0.063rem dotted #D0D0D0;
    }
    .beauty h2, h3, h4, h5, h6 {
        font-family: Majerit, serif;
        font-weight: 700;
    }
    .beauty p {
        margin-bottom: 10px;
        color: #444;
    }
    .beauty dl { margin-top: 30px; }
    .beauty dd { margin-left: 20px; }
    .beauty figure {
        display: table;
        margin: 0 auto;
    }
    .beauty figure img {
      width: 100%;
      height: auto;
    }
    .beauty figcaption {
        display: table-caption;
        caption-side: bottom;
        border-bottom: 0.063rem dotted #D0D0D0;
        margin-bottom: 10px;
        line-height: 22px;
        font-size: 13px;
        color: #646464;
        text-align: center;
    }
    .beauty figcaption * {
        text-align: center;
        font-size: 13px;
        color: #646464;
    }
    .beauty figcaption p { margin-bottom: 0px;}
</style>
"""
HTML(style)

Let's show the article again. It looks better, isn't it? And the best is that this style (with a little bit more of work) would work consistently across content from different websites.

In [ ]:

show(sport_article)

Tweets and other embeddings¶

Have a look to the following page:

In [ ]:

musk_article = autoextract_article(
    "https://www.geekwire.com/2019/tesla-shares-slump-sec-accuses-ceo-elon-"
    "musk-violating-tweet-deal/")
show(musk_article)

The page is full of tweets, but the format is not the usual one seen in pages. But don't worry. Everything is ready to get them formatted, all we have to do is to include the Twitter widgets javascript library into the page. Let's to do it:

In [ ]:

twitter_js = """<script async src="https://platform.twitter.com/widgets.js" charset="utf-8">
                </script>"""
HTML(twitter_js)

Now the tweets in the article are nicely formatted. Facebook and Instagram content can also get formatted by including its javascript libraries.

But not only that. Other iframe based multimedia content like videos, podcasts, maps, etc will also be present and functional in the articleBodyHtml attribute.

Cherry picking¶

Another advantage of having a normalized structure is that we can pick only the parts we are in interested in.

In the following example, we are going to just pick the images from this article with its corresponding caption to compose an images array.

In [ ]:

queen_article = autoextract_article(
    "https://www.theguardian.com/uk-news/2019/aug/23/prince-albert-passions-digitised-"
    "website-photos-200th-anniversary")

In [ ]:

sel = Selector(queen_article['articleBodyHtml'])
images = [{'img_url': fig.xpath(".//img/@src").get(),
           'caption': html_text.selector_to_text(fig.xpath("(.//figcaption)"))} 
          for fig in sel.xpath("//figure")]
print(json.dumps(images, indent=4))

parsel and html-text libraries were used as helpers for the task. parsel makes possible to query the content using XPath and CSS expressions and html-text converts HTML content to raw text.

Note that in the source code of the page in question there is not any figcaption tag: AutoExtract machine learning capabilities can detect that a particular section of the page is really a figure caption even if it was not annotated with the right HTML tag. Such intelligence is also applied to other elements like blockquote.

Let's go further. We are now going to compose a summary page that also includes independent sections for figures and tweets. It is really easy to cherry pick such elements from articleBodyHtml. Let's see it applied to the Musk page:

In [ ]:

sel = Selector(musk_article['articleBodyHtml']) 
only_tweets = sel.css(".twitter-tweet")
only_figures = sel.css("figure")
HTML(
    f"""
    <article class='beauty'>
        <h2>{musk_article['headline']}</h2>
        <dl>
            <dt>Author</dt>       <dd>{musk_article['author']}</dd>
            <dt>Published</dt>    <dd>{musk_article['datePublished'][:10]}</dd>
            <dt>Time to read</dt> <dd>{len(musk_article['articleBody'].split()) / 130:.1f}
                                      minutes
                                  </dd>
        </dl>
        <h3>First paragraph</h3>
        {sel.css("article > p").get()}
        <h3>Tweets ({len(only_tweets)})</h3>
        {"".join(only_tweets.getall())}
        <h3>Figures ({len(only_figures)})</h3>
        {"".join(only_figures.getall())}
    </article>
    {twitter_js}
    """
)

The normalized HTML brings thus flexibility to adapt the article content to your own purposes: you might decide to exclude figure captions, or to exclude multimedia content from iframes, or show figures in a separated carousel for example.

Heading levels are also normalized. It can be handy to automatically extract "table of contents" for articleBodyHtml. The function print_toc presented below print the table of content of an article extracted by AutoExtract.

In [ ]:

def print_toc(html):  
    for section in Selector(html).css("h2,h3,h4,h5,h6"):
        level = int(section.root.tag[-1]) - 2
        print(f"{'   ' * level}{section.css('::text').get()}")

Let's try it with this article:

In [ ]:

article_toc = autoextract_article("http://cs231n.github.io/neural-networks-1/")        
print_toc(article_toc['articleBodyHtml'])

Including figure captions in the text body¶

The textual attribute articleBody is not including any text from figure elements (i.e. figure captions) by default. This is generally desired because images cannot be included in raw text and showing a caption without its figure is disturbing for humans.

But sometimes the body textual information is used as the input for some analysis algorithm. For example you could be grouping articles by similarity using the simple technique of K Nearest Neighbors. Or even you can be feeding very advance neural networks using deep learning models for NLP.

In all these cases you might want to have the textual information for figure captions included. It is very easy to do. Let's do it for the sport article:

In [ ]:

# Converting `articleBodyHtml` into text is enough to have figure captions included
sport_text_with_captions = html_text.selector_to_text(
    Selector(sport_article['articleBodyHtml']))

print("Without captions:")
print("-----------------")
print(sport_article['articleBody'][500:800])
print("\nWith captions:")
print("---------------")
print(sport_text_with_captions[500:800])

Removing pull quotes¶

Pull quotes are being used very often in articles these days. A pull quote is an excerpt of the article content which is repeated within the article but highlighted with a different format (i.e appearing in its own box and using a bigger font). A pair of examples can be seen on this page.

Pull quotes are a nice formatting element, but it might be better to strip them out if we are converting the document to plain text because having repeated content should be avoided here: formatting is lost in raw text and therefore pull quotes are not useful but disturbing for the reader. The attribute articleBody already contains a text version of the article, but pull quotes are not removed there. In the following example, we are going to convert the article to raw text but excluding all pull quotes.

Note that AutoExtract detects quotes using machine learning techniques and returns them in articleBodyHtml under blockquote tags.

In [ ]:

chris_article = autoextract_article("https://www.vox.com/the-highlight/2020/1/15/20863236/chris-hughes-break-up-facebook-economic-security-basic-income-new-republic")

In [ ]:

def drop_elements(selectors):
    """ Drops HTML subtrees for given selectors """
    for element in selectors:
        tree = element.root
        if tree.getparent() is not None:
            tree.drop_tree()

# First let's get the text of the article without any quote. 
# We'll search over it to detect which quotes are pull quotes.
sel = Selector(chris_article['articleBodyHtml'])
drop_elements(sel.css("blockquote"))
text_without_quotes = html_text.selector_to_text(sel)

# Some quotes can change the case, or add some '""' characters. 
# Using some normalization helps with the matching
normalized = lambda text: re.sub(r'"|“|”|', '', ' '.join(text.split()).lower().strip())

# Now let's iterate over all `blockquote` tags
sel = Selector(chris_article['articleBodyHtml'])
pull_quotes = []
for quote in sel.css("blockquote"):
    # bq_text contains the quote text
    bq_text = html_text.selector_to_text(quote)
    # The quote is a pull quote if the quote text was already in the text without quotes
    if normalized(bq_text) in normalized(text_without_quotes):        
        pull_quotes.append(quote)
        
# Let's show found pull quotes
print(f"Found {len(pull_quotes)} pull quotes from {len(sel.css('blockquote'))} "
       "source quotes:\n")
for idx, quote in enumerate(pull_quotes):
    print(f"Pull quote {idx}:")
    print("------------------")
    print(html_text.selector_to_text(quote))
    print()

Finally we can obtain the full text but with pull quotes stripped out:

In [ ]:

# Removing figures as well as probably you will also want them removed
drop_elements(chain(pull_quotes, sel.css("figure")))
cleaned_text = html_text.selector_to_text(sel)

# Printing first 500 characters of the clean text
print(cleaned_text[:500])

Let's verify that we have removed the duplicated text:

In [ ]:

def count(needle, haystack):
    return len(re.findall(needle, haystack))

pquote_excerpt = "haven’t heard from Mark"
cases_before = count(pquote_excerpt, chris_article['articleBodyHtml'])
cases_after = count(pquote_excerpt, cleaned_text)
print(f"Occurrences before: {cases_before} and after the clean up: {cases_after}")

Try it yourself¶

Now is the moment to try it yourself. Set the url variable below and execute the cell to see the results of autoextract on it:

In [ ]:

url = "https://www.vox.com/policy-and-politics/2020/1/17/21046874/netherlands-universal-health-insurance-private"

article = autoextract_article(url)
show(article)