The AutoExtract API is a service for
automatically extracting information from web content. This notebook
shows how is it possible to extract article body content
from articles automatically and specifically it focuses on the features
offered by the attribute articleBodyHtml
.
articleBodyHtml
attribute returns
a clean version of the article content where all irrelevant stuff has been removed
(framing, ads, links to content no directly related with article, call to actions elements, etc)
and where the resultant HTML is simplified and normalized in such a way
that it is consistent across content from different sites.
Resultant HTML offers a great flexibility to:
AutoExtract is relying in machine learning models and is able to detect elements like figure captions or block quotes even if they were not annotated with the proper HTML tag, bringing normalization one step further.
Recomendation: For a better viewing experience execute this notebook cell by cell.
Before starting, let's import some stuff that will be needed:
import os
import re
import json
from itertools import chain
from autoextract.sync import request_batch
from IPython.core.display import HTML
from parsel import Selector
import html_text
Scrapinghub client library scrapinghub-autoextract
brings access to the Articles
Extraction API in Python. A key is required to access the service. You can obtain one
at in this page. The client library will look
for this key in the environmental variable SCRAPINGHUB_AUTOEXTRACT_KEY
but you can
also set it in the variable AUTOEXTRACT_KEY
below and then evaluate the cell.
# Set in the variable below your AutoExtract key
AUTOEXTRACT_KEY = ""
if AUTOEXTRACT_KEY:
os.environ['SCRAPINGHUB_AUTOEXTRACT_KEY'] = AUTOEXTRACT_KEY
if not os.environ.get('SCRAPINGHUB_AUTOEXTRACT_KEY'):
raise Exception("Please, fill the variable 'AUTOEXTRACT_KEY above with your "
"AutoExtract key")
The method request_raw
is the entrypoint to AutoExtract API. Let's define the method autoextract_article
for convenience
as:
def autoextract_article(url):
return request_batch([url], page_type='article')[0]['article']
Between the attributes returned by AutoExtract
this notebook will focus in the attribute articleBodyHtml
, which contains the simplified,
normalized and cleaned up article content in HTML code.
Let's see an extraction example for this page
sport_article = autoextract_article(
"https://thenewdaily.com.au/sport/afl/2020/03/12/clear-the-decks-and-"
"let-the-aflw-thrive-and-prosper/")
HTML(sport_article['articleBodyHtml'])
Note how only the relevant content of the article was extracted, avoiding elements like ads, unrelated content, etc. AutoExtract relies in advanced machine learning models that are able to discriminate between what is relevant and what is not.
Also note how figures with captions was extracted. Many other elements can be also present.
Having normalized HTML code has some cool advantages. One is that the content can be formatted independently of the original style with simple CSS rules. That means that the same consistent formatting can be applied even if the content is coming from very different pages with different formats.
AutoExtract encapsulates the articleBodyHtml
content within article
tags. For example:
<article>
<p>This is a simple article</p>
</article>
For convenience, we are going to encapsulate the content within a div
with the class beauty
. This way we will be able to apply our custom styling only to div
tags with this mark.
The method show
will take care of that:
def show(article):
return HTML(f"""
<div class=beauty>
{article['articleBodyHtml']}
</div>""")
Now let's create some CSS style rules to be applied for the beauty
class:
style = """
<style>
.beauty {
font-family: 'Benton Sans', Sans-Serif;
line-height: 23px;
font-size: 17.008px;
font-style: normal;
background-color: #F9F9F9;
padding: 20px;
border: 0.063rem dotted #D0D0D0;
}
.beauty h2, h3, h4, h5, h6 {
font-family: Majerit, serif;
font-weight: 700;
}
.beauty p {
margin-bottom: 10px;
color: #444;
}
.beauty dl { margin-top: 30px; }
.beauty dd { margin-left: 20px; }
.beauty figure {
display: table;
margin: 0 auto;
}
.beauty figure img {
width: 100%;
height: auto;
}
.beauty figcaption {
display: table-caption;
caption-side: bottom;
border-bottom: 0.063rem dotted #D0D0D0;
margin-bottom: 10px;
line-height: 22px;
font-size: 13px;
color: #646464;
text-align: center;
}
.beauty figcaption * {
text-align: center;
font-size: 13px;
color: #646464;
}
.beauty figcaption p { margin-bottom: 0px;}
</style>
"""
HTML(style)
Let's show the article again. It looks better, isn't it? And the best is that this style (with a little bit more of work) would work consistently across content from different websites.
show(sport_article)
Have a look to the following page:
musk_article = autoextract_article(
"https://www.geekwire.com/2019/tesla-shares-slump-sec-accuses-ceo-elon-"
"musk-violating-tweet-deal/")
show(musk_article)
The page is full of tweets, but the format is not the usual one seen in pages. But don't worry. Everything is ready to get them formatted, all we have to do is to include the Twitter widgets javascript library into the page. Let's to do it:
twitter_js = """<script async src="https://platform.twitter.com/widgets.js" charset="utf-8">
</script>"""
HTML(twitter_js)
Now the tweets in the article are nicely formatted. Facebook and Instagram content can also get formatted by including its javascript libraries.
But not only that. Other iframe
based multimedia content like videos, podcasts, maps, etc
will also be present and functional in the articleBodyHtml
attribute.
Another advantage of having a normalized structure is that we can pick only the parts we are in interested in.
In the following example, we are going to just pick the images from this article with its corresponding caption to compose an images array.
queen_article = autoextract_article(
"https://www.theguardian.com/uk-news/2019/aug/23/prince-albert-passions-digitised-"
"website-photos-200th-anniversary")
sel = Selector(queen_article['articleBodyHtml'])
images = [{'img_url': fig.xpath(".//img/@src").get(),
'caption': html_text.selector_to_text(fig.xpath("(.//figcaption)"))}
for fig in sel.xpath("//figure")]
print(json.dumps(images, indent=4))
parsel and html-text
libraries were used as helpers for the task. parsel
makes possible to query the content using
XPath and CSS expressions and html-text
converts HTML content to raw text.
Note that in the source code of the page in question there is not any figcaption
tag: AutoExtract machine learning capabilities can detect that a particular
section of the page is really a figure caption even if it was not annotated with the right
HTML tag. Such intelligence is also applied to other elements like blockquote
.
Let's go further. We are now going to compose a summary page that also
includes independent sections for figures and tweets. It is really easy to cherry pick
such elements from articleBodyHtml
. Let's see it applied to the Musk page:
sel = Selector(musk_article['articleBodyHtml'])
only_tweets = sel.css(".twitter-tweet")
only_figures = sel.css("figure")
HTML(
f"""
<article class='beauty'>
<h2>{musk_article['headline']}</h2>
<dl>
<dt>Author</dt> <dd>{musk_article['author']}</dd>
<dt>Published</dt> <dd>{musk_article['datePublished'][:10]}</dd>
<dt>Time to read</dt> <dd>{len(musk_article['articleBody'].split()) / 130:.1f}
minutes
</dd>
</dl>
<h3>First paragraph</h3>
{sel.css("article > p").get()}
<h3>Tweets ({len(only_tweets)})</h3>
{"".join(only_tweets.getall())}
<h3>Figures ({len(only_figures)})</h3>
{"".join(only_figures.getall())}
</article>
{twitter_js}
"""
)
The normalized HTML brings thus flexibility to adapt the article content to your
own purposes: you might decide to exclude figure captions, or to exclude multimedia content from
iframes
, or show figures in a separated carousel for example.
Heading levels are also normalized. It can be handy to automatically extract
"table of contents" for articleBodyHtml
. The function print_toc
presented below
print the table of content of an article extracted by AutoExtract.
def print_toc(html):
for section in Selector(html).css("h2,h3,h4,h5,h6"):
level = int(section.root.tag[-1]) - 2
print(f"{' ' * level}{section.css('::text').get()}")
Let's try it with this article:
article_toc = autoextract_article("http://cs231n.github.io/neural-networks-1/")
print_toc(article_toc['articleBodyHtml'])
The textual attribute articleBody
is not including any text from figure
elements (i.e. figure captions) by default. This is generally desired because images cannot
be included in raw text and showing a caption without its figure is disturbing for humans.
But sometimes the body textual information is used as the input for some analysis algorithm. For example you could be grouping articles by similarity using the simple technique of K Nearest Neighbors. Or even you can be feeding very advance neural networks using deep learning models for NLP.
In all these cases you might want to have the textual information for figure captions included. It is very easy to do. Let's do it for the sport article:
# Converting `articleBodyHtml` into text is enough to have figure captions included
sport_text_with_captions = html_text.selector_to_text(
Selector(sport_article['articleBodyHtml']))
print("Without captions:")
print("-----------------")
print(sport_article['articleBody'][500:800])
print("\nWith captions:")
print("---------------")
print(sport_text_with_captions[500:800])
Pull quotes are being used very often in articles these days. A pull quote is an excerpt of the article content which is repeated within the article but highlighted with a different format (i.e appearing in its own box and using a bigger font). A pair of examples can be seen on this page.
Pull quotes are a nice formatting element, but it might be better to strip them out if we are converting the document to plain text because having repeated content should be avoided here: formatting is lost in raw text
and therefore pull quotes are not useful but disturbing for the reader. The attribute articleBody
already contains a text version of the article, but pull quotes are not removed there. In the following example, we are
going to convert the article to raw text but excluding all pull quotes.
Note that AutoExtract detects quotes using machine learning techniques and returns
them in articleBodyHtml
under blockquote
tags.
chris_article = autoextract_article("https://www.vox.com/the-highlight/2020/1/15/20863236/chris-hughes-break-up-facebook-economic-security-basic-income-new-republic")
def drop_elements(selectors):
""" Drops HTML subtrees for given selectors """
for element in selectors:
tree = element.root
if tree.getparent() is not None:
tree.drop_tree()
# First let's get the text of the article without any quote.
# We'll search over it to detect which quotes are pull quotes.
sel = Selector(chris_article['articleBodyHtml'])
drop_elements(sel.css("blockquote"))
text_without_quotes = html_text.selector_to_text(sel)
# Some quotes can change the case, or add some '""' characters.
# Using some normalization helps with the matching
normalized = lambda text: re.sub(r'"|“|”|', '', ' '.join(text.split()).lower().strip())
# Now let's iterate over all `blockquote` tags
sel = Selector(chris_article['articleBodyHtml'])
pull_quotes = []
for quote in sel.css("blockquote"):
# bq_text contains the quote text
bq_text = html_text.selector_to_text(quote)
# The quote is a pull quote if the quote text was already in the text without quotes
if normalized(bq_text) in normalized(text_without_quotes):
pull_quotes.append(quote)
# Let's show found pull quotes
print(f"Found {len(pull_quotes)} pull quotes from {len(sel.css('blockquote'))} "
"source quotes:\n")
for idx, quote in enumerate(pull_quotes):
print(f"Pull quote {idx}:")
print("------------------")
print(html_text.selector_to_text(quote))
print()
Finally we can obtain the full text but with pull quotes stripped out:
# Removing figures as well as probably you will also want them removed
drop_elements(chain(pull_quotes, sel.css("figure")))
cleaned_text = html_text.selector_to_text(sel)
# Printing first 500 characters of the clean text
print(cleaned_text[:500])
Let's verify that we have removed the duplicated text:
def count(needle, haystack):
return len(re.findall(needle, haystack))
pquote_excerpt = "haven’t heard from Mark"
cases_before = count(pquote_excerpt, chris_article['articleBodyHtml'])
cases_after = count(pquote_excerpt, cleaned_text)
print(f"Occurrences before: {cases_before} and after the clean up: {cases_after}")
Now is the moment to try it yourself. Set the url
variable below and execute the cell
to see the results of autoextract on it:
url = "https://www.vox.com/policy-and-politics/2020/1/17/21046874/netherlands-universal-health-insurance-private"
article = autoextract_article(url)
show(article)