How To Retrieve Unstructred Web Data In a Structured Manner with Riko

A Riveting 15-688 Tutorial

*by* Ahmet Emre Unal ([aemreunal](https://github.com/aemreunal))

You might have heard about Google Reader. It was a free RSS reader that brought RSS reading to the masses. It was a great product and I, personally, was a very heavy user. Google Reader allowed me to follow many websites that publish things infrequently. This, though, was only possible through the RSS feeds published by the websites.

It's great when a website admin takes the time to create the necessary RSS feeds (or implement the tool that does it) but every so often, you come across websites that you want to follow but don't have an RSS feed. How can you now make use of this beautiful system? Can you somehow parse the plain HTML web page to retrieve data in an ordered fashion?

The Riko library is a library that allows you to do exactly that. By using Riko, we can parse the plain HTML of a website and retrieve the elements in a website in an orderly fashion, like iterating through <li> elements with a for-loop, for example.

I personally believe in walking through examples to learn something so let's jump right in (If you would like to follow along, you can install Riko on your local environment):

In [ ]:
import os
import itertools
from riko.collections.sync import SyncPipe

def get_test_site_url(test_site_name):
    return 'file://' + os.getcwd() + '/test_sites/' + test_site_name
In [ ]:
##########################################################################################
#
# Note: You can use the following section to create the test sites' files:
#
##########################################################################################

test_site_1_contents = '''<!DOCTYPE html>\n<html>\n<body>\n\n<h4>This is a simple example</h4>
<div class="container">\n    <ul>\n      <li class="drink hot">Coffee</li>
      <li class="drink hot">Green Tea</li>\n      <li class="hot drink">Black Tea</li>
      <li class="drink cold">Milk</li>\n      <li class="food">Chocolate</li>
      <li class="food">Marshmallow</li>
    </ul>\n</div>\n\n</body>\n</html>\n'''

test_site_2_contents = '''<!DOCTYPE html>\n<html>\n<body>\n\n<h4>This is a slightly more complex example</h4>
<div class="container">\n    <ul>\n      <li class="drink hot">Coffee</li>
      <li class="drink hot">Green Tea\n          <p>Oolong Tea</p>
          <a href="https://en.wikipedia.org/wiki/Oolong"></a>\n      </li>\n      <li class="hot drink">Black Tea
          <p>Rize Tea</p>\n          <a href="https://en.wikipedia.org/wiki/Rize_Tea"></a>\n      </li>
      <li class="drink cold">Milk</li>\n      <li class="food">Chocolate</li>
      <li class="food">Marshmallow</li>\n    </ul>\n</div>\n\n</body>\n</html>\n'''

# You can use the following functions to create the test sites' files:

path = os.getcwd() + '/test_sites/'

# Check if 'test_sites' folder exists
if not os.path.exists(path):
    os.mkdir(path)  # Create the 'test_sites' folder
    
# Check if 'test1.html' file exists
if not os.path.exists(path + 'test1.html'):
    with open(path + 'test1.html', "w") as test_site_1:
        test_site_1.write(test_site_1_contents)
    
# Check if 'test2.html' file exists
if not os.path.exists(path + 'test2.html'):
    with open(path + 'test2.html', "w") as test_site_2:
        test_site_2.write(test_site_2_contents)

##########################################################################################

In the test_sites folder, you will find some number of HTML files that are simple website examples. The first one, test1.html, is as follows:

<!DOCTYPE html>
<html>
<body>

<h4>This is a simple example</h4>
<div class="container">
    <ul>
      <li class="drink hot">Coffee</li>
      <li class="drink hot">Green Tea</li>
      <li class="hot drink">Black Tea</li>
      <li class="drink cold">Milk</li>
      <li class="food">Chocolate</li>
      <li class="food">Marshmallow</li>
    </ul>
</div>

</body>
</html>

Riko sees things through what's called a 'pipe'. By fetching a webpage through a URL and pointing Riko to the appropriate part of said webpage, we can obtain 'streams' coming from those 'pipe's that can be iterated. Let's start with a very simple act of retrieveing the webpage in its entirety. We can achieve this with the very simple fetchpage module, which will literally just fetch a page:

In [ ]:
url = get_test_site_url('test1.html')          # The URL of our test website
fetch_conf = {'url': url}                      # A configuration dictionary for Riko
pipe = SyncPipe('fetchpage', conf=fetch_conf)  # A pipe that streams 'test1.html'
stream = pipe.output                           # The stream being output from the pipe

What we did was to tell Riko to create a synchronous pipe (using the SyncPipe class) that uses the webpage fetching module (called fetchpage) to fetch the URL specified in the fetch_conf configuration dictionary.

We could've created the stream driectly by simply using the fetchpage module directly:

from riko.modules import fetchpage
stream = fetchpage.pipe(conf=fetch_conf)

but we'll see in a bit why we're using the SyncPipe class.

You might've wondered when did Riko even have the time to go fetch the page? Well, pipes in Riko are lazy. That means it won't start fetching (or processing) a URL before we start iterating. So let's iterate:

In [ ]:
for item in stream:
    print item

I told you it would literally just fetch the entire page:

{u'content': '<!DOCTYPE html>\n\n<html>\n\n<body>\n\n\n\n<h4>This is a simple example</h4>\n\n<div class="container">\n\n    <ul>\n\n      <li class="drink hot">Coffee</li>\n\n      <li class="drink hot">Green Tea</li>\n\n      <li class="hot drink">Black Tea</li>\n\n      <li class="drink cold">Milk</li>\n\n      <li class="food">Chocolate</li>\n\n      <li class="food">Marshmallow</li>\n\n    </ul>\n\n</div>\n\n\n\n</body>\n\n</html>\n\n\n'}

The whole webpage being printed is not really that useful; there is nothing special about this. We could've at least specified a start and end tag for Riko to fetch only that part:

In [ ]:
fetch_conf = {   # The same config as above, but with the start and end tags to fetch specified
    'url': url,
    'start': '<body>',
    'end': '</body>'
}
pipe = SyncPipe('fetchpage', conf=fetch_conf)  # A pipe that streams 'test1.html' according to the config above
stream = pipe.output                           # The stream being output from the pipe

for item in stream:
    print item

This isn't very useful either, honestly:

{u'content': '\n\n\n\n<h4>This is a simple example</h4>\n\n<div class="container">\n\n    <ul>\n\n      <li class="drink hot">Coffee</li>\n\n      <li class="drink hot">Green Tea</li>\n\n      <li class="hot drink">Black Tea</li>\n\n      <li class="drink cold">Milk</li>\n\n      <li class="food">Chocolate</li>\n\n      <li class="food">Marshmallow</li>\n\n    </ul>\n\n</div>\n\n\n\n'}

To get to the list items we want, we'd need to do some weird string processing. We don't want to do that and that's why we have Riko!


Let's take a side step and ask ourselves a question: a URL is a string that points to a webpage (or a file in the filesystem), but what could point to an element inside a webpage? The answer is XPath. 'XPath' is very similar to a URL, only that it denotes a path inside a markup file. For example, the XPath of the <ul> element in the website structure above is /html/body/div/ul. In turn, each <li> element under that <ul> element could be pointed to using the XPath /html/body/div/ul/li[<index>], where <index> is the 1-based index (index = 1 is the first element) or all <li> elements with the XPath /html/body/div/ul/li.


Riko has an alternate module called xpathfetchpage that can take a URL, as well as an XPath, and can pipe the element pointed by that XPath:

In [ ]:
xpath = '/html/body/div/ul'                         # The XPath of the <ul> element
xpath_conf = {'xpath': xpath, 'url': url}           # The XPath configuration dictionary for Riko
pipe = SyncPipe('xpathfetchpage', conf=xpath_conf)  # A pipe that streams what's pointed by the 
                                                    # XPath inside 'test1.html'
stream = pipe.output                                # The stream being output from the pipe
    
for item in stream:
    print item

Ah, now this seems interesting:

{u'{http://www.w3.org/1999/xhtml}li': [{u'content': u'Coffee', u'class': u'drink hot'}, {u'content': u'Green Tea', u'class': u'drink hot'}, {u'content': u'Black Tea', u'class': u'hot drink'}, {u'content': u'Milk', u'class': u'drink cold'}, {u'content': u'Chocolate', u'class': u'food'}, {u'content': u'Marshmallow', u'class': u'food'}]}

The pipe seems to have retrieved a dictionary with a single key, u'{http://www.w3.org/1999/xhtml}li' (weird key, I know), which points to a list of dictionaries, like {u'content': u'Coffee', u'class': u'drink hot'}, that look eerily similar to our list elements! But it's still tedious at this point to unwrap that outer dictionary. Let's try pointing Riko to an XPath that matches all multiple <li> elements, which is /html/body/div/ul/li:

In [ ]:
xpath = '/html/body/div/ul/li'                      # The XPath of the <li> element(s)
xpath_conf = {'xpath': xpath, 'url': url}           # The XPath configuration dictionary for Riko
pipe = SyncPipe('xpathfetchpage', conf=xpath_conf)  # A pipe that streams what's pointed by the 
                                                    # XPath inside 'test1.html'
stream = pipe.output                                # The stream being output from the pipe
    
for item in stream:
    print item

Now we're talking:

{u'content': u'Coffee', u'class': u'drink hot'}
{u'content': u'Green Tea', u'class': u'drink hot'}
{u'content': u'Black Tea', u'class': u'hot drink'}
{u'content': u'Milk', u'class': u'drink cold'}
{u'content': u'Chocolate', u'class': u'food'}
{u'content': u'Marshmallow', u'class': u'food'}

We have retrieved each <li> element as a seperate item through the stream we created.

As mentioned above, we could've retrieved a specific <li> element by specifying its index on the XPath; adding '[1]' to the end of the XPath above will return:

{u'content': u'Coffee', u'class': u'drink hot'}

Let's say we are only interested in the drinks. How do we only get the drinks? Do we check for and do some weird string matching with the class of each element while iterating over the stream elements and only add ones that match our criteria? Nope!

The point of having streams and pipes is to filter the streams and prevent the unwanted objects from going through the stream in the first place. Riko has a way to filter streams, by using the very handy filter pipe module. The gist of thinking in Riko's terms is to think of chaining pipes together. The first pipe will be carrying a flow of <li> elements we pointed to. The second pipe, the filter pipe, will only let through elements that match a certain criteria:

In [ ]:
url = get_test_site_url('test1.html')               # The URL of our test website
xpath = '/html/body/div/ul/li'                      # The XPath of the <li> element(s)
xpath_conf = {'xpath': xpath, 'url': url}           # The XPath configuration dictionary for Riko
pipe = SyncPipe('xpathfetchpage', conf=xpath_conf)  # A pipe that streams what's pointed by the 
                                                    # XPath inside 'test1.html'
filter_rule = {                                     # A 'filter' rule that tells the 'filter'
    'field': 'class',                               # pipe to perform the 'contains' operation on the 'class'
    'op': 'contains',                               # field, to check wether the value 'drink' exists, and
    'value': 'drink'                                # only let through the items that do match the rule
}
filter_conf = {'rule': filter_rule}                 # The 'filter' pipe configuration created from the rule
pipe = pipe.filter(conf=filter_conf)                # A chained pipe that filters acording to the configuration
stream = pipe.output                                # The stream being output from the pipe
    
for item in stream:
    print item

This is getting really cool:

{u'content': u'Coffee', u'class': u'drink hot'}
{u'content': u'Green Tea', u'class': u'drink hot'}
{u'content': u'Black Tea', u'class': u'hot drink'}
{u'content': u'Milk', u'class': u'drink cold'}

We seemed to have retrieved all the drinks, and only the drinks! A similar operation can be performed to only retrieve the hot drinks:

In [ ]:
url = get_test_site_url('test1.html')               # The URL of our test website
xpath = '/html/body/div/ul/li'                      # The XPath of the <li> element(s)
xpath_conf = {'xpath': xpath, 'url': url}           # The XPath configuration dictionary for Riko
pipe = SyncPipe('xpathfetchpage', conf=xpath_conf)  # A pipe that streams what's pointed by the 
                                                    # XPath inside 'test1.html'
filter_rule = {                                     # A 'filter' rule that tells the 'filter'
    'field': 'class',                               # pipe to perform the 'contains' operation on the 'class'
    'op': 'contains',                               # field, to check whether the value 'drink hot' exists, and
    'value': 'drink hot'                            # only let through the items that do match the rule
}
filter_conf = {'rule': filter_rule}                 # The 'filter' pipe configuration created from the rule
pipe = pipe.filter(conf=filter_conf)                # A chained pipe that filters acording to the configuration
stream = pipe.output                                # The stream being output from the pipe
    
for item in stream:
    print item

Wow, this even cooler:

{u'content': u'Coffee', u'class': u'drink hot'}
{u'content': u'Green Tea', u'class': u'drink hot'}

but it seems like we have a problem: the fact that the 'value' key in the rule above has a 'drink hot' value means that it's not matching an <li> element with the class 'hot drink', which is perfectly valid and equal to the class 'drink hot'. It looks like having a long, more specific value can get pretty unwieldy. It seems to me like it would make more sense if we could apply shorter, more general, multiple rules to the filter pipe:

In [ ]:
url = get_test_site_url('test1.html')               # The URL of our test website
xpath = '/html/body/div/ul/li'                      # The XPath of the <li> element(s)
xpath_conf = {'xpath': xpath, 'url': url}           # The XPath configuration dictionary for Riko
pipe = SyncPipe('xpathfetchpage', conf=xpath_conf)  # A pipe that streams what's pointed by the 
                                                    # XPath inside 'test1.html'
filter_rule_drink = {                               # A 'filter' rule that tells the 'filter'
    'field': 'class',                               # pipe to perform the 'contains' operation on the 'class'
    'op': 'contains',                               # field, to check whether the value 'drink' exists, and
    'value': 'drink'                                # only let through the items that do match the rule
}
filter_rule_hot = {                                 # A 'filter' rule that tells the 'filter'
    'field': 'class',                               # pipe to perform the 'contains' operation on the 'class'
    'op': 'contains',                               # field, to check whether the value 'hot' exists, and
    'value': 'hot'                                  # only let through the items that do match the rule
}
filter_conf = {                                     # The 'filter' pipe configuration created from the two
    'rule': [filter_rule_drink, filter_rule_hot]    # rules specified above
}
pipe = pipe.filter(conf=filter_conf)                # A chained pipe that filters acording to the configuration
stream = pipe.output                                # The stream being output from the pipe
    
for item in stream:
    print item

Have you heard? They're saying you're the coolest kid on the block:

{u'content': u'Coffee', u'class': u'drink hot'}
{u'content': u'Green Tea', u'class': u'drink hot'}
{u'content': u'Black Tea', u'class': u'hot drink'}

It seems to be pretty clear how you can apply different filters to get the elements you want. You can use the filter pipe to filter based on the content as well, to, for example, print only the teas:

filter_rule = {          # A 'filter' rule that tells the 'filter'
    'field': 'content',  # pipe to perform the 'contains' operation on the 'content'
    'op': 'contains',    # field, to check whether the value 'tea' exists, and
    'value': 'tea'       # only let through the items that do match the rule
}

which, when used in the ways above, would print:

{u'content': u'Green Tea', u'class': u'drink hot'}
{u'content': u'Black Tea', u'class': u'hot drink'}

You can notice that the rule was applied case-insensitively.


Through all of these streams, you can use the items, which are plain old Python objects, in any way you want. You can go ahead and print the list of hot drinks you have with the following for-loop:

for item in stream:
    print item['content']  # 'item' object is a regular Python dictionary

which would print:

Coffee
Green Tea
Black Tea

Let's look at the following, more complicated webpage structure, which is test2.html:

<!DOCTYPE html>
<html>
<body>

<h4>This is a slightly more complex example</h4>
<div class="container">
    <ul>
      <li class="drink hot">Coffee</li>
      <li class="drink hot">Green Tea
          <p>Oolong Tea</p>
          <a href="https://en.wikipedia.org/wiki/Oolong"></a>
      </li>
      <li class="hot drink">Black Tea
          <p>Rize Tea</p>
          <a href="https://en.wikipedia.org/wiki/Rize_Tea"></a>
      </li>
      <li class="drink cold">Milk</li>
      <li class="food">Chocolate</li>
      <li class="food">Marshmallow</li>
    </ul>
</div>

</body>
</html>

How would we access the URLs nested under the teas in the list? If you thought of 'XPath', you can congratulate yourself:

In [ ]:
url = get_test_site_url('test2.html')               # The URL of our test website
xpath = '/html/body/div/ul/li/a'                    # The XPath of the <a> element(s)
xpath_conf = {'xpath': xpath, 'url': url}           # The XPath configuration dictionary for Riko
pipe = SyncPipe('xpathfetchpage', conf=xpath_conf)  # A pipe that streams what's pointed by the 
                                                    # XPath inside 'test2.html'
stream = pipe.output                                # The stream being output from the pipe
    
for item in stream:
    print item['href']

It seems like we got both of the URLs:

https://en.wikipedia.org/wiki/Oolong
https://en.wikipedia.org/wiki/Rize_Tea

Notice how Riko didn't raise an error for <li> tags that lack <a> tags underneath them. This is because the XPath only matches those that do have the <a> tags. This is very handy for unstructured web data, where some tags might have nested elements, while some might not.


Finally, let's apply what we learned to a real world example: a prominent Turkish writer by the name 'Yılmaz Özdil' publishes an article every day on the newspaper 'Sözcü', talking about the current affairs of Turkey. The newspaper lists his articles under the URL:

In [ ]:
url = 'http://www.sozcu.com.tr/kategori/yazarlar/yilmaz-ozdil/'

On this page, you can see a list of article titles (that link to the articles themselves), along with the date it was published. The XPath of the list elements are:

In [ ]:
xpath = '/html/body/div[5]/div[6]/div[3]/div[1]/div[2]/div[1]/div[1]/div[2]/ul/li/a'

Let's go ahead and set up a pipe to fetch these list entries:

In [ ]:
xpath_conf = {'xpath': xpath, 'url': url}           # The XPath configuration dictionary for Riko
pipe = SyncPipe('xpathfetchpage', conf=xpath_conf)  # A pipe that streams what's pointed by the 
                                                    # XPath inside the web page
stream = pipe.output                                # The stream being output from the pipe
    
for item in itertools.islice(stream, 3):            # itertools.islice will allow us to get only
    print item                                      # the first n elements, which is 3 in this case
    print 

It seems like we retrieved the first 3 articles (the exact articles you retrieve will be different when ran on a different day):

{u'href': u'http://www.sozcu.com.tr/2016/yazarlar/yilmaz-ozdil/ilelebet-payidar-2-1477851/', u'{http://www.w3.org/1999/xhtml}p': u'\u0130lelebet payidar', u'{http://www.w3.org/1999/xhtml}span': {u'content': u'30 Ekim 2016', u'class': u'date'}, u'title': u'\u0130lelebet payidar'}

{u'href': u'http://www.sozcu.com.tr/2016/yazarlar/yilmaz-ozdil/cumhuriyet-mucizedir-1475895/', u'{http://www.w3.org/1999/xhtml}p': u'Cumhuriyet, mucizedir', u'{http://www.w3.org/1999/xhtml}span': {u'content': u'29 Ekim 2016', u'class': u'date'}, u'title': u'Cumhuriyet, mucizedir'}

{u'href': u'http://www.sozcu.com.tr/2016/yazarlar/yilmaz-ozdil/yarin-bayram-1473877/', u'{http://www.w3.org/1999/xhtml}p': u'Yar\u0131n bayram...', u'{http://www.w3.org/1999/xhtml}span': {u'content': u'28 Ekim 2016', u'class': u'date'}, u'title': u'Yar\u0131n bayram...'}

You can notice that each element has a title, a URL and a date. Let's say that we want to parse all of this and return it as a list of tuples, where each entry is of form: (title, date, url). We can do this the old fashioned way, where we iterate through each of those dictionaries and get the data we want. Instead, let's do something a bit different: let's set up two pipes for two different XPaths, and iterate through them synchronously:

In [ ]:
# Top-level <a> elements stream
xpath_conf_top = {'xpath': xpath, 'url': url}                   # The XPath config. for the top-level <a> elements
pipe_top = SyncPipe('xpathfetchpage', conf=xpath_conf_top)      # A pipe that streams the top-level <a> elements 
stream_top = pipe_top.output                                    # The stream being output from the pipe
  
# The child <span> element stream
xpath_date = xpath + '/span'                                    # XPath of the <span> children
xpath_conf_date = {'xpath': xpath_date, 'url': url}             # The XPath config. for the top-level <a> elements
pipe_date = SyncPipe('xpathfetchpage', conf=xpath_conf_date)    # A pipe that streams the top-level <a> elements 
stream_date = pipe_date.output                                  # The stream being output from the pipe

sync_iterator = zip(stream_top, stream_date)                    # Create a synchronous iterator from the two pipes
for top_item, date_item in itertools.islice(sync_iterator, 3):  # itertools.islice will allow us to get only
    article = (top_item['title'],                               # the first n elements, which is 3 in this case
               date_item['content'], 
               top_item['href'])
    print article
    print

This is awesome!

(u'\u0130lelebet payidar', u'30 Ekim 2016', u'http://www.sozcu.com.tr/2016/yazarlar/yilmaz-ozdil/ilelebet-payidar-2-1477851/')

(u'Cumhuriyet, mucizedir', u'29 Ekim 2016', u'http://www.sozcu.com.tr/2016/yazarlar/yilmaz-ozdil/cumhuriyet-mucizedir-1475895/')

(u'Yar\u0131n bayram...', u'28 Ekim 2016', u'http://www.sozcu.com.tr/2016/yazarlar/yilmaz-ozdil/yarin-bayram-1473877/')

You can go to the website and see the list elements for yourself. For a website that is mostly auto-generated (disastrously, might I say), this was relatively easy to achieve!

Let's look at one last example: let's fetch this list and dynamically fetch the articles it points to and get the full article:

In [ ]:
# Article list elements stream
xpath_conf_list = {'xpath': xpath, 'url': url}                 # The XPath configuration for the article list
pipe_list = SyncPipe('xpathfetchpage', conf=xpath_conf_list)   # A pipe that streams the article list elements
stream_list = pipe_list.output                                 # The stream being output from the pipe

# The article stream
xpath_article = '/html/body/div[5]/div[6]/div[3]/div/div[2]/div[1]/div/div[2]/div[2]'  # XPath of article body

xpath_conf_article = {                                         # The XPath configuration for the articles
    'url': {'subkey': 'href'},                                 # Notice how we can refer to a 'subkey' as the
    'xpath': xpath_article                                     # URL of this configuration
}
pipe_article = pipe_list.xpathfetchpage(                       # A pipe that streams the articles linked to
    conf=xpath_conf_article                                    # by the list stream
)                                                              # Notice how we create this pipe by chaining a
                                                               # pipe on top of the list pipe; how this one is
                                                               # 'dependent' on the list pipe
stream_article = pipe_article.output                           # The stream being output from the article pipe

sync_iterator = zip(stream_list, stream_article)               # Create a synchronous iterator from the two pipes
for list_item, article in itertools.islice(sync_iterator, 3):  # itertools.islice will allow us to get only
                                                               # the first n elements, which is 3 in this case
    p_elements = article['{http://www.w3.org/1999/xhtml}p']    # Get the list of <p> elements under this XPath
    article_body = [paragraph                                  # Grab only the strings under the <p> elements
                    for paragraph in p_elements
                    if type(paragraph) in [str, unicode]]
    article_body = '\n'.join(article_body)                     # Join strings to create the whole article
    article = (list_item['title'], article_body)               # Create the article's (title, body) tuple
    print article
    print

We now have a script to fetch the articles and read them easily, without needing to go to the website:

(u'\u0130lelebet payidar', u"17 Kas\u0131m 1938.\n*\nMaalesef, izdihamdan dalga ...")

(u'Cumhuriyet, mucizedir', u"*\nYanm\u0131\u015f bina say\u0131s\u0131 115 bin, ...")

(u'Yar\u0131n bayram...', u"An\u0131tkabir'e gitti\u011finde seni en \xe7ok etk ...")

The article body looks nicely formatted when you print it. Can this be the most effective ad blocker?


The power of Riko and its pipes may not be immediately visible through parsing just a website but as you explore different options, you can appreciate the power it gives you, the developer, over the mess that is HTML and the World Wide Web.