XPath¶

XPath is short for XML Path Language which is a query language for selecting nodes in an XML document. This is very useful in webscraping because all HTML documents are a form of XML documents.

In [1]:

import requests
from lxml import html

In [2]:

%%HTML
<html>
  <body>
    <h1>Favorite Python Librarires</h1>
    <ul>
      <li>Numpy</li>
      <li>Pandas</li>
      <li>requests</li>
    </ul>
  </body>
</html>

Favorite Python Librarires

Numpy
Pandas
requests

Load HTML Code¶

Now I'll read the code from cell number 2 and store it in html_code. Finally we will parse that into a lxml node object.

In [3]:

html_code = In[2]
html_code = html_code[42:-2].replace("\\n","\n")
print(html_code)

doc = html.fromstring(html_code)

<html>
  <body>
    <h1>Favorite Python Librarires</h1>
    <ul>
      <li>Numpy</li>
      <li>Pandas</li>
      <li>requests</li>
    </ul>
</html>

Using xpath to find nodes in a document¶

There many methods for fidning a node that you are interested in from a XML or HTML document. The first way is to write the whole path separated by forward slashes /

Reading `<h1>` tag¶

In [4]:

title = doc.xpath("/html/body/h1")[0]
title

Out[4]:

<Element h1 at 0x7f447cafa458>

To read the text inside that tag you can use the text variable.

In [5]:

title.text

Out[5]:

'Favorite Python Librarires'

Another way is read the text is to use the text() function in xpath.

In [6]:

title = doc.xpath("/html/body/h1/text()")[0]
title

Out[6]:

'Favorite Python Librarires'

Working with multiple items¶

xpath always returns a list. If there are no matches, it will return an empty list. If there is one match it will return a list with one item.

In [7]:

item_list = doc.xpath("/html/body/ul/li")
item_list

Out[7]:

[<Element li at 0x7f447cafa9a8>,
 <Element li at 0x7f447cafaae8>,
 <Element li at 0x7f447cafab38>]

We can use text() function with multiple items.

In [8]:

doc = html.fromstring(html_code)
item_list = doc.xpath("/html/body/ul/li/text()")
item_list

Out[8]:

['Numpy', 'Pandas', 'requests']

Tag selector without full path¶

you can select any node in your document that matches a node selector without using the full path with a double forward slash //

In [9]:

doc = html.fromstring(html_code)
item_list = doc.xpath("//li/text()")
item_list

Out[9]:

['Numpy', 'Pandas', 'requests']

Selecting one result¶

You can select one result from a list using [index] after your tag selector. Make sure you use it on the tag selector and not a function selector.

Notice: This is index starts from 1.

In [10]:

doc = html.fromstring(html_code)
item_list = doc.xpath("/html/body/ul/li[1]/text()")
item_list

Out[10]:

['Numpy']

In [11]:

%%HTML
<html>
  <body>
    <h1 class="text-muted">Favorite Python Librarires</h1>
    <ul class="nav nav-pills nav-stacked">
      <li role="presentation"><a href="http://www.numpy.org/">Numpy</a></li>
      <li role="presentation"><a href="http://pandas.pydata.org/">Pandas</a></li>
      <li role="presentation"><a href="http://python-requests.org/">requests</a></li>
    </ul>
    <h1 class="text-success">Favorite JS Librarires</h1>
    <ul class="nav nav-tabs">
      <li role="presentation"><a href="http://getbootstrap.com/">Bootstrap</a></li>
      <li role="presentation"><a href="https://jquery.com/">jQuery</a></li>
      <li role="presentation"><a href="http://d3js.org/">d3.js</a></li>
    </ul>
</html>

Favorite Python Librarires

Favorite JS Librarires

In [12]:

html_code = In[11]
html_code = html_code[42:-2].replace("\\n","\n")
print(html_code)

doc = html.fromstring(html_code)

<html>
  <body>
    <h1 class="text-muted">Favorite Python Librarires</h1>
    <ul class="nav nav-pills nav-stacked">
      <li role="presentation"><a href="http://www.numpy.org/">Numpy</a></li>
      <li role="presentation"><a href="http://pandas.pydata.org/">Pandas</a></li>
      <li role="presentation"><a href="http://python-requests.org/">requests</a></li>
    </ul>
    <h1 class="text-success">Favorite JS Librarires</h1>
    <ul class="nav nav-tabs">
      <li role="presentation"><a href="http://getbootstrap.com/">Bootstrap</a></li>
      <li role="presentation"><a href="https://jquery.com/">jQuery</a></li>
      <li role="presentation"><a href="http://d3js.org/">d3.js</a></li>
    </ul>
</html>

Attributes selector¶

In this example we have two <h1> tags with different css classes. We can select tags based on css classes as follows:

In [13]:

title = doc.xpath("/html/body/h1[@class='text-muted']/text()")[0]
title

Out[13]:

'Favorite Python Librarires'

`contains()` function¶

I want to select all items in the first list. I could use the full class for selection or I could just use one of the classed only used in the first list with the contains() function.

In [14]:

item_list = doc.xpath("/html/body/ul[contains(@class,'nav-stacked')]/li/a/text()")
item_list

Out[14]:

['Numpy', 'Pandas', 'requests']

Returning attributes¶

What if we want to read the href attribute of the <a> tag to get the link. This is how you do that:

In [15]:

item_list = doc.xpath("/html/body/ul[contains(@class,'nav-stacked')]/li/a/@href")
item_list

Out[15]:

['http://www.numpy.org/',
 'http://pandas.pydata.org/',
 'http://python-requests.org/']

Real world example¶

Read the list of languages with 1M+ articles on http://www.wikipedia.org/

In [16]:

response = requests.get("http://www.wikipedia.org")
doc = html.fromstring(response.content, parser=html.HTMLParser(encoding="utf-8"))

In [17]:

lang_list = doc.xpath("//div[@class='langlist langlist-large hlist'][1]/ul/li/a/text()")
lang_list

Out[17]:

['Deutsch',
 'English',
 'EspaÃ±ol',
 'FranÃ§ais',
 'Italiano',
 'Nederlands',
 'Polski',
 'Ð ÑƒÑÑÐºÐ¸Ð¹',
 'Sinugboanong Binisaya',
 'Svenska',
 'Tiáº¿ng Viá»‡t',
 'Winaray']

In [ ]:

XPath¶

Favorite Python Librarires

Load HTML Code¶

Using xpath to find nodes in a document¶

Reading <h1> tag¶

Working with multiple items¶

Tag selector without full path¶

Selecting one result¶

Favorite Python Librarires

Favorite JS Librarires

Attributes selector¶

contains() function¶

Returning attributes¶

Real world example¶

Reading `<h1>` tag¶

`contains()` function¶