XPath is short for XML Path Language which is a query language for selecting nodes in an XML document. This is very useful in webscraping because all HTML documents are a form of XML documents.
import requests
from lxml import html
%%HTML
<html>
<body>
<h1>Favorite Python Librarires</h1>
<ul>
<li>Numpy</li>
<li>Pandas</li>
<li>requests</li>
</ul>
</body>
</html>
Now I'll read the code from cell number 2 and store it in html_code
. Finally we will parse that into a lxml node object.
html_code = In[2]
html_code = html_code[42:-2].replace("\\n","\n")
print(html_code)
doc = html.fromstring(html_code)
<html> <body> <h1>Favorite Python Librarires</h1> <ul> <li>Numpy</li> <li>Pandas</li> <li>requests</li> </ul> </html>
title = doc.xpath("/html/body/h1")[0]
title
<Element h1 at 0x7f447cafa458>
To read the text inside that tag you can use the text variable.
title.text
'Favorite Python Librarires'
Another way is read the text is to use the text()
function in xpath.
title = doc.xpath("/html/body/h1/text()")[0]
title
'Favorite Python Librarires'
xpath always returns a list. If there are no matches, it will return an empty list. If there is one match it will return a list with one item.
item_list = doc.xpath("/html/body/ul/li")
item_list
[<Element li at 0x7f447cafa9a8>, <Element li at 0x7f447cafaae8>, <Element li at 0x7f447cafab38>]
We can use text()
function with multiple items.
doc = html.fromstring(html_code)
item_list = doc.xpath("/html/body/ul/li/text()")
item_list
['Numpy', 'Pandas', 'requests']
you can select any node in your document that matches a node selector without using the full path with a double forward slash //
doc = html.fromstring(html_code)
item_list = doc.xpath("//li/text()")
item_list
['Numpy', 'Pandas', 'requests']
You can select one result from a list using [index]
after your tag selector. Make sure you use it on the tag selector and not a function selector.
Notice: This is index
starts from 1.
doc = html.fromstring(html_code)
item_list = doc.xpath("/html/body/ul/li[1]/text()")
item_list
['Numpy']
%%HTML
<html>
<body>
<h1 class="text-muted">Favorite Python Librarires</h1>
<ul class="nav nav-pills nav-stacked">
<li role="presentation"><a href="http://www.numpy.org/">Numpy</a></li>
<li role="presentation"><a href="http://pandas.pydata.org/">Pandas</a></li>
<li role="presentation"><a href="http://python-requests.org/">requests</a></li>
</ul>
<h1 class="text-success">Favorite JS Librarires</h1>
<ul class="nav nav-tabs">
<li role="presentation"><a href="http://getbootstrap.com/">Bootstrap</a></li>
<li role="presentation"><a href="https://jquery.com/">jQuery</a></li>
<li role="presentation"><a href="http://d3js.org/">d3.js</a></li>
</ul>
</html>
html_code = In[11]
html_code = html_code[42:-2].replace("\\n","\n")
print(html_code)
doc = html.fromstring(html_code)
<html> <body> <h1 class="text-muted">Favorite Python Librarires</h1> <ul class="nav nav-pills nav-stacked"> <li role="presentation"><a href="http://www.numpy.org/">Numpy</a></li> <li role="presentation"><a href="http://pandas.pydata.org/">Pandas</a></li> <li role="presentation"><a href="http://python-requests.org/">requests</a></li> </ul> <h1 class="text-success">Favorite JS Librarires</h1> <ul class="nav nav-tabs"> <li role="presentation"><a href="http://getbootstrap.com/">Bootstrap</a></li> <li role="presentation"><a href="https://jquery.com/">jQuery</a></li> <li role="presentation"><a href="http://d3js.org/">d3.js</a></li> </ul> </html>
In this example we have two <h1>
tags with different css classes. We can select tags based on css classes as follows:
title = doc.xpath("/html/body/h1[@class='text-muted']/text()")[0]
title
'Favorite Python Librarires'
contains()
function¶I want to select all items in the first list. I could use the full class for selection or I could just use one of the classed only used in the first list with the contains()
function.
item_list = doc.xpath("/html/body/ul[contains(@class,'nav-stacked')]/li/a/text()")
item_list
['Numpy', 'Pandas', 'requests']
What if we want to read the href
attribute of the <a>
tag to get the link. This is how you do that:
item_list = doc.xpath("/html/body/ul[contains(@class,'nav-stacked')]/li/a/@href")
item_list
['http://www.numpy.org/', 'http://pandas.pydata.org/', 'http://python-requests.org/']
Read the list of languages with 1M+ articles on http://www.wikipedia.org/
response = requests.get("http://www.wikipedia.org")
doc = html.fromstring(response.content, parser=html.HTMLParser(encoding="utf-8"))
lang_list = doc.xpath("//div[@class='langlist langlist-large hlist'][1]/ul/li/a/text()")
lang_list
['Deutsch', 'English', 'Español', 'Français', 'Italiano', 'Nederlands', 'Polski', 'РуÑÑкий', 'Sinugboanong Binisaya', 'Svenska', 'Tiếng Việt', 'Winaray']