Author: Maxim Keremet, @maximkeremet

Tutorial

Webscraping an online retailer assortment.

So, when it comes to retrieving data from a website one can hear different notions: parsing, (web)scraping and crawling. Let's first understand if there is any difference between those notions and why everybody is using different terms.

After a brief googling, one can come to conclusion that:
Parsing is just getting information basically from any data source (logs/tables or files)
(Web)scraping is essentially getting data from a web page
Crawling is the process of moving around the website

So, when somebody speaks about retrieving data from a webpage and a recursive data extraction from a website, he/she will probably use one of the listed above words. In reality, these notions are intependent, but interrelated and consequent (first you scrape one page, crawl to another, scrape another page etc.).
When terms are defined, and we understand them, its time to dive into details.

1. A powerful webscraping library - Scrapy.

Scrapy,formally, is kinda more than a library - it is believed to be a framework, a powerful tool to extract data from websites and automatize this process with a few code written by the programmer.
The strong point of Scrapy is that it has a bunch of template spiders (programs that go around the targeted locations and search for needed data), that can be adjusted in a blink of an eye for user-specific need with a few lines of python and bash.

Well, I was stralling around the internet prior to the "Black Friday" to find a tempting offer. And I thought that it would be interesting to get phone assortment at Svyaznoy, one of the biggest online retailer in Russia.

So the ultimate goal of this tutorial is to get the phone name, its price, a discount (if there is one) and a phone photo and beautifully store them.
So i went to Svyaznoy website and looked in the phone assortment.

And saw that there are 109 pages and something around 2,5K of phone articles, which would be tough and tyring even to look though.

P.S. We also note the link to the 1st page (which looks weird), beacuse we will need it further on.

1.1 Creating a crawler.

First, install the library if you don't have it.

conda install -c conda-forge scrapy or pip install scrapy

In [1]:
pwd
Out[1]:
'/Users/maximkeremet/courses/mlcourse.ai/jupyter_english/tutorials'

Scrapy works in terms of projects. So you create a default project with a bunch of scripts that Scrapy runs to get the data from defined locations.
Let's create a default project.

In [2]:
!scrapy startproject svyaznoy
New Scrapy project 'svyaznoy', using template directory '/Users/maximkeremet/anaconda3/lib/python3.6/site-packages/scrapy/templates/project', created in:
    /Users/maximkeremet/courses/mlcourse.ai/jupyter_english/tutorials/svyaznoy

You can start your first spider with:
    cd svyaznoy
    scrapy genspider example example.com
In [3]:
ls
basic_semi-supervised_learning_models_altprof.ipynb
learn_regex_the_easy_way_aditya_soni.ipynb
module_8. webscraping_ecommerce_website_with_scrapy.ipynb
plotly_tutorial_for_interactive_plots_sankovalev.ipynb
svyaznoy/
tutorial_template.ipynb

Look in the project folder that we just have created.

In [4]:
cd svyaznoy
/Users/maximkeremet/courses/mlcourse.ai/jupyter_english/tutorials/svyaznoy

So there is some config file and a folder with Scrapy-specific scripts.

In [5]:
ls
scrapy.cfg  svyaznoy/

First we have to create spider, a key program defines which location to crawl and whcih data to collect.
For similarity, we shall call in svz.py
We will also pass the webpage, so the Scrapy would identify the structure.
P.S. You cannot call thespider the same as the project (I guess it just disturbs everybody).

In [6]:
!scrapy genspider svz https://www.svyaznoy.ru/catalog/phone/224
Created spider 'svz' using template 'basic' in module:
  svyaznoy.spiders.svz

We can see that our spider has been created in location: svyaznoy.spiders.svz
Proceed to svyaznoy to see what is inside.

In [7]:
cd svyaznoy/
/Users/maximkeremet/courses/mlcourse.ai/jupyter_english/tutorials/svyaznoy/svyaznoy
In [8]:
ls
__init__.py     items.py        pipelines.py    spiders/
__pycache__/    middlewares.py  settings.py

Look inside the spiders folder and see our script.

In [9]:
cd spiders/
/Users/maximkeremet/courses/mlcourse.ai/jupyter_english/tutorials/svyaznoy/svyaznoy/spiders
In [10]:
ls
__init__.py  __pycache__/ svz.py

Look inside the spider script.

In [11]:
cat svz.py
# -*- coding: utf-8 -*-
import scrapy


class SvzSpider(scrapy.Spider):
    name = 'svz'
    allowed_domains = ['https://www.svyaznoy.ru/catalog/phone/224']
    start_urls = ['http://https://www.svyaznoy.ru/catalog/phone/224/']

    def parse(self, response):
        pass

1.2. Understanding what we want to get from the website. Dealing with developer tools.

We all know that everythin in python can be regarded as an object. The same applies to the website.
Every phone is embeded in some kind of a card, that has all characteristics, like name, price, discounts, rating andd others and everything can be regarded as separate objects. A card can be recarded as an object as well and set of goods therefore is a set of cards.

Here is what I mean by card of the good:

Everybody knows about the developer tools, so we can click with the right button to inspect our objects and understand where are needed objects are located in terms of HTML/XML markup.

Here we can see the phone price block: The phone name block: The photo link block:

Now we have to understand what HTML and XML is, how they differ and how efficiently retrieve information from a website markup.

1.3. Brief understanding of XML and Xpath.

The key and the only difference between the 2 guys that we will need is that, XML is used for storing and transporting data, while HTML is used for formatting and displaying the same data.

What is XML?

  • Although XML looks a lot like HTML, but it has absolutely different purpose and guts. XML stands short for eXtensible Markup Language, which actually explains itself.

  • Surprisingly, but XML doesn't really do anything, it just structures, stores and transports data upon request.

  • One of the reasons why it is called eXtensible is because you can invent your own tags, that helps you navigating the data the way you like, while HTML has predefined tags and all HTML documents are based on standartised tags, like <body>, <p>, <li> etc.

  • This helps the developer invent own tags and structure the data the way if fits the nature of the document. However XML is not a replacement for HTML, but is an extension (seriously, man?).

So, in most of the web solutions they word in synergy, XML transports and HTML formats and displys the data nicely. All this maked XML a vital tool for the internet and is utilized everywhere, where one has to transport the data between all kinds of applications.

This is how XML code looks:

In [ ]:
<?xml version="1.0" encoding="UTF-8"?>
<bookstore>
   <book category="COOKING">
      <title lang="en">Everyday Italian</title>
      <author>Giada De Laurentiis</author>
      <year>2005</year>
      <price>30.00</price>
   </book>
   <book category="CHILDREN">
      <title lang="en">Harry Potter</title>
      <author>J K. Rowling</author>
      <year>2005</year>
      <price>29.99</price>
   </book>
   <book category="WEB">
      <title lang="en">XQuery Kick Start</title>
      <author>James McGovern</author>
      <author>Per Bothner</author>
      <author>Kurt Cagle</author>
      <author>James Linn</author>
      <author>Vaidyanathan Nagarajan</author>
      <year>2003</year>
      <price>49.99</price>
   </book>
   <book category="WEB">
      <title lang="en">Learning XML</title>
      <author>Erik T. Ray</author>
      <year>2003</year>
      <price>39.95</price>
   </book>
</bookstore>

Or it can also be represented in a tree-form, which can be easier to grasp.

What is XPath?

  • XPath is a special language to identify parts of XML documents, search and select information.

  • It uses path expressions to navigate, that look a lot like queries.

  • It also has a list of functions (logical and numerical) to test the data.

And this is how query structures look like. With the help of those we can drill into XML notation with Xpath to data elements, using hierarchical selectors:

XPath expression Result
/bookstore/book[1] Selects the first book element that is the child of the bookstore element
/bookstore/book[last( )] Selects the last book element that is the child of the bookstore element
/bookstore/book[last( )-1] Selects the last but one book element that is the child of the bookstore element
/bookstore/book[position( )<3] Selects the first two book elements that are children of the bookstore element
//title[@lang] Selects all the title elements that have an attribute named lang
//title[@lang='eng'] Selects all the title elements that have a "lang" attribute with a value of "en"
/bookstore/book[price>35.00] Selects all the book elements of the bookstore element that have a price element with a value greater than 35.00
/bookstore/book[price>35.00]/title Selects all the title elements of the book elements of the bookstore element that have a price element with a value greater than 35.00

For now, it is sufficient to know that textual data on the website is stored in in XML format and we can efficiently retrieve that by quering with Xpath.
Xpath in its turn, is a qury language, that is base on hierarcical tag structure, forming a tree-based structure, which can be easily decomposed, selected and manipulated for our purposes.

2. Getting to know the website throught terminal.

Scrapy has an interactive shell where you can debug your scraping code very quickly and try out selecting data without running a spider every time.
Try it yourself!

In [ ]:
scrapy shell # this will start the shell
In [ ]:
fetch("https://www.svyaznoy.ru/catalog/phone/224") # get the structure of the web page
In [ ]:
print(response.text) # bring back the bare html and css, like in developer tools

2.1. Getting phone names.

Since you cannot execute everything in Jupyter notebooks (unfortuantely), we test/debug via scpary shell in terminal, Command line or other console app, and use Xpath notation in order to drill in the tag tree and get the needed data.

Titles can be found like so:

In [ ]:
response.xpath("//div[@class='b-product-block__image']//img[@class='lazy']/@title").extract() 

Write them to an object.

In [ ]:
titles = response.xpath("//div[@class='b-product-block__image']//img[@class='lazy']/@title").extract()

2.2. Getting photos.

Out photo links are also located in b-product-block__image class and we can extract them the following way:

In [ ]:
imgs = response.xpath("//div[@class='b-product-block__image']//img[@class='lazy']/@data-original").extract()

2.3. Getting prices.

Prices are located a bit further in the tree in b-product-block__image block and span of b-product-block__visible-price.

In [ ]:
response.xpath(".//div[@class='b-product-block__price']//span[@class='b-product-block__visible-price']/text()").extract()

However, the raw data is dirty and we will have to clean it up, using some time, magic and regular expressions (which are actually are equivalent to magic).

This is how we can clear it up:

In [ ]:
prices = [price.replace("\n", "") for price in response.xpath(".//div[@class='b-product-block__price']//span[@class='b-product-block__visible-price']/text()").extract()]
prices = [price.replace("\xa0", "") for price in prices]  # cleaning from non-breaking space in Latin1(ISO 8859-1)
prices = [price.strip() for price in prices] # cleaning from unwanted spaces 
prices = [int(price) for price in prices if price] # turning string objects to integers

2.4. Getting sale offers.

It is pretty much the same as with prices, but there are cases, when there is no sale offer for an item, so we will have to be a bit witty here.
Below you can find a list comprehension of how you get the sale offer. So, in this case we test, if there is an item then we extract it, otherwise fill our object with string zero.

In [ ]:
[response.xpath(".//div[@class='b-product-block__gain']").extract_first() if  \
'b-product-block__gain' in i else '' \
for i in response.xpath(".//div[@class='b-product-block__price']").extract()]
In [ ]:
import re

sales = [response.xpath(".//div[@class='b-product-block__gain']").extract_first() if 'b-product-block__gain' in i else '0' for i in response.xpath(".//div[@class='b-product-block__price']").extract()]
sales = [sale.replace("\xa0", "") for sale in sales]  # cleaning from non-breaking space in Latin1(ISO 8859-1)
sales = [sale.strip() for sale in sales] # cleaning from unwanted spaces 
sales = [re.findall("\d+", sale) for sale in sales] # finding all objects that contain digits in our list of ojects
sales = [item for sublist in sales for item in sublist] # flatten the list of lists 
sales = [int(sale) for sale in sales] # turning string objects to integers

3. Compling it all together.

So we put all things together.

In [ ]:
# Retriving objects
imgs = response.xpath("//div[@class='b-product-block__image']//img[@class='lazy']/@data-original").extract()
titles = response.xpath("//div[@class='b-product-block__image']//img[@class='lazy']/@title").extract()
prices = response.xpath(".//div[@class='b-product-block__price']//span[@class='b-product-block__visible-price']/text()").extract()
sales = [response.xpath(".//div[@class='b-product-block__gain']").extract_first() if 'b-product-block__gain' in i else '0' for i in response.xpath(".//div[@class='b-product-block__price']").extract()]

# Process the prices
prices = [price.replace("\n", "") for price in response.xpath(".//div[@class='b-product-block__price']//span[@class='b-product-block__visible-price']/text()").extract()]
prices = [price.replace("\xa0", "") for price in prices]
prices = [price.strip() for price in prices]
prices = [int(price) for price in prices if price]

# Process the discounts
sales = [sale.replace("\xa0", "") for sale in sales] 
sales = [sale.strip() for sale in sales] 
sales = [re.findall("\d+", sale) for sale in sales]
sales = [item for sublist in sales for item in sublist] 
sales = [int(sale) for sale in sales] 

Then we want to have the data in a usual format. First the data will be stored in dictionaries and then we will pass it to internal Scrapy scripts, so that we would yeild a table in .csv format.

We will use a for cycle and zip construction.
We will also use a yield generator so that a spider would use it and form a dictionary for each request.
All together look like this:

In [ ]:
for item in zip(titles,prices,sales,imgs):
            scraped_info = {
                'title' : item[0],
                'price' : item[1],
                'sale_offer': sales[2],
                'image_urls' : [item[3]]}
            yield scraped_info

4. Pagination. How to iterate through all pages.

Scrapy has built in structures for extracting page links and defining rules to crawl, but I decided to make it very simple and make a list of links with agenerator and merge them.

In [ ]:
# Paginartion. 
allowed_domains = ['http://www.svyaznoy.ru/']
first_page = ['http://www.svyaznoy.ru/catalog/phone/224/']
all_others = ['http://www.svyaznoy.ru/catalog/phone/224/page-'+str(x) for x in range(2,109)]

# Locate 1st page.
start_urls = first_page + all_others

Apart from it we will also have to call our spider to crawl the pages for each request.

next_page = response.xpath(".//li[@class='next']//a/@data-page").extract() next_page = str(int(next_page[0])+1) if next_page is not None: next_page = response.urljoin(next_page) yield scrapy.Request(next_page, callback=self.parse)

5. Some finishing touches.

In addition, we will have to inclide several adjustments in our scripts. Firstly, we will have to specify where we would like to locate our results in our main script - svz.py.

In [ ]:
custom_settings = {'FEED_URI' : 'results/svyaznoy.csv'}

Secondly, we will have to look in our project folder svyaznoy for a settings.py script and include some parameters, listed below.

In [ ]:
BOT_NAME = 'svyaznoy'

SPIDER_MODULES = ['svz.spiders']
NEWSPIDER_MODULE = 'svz.spiders'
FEED_FORMAT = "csv"
FEED_URI = "svyaznoy.csv"

And lastly, we will need to spcify the pipeline to download phone photos by extracted links. (include in settings.py)

In [ ]:
ITEM_PIPELINES = {
  'scrapy.pipelines.images.ImagesPipeline': 1
}
IMAGES_STORE = 'results/images/'

Done!
Other scripts in spiders forder do not need any adjustments, at least for our purposes.
Locate yourself in project folder(svyaznoy), make some popcorn and run the script with the command scpay crawl svyaznoy. Enjoy.


After couple minutes we will get a report with stats about the job done. Something like that:

We can see that the majority of data was scraped, while some pages occured with 301 code, which is redirecting. Of course there are tips and tricks how to deal with that as well, but I will leve it to the reader to find out in the documentation.

You can also notice small resized photos and a csv file available in result folder.


6. All together. A full script.

The script below can be just "Ctl+C/Ctrl-V" to a key spider/crawler script - svy.py. Don't forget the additional adjustments in settings.py

P.S. Bare in mind the indentation problem (4-spaces or Tab) when writing/debugging code in text editors/IDE.
In my case, I choose Tabs.

In [ ]:
import scrapy
import re


class SvzSpider(scrapy.Spider):


	custom_settings = {'FEED_URI' : 'results/svyaznoy.csv'}

	name = 'svyaznoy'

	""" Making a proper list of pages. """
	allowed_domains = ['http://www.svyaznoy.ru/']
	first_page = ['http://www.svyaznoy.ru/catalog/phone/224/']
	all_others = ['http://www.svyaznoy.ru/catalog/phone/224/page-'+str(x) for x in range(2,109)]
	
	""" Inserting the 1st page. """
	start_urls = first_page+all_others


	def parse(self, response):
        
		# Retrieving objects
		imgs = response.xpath("//div[@class='b-product-block__image']//img[@class='lazy']/@data-original").extract()
		titles = response.xpath("//div[@class='b-product-block__image']//img[@class='lazy']/@title").extract()
		prices = response.xpath(".//div[@class='b-product-block__price']//span[@class='b-product-block__visible-price']/text()").extract()
		sales = [response.xpath(".//div[@class='b-product-block__gain']").extract_first() if 'b-product-block__gain' in i else '0' for i in response.xpath(".//div[@class='b-product-block__price']").extract()]

		# Processing prices
		prices = [price.replace("\n", "") for price in response.xpath(".//div[@class='b-product-block__price']//span[@class='b-product-block__visible-price']/text()").extract()]
		prices = [price.replace("\xa0", "") for price in prices]
		prices = [price.strip() for price in prices]
		prices = [int(price) for price in prices if price]

		# Processing sale offers
		sales = [sale.replace("\xa0", "") for sale in sales] 
		sales = [sale.strip() for sale in sales] 
		sales = [re.findall("\d+", sale) for sale in sales]
		sales = [item for sublist in sales for item in sublist] 
		sales = [int(sale) for sale in sales] 
        
		# Yielding objects    
		for item in zip(titles,prices,sales,imgs):
				scraped_info = {
					'title' : item[0],
					'price' : item[1],
					'sale_offer': sales[2],
					'image_urls' : [item[3]]}
				yield scraped_info
  
		# Pagination loop       
		next_page = response.xpath(".//li[@class='next']//a/@data-page").extract()
		next_page = str(int(next_page[0])+1)
		if next_page is not None:
			next_page = response.urljoin(next_page)
			yield scrapy.Request(next_page, callback=self.parse)