We are going to use Scrapy, which is a Python framework for automating data extraction from webpages. In this noteboodk we are following along a very good introduction to web scraping in general (along with some other tools that you might wish to explore) at the Library Carpentry, "Webscraping with Python" October 2016 site.

Scrapy has already been installed in this notebook for you; but for reference, you can install it with pip install scrapy.

In [4]:
!scrapy version
Scrapy 1.5.1

We're going to create a spider that scrapes the collections search page of the Canadian Museum of History(formerly the Canadian Museum of Civilization).

In [6]:
!scrapy startproject romanbrick
New Scrapy project 'romanbrick', using template directory '/srv/conda/lib/python3.6/site-packages/scrapy/templates/project', created in:
    /home/jovyan/romanbrick

You can start your first spider with:
    cd romanbrick
    scrapy genspider example example.com

If you go to the Home page (right-click on the jupyter logo at top left and open in a new tab) you'll see the new folder called romanbrick. Inside that folder there is a new folder called (again) romanbrick and a file called config. The second romanbrick folder contains all of the plumbing for our spider. Your entire scrapy project then looks like this:


romanbrick/            # the root project directory
    scrapy.cfg      # deploy configuration file

    romanbrick/        # project's Python module, you'll import your code from here
        __init__.py

        items.py        # Holds the structure of the data we want to collect

        middlewares.py  # We aren't going to use this today. 
                        # A file to manipulate how spiders process input, in case pretending to be a normal HTTP browser doesn't work.

        pipelines.py    # We aren't going to use this today, either.
                        # Once our spider writes things to the "item," we can use pipelines to do additional processing before we export it.

        settings.py # project settings file

        spiders/        # a directory where you'll later put your spiders
            __init__.py
            ...

Spiders

The spiders directory is where we're going to put our crawler (spiders...crawlers, geddit?). Spiders have three parts-

  1. the starting URL(s)
  2. a list of allowed domains, or places the crawler is allowed to go, for fear of downloading the entire internet (Not an idle fear, actually.)
  3. a way of parsing the results, to get the data you're after.

Scrapy makes this easy for us. Say we're interested in the archaeology of the timber trade. At the Museum's search page, we enter 'logging' and end up with the following url:

https://finds.org.uk/database/search/results/q/stamped+brick

To create the spider, we tell Scrapy scrapy genspider <SCRAPER NAME> <START URL>, so:

In [8]:
!scrapy genspider rbrickspider "finds.org.uk/database/search/results/q/stamped+brick"
Created spider 'rbrickspider' using template 'basic' 

Note Open the file list (right-click on the Jupyter logo, open in a new tab so you don't close this notebook), and see where your rbrickspider.py file has been created. Tick off the check box, select 'move', and put it in the spiders folder, eg \romanbrick\romanbrick\spiders. Once it's moved, you can click on this file and it will open in the text editor - it'll look like this:

# -*- coding: utf-8 -*-
import scrapy


class RbrickspiderSpider(scrapy.Spider):
    name = 'rbrickspider'
    allowed_domains = ['finds.org.uk/database/search/results/q/stamped+brick']
    start_urls = ['http://finds.org.uk/database/search/results/q/stamped+brick/']

    def parse(self, response):
        pass

Look at the allowed_domains. That's a bit too restrictive - only results that exactly match that pattern would be allowed. We want all the materials, so let's let the spider grab everything inside the top level domain. Change the pattern for allowed_domains to just the top level, then save the file. (psst: finds.org.uk).

Let's run it!

In [9]:
%cd romanbrick
/home/jovyan/romanbrick
In [ ]:
!scrapy crawl rbrickspider -s DEPTH_LIMIT=1

It worked! You know this by the downloader/response_status_count/200 - the scraper was able to reach the webpage and interact with it. However, because we haven't told the spider what to parse, it hasn't grabbed anything.

Add the following to your rbrickspider.py file (which is in \romanbrick\romanbrick\spiders remember):

 def parse(self, response):
        with open("test.html", 'wb') as file:
            file.write(response.body)

This writes the html of the page to file. Save that, then run the spider again in the block below. Be careful with the copying and pasting; python is very fussy about tabs versus spaces and so on. If you get an error 'unexpected indent' you need to make sure that you haven't mixed spaces with tabs.

In [ ]:
# paste your crawl command in this block and run it.

If you check the contents of this directory with the ls command, you should see a new file test.html

In [11]:
!ls
romanbrick  scrapy.cfg	test.html

Let's look at the first few lines of the file with the head command. Have we actually scraped anything?

In [12]:
!head test.html
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML+RDFa 1.0//EN" "http://www.w3.org/MarkUp/DTD/xhtml-rdfa-1.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xmlns:foaf="http://xmlns.com/foaf/0.1/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:batlas="http://atlantides.org/batlas/" xmlns:gml="http://www.opengis.net/gml/" xmlns:nm="http://nomisma.org/id/" xmlns:ov="http://open.vocab.org/terms/" xmlns:owl="http://www.w3.org/2002/07/owl#" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#" xmlns:skos="http://www.w3.org/2008/05/skos#" xmlns:pas="https://finds.org.uk/database/" xmlns:google="http://rdf.data-vocabulary.org/#" xmlns:con="http://www.w3.org/2000/10/swap/pim/contact#" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:og="http://ogp.me/ns#" xmlns:fb="http://www.facebook.com/2008/fbml" xml:lang="en">
<head>

    <title>Search results from the database Page: 1</title>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>
<meta http-equiv="expires" content="Fri, 17 Aug 2018 17:01:01 +0100"/>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
<meta http-equiv="Content-Language" content="en-GB"/>

But saving webpages isn't the point of scraping; rather, we want the data itself. In the file list screen (accessed by clicking on the Jupyter logo at top left) tick the box beside test.html and then select edit. Let's scroll through the results. Examine the html, paying attention to the <div> elements that wrap the information we want, and how these elements nest one within the other.

We can tell Scrapy which of those bits of information we want to collect using Xpath Selectors. Many websites are built by pulling data out of a database and then displaying it by running it through a 'document object model'; this model tells the webpage where to put and how to display the various information. An 'xpath' let's us specify which object in this model we want. We tell Scrapy the xpath to the info, and it duly collects all the data found in that path.

There are a variety of ways of finding the correct xpath; some people recommend this scraper extension for chrome.

If you open a new terminal in Jupyter, type cd romanbrick and then

scrapy shell "https://finds.org.uk/database/search/results/q/stamped+brick"

you can try testing different xpath combinations and test what results you get using the response.xpath command:

>>> response.xpath("//THE XPATH SELECTOR GOES IN HERE AND MAKE SURE TO CHANGE QUOTES TO SINGLE QUOTES IF NECESSARY")

After some examination, we see that //*[@id='preview']/p/text() will get the information we want.

Edit rbrickspider.py so that you're parsing for just the data:

    def parse(self, response):
        print(response.xpath("//*[@id='preview']/p/text()").extract())

Now run

!scrapy crawl rbrickspider -s DEPTH_LIMIT=1

in the code block below. Check with !pwd to make sure you're in jovyan\canhist first.

In [ ]:
#past the code here

We're getting there! But all of the data is coming in a big mess. Edit rbrickspider.py so that you're parsing for just the data:

    def parse(self, response):
        for resource in response.xpath("//*[@id='preview']/p/text()"):
            print(resource.extract())

Now run

!scrapy crawl rbrickspider -s DEPTH_LIMIT=1

(By the way - if you ever find that nothing happens when you run the scrapy crawl command, try refreshing the browser window and re-running the %cd romanbrick command first).

In [ ]:
# put your code here

When you look at the Finds.org.uk search results' page underlying html using Chrome's inspector, you'll notice that the images are all contained within the ul tag. You can right click on that html and select 'copy xpath'. You should get //*[@id='preview']/ul/li/div/a. Swap that into your spider and run it.

At this point, you'll be getting the hang of it. For the final step of writing your data to csv, please scroll down to the Writing to a csv section of the Library Carpentry tutorial - but make sure you right-click on that link and open in a new tab! You now know how to grab two different sets of data from the finds page. Complete the tutorial so that you end up with a table of results from the finds.org.uk webpage.

Save your work, and remember to file -> download as -> notebook when you're done. Remember that this binder will time out after ten minutes.