We are going to use Scrapy, which is a Python framework for automating data extraction from webpages. In this noteboodk we are following along a very good introduction to web scraping in general (along with some other tools that you might wish to explore) at the Library Carpentry, "Webscraping with Python" October 2016 site.
Scrapy has already been installed in this notebook for you; but for reference, you can install it with pip install scrapy
.
!scrapy version
Scrapy 1.5.1
We're going to create a spider that scrapes the databases of the Portable Antiquities Scheme.
!scrapy startproject romanbrick
New Scrapy project 'romanbrick', using template directory '/srv/conda/lib/python3.6/site-packages/scrapy/templates/project', created in: /home/jovyan/romanbrick You can start your first spider with: cd romanbrick scrapy genspider example example.com
If you go to the Home page (right-click on the jupyter logo at top left and open in a new tab) you'll see the new folder called romanbrick
. Inside that folder there is a new folder called (again) romanbrick
and a file called config
. The second romanbrick
folder contains all of the plumbing for our spider. Your entire scrapy project then looks like this:
romanbrick/ # the root project directory
scrapy.cfg # deploy configuration file
romanbrick/ # project's Python module, you'll import your code from here
__init__.py
items.py # Holds the structure of the data we want to collect
middlewares.py # We aren't going to use this today.
# A file to manipulate how spiders process input, in case pretending to be a normal HTTP browser doesn't work.
pipelines.py # We aren't going to use this today, either.
# Once our spider writes things to the "item," we can use pipelines to do additional processing before we export it.
settings.py # project settings file
spiders/ # a directory where you'll later put your spiders
__init__.py
...
The spiders
directory is where we're going to put our crawler (spiders...crawlers, geddit?). Spiders have three parts-
Scrapy makes this easy for us. Say we're interested in the archaeology of the Roman construction industry. At the PAS search page, we enter 'stamped brick' and end up with the following url:
https://finds.org.uk/database/search/results/q/stamped+brick
To create the spider, we tell Scrapy scrapy genspider <SCRAPER NAME> <START URL>
, so:
!scrapy genspider rbrickspider "finds.org.uk/database/search/results/q/stamped+brick"
Created spider 'rbrickspider' using template 'basic'
Note Open the file list (right-click on the Jupyter logo, open in a new tab so you don't close this notebook), and see where your rbrickspider.py
file has been created. Tick off the check box, select 'move', and put it in the spiders
folder, eg \romanbrick\romanbrick\spiders
. Once it's moved, you can click on this file and it will open in the text editor - it'll look like this:
# -*- coding: utf-8 -*-
import scrapy
class RbrickspiderSpider(scrapy.Spider):
name = 'rbrickspider'
allowed_domains = ['finds.org.uk/database/search/results/q/stamped+brick']
start_urls = ['http://finds.org.uk/database/search/results/q/stamped+brick/']
def parse(self, response):
pass
Look at the allowed_domains
. That's a bit too restrictive - only results that exactly match that pattern would be allowed. We want all the materials, so let's let the spider grab everything inside the top level domain. Change the pattern for allowed_domains
to just the top level, then save the file. (psst: finds.org.uk
).
Let's run it!
%cd romanbrick
/home/jovyan/romanbrick
!scrapy crawl rbrickspider -s DEPTH_LIMIT=1
It worked! You know this by the downloader/response_status_count/200
- the scraper was able to reach the webpage and interact with it. However, because we haven't told the spider what to parse, it hasn't grabbed anything.
Add the following to your rbrickspider.py
file (which is in \romanbrick\romanbrick\spiders
remember):
def parse(self, response):
with open("test.html", 'wb') as file:
file.write(response.body)
This writes the html of the page to file. Save that, then run the spider again in the block below. Be careful with the copying and pasting; python is very fussy about tabs versus spaces and so on. If you get an error 'unexpected indent' you need to make sure that you haven't mixed spaces with tabs.
# paste your crawl command in this block and run it.
If you check the contents of this directory with the ls
command, you should see a new file test.html
!ls
romanbrick scrapy.cfg test.html
Let's look at the first few lines of the file with the head
command. Have we actually scraped anything?
!head test.html
<?xml version="1.0" encoding="utf-8"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML+RDFa 1.0//EN" "http://www.w3.org/MarkUp/DTD/xhtml-rdfa-1.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" xmlns:foaf="http://xmlns.com/foaf/0.1/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:batlas="http://atlantides.org/batlas/" xmlns:gml="http://www.opengis.net/gml/" xmlns:nm="http://nomisma.org/id/" xmlns:ov="http://open.vocab.org/terms/" xmlns:owl="http://www.w3.org/2002/07/owl#" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#" xmlns:skos="http://www.w3.org/2008/05/skos#" xmlns:pas="https://finds.org.uk/database/" xmlns:google="http://rdf.data-vocabulary.org/#" xmlns:con="http://www.w3.org/2000/10/swap/pim/contact#" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:og="http://ogp.me/ns#" xmlns:fb="http://www.facebook.com/2008/fbml" xml:lang="en"> <head> <title>Search results from the database Page: 1</title> <meta http-equiv="Content-Type" content="text/html; charset=utf-8"/> <meta http-equiv="expires" content="Fri, 17 Aug 2018 17:01:01 +0100"/> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/> <meta http-equiv="Content-Language" content="en-GB"/>
But saving webpages isn't the point of scraping; rather, we want the data itself. In the file list screen (accessed by clicking on the Jupyter logo at top left) tick the box beside test.html
and then select edit
. Let's scroll through the results. Examine the html, paying attention to the <div>
elements that wrap the information we want, and how these elements nest one within the other.
We can tell Scrapy which of those bits of information we want to collect using Xpath Selectors. Many websites are built by pulling data out of a database and then displaying it by running it through a 'document object model'; this model tells the webpage where to put and how to display the various information. An 'xpath' lets us specify which object in this model we want. We tell Scrapy the xpath to the info, and it duly collects all the data found in that path.
There are a variety of ways of finding the correct xpath; some people recommend this scraper extension for chrome.
If you open a new terminal in Jupyter, type cd romanbrick
and then
scrapy shell "https://finds.org.uk/database/search/results/q/stamped+brick"
you can try testing different xpath combinations and test what results you get using the response.xpath
command:
>>> response.xpath("//THE XPATH SELECTOR GOES IN HERE AND MAKE SURE TO CHANGE QUOTES TO SINGLE QUOTES IF NECESSARY")
After some examination, we see that //*[@id='preview']/p/text()
will get the information we want.
Edit rbrickspider.py
so that you're parsing for just the data:
def parse(self, response):
print(response.xpath("//*[@id='preview']/p/text()").extract())
Now run
!scrapy crawl rbrickspider -s DEPTH_LIMIT=1
in the code block below. Check with !pwd
to make sure you're in jovyan\romanbrick
first.
#past the code here
We're getting there! But all of the data is coming in a big mess. Edit rbrickspider.py
so that you're parsing for just the data:
def parse(self, response):
for resource in response.xpath("//*[@id='preview']/p/text()"):
print(resource.extract())
Now run
!scrapy crawl rbrickspider -s DEPTH_LIMIT=1
(By the way - if you ever find that nothing happens when you run the scrapy crawl command, try refreshing the browser window and re-running the %cd romanbrick
command first).
# put your code here
When you look at the Finds.org.uk search results' page underlying html using Chrome's inspector, you'll notice that the images are all contained within the ul
tag. You can right click on that html and select 'copy xpath'. You should get //*[@id='preview']/ul/li/div/a
. Swap that into your spider and run it.
At this point, you'll be getting the hang of it. For the final step of writing your data to csv, please scroll down to the Writing to a csv section of the Library Carpentry tutorial - but make sure you right-click on that link and open in a new tab! You now know how to grab two different sets of data from the finds page. Complete the tutorial so that you end up with a table of results from the finds.org.uk webpage.
Save your work, and remember to file -> download as -> notebook when you're done. Remember that this binder will time out after ten minutes.