Collecting web networks with scraping

Alex Hanna, University of Wisconsin-Madison
alex-hanna.com
@alexhanna

Yesterday we focused on collecting data via APIs (application program interfaces). While APIs are great, sometimes they don't give us the data which we want or need. We're bound by what they give us and somewhat limited in that sense.

As an alternative, one way to get Internet data is to scrape the websites themselves for connections. Say we want to get a sense of how several political candidates are connected to each other, or to understand how particular online communities organize through their websites. Those are some examples in which scrapers can thrive.

The intuition behind scraping

The idea behind scraping is that we're looking at all the links on a webpage and listing each of the links as a connection. We are then looking at all the links on the pages linked from the original page. And so on and so on. In computer science, this is known as a breadth first search.

So, for instance, on my blog Bad Hessian, there are a set of links that can be traversed.

This is a post drawn from this post.

So the list of pages that we may traverse may look like this list when we visit Bad Hessian first:

$G = {B_1, B_2, B_3... B_N}$

where $B_i$ is a link on the Bad Hessian article. Then we add on, say, the original OrgTheory article which is the link $B_1$, and then the list looks like this:

$G = {B_2, B_3... B_N, O_1, O_2, O_3... O_N}$

where $O_i$ is a link on the OrgTheory article. Note that $B_1$ has been removed from the list because we've already visited the link.

Blogs tend to link to many different things. So this blog post links to different types of pages -- to other blogs, to social media, to academic citations and journal articles, and to itself. The network map might look something like this:

If we are looking into a particular kind of phenomenon, like a blogging community, maybe we want to restrict our links to those that link to other blogs. How do we do that?

Luckily, blogs tend to shared the same sorts of URLs. Blogspot, Wordpress, and Livejournal all have similar URL structures. For this exercise, we're going to focus on collecting network data on blogging networks.

Starting up

Like yesterday, you first need to connect to the Amazon EC2 server. The hostname is ec2-54-225-7-147.compute-1.amazonaws.com. If you are in Windows you need to log in with PuTTy, and if you are using Mac, you would be logging on with the Terminal.

Once you have logged in, you need to grab the GitHub repository that contains the files we're working from today. Run this command:

 [[email protected] ~]$ git clone https://github.com/raynach/hse-scraping

If you're successful you should see this:

Cloning into 'hse-scraping'...
remote: Counting objects: 44, done.
remote: Compressing objects: 100% (37/37), done.
remote: Total 44 (delta 11), reused 35 (delta 5)
Unpacking objects: 100% (44/44), done.

This sets up the basic framework for the scrapy Python package. We won't get much into the complicated internals of the package. Once you've done that you need to enter into the hse-scraping/blogcrawler/ directory.

[[email protected] ~]$ cd hse-scraping/blogcrawler/
[[email protected] blogcrawler]$ 

Now that we're there, the main file that I want to focus on is blogcrawler/spiders/blogspider.py

In [ ]:
#!/usr/bin/python

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from blogcrawler.items import BlogcrawlerItem
from urlparse import urlparse

class BlogSpider(CrawlSpider):    
    name = "blogspider"
    allowed_domains = [
        "wordpress.org", 
        "blogspot.com", 
        "blogger.com",
        "livejournal.com",
        "typepad.com", 
        "tumblr.com"]

    start_urls = ["http://badhessian.org"]

    rules = (
        Rule(SgmlLinkExtractor(
            allow=('/', ),
            deny=('www\.blogger\.com', 'profile\.typepad\.com', 
                'http:\/\/wordpress\.com', '.+\.trac\.wordpress\.org',
                '.+\.wordpress\.org', 'wordpress\.org', 'www\.tumblr\.com', 
                'en\..+\.wordpress\.com', 'vip\.wordpress\.com'),
                ), callback = "parse_item", follow = True), 
    )

    def parse_item(self, response):
        item = BlogcrawlerItem()

        item['url1'] = urlparse(response.request.headers.get('Referer'))[1]
        item['url2'] = urlparse(response.url)[1]

        yield item

The intuition behind this file is in the allowed_domains list and the parse_item function. We are just trying to get blogs, so there are a list of blogging services from which we will choose to gather information. The downside is that we don't get blogs that are not on one of these services.

Next, look at the parse_item function.

In [ ]:
    def parse_item(self, response):
        item = BlogcrawlerItem()

        item['url1'] = urlparse(response.request.headers.get('Referer'))[1]
        item['url2'] = urlparse(response.url)[1]

        yield item

This function is called every time that the crawl visits a link. The "item" here can be considered a network edge. url1 is the source node while url2 is the destination node.

Running scrapy

To actually run scrapy, type the following:

[[email protected] blogcrawler]$ scrapy crawl blogspider -o output.csv -t csv

This outputs the data from the crawler into a CSV file which represents an edgelist. You'll see a lot of stuff being produced when this is happening.

The process could go on indefinitely. To stop the process, press Ctrl + C and it should close itself. Give it a few seconds to do so.

If you want to suppress all the log messages that come along with it, use this command:

[[email protected] blogcrawler]$ scrapy crawl blogspider -o output.csv -t csv --nolog

You'll get an output that looks like this:

[[email protected] blogcrawler]$ more output.csv 
url1,url2
badhessian.org,orgtheory.wordpress.com
badhessian.org,scatter.wordpress.com
badhessian.org,orgtheory.wordpress.com
badhessian.org,orgtheory.wordpress.com
badhessian.org,scatter.wordpress.com
badhessian.org,permut.wordpress.com
badhessian.org,mobilizingideas.wordpress.com
badhessian.org,asecondmouse.wordpress.com
badhessian.org,codeandculture.wordpress.com
badhessian.org,dartthrowingchimp.wordpress.com
badhessian.org,exploringpossibilityspace.blogspot.com
orgtheory.wordpress.com,orgtheory.wordpress.com
orgtheory.wordpress.com,orgtheory.wordpress.com
orgtheory.wordpress.com,orgtheory.wordpress.com
orgtheory.wordpress.com,orgtheory.wordpress.com
...

If you look at the Blog Roll on the side of the page, you'll notice that this closely matches the list of blogs listed there. You'll note that many of the blogs will repeat themselves on the first line -- that means they are linking to themselves. Like the diagram above, these nodes are in a loop.

Additional materials

There are a lot more options available in scrapy. If you want to learn more about scrapy and the various types of tasks you can accomplish with it, you can check out their documentation.