Notebook

Web Crawler¶

As I'm interesting in text processing, I decided to write simple web page crawler. I know that there is Python library beautifull soup, but writting parser is interesting itself... OK lets go.

This is a n ary tree, traversal, and function tu built abstract syntax tree - having it properly done - we are home.

In [2]:

# n - ary tree class

class NaryTree:
    """n ary tree, Python list as an array of children"""	
    def __init__(self, rootObj):
        self.key = rootObj
        self.parent = None
        self.kids = None

    def insert(self, newNode):
        if self.kids:
            t = NaryTree(newNode)
            self.kids.append(t)
            t.parent = self
        else:
            t = NaryTree(newNode)
            self.kids = []
            self.kids.append(t)
            t.parent = self

    def setRootVal(self, obj):
        self.key.append(obj)

    def getRootVal(self):
        return self.key

    def getParent(self):
        return self.parent

    def getNthKid(self, i=-1): # return a child index i, the first from right if i not specified
        return self.kids[i]

The tree code is suprisingly simple, simplest than a binary tree class! Nodes are inside a python array (self.kids) - which can be any size we want; function insert starts creating child nodes appending them to array from left to right. Traversal:

In [3]:

# tree traversal and convert to a python list

def traverse(tree): # dfs over tree
    if tree:
        stk = []
        stk.append(tree)
        while len(stk) > 0:
            top = stk.pop()
            if top.kids:
                for child in top.kids:
                    stk.append(child)
            print(top.getRootVal())
    else:
        return None

def nary_tree_tolist(tree, lst): # convert tree to a list
    if tree:
        stk = []
        stk.append(tree)
        while len(stk) > 0:
            top = stk.pop()
            if top.kids:
                for child in top.kids:
                    stk.append(child)
            lst.append(top.getRootVal())
        return lst
    else:
        return None

Traverse function is similar to DFS in graph, yes a tree is a graph, no problem.;) The second method just transfer tree content to a list. Finally, time to built a syntax tree:

In [ ]:

# having supplied delimiters builds a syntax tree

def buildNaryParseTree(lexp):
    """build a syntax tree from tokenized list expression, returns n ary tree"""
    tree = NaryTree([])
    cur = tree
    ctr = 0
    ctr1 = 0
    for token in lexp:
        if token == '<p>':
            cur.insert([])
            cur = cur.getNthKid()
            ctr += 1
        elif token == '<title>':
            cur.insert([])
            cur = cur.getNthKid()
            ctr1 += 1
        elif not token in ['<p>', '</p>', '<title>', '</title>'] and (ctr is not 0 or ctr1 is not 0):
            cur.setRootVal(token)
        elif token == '</p>':
            ctr -= 1
            cur = cur.getParent()
        elif token == '</title>':
            ctr1 -= 1
            cur = cur.getParent()
    return tree

The function in standard way parse an expression: seeing opening token - create new node and go to it, seeing closing token - go level up on the tree; in the mean time it writes down whats inside delimiters (cur.setRootVal(token)).
Variables ctr, ctr1 are for controlling writing (to not put the whole page inside a tree - just what's between delimiters).
Of course, more tags must be added, this just for testing if design is correct (I believe yes!).
Generally this function can parse anything, delimiters maybe nested or not (I check lots of expressions).
The worst part now prepare html code for parsing, I think, I just scratch it;/...

In [5]:

def prepare_expression(exp):
    """takes a page as a list and make tags visible"""
    del_list = []
    l = len(exp)
    for i in range(l):
        if exp[i] == '<' and exp[i + 1] == 't' and exp[i + 2] == 'i' and exp[i + 3] == 't' and exp[i + 4] == 'l' and \
                        exp[i + 5] == 'e' and exp[i + 6] == '>':
            exp[i] = '<title>'
            exp[i + 1] = []
            exp[i + 2] = []
            exp[i + 3] = []
            exp[i + 4] = []
            exp[i + 5] = []
            exp[i + 6] = []
        if exp[i] == '<' and exp[i + 1] == '/' and exp[i + 2] == 't' and exp[i + 3] == 'i' and exp[i + 4] == 't' and \
                        exp[i + 5] == 'l' and exp[i + 6] == 'e' and exp[i + 7] == '>':
            exp[i] = '</title>'
            exp[i + 1] = []
            exp[i + 2] = []
            exp[i + 3] = []
            exp[i + 4] = []
            exp[i + 5] = []
            exp[i + 6] = []
            exp[i + 7] = []
    for t in range(len(exp) - 1, -1, -1):
        if isinstance(exp[t], list):
            del (exp[t])
    return exp

Again, having supplied a html page as a list of tokens, this function traverses it and bring to the world defined in a loop tags; here is just one "title", but there is more on github page. Lots of things to do, same sanity checks of html code, remove unbalanced tags, check tags balance and probaly more - depends on particular page and task.
Finally, there is a recursive crawler code:

In [6]:

# crawler 
import urllib.request
from parse_tree import *
import re
url = 'http://www.example.com/'
s = set()
List = []
def f_go(List, s, url, iter_cnt):
    try:
        if iter_cnt > 200:
            return
        if url in s:
            return
        s.add(url)
        with urllib.request.urlopen(url) as response:
            html = response.read()
            print("here", url)
        h = html.decode("utf-8")
        lst0 = prepare_expression(list(h))
        ntr = buildNaryParseTree(lst0)
        lst1 = []
        lst2 = nary_tree_tolist(ntr, lst1)
        List.append(lst2)
        f1 = re.finditer(str_regex, h)
        l1 = []
        for tok in f1:
            ind1 = tok.span()
            l1.append(h[ind1[0]:ind1[1]])
        for exp in l1:
            f_go(List, s, exp, iter_cnt + 1)
    except:
        return

The function f_go (function go!) recursively traverses set of pages (a single domain - with subdomains, but maybe tweaked to achieve different goals - namely); uses python set to store url's (to avoid repetitions); in python list List main scrapped web page is stored.
How it works? Takes a url, add to set (return if not in) and try to open it, if succeed, parse content and add to List (lines 19 - 23), then, using regexp, extracts and put into the list, the all url's in given domain and in the last loop recursively visits them.
The weak points are, arbitrary base case: just counter (means calls stop after certain limit iter_cnt - 200 this time - it need manual set); needs care in choosing subdomains to be visited - too wide selection may makes function to slow, to narrow - could (certainly will) cause loosing to many data...
Here are the regexp used:

In [7]:

str_regex = '(https?:\/\/)?([a-z]+\d\.)?([a-z]+\.)?activeingredients\.[a-z]+(/?(work|about|contact)?/?([a-zA-z-]+)*)?/?'
str_regex1 = '(https?:\/\/)?([a-z]+\d\.)?([a-z]+\.)?activeingredients\.[a-z]+(/?[^\s"\.]*/?)*'

After this all, example (small test rather):

In [9]:

url = 'http://www.activeingredients.com/'
s = set()
List = []
def f_go(List, s, url, iter_cnt):
    try:
        if iter_cnt > 200:
            return
        if url in s:
            return
        s.add(url)
        with urllib.request.urlopen(url) as response:
            html = response.read()
            print("here", url)
        h = html.decode("utf-8")
        lst0 = prepare_expression(list(h))
        ntr = buildNaryParseTree(lst0)
        lst1 = []
        lst2 = nary_tree_tolist(ntr, lst1)
        List.append(lst2)
        f1 = re.finditer(str_regex, h)
        l1 = []
        for tok in f1:
            ind1 = tok.span()
            l1.append(h[ind1[0]:ind1[1]])
        for exp in l1:
            f_go(List, s, exp, iter_cnt + 1)
    except:
        return

%time f_go(List, s, url, 0)
print(len(str(List)))
print(str(List))

here http://www.activeingredients.com/
here https://www.activeingredients.com/work
here https://www.activeingredients.com/
here https://www.activeingredients.com/about
here https://www.activeingredients.com/contact
here https://www.activeingredients.com/work/lindsay
here https://www.activeingredients.com/work/lithium
here https://www.activeingredients.com/work/docusign
here https://www.activeingredients.com/work/chicken-of-the-sea
here https://www.activeingredients.com/work/cline
here https://www.activeingredients.com/work/comfort-zone
CPU times: user 1.04 s, sys: 0 ns, total: 1.04 s
Wall time: 13.5 s
2392
[[[], ['A', 'c', 't', 'i', 'v', 'e', ' ', 'I', 'n', 'g', 'r', 'e', 'd', 'i', 'e', 'n', 't', 's', ' ', '&', 'n', 'd', 'a', 's', 'h', ';', ' ', 'A', ' ', 'D', 'i', 'g', 'i', 't', 'a', 'l', ' ', 'E', 'x', 'p', 'e', 'r', 'i', 'e', 'n', 'c', 'e', ' ', 'A', 'g', 'e', 'n', 'c', 'y']], [[], ['W', 'o', 'r', 'k', ' ', '\x1f', '\x1f', '\x1f', '&', 'n', 'd', 'a', 's', 'h', ';', ' ', 'A', 'c', 't', 'i', 'v', 'e', ' ', 'I', 'n', 'g', 'r', 'e', 'd', 'i', 'e', 'n', 't', 's']], [[], ['A', 'c', 't', 'i', 'v', 'e', ' ', 'I', 'n', 'g', 'r', 'e', 'd', 'i', 'e', 'n', 't', 's', ' ', '&', 'n', 'd', 'a', 's', 'h', ';', ' ', 'A', ' ', 'D', 'i', 'g', 'i', 't', 'a', 'l', ' ', 'E', 'x', 'p', 'e', 'r', 'i', 'e', 'n', 'c', 'e', ' ', 'A', 'g', 'e', 'n', 'c', 'y']], [[], ['A', 'b', 'o', 'u', 't', ' ', '\x1f', '\x1f', '\x1f', '&', 'n', 'd', 'a', 's', 'h', ';', ' ', 'A', 'c', 't', 'i', 'v', 'e', ' ', 'I', 'n', 'g', 'r', 'e', 'd', 'i', 'e', 'n', 't', 's']], [[], ['C', 'o', 'n', 't', 'a', 'c', 't', ' ', '\x1f', '\x1f', '\x1f', '&', 'n', 'd', 'a', 's', 'h', ';', ' ', 'A', 'c', 't', 'i', 'v', 'e', ' ', 'I', 'n', 'g', 'r', 'e', 'd', 'i', 'e', 'n', 't', 's']], [[], ['L', 'i', 'n', 'd', 's', 'a', 'y', ' ', '&', 'n', 'd', 'a', 's', 'h', ';', ' ', 'A', 'c', 't', 'i', 'v', 'e', ' ', 'I', 'n', 'g', 'r', 'e', 'd', 'i', 'e', 'n', 't', 's']], [[], ['L', 'i', 't', 'h', 'i', 'u', 'm', ' ', '&', 'n', 'd', 'a', 's', 'h', ';', ' ', 'A', 'c', 't', 'i', 'v', 'e', ' ', 'I', 'n', 'g', 'r', 'e', 'd', 'i', 'e', 'n', 't', 's']], [[], ['D', 'o', 'c', 'u', 'S', 'i', 'g', 'n', ' ', '\x1f', '\x1f', '\x1f', '&', 'n', 'd', 'a', 's', 'h', ';', ' ', 'A', 'c', 't', 'i', 'v', 'e', ' ', 'I', 'n', 'g', 'r', 'e', 'd', 'i', 'e', 'n', 't', 's']], [[], ['C', 'h', 'i', 'c', 'k', 'e', 'n', ' ', 'o', 'f', ' ', 't', 'h', 'e', ' ', 'S', 'e', 'a', ' ', '&', 'n', 'd', 'a', 's', 'h', ';', ' ', 'A', 'c', 't', 'i', 'v', 'e', ' ', 'I', 'n', 'g', 'r', 'e', 'd', 'i', 'e', 'n', 't', 's']], [[], ['C', 'l', 'i', 'n', 'e', ' ', 'C', 'e', 'l', 'l', 'a', 'r', 's', ' ', '\x1f', '\x1f', '\x1f', '&', 'n', 'd', 'a', 's', 'h', ';', ' ', 'A', 'c', 't', 'i', 'v', 'e', ' ', 'I', 'n', 'g', 'r', 'e', 'd', 'i', 'e', 'n', 't', 's']], [[], ['C', 'o', 'm', 'f', 'o', 'r', 't', ' ', 'Z', 'o', 'n', 'e', ' ', '\x1f', '\x1f', '\x1f', '&', 'n', 'd', 'a', 's', 'h', ';', ' ', 'A', 'c', 't', 'i', 'v', 'e', ' ', 'I', 'n', 'g', 'r', 'e', 'd', 'i', 'e', 'n', 't', 's']]]

I've chosen small page, just to make thinks visible, and I've left "debugging print" in line 13, to check what domains are there. It's a simple function, but may give nice material to data analysis. Having spaces as word delimiters, we could easily come back to original text and do whatever, collocations, frequencies, statistics, information retrieval. The biggest problem, I think is html itself. Thanks!