As I'm interesting in text processing, I decided to write simple web page crawler. I know that there is Python library beautifull soup, but writting parser is interesting itself... OK lets go.
This is a n ary tree, traversal, and function tu built abstract syntax tree - having it properly done - we are home.
# n - ary tree class
class NaryTree:
"""n ary tree, Python list as an array of children"""
def __init__(self, rootObj):
self.key = rootObj
self.parent = None
self.kids = None
def insert(self, newNode):
if self.kids:
t = NaryTree(newNode)
self.kids.append(t)
t.parent = self
else:
t = NaryTree(newNode)
self.kids = []
self.kids.append(t)
t.parent = self
def setRootVal(self, obj):
self.key.append(obj)
def getRootVal(self):
return self.key
def getParent(self):
return self.parent
def getNthKid(self, i=-1): # return a child index i, the first from right if i not specified
return self.kids[i]
The tree code is suprisingly simple, simplest than a binary tree class! Nodes are inside a python array (self.kids) - which can be any size we want; function insert starts creating child nodes appending them to array from left to right. Traversal:
# tree traversal and convert to a python list
def traverse(tree): # dfs over tree
if tree:
stk = []
stk.append(tree)
while len(stk) > 0:
top = stk.pop()
if top.kids:
for child in top.kids:
stk.append(child)
print(top.getRootVal())
else:
return None
def nary_tree_tolist(tree, lst): # convert tree to a list
if tree:
stk = []
stk.append(tree)
while len(stk) > 0:
top = stk.pop()
if top.kids:
for child in top.kids:
stk.append(child)
lst.append(top.getRootVal())
return lst
else:
return None
Traverse function is similar to DFS in graph, yes a tree is a graph, no problem.;) The second method just transfer tree content to a list. Finally, time to built a syntax tree:
# having supplied delimiters builds a syntax tree
def buildNaryParseTree(lexp):
"""build a syntax tree from tokenized list expression, returns n ary tree"""
tree = NaryTree([])
cur = tree
ctr = 0
ctr1 = 0
for token in lexp:
if token == '<p>':
cur.insert([])
cur = cur.getNthKid()
ctr += 1
elif token == '<title>':
cur.insert([])
cur = cur.getNthKid()
ctr1 += 1
elif not token in ['<p>', '</p>', '<title>', '</title>'] and (ctr is not 0 or ctr1 is not 0):
cur.setRootVal(token)
elif token == '</p>':
ctr -= 1
cur = cur.getParent()
elif token == '</title>':
ctr1 -= 1
cur = cur.getParent()
return tree
The function in standard way parse an expression: seeing opening token - create new node and go to it, seeing closing token - go level up on the tree; in the mean time it writes down whats inside delimiters (cur.setRootVal(token)).
Variables ctr, ctr1 are for controlling writing (to not put the whole page inside a tree - just what's between delimiters).
Of course, more tags must be added, this just for testing if design is correct (I believe yes!).
Generally this function can parse anything, delimiters maybe nested or not (I check lots of expressions).
The worst part now prepare html code for parsing, I think, I just scratch it;/...
def prepare_expression(exp):
"""takes a page as a list and make tags visible"""
del_list = []
l = len(exp)
for i in range(l):
if exp[i] == '<' and exp[i + 1] == 't' and exp[i + 2] == 'i' and exp[i + 3] == 't' and exp[i + 4] == 'l' and \
exp[i + 5] == 'e' and exp[i + 6] == '>':
exp[i] = '<title>'
exp[i + 1] = []
exp[i + 2] = []
exp[i + 3] = []
exp[i + 4] = []
exp[i + 5] = []
exp[i + 6] = []
if exp[i] == '<' and exp[i + 1] == '/' and exp[i + 2] == 't' and exp[i + 3] == 'i' and exp[i + 4] == 't' and \
exp[i + 5] == 'l' and exp[i + 6] == 'e' and exp[i + 7] == '>':
exp[i] = '</title>'
exp[i + 1] = []
exp[i + 2] = []
exp[i + 3] = []
exp[i + 4] = []
exp[i + 5] = []
exp[i + 6] = []
exp[i + 7] = []
for t in range(len(exp) - 1, -1, -1):
if isinstance(exp[t], list):
del (exp[t])
return exp
Again, having supplied a html page as a list of tokens, this function traverses it and bring to the world defined in a loop tags; here is just one "title", but there is more on github page. Lots of things to do, same sanity checks of html code, remove unbalanced tags, check tags balance and probaly more - depends on particular page and task.
Finally, there is a recursive crawler code:
# crawler
import urllib.request
from parse_tree import *
import re
url = 'http://www.example.com/'
s = set()
List = []
def f_go(List, s, url, iter_cnt):
try:
if iter_cnt > 200:
return
if url in s:
return
s.add(url)
with urllib.request.urlopen(url) as response:
html = response.read()
print("here", url)
h = html.decode("utf-8")
lst0 = prepare_expression(list(h))
ntr = buildNaryParseTree(lst0)
lst1 = []
lst2 = nary_tree_tolist(ntr, lst1)
List.append(lst2)
f1 = re.finditer(str_regex, h)
l1 = []
for tok in f1:
ind1 = tok.span()
l1.append(h[ind1[0]:ind1[1]])
for exp in l1:
f_go(List, s, exp, iter_cnt + 1)
except:
return
The function f_go (function go!) recursively traverses set of pages (a single domain - with subdomains, but maybe tweaked to achieve different goals - namely); uses python set to store url's (to avoid repetitions); in python list List main scrapped web page is stored.
How it works? Takes a url, add to set (return if not in) and try to open it, if succeed, parse content and add to List (lines 19 - 23), then, using regexp, extracts and put into the list, the all url's in given domain and in the last loop recursively visits them.
The weak points are, arbitrary base case: just counter (means calls stop after certain limit iter_cnt - 200 this time - it need manual set); needs care in choosing subdomains to be visited - too wide selection may makes function to slow, to narrow - could (certainly will) cause loosing to many data...
Here are the regexp used:
str_regex = '(https?:\/\/)?([a-z]+\d\.)?([a-z]+\.)?activeingredients\.[a-z]+(/?(work|about|contact)?/?([a-zA-z-]+)*)?/?'
str_regex1 = '(https?:\/\/)?([a-z]+\d\.)?([a-z]+\.)?activeingredients\.[a-z]+(/?[^\s"\.]*/?)*'
After this all, example (small test rather):
url = 'http://www.activeingredients.com/'
s = set()
List = []
def f_go(List, s, url, iter_cnt):
try:
if iter_cnt > 200:
return
if url in s:
return
s.add(url)
with urllib.request.urlopen(url) as response:
html = response.read()
print("here", url)
h = html.decode("utf-8")
lst0 = prepare_expression(list(h))
ntr = buildNaryParseTree(lst0)
lst1 = []
lst2 = nary_tree_tolist(ntr, lst1)
List.append(lst2)
f1 = re.finditer(str_regex, h)
l1 = []
for tok in f1:
ind1 = tok.span()
l1.append(h[ind1[0]:ind1[1]])
for exp in l1:
f_go(List, s, exp, iter_cnt + 1)
except:
return
%time f_go(List, s, url, 0)
print(len(str(List)))
print(str(List))
here http://www.activeingredients.com/ here https://www.activeingredients.com/work here https://www.activeingredients.com/ here https://www.activeingredients.com/about here https://www.activeingredients.com/contact here https://www.activeingredients.com/work/lindsay here https://www.activeingredients.com/work/lithium here https://www.activeingredients.com/work/docusign here https://www.activeingredients.com/work/chicken-of-the-sea here https://www.activeingredients.com/work/cline here https://www.activeingredients.com/work/comfort-zone CPU times: user 1.04 s, sys: 0 ns, total: 1.04 s Wall time: 13.5 s 2392 [[[], ['A', 'c', 't', 'i', 'v', 'e', ' ', 'I', 'n', 'g', 'r', 'e', 'd', 'i', 'e', 'n', 't', 's', ' ', '&', 'n', 'd', 'a', 's', 'h', ';', ' ', 'A', ' ', 'D', 'i', 'g', 'i', 't', 'a', 'l', ' ', 'E', 'x', 'p', 'e', 'r', 'i', 'e', 'n', 'c', 'e', ' ', 'A', 'g', 'e', 'n', 'c', 'y']], [[], ['W', 'o', 'r', 'k', ' ', '\x1f', '\x1f', '\x1f', '&', 'n', 'd', 'a', 's', 'h', ';', ' ', 'A', 'c', 't', 'i', 'v', 'e', ' ', 'I', 'n', 'g', 'r', 'e', 'd', 'i', 'e', 'n', 't', 's']], [[], ['A', 'c', 't', 'i', 'v', 'e', ' ', 'I', 'n', 'g', 'r', 'e', 'd', 'i', 'e', 'n', 't', 's', ' ', '&', 'n', 'd', 'a', 's', 'h', ';', ' ', 'A', ' ', 'D', 'i', 'g', 'i', 't', 'a', 'l', ' ', 'E', 'x', 'p', 'e', 'r', 'i', 'e', 'n', 'c', 'e', ' ', 'A', 'g', 'e', 'n', 'c', 'y']], [[], ['A', 'b', 'o', 'u', 't', ' ', '\x1f', '\x1f', '\x1f', '&', 'n', 'd', 'a', 's', 'h', ';', ' ', 'A', 'c', 't', 'i', 'v', 'e', ' ', 'I', 'n', 'g', 'r', 'e', 'd', 'i', 'e', 'n', 't', 's']], [[], ['C', 'o', 'n', 't', 'a', 'c', 't', ' ', '\x1f', '\x1f', '\x1f', '&', 'n', 'd', 'a', 's', 'h', ';', ' ', 'A', 'c', 't', 'i', 'v', 'e', ' ', 'I', 'n', 'g', 'r', 'e', 'd', 'i', 'e', 'n', 't', 's']], [[], ['L', 'i', 'n', 'd', 's', 'a', 'y', ' ', '&', 'n', 'd', 'a', 's', 'h', ';', ' ', 'A', 'c', 't', 'i', 'v', 'e', ' ', 'I', 'n', 'g', 'r', 'e', 'd', 'i', 'e', 'n', 't', 's']], [[], ['L', 'i', 't', 'h', 'i', 'u', 'm', ' ', '&', 'n', 'd', 'a', 's', 'h', ';', ' ', 'A', 'c', 't', 'i', 'v', 'e', ' ', 'I', 'n', 'g', 'r', 'e', 'd', 'i', 'e', 'n', 't', 's']], [[], ['D', 'o', 'c', 'u', 'S', 'i', 'g', 'n', ' ', '\x1f', '\x1f', '\x1f', '&', 'n', 'd', 'a', 's', 'h', ';', ' ', 'A', 'c', 't', 'i', 'v', 'e', ' ', 'I', 'n', 'g', 'r', 'e', 'd', 'i', 'e', 'n', 't', 's']], [[], ['C', 'h', 'i', 'c', 'k', 'e', 'n', ' ', 'o', 'f', ' ', 't', 'h', 'e', ' ', 'S', 'e', 'a', ' ', '&', 'n', 'd', 'a', 's', 'h', ';', ' ', 'A', 'c', 't', 'i', 'v', 'e', ' ', 'I', 'n', 'g', 'r', 'e', 'd', 'i', 'e', 'n', 't', 's']], [[], ['C', 'l', 'i', 'n', 'e', ' ', 'C', 'e', 'l', 'l', 'a', 'r', 's', ' ', '\x1f', '\x1f', '\x1f', '&', 'n', 'd', 'a', 's', 'h', ';', ' ', 'A', 'c', 't', 'i', 'v', 'e', ' ', 'I', 'n', 'g', 'r', 'e', 'd', 'i', 'e', 'n', 't', 's']], [[], ['C', 'o', 'm', 'f', 'o', 'r', 't', ' ', 'Z', 'o', 'n', 'e', ' ', '\x1f', '\x1f', '\x1f', '&', 'n', 'd', 'a', 's', 'h', ';', ' ', 'A', 'c', 't', 'i', 'v', 'e', ' ', 'I', 'n', 'g', 'r', 'e', 'd', 'i', 'e', 'n', 't', 's']]]
I've chosen small page, just to make thinks visible, and I've left "debugging print" in line 13, to check what domains are there. It's a simple function, but may give nice material to data analysis. Having spaces as word delimiters, we could easily come back to original text and do whatever, collocations, frequencies, statistics, information retrieval. The biggest problem, I think is html itself. Thanks!