Crawling Web Pages

This notebook crawls apress.com's blog post to:

  • extract content related to blog post using regex
In [1]:
# import required libraries
import re
import requests

Utility

In [2]:
def extract_blog_content(content):
    """This function extracts blog post content using regex

    Args:
        content (request.content): String content returned from requests.get

    Returns:
        str: string content as per regex match

    """
    content_pattern = re.compile(r'<div class="cms-richtext">(.*?)</div>')
    result = re.findall(content_pattern, content)
    return result[0] if result else "None"

Crawl the Web

Set the URL and blog post to be parsed

In [3]:
base_url = "http://www.apress.com/in/blog/all-blog-posts"
blog_suffix = "/wannacry-how-to-prepare/12302194"

Use requests library to make a get request

In [4]:
response = requests.get(base_url+blog_suffix)

Identify and Parse blog content using python's regex library (re)

In [5]:
if response.status_code == 200:
        content = response.text.encode('utf-8', 'ignore').decode('utf-8', 'ignore')
        content = content.replace("\n", '')
        blog_post_content = extract_blog_content(content)

View first 500 characters of the blogpost

In [6]:
blog_post_content[0:500]
Out[6]:
'<p class="intro--paragraph"><em>By Mike Halsey</em></p><p><br/></p><p>It was a perfectly ordinary Friday when the Wannacry ransomware struck in May 2017. The malware spread around the world to more than 150 countries in just a matter of a few hours, affecting the National Health Service in the UK, telecoms provider Telefonica in Spain, and many other organisations and businesses in the USA, Canada, China, Japan, Russia, and right across Europe, the Middle-East, and Asia.</p><p>The malware was re'