Crawling Web Pages

This notebook crawls's blog post to:

  • extract content related to blog post using regex
In [1]:
# import required libraries
import re
import requests


In [2]:
def extract_blog_content(content):
    """This function extracts blog post content using regex

        content (request.content): String content returned from requests.get

        str: string content as per regex match

    content_pattern = re.compile(r'<div class="cms-richtext">(.*?)</div>')
    result = re.findall(content_pattern, content)
    return result[0] if result else "None"

Crawl the Web

Set the URL and blog post to be parsed

In [3]:
base_url = ""
blog_suffix = "/wannacry-how-to-prepare/12302194"

Use requests library to make a get request

In [4]:
response = requests.get(base_url+blog_suffix)

Identify and Parse blog content using python's regex library (re)

In [5]:
if response.status_code == 200:
        content = response.text.encode('utf-8', 'ignore').decode('utf-8', 'ignore')
        content = content.replace("\n", '')
        blog_post_content = extract_blog_content(content)

View first 500 characters of the blogpost

In [6]:
'<p class="intro--paragraph"><em>By Mike Halsey</em></p><p><br/></p><p>It was a perfectly ordinary Friday when the Wannacry ransomware struck in May 2017. The malware spread around the world to more than 150 countries in just a matter of a few hours, affecting the National Health Service in the UK, telecoms provider Telefonica in Spain, and many other organisations and businesses in the USA, Canada, China, Japan, Russia, and right across Europe, the Middle-East, and Asia.</p><p>The malware was re'