This notebook crawls apress.com's blog post to:
# import required libraries
import re
import requests
def extract_blog_content(content):
"""This function extracts blog post content using regex
Args:
content (request.content): String content returned from requests.get
Returns:
str: string content as per regex match
"""
content_pattern = re.compile(r'<div class="cms-richtext">(.*?)</div>')
result = re.findall(content_pattern, content)
return result[0] if result else "None"
Set the URL and blog post to be parsed
base_url = "http://www.apress.com/in/blog/all-blog-posts"
blog_suffix = "/wannacry-how-to-prepare/12302194"
Use requests library to make a get request
response = requests.get(base_url+blog_suffix)
Identify and Parse blog content using python's regex library (re)
if response.status_code == 200:
content = response.text.encode('utf-8', 'ignore').decode('utf-8', 'ignore')
content = content.replace("\n", '')
blog_post_content = extract_blog_content(content)
View first 500 characters of the blogpost
blog_post_content[0:500]
'<p class="intro--paragraph"><em>By Mike Halsey</em></p><p><br/></p><p>It was a perfectly ordinary Friday when the Wannacry ransomware struck in May 2017. The malware spread around the world to more than 150 countries in just a matter of a few hours, affecting the National Health Service in the UK, telecoms provider Telefonica in Spain, and many other organisations and businesses in the USA, Canada, China, Japan, Russia, and right across Europe, the Middle-East, and Asia.</p><p>The malware was re'