This notebook crawls apress.com's blog post to:
# import required libraries import re import requests
def extract_blog_content(content): """This function extracts blog post content using regex Args: content (request.content): String content returned from requests.get Returns: str: string content as per regex match """ content_pattern = re.compile(r'<div class="cms-richtext">(.*?)</div>') result = re.findall(content_pattern, content) return result if result else "None"
Set the URL and blog post to be parsed
base_url = "http://www.apress.com/in/blog/all-blog-posts" blog_suffix = "/wannacry-how-to-prepare/12302194"
Use requests library to make a get request
response = requests.get(base_url+blog_suffix)
Identify and Parse blog content using python's regex library (re)
if response.status_code == 200: content = response.text.encode('utf-8', 'ignore').decode('utf-8', 'ignore') content = content.replace("\n", '') blog_post_content = extract_blog_content(content)
View first 500 characters of the blogpost
'<p class="intro--paragraph"><em>By Mike Halsey</em></p><p><br/></p><p>It was a perfectly ordinary Friday when the Wannacry ransomware struck in May 2017. The malware spread around the world to more than 150 countries in just a matter of a few hours, affecting the National Health Service in the UK, telecoms provider Telefonica in Spain, and many other organisations and businesses in the USA, Canada, China, Japan, Russia, and right across Europe, the Middle-East, and Asia.</p><p>The malware was re'