Crawling Web Pages using Beautiful Soup

This notebook crawls apress.com's blog page to:

  • extract list of recent blog post titles and their URLS
  • extract content related to each blog post in plain text

using requests and BeautifulSoup

In [1]:
# import required libraries
import requests
from time import sleep
from bs4 import BeautifulSoup

Utilities

In [2]:
def get_post_mapping(content):
    """This function extracts blog post title and url from response object

    Args:
        content (request.content): String content returned from requests.get

    Returns:
        list: a list of dictionaries with keys title and url

    """
    post_detail_list = []
    post_soup = BeautifulSoup(content,"lxml")
    h3_content = post_soup.find_all("h3")
    
    for h3 in h3_content:
        post_detail_list.append(
            {'title':h3.a.get_text(),'url':h3.a.attrs.get('href')}
            )
    
    return post_detail_list


def get_post_content(content):
    """This function extracts blog post content from response object

    Args:
        content (request.content): String content returned from requests.get

    Returns:
        str: blog's content in plain text

    """
    plain_text = ""
    text_soup = BeautifulSoup(content,"lxml")
    para_list = text_soup.find_all("div",
                                   {'class':'cms-richtext'})
    
    for p in para_list[0]:
        plain_text += p.getText()
    
    return plain_text

Crawl the Web

Set the URL and get a list of blogs to be parsed

In [3]:
crawl_url = "http://www.apress.com/in/blog/all-blog-posts"
post_url_prefix = "http://www.apress.com"

Crawl recent posts on the website

In [4]:
response = requests.get(crawl_url)

Extract blog post title and url from response object

In [5]:
if response.status_code == 200:
        blog_post_details = get_post_mapping(response.content)

For each recent post, crawl the content and parse plain text content using beautiful soup

In [6]:
if blog_post_details:
        print("Blog posts found:{}".format(len(blog_post_details)))
        
        for post in blog_post_details:
            print("Crawling content for post titled:",post.get('title'))
            post_response = requests.get(post_url_prefix+post.get('url'))
            
            if post_response.status_code == 200:
                post['content'] = get_post_content(post_response.content)
            
            print("Waiting for 10 secs before crawling next post...\n\n")
            sleep(10)
    
        print("Content crawled for all posts")
Blog posts found:20
Crawling content for post titled: Creating Complex Validation Rules Using Fluent Validation with ASP.NET Core
Waiting for 10 secs before crawling next post...


Crawling content for post titled: Writing Functions in Python
Waiting for 10 secs before crawling next post...


Crawling content for post titled: Push Notifications: Responsible Web App Development
Waiting for 10 secs before crawling next post...


Crawling content for post titled: Best practices for using Simple Lookup Tables
Waiting for 10 secs before crawling next post...


Crawling content for post titled: Learn AI - The Time is NOW
Waiting for 10 secs before crawling next post...


Crawling content for post titled: The Definitive Guide to Shopify Themes
Waiting for 10 secs before crawling next post...


Crawling content for post titled: Why a Deadlock Is Not Just “Really Bad Blocking”
Waiting for 10 secs before crawling next post...


Crawling content for post titled: Testing, 1-2-3: Getting Started Debugging Python
Waiting for 10 secs before crawling next post...


Crawling content for post titled: Surviving the Corporate PowerPoint Template
Waiting for 10 secs before crawling next post...


Crawling content for post titled: Beginning Data Science and Supervised Learning in R
Waiting for 10 secs before crawling next post...


Crawling content for post titled: Common Table Expressions vs. Derived Tables
Waiting for 10 secs before crawling next post...


Crawling content for post titled: The Power of Pixlr
Waiting for 10 secs before crawling next post...


Crawling content for post titled: Math and Science with a 3D Printer
Waiting for 10 secs before crawling next post...


Crawling content for post titled: A Brief Primer to the Internet of Things
Waiting for 10 secs before crawling next post...


Crawling content for post titled: Wannacry: Why It's Only the Beginning, and How to Prepare for What Comes Next
Waiting for 10 secs before crawling next post...


Crawling content for post titled: Reusing ngrx/effects in Angular (communicating between reducers)
Waiting for 10 secs before crawling next post...


Crawling content for post titled: Interview with Tony Smith - Author and SharePoint Expert
Waiting for 10 secs before crawling next post...


Crawling content for post titled: Making Sense of Sensors – Types and Levels of Recognition
Waiting for 10 secs before crawling next post...


Crawling content for post titled: VS 2017, .NET Core, and JavaScript Frameworks, Oh My!
Waiting for 10 secs before crawling next post...


Crawling content for post titled: Relabel the Email Send Button “Make Public”
Waiting for 10 secs before crawling next post...


Content crawled for all posts

Print Title and Content for first 5 posts

In [7]:
# print/write content to file
for post in blog_post_details[:5]:
    print("title:{}\n-----------".format(post['title']))
    print("content:{}\n".format(post['content'][0:250]))
title:Creating Complex Validation Rules Using Fluent Validation with ASP.NET Core
-----------
content:By John Ciliberti Many ecommerce web sites are driven by user input and the choices they make. As a developer, you want to help them make the right choices and have a positive experience with your site, so they will complete their purchase, and retur

title:Writing Functions in Python
-----------
content:By Paul Gerrard Why Write Functions?When you write more complicated programs, you can choose to write them in long, complicated modules, but complicated modules are harder to write and difficult to understand. A better approach is to modularize a com

title:Push Notifications: Responsible Web App Development
-----------
content:By Dennis Sheppard While push notifications on the web are a powerful feature that inches the web ever closer to native apps, some developers have started to transform them into trite annoyances that have conditioned users to ignore notifications or 

title:Best practices for using Simple Lookup Tables
-----------
content:By Nick Harrison Simple lookup tables are just one type of logic table that you may find useful. This article explores what simple lookup tables look like, discusses when they may be applicable, and steps through some of the details for proper implem

title:Learn AI - The Time is NOW
-----------
content:By Nishith PathakImagine creating a software so smart that it will not only understand human languages but also slangs and subtle variations of these languages, such that your software will know that “Hello, Computer! How are you doing?” and “wassup