#!/usr/bin/env python # coding: utf-8 # # Scraping StackOverflow # # In this project, we will be scraping StackOverflow website and: # # - [Goal 1: List Most mentioned/tagged languages along with their tag counts](#Goal1) # - [Goal 2: List Most voted questions along with with their attributes (votes, summary, tags, number of votes, answers and views)](#Goal2) # # We will divide our project into the above mentioned two goals. # # Before starting our project, we need to understand few basics regarding Web Scraping. # # Web Scraping Basics # # Before starting our project, we need to understand few basics regarding Web Pages and Web Scraping. # # When we visit a page, our browser makes a request to a web server. Most of the times, this request is a [GET Request](https://realpython.com/lessons/the-get-request/). Our web browser then receives a bunch of files, typically (HTML, CSS, JavaScript). HTML contains the content, CSS & JavaScript tell browser how to render the webpage. So, we will be mainly interested in the HTML file. # # ### HTML: # HTML has elements called [tags](https://www.w3schools.com/html/html_elements.asp), which help in differentiating different parts of a HTML Document. Different types of tags are: # * `html` - all content is inside this tag # * `head` - contains title and other related files # * `body` - contains main cotent to be displayed on the webpage # * `div` - division or area of a page # * `p` - paragraph # * `a` - links # # We will get our content inside the body tag and use p and a tags for getting paragraphs and links. # # HTML also has [class and id properties](https://www.codecademy.com/articles/classes-vs-ids). These properties give HTML elements names and makes it easier for us to refer to a particular element. `Class` can be shared among multiple elements and an element can have moer then one class. Whereas, `id` needs to be unique for a given element and can be used just once in the document. # # ### Requests # The requests module in python lets us easily download pages from the web.
# We can request contents of a webpage by using `requests.get()`, passing in target link as a parameter. This will give us a [response object](https://realpython.com/python-requests/#the-response). # # ### Beautiful Soup # [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) library helps us parse contents of the webpage in an easy to use manner. It provides us with some very useful methods and attributes like: # * `find()`, `select_one()` - retuns first occurence of the tag object that matches our filter # * `find_all()`, `select()` - retuns a list of the tag object that matches our filter # * `children` - provides list of direct nested tags of the given paramter/tag # # These methods help us in extracting specific portions from the webpage. # # ***Tip: When Scraping, we try to find common properties shared among target objects. This helps us in extracting all of them in just one or two commands.*** # # For e.g. We want to scrap points of teams on a league table. In such a scenario, we can go to each element and extract its value. Or else, we can find a common thread (like **same class, same parent + same element type**) between all the points. And then, pass that common thread as an argument to BeautifulSoup. BeautifulSoup will then extract and return the elements to us. # # Goal 1: Listing most tagged Languages # # Now that we know the basics of Web Scraping, we will move towards our first goal. # # In Goal 1, we have to list most tagged Languages along with their Tag Count. First, lets make a list of steps to follow: # # - [1. Download Webpage from stackoverflow](#1.1) # - [2. Parse the document content into BeautifulSoup](#1.2) # - [3. Extract Top Languages](#1.3) # - [4. Extract their respective Tag Counts](#1.4) # - [5. Put all code together and join the two lists](#1.5) # - [6. Plot Data](#1.6) # Let's import all the required libraries and packages # In[1]: import numpy as np # linear algebra import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv) import requests # Getting Webpage content from bs4 import BeautifulSoup as bs # Scraping webpages import matplotlib.pyplot as plt # Visualization import matplotlib.style as style # For styling plots from matplotlib import pyplot as mp # For Saving plots as images # For displaying plots in jupyter notebook get_ipython().run_line_magic('matplotlib', 'inline') style.use('fivethirtyeight') # matplotlib Style # ### Downloading Tags page from StackOverflow # # We will download the [tags page](https://stackoverflow.com/tags) from [stackoverflow](https://stackoverflow.com/), where it has all the languages listed with their tag count. # In[2]: # Using requests module for downloading webpage content response = requests.get('https://stackoverflow.com/tags') # Getting status of the request # 200 status code means our request was successful # 404 status code means that the resource you were looking for was not found response.status_code # ### Parsing the document into Beautiful Soup # In[3]: # Parsing html data using BeautifulSoup soup = bs(response.content, 'html.parser') # body body = soup.find('body') # printing the object type of body type(body) # ### Extract Top Languages # # In order to acheive this, we need to understand HTML structure of the document that we have. And then, narrow down to our element of interest. # # # One way of doing this would be manually searching the webpage (hint: print `body` variable from above and search through it).
# Second method, is to use the browser's Developr Tools. # # We will use this second one. On Chrome, open [tags page](http://stackoverflow.com/tags?tab=popular) and right-click on the language name (shown in top left) and choose **Inspect**. # # ![Image for Reference](https://github.com/nveenverma/Projects/blob/master/Exploring%20StackOverflow/tags.png?raw=true) # *
* # We can see that the `a` tag has a class of `post-tag`. Using this class along with `a` tag, we can extract all the language links in a list.
* # Here, the tag counts are inside `span` tag, with a class of `item-multiplier-count`. Using this class along with `span` tag, we will extract all the tag count spans in a list.
In order to make a DataFrame, we need to pass both the lists (in dictionary form) as argument to our function. # In[8]: # Function to check, if there is any error in length of the extracted bs4 object def error_checking(list_name, length): if (len(list_name) != length): print("Error in {} parsing, length not equal to {}!!!".format(list_name, length)) return -1 else: pass def get_top_languages(url): # Using requests module for downloading webpage content response = requests.get(url) # Parsing html data using BeautifulSoup soup = bs(response.content, 'html.parser') body = soup.find('body') # Extracting Top Langauges lang_tags = body.find_all('a', class_='post-tag') error_checking(lang_tags, 36) # Error Checking languages = [i.text for i in lang_tags] # Languages List # Extracting Tag Counts tag_counts = body.find_all('span', class_='item-multiplier-count') error_checking(tag_counts, 36) # Error Checking no_of_tags = [int(i.text) for i in tag_counts] # Tag Counts List # Putting the two lists together df = pd.DataFrame({'Languages':languages, 'Tag Count':no_of_tags}) return df # ### Plot Data # In[9]: URL1 = 'https://stackoverflow.com/tags' df = get_top_languages(URL1) df.head() # Now, we will plot the Top Languages along with their Tag Counts. # In[10]: plt.figure(figsize=(8, 3)) plt.bar(height=df['Tag Count'][:10], x=df['Languages'][:10]) plt.xticks(rotation=90) plt.xlabel('Languages') plt.ylabel('Tag Counts') plt.savefig('lang_vs_tag_counts.png', bbox_inches='tight') plt.show() # # Goal 2: Listing most voted Questions # # Now that we have collected data using web scraping one time, it won't be difficult the next time.
# In Goal 2 part, we have to list questions with most votes along with their attributes, like: # > - Summary # - Tags # - Number of Votes # - Number of Answers # - Number of Views # # I would suggest giving it a try on your own, then come here to see my solution. # # Similar to previous step, we will make a list of steps to act upon: # # - [1. Download Webpage from stackoverflow](#2.1) # - [2. Parse the document content into BeautifulSoup](#2.2) # - [3. Extract Top Questions](#2.3) # - [4. Extract their respective Summary](#2.4) # - [5. Extract their respective Tags](#2.5) # - [6. Extract their respective no. of votes, answers and views](#2.6) # - [7. Put all code togther and join the lists](#2.7) # - [8. Plot Data](#2.8) # ### Downloading Questions page from StackOverflow # # We will download the [questions page](https://stackoverflow.com/questions?sort=votes&pagesize=50) from [stackoverflow](https://stackoverflow.com/), where it has all the top voted questions listed.
# # Here, I've appended `?sort=votes&pagesize=50` to the end of the defualt questions URL, to get a list of top 50 questions.
* # We can see that the question is inside `a` tag, which has a class of `question-hyperlink`. # # Taking cue from our previous Goal, we can use this class along with `a` tag, to extract all the question links in a list. However, there are more question hyperlinks in sidebar which will also be extracted in this case. To avoid this scenario, we can combine `a` tag, `question-hyperlink` class with their parent `h3` tag. This will give us exactly 50 Tags.
* # We can see that the question is inside `div` tag, which has a class of `excerpt`. Using this class along with `div` tag, we can extract all the question links in a list.
* # Extracting **tags per question** is the most complex task in this post. Here, we cannot find unique class or id for each tag, and there are multiple tags per question that we n # eed to store. # # To extract **tags per question**, we will follow a multi-step process: # # * As shown in figure, individual tags are in a third layer, under two nested div tags. With the upper div tag, only having unique class (`summary`). # - First, we will extract div with `summary`class. # - Now notice our target div is third child overall and second `div` child of the above extracted object. Here, we can use `nth-of-type()` method to extract this 2nd `div` child. Usage of this method is very easy and few exmaples can be found [here](https://gist.github.com/yoki/b7f2fcef64c893e307c4c59303ead19a#file-20_search-py). This method will extract the 2nd `div` child directly, without extracting `summary div` first.
* # ### No. of Votes # - They can be found by using `span` tag along with `vote-count-post` class and nested `strong` tags