Notebook

Introduction to web scraping¶

This workshop is a one-hour beginner's introduction to web scraping.

This notebook deliberately has more content that we can reasonably cover in one hour. The most important material is in bold, and we'll focus on that material in person. To get the most out of this workshop, I'd suggest spending some time working through it in full after the workshop.

We'll cover the following topics:

Motivation

Why would you want to scrape data from the web?

How the Web works

A high-level appreciation of how the Web works will help us to scrape data effectively.

Making a request

How can we ask other computers on the Internet to send us data using Python?

Parsing HTML

Web pages are just files in a special format. Extracting information out of these files involves parsing HTML.

Don't go scraping willy-nilly!

Further resources

So you want to learn more about web scraping.

Motivation ¶

It's 2019. The web is everywhere.

If you want to buy a house, real estate agents have websites where they list the houses they're currently selling.
If you want to know whether to where a rain jacket or shorts, you check the weather on a website.
If you want to know what's happening in the world, you read the news online.
If you've forgotten which city is the capital of Australia, you check Wikipedia.

The point is this: there is an enormous amount of information (also known as data) on the web.

If we (in our capacities as, for example, data scientists, social scientists, digital humanists, businesses, public servants or members of the public) can get our hands on this information, we can answer all sorts of interesting questions or solve important problems.

Maybe you're studying gender bias in student evaluations of professors. One option would be to scrape ratings from Rate My Professors (provided you follow their terms of service)
Perhaps you want to build an app that shows users articles relating to their specified interests. You could scrape stories from various news websites and then use NLP methods to decide which articles to show which users.
Geoff Boeing and Paul Waddell recently published a great study of the US housing market by scraping millions of Craiglist rental listings. Among other insights, their study shows which metropolitan areas in the US are more or less affordable to renters.

How the Web works ¶

Here's our high-level description of the web.

The internet is a bunch of computers connected together. Some computers are laptops, some are desktops, some are smart phones, some are servers owned by companies. Each computer has its own address on the internet. Using these addresses, one computer can ask another computer for some information (data). We say that the first computer sends a request to the second computer, asking for some particular information. The second computer sends back a response. The response could include the information requested, or it could be an error message. Perhaps the second computer doesn't have that information any more, or the first computer isn't allowed to access that information.

We said that there is an enormous amount of information available on the web. When people put information on the web, they generally have two different audiences in mind, two different types of consumers of their information: humans and computers. If they want their information to be used primarily by humans, they'll make a website. This will let them lay out the information in a visually appealing way, choose colours, add pictures, and make the information interactive. If they want their information to be used by computers, they'll make a web API. A web API provides other computers structured access to their data. We won't cover APIs in this workshop, but you should know that i) APIs are very common and ii) if there is an API for a website/data source, you should use that over web scraping. Many data sources that you might be interested in (e.g. social media sites) have APIs.

Websites are just a bunch of files on one of those computers. They are just plain text files, so you can view them if you want. When you type in the address of a website in your browser, your computer sends a request to the computer located at that address. The request says "hey buddy, please send me the file(s) for this website". If everything goes well, the other computer will send back the file(s) in the response. Everytime you navigate to a new website or page in your browser, this process repeats.

There are three main languages that that website files are written with: HyperText Markup Language (HTML), Cascading Style Sheets (CSS) and JavaScript (JS). They normally have .html, .css and .js file extensions. Each language (and thus each type of file) serves a different purpose. HTML files are the ones we care about the most, because they are the ones that contain the text you see on a web page. CSS files contain the instructions on how to make the content in a HTML visually appealing (all the colours, font sizes, border widths, etc.). JavaScript files have the instructions on how to make the information on a website interactive (things like changing colour when you click something, entering data in a form). In this workshop, we're going to focus on HTML.

It's not too much of a simplification to say:

\begin{equation} \textrm{Web scraping} = \textrm{Making a request for a HTML file} + \textrm{Parsing the HTML response} \end{equation}

Making a request ¶

The first step in web scraping is to get the HTML of the website we want to scrape. The requests library is the easiest way to do this in Python.

In [ ]:

import requests

url = 'https://en.wikipedia.org/wiki/Canberra'

response = requests.get(url)

Great, it looks like everything worked! Let's see our beautiful HTML:

In [ ]:

response

Huh, that's weird. Doesn't look like HTML to me.

What the requests.get function returned (and the thing in our response variable) was a Response object. It itself isn't the HTML that we wanted, but rather a collection of metadata about the request/response interaction between your computer and the Wikipedia server.

For example, it knows whether the response was successful or not (response.ok), how long the whole interaction took (response.elapsed), what time the request took place (response.headers['Date']) and a whole bunch of other metadata.

In [ ]:

response.ok

In [ ]:

response.headers['Date']

Of course, what we really care about is the HTML content. We can get that from the Response object with response.text. What we get back is a string of HTML, exactly the contents of the HTML file at the URL that we requested.

In [ ]:

html = response.text
print(html[:1000])

Challenge¶

Get the HTML for the Wikipedia page about HTML. Print out the first 1000 characters and compare it to the HTML you see when you view the source HTML in your broswer.

In [ ]:

# your solution here

In [ ]:

# solution
url = 'https://en.wikipedia.org/wiki/HTML'
response = requests.get(url)
html = response.text

Challenge¶

Write a function called get_html that takes a URL as an argument and returns the HTML contents as a string. Test your function on the page for Sir Tim Berners-Lee.

In [ ]:

# your solution here

In [ ]:

# solution
def get_html(url):
    response = requests.get(url)
    return response.text

url = 'https://en.wikipedia.org/wiki/Tim_Berners-Lee'
html = get_html(url)

Challenge¶

What happens if the request doesn't go so smoothly? Add a defensive measure to your function to check that the response recieved was successful.

In [ ]:

# your solution here

In [ ]:

# solution
def get_html(url):
    response = requests.get(url)
    assert response.ok, "Whoops, this request didn't go as planned!"
    return response.text
    

url = 'https://en.wikipedia.org/wiki/Tim_Berners-Lee'
html = get_html(url)

Parsing HTML ¶

The second step in web scraping is parsing HTML. This is where things can get a little tricky.

Imagine you're in the field of education, in fact your specialty is studying higher education institutions. You're wondering how different disciplines change over time. Is it true that disciplines are incorporating more computational techniques as the years go on? Is that true for all disciplines or only some? Can we spot emerging themes across a whole university?

To answer these questions, we're going to need data. We're going to collect a dataset of all courses registered at UC Berkeley, not just those being taught this semester but all courses currently approved to be taught. These are listed on this page, called the Academic Guide. Well, actually they're not directly listed on that page. That page lists the departments/programs/units that teach currently approved courses. If we click on each department (for the sake of brevity, I'm just going to call them all "departments"), we can see the list of all courses they're approved to teach. For example, here's the page for Aerospace Studies. We'll call these pages departmental pages.

Challenge¶

View the source HTML of the page listing all departments, and see if you can find the part of the HTML where the departments are listed. There's a lot of other stuff in the file that we don't care too much about. You could try Crtl-Fing for the name of a department you can see on the webpage.

Solution

You should see something like this:

<div id="atozindex">
<h2 class="letternav-head" id='A'><a name='A'>A</a></h2>
<ul>
<li><a href="/courses/aerospc/">Aerospace Studies (AEROSPC)</a></li>
<li><a href="/courses/africam/">African American Studies (AFRICAM)</a></li>
<li><a href="/courses/a,resec/">Agricultural and Resource Economics (A,RESEC)</a></li>
<li><a href="/courses/amerstd/">American Studies (AMERSTD)</a></li>
<li><a href="/courses/ahma/">Ancient History and Mediterranean Archaeology (AHMA)</a></li>
<li><a href="/courses/anthro/">Anthropology (ANTHRO)</a></li>
<li><a href="/courses/ast/">Applied Science and Technology (AST)</a></li>
<li><a href="/courses/arabic/">Arabic (ARABIC)</a></li>
<li><a href="/courses/arch/">Architecture (ARCH)</a></li>

This is HTML. HTML uses "tags", code that surrounds the raw text which indicates the structure of the content. The tags are enclosed in < and > symbols. The <li> says "this is a new thing in a list and </li> says "that's the end of that new thing in the list". Similarly, the <a ...> and the </a> say, "everything between us is a hyperlink". In this HTML file, each department is listed in a list with <li>...</li> and is also linked to its own page using <a>...</a>. In our browser, if we click on the name of the department, it takes us to that department's own page. The way the browser knows where to go is because the <a>...</a> tag tells it what page to go to. You'll see inside the <a> bit, there's a href=.... That tells us the (relative) location of the page it's linked to.

Challenge¶

Look at HTML source of the page for the Aerospace Studies department, and try to find the part of the file where the information on each course is. Again, try searching for it using Crtl-F.

Solution

<div class="courseblock">

<button class="btn_toggleCoursebody" aria-expanded="false" aria-controls="cb_aerospc1a" data-toggle="#cb_aerospc1a">

<a name="spanaerospc1aspanspanfoundationsoftheu.s.airforcespanspan1unitspan"></a>
<h3 class="courseblocktitle">
<span class="code">AEROSPC 1A</span> 
<span class="title">Foundations of the U.S. Air Force</span> 
<span class="hours">1 Unit</span>
</h3>

The content that we care about is enclosed within HTML tags. It looks like the course code is enclosed in a span tag, which has a class attribute with the value "code". What we'll have to do is extract out the information we care about by specifying what tag it's enclosed in.

But first, we're going to need to get the HTML of the first page.

Challenge¶

Get the HTML content of http://guide.berkeley.edu/courses/ and store it in a variable called academic_guide_html. You can use the get_html function you wrote before.

Print the first 500 characters to see what we got back.

In [ ]:

# your solution here

In [ ]:

# solution
academic_guide_url = 'http://guide.berkeley.edu/courses/'
academic_guide_html = get_html(academic_guide_url)
print(academic_guide_html[:500])

Great, we've got the HTML contents of the Academic Guide site we want to scrape. Now we can parse it. "Parsing" means to turn a string of data into a structured representation. When we're parsing HTML, we're taking the Python string and turning it into a tree. The Python package BeautifulSoup does all our HTML parsing for us. We give it our HTML as a string and it returns a parsed HTML tree. Here, we're also telling BeautifulSoup to use the lxml parser behind the scenes.

In [ ]:

from bs4 import BeautifulSoup

academic_guide_soup = BeautifulSoup(academic_guide_html, 'lxml')

We said before that all the departments were listed on the Academic Guide page with links to their departmental page, where the actual courses are listed. So we can find all the departments by looking in our parsed HTML for all the links. Remember that the links are represented in the HTML with the <a>...</a> tag, so we ask our academic_guide_soup to find us all the tags called a. What we get back is a list of all the a elements in the HTML page.

In [ ]:

links = academic_guide_soup.find_all('a')
# print a random link element
links[48]

So now we have a list of a elements, each one represents a link on the Academic Guide page. But there are other links on this page in addition to the ones we care about, for example, a link back to the UC Berkeley home page. How can we filter out all the links we don't care about?

Challenge¶

Look through the list links, or the HTML source, and figure out how we can identify just the links that we care about, namely the links to departmental pages.

In [ ]:

# your solution here

In [ ]:

# solution
import re

def is_departmental_page(link):
    """
    Return true if `link` points to a departmental page.
    
    By examining the source HTML by eye, I noticed that 
    the links we care about (i.e. the departmental pages) 
    all point to a relative path that starts with "/courses/".
    This function uses that idea to determine if the link is 
    a departmental page.
    """
    # some links don't have a href attribute, only a name attribute
    # we don't care about them
    try:
        href = link.attrs['href'] 
    except KeyError:
        return False
    pattern = r'/courses/(.*)/'
    match = re.search(pattern, href)
    return bool(match)

print(links[0])
print(is_departmental_page(links[0]))
print()
print(links[48])
print(is_departmental_page(links[48]))

Let's use our new is_departmental_page function to filter out the links we don't care about. How many departments do we have?

In [ ]:

departmental_page_links = [link for link in links if is_departmental_page(link)]
len(departmental_page_links)

Each link in our departmental_page_links list contains a HTML element representing a link. Each element contains not only the relative location of the link but also the text that is linked (i.e. the words on the page that are underlined and you can click on to go to the linked page). In BeautifulSoup, we can get that text by asking for it with element.text, like this:

In [ ]:

departmental_page_links[0].text

Challenge¶

From the departmental_page_links, we can extract out the name and the code for each department. Try doing this.

In [ ]:

# your solution here

In [ ]:

# solution
import re

def extract_department_name_and_code(departmental_link):
    """
    Return the (name, code) for a department.
    
    The easiest way to do this is to use regular expressions. 
    We're not going to cover regular expressions in this workshop, 
    but here's how to do it anyway.
    """
    text = departmental_link.text
    pattern = r'([^(]+) \((.*)\)'
    match = re.search(pattern, text)
    if match:
        return match.group(1), match.group(2)

extract_department_name_and_code(links[48])

From each link in our departmental_page_links list, we can get the relative link that it points to like this:

In [ ]:

departmental_page_links[0].attrs['href']

Challenge¶

Write a function that extracts out the relative link of a link element.

Hint: This has a similar solution to our is_departmental_page function from before.

In [ ]:

# your solution here

In [ ]:

# solution
def extract_relative_link(departmental_link):
    """
    We noted above that all the departmental links point to "/courses/something/", 
    where the "something" looks a lot like their code. This function 
    extracts out that "something", so we can add it to the base URL of 
    the Academic Guide page and get full paths to each departmental page.
    """
    href = departmental_link.attrs['href']
    pattern = r'/courses/(.*)/'
    match = re.search(pattern, href)
    if match:
        return match.group(1)

extract_relative_link(departmental_page_links[0])

Alright! Now we've identified all the departmental links on the Academic Guide page, we've found their name and code, and we know the relative link they point to. Next, we can use this relative link to construct the full URL they point to, which we'll then use to scrape the HTML for each departmental page.

Let's write a function that takes a departmental link and returns the absolute URL of its departmental page.

In [ ]:

def construct_absolute_url(departmental_link):
    relative_link = extract_relative_link(departmental_link)
    return academic_guide_url + relative_link

construct_absolute_url(departmental_page_links[37])

To summarize so far, we've gone from the URL of the Academic Guide website, found all the departments that offer approved courses, identified their name and code and the link to their departmental page which lists all the courses they teach.

Now we want to find the get the HTML for each departmental page and scrape it for all the courses they offer. Let's focus on one page for now, the Aerospace Studies page. Once we select the link, we use our functions from above to: i) get the name (I guess we already know it's Aerospace, but whatever) and code, get the full URL, get the HTML for that URL and then parse the HTML.

In [ ]:

aerospace_link = departmental_page_links[0]
aerospace_name, aerospace_code = extract_department_name_and_code(aerospace_link)
aerospace_url = construct_absolute_url(aerospace_link)
aerospace_html = get_html(aerospace_url)
aerospace_soup = BeautifulSoup(aerospace_html, 'lxml')
print(aerospace_html[:500])

Right at the start of this section on parsing HTML, we saw the HTML for a departmental page. Here it is again.

<div class="courseblock">

<button class="btn_toggleCoursebody" aria-expanded="false" aria-controls="cb_aerospc1a" data-toggle="#cb_aerospc1a">

<a name="spanaerospc1aspanspanfoundationsoftheu.s.airforcespanspan1unitspan"></a>
<h3 class="courseblocktitle">
<span class="code">AEROSPC 1A</span> 
<span class="title">Foundations of the U.S. Air Force</span> 
<span class="hours">1 Unit</span>
</h3>

It looks like each course is listed in a div element that has a class attribute with value "courseblock". We can use this information to identify all the courses on a page and then extract out the information from them. You've seen how to do this before, here it is again:

In [ ]:

aerospace_courseblocks = aerospace_soup.find_all(class_='courseblock')
len(aerospace_courseblocks)

Looks like the Aerospace department has seven current courses they're approved to teach (at the time of writing). Looking at the page in our browser, that looks right to me! So now we have a list called aerospace_courseblocks that holds seven elements that each refer to one course taught by the Aerospace department. Now we can extract out any information we care about. We just have to look at the page in our browser, decide what information we care about, then look at the HTML source to see where that information is kept in the HTML structure. Finally, we write a function for each piece of information we want to extract out of a course.

Challenge¶

Write functions to take a courseblock and extract:

The course code (e.g. AEROSPC 1A)
The coure name
The number of units
The textual description of the course

In [ ]:

# your solution here

In [ ]:

# solution
def extract_course_code(courseblock):
    span = courseblock.find(class_='code')
    return span.text

def extract_course_title(courseblock):
    span = courseblock.find(class_='title')
    return span.text

def extract_course_units(courseblock):
    span = courseblock.find(class_='hours')
    return span.text

def extract_course_description(courseblock):
    span = courseblock.find(class_='coursebody')
    return span.text

def extract_one_course(courseblock):
    course = {}
    course['course_code'] = extract_course_code(courseblock)
    course['course_title'] = extract_course_title(courseblock)
    course['course_units'] = extract_course_units(courseblock)
    course['course_description'] = extract_course_description(courseblock)
    return course

first_aerospace_course = extract_one_course(aerospace_courseblocks[0])
for value in first_aerospace_course.values():
    print(value)
    print()

Let's write a function to scrape these four pieces of information from every course from every department and save it as a csv file.

In [ ]:

def scrape_one_department(department_link):
    department_name, department_code = extract_department_name_and_code(department_link)
    department_url = construct_absolute_url(department_link)
    department_html = get_html(department_url)
    department_soup = BeautifulSoup(department_html, 'lxml')
    department_courseblocks = department_soup.find_all(class_='courseblock')
    result = []
    for courseblock in department_courseblocks:
        course = extract_one_course(courseblock)
        course['department_name'] = department_name
        course['department_code'] = department_code
        result.append(course)
    return result

aerospace_courses = scrape_one_department(aerospace_link)
for value in aerospace_courses[0].values():
    print(value)
    print()

In [ ]:

import time

def scrape_all_departments(be_nice=True):
    academic_guide_url = 'http://guide.berkeley.edu/courses/'
    academic_guide_html = get_html(academic_guide_url)
    academic_guide_soup = BeautifulSoup(academic_guide_html, 'lxml')
    links = academic_guide_soup.find_all('a')
    departmental_page_links = [link for link in links if is_departmental_page(link)]
    
    result = []
    for departmental_page_link in departmental_page_links:
        department_result = scrape_one_department(departmental_page_link)
        result.extend(department_result)
        if be_nice:
            time.sleep(1)
    return result

In [ ]:

import pandas as pd
result = scrape_all_departments(be_nice=False)
df = pd.DataFrame(result)
print(str(len(df)) + ' courses scraped')
df.head()

9360 courses scraped! (At the time of writing). Wow, that was a lot easier than doing it by hand!

Terms of Service ¶

As you've seen, web scraping involves making requests from other computers for their data. It costs people money to maintain the computers that we request data from: it needs electricity, it requires staff, sometimes you need to upgrade the computer, etc. But we didn't pay anyone for using their resources.

Because we're making these requests programmatically, we could make many, many requests per second. For example, we could put a request in a never-ending loop which would constantly request data from a server. But computers can't handle too much traffic, so eventually this might crash someone else's computer. Moreover, if we make too many requests when we're web scraping, that might restrict the number of people who can view the web page in their browser. This isn't very nice.

Websites often have Terms of Service, documents that you agree to whenever you visit a site. Some of these terms prohibit web scraping, because it puts too much strain on their servers, or they just don't want their data accessed programmatically. Whatever the reason, we need to respect a websites Terms of Service. Before you scrape a site, you should always check its terms of service to make sure it's allowed.

Often, there are better ways of accessing the same data. For the Wikipedia sites we scraped, there's actually an API that we could have used. In fact, Wikipedia would prefer that we access their data that way. There's even a Python package that wraps around this API to make it even easier to use. Furthermore, Wikipedia actually makes all of its content available for direct download. The point of the story is: before web scraping, see if you can get the same data elsewhere. This will often be easier for you and preferred by the people who own the data.

Moreover, if you're affiliated with an institution, you may be breaching existing contracts by engaging in scraping. UC Berkeley's Library recommends following this workflow:

Further resources ¶

Resources for learning¶

Work through this notebook in full.
- We glossed over a lot of details. As your next step, I'd suggest spending as much time as you need to understand every line of text and code in this notebook.
Web-scraping with Python
- A great textbook for learning more about web scraping using Python.
Fantastic Data and Where To Find Them: An introduction to APIs, RSS, and Scraping
- This is a recorded video of a workshop on collecting data via the web at PyCon, a Python conference._
D-Lab workshops
- We teach workshops on web scraping and NLP-related tools throughout the semester. Check this page for the latest scheduled workshops.
D-Lab consulting
- We also offer free consulting for members of UC Berkeley's community. Reach out to us if you ever need a hand with a web scraping or NLP project!

A few libraries to be aware of¶

requests-HTML
- Did you see how easy it was to request data using the requests library? Well the author of that library, Kenneth Reitz, has another library for parsing HTML. I'm not that familiar with it, but it looks promising if it's by Reitz!
furl
- Lets you extract out different parts of a URL.
cssutils
- We didn't talk about CSS at all, but you can also scrape data depending on its visual characteristics. This is a great library for parsing CSS files, but you can get some of the same functionality with BeautifulSoup.
scrapy
- Need to do some serious web scraping? You'll wanna check out Scrapy.
newspaper
- If you know you're focussed on newspaper articles, this is a great little library for parsing common formats.

Introduction to web scraping¶

Motivation¶

How the Web works¶

Making a request¶

Challenge¶

Challenge¶

Challenge¶

Parsing HTML¶

Challenge¶

Challenge¶

Challenge¶

Challenge¶

Challenge¶

Challenge¶

Challenge¶

Terms of Service¶

Further resources¶

Resources for learning¶

A few libraries to be aware of¶

Motivation ¶

How the Web works ¶

Making a request ¶

Parsing HTML ¶

Terms of Service ¶

Further resources ¶