This workshop is a one-hour beginner's introduction to web scraping.
This notebook deliberately has more content that we can reasonably cover in one hour. The most important material is in bold, and we'll focus on that material in person. To get the most out of this workshop, I'd suggest spending some time working through it in full after the workshop.
We'll cover the following topics:
Why would you want to scrape data from the web?
A high-level appreciation of how the Web works will help us to scrape data effectively.
How can we ask other computers on the Internet to send us data using Python?
Web pages are just files in a special format. Extracting information out of these files involves parsing HTML.
Don't go scraping willy-nilly!
So you want to learn more about web scraping.
It's 2019. The web is everywhere.
The point is this: there is an enormous amount of information (also known as data) on the web.
If we (in our capacities as, for example, data scientists, social scientists, digital humanists, businesses, public servants or members of the public) can get our hands on this information, we can answer all sorts of interesting questions or solve important problems.
Here's our high-level description of the web.
The internet is a bunch of computers connected together. Some computers are laptops, some are desktops, some are smart phones, some are servers owned by companies. Each computer has its own address on the internet. Using these addresses, one computer can ask another computer for some information (data). We say that the first computer sends a request to the second computer, asking for some particular information. The second computer sends back a response. The response could include the information requested, or it could be an error message. Perhaps the second computer doesn't have that information any more, or the first computer isn't allowed to access that information.
We said that there is an enormous amount of information available on the web. When people put information on the web, they generally have two different audiences in mind, two different types of consumers of their information: humans and computers. If they want their information to be used primarily by humans, they'll make a website. This will let them lay out the information in a visually appealing way, choose colours, add pictures, and make the information interactive. If they want their information to be used by computers, they'll make a web API. A web API provides other computers structured access to their data. We won't cover APIs in this workshop, but you should know that i) APIs are very common and ii) if there is an API for a website/data source, you should use that over web scraping. Many data sources that you might be interested in (e.g. social media sites) have APIs.
Websites are just a bunch of files on one of those computers. They are just plain text files, so you can view them if you want. When you type in the address of a website in your browser, your computer sends a request to the computer located at that address. The request says "hey buddy, please send me the file(s) for this website". If everything goes well, the other computer will send back the file(s) in the response. Everytime you navigate to a new website or page in your browser, this process repeats.
There are three main languages that that website files are written with: HyperText Markup Language (HTML), Cascading Style Sheets (CSS) and JavaScript (JS). They normally have .html
, .css
and .js
file extensions. Each language (and thus each type of file) serves a different purpose. HTML files are the ones we care about the most, because they are the ones that contain the text you see on a web page. CSS files contain the instructions on how to make the content in a HTML visually appealing (all the colours, font sizes, border widths, etc.). JavaScript files have the instructions on how to make the information on a website interactive (things like changing colour when you click something, entering data in a form). In this workshop, we're going to focus on HTML.
It's not too much of a simplification to say:
\begin{equation} \textrm{Web scraping} = \textrm{Making a request for a HTML file} + \textrm{Parsing the HTML response} \end{equation}import requests
url = 'https://en.wikipedia.org/wiki/Canberra'
response = requests.get(url)
Great, it looks like everything worked! Let's see our beautiful HTML:
response
Huh, that's weird. Doesn't look like HTML to me.
What the requests.get
function returned (and the thing in our response
variable) was a Response object. It itself isn't the HTML that we wanted, but rather a collection of metadata about the request/response interaction between your computer and the Wikipedia server.
For example, it knows whether the response was successful or not (response.ok
), how long the whole interaction took (response.elapsed
), what time the request took place (response.headers['Date']
) and a whole bunch of other metadata.
response.ok
response.headers['Date']
Of course, what we really care about is the HTML content. We can get that from the Response
object with response.text
. What we get back is a string of HTML, exactly the contents of the HTML file at the URL that we requested.
html = response.text
print(html[:1000])
Get the HTML for the Wikipedia page about HTML. Print out the first 1000 characters and compare it to the HTML you see when you view the source HTML in your broswer.
# your solution here
# solution
url = 'https://en.wikipedia.org/wiki/HTML'
response = requests.get(url)
html = response.text
Write a function called get_html
that takes a URL as an argument and returns the HTML contents as a string. Test your function on the page for Sir Tim Berners-Lee.
# your solution here
# solution
def get_html(url):
response = requests.get(url)
return response.text
url = 'https://en.wikipedia.org/wiki/Tim_Berners-Lee'
html = get_html(url)
What happens if the request doesn't go so smoothly? Add a defensive measure to your function to check that the response recieved was successful.
# your solution here
# solution
def get_html(url):
response = requests.get(url)
assert response.ok, "Whoops, this request didn't go as planned!"
return response.text
url = 'https://en.wikipedia.org/wiki/Tim_Berners-Lee'
html = get_html(url)
The second step in web scraping is parsing HTML. This is where things can get a little tricky.
Imagine you're in the field of education, in fact your specialty is studying higher education institutions. You're wondering how different disciplines change over time. Is it true that disciplines are incorporating more computational techniques as the years go on? Is that true for all disciplines or only some? Can we spot emerging themes across a whole university?
To answer these questions, we're going to need data. We're going to collect a dataset of all courses registered at UC Berkeley, not just those being taught this semester but all courses currently approved to be taught. These are listed on this page, called the Academic Guide. Well, actually they're not directly listed on that page. That page lists the departments/programs/units that teach currently approved courses. If we click on each department (for the sake of brevity, I'm just going to call them all "departments"), we can see the list of all courses they're approved to teach. For example, here's the page for Aerospace Studies. We'll call these pages departmental pages.
View the source HTML of the page listing all departments, and see if you can find the part of the HTML where the departments are listed. There's a lot of other stuff in the file that we don't care too much about. You could try Crtl-F
ing for the name of a department you can see on the webpage.
Solution
You should see something like this:
<div id="atozindex">
<h2 class="letternav-head" id='A'><a name='A'>A</a></h2>
<ul>
<li><a href="/courses/aerospc/">Aerospace Studies (AEROSPC)</a></li>
<li><a href="/courses/africam/">African American Studies (AFRICAM)</a></li>
<li><a href="/courses/a,resec/">Agricultural and Resource Economics (A,RESEC)</a></li>
<li><a href="/courses/amerstd/">American Studies (AMERSTD)</a></li>
<li><a href="/courses/ahma/">Ancient History and Mediterranean Archaeology (AHMA)</a></li>
<li><a href="/courses/anthro/">Anthropology (ANTHRO)</a></li>
<li><a href="/courses/ast/">Applied Science and Technology (AST)</a></li>
<li><a href="/courses/arabic/">Arabic (ARABIC)</a></li>
<li><a href="/courses/arch/">Architecture (ARCH)</a></li>
This is HTML. HTML uses "tags", code that surrounds the raw text which indicates the structure of the content. The tags are enclosed in <
and >
symbols. The <li>
says "this is a new thing in a list and </li>
says "that's the end of that new thing in the list". Similarly, the <a ...>
and the </a>
say, "everything between us is a hyperlink". In this HTML file, each department is listed in a list with <li>...</li>
and is also linked to its own page using <a>...</a>
. In our browser, if we click on the name of the department, it takes us to that department's own page. The way the browser knows where to go is because the <a>...</a>
tag tells it what page to go to. You'll see inside the <a>
bit, there's a href=...
. That tells us the (relative) location of the page it's linked to.
Look at HTML source of the page for the Aerospace Studies department, and try to find the part of the file where the information on each course is. Again, try searching for it using Crtl-F
.
Solution
<div class="courseblock">
<button class="btn_toggleCoursebody" aria-expanded="false" aria-controls="cb_aerospc1a" data-toggle="#cb_aerospc1a">
<a name="spanaerospc1aspanspanfoundationsoftheu.s.airforcespanspan1unitspan"></a>
<h3 class="courseblocktitle">
<span class="code">AEROSPC 1A</span>
<span class="title">Foundations of the U.S. Air Force</span>
<span class="hours">1 Unit</span>
</h3>
The content that we care about is enclosed within HTML tags. It looks like the course code is enclosed in a span
tag, which has a class
attribute with the value "code"
. What we'll have to do is extract out the information we care about by specifying what tag it's enclosed in.
But first, we're going to need to get the HTML of the first page.
Get the HTML content of http://guide.berkeley.edu/courses/
and store it in a variable called academic_guide_html
. You can use the get_html
function you wrote before.
Print the first 500 characters to see what we got back.
# your solution here
# solution
academic_guide_url = 'http://guide.berkeley.edu/courses/'
academic_guide_html = get_html(academic_guide_url)
print(academic_guide_html[:500])
Great, we've got the HTML contents of the Academic Guide site we want to scrape. Now we can parse it. "Parsing" means to turn a string of data into a structured representation. When we're parsing HTML, we're taking the Python string and turning it into a tree. The Python package BeautifulSoup
does all our HTML parsing for us. We give it our HTML as a string and it returns a parsed HTML tree. Here, we're also telling BeautifulSoup to use the lxml
parser behind the scenes.
from bs4 import BeautifulSoup
academic_guide_soup = BeautifulSoup(academic_guide_html, 'lxml')
We said before that all the departments were listed on the Academic Guide page with links to their departmental page, where the actual courses are listed. So we can find all the departments by looking in our parsed HTML for all the links. Remember that the links are represented in the HTML with the <a>...</a>
tag, so we ask our academic_guide_soup
to find us all the tags called a
. What we get back is a list of all the a
elements in the HTML page.
links = academic_guide_soup.find_all('a')
# print a random link element
links[48]
So now we have a list of a
elements, each one represents a link on the Academic Guide page. But there are other links on this page in addition to the ones we care about, for example, a link back to the UC Berkeley home page. How can we filter out all the links we don't care about?
Look through the list links
, or the HTML source, and figure out how we can identify just the links that we care about, namely the links to departmental pages.
# your solution here
# solution
import re
def is_departmental_page(link):
"""
Return true if `link` points to a departmental page.
By examining the source HTML by eye, I noticed that
the links we care about (i.e. the departmental pages)
all point to a relative path that starts with "/courses/".
This function uses that idea to determine if the link is
a departmental page.
"""
# some links don't have a href attribute, only a name attribute
# we don't care about them
try:
href = link.attrs['href']
except KeyError:
return False
pattern = r'/courses/(.*)/'
match = re.search(pattern, href)
return bool(match)
print(links[0])
print(is_departmental_page(links[0]))
print()
print(links[48])
print(is_departmental_page(links[48]))
Let's use our new is_departmental_page
function to filter out the links we don't care about. How many departments do we have?
departmental_page_links = [link for link in links if is_departmental_page(link)]
len(departmental_page_links)
Each link in our departmental_page_links
list contains a HTML element representing a link. Each element contains not only the relative location of the link but also the text that is linked (i.e. the words on the page that are underlined and you can click on to go to the linked page). In BeautifulSoup, we can get that text by asking for it with element.text
, like this:
departmental_page_links[0].text
From the departmental_page_links
, we can extract out the name and the code for each department. Try doing this.
# your solution here
# solution
import re
def extract_department_name_and_code(departmental_link):
"""
Return the (name, code) for a department.
The easiest way to do this is to use regular expressions.
We're not going to cover regular expressions in this workshop,
but here's how to do it anyway.
"""
text = departmental_link.text
pattern = r'([^(]+) \((.*)\)'
match = re.search(pattern, text)
if match:
return match.group(1), match.group(2)
extract_department_name_and_code(links[48])
From each link in our departmental_page_links
list, we can get the relative link that it points to like this:
departmental_page_links[0].attrs['href']
Write a function that extracts out the relative link of a link element.
Hint: This has a similar solution to our is_departmental_page
function from before.
# your solution here
# solution
def extract_relative_link(departmental_link):
"""
We noted above that all the departmental links point to "/courses/something/",
where the "something" looks a lot like their code. This function
extracts out that "something", so we can add it to the base URL of
the Academic Guide page and get full paths to each departmental page.
"""
href = departmental_link.attrs['href']
pattern = r'/courses/(.*)/'
match = re.search(pattern, href)
if match:
return match.group(1)
extract_relative_link(departmental_page_links[0])
Alright! Now we've identified all the departmental links on the Academic Guide page, we've found their name and code, and we know the relative link they point to. Next, we can use this relative link to construct the full URL they point to, which we'll then use to scrape the HTML for each departmental page.
Let's write a function that takes a departmental link and returns the absolute URL of its departmental page.
def construct_absolute_url(departmental_link):
relative_link = extract_relative_link(departmental_link)
return academic_guide_url + relative_link
construct_absolute_url(departmental_page_links[37])
To summarize so far, we've gone from the URL of the Academic Guide website, found all the departments that offer approved courses, identified their name and code and the link to their departmental page which lists all the courses they teach.
Now we want to find the get the HTML for each departmental page and scrape it for all the courses they offer. Let's focus on one page for now, the Aerospace Studies page. Once we select the link, we use our functions from above to: i) get the name (I guess we already know it's Aerospace, but whatever) and code, get the full URL, get the HTML for that URL and then parse the HTML.
aerospace_link = departmental_page_links[0]
aerospace_name, aerospace_code = extract_department_name_and_code(aerospace_link)
aerospace_url = construct_absolute_url(aerospace_link)
aerospace_html = get_html(aerospace_url)
aerospace_soup = BeautifulSoup(aerospace_html, 'lxml')
print(aerospace_html[:500])
Right at the start of this section on parsing HTML, we saw the HTML for a departmental page. Here it is again.
<div class="courseblock">
<button class="btn_toggleCoursebody" aria-expanded="false" aria-controls="cb_aerospc1a" data-toggle="#cb_aerospc1a">
<a name="spanaerospc1aspanspanfoundationsoftheu.s.airforcespanspan1unitspan"></a>
<h3 class="courseblocktitle">
<span class="code">AEROSPC 1A</span>
<span class="title">Foundations of the U.S. Air Force</span>
<span class="hours">1 Unit</span>
</h3>
It looks like each course is listed in a div
element that has a class
attribute with value "courseblock"
. We can use this information to identify all the courses on a page and then extract out the information from them. You've seen how to do this before, here it is again:
aerospace_courseblocks = aerospace_soup.find_all(class_='courseblock')
len(aerospace_courseblocks)
Looks like the Aerospace department has seven current courses they're approved to teach (at the time of writing). Looking at the page in our browser, that looks right to me! So now we have a list called aerospace_courseblocks
that holds seven elements that each refer to one course taught by the Aerospace department. Now we can extract out any information we care about. We just have to look at the page in our browser, decide what information we care about, then look at the HTML source to see where that information is kept in the HTML structure. Finally, we write a function for each piece of information we want to extract out of a course.
Write functions to take a courseblock and extract:
# your solution here
# solution
def extract_course_code(courseblock):
span = courseblock.find(class_='code')
return span.text
def extract_course_title(courseblock):
span = courseblock.find(class_='title')
return span.text
def extract_course_units(courseblock):
span = courseblock.find(class_='hours')
return span.text
def extract_course_description(courseblock):
span = courseblock.find(class_='coursebody')
return span.text
def extract_one_course(courseblock):
course = {}
course['course_code'] = extract_course_code(courseblock)
course['course_title'] = extract_course_title(courseblock)
course['course_units'] = extract_course_units(courseblock)
course['course_description'] = extract_course_description(courseblock)
return course
first_aerospace_course = extract_one_course(aerospace_courseblocks[0])
for value in first_aerospace_course.values():
print(value)
print()
Let's write a function to scrape these four pieces of information from every course from every department and save it as a csv file.
def scrape_one_department(department_link):
department_name, department_code = extract_department_name_and_code(department_link)
department_url = construct_absolute_url(department_link)
department_html = get_html(department_url)
department_soup = BeautifulSoup(department_html, 'lxml')
department_courseblocks = department_soup.find_all(class_='courseblock')
result = []
for courseblock in department_courseblocks:
course = extract_one_course(courseblock)
course['department_name'] = department_name
course['department_code'] = department_code
result.append(course)
return result
aerospace_courses = scrape_one_department(aerospace_link)
for value in aerospace_courses[0].values():
print(value)
print()
import time
def scrape_all_departments(be_nice=True):
academic_guide_url = 'http://guide.berkeley.edu/courses/'
academic_guide_html = get_html(academic_guide_url)
academic_guide_soup = BeautifulSoup(academic_guide_html, 'lxml')
links = academic_guide_soup.find_all('a')
departmental_page_links = [link for link in links if is_departmental_page(link)]
result = []
for departmental_page_link in departmental_page_links:
department_result = scrape_one_department(departmental_page_link)
result.extend(department_result)
if be_nice:
time.sleep(1)
return result
import pandas as pd
result = scrape_all_departments(be_nice=False)
df = pd.DataFrame(result)
print(str(len(df)) + ' courses scraped')
df.head()
9360 courses scraped! (At the time of writing). Wow, that was a lot easier than doing it by hand!
As you've seen, web scraping involves making requests from other computers for their data. It costs people money to maintain the computers that we request data from: it needs electricity, it requires staff, sometimes you need to upgrade the computer, etc. But we didn't pay anyone for using their resources.
Because we're making these requests programmatically, we could make many, many requests per second. For example, we could put a request in a never-ending loop which would constantly request data from a server. But computers can't handle too much traffic, so eventually this might crash someone else's computer. Moreover, if we make too many requests when we're web scraping, that might restrict the number of people who can view the web page in their browser. This isn't very nice.
Websites often have Terms of Service, documents that you agree to whenever you visit a site. Some of these terms prohibit web scraping, because it puts too much strain on their servers, or they just don't want their data accessed programmatically. Whatever the reason, we need to respect a websites Terms of Service. Before you scrape a site, you should always check its terms of service to make sure it's allowed.
Often, there are better ways of accessing the same data. For the Wikipedia sites we scraped, there's actually an API that we could have used. In fact, Wikipedia would prefer that we access their data that way. There's even a Python package that wraps around this API to make it even easier to use. Furthermore, Wikipedia actually makes all of its content available for direct download. The point of the story is: before web scraping, see if you can get the same data elsewhere. This will often be easier for you and preferred by the people who own the data.
Moreover, if you're affiliated with an institution, you may be breaching existing contracts by engaging in scraping. UC Berkeley's Library recommends following this workflow:
Work through this notebook in full.
Fantastic Data and Where To Find Them: An introduction to APIs, RSS, and Scraping
requests
library? Well the author of that library, Kenneth Reitz, has another library for parsing HTML. I'm not that familiar with it, but it looks promising if it's by Reitz!