Webscraping is the activity of downloading, manipulating, and using information obtained online. Webscraping can get very complicated, and we won't do much in this course. This set of lecture notes can help you get started on the basics. We'll look into this a bit more when we get to regular expressions in a few lectures.
There are several modules for downloading files from the internet. We'll use urllib
:
import urllib
url = "https://philchodrow.github.io/PIC16A/content/IO_and_modules/IO/palmer_penguins.csv"
filedata = urllib.request.urlopen(url)
to_write = filedata.read()
with open("downloaded_penguins.csv", "wb") as f:
f.write(to_write)
Having run this code, you can check in your file explorer that a file called downloaded_penguins.csv
now lives in the same directory as this notebook. We used the somewhat unusual flag "wb"
to open()
in order to indicate that we need to write a binary file, rather than the usual text file. This is because to_write
, the return value of filedata.read()
, is by default binary data. We might ask you in assignments to use this pattern, but you we won't evaluate you on it in any timed or closed-book contexts.
The module wget
is another popular tool for downloading files from the internet.
Often, we want to access the contents of a webpage. In this case, the request.urlopen
submodule of urllib
can help us easily access the contents of a desired URL.
from urllib.request import urlopen
url = "https://philchodrow.github.io/PIC16A/schedule/"
page = urlopen(url)
html_bits = page.read()
html = html_bits.decode("utf-8")
print(html[0:500])
<!DOCTYPE html> <html> <head> <meta charset="utf-8"> <meta http-equiv="X-UA-Compatible" content="IE=edge"> <meta name="viewport" content="width=device-width, initial-scale=1"> <title>PIC16A: Course Schedule (Fall 2020)</title> <meta name="description" content="Course materials for PIC16A at UCLA"> <link rel="stylesheet" href="/PIC16A//_css/main.css"> <link rel="canonical" href="http://philchodrow.github.io/PIC16A/PIC16A//schedule/"> <link rel="alternate" type="application/rs
import re
urls = re.findall(r'href=[\'"]?([^\'">]+)', html)
urls
[url for url in urls if "http" in url]
# ---
['http://philchodrow.github.io/PIC16A/PIC16A//schedule/', 'http://philchodrow.github.io/PIC16A/PIC16A//feed.xml', 'https://fonts.googleapis.com/css?family=Titillium+Web:600italic,600,400,400italic', 'https://fonts.googleapis.com/css2?family=Lato&display=swap', 'https://fonts.googleapis.com/css2?family=Lato:ital,wght@0,400;0,700;1,400&display=swap', 'https://fonts.googleapis.com/css2?family=Raleway&display=swap', 'https://use.fontawesome.com/releases/v5.2.0/css/all.css', 'http://philchodrow.github.io/PIC16A/syllabus/', 'http://philchodrow.github.io/PIC16A/schedule/', 'http://philchodrow.github.io/PIC16A/materials/', 'https://github.com/philchodrow/PIC16A', 'http://www.philchodrow.com', 'https://docs.anaconda.com/anaconda/install/', 'https://docs.python.org/3/tutorial/appetite.html', 'https://nbviewer.jupyter.org/github/PhilChodrow/PIC16A/blob/master/content/basics/numbers.ipynb', 'https://nbviewer.jupyter.org/github/PhilChodrow/PIC16A/blob/master/content/basics/strings.ipynb', 'https://nbviewer.jupyter.org/github/PhilChodrow/PIC16A/blob/master/content/basics/lists.ipynb', 'https://youtu.be/Vws-gJxqM5s', 'https://youtu.be/duCSMMX8RUc', 'https://www.youtube.com/watch?v=2e1Al1yaY4U', 'https://docs.python.org/3/tutorial/introduction.html', 'https://nbviewer.jupyter.org/github/PhilChodrow/PIC16A/blob/master/content/basics/for_loops_and_comprehensions.ipynb', 'https://nbviewer.jupyter.org/github/PhilChodrow/PIC16A/blob/master/content/basics/control_flow.ipynb', 'https://youtu.be/Y08doVJjv84', 'https://youtu.be/GnFg3f6oFqU', 'https://docs.python.org/3/tutorial/controlflow.html', 'https://nbviewer.jupyter.org/github/PhilChodrow/PIC16A/blob/master/content/basics/more_iterables.ipynb', 'https://nbviewer.jupyter.org/github/PhilChodrow/PIC16A/blob/master/content/basics/dictionaries.ipynb', 'https://youtu.be/5JUqacQcewM', 'https://youtu.be/ms1D4zEHOMM', 'https://docs.python.org/3/tutorial/datastructures.html', 'https://nbviewer.jupyter.org/github/PhilChodrow/PIC16A/blob/master/content/functions/functions_1.ipynb', 'https://nbviewer.jupyter.org/github/PhilChodrow/PIC16A/blob/master/content/functions/functions_2.ipynb', 'https://youtu.be/Y6c-1VxXYvE', 'https://youtu.be/N1jT_ZpplQs', 'https://docs.python.org/3/tutorial/controlflow.html#defining-functions', 'https://nbviewer.jupyter.org/github/PhilChodrow/PIC16A/blob/master/content/functions/functions_3.ipynb', 'https://nbviewer.jupyter.org/github/PhilChodrow/PIC16A/blob/master/content/functions/exceptions.ipynb', 'https://youtu.be/ojdHJ4qSkaM', 'https://youtu.be/JEKXteMktwA', 'https://docs.python.org/3/tutorial/errors.html', 'https://nbviewer.jupyter.org/github/PhilChodrow/PIC16A/blob/master/content/object_oriented_programming/class_and_objects_I.ipynb', 'https://nbviewer.jupyter.org/github/PhilChodrow/PIC16A/blob/master/content/object_oriented_programming/classes_and_objects_II.ipynb', 'https://youtu.be/_GrQScemoz4', 'https://youtu.be/PjOpuWaK40k', 'https://docs.python.org/3/tutorial/classes.html', 'https://nbviewer.jupyter.org/github/PhilChodrow/PIC16A/blob/master/content/object_oriented_programming/inheritance_I.ipynb', 'https://nbviewer.jupyter.org/github/PhilChodrow/PIC16A/blob/master/content/object_oriented_programming/inheritance_II.ipynb', 'https://youtu.be/XChF4v8FLq4', 'https://youtu.be/PHiAsOuApgg', 'https://docs.python.org/3/tutorial/classes.html', 'https://nbviewer.jupyter.org/github/PhilChodrow/PIC16A/blob/master/content/object_oriented_programming/Iterators_1.ipynb', 'https://nbviewer.jupyter.org/github/PhilChodrow/PIC16A/blob/master/content/object_oriented_programming/Iterators_1.ipynb', 'https://nbviewer.jupyter.org/github/PhilChodrow/PIC16A/blob/master/content/object_oriented_programming/generators.ipynb', 'https://youtu.be/kn5yT12ohlk', 'https://youtu.be/Nid6KGKeZ2E', 'https://youtu.be/okVpT_PrOx4', 'https://docs.python.org/3/tutorial/classes.html', 'https://nbviewer.jupyter.org/github/PhilChodrow/PIC16A/blob/master/content/IO_and_modules/IO/IO.ipynb', 'https://nbviewer.jupyter.org/github/PhilChodrow/PIC16A/blob/master/content/IO_and_modules/IO/online_data.ipynb', 'https://docs.python.org/3/tutorial/inputoutput.html', 'https://docs.python.org/3/library/csv.html#reader-objects', 'https://nbviewer.jupyter.org/github/PhilChodrow/PIC16A/blob/master/content/IO_and_modules/modules/modules.ipynb', 'https://nbviewer.jupyter.org/github/PhilChodrow/PIC16A/blob/master/content/IO_and_modules/modules/unit_testing.ipynb', 'https://youtu.be/dfH0-x1tgRo', 'https://youtu.be/TwOmk9oSaR8', 'https://www.geeksforgeeks.org/what-does-the-if-__name__-__main__-do/', 'https://docs.python.org/3/library/unittest.html', 'https://github.com/philchodrow/PIC16A', 'https://twitter.com/philchodrow', 'http://www.philchodrow.com']
As you can imagine, parsing HTML in order to extract useful content is a difficult problem. We will revisit this problem when we learn regular expressions in a few lectures. Here's an example of the kind of thing we'll be able to do: