Notebook

Online Data¶

Webscraping is the activity of downloading, manipulating, and using information obtained online. Webscraping can get very complicated, and we won't do much in this course. This set of lecture notes can help you get started on the basics. We'll look into this a bit more when we get to regular expressions in a few lectures.

Downloading Files¶

There are several modules for downloading files from the internet. We'll use urllib:

In [1]:

import urllib

In [3]:

url = "https://philchodrow.github.io/PIC16A/content/IO_and_modules/IO/palmer_penguins.csv"

filedata = urllib.request.urlopen(url)
to_write = filedata.read()

with open("downloaded_penguins.csv", "wb") as f:
    f.write(to_write)

Having run this code, you can check in your file explorer that a file called downloaded_penguins.csv now lives in the same directory as this notebook. We used the somewhat unusual flag "wb" to open() in order to indicate that we need to write a binary file, rather than the usual text file. This is because to_write, the return value of filedata.read(), is by default binary data. We might ask you in assignments to use this pattern, but you we won't evaluate you on it in any timed or closed-book contexts.

The module wget is another popular tool for downloading files from the internet.

Data from Websites¶

Often, we want to access the contents of a webpage. In this case, the request.urlopen submodule of urllib can help us easily access the contents of a desired URL.

In [11]:

from urllib.request import urlopen

url = "https://philchodrow.github.io/PIC16A/schedule/"

page = urlopen(url)
html_bits = page.read()

html = html_bits.decode("utf-8")

print(html[0:500])

<!DOCTYPE html>
<html>

  <head>
  <meta charset="utf-8">
  <meta http-equiv="X-UA-Compatible" content="IE=edge">
  <meta name="viewport" content="width=device-width, initial-scale=1">

  <title>PIC16A: Course Schedule (Fall 2020)</title>
  <meta name="description" content="Course materials for PIC16A at UCLA">

  <link rel="stylesheet" href="/PIC16A//_css/main.css">
  <link rel="canonical" href="http://philchodrow.github.io/PIC16A/PIC16A//schedule/">
  <link rel="alternate" type="application/rs

In [13]:

import re

urls = re.findall(r'href=[\'"]?([^\'">]+)', html)

urls

[url for url in urls if "http" in url]
# ---

Out[13]:

['http://philchodrow.github.io/PIC16A/PIC16A//schedule/',
 'http://philchodrow.github.io/PIC16A/PIC16A//feed.xml',
 'https://fonts.googleapis.com/css?family=Titillium+Web:600italic,600,400,400italic',
 'https://fonts.googleapis.com/css2?family=Lato&display=swap',
 'https://fonts.googleapis.com/css2?family=Lato:ital,wght@0,400;0,700;1,400&display=swap',
 'https://fonts.googleapis.com/css2?family=Raleway&display=swap',
 'https://use.fontawesome.com/releases/v5.2.0/css/all.css',
 'http://philchodrow.github.io/PIC16A/syllabus/',
 'http://philchodrow.github.io/PIC16A/schedule/',
 'http://philchodrow.github.io/PIC16A/materials/',
 'https://github.com/philchodrow/PIC16A',
 'http://www.philchodrow.com',
 'https://docs.anaconda.com/anaconda/install/',
 'https://docs.python.org/3/tutorial/appetite.html',
 'https://nbviewer.jupyter.org/github/PhilChodrow/PIC16A/blob/master/content/basics/numbers.ipynb',
 'https://nbviewer.jupyter.org/github/PhilChodrow/PIC16A/blob/master/content/basics/strings.ipynb',
 'https://nbviewer.jupyter.org/github/PhilChodrow/PIC16A/blob/master/content/basics/lists.ipynb',
 'https://youtu.be/Vws-gJxqM5s',
 'https://youtu.be/duCSMMX8RUc',
 'https://www.youtube.com/watch?v=2e1Al1yaY4U',
 'https://docs.python.org/3/tutorial/introduction.html',
 'https://nbviewer.jupyter.org/github/PhilChodrow/PIC16A/blob/master/content/basics/for_loops_and_comprehensions.ipynb',
 'https://nbviewer.jupyter.org/github/PhilChodrow/PIC16A/blob/master/content/basics/control_flow.ipynb',
 'https://youtu.be/Y08doVJjv84',
 'https://youtu.be/GnFg3f6oFqU',
 'https://docs.python.org/3/tutorial/controlflow.html',
 'https://nbviewer.jupyter.org/github/PhilChodrow/PIC16A/blob/master/content/basics/more_iterables.ipynb',
 'https://nbviewer.jupyter.org/github/PhilChodrow/PIC16A/blob/master/content/basics/dictionaries.ipynb',
 'https://youtu.be/5JUqacQcewM',
 'https://youtu.be/ms1D4zEHOMM',
 'https://docs.python.org/3/tutorial/datastructures.html',
 'https://nbviewer.jupyter.org/github/PhilChodrow/PIC16A/blob/master/content/functions/functions_1.ipynb',
 'https://nbviewer.jupyter.org/github/PhilChodrow/PIC16A/blob/master/content/functions/functions_2.ipynb',
 'https://youtu.be/Y6c-1VxXYvE',
 'https://youtu.be/N1jT_ZpplQs',
 'https://docs.python.org/3/tutorial/controlflow.html#defining-functions',
 'https://nbviewer.jupyter.org/github/PhilChodrow/PIC16A/blob/master/content/functions/functions_3.ipynb',
 'https://nbviewer.jupyter.org/github/PhilChodrow/PIC16A/blob/master/content/functions/exceptions.ipynb',
 'https://youtu.be/ojdHJ4qSkaM',
 'https://youtu.be/JEKXteMktwA',
 'https://docs.python.org/3/tutorial/errors.html',
 'https://nbviewer.jupyter.org/github/PhilChodrow/PIC16A/blob/master/content/object_oriented_programming/class_and_objects_I.ipynb',
 'https://nbviewer.jupyter.org/github/PhilChodrow/PIC16A/blob/master/content/object_oriented_programming/classes_and_objects_II.ipynb',
 'https://youtu.be/_GrQScemoz4',
 'https://youtu.be/PjOpuWaK40k',
 'https://docs.python.org/3/tutorial/classes.html',
 'https://nbviewer.jupyter.org/github/PhilChodrow/PIC16A/blob/master/content/object_oriented_programming/inheritance_I.ipynb',
 'https://nbviewer.jupyter.org/github/PhilChodrow/PIC16A/blob/master/content/object_oriented_programming/inheritance_II.ipynb',
 'https://youtu.be/XChF4v8FLq4',
 'https://youtu.be/PHiAsOuApgg',
 'https://docs.python.org/3/tutorial/classes.html',
 'https://nbviewer.jupyter.org/github/PhilChodrow/PIC16A/blob/master/content/object_oriented_programming/Iterators_1.ipynb',
 'https://nbviewer.jupyter.org/github/PhilChodrow/PIC16A/blob/master/content/object_oriented_programming/Iterators_1.ipynb',
 'https://nbviewer.jupyter.org/github/PhilChodrow/PIC16A/blob/master/content/object_oriented_programming/generators.ipynb',
 'https://youtu.be/kn5yT12ohlk',
 'https://youtu.be/Nid6KGKeZ2E',
 'https://youtu.be/okVpT_PrOx4',
 'https://docs.python.org/3/tutorial/classes.html',
 'https://nbviewer.jupyter.org/github/PhilChodrow/PIC16A/blob/master/content/IO_and_modules/IO/IO.ipynb',
 'https://nbviewer.jupyter.org/github/PhilChodrow/PIC16A/blob/master/content/IO_and_modules/IO/online_data.ipynb',
 'https://docs.python.org/3/tutorial/inputoutput.html',
 'https://docs.python.org/3/library/csv.html#reader-objects',
 'https://nbviewer.jupyter.org/github/PhilChodrow/PIC16A/blob/master/content/IO_and_modules/modules/modules.ipynb',
 'https://nbviewer.jupyter.org/github/PhilChodrow/PIC16A/blob/master/content/IO_and_modules/modules/unit_testing.ipynb',
 'https://youtu.be/dfH0-x1tgRo',
 'https://youtu.be/TwOmk9oSaR8',
 'https://www.geeksforgeeks.org/what-does-the-if-__name__-__main__-do/',
 'https://docs.python.org/3/library/unittest.html',
 'https://github.com/philchodrow/PIC16A',
 'https://twitter.com/philchodrow',
 'http://www.philchodrow.com']

As you can imagine, parsing HTML in order to extract useful content is a difficult problem. We will revisit this problem when we learn regular expressions in a few lectures. Here's an example of the kind of thing we'll be able to do:

In [ ]: