Welcome to Lab 2!
Last time, we had our first look at Python and Jupyter notebooks. So far, we've only used Python to manipulate numbers. There's a lot more to life than numbers, so Python lets us represent many other types of data in programs.
In this lab, you'll first see how to represent and manipulate another fundamental type of data: text. A piece of text is called a string in Python.
You'll also see how to invoke methods. A method is very similar to a function. Calling a method looks different because the method is tied to a particular piece of data.
Last, you'll learn more about working with datasets in Python.
First, initialize the grader. Each time you come back to this site to work on the lab, you will need to run this cell again.
import otter
grader = otter.Notebook()
The two building blocks of Python code are expressions and statements. An expression is a piece of code that
Here are two expressions that both evaluate to 3
3
5 - 2
One important form of an expression is the call expression, which first names a function and then describes its arguments. The function returns some value, based on its arguments. Some important mathematical functions are
Function | Description |
---|---|
abs |
Returns the absolute value of its argument |
max |
Returns the maximum of all its arguments |
min |
Returns the minimum of all its arguments |
pow |
Raises its first argument to the power of its second argument |
round |
Round its argument to the nearest integer |
Here are two call expressions that both evaluate to 3
abs(2 - 5)
max(round(2.8), min(pow(2, 10), -1 * pow(2, 10)))
All these expressions but the first are compound expressions, meaning that they are actually combinations of several smaller expressions. 2 + 3
combines the expressions 2
and 3
by addition. In this case, 2
and 3
are called subexpressions because they're expressions that are part of a larger expression. Any expression can be used as part of a larger expression.
A statement is a piece of code that makes something happen rather than having a value. For example, an assignment statement assigns a value to a name.
Every assignment statement has one =
sign. The whole statement is executed by evaluating the expression on the right-hand side of the equals sign and then assigning its value to the name on the left-hand side. Here are some assignment statements:
height = 1.3
the_number_five = abs(-5)
absolute_height_difference = abs(height - 1.688)
A key idea in programming is that large, interesting things can be built by combining many simple, uninteresting things. The key to understanding a complicated piece of code is breaking it down into its simple components.
For example, a lot is going on in the last statement above, but it's really just a combination of a few things. This picture describes what's going on.
Any names that you assign in one cell are available in later cells and can be used in place of the value assigned to them.
Question 1.1.
In the next cell, assign the name new_year
to the larger number among the following two numbers:
Try to use just one statement (one line of code).
new_year = ...
new_year
Check your work by executing the next cell.
grader.check("q11")
Programming doesn't just concern numbers. Text is one of the most common types of values used in programs.
A snippet of text is represented by a string value in Python. The word "string" is a programming term for a sequence of characters. A string might contain a single character, a word, a sentence, or a whole book.
To distinguish text data from actual code, we demarcate strings by putting quotation marks around them. Single quotes ('
) and double quotes ("
) are both valid, but the types of opening and closing quotation marks must match. The contents can be any sequence of characters, including numbers and symbols.
We've seen strings before in print
statements. Below, two different strings are passed as arguments to the print
function.
print("I <3", 'Data Science')
Just like names can be given to numbers, names can be given to string values. The names and strings aren't required to be similar in any way. Any name can be assigned to any string.
one = 'two'
plus = '*'
print(one, plus, one)
Question 2.1.
Yuri Gagarin was the first person to travel through outer space. When he emerged from his capsule upon landing on Earth, he reportedly had the following conversation with a woman and girl who saw the landing:
The woman asked: "Can it be that you have come from outer space?"
Gagarin replied: "As a matter of fact, I have!"
The cell below contains unfinished code. Fill in the ...
s so that it prints out this conversation exactly as it appears above.
woman_asking = ...
woman_quote = '"Can it be that you have come from outer space?"'
gagarin_reply = 'Gagarin replied:'
gagarin_quote = ...
print(woman_asking, woman_quote)
print(gagarin_reply, gagarin_quote)
grader.check("q21")
Strings can be transformed using methods, which are functions that involve an existing string and some other arguments. One example is the replace
method, which replaces all instances of some part of a string with some alternative.
A method is invoked on a string by placing a .
after the string value, then the name of the method, and finally parentheses containing the arguments. Here's a sketch, where the <
and >
symbols aren't part of the syntax; they just mark the boundaries of sub-expressions.
<expression that evaluates to a string>.<method name>(<argument>, <argument>, ...)
Try to predict the output of these examples, then execute them.
# Replace one letter
'Hello'.replace('H', 'C')
# Replace a sequence of letters, which appears twice
'hitchhiker'.replace('hi', 'ma')
Once a name is bound to a string value, methods can be invoked on that name as well. The name is still bound to the original string, so a new name is needed to capture the result.
sharp = 'edged'
hot = sharp.replace('ed', 'ma')
print('sharp:', sharp)
print('hot:', hot)
You can call functions on the results of other functions. For example,
max(abs(-5), abs(3))
has value 5. Similarly, you can invoke methods on the results of other method (or function) calls.
# Calling replace on the output of another call to replace
'train'.replace('t', 'ing').replace('in', 'de')
Here's a picture of how Python evaluates a "chained" method call like that:
Question 2.1.1.
Assign strings to the names you
and this
so that the final expression evaluates to a 10-letter English word with three double letters in a row.
Hint: The call to print
is there to print out the intermediate result called the
. This should be an English word with two double letters in a row.
Hint 2: Run the tests if you're stuck. They'll give you some hints.
you = ...
this = ...
a = 'beeper'
the = a.replace('p', you)
print('the:', the)
the.replace('bee', this)
grader.check("q211")
Other string methods do not take any arguments at all, because the original string is all that's needed to compute the result. In these cases, parentheses are still needed, but there's nothing in between the parentheses. Here are some methods that take no arguments:
Method name | Value |
---|---|
lower |
a lowercased version of the string |
upper |
an uppercased version of the string |
capitalize |
a version with the first letter capitalized |
title |
a version with the first letter of every word capitalized |
'unIverSITy of caliFORnia'.title()
All these string methods are useful, but most programmers don't memorize their names or how to use them. Instead, people usually just search the internet for documentation and examples. A complete list of string methods appears in the Python language documentation. Stack Overflow has a huge database of answered questions that often demonstrate how to use these methods to achieve various ends.
Strings and numbers are different types of values, even when a string contains the digits of a number. For example, evaluating the following cell causes an error because an integer cannot be added to a string.
8 + "8"
However, there are built-in functions to convert numbers to strings and strings to numbers.
Function name | Effect | Example |
---|---|---|
int |
Converts a string of digits and perhaps a negative sign to an integer (int ) value |
int("42") |
float |
Converts a string of digits and perhaps a negative sign and decimal point to a decimal (float ) value |
float("4.2") |
str |
Converts any value to a string (str ) value |
str(42) |
Try to predict what the following cell will evaluate to, then evaluate it.
8 + int("8")
Suppose you're writing a program that looks for dates in a text, and you want your program to find the amount of time that elapsed between two years it has identified. It doesn't make sense to subtract two texts, but you can first convert the text containing the years into numbers.
Question 2.2.1.
Finish the code below to compute the number of years that elapsed between one_year
and another_year
. Don't just write the numbers 1618
and 1648
(or 30
); use a conversion function to turn the given text data into numbers.
# Some text data:
one_year = "1618"
another_year = "1648"
# Complete the next line. Note that we can't just write:
# another_year - one_year
# If you don't see why, try seeing what happens when you
# write that here.
difference = ...
difference
grader.check("q221")
Question 2.2.2. Use replace
and int
together to compute the difference between the the year 753 BC (the founding of Rome) and the year 410 AD (the sack of Rome. Try not to use any numbers in your solution, but instead manipulate the strings that are provided.
Hint: It's ok to be off by one year. In historical calendars, there is no year zero, but astronomical calendars do include year zero to simplify calculations.
founded = 'BC 753'
sacked = 'AD 410'
start = ...
end = ...
print('Ancient Rome lasted for about', end-start, 'years from', founded, 'to', sacked)
grader.check("q222")
String values, like numbers, can be arguments to functions and can be returned by functions. The function len
takes a single string as its argument and returns the number of characters in the string: its length.
Note that it doesn't count words. len("one small step for man")
is 22, not 5.
Question 2.3.1.
Use len
to find out the number of characters in the very long string in the next cell. (It's the first sentence of the English translation of the French Declaration of the Rights of Man.) The length of a string is the total number of characters in it, including things like spaces and punctuation. Assign sentence_length
to that number.
a_very_long_sentence = "The representatives of the French people, organized as a National Assembly, believing that the ignorance, neglect, or contempt of the rights of man are the sole cause of public calamities and of the corruption of governments, have determined to set forth in a solemn declaration the natural, unalienable, and sacred rights of man, in order that this declaration, being constantly before all the members of the Social body, shall remind them continually of their rights and duties; in order that the acts of the legislative power, as well as those of the executive power, may be compared at any moment with the objects and purposes of all political institutions and may thus be more respected, and, lastly, in order that the grievances of the citizens, based hereafter upon simple and incontestable principles, shall tend to the maintenance of the constitution and redound to the happiness of all."
sentence_length = ...
sentence_length
grader.check("q231")
What has been will be again,
what has been done will be done again;
there is nothing new under the sun.
Most programming involves work that is very similar to work that has been done before. Since writing code is time consuming, it's good to rely on others' published code when you can. Rather than copy-pasting, Python allows us to import other code, creating a module that contains all of the names created by that code.
Python includes many useful modules that are just an import
away. We'll look at the math
module as a first example. The math
module is extremely useful in computing mathematical expressions in Python.
Suppose we want to very accurately compute the area of a circle with radius 5 meters. For that, we need the constant $\pi$, which is roughly 3.14. Conveniently, the math
module has pi
defined for us:
import math
radius = 5
area_of_circle = radius**2 * math.pi
area_of_circle
pi
is defined inside math
, and the way that we access names that are inside modules is by writing the module's name, then a dot, then the name of the thing we want:
<module name>.<name>
In order to use a module at all, we must first write the statement import <module name>
. That statement creates a module object with things like pi
in it and then assigns the name math
to that module. Above we have done that for math
.
Question 3.1.
math
also provides the name e
for the base of the natural logarithm, which is roughly 2.71. Compute $e^{\pi}-\pi$, giving it the name near_twenty
.
near_twenty = ...
near_twenty
grader.check("q31")
Modules can provide other named things, including functions. For example, math
provides the name sin
for the sine function. Having imported math
already, we can write math.sin(3)
to compute the sine of 3. (Note that this sine function considers its argument to be in radians, not degrees. 180 degrees are equivalent to $\pi$ radians.)
Question 3.1.1.
A $\frac{\pi}{4}$-radian (45-degree) angle forms a right triangle with equal base and height, pictured below. If the hypotenuse (the radius of the circle in the picture) is 1, then the height is $\sin(\frac{\pi}{4})$. Compute that using sin
and pi
from the math
module. Give the result the name sine_of_pi_over_four
.
sine_of_pi_over_four = ...
sine_of_pi_over_four
grader.check("q311")
For your reference, here are some more examples of functions from the math
module.
Note how different methods take in different number of arguments. Often, the documentation of the module will provide information on how many arguments is required for each method.
# Calculating factorials.
math.factorial(5)
# Calculating logarithms (the logarithm of 8 in base 2).
# The result is 3 because 2 to the power of 3 is 8.
math.log(8, 2)
# Calculating square roots.
math.sqrt(5)
There's many variations of how we can import methods from outside sources. For example, we can import just a specific method from an outside source, we can rename a library we import, and we can import every single method from a whole library.
# Importing just cos and pi from math.
# Now, we don't have to use "math." before these names.
from math import cos, pi
print(cos(pi))
# We can nickname math as something else, if we don't want to type the name math
import math as m
m.log(m.pi)
# Lastly, we can import ever thing from math and use all of its names without "math."
from math import *
log(pi)
People have written Python functions that do very cool and complicated things, like crawling web pages for data, transforming videos, or learning functions from data. Now that you can import things, when you want to do something with code, first check to see if someone else has done it for you.
Let's see an example of a function that's used for downloading and displaying pictures.
The module IPython.display
provides a function called Image
. The Image
function takes a single argument, a string that is the URL of the image on the web. It returns an image value that this Jupyter notebook understands how to display. To display an image, make it the value of the last expression in a cell, just like you'd display a number or a string.
Question 3.1.2.
In the next cell, import the module IPython.display
and use its Image
function to display the image at this URL:
https://upload.wikimedia.org/wikipedia/commons/thumb/8/8c/David_-_The_Death_of_Socrates.jpg/1024px-David_-_The_Death_of_Socrates.jpg
Give the name art
to the output of the call to Image
. (It might take a few seconds to load the image. It's a painting called The Death of Socrates by Jacques-Louis David, depicting events from a philosophical text by Plato.)
Hint: A link isn't any special type of data type in Python. You can't just write a link into Python and expect it to work; you need to type the link in as a specific data type. Which one makes the most sense?
# Import the module IPython.display. Watch out for capitalization.
import IPython.display
# Replace the ... with a call to the Image function
# in the IPython.display module, which should produce
# a picture.
art = ...
art
grader.check("q312")
Up to now, we haven't done much that you couldn't do yourself by hand, without going through the trouble of learning Python. Computers are most useful when a small amount of code performs a lot of work by performing the same action to many different things.
For example, in the time it takes you to calculate the 18% tip on a restaurant bill, a laptop can calculate 18% tips for every restaurant bill paid by every human on Earth that day. (That's if you're pretty fast at doing arithmetic in your head!)
Arrays are how we put many values in one place so that we can operate on them as a group. For example, if billions_of_numbers
is an array of numbers, the expression
.18 * billions_of_numbers
gives a new array of numbers that's the result of multiplying each number in billions_of_numbers
by .18 (18%). Arrays are not limited to numbers; we can also put all the words in a book into an array of strings.
Concretely, an array is a collection of values of the same type, like a column in an Excel spreadsheet.
You can type in the data that goes in an array yourself, but that's not typically how programs work. Normally, we create arrays by loading them from an external source, like a data file.
First, though, let's learn how to start from scratch. Execute the following cell so that all the names from the datascience
module are available to you. The documentation for this module is available at http://data8.org/datascience.
from datascience import *
Now, to create an array, call the function make_array
. Each argument you pass to make_array
will be in the array it returns. Run this cell to see an example:
make_array(0.125, 4.75, -1.3)
Each value in an array (in the above case, the numbers 0.125, 4.75, and -1.3) is called an element or item of that array.
Arrays themselves are also values, just like numbers and strings. That means you can assign them names or use them as arguments to functions.
Question 4.1.1.
Make an array containing the numbers 1, 2, and 3, in that order. Name it small_numbers
.
small_numbers = ...
small_numbers
grader.check("q411")
Question 4.1.2.
Make an array containing the numbers 0, 1, -1, $\pi$, and $e$, in that order. Name it interesting_numbers
. Hint: How did you get the values $\pi$ and $e$ earlier? You can refer to them in exactly the same way here.
interesting_numbers = ...
interesting_numbers
grader.check("q412")
Question 4.1.3.
Make an array containing the five strings "Hello"
, ","
, " "
, "world"
, and "!"
. (The third one is a single space inside quotes.) Name it hello_world_components
.
Note: If you print hello_world_components
, you'll notice some extra information in addition to its contents: dtype='<U5'
. That's just NumPy's extremely cryptic way of saying that the things in the array are strings.
hello_world_components = ...
hello_world_components
grader.check("q413")
The join
method of a string takes an array of strings as its argument and puts all of the elements together into one string. Try it:
'-'.join(make_array('a', 'b', 'c', 'd'))
Question 4.1.4.
Assign separator
to a string so that the name hello
is bound to the string 'Hello, world!'
in the cell below.
separator = ...
hello = separator.join(hello_world_components)
hello
grader.check("q414")
np.arange
¶Arrays are provided by a package called NumPy (pronounced "NUM-pie" or, if you prefer to pronounce things incorrectly, "NUM-pee"). The package is called numpy
, but it's standard to rename it np
for brevity. You can do that with:
import numpy as np
Very often in data science, we want to work with many numbers that are evenly spaced within some range. NumPy provides a special function for this called arange
. np.arange(start, stop, space)
produces an array with all the numbers starting at start
and counting up by space
, stopping before stop
is reached.
For example, the value of np.arange(1, 6, 2)
is an array with elements 1, 3, and 5 -- it starts at 1 and counts up by 2, then stops before 6. In other words, it's equivalent to make_array(1, 3, 5)
.
np.arange(4, 9, 1)
is an array with elements 4, 5, 6, 7, and 8. (It doesn't contain 9 because np.arange
stops before the stop value is reached.)
Question 4.1.1.1.
Import numpy
as np
and then use np.arange
to create an array with the multiples of 99 from 0 up to (and including) 9999. (So its elements are 0, 99, 198, 297, etc.)
...
multiples_of_99 = ...
multiples_of_99
grader.check("q4111")
NOAA (the US National Oceanic and Atmospheric Administration) operates weather stations that measure surface temperatures at different sites around the United States. The hourly readings are publicly available.
Suppose we download all the hourly data from the Oakland, California site for the month of December 2015. To analyze the data, we want to know when each reading was taken, but we find that the data don't include the timestamps of the readings (the time at which each one was taken).
However, we know the first reading was taken at the first instant of December 2015 (midnight on December 1st) and each subsequent reading was taken exactly 1 hour after the last.
Question 4.1.1.2.
Create an array of the time, in seconds, since the start of the month at which each hourly reading was taken. Name it collection_times
.
Hint 1: There were 31 days in December, which is equivalent to ($31 \times 24$) hours or ($31 \times 24 \times 60 \times 60$) seconds. So your array should have $31 \times 24$ elements in it.
Hint 2: The len
function works on arrays, too. If your collection_times
isn't passing the tests, check its length and make sure it has $31 \times 24$ elements.
collection_times = ...
collection_times
grader.check("q4112")
Let's work with a more interesting dataset. The next cell creates an array called population
that includes estimated world populations in every year from 1950 to roughly the present. (The estimates come from the US Census Bureau website.)
Rather than type in the data manually, we've loaded them from a file on your computer called world_population.csv
. You'll learn how to do that next week.
# Don't worry too much about what goes on in this cell.
from datascience import *
population = Table.read_table("world_population.csv").column("Population")
population
Here's how we get the first element of population
, which is the world population in the first year in the dataset, 1950.
population.item(0)
The value of that expression is the number 2557628654 (around 2.5 billion), because that's the first thing in the array population
.
Notice that we wrote .item(0)
, not .item(1)
, to get the first element. This is a weird convention in computer science. 0 is called the index of the first item. It's the number of elements that appear before that item. So 3 is the index of the 4th item.
Here are some more examples. In the examples, we've given names to the things we get out of population
. Read and run each cell.
# The third element in the array is the population
# in 1952.
population_1952 = population.item(2)
population_1952
# The thirteenth element in the array is the population
# in 1962 (which is 1950 + 12).
population_1962 = population.item(12)
population_1962
# The 66th element is the population in 2015.
population_2015 = population.item(65)
population_2015
# The array has only 66 elements, so this doesn't work.
# (There's no element with 66 other elements before it.)
population_2016 = population.item(66)
population_2016
# Since make_array returns an array, we can call .item(3)
# on its output to get its 4th element, just like we
# "chained" together calls to the method "replace" earlier.
make_array(-1, -3, 4, -2).item(3)
Question 4.2.1.
Set population_1973
to the world population in 1973, by getting the appropriate element from population
using item
.
population_1973 = ...
population_1973
grader.check("q421")
Arrays are primarily useful for doing the same operation many times, so we don't often have to use .item
and work with single elements.
Here is one simple question we might ask about world population:
How big was the population in orders of magnitude in each year?
The logarithm function is one way of measuring how big a number is. The logarithm (base 10) of a number increases by 1 every time we multiply the number by 10. It's like a measure of how many decimal digits the number has, or how big it is in orders of magnitude.
We could try to answer our question like this, using the log10
function from the math
module and the item
method you just saw:
import math
population_1950_magnitude = math.log10(population.item(0))
population_1951_magnitude = math.log10(population.item(1))
population_1952_magnitude = math.log10(population.item(2))
population_1953_magnitude = math.log10(population.item(3))
...
But this is tedious and doesn't really take advantage of the fact that we are using a computer.
Instead, NumPy provides its own version of log10
that takes the logarithm of each element of an array. It takes a single array of numbers as its argument. It returns an array of the same length, where the first element of the result is the logarithm of the first element of the argument, and so on.
Question 4.3.1.
Use it to compute the logarithms of the world population in every year. Give the result (an array of 66 numbers) the name population_magnitudes
. Your code should be very short.
population_magnitudes = ...
population_magnitudes
grader.check("q431")
This is called elementwise application of the function, since it operates separately on each element of the array it's called on. The textbook's section on arrays has a useful list of NumPy functions that are designed to work elementwise, like np.log10
.
Arithmetic also works elementwise on arrays. For example, you can divide all the population numbers by 1 billion to get numbers in billions:
population_in_billions = population / 1000000000
population_in_billions
You can do the same with addition, subtraction, multiplication, and exponentiation (**
). For example, you can calculate a tip on several restaurant bills at once (in this case just 3):
restaurant_bills = make_array(20.12, 39.90, 31.01)
print("Restaurant bills:\t", restaurant_bills)
tips = .2 * restaurant_bills
print("Tips:\t\t\t", tips)
Question 4.3.2.
Suppose the total charge at a restaurant is the original bill plus the tip. That means we can multiply the original bill by 1.2 to get the total charge. Compute the total charge for each bill in restaurant_bills
.
total_charges = ...
total_charges
grader.check("q432")
Question 4.3.3.
more_restaurant_bills.csv
contains 100,000 bills! Compute the total charge for each one. How is your code different?
more_restaurant_bills = Table.read_table("more_restaurant_bills.csv").column("Bill")
more_total_charges = ...
more_total_charges
grader.check("q433")
The function sum
takes a single array of numbers as its argument. It returns the sum of all the numbers in that array (so it returns a single number, not an array).
Question 4.3.4.
What was the sum of all the bills in more_restaurant_bills
, including tips?
sum_of_bills = ...
sum_of_bills
grader.check("q434")
Question 4.3.5.
The powers of 2 ($2^0 = 1$, $2^1 = 2$, $2^2 = 4$, etc) arise frequently in computer science. (For example, you may have noticed that storage on smartphones or USBs come in powers of 2, like 16 GB, 32 GB, or 64 GB.) Use np.arange
and the exponentiation operator **
to compute the first 15 powers of 2, starting from 2^0
.
powers_of_2 = ...
powers_of_2
grader.check("q435")
Once you're finished, select "Save and Checkpoint" in the File menu and then execute the cells below to rerun all autograder tests and generate a PDF of this notebook. Download the PDF and this IPYNB file and submit these to Gradescope.
# For your convenience, you can run this cell to run all the tests at once!
grader.check_all()
# This will generate a PDF of this notebook.
grader.export("lab02.ipynb")