# HIDDEN
from datascience import *
%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')
import numpy as np
import math
In this note, we'll go over the structure of Python code in a bit more detail than we have before. When you've absorbed this material, you should be able to read Python code and decompose it into simple, understandable parts. This note should be particularly useful if you've seen a lot of Python code, but you have a hard time interpreting complicated-looking code like table['foo'] = np.array([1,2,3]) + table['bar']
.
Decomposing Python into small parts is kind of like diagramming an English sentence. While our brains are perfectly capable of generating and understanding English without explicitly identifying things like subjects and predicates, Python interprets code very literally according to its rules (its syntax). So if you want to understand Python code, it's more important to have a precise model of Python's rules in your head. On the flip side, Python's rules are much simpler than those of English (see, for example, this amusingly complicated English sentence). They just seem complicated because we're less familiar with them. That makes it possible to learn Python much faster than you learned English.
Note: Everything in this note is also available, with even more pedantic precision, at the official Python language reference. This note is focused on the material in chapters 6, 7, and 8 of the reference. We will omit some details and fudge some truths in the interest of pedagogy. Once you feel like an expert in this stuff, feel free to brave the official documentation.
This note contains a bunch of code cells, in addition to text. The code cells typically illustrate points from the text. Please run the code cells as you go through the note, and pay attention to what their output is. Recall that the thing that's printed when you run a cell is the value of the last line.
Below is a cell containing various Python code that might look familiar by now.
3 # Line 1.0
z = 3 # Line 1.1
4+3 # Line 1.2
y = 4+3 # Line 1.3
(2+3)+z # Line 1.4
"foo"+"bar" # Line 1.5
[1,2,3] # Line 1.6
x = [1,2,3] # Line 1.7
sum(x) # Line 1.8
x[2] # Line 1.9
x[2] = 4 # Line 1.10
t = Table() # Line 1.11
t['Things'] = np.array(["foo", "bar", "baz"]) # Line 1.12
t.sort('Things') # Line 1.13
u = t.sort('Things') # Line 1.14
u.relabel('Things', 'Nonsense') # Line 1.15
u # Line 1.16
(The # Line X
comments are just there for labeling; don't consider them part of the lines. Similarly, other instances of # some text here
that you see in this note are just for explanation.) Each line in the cell is a statement. A statement is a (somewhat) self-contained piece of code. Python executes statements in the order in which they appear. There are many kinds of statements, and to execute a statement, Python first has to figure out what kind of statement it is.
The most basic kind of statement is the expression.
Line 0 above is just an expression: 3
. Like many (but not all) expressions, it has a value, the integer 3. Like some (but not all) expressions, computing its value causes nothing to "happen" to the world. (We say it has no side effects.) When Python executes line 0, it computes that value. Since nothing is done with it, it just gets discarded. The same is true of lines 2, 4, 5, 6, 8, 9, 13, and 16 -- those are expression statements that cause values to be computed, but the computation has no side effects, and the value of the full expression is eventually discarded. Line 15 is an expression that does have side effects -- it causes the 'Things'
column in the table named t
to be renamed to 'Nonsense'
. The other lines are statements but not expressions, but we will see that, like many statements, they contain expressions.
Expressions are themselves usually made up of several smaller expressions joined together by some rules; we call these compound expressions, and we sometimes call the component expressions subexpressions. Line 2, for example, is a compound expression made up of the subexpressions 4
and 3
joined by +
. Python knows what a +
between two expressions means, and it puts them together so that the value of 4+3
is the value of 4
plus the value of 3
, or 7.
Line 4 is another compound expression. We can think of it as the subexpressions (2+3)
and z
, again joined by +
. But (2+3)
is itself a compound expression, made up of 2
and 3
joined by +
. Python first computes the value of (2+3)
, which is 5, and then computes the value of z
, which is 3 (z
having been assigned previously), and then adds 5 and 3 to get 8. (2+3)*(4*((5*6)+7))
is also a valid expression. It contains 10 subexpressions (not including itself):
2
3
(2+3)
4
5
6
(5*6)
7
((5*6)+7)
(4*((5*6)+7))
Compound expressions can be arbitrarily complicated compositions of expressions.
Question. How many subexpressions are contained in the expression ((1+2)+(3+4))+((5+6)+(7+8))
?
It's critical to recognize that subexpressions are valid expressions that could be written by themselves or made part of other compound expressions. If you see a complicated expression like the one above (or even more exotic ones later), and you don't understand what it does, you can always break it down into smaller bits until you get to very basic expressions. There is a fairly small list of basic expression types (things that can't be broken down into subexpressions) to learn.
This note will tell you the rules about most of the basic expressions in Python, but in order to understand and write real code (which very regularly involves large compound expressions) you'll need to develop the skill of breaking down compound expressions into subexpressions. You can try to do that mentally while you're reading code, but if that's too hard, you can just type them into a Python code cell and see what they do.
Question. What's the value of each subexpression you found above? You can just type them into the empty code cell below if you like.
Line 5 ("foo"+"bar"
) is a compound expression adding two strings, with "foo"
and "bar"
as subexpressions. This is okay, since the +
operator knows how to handle two strings. It produces the string "foobar" as its value.
When the following cell is executed, however, there is an error. (Run the cell to confirm that.)
"foo"+5 # Error!
When you see an error, don't just give up. Often (though unfortunately not always) the error message will tell you what's wrong. The error message first tells us that the problem happened on line 1 of the cell (in this case, the only line) and the text of the error is "TypeError: Can't convert 'int' object to str implicitly". Python evaluates "foo"
and 5
just fine, but when the +
operator tries to apply itself "foo" and 5, it becomes unhappy. The error refers to the fact that the +
operator tries to convert its arguments to something it can add. For example, adding an integer and a float, like 3+4.5
, works because +
converts the integer 3
to a float. But +
can't convert a number to text (or vice-versa), so it gives up.
The important thing to realize about that cell, for our purposes, is exactly where the error happens. In the next cell, for example, some work is done before an error happens:
("foo"+"bar")+5 # Error!
Python actually evaluates the subexpression ("foo"+"bar")
successfully, producing the string "foobar", before again failing to add "foobar" and 5. The error occurs only when trying to add a string and a number, and not before.
Now, let's go over the kinds of expressions that Python has.
The most basic kinds of expressions, which we've seen repeatedly above, are string, int, and float expressions. These just look like this:
"foo" # a string expression, whose value is the string "foo"
'foo' # a string expression, essentially identical to the one above
'5' # a string expression, which happens to contain a single character called 5
5 # an int expression, whose value is the integer number 5
5.1 # a float expression, whose value is the decimal number 5.1
It's important to recognize that string, int, and float expressions produce values of different types. A string is not an int, nor is it a float. You can see the type of anything by calling type(thing)
(or print it out with print(type(thing))
, as in type(2)
, type('foo')
, or
i_am_a_string = "blah"
type(i_am_a_string)
Confusingly but conveniently, many functions built into Python will try to convert values of one type to another. 3+4.5
was one example we just saw -- in order to add 3
and 4.5
, Python first converts the integer 3
to the float 3.
. print(3)
is another -- in order to print anything so you can see its value, the print
function first converts it to a string. So sometimes you can forget about the types of values. Other times, as in the error we saw before, you have to think about types.
You can do conversions between these three types yourself with the str()
, int()
, and float()
functions.
Here is a more exotic kind of string expression:
"""blah
...
# looks like a comment but isn't
last line"""
The result is just a string like "foo"
above, with a few differences. Triple double-quotation marks denote the beginning and end of this string, and it can take up multiple lines, unlike an ordinary string expression.
Frankly, this is an arcane detail of Python, but we bring it up because triple-quoted strings are often used for writing long-form comments in code, instead of #
comments. This works even though the string is just an expression, not a special device for long comments. That's because an expression doesn't do anything by itself, except that the last expression in a Jupyter notebook cell gets printed. So you can sprinkle string expressions (or other expressions that have no side-effects) throughout your code (on their own lines) and no harm will come of it.
The following (oddly and excessively) documented code shows this:
"""The code in this cell produces
pi rounded to 5 decimal digits."""
"First, let's give a name to pi."
my_name_for_pi = math.pi
# Now, we round it to 5 decimal
# digits.
pi_rounded = round(my_name_for_pi, 5)
"Now make that the last expression in this cell."
pi_rounded
Names, also called variables, are just expressions like x
or my_name_for_pi
that refer to some actual values. When Python sees a name expression, it basically just substitutes the current value of that name for the name. We'll later talk about what kinds of statements assign names to values.
Line 6 above, [1,2,3]
, is another kind of compound expression, the list literal. Python knows that when square brackets ([]
) appear by themselves with a comma-separated list of expressions inside them, we are asking for a list consisting of those expressions' values.
Again, each expression in the list can be a compound expression. So it's okay to write something like:
["foo"+"bar", sum([1,2,3]), [4, 5, 6]]
Question. Describe the value of the above list expression in English.
Line 8, sum(x)
, is also a compound expression, a function call. Python evaluates the subexpression sum
, producing a function that adds members of lists, and the subexpression x
, which was previously set to a list of integers. Then the parentheses ()
direct Python to call the function on the left of the parenthesis (the one named sum
) on the value of x
, producing the value 6. Note that it's possible to write things like 5(3)
or nonexistent_function(0)
. Python will just complain that 5
is not a function (specifically, that it is not "callable") or that nonexistent_function
hasn't been defined, respectively.
The following line is similar to line 8, but the subexpression inside the parentheses, x + [4]
, is itself a compound expression:
sum(x + [4])
(Recall that adding two lists with +
makes a new list consisting of the two lists smashed together. So x + [4]
above has value equal to [1,2,4,4]
. x
is equal to [1,2,4]
, not [1,2,3]
as it was defined on line 7, because on line 10 we set its last element to 4
.)
We haven't seen how to define new functions yet, but here is one example to see how the expression before the (
is just an expression (whose value must be a function):
my_name_for_sum = sum
my_name_for_sum(x)
Line 9, x[2]
, is yet another compound expression. Python evaluates the subexpression x
, producing a list, and the subexpression 2
. The square brackets []
, appearing immediately after an expression and with another expression inside them, tell Python to index into the value of the first expression using the value of the second expression. For this list as it's defined on line 9, this produces the value 3.
Notice that the code string [2]
can have two different meanings, depending on the code immediately around it. If there is an expression to the left, for example x[2]
, then Python will take it to mean an indexing expression. If not, Python will think you mean a list with a single element, 2.
Like parentheses, the things on either side of the square brackets can be compound expressions:
x[2-1]
(x + [13])[2+1]
Question. In the last cell, there are 7 subexpressions, not counting the whole expression (x + [13])[2+1]
. Can you identify all of them?
Finally, note that different kinds of values support different kinds of indexing. A Table, for example, supports indexing by strings, producing a column:
t['Things']
Question. To put together list indexing and function calls, try to figure out what the following code is doing. (Note that an expression like sum
has a value, like any other name expression, and that value is a function. We can put function values into lists, just like other values.)
some_functions_on_lists = [sum, len]
(some_functions_on_lists[0])(x)
Objects (just another name for a value, like 1, "four score", or a Table) often have things called properties, attributes, fields, or (in the case when the things are functions) methods. Let's call them attributes. Though in this class we won't see how to create new kinds of objects, we will use attributes all the time.
We access attributes using a .
. For example:
t.rows
Generically, the thing on the left of the .
must be an expression whose value is an object with the attribute we want. As with calling and indexing, it can be an arbitrarily complicated compound expression. The thing on the right of the dot is the name of the attribute. Unlike the arguments of a function or the index in an indexing expression, it is not an expression. It must be the name of an attribute that the object on the left has.
As we said, sometimes an attribute is a function, in which case we sometimes call it a method instead. The syntax is the same as other attribute accesses:
t.sort
t.sort('Things')
The only difference between a method and a normal function is that the object itself (t
in this case) is automatically passed as the first argument to the method. So the sort
function technically has two arguments -- the first is the table that sort
is being called on, and the second is the column name. This is how sort
knows which table to sort! Normally this is a really technical detail that you don't need to worry about, but it can come up when you accidentally pass the wrong number of arguments to a method:
t.sort('This', 'is', 'too', 'many', 'arguments') # Error!
The error complains that we gave 6 arguments to sort
, but it looks like we only passed 5. The extra first argument is the table t
.
You might notice at some point that dots are used in two ways in Python: accessing attributes, and in expressions for floating-point numbers. For example, x.y
is accessing the attribute named y
in the value named x
, while 1.2
is just an expression for the number 1.2. This is one reason why you can't have numbers at the start of names. It also means that the expression on the left of a .
can't just be number. For example, we can't access the attribute real
of an integer this way (for this example, you don't need to know what real
is doing, other than that it should just return the same value as the integer):
1.real
That's because Python can't tell whether we're trying to write an (invalid) decimal number 1.real
or access the real
attribute of the value 1
. Surrounding the 1
in parentheses makes it clear to Python:
(1).real
No. We might see more as the class goes on. But these are most of the important ones, and you've seen most of the difficult ideas.
Question. Many people, when they first encounter tables and try to use them to manipulate data, assume that Python allows more syntactic flexibility than it really does. Below are some examples of things we might hope would work, but don't. For each one, describe what it actually does, what its author was probably trying to do, what went wrong, and how to fix it.
# No error here, just setup for the next cells. Run this cell to see the table we're working with.
my_table = Table([[1, 2, 3, 4], [9, 2, 3, 1]], ['x', 'y'], )
my_table
my_table['x + y']
my_table['x' + 'y']
my_table['x'] + ['y']
my_table.where('x' >= 3)
my_table.where(['x'] >= 3)
my_table.sort('y')
row_with_smallest_y = my_table.rows[0]
If we had only expressions, it would be difficult to put together many steps in our code. For example, which piece of code is more legible?
Table([['Alice', 'Bob', 'Alice', 'Alice', 'Connie'], [119.99, 29.99, 10.00, 350.00, 5.29]], ['Customer', 'Bill']).group('Customer', np.sum).sort('Bill sum', descending=True)['Customer'][0]
transactions = Table() # Line 3.0
transactions['Customer'] = ['Alice', 'Bob', 'Alice', 'Alice', 'Connie'] # Line 3.1
transactions['Bill'] = [119.99, 29.99, 10.00, 350.00, 5.29] # Line 3.2
total_bill_per_customer = transactions.group('Customer', np.sum) # Line 3.3
customers_sorted_by_total_bill = total_bill_per_customer.sort('Bill sum', descending=True)['Customer'] # Line 3.4
top_customer = customers_sorted_by_total_bill[0] # Line 3.5
top_customer # Line 3.6
Many programs do hundreds (or millions) of different things, and it would be cumbersome to do this only using expressions. In this example, we are doing only one thing, using several steps. The first cell is concise, but it's very hard to read. In the second cell, we use assignment statements to break down the steps into things that are (hopefully) understandable.
An assignment statement is executed like other statements, but it always causes an effect on the world (recall that we called these side effects). That is subsequent statements will see the changes made by the assignment.
An assignment statement generally has two expressions separated by an equals sign. The expression on the right can be anything, but the expression on the left must be an "assignable thing". The simplest case is a name that has not been assigned to anything yet, like total_bill_per_customer
on line 3 above. Before line 3 is executed, it would be an error to refer to total_bill_per_customer
, but after line 3, that name can be used to refer to the table created by transactions.group('Customer', np.sum)
.
Assignment statements can also reassign existing names to something else:
number = 3
number = 4
number = number + 2
number
As a matter of code style, it is best to avoid this where possible, because it can make your code more confusing. (If everything is assigned only once, it's trivial to see what its value is when you read code. Otherwise you might need to hunt down all the assignments.) But occasionally it is useful, and sometimes it is necessary. We'll see examples of the latter when we cover iteration.
Lines 1 and 2 above are assignments to parts of an indexable thing. In this case, they add new columns to the transactions
Table associated with the strings "Customer" and "Bill", respectively. Generically, an indexing assignment looks like:
<expression with indexable value>[<expression>] = <expression>
The same pattern happens when we assign elements of a list or array:
my_list = [4, 5, "foo"]
my_list[0] = "bar"
Different indexable things can have different behavior when you set something in them. For example, Tables use string indexing instead of number indexing, and they are okay with adding new columns using indexing assignments (as we saw in lines 1 and 2) or with replacing existing columns with something else. If we want to change the customer names (say because we made a mistake the first time), we could do that by changing the whole "Customer" column:
transactions['Customer'] = ['Alice', 'Bob', 'Alice', 'Alice', 'Dora'] #
Lists, however, don't let us add new elements. We can only assign new things to the slots a list had when it was created:
my_list[2] = "baz" # Okay.
my_list[3] = "garply" # Error.
Note that it is possible to make an existing list longer using extend(), or to make a new, longer copy of the list with +
. You just can't do it with index assignment.
Why do lists have this restriction?
Lists are supposed to contain contiguous ranges of things; they can't have "holes" that aren't indexable. If you could extend a list by assigning to it at whatever indices you wanted, you could assign elements, say, 0, 1, and 3, leaving 2 unassigned. Then what should len
return for that list -- 3 or 4? And what should happen when you print it? Should it say [0,1,<blank>,3]
? It's not clear. To make sure you don't have to worry about this when you use lists, Python doesn't let you do it.
A simple, standalone kind of statement is the import statement, as in import numpy as np
. It has the side effect of making the numpy
module available, giving it the name np
. Notice that the import statement has its own special rules, and it doesn't include other expressions as subexpressions anywhere.
Modules are actually values, just like strings or functions. Saying import numpy
just loads the module named numpy from the computer's library of modules and assigns it the name numpy
. import numpy as np
assigns it the name np
instead. We could imagine that import numpy as np
does something like this:
np = load_module('numpy') # BEWARE: NOT REAL PYTHON CODE.
When you say something like np.array([1,2,3])
, you're accessing the array
attribute of the module named np
and calling it on the list given by [1,2,3]
. (Note that, unlike function attributes of some other values, function attributes of modules are not usually called methods, and they don't get the module value as an extra argument.)
Question. How many subexpressions (not counting the whole expression) are there in the following expression?
np.array([1,1+2,3])*4
Another important statement is the function definition:
def square(x):
return x*x
square(5.5)
After this line, the function square
will be available for calling. Defining a function doesn't do anything else. In particular, it's not called unless you call it somewhere.
The function definition is our first example of a statement that takes up multiple lines. In fact, a function definition is a compound statement that typically includes multiple substatements; its general form is:
def <function name>(<argument list>):
<substatement 0>
<substatement 1>
...
Notice the indentation of the statements inside the function. Indentation tells Python where your function definition ends. You can use as many spaces as you want (as long as you're consistent), but 4 is traditional.
When a function is executed (using the function call syntax we saw above), its substatements are executed sequentially, just like an ordinary sequence of statements in a cell. A substatement can be any statement you want, just like a subexpression can be any expression you want. You can even put function definitions as substatements inside function definitions. A special kind of substatement often seen in functions (and nowhere else) is the return
statement, which is covered in detail next. When a return statement is reached, execution finishes (even if there are statements below) and the expression after return
becomes the value of the function call.
Before the statements are executed, each name in the argument list is set to the corresponding value in the arguments passed to the function. For example, when we call square(5.5)
above, Python starts executing the statements in the square
function, but first sets x
to 5.5. Arguments are how we pass information into functions; functions with no arguments can only behave one way.
Functions are extremely useful for packaging small pieces of functionality into easily-understandable pieces. Computer code is so powerful that organizing and maintaining it is often much more difficult than just getting the computer to do what we want. If you can wrap a complicated procedure into a single function, then you can focus once on getting that function written correctly, and then move on to something else, never worrying about its correctness again. In most moderate- or large-scale software, all code is organized this way -- that is, all code is just a bunch of (relatively short) functions that call each other.
In your labs, and in coding you do outside and after this class, you'll often notice yourself repeating the same thing several times, with slight modifications. For example, you might analyze a dataset and then perform the same analysis with a different dataset for comparison. Or you might find yourself repeatedly doing the same mathematical operation, like "square each element and add 5". When that happens, you should rewrite your code so that the thing you're repeating happens inside a function with a memorable name.
Let's go back to our definition of square
for a moment.
def square(x):
return x*x
It's important to know that the name x
is assigned to a value only for the purposes of the statements inside the function. Outside the function call, argument names are not modified or visible. For example:
x = 5
def cube(x):
return x*x*x
cube(3) # 27
x # Still 5!
def square_root(does_not_appear_elsewhere):
return does_not_appear_elsewhere**(1/2)
square_root(4)
does_not_appear_elsewhere # Causes an error. does_not_appear_elsewhere was only defined inside the function while it was being called.
Similarly, any names defined inside a function are only defined inside the function while it's running. They don't even stick around across calls to the function; each time the function body finishes, the names defined inside it are wiped out, just like argument names.
def times_three(x):
multiplier = 3
return multiplier*x
six = times_three(2)
three = multiplier # Error!
A function definition like
def my_func(x):
return 2*x
is really just producing a function value and assigning the name my_func
to that value. In this case, the function value is the function that multiplies its single argument by 2. You should imagine def
as doing something similar to the following (non-functioning) code:
my_func = make_a_function(x): # BEWARE: NOT REAL PYTHON CODE.
return 2*x
...where we're imagining for a moment that the special syntax make_a_function(...): ...
returns a function. So names assigned to functions are really just ordinary names, and function values are just like other values. Of course, function values, like other values, have special behaviors; they can be called using ()
, and they can't be added together like strings or numbers.
Names assigned to functions are also just ordinary names. It is possible, for example, to redefine a name that was previously defined as a function using def
(though this is so confusing that it is usually a bad idea):
def my_func(x):
return 2*x
eight = my_func(4)
my_func = 3 # Technically possible, but inadvisable!
my_func + 2
We can also put function values into a list, as we saw earlier:
def my_func_0(x):
return 0*x
def my_func_1(x):
return 1*x
funcs = [my_func_0, my_func_1]
zero = funcs[0](3)
zero
Though Python prints function values in a slightly cryptic way, you can print them if you want:
funcs
Inside a function definition, we very often see yet another kind of statement: the return statement. This has the form return <expression>
. Any of the expressions we saw above can appear after the return
. This is the value produced by calls to the function. For example, the value of square(5)
is 25, since square
will return 5*5
when it is called with the argument 5
.
return
stops execution of the function; subsequent statements are not reached. For example:
def weird_but_technically_correct_square(x):
return x*x
return (x*x)+1
weird_but_technically_correct_square(5)
If a return statement is never reached, calling the function produces no value. The following code is wrong, for example:
def wrong_circle_area(r):
math.pi*(r**2)
some_name = wrong_circle_area(4)
some_name
Unfortunately, this is a mistake that Python will not complain about; it will just silently let some_name
have no value. (Technically it is given a special value called None. If a statement with value None is the last statement in a cell, Jupyter doesn't print anything, and that's what happens in the above cell. But you can see the value of some_name
if you write, for example, str(some_name)
.)
To be clear, we just fix this by return
ing whatever we want the function to return:
def correct_circle_area(r):
return math.pi*(r**2)
circle_radius_four_area = correct_circle_area(4)
circle_radius_four_area
Conditionals are another important kind of multi-line statement:
x = [1,2,3]
if len(x) > 4:
message = "x is a long list!"
else:
message = "x is a short list!"
message
The general form of a conditional is:
if <boolean-valued expression 0>:
<statement 0.0>
...
elif <boolean-valued expression 1>:
<statement 1.0>
...
elif <boolean-valued expression 2>:
<statement 2.0>
...
...
else:
<statement n.0>
...
If there is an else
clause, then exactly one of the statement groups will be executed; otherwise, it's possible that none of them will happen (if none of the expressions next to if
or elif
are True).
Conditionals are pretty simple, but like functions, they are very important for writing code that does interesting things.
Something to watch out for is that Python will implicitly convert non-boolean values to boolean values, sometimes using surprising rules. Typically, the convention is that something that is "zero-like" or "empty" is False, while other things are True. It's best not to rely on this behavior, though; use an explicit comparison that produces a boolean value. See what happens in the following examples:
if 0:
x = True
else:
x = False
x
if 1:
x = True
else:
x = False
x
if "some string":
x = True
else:
x = False
x
if "":
x = True
else:
x = False
x
if []: # (an empty list)
x = True
else:
x = False
x
if [3]:
x = True
else:
x = False
x
if np.array([]):
x = True
else:
x = False
x
if np.array([True]):
x = True
else:
x = False
x
if np.array([False]):
x = True
else:
x = False
x
if np.array([True, False]):
x = True
else:
x = False
x