Expressions and evaluation

Let's start with a very high-level description of how computer programming works. When you're writing a computer program, you're describing to the computer what you want, and then asking the computer to figure that thing out for you. Your description of what you want is called an expression. The process that the computer uses to turn your expression into whatever that expression means is called evaluation.

Think of a science fiction movie where a character asks the computer, out loud, "What's the square root of nine billion?" or "How many people older than 50 live in Paris, France?" Those are examples of expressions. The process that the computer uses to transform those expressions into a response is evaluation.

When the process of evaluation is complete, you're left with a single "value". Think of it schematically like so:

Expression (you write this) -> Evaluation (the computer does this) -> Value (the resulting output)

What makes computer programs powerful is that they make it possible to write very precise and sophisticated expressions. And importantly, you can embed the results of evaluating one expression inside of another expression, or save the results of evaluating an expression for later in your program.

Unfortunately, computers can't understand and intuit your desires simply from a verbal description. That's why we need computer programming languages: to give us a way to write expressions in a way that the computer can understand. Because programming languages are designed to be precise, they can also be persnickety (and frustrating). And every programming language is different. It's tricky, but worth it.

Arithmetic expressions

Let's start with simple arithmetic expressions. The way that you write arithmetic expressions in Python is very similar to the way that you write arithmetic expressions in, say, grade school arithmetic, or algebra. In the example below, 3 + 5 is the expression, and print is a special Python keyword that tells iPython Notebook to put the result of evaluating the expression below the code.

In [128]:
print 3 + 5
8

Arithmetic expressions in Python can be much more sophisticated than this, of course. We won't go over all of the details right now, but one thing you should know immediately is that Python arithmetic operations are evaluated using the typical order of operations, which you can override with parentheses:

In [129]:
print 4 + 5 * 6
print (4 + 5) * 6
34
54

You can write arithmetic expressions with or without spaces between the numbers and the operators (but usually it's considered better style to include spaces):

In [130]:
print 10+20+30
60

Expressions in Python can also be very simple. In fact, a number on its own is its own expression, which Python evaluates to that number itself:

In [131]:
print 19
19

If you write an expression that Python doesn't understand, then you'll get an error. Here's what that looks like:

In [132]:
print + 20 19
  File "<ipython-input-132-b7afa4290b0f>", line 1
    print + 20 19
                ^
SyntaxError: invalid syntax

Expressions of inequality

You can also ask Python whether two expressions evaluate to the same value, or if one expression evaluates to a value greater than another expression, using a similar familiar syntax. When evaluating such expressions, Python will return one of two special values: either True or False.

The == operator compares the expression on its left side to the expression on its right side. It evaluates to True if the values are equal, and False if they're not equal.

In [133]:
print 3 * 5 == 9 + 6
True
In [134]:
print 20 == 7 * 3
False

The < operator compares the expression on its left side to the expression on its right side, evaluating to True if the left-side expression is less than the right-side expression, False otherwise. The > does the same thing, except checking to see if the left-side expression is greater than the right-side expression:

In [135]:
print 17 < 18
True
In [136]:
print 17 > 18
False

Variables

You can save the result of evaluating an expression for later using the = operator. On the left-hand side of the =, write a word that you'd like to use to refer to the value of the expression, and on the right-hand side, write the expression itself. After you've assigned a value like this, whenever you include that word in your Python code, Python will evaluate the word and replace it with the value you assigned to it earlier. Like so:

In [137]:
x = (4 + 5) * 6
print x
54

You can create as many variables as you want!

In [138]:
another_variable = (x + 2) * 4
print another_variable
224

Variable names can contain letters, numbers and underscores, but must begin with a letter or underscore. There are other, more technical constraints on variable names; you can review them here.

If you attempt to use a the name of a variable that you haven't defined in the notebook, Python will raise an error:

In [139]:
print voldemort
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-139-4b4667502dbf> in <module>()
----> 1 print voldemort

NameError: name 'voldemort' is not defined

Types

Another important thing to know is that when Python evaluates an expression, it assigns the result to a "type." A type is a description of what kind of thing a value is, and Python uses that information to determine later what you can do with that value, and what kinds of expressions that value can be used in. You can ask Python what type it thinks a particular expression evaluates to, or what type a particular value is, using the type() function:

In [140]:
print type(100 + 1)
<type 'int'>

The word int stands for "integer." Python has many, many other types, and lots of (sometimes arcane) rules for how those types interact with each other when used in the same expression. For example, you can create a floating point type by writing a number with a decimal point in it:

In [141]:
print type(3.14)
<type 'float'>

Interestingly, the result of adding a floating-point number and an integer number together is always a floating point number:

In [142]:
print type(3.14 + 17)
<type 'float'>

... and the result of dividing one integer by another integer is always an integer:

In [143]:
print type(4 / 3)
<type 'int'>

Throwing an expression into the type() function is a good way to know whether or not the value you're working with is the value you were expecting to work with. We'll use it for debugging some example code later.

Lists

A list is a type of value in Python that represents a sequence of values. The list is a very common and versatile data structure in Python and is used frequently to represent (among other things) tabular data. Here's how you write one out in Python:

In [144]:
[5, 10, 15, 20, 25, 30]
Out[144]:
[5, 10, 15, 20, 25, 30]

That is: a left square bracket, followed by a series of comma-separated expressions, followed by a right square bracket. Items in a list don't have to be values; they can be more complex expressions as well. Python will evaluate those expressions and put them in the list.

In [145]:
[5, 2*5, 3*5, 4*5, 5*5, 6*5]
Out[145]:
[5, 10, 15, 20, 25, 30]

Lists can have an arbitrary number of values. Here's a list with only one value in it:

In [146]:
[5]
Out[146]:
[5]

And here's a list with no values in it:

In [147]:
[]
Out[147]:
[]

Here's what happens when we ask Python what type of value a list is:

In [148]:
print type([1, 2, 3])
<type 'list'>

It's a value of type list.

Getting values out of lists

Once we have a list, we might want to get values out of the list. You can write a Python expression that evaluates to a particular value in a list using square brackets to the right of your list, with a number representing which value you want, numbered from the beginning (the left-hand side) of the list. Here's an example:

In [149]:
print [5, 10, 15, 20][2]
15

If we were to say this expression out loud, it might read, "I have a list of four things: 5, 10, 15, 20. Give me back the second item in the list." Python evaluates that expression to 15, the second item in the list.

The second item? Am I seeing things. 15 is clearly the third item in the list.

You're right---good catch. But for reasons too complicated to go into here, Python (along with many other programming languages!) starts list indexes at 0, instead of 1. So what looks like the third element of the list to human eyes is actually the second element to Python. The first element of the list is accessed using index 0, like so:

In [150]:
print [5, 10, 15, 20][0]
5

The way I like to conceptualize this is to think of list indexes not as specifying the number of the item you want, but instead specifying how "far away" from the beginning of the list to look for that value.

If you attempt to use a value for the index of a list that is beyond the end of the list (i.e., the value you use is higher than the last index in the list), Python gives you an error:

In [151]:
print [1, 2, 3][47]
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-151-6875f95c5cd6> in <module>()
----> 1 print [1, 2, 3][47]

IndexError: list index out of range

Note that while the type of a list is list, the type of an expression using index brackets to get an item out of the list is the type of whatever was in the list to begin with. To illustrate:

In [152]:
print type([1, 2, 3])
<type 'list'>
In [153]:
print type([1, 2, 3][0])
<type 'int'>

Indexes can be expressions too

The thing that goes inside of the index brackets doesn't have to be a number that you've just typed in there. Any Python expression that evaluates to an integer can go in there.

In [154]:
print [5, 10, 15, 20][6 / 2]
20
In [155]:
x = 3
print [5, 10, 15, 20][x]
20

Assigning lists to variables

In the above examples, we've used the square-bracket indexing syntax on a list that we've just typed out in Python code. Usually you'll want to assign the list to a variable first, and then use indexing syntax, like so:

In [156]:
shoe_sizes = [6.5, 9, 11, 10.5, 7]
print shoe_sizes[1]
9

Other operations on lists

Because lists are so central to Python programming, Python includes a number of built-in functions that allow us to write expressions that evaluate to interesting facts about lists. For example, try putting a list between the parentheses of the len() function. It will evaluate to the number of items in the list:

In [157]:
print len([10, 20, 30, 40])
4
In [158]:
print len([20])
1
In [159]:
print len([])
0

The max() function will evaluate to the highest value in the list:

In [160]:
print max([9, 8, 42, 3, -17, 2])
42

... and the min() function will evaluate to the lowest value in the list:

In [161]:
print min([9, 8, 42, 3, -17, 2])
-17

The sum() function evaluates to the sum of all values in the list.

In [162]:
print sum([2, 4, 6, 8, 80])
100

Finally, the sorted() function evaluates to a copy of the list, sorted from smallest value to largest value:

In [163]:
print sorted([9, 8, 42, 3, -17, 2])
[-17, 2, 3, 8, 9, 42]

Generating lists with range()

Python has a built-in function range() which evaluates to a list including numbers from zero up to the value of the expression between the parentheses. For example, to generate a list of integers from zero to ten:

In [164]:
print range(10)
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

You can start the range at an integer other than zero by giving two parameters to the range() function (separated by commas):

In [165]:
print range(10, 20)
[10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
In [166]:
print range(-15, 15)
[-15, -14, -13, -12, -11, -10, -9, -8, -7, -6, -5, -4, -3, -2, -1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]

Negative indexes

If you use -1 as the value inside of the brackets, something interesting happens:

In [167]:
fib = [1, 1, 2, 3, 5]
print fib[-1]
5

The expression evaluates to the last item in the list. This is essentially the same thing as the following code:

In [168]:
print fib[len(fib) - 1]
5

... except easier to write. In fact, you can use any negative integer in the index brackets, and Python will count that many items from the end of the list, and evaluate the expression to that item.

In [169]:
print fib[-3]
2

If the value in the brackets would "go past" the beginning of the list, Python will raise an error:

In [170]:
print fib[-14]
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-170-a9b12c62649a> in <module>()
----> 1 print fib[-14]

IndexError: list index out of range

List slices

The index bracket syntax explained above allows you to write an expression that evaluates to a particular item in a list, based on its position in the list. Python also has a powerful way for you to write expressions that return a section of a list, starting from a particular index and ending with another index. In Python parlance we'll call this section a slice.

Writing an expression to get a slice of a list looks a lot like writing an expression to get a single value. The difference is that instead of putting one number between square brackets, we put two numbers, separated by a colon. The first number tells Python where to begin the slice, and the second number tells Python where to end it.

In [171]:
print [4, 5, 6, 10, 12, 15][1:4]
[5, 6, 10]

Note that the value after the colon specifies at which index the slice should end, but the slice does not include the value at that index. (You can tell how long the slice will be by subtracting the value before the colon from the value after it.)

Also note that---as always!---any expression that evaluates to an integer can be used for either value in the brackets. For example:

In [172]:
x = 3
print [4, 5, 6, 10, 12, 15][x:x+2]
[10, 12]

Finally, note that the type of a slice is list:

In [173]:
print type([4, 5, 6, 10, 12, 15])
<type 'list'>
In [174]:
print type([4, 5, 6, 10, 12, 15][1:4])
<type 'list'>

Omitting slice values

Because it's so common to use the slice syntax to get a list that is either a slice starting at the beginning of the list or a slice ending at the end of the list, Python has a special shortcut. Instead of writing:

In [175]:
print [4, 5, 6, 10, 12, 15][0:3]
[4, 5, 6]

You can leave out the 0 and write this instead:

In [176]:
print [4, 5, 6, 10, 12, 15][:3]
[4, 5, 6]

Likewise, if you wanted a slice that starts at index 4 and goes to the end of the list, you might write:

In [177]:
print [4, 5, 6, 10, 12, 15][4:]
[12, 15]

Negative index values in slices

Now for some tricky stuff: You can use negative index values in slice brackets as well! For example, to get a slice of a list from the fourth-to-last element of the list up to (but not including) the second-to-last element of the list:

In [178]:
print [4, 5, 6, 10, 12, 15][-4:-2]
[6, 10]

To get the last three elements of the list:

In [179]:
print [4, 5, 6, 10, 12, 15][:-3]
[4, 5, 6]

Lists within lists

So far we've seen lists that contain integers and floating-point numbers. But we're not limited to just those types! Importantly, lists can themselves contain... other lists. Lists within lists is one of the ways Python represents a matrix of values, like a spreadsheet. Here's what it looks like when you have lists inside of a list:

In [180]:
[[1, 2, 3, 4], [5, 10, 15, 20], [100, 200, 300, 400]]
Out[180]:
[[1, 2, 3, 4], [5, 10, 15, 20], [100, 200, 300, 400]]

Whew, that's a lot of brackets! This is a list that has three items, each of which is itself a list containing four items. One way to visualize this list is to think of it as a table or spreadsheet that looks like this:

col 0 col 1 col 2 col 3
1 2 3 4
5 10 15 20
100 200 300 400

Using the len() function on this list returns the number of items in the outer list, or the number of "rows" in the "table":

In [181]:
print len([[1, 2, 3, 4], [5, 10, 15, 20], [100, 200, 300, 400]])
3

You can write an expression that evaluates to a single "row" of a list-of-lists by using the square bracket indexing syntax:

In [182]:
print [[1, 2, 3, 4], [5, 10, 15, 20], [100, 200, 300, 400]][1]
[5, 10, 15, 20]

It's important to remember that a single element of a list-of-lists is itself a list:

In [183]:
print type([[1, 2, 3, 4], [5, 10, 15, 20], [100, 200, 300, 400]][1])
<type 'list'>

What if you want to write an expression that evaluates to a single item in that row? Well, then you need to use the square brackets index syntax... twice. The first square bracket index gets the row, and the second square bracket index gets the item from that row:

In [184]:
print [[1, 2, 3, 4], [5, 10, 15, 20], [100, 200, 300, 400]][1][3]
20

Still more brackets! Maybe it'll be clearer if we assign our list-of-lists to a variable first:

In [185]:
data = [[1, 2, 3, 4], [5, 10, 15, 20], [100, 200, 300, 400]]
print data[1]
print data[1][3]
[5, 10, 15, 20]
20

List comprehensions: Applying transformations to lists

A very common task in both data analysis and computer programming is applying some operation to every item in a list (e.g., scaling the numbers in a list by a fixed factor), or to create a copy of a list with only those items that match a particular criterion (e.g., eliminating values that fall below a certain threshold). Python has a succinct syntax, called a list comprehension, which allows you to easily write expressions that transform and filter lists.

A list comprehension has a few parts:

  • a source list, or the list whose values will be transformed or filtered;
  • a predicate expression, to be evaluated for every item in the list;
  • (optionally) a membership expression that determines whether or not an item in the source list will be included in the result of evaluating the list comprehension, based on whether the expression evaluates to True or False; and
  • a temporary variable name by which each value from the source list will be known in the predicate expression and membership expression.

These parts are arranged like so:

[ predicate expression for temporary variable name in source list if membership expression ]

The words for, in, and if are a part of the syntax of the expression. They don't mean anything in particular (and in fact, they do completely different things in other parts of the Python language). You just have to spell them right and put them in the right place in order for the list comprehension to work.

Here's an example, returning the squares of integers zero up to ten:

In [186]:
print [x * x for x in [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]]
[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

In the example above, x*x is the predicate expression; x is the temporary variable name; and [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] is the source list. There's no membership expression in this example, so we omit it (and the word if).

There's nothing special about the variable x; it's just a name that we chose. We could easily choose any other temporary variable name, as long as we use it in the predicate expression as well. Below, I use the name of one of my cats as the temporary variable name, and the expression evaluates the same way it did with x:

In [187]:
print [shumai * shumai for shumai in [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]]
[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

Notice that the type of the value that a list comprehension evaluates to is itself type list:

In [188]:
print type([x * x for x in [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]])
<type 'list'>

The expression we supply for the source list component of a list comprehension doesn't have to be a list that you've written out by hand. It can also be a variable:

In [189]:
numbers = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
print [x * x for x in numbers]
[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

... or it can be the result of some other expression that evaluates to a list:

In [222]:
print [x * x for x in range(10)]
[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

We've used the expression x * x as the predicate expression in the examples above, but you can use any expression you want. For example, to scale the values of a list by 0.5:

In [226]:
print [x * 0.5 for x in range(10)]
[0.0, 0.5, 1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0, 4.5]

In fact, the expression in the list comprehension can just be the temporary variable itself, in which case the list comprehension will simply evaluate to a copy of the original list:

In [227]:
print [x for x in range(10)]
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

You don't technically even need to use the temporary variable in the predicate expression:

In [228]:
print [42 for x in range(10)]
[42, 42, 42, 42, 42, 42, 42, 42, 42, 42]

Bonus exercise: Write a list comprehension for the list range(5) that evaluates to a list where every value has been multiplied by two (i.e., the expression would evaluate to [0, 2, 4, 6, 8]).

The membership expression

As indicated above, you can include an expression at the end of the list comprehension to determine whether or not the item in the source list will be evaluated and included in the resulting list. One way, for example, of including only those values from the source list that are greater than or equal to five:

In [191]:
print [x*x for x in range(10) if x >= 5]
[25, 36, 49, 64, 81]

Making lists from other kinds of data

We've learned how lists work, and we've learned some basic techniques for getting data from lists and turning lists into other lists. But so far it's been pretty abstract—we've been working with lists of numbers we've been typing in by hand, instead of with, you know, actual data fetched from some source. The reason for this is that getting data from real-world sources is... well, it's hard, and learning how to do that constitutes much of the content of this course.

Data files from real world sources usually come in a series of bytes—basically, a long sequence of numbers that correlate to how data is stored on disk or transmitted over the network. Our job as data mungers is to figure out how to "parse" this data ("parse" is used here loosely, in its colloquial meaning), transforming it from its "raw" form into actual Python data structures, like integers and lists.

Strings

One way of representing raw data in Python is with a data type called a string. A string is essentially a sequence of characters of arbitrary length. You can write one in iPython using single quotes (') or double quotes (") surrounding whatever characters you want. (The rules are a little more complicated than that, but we're focusing on the simple stuff for now.) Here's an example of a string:

In [192]:
print "this is a string. I can put a bunch of characters in here."
this is a string. I can put a bunch of characters in here.

Using print on a string causes Python simply to display the characters in the string. You can assign strings to variables as well, and the len() function will return the length of a string, just as it returns the length of a list:

In [193]:
x = "hi i'm a string"
print x
print len(x)
hi i'm a string
15

Strings have their own data type, type str:

In [194]:
print type("mother said there'd be days like these")
<type 'str'>

Strings and numbers

Notably, a string that contains what looks like a number does not behave like an actual integer or floating point number does. For example, attempting to subtract one string containing a number from another string containing a number will cause an error to be raised:

In [195]:
print "15" - "4"
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-195-a8252f2e06d4> in <module>()
----> 1 print "15" - "4"

TypeError: unsupported operand type(s) for -: 'str' and 'str'

Attempting to add an integer or floating-point number to a string that has a number inside of it will raise a similar error:

In [196]:
print 16 + "8.9"
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-196-cf6d61fa7c30> in <module>()
----> 1 print 16 + "8.9"

TypeError: unsupported operand type(s) for +: 'int' and 'str'

"TypeError: unsupported operand type(s)" translates from Python talk to "you gave me two values, and asked me to perform an operation on those values, but I don't know how to do that when the values belong to these types." In this case, Python has no idea how to "add" a number and a string of characters. Fortunately, there are built-in functions whose purpose is to convert from one type to another; notably, you can put a string inside the parentheses of the int() and float() functions, and it will evaluate to (what Python interprets as) the integer and floating-point values (respectively) of the string:

In [197]:
print type("17")
print int("17")
print type(int("17"))
<type 'str'>
17
<type 'int'>
In [198]:
print type("3.14159")
print float("3.14159")
print type(float("3.14159"))
<type 'str'>
3.14159
<type 'float'>

If you give a string to one of these functions that Python can't interpret as an integer or floating-point number, Python will raise an error:

In [199]:
print int("shumai")
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-199-725f52c691bd> in <module>()
----> 1 print int("shumai")

ValueError: invalid literal for int() with base 10: 'shumai'

Splitting strings

We'll be talking a LOT about what you can do with strings in this class. But for the purposes of this session, I just want to talk about one additional thing: the split() method. The split() method is a funny thing you can do with a string to transform it into a list. If you have an expression that evaluates to a string, you can put .split() right after it, and Python will evaluate the whole expression to mean "take this string, and 'split' it on white space, giving me a list of strings with the remaining parts." For example:

In [221]:
print "this is a test".split()
['this', 'is', 'a', 'test']

Notably, while the type of a string is str, the type of the result of split() is list:

In [201]:
print type("this is a test".split())
<type 'list'>

If the string in question has some delimiter in it other than whitespace that we want to use to separate the fields in the resulting list, we can put a string with that delimiter inside the parentheses of the split() method. Maybe you can tell where I'm going with this at this point!

From string to list of numbers: an example

For example, I happen to have here a string that represents the total points scored by LeBron James in each of his NBA games in the 2013-2014 regular season.

17,25,26,25,35,18,25,33,39,30,13,21,22,35,28,27,26,23,21,21,24,17,25,30,24,18,38,19,33,26,26,15,30,32,32,36,25,21,34,30,29,27,18,34,30,24,31,13,37,36,42,33,31,20,61,22,19,17,23,19,21,24,43,15,25,32,38,17,13,32,17,34,38,29,37,36,27

You can either cut-and-paste this string from the notes, or see a file on github with these values here.

Now if I just cut-and-pasted this string into a variable and tried to call list functions on it, I wouldn't get very helpful responses:

In [202]:
raw_str = "17,25,26,25,35,18,25,33,39,30,13,21,22,35,28,27,26,23,21,21,24,17,25,30,24,18,38,19,33,26,26,15,30,32,32,36,25,21,34,30,29,27,18,34,30,24,31,13,37,36,42,33,31,20,61,22,19,17,23,19,21,24,43,15,25,32,38,17,13,32,17,34,38,29,37,36,27"
print max(raw_str)
9

This is wrong—we know that LeBron James scored more than nine points in his highest scoring game. The max() function clearly does strange things when we give it a string instead of a list. The reason for this is that all Python knows about a string is that it's a series of characters. It's easy for a human to look at this string and think, "Hey, that's a list of numbers!" But Python doesn't know that. We have to explicitly "translate" that string into the kind of data we want Python to treat it as.

Bonus advanced exercise: Take a guess as to why, specifically, Python evaluates max(raw_str) to 9. Hint: what's the result of type(max(raw_str))?

What we want to do, then, is find some way to convert this string that represents integer values into an actual Python list of integer values. We'll start by splitting this string into a list, using the split() method, passing "," as a parameter so it splits on commas instead of on whitespace:

In [203]:
str_list = raw_str.split(",")
print str_list
['17', '25', '26', '25', '35', '18', '25', '33', '39', '30', '13', '21', '22', '35', '28', '27', '26', '23', '21', '21', '24', '17', '25', '30', '24', '18', '38', '19', '33', '26', '26', '15', '30', '32', '32', '36', '25', '21', '34', '30', '29', '27', '18', '34', '30', '24', '31', '13', '37', '36', '42', '33', '31', '20', '61', '22', '19', '17', '23', '19', '21', '24', '43', '15', '25', '32', '38', '17', '13', '32', '17', '34', '38', '29', '37', '36', '27']

Looks good so far. What does max() have to say about it?

In [204]:
print min(str_list)
13

This works. But what if we wanted to find the total number of points scored by LBJ? We should be able to do something like this:

In [205]:
print sum(str_list)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-205-cb32670f7064> in <module>()
----> 1 print sum(str_list)

TypeError: unsupported operand type(s) for +: 'int' and 'str'

... but we get an error. Why this error? The reason lies in what kind of data is in our list. We can check the data type of an element of the list with the type() function:

In [206]:
print type(str_list[0])
<type 'str'>

A-ha! The type is str. So the error message we got before (unsupported operand type(s) for +: 'int' and 'str') is Python's way of telling us, "You gave me a list of strings and then asked me to add them all together. I'm not sure what I can do for you."

So there's one step left in our process of "converting" our "raw" string, consisting of comma-separated numbers, into a list of numbers. What we have is a list of strings; what we want is a list of numbers. Fortunately, we know how to write an expression to transform one list into another list, applying an expression to each member of the list along the way—it's called a list comprehension. Equally fortunately, we know how to write an expression that converts a string representing an integer into an actual integer (int()). Here's how to write that expression:

In [207]:
print [int(x) for x in str_list]
[17, 25, 26, 25, 35, 18, 25, 33, 39, 30, 13, 21, 22, 35, 28, 27, 26, 23, 21, 21, 24, 17, 25, 30, 24, 18, 38, 19, 33, 26, 26, 15, 30, 32, 32, 36, 25, 21, 34, 30, 29, 27, 18, 34, 30, 24, 31, 13, 37, 36, 42, 33, 31, 20, 61, 22, 19, 17, 23, 19, 21, 24, 43, 15, 25, 32, 38, 17, 13, 32, 17, 34, 38, 29, 37, 36, 27]

Let's double-check that the values in this list are, in fact, integers, by spot-checking the first item in the list:

In [208]:
print type([int(x) for x in str_list][0])
<type 'int'>

Hey, voila! Now we'll assign that list to a variable, for the sake of convenience, and then check to see if sum() works how we expect it to.

In [209]:
int_list = [int(x) for x in str_list]
print sum(int_list)
2089

Wow! 2089 points in one season! Good work, King James.

Comma-separated value files (CSVs)

A common task in this class will be to (1) take data from some source, (2) figure out what format that data is in, then (3) parse the data into Python data structures so we can (4) perform operations on it and synthesize useful information. The example in the section above—taking a string containing a comma-separated list of numbers, splitting it apart, converting it to integers, and then finding its sum—is a simple example of that task.

Step (3) above usually turns out to be the most difficult step in the process. Fortunately, Python makes available many "libraries" that know how to take apart data in particular formats and convert them to Python data structures that we can use in our notebooks. (A library is a piece of pre-existing code that you can incorporate into your program. Some libraries come pre-installed with Python; some are pre-installed on your EC2 machine as a part of the AMI for this class; others still you may need to install by hand. We'll talk about that step when the time comes!)

One such library is called csv—it's a library for parsing comma-separated value files (CSVs). CSV is a common "exchange" format, often used when exporting data from a spreadsheet program. Here's an example of some CSV-formatted data, representing statistics for LeBron James' first five games of the 2013-2014 NBA season (source):

Rk,G,Date,Age,Tm,,Opp,,GS,MP,FG,FGA,FG%,3P,3PA,3P%,FT,FTA,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,GmSc,+/-
1,1,2013-10-29,28-303,MIA,,CHI,W (+12),1,38:01,5,11,.455,0,1,.000,7,9,.778,0,6,6,8,1,0,2,0,17,16.9,+8
2,2,2013-10-30,28-304,MIA,@,PHI,L (-4),1,36:38,9,17,.529,4,7,.571,3,4,.750,0,4,4,13,0,0,4,3,25,21.4,-8
3,3,2013-11-01,28-306,MIA,@,BRK,L (-1),1,42:14,11,19,.579,1,2,.500,3,5,.600,1,6,7,6,2,1,5,2,26,19.9,-3
4,4,2013-11-03,28-308,MIA,,WAS,W (+10),1,34:41,9,14,.643,3,5,.600,4,5,.800,0,3,3,5,1,0,6,2,25,17.0,+16
5,5,2013-11-05,28-310,MIA,@,TOR,W (+9),1,36:01,13,20,.650,1,3,.333,8,8,1.000,2,6,8,8,0,1,1,2,35,33.9,+3

The gist of the CSV format is that it consists of a number of records, one per line, each of consists of an equal number of fields. Fields are separated with a comma. Often, the first line of data in a CSV file gives some clue as to how to interpret the information in the corresponding fields in the following rows. We can surmise that, e.g., the data in the fields corresponding to PTS represent the number of points scored in that game, that FGA is the number of field goals attempted, etc.

Knowing what we know already, we can sort of imagine what the code to parse a file in this format might look like. We'd need to put that whole chunk of data into a big string, then split that string (somehow?) into individual lines; that would give us a list of strings. We could then write an expression to split those strings into lists of individual fields. In the end, we'd end up with a list of lists.

Using the csv library

Fortunately, we don't have to do that work for ourselves (if we don't want to). There's an existing library for Python called csv that will do the work of taking a string and turning it into a list of lists for us.

I'm going to show you how to use this library, so you can get started doing simple data tasks with CSV files you find on the web. In the examples below, there is going to be some code that you won't be prepared for just yet—but I'm going to try to be careful to show you which parts of the code you can change yourself, and which parts you need to leave alone.

The full version of the LeBron James CSV file is here. Instead of cutting and pasting this entire file into a string, the code I write below will fetch the data straight from github. (I'll show you how this works and how to do it for arbitrary web addresses next session!)

Here's some code for loading an entire CSV into a list of lists:

In [230]:
import csv
import urllib

url = "https://gist.githubusercontent.com/aparrish/cb1672e98057ea2ab7a1/raw/13166792e0e8436221ef85d2a655f1965c400f75/lebron_james.csv"
stats = list(csv.reader(urllib.urlopen(url)))

The import csv line at the top of the cell simply tells Python that we want to use the csv library for the rest of the notebook. The part of the code that does all the work is this one:

url = "YOUR_URL_HERE"
stats = list(csv.reader(urllib.urlopen(url)))

You can change the value of the url variable to whatever URL you want, as long as what's on the other side is a CSV file. The second line (stats = csv.reader(urllib.urlopen(url))) loads the CSV at that URL and parses it into a list, assigning that list to a variable called stats. Let's examine the stats variable:

In [231]:
print type(stats)
<type 'list'>

It's a list! What's in the list?

In [232]:
print type(stats[0])
<type 'list'>

The first element of this list is... also a list! Looks like we've got a list of lists here. Let's see what's actually inside the list:

In [233]:
print stats[0]
['Rk', 'G', 'Date', 'Age', 'Tm', '', 'Opp', '', 'GS', 'MP', 'FG', 'FGA', 'FG%', '3P', '3PA', '3P%', 'FT', 'FTA', 'FT%', 'ORB', 'DRB', 'TRB', 'AST', 'STL', 'BLK', 'TOV', 'PF', 'PTS', 'GmSc', '+/-']

Looks like the first element of the stats list is a list of column headings. What's in the second element?

In [234]:
print stats[1]
['1', '1', '2013-10-29', '28-303', 'MIA', '', 'CHI', 'W (+12)', '1', '38:01', '5', '11', '.455', '0', '1', '.000', '7', '9', '.778', '0', '6', '6', '8', '1', '0', '2', '0', '17', '16.9', '+8']

Ah, okay, now we're finally getting some actual data. How many records do we have? We'll use the len() function to check, taking care to not include the first record in our count (since that's the column heading row, and doesn't itself represent a game):

In [216]:
print len(stats[1:])
77

77 games total. (NBA fans: yes, there are 82 games in a season, but I purposefully excluded games in which James was marked as "Inactive.")

We can access a particular item in a particular record by using the list indexing brackets twice. According to the column headings, the number of points scored in a game is in the... let's count it together, actually (remember to start counting at zero!):

['Rk', 'G', 'Date', 'Age', 'Tm', '', 'Opp', '', 'GS', 'MP', 'FG', 'FGA', 'FG%', '3P', '3PA', '3P%', 'FT', 'FTA', 'FT%', 'ORB', 'DRB', 'TRB', 'AST', 'STL', 'BLK', 'TOV', 'PF', 'PTS', 'GmSc', '+/-']

...27! So we can get the number of points LBJ scored in his first game of the season like so:

In [217]:
print stats[1][27]
17

Selecting a single column

Now we're in a position to do some interesting things with the data from our CSV file. Let's start by creating an expression that evaluates to all of the values in a particular column. We'll do this using a list comprehension. Here's what it looks like:

In [218]:
print [int(record[27]) for record in stats[1:]]
[17, 25, 26, 25, 35, 18, 25, 33, 39, 30, 13, 21, 22, 35, 28, 27, 26, 23, 21, 21, 24, 17, 25, 30, 24, 18, 38, 19, 33, 26, 26, 15, 30, 32, 32, 36, 25, 21, 34, 30, 29, 27, 18, 34, 30, 24, 31, 13, 37, 36, 42, 33, 31, 20, 61, 22, 19, 17, 23, 19, 21, 24, 43, 15, 25, 32, 38, 17, 13, 32, 17, 34, 38, 29, 37, 36, 27]

Hey, that looks familiar! It's the same list of numbers we came up with earlier. Let's break down that list comprehension a bit.

  • The source list is stats[1:]. (Why stats[1:] and not just stats? Because we want to omit the column header row.)
  • The temporary variable name is record. As mentioned above, this can be anything! I chose record to remind us of the fact that each element in the source list is itself a record in a table.
  • The predicate expression is int(record[27]), which translates into English as "get the 27th element of the list called record and convert it to an integer value."

Bonus exercise: Write an expression to get the sum of these values.

Here's another example. Let's get a list of how many points LBJ scored in games where he had exactly ten assists. ("Assists" are in the column labelled AST, or column number 22.)

In [219]:
print [int(record[27]) for record in stats[1:] if int(record[22]) == 10]
[25, 26, 21]

Bonus exercise: Write an expression to get the number of blocks (the column labelled BLK) that LeBron James had in games 10 up to 20.

Conclusion

We've put down the foundation today for you to become fluent in Python's very powerful and super-convenient syntax for lists. We've also done a bit of data parsing and analysis! Pretty good for day one.

Further resources: