Python basics: Part 3

Before we can start working with data, we need to work out some of the basics of Python. The goal is to learn enough so that we can do some interesting data work --- we do not need to be Python Jedi.

We now know about the basic data structures in python, how types work, and how to do some basic computation and string manipulation. What need next is flow control.

A python program is a list of statements. The python interpretor reads those statement from top to bottom and executes them. Depending on what is happening in our program, we may want to skip some statements, or repeat some other statements. Flow control statements manage this process.

In this notebook we will cover

  1. The bool type
  2. The if/else statement
  3. The for loop
  4. List comprehensions

Remember: Ask questions as we go.

Bool

Flow control often requires asking whether a statement is true or false and then taking an action conditional on the answer. For example: Is this data a string? If yes, convert to a float. If not, do nothing.

The python type bool can take on two values: True or False. Let's see it in action.

In [1]:
my_age = 41                # I am still old...

is_a_minor = my_age < 18   # here we compare age to see if it is less than 18

print(is_a_minor)
print(type(is_a_minor))
False
<class 'bool'>

The comparison operators we will often use are

  • < (less than)
  • > (greater than)
  • <= (less than or equal to)
  • >= (greater than or equal to)
  • == (equal)
  • != (not equal)

Important: We use a double equal sign == to check for equality and a single equal sign for assignment.

In [2]:
# a bit of code to see if the variable year is equal to the current year
year = 2019
is_current_year = (2019 == year)  # the parenthesis are not needed, but I like them for clarity
print(is_current_year)
True

Go back and change the third line to is_current_year = (2019 = year). What happened?

More complicated comparisons

We can build more complicated expressions using and and or. For and all the sub-comparisons need to evaluate to True for the whole comparison to be True. For or only one of the sub-comparisons needs to be true for the whole comparison to be true.

In [3]:
x = (2 < 3) and (1 > 0)      # Both sub-comparions are true
print('Is 2<3 and 1>0?', x)

y = (2 < 3) and (1 < 0)      # Only one sub-comparison is true
print('Is 2<3 and 1<0?', y)

z = (2 < 3) or (1 < 0)       # Only one sub-comparison is true
print('Is 2<3 or 1<0?', z)
Is 2<3 and 1>0? True
Is 2<3 and 1<0? False
Is 2<3 or 1<0? True

Comparing strings

Given the nature of data, we might need to compare strings. Remember, programming languages are picky...

In [4]:
gender = 'Male'

is_male = ('male' == gender)
print(is_male)
False

Case matters. Luckily, python has lots of features to manipulate strings. We will learn some of these as we go along. In this case we use the lower() method of the string class to make the string all lower case.

We are using the 'dot' notation again without really explaining it yet, but that explanation is coming.

In [5]:
gender_lowcase = gender.lower()  # we are applying the lower() method to the variable gender
print('gender, after being lowered:', gender_lowcase)

is_male = 'male' == gender_lowcase  
print(gender_lowcase, is_male)

is_male = 'male' == gender.lower()  # you don't have to store the lowered string separately
print(gender.lower(), is_male)
gender, after being lowered: male
male True
male True

Conditional statements

Conditional statements check a condition statement. If the statement is true it evaluates one set of code. If the statement is false it evaluates another set of code.

Important: Earlier, I mentioned that white space doesn't matter around operators like + or * and that we can insert blank lines wherever we want. Here comes a caveat: When we form a conditional, we need exactly four spaces in the lines following the condition statement. The indents define the lines of code that are executed in each branch of the if statement.

In [6]:
quantity = 10

if quantity > 0: 
    print('This print statement occured because the statement is true.')  # this indented code is the 'if branch'
    print('The quantity is positive.')
    temp = quantity + 5
    print('When I add 5 to the quantity it is:', temp, '\n')
else:
    print('This print statement occured because the statement is false.')  # this indented code is the 'else branch'
    print('The quantity is not positive.\n')
    

print('This un-indented code runs no matter what.')
This print statement occured because the statement is true.
The quantity is positive.
When I add 5 to the quantity it is: 15 

This un-indented code runs no matter what.
  1. Now go back to the code and change quantity to 0, or -10 and run the cell. What happens?

  2. Now go back to the code and change the indentation of the first print statement after if quantity > 0: to be two spaces. Run the cell. What happened?

In [7]:
# the else is optional. 

size = 'md'
if (size == 'sm') or (size == 'md') or (size == 'lg'):
    print('A standard size was requested.\n')
    
print('This un-indented code runs no matter what.')
A standard size was requested.

This un-indented code runs no matter what.

Change size to 'xxl'. Run the cell.

Practice: Conditionals

Take a few minutes and try the following. Feel free to chat with those around you if you get stuck. I am here, too.

  1. Edit this markdown cell and write True, False, or error next to each statement
  • 1 > 2 False
  • 'bill' = 'Bill' False
  • (1 > 2) or (2*10 < 100) True
  • 'Dennis' == 'Dennis' True
  • x = 2
    0 < x < 5
    
    True
  • x = 0.10
    y = 1/10
    x == y
    
    True
  1. Before you run the code cell below: do you think it will be true or false?
  2. Run the code cell.
In [8]:
x = 1/3
y = 0.333333333     # This is an approximation of 1/3
print(x == y)
False

In the previous cell, add a few more 3s to the end of the definition of y so you get a better approximation of x. Can you get x==y to be true?

Representing a floating point number that does not have a base-2 fractional representation is a problem in all programing languages. It is a limitation of the computer hardware itself. The python documentation has a nice discussion. https://docs.python.org/3.7/tutorial/floatingpoint.html

This will not likely be an issue for us (although it could crop up) but it is a big deal in numerical computing.

  1. Let's introduce a new function that is built into python: the len() function. This computes the length of an object. In the code cell below try print(len('hello world'))
In [9]:
print(len('hello world'))
11
  1. In the cell below, write some code (use an if statement) that prints out the longer string in all lower case letters and prints out the shorter string in all upper case letters. [Hint: the companion to lower() is upper()].
In [10]:
string1 = 'MemoriaL'
string2 = 'unIon'

if len(string1) > len(string2):
    print(string1.lower(), string2.upper())
else:
    print(string2.lower(), string1.upper())
        
    
    
memorial UNION

The for loop

The conditional statement allows us to selectively run parts of our program. Loops allow us to re-run parts of our code several times, perhaps making some changes each time the code is run. There are several types of loops. The for loop runs a block of code 'for' a fixed number of times.

Here is a basic example.

In [11]:
# loop three times and print out the value of 'i'

for i in range(3):       # The counter variable 'i' can be named anything. 
    print('i =', i )
i = 0
i = 1
i = 2

Important: Notice the 4-space indent again. In general, the colon tells us that the indented lines below 'belong' to the line of code with the colon.

Ranges

The function range() creates a sequence of whole numbers. With a single argument, it starts at zero, but it can do more. Examples:

  • range(3) returns 0, 1, 2
  • range(2,7) returns 2, 3, 4, 5, 6
  • range(0, 10, 2) returns 0, 2, 4, 6, 8 [the third argument is the 'step' size]

Change the code above to try out these ranges.

In [12]:
# a range is python type, like a float or a str
my_range = range(5)
print(type(my_range))

# what happens if I print the range?
print(my_range)
<class 'range'>
range(0, 5)

That last print out might not be what you expected. If you want to see the sequence, convert it to a list first.

In [13]:
print(list(my_range))     #Remember what list() does?
[0, 1, 2, 3, 4]

Looping over lists and strings

Looping over a range is the only kind of for loop you can use in languages like C or MATLAB. Python gives us a very easy way to loop over many kinds of things. Here, we loop over a list.

In [14]:
var_names = ['GDP', 'POP', 'INVEST', 'EXPORTS']

# Here is a clunky, C-style way to do this
print('The old-school way:')
for i in range(4):       # i = 0, 1, 2, 3
    print(var_names[i])


# The python way
print('\nThe python way:')
for var in var_names:     # again, 'var' can be named anything
    print(var)
The old-school way:
GDP
POP
INVEST
EXPORTS

The python way:
GDP
POP
INVEST
EXPORTS

We can do the same kind of thing for a string.

In [15]:
street = 'observatory drive'

for letter in street:
    print(letter)
o
b
s
e
r
v
a
t
o
r
y
 
d
r
i
v
e

Wow.

Ranges, lists, and strings are all 'iterable objects'. An iterable object (a type) is an object that knows how to return the 'next' element within it. When we iterate over a list, each time the for loop 'asks' for the next element in the list, the variable knows how to answer with the next element.

  • Ranges iterate over whole numbers
  • Lists iterate over the elements of the list
  • Strings iterate over the characters
  • Dicts iterate over the keys
  • and more...

Iterators are used in places besides loops, too. We will see other uses as we go on. Powerful stuff.

Practice: Loops

Take a few minutes and try the following. Feel free to chat with those around you if you get stuck. The TA and I are here, too.

Remember this example from earlier?

  1. We have 5 integer observations in our dataset: 1, 3, 8, 3, 9. Unfortunately, the data file ran all the observations together and we are left with the variable raw_data in the cell below.
  2. What type is raw_data?
  3. Turn raw_data into a list.
In [16]:
raw_data = '13839'
print(type(raw_data))

list_data = list(raw_data)
print(list_data)
<class 'str'>
['1', '3', '8', '3', '9']

Is your data ready to be analyzed? Why not?

  1. In the cell below, covert your list to a list of integers. You might try repeating statements like list_data[0]=int(list_data[0]) Put a loop to work! (We will see even better ways to do this soon...)
In [17]:
for i in range(len(list_data)):
    list_data[i] = int(list_data[i])
    
print(list_data)
[1, 3, 8, 3, 9]
  1. Loop through the following list: commands = ['go', 'go', 'go', 'stop', 'go', 'go'] If the command is 'go' print out the word 'Green'. If the command is 'stop' print out the word 'Red'.
In [18]:
commands = ['go', 'go', 'go', 'stop', 'go', 'go']

for com in commands:
    if com == 'go':
        print('Green')
    else:
        print('Red')
Green
Green
Green
Red
Green
Green

List comprehensions

List comprehensions provide a very compact syntax to do loops over lists (or other iterable objects). Anything you can do with a list comprehension you can do with a for loop. In this sense, we don't really need to know this, but python programmers love list comprehensions, so you will see them in other people's code. Plus, it's a cool skill to have.

[Programmers call this kind of thing syntactic sugar. It makes the code 'sweeter' for humans to read, but doesn't add functionality to the language. You might want to casually drop this kind of language around your programmer friends.]

Here is a common problem with data in the wild. We would like to check for certain string values, but we have to be careful about cases. To facilitate comparison, let's make all the strings lower case.

First, the loop way

In [19]:
# Some data
class_rank = ['Senior', 'senior', 'Freshman', 'sophomore', 'senior', 'Junior']

# Create a new list with all lower case entries
class_rank_cleaned = []                             # creates an empty list
for datum in class_rank:
    class_rank_cleaned.append(datum.lower())        # append() adds an element to the end of a list
    
print(class_rank_cleaned)
['senior', 'senior', 'freshman', 'sophomore', 'senior', 'junior']

Not bad. We now have cleaned up data and can do comparisons without worrying about case.

Now, let's roll out a list comprehension.

In [20]:
# Some data
class_rank = ['Senior', 'senior', 'Freshman', 'sophomore', 'senior', 'Junior']

class_rank_cleaned_lc = [elem.lower() for elem in class_rank]     # elem could be anything. It is a counter variable.

print(class_rank_cleaned_lc)
['senior', 'senior', 'freshman', 'sophomore', 'senior', 'junior']

Very clean. Very easy. Let's break down the list comprehension.

class_rank_cleaned_lc = [elem.lower() for elem in class_rank]

  1. The square brackets [] are creating a new list, just like we have done in the past
  2. The code on the left-hand side of for is the operation we want performed on each element of the list
  3. The for elem in class_rank is the for loop syntax, like we have used before.

Let's try another. Before you run the cell, what do you think this code does?

In [21]:
# What does this code do?
sq = [item**2 for item in range(3)]
print(sq)
[0, 1, 4]
In [22]:
# What about this code? What does it do?
class_rank_len = [len(elem) for elem in class_rank]
print(class_rank_len)
[6, 6, 8, 9, 6, 6]

We can apply a conditional statement so that we only perform an operation on certain elements.

In [23]:
# Seniors rule!
class_rank_caps = [i.upper() for i in class_rank_cleaned if i=='senior']
print(class_rank_caps)
['SENIOR', 'SENIOR', 'SENIOR']

Practice: List comprehensions

Take a few minutes and try the following. Feel free to chat with those around you if you get stuck. The TA and I are here, too.

  1. Here is a list of interest rates: r = [0.01, 0.01, 0.015, 0.02, 0.022] Multiply each of them by 100 to make them percentage interest rates.
  2. Here we go again! Turn raw_data = '13839' into a list of integers. Use a list comprehension. (We've come a long way!)
  3. A bit harder: create two lists derived from the following list: data_list = [1, 2, 3, 4, 5, 6, 7, 8,] One list should have only the odd numbers and rest should have only the even numbers. You might use the modulo operator % which 'yields the remainder from the division of the first argument by the second.' For example 3%2 = 1.
In [24]:
# Part 1

r = [0.01, 0.01, 0.015, 0.02, 0.022]
r_pct = [i*100 for i in r]
print(r_pct)
[1.0, 1.0, 1.5, 2.0, 2.1999999999999997]
In [25]:
# Part 2
raw_data = '13839'

int_data = [int(i) for i in raw_data]
print(int_data)
[1, 3, 8, 3, 9]
In [26]:
# Part 3

data_list = [1, 2, 3, 4, 5, 6, 7, 8,]
odds = [i for i in data_list if i%2 == 1]
evens = [i for i in data_list if i%2 == 0]

print(odds)
print(evens)
[1, 3, 5, 7]
[2, 4, 6, 8]