files needed = none
Before we can start working with data, we need to work out some of the basics of Python. The goal is to learn enough so that we can do some interesting data work --- we do not need to be Python Jedi.
We now know about the basic data structures in python, how types work, and how to do some basic computation and string manipulation. We can use flow control statements to steer our program to different blocks of code depending on conditional statements and we have sorted out loops and list comprehensions.
Up next is a few more important topics before we get started with pandas. We will cover
Slicing is an important part of python life. We slice a list (or a tuple or a string) when we take a subset of it. As you can probably imagine, slicing will be a common thing we do with data in pandas. We often want to grab slices of the data set and analyze them.
The slice syntax uses square brackets --- even if we are slicing a string or a tuple. The basic command is
startis the first element to include in the slice
stopis the first element we do NOT include
strideis the step size
Notice that the start is inclusive and the stop is exclusive. Think of a slice as a half open interval in mathematics: [start, stop) we include start in the interval but exclude stop.
The default stride is 1, meaning take every element from [start, stop).
some_list = [5, 6, 7, 8, 9] print(some_list[0:2]) # indexes start with zero; stride defualts to 1 print(some_list[0:2:1]) # this should be the same print(some_list[0:5:2]) # take every other element
[5, 6] [5, 6] [5, 7, 9]
# take a slice out of the middle print(some_list[1:3]) #take the second element and the third element
If we want to take a start and then 'everything to the end' we just leave the second argument blank. A similar syntax for taking everything from the beginning to a stop.
print(some_list[2:]) # the third element to the end of the list print(some_list[:4]) # everything up to but not including the fifth element
[7, 8, 9] [5, 6, 7, 8]
One nice thing about this half open interval syntax is that we can divide up a list very neatly:
first_part = some_list[:3] second_part = some_list[3:] print(first_part, second_part, some_list)
[5, 6, 7] [8, 9] [5, 6, 7, 8, 9]
Slice arguments can be negative. When we use a negative number for start or stop, we are telling python to count from the end of the list.
print(some_list[:-1]) # all but the last one print(some_list[:-2]) # all but the last two print(some_list[-4:-2]) # ugh (again, we don't take the -2 value) # [5 | 6 | 7 | 8 | 9] # The list # -5 -4 -3 -2 -1 # backwards counting # 0 1 2 3 4 # forwards counting
[5, 6, 7, 8] [5, 6, 7] [6, 7]
If we use a negative number for the stride argument, we iterate backwards.
print(some_list[::-1]) # print the list out backwards print(some_list[4:1:-1]) # we are counting backwards, so be careful about start and stop # start at the  element in the list and end at the 
[9, 8, 7, 6, 5] [9, 8, 7]
# don't forget, we can do this with strings, too slogan = 'onward' print(slogan[:2]) # just print 'on' print(slogan[::-1]) # backwards
Take a few minutes and try the following. Feel free to chat with those around you if you get stuck. The TA and I are here, too
boss = 'Ananth Seshadri'
bossto create the variables
bossusing the negative number notation that counts from the end of the list.
boss = 'Ananth Seshadri' first_name = boss[:6] last_name = boss[7:] print(first_name) print(last_name, '\n') first_name_neg = boss[-15:-9] last_name_neg = boss[-8:] print(first_name_neg) print(last_name_neg)
Ananth Seshadri Ananth Seshadri
Consider this list of sorted data.
x_sorted = [10, 40, 100, 1000, 50000]
x_sorted = [10, 40, 100, 1000, 50000] print('The three largest elements are:', x_sorted[-3:]) print('The three smallest elements are:', x_sorted[:2])
The three largest elements are: [100, 1000, 50000] The three smallest elements are: [10, 40]
We have seen some of python's built-in functions:
len(). Like many other languages, python allows users to create their own functions.
Using functions lets us (or someone else) write and debug the code once --- then we can reuse it. Very powerful stuff. Here is a simple example:
def lb_to_kg(pounds): """ Input a weight in pounds. Return the weight in kilograms. """ kilos = pounds * 0.453592 # 1 pound = 0.453592 kilos... return kilos # this is the value the function returns
When you run the cell above, it looks like nothing happened, but python read the code and created the function. We can use the
whos statement (a jupyter notebook 'magic' command) to learn about what objects are in the namespace. [A namespace is a list of all the objects we have created and the names we have assigned them.]
Variable Type Data/Info -------------------------------------- boss str Ananth Seshadri first_name str Ananth first_name_neg str Ananth first_part list n=3 last_name str Seshadri last_name_neg str Seshadri lb_to_kg function <function lb_to_kg at 0x000002C735BEB378> second_part list n=2 slogan str onward some_list list n=5 x_sorted list n=5
We can see the variables we have created earlier as well as the function
lb_to_kg. Notice functions are of type
function. Just like any other variable,
lb_to_kg is loaded into the namespace.
Now that our function is defined, we are ready to use it.
car_weight_pounds = 5000 car_weight_kilos = lb_to_kg(car_weight_pounds) print('The car weighs', car_weight_kilos, 'kilos.')
The car weighs 2267.96 kilos.
Since it is our function, we have to handle potentially bad inputs, or python will throw an error.
truck_weight_pounds = '5000' # A classic problem with real data truck_weight_kilos = lb_to_kg(truck_weight_pounds) print('The truck weighs', truck_weight_kilos, 'kilos.')
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-13-49d4bc652b32> in <module> 1 truck_weight_pounds = '5000' # A classic problem with real data ----> 2 truck_weight_kilos = lb_to_kg(truck_weight_pounds) 3 print('The truck weighs', truck_weight_kilos, 'kilos.') <ipython-input-10-e046052c2859> in lb_to_kg(pounds) 4 """ 5 ----> 6 kilos = pounds * 0.453592 # 1 pound = 0.453592 kilos... 7 8 return kilos # this is the value the function returns TypeError: can't multiply sequence by non-int of type 'float'
def lb_to_kg_v2(pounds): """ Input a weight in pounds. Return the weight in kilograms. """ if type(pounds)==float or type(pounds)== int: # check that pounds is an allowable type kilos = pounds * 0.453592 # 1 pound = 0.453592 kilos... return kilos # this is the value the function returns else: print('error: lb_to_kg_v2 only takes integers or floats.') return -99
truck_weight_pounds = '5000' #A classic problem with real data truck_weight_kilos = lb_to_kg_v2(truck_weight_pounds) print('The truck weighs', truck_weight_kilos, 'kilos.')
error: lb_to_kg_v2 only takes integers or floats. The truck weighs -99 kilos.
How much time you spend writing code that is safe from errors is a tradeoff between your time and how robust your code needs to be. Life is all about tradeoffs.
We can have functions with several input variables:
def name_fixer(first, middle, last): """ Fix any capitalization problems and create a single variable with the complete name. """ # the sting method title() makes the fist letter capital return first.title() + ' ' + middle.title() + ' ' + last.title()
You may have noticed that we write a triple-quote comment at the beginning of our functions. This is called a docstring, and we use it to tell others what the function does. Remember the '?' operator? Give it a try below.
Help on function name_fixer in module __main__: name_fixer(first, middle, last) Fix any capitalization problems and create a single variable with the complete name.
Now let's try out the
mascot_first = 'bucKingham' mascot_middle = 'u' mascot_last = 'badger' full_name = name_fixer(mascot_first, mascot_middle, mascot_last) print(full_name)
Buckingham U Badger
Important: We can assign several return variables. This is called multiple assignment. First, let's look at multiple assignment outside of a function, then we use it in a function.
# this is an example of multiple assignment. a, b = 'foo', 10 # assign 'foo' to a and 10 to b...all in one statement print(a, b)
Back on day one, we worked on the problem of
n=3. Write some code that swaps the values of
Back then, we created a temp variable to help us make the swap. Now that we have some python under our belts we can just do this:
m = 2 n = 3 #I could have used multiple assignment here, too, but didn't print('m=', m, 'n=', n) m, n = n, m # make the swap print('m=', m, 'n=', n)
m= 2 n= 3 m= 3 n= 2
Wow. Don't underestimate the Force.
Multiple assignment let's us return several objects from a function.
def temp_converter(temp_in_fahrenheit): """ Takes a temperature in fahrenheit and returns it in celsius and in kelvin. """ temp_in_celsius = (temp_in_fahrenheit - 32) * 5/9 temp_in_kelvin = (temp_in_fahrenheit + 459.67) * 5/9 return temp_in_celsius, temp_in_kelvin # Note that I am defining the function and using it in the same code cell. # The code below is NOT part of the function definition. We can see that because it is not indented. t_f = 65 # temp in fahrenheit t_c, t_k = temp_converter(t_f) print(t_f, 'degrees fahrenheit', 'is', t_c, 'degrees celsius and', t_k, 'degrees kelvin.')
65 degrees fahrenheit is 18.333333333333332 degrees celsius and 291.48333333333335 degrees kelvin.
We need to work on our string formatting at some point...
Take a few minutes and try the following. Feel free to chat with those around you if you get stuck. The TA and I are here, too.
name_fixer()function to return both the fixed-up full name and the length of the full name. Use multiple assignment. Test it
# Part 1 def change_counter(pennies, nickels, dimes, quarters): """ Compute the value of a given number of pennies, nickels, dimes, and quarters. """ return pennies + nickels*5 + dimes*10 + quarters*25 print(change_counter(5, 0, 4, 2), 'cents')
# Part 2 def name_fixer_improved(first, middle, last): """ Fix any capitalization problems and create a single variable with the complete name. Also return the length of the name. """ # the sting method title() makes the fist letter capital full_name = first.title() + ' ' + middle.title() + ' ' + last.title() return full_name, len(full_name) print(name_fixer_improved('nelsoN', 'websTER', 'DEweY'))
('Nelson Webster Dewey', 20)
split(sep)string method breaks up a string into sub-strings. The argument
sepdefines the delimiting character. [try help(str.split)]
test_string = 'There is a place where the sidewalk ends' test_string_chunks = test_string.split(sep=' ') # use the space as the delimiter print(type(test_string_chunks)) print(test_string_chunks)
<class 'list'> ['There', 'is', 'a', 'place', 'where', 'the', 'sidewalk', 'ends']
Write a function that takes names of the form 'last,first,middle' and returns three strings: first, middle, and last. Test your function with 'Silverstein,Sheldon,Allan'.
def name_splitter(name): split_name = name.split(',') return split_name, split_name, split_name first, middle, last = name_splitter('Silverstein,Sheldon,Allan') print(first, middle, last)
Sheldon Allan Silverstein
Everything in python is an object. The variables we have been creating are objects. The functions we have written are objects. Objects are useful because they have attributes and methods associated with them. What attributes and methods an object has, depends on the object's type. Let's take lists for example.
list_1 = ['a', 'b', 'c'] list_2 = [4, 5, 6, 7, 8]
Both lists are objects and both have type
list, but their attributes are different. For example list length is an attribute: list_1 is of length 3, while list_2 is of length 5.
Methods are like functions that are attached to an object. Different types of objects have different methods available. Methods implement operations that we often use with a particular data type. We access methods with the 'dot' notation.
method() is a method associated with the list class. We have been using the
title() methods of the string class already. We just used the
split() method of the string class.
list_1 = ['a', 'c', 'b'] print(list_1)
['a', 'c', 'b']
list_1.sort() # the sort() method from the 'list' class print(list_1)
['a', 'b', 'c']
How do we find out what methods are available for an object? Google is always a good way. You can also use
help() with the class name.
help(str) for strings,
help(list) for lists.
Important: We can also use TAB completion in jupyter. Type
list_1. in the cell below and hit the TAB key.
Help on class list in module builtins: class list(object) | list(iterable=(), /) | | Built-in mutable sequence. | | If no argument is given, the constructor creates a new empty list. | The argument must be an iterable if specified. | | Methods defined here: | | __add__(self, value, /) | Return self+value. | | __contains__(self, key, /) | Return key in self. | | __delitem__(self, key, /) | Delete self[key]. | | __eq__(self, value, /) | Return self==value. | | __ge__(self, value, /) | Return self>=value. | | __getattribute__(self, name, /) | Return getattr(self, name). | | __getitem__(...) | x.__getitem__(y) <==> x[y] | | __gt__(self, value, /) | Return self>value. | | __iadd__(self, value, /) | Implement self+=value. | | __imul__(self, value, /) | Implement self*=value. | | __init__(self, /, *args, **kwargs) | Initialize self. See help(type(self)) for accurate signature. | | __iter__(self, /) | Implement iter(self). | | __le__(self, value, /) | Return self<=value. | | __len__(self, /) | Return len(self). | | __lt__(self, value, /) | Return self<value. | | __mul__(self, value, /) | Return self*value. | | __ne__(self, value, /) | Return self!=value. | | __repr__(self, /) | Return repr(self). | | __reversed__(self, /) | Return a reverse iterator over the list. | | __rmul__(self, value, /) | Return value*self. | | __setitem__(self, key, value, /) | Set self[key] to value. | | __sizeof__(self, /) | Return the size of the list in memory, in bytes. | | append(self, object, /) | Append object to the end of the list. | | clear(self, /) | Remove all items from list. | | copy(self, /) | Return a shallow copy of the list. | | count(self, value, /) | Return number of occurrences of value. | | extend(self, iterable, /) | Extend list by appending elements from the iterable. | | index(self, value, start=0, stop=9223372036854775807, /) | Return first index of value. | | Raises ValueError if the value is not present. | | insert(self, index, object, /) | Insert object before index. | | pop(self, index=-1, /) | Remove and return item at index (default last). | | Raises IndexError if list is empty or index is out of range. | | remove(self, value, /) | Remove first occurrence of value. | | Raises ValueError if the value is not present. | | reverse(self, /) | Reverse *IN PLACE*. | | sort(self, /, *, key=None, reverse=False) | Stable sort *IN PLACE*. | | ---------------------------------------------------------------------- | Static methods defined here: | | __new__(*args, **kwargs) from builtins.type | Create and return a new object. See help(type) for accurate signature. | | ---------------------------------------------------------------------- | Data and other attributes defined here: | | __hash__ = None
['a', 'b', 'c']
The TAB gives us a list of possible methods. We have already seen
reverse() looks interesting. Let's give it a try.
['c', 'b', 'a']
TAB completion is also there to make it easier to reference variables in the namespace. Insert a code cell and start typing
lis and hit tab. It should bring up a list of variables in the namespace that start with 'lis'. This is handy: it saves typing and avoids errors from typos.
Take a few minutes and try the following. Feel free to chat with those around you if you get stuck. The TA and I are here, too.
gdp = '18,570.50'. Convert the variable to a float. Use TAB completion (and Google, if needed) to find a method that removes the comma.
gdp = '18,570.50' gdp = gdp.replace(',', '') gdp = float(gdp) print(gdp, 'is of type', type(gdp))
18570.5 is of type <class 'float'>
new_scoreinto the list in the correct position so that the list stays sorted.
scores = [50, 32, 78, 99, 39, 75] new_score = 85
print(scores) scores.sort() print(scores)
[50, 32, 78, 99, 39, 75] [32, 39, 50, 75, 78, 99]
scores.insert(5, new_score) print(scores)
[32, 39, 50, 75, 78, 85, 99]