About this notebook
This notebook/presentation has been prepared for the 2017 edition of http://python.g-node.org, the renowned Advanced Scientific Programming in Python summer school (Happy 10th Anniversary!!). I gratefully acknowledge the efforts of the entire Python community, which produced great documentation I largely consumed to create this notebook; a list of which can be found at the end of the notebook. If I have missed anyone, apologies, let me know and I'll add you to the list!
Although you should be able to run the notebook straight out of the box, bear in mind that it was designed to work with Python3, in conjunction with the following nbextensions:
The repository also contains exercises, with and without solutions, which I borrowed from last year's edition of the summer school.
I hope you enjoy it! By all means get in touch! :)
Etienne
import sys
print('Python version ' + sys.version)
import time
from IPython.display import display, Image
from IPython.core.display import HTML
def countdown(t, display_picture=False):
"""Displays countdown.
Keyword arguments:
t -- the amount of time to countdown in seconds
"""
while t:
mins, secs = divmod(t, 60)
timeformat = '{:02d}:{:02d} left'.format(mins, secs)
print(timeformat, end='\r')
time.sleep(1)
t -= 1
print('Hands off of keyboards now!')
if display_picture:
display(Image(filename="./picts/aspp2017.png", width=400))
Python version 3.6.1 |Anaconda 4.4.0 (x86_64)| (default, May 11 2017, 13:04:09) [GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)]
Iterators are arcane mechanisms that support loops, and everything else;
Generators are kinds of iterators that provide a level of optimisation and interactivity;
Decorators are a mechanism to incrementally power-up existing code;
Context managers are semantically related to decorators, to manage resources properly.
An iterator is any Python type that can be used with a for loop.
They implement the iterator protocol, which describes implicit methods, like __init__(), to iterate in sets of objects. In Python 3, you find them everywhere, e.g. files, i/o streams, etc.
import numpy as np
nums = np.arange(2) # ndarray contains [0, 1]
for x in nums:
print(x, end=" ")
0 1
iter(nums) # ndarray is an iterable
<iterator at 0x1040d00b8>
it = iter(nums)
it.__next__() # One way to iterate
0
next(it) # Another way to iterate
1
next(it) # Raises StopIteration exception
--------------------------------------------------------------------------- StopIteration Traceback (most recent call last) <ipython-input-7-fad128c8b2df> in <module>() ----> 1 next(it) # Raises StopIteration exception StopIteration:
Leonardo Filius Bonacci (1175-1245), aka Leonardo Fibonacci, defines the recurrence relation that now bears his name and fuels conspiracy theorists.
class Fib:
'''Iterator Class to calculate the Fibonacci series'''
def __init__(self, max):
self.max = max
def __iter__(self): # defines initial conditions
self.a = 0
self.b = 1
return self # returns a handle to the object
def __next__(self): # defines behaviour for next()
fib = self.a
if fib > self.max:
raise StopIteration # is caught when in _for_ loop
temp_b = self.a + self.b
self.a = self.b
self.b = temp_b
return fib # F_n = F_n-1 + F_n+2
# 33rd degree in Freemason Antient & Accepted Scottish Rite
for i in Fib(33):
print(i, end=' ') # literally calls the __next__() method
0 1 1 2 3 5 8 13 21
Generators (generator-iterators as they are called) is a mechanism to simplify this process.
Python provides the yield keyword to define generators, which takes care of __iter__() and __next__() for you.
def fib_without_iterator_protocol(max):
numbers = [] # Needs to return an array of values
a, b = 0, 1 # a = 0 and b = 1
while a < max:
numbers.append(a)
a, b = b, a + b # Evalute right-hand side first and assign
return numbers # Returns full list of numbers
for i in fib_without_iterator_protocol(33):
print(i, end=" ") # iterates through array of values
0 1 1 2 3 5 8 13 21
In real life problems, this way of doing things is problematic because it forces us to compute all numbers in turn and to store everything in one go.
yield expression_list
yield does something similar to return:
yield saves local state and variables, instruction pointer and internal evaluation stack; i.e. enough information so that .__next__() behaves like an external call.
def fib_with_yield(max_limit):
'''fib function using yield'''
a, b = 0, 1 # a = 0 and b = 1
while a < max_limit:
yield a # freezes execution, returns current a
a, b = b, a + b # a = b and b = a + b
for i in fib_with_yield(33):
print(i, end=" ")
0 1 1 2 3 5 8 13 21
my_masonic_secret = fib_with_yield(33)
my_masonic_secret
<generator object fib_with_yield at 0x104157620>
next(my_masonic_secret)
0
next(my_masonic_secret)
1
next(my_masonic_secret)
1
next(my_masonic_secret)
2
... and so on.
Write a function that uses yield to draw numbers for the lottery--with replacement is fine! That's six numbers between 1-40; and if you feel ambitious, add one number between 1-15.
countdown(1)
Hands off of keyboards now!
import random
def super_million_lottery():
# returns 6 numbers between 1 and 40
for i in range(6):
yield random.randint(1, 40)
# returns a 7th number between 1 and 15
yield random.randint(1,15)
for i in super_million_lottery():
print(i, end=" ")
4 18 9 35 33 16 3
Python's list comprehension, with [..], computes everything at once and can take a lot of memory.
squares = [i**2 for i in range(10)]
squares
[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]
Generator expressions, with (..), are computed on demand.
squares = (i**2 for i in range(10))
squares
<generator object <genexpr> at 0x1041575c8>
On-demand calculation is important for the streamed processing of big amount of data; where the size of the data is uncertain, values of parameters are changing, etc, or when the processing steps might take a long time, yields errors or enter infinite loops.
Generators are also an easier way to handle callbacks, and can be used to simulate concurrency.
Bash pipeline to count the number of characters, omitting whitespaces, per line, in a given file:
!sed 's/ˆ//g' ./custom.css | tr -d ' ' | awk '{ printf "%i ", length($0); }'
# Noticed the magic "!"? Type %lsmagic in a cell to learn more
# https://blog.dominodatalab.com/lesser-known-ways-of-using-notebooks/
5 41 15 14 21 1 28 11 1
Same processing pipeline using native generators:
my_custom_css = open("./custom.css")
line_stripped = (line.replace(" ", "").rstrip('\n') for line in my_custom_css)
size_line = (len(line) for line in line_stripped)
for i in size_line:
print(i, end=" ")
5 41 15 14 21 1 28 11 1
In your research, you may only need to analyse one single .csv file.
More likely, you will be faced with increasingly bigger and more complex data, leaning towards so-called "big data", whatever that actually is.
This data won't fit in your workspace, may be "live" and constantly changing, and will require real-time or batch analysis methods; e.g., you won't be able to store raw data, but will have to filter it, compute "metrics", like averages, standard deviations, to then make a decision about what to do with the data.
You'll enter the realm of big data techniques, which will attempt to decouple data handling from analysis, and pipeline steps of preprocessing to ease analyses proper.
Keywords: dataflow, processing pipelines and stream processors, map reduce, lambda & kappa architectures, dremmel; e.g., hadoop.
You can simulate concurrency, by interacting with instantiated (currently alive) functions.
def receiver():
while True:
item = yield
print("I'm currently processing:", item)
recv = receiver() # Instantiate function
next(recv) # Starts function, alt. recv.send(None)
recv.send("Hello") # Python's .send() to function communicate..
recv.send("World") # ..with the instantiated object
I'm currently processing: Hello I'm currently processing: World
recv.close() # Obviously, clean up after yourself
def my_generator():
...
item = yield
...
value = do_something(item)
...
yield value # return value
gen = my_generator()
next(gen) # Starts generator and advances to yield value = gen.send(item) # Sends and receives stuff gen.close() # Terminates gen.throw(exc, val, tb) # Raises exception result = yield from gen # Handles callback and returns content
Functions are objects themselves.
def shout(word="hello world"):
return word.capitalize() + "!"
print(shout())
Hello world!
yell = shout
print(yell())
Hello world!
del shout
try: # this is how you catch an Exception
print(shout()) # This won't work
except NameError as e:
print(e)
print(yell()) # But this still works
name 'shout' is not defined Hello world!
Therefore, functions can be defined inside other functions.
def languaging():
def whisper(word="Hello world"):
return word.lower() + "..."
print(whisper())
languaging()
hello world...
try:
print(whisper()) # is outside the scope!
except NameError as e:
print(e)
name 'whisper' is not defined
def languaging(type="shout"):
def shout(word="hello world"):
return word.capitalize() + "!"
def whisper(word="hello world"):
return word.lower() + "..."
if type == "shout":
return shout
else:
return whisper
speak = languaging()
print(speak)
<function languaging.<locals>.shout at 0x10416a0d0>
print(speak())
Hello world!
print(languaging("whisper")())
hello world...
If functions, as objects, can be returned, they can also be arguments!
def my_good_old_analysis():
print("Ah, the way we've always done analysis.")
my_good_old_analysis()
Ah, the way we've always done analysis.
def deprecated(my_function):
def wrapper():
print("!!! You should not be using this function.")
my_function()
print("!!! Please, don't do it.")
return wrapper
my_good_old_analysis = deprecated(my_good_old_analysis)
my_good_old_analysis()
!!! You should not be using this function. Ah, the way we've always done analysis. !!! Please, don't do it.
And this is exactely what decorators do!
def deprecated(my_function):
def wrapper():
print("!!! You should not be using this function.")
my_function()
print("!!! Please, don't do it")
return wrapper
@deprecated # <-- ain't this a pretty decorator?
def my_even_older_analysis():
print("Aaaaah, please kill me.")
my_even_older_analysis()
!!! You should not be using this function. Aaaaah, please kill me. !!! Please, don't do it
Some in-built Python decorators will ease abstraction (only expose relevant information) and encapsulation (combine data and functions in a usable unit). See: https://docs.python.org/3.6/howto/descriptor.html
class My_class:
def __init__(self,x):
self.x = x
@property # In-built Python decorator
def x(self): # x is publicly accessible
return self._x # _x is private
@x.setter # ".setter" in-built Python decorator
def x(self, x):
if x < 0: # Implementation is hidden to end-users
self._x = 0 # _x actually stores the data
elif x > 1000: # "_" is a warning to end-users
self._x = 1000 # that things under the hood may
else: # change in future releases, and it
self._x = x # it's dangerous to assume that'll
my_instance = My_class(10000)
my_instance._x -= 1
print( my_instance._x )
999
Given a function that uses an ingredient "---Ham---" as a string, write decorators that will draw a sandwich, like this:
/''''''\
@Tomatoes@
---Ham---
~~Salad~~
\______/
/''''''\
@Tomatoes@
---Ham---
~~Salad~~
\______/
countdown(1)
Hands off of keyboards now!
def bread(my_function):
def wrapper():
print(" /''''''\ ")
my_function()
print(" \______/ ")
return wrapper
def ingredients(my_function):
def wrapper():
print("@Tomatoes@")
my_function()
print("~~Salad~~")
return wrapper
@bread # Order matters
@ingredients #
def sandwich(food="---Ham---"):
print(food)
sandwich()
/''''''\ @Tomatoes@ ---Ham--- ~~Salad~~ \______/
Context managers are semantically related to decorators.
They aim primarily to help you manage resources properly, i.e., groom your memory, avoid consumer bottlenecks, clean up after yourself, maintain livelihood of connections (db), etc, and other sensible things.
files = []
for x in range(100000):
files.append(open("how_to_mess_up_my_memory.txt", "w"))
#.. at this point of the notebook, I have messed up my memory
# and won't be able to open any more files
ERROR:root:Internal Python error in the inspect module. Below is the traceback from this internal error.
Traceback (most recent call last): File "//anaconda/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2881, in run_code exec(code_obj, self.user_global_ns, self.user_ns) File "<ipython-input-39-bfc1ac5e0354>", line 3, in <module> files.append(open("how_to_mess_up_my_memory.txt", "w")) OSError: [Errno 24] Too many open files: 'how_to_mess_up_my_memory.txt' During handling of the above exception, another exception occurred: Traceback (most recent call last): File "//anaconda/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 1821, in showtraceback stb = value._render_traceback_() AttributeError: 'OSError' object has no attribute '_render_traceback_' During handling of the above exception, another exception occurred: Traceback (most recent call last): File "//anaconda/lib/python3.6/site-packages/IPython/core/ultratb.py", line 1132, in get_records File "//anaconda/lib/python3.6/site-packages/IPython/core/ultratb.py", line 313, in wrapped File "//anaconda/lib/python3.6/site-packages/IPython/core/ultratb.py", line 358, in _fixed_getinnerframes File "//anaconda/lib/python3.6/inspect.py", line 1453, in getinnerframes File "//anaconda/lib/python3.6/inspect.py", line 1411, in getframeinfo File "//anaconda/lib/python3.6/inspect.py", line 666, in getsourcefile File "//anaconda/lib/python3.6/inspect.py", line 695, in getmodule File "//anaconda/lib/python3.6/inspect.py", line 679, in getabsfile File "//anaconda/lib/python3.6/posixpath.py", line 374, in abspath OSError: [Errno 24] Too many open files
---------------------------------------------------------------------------
In real life, you are dealing with finite resources. When you allocate some resource to a particular task, you need to make sure you use only what you need, and when you are done, you release it for other task/people to use.
print(my_custom_css) # Remember me? (see Section Generators)
if not my_custom_css.closed:
print("Clean up, or you'll mess up your memory!")
<_io.TextIOWrapper name='./custom.css' mode='r' encoding='UTF-8'> Clean up, or you'll mess up your memory!
my_custom_css.close() # Always clean up after yourself!
del my_custom_css
my_custom_css
--------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-41-40ecda7ea446> in <module>() 1 my_custom_css.close() # Always clean up after yourself! 2 del my_custom_css ----> 3 my_custom_css NameError: name 'my_custom_css' is not defined
That's primarily what context managers do for you.
# First, I need to clean up the mess I created by opening
# 100K files, otherwise I won't be able to open files
files = []
#for name in dir():
# if not name.startswith('files'):
# del globals()[name]
with open("./custom.css") as my_custom_css:
for line in my_custom_css:
print(len(line), end=" ")
7 49 19 18 25 2 30 13 2
if not my_custom_css.closed:
print("Clean up, or you'll mess up your memory!")
else:
print("It's already closed! Ain't that wonderful?")
It's already closed! Ain't that wonderful?
That's all there is to it: the with..as statement instantiates a variable that is short-lived, in a given scope.
It automatically calls a number of "management" functions for you.
You'll find context managers for files, locks, threads, database connections, and you can implement your own.
class File():
def __init__(self, filename, mode):
self.filename = filename
self.mode = mode
def __enter__(self):
self.open_file = open(self.filename, self.mode)
return self.open_file
def __exit__(self, *args):
self.open_file.close()
files = []
for _ in range(100000):
with File('that_shouldnt_mess_up_my_memory.txt', 'w') as myfile:
files.append(myfile)
len(files)
100000
for i in range(len(files)):
if not files[i].closed:
print("Arrg, files[%i] is not closed!" % i)
# Hopefully, there is no output to this cell!! :)
Write a context manager that will measure the amount of time spent within its scope.
import time
class MyTimeIt():
def __init__(self):
... # Write something here to initialise time
def __enter__(self):
... # Here start the timer
def __exit__(self, *args):
... # Here measure the amount of time
with MyTimeIt():
time.sleep(2)
countdown(1)
Hands off of keyboards now!
import time
class MyTimeIt():
def __init__(self):
self.t = 0.
def __enter__(self):
self.t = time.time()
def __exit__(self, *args):
print('This function took {:.2f} seconds.'.format(time.time() - self.t))
with MyTimeIt():
time.sleep(2)
This function took 2.00 seconds.
Iterators are arcane mechanisms that support loops, and everything else;
Generators are kinds of iterators that provide a level of optimisation and interactivity;
Decorators are a mechanism to incrementally power-up existing code;
Context managers are semantically related to decorators, to manage resources properly.