Today we briefly will covere some intermediate-to-advanced Python concepts.
Topics:
Knowledge of these concepts will take your Python to a more advanced level.
Remember: Any data Python computes during runtime is stored as an object in memory (RAM).
** Not to be confused with long term storage.**
Objects take up space. Just like books on a bookshelf.
In general, we want to keep the memory footprint of our programs to a minimum.
Misuse of memory can cause:
MemoryError
)In today's world of "Big Data", memory usage is a growing concern.
Typical computer memory size is a couple of GB ~4-16 GB of memory.
What happens when you need to process a data file that is 50GB large?
Generators to the rescue!
Under the hood we've been dealing with generators all along but thinking of them as lists.
A list is a pre-computed container object because all its values exist in memory all at once.
# this is a list of the first 10 numbers
nums = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
We can access each element directly as they are all stored in memory.
nums[6]
6
I can also create this list with a loop.
def firstn(n):
nums = []
count = 0
while count < n:
nums.append(count)
count += 1
return nums
nums = firstn(10)
print(nums)
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
But what if we don't need all the numbers stored all at once?
If we know the rule for computing the next number in the sequence we can just not do anything and spit out the next number when asked for it.
This is what generator functions do.
generator funcions return generator objects and can yield values during their execution.
In this case, the rule is "the next number is the previous number + 1"
def firstn_gen(n):
count = 0
while count < n:
yield count
count += 1
nums = firstn_gen(10)
print(nums)
<generator object firstn_gen at 0x1798696d0>
Ok so we got a generator
object instead of a list.
generator
objects have an attribute called __next__()
which is a function that returns the next item in the sequence.
nums.__next__()
0
i = nums.__next__()
print(i)
3
If we put a geneator in a for loop, python repeatedly calls __next__()
for you until it runs out of items.
nums = firstn_gen(10)
for i in nums:
print(i)
0 1 2 3 4 5 6 7 8 9
Note: once we pulled out all the items in a geneator, we can't re-use it.
for i in nums:
print(i)
g = firstn_gen(100)
Okay back to the funny yield
statement.
def firstn_gen(n):
count = 0
print("before loop")
while count < n:
print("yielding number")
yield count
print("i'm back. incrementing count")
count += 1
nums = firstn_gen(10)
nums
is a generator and we can call its __next__
method.
Generators behave like functions except their execution can be interrupted and re-started.
When we reach a yield
statement, the function exits, yields a number, and all varible assignments are maintained.
With a normal function, once a function is exited, all local memory is lost.
__next__
executes the function code.
If it's the first time __next__
is called, execution begins at the top of the definition and stops when yield
is reached.
Subsequent calls resume after the yield
.
#first call
nums.__next__()
before loop yielding number
0
#second call, resumes after 'yield'
nums.__next__()
x = 5
d = {}
--------------------------------------------------------------------------- StopIteration Traceback (most recent call last) <ipython-input-106-65d4a4f802fb> in <module>() 1 #second call, resumes after 'yield' ----> 2 nums.__next__() 3 x = 5 4 d = {} StopIteration:
Because the generator remembers its current state, we can continue with the rest of the numbers with a for loop.
for n in nums:
print(n)
i'm back. incrementing count yielding number 2 i'm back. incrementing count yielding number 3 i'm back. incrementing count yielding number 4 i'm back. incrementing count yielding number 5 i'm back. incrementing count yielding number 6 i'm back. incrementing count yielding number 7 i'm back. incrementing count yielding number 8 i'm back. incrementing count yielding number 9 i'm back. incrementing count
Often we want to read from a very large file and do something with each line but we don't need the whole file loaded all at once.
I downloaded all the questions about Python on Stack Overflow, a website I'm sure you're familiar with by now.
filepath = "Questions.csv"
Here is the wrong way.
def load_data_list(path, encoding="utf-8"):
file_handle = open(path, "r", encoding=encoding)
file_lines = file_handle.readlines()
file_handle.close()
return file_lines
#this file has a non-standard encoding "latin-1" so I have to specify that. don't worry about file encodings.
questions = load_data_list(filepath, encoding="latin-1")
We can use memory_profiler
(doesn't work in Notebooks) to see how much memory this uses.
print(questions[:10])
['Id,OwnerUserId,CreationDate,Score,Title,Body\n', '469,147,2008-08-02T15:11:16Z,21,How can I find the full path to a font from its display name on a Mac?,"<p>I am using the Photoshop\'s javascript API to find the fonts in a given PSD.</p>\n', '\n', '<p>Given a font name returned by the API, I want to find the actual physical font file that that font name corresponds to on the disc.</p>\n', '\n', "<p>This is all happening in a python program running on OSX so I guess I'm looking for one of:</p>\n", '\n', '<ul>\n', '<li>Some Photoshop javascript</li>\n', '<li>A Python function</li>\n']
type(questions)
list
Now let's lazily read the lines from the file.
def lazy_read(filepath, encoding="utf-8"):
file_handle = open(filepath, "r", encoding=encoding)
while True:
yield file_handle.readline()
file_handle.close()
g = lazy_read(filepath, encoding="latin-1")
g.__next__()
'Id,OwnerUserId,CreationDate,Score,Title,Body\n'
g.__next__()
'469,147,2008-08-02T15:11:16Z,21,How can I find the full path to a font from its display name on a Mac?,"<p>I am using the Photoshop\'s javascript API to find the fonts in a given PSD.</p>\n'
g.__next__()
'\n'
count = 12
c = 0
for line in g:
if c < count:
print(line)
c += 1
<p>This is all happening in a python program running on OSX so I guess I'm looking for one of:</p> <ul> <li>Some Photoshop javascript</li> <li>A Python function</li> <li>An OSX API that I can call from python</li> </ul> " 502,147,2008-08-02T17:01:58Z,27,Get a preview JPEG of a PDF on Windows?,"<p>I have a cross-platform (Python) application which needs to generate a JPEG preview of the first page of a PDF.</p> <p>On the Mac I am spawning <a href=""http://developer.apple.com/documentation/Darwin/Reference/ManPages/man1/sips.1.html"">sips</a>. Is there something similarly simple I can do on Windows?</p>
--------------------------------------------------------------------------- KeyboardInterrupt Traceback (most recent call last) <ipython-input-118-3631c47cbbba> in <module>() 1 count = 12 2 c = 0 ----> 3 for line in g: 4 if c < count: 5 print(line) <ipython-input-112-158a12bc81d4> in lazy_read(filepath, encoding) 2 file_handle = open(filepath, "r", encoding=encoding) 3 while True: ----> 4 yield file_handle.readline() 5 file_handle.close() 6 KeyboardInterrupt:
This is part of what Python does for you when you open a file using with open()
.
Just like list comprehensions, we can create generator comprehensions.
Generator comprehensions look exactly like list comprehensions except they use round brackets ().
evens = (x for x in range(100) if x % 2 == 0)
evens.__next__()
0
next(evens)
2
next(evens)
4
Is equivalent to:
def evens():
x = 0
while x < 100:
if x % 2 == 0:
yield x
x += 1
e = evens()
Generators can create an infinite amount of information while requiring almost zero memory!
Example: fibonacci numbers
$F_n = F_{n-1} + F_{n-2}$ where $F_0 = 0$ and $F_1 = 1$
e.g. $0, 1, 1, 2, 3, 5, 8, 13, 21, ...$
def fib():
a = 0
b = 1
while True:
yield a
a, b = b, a + b
f = fib()
for i in range(10):
print(next(f))
0 1 1 2 3 5 8 13 21 34
So now whenever we want to pull out a fibonacci number, we can just call our generator again.
Another cool thing we can do with iterables (such as generators and lists) is pass them as arguments to functions.
When we pass an iterable to a function with a * before its name, Python unpacks the items as positional arguments.
def polynomial(a, b, c):
return ((a*x**2) + (b * x) + c for x in range(50))
coefficients = [2, 2, 3]
poly_gen = polynomial(*coefficients)
This is the same as
poly_gen = polynomial(1, 2, 3)
arg_gen = (x for x in range(3))
polynomial(*arg_gen)
<generator object polynomial.<locals>.<genexpr> at 0x110850e60>
for p in poly_gen:
print(p)
3 6 11 18 27 38 51 66 83 102 123 146 171 198 227 258 291 326 363 402 443 486 531 578 627 678 731 786 843 902 963 1026 1091 1158 1227 1298 1371 1446 1523 1602 1683 1766 1851 1938 2027 2118 2211 2306 2403 2502
If a function takes keyword arguments, we can also unpack them using the ** operator.
def student(grade=4.0, name='bob'):
print(f"{name}'s gpa is {grade}")
s = {'name': 'carlos', 'grade':3.5}
student(s['name'], s['grade'])
3.5's gpa is carlos
student(**s)
carlos's gpa is 3.5
In the spirit of Christmas, let's talk about wrapping functions and decorators.
First let's do a little recap on functions.
Functions are objects
Just like objects functions can be:
def foo(x):
return x
f = foo
b = foo
f(5)
b(10)
10
def caller(func, arg):
return f"calling {func.__name__} with input {arg} result is: {func(arg)}"
def is_even(num):
return not num % 2
def is_odd(num):
return bool(num % 2)
print(caller(is_even, 5))
print(caller(is_odd, 5))
calling is_even with input 5 result is: False calling is_odd with input 5 result is: True
We can also define functions inside other functions and return them.
def foo():
x = 5
def boo():
return 5
return boo
five = foo()
print(type(five))
five()
<class 'function'>
5
What if I want to time how long a function takes to run?
import time
import random
#this function runs for 0 to 5 seconds
def foo():
print("doing some stuff..")
time.sleep(random.randrange(5))
start = time.time() #get current time
foo()
print(f"time elapsed: {time.time() - start} seconds")
doing some stuff.. time elapsed: 0.004309892654418945 seconds
That's nice but what happens next time I want to time a different function?
I have to write the same code again... repetitive code is no good.
def boo():
print("doing some other stuff")
time.sleep(random.randrange(3))
start = time.time()
boo()
print(f"time elapsed: {time.time() - start} seconds")
doing some other stuff time elapsed: 2.004883050918579 seconds
What if I want to make it so that I can automatically modify any function with timing functionality?
This is where decorators come in.
You can think of decorators as functions that create functions but with some useful "decorations".
Since we know that we can return functions, why don't we write a function that takes a function as input and returns a decorated version of it?
def timer(func):
def wrapped():
start = time.time()
func()
print(f"{func.__name__} took: {time.time() - start} seconds")
#return the decorated function
return wrapped
def stuff():
print("doing some stuff..")
time.sleep(random.randrange(3))
Let's transform our boring stuff
function into a decorated version of itself.
timed_stuff = timer(stuff)
type(timed_stuff)
function
Now we can use the improved version of our function.
timed_stuff()
doing some stuff.. stuff took: 2.0035247802734375 seconds
And we can use the decorator to transform any function (that takes no arguments) without having to re-write any code.
def other_stuff():
print("doing some other stuff..")
time.sleep(random.randrange(3))
timed_other_stuff = timer(other_stuff)
timed_other_stuff()
doing some other stuff.. other_stuff took: 0.0008678436279296875 seconds
Python makes using decorators a little easier with the @
operator.
Instead of creating a new function explicitly, we can just put @decoratorname
before any function we want to decorate.
@timer
def fibonaccis():
f = fib()
max_fib = 10
for i, num in enumerate(f):
print(num)
if i > max_fib:
break
i += 1
fibonaccis()
0 1 1 2 3 5 8 13 21 34 55 89 fibonaccis took: 0.0009970664978027344 seconds
Another good use of decorators is argument type checking. (source: https://www.python-course.eu/python3_decorators.php)
Here is one for debugging.
def debug(func):
def helper(*args):
print(f"calling {func.__name__} on arguments {args}")
func(*args)
return helper
@debug
def multiply(a, b):
return a * b
@debug
def add(*args):
return sum(args)
multiply(4, 5)
add(1, 2, 3, 4)
calling multiply on arguments (4, 5) calling add on arguments (1, 2, 3, 4)