Note: Click on "Kernel" > "Restart Kernel and Clear All Outputs" in JupyterLab before reading this notebook to reset its output. If you cannot run this file on your machine, you may want to open it in the cloud .
After introducing the Map-Filter-Reduce paradigm in the first part of this chapter, we first see how
list
comprehensions can replace the map() and filter()
built-ins in many cases. Then, we learn how
generator
expressions are like list
comprehensions without using the memory. We end this part with a short discussion of the built-in all() and any()
functions.
list
Comprehensions¶Consider again the original "A simple Filter" example from Chapter 4 , re-written such that both the mapping and the filtering are done in one
for
-loop instead of the two above.
numbers = [7, 11, 8, 5, 3, 12, 2, 6, 9, 10, 1, 4]
evens_transformed = []
for number in numbers:
if number % 2 == 0:
evens_transformed.append((number ** 2) + 1)
sum(evens_transformed)
370
list
comprehensions are expressions to derive new list
objects out of existing ones (cf., reference ). Practically, this means that we place the
for
and if
inside brackets [
and ]
.
So, the example can be written in a single expression like below replacing the compound for
statement above.
[(n ** 2) + 1 for n in numbers if n % 2 == 0]
[65, 145, 5, 37, 101, 17]
A list comprehension may always be used in a place where otherwise a list
object would work.
For example, let's rewrite the "A simple Filter" example from Chapter 4 in just one line. As a caveat, the code below materializes all elements in memory before summing them up, and may, therefore, cause a
MemoryError
when executed with a bigger numbers
list. We see with PythonTutor how a
list
object exists in memory at step 17 and then "gets lost" right after. As the next section shows, this downside may be mitigated.
sum([(n ** 2) + 1 for n in numbers if n % 2 == 0])
370
List comprehensions may come with several for
's and if
's.
The cell below creates a list
object that contains three other list
objects with a series of numbers in them. The first and last numbers in each inner list
object are offset by 1
.
nested_numbers = [
[1, 2, 3, 4, 5, 6, 7],
[2, 3, 4, 5, 6, 7, 8],
[3, 4, 5, 6, 7, 8, 9],
]
nested_numbers
[[1, 2, 3, 4, 5, 6, 7], [2, 3, 4, 5, 6, 7, 8], [3, 4, 5, 6, 7, 8, 9]]
To do something meaningful with the numbers, we have to remove the inner layer of list
objects and flatten (i.e., "un-nest") the data.
Without list comprehensions, we achieve that with two nested for
-loops.
flat_numbers = []
for inner_numbers in nested_numbers:
for number in inner_numbers:
flat_numbers.append(number)
flat_numbers
[1, 2, 3, 4, 5, 6, 7, 2, 3, 4, 5, 6, 7, 8, 3, 4, 5, 6, 7, 8, 9]
That translates into a list comprehension like below. The order of the for
's may be confusing at first but is the same as writing out the nested for
-loops as above.
[number for inner_numbers in nested_numbers for number in inner_numbers]
[1, 2, 3, 4, 5, 6, 7, 2, 3, 4, 5, 6, 7, 8, 3, 4, 5, 6, 7, 8, 9]
sum([number for inner_numbers in nested_numbers for number in inner_numbers])
105
In this particular example, however, we can exploit the fact that any sum of numbers can be expressed as the sum of sums of mutually exclusive and collectively exhaustive subsets of these numbers and get away with just one for
in the list comprehension.
sum([sum(inner_numbers) for inner_numbers in nested_numbers])
105
A popular use case of nested list comprehensions is applying a transformation to each n-tuple in the Cartesian product created from elements of n iterables. In the generic illustration below, a transformation f(x,y) is applied to each 2-tuple in the Cartesian product X×Y where x is an element in X={x1,x2,x3} and y is an element in Y={y1,y2,y3}.
Y \ X | x1 | x2 | x3 |
---|---|---|---|
y1 | f(x1,y1) | f(x2,y1) | f(x3,y1) |
y2 | f(x1,y2) | f(x2,y2) | f(x3,y2) |
y3 | f(x1,y3) | f(x2,y3) | f(x3,y3) |
For example, let's add each quotient obtained by taking the numerator x from [10, 20, 30]
and the denominator y from [40, 50, 60]
to 1
, and then find the overall product. This transformation can be described mathematically as the function z=f(x,y) below.
Further, the table below visualizes the calculations: The result is the product of nine entries.
y \ x |
10 | 20 | 30 |
---|---|---|---|
40 | 1.25 | 1.50 | 1.75 |
50 | 1.20 | 1.40 | 1.60 |
60 | 1.17 | 1.33 | 1.50 |
To express that in Python, we start by creating two list
objects, first
and second
.
first = [10, 20, 30]
second = [40, 50, 60]
For a Cartesian product, we must loop over all possible 2-tuples where one element is drawn from first
and the other from second
. We achieve that with two nested for
-loops, in which we calculate each quotient
and append it to an initially empty cartesian_product
list.
cartesian_product = []
for numerator in first:
for denominator in second:
quotient = 1 + (numerator / denominator)
cartesian_product.append(quotient)
cartesian_product
[1.25, 1.2, 1.1666666666666667, 1.5, 1.4, 1.3333333333333333, 1.75, 1.6, 1.5]
Next, we convert the two explicit for
-loops into one list comprehensions and use x
and y
as shorter variable names.
[1 + (x / y) for x in first for y in second]
[1.25, 1.2, 1.1666666666666667, 1.5, 1.4, 1.3333333333333333, 1.75, 1.6, 1.5]
The order of the for
's is important: The list comprehension above divides numbers from first
by numbers from second
, whereas the list comprehension below does the opposite.
[1 + (x / y) for x in second for y in first]
[5.0, 3.0, 2.333333333333333, 6.0, 3.5, 2.666666666666667, 7.0, 4.0, 3.0]
To find the overall product, we unpack the list comprehension into the product()
function from the "Function Definitions & Calls" sub-section in Chapter 7 .
def product(*args):
"""Multiply all arguments."""
result = args[0]
for arg in args[1:]:
result *= arg
return result
product(*[1 + (x / y) for x in first for y in second])
20.58
from functools import reduce
reduce(lambda x, y: x * y, [1 + (x / y) for x in first for y in second])
20.58
While this example is stylized, Cartesian products are hidden in many applications, and it shows how the various language features introduced in this chapter can be seamlessly combined to process sequential data.
generator
Expressions¶Because of the high memory consumption, Pythonistas avoid materialized list
objects, and, thus, also list
comprehensions, whenever possible. Instead, they prefer to work with generator
expressions . Syntactically, they work like list comprehensions except that parentheses,
(
and )
, replace brackets, [
and ]
.
Let's go back to the original "A simple Filter" example from Chapter 4 one more time, apply the transformation y:=x2+1 to all even
numbers
, and sum them up.
numbers = [7, 11, 8, 5, 3, 12, 2, 6, 9, 10, 1, 4]
To filter and transform numbers
, we wrote a list comprehension above ...
[(n ** 2) + 1 for n in numbers if n % 2 == 0]
[65, 145, 5, 37, 101, 17]
... that now becomes a generator
expression.
((n ** 2) + 1 for n in numbers if n % 2 == 0)
<generator object <genexpr> at 0x7fcf703d84a0>
A generator
expression evaluates to yet another "rule"-like object in memory that knows how to generate the individual objects in a series one by one. Whereas a list
comprehension materializes its elements in memory when it is evaluated, the opposite holds true for generator
expressions: No object is created in memory except the "rule" itself. Because of this behavior, we describe generator
expressions as lazy and list
comprehensions as eager.
To materialize the elements specified by a generator
expression, we use the list() constructor as seen above.
list(((n ** 2) + 1 for n in numbers if n % 2 == 0))
[65, 145, 5, 37, 101, 17]
Whenever a generator
expression is the only argument in a function call, we may merge the double parentheses, ((
and ))
, into just (
and )
.
list((n ** 2) + 1 for n in numbers if n % 2 == 0)
[65, 145, 5, 37, 101, 17]
A common use case is to reduce the elements into a single object instead, for example, by adding them up with sum() as in the original "A simple Filter" example. PythonTutor
shows how the code cell below does not create a temporary
list
object in memory whereas a list
comprehension would do so (cf., step 17).
sum((n ** 2) + 1 for n in numbers if n % 2 == 0)
370
Let's assign the object to which the generator
expression below evaluates to to a variable gen
and inspect it.
gen = ((n ** 2) + 1 for n in numbers if n % 2 == 0)
gen
<generator object <genexpr> at 0x7fcf703d8c10>
Unsurprisingly, generator
expressions create objects of type generator
.
type(gen)
generator
generator
objects work just like the map
and filter
objects in the first part of this chapter. So, with the next()
function, we can generate elements one by one.
next(gen)
65
next(gen)
145
next(gen)
5
next(gen)
37
next(gen)
101
next(gen)
17
Once a generator
object runs out of elements, it raises a StopIteration
exception, and we say that the generator
object is exhausted. To loop over its elements again, we must create a new one.
next(gen)
--------------------------------------------------------------------------- StopIteration Traceback (most recent call last) Cell In[36], line 1 ----> 1 next(gen) StopIteration:
next(gen)
--------------------------------------------------------------------------- StopIteration Traceback (most recent call last) Cell In[37], line 1 ----> 1 next(gen) StopIteration:
Calling the next() function repeatedly with the same
generator
object as the argument is essentially what a for
-loop automates for us. So, generator
objects are iterable. We look into this in detail further below in the "The for
Statement (revisited)" section.
for number in ((n ** 2) + 1 for n in numbers if n % 2 == 0):
print(number, end=" ")
65 145 5 37 101 17
If we are only interested in a reduction of nested_numbers
into a single statistic, as the overall sum in the "Nested Lists" example above, we should replace list
objects or list
comprehensions with generator
expressions wherever possible. The result is the same, but no intermediate list
objects are materialized! That makes our code scale to large amounts of data.
Let's adapt the example nested_numbers
from above.
nested_numbers
[[1, 2, 3, 4, 5, 6, 7], [2, 3, 4, 5, 6, 7, 8], [3, 4, 5, 6, 7, 8, 9]]
Compared to the implementation with the list comprehension above, we simply remove the brackets, [
and ]
: The argument to sum() becomes a generator expression.
sum(number for inner_numbers in nested_numbers for number in inner_numbers)
105
That also holds for the alternative formulation as a sum of sums.
sum(sum(inner_numbers) for inner_numbers in nested_numbers)
105
Because nested_numbers
has an internal structure (i.e., the inner list
objects are series of consecutive int
objects), we can even provide an effectively memoryless implementation by expressing it as a generator
expression derived from range
objects. PythonTutor confirms that no
list
objects materialize at any point in time.
nested_numbers = ((range(x, y + 1)) for x, y in zip(range(1, 4), range(7, 10)))
nested_numbers
<generator object <genexpr> at 0x7fcf70347970>
sum(number for inner_numbers in nested_numbers for number in inner_numbers)
105
We must be careful when assigning a generator
object to a variable: If we use nested_numbers
again, for example, in the alternative formulation below, sum() returns
0
because nested_numbers
is exhausted after executing the previous code cell. PythonTutor also shows that.
sum(sum(inner_numbers) for inner_numbers in nested_numbers)
0
Let's also rewrite the "Working with Cartesian Products" example from above with generator expressions.
As a first optimization, we replace the materialized list
objects, first
and second
, with memoryless range
objects.
first = range(10, 31, 10) # "==" [10, 20, 30]
second = range(40, 61, 10) # "==" [40, 50, 60]
Now, the first of the two alternative solutions may be more appealing to many readers. In general, many practitioners seem to dislike lambda
expressions.
In the first solution, we unpack the elements produced by (1 + (x / y) for x in first for y in second)
into the product()
function from the "Function Definitions & Calls" sub-section in Chapter 7 . However, inside
product()
, the elements are packed into args
, a materialized tuple
object! So, all the memory efficiency gained by using a generator expression is lost! PythonTutor shows how a
tuple
object exists in steps 38-58.
product(*(1 + (x / y) for x in first for y in second)) # not memory efficient!
20.58
On the contrary, the second solution with the reduce() function from the functools
module and the
lambda
expression works without the elements materialized at the same time, and PythonTutor confirms that. So, only the second alternative is truly memory-efficient!
reduce(lambda x, y: x * y, (1 + (x / y) for x in first for y in second))
20.58
In summary, we learn from this example that unpacking generator
objects may be a bad idea.
With the new concepts in this chapter, let's rewrite the book's introductory "Averaging all even Numbers in a List" example from Chapter 1 such that it efficiently handles a large sequence of numbers. We continue from its latest implementation, the
average_evens()
function in the "Keyword-only Arguments" section in Chapter 2 .
We assume that average_evens()
is called with a finite and iterable object that generates a stream of numeric objects that can be cast as int
objects. After all, the idea of even and odd numbers makes sense only in the context of whole numbers.
The generator
expression (round(n) for n in numbers)
implements the type casting, and, when it is evaluated during a function call, nothing happens except that a generator
object is assigned to integers
. Then, with the reduce() function from the functools
module, we simultaneously add up and count the even numbers with the
generator
object to which the inner generator
expression ((n, 1) for n in integers if n % 2 == 0)
evaluates to. That generator
object takes the integers
generator as its source and produces tuple
objects consisting of the next even number in line and 1
. Two such tuple
objects are then iteratively passed to the function
object to which the lambda
expression evaluates to. x
represents the total and the count of the even numbers processed so far, while y
's first element, y[0]
, is the next even number to be added to the running total. The result of calling reduce() is again a
tuple
object, namely the final total
and count
. Lastly, we calculate the simple average and scale it.
In summary, this implementation of average_evens()
does not keep materialized list
objects internally like its predecessors in Chapter 2 but processes the elements of the
numbers
argument on a one-by-one basis.
def average_evens(numbers, *, scalar=1):
"""Calculate the average of all even numbers.
Args:
numbers (iterable): a finite stream of the numbers to be averaged;
if non-whole numbers are provided, they are rounded
scalar (float, optional): multiplies the average; defaults to 1
Returns:
float: (scaled) average
"""
integers = (round(n) for n in numbers)
total, count = reduce(
lambda x, y: (x[0] + y[0], x[1] + y[1]),
((n, 1) for n in integers if n % 2 == 0)
)
return scalar * total / count
average_evens([7, 11, 8, 5, 3, 12, 2, 6, 9, 10, 1, 4])
7.0
We may provide an optional scalar
argument as before.
average_evens([7, 11, 8, 5, 3, 12, 2, 6, 9, 10, 1, 4], scalar=2)
14.0
An argument with float
objects works just as well.
average_evens([7., 11., 8., 5., 3., 12., 2., 6., 9., 10., 1., 4.])
7.0
To show that average_evens()
can process a large stream of data, we simulate 10_000_000
randomly drawn integers between 0
and 100
with the randint() function from the random
module. We use a
generator
expression derived from a range
object as the numbers
argument. So, at no point in time is there a materialized list
object in memory. The result approaching 50
confirms that randint() must be based on a uniform distribution.
import random
random.seed(42)
average_evens(random.randint(0, 100) for _ in range(10_000_000))
49.994081434519636
To show that average_evens()
filters out odd numbers, we simulate another stream of 10_000_000
randomly drawn odd integers between 1
and 99
. As no function in the random module does that "out of the box," we must be creative: Doubling a number drawn from
random.randint(0, 49)
results in an even number between 0
and 98
, and adding 1
makes it odd. Then, average_evens()
raises a TypeError
, essentially because (int(n) for n in numbers)
does not generate any element.
average_evens(2 * random.randint(0, 49) + 1 for _ in range(10_000_000))
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) Cell In[56], line 1 ----> 1 average_evens(2 * random.randint(0, 49) + 1 for _ in range(10_000_000)) Cell In[49], line 13, in average_evens(numbers, scalar) 2 """Calculate the average of all even numbers. 3 4 Args: (...) 10 float: (scaled) average 11 """ 12 integers = (round(n) for n in numbers) ---> 13 total, count = reduce( 14 lambda x, y: (x[0] + y[0], x[1] + y[1]), 15 ((n, 1) for n in integers if n % 2 == 0) 16 ) 17 return scalar * total / count TypeError: reduce() of empty iterable with no initial value
tuple
Comprehensions¶There is no syntax to derive new tuple
objects out of existing ones. However, we can mimic such a construct by combining the built-in tuple() constructor with a
generator
expression.
So, to convert the list
comprehension [(n ** 2) + 1 for n in numbers if n % 2 == 0]
from above into a "tuple
comprehension," we write the following.
tuple((n ** 2) + 1 for n in numbers if n % 2 == 0)
(65, 145, 5, 37, 101, 17)
numbers = [7, 11, 8, 5, 3, 12, 2, 6, 9, 10, 1, 4]
all(x ** 2 < 100 for x in numbers)
False
all(x ** 2 < 150 for x in numbers)
True
all() can be viewed as syntactic sugar replacing a
for
-loop: Internally, all() implements the short-circuiting strategy explained in Chapter 3
, and we mimic that by testing for the opposite condition in the
if
statement and stopping the for
-loop early with the break
statement. In the worst case, if threshold
were, for example, 150
, we would loop over all elements in the iterable
argument, which must be finite for the code to work. So, all() is a linear search in disguise.
threshold = 100
for number in numbers:
if number ** 2 >= threshold: # the opposite of what we are checking for
all_below_threshold = False
break
else:
all_below_threshold = True
all_below_threshold
False
The documentation of all() shows in another way what it does with code: By placing a
return
statement in a for
-loop's body inside a function, iteration is stopped prematurely once an element does not meet the condition. That is the familiar early exit pattern at work.
def all_alt(iterable):
"""Alternative implementation of the built-in all() function."""
for element in iterable:
if not element: # the opposite of what we are checking for
return False
return True
all_alt(x ** 2 < 100 for x in numbers)
False
all_alt(x ** 2 < 150 for x in numbers)
True
Similarly, any() checks if at least one element in the
iterable
argument is truthy.
To continue the example, let's check if the square of any element in numbers
is above 100
or 150
, respectively.
any(x ** 2 > 100 for x in numbers)
True
any(x ** 2 > 150 for x in numbers)
False
The code cell below shows how any() works internally: It also follows the short-circuiting strategy. Here, we do not need to check for the opposite condition.
threshold = 100
for number in numbers:
if number ** 2 > threshold:
any_above_threshold = True
break
else:
any_above_threshold = False
any_above_threshold
True
The alternative formulation in the documentation of any() is straightforward and also uses the early exit pattern.
def any_alt(iterable):
"""Alternative implementation of the built-in any() function."""
for element in iterable:
if element:
return True
return False
any_alt(x ** 2 > 100 for x in numbers)
True
any_alt(x ** 2 > 150 for x in numbers)
False