Note: Click on "Kernel" > "Restart Kernel and Clear All Outputs" in JupyterLab before reading this notebook to reset its output. If you cannot run this file on your machine, you may want to open it in the cloud .
In this chapter, we continue the study of sequential data by looking at memory efficient ways to process the elements in a sequence. That is an important topic for the data science practitioner who must be able to work with data that does not fit into a single computer's memory.
As shown in Chapter 4 , both the
list
objects [0, 1, 2, 3, 4]
and [1, 3, 5, 7, 9]
on the one side and the range
objects range(5)
and range(1, 10, 2)
on the other side allow us to loop over the same numbers. However, the latter two only create one int
object in every iteration while the former two create all int
objects before the loop even starts. In this aspect, we consider range
objects to be "rules" in memory that know how to calculate the numbers without calculating them.
In Chapter 7 , we see how the built-in list()
constructor materializes the
range(1, 13)
object into the list
object [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
. In other words, we make range(1, 13)
calculate all numbers at once and store them in a list
object for further processing.
In many cases, however, it is not necessary to do that, and, in this chapter, we look at other types of "rules" in memory and how we can compose different "rules" together to implement bigger computations.
Next, we take a step back and continue with a simple example involving the familiar numbers
list. Then, we iteratively exchange list
objects with "rule"-like objects without changing the overall computation at all. As computations involving sequential data are commonly classified into three categories map, filter, or reduce, we do so too for our numbers
example.
numbers = [7, 11, 8, 5, 3, 12, 2, 6, 9, 10, 1, 4]
Mapping refers to the idea of applying a transformation to every element in a sequence.
For example, let's square each element in numbers
and add 1
to the squares. In essence, we apply the transformation y:=x2+1 as expressed with the transform()
function below.
def transform(element):
"""Map elements to their squares plus 1."""
return (element ** 2) + 1
With the syntax we know so far, we revert to a for
-loop that iteratively appends the transformed elements to an initially empty transformed_numbers
list.
transformed_numbers = []
for old in numbers:
new = transform(old)
transformed_numbers.append(new)
transformed_numbers
[50, 122, 65, 26, 10, 145, 5, 37, 82, 101, 2, 17]
As this kind of data processing is so common, Python provides the map() built-in. In its simplest usage form, it takes two arguments: A transformation
function
that takes exactly one positional argument and an iterable
that provides the objects to be mapped.
We call map() with a reference to the
transform()
function and the numbers
list as the arguments and store the result in the variable transformer
to inspect it.
transformer = map(transform, numbers)
We might expect to get back a materialized sequence (i.e., all elements exist in memory), and a list
object would feel the most natural because of the type of the numbers
argument. However, transformer
is an object of type map
.
transformer
<map at 0x7f778422fc10>
type(transformer)
map
Like range
objects, map
objects generate a series of objects "on the fly" (i.e., one by one), and we use the built-in next() function to obtain the next object in line. So, we should think of a
map
object as a "rule" stored in memory that only knows how to calculate the next object of possibly infinitely many.
next(transformer)
50
next(transformer)
122
next(transformer)
65
It is essential to understand that by creating a map
object with the map() built-in, nothing happens in memory except the creation of the
map
object. In particular, no second list
object derived from numbers
is created. Also, we may view range
objects as a special case of map
objects: They are constrained to generating int
objects only, and the iterable
argument is replaced with start
, stop
, and step
arguments.
If we are sure that a map
object generates a finite number of elements, we may materialize them into a list
object with the built-in list() constructor. Below, we "pull out" the remaining
int
objects from transformer
, which itself is derived from a finite list
object.
list(transformer)
[26, 10, 145, 5, 37, 82, 101, 2, 17]
In summary, instead of creating an empty list first and appending it in a for
-loop as above, we write the following one-liner and obtain an equal transformed_numbers
list.
transformed_numbers = list(map(transform, numbers))
transformed_numbers
[50, 122, 65, 26, 10, 145, 5, 37, 82, 101, 2, 17]
Filtering refers to the idea of creating a subset of a sequence with a boolean filter function
that indicates if an element should be kept (i.e., True
) or not (i.e., False
).
In the example, let's only keep the even elements in numbers
. The is_even()
function implements that as a filter.
def is_even(element):
"""Filter out odd numbers."""
if element % 2 == 0:
return True
return False
As element % 2 == 0
is already a boolean expression, we could shorten is_even()
like so.
def is_even(element):
"""Filter out odd numbers."""
return element % 2 == 0
As before, we first use a for
-loop that appends the elements to be kept iteratively to an initially empty even_numbers
list.
even_numbers = []
for number in transformed_numbers:
if is_even(number):
even_numbers.append(number)
even_numbers
[50, 122, 26, 10, 82, 2]
Analogously to the map
object above, we use the filter() built-in to create an object of type
filter
and assign it to evens
.
evens = filter(is_even, transformed_numbers)
evens
<filter at 0x7f778422fd30>
type(evens)
filter
evens
works like transformer
above: With the built-in next() function we obtain the even numbers one by one. So, the "next" element in line is simply the next even
int
object the filter
object encounters.
transformed_numbers
[50, 122, 65, 26, 10, 145, 5, 37, 82, 101, 2, 17]
next(evens)
50
next(evens)
122
next(evens)
26
As above, we could create a materialized list
object with the list() constructor.
list(filter(is_even, transformed_numbers))
[50, 122, 26, 10, 82, 2]
We may also chain map
and filter
objects derived from the original numbers
list. As the entire cell is one big expression consisting of nested function calls, we read it from the inside out.
list(
filter(
is_even,
map(transform, numbers),
)
)
[50, 122, 26, 10, 82, 2]
Using the map() and filter()
built-ins, we can quickly switch the order: Filter first and then transform the remaining elements. This variant equals the "A simple Filter" example in Chapter 4
. On the contrary, code with
for
-loops and if
statements is more tedious to adapt. Additionally, map
and filter
objects loop "at the C level" and are a lot faster because of that. Because of that, experienced Pythonistas tend to not use explicit for
-loops so often.
list(
map(
transform,
filter(is_even, numbers),
)
)
[65, 145, 5, 37, 101, 17]
Lastly, reducing sequential data means to summarize the elements into a single statistic.
A simple example is the built-in sum() function.
sum(
map(
transform,
filter(is_even, numbers),
)
)
370
min(map(transform, filter(is_even, numbers)))
5
max(map(transform, filter(is_even, numbers)))
145
sum() , min()
, and max()
can be regarded as special cases.
The generic way of reducing a sequence is to apply a function of two arguments on a rolling horizon: Its first argument is the reduction of the elements processed so far, and the second the next element to be reduced.
For illustration, let's replicate sum() as such a function, called
sum_alt()
. Its implementation only adds two numbers.
def sum_alt(sum_so_far, next_number):
"""Reduce a sequence by addition."""
return sum_so_far + next_number
Further, we create a new map
object derived from numbers
...
evens_transformed = map(transform, filter(is_even, numbers))
... and loop over all but the first element it generates. The latter is captured separately as the initial result
with the next() function. We know from above that
evens_transformed
generates six elements. That is why we see five growing result
values resembling a cumulative sum. The first 210
is the sum of the first two elements generated by evens_transformed
, 65
and 145
.
So, we also learn that map
objects, and analogously filter
objects, are iterable as we may loop over them.
result = next(evens_transformed)
for number in evens_transformed:
result = sum_alt(result, number)
print(result, end=" ") # line added for didactical purposes
210 215 252 353 370
The final result
is the same 370
as above.
result
370
The reduce() function in the functools
module in the standard library
provides more convenience (and speed) replacing the
for
-loop. It takes two arguments, function
and iterable
, in the same way as the map() and filter()
built-ins.
reduce() is eager
meaning that all computations implied by the contained
map
and filter
"rules" are executed immediately, and the code cell evaluates to 370
. On the contrary, map() and filter()
create lazy
map
and filter
objects, and we have to use the next() function to obtain the elements, one by one.
from functools import reduce
reduce(
sum_alt,
map(
transform,
filter(is_even, numbers),
)
)
370
map() , filter()
, and reduce()
take a
function
object as their first argument, and we defined transform()
, is_even()
, and sum_alt()
to be used precisely for that.
Often, such functions are used only once in a program. However, the primary purpose of functions is to reuse them. In such cases, it makes more sense to define them "anonymously" right at the position where the first argument goes.
As mentioned in Chapter 2 , we use
lambda
expressions to create function
objects without a name referencing them.
So, the above sum_alt()
function could be rewritten as a lambda
expression like so ...
lambda sum_so_far, next_number: sum_so_far + next_number
<function __main__.<lambda>(sum_so_far, next_number)>
... or even shorter.
lambda x, y: x + y
<function __main__.<lambda>(x, y)>
With the new concepts in this section, we can rewrite the entire example in just a few lines of code without any for
, if
, and def
statements. The resulting code is concise, easy to read, quick to modify, and even faster in execution. Most importantly, it is optimized to handle big amounts of data as no temporary list
objects are materialized in memory.
numbers = [7, 11, 8, 5, 3, 12, 2, 6, 9, 10, 1, 4]
evens = filter(lambda x: x % 2 == 0, numbers)
transformed = map(lambda x: (x ** 2) + 1, evens)
sum(transformed)
370
If numbers
comes as a sorted sequence of whole numbers, we may use the range() built-in and get away without any materialized
list
object in memory at all!
numbers = range(1, 13)
evens = filter(lambda x: x % 2 == 0, numbers)
transformed = map(lambda x: (x ** 2) + 1, evens)
sum(transformed)
370
To additionally save the temporary variables, numbers
, evens
, and transformed
, we could write the entire computation as one expression.
sum(
map(
lambda x: (x ** 2) + 1,
filter(
lambda x: x % 2 == 0,
range(1, 13),
)
)
)
370
PythonTutor visualizes the differences in the number of computational steps and memory usage:
for
-loops, if
statements, and named functions -> 116 steps and 3 list
objectsmap
and filter
objects -> 58 steps and no list
objectlist
objectVersions 2 and 3 are the same, except for the three additional steps required to create the temporary variables. The major downside of Version 1 is that, in the worst case, it may need three times the memory as compared to the other two versions!
An experienced Pythonista would probably go with Version 2 in a production system to keep the code readable and maintainable.
The map-filter-reduce paradigm has caught attention in recent years as it enables parallel computing , and this gets important when dealing with big amounts of data. The workings in the memory as shown in this section provide an idea why.