Note: Click on "Kernel" > "Restart Kernel and Clear All Outputs" in JupyterLab before reading this notebook to reset its output. If you cannot run this file on your machine, you may want to open it in the cloud .

Chapter 8: Map, Filter, & Reduce¶

In this chapter, we continue the study of sequential data by looking at memory efficient ways to process the elements in a sequence. That is an important topic for the data science practitioner who must be able to work with data that does not fit into a single computer's memory.

As shown in Chapter 4 , both the list objects [0, 1, 2, 3, 4] and [1, 3, 5, 7, 9] on the one side and the range objects range(5) and range(1, 10, 2) on the other side allow us to loop over the same numbers. However, the latter two only create one int object in every iteration while the former two create all int objects before the loop even starts. In this aspect, we consider range objects to be "rules" in memory that know how to calculate the numbers without calculating them.

In Chapter 7 , we see how the built-in list() constructor materializes the range(1, 13) object into the list object [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]. In other words, we make range(1, 13) calculate all numbers at once and store them in a list object for further processing.

In many cases, however, it is not necessary to do that, and, in this chapter, we look at other types of "rules" in memory and how we can compose different "rules" together to implement bigger computations.

Next, we take a step back and continue with a simple example involving the familiar numbers list. Then, we iteratively exchange list objects with "rule"-like objects without changing the overall computation at all. As computations involving sequential data are commonly classified into three categories map, filter, or reduce, we do so too for our numbers example.

In [1]:

numbers = [7, 11, 8, 5, 3, 12, 2, 6, 9, 10, 1, 4]

Mapping¶

Mapping refers to the idea of applying a transformation to every element in a sequence.

For example, let's square each element in numbers and add 1 to the squares. In essence, we apply the transformation $y := x^2 + 1$ as expressed with the transform() function below.

In [2]:

def transform(element):
    """Map elements to their squares plus 1."""
    return (element ** 2) + 1

With the syntax we know so far, we revert to a for-loop that iteratively appends the transformed elements to an initially empty transformed_numbers list.

In [3]:

transformed_numbers = []

for old in numbers:
    new = transform(old)
    transformed_numbers.append(new)

In [4]:

transformed_numbers

Out[4]:

[50, 122, 65, 26, 10, 145, 5, 37, 82, 101, 2, 17]

As this kind of data processing is so common, Python provides the map() built-in. In its simplest usage form, it takes two arguments: A transformation function that takes exactly one positional argument and an iterable that provides the objects to be mapped.

We call map() with a reference to the transform() function and the numbers list as the arguments and store the result in the variable transformer to inspect it.

In [5]:

transformer = map(transform, numbers)

We might expect to get back a materialized sequence (i.e., all elements exist in memory), and a list object would feel the most natural because of the type of the numbers argument. However, transformer is an object of type map.

In [6]:

transformer

Out[6]:

<map at 0x7f778422fc10>

In [7]:

type(transformer)

Out[7]:

map

Like range objects, map objects generate a series of objects "on the fly" (i.e., one by one), and we use the built-in next() function to obtain the next object in line. So, we should think of a map object as a "rule" stored in memory that only knows how to calculate the next object of possibly infinitely many.

In [8]:

next(transformer)

Out[8]:

In [9]:

next(transformer)

Out[9]:

In [10]:

next(transformer)

Out[10]:

It is essential to understand that by creating a map object with the map() built-in, nothing happens in memory except the creation of the map object. In particular, no second list object derived from numbers is created. Also, we may view range objects as a special case of map objects: They are constrained to generating int objects only, and the iterable argument is replaced with start, stop, and step arguments.

If we are sure that a map object generates a finite number of elements, we may materialize them into a list object with the built-in list() constructor. Below, we "pull out" the remaining int objects from transformer, which itself is derived from a finite list object.

In [11]:

list(transformer)

Out[11]:

[26, 10, 145, 5, 37, 82, 101, 2, 17]

In summary, instead of creating an empty list first and appending it in a for-loop as above, we write the following one-liner and obtain an equal transformed_numbers list.

In [12]:

transformed_numbers = list(map(transform, numbers))

In [13]:

transformed_numbers

Out[13]:

[50, 122, 65, 26, 10, 145, 5, 37, 82, 101, 2, 17]

Filtering¶

Filtering refers to the idea of creating a subset of a sequence with a boolean filter function that indicates if an element should be kept (i.e., True) or not (i.e., False).

In the example, let's only keep the even elements in numbers. The is_even() function implements that as a filter.

In [14]:

def is_even(element):
    """Filter out odd numbers."""
    if element % 2 == 0:
        return True
    return False

As element % 2 == 0 is already a boolean expression, we could shorten is_even() like so.

In [15]:

def is_even(element):
    """Filter out odd numbers."""
    return element % 2 == 0

As before, we first use a for-loop that appends the elements to be kept iteratively to an initially empty even_numbers list.

In [16]:

even_numbers = []

for number in transformed_numbers:
    if is_even(number):
        even_numbers.append(number)

In [17]:

even_numbers

Out[17]:

[50, 122, 26, 10, 82, 2]

Analogously to the map object above, we use the filter() built-in to create an object of type filter and assign it to evens.

In [18]:

evens = filter(is_even, transformed_numbers)

In [19]:

evens

Out[19]:

<filter at 0x7f778422fd30>

In [20]:

type(evens)

Out[20]:

filter

evens works like transformer above: With the built-in next() function we obtain the even numbers one by one. So, the "next" element in line is simply the next even int object the filter object encounters.

In [21]:

transformed_numbers

Out[21]:

[50, 122, 65, 26, 10, 145, 5, 37, 82, 101, 2, 17]

In [22]:

next(evens)

Out[22]:

In [23]:

next(evens)

Out[23]:

In [24]:

next(evens)

Out[24]:

As above, we could create a materialized list object with the list() constructor.

In [25]:

list(filter(is_even, transformed_numbers))

Out[25]:

[50, 122, 26, 10, 82, 2]

We may also chain map and filter objects derived from the original numbers list. As the entire cell is one big expression consisting of nested function calls, we read it from the inside out.

In [26]:

list(
    filter(
        is_even,
        map(transform, numbers),
    )
)

Out[26]:

[50, 122, 26, 10, 82, 2]

Using the map() and filter() built-ins, we can quickly switch the order: Filter first and then transform the remaining elements. This variant equals the "A simple Filter" example in Chapter 4 . On the contrary, code with for-loops and if statements is more tedious to adapt. Additionally, map and filter objects loop "at the C level" and are a lot faster because of that. Because of that, experienced Pythonistas tend to not use explicit for-loops so often.

In [27]:

list(
    map(
        transform,
        filter(is_even, numbers),
    )
)

Out[27]:

[65, 145, 5, 37, 101, 17]

Reducing¶

Lastly, reducing sequential data means to summarize the elements into a single statistic.

A simple example is the built-in sum() function.

In [28]:

sum(
    map(
        transform,
        filter(is_even, numbers),
    )
)

Out[28]:

Other straightforward examples are the built-in min() or max() functions.

In [29]:

min(map(transform, filter(is_even, numbers)))

Out[29]:

In [30]:

max(map(transform, filter(is_even, numbers)))

Out[30]:

sum() , min() , and max() can be regarded as special cases.

The generic way of reducing a sequence is to apply a function of two arguments on a rolling horizon: Its first argument is the reduction of the elements processed so far, and the second the next element to be reduced.

For illustration, let's replicate sum() as such a function, called sum_alt(). Its implementation only adds two numbers.

In [31]:

def sum_alt(sum_so_far, next_number):
    """Reduce a sequence by addition."""
    return sum_so_far + next_number

Further, we create a new map object derived from numbers ...

In [32]:

evens_transformed = map(transform, filter(is_even, numbers))

... and loop over all but the first element it generates. The latter is captured separately as the initial result with the next() function. We know from above that evens_transformed generates six elements. That is why we see five growing result values resembling a cumulative sum. The first 210 is the sum of the first two elements generated by evens_transformed, 65 and 145.

So, we also learn that map objects, and analogously filter objects, are iterable as we may loop over them.

In [33]:

result = next(evens_transformed)

for number in evens_transformed:
    result = sum_alt(result, number)
    print(result, end=" ")  # line added for didactical purposes

210 215 252 353 370

The final result is the same 370 as above.

In [34]:

result

Out[34]:

The reduce() function in the functools module in the standard library provides more convenience (and speed) replacing the for-loop. It takes two arguments, function and iterable, in the same way as the map() and filter() built-ins.

reduce() is eager meaning that all computations implied by the contained map and filter "rules" are executed immediately, and the code cell evaluates to 370. On the contrary, map() and filter() create lazy map and filter objects, and we have to use the next() function to obtain the elements, one by one.

In [35]:

from functools import reduce

In [36]:

reduce(
    sum_alt,
    map(
        transform,
        filter(is_even, numbers),
    )
)

Out[36]:

Lambda Expressions¶

map() , filter() , and reduce() take a function object as their first argument, and we defined transform(), is_even(), and sum_alt() to be used precisely for that.

Often, such functions are used only once in a program. However, the primary purpose of functions is to reuse them. In such cases, it makes more sense to define them "anonymously" right at the position where the first argument goes.

As mentioned in Chapter 2 , we use lambda expressions to create function objects without a name referencing them.

So, the above sum_alt() function could be rewritten as a lambda expression like so ...

In [37]:

lambda sum_so_far, next_number: sum_so_far + next_number

Out[37]:

<function __main__.<lambda>(sum_so_far, next_number)>

... or even shorter.

In [38]:

lambda x, y: x + y

Out[38]:

<function __main__.<lambda>(x, y)>

With the new concepts in this section, we can rewrite the entire example in just a few lines of code without any for, if, and def statements. The resulting code is concise, easy to read, quick to modify, and even faster in execution. Most importantly, it is optimized to handle big amounts of data as no temporary list objects are materialized in memory.

In [39]:

numbers = [7, 11, 8, 5, 3, 12, 2, 6, 9, 10, 1, 4]
evens = filter(lambda x: x % 2 == 0, numbers)
transformed = map(lambda x: (x ** 2) + 1, evens)
sum(transformed)

Out[39]:

If numbers comes as a sorted sequence of whole numbers, we may use the range() built-in and get away without any materialized list object in memory at all!

In [40]:

numbers = range(1, 13)
evens = filter(lambda x: x % 2 == 0, numbers)
transformed = map(lambda x: (x ** 2) + 1, evens)
sum(transformed)

Out[40]:

To additionally save the temporary variables, numbers, evens, and transformed, we could write the entire computation as one expression.

In [41]:

sum(
    map(
        lambda x: (x ** 2) + 1,
        filter(
            lambda x: x % 2 == 0,
            range(1, 13),
        )
    )
)

Out[41]:

PythonTutor visualizes the differences in the number of computational steps and memory usage:

Version 1 : With for-loops, if statements, and named functions -> 116 steps and 3 list objects
Version 2 : With named map and filter objects -> 58 steps and no list object
Version 3 : Everything in one expression -> 55 steps and no list object

Versions 2 and 3 are the same, except for the three additional steps required to create the temporary variables. The major downside of Version 1 is that, in the worst case, it may need three times the memory as compared to the other two versions!

An experienced Pythonista would probably go with Version 2 in a production system to keep the code readable and maintainable.

The map-filter-reduce paradigm has caught attention in recent years as it enables parallel computing , and this gets important when dealing with big amounts of data. The workings in the memory as shown in this section provide an idea why.