Note: Click on "Kernel" > "Restart Kernel and Clear All Outputs" in JupyterLab before reading this notebook to reset its output. If you cannot run this file on your machine, you may want to open it in the cloud .

Chapter 8: Map, Filter, & Reduce¶

Similarly to how we classify different concrete data types like list or str by how they behave abstractly in a given context in Chapter 7 , we also do so for the data types we have introduced in this chapter.

Iterators vs. Iterables¶

Here, the map, filter, and generator types all behave like "rules" in memory that govern how objects are produced "on the fly." Their main commonality is their support for the built-in next() function. In computer science terminology, such data types are called iterators , and the collections.abc module formalizes them with the Iterator ABC in Python.

So, one example of an iterator is evens_transformed below, an object of type generator.

In [1]:

numbers = [7, 11, 8, 5, 3, 12, 2, 6, 9, 10, 1, 4]

In [2]:

evens_transformed = ((x ** 2) + 1 for x in numbers if x % 2 == 0)

Let's first confirm that evens_transformed is indeed an Iterator, "abstractly speaking."

In [3]:

import collections.abc as abc

In [4]:

isinstance(evens_transformed, abc.Iterator)

Out[4]:

True

In Python, iterators are always also iterables. The reverse is not true! To be precise, iterators are specializations of iterables. That is what the "Inherits from" column means in the collections.abc module's documentation.

In [5]:

isinstance(evens_transformed, abc.Iterable)

Out[5]:

True

Furthermore, we sharpen our definition of an iterable from Chapter 7 : Just as we define an iterator to be any object that supports the next() function, we define an iterable to be any object that supports the built-in iter() function.

The confused reader may now be wondering how the two concepts relate to each other.

In short, the iter() function is the general way to create an iterator object out of a given iterable object. Then, the iterator object manages the iteration over the iterable object. In real-world code, we hardly ever see iter() as Python calls it for us in the background.

For illustration, let's do that ourselves and create two iterators out of the iterable numbers and see what we can do with them.

In [6]:

iterator1 = iter(numbers)

In [7]:

iterator2 = iter(numbers)

iterator1 and iterator2 are of type list_iterator.

In [8]:

type(iterator1)

Out[8]:

list_iterator

Iterators are useful for only one thing: Get the next object from the associated iterable.

By calling next() three times with iterator1 as the argument, we obtain the first three elements of numbers.

In [9]:

next(iterator1), next(iterator1), next(iterator1)

Out[9]:

(7, 11, 8)

iterator1 and iterator2 keep their states separate. So, we could loop over the same iterable several times in parallel.

In [10]:

next(iterator1), next(iterator2)

Out[10]:

(5, 7)

We can also play a "trick" and exchange some elements in numbers. iterator1 and iterator2 do not see these changes and present us with the new elements. So, iterators not only have state on their own but also keep this separate from the underlying iterable.

In [11]:

numbers[1], numbers[4] = 99, 99

In [12]:

next(iterator1), next(iterator2)

Out[12]:

(99, 99)

Let's re-assign the elements in numbers so that they are in order. After that, the numbers returned from next() also tell us how often next() was called with iterator1 or iterator2. We conclude that list_iterator objects work by simply keeping track of the last index obtained from the underlying iterable.

In [13]:

numbers[:] = list(range(1, 13))

In [14]:

numbers

Out[14]:

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]

In [15]:

next(iterator1), next(iterator2)

Out[15]:

(6, 3)

With the concepts introduced in this section, we can now understand the first sentence in the documentation on the zip() built-in better: "Make an iterator that aggregates elements from each of the iterables."

Because iterators are always also iterables, we may pass iterator1 and iterator2 as arguments to zip() .

The returned zipper object is of type zip and, "abstractly speaking," an Iterator as well.

In [16]:

zipper = zip(iterator1, iterator2)

In [17]:

zipper

Out[17]:

<zip at 0x7f2788118b00>

In [18]:

type(zipper)

Out[18]:

zip

In [19]:

isinstance(zipper, abc.Iterator)

Out[19]:

True

So far, we have always used zip() in a for-loop. That was our earlier definition of an iterable. Our revised definition in this section states that an iterable is an object that supports the iter() function. So, let's see what happens if we pass zipper to iter() .

In [20]:

zipper_iterator = iter(zipper)

In [21]:

zipper_iterator

Out[21]:

<zip at 0x7f2788118b00>

zipper_iterator references the same object as zipper! That is true for iterators in general: Any iterator created from an existing iterator with iter() is the iterator itself.

In [22]:

zipper is zipper_iterator

Out[22]:

True

The Python core developers made that design decision so that iterators may also be looped over.

The for-loop below prints out only six more tuple objects derived from the now ordered numbers because the iterator1 object hidden inside zipper already returned its first six elements. So, the respective first elements of the tuple objects printed range from 7 to 12. Similarly, as iterator2 already returned its first three elements from numbers, we see the respective second elements in the range from 4 to 9.

In [23]:

for x, y in zipper:
    print(x, "and", y, end="   ")

7 and 4   8 and 5   9 and 6   10 and 7   11 and 8   12 and 9

zipper is now exhausted. So, the for-loop below does not make any iteration at all.

In [24]:

for x, y in zipper:
    raise RuntimeError("We won't see this error")

We verify that iterator1 is exhausted by passing it to next() again, which raises a StopIteration exception.

In [25]:

next(iterator1)

---------------------------------------------------------------------------
StopIteration                             Traceback (most recent call last)
Cell In[25], line 1
----> 1 next(iterator1)

StopIteration:

On the contrary, iterator2 is not yet exhausted.

In [26]:

next(iterator2)

Out[26]:

Understanding iterators and iterables is helpful for any data science practitioner that deals with large amounts of data. Even without that, these two terms occur everywhere in Python-related texts and documentation. So, a beginner should regularly review this section until it becomes second nature.

The `for` Statement (revisited)¶

In Chapter 4 , we argue that the for statement is syntactic sugar, replacing the while statement in many common scenarios. In particular, a for-loop saves us two tasks: Managing an index variable and obtaining the individual elements by indexing. In this sub-section, we look at a more realistic picture, using the new terminology as well.

Let's print out the elements of a list object as the iterable being looped over.

In [27]:

iterable = [0, 1, 2, 3, 4]

In [28]:

for element in iterable:
    print(element, end=" ")

0 1 2 3 4

Our previous and equivalent formulation with a while statement is like so.

In [29]:

index = 0

while index < len(iterable):
    element = iterable[index]
    print(element, end=" ")
    index += 1

del index

0 1 2 3 4

What actually happens behind the scenes in the Python interpreter is shown below.

First, Python calls iter() with the iterable to be looped over as the argument. The returned iterator contains the entire logic of how the iterable is looped over. In particular, the iterator may or may not pick the iterable's elements in a predictable order. That is up to the "rule" it models.

Second, Python enters an indefinite while-loop. It tries to obtain the next element with next() . If that succeeds, the for-loop's code block is executed. Below, that code is placed within the else-clause that runs only if no exception is raised in the try-clause. Then, Python jumps into the next iteration and tries to obtain the next element from the iterator, and so on. Once the iterator is exhausted, it raises a StopIteration exception, and Python stops the while-loop with the break statement.

In [30]:

iterator = iter(iterable)

while True:
    try:
        element = next(iterator)
    except StopIteration:
        break
    else:
        print(element, end=" ")

del iterator

0 1 2 3 4

sorted() vs. reversed()¶

Now that we know the concept of an iterator, let's compare some of the built-ins introduced in Chapter 7 in detail and make sure we understand what is going on in memory. This section also serves as a summary of all the concepts in this chapter.

We use two simple examples, numbers and memoryless: numbers creates thirteen objects in memory and memoryless only one (cf., PythonTutor ).

In [31]:

numbers = [7, 11, 8, 5, 3, 12, 2, 6, 9, 10, 1, 4]

In [32]:

memoryless = range(1, 13)

The sorted() function takes a finite iterable argument and materializes its elements into a new list object that is returned.

The argument may already be materialized, as is the case with numbers, but may also be an iterable without any objects in it, such as memoryless. In both cases, we end up with materialized list objects with the elements sorted in forward order (cf., PythonTutor ).

In [33]:

sorted(numbers)

Out[33]:

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]

In [34]:

sorted(memoryless)

Out[34]:

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]

By adding a keyword-only argument reverse=True, the materialized list objects are sorted in reverse order (cf., PythonTutor ).

In [35]:

sorted(numbers, reverse=True)

Out[35]:

[12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1]

In [36]:

sorted(memoryless, reverse=True)

Out[36]:

[12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1]

The order in numbers remains unchanged, and memoryless is still not materialized.

In [37]:

numbers

Out[37]:

[7, 11, 8, 5, 3, 12, 2, 6, 9, 10, 1, 4]

In [38]:

memoryless

Out[38]:

range(1, 13)

The reversed() built-in takes a sequence argument and returns an iterator. The argument must be finite and reversible (i.e., iterable in reverse order) as otherwise reversed() could neither determine the last element that becomes the first nor loop in a predictable backward fashion. PythonTutor confirms that reversed() does not materialize any elements but only returns an iterator.

Side Note: Even though range objects, like memoryless here, do not "contain" references to other objects, they count as sequence types, and as such, they are also container types. The in operator works with range objects because we can always cast the object to be checked as an int and check if that lies within the range object's start and stop values, taking a potential step value into account (cf., this blog post for more details on the range() built-in).

In [39]:

reversed(numbers)

Out[39]:

<list_reverseiterator at 0x7f2788158880>

In [40]:

reversed(memoryless)

Out[40]:

<range_iterator at 0x7f2788158fc0>

To materialize the elements, we can pass the returned iterators to, for example, the list() or tuple() constructors. That creates new list and tuple objects (cf., PythonTutor ).

To reiterate some more new terminology from this chapter, we describe reversed() as lazy whereas list() and tuple() are eager. The former has no significant side effect in memory, while the latter may require a lot of memory.

In [41]:

list(reversed(numbers))

Out[41]:

[4, 1, 10, 9, 6, 2, 12, 3, 5, 8, 11, 7]

In [42]:

tuple(reversed(memoryless))

Out[42]:

(12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1)

Of course, we can also loop over the returned iterators instead.

That works because iterators are always iterable; in particular, as the previous "The for Statement (revisited)" sub-section explains, the for-loops below call iter(reversed(numbers)) and iter(reversed(memoryless)) behind the scenes. However, the iterators returned by iter() are the same as the reversed(numbers) and reversed(memoryless) iterators passed in! In summary, the for-loops below involve many subtleties that together make Python the expressive language it is.

In [43]:

for number in reversed(numbers):
    print(number, end=" ")

4 1 10 9 6 2 12 3 5 8 11 7

In [44]:

for element in reversed(memoryless):
    print(element, end=" ")

12 11 10 9 8 7 6 5 4 3 2 1

As with sorted() , the reversed() built-in does not mutate its argument.

In [45]:

numbers

Out[45]:

[7, 11, 8, 5, 3, 12, 2, 6, 9, 10, 1, 4]

In [46]:

memoryless

Out[46]:

range(1, 13)

To point out the potentially obvious, we compare the results of sorting numbers in reverse order with reversing it: These are different concepts!

In [47]:

numbers

Out[47]:

[7, 11, 8, 5, 3, 12, 2, 6, 9, 10, 1, 4]

In [48]:

sorted(numbers, reverse=True)

Out[48]:

[12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1]

In [49]:

list(reversed(numbers))

Out[49]:

[4, 1, 10, 9, 6, 2, 12, 3, 5, 8, 11, 7]

Whereas both sorted() and reversed() do not mutate their arguments, the mutable list type comes with two methods, sort() and reverse(), that implement the same logic but mutate an object, like numbers below, in place. To indicate that all changes occur in place, the sort() and reverse() methods always return None, which is not shown in JupyterLab.

In [50]:

numbers

Out[50]:

[7, 11, 8, 5, 3, 12, 2, 6, 9, 10, 1, 4]

The reverse() method on the list type is eager, as opposed to the lazy reversed() built-in. That means the mutations caused by the reverse() method are written into memory right away.

In [51]:

numbers.reverse()

In [52]:

numbers

Out[52]:

[4, 1, 10, 9, 6, 2, 12, 3, 5, 8, 11, 7]

Sorting numbers in place occurs eagerly.

In [53]:

numbers.sort(reverse=True)

In [54]:

numbers

Out[54]:

[12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1]

In [55]:

numbers.sort()

In [56]:

numbers

Out[56]:

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]

Chapter 8: Map, Filter, & Reduce¶

Iterators vs. Iterables¶

The for Statement (revisited)¶

sorted() vs. reversed()¶

The `for` Statement (revisited)¶