Note: Click on "Kernel" > "Restart Kernel and Clear All Outputs" in JupyterLab before reading this notebook to reset its output. If you cannot run this file on your machine, you may want to open it in the cloud .
We studied numbers (cf., Chapter 5 ) and textual data (cf., Chapter 6
) first mainly because objects of the presented data types are "simple." That is so for two reasons: First, they are immutable, and, as we saw in the "Who am I? And how many?" section in Chapter 1
, mutable objects can quickly become hard to reason about. Second, they are "flat" in the sense that they are not composed of other objects.
The str
type is a bit of a corner case in this regard. While one could argue that a longer str
object, for example, "text"
, is composed of individual characters, this is not the case in memory as the literal "text"
only creates one object (i.e., one "bag" of 0s and 1s modeling all characters).
This chapter, Chapter 8 , Chapter 9
, and Chapter 10
introduce various "complex" data types. While some are mutable and others are not, they all share that they are primarily used to "manage," or structure, the memory in a program (i.e., they provide references to other objects). Unsurprisingly, computer scientists refer to the ideas behind these data types as data structures
.
In this chapter, we focus on data types that model all kinds of sequential data. Examples of such data are spreadsheets or matrices
and vectors
. These formats share the property that they are composed of smaller units that come in a sequence of, for example, rows/columns/cells or elements/entries.
Chapter 6 already describes the sequence properties of
str
objects. In this section, we take a step back and study these properties one by one.
The collections.abc module in the standard library
defines a variety of abstract base classes (ABCs). We saw ABCs already in Chapter 5
, where we use the ones from the numbers
module in the standard library
to classify Python's numeric data types according to mathematical ideas. Now, we take the ABCs from the collections.abc
module to classify the data types in this chapter according to their behavior in various contexts.
As an illustration, consider numbers
and text
below, two objects of different types.
numbers = [7, 11, 8, 5, 3, 12, 2, 6, 9, 10, 1, 4]
text = "Lorem ipsum dolor sit amet."
Among others, one commonality between the two is that we may loop over them with the for
statement. So, in the context of iteration, both exhibit the same behavior.
for number in numbers:
print(number, end=" ")
7 11 8 5 3 12 2 6 9 10 1 4
for character in text:
print(character, end=" ")
L o r e m i p s u m d o l o r s i t a m e t .
In Chapter 4 , we referred to such types as iterables. That is not a proper English word, even if it may sound like one at first sight. Yet, it is an official term in the Python world formalized with the
Iterable
ABC in the collections.abc module.
For the data science practitioner, it is worthwhile to know such terms as, for example, the documentation on the built-ins uses them extensively: In simple words, any built-in that takes an argument called "iterable" may be called with any object that supports being looped over. Already familiar built-ins
include enumerate()
, sum()
, or zip()
. So, they do not require the argument to be of a certain data type (e.g.,
list
); instead, any iterable type works.
import collections.abc as abc
abc.Iterable
collections.abc.Iterable
As seen in Chapter 5 , we can use ABCs with the built-in isinstance()
function to check if an object supports a behavior.
So, let's "ask" Python if it can loop over numbers
or text
.
isinstance(numbers, abc.Iterable)
True
isinstance(text, abc.Iterable)
True
Contrary to list
or str
objects, numeric objects are not iterable.
isinstance(999, abc.Iterable)
False
Instead of asking, we could try to loop over 999
, but this results in a TypeError
.
for digit in 999:
print(digit)
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) Cell In[9], line 1 ----> 1 for digit in 999: 2 print(digit) TypeError: 'int' object is not iterable
Most of the data types in this chapter and Chapter 9 and Chapter 10
exhibit three orthogonal
(i.e., "independent") behaviors, formalized by ABCs in the collections.abc
module as:
Iterable
: An object may be looped over.Container
: An object "contains" references to other objects; a "whole" is composed of many "parts."Sized
: The number of references to other objects, the "parts," is finite.The characteristical operation supported by Container
types is the in
operator for membership testing.
0 in numbers
False
"l" in text
True
Alternatively, we could also check if numbers
and text
are Container
types with isinstance() .
isinstance(numbers, abc.Container)
True
isinstance(text, abc.Container)
True
Numeric objects do not "contain" references to other objects, and that is why they are considered "flat" data types. The in
operator raises a TypeError
. Conceptually speaking, Python views numeric types as "wholes" without any "parts."
isinstance(999, abc.Container)
False
9 in 999
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) Cell In[15], line 1 ----> 1 9 in 999 TypeError: argument of type 'int' is not iterable
Analogously, being Sized
types, we can pass numbers
and text
as the argument to the built-in len() function and obtain "meaningful" results. The exact meaning depends on the data type: For
numbers
, len() tells us how many elements are in the
list
object; for text
, it tells us how many Unicode characters make up the
str
object. Abstractly speaking, both data types exhibit the same behavior of finiteness.
len(numbers)
12
len(text)
27
isinstance(numbers, abc.Sized)
True
isinstance(text, abc.Sized)
True
On the contrary, even though 999
consists of three digits for humans, numeric objects in Python have no concept of a "size" or "length," and the len() function raises a
TypeError
.
isinstance(999, abc.Sized)
False
len(999)
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) Cell In[21], line 1 ----> 1 len(999) TypeError: object of type 'int' has no len()
These three behaviors are so essential that whenever they coincide for a data type, it is called a collection, formalized with the Collection
ABC. That is where the collections.abc module got its name from: It summarizes all ABCs related to collections; in particular, it defines a hierarchy of specialized kinds of collections.
Without going into too much detail, one way to read the summary table at the beginning of the collections.abc module's documention is as follows: The first column, titled "ABC", lists all collection-related ABCs in Python. The second column, titled "Inherits from," indicates if the idea behind the ABC is original (e.g., the first row with the
Container
ABC has an empty "Inherits from" column) or a combination (e.g., the row with the Collection
ABC has Sized
, Iterable
, and Container
in the "Inherits from" column). The third and fourth columns list the methods that come with a data type following an ABC. We keep ignoring the methods named in the dunder style for now.
So, let's confirm that both numbers
and text
are collections.
isinstance(numbers, abc.Collection)
True
isinstance(text, abc.Collection)
True
They share one more common behavior: When looping over them, we can predict the order of the elements or characters. The ABC in the collections.abc module corresponding to this behavior is
Reversible
. While sounding unintuitive at first, it is evident that if something is reversible, it must have a forward order, to begin with.
The reversed() built-in allows us to loop over the elements or characters in reverse order.
for number in reversed(numbers):
print(number, end=" ")
4 1 10 9 6 2 12 3 5 8 11 7
for character in reversed(text):
print(character, end=" ")
. t e m a t i s r o l o d m u s p i m e r o L
isinstance(numbers, abc.Reversible)
True
isinstance(text, abc.Reversible)
True
Collections that exhibit this fourth behavior are referred to as sequences, formalized with the Sequence
ABC in the collections.abc module.
isinstance(numbers, abc.Sequence)
True
isinstance(text, abc.Sequence)
True
The data types introduced in this chapter are sequences. Nevertheless, we also look at some data types that are neither collections nor sequences but are still useful to model sequential data in practice in Chapter 8 .
In Python-related documentations, the terms collection and sequence are heavily used, and the data science practitioner should always think of them in terms of the three or four behaviors they exhibit.
Data types that are collections but not sequences are covered in Chapter 9 .