Note: Click on "Kernel" > "Restart Kernel and Clear All Outputs" in JupyterLab before reading this notebook to reset its output. If you cannot run this file on your machine, you may want to open it in the cloud .
In this second part of the chapter, we learn how we make our Vector
and Matrix
instances behave like Python's built-in sequence types, for example, list
or tuple
objects.
As discussed in detail in Chapter 7 , a sequence is any finite and iterable container type with a predictable order of its elements such that we can label each element with an index in the range
0 <= index < len(sequence)
.
To make Vector
and Matrix
instances emulate sequences, we implement the .__len__()
(cf., reference ) and
.__getitem__()
(cf., reference ) methods. While the former returns the total number of elements in a container and is automatically invoked on any object passed to the built-in len()
function, the latter is invoked by the interpreter behind the scenes when we use the indexing operator
[]
.
In the example, both .__len__()
and .__getitem__()
delegate parts of the work to the embedded list
object named ._entries
. This is a design principle known as delegation in software engineering. Also, we implicitly invoke the
.__len__()
method inside the .__init__()
method already via the len(self)
expression. This reuses code and also ensures that we calculate the number of entries in one way only within the entire class.
class Vector:
def __init__(self, data):
self._entries = list(float(x) for x in data)
if len(self) == 0:
raise ValueError("a vector must have at least one entry")
def __repr__(self):
args = ", ".join(repr(x) for x in self._entries)
return f"Vector(({args}))"
def __len__(self):
return len(self._entries)
def __getitem__(self, index):
return self._entries[index]
Now, we may obtain the number of elements with len() and index into
Vector
instances.
v = Vector([1, 2, 3])
len(v)
3
v[0]
1.0
Negative indexes work "out of the box" because of the delegation to the internal list
object.
v[-1]
3.0
Somehow "magically" we can loop over v
with a for
statement. This works as Python simply loops over the indexes implied by len(v)
and obtains the entries one by one.
for entry in v:
print(entry, end=" ")
1.0 2.0 3.0
v
may also be looped over in reverse order with the reversed() built-in.
for entry in reversed(v):
print(entry, end=" ")
3.0 2.0 1.0
Membership testing with the in
operator also comes "for free." Here, Python compares the object to be searched to each element with the ==
operator and stops early once one compares equal. That constitutes a linear search as seen before.
1 in v
True
99 in v
False
So far, indexing is a read-only operation.
v[0] = 99
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) Cell In[10], line 1 ----> 1 v[0] = 99 TypeError: 'Vector' object does not support item assignment
Because a Matrix
is two-dimensional, we must decide how we flatten the ._entries
. We choose to loop over the first row, then the second row, and so on. This is called a row major approach .
In addition to indexing by int
objects, we also implement indexing by 2-tuple
s of int
s where the first element indicates the row and the second the column. Deciding what to do inside a method depending on the type of an argument is known as type dispatching. We achieve that with the built-in isinstance() function.
Lastly, we ensure that integer indexing also works with negative values as we are used to from sequences in general.
Note how all of the methods work together:
.__init__()
, .__len__()
, and .__getitem__()
reuse the .n_rows
and .n_cols
properties, and.__init__()
and .__getitem__()
invoke .__len__()
via the len(self)
expressions.class Matrix:
def __init__(self, data):
self._entries = list(list(float(x) for x in r) for r in data)
for row in self._entries[1:]:
if len(row) != self.n_cols:
raise ValueError("rows must have the same number of entries")
if len(self) == 0:
raise ValueError("a matrix must have at least one entry")
@property
def n_rows(self):
return len(self._entries)
@property
def n_cols(self):
return len(self._entries[0])
def __len__(self):
return self.n_rows * self.n_cols
def __getitem__(self, index):
if isinstance(index, int):
if index < 0:
index += len(self)
if not (0 <= index < len(self)):
raise IndexError("integer index out of range")
row, col = divmod(index, self.n_cols)
return self._entries[row][col]
elif (
isinstance(index, tuple) and len(index) == 2
and isinstance(index[0], int) and isinstance(index[1], int)
):
return self._entries[index[0]][index[1]]
raise TypeError("index must be either an int or a tuple of two int's")
Now, we may use a Matrix
instance just like any other sequence ...
m = Matrix([(1, 2, 3), (4, 5, 6), (7, 8, 9)])
len(m)
9
m[0] # entry in the upper left corner
1.0
m[-1] # entry in the lower right corner
9.0
... but also index in the two dimensions separately.
m[0, 2] # first row, third column
3.0
m[-1, -1] # last row, last column / lower right corner
9.0
As before, Python figures out the iteration on its own ...
for entry in m:
print(entry, end=" ")
1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0
for entry in reversed(m):
print(entry, end=" ")
9.0 8.0 7.0 6.0 5.0 4.0 3.0 2.0 1.0
... and makes the in
operator do a linear search.
1 in m
True
99 in m
False
Sequence emulation itself is not a property of object-oriented languages in general. Instead, it is a behavior any data type may or may not exhibit in Python.
The collection of all such behaviors a programming language offers is commonly referred to as its object model. In Python, the term data model is used instead and all possible behaviors are documented in the language reference , in particular, in the section on special methods. We can think of the data model as the collection of all the behaviors we can make our user-defined data types follow. Pythonistas also use the term protocol instead of behavior, for example, we may say that "the
Vector
and Matrix
classes follow the sequence protocol."
So, merely defining the two .__len__()
and .__getitem__()
methods is enough to make instances of any user-defined type behave like the built-in sequences in Chapter 7 . Yet, there we defined sequences as all objects having the four properties of being finite, iterable, and ordered container types. And, these properties correspond to special methods by the names of
.__len__()
, .__iter__()
, .__reversed__()
, and .__contains__()
as we see in the next section. Thus, Python "magically" knows how to derive the logic for the .__iter__()
, .__reversed__()
, and .__contains__()
methods from the combination of the .__len__()
and .__getitem__()
methods. In general, while some special methods are related, others are not. Understanding these relationships means understanding the Python data model and vice versa. That is what every aspiring data scientist should aim for.
On the contrary, we could also look at special methods individually. Whereas .__len__()
is invoked on the object passed to len() , Python "translates" the indexing operator applied on any name like
a[i]
, for example, into the method invocation a.__getitem__(i)
. So, in both cases, the special methods are executed according to a deterministic rule of the language. In that sense, they act as some sort of syntactic sugar. Thus, they even work if only one of them is defined. For example, without .__len__()
, iteration with a for
-loop still works but only in forward direction.
When implementing the sequence protocol for our Matrix
class, we had to make the assumption that the user of our class wants to loop over the entries in a rows first fashion. While such assumptions can often be justified by referring to popular conventions (e.g., mathematicians usually look at matrices also in a "row by column" way), we could instead provide several iteration methods such that the user may choose one, just like dict
objects come with several built-in methods that provide iteration.
In the revised Matrix
class below, we add the .rows()
, .cols()
, and .entries()
methods that return generator
s providing different and memory efficient ways of looping over the entries. .rows()
and .cols()
sequentially produce Vector
instances representing individual rows and columns. This is in line with a popular idea in linear algebra to view a matrix as a collection of either row or column vectors. Further, .entries()
by default produces the entries in the matrix one by one in a flat and row major fashion. Called with the optional row_major=False
flag, it does the same in a column major fashion. The optional reverse=True
flag allows iteration in backwards order.
We also implement the .__iter__()
(cf., reference ) and
.__reversed__()
(cf., reference ) methods that immediately forward invocation to
.entries()
. So, Python does not need to fall back to .__len__()
and .__getitem__()
as we described above.
class Matrix:
def __init__(self, data):
self._entries = list(list(float(x) for x in r) for r in data)
# ...
def __repr__(self):
args = ", ".join("(" + ", ".join(repr(c) for c in r) + ",)" for r in self._entries)
return f"Matrix(({args}))"
@property
def n_rows(self):
return len(self._entries)
@property
def n_cols(self):
return len(self._entries[0])
def rows(self):
return (Vector(r) for r in self._entries)
def cols(self):
return (
Vector(self._entries[r][c] for r in range(self.n_rows)) for c in range(self.n_cols)
)
def entries(self, *, reverse=False, row_major=True):
if reverse:
rows, cols = (range(self.n_rows - 1, -1, -1), range(self.n_cols - 1, -1, -1))
else:
rows, cols = range(self.n_rows), range(self.n_cols)
if row_major:
return (self._entries[r][c] for r in rows for c in cols)
return (self._entries[r][c] for c in cols for r in rows)
def __iter__(self):
return self.entries()
def __reversed__(self):
return self.entries(reverse=True)
The revised version of Vector
below also works without .__len__()
and .__getitem__()
methods and leaves the creation of memory efficient generator
s up to the embedded list
object in ._entries
by using the built-in iter() and reversed()
functions. Also,
.__repr__()
now relies on the sequence protocol as the instance loops over "itself" with for x in self
, a subtle reuse of code again.
class Vector:
def __init__(self, data):
self._entries = list(float(x) for x in data)
# ...
def __repr__(self):
args = ", ".join(repr(x) for x in self)
return f"Vector(({args}))"
def __iter__(self):
return iter(self._entries)
def __reversed__(self):
return reversed(self._entries)
m = Matrix([(1, 2, 3), (4, 5, 6), (7, 8, 9)])
m
Matrix(((1.0, 2.0, 3.0,), (4.0, 5.0, 6.0,), (7.0, 8.0, 9.0,)))
Iteration works as before ...
for entry in m:
print(entry, end=" ")
1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0
for entry in reversed(m):
print(entry, end=" ")
9.0 8.0 7.0 6.0 5.0 4.0 3.0 2.0 1.0
... but now we have some ways of customizing it.
for row_vector in m.rows():
print(row_vector, end=" ")
Vector((1.0, 2.0, 3.0)) Vector((4.0, 5.0, 6.0)) Vector((7.0, 8.0, 9.0))
for col_vector in m.cols():
print(col_vector, end=" ")
Vector((1.0, 4.0, 7.0)) Vector((2.0, 5.0, 8.0)) Vector((3.0, 6.0, 9.0))
for entry in m.entries():
print(entry, end=" ")
1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0
for entry in m.entries(row_major=False):
print(entry, end=" ")
1.0 4.0 7.0 2.0 5.0 8.0 3.0 6.0 9.0
In the above implementations, the instance attribute ._entries
on a Vector
or Matrix
instance references either a list
or a list
of row list
s , which is by the convention of the leading underscore _
an implementation detail. If users of our classes adhere to this convention, Vector
and Matrix
instances can be regarded as immutable.
In line with the implied immutability, we implemented the .transpose()
method such that it returns a new Matrix
instance. Instead, we could make the method change the internal self._entries
attribute in place as we do in the next example. To indicate this mutation to the user of the Matrix
class clearly, the revised .transpose()
method returns None
. That mirrors, for example, how the mutating methods of the built-in list
type behave (cf., Chapter 7 ).
Such decisions are better made consciously when designing a custom data type. The main trade-off is that immutable data types are typically easier to reason about when reading code whereas mutable data types tend to be more memory efficient and make programs faster as less copying operations take place in memory. However, this trade-off only becomes critical when we deal with big amounts of data.
class Matrix:
def __init__(self, data):
self._entries = list(list(float(x) for x in r) for r in data)
# ...
def __repr__(self):
args = ", ".join("(" + ", ".join(repr(c) for c in r) + ",)" for r in self._entries)
return f"Matrix(({args}))"
def transpose(self):
self._entries = list(list(float(x) for x in r) for r in (zip(*self._entries)))
return None
m = Matrix([(1, 2, 3), (4, 5, 6), (7, 8, 9)])
m
Matrix(((1.0, 2.0, 3.0,), (4.0, 5.0, 6.0,), (7.0, 8.0, 9.0,)))
Transposing m
has no cell output ...
m.transpose()
... so we must look at m
again.
m
Matrix(((1.0, 4.0, 7.0,), (2.0, 5.0, 8.0,), (3.0, 6.0, 9.0,)))
A downside of returning None
is that we can not chain repeated invocations of .transpose()
.
m.transpose().transpose()
--------------------------------------------------------------------------- AttributeError Traceback (most recent call last) Cell In[37], line 1 ----> 1 m.transpose().transpose() AttributeError: 'NoneType' object has no attribute 'transpose'
To fix the missing method chaining, we end the .transpose()
method with return self
, which returns a reference to the instance on which the method is invoked.
class Matrix:
def __init__(self, data):
self._entries = list(list(float(x) for x in r) for r in data)
# ...
def __repr__(self):
args = ", ".join("(" + ", ".join(repr(c) for c in r) + ",)" for r in self._entries)
return f"Matrix(({args}))"
def __iter__(self): # adapted for brevity; uses parts of entries()
rows, cols = range(len(self._entries)), range(len(self._entries[0]))
return (self._entries[r][c] for r in rows for c in cols)
def transpose(self):
self._entries = list(list(float(x) for x in r) for r in (zip(*self._entries)))
return self
m = Matrix([(1, 2, 3), (4, 5, 6), (7, 8, 9)])
m
Matrix(((1.0, 2.0, 3.0,), (4.0, 5.0, 6.0,), (7.0, 8.0, 9.0,)))
The downside of this approach is that a user may unknowingly end up with two references to the same instance. That can only be mitigated by clear documentation.
n = m.transpose()
m
Matrix(((1.0, 4.0, 7.0,), (2.0, 5.0, 8.0,), (3.0, 6.0, 9.0,)))
n
Matrix(((1.0, 4.0, 7.0,), (2.0, 5.0, 8.0,), (3.0, 6.0, 9.0,)))
m is n
True
Analogous to the .__getitem__()
method above, there are also the .__setitem__()
(cf., reference ) and
.__delitem__()
(cf., reference ) methods that assign a new element to or delete an existing element from a sequence.
Whereas deleting an individual entry in a Vector
or Matrix
instance may not really make sense semantically, we interpret this as setting the corresponding entry to "unknown" (i.e., NaN
). Also, we implement changing individual entries via index assignment. Here, .__setitem__()
delegates the assignment to the embedded list
object after casting the assigned value as a float
. While the example below only allows indexing by an integer, it could be generalized to slicing as well.
class Vector:
def __init__(self, data):
self._entries = list(float(x) for x in data)
# ...
def __repr__(self):
args = ", ".join(repr(x) for x in self)
return f"Vector(({args}))"
def __getitem__(self, index):
return self._entries[index]
def __setitem__(self, index, value):
self._entries[index] = float(value)
def __delitem__(self, index):
self._entries[index] = float("NaN")
v = Vector([1, 2, 3])
v
Vector((1.0, 2.0, 3.0))
v
can now be changed in place.
del v[0]
v
Vector((nan, 2.0, 3.0))
v[0] = 99
v
Vector((99.0, 2.0, 3.0))
After this discussion of mutable Vector
and Matrix
classes, we continue with immutable implementations in the rest of this chapter. To lower the chance that we accidently design parts of our classes to be mutable, we replace the built-in list() constructor with tuple()
in the
.__init__()
methods. As we learn in Chapter 7 ,
tuple
s are like immutable list
s.
A function is considered polymorphic if it can work with different data types. The main advantage is reuse of the function's code. Polymorphism goes hand in hand with the concept of duck typing , first mentioned in Chapter 4
in the context of input validation.
We know polymorphic functions already: The built-in sum() function is a trivial example that works with all kinds of
iterable
arguments.
sum((1, 2, 3, 4))
10
sum([1, 2, 3, 4])
10
sum({1, 2, 3, 4})
10
sum({1: 996, 2: 997, 3: 998, 4: 999}) # loops over the keys
10
As we implemented the Vector
and Matrix
classes to be iterable, we may pass them to sum() as well.
sum(Vector([1, 2, 3, 4]))
10.0
sum(Matrix([(1, 2), (3, 4)]))
10.0
A polymorphic function with a semantic meaning in the context of linear algebra would be one that calculates the Euclidean norm for vectors, which is a generalization of the popular Pythagorean theorem
. Extending the same kind of computation to a matrix results in the even more general Frobenius norm
:
The norm()
function below can handle both a Vector
or a Matrix
instance and is therefore polymorphic. In this sense, Vector
and Matrix
instances "walk" and "quack" alike. In particular, they they both can provide all their entries as a flat sequence.
import math
def norm(vec_or_mat):
"""Calculate the Frobenius or Euclidean norm of a matrix or vector.
Args:
vec_or_mat (Vector / Matrix): object whose entries are squared and summed up
Returns:
norm (float)
"""
return math.sqrt(sum(x ** 2 for x in vec_or_mat))
While norm()
is intended to work with Vector
or Matrix
instances ...
norm(Vector([1, 2, 3, 4]))
5.477225575051661
norm(Matrix([(1, 2), (3, 4)]))
5.477225575051661
... it also works for any sequence of numbers.
norm([1, 2, 3, 4])
5.477225575051661
An important criterion if different classes are compatible in the sense that the same polymorphic function can work with them is that they implement the same interface.
Whereas many other programming languages formalize this concept , in Python the term refers to the loose idea that different classes define the same attributes and implement the various protocols behind the special methods in a consistent way. This is what it means to "walk" and "quack" alike.